PAMI is a Python library containing 100+ algorithms to discover useful patterns in various databases across multiple computing platforms. (Active)
Frequent pattern mining aims to discover all interesting patterns in a transactional database that satisfy the user-specified minimum support (minSup) constraint. The minSup controls the minimum number of transactions that a pattern must cover in a database. Since only a single minSup is employed for the entire database, this technique implicitly assumes that all items in a database have uniform frequencies or similar occurrence behavior. However, this is seldom not the case in many real-world applications. In many applications, some items appear frequently in the data, while others occur rarely. If the frequencies of the items in a database vary a great deal, then finding frequent patterns with a single minSup leads to the following two problems:
This dillema is known as the ‘‘rare item problem.’’
When confronted with the above problem in the real-world applications, researchers tried to tackle it by finding correlated patterns in a database. Several alternative measures have been described in the literature to find correlated patterns. Each measure has a selection bias that justifies the significance of one pattern over another. Consequently, there exists no universally accepted best measure to find correlated patterns. However, finding correlated patterns using all-confidence measure has gained popularity as it satisfies both null-invariant and anti-monotonic properities. In this context, we have developed correlated pattern mining algorithms using all-confidence measure.
According to the all-confidence based correlated pattern mining model, a pattern is said to be correlated if it satisfies both minimum Support and minimum all-confidence constraints.
References:
E. R. Omiecinski, “Alternative interest measures for mining associations in databases,” in IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 1, pp. 57-69, Jan.-Feb. 2003, doi: 10.1109/TKDE.2003.1161582. Link
Young-Koo Lee, Won-Young Kim, Y. Dora Cai, Jiawei Han: CoMine: Efficient Mining of Correlated Patterns. 581-584. link
A transactional database is a collection of transactions, where each transaction contains a transaction-identifier and a set of items.
A hypothetical transactional database containing the items a, b, c, d, e, f, and g is shown below
tid | Transactions |
---|---|
1 | a b c g |
2 | b c d e |
3 | a b c d |
4 | a c d f |
5 | a b c d g |
6 | c d e f |
7 | a b c d |
8 | a e f |
9 | a b c d |
10 | b c d e |
Note: Duplicate items must not exist in a transaction.
Each row in a transactional database must contain only items. The frequent pattern mining algorithms in PAMI implicitly assume the row number of a transaction as its transactional-identifier to reduce storage and processing costs.
A sample transactional database, say sampleTransactionalDatabase.txt, is provided below.
a b c g
b c d e
a b c d
a c d f
a b c d g
c d e f
a b c d
a e f
a b c d
b c d e
The performance of a pattern mining algorithm primarily depends on the satistical nature of a database. Thus, it is important to know the following details of a database:
The below sample code prints the statistical details of a database.
import PAMI.extras.dbStats.TransactionalDatabase as stats
obj = stats.TransactionalDatabase('sampleTransactionalDatabase.txt', ' ')
obj.run()
obj.printStats()
The input parameters to a correlated pattern mining algorithm are:
- String : E.g., ‘transactionalDatabase.txt’
- URL : E.g., https://u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/transactional_T10I4D100K.csv
- DataFrame with the header titled ‘Transactions’
- count (beween 0 to length of a database) or
- [0, 1]
- [0, 1]
The patterns discovered by a correlated pattern mining algorithm can be saved into a file or a data frame.
foo@bar: cd PAMI/correlatedPattern/basic
foo@bar: python3 algorithmName.py inputFile outputFile minSup minAllConf seperator
Example: python3 CPGrowth.py
inputFile.txt
outputFile.txt
3
0.4
' '
import PAMI.correlatedPattern.basic.CoMine as alg
iFile = 'sampleTransactionalDatabase.txt' # specify the input transactional database
minSup = 4 # specify the minSupvalue
minAllConf = 0.7 # specify the minAllConf value
seperator = ' ' # specify the seperator. Default seperator is tab space.
oFile = 'correlatedPattern.txt' # specify the output file name<
obj = alg.CPGrowth(iFile, minSup, minAllConf, seperator) # initialize the algorithm
obj.mine() # start the mining process
obj.save(oFile) # store the patterns in file
df = obj.getPatternsAsDataFrame() # Get the patterns discovered into a dataframe
obj.printResults()
Correlated Frequent patterns were generated successfully using CorrelatedPatternGrowth algorithm
Total number of Correlated Patterns: 9
Total Memory in USS: 81182720
Total Memory in RSS 119152640
Total ExecutionTime in ms: 0.0003771781921386719
!cat correlatedPatterns.txt
#format: correlatedPattern:support:all-confidence
e:4:1.0
d:8:1.0
d c:8:0.8888888888888888
c:9:1.0
b:7:1.0
b d:6:0.75
b c:7:0.7777777777777778
b a:5:0.7142857142857143
a:7:1.0
The dataframe contains the following information:
df #The dataframe containing the patterns is shown below. In each pattern, items were seperated from each other with a tab space (or \t).
Patterns | Support | Confidence | |
---|---|---|---|
0 | e | 4 | 1.000000 |
1 | d | 8 | 1.000000 |
2 | d c | 8 | 0.888889 |
3 | c | 9 | 1.000000 |
4 | b | 7 | 1.000000 |
5 | b d | 6 | 0.750000 |
6 | b c | 7 | 0.777778 |
7 | b a | 5 | 0.714286 |
8 | a | 7 | 1.000000 |