PAMI is a Python library containing 100+ algorithms to discover useful patterns in various databases across multiple computing platforms. (Active)
Frequent pattern mining aims to discover all interesting patterns in a transactional database that satisfy the user-specified minimum support (minSup) constraint. The minSup controls the minimum number of transactions that a pattern must cover in a database. Since only a single minSup is employed for the entire database, this technique implicitly assumes that all items in a database have uniform frequencies or similar occurrence behavior. However, this is seldom not the case in many real-world applications. In many applications, some items appear frequently in the data, while others occur rarely. If the frequencies of the items in a database vary a great deal, then finding frequent patterns with a single minSup leads to the following two problems:
This dillema is known as the ‘‘rare item problem.’’
When confronted with the above problem in the real-world applications, researchers tried to tackle it by finding the frequent patterns using multiple minimum support values. In this extended model, each item in the database is specified with a minSup-like constraint known as minimum item support, and the minSup of a pattern is expressed with the lowest minimum support of its items.
References:
Bing Liu, Wynne Hsu, and Yiming Ma. 1999. Mining association rules with multiple minimum supports. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘99). Association for Computing Machinery, New York, NY, USA, 337–341. Link
R. Uday Kiran, P. Krishna Reddy: Novel techniques to reduce search space in multiple minimum supports-based frequent pattern mining algorithms. EDBT 2011: 11-20. Link
A transactional database is a collection of transactions, where each transaction contains a transaction-identifier and a set of items.
A hypothetical transactional database containing the items a, b, c, d, e, f, and g as shown below.
tid | Transactions |
---|---|
1 | a b c g |
2 | b c d e |
3 | a b c d |
4 | a c d f |
5 | a b c d g |
6 | c d e f |
7 | a b c d |
8 | a e f |
9 | a b c d |
10 | b c d e |
Note: Duplicate items must not exist in a transaction.
Each row in a transactional database must contain only items. The frequent pattern mining algorithms in PAMI implicitly assume the row number of a transaction as its transactional-identifier to reduce storage and processing costs. A sample transactional database, say sampleInputFile.txt, is provided below.
a b c g
b c d e
a b c d
a c d f
a b c d g
c d e f
a b c d
a e f
a b c d
b c d e
The performance of a pattern mining algorithm primarily depends on the satistical nature of a database. Thus, it is important to know the following details of a database:
The below sample code prints the statistical details of a database.
import PAMI.extras.dbStats.TransactionalDatabase as stats
obj = stats.TransactionalDatabase('sampleTransactionalDatabase.txt', ' ')
obj.run()
obj.printStats()
A multiple minimum support file contains a set of items and its respective minimum support value.
A hypothetical multiple minSup database containing the items a, b, c, d, e, f, and g as shown below.
Item | MIS |
---|---|
a | 4 |
b | 4 |
c | 3 |
d | 4 |
e | 4 |
f | 5 |
g | 5 |
In many real-world applications, it is often difficult for the users to specify minimum item supports for every item in a database. Different methods have been described in the literature to derive the items’ minimum item support values. The procedures to specify the items’ MIS values is provided at derivingItemsMISValues.
a 4
b 4
c 3
d 4
e 4
f 5
g 5
The input parameters to a frequent pattern mining algorithm are:
- String : E.g., ‘transactionalDatabase.txt’
- URL : E.g., https://u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/transactional_T10I4D100K.csv
- DataFrame with the header titled ‘Transactions’
- String : E.g., ‘minSupDatabase.txt’
- URL : E.g., https://u-aizu.ac.jp/~udayrage/datasets/transactionalDatabases/transactional_T10I4D100K.csv
- DataFrame with the header titled ‘item’ and ‘MIS’
The patterns discovered by a frequent pattern mining with multiple minimum support algorithm can be saved into a file or a data frame.
foo@bar: cd PAMI/multipleminimumSupportbasedFrequentPattern/basic
foo@bar: python3 algorithmName.py inputFile outputFile multipleMinSupFile.txt seperator
Example: python3 CFPGrowth.py
inputFile.txt
outputFile.txt
MIS.txt
' '
import PAMI.multipleMinimumSupportBasedFrequentPattern.basic.CFPGrowth as alg
iFile = 'sampleTransactionalDatabase.txt' #specify the input transactional database
mFile = 'MIS.txt' #specify the multiple minsup file
seperator = ' ' #specify the seperator. Default seperator is tab space.
oFile = 'frequentPatterns.txt' #specify the output file name
obj = alg.CFPGrowth(iFile, mFile, seperator) #initialize the algorithm
obj.mine() #start the mining process
obj.save(oFile) #store the patterns in file
df = obj.getPatternsAsDataFrame() #Get the patterns discovered into a dataframe
obj.printResults() #Print the stats of mining process
Frequent patterns were generated successfully using CFPGrowth algorithm
Total number of Frequent Patterns: 10
Total Memory in USS: 100147200
Total Memory in RSS 138452992
Total ExecutionTime in ms: 0.0017397403717041016
!cat frequentPatterns.txt
#format: frequentPattern:support
e:4
e d c:3
e c:3
b:5
b d:6
b d c:6
b c:7
d:8
d c:8
c:8
The dataframe containing the patterns is shown below:
df
Patterns | Support | |
---|---|---|
0 | e | 4 |
1 | e d c | 3 |
2 | e c | 3 |
3 | b | 5 |
4 | b d | 6 |
5 | b d c | 6 |
6 | b c | 7 |
7 | d | 8 |
8 | d c | 8 |
9 | c | 8 |