PAMI is a Python library containing 100+ algorithms to discover useful patterns in various databases across multiple computing platforms. (Active)
Frequent pattern mining (FPM) is an important knowledge discovery technique with many real-world applications. It involves finding all frequently occurring patterns in a (binary) transactional database. A fundamental limitation of this technique is its inability to find interesting patterns in a quantitative transactional database. High utility pattern mining (HUPM) has been introduced to tackle this limitation.
HUPM generalizes the model of frequent patterns by discovering all patterns that have utility more than the user-specified minimum utility (minUtil) in a quantitative transactional database. The minUtil constraint controls the minimum utility (or value) of a pattern must maintain in a database.
Reference: Hong Yao and Howard J. Hamilton. 2006. Mining itemset utilities from transaction databases. Data Knowl. Eng. 59, 3 (December 2006), 603–626. Link
A utility database consists of an ‘internal utility database’ and an ‘external utility database’.
In an internal utility database, every transaction contains a set of items and a positive integer called internal utility respectively.
In an external utility database, every transaction contains an item and it external utility value.
An hypothetical internal utility database is shown in below table.
Transactions |
---|
(a,2) (b,3) (c,1) (g,1) |
(b,3) (c,2) (d,3) (e,2) |
(a,2) (b,1) (c,3) (d,4) |
(a,3) (c,2) (d,1) (f,2) |
(a,3) (b,1) (c,2) (d,1) (g,2) |
(c,2) (d,2) (e,3) (f,1) |
(a,2) (b,1) (c,1) (d,2) |
(a,1) (e,2) (f,2) |
(a,2) (b,2) (c,4) (d,2) |
(b,3) (c,2) (d,2) (e,2) |
A hypothetical external utility database is shown in below table.
Item | Profit |
---|---|
a | 4 |
b | 3 |
c | 6 |
d | 2 |
e | 5 |
f | 2 |
g | 3 |
Note: Duplicate items must not exist in a transaction.
Each row in a utility database must contain the following information:
All of the above three fields have to be seperated using the colan symbol.
A sample utility database, say sampleUtility.txt, is shown below:
a b c g:7:2 3 1 1
b c d e:10:3 2 3 2
a b c d:10:2 1 3 4
a c d f:7:3 2 1 2
a b c d g:9:3 1 2 1 2
c d e f:8:2 2 3 1
a b c d:6:2 1 1 2
a e f:5:1 2 2
a b c d:10:2 2 4 2
b c d e:9:3 2 2 2
The performance of a pattern mining algorithm primarily depends on the satistical nature of a database. Thus it is important to know the following details of a database:
The below sample code prints the statistical details of a database.
import PAMI.extras.dbStats.UtilityDatabase as stats
obj = stats.UtilityDatabase('sampleUtility.txt', ' ')
obj.run()
obj.printStats()
The input parameters to a frequent pattern mining algorithm are:
- String : E.g., ‘utilityDatabase.txt’
- URL : E.g., https://u-aizu.ac.jp/~udayrage/datasets/utilityDatabases/utility_T10I4D100K.csv
- DataFrame with the header titled ‘Transactions’, ‘Utility’ and ‘TransactionUtility’
- count
The patterns discovered by a high utility pattern mining algorithm can be saved into a file or a data frame.
foo@bar: cd PAMI/highUtilityPattern/basic
foo@bar: python3 algorithmName.py iFile outputFile minUtil seperator
Example: python3 EFIM.py
inputFile.txt
outputFile.txt
$20$ ' '
import PAMI.highUtilityPattern.basic.EFIM as alg
iFile = 'sampleUtility.txt' #specify the input utility database <
minUtil = 20 #specify the minSupvalue
seperator = ' ' #specify the seperator. Default seperator is tab space.
oFile = 'utilityPatterns.txt' #specify the output file name
obj = alg.EFIM(iFile, minUtil, seperator) #initialize the algorithm
obj.mine() #start the mining process
obj.save(oFile) #store the patterns in file
df = obj.getPatternsAsDataFrame() #Get the patterns discovered into a dataframe
obj.printResults() #Print the stats of mining process
High Utility patterns were generated successfully using EFIM algorithm
Total number of High Utility Patterns: 11
Total Memory in USS: 110927872
Total Memory in RSS 149282816
Total ExecutionTime in seconds: 0.0006177425384521484
The utilityPatterns.txt file contains the following patterns (format: pattern:utility):!cat utilityPatterns.txt
!cat utilityPatterns.txt
e d c:20
a b d:23
a b d c:33
a b c:30
a d:22
a d c:34
a c:27
b d:25
b d c:39
b c:29
d c:35
The dataframe containing the patterns is shown below:
df
Patterns | Utility | |
---|---|---|
0 | e d c | 20 |
1 | a b d | 23 |
2 | a b d c | 33 |
3 | a b c | 30 |
4 | a d | 22 |
5 | a d c | 34 |
6 | a c | 27 |
7 | b d | 25 |
8 | b d c | 39 |
9 | b c | 29 |
10 | d c | 35 |