PAMI is a Python library containing 100+ algorithms to discover useful patterns in various databases across multiple computing platforms. (Active)
High utility pattern mining model disregards the frequency information of a pattern in a database. However, in many real-world applications, an interestingness of a pattern is determined by both of its value and frequency. In this context, high utility frequent pattern mining was introduced to discover only those interesting patterns that had high value and occurred at least certain number of times in a database.
High utility frequent pattern mining aims to discover all the patterns with utility of pattern is no less than user-specified minimum utility (minutil) and support is no less than user-specified minimum support (minSup).
Reference: R. Uday Kiran, T. Yashwanth Reddy, Philippe Fournier-Viger, Masashi Toyoda, P. Krishna Reddy, Masaru Kitsuregawa: Efficiently Finding High Utility-Frequent Itemsets Using Cutoff and Suffix Utility. PAKDD (2) 2019: 191-203 Link
A utility database consists of an ‘internal utility database’ and an ‘external utility database’.
In an internal utility database, every transaction contains a set of items and a positive integer called internal utility respectively.
In an external utility database, every transaction contains an item and it external utility value.
An hypothetical internal utility database is shown in below table.
Transactions |
---|
(a,2) (b,3) (c,1) (g,1) |
(b,3) (c,2) (d,3) (e,2) |
(a,2) (b,1) (c,3) (d,4) |
(a,3) (c,2) (d,1) (f,2) |
(a,3) (b,1) (c,2) (d,1) (g,2) |
(c,2) (d,2) (e,3) (f,1) |
(a,2) (b,1) (c,1) (d,2) |
(a,1) (e,2) (f,2) |
(a,2) (b,2) (c,4) (d,2) |
(b,3) (c,2) (d,2) (e,2) |
A hypothetical external utility database is shown in below table.
Item | Profit |
---|---|
a | 4 |
b | 3 |
c | 6 |
d | 2 |
e | 5 |
f | 2 |
g | 3 |
Note: Duplicate items must not exist in a transaction.
Each row in a utility database must contain the following information:
All of the above three fields have to be seperated using the colan symbol.
A sample utility database, say sampleUtility.txt, is shown below:
a b c g:7:2 3 1 1
b c d e:10:3 2 3 2
a b c d:10:2 1 3 4
a c d f:7:3 2 1 2
a b c d g:9:3 1 2 1 2
c d e f:8:2 2 3 1
a b c d:6:2 1 1 2
a e f:5:1 2 2
a b c d:10:2 2 4 2
b c d e:9:3 2 2 2
The performance of a pattern mining algorithm primarily depends on the satistical nature of a database. Thus it is important to know the following details of a database:
The sample code is provided below:
import PAMI.extras.dbStats.UtilityDatabase as stats
obj = stats.UtilityDatabase('sampleUtility.txt', ' ')
obj.run()
obj.printStats()
Algorithms to mine the high-utility patterns requires utility database, minUtil, and minSup (specified by user).
- String : E.g., ‘utilityDatabase.txt’
- URL : E.g., https://u-aizu.ac.jp/~udayrage/datasets/utilityDatabases/utility_T10I4D100K.csv
- In DataFrame format (dataframe variable with heading
Transactions
,Utilities
andTransactionUtility
- [0, 1]
- count (beween 0 to length of a database) or
- [0, 1]
The patterns dicovered by a high utility frequent pattern mining algorithm can be saved into a file or a data frame.
foo@bar: cd PAMI/highUtilityFrequent/basic
foo@bar: python3 algorithmName.py inputFile outputFile minUtil minSup seperator
Example: python3 HUFIM.py
inputFile.txt
outputFile.txt
$20$ $5$ ' '
import PAMI.highUtilityFrequentPattern.basic.HUFIM as alg
iFile = 'sampleUtility.txt' # specify the input transactional database
minUtil = 25 # specify the minUtil value
minSup = 5 # specify the minSup value
seperator = ' ' # specify the seperator. Default seperator is tab space.
oFile = 'utilityfrequentPatterns.txt' # specify the output file name
obj = alg.HUFIM(iFile, minUtil, minSup, seperator) # initialize the algorithm
obj.mine() # start the mining process
obj.save(oFile) # store the patterns in file
df = obj.getPatternsAsDataFrame() # Get the patterns discovered into a dataframe
obj.printResults() # Print the stats of mining process
High Utility Frequent patterns were generated successfully using HUFIM algorithm
Total number of High Utility Frequent Patterns: 7
Total Memory in USS: 81223680
Total Memory in RSS 119382016
Total ExecutionTime in seconds: 0.0004372596740722656
!cat utilityfrequentPatterns.txt
# The format of the file is pattern:utility:support
c d:35:8
c d a:34:5
c d b:39:6
c a:27:6
c a b:30:5
c b:29:7
d b:25:6
df
#The dataframe containing the patterns is shown below.
Patterns | Utility | Support | |
---|---|---|---|
0 | c d | 35 | 8 |
1 | c d a | 34 | 5 |
2 | c d b | 39 | 6 |
3 | c a | 27 | 6 |
4 | c a b | 30 | 5 |
5 | c b | 29 | 7 |
6 | d b | 25 | 6 |