Mining High-Utility Patterns in Utility Databases

PAMI is a Python library containing 100+ algorithms to discover useful patterns in various databases across multiple computing platforms. (Active)

Mining High-Utility Patterns in Utility Databases

1. What is High-Utility pattern mining?

Frequent pattern mining (FPM) is an important knowledge discovery technique with many real-world applications. It involves finding all frequently occurring patterns in a (binary) transactional database. A fundamental limitation of this technique is its inability to find interesting patterns in a quantitative transactional database. High utility pattern mining (HUPM) has been introduced to tackle this limitation.

HUPM generalizes the model of frequent patterns by discovering all patterns that have utility more than the user-specified minimum utility (minUtil) in a quantitative transactional database. The minUtil constraint controls the minimum utility (or value) of a pattern must maintain in a database.

Reference: Hong Yao and Howard J. Hamilton. 2006. Mining itemset utilities from transaction databases. Data Knowl. Eng. 59, 3 (December 2006), 603–626. Link

2. What is a utility database?

A utility database consists of an ‘internal utility database’ and an ‘external utility database’.

In an internal utility database, every transaction contains a set of items and a positive integer called internal utility respectively.

In an external utility database, every transaction contains an item and it external utility value.

An hypothetical internal utility database is shown in below table.

Transactions
(a,2) (b,3) (c,1) (g,1)
(b,3) (c,2) (d,3) (e,2)
(a,2) (b,1) (c,3) (d,4)
(a,3) (c,2) (d,1) (f,2)
(a,3) (b,1) (c,2) (d,1) (g,2)
(c,2) (d,2) (e,3) (f,1)
(a,2) (b,1) (c,1) (d,2)
(a,1) (e,2) (f,2)
(a,2) (b,2) (c,4) (d,2)
(b,3) (c,2) (d,2) (e,2)

A hypothetical external utility database is shown in below table.

Item Profit
a 4
b 3
c 6
d 2
e 5
f 2
g 3

Note: Duplicate items must not exist in a transaction.

3. What is acceptable format of a utility databases in PAMI?

Each row in a utility database must contain the following information:

All of the above three fields have to be seperated using the colan symbol.

A sample utility database, say sampleUtility.txt, is shown below:

a b c g:7:2 3 1 1
b c d e:10:3 2 3 2
a b c d:10:2 1 3 4
a c d f:7:3 2 1 2
a b c d g:9:3 1 2 1 2
c d e f:8:2 2 3 1
a b c d:6:2 1 1 2
a e f:5:1 2 2
a b c d:10:2 2 4 2
b c d e:9:3 2 2 2

4. What is the need for understanding the statistics of database?

The performance of a pattern mining algorithm primarily depends on the satistical nature of a database. Thus it is important to know the following details of a database:

The below sample code prints the statistical details of a database.

import PAMI.extras.dbStats.UtilityDatabase as stats

obj = stats.UtilityDatabase('sampleUtility.txt', ' ')
obj.run()
obj.printStats()

5. What are the input parameters to a high utility pattern mining algorithm?

The input parameters to a frequent pattern mining algorithm are:

6. How to store the output of a high utility pattern mining algorithm?

The patterns discovered by a high utility pattern mining algorithm can be saved into a file or a data frame.

How to run the high utility pattern mining algorithms in a terminal?

Example: python3 EFIM.py inputFile.txt outputFile.txt $20$   ' '

7. How to execute a High utility pattern mining algorithm in a Jupyter Notebook?

import PAMI.highUtilityPattern.basic.EFIM as alg 

iFile = 'sampleUtility.txt'  #specify the input utility database <
minUtil = 20                #specify the minSupvalue
seperator = ' '               #specify the seperator. Default seperator is tab space. 
oFile = 'utilityPatterns.txt'   #specify the output file name

obj = alg.EFIM(iFile, minUtil, seperator) #initialize the algorithm 
obj.startMine()                       #start the mining process 
obj.save(oFile)               #store the patterns in file 
df = obj.getPatternsAsDataFrame()     #Get the patterns discovered into a dataframe 
obj.printResults()                      #Print the stats of mining process
High Utility patterns were generated successfully using EFIM algorithm
Total number of High Utility Patterns: 11
Total Memory in USS: 110927872
Total Memory in RSS 149282816
Total ExecutionTime in seconds: 0.0006177425384521484

The utilityPatterns.txt file contains the following patterns (format: pattern:utility):!cat utilityPatterns.txt

!cat utilityPatterns.txt
e	d	c:20 
a	b	d:23 
a	b	d	c:33 
a	b	c:30 
a	d:22 
a	d	c:34 
a	c:27 
b	d:25 
b	d	c:39 
b	c:29 
d	c:35 

The dataframe containing the patterns is shown below:

df
Patterns Utility
0 e d c 20
1 a b d 23
2 a b d c 33
3 a b c 30
4 a d 22
5 a d c 34
6 a c 27
7 b d 25
8 b d c 39
9 b c 29
10 d c 35