Data mining algorithms: Association rules

Motivation and terminology

  1. Data mining perspective
  2. Machine Learning approach: treat every possible combination of attribute values as a separate class, learn rules using the rest of attributes as input and then evaluate them for support and confidence. Problem: computationally intractable (too many classes and consequently, too many rules).
  3. Basic terminology:
    1. Tuples are transactions, attribute-value pairs are items.
    2. Association rule: {A,B,C,D,...} => {E,F,G,...}, where A,B,C,D,E,F,G,... are items.
    3. Confidence (accuracy) of A => B : P(B|A) = (# of transactions containing both A and B) / (# of transactions containing A).
    4. Support (coverage) of A => B : P(A,B) = (# of transactions containing both A and B) / (total # of transactions)
    5. We looking for rules that exceed pre-defined support (minimum support) and have high confidence.

Example

  1. Load the weather data in Weka (click on Preprocess and then on Open file... weather.nominal.arff). The data are shown below in tabular form.
 
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

Click on Associate and then on Start. You get the following 10 association rules in Associator output window:

 1. humidity=normal windy=FALSE 4 ==> play=yes 4    conf:(1)
 2. temperature=cool 4 ==> humidity=normal 4    conf:(1)
 3. outlook=overcast 4 ==> play=yes 4    conf:(1)
 4. temperature=cool play=yes 3 ==> humidity=normal 3    conf:(1)
 5. outlook=rainy windy=FALSE 3 ==> play=yes 3    conf:(1)
 6. outlook=rainy play=yes 3 ==> windy=FALSE 3   conf:(1)
 7. outlook=sunny humidity=high 3 ==> play=no 3    conf:(1)
 8. outlook=sunny play=no 3 ==> humidity=high 3    conf:(1)
 9. temperature=cool windy=FALSE 2 ==> humidity=normal play=yes 2    conf:(1)
10. temperature=cool humidity=normal windy=FALSE 2 ==> play=yes 2    conf:(1)

Basic idea: item sets

  1. Item set: sets of all items in a rule (in both LHS and RHS).
  2. Item sets for weather data: 12 one-item sets (3 values for outlook + 3 for temperature + 2 for humidity + 2 for windy + 2 for play), 47 two-item sets, 39 three-item sets, 6 four-item sets and 0 five-item sets (with minimum support of two).
 One-item sets  Two-item sets  Three-item sets  Four-item sets
Outlook = Sunny (5) 

Temperature = Cool (4) 

... 
 
 
 
 

 

Outlook = Sunny 
Temperature = Mild (2) 

Outlook = Sunny 
Humidity = High (3) 

... 
 
 

 

Outlook = Sunny 
Temperature = Hot 
Humidity = High (2) 

Outlook = Sunny 
Humidity = High 
Windy = False (2) 

... 

 

Outlook = Sunny 
Temperature = Hot 
Humidity = High 
Play = No (2) 

Outlook = Rainy 
Temperature = Mild 
Windy = False 
Play = Yes (2) 
... 
 

3. Generating rules from item sets. Once all item sets with minimum support have been generated, we can turn them into rules.

Generating item sets efficiently

  1. Frequent item sets: item sets with the desired minimal support.
  2. Observation: if {A,B} is a frequent item set, then both A and B are frequent item sets too. The inverse, however is not true (find a counter-example).
  3. Basic idea (Apriori algorithm):

Generating rules efficiently

  1. Brute-force method (for small item sets):
  2. Better way: iterative rule generation within minimal accuracy.
  3. Weka's approach (default settings for Apriori): generate best 10 rules. Begin with a minimum support 100% and decrease this in steps of 5%. Stop when generate 10 rules or the support falls below 10%. The minimum confidence is 90%.

Advanced association rules

  1. Multi-level association rules: using concept hierarchies.
    2. Approaches to mining multi-level association rules: 3. Interpretation of association rules:
    4. Handling numeric attributes.

Correlation analysis

  1. High support and high confidence rules are not necessarily interesting. Example:
  2. Support-confidence framework: an estimate of the conditional probability of B given A. We need a measure for the certainty of the implication A => B, that is, whether A implies B and to what extend.
  3. Correlation between occurrences of A and B:
  4. Contingency table:
     
    outlook=sunny outlook<>sunny Row total
    play=yes   2   7   9
    play=no   3   2   5
    Column total   5   9  14