## Data mining algorithms: Prediction

1. Supervised learning task where the data are used directly (no explicit model is created) to predict the class value of a new instance.
2. Basic approaches:
• Instance-based (nearest neighbor)
• Statistical (naive bayes)
• Bayesian networks
• Regression (a kind of concept learning for continuous class)

### Statistical modeling

1. Basic assumptions
• Opposite of OneR: use all the attributes
• Attributes are assumed to be:
• equally important: all attributes have the same relevance to the classification task.
• statistically independent (given the class value): knowledge about the value of a particular attribute doesn't tell us anything about the value of another attribute (if the class is known).
• Although based on assumptions that are almost never correct, this scheme works well in practice!
2. Probabilities of weather data

3.
 outlook temp humidity windy play sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no

• outlook = sunny  [yes (2/9); no (3/5)];
• temperature = cool  [yes (3/9); no (1/5)];
• humidity = high [yes (3/9); no (4/5)];
• windy = true [yes (3/9); no (3/5)];
• play = yes [(9/14)]
• play = no [(5/14)]
• New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]
• Likelihood of the two classes (play=yes; play=no):
• yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;
• no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;
• Conversion into probabilities by normalization:
• P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
• P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
4. Bayes theorem (Bayes rule)
• Probability of event H, given evidence E: P(H|E) = P(E|H) * P(H) / P(E);
• P(H): a priori probability of H (probability of event before evidence has been seen);
• P(H|E): a posteriori (conditional) probability of H (probability of event after evidence has been seen);
5. Bayes for classification
• What is the probability of the class given an instance?
• Evidence E = instance
• Event H = class value for instance
• Naïve Bayes assumption: evidence can be split into independent parts (attributes of the instance).
• E = [A1,A2,...,An]
• P(E|H) = P(A1|H)*P(A2|H)*...*P(An|H)
• Bayes: P(H|E) = P(A1|H)*P(A2|H)*...*P(An|H)*P(H) / P(E)
• Weather data:
• E = [outlook=sunny, temp=cool, humidity=high, windy=true]
• P(yes|E) = (outlook=sunny|yes) * P(temp=cool|yes) * P(humidity=high|yes) * P(windy=true|yes) * P(yes) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E)
6. The zero-frequency problem
• What if an attribute value doesn't occur with every class value (e. g. humidity = high for class yes)?
• Probability will be zero, for example P(humidity=high|yes) = 0;
• A posteriori probability will also be zero: P(yes|E) = 0 (no matter how likely the other values are!)
• Remedy: add 1 to the count for every attribute value-class combination (i.e. use the Laplace estimator: (p+1) / (n+1) ).
• Result: probabilities will never be zero! (also stabilizes probability estimates)
7. Missing values
• Calculating probabilities: instance is not included in frequency count for attribute value-class combination.
• Classification: attribute will be omitted from calculation
• Example: [outlook=?, temp=cool, humidity=high, windy=true, play=?]
• Likelihood of yes = (3/9)*(3/9)*(3/9)*(9/14) = 0.0238;
• Likelihood of no = (1/5)*(4/5)*(3/5)*(5/14) = 0.0343;
• P(yes) = 0.0238 / (0.0238 + 0.0343) = 0.41
• P(no) = 0.0343 / (0.0238 + 0.0343) = 0.59
8. Numeric attributes
• Assumption: attributes have a normal or Gaussian probability distribution (given the class)
• Parameters involved: mean, standard deviation, density function for probabilty
9. Discussion
• Naïve Bayes works surprisingly well (even if independence assumption is clearly violated).
• Why? Because classification doesn't require accurate probability estimates as long as

• maximum probability is assigned to correct class.
• Adding too many redundant attributes will cause problems (e. g. identical attributes).
• Numeric attributes are often not normally distributed.
• Yet another problem: estimating prior probability is difficult.

### Bayesian networks

1. Basics of BN
• Define joint conditional probabilities.
• Combine Bayesian reasoning with causal relationships between attributes.
• Also known as belief networks, probabilistic networks.
• Defined by:
• Directed acyclic graph, with nodes representing random variables and links - probabilistic dependence.
• Conditional probability tables (CPT) for each variable (node): specifies all P(X|partents(X)), i.e. the probability of each value of X, given every possible combination of values for its parents.
• Reasoning: given the probabilities at some nodes (inputs) BN calculates the probabilities in other nodes (outputs).
• Classification: inputs - attribute values, output - class value probability.
• There are mechanisms for training BN from examples, given variables and network structure, i.e. creating CPT's.
2. Example:
• Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)
• Structure ("->" denotes causal relation): Burglary -> Alarm; Earthquake -> Alarm; Alarm -> JohnCalls; Alarm -> MaryCalls.
• CPT's (for brevity, probability of false is not given, rows must sum to 1):
• P(B) = 0.001
• P(E) = 0.002
 B   E P(A) T   T   T   F   F   T   F   F 0.95   0.94   0.29   0.001

 A P(J) T   F 0.90   0.05

 A P(M) T   F 0.70   0.01

• Calculation of joint probabilities (~ means not): P(J, M, A, ~B, ~E) = P(J|A) * P(M|A) * P(A|~B and ~E) * P(~B) * P(~E) = 0.9 * 0.7 * 0.001 * 0.999 * 0.998 = 0.000628.
• Reasoning (using complete joints distribution or other more efficient methods):
• Diagnostic (from effect top cause): P(B|J) = 0.016; P(B|J and M) = 0.29); P(A|J and M) = 0.76
• Predictive (from cause to effect): P(J|B) = 0.86; P(M|B) = 0.67;
• Other: intercausal P(B|A), mixed P(A|J and ~E)
3. Naive Bayes as a BN
• Variables: play, outlook, temp, humidity, windy.
• Structure: play -> outlook, play -> temp, play -> humidity, play -> windy.
• CPT's:
• play: P(play=yes)=9/14; P(play=no)=5/14;
• outlook:
• P(outlook=overcast | play=yes) = 4/9
• P(outlook=sunny | play=yes) = 2/9
• P(outlook=rainy | play=yes) = 3/9
• P(outlook=overcast | play=no) = 0/5
• P(outlook=sunny | play=no) = 3/5
• P(outlook=rainy | play=no) = 2/5
• ...

### Instance-based methods

1. Distance function defines what's learned.
• Most instance-based schemes use Euclidean distance (for numeric attributes): D(X,Y) = sqrt[(x1-y1)2 + (x2-y2)2 + ... + (xn-yn)2], where X = {x1, x2, ..., xn}, Y = {y1, y2, ..., yn}. Taking the square root is not required when comparing distances.
• Other popular metric: city-block distance. D(X,Y) = |x1-y1| + |x2-y2| + ... + |xn-yn|.
• As different attributes use diferent scales, normalization is required. Vnorm = (V-Vmin) / (Vmax - Vmin). Thus Vnorm is within  [0,1].
• Nominal attributes: number of differences, i.e. city block distance, where |xi-yi| = 0 (xi=yi) or 1 (xi<>yi).
• Missing attributes: assumed to be maximally distant (given normalized attributes).
2. Example: weather data
•    ID outlook temp humidity windy play 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes 4 rainy mild high false yes 5 rainy cool normal false yes 6 rainy cool normal true no 7 overcast cool normal true yes 8 sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes 14 rainy mild high true no X sunny cool high true ?

 ID 2 8 9 11 D(X, ID) 1 2 2 2 play no no yes yes
1. Discussion
• Instance space: Voronoi diagram
• 1-NN is very accurate but also slow: scans entire training data to derive a prediction (possible improvements: use a sample)
• Assumes all attributes are equally important. Remedy: attribute selection or weights (see attribute relevance).
• Dealing with noise (wrong values of some attributes)
• Taking a majority vote over the k nearest neighbors (k-NN).
• Removing noisy instances from dataset (difficult!)
• Numeric class attribute: take mean of the class values the k nearest neighbors.
• k-NN has been used by statisticians since early 1950s. Question: k=?
• Distance weighted k-NN:
• Weight each vote or class value (for numeric) with the distance.
• For example: instead of summing up votes, sum up 1 / D(X,Y) or 1 / D(X,Y)2
• Then it makes sense to use all instances (k=n).

### Linear models

1. Basic idea
• Work most naturally with numeric attributes. The standard technique for numeric prediction is linear regression.
• Predicted class value is linear combination of attribute values (ai): C =  w0*a0 + w1*a1 + w2*a2 + ... + wk*ak. For k attributes we have k+1 coefficients. To simplify notation we add a0 that is always 1.
• Squared error: Sum through all instances (actual class value - predicted one)2
• Deriving the coefficients (wi): minimizing squared error on training data. Using standard numerical analysis techniques (matrix operations). Can be done if there are more instances than attributes (roughly speaking).
2. Classification by linear regression
• Multi-response linear regression (learning a membership function for each class)
• Training: perform a regression (create a model) for each class, setting the output to 1 for training instances that belong to the class, and 0 for those that do not.
• Prediction: predict the class corresponding to the model with largest output value
• Pairwise regression (designed especially for classification)
• Training: perform regression for every pair of classes assigning output 1 for one class and -1 for the other.
• Prediction: predict the class that receives most "votes" (outputs > 0) from the regression lines.
• More accurate than multi-response linear regression, however more computationally expensive.
3. Discussion
• Creates a hyperplane for any two classes
• Pairwise: the regression line between the two classes
• Multi-response: (w0-v0)*a0 + (w1-v1)*a1 + ... + (wk-vk)*ak,  where wi and vi are the coefficients of the models for the two classes.
• Not appropriate if data exhibits non-linear dependencies. For example, instances that cannot be separated by a hyperplane. Classical example: XOR function.