Data mining algorithms: Prediction
The prediction task
-
Supervised learning task where the data are used directly (no explicit
model is created) to predict the class value of a new instance.
-
Basic approaches:
-
Instance-based (nearest neighbor)
-
Statistical (naive bayes)
-
Bayesian networks
-
Regression (a kind of concept learning for continuous class)
Statistical modeling
-
Basic assumptions
-
Opposite of OneR: use all the attributes
-
Attributes are assumed to be:
-
equally important: all attributes have the same relevance to the classification
task.
-
statistically independent (given the class value): knowledge about the
value of a particular attribute doesn't tell us anything about the value
of another attribute (if the class is known).
-
Although based on assumptions that are almost never correct, this scheme
works well in practice!
-
Probabilities of weather data
outlook |
temp |
humidity |
windy |
play |
sunny |
hot |
high |
false |
no |
sunny |
hot |
high |
true |
no |
overcast |
hot |
high |
false |
yes |
rainy |
mild |
high |
false |
yes |
rainy |
cool |
normal |
false |
yes |
rainy |
cool |
normal |
true |
no |
overcast |
cool |
normal |
true |
yes |
sunny |
mild |
high |
false |
no |
sunny |
cool |
normal |
false |
yes |
rainy |
mild |
normal |
false |
yes |
sunny |
mild |
normal |
true |
yes |
overcast |
mild |
high |
true |
yes |
overcast |
hot |
normal |
false |
yes |
rainy |
mild |
high |
true |
no |
-
outlook = sunny [yes (2/9); no (3/5)];
-
temperature = cool [yes (3/9); no (1/5)];
-
humidity = high [yes (3/9); no (4/5)];
-
windy = true [yes (3/9); no (3/5)];
-
play = yes [(9/14)]
-
play = no [(5/14)]
-
New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]
-
Likelihood of the two classes (play=yes; play=no):
-
yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;
-
no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;
-
Conversion into probabilities by normalization:
-
P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
-
P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
-
Bayes theorem (Bayes rule)
-
Probability of event H, given evidence E: P(H|E) = P(E|H) * P(H) / P(E);
-
P(H): a priori probability of H (probability of event before
evidence has been seen);
-
P(H|E): a posteriori (conditional) probability of H (probability
of event after evidence has been seen);
-
Bayes for classification
-
What is the probability of the class given an instance?
-
Evidence E = instance
-
Event H = class value for instance
-
Naïve Bayes assumption: evidence can be split into independent parts
(attributes of the instance).
-
E = [A1,A2,...,An]
-
P(E|H) = P(A1|H)*P(A2|H)*...*P(An|H)
-
Bayes: P(H|E) = P(A1|H)*P(A2|H)*...*P(An|H)*P(H)
/ P(E)
-
Weather data:
-
E = [outlook=sunny, temp=cool, humidity=high, windy=true]
-
P(yes|E) = (outlook=sunny|yes) * P(temp=cool|yes) * P(humidity=high|yes)
* P(windy=true|yes) * P(yes) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) /
P(E)
-
The “zero-frequency problem”
-
What if an attribute value doesn't occur with every class value (e. g.
humidity = high for class yes)?
-
Probability will be zero, for example P(humidity=high|yes) = 0;
-
A posteriori probability will also be zero: P(yes|E) = 0 (no matter how
likely the other values are!)
-
Remedy: add 1 to the count for every attribute value-class combination
(i.e. use the Laplace estimator: (p+1) / (n+1) ).
-
Result: probabilities will never be zero! (also stabilizes probability
estimates)
-
Missing values
-
Calculating probabilities: instance is not included in frequency count
for attribute value-class combination.
-
Classification: attribute will be omitted from calculation
-
Example: [outlook=?, temp=cool, humidity=high, windy=true, play=?]
-
Likelihood of yes = (3/9)*(3/9)*(3/9)*(9/14) = 0.0238;
-
Likelihood of no = (1/5)*(4/5)*(3/5)*(5/14) = 0.0343;
-
P(yes) = 0.0238 / (0.0238 + 0.0343) = 0.41
-
P(no) = 0.0343 / (0.0238 + 0.0343) = 0.59
-
Numeric attributes
-
Assumption: attributes have a normal or Gaussian probability
distribution (given the class)
-
Parameters involved: mean, standard deviation, density function for probabilty
-
Discussion
-
Naïve Bayes works surprisingly well (even if independence assumption
is clearly violated).
-
Why? Because classification doesn't require accurate probability estimates
as long as
maximum probability is assigned to correct class.
-
Adding too many redundant attributes will cause problems (e. g. identical
attributes).
-
Numeric attributes are often not normally distributed.
-
Yet another problem: estimating prior probability is difficult.
-
Advanced approaches: Bayesian networks.
Bayesian networks
-
Basics of BN
-
Define joint conditional probabilities.
-
Combine Bayesian reasoning with causal relationships between attributes.
-
Also known as belief networks, probabilistic networks.
-
Defined by:
-
Directed acyclic graph, with nodes representing random variables and links
- probabilistic dependence.
-
Conditional probability tables (CPT) for each variable (node): specifies
all P(X|partents(X)), i.e. the probability of each value of X, given every
possible combination of values for its parents.
-
Reasoning: given the probabilities at some nodes (inputs) BN calculates
the probabilities in other nodes (outputs).
-
Classification: inputs - attribute values, output - class value probability.
-
There are mechanisms for training BN from examples, given variables and
network structure, i.e. creating CPT's.
-
Example:
-
Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls
(M)
-
Structure ("->" denotes causal relation): Burglary ->
Alarm; Earthquake -> Alarm; Alarm -> JohnCalls; Alarm
->
MaryCalls.
-
CPT's (for brevity, probability of false is not given, rows must sum to
1):
B E |
P(A) |
T T
T F
F T
F F |
0.95
0.94
0.29
0.001 |
-
Calculation of joint probabilities (~ means not): P(J, M, A, ~B, ~E) =
P(J|A) * P(M|A) * P(A|~B and ~E) * P(~B) * P(~E) = 0.9 * 0.7 * 0.001 *
0.999 * 0.998 = 0.000628.
-
Reasoning (using complete joints distribution or other more efficient methods):
-
Diagnostic (from effect top cause): P(B|J) = 0.016; P(B|J and M) = 0.29);
P(A|J and M) = 0.76
-
Predictive (from cause to effect): P(J|B) = 0.86; P(M|B) = 0.67;
-
Other: intercausal P(B|A), mixed P(A|J and ~E)
-
-
Naive Bayes as a BN
-
Variables: play, outlook, temp, humidity, windy.
-
Structure: play -> outlook, play -> temp, play ->
humidity, play -> windy.
-
CPT's:
-
play: P(play=yes)=9/14; P(play=no)=5/14;
-
outlook:
-
P(outlook=overcast | play=yes) = 4/9
-
P(outlook=sunny | play=yes) = 2/9
-
P(outlook=rainy | play=yes) = 3/9
-
P(outlook=overcast | play=no) = 0/5
-
P(outlook=sunny | play=no) = 3/5
-
P(outlook=rainy | play=no) = 2/5
-
...
Instance-based methods
-
Distance function defines what's learned.
-
Most instance-based schemes use Euclidean distance (for numeric
attributes):
D(X,Y) = sqrt[(x1-y1)2 + (x2-y2)2
+ ... + (xn-yn)2], where X = {x1,
x2, ..., xn}, Y = {y1, y2,
..., yn}. Taking the square root is not required when comparing
distances.
-
Other popular metric: city-block distance. D(X,Y) = |x1-y1|
+ |x2-y2| + ... + |xn-yn|.
-
As different attributes use diferent scales, normalization is required.
Vnorm = (V-Vmin) / (Vmax - Vmin).
Thus Vnorm is within [0,1].
-
Nominal attributes: number of differences, i.e. city block distance, where
|xi-yi| = 0 (xi=yi) or 1 (xi<>yi).
-
Missing attributes: assumed to be maximally distant (given normalized attributes).
-
Example: weather data
ID |
outlook |
temp |
humidity |
windy |
play |
1 |
sunny |
hot |
high |
false |
no |
2 |
sunny |
hot |
high |
true |
no |
3 |
overcast |
hot |
high |
false |
yes |
4 |
rainy |
mild |
high |
false |
yes |
5 |
rainy |
cool |
normal |
false |
yes |
6 |
rainy |
cool |
normal |
true |
no |
7 |
overcast |
cool |
normal |
true |
yes |
8 |
sunny |
mild |
high |
false |
no |
9 |
sunny |
cool |
normal |
false |
yes |
10 |
rainy |
mild |
normal |
false |
yes |
11 |
sunny |
mild |
normal |
true |
yes |
12 |
overcast |
mild |
high |
true |
yes |
13 |
overcast |
hot |
normal |
false |
yes |
14 |
rainy |
mild |
high |
true |
no |
X |
sunny |
cool |
high |
true |
? |
ID |
2 |
8 |
9 |
11 |
D(X, ID) |
1 |
2 |
2 |
2 |
play |
no |
no |
yes |
yes |
-
Discussion
-
Instance space: Voronoi diagram
-
1-NN is very accurate but also slow: scans entire training data to derive
a prediction (possible improvements: use a sample)
-
Assumes all attributes are equally important. Remedy: attribute selection
or weights (see attribute relevance).
-
Dealing with noise (wrong values of some attributes)
-
Taking a majority vote over the k nearest neighbors (k-NN).
-
Removing noisy instances from dataset (difficult!)
-
Numeric class attribute: take mean of the class values the k nearest neighbors.
-
k-NN has been used by statisticians since early 1950s. Question: k=?
-
Distance weighted k-NN:
-
Weight each vote or class value (for numeric) with the distance.
-
For example: instead of summing up votes, sum up 1 / D(X,Y) or 1 / D(X,Y)2
-
Then it makes sense to use all instances (k=n).
Linear models
-
Basic idea
-
Work most naturally with numeric attributes. The standard technique
for numeric prediction is linear regression.
-
Predicted class value is linear combination of attribute values
(ai): C = w0*a0 + w1*a1
+ w2*a2 + ... + wk*ak.
For k attributes we have k+1 coefficients. To simplify notation
we add a0 that is always 1.
-
Squared error: Sum through all instances (actual class value - predicted
one)2
-
Deriving the coefficients (wi): minimizing squared
error on training data. Using standard numerical analysis techniques
(matrix operations). Can be done if there are more instances than attributes
(roughly speaking).
-
Classification by linear regression
-
Multi-response linear regression (learning a membership function for each
class)
-
Training: perform a regression (create a model) for each class, setting
the output to 1 for training instances that belong to the class, and 0
for those that do not.
-
Prediction: predict the class corresponding to the model with largest output
value
-
Pairwise regression (designed especially for classification)
-
Training: perform regression for every pair of classes assigning output
1 for one class and -1 for the other.
-
Prediction: predict the class that receives most "votes" (outputs > 0)
from the regression lines.
-
More accurate than multi-response linear regression, however more computationally
expensive.
-
Discussion
-
Creates a hyperplane for any two classes
-
Pairwise: the regression line between the two classes
-
Multi-response: (w0-v0)*a0 + (w1-v1)*a1
+ ... + (wk-vk)*ak, where wi
and vi are the coefficients of the models for the two
classes.
-
Not appropriate if data exhibits non-linear dependencies. For example,
instances that cannot be separated by a hyperplane. Classical example:
XOR function.