Data mining algorithms: Prediction
The prediction task

Supervised learning task where the data are used directly (no explicit
model is created) to predict the class value of a new instance.

Basic approaches:

Instancebased (nearest neighbor)

Statistical (naive bayes)

Bayesian networks

Regression (a kind of concept learning for continuous class)
Statistical modeling

Basic assumptions

Opposite of OneR: use all the attributes

Attributes are assumed to be:

equally important: all attributes have the same relevance to the classification
task.

statistically independent (given the class value): knowledge about the
value of a particular attribute doesn't tell us anything about the value
of another attribute (if the class is known).

Although based on assumptions that are almost never correct, this scheme
works well in practice!

Probabilities of weather data
outlook 
temp 
humidity 
windy 
play 
sunny 
hot 
high 
false 
no 
sunny 
hot 
high 
true 
no 
overcast 
hot 
high 
false 
yes 
rainy 
mild 
high 
false 
yes 
rainy 
cool 
normal 
false 
yes 
rainy 
cool 
normal 
true 
no 
overcast 
cool 
normal 
true 
yes 
sunny 
mild 
high 
false 
no 
sunny 
cool 
normal 
false 
yes 
rainy 
mild 
normal 
false 
yes 
sunny 
mild 
normal 
true 
yes 
overcast 
mild 
high 
true 
yes 
overcast 
hot 
normal 
false 
yes 
rainy 
mild 
high 
true 
no 

outlook = sunny [yes (2/9); no (3/5)];

temperature = cool [yes (3/9); no (1/5)];

humidity = high [yes (3/9); no (4/5)];

windy = true [yes (3/9); no (3/5)];

play = yes [(9/14)]

play = no [(5/14)]

New instance: [outlook=sunny, temp=cool, humidity=high, windy=true, play=?]

Likelihood of the two classes (play=yes; play=no):

yes = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) = 0.0053;

no = (3/5)*(1/5)*(4/5)*(3/5)*(5/14) = 0.0206;

Conversion into probabilities by normalization:

P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205

P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795

Bayes theorem (Bayes rule)

Probability of event H, given evidence E: P(HE) = P(EH) * P(H) / P(E);

P(H): a priori probability of H (probability of event before
evidence has been seen);

P(HE): a posteriori (conditional) probability of H (probability
of event after evidence has been seen);

Bayes for classification

What is the probability of the class given an instance?

Evidence E = instance

Event H = class value for instance

Naïve Bayes assumption: evidence can be split into independent parts
(attributes of the instance).

E = [A_{1},A_{2},...,A_{n}]

P(EH) = P(A_{1}H)*P(A_{2}H)*...*P(A_{n}H)

Bayes: P(HE) = P(A_{1}H)*P(A_{2}H)*...*P(A_{n}H)*P(H)
/ P(E)

Weather data:

E = [outlook=sunny, temp=cool, humidity=high, windy=true]

P(yesE) = (outlook=sunnyyes) * P(temp=coolyes) * P(humidity=highyes)
* P(windy=trueyes) * P(yes) / P(E) = (2/9)*(3/9)*(3/9)*(3/9)*(9/14) /
P(E)

The “zerofrequency problem”

What if an attribute value doesn't occur with every class value (e. g.
humidity = high for class yes)?

Probability will be zero, for example P(humidity=highyes) = 0;

A posteriori probability will also be zero: P(yesE) = 0 (no matter how
likely the other values are!)

Remedy: add 1 to the count for every attribute valueclass combination
(i.e. use the Laplace estimator: (p+1) / (n+1) ).

Result: probabilities will never be zero! (also stabilizes probability
estimates)

Missing values

Calculating probabilities: instance is not included in frequency count
for attribute valueclass combination.

Classification: attribute will be omitted from calculation

Example: [outlook=?, temp=cool, humidity=high, windy=true, play=?]

Likelihood of yes = (3/9)*(3/9)*(3/9)*(9/14) = 0.0238;

Likelihood of no = (1/5)*(4/5)*(3/5)*(5/14) = 0.0343;

P(yes) = 0.0238 / (0.0238 + 0.0343) = 0.41

P(no) = 0.0343 / (0.0238 + 0.0343) = 0.59

Numeric attributes

Assumption: attributes have a normal or Gaussian probability
distribution (given the class)

Parameters involved: mean, standard deviation, density function for probabilty

Discussion

Naïve Bayes works surprisingly well (even if independence assumption
is clearly violated).

Why? Because classification doesn't require accurate probability estimates
as long as
maximum probability is assigned to correct class.

Adding too many redundant attributes will cause problems (e. g. identical
attributes).

Numeric attributes are often not normally distributed.

Yet another problem: estimating prior probability is difficult.

Advanced approaches: Bayesian networks.
Bayesian networks

Basics of BN

Define joint conditional probabilities.

Combine Bayesian reasoning with causal relationships between attributes.

Also known as belief networks, probabilistic networks.

Defined by:

Directed acyclic graph, with nodes representing random variables and links
 probabilistic dependence.

Conditional probability tables (CPT) for each variable (node): specifies
all P(Xpartents(X)), i.e. the probability of each value of X, given every
possible combination of values for its parents.

Reasoning: given the probabilities at some nodes (inputs) BN calculates
the probabilities in other nodes (outputs).

Classification: inputs  attribute values, output  class value probability.

There are mechanisms for training BN from examples, given variables and
network structure, i.e. creating CPT's.

Example:

Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls
(M)

Structure (">" denotes causal relation): Burglary >
Alarm; Earthquake > Alarm; Alarm > JohnCalls; Alarm
>
MaryCalls.

CPT's (for brevity, probability of false is not given, rows must sum to
1):
B E 
P(A) 
T T
T F
F T
F F 
0.95
0.94
0.29
0.001 

Calculation of joint probabilities (~ means not): P(J, M, A, ~B, ~E) =
P(JA) * P(MA) * P(A~B and ~E) * P(~B) * P(~E) = 0.9 * 0.7 * 0.001 *
0.999 * 0.998 = 0.000628.

Reasoning (using complete joints distribution or other more efficient methods):

Diagnostic (from effect top cause): P(BJ) = 0.016; P(BJ and M) = 0.29);
P(AJ and M) = 0.76

Predictive (from cause to effect): P(JB) = 0.86; P(MB) = 0.67;

Other: intercausal P(BA), mixed P(AJ and ~E)


Naive Bayes as a BN

Variables: play, outlook, temp, humidity, windy.

Structure: play > outlook, play > temp, play >
humidity, play > windy.

CPT's:

play: P(play=yes)=9/14; P(play=no)=5/14;

outlook:

P(outlook=overcast  play=yes) = 4/9

P(outlook=sunny  play=yes) = 2/9

P(outlook=rainy  play=yes) = 3/9

P(outlook=overcast  play=no) = 0/5

P(outlook=sunny  play=no) = 3/5

P(outlook=rainy  play=no) = 2/5

...
Instancebased methods

Distance function defines what's learned.

Most instancebased schemes use Euclidean distance (for numeric
attributes):
D(X,Y) = sqrt[(x_{1}y_{1})^{2} + (x_{2}y_{2})^{2}
+ ... + (x_{n}y_{n})^{2}], where X = {x_{1},
x_{2}, ..., x_{n}}, Y = {y_{1}, y_{2},
..., y_{n}}. Taking the square root is not required when comparing
distances.

Other popular metric: cityblock distance. D(X,Y) = x_{1}y_{1}
+ x_{2}y_{2} + ... + x_{n}y_{n}.

As different attributes use diferent scales, normalization is required.
V_{norm} = (VV_{min}) / (V_{max}  V_{min}).
Thus V_{norm} is within [0,1].

Nominal attributes: number of differences, i.e. city block distance, where
x_{i}y_{i} = 0 (x_{i}=y_{i}) or 1 (x_{i}<>y_{i}).

Missing attributes: assumed to be maximally distant (given normalized attributes).

Example: weather data
ID 
outlook 
temp 
humidity 
windy 
play 
1 
sunny 
hot 
high 
false 
no 
2 
sunny 
hot 
high 
true 
no 
3 
overcast 
hot 
high 
false 
yes 
4 
rainy 
mild 
high 
false 
yes 
5 
rainy 
cool 
normal 
false 
yes 
6 
rainy 
cool 
normal 
true 
no 
7 
overcast 
cool 
normal 
true 
yes 
8 
sunny 
mild 
high 
false 
no 
9 
sunny 
cool 
normal 
false 
yes 
10 
rainy 
mild 
normal 
false 
yes 
11 
sunny 
mild 
normal 
true 
yes 
12 
overcast 
mild 
high 
true 
yes 
13 
overcast 
hot 
normal 
false 
yes 
14 
rainy 
mild 
high 
true 
no 
X 
sunny 
cool 
high 
true 
? 
ID 
2 
8 
9 
11 
D(X, ID) 
1 
2 
2 
2 
play 
no 
no 
yes 
yes 

Discussion

Instance space: Voronoi diagram

1NN is very accurate but also slow: scans entire training data to derive
a prediction (possible improvements: use a sample)

Assumes all attributes are equally important. Remedy: attribute selection
or weights (see attribute relevance).

Dealing with noise (wrong values of some attributes)

Taking a majority vote over the k nearest neighbors (kNN).

Removing noisy instances from dataset (difficult!)

Numeric class attribute: take mean of the class values the k nearest neighbors.

kNN has been used by statisticians since early 1950s. Question: k=?

Distance weighted kNN:

Weight each vote or class value (for numeric) with the distance.

For example: instead of summing up votes, sum up 1 / D(X,Y) or 1 / D(X,Y)^{2}

Then it makes sense to use all instances (k=n).
Linear models

Basic idea

Work most naturally with numeric attributes. The standard technique
for numeric prediction is linear regression.

Predicted class value is linear combination of attribute values
(a_{i}): C = w_{0}*a_{0} + w_{1}*a_{1}
+ w_{2}*a_{2} + ... + w_{k}*a_{k}.
For k attributes we have k+1 coefficients. To simplify notation
we add a_{0} that is always 1.

Squared error: Sum through all instances (actual class value  predicted
one)^{2}

Deriving the coefficients (w_{i}): minimizing squared
error on training data. Using standard numerical analysis techniques
(matrix operations). Can be done if there are more instances than attributes
(roughly speaking).

Classification by linear regression

Multiresponse linear regression (learning a membership function for each
class)

Training: perform a regression (create a model) for each class, setting
the output to 1 for training instances that belong to the class, and 0
for those that do not.

Prediction: predict the class corresponding to the model with largest output
value

Pairwise regression (designed especially for classification)

Training: perform regression for every pair of classes assigning output
1 for one class and 1 for the other.

Prediction: predict the class that receives most "votes" (outputs > 0)
from the regression lines.

More accurate than multiresponse linear regression, however more computationally
expensive.

Discussion

Creates a hyperplane for any two classes

Pairwise: the regression line between the two classes

Multiresponse: (w_{0}v_{0})*a_{0} + (w_{1}v_{1})*a_{1}
+ ... + (w_{k}v_{k})*a_{k}, where w_{i}
and v_{i} are the coefficients of the models for the two
classes.

Not appropriate if data exhibits nonlinear dependencies. For example,
instances that cannot be separated by a hyperplane. Classical example:
XOR function.