Machine Learning - Summer 2003

Project 3 - Evaluating hypotheses/algorithms

Posted: 6/10/2003
Due:     6/16/2003

1. Holdout procedure. Find two stratified splits of the PlayTennis data into training (2/3 or 9 examples) and test (1/3 or 5 examples) sets, so that the id3 generated hypothesis for the first split has higher error than the hypothesis generated from the second split. Include in your report: the training and test sets, as well as a description of the procedure you used to calculate the error (e.g. the outputs of the Zprolog queries you used).

2. Leave-one-out cross-validation.  Rate the algorithms id3.pl, lgg.pl and search.pl. according to the hypothesis accuracy they achieve on the data sets: animals.pl, loandata.pl and PlayTennis. Evaluate the hypothesis accuracy by using LOO cross-validation. Include in your report: a description of the procedure you used to calculate the error (e.g. the outputs of the Zprolog queries you used).

3. Evaluating error free hypotheses with MDL. Use the three algorithms id3.pl, lgg.pl and search.pl. to build three hypotheses (sets of rules) for the complete PlayTennis data set. Then compute the information compression for each hypothesis. That is compr(Hi)=L(E)-L(Hi), where Hi (i = 1, 2, 3) is the corresponding hypothesis and E is the data set (we cannot use here the code length of the exceptions L(E|Hi), as the hypotheses are error free, i.e. there are no exceptions). Include in your report: all work when you compute information compression including Zprolog queries and output.

4. Evaluating hypotheses with MDL by encoding exceptions. Get the PlayTennis data split with lower error rate as you found it in Problem 1. Then using the three algorithms id3.pl, lgg.pl and search.pl. build three hypotheses H1, H2, and H3 from the larger (2/3) subset in this split and compute the MDL compression using exceptions for each one. That is compr(Hi)=L(E)-L(Hi)-L(E|Hi), for i=1, 2, 3. Note that E here is the whole PlayTennis data set. Include in your report: all work when you compute information compression including Zprolog queries and output.