1. Holdout procedure. Find two stratified splits of the PlayTennis data into training (2/3 or 9 examples) and test (1/3 or 5 examples) sets, so that the id3 generated hypothesis for the first split has higher error than the hypothesis generated from the second split. Include in your report: the training and test sets, as well as a description of the procedure you used to calculate the error (e.g. the outputs of the Zprolog queries you used).
2. Leave-one-out cross-validation. Rate the algorithms id3.pl, lgg.pl and search.pl. according to the hypothesis accuracy they achieve on the data sets: animals.pl, loandata.pl and PlayTennis. Evaluate the hypothesis accuracy by using LOO cross-validation. Include in your report: a description of the procedure you used to calculate the error (e.g. the outputs of the Zprolog queries you used).
3. Evaluating error free hypotheses with MDL. Use the three algorithms id3.pl, lgg.pl and search.pl. to build three hypotheses (sets of rules) for the complete PlayTennis data set. Then compute the information compression for each hypothesis. That is compr(Hi)=L(E)-L(Hi), where Hi (i = 1, 2, 3) is the corresponding hypothesis and E is the data set (we cannot use here the code length of the exceptions L(E|Hi), as the hypotheses are error free, i.e. there are no exceptions). Include in your report: all work when you compute information compression including Zprolog queries and output.
4. Evaluating hypotheses with MDL by encoding exceptions. Get
the PlayTennis data split with lower error rate as you found
it in Problem 1. Then using the three algorithms
id3.pl,
lgg.pl
and search.pl. build three hypotheses
H1, H2, and H3 from the larger (2/3) subset in this split and compute
the MDL compression using exceptions for each one. That is compr(Hi)=L(E)-L(Hi)-L(E|Hi),
for i=1, 2, 3. Note that E here is the whole PlayTennis data set.
Include in your report: all work when you compute information
compression including Zprolog queries and output.