UCSC Machine Learning - Homework 2 (About Decision Tree)

1) This question uses the following ages for a set of trees:

19,23,30,30,45,25,24,20. Store them in R using the syntax

ages<-c(19,23,30,30,45,25,24,20).

a) Compute the standard deviation in R using the sd() function. Also compute the mean and median.

b) Compute the same value in R without the sd function.

c) Using R, how does the standard deviation from part a) change if you add 10 to all the values?

d) Using R, how does the standard deviation in part a) change if you multiply all the values by 100?

e) Next add another tree of age 70 to the sample. Compute the mean and median with this tree added to the sample. How have the mean and median changed?

## Some house keep jobs: 
# Set working directory:
rm(list=ls())

# a):
# Prepare data:
ages<-c(19,23,30,30,45,25,24,20)
sd(ages)

## [1] 8.315218

mean(ages)

## [1] 27

median(ages)

## [1] 24.5

# b): 
sqrt(sum((ages - mean(ages))**2)/(length(ages) -1))

## [1] 8.315218

# Same as sd(ages)
# c) 
sd(ages + 10)  # No different

## [1] 8.315218

# d) 
sd(ages * 100)  # Increaed 100 times

## [1] 831.5218

# e)
mean(ages + 70)     # added 70

## [1] 97

median(ages + 70)   # Added 70

## [1] 94.5

2) Here is the data table for question 2 (Using table4_8pg199.txt)

The following tree was created using rpart for the data table given above.

Use this tree to predict the class labels (either a + or -) for the following test observations:

library(rpart)

# create the model (use the "train" function in R)
train <- read.csv("C:/Users/Andrew/SkyDrive/AGZ_Home/workspace_R/UCSC/MachinLearning/All_data/table4_8pg199.txt",header=TRUE)
train

##   Instance    a1    a2 a3 Target
## 1        1  TRUE  TRUE  1      1
## 2        2  TRUE  TRUE  6      1
## 3        3  TRUE FALSE  5      0
## 4        4 FALSE FALSE  4      1
## 5        5 FALSE  TRUE  7      0
## 6        6 FALSE  TRUE  3      0
## 7        7 FALSE FALSE  8      0
## 8        8  TRUE FALSE  7      1
## 9        9 FALSE  TRUE  5      0

str(train)

## 'data.frame':    9 obs. of  5 variables:
##  $ Instance: int  1 2 3 4 5 6 7 8 9
##  $ a1      : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
##  $ a2      : logi  TRUE TRUE FALSE FALSE TRUE TRUE ...
##  $ a3      : num  1 6 5 4 7 3 8 7 5
##  $ Target  : int  1 1 0 1 0 0 0 1 0

y<-as.factor(train[,5])#class labels 0 or 1
y

## [1] 1 1 0 1 0 0 0 1 0
## Levels: 0 1

x<-train[,2:4]
x

##      a1    a2 a3
## 1  TRUE  TRUE  1
## 2  TRUE  TRUE  6
## 3  TRUE FALSE  5
## 4 FALSE FALSE  4
## 5 FALSE  TRUE  7
## 6 FALSE  TRUE  3
## 7 FALSE FALSE  8
## 8  TRUE FALSE  7
## 9 FALSE  TRUE  5

str(train)

## 'data.frame':    9 obs. of  5 variables:
##  $ Instance: int  1 2 3 4 5 6 7 8 9
##  $ a1      : logi  TRUE TRUE TRUE FALSE FALSE FALSE ...
##  $ a2      : logi  TRUE TRUE FALSE FALSE TRUE TRUE ...
##  $ a3      : num  1 6 5 4 7 3 8 7 5
##  $ Target  : int  1 1 0 1 0 0 0 1 0

x;y

##      a1    a2 a3
## 1  TRUE  TRUE  1
## 2  TRUE  TRUE  6
## 3  TRUE FALSE  5
## 4 FALSE FALSE  4
## 5 FALSE  TRUE  7
## 6 FALSE  TRUE  3
## 7 FALSE FALSE  8
## 8  TRUE FALSE  7
## 9 FALSE  TRUE  5

## [1] 1 1 0 1 0 0 0 1 0
## Levels: 0 1

# Use training data 
fit<-rpart(y~.,x,control=rpart.control(minsplit=0,minbucket=0,maxdepth=5))
fit

## n= 9 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 9 4 0 (0.5555556 0.4444444)  
##    2) a1< 0.5 5 1 0 (0.8000000 0.2000000)  
##      4) a2>=0.5 3 0 0 (1.0000000 0.0000000) *
##      5) a2< 0.5 2 1 0 (0.5000000 0.5000000)  
##       10) a3>=6 1 0 0 (1.0000000 0.0000000) *
##       11) a3< 6 1 0 1 (0.0000000 1.0000000) *
##    3) a1>=0.5 4 1 1 (0.2500000 0.7500000)  
##      6) a2< 0.5 2 1 0 (0.5000000 0.5000000)  
##       12) a3< 6 1 0 0 (1.0000000 0.0000000) *
##       13) a3>=6 1 0 1 (0.0000000 1.0000000) *
##      7) a2>=0.5 2 0 1 (0.0000000 1.0000000) *

plot(fit)
text(fit)

# Now start to predit using "real" data:
test.csv <- read.csv("C:/Users/Andrew/SkyDrive/AGZ_Home/workspace_R/UCSC/MachinLearning/All_data/HW2_Q2.csv",header=TRUE)
test.txt <- read.csv("C:/Users/Andrew/SkyDrive/AGZ_Home/workspace_R/UCSC/MachinLearning/All_data/HW2_Q2.txt",header=TRUE)

## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on
## 'C:/Users/Andrew/SkyDrive/AGZ_Home/workspace_R/UCSC/MachinLearning/All_data/HW2_Q2.txt'

test.csv

##   Observation    a1    a2  a3
## 1           1  TRUE  TRUE 2.5
## 2           2  TRUE FALSE 5.5
## 3           3 FALSE  TRUE 2.5
## 4           4 FALSE FALSE 8.5

test.txt

##   Observation    a1    a2  a3
## 1           1  TRUE  TRUE 2.5
## 2           2  TRUE  TRUE 5.5
## 3           3  TRUE FALSE 2.5
## 4           4 FALSE FALSE 8.5

str(test.csv)

## 'data.frame':    4 obs. of  4 variables:
##  $ Observation: int  1 2 3 4
##  $ a1         : logi  TRUE TRUE FALSE FALSE
##  $ a2         : logi  TRUE FALSE TRUE FALSE
##  $ a3         : num  2.5 5.5 2.5 8.5

str(test.txt)

## 'data.frame':    4 obs. of  4 variables:
##  $ Observation: int  1 2 3 4
##  $ a1         : logi  TRUE TRUE TRUE FALSE
##  $ a2         : logi  TRUE TRUE FALSE FALSE
##  $ a3         : num  2.5 5.5 2.5 8.5

predict_test.csv <- predict(fit, test.csv, type="class")
predict_test.txt <- predict(fit, test.csv, type="class")


predict(fit, type="prob")

##   0 1
## 1 0 1
## 2 0 1
## 3 1 0
## 4 0 1
## 5 1 0
## 6 1 0
## 7 1 0
## 8 0 1
## 9 1 0

fit

## n= 9 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 9 4 0 (0.5555556 0.4444444)  
##    2) a1< 0.5 5 1 0 (0.8000000 0.2000000)  
##      4) a2>=0.5 3 0 0 (1.0000000 0.0000000) *
##      5) a2< 0.5 2 1 0 (0.5000000 0.5000000)  
##       10) a3>=6 1 0 0 (1.0000000 0.0000000) *
##       11) a3< 6 1 0 1 (0.0000000 1.0000000) *
##    3) a1>=0.5 4 1 1 (0.2500000 0.7500000)  
##      6) a2< 0.5 2 1 0 (0.5000000 0.5000000)  
##       12) a3< 6 1 0 0 (1.0000000 0.0000000) *
##       13) a3>=6 1 0 1 (0.0000000 1.0000000) *
##      7) a2>=0.5 2 0 1 (0.0000000 1.0000000) *

fit$frame

##       var n wt dev yval complexity ncompete nsurrogate  yval2.V1  yval2.V2
## 1      a1 9  9   4    1      0.500        2          0 1.0000000 5.0000000
## 2      a2 5  5   1    1      0.125        1          0 1.0000000 4.0000000
## 4  <leaf> 3  3   0    1      0.010        0          0 1.0000000 3.0000000
## 5      a3 2  2   1    1      0.125        0          0 1.0000000 1.0000000
## 10 <leaf> 1  1   0    1      0.010        0          0 1.0000000 1.0000000
## 11 <leaf> 1  1   0    2      0.010        0          0 2.0000000 0.0000000
## 3      a2 4  4   1    2      0.125        1          0 2.0000000 1.0000000
## 6      a3 2  2   1    1      0.125        0          0 1.0000000 1.0000000
## 12 <leaf> 1  1   0    1      0.010        0          0 1.0000000 1.0000000
## 13 <leaf> 1  1   0    2      0.010        0          0 2.0000000 0.0000000
## 7  <leaf> 2  2   0    2      0.010        0          0 2.0000000 0.0000000
##     yval2.V3  yval2.V4  yval2.V5 yval2.nodeprob
## 1  4.0000000 0.5555556 0.4444444      1.0000000
## 2  1.0000000 0.8000000 0.2000000      0.5555556
## 4  0.0000000 1.0000000 0.0000000      0.3333333
## 5  1.0000000 0.5000000 0.5000000      0.2222222
## 10 0.0000000 1.0000000 0.0000000      0.1111111
## 11 1.0000000 0.0000000 1.0000000      0.1111111
## 3  3.0000000 0.2500000 0.7500000      0.4444444
## 6  1.0000000 0.5000000 0.5000000      0.2222222
## 12 0.0000000 1.0000000 0.0000000      0.1111111
## 13 1.0000000 0.0000000 1.0000000      0.1111111
## 7  2.0000000 0.0000000 1.0000000      0.2222222

fit$frame[1,1]

## [1] a1
## Levels: <leaf> a1 a2 a3

print(fit)

## n= 9 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 9 4 0 (0.5555556 0.4444444)  
##    2) a1< 0.5 5 1 0 (0.8000000 0.2000000)  
##      4) a2>=0.5 3 0 0 (1.0000000 0.0000000) *
##      5) a2< 0.5 2 1 0 (0.5000000 0.5000000)  
##       10) a3>=6 1 0 0 (1.0000000 0.0000000) *
##       11) a3< 6 1 0 1 (0.0000000 1.0000000) *
##    3) a1>=0.5 4 1 1 (0.2500000 0.7500000)  
##      6) a2< 0.5 2 1 0 (0.5000000 0.5000000)  
##       12) a3< 6 1 0 0 (1.0000000 0.0000000) *
##       13) a3>=6 1 0 1 (0.0000000 1.0000000) *
##      7) a2>=0.5 2 0 1 (0.0000000 1.0000000) *

3) Question 3

Consider the table given in the text on page 200 in the book exercise number five (copied below). It is a binary class problem. Would it be possible to create a model which would correctly classify this training data? If it is possible create a tree which gives the correct answer (either + or - ) for each training observation. Otherwise, give the reason that it is not possible to do so.

library(rpart)

HW2_Q3_data <- data.frame(Observation = 1:10, A = c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE),B = c(FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE), "Class Label" = as.factor(c(1, 1, 1, 0, 1,0, 0, 0, 0, 0)))

HW2_Q3_data

##    Observation     A     B Class.Label
## 1            1  TRUE FALSE           1
## 2            2  TRUE  TRUE           1
## 3            3  TRUE  TRUE           1
## 4            4  TRUE FALSE           0
## 5            5  TRUE  TRUE           1
## 6            6 FALSE FALSE           0
## 7            7 FALSE FALSE           0
## 8            8 FALSE FALSE           0
## 9            9  TRUE  TRUE           0
## 10          10  TRUE FALSE           0

str(HW2_Q3_data)

## 'data.frame':    10 obs. of  4 variables:
##  $ Observation: int  1 2 3 4 5 6 7 8 9 10
##  $ A          : logi  TRUE TRUE TRUE TRUE TRUE FALSE ...
##  $ B          : logi  FALSE TRUE TRUE FALSE TRUE FALSE ...
##  $ Class.Label: Factor w/ 2 levels "0","1": 2 2 2 1 2 1 1 1 1 1

y = HW2_Q3_data$Class.Label
y

##  [1] 1 1 1 0 1 0 0 0 0 0
## Levels: 0 1

x = HW2_Q3_data[,2:3]
x

##        A     B
## 1   TRUE FALSE
## 2   TRUE  TRUE
## 3   TRUE  TRUE
## 4   TRUE FALSE
## 5   TRUE  TRUE
## 6  FALSE FALSE
## 7  FALSE FALSE
## 8  FALSE FALSE
## 9   TRUE  TRUE
## 10  TRUE FALSE

fit<-rpart(y~., x, control=rpart.control(minsplit=0,minbucket=0,maxdepth=5))
fit

## n= 10 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 10 4 0 (0.6000000 0.4000000)  
##   2) B< 0.5 6 1 0 (0.8333333 0.1666667) *
##   3) B>=0.5 4 1 1 (0.2500000 0.7500000) *

error_training = 1-sum(y==predict(fit,x,type="class"))/length(y)
error_training

## [1] 0.2

cat("Training error:", error_training)

## Training error: 0.2

plot(fit)
text(fit)

UCSC Machine Learning - Homework 2 (About Decision Tree)

Andrew Zhang

Friday, January 16, 2015

1) This question uses the following ages for a set of trees:

19,23,30,30,45,25,24,20. Store them in R using the syntax

ages<-c(19,23,30,30,45,25,24,20).

a) Compute the standard deviation in R using the sd() function. Also compute the mean and median.

b) Compute the same value in R without the sd function.

c) Using R, how does the standard deviation from part a) change if you add 10 to all the values?

d) Using R, how does the standard deviation in part a) change if you multiply all the values by 100?

e) Next add another tree of age 70 to the sample. Compute the mean and median with this tree added to the sample. How have the mean and median changed?

2) Here is the data table for question 2 (Using table4_8pg199.txt)

The following tree was created using rpart for the data table given above.

Use this tree to predict the class labels (either a + or -) for the following test observations:

3) Question 3