Machine Learning Week 2

Running basic linear regression(regression) and decision tree(classification) on data set Guerry from package HistData

data set

head(Guerry)
##   dept Region   Department Crime_pers Crime_prop Literacy Donations
## 1    1      E          Ain      28870      15890       37      5098
## 2    2      N        Aisne      26226       5521       51      8901
## 3    3      C       Allier      26747       7925       13     10973
## 4    4      E Basses-Alpes      12935       7289       46      2733
## 5    5      E Hautes-Alpes      17488       8174       69      6962
## 6    7      S      Ardeche       9474      10263       27      3188
##   Infants Suicides MainCity Wealth Commerce Clergy Crime_parents
## 1   33120    35039    2:Med     73       58     11            71
## 2   14572    12831    2:Med     22       10     82             4
## 3   17044   114121    2:Med     61       66     68            46
## 4   23018    14238     1:Sm     76       49      5            70
## 5   23076    16171     1:Sm     83       65     10            22
## 6   42117    52547     1:Sm     84        1     28            76
##   Infanticide Donation_clergy Lottery Desertion Instruction Prostitutes
## 1          60              69      41        55          46          13
## 2          82              36      38        82          24         327
## 3          42              76      66        16          85          34
## 4          12              37      80        32          29           2
## 5          23              64      79        35           7           1
## 6          47              67      70        19          62           1
##   Distance Area Pop1831
## 1  218.372 5762  346.03
## 2   65.945 7369  513.00
## 3  161.927 7340  298.26
## 4  351.399 6925  155.90
## 5  320.280 5549  129.10
## 6  279.413 5529  340.73
str(Guerry)
## 'data.frame':    86 obs. of  23 variables:
##  $ dept           : int  1 2 3 4 5 7 8 9 10 11 ...
##  $ Region         : Factor w/ 5 levels "C","E","N","S",..: 2 3 1 2 2 4 3 4 2 4 ...
##  $ Department     : Factor w/ 86 levels "Ain","Aisne",..: 1 2 3 11 40 4 5 6 7 8 ...
##  $ Crime_pers     : int  28870 26226 26747 12935 17488 9474 35203 6173 19602 15647 ...
##  $ Crime_prop     : int  15890 5521 7925 7289 8174 10263 8847 9597 4086 10431 ...
##  $ Literacy       : int  37 51 13 46 69 27 67 18 59 34 ...
##  $ Donations      : int  5098 8901 10973 2733 6962 3188 6400 3542 3608 2582 ...
##  $ Infants        : int  33120 14572 17044 23018 23076 42117 16106 22916 18642 20225 ...
##  $ Suicides       : int  35039 12831 114121 14238 16171 52547 26198 123625 10989 66498 ...
##  $ MainCity       : Ord.factor w/ 3 levels "1:Sm"<"2:Med"<..: 2 2 2 1 1 1 2 1 2 2 ...
##  $ Wealth         : int  73 22 61 76 83 84 33 72 14 17 ...
##  $ Commerce       : int  58 10 66 49 65 1 4 60 3 35 ...
##  $ Clergy         : int  11 82 68 5 10 28 50 39 42 15 ...
##  $ Crime_parents  : int  71 4 46 70 22 76 53 74 77 80 ...
##  $ Infanticide    : int  60 82 42 12 23 47 85 28 54 35 ...
##  $ Donation_clergy: int  69 36 76 37 64 67 49 63 9 27 ...
##  $ Lottery        : int  41 38 66 80 79 70 31 75 28 50 ...
##  $ Desertion      : int  55 82 16 32 35 19 62 22 86 63 ...
##  $ Instruction    : int  46 24 85 29 7 62 9 77 15 48 ...
##  $ Prostitutes    : int  13 327 34 2 1 1 83 3 207 1 ...
##  $ Distance       : num  218.4 65.9 161.9 351.4 320.3 ...
##  $ Area           : int  5762 7369 7340 6925 5549 5529 5229 4890 6004 6139 ...
##  $ Pop1831        : num  346 513 298 156 129 ...
summary(Guerry)
##       dept         Region      Department   Crime_pers      Crime_prop   
##  Min.   :  1.00   C   :17   Ain     : 1   Min.   : 2199   Min.   : 1368  
##  1st Qu.: 24.25   E   :17   Aisne   : 1   1st Qu.:14156   1st Qu.: 5933  
##  Median : 45.50   N   :17   Allier  : 1   Median :18749   Median : 7595  
##  Mean   : 46.88   S   :17   Ardeche : 1   Mean   :19754   Mean   : 7843  
##  3rd Qu.: 66.75   W   :17   Ardennes: 1   3rd Qu.:25938   3rd Qu.: 9182  
##  Max.   :200.00   NA's: 1   Ariege  : 1   Max.   :37014   Max.   :20235  
##                             (Other) :80                                  
##     Literacy       Donations        Infants         Suicides     
##  Min.   :12.00   Min.   : 1246   Min.   : 2660   Min.   :  3460  
##  1st Qu.:25.00   1st Qu.: 3447   1st Qu.:14300   1st Qu.: 15463  
##  Median :38.00   Median : 5020   Median :17142   Median : 26744  
##  Mean   :39.26   Mean   : 7076   Mean   :19050   Mean   : 36523  
##  3rd Qu.:51.75   3rd Qu.: 9447   3rd Qu.:22682   3rd Qu.: 44058  
##  Max.   :74.00   Max.   :37015   Max.   :62486   Max.   :163241  
##                                                                  
##   MainCity      Wealth         Commerce         Clergy     
##  1:Sm :10   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  2:Med:66   1st Qu.:22.25   1st Qu.:21.25   1st Qu.:22.25  
##  3:Lg :10   Median :43.50   Median :42.50   Median :43.50  
##             Mean   :43.50   Mean   :42.80   Mean   :43.43  
##             3rd Qu.:64.75   3rd Qu.:63.75   3rd Qu.:64.75  
##             Max.   :86.00   Max.   :86.00   Max.   :86.00  
##                                                            
##  Crime_parents    Infanticide    Donation_clergy    Lottery     
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:22.25   1st Qu.:22.25   1st Qu.:22.25   1st Qu.:22.25  
##  Median :43.50   Median :43.50   Median :43.50   Median :43.50  
##  Mean   :43.50   Mean   :43.51   Mean   :43.50   Mean   :43.50  
##  3rd Qu.:64.75   3rd Qu.:64.75   3rd Qu.:64.75   3rd Qu.:64.75  
##  Max.   :86.00   Max.   :86.00   Max.   :86.00   Max.   :86.00  
##                                                                 
##    Desertion      Instruction     Prostitutes        Distance    
##  Min.   : 1.00   Min.   : 1.00   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:22.25   1st Qu.:23.25   1st Qu.:   6.0   1st Qu.:121.4  
##  Median :43.50   Median :41.50   Median :  33.0   Median :200.6  
##  Mean   :43.50   Mean   :43.13   Mean   : 141.9   Mean   :208.0  
##  3rd Qu.:64.75   3rd Qu.:64.75   3rd Qu.: 113.8   3rd Qu.:289.7  
##  Max.   :86.00   Max.   :86.00   Max.   :4744.0   Max.   :539.2  
##                                                                  
##       Area          Pop1831     
##  Min.   :  762   Min.   :129.1  
##  1st Qu.: 5401   1st Qu.:283.0  
##  Median : 6070   Median :346.2  
##  Mean   : 6147   Mean   :378.6  
##  3rd Qu.: 6816   3rd Qu.:444.4  
##  Max.   :10000   Max.   :989.9  
## 

linear regression

In this simple application, I choose Crime_pers, Wealth and Distance as independent variables(predictors), and choose Commerce as dependent variables(response).

select columns in data set that we interesed in

data <- Guerry[, c("Distance", "Wealth", "Crime_pers", "Commerce")]

view data

head(data)
##   Distance Wealth Crime_pers Commerce
## 1  218.372     73      28870       58
## 2   65.945     22      26226       10
## 3  161.927     61      26747       66
## 4  351.399     76      12935       49
## 5  320.280     83      17488       65
## 6  279.413     84       9474        1

use simple 1-fold cross validation. Randomly select 60% percent from data as training set, leaving other 40% as testing set.

# randomly generate train set and test set by ratio 6 : 4
set.seed(2)
group <- sample(1:2, size = nrow(Guerry), replace = TRUE, prob = c(0.6, 0.4))
train.set <- data[group == 1, ]
test.set <- data[group == 2, ]
# display ratio
nrow(train.set) / nrow(test.set)
## [1] 1.457143
6/4
## [1] 1.5

generate linear regression model by train.set

lm.fit <- lm(Commerce~Wealth+Crime_pers+Distance, data = train.set)
summary(lm.fit)
## 
## Call:
## lm(formula = Commerce ~ Wealth + Crime_pers + Distance, data = train.set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.588 -11.409  -2.651  15.000  37.666 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -8.5150365 12.0940999  -0.704   0.4849   
## Wealth       0.4034246  0.1238861   3.256   0.0021 **
## Crime_pers   0.0009415  0.0004316   2.181   0.0342 * 
## Distance     0.0662524  0.0317821   2.085   0.0426 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.43 on 47 degrees of freedom
## Multiple R-squared:  0.3691, Adjusted R-squared:  0.3288 
## F-statistic: 9.164 on 3 and 47 DF,  p-value: 6.959e-05

predict Commerce rank using linear regression model

pred <- predict(lm.fit, test.set, interval = "predict")
plot(pred[,"fit"], type = "o", pch = 21, ylim = c(0,100), col = "blue")
lines(test.set$Commerce, type = "o", pch = 22, col = "red")
legend(1,100,c("predict", "real"), cex = 0.5, pch = 21:22, lty = 1:2, col = c("blue","red"))

Decision Tree

In this simple application, I choose Crime_pers, Wealth and Distance as independent variables(predictors), and choose MainCity as labels. For simiplicity, we only decision whether a city is Med or not.

select columns in data set that we interesed in

data <- Guerry[, c("Distance", "Wealth", "Crime_pers")]
data$Label <- as.factor(Guerry[, "MainCity"] == "2:Med") 

view data

head(data)
##   Distance Wealth Crime_pers Label
## 1  218.372     73      28870  TRUE
## 2   65.945     22      26226  TRUE
## 3  161.927     61      26747  TRUE
## 4  351.399     76      12935 FALSE
## 5  320.280     83      17488 FALSE
## 6  279.413     84       9474 FALSE

use simple 1-fold cross validation. Randomly select 60% percent from data as training set, leaving other 40% as testing set.

# randomly generate train set and test set by ratio 6 : 4
set.seed(2)
group <- sample(1:2, size = nrow(Guerry), replace = TRUE, prob = c(0.6, 0.4))
train.set <- data[group == 1, ]
test.set <- data[group == 2, ]
# display ratio
nrow(train.set) / nrow(test.set)
## [1] 1.457143
6/4
## [1] 1.5

grow tree

fit <- tree(Label ~ ., data = train.set)
summary(fit)
## 
## Classification tree:
## tree(formula = Label ~ ., data = train.set)
## Number of terminal nodes:  5 
## Residual mean deviance:  0.5554 = 25.55 / 46 
## Misclassification error rate: 0.1176 = 6 / 51
plot(fit)
text(fit, pretty = 0)

predict Label in test.set

pred <- predict(fit, test.set, type = "class")
summary(pred)
## FALSE  TRUE 
##     4    31
table(pred, test.set$Label) #confusion matrix
##        
## pred    FALSE TRUE
##   FALSE     0    4
##   TRUE      9   22