Running basic linear regression(regression) and decision tree(classification) on data set Guerry from package HistData
head(Guerry)
## dept Region Department Crime_pers Crime_prop Literacy Donations
## 1 1 E Ain 28870 15890 37 5098
## 2 2 N Aisne 26226 5521 51 8901
## 3 3 C Allier 26747 7925 13 10973
## 4 4 E Basses-Alpes 12935 7289 46 2733
## 5 5 E Hautes-Alpes 17488 8174 69 6962
## 6 7 S Ardeche 9474 10263 27 3188
## Infants Suicides MainCity Wealth Commerce Clergy Crime_parents
## 1 33120 35039 2:Med 73 58 11 71
## 2 14572 12831 2:Med 22 10 82 4
## 3 17044 114121 2:Med 61 66 68 46
## 4 23018 14238 1:Sm 76 49 5 70
## 5 23076 16171 1:Sm 83 65 10 22
## 6 42117 52547 1:Sm 84 1 28 76
## Infanticide Donation_clergy Lottery Desertion Instruction Prostitutes
## 1 60 69 41 55 46 13
## 2 82 36 38 82 24 327
## 3 42 76 66 16 85 34
## 4 12 37 80 32 29 2
## 5 23 64 79 35 7 1
## 6 47 67 70 19 62 1
## Distance Area Pop1831
## 1 218.372 5762 346.03
## 2 65.945 7369 513.00
## 3 161.927 7340 298.26
## 4 351.399 6925 155.90
## 5 320.280 5549 129.10
## 6 279.413 5529 340.73
str(Guerry)
## 'data.frame': 86 obs. of 23 variables:
## $ dept : int 1 2 3 4 5 7 8 9 10 11 ...
## $ Region : Factor w/ 5 levels "C","E","N","S",..: 2 3 1 2 2 4 3 4 2 4 ...
## $ Department : Factor w/ 86 levels "Ain","Aisne",..: 1 2 3 11 40 4 5 6 7 8 ...
## $ Crime_pers : int 28870 26226 26747 12935 17488 9474 35203 6173 19602 15647 ...
## $ Crime_prop : int 15890 5521 7925 7289 8174 10263 8847 9597 4086 10431 ...
## $ Literacy : int 37 51 13 46 69 27 67 18 59 34 ...
## $ Donations : int 5098 8901 10973 2733 6962 3188 6400 3542 3608 2582 ...
## $ Infants : int 33120 14572 17044 23018 23076 42117 16106 22916 18642 20225 ...
## $ Suicides : int 35039 12831 114121 14238 16171 52547 26198 123625 10989 66498 ...
## $ MainCity : Ord.factor w/ 3 levels "1:Sm"<"2:Med"<..: 2 2 2 1 1 1 2 1 2 2 ...
## $ Wealth : int 73 22 61 76 83 84 33 72 14 17 ...
## $ Commerce : int 58 10 66 49 65 1 4 60 3 35 ...
## $ Clergy : int 11 82 68 5 10 28 50 39 42 15 ...
## $ Crime_parents : int 71 4 46 70 22 76 53 74 77 80 ...
## $ Infanticide : int 60 82 42 12 23 47 85 28 54 35 ...
## $ Donation_clergy: int 69 36 76 37 64 67 49 63 9 27 ...
## $ Lottery : int 41 38 66 80 79 70 31 75 28 50 ...
## $ Desertion : int 55 82 16 32 35 19 62 22 86 63 ...
## $ Instruction : int 46 24 85 29 7 62 9 77 15 48 ...
## $ Prostitutes : int 13 327 34 2 1 1 83 3 207 1 ...
## $ Distance : num 218.4 65.9 161.9 351.4 320.3 ...
## $ Area : int 5762 7369 7340 6925 5549 5529 5229 4890 6004 6139 ...
## $ Pop1831 : num 346 513 298 156 129 ...
summary(Guerry)
## dept Region Department Crime_pers Crime_prop
## Min. : 1.00 C :17 Ain : 1 Min. : 2199 Min. : 1368
## 1st Qu.: 24.25 E :17 Aisne : 1 1st Qu.:14156 1st Qu.: 5933
## Median : 45.50 N :17 Allier : 1 Median :18749 Median : 7595
## Mean : 46.88 S :17 Ardeche : 1 Mean :19754 Mean : 7843
## 3rd Qu.: 66.75 W :17 Ardennes: 1 3rd Qu.:25938 3rd Qu.: 9182
## Max. :200.00 NA's: 1 Ariege : 1 Max. :37014 Max. :20235
## (Other) :80
## Literacy Donations Infants Suicides
## Min. :12.00 Min. : 1246 Min. : 2660 Min. : 3460
## 1st Qu.:25.00 1st Qu.: 3447 1st Qu.:14300 1st Qu.: 15463
## Median :38.00 Median : 5020 Median :17142 Median : 26744
## Mean :39.26 Mean : 7076 Mean :19050 Mean : 36523
## 3rd Qu.:51.75 3rd Qu.: 9447 3rd Qu.:22682 3rd Qu.: 44058
## Max. :74.00 Max. :37015 Max. :62486 Max. :163241
##
## MainCity Wealth Commerce Clergy
## 1:Sm :10 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 2:Med:66 1st Qu.:22.25 1st Qu.:21.25 1st Qu.:22.25
## 3:Lg :10 Median :43.50 Median :42.50 Median :43.50
## Mean :43.50 Mean :42.80 Mean :43.43
## 3rd Qu.:64.75 3rd Qu.:63.75 3rd Qu.:64.75
## Max. :86.00 Max. :86.00 Max. :86.00
##
## Crime_parents Infanticide Donation_clergy Lottery
## Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.:22.25 1st Qu.:22.25 1st Qu.:22.25 1st Qu.:22.25
## Median :43.50 Median :43.50 Median :43.50 Median :43.50
## Mean :43.50 Mean :43.51 Mean :43.50 Mean :43.50
## 3rd Qu.:64.75 3rd Qu.:64.75 3rd Qu.:64.75 3rd Qu.:64.75
## Max. :86.00 Max. :86.00 Max. :86.00 Max. :86.00
##
## Desertion Instruction Prostitutes Distance
## Min. : 1.00 Min. : 1.00 Min. : 0.0 Min. : 0.0
## 1st Qu.:22.25 1st Qu.:23.25 1st Qu.: 6.0 1st Qu.:121.4
## Median :43.50 Median :41.50 Median : 33.0 Median :200.6
## Mean :43.50 Mean :43.13 Mean : 141.9 Mean :208.0
## 3rd Qu.:64.75 3rd Qu.:64.75 3rd Qu.: 113.8 3rd Qu.:289.7
## Max. :86.00 Max. :86.00 Max. :4744.0 Max. :539.2
##
## Area Pop1831
## Min. : 762 Min. :129.1
## 1st Qu.: 5401 1st Qu.:283.0
## Median : 6070 Median :346.2
## Mean : 6147 Mean :378.6
## 3rd Qu.: 6816 3rd Qu.:444.4
## Max. :10000 Max. :989.9
##
In this simple application, I choose Crime_pers, Wealth and Distance as independent variables(predictors), and choose Commerce as dependent variables(response).
data <- Guerry[, c("Distance", "Wealth", "Crime_pers", "Commerce")]
head(data)
## Distance Wealth Crime_pers Commerce
## 1 218.372 73 28870 58
## 2 65.945 22 26226 10
## 3 161.927 61 26747 66
## 4 351.399 76 12935 49
## 5 320.280 83 17488 65
## 6 279.413 84 9474 1
# randomly generate train set and test set by ratio 6 : 4
set.seed(2)
group <- sample(1:2, size = nrow(Guerry), replace = TRUE, prob = c(0.6, 0.4))
train.set <- data[group == 1, ]
test.set <- data[group == 2, ]
# display ratio
nrow(train.set) / nrow(test.set)
## [1] 1.457143
6/4
## [1] 1.5
lm.fit <- lm(Commerce~Wealth+Crime_pers+Distance, data = train.set)
summary(lm.fit)
##
## Call:
## lm(formula = Commerce ~ Wealth + Crime_pers + Distance, data = train.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.588 -11.409 -2.651 15.000 37.666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.5150365 12.0940999 -0.704 0.4849
## Wealth 0.4034246 0.1238861 3.256 0.0021 **
## Crime_pers 0.0009415 0.0004316 2.181 0.0342 *
## Distance 0.0662524 0.0317821 2.085 0.0426 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.43 on 47 degrees of freedom
## Multiple R-squared: 0.3691, Adjusted R-squared: 0.3288
## F-statistic: 9.164 on 3 and 47 DF, p-value: 6.959e-05
pred <- predict(lm.fit, test.set, interval = "predict")
plot(pred[,"fit"], type = "o", pch = 21, ylim = c(0,100), col = "blue")
lines(test.set$Commerce, type = "o", pch = 22, col = "red")
legend(1,100,c("predict", "real"), cex = 0.5, pch = 21:22, lty = 1:2, col = c("blue","red"))
In this simple application, I choose Crime_pers, Wealth and Distance as independent variables(predictors), and choose MainCity as labels. For simiplicity, we only decision whether a city is Med or not.
data <- Guerry[, c("Distance", "Wealth", "Crime_pers")]
data$Label <- as.factor(Guerry[, "MainCity"] == "2:Med")
head(data)
## Distance Wealth Crime_pers Label
## 1 218.372 73 28870 TRUE
## 2 65.945 22 26226 TRUE
## 3 161.927 61 26747 TRUE
## 4 351.399 76 12935 FALSE
## 5 320.280 83 17488 FALSE
## 6 279.413 84 9474 FALSE
# randomly generate train set and test set by ratio 6 : 4
set.seed(2)
group <- sample(1:2, size = nrow(Guerry), replace = TRUE, prob = c(0.6, 0.4))
train.set <- data[group == 1, ]
test.set <- data[group == 2, ]
# display ratio
nrow(train.set) / nrow(test.set)
## [1] 1.457143
6/4
## [1] 1.5
fit <- tree(Label ~ ., data = train.set)
summary(fit)
##
## Classification tree:
## tree(formula = Label ~ ., data = train.set)
## Number of terminal nodes: 5
## Residual mean deviance: 0.5554 = 25.55 / 46
## Misclassification error rate: 0.1176 = 6 / 51
plot(fit)
text(fit, pretty = 0)
pred <- predict(fit, test.set, type = "class")
summary(pred)
## FALSE TRUE
## 4 31
table(pred, test.set$Label) #confusion matrix
##
## pred FALSE TRUE
## FALSE 0 4
## TRUE 9 22