We begin by importing the Boston data set, which contains 506 observations on 14 variables, pertaining to the median value in USD 1000’s of owner-occupied homes.
We continue by setting a seed for reproducibility to the number 20.
We proceed to create 2 data sets: one for testing with 10% of the original values and the other one for training, with 90% of the original values.
Below is a summary of the Boston training data set.
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.74 Min. :0.00000
## 1st Qu.: 0.08193 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.29090 Median : 0.00 Median : 8.56 Median :0.00000
## Mean : 3.77846 Mean : 10.96 Mean :11.20 Mean :0.07253
## 3rd Qu.: 3.94406 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 6.00 Min. : 1.130
## 1st Qu.:0.4510 1st Qu.:5.893 1st Qu.: 46.05 1st Qu.: 2.075
## Median :0.5380 Median :6.211 Median : 79.80 Median : 3.152
## Mean :0.5565 Mean :6.295 Mean : 69.78 Mean : 3.797
## 3rd Qu.:0.6275 3rd Qu.:6.633 3rd Qu.: 94.35 3rd Qu.: 5.117
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:281.0 1st Qu.:17.40 1st Qu.:373.29
## Median : 5.000 Median :330.0 Median :19.10 Median :391.34
## Mean : 9.774 Mean :411.5 Mean :18.48 Mean :354.03
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.10
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.91 1st Qu.:16.55
## Median :11.34 Median :21.10
## Mean :12.71 Mean :22.42
## 3rd Qu.:17.14 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Below are 2 box plots showcasing the independent variables, their means, 1st and 3rd quartile and any outliers(marked in red).
Interpretation: On the one hand, variables like industry,tax, age, nox(nitrogen oxide concentration), rm(avg rooms per dwelling) and dis(distance to employment centers) appear to have rather symmetric distributions with not that many outliers, except for rmd. This suggests that the proportion of non-retail business acres in Boston is fairly consistent, the nitrogen oxide concentration is similar accross most homes, many of the residencies have a consistent number of rooms which is expected (except for some houses with many rooms) and similar in age, and most residents live an approximate distance from employment centers, except for some cases where they are farther away. Also, most properties have similar tax rates, and the pupil to teacher ratio is pretty symmetric with only a few outliers on the higher end.
Crim(crime), lstat (lower status of the population), zn(proportion of residential land zoned for lots over 25,000 sq.ft), age, tax, and black seem to have a rather skewed distribution.
Crime’s distribution seemes to indicate that most data is concentrated near zero, suggesting most areas report low crime rates, but many outliers correspond to areas with very high crime rates.
The proportion of residential land zoned for large lots indicates that most areas have little land zoned with some areas having large lot zoning.
The proportion of African America residents shows many outliers, indicating there is high variability in the proportion of Black residents accorss Boston.
Lower status of the population exhibits many outliers, which points to a wide spread in socioeconomic status in Boston.
Now we take a look at the distribution of medv (median value of owner-occupied homes in USD 1000’s)
Interpretation: The data seems to be right skewed, with a high concentration of properties with a value in the $20,000 and fewer properties with high median values.Moreover, there also seems to be a slight peak at the upper bound, around $50,000.
##
## Call:
## lm(formula = medv ~ ., data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.319 -2.596 -0.451 1.744 26.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.89442 5.14766 7.75 6.4e-14 ***
## crim -0.10116 0.03239 -3.12 0.00191 **
## zn 0.04471 0.01433 3.12 0.00193 **
## indus 0.03947 0.06383 0.62 0.53668
## chas 2.24948 0.85909 2.62 0.00914 **
## nox -18.60362 3.82144 -4.87 1.6e-06 ***
## rm 3.65413 0.41838 8.73 < 2e-16 ***
## age 0.00689 0.01402 0.49 0.62325
## dis -1.36995 0.20174 -6.79 3.6e-11 ***
## rad 0.33475 0.06839 4.89 1.4e-06 ***
## tax -0.01457 0.00391 -3.73 0.00022 ***
## ptratio -1.06538 0.13248 -8.04 8.2e-15 ***
## black 0.00847 0.00265 3.20 0.00149 **
## lstat -0.52775 0.05171 -10.21 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.57 on 441 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.754
## F-statistic: 108 on 13 and 441 DF, p-value: <2e-16
Interpretation: Based on this model, we can infer that industry and age are not strongly related with property value given their insignificant p-values.
For this step, we will be using the stepwise elimination procedure.
## Start: AIC=2021.8
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 21541 16994 1651
## + rm 1 18934 19600 1716
## + ptratio 1 11187 27347 1868
## + indus 1 9662 28872 1892
## + tax 1 9522 29013 1895
## + nox 1 7330 31205 1928
## + rad 1 6535 32000 1939
## + age 1 6116 32419 1945
## + crim 1 6063 32472 1946
## + zn 1 5225 33310 1957
## + black 1 4602 33933 1966
## + dis 1 2596 35938 1992
## + chas 1 1116 37419 2010
## <none> 38534 2022
##
## Step: AIC=1651.2
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 3650 13343 1543
## + ptratio 1 2789 14204 1572
## + chas 1 674 16320 1635
## + dis 1 615 16378 1636
## + tax 1 351 16643 1644
## + age 1 285 16709 1646
## + black 1 184 16810 1648
## + crim 1 163 16830 1649
## + zn 1 143 16851 1649
## + indus 1 98 16895 1651
## <none> 16994 1651
## + rad 1 62 16932 1652
## + nox 1 0 16994 1653
## - lstat 1 21541 38534 2022
##
## Step: AIC=1543.2
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1814 11529 1479
## + tax 1 487 12856 1528
## + black 1 461 12883 1529
## + chas 1 452 12891 1530
## + crim 1 299 13044 1535
## + dis 1 247 13096 1537
## + rad 1 220 13124 1538
## + indus 1 63 13281 1543
## <none> 13343 1543
## + zn 1 52 13292 1543
## + nox 1 29 13314 1544
## + age 1 26 13318 1544
## - rm 1 3650 16994 1651
## - lstat 1 6257 19600 1716
##
## Step: AIC=1478.7
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 368 11162 1466
## + black 1 351 11178 1467
## + chas 1 264 11266 1470
## + crim 1 115 11414 1476
## + tax 1 69 11460 1478
## + age 1 66 11464 1478
## <none> 11529 1479
## + nox 1 51 11479 1479
## + zn 1 16 11513 1480
## + rad 1 1 11528 1481
## + indus 1 0 11529 1481
## - ptratio 1 1814 13343 1543
## - rm 1 2675 14204 1572
## - lstat 1 4583 16112 1629
##
## Step: AIC=1466
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 742 10420 1437
## + black 1 448 10713 1449
## + tax 1 264 10897 1457
## + indus 1 221 10941 1459
## + crim 1 211 10951 1459
## + chas 1 186 10975 1460
## + zn 1 95 11067 1464
## <none> 11162 1466
## + age 1 31 11131 1467
## + rad 1 26 11136 1467
## - dis 1 368 11529 1479
## - ptratio 1 1934 13096 1537
## - rm 1 2289 13451 1549
## - lstat 1 4748 15910 1625
##
## Step: AIC=1436.7
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + black 1 263 10157 1427
## + chas 1 240 10179 1428
## + crim 1 127 10292 1433
## + zn 1 108 10312 1434
## <none> 10420 1437
## + rad 1 45 10375 1437
## + tax 1 19 10401 1438
## + indus 1 9 10410 1438
## + age 1 1 10418 1439
## - nox 1 742 11162 1466
## - dis 1 1059 11479 1479
## - rm 1 2187 12606 1521
## - ptratio 1 2213 12633 1522
## - lstat 1 3256 13676 1558
##
## Step: AIC=1427
## medv ~ lstat + rm + ptratio + dis + nox + black
##
## Df Sum of Sq RSS AIC
## + chas 1 206 9951 1420
## + zn 1 132 10024 1423
## + rad 1 119 10038 1424
## + crim 1 62 10095 1426
## <none> 10157 1427
## + indus 1 4 10153 1429
## + tax 1 0 10156 1429
## + age 1 0 10156 1429
## - black 1 263 10420 1437
## - nox 1 557 10713 1449
## - dis 1 1012 11169 1468
## - ptratio 1 2078 12235 1510
## - rm 1 2347 12503 1520
## - lstat 1 2746 12902 1534
##
## Step: AIC=1419.7
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
##
## Df Sum of Sq RSS AIC
## + zn 1 142 9809 1415
## + rad 1 123 9827 1416
## + crim 1 50 9901 1419
## <none> 9951 1420
## + indus 1 8 9943 1421
## + age 1 1 9949 1422
## + tax 1 0 9951 1422
## - chas 1 206 10157 1427
## - black 1 229 10179 1428
## - nox 1 608 10559 1445
## - dis 1 957 10908 1460
## - ptratio 1 1899 11849 1497
## - rm 1 2281 12232 1512
## - lstat 1 2686 12637 1526
##
## Step: AIC=1415.2
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
##
## Df Sum of Sq RSS AIC
## + crim 1 82 9726 1413
## + rad 1 82 9727 1413
## <none> 9809 1415
## + indus 1 10 9798 1417
## + tax 1 7 9801 1417
## + age 1 1 9808 1417
## - zn 1 142 9951 1420
## - chas 1 216 10024 1423
## - black 1 251 10060 1425
## - nox 1 616 10425 1441
## - dis 1 1072 10880 1460
## - ptratio 1 1502 11311 1478
## - rm 1 2049 11858 1500
## - lstat 1 2729 12538 1525
##
## Step: AIC=1413.3
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## crim
##
## Df Sum of Sq RSS AIC
## + rad 1 193 9533 1406
## <none> 9726 1413
## + indus 1 10 9716 1415
## - crim 1 82 9809 1415
## + age 1 0 9726 1415
## + tax 1 0 9726 1415
## - zn 1 174 9901 1419
## - black 1 184 9910 1420
## - chas 1 202 9928 1421
## - nox 1 575 10302 1437
## - dis 1 1135 10861 1462
## - ptratio 1 1336 11062 1470
## - rm 1 2070 11796 1499
## - lstat 1 2484 12211 1515
##
## Step: AIC=1406.2
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## crim + rad
##
## Df Sum of Sq RSS AIC
## + tax 1 307 9226 1393
## <none> 9533 1406
## + indus 1 25 9508 1407
## + age 1 4 9529 1408
## - zn 1 125 9659 1410
## - rad 1 193 9726 1413
## - crim 1 194 9727 1413
## - chas 1 197 9730 1414
## - black 1 243 9776 1416
## - nox 1 755 10288 1439
## - dis 1 1120 10653 1455
## - ptratio 1 1518 11052 1471
## - rm 1 1891 11424 1487
## - lstat 1 2532 12065 1511
##
## Step: AIC=1393.3
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn +
## crim + rad + tax
##
## Df Sum of Sq RSS AIC
## <none> 9226 1393
## + indus 1 8 9218 1395
## + age 1 5 9221 1395
## - chas 1 154 9381 1399
## - zn 1 194 9420 1401
## - crim 1 208 9434 1401
## - black 1 217 9443 1402
## - tax 1 307 9533 1406
## - rad 1 500 9726 1415
## - nox 1 511 9738 1416
## - dis 1 1223 10449 1448
## - ptratio 1 1342 10568 1453
## - rm 1 1688 10915 1468
## - lstat 1 2444 11670 1498
Interpretation: We observed that the best model has 11 variables and and AIC of 1392.5.
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + dis + nox +
## chas + zn + crim + rad + tax, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.375 -2.718 -0.405 1.716 26.602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.57816 5.12366 7.72 7.6e-14 ***
## lstat -0.51534 0.04758 -10.83 < 2e-16 ***
## rm 3.66671 0.40724 9.00 < 2e-16 ***
## ptratio -1.05036 0.13087 -8.03 9.1e-15 ***
## black 0.00851 0.00264 3.23 0.00135 **
## dis -1.42847 0.18643 -7.66 1.2e-13 ***
## nox -17.44340 3.52054 -4.95 1.0e-06 ***
## chas 2.32069 0.85225 2.72 0.00672 **
## zn 0.04309 0.01412 3.05 0.00241 **
## crim -0.10203 0.03231 -3.16 0.00170 **
## rad 0.32058 0.06543 4.90 1.4e-06 ***
## tax -0.01349 0.00352 -3.84 0.00014 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.56 on 443 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.755
## F-statistic: 128 on 11 and 443 DF, p-value: <2e-16
Interpretation: With R2adj of 0.774 and a significant p-value, we can conclude that our model is indeed significant and our regression equation would look like this: - medv =22.205−3.244⋅lstat+2.748⋅rm−2.139⋅ptratio+0.729⋅black+2.885⋅dis−2.025⋅nox+0.548⋅chas+0.027⋅zn−0.970⋅crim+2.659⋅rad−2.365⋅tax
Interpretation: Based on our residual plots, we can
conclude that our model meets the key assumptions of a linear
regression.
Using the final linear model built from on the 90% of original data, we will test with the remaining 10% testing data set.
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + rad + crim + tax + zn, data = Boston_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.69 -3.62 -0.54 2.26 20.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.40484 24.31874 0.59 0.557
## lstat -0.39117 0.24404 -1.60 0.117
## rm 5.43350 2.06838 2.63 0.012 *
## ptratio -0.20633 0.55091 -0.37 0.710
## dis -2.88316 1.07125 -2.69 0.010 *
## nox -27.95957 20.56268 -1.36 0.182
## chas 5.95475 4.71038 1.26 0.214
## black 0.02275 0.02047 1.11 0.273
## rad 0.39228 0.28699 1.37 0.179
## crim -0.31270 0.22017 -1.42 0.163
## tax -0.00645 0.01205 -0.53 0.596
## zn 0.08133 0.04971 1.64 0.110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.03 on 39 degrees of freedom
## Multiple R-squared: 0.656, Adjusted R-squared: 0.559
## F-statistic: 6.75 on 11 and 39 DF, p-value: 3.44e-06
## [1] "Mean Squared Error (MSE): 27.8497109831055"
Interpretation: A MSE of 18.55, Adjusted R-squared of 0.77, and F-statistic of 16.6 seem to indicate that our model is relatively strong at predicting median property values in Boston.
## [1] "Mean Squared Error (MSE): 23.5689617375491"
Interpretation: Given an MSE of 23.761, which is small compared to the range of the independent variable (medv), we can conclude that our model performs well on unseen data.
-First we start by splitting the Boston data set into a training data set with 80% of the origincal values, and a testing data set with 20% of the original values. Below is a summary of our new training data set.
## crim zn indus chas
## Min. : 0.006 Min. : 0.0 Min. : 0.46 Min. :0.0000
## 1st Qu.: 0.082 1st Qu.: 0.0 1st Qu.: 5.13 1st Qu.:0.0000
## Median : 0.252 Median : 0.0 Median : 9.69 Median :0.0000
## Mean : 3.616 Mean : 11.3 Mean :11.13 Mean :0.0767
## 3rd Qu.: 3.675 3rd Qu.: 12.5 3rd Qu.:18.10 3rd Qu.:0.0000
## Max. :88.976 Max. :100.0 Max. :27.74 Max. :1.0000
## nox rm age dis rad
## Min. :0.385 Min. :3.56 Min. : 2.9 Min. : 1.13 Min. : 1.00
## 1st Qu.:0.448 1st Qu.:5.87 1st Qu.: 43.2 1st Qu.: 2.10 1st Qu.: 4.00
## Median :0.538 Median :6.18 Median : 76.2 Median : 3.27 Median : 5.00
## Mean :0.555 Mean :6.27 Mean : 68.1 Mean : 3.82 Mean : 9.59
## 3rd Qu.:0.624 3rd Qu.:6.62 3rd Qu.: 94.3 3rd Qu.: 5.21 3rd Qu.:24.00
## Max. :0.871 Max. :8.78 Max. :100.0 Max. :12.13 Max. :24.00
## tax ptratio black lstat medv
## Min. :187 Min. :12.6 Min. : 0.32 Min. : 1.73 Min. : 5.0
## 1st Qu.:277 1st Qu.:17.4 1st Qu.:376.00 1st Qu.: 6.77 1st Qu.:17.1
## Median :334 Median :19.0 Median :391.77 Median :11.36 Median :21.1
## Mean :409 Mean :18.5 Mean :360.54 Mean :12.60 Mean :22.6
## 3rd Qu.:666 3rd Qu.:20.2 3rd Qu.:396.90 3rd Qu.:16.91 3rd Qu.:25.1
## Max. :711 Max. :22.0 Max. :396.90 Max. :37.97 Max. :50.0
##
## Call:
## lm(formula = medv ~ ., data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.319 -2.596 -0.451 1.744 26.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.89442 5.14766 7.75 6.4e-14 ***
## crim -0.10116 0.03239 -3.12 0.00191 **
## zn 0.04471 0.01433 3.12 0.00193 **
## indus 0.03947 0.06383 0.62 0.53668
## chas 2.24948 0.85909 2.62 0.00914 **
## nox -18.60362 3.82144 -4.87 1.6e-06 ***
## rm 3.65413 0.41838 8.73 < 2e-16 ***
## age 0.00689 0.01402 0.49 0.62325
## dis -1.36995 0.20174 -6.79 3.6e-11 ***
## rad 0.33475 0.06839 4.89 1.4e-06 ***
## tax -0.01457 0.00391 -3.73 0.00022 ***
## ptratio -1.06538 0.13248 -8.04 8.2e-15 ***
## black 0.00847 0.00265 3.20 0.00149 **
## lstat -0.52775 0.05171 -10.21 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.57 on 441 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.754
## F-statistic: 108 on 13 and 441 DF, p-value: <2e-16
Interpretation: Both industry and age appear to be insignificant predictors of median property values like predicted by our model 1.
## Start: AIC=1811.1
## medv ~ 1
##
## Df Sum of Sq RSS AIC
## + lstat 1 19742 15832 1486
## + rm 1 16843 18731 1554
## + ptratio 1 9515 26059 1687
## + indus 1 7669 27905 1715
## + tax 1 7655 27919 1715
## + nox 1 6430 29144 1733
## + rad 1 5274 30300 1748
## + crim 1 5142 30432 1750
## + zn 1 4835 30739 1754
## + age 1 4794 30780 1755
## + black 1 3815 31759 1767
## + dis 1 2132 33442 1788
## + chas 1 1322 34252 1798
## <none> 35574 1811
##
## Step: AIC=1486
## medv ~ lstat
##
## Df Sum of Sq RSS AIC
## + rm 1 2992 12840 1403
## + ptratio 1 2217 13615 1427
## + chas 1 869 14964 1465
## + dis 1 707 15125 1470
## + black 1 285 15548 1481
## + age 1 282 15551 1481
## + tax 1 187 15645 1483
## + zn 1 161 15672 1484
## + crim 1 107 15725 1485
## <none> 15832 1486
## + indus 1 25 15808 1487
## + rad 1 13 15820 1488
## + nox 1 10 15822 1488
## - lstat 1 19742 35574 1811
##
## Step: AIC=1403.4
## medv ~ lstat + rm
##
## Df Sum of Sq RSS AIC
## + ptratio 1 1519 11322 1355
## + chas 1 602 12238 1386
## + black 1 508 12332 1389
## + dis 1 393 12448 1393
## + tax 1 279 12561 1397
## + crim 1 234 12606 1398
## + rad 1 104 12737 1402
## <none> 12840 1403
## + age 1 56 12785 1404
## + zn 1 50 12791 1404
## + indus 1 14 12826 1405
## + nox 1 1 12839 1405
## - rm 1 2992 15832 1486
## - lstat 1 5891 18731 1554
##
## Step: AIC=1354.5
## medv ~ lstat + rm + ptratio
##
## Df Sum of Sq RSS AIC
## + dis 1 451 10870 1340
## + chas 1 420 10901 1341
## + black 1 389 10933 1342
## + age 1 113 11209 1353
## + crim 1 94 11228 1353
## <none> 11322 1355
## + rad 1 22 11299 1356
## + tax 1 13 11309 1356
## + indus 1 12 11310 1356
## + nox 1 8 11313 1356
## + zn 1 6 11315 1356
## - ptratio 1 1519 12840 1403
## - rm 1 2294 13615 1427
## - lstat 1 4337 15659 1484
##
## Step: AIC=1340.1
## medv ~ lstat + rm + ptratio + dis
##
## Df Sum of Sq RSS AIC
## + nox 1 548 10323 1321
## + black 1 508 10362 1323
## + chas 1 308 10562 1331
## + crim 1 182 10688 1335
## + zn 1 165 10705 1336
## + tax 1 145 10725 1337
## + indus 1 137 10734 1337
## <none> 10870 1340
## + age 1 17 10854 1342
## + rad 1 5 10866 1342
## - dis 1 451 11322 1355
## - ptratio 1 1577 12448 1393
## - rm 1 1998 12868 1406
## - lstat 1 4671 15541 1483
##
## Step: AIC=1321.2
## medv ~ lstat + rm + ptratio + dis + nox
##
## Df Sum of Sq RSS AIC
## + chas 1 335 9988 1310
## + black 1 327 9996 1310
## + zn 1 172 10151 1316
## + crim 1 113 10210 1319
## + rad 1 71 10252 1320
## <none> 10323 1321
## + age 1 7 10315 1323
## + indus 1 2 10320 1323
## + tax 1 1 10321 1323
## - nox 1 548 10870 1340
## - dis 1 991 11313 1356
## - ptratio 1 1748 12071 1382
## - rm 1 1937 12260 1389
## - lstat 1 3303 13625 1431
##
## Step: AIC=1309.9
## medv ~ lstat + rm + ptratio + dis + nox + chas
##
## Df Sum of Sq RSS AIC
## + black 1 276 9713 1301
## + zn 1 189 9799 1304
## + crim 1 89 9899 1308
## + rad 1 76 9912 1309
## <none> 9988 1310
## + indus 1 7 9981 1312
## + age 1 1 9987 1312
## + tax 1 0 9988 1312
## - chas 1 335 10323 1321
## - nox 1 574 10562 1331
## - dis 1 884 10872 1342
## - ptratio 1 1564 11552 1367
## - rm 1 1842 11830 1376
## - lstat 1 3219 13208 1421
##
## Step: AIC=1300.6
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
##
## Df Sum of Sq RSS AIC
## + zn 1 215 9498 1294
## + rad 1 164 9549 1296
## <none> 9713 1301
## + crim 1 41 9672 1301
## + tax 1 12 9700 1302
## + indus 1 3 9710 1303
## + age 1 0 9713 1303
## - black 1 276 9988 1310
## - chas 1 283 9996 1310
## - nox 1 397 10110 1315
## - dis 1 842 10554 1332
## - ptratio 1 1448 11161 1355
## - rm 1 1977 11690 1373
## - lstat 1 2887 12600 1404
##
## Step: AIC=1293.6
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn
##
## Df Sum of Sq RSS AIC
## + rad 1 106 9391 1291
## + crim 1 72 9425 1292
## <none> 9498 1294
## + indus 1 3 9495 1295
## + age 1 2 9496 1296
## + tax 1 0 9498 1296
## - zn 1 215 9713 1301
## - chas 1 298 9796 1304
## - black 1 301 9799 1304
## - nox 1 397 9895 1308
## - ptratio 1 1054 10552 1334
## - dis 1 1055 10553 1334
## - rm 1 1700 11198 1358
## - lstat 1 2969 12467 1401
##
## Step: AIC=1291
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
## rad
##
## Df Sum of Sq RSS AIC
## + crim 1 184 9207 1285
## + tax 1 174 9217 1285
## <none> 9391 1291
## + age 1 7 9384 1293
## + indus 1 7 9384 1293
## - rad 1 106 9498 1294
## - zn 1 157 9549 1296
## - chas 1 295 9686 1302
## - black 1 368 9760 1305
## - nox 1 501 9892 1310
## - dis 1 1009 10400 1330
## - ptratio 1 1125 10516 1335
## - rm 1 1561 10953 1351
## - lstat 1 3056 12447 1403
##
## Step: AIC=1285
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
## rad + crim
##
## Df Sum of Sq RSS AIC
## + tax 1 182 9025 1279
## <none> 9207 1285
## + indus 1 11 9197 1287
## + age 1 9 9199 1287
## - zn 1 183 9390 1291
## - crim 1 184 9391 1291
## - rad 1 218 9425 1292
## - chas 1 269 9477 1295
## - black 1 311 9518 1296
## - nox 1 544 9751 1306
## - dis 1 1087 10295 1328
## - ptratio 1 1152 10360 1331
## - rm 1 1564 10772 1346
## - lstat 1 2720 11928 1388
##
## Step: AIC=1279
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn +
## rad + crim + tax
##
## Df Sum of Sq RSS AIC
## <none> 9025 1279
## + age 1 12 9013 1280
## + indus 1 8 9017 1281
## - tax 1 182 9207 1285
## - crim 1 192 9217 1285
## - chas 1 231 9256 1287
## - zn 1 236 9261 1287
## - black 1 292 9317 1290
## - nox 1 371 9396 1293
## - rad 1 393 9418 1294
## - ptratio 1 997 10022 1319
## - dis 1 1148 10173 1325
## - rm 1 1434 10459 1337
## - lstat 1 2698 11723 1383
Interpretation: The model with the lowest AIC score
(1279) includes the following independent variables: - lstat - rm
- ptratio -dis - nox -chas - black -zn - rad - crim -tax
Such a low AIC score suggests that the model above balances good fit and model simplicity the best, and it is consistent with our first model which excluded age and industry as potential good predictors of median property value.
Now determined what variables to use for the regression model, we construct the new linear model.
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + dis + nox +
## chas + zn + crim + rad + tax, data = Boston_train.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.946 -2.716 -0.432 1.911 25.114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.91948 5.60435 6.59 1.4e-10 ***
## lstat -0.57903 0.05349 -10.82 < 2e-16 ***
## rm 3.54605 0.44933 7.89 3.0e-14 ***
## ptratio -0.96698 0.14697 -6.58 1.5e-10 ***
## black 0.01135 0.00319 3.56 0.00041 ***
## dis -1.44048 0.20399 -7.06 7.6e-12 ***
## nox -15.95473 3.97687 -4.01 7.2e-05 ***
## chas 2.93341 0.92589 3.17 0.00165 **
## zn 0.04889 0.01526 3.20 0.00146 **
## crim -0.09967 0.03449 -2.89 0.00407 **
## rad 0.29125 0.07051 4.13 4.4e-05 ***
## tax -0.01053 0.00374 -2.81 0.00514 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.8 on 392 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.739
## F-statistic: 105 on 11 and 392 DF, p-value: <2e-16
Interpretation: An adj R-squared value of 0.739 is an indicator that model 2 is a relatively strong model.
##Test on the testing data set.
Using the final linear model built from on the 80% of original data, we will test with the remaining 20% testing data.
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + rad + crim + tax + zn, data = Boston_test.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.762 -2.759 -0.921 1.801 13.006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.12551 12.68390 2.53 0.01305 *
## lstat -0.28257 0.10645 -2.65 0.00939 **
## rm 4.85354 1.01080 4.80 6.2e-06 ***
## ptratio -0.78210 0.27383 -2.86 0.00532 **
## dis -1.94525 0.47980 -4.05 0.00011 ***
## nox -23.17097 7.94751 -2.92 0.00448 **
## chas 1.25285 2.37714 0.53 0.59946
## black 0.00503 0.00513 0.98 0.32993
## rad 0.40464 0.15750 2.57 0.01184 *
## crim -0.18639 0.11623 -1.60 0.11230
## tax -0.02054 0.00800 -2.57 0.01192 *
## zn 0.05810 0.03037 1.91 0.05889 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.4 on 90 degrees of freedom
## Multiple R-squared: 0.755, Adjusted R-squared: 0.725
## F-statistic: 25.2 on 11 and 90 DF, p-value: <2e-16
## [1] "Mean Squared Error (MSE): 17.0986009862146"
Interpretation: A MSE of 17.1, Adjusted R-squared of 0.725, and F-statistic of 25.2 seem to indicate that our second model is relatively strong at predicting median property values in Boston just like our 1st model.
## [1] "Mean Squared Error (MSE): 23.492531982248"
Interpretation: Given a low MSE of 23.493, we can conclude that both models 1 and 2 are strong predictors of median property value in Boston, with lstat, rm, ptratio, dis, nox, chas, black, rad, crim, tax and zn as useful predictors.