Boston Housing data.

Exploratory Analysis

We begin by importing the Boston data set, which contains 506 observations on 14 variables, pertaining to the median value in USD 1000’s of owner-occupied homes.

We continue by setting a seed for reproducibility to the number 20.

Sampling

We proceed to create 2 data sets: one for testing with 10% of the original values and the other one for training, with 90% of the original values.

Below is a summary of the Boston training data set.

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.74   Min.   :0.00000  
##  1st Qu.: 0.08193   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.29090   Median :  0.00   Median : 8.56   Median :0.00000  
##  Mean   : 3.77846   Mean   : 10.96   Mean   :11.20   Mean   :0.07253  
##  3rd Qu.: 3.94406   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  6.00   Min.   : 1.130  
##  1st Qu.:0.4510   1st Qu.:5.893   1st Qu.: 46.05   1st Qu.: 2.075  
##  Median :0.5380   Median :6.211   Median : 79.80   Median : 3.152  
##  Mean   :0.5565   Mean   :6.295   Mean   : 69.78   Mean   : 3.797  
##  3rd Qu.:0.6275   3rd Qu.:6.633   3rd Qu.: 94.35   3rd Qu.: 5.117  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:281.0   1st Qu.:17.40   1st Qu.:373.29  
##  Median : 5.000   Median :330.0   Median :19.10   Median :391.34  
##  Mean   : 9.774   Mean   :411.5   Mean   :18.48   Mean   :354.03  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.10  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.91   1st Qu.:16.55  
##  Median :11.34   Median :21.10  
##  Mean   :12.71   Mean   :22.42  
##  3rd Qu.:17.14   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Below are 2 box plots showcasing the independent variables, their means, 1st and 3rd quartile and any outliers(marked in red).

Interpretation: On the one hand, variables like industry,tax, age, nox(nitrogen oxide concentration), rm(avg rooms per dwelling) and dis(distance to employment centers) appear to have rather symmetric distributions with not that many outliers, except for rmd. This suggests that the proportion of non-retail business acres in Boston is fairly consistent, the nitrogen oxide concentration is similar accross most homes, many of the residencies have a consistent number of rooms which is expected (except for some houses with many rooms) and similar in age, and most residents live an approximate distance from employment centers, except for some cases where they are farther away. Also, most properties have similar tax rates, and the pupil to teacher ratio is pretty symmetric with only a few outliers on the higher end.

Interpretation: The data seems to be right skewed, with a high concentration of properties with a value in the $20,000 and fewer properties with high median values.Moreover, there also seems to be a slight peak at the upper bound, around $50,000.

Linear Regression - Model Building

## 
## Call:
## lm(formula = medv ~ ., data = Boston_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.319  -2.596  -0.451   1.744  26.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.89442    5.14766    7.75  6.4e-14 ***
## crim         -0.10116    0.03239   -3.12  0.00191 ** 
## zn            0.04471    0.01433    3.12  0.00193 ** 
## indus         0.03947    0.06383    0.62  0.53668    
## chas          2.24948    0.85909    2.62  0.00914 ** 
## nox         -18.60362    3.82144   -4.87  1.6e-06 ***
## rm            3.65413    0.41838    8.73  < 2e-16 ***
## age           0.00689    0.01402    0.49  0.62325    
## dis          -1.36995    0.20174   -6.79  3.6e-11 ***
## rad           0.33475    0.06839    4.89  1.4e-06 ***
## tax          -0.01457    0.00391   -3.73  0.00022 ***
## ptratio      -1.06538    0.13248   -8.04  8.2e-15 ***
## black         0.00847    0.00265    3.20  0.00149 ** 
## lstat        -0.52775    0.05171  -10.21  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.57 on 441 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.754 
## F-statistic:  108 on 13 and 441 DF,  p-value: <2e-16

Interpretation: Based on this model, we can infer that industry and age are not strongly related with property value given their insignificant p-values.

Variable Selection

For this step, we will be using the stepwise elimination procedure.

## Start:  AIC=2021.8
## medv ~ 1
## 
##           Df Sum of Sq   RSS  AIC
## + lstat    1     21541 16994 1651
## + rm       1     18934 19600 1716
## + ptratio  1     11187 27347 1868
## + indus    1      9662 28872 1892
## + tax      1      9522 29013 1895
## + nox      1      7330 31205 1928
## + rad      1      6535 32000 1939
## + age      1      6116 32419 1945
## + crim     1      6063 32472 1946
## + zn       1      5225 33310 1957
## + black    1      4602 33933 1966
## + dis      1      2596 35938 1992
## + chas     1      1116 37419 2010
## <none>                 38534 2022
## 
## Step:  AIC=1651.2
## medv ~ lstat
## 
##           Df Sum of Sq   RSS  AIC
## + rm       1      3650 13343 1543
## + ptratio  1      2789 14204 1572
## + chas     1       674 16320 1635
## + dis      1       615 16378 1636
## + tax      1       351 16643 1644
## + age      1       285 16709 1646
## + black    1       184 16810 1648
## + crim     1       163 16830 1649
## + zn       1       143 16851 1649
## + indus    1        98 16895 1651
## <none>                 16994 1651
## + rad      1        62 16932 1652
## + nox      1         0 16994 1653
## - lstat    1     21541 38534 2022
## 
## Step:  AIC=1543.2
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS  AIC
## + ptratio  1      1814 11529 1479
## + tax      1       487 12856 1528
## + black    1       461 12883 1529
## + chas     1       452 12891 1530
## + crim     1       299 13044 1535
## + dis      1       247 13096 1537
## + rad      1       220 13124 1538
## + indus    1        63 13281 1543
## <none>                 13343 1543
## + zn       1        52 13292 1543
## + nox      1        29 13314 1544
## + age      1        26 13318 1544
## - rm       1      3650 16994 1651
## - lstat    1      6257 19600 1716
## 
## Step:  AIC=1478.7
## medv ~ lstat + rm + ptratio
## 
##           Df Sum of Sq   RSS  AIC
## + dis      1       368 11162 1466
## + black    1       351 11178 1467
## + chas     1       264 11266 1470
## + crim     1       115 11414 1476
## + tax      1        69 11460 1478
## + age      1        66 11464 1478
## <none>                 11529 1479
## + nox      1        51 11479 1479
## + zn       1        16 11513 1480
## + rad      1         1 11528 1481
## + indus    1         0 11529 1481
## - ptratio  1      1814 13343 1543
## - rm       1      2675 14204 1572
## - lstat    1      4583 16112 1629
## 
## Step:  AIC=1466
## medv ~ lstat + rm + ptratio + dis
## 
##           Df Sum of Sq   RSS  AIC
## + nox      1       742 10420 1437
## + black    1       448 10713 1449
## + tax      1       264 10897 1457
## + indus    1       221 10941 1459
## + crim     1       211 10951 1459
## + chas     1       186 10975 1460
## + zn       1        95 11067 1464
## <none>                 11162 1466
## + age      1        31 11131 1467
## + rad      1        26 11136 1467
## - dis      1       368 11529 1479
## - ptratio  1      1934 13096 1537
## - rm       1      2289 13451 1549
## - lstat    1      4748 15910 1625
## 
## Step:  AIC=1436.7
## medv ~ lstat + rm + ptratio + dis + nox
## 
##           Df Sum of Sq   RSS  AIC
## + black    1       263 10157 1427
## + chas     1       240 10179 1428
## + crim     1       127 10292 1433
## + zn       1       108 10312 1434
## <none>                 10420 1437
## + rad      1        45 10375 1437
## + tax      1        19 10401 1438
## + indus    1         9 10410 1438
## + age      1         1 10418 1439
## - nox      1       742 11162 1466
## - dis      1      1059 11479 1479
## - rm       1      2187 12606 1521
## - ptratio  1      2213 12633 1522
## - lstat    1      3256 13676 1558
## 
## Step:  AIC=1427
## medv ~ lstat + rm + ptratio + dis + nox + black
## 
##           Df Sum of Sq   RSS  AIC
## + chas     1       206  9951 1420
## + zn       1       132 10024 1423
## + rad      1       119 10038 1424
## + crim     1        62 10095 1426
## <none>                 10157 1427
## + indus    1         4 10153 1429
## + tax      1         0 10156 1429
## + age      1         0 10156 1429
## - black    1       263 10420 1437
## - nox      1       557 10713 1449
## - dis      1      1012 11169 1468
## - ptratio  1      2078 12235 1510
## - rm       1      2347 12503 1520
## - lstat    1      2746 12902 1534
## 
## Step:  AIC=1419.7
## medv ~ lstat + rm + ptratio + dis + nox + black + chas
## 
##           Df Sum of Sq   RSS  AIC
## + zn       1       142  9809 1415
## + rad      1       123  9827 1416
## + crim     1        50  9901 1419
## <none>                  9951 1420
## + indus    1         8  9943 1421
## + age      1         1  9949 1422
## + tax      1         0  9951 1422
## - chas     1       206 10157 1427
## - black    1       229 10179 1428
## - nox      1       608 10559 1445
## - dis      1       957 10908 1460
## - ptratio  1      1899 11849 1497
## - rm       1      2281 12232 1512
## - lstat    1      2686 12637 1526
## 
## Step:  AIC=1415.2
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn
## 
##           Df Sum of Sq   RSS  AIC
## + crim     1        82  9726 1413
## + rad      1        82  9727 1413
## <none>                  9809 1415
## + indus    1        10  9798 1417
## + tax      1         7  9801 1417
## + age      1         1  9808 1417
## - zn       1       142  9951 1420
## - chas     1       216 10024 1423
## - black    1       251 10060 1425
## - nox      1       616 10425 1441
## - dis      1      1072 10880 1460
## - ptratio  1      1502 11311 1478
## - rm       1      2049 11858 1500
## - lstat    1      2729 12538 1525
## 
## Step:  AIC=1413.3
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     crim
## 
##           Df Sum of Sq   RSS  AIC
## + rad      1       193  9533 1406
## <none>                  9726 1413
## + indus    1        10  9716 1415
## - crim     1        82  9809 1415
## + age      1         0  9726 1415
## + tax      1         0  9726 1415
## - zn       1       174  9901 1419
## - black    1       184  9910 1420
## - chas     1       202  9928 1421
## - nox      1       575 10302 1437
## - dis      1      1135 10861 1462
## - ptratio  1      1336 11062 1470
## - rm       1      2070 11796 1499
## - lstat    1      2484 12211 1515
## 
## Step:  AIC=1406.2
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     crim + rad
## 
##           Df Sum of Sq   RSS  AIC
## + tax      1       307  9226 1393
## <none>                  9533 1406
## + indus    1        25  9508 1407
## + age      1         4  9529 1408
## - zn       1       125  9659 1410
## - rad      1       193  9726 1413
## - crim     1       194  9727 1413
## - chas     1       197  9730 1414
## - black    1       243  9776 1416
## - nox      1       755 10288 1439
## - dis      1      1120 10653 1455
## - ptratio  1      1518 11052 1471
## - rm       1      1891 11424 1487
## - lstat    1      2532 12065 1511
## 
## Step:  AIC=1393.3
## medv ~ lstat + rm + ptratio + dis + nox + black + chas + zn + 
##     crim + rad + tax
## 
##           Df Sum of Sq   RSS  AIC
## <none>                  9226 1393
## + indus    1         8  9218 1395
## + age      1         5  9221 1395
## - chas     1       154  9381 1399
## - zn       1       194  9420 1401
## - crim     1       208  9434 1401
## - black    1       217  9443 1402
## - tax      1       307  9533 1406
## - rad      1       500  9726 1415
## - nox      1       511  9738 1416
## - dis      1      1223 10449 1448
## - ptratio  1      1342 10568 1453
## - rm       1      1688 10915 1468
## - lstat    1      2444 11670 1498

Interpretation: We observed that the best model has 11 variables and and AIC of 1392.5.

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + dis + nox + 
##     chas + zn + crim + rad + tax, data = Boston_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.375  -2.718  -0.405   1.716  26.602 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.57816    5.12366    7.72  7.6e-14 ***
## lstat        -0.51534    0.04758  -10.83  < 2e-16 ***
## rm            3.66671    0.40724    9.00  < 2e-16 ***
## ptratio      -1.05036    0.13087   -8.03  9.1e-15 ***
## black         0.00851    0.00264    3.23  0.00135 ** 
## dis          -1.42847    0.18643   -7.66  1.2e-13 ***
## nox         -17.44340    3.52054   -4.95  1.0e-06 ***
## chas          2.32069    0.85225    2.72  0.00672 ** 
## zn            0.04309    0.01412    3.05  0.00241 ** 
## crim         -0.10203    0.03231   -3.16  0.00170 ** 
## rad           0.32058    0.06543    4.90  1.4e-06 ***
## tax          -0.01349    0.00352   -3.84  0.00014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.56 on 443 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.755 
## F-statistic:  128 on 11 and 443 DF,  p-value: <2e-16

Interpretation: With R2adj of 0.774 and a significant p-value, we can conclude that our model is indeed significant and our regression equation would look like this: - medv =22.205−3.244⋅lstat+2.748⋅rm−2.139⋅ptratio+0.729⋅black+2.885⋅dis−2.025⋅nox+0.548⋅chas+0.027⋅zn−0.970⋅crim+2.659⋅rad−2.365⋅tax

Residual Diagnosis

Interpretation: Based on our residual plots, we can conclude that our model meets the key assumptions of a linear regression.

Test on testing data set

Using the final linear model built from on the 90% of original data, we will test with the remaining 10% testing data set.

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + rad + crim + tax + zn, data = Boston_test)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.69  -3.62  -0.54   2.26  20.16 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  14.40484   24.31874    0.59    0.557  
## lstat        -0.39117    0.24404   -1.60    0.117  
## rm            5.43350    2.06838    2.63    0.012 *
## ptratio      -0.20633    0.55091   -0.37    0.710  
## dis          -2.88316    1.07125   -2.69    0.010 *
## nox         -27.95957   20.56268   -1.36    0.182  
## chas          5.95475    4.71038    1.26    0.214  
## black         0.02275    0.02047    1.11    0.273  
## rad           0.39228    0.28699    1.37    0.179  
## crim         -0.31270    0.22017   -1.42    0.163  
## tax          -0.00645    0.01205   -0.53    0.596  
## zn            0.08133    0.04971    1.64    0.110  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.03 on 39 degrees of freedom
## Multiple R-squared:  0.656,  Adjusted R-squared:  0.559 
## F-statistic: 6.75 on 11 and 39 DF,  p-value: 3.44e-06
## [1] "Mean Squared Error (MSE): 27.8497109831055"

Interpretation: A MSE of 18.55, Adjusted R-squared of 0.77, and F-statistic of 16.6 seem to indicate that our model is relatively strong at predicting median property values in Boston.

Cross Validation on the original data

## [1] "Mean Squared Error (MSE): 23.5689617375491"

Interpretation: Given an MSE of 23.761, which is small compared to the range of the independent variable (medv), we can conclude that our model performs well on unseen data.

Model 2 on another random sample

-First we start by splitting the Boston data set into a training data set with 80% of the origincal values, and a testing data set with 20% of the original values. Below is a summary of our new training data set.

##       crim              zn            indus            chas       
##  Min.   : 0.006   Min.   :  0.0   Min.   : 0.46   Min.   :0.0000  
##  1st Qu.: 0.082   1st Qu.:  0.0   1st Qu.: 5.13   1st Qu.:0.0000  
##  Median : 0.252   Median :  0.0   Median : 9.69   Median :0.0000  
##  Mean   : 3.616   Mean   : 11.3   Mean   :11.13   Mean   :0.0767  
##  3rd Qu.: 3.675   3rd Qu.: 12.5   3rd Qu.:18.10   3rd Qu.:0.0000  
##  Max.   :88.976   Max.   :100.0   Max.   :27.74   Max.   :1.0000  
##       nox              rm            age             dis             rad       
##  Min.   :0.385   Min.   :3.56   Min.   :  2.9   Min.   : 1.13   Min.   : 1.00  
##  1st Qu.:0.448   1st Qu.:5.87   1st Qu.: 43.2   1st Qu.: 2.10   1st Qu.: 4.00  
##  Median :0.538   Median :6.18   Median : 76.2   Median : 3.27   Median : 5.00  
##  Mean   :0.555   Mean   :6.27   Mean   : 68.1   Mean   : 3.82   Mean   : 9.59  
##  3rd Qu.:0.624   3rd Qu.:6.62   3rd Qu.: 94.3   3rd Qu.: 5.21   3rd Qu.:24.00  
##  Max.   :0.871   Max.   :8.78   Max.   :100.0   Max.   :12.13   Max.   :24.00  
##       tax         ptratio         black            lstat            medv     
##  Min.   :187   Min.   :12.6   Min.   :  0.32   Min.   : 1.73   Min.   : 5.0  
##  1st Qu.:277   1st Qu.:17.4   1st Qu.:376.00   1st Qu.: 6.77   1st Qu.:17.1  
##  Median :334   Median :19.0   Median :391.77   Median :11.36   Median :21.1  
##  Mean   :409   Mean   :18.5   Mean   :360.54   Mean   :12.60   Mean   :22.6  
##  3rd Qu.:666   3rd Qu.:20.2   3rd Qu.:396.90   3rd Qu.:16.91   3rd Qu.:25.1  
##  Max.   :711   Max.   :22.0   Max.   :396.90   Max.   :37.97   Max.   :50.0

Model Building -Linear Regression

## 
## Call:
## lm(formula = medv ~ ., data = Boston_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.319  -2.596  -0.451   1.744  26.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.89442    5.14766    7.75  6.4e-14 ***
## crim         -0.10116    0.03239   -3.12  0.00191 ** 
## zn            0.04471    0.01433    3.12  0.00193 ** 
## indus         0.03947    0.06383    0.62  0.53668    
## chas          2.24948    0.85909    2.62  0.00914 ** 
## nox         -18.60362    3.82144   -4.87  1.6e-06 ***
## rm            3.65413    0.41838    8.73  < 2e-16 ***
## age           0.00689    0.01402    0.49  0.62325    
## dis          -1.36995    0.20174   -6.79  3.6e-11 ***
## rad           0.33475    0.06839    4.89  1.4e-06 ***
## tax          -0.01457    0.00391   -3.73  0.00022 ***
## ptratio      -1.06538    0.13248   -8.04  8.2e-15 ***
## black         0.00847    0.00265    3.20  0.00149 ** 
## lstat        -0.52775    0.05171  -10.21  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.57 on 441 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.754 
## F-statistic:  108 on 13 and 441 DF,  p-value: <2e-16

Interpretation: Both industry and age appear to be insignificant predictors of median property values like predicted by our model 1.

Variable Selection

## Start:  AIC=1811.1
## medv ~ 1
## 
##           Df Sum of Sq   RSS  AIC
## + lstat    1     19742 15832 1486
## + rm       1     16843 18731 1554
## + ptratio  1      9515 26059 1687
## + indus    1      7669 27905 1715
## + tax      1      7655 27919 1715
## + nox      1      6430 29144 1733
## + rad      1      5274 30300 1748
## + crim     1      5142 30432 1750
## + zn       1      4835 30739 1754
## + age      1      4794 30780 1755
## + black    1      3815 31759 1767
## + dis      1      2132 33442 1788
## + chas     1      1322 34252 1798
## <none>                 35574 1811
## 
## Step:  AIC=1486
## medv ~ lstat
## 
##           Df Sum of Sq   RSS  AIC
## + rm       1      2992 12840 1403
## + ptratio  1      2217 13615 1427
## + chas     1       869 14964 1465
## + dis      1       707 15125 1470
## + black    1       285 15548 1481
## + age      1       282 15551 1481
## + tax      1       187 15645 1483
## + zn       1       161 15672 1484
## + crim     1       107 15725 1485
## <none>                 15832 1486
## + indus    1        25 15808 1487
## + rad      1        13 15820 1488
## + nox      1        10 15822 1488
## - lstat    1     19742 35574 1811
## 
## Step:  AIC=1403.4
## medv ~ lstat + rm
## 
##           Df Sum of Sq   RSS  AIC
## + ptratio  1      1519 11322 1355
## + chas     1       602 12238 1386
## + black    1       508 12332 1389
## + dis      1       393 12448 1393
## + tax      1       279 12561 1397
## + crim     1       234 12606 1398
## + rad      1       104 12737 1402
## <none>                 12840 1403
## + age      1        56 12785 1404
## + zn       1        50 12791 1404
## + indus    1        14 12826 1405
## + nox      1         1 12839 1405
## - rm       1      2992 15832 1486
## - lstat    1      5891 18731 1554
## 
## Step:  AIC=1354.5
## medv ~ lstat + rm + ptratio
## 
##           Df Sum of Sq   RSS  AIC
## + dis      1       451 10870 1340
## + chas     1       420 10901 1341
## + black    1       389 10933 1342
## + age      1       113 11209 1353
## + crim     1        94 11228 1353
## <none>                 11322 1355
## + rad      1        22 11299 1356
## + tax      1        13 11309 1356
## + indus    1        12 11310 1356
## + nox      1         8 11313 1356
## + zn       1         6 11315 1356
## - ptratio  1      1519 12840 1403
## - rm       1      2294 13615 1427
## - lstat    1      4337 15659 1484
## 
## Step:  AIC=1340.1
## medv ~ lstat + rm + ptratio + dis
## 
##           Df Sum of Sq   RSS  AIC
## + nox      1       548 10323 1321
## + black    1       508 10362 1323
## + chas     1       308 10562 1331
## + crim     1       182 10688 1335
## + zn       1       165 10705 1336
## + tax      1       145 10725 1337
## + indus    1       137 10734 1337
## <none>                 10870 1340
## + age      1        17 10854 1342
## + rad      1         5 10866 1342
## - dis      1       451 11322 1355
## - ptratio  1      1577 12448 1393
## - rm       1      1998 12868 1406
## - lstat    1      4671 15541 1483
## 
## Step:  AIC=1321.2
## medv ~ lstat + rm + ptratio + dis + nox
## 
##           Df Sum of Sq   RSS  AIC
## + chas     1       335  9988 1310
## + black    1       327  9996 1310
## + zn       1       172 10151 1316
## + crim     1       113 10210 1319
## + rad      1        71 10252 1320
## <none>                 10323 1321
## + age      1         7 10315 1323
## + indus    1         2 10320 1323
## + tax      1         1 10321 1323
## - nox      1       548 10870 1340
## - dis      1       991 11313 1356
## - ptratio  1      1748 12071 1382
## - rm       1      1937 12260 1389
## - lstat    1      3303 13625 1431
## 
## Step:  AIC=1309.9
## medv ~ lstat + rm + ptratio + dis + nox + chas
## 
##           Df Sum of Sq   RSS  AIC
## + black    1       276  9713 1301
## + zn       1       189  9799 1304
## + crim     1        89  9899 1308
## + rad      1        76  9912 1309
## <none>                  9988 1310
## + indus    1         7  9981 1312
## + age      1         1  9987 1312
## + tax      1         0  9988 1312
## - chas     1       335 10323 1321
## - nox      1       574 10562 1331
## - dis      1       884 10872 1342
## - ptratio  1      1564 11552 1367
## - rm       1      1842 11830 1376
## - lstat    1      3219 13208 1421
## 
## Step:  AIC=1300.6
## medv ~ lstat + rm + ptratio + dis + nox + chas + black
## 
##           Df Sum of Sq   RSS  AIC
## + zn       1       215  9498 1294
## + rad      1       164  9549 1296
## <none>                  9713 1301
## + crim     1        41  9672 1301
## + tax      1        12  9700 1302
## + indus    1         3  9710 1303
## + age      1         0  9713 1303
## - black    1       276  9988 1310
## - chas     1       283  9996 1310
## - nox      1       397 10110 1315
## - dis      1       842 10554 1332
## - ptratio  1      1448 11161 1355
## - rm       1      1977 11690 1373
## - lstat    1      2887 12600 1404
## 
## Step:  AIC=1293.6
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn
## 
##           Df Sum of Sq   RSS  AIC
## + rad      1       106  9391 1291
## + crim     1        72  9425 1292
## <none>                  9498 1294
## + indus    1         3  9495 1295
## + age      1         2  9496 1296
## + tax      1         0  9498 1296
## - zn       1       215  9713 1301
## - chas     1       298  9796 1304
## - black    1       301  9799 1304
## - nox      1       397  9895 1308
## - ptratio  1      1054 10552 1334
## - dis      1      1055 10553 1334
## - rm       1      1700 11198 1358
## - lstat    1      2969 12467 1401
## 
## Step:  AIC=1291
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
##     rad
## 
##           Df Sum of Sq   RSS  AIC
## + crim     1       184  9207 1285
## + tax      1       174  9217 1285
## <none>                  9391 1291
## + age      1         7  9384 1293
## + indus    1         7  9384 1293
## - rad      1       106  9498 1294
## - zn       1       157  9549 1296
## - chas     1       295  9686 1302
## - black    1       368  9760 1305
## - nox      1       501  9892 1310
## - dis      1      1009 10400 1330
## - ptratio  1      1125 10516 1335
## - rm       1      1561 10953 1351
## - lstat    1      3056 12447 1403
## 
## Step:  AIC=1285
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
##     rad + crim
## 
##           Df Sum of Sq   RSS  AIC
## + tax      1       182  9025 1279
## <none>                  9207 1285
## + indus    1        11  9197 1287
## + age      1         9  9199 1287
## - zn       1       183  9390 1291
## - crim     1       184  9391 1291
## - rad      1       218  9425 1292
## - chas     1       269  9477 1295
## - black    1       311  9518 1296
## - nox      1       544  9751 1306
## - dis      1      1087 10295 1328
## - ptratio  1      1152 10360 1331
## - rm       1      1564 10772 1346
## - lstat    1      2720 11928 1388
## 
## Step:  AIC=1279
## medv ~ lstat + rm + ptratio + dis + nox + chas + black + zn + 
##     rad + crim + tax
## 
##           Df Sum of Sq   RSS  AIC
## <none>                  9025 1279
## + age      1        12  9013 1280
## + indus    1         8  9017 1281
## - tax      1       182  9207 1285
## - crim     1       192  9217 1285
## - chas     1       231  9256 1287
## - zn       1       236  9261 1287
## - black    1       292  9317 1290
## - nox      1       371  9396 1293
## - rad      1       393  9418 1294
## - ptratio  1       997 10022 1319
## - dis      1      1148 10173 1325
## - rm       1      1434 10459 1337
## - lstat    1      2698 11723 1383

Interpretation: The model with the lowest AIC score (1279) includes the following independent variables: - lstat - rm
- ptratio -dis - nox -chas - black -zn - rad - crim -tax

Such a low AIC score suggests that the model above balances good fit and model simplicity the best, and it is consistent with our first model which excluded age and industry as potential good predictors of median property value.

Now determined what variables to use for the regression model, we construct the new linear model.

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + dis + nox + 
##     chas + zn + crim + rad + tax, data = Boston_train.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.946  -2.716  -0.432   1.911  25.114 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.91948    5.60435    6.59  1.4e-10 ***
## lstat        -0.57903    0.05349  -10.82  < 2e-16 ***
## rm            3.54605    0.44933    7.89  3.0e-14 ***
## ptratio      -0.96698    0.14697   -6.58  1.5e-10 ***
## black         0.01135    0.00319    3.56  0.00041 ***
## dis          -1.44048    0.20399   -7.06  7.6e-12 ***
## nox         -15.95473    3.97687   -4.01  7.2e-05 ***
## chas          2.93341    0.92589    3.17  0.00165 ** 
## zn            0.04889    0.01526    3.20  0.00146 ** 
## crim         -0.09967    0.03449   -2.89  0.00407 ** 
## rad           0.29125    0.07051    4.13  4.4e-05 ***
## tax          -0.01053    0.00374   -2.81  0.00514 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.8 on 392 degrees of freedom
## Multiple R-squared:  0.746,  Adjusted R-squared:  0.739 
## F-statistic:  105 on 11 and 392 DF,  p-value: <2e-16

Interpretation: An adj R-squared value of 0.739 is an indicator that model 2 is a relatively strong model.

##Test on the testing data set.

Using the final linear model built from on the 80% of original data, we will test with the remaining 20% testing data.

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + rad + crim + tax + zn, data = Boston_test.2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.762 -2.759 -0.921  1.801 13.006 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  32.12551   12.68390    2.53  0.01305 *  
## lstat        -0.28257    0.10645   -2.65  0.00939 ** 
## rm            4.85354    1.01080    4.80  6.2e-06 ***
## ptratio      -0.78210    0.27383   -2.86  0.00532 ** 
## dis          -1.94525    0.47980   -4.05  0.00011 ***
## nox         -23.17097    7.94751   -2.92  0.00448 ** 
## chas          1.25285    2.37714    0.53  0.59946    
## black         0.00503    0.00513    0.98  0.32993    
## rad           0.40464    0.15750    2.57  0.01184 *  
## crim         -0.18639    0.11623   -1.60  0.11230    
## tax          -0.02054    0.00800   -2.57  0.01192 *  
## zn            0.05810    0.03037    1.91  0.05889 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.4 on 90 degrees of freedom
## Multiple R-squared:  0.755,  Adjusted R-squared:  0.725 
## F-statistic: 25.2 on 11 and 90 DF,  p-value: <2e-16
## [1] "Mean Squared Error (MSE): 17.0986009862146"

Interpretation: A MSE of 17.1, Adjusted R-squared of 0.725, and F-statistic of 25.2 seem to indicate that our second model is relatively strong at predicting median property values in Boston just like our 1st model.

Cross Validation on the original data

## [1] "Mean Squared Error (MSE): 23.492531982248"

Interpretation: Given a low MSE of 23.493, we can conclude that both models 1 and 2 are strong predictors of median property value in Boston, with lstat, rm, ptratio, dis, nox, chas, black, rad, crim, tax and zn as useful predictors.