The Boston Real Estate Dataset

This dataset contains information collected by the U.S Census Service concerning real estate information in the area of Boston Mass. The information was obtained from the StatLib archive and has been used extensively throughout the literature to bench algorithms.

In the context of R, the Boston dataset is found in the MASS library and has 506 rows and 14 columns. The median value of a home, medv, and nitrogen oxides concentration (parts per 10 million), nox, is to be predicted.

The variables give in the dataset are the following:

  1. crim —per capita crime rate by town.

  2. zn — proportion of residential land zoned for lots over 25,000 sq.ft.

  3. indus — proportion of non-retail business acres per town.

  4. chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  5. nox — nitrogen oxides concentration (parts per 10 million).

  6. rm — average number of rooms per dwelling.

  7. age — proportion of owner-occupied units built prior to 1940.

  8. dis — weighted mean of distances to five Boston employment centres.

  9. rad — index of accessibility to radial highways.

  10. tax — full-value property-tax rate per $10,000.

  11. ptratio — pupil-teacher ratio by town.

  12. black — 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town.

  13. lstat — lower status of the population (percent).

  14. medv — median value of owner-occupied homes in $1000s.

Taking a look at the Boston Dataset

## Call MASS library
library(MASS)
## Print out the column names of Boston dataset using names function
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
## Data Types present in Dataset
# Show rows and columns of the Boston dataset
dim(Boston)
## [1] 506  14
# Show first 6 rows of data
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

Data Types present in Dataset

crim — per capita crime rate by town — Numerical, continuous

zn — proportion of residential land zoned for lots over 25,000 sq.ft. — Numerical, continuous

indus — proportion of non-retail business acres per town. — Numerical, continuous

chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). — categorical, nominal

nox — nitrogen oxides concentration (parts per 10 million). — Numerical, continuous

rm — average number of rooms per dwelling. — Numerical, continuous

age — proportion of owner-occupied units built prior to 1940. — Numerical, continuous

dis — weighted mean of distances to five Boston employment centres. — Numerical, continuous

rad — index of accessibility to radial highways. larger index denotes better accessibility — categorical, ordinal

tax — full-value property-tax rate per $10,000. — Numerical, continuous

ptratio — pupil-teacher ratio by town. — Numerical, continuous

black — 1000(Bk−0.63)2 where Bk is the proportion of blacks by town. — Numerical, continuous

lstat — lower status of the population (percent). — Numerical, continuous

medv — median value of owner-occupied homes in $1000s. — Numerical, continuous

Checking Data Integrity

sapply(Boston, anyNA)
##    crim      zn   indus    chas     nox      rm     age     dis     rad     tax 
##   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE 
## ptratio   black   lstat    medv 
##   FALSE   FALSE   FALSE   FALSE

Looks like there are no missing values in the Boston Dataset. We are good to use this data!

Summary of Boston Data

Summary of the 14 variables.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Creating boxplots for Boston variables

library(ggplot2)
attach(Boston)
boxplot(crim, main = "Crim rate in town")

boxplot(rm, main = "Average room per dwelling")

## Using ggplot to plot boxplot of nox by chas groups 

ggplot(Boston, aes(x = as.factor(chas), y = nox)) + geom_boxplot()

## Adding some colours
ggplot(Boston, aes(x = as.factor(chas), y = rm , colour = as.factor(chas))) + geom_boxplot()

Histograms for all the variables in Boston Dataset

library(ggplot2)

## Using hist function to plot histogram
hist(Boston$medv, main = "Histogram of nox")

Creating 1 histogram per variable in the dataset can be slow. Try out the Hmisc package.

Hmisc Package to create Mass Histograms

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
par(mar=c(1,1,1,1))
hist.data.frame(Boston)

Using ggplot to create line and scatterplot graph

## Using ggplot to create line graph of medv (y) vs nox (x)

ggplot(Boston, aes(x = nox, y = medv)) + geom_point(colour = "blue") + geom_line(colour = "red")

Creating Bar Charts and Pie Charts

attach(Boston)
## The following objects are masked from Boston (pos = 7):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
counts = table(chas)
barplot(counts, main = "Chas Distribution" , xlab = "Tract bounds river = 1 and Tract does not bound river = 0")

## Grouped Barchart of rad vs chas
counts <- table(Boston$chas, Boston$rad)
barplot(counts, main="Boston Population Distribution by chas and rad",
        xlab="rad", col=c("darkblue","red"),
        legend = rownames(counts), beside=TRUE, args.legend = list(x = "topright",
                           inset = c(-0.1, 0)))

## Pie Chart showing distribution of index of accessibility to radial highways
library(plotrix)
counts_rad = table(rad)
pie3D(counts_rad,labels=names(counts_rad),explode=0.1,
      main="Pie Chart of Rad (Index of accessibility to radial highways) Distribtion",)

Creating Scatterplots

attach(Boston)
## The following objects are masked from Boston (pos = 4):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
## The following objects are masked from Boston (pos = 9):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
plot(rm, medv)

plot(lstat, medv)

Using correlation matrix to discover relationships between the variables

Correlations that fall between 0 to -1 are negative and 0 to +1 are positive correlations.

library(ggcorrplot)

## using cor function to draw correlation matrix between each variable in Boston Dataset
corr_boston = cor(Boston)
corr_boston
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax    ptratio
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431  0.2899456
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018  0.3832476
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320  0.1889327
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783 -0.3555015
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559  0.2615150
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819  0.4647412
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000  0.4608530
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304  1.0000000
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341  0.3740443
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593 -0.5077867
##               black      lstat       medv
## crim    -0.38506394  0.4556215 -0.3883046
## zn       0.17552032 -0.4129946  0.3604453
## indus   -0.35697654  0.6037997 -0.4837252
## chas     0.04878848 -0.0539293  0.1752602
## nox     -0.38005064  0.5908789 -0.4273208
## rm       0.12806864 -0.6138083  0.6953599
## age     -0.27353398  0.6023385 -0.3769546
## dis      0.29151167 -0.4969958  0.2499287
## rad     -0.44441282  0.4886763 -0.3816262
## tax     -0.44180801  0.5439934 -0.4685359
## ptratio -0.17738330  0.3740443 -0.5077867
## black    1.00000000 -0.3660869  0.3334608
## lstat   -0.36608690  1.0000000 -0.7376627
## medv     0.33346082 -0.7376627  1.0000000
## Visualization the correlation matrix

ggcorrplot(corr_boston)

## Circle

ggcorrplot(corr_boston, method = "circle")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

## Reordering the Boston correlation matrix
ggcorrplot(corr_boston, hc.order = TRUE)

## Get the lower triangle

ggcorrplot(corr_boston, hc.order = TRUE, type = "lower", outline.col = "white", )

## Get the upper triangle 
ggcorrplot(corr_boston, hc.order= TRUE, type = "upper", outline.col = "white", )

## Adding crosses to show those correlation with no significance coefficient
p_boston = cor_pmat(Boston)
p_boston
##                 crim           zn        indus         chas           nox
## crim    0.000000e+00 5.506472e-06 1.450349e-21 2.094345e-01  3.751739e-23
## zn      5.506472e-06 0.000000e+00 1.289161e-38 3.378103e-01  7.231578e-36
## indus   1.450349e-21 1.289161e-38 0.000000e+00 1.574628e-01  7.913361e-98
## chas    2.094345e-01 3.378103e-01 1.574628e-01 0.000000e+00  4.029050e-02
## nox     3.751739e-23 7.231578e-36 7.913361e-98 4.029050e-02  0.000000e+00
## rm      6.346703e-07 6.935337e-13 5.328458e-20 4.018410e-02  3.818694e-12
## age     2.854869e-16 7.575575e-45 8.409642e-61 5.177446e-02  7.452392e-86
## dis     8.519949e-19 9.748287e-66 3.586280e-78 2.568848e-02 4.233063e-100
## rad     2.693844e-56 6.988109e-13 8.368289e-50 8.686789e-01  3.342034e-53
## tax     2.357127e-47 4.385492e-13 3.018199e-82 4.244225e-01  1.093287e-66
## ptratio 2.942922e-11 5.325074e-20 3.774843e-19 6.203916e-03  1.885692e-05
## black   2.487274e-19 7.207719e-05 1.184586e-16 2.733379e-01  7.816936e-19
## lstat   2.654277e-27 2.908736e-22 1.381948e-51 2.258990e-01  5.979284e-49
## medv    1.173987e-19 5.713584e-17 4.900260e-31 7.390623e-05  7.065042e-24
##                   rm          age           dis           rad           tax
## crim    6.346703e-07 2.854869e-16  8.519949e-19  2.693844e-56  2.357127e-47
## zn      6.935337e-13 7.575575e-45  9.748287e-66  6.988109e-13  4.385492e-13
## indus   5.328458e-20 8.409642e-61  3.586280e-78  8.368289e-50  3.018199e-82
## chas    4.018410e-02 5.177446e-02  2.568848e-02  8.686789e-01  4.244225e-01
## nox     3.818694e-12 7.452392e-86 4.233063e-100  3.342034e-53  1.093287e-66
## rm      0.000000e+00 4.459649e-08  3.237746e-06  1.918446e-06  2.086816e-11
## age     4.459649e-08 0.000000e+00  9.857534e-92  2.360876e-27  2.551067e-34
## dis     3.237746e-06 9.857534e-92  0.000000e+00  1.418269e-32  1.025931e-38
## rad     1.918446e-06 2.360876e-27  1.418269e-32  0.000000e+00 4.129920e-195
## tax     2.086816e-11 2.551067e-34  1.025931e-38 4.129920e-195  0.000000e+00
## ptratio 1.610820e-16 2.338885e-09  1.229920e-07  1.778554e-28  5.686833e-28
## black   3.906695e-03 3.911801e-10  2.278649e-11  6.592918e-26  1.367562e-25
## lstat   1.033009e-53 2.783924e-51  6.356331e-33  9.904457e-32  2.583867e-40
## medv    2.487229e-74 1.569982e-18  1.206612e-08  5.465933e-19  5.637734e-29
##              ptratio        black        lstat         medv
## crim    2.942922e-11 2.487274e-19 2.654277e-27 1.173987e-19
## zn      5.325074e-20 7.207719e-05 2.908736e-22 5.713584e-17
## indus   3.774843e-19 1.184586e-16 1.381948e-51 4.900260e-31
## chas    6.203916e-03 2.733379e-01 2.258990e-01 7.390623e-05
## nox     1.885692e-05 7.816936e-19 5.979284e-49 7.065042e-24
## rm      1.610820e-16 3.906695e-03 1.033009e-53 2.487229e-74
## age     2.338885e-09 3.911801e-10 2.783924e-51 1.569982e-18
## dis     1.229920e-07 2.278649e-11 6.356331e-33 1.206612e-08
## rad     1.778554e-28 6.592918e-26 9.904457e-32 5.465933e-19
## tax     5.686833e-28 1.367562e-25 2.583867e-40 5.637734e-29
## ptratio 0.000000e+00 6.017320e-05 3.003524e-18 1.609509e-34
## black   6.017320e-05 0.000000e+00 1.712280e-17 1.318113e-14
## lstat   3.003524e-18 1.712280e-17 0.000000e+00 5.081103e-88
## medv    1.609509e-34 1.318113e-14 5.081103e-88 0.000000e+00
ggcorrplot(corr_boston, hc.order = TRUE, type = "lower", insig = "blank")

Models to predict medv and nox values

In this section, we will use 13 factors or predictors to predict medv (Median Value of owner-occupied properties in $1000s) and nox (nitrogen oxides concentration (parts per 10 million)). By using correlation matrix, we will seek for the predictor with the strongest linear correlation with medv and nox and build a model.

## Adding the correlation coefficients and changing label size

ggcorrplot(corr_boston, hc.order = TRUE, type = "upper", lab = TRUE, lab_size = 3, insig = "blank")

## Finding the variables with the strongest correlation with medv and nox

From the correlation plot above, the inverse linear correlation between lstat and medv is the very strong where lstat is the % lower status of the population. The linear correlation between indus and nox is also very strong but we will cover this in the later part.

Creating model(s) for medv vs lstat

We will now visualize the linear relationship between the 2 variables where y = medv (dependent variable) and x = lstat (independent variable).

## medv against lstat scatterplot
ggplot(Boston, aes(x = lstat, y = medv)) + geom_point(size = 2)

The scatterplot above demonstrates a slightly curved linear relationship between medv and lstat.

## Correlation Coefficient of medv v lstat
attach(Boston)
## The following objects are masked from Boston (pos = 4):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
## The following objects are masked from Boston (pos = 6):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
## The following objects are masked from Boston (pos = 11):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
cor(lstat, medv)
## [1] -0.7376627

Hypothesis Testing (medv vs lstat)

We will go with the hypothesis that if % lower status of the population (lstat) increase then median value of owner occupied properties (medv) will decrease.

To proof that the Null Hypothesis: Increase in lstat will not decrease medv.

cor.test(lstat, medv)
## 
##  Pearson's product-moment correlation
## 
## data:  lstat and medv
## t = -24.528, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7749982 -0.6951959
## sample estimates:
##        cor 
## -0.7376627
pval = cor.test(lstat, medv)$p.value

Since, p value (pval) = 5.08e-88 < α =0.05, we reject the null hypothesis. Medv has a significant inverse linear correlation with lstat.

Fitting linear regression model (medv vs lstat)

fit.lm = lm(medv~lstat, data = Boston)
fit.lm
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95

Forming the estimated linear regression equation (medv vs lstat)

coef(fit.lm)
## (Intercept)       lstat 
##  34.5538409  -0.9500494

Estimated equation is y = -0.95x + 34.554.

Testing model accuracy (medv vs lstat)

summary(fit.lm)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

F-statistic = 601.6 and corresponding p value < 2.2e-16 which is < 0.05, the linear regression model is statistically significant.

Multiple R-squared = 0.5441 meaning that 54.41% of the variation in the response variable, y or medv can be explained by the predictor variable x or lstat. This is a decent r2 value but perhaps we can achieve higher accuracy with a curvilinear model.

Coefficient estimate of lstat = -0.95. This means that each additional 1% increase in lstat is associated with $-950 average decrease in medv.

Estimated regression equation:y = -0.95x + 34.554

Diagnostic plots of Linear Regression Model (medv vs lstat)

##Create diagnostic plots

plot(fit.lm)

1st Plot: Residuals vs Fitted Plot

This plot helps to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then we can assume that the residuals follow a linear pattern.

In our case, the red line deviates slightly on the right side from a perfect horizontal line but not severely. The residuals follow a roughly linear pattern and a linear regression model is appropriate for this dataset but it is possible we can

2nd Plot: Normal Q-Q

This plot is used to determine if the residuals of the regression model are normally distributed. If the points in this plot fall roughly along a straight diagonal line, then we can assume the residuals are normally distributed.

The observations #268/373/372 deviate far from the line. However, we should note that the residuals start to exhibit non-normal distribution behaviour as residuals increase.

3rd Plot: Scale-Location Plot

This plot is used to check the assumption of equal variance (also called “homoscedasticity”) among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met.

The red line displayed for our case is not exactly a straight horizontal line but it does not deviate too drastically.

4th Plot: Residuals vs Leverage Plot

This plot is used to identify influential observations. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation.

All observations are far within cook’s distance and there are not many overly influential points within the data set.

Plotting the fitted regression model (medv vs lstat)

#create scatterplot of medv vs lstat 
plot(Boston$lstat, Boston$medv, col = "blue", main = "Summary of Linear Regression Model", xlab = "lstat", ylab = "medv")

#add fitted regression line
abline(fit.lm)

Making Predictions (medv vs lstat)

predict_lm =  predict(fit.lm, data.frame(lstat = c(10, 25, 40)), se.fit = TRUE, interval = "predict")

predict_lm
## $fit
##         fit        lwr      upr
## 1 25.053347  12.827626 37.27907
## 2 10.802607  -1.457504 23.06272
## 3 -3.448133 -15.848067  8.95180
## 
## $se.fit
##         1         2         3 
## 0.2948138 0.5523293 1.0946895 
## 
## $df
## [1] 504
## 
## $residual.scale
## [1] 6.21576
names(predict_lm)
## [1] "fit"            "se.fit"         "df"             "residual.scale"
predict_lm$fit
##         fit        lwr      upr
## 1 25.053347  12.827626 37.27907
## 2 10.802607  -1.457504 23.06272
## 3 -3.448133 -15.848067  8.95180

This linear regression model predicts a medv value of 25053, 10802 and -345 for lstat values of 10,25 and 40 respectively. Logically speaking, value of a property should not be negative and hence, the model is not perfect.

Let us explore improving the accuracy of this model by using a curvilinear model instead.

Fitting a non-linear regression model (medv vs lstat)

## Creating the non linear regression model

fit_nlm = lm(medv ~ lstat + I(lstat^2), data = Boston)

fit_nlm
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Coefficients:
## (Intercept)        lstat   I(lstat^2)  
##    42.86201     -2.33282      0.04355
coef(fit_nlm)
## (Intercept)       lstat  I(lstat^2) 
## 42.86200733 -2.33282110  0.04354689

The non-linear regression model equation is estimated to be y = 0.0435x^2 -2.33x + 42.9

summary(fit_nlm)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

F-statistic = 448.5 and corresponding p value < 2.2e-16 which is < 0.05, the linear regression model is statistically significant.

Multiple R-squared = 0.6407 meaning that 64.07% of the variation in the response variable, y or medv can be explained by the predictor variable x or lstat. There is an approximately 10% improvement in the r^2 value in the non-linear model compared to the linear model.

Diagnostic plots of Non-linear Regression Model (medv vs lstat)

##Create diagnostic plots

plot(fit_nlm)

1st Plot: Residuals vs Fitted Plot

The red line is more horizontal compared to the linear model which was more U-shaped. The non-linear regression is more appropriate for this dataset.

2nd Plot: Normal Q-Q

The observations #268/373/372 still deviate far from the line. However, we should note that the residuals do not deviate as far from the diagonal line compared to the linear model. The residual more normally distributed compared to the linear model.

3rd Plot: Scale-Location Plot

The red line displayed is still not a straight horizontal line but it deviates slightly less than the linear model.

4th Plot: Residuals vs Leverage Plot

All observations are far within cook’s distance and there are not many overly influential points within the data set. It is interesting to note that the red dash lines are visible for the non-linear regression line and the observations are relatively more influential than the observations in the linear model.

Summary of the non-linear regression line (medv vs lstat)

data_b = data.frame(x = Boston$lstat, y = Boston$medv)
nls_fit = nls(I(y ~ a + (b*x) + (c*x^2)), data = data_b, start = list(a = 42.862, b = -2.333, c = 0.0435), trace = T)
## 15347.39    (3.14e-03): par = (42.862 -2.333 0.0435)
## 15347.24    (2.12e-09): par = (42.86201 -2.332821 0.04354689)
summary(nls_fit)
## 
## Formula: y ~ a + (b * x) + (c * x^2)
## 
## Parameters:
##    Estimate Std. Error t value Pr(>|t|)    
## a 42.862007   0.872084   49.15   <2e-16 ***
## b -2.332821   0.123803  -18.84   <2e-16 ***
## c  0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 2.118e-09

(Intercept) lstat I(lstat^2) 42.86200733 -2.33282110 0.04354689

Making Predictions (medv vs lstat)

predict_nlm =  predict(fit_nlm, data.frame(lstat = c(10, 25, 40)), se.fit = TRUE, interval = "predict")

predict_nlm
## $fit
##        fit        lwr      upr
## 1 23.88849 13.0221089 34.75486
## 2 11.75829  0.8619344 22.65464
## 3 19.22419  7.5578547 30.89052
## 
## $se.fit
##         1         2         3 
## 0.2804907 0.4976684 2.1790803 
## 
## $df
## [1] 503
## 
## $residual.scale
## [1] 5.523714
names(predict_nlm)
## [1] "fit"            "se.fit"         "df"             "residual.scale"
predict_nlm$fit
##        fit        lwr      upr
## 1 23.88849 13.0221089 34.75486
## 2 11.75829  0.8619344 22.65464
## 3 19.22419  7.5578547 30.89052

Finding correlation between nox and another variable

ggcorrplot(corr_boston, hc.order = TRUE, type = "upper", lab = TRUE, lab_size = 3, insig = "blank")

## Creating model(s) for nox and indus

Based on the correlation map above, nox has a strongest correlation level of 0.76 with indus.

medv against lstat scatterplot

ggplot(Boston, aes(x =indus, y = nox)) + geom_point(size = 2)

There is an obvious linear relationship between nox and indus.

## Correlation Coefficient of nox v indus
attach(Boston)
## The following objects are masked from Boston (pos = 3):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
## The following objects are masked from Boston (pos = 5):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
## The following objects are masked from Boston (pos = 7):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
## The following objects are masked from Boston (pos = 12):
## 
##     age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
##     rm, tax, zn
cor(indus, nox)
## [1] 0.7636514

Hypothesis Testing (nox vs indus)

We will go with the hypothesis that if proportion of non-retail business acres per town (indus) increases then nitrogen oxides concentration (nox) will increase.

To proof that the Null Hypothesis: Increase in indus will not increase nox.

cor.test(indus, nox)
## 
##  Pearson's product-moment correlation
## 
## data:  indus and nox
## t = 26.554, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7247252 0.7977188
## sample estimates:
##       cor 
## 0.7636514
pval2 = cor.test(indus, nox)$p.value

Since, p value (pval) = 2.2e-16 < α =0.05, we reject the null hypothesis. Nox has a significant linear correlation with indus.

Fitting linear regression model (nox vs indus)

fit.lm2 = lm(nox~indus, data = Boston)
fit.lm2
## 
## Call:
## lm(formula = nox ~ indus, data = Boston)
## 
## Coefficients:
## (Intercept)        indus  
##      0.4110       0.0129

Forming the estimated linear regression equation (nox vs indus)

coef(fit.lm2)
## (Intercept)       indus 
##  0.41104425  0.01289878

Estimated equation is y = 0.0129x + 0.411

Testing model accuracy (nox vs indus)

summary(fit.lm2)
## 
## Call:
## lm(formula = nox ~ indus, data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.160898 -0.052175 -0.001458  0.037011  0.207398 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.4110442  0.0063521   64.71   <2e-16 ***
## indus       0.0128988  0.0004858   26.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07489 on 504 degrees of freedom
## Multiple R-squared:  0.5832, Adjusted R-squared:  0.5823 
## F-statistic: 705.1 on 1 and 504 DF,  p-value: < 2.2e-16

F-statistic = 705.1 and corresponding p value < 2.2e-16 which is < 0.05, the linear regression model is statistically significant.

Multiple R-squared = 0.5832 meaning that 58.32% of the variation in the response variable, y or nox can be explained by the predictor variable x or indus. This is a decent r2 value but perhaps we can achieve higher accuracy with a curvilinear model.

x.1 = 1
y.1 = 0.0129*x.1 + 0.411
y.1
## [1] 0.4239
x.2 = 2
y.2 = 0.0129*x.2 + 0.411
y.2
## [1] 0.4368
y.2 - y.1
## [1] 0.0129

Coefficient estimate of indus = 0.0129. This means that each additional 1 unit increase in indus is associated with average increase of 0.0129 PP10M in nox.

Estimated regression equation:y = 0.0129x + 0.411

Diagnostic plots of Linear Regression Model (nox vs indus)

##Create diagnostic plots

plot(fit.lm2)

1st Plot: Residuals vs Fitted Plot

In our case, the red line is fairly horizontal till the 0.65 mark on the x axis where it starts to slant downwards. Linear regression model should work on nox vs indus.

2nd Plot: Normal Q-Q

The observation #145 deviate far from the line. Most points fall within the straight diagonal line.

3rd Plot: Scale-Location Plot

The red line is not too horizontal for this case as it starts horizontal and moves diagonally upwards. The deviation is quite drastic and poor variance is seen.

4th Plot: Residuals vs Leverage Plot

All observations are far within cook’s distance and there are not many overly influential points within the data set.

Plotting the fitted regression model (nox vs indus)

#create scatterplot of nox vs indus 
plot(Boston$indus, Boston$nox, col = "blue", main = "Summary of Linear Regression Model", xlab = "indus", ylab = "nox")

#add fitted regression line
abline(fit.lm2)

Making Predictions (nox vs indus)

predict_lm2 =  predict(fit.lm2, data.frame(indus = c(5, 15, 25)), se.fit = TRUE, interval = "predict")

predict_lm2
## $fit
##         fit       lwr       upr
## 1 0.4755381 0.3281450 0.6229312
## 2 0.6045259 0.4572030 0.7518487
## 3 0.7335136 0.5856439 0.8813834
## 
## $se.fit
##           1           2           3 
## 0.004468758 0.003821658 0.007512171 
## 
## $df
## [1] 504
## 
## $residual.scale
## [1] 0.07488814
names(predict_lm2)
## [1] "fit"            "se.fit"         "df"             "residual.scale"
predict_lm2$fit
##         fit       lwr       upr
## 1 0.4755381 0.3281450 0.6229312
## 2 0.6045259 0.4572030 0.7518487
## 3 0.7335136 0.5856439 0.8813834

This linear regression model predicts a nox value of 0.476, 0.605 and 0.734 for lstat values of 5, 15 and 25 respectively.

Let us explore improving the accuracy of this model by using a curvilinear model instead.

Fitting a non-linear regression model (nox vs indus)

## Creating the non linear regression model

fit_nlm2 = lm(nox ~ indus + I(lstat^2), data = Boston)

fit_nlm2
## 
## Call:
## lm(formula = nox ~ indus + I(lstat^2), data = Boston)
## 
## Coefficients:
## (Intercept)        indus   I(lstat^2)  
##   4.105e-01    1.134e-02    8.502e-05
coef(fit_nlm2)
##  (Intercept)        indus   I(lstat^2) 
## 4.104546e-01 1.134104e-02 8.501568e-05

The non-linear regression model equation is estimated to be y = 8.50e-5x^2 + 1.13e-2x + 4.10e-1

summary(fit_nlm2)
## 
## Call:
## lm(formula = nox ~ indus + I(lstat^2), data = Boston)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.190946 -0.045466 -0.002307  0.032283  0.233845 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.105e-01  6.193e-03  66.273  < 2e-16 ***
## indus       1.134e-02  5.595e-04  20.272  < 2e-16 ***
## I(lstat^2)  8.502e-05  1.626e-05   5.229  2.5e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.073 on 503 degrees of freedom
## Multiple R-squared:  0.6047, Adjusted R-squared:  0.6031 
## F-statistic: 384.7 on 2 and 503 DF,  p-value: < 2.2e-16

F-statistic = 384.7 and corresponding p value < 2.2e-16 which is < 0.05, the non-linear regression model is statistically significant.

Multiple R-squared = 0.6047 meaning that 60.47% of the variation in the response variable, y or nox can be explained by the predictor variable x or indus. There is an approximately 2% improvement in the r^2 value in the non-linear model compared to the linear model.

Diagnostic plots of Non-linear Regression Model (nox vs indus)

##Create diagnostic plots

plot(fit_nlm2)

1st Plot: Residuals vs Fitted Plot

In our case, the red line is more horizontal compared to the red line for the linear regression model. We can assume that a non-linear regression model is more appropriate for the residuals.

2nd Plot: Normal Q-Q

The observation #160/152 deviate far from the line. Most points fall within the straight diagonal line.

3rd Plot: Scale-Location Plot

The red line is not horizontal for this case as it starts horizontal and moves diagonally upwards. The deviation is quite drastic and poor variance is seen.

4th Plot: Residuals vs Leverage Plot

All observations are far within cook’s distance and there are not many overly influential points within the data set.

Summary of the non-linear regression line (nox vs indus)

data_c = data.frame(x = Boston$indus, y = Boston$nox)
nls_fit2 = nls(I(y ~ a + (b*x) + (c*x^2)), data = data_c, start = list(a = 4.11e-01 , b = 1.13e-02, c = 8.50e-05), trace = T)
## 2.880089    (2.06e-01): par = (0.411 0.0113 8.5e-05)
## 2.763038    (1.47e-07): par = (0.3822677 0.01992304 -0.0002891891)
summary(nls_fit2)
## 
## Formula: y ~ a + (b * x) + (c * x^2)
## 
## Parameters:
##     Estimate Std. Error t value Pr(>|t|)    
## a  3.823e-01  1.054e-02  36.260  < 2e-16 ***
## b  1.992e-02  2.121e-03   9.393  < 2e-16 ***
## c -2.892e-04  8.505e-05  -3.400 0.000727 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07412 on 503 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 1.465e-07
nls_fit2
## Nonlinear regression model
##   model: y ~ a + (b * x) + (c * x^2)
##    data: data_c
##          a          b          c 
##  0.3822677  0.0199230 -0.0002892 
##  residual sum-of-squares: 2.763
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 1.465e-07
coef(nls_fit2)
##             a             b             c 
##  0.3822677247  0.0199230378 -0.0002891891

Coefficients: (Intercept) indus I(lstat^2)
4.105e-01 1.134e-02 8.502e-05

Making Predictions (nox vs indus)

for(x2 in 10:25)
{y3 = 8.50e-5*x2^2 + 1.13e-2*x2 + 4.10e-1
  print(y3)}
## [1] 0.5315
## [1] 0.544585
## [1] 0.55784
## [1] 0.571265
## [1] 0.58486
## [1] 0.598625
## [1] 0.61256
## [1] 0.626665
## [1] 0.64094
## [1] 0.655385
## [1] 0.67
## [1] 0.684785
## [1] 0.69974
## [1] 0.714865
## [1] 0.73016
## [1] 0.745625

Conclusion

Testing several models is always good in order to find the model with the best fit and can make the most accurate predictions.