This dataset contains information collected by the U.S Census Service concerning real estate information in the area of Boston Mass. The information was obtained from the StatLib archive and has been used extensively throughout the literature to bench algorithms.
In the context of R, the Boston dataset is found in the MASS library and has 506 rows and 14 columns. The median value of a home, medv, and nitrogen oxides concentration (parts per 10 million), nox, is to be predicted.
The variables give in the dataset are the following:
crim —per capita crime rate by town.
zn — proportion of residential land zoned for lots over 25,000 sq.ft.
indus — proportion of non-retail business acres per town.
chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox — nitrogen oxides concentration (parts per 10 million).
rm — average number of rooms per dwelling.
age — proportion of owner-occupied units built prior to 1940.
dis — weighted mean of distances to five Boston employment centres.
rad — index of accessibility to radial highways.
tax — full-value property-tax rate per $10,000.
ptratio — pupil-teacher ratio by town.
black — 1000(Bk−0.63)^2 where Bk is the proportion of blacks by town.
lstat — lower status of the population (percent).
medv — median value of owner-occupied homes in $1000s.
## Call MASS library
library(MASS)
## Print out the column names of Boston dataset using names function
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
## Data Types present in Dataset
# Show rows and columns of the Boston dataset
dim(Boston)
## [1] 506 14
# Show first 6 rows of data
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
crim — per capita crime rate by town — Numerical, continuous
zn — proportion of residential land zoned for lots over 25,000 sq.ft. — Numerical, continuous
indus — proportion of non-retail business acres per town. — Numerical, continuous
chas — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). — categorical, nominal
nox — nitrogen oxides concentration (parts per 10 million). — Numerical, continuous
rm — average number of rooms per dwelling. — Numerical, continuous
age — proportion of owner-occupied units built prior to 1940. — Numerical, continuous
dis — weighted mean of distances to five Boston employment centres. — Numerical, continuous
rad — index of accessibility to radial highways. larger index denotes better accessibility — categorical, ordinal
tax — full-value property-tax rate per $10,000. — Numerical, continuous
ptratio — pupil-teacher ratio by town. — Numerical, continuous
black — 1000(Bk−0.63)2 where Bk is the proportion of blacks by town. — Numerical, continuous
lstat — lower status of the population (percent). — Numerical, continuous
medv — median value of owner-occupied homes in $1000s. — Numerical, continuous
sapply(Boston, anyNA)
## crim zn indus chas nox rm age dis rad tax
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## ptratio black lstat medv
## FALSE FALSE FALSE FALSE
Looks like there are no missing values in the Boston Dataset. We are good to use this data!
Summary of the 14 variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
library(ggplot2)
attach(Boston)
boxplot(crim, main = "Crim rate in town")
boxplot(rm, main = "Average room per dwelling")
## Using ggplot to plot boxplot of nox by chas groups
ggplot(Boston, aes(x = as.factor(chas), y = nox)) + geom_boxplot()
## Adding some colours
ggplot(Boston, aes(x = as.factor(chas), y = rm , colour = as.factor(chas))) + geom_boxplot()
library(ggplot2)
## Using hist function to plot histogram
hist(Boston$medv, main = "Histogram of nox")
Creating 1 histogram per variable in the dataset can be slow. Try out the Hmisc package.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
par(mar=c(1,1,1,1))
hist.data.frame(Boston)
## Using ggplot to create line graph of medv (y) vs nox (x)
ggplot(Boston, aes(x = nox, y = medv)) + geom_point(colour = "blue") + geom_line(colour = "red")
attach(Boston)
## The following objects are masked from Boston (pos = 7):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
counts = table(chas)
barplot(counts, main = "Chas Distribution" , xlab = "Tract bounds river = 1 and Tract does not bound river = 0")
## Grouped Barchart of rad vs chas
counts <- table(Boston$chas, Boston$rad)
barplot(counts, main="Boston Population Distribution by chas and rad",
xlab="rad", col=c("darkblue","red"),
legend = rownames(counts), beside=TRUE, args.legend = list(x = "topright",
inset = c(-0.1, 0)))
## Pie Chart showing distribution of index of accessibility to radial highways
library(plotrix)
counts_rad = table(rad)
pie3D(counts_rad,labels=names(counts_rad),explode=0.1,
main="Pie Chart of Rad (Index of accessibility to radial highways) Distribtion",)
attach(Boston)
## The following objects are masked from Boston (pos = 4):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
## The following objects are masked from Boston (pos = 9):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
plot(rm, medv)
plot(lstat, medv)
Correlations that fall between 0 to -1 are negative and 0 to +1 are positive correlations.
library(ggcorrplot)
## using cor function to draw correlation matrix between each variable in Boston Dataset
corr_boston = cor(Boston)
corr_boston
## crim zn indus chas nox
## crim 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## zn -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## chas -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## nox 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## rm -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## age 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## dis -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## tax 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## ptratio 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## black -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## lstat 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## medv -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## rm age dis rad tax ptratio
## crim -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456
## zn 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785
## indus -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476
## chas 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152
## nox -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327
## rm 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015
## age -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150
## dis 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705
## rad -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412
## tax -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530
## ptratio -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000
## black 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833
## lstat -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443
## medv 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867
## black lstat medv
## crim -0.38506394 0.4556215 -0.3883046
## zn 0.17552032 -0.4129946 0.3604453
## indus -0.35697654 0.6037997 -0.4837252
## chas 0.04878848 -0.0539293 0.1752602
## nox -0.38005064 0.5908789 -0.4273208
## rm 0.12806864 -0.6138083 0.6953599
## age -0.27353398 0.6023385 -0.3769546
## dis 0.29151167 -0.4969958 0.2499287
## rad -0.44441282 0.4886763 -0.3816262
## tax -0.44180801 0.5439934 -0.4685359
## ptratio -0.17738330 0.3740443 -0.5077867
## black 1.00000000 -0.3660869 0.3334608
## lstat -0.36608690 1.0000000 -0.7376627
## medv 0.33346082 -0.7376627 1.0000000
## Visualization the correlation matrix
ggcorrplot(corr_boston)
## Circle
ggcorrplot(corr_boston, method = "circle")
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## Reordering the Boston correlation matrix
ggcorrplot(corr_boston, hc.order = TRUE)
## Get the lower triangle
ggcorrplot(corr_boston, hc.order = TRUE, type = "lower", outline.col = "white", )
## Get the upper triangle
ggcorrplot(corr_boston, hc.order= TRUE, type = "upper", outline.col = "white", )
## Adding crosses to show those correlation with no significance coefficient
p_boston = cor_pmat(Boston)
p_boston
## crim zn indus chas nox
## crim 0.000000e+00 5.506472e-06 1.450349e-21 2.094345e-01 3.751739e-23
## zn 5.506472e-06 0.000000e+00 1.289161e-38 3.378103e-01 7.231578e-36
## indus 1.450349e-21 1.289161e-38 0.000000e+00 1.574628e-01 7.913361e-98
## chas 2.094345e-01 3.378103e-01 1.574628e-01 0.000000e+00 4.029050e-02
## nox 3.751739e-23 7.231578e-36 7.913361e-98 4.029050e-02 0.000000e+00
## rm 6.346703e-07 6.935337e-13 5.328458e-20 4.018410e-02 3.818694e-12
## age 2.854869e-16 7.575575e-45 8.409642e-61 5.177446e-02 7.452392e-86
## dis 8.519949e-19 9.748287e-66 3.586280e-78 2.568848e-02 4.233063e-100
## rad 2.693844e-56 6.988109e-13 8.368289e-50 8.686789e-01 3.342034e-53
## tax 2.357127e-47 4.385492e-13 3.018199e-82 4.244225e-01 1.093287e-66
## ptratio 2.942922e-11 5.325074e-20 3.774843e-19 6.203916e-03 1.885692e-05
## black 2.487274e-19 7.207719e-05 1.184586e-16 2.733379e-01 7.816936e-19
## lstat 2.654277e-27 2.908736e-22 1.381948e-51 2.258990e-01 5.979284e-49
## medv 1.173987e-19 5.713584e-17 4.900260e-31 7.390623e-05 7.065042e-24
## rm age dis rad tax
## crim 6.346703e-07 2.854869e-16 8.519949e-19 2.693844e-56 2.357127e-47
## zn 6.935337e-13 7.575575e-45 9.748287e-66 6.988109e-13 4.385492e-13
## indus 5.328458e-20 8.409642e-61 3.586280e-78 8.368289e-50 3.018199e-82
## chas 4.018410e-02 5.177446e-02 2.568848e-02 8.686789e-01 4.244225e-01
## nox 3.818694e-12 7.452392e-86 4.233063e-100 3.342034e-53 1.093287e-66
## rm 0.000000e+00 4.459649e-08 3.237746e-06 1.918446e-06 2.086816e-11
## age 4.459649e-08 0.000000e+00 9.857534e-92 2.360876e-27 2.551067e-34
## dis 3.237746e-06 9.857534e-92 0.000000e+00 1.418269e-32 1.025931e-38
## rad 1.918446e-06 2.360876e-27 1.418269e-32 0.000000e+00 4.129920e-195
## tax 2.086816e-11 2.551067e-34 1.025931e-38 4.129920e-195 0.000000e+00
## ptratio 1.610820e-16 2.338885e-09 1.229920e-07 1.778554e-28 5.686833e-28
## black 3.906695e-03 3.911801e-10 2.278649e-11 6.592918e-26 1.367562e-25
## lstat 1.033009e-53 2.783924e-51 6.356331e-33 9.904457e-32 2.583867e-40
## medv 2.487229e-74 1.569982e-18 1.206612e-08 5.465933e-19 5.637734e-29
## ptratio black lstat medv
## crim 2.942922e-11 2.487274e-19 2.654277e-27 1.173987e-19
## zn 5.325074e-20 7.207719e-05 2.908736e-22 5.713584e-17
## indus 3.774843e-19 1.184586e-16 1.381948e-51 4.900260e-31
## chas 6.203916e-03 2.733379e-01 2.258990e-01 7.390623e-05
## nox 1.885692e-05 7.816936e-19 5.979284e-49 7.065042e-24
## rm 1.610820e-16 3.906695e-03 1.033009e-53 2.487229e-74
## age 2.338885e-09 3.911801e-10 2.783924e-51 1.569982e-18
## dis 1.229920e-07 2.278649e-11 6.356331e-33 1.206612e-08
## rad 1.778554e-28 6.592918e-26 9.904457e-32 5.465933e-19
## tax 5.686833e-28 1.367562e-25 2.583867e-40 5.637734e-29
## ptratio 0.000000e+00 6.017320e-05 3.003524e-18 1.609509e-34
## black 6.017320e-05 0.000000e+00 1.712280e-17 1.318113e-14
## lstat 3.003524e-18 1.712280e-17 0.000000e+00 5.081103e-88
## medv 1.609509e-34 1.318113e-14 5.081103e-88 0.000000e+00
ggcorrplot(corr_boston, hc.order = TRUE, type = "lower", insig = "blank")
In this section, we will use 13 factors or predictors to predict medv (Median Value of owner-occupied properties in $1000s) and nox (nitrogen oxides concentration (parts per 10 million)). By using correlation matrix, we will seek for the predictor with the strongest linear correlation with medv and nox and build a model.
## Adding the correlation coefficients and changing label size
ggcorrplot(corr_boston, hc.order = TRUE, type = "upper", lab = TRUE, lab_size = 3, insig = "blank")
## Finding the variables with the strongest correlation with medv and
nox
From the correlation plot above, the inverse linear correlation between lstat and medv is the very strong where lstat is the % lower status of the population. The linear correlation between indus and nox is also very strong but we will cover this in the later part.
We will now visualize the linear relationship between the 2 variables where y = medv (dependent variable) and x = lstat (independent variable).
## medv against lstat scatterplot
ggplot(Boston, aes(x = lstat, y = medv)) + geom_point(size = 2)
The scatterplot above demonstrates a slightly curved linear relationship between medv and lstat.
## Correlation Coefficient of medv v lstat
attach(Boston)
## The following objects are masked from Boston (pos = 4):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
## The following objects are masked from Boston (pos = 6):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
## The following objects are masked from Boston (pos = 11):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
cor(lstat, medv)
## [1] -0.7376627
We will go with the hypothesis that if % lower status of the population (lstat) increase then median value of owner occupied properties (medv) will decrease.
To proof that the Null Hypothesis: Increase in lstat will not decrease medv.
cor.test(lstat, medv)
##
## Pearson's product-moment correlation
##
## data: lstat and medv
## t = -24.528, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7749982 -0.6951959
## sample estimates:
## cor
## -0.7376627
pval = cor.test(lstat, medv)$p.value
Since, p value (pval) = 5.08e-88 < α =0.05, we reject the null hypothesis. Medv has a significant inverse linear correlation with lstat.
fit.lm = lm(medv~lstat, data = Boston)
fit.lm
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Coefficients:
## (Intercept) lstat
## 34.55 -0.95
coef(fit.lm)
## (Intercept) lstat
## 34.5538409 -0.9500494
Estimated equation is y = -0.95x + 34.554.
summary(fit.lm)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
F-statistic = 601.6 and corresponding p value < 2.2e-16 which is < 0.05, the linear regression model is statistically significant.
Multiple R-squared = 0.5441 meaning that 54.41% of the variation in the response variable, y or medv can be explained by the predictor variable x or lstat. This is a decent r2 value but perhaps we can achieve higher accuracy with a curvilinear model.
Coefficient estimate of lstat = -0.95. This means that each additional 1% increase in lstat is associated with $-950 average decrease in medv.
Estimated regression equation:y = -0.95x + 34.554
##Create diagnostic plots
plot(fit.lm)
This plot helps to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then we can assume that the residuals follow a linear pattern.
In our case, the red line deviates slightly on the right side from a perfect horizontal line but not severely. The residuals follow a roughly linear pattern and a linear regression model is appropriate for this dataset but it is possible we can
This plot is used to determine if the residuals of the regression model are normally distributed. If the points in this plot fall roughly along a straight diagonal line, then we can assume the residuals are normally distributed.
The observations #268/373/372 deviate far from the line. However, we should note that the residuals start to exhibit non-normal distribution behaviour as residuals increase.
This plot is used to check the assumption of equal variance (also called “homoscedasticity”) among the residuals in our regression model. If the red line is roughly horizontal across the plot, then the assumption of equal variance is likely met.
The red line displayed for our case is not exactly a straight horizontal line but it does not deviate too drastically.
This plot is used to identify influential observations. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation.
All observations are far within cook’s distance and there are not many overly influential points within the data set.
#create scatterplot of medv vs lstat
plot(Boston$lstat, Boston$medv, col = "blue", main = "Summary of Linear Regression Model", xlab = "lstat", ylab = "medv")
#add fitted regression line
abline(fit.lm)
predict_lm = predict(fit.lm, data.frame(lstat = c(10, 25, 40)), se.fit = TRUE, interval = "predict")
predict_lm
## $fit
## fit lwr upr
## 1 25.053347 12.827626 37.27907
## 2 10.802607 -1.457504 23.06272
## 3 -3.448133 -15.848067 8.95180
##
## $se.fit
## 1 2 3
## 0.2948138 0.5523293 1.0946895
##
## $df
## [1] 504
##
## $residual.scale
## [1] 6.21576
names(predict_lm)
## [1] "fit" "se.fit" "df" "residual.scale"
predict_lm$fit
## fit lwr upr
## 1 25.053347 12.827626 37.27907
## 2 10.802607 -1.457504 23.06272
## 3 -3.448133 -15.848067 8.95180
This linear regression model predicts a medv value of 25053, 10802 and -345 for lstat values of 10,25 and 40 respectively. Logically speaking, value of a property should not be negative and hence, the model is not perfect.
Let us explore improving the accuracy of this model by using a curvilinear model instead.
## Creating the non linear regression model
fit_nlm = lm(medv ~ lstat + I(lstat^2), data = Boston)
fit_nlm
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Coefficients:
## (Intercept) lstat I(lstat^2)
## 42.86201 -2.33282 0.04355
coef(fit_nlm)
## (Intercept) lstat I(lstat^2)
## 42.86200733 -2.33282110 0.04354689
The non-linear regression model equation is estimated to be y = 0.0435x^2 -2.33x + 42.9
summary(fit_nlm)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
F-statistic = 448.5 and corresponding p value < 2.2e-16 which is < 0.05, the linear regression model is statistically significant.
Multiple R-squared = 0.6407 meaning that 64.07% of the variation in the response variable, y or medv can be explained by the predictor variable x or lstat. There is an approximately 10% improvement in the r^2 value in the non-linear model compared to the linear model.
##Create diagnostic plots
plot(fit_nlm)
The red line is more horizontal compared to the linear model which was more U-shaped. The non-linear regression is more appropriate for this dataset.
The observations #268/373/372 still deviate far from the line. However, we should note that the residuals do not deviate as far from the diagonal line compared to the linear model. The residual more normally distributed compared to the linear model.
The red line displayed is still not a straight horizontal line but it deviates slightly less than the linear model.
All observations are far within cook’s distance and there are not many overly influential points within the data set. It is interesting to note that the red dash lines are visible for the non-linear regression line and the observations are relatively more influential than the observations in the linear model.
data_b = data.frame(x = Boston$lstat, y = Boston$medv)
nls_fit = nls(I(y ~ a + (b*x) + (c*x^2)), data = data_b, start = list(a = 42.862, b = -2.333, c = 0.0435), trace = T)
## 15347.39 (3.14e-03): par = (42.862 -2.333 0.0435)
## 15347.24 (2.12e-09): par = (42.86201 -2.332821 0.04354689)
summary(nls_fit)
##
## Formula: y ~ a + (b * x) + (c * x^2)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 42.862007 0.872084 49.15 <2e-16 ***
## b -2.332821 0.123803 -18.84 <2e-16 ***
## c 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 2.118e-09
(Intercept) lstat I(lstat^2) 42.86200733 -2.33282110 0.04354689
predict_nlm = predict(fit_nlm, data.frame(lstat = c(10, 25, 40)), se.fit = TRUE, interval = "predict")
predict_nlm
## $fit
## fit lwr upr
## 1 23.88849 13.0221089 34.75486
## 2 11.75829 0.8619344 22.65464
## 3 19.22419 7.5578547 30.89052
##
## $se.fit
## 1 2 3
## 0.2804907 0.4976684 2.1790803
##
## $df
## [1] 503
##
## $residual.scale
## [1] 5.523714
names(predict_nlm)
## [1] "fit" "se.fit" "df" "residual.scale"
predict_nlm$fit
## fit lwr upr
## 1 23.88849 13.0221089 34.75486
## 2 11.75829 0.8619344 22.65464
## 3 19.22419 7.5578547 30.89052
ggcorrplot(corr_boston, hc.order = TRUE, type = "upper", lab = TRUE, lab_size = 3, insig = "blank")
## Creating model(s) for nox and indus
Based on the correlation map above, nox has a strongest correlation level of 0.76 with indus.
ggplot(Boston, aes(x =indus, y = nox)) + geom_point(size = 2)
There is an obvious linear relationship between nox and indus.
## Correlation Coefficient of nox v indus
attach(Boston)
## The following objects are masked from Boston (pos = 3):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
## The following objects are masked from Boston (pos = 5):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
## The following objects are masked from Boston (pos = 7):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
## The following objects are masked from Boston (pos = 12):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, rad,
## rm, tax, zn
cor(indus, nox)
## [1] 0.7636514
We will go with the hypothesis that if proportion of non-retail business acres per town (indus) increases then nitrogen oxides concentration (nox) will increase.
To proof that the Null Hypothesis: Increase in indus will not increase nox.
cor.test(indus, nox)
##
## Pearson's product-moment correlation
##
## data: indus and nox
## t = 26.554, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7247252 0.7977188
## sample estimates:
## cor
## 0.7636514
pval2 = cor.test(indus, nox)$p.value
Since, p value (pval) = 2.2e-16 < α =0.05, we reject the null hypothesis. Nox has a significant linear correlation with indus.
fit.lm2 = lm(nox~indus, data = Boston)
fit.lm2
##
## Call:
## lm(formula = nox ~ indus, data = Boston)
##
## Coefficients:
## (Intercept) indus
## 0.4110 0.0129
coef(fit.lm2)
## (Intercept) indus
## 0.41104425 0.01289878
Estimated equation is y = 0.0129x + 0.411
summary(fit.lm2)
##
## Call:
## lm(formula = nox ~ indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.160898 -0.052175 -0.001458 0.037011 0.207398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.4110442 0.0063521 64.71 <2e-16 ***
## indus 0.0128988 0.0004858 26.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07489 on 504 degrees of freedom
## Multiple R-squared: 0.5832, Adjusted R-squared: 0.5823
## F-statistic: 705.1 on 1 and 504 DF, p-value: < 2.2e-16
F-statistic = 705.1 and corresponding p value < 2.2e-16 which is < 0.05, the linear regression model is statistically significant.
Multiple R-squared = 0.5832 meaning that 58.32% of the variation in the response variable, y or nox can be explained by the predictor variable x or indus. This is a decent r2 value but perhaps we can achieve higher accuracy with a curvilinear model.
x.1 = 1
y.1 = 0.0129*x.1 + 0.411
y.1
## [1] 0.4239
x.2 = 2
y.2 = 0.0129*x.2 + 0.411
y.2
## [1] 0.4368
y.2 - y.1
## [1] 0.0129
Coefficient estimate of indus = 0.0129. This means that each additional 1 unit increase in indus is associated with average increase of 0.0129 PP10M in nox.
Estimated regression equation:y = 0.0129x + 0.411
##Create diagnostic plots
plot(fit.lm2)
In our case, the red line is fairly horizontal till the 0.65 mark on the x axis where it starts to slant downwards. Linear regression model should work on nox vs indus.
The observation #145 deviate far from the line. Most points fall within the straight diagonal line.
The red line is not too horizontal for this case as it starts horizontal and moves diagonally upwards. The deviation is quite drastic and poor variance is seen.
All observations are far within cook’s distance and there are not many overly influential points within the data set.
#create scatterplot of nox vs indus
plot(Boston$indus, Boston$nox, col = "blue", main = "Summary of Linear Regression Model", xlab = "indus", ylab = "nox")
#add fitted regression line
abline(fit.lm2)
predict_lm2 = predict(fit.lm2, data.frame(indus = c(5, 15, 25)), se.fit = TRUE, interval = "predict")
predict_lm2
## $fit
## fit lwr upr
## 1 0.4755381 0.3281450 0.6229312
## 2 0.6045259 0.4572030 0.7518487
## 3 0.7335136 0.5856439 0.8813834
##
## $se.fit
## 1 2 3
## 0.004468758 0.003821658 0.007512171
##
## $df
## [1] 504
##
## $residual.scale
## [1] 0.07488814
names(predict_lm2)
## [1] "fit" "se.fit" "df" "residual.scale"
predict_lm2$fit
## fit lwr upr
## 1 0.4755381 0.3281450 0.6229312
## 2 0.6045259 0.4572030 0.7518487
## 3 0.7335136 0.5856439 0.8813834
This linear regression model predicts a nox value of 0.476, 0.605 and 0.734 for lstat values of 5, 15 and 25 respectively.
Let us explore improving the accuracy of this model by using a curvilinear model instead.
## Creating the non linear regression model
fit_nlm2 = lm(nox ~ indus + I(lstat^2), data = Boston)
fit_nlm2
##
## Call:
## lm(formula = nox ~ indus + I(lstat^2), data = Boston)
##
## Coefficients:
## (Intercept) indus I(lstat^2)
## 4.105e-01 1.134e-02 8.502e-05
coef(fit_nlm2)
## (Intercept) indus I(lstat^2)
## 4.104546e-01 1.134104e-02 8.501568e-05
The non-linear regression model equation is estimated to be y = 8.50e-5x^2 + 1.13e-2x + 4.10e-1
summary(fit_nlm2)
##
## Call:
## lm(formula = nox ~ indus + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.190946 -0.045466 -0.002307 0.032283 0.233845
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.105e-01 6.193e-03 66.273 < 2e-16 ***
## indus 1.134e-02 5.595e-04 20.272 < 2e-16 ***
## I(lstat^2) 8.502e-05 1.626e-05 5.229 2.5e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.073 on 503 degrees of freedom
## Multiple R-squared: 0.6047, Adjusted R-squared: 0.6031
## F-statistic: 384.7 on 2 and 503 DF, p-value: < 2.2e-16
F-statistic = 384.7 and corresponding p value < 2.2e-16 which is < 0.05, the non-linear regression model is statistically significant.
Multiple R-squared = 0.6047 meaning that 60.47% of the variation in the response variable, y or nox can be explained by the predictor variable x or indus. There is an approximately 2% improvement in the r^2 value in the non-linear model compared to the linear model.
##Create diagnostic plots
plot(fit_nlm2)
In our case, the red line is more horizontal compared to the red line for the linear regression model. We can assume that a non-linear regression model is more appropriate for the residuals.
The observation #160/152 deviate far from the line. Most points fall within the straight diagonal line.
The red line is not horizontal for this case as it starts horizontal and moves diagonally upwards. The deviation is quite drastic and poor variance is seen.
All observations are far within cook’s distance and there are not many overly influential points within the data set.
data_c = data.frame(x = Boston$indus, y = Boston$nox)
nls_fit2 = nls(I(y ~ a + (b*x) + (c*x^2)), data = data_c, start = list(a = 4.11e-01 , b = 1.13e-02, c = 8.50e-05), trace = T)
## 2.880089 (2.06e-01): par = (0.411 0.0113 8.5e-05)
## 2.763038 (1.47e-07): par = (0.3822677 0.01992304 -0.0002891891)
summary(nls_fit2)
##
## Formula: y ~ a + (b * x) + (c * x^2)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 3.823e-01 1.054e-02 36.260 < 2e-16 ***
## b 1.992e-02 2.121e-03 9.393 < 2e-16 ***
## c -2.892e-04 8.505e-05 -3.400 0.000727 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07412 on 503 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 1.465e-07
nls_fit2
## Nonlinear regression model
## model: y ~ a + (b * x) + (c * x^2)
## data: data_c
## a b c
## 0.3822677 0.0199230 -0.0002892
## residual sum-of-squares: 2.763
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 1.465e-07
coef(nls_fit2)
## a b c
## 0.3822677247 0.0199230378 -0.0002891891
Coefficients: (Intercept) indus I(lstat^2)
4.105e-01 1.134e-02 8.502e-05
for(x2 in 10:25)
{y3 = 8.50e-5*x2^2 + 1.13e-2*x2 + 4.10e-1
print(y3)}
## [1] 0.5315
## [1] 0.544585
## [1] 0.55784
## [1] 0.571265
## [1] 0.58486
## [1] 0.598625
## [1] 0.61256
## [1] 0.626665
## [1] 0.64094
## [1] 0.655385
## [1] 0.67
## [1] 0.684785
## [1] 0.69974
## [1] 0.714865
## [1] 0.73016
## [1] 0.745625
Testing several models is always good in order to find the model with the best fit and can make the most accurate predictions.