Let’s start the necessary packages for this exercise
library(reshape2)
library(ggplot2)
library(data.table)
library(GGally)
library(corrplot)
library(gridExtra)
housing<-read.table(file = "~/Data_projects/Date_Guide/housing.data")
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
attach(housing)
summary(housing)
## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(housing)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : num 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT : num 4.98 9.14 4.03 2.94 5.33 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Zum erste let’s load our dataset that concerns housing values in suburbs of Boston provided by the StatLib library at Carnegie Mellon University.
It contains 506 instances of the following variables:
Attribute Information:
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over
25,000 sq.ft.
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks
by town
13. LSTAT % lower status of the population
14. MEDV Median value of owner-occupied homes in $1000's
Let’s get to some EDA to take a look what we have here:
#pairs(housing)
#ggpairs(housing)
#ggplot(melt(housing),aes(x=value)) + geom_histogram() + facet_wrap(~variable)
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(CRIM,breaks=50)
hist(ZN)
hist(INDUS)
qqnorm(CRIM); qqline(CRIM, col = 2)
qqnorm(ZN); qqline(ZN, col = 2)
qqnorm(INDUS); qqline(INDUS,col=2)
shapiro.test(CRIM)
##
## Shapiro-Wilk normality test
##
## data: CRIM
## W = 0.45, p-value < 2.2e-16
shapiro.test(ZN)
##
## Shapiro-Wilk normality test
##
## data: ZN
## W = 0.5559, p-value < 2.2e-16
shapiro.test(INDUS)
##
## Shapiro-Wilk normality test
##
## data: INDUS
## W = 0.8998, p-value < 2.2e-16
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(CHAS)
hist(NOX)
hist(RM)
qqnorm(CHAS); qqline(CHAS, col = 2)
qqnorm(NOX); qqline(NOX, col = 2)
qqnorm(RM); qqline(RM,col=2)
shapiro.test(CHAS)
##
## Shapiro-Wilk normality test
##
## data: CHAS
## W = 0.2748, p-value < 2.2e-16
shapiro.test(NOX)
##
## Shapiro-Wilk normality test
##
## data: NOX
## W = 0.9356, p-value = 5.776e-14
shapiro.test(RM)
##
## Shapiro-Wilk normality test
##
## data: RM
## W = 0.9609, p-value = 2.412e-10
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(AGE,breaks=80)
hist(DIS,breaks=50)
hist(RAD,breaks=30)
qqnorm(AGE); qqline(AGE, col = 2)
qqnorm(DIS); qqline(DIS, col = 2)
qqnorm(RAD); qqline(RAD,col=2)
shapiro.test(AGE)
##
## Shapiro-Wilk normality test
##
## data: AGE
## W = 0.892, p-value < 2.2e-16
shapiro.test(DIS)
##
## Shapiro-Wilk normality test
##
## data: DIS
## W = 0.9032, p-value < 2.2e-16
shapiro.test(RAD)
##
## Shapiro-Wilk normality test
##
## data: RAD
## W = 0.6796, p-value < 2.2e-16
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(TAX,breaks=20)
hist(PTRATIO,breaks=30)
hist(B,breaks=30)
qqnorm(TAX); qqline(TAX, col = 2)
qqnorm(PTRATIO); qqline(PTRATIO, col = 2)
qqnorm(B); qqline(B,col=2)
shapiro.test(TAX)
##
## Shapiro-Wilk normality test
##
## data: TAX
## W = 0.8152, p-value < 2.2e-16
shapiro.test(PTRATIO)
##
## Shapiro-Wilk normality test
##
## data: PTRATIO
## W = 0.9036, p-value < 2.2e-16
shapiro.test(B)
##
## Shapiro-Wilk normality test
##
## data: B
## W = 0.4768, p-value < 2.2e-16
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(LSTAT,breaks=30)
hist(MEDV, breaks=30)
qqnorm(LSTAT); qqline(LSTAT, col = 2)
qqnorm(MEDV); qqline(MEDV,col=2)
shapiro.test(LSTAT)
##
## Shapiro-Wilk normality test
##
## data: LSTAT
## W = 0.9369, p-value = 8.287e-14
shapiro.test(MEDV)
##
## Shapiro-Wilk normality test
##
## data: MEDV
## W = 0.9172, p-value = 4.941e-16
In order to construct linear model we are mostly interested if any variables are correlated with the MEDV - our response variable (y)
correlations <- cor(housing);
correlations
## CRIM ZN INDUS CHAS NOX
## CRIM 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171
## ZN -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371
## INDUS 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145
## CHAS -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281
## NOX 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000
## RM -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819
## AGE 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010
## DIS -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011
## RAD 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056
## TAX 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320
## PTRATIO 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268
## B -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064
## LSTAT 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892
## MEDV -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077
## RM AGE DIS RAD TAX
## CRIM -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431
## ZN 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332
## INDUS -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018
## CHAS 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652
## NOX -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320
## RM 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783
## AGE -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559
## DIS 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158
## RAD -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819
## TAX -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000
## PTRATIO -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304
## B 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801
## LSTAT -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341
## MEDV 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593
## PTRATIO B LSTAT MEDV
## CRIM 0.2899456 -0.38506394 0.4556215 -0.3883046
## ZN -0.3916785 0.17552032 -0.4129946 0.3604453
## INDUS 0.3832476 -0.35697654 0.6037997 -0.4837252
## CHAS -0.1215152 0.04878848 -0.0539293 0.1752602
## NOX 0.1889327 -0.38005064 0.5908789 -0.4273208
## RM -0.3555015 0.12806864 -0.6138083 0.6953599
## AGE 0.2615150 -0.27353398 0.6023385 -0.3769546
## DIS -0.2324705 0.29151167 -0.4969958 0.2499287
## RAD 0.4647412 -0.44441282 0.4886763 -0.3816262
## TAX 0.4608530 -0.44180801 0.5439934 -0.4685359
## PTRATIO 1.0000000 -0.17738330 0.3740443 -0.5077867
## B -0.1773833 1.00000000 -0.3660869 0.3334608
## LSTAT 0.3740443 -0.36608690 1.0000000 -0.7376627
## MEDV -0.5077867 0.33346082 -0.7376627 1.0000000
par(mfrow=c(1,1))
#boxplot(housing)
corrplot.mixed(correlations)
The variable we are interested in predicting is Median value of owner-occupied homes in $1000’s (MEDV). The following variables are correlated to it and are suited to be response variables: RM 0.7 (average number of rooms per dwelling), TAX -0.47 (full-value property-tax rate per $10,000), INDUS -0.48 (proportion of non-retail business acres per town) LSTAT -0.74 (% lower status of the population), PTRATIO -0.51 (pupil-teacher ratio by town), NOX -0.42 (nitric oxides concentration)
Let’s build a model
bob <- housing[,c("MEDV", "RM", "TAX", "INDUS", "LSTAT", "PTRATIO", "NOX")]
housing_lm <- lm(MEDV ~., bob)
summary(housing_lm)
##
## Call:
## lm(formula = MEDV ~ ., data = bob)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9802 -3.0470 -0.9347 1.7100 30.4545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.145818 4.309359 4.443 1.09e-05 ***
## RM 4.655928 0.431815 10.782 < 2e-16 ***
## TAX -0.002901 0.002225 -1.304 0.193
## INDUS 0.087187 0.061080 1.427 0.154
## LSTAT -0.545935 0.050641 -10.780 < 2e-16 ***
## PTRATIO -0.913819 0.131157 -6.967 1.03e-11 ***
## NOX -3.403117 3.478085 -0.978 0.328
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.225 on 499 degrees of freedom
## Multiple R-squared: 0.681, Adjusted R-squared: 0.6772
## F-statistic: 177.6 on 6 and 499 DF, p-value: < 2.2e-16
Adjusted R-squared value is 0.6772 meaning that that proportion of housing prices can be predicted by the above meantioned variables
qqnorm(housing_lm$residuals, ylab = "Residuals Quantiles"); qqline(housing_lm$residuals, col = "red")
Finally let’s visualize the model
layout(matrix(c(1,2,3,4),2,2))
plot(housing_lm)