Let’s start the necessary packages for this exercise

library(reshape2)
library(ggplot2)
library(data.table)
library(GGally)
library(corrplot)
library(gridExtra)
housing<-read.table(file = "~/Data_projects/Date_Guide/housing.data")
colnames(housing) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV") 
attach(housing) 
summary(housing)
##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(housing)
## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Zum erste let’s load our dataset that concerns housing values in suburbs of Boston provided by the StatLib library at Carnegie Mellon University.

It contains 506 instances of the following variables:

Attribute Information:

1. CRIM      per capita crime rate by town
2. ZN        proportion of residential land zoned for lots over 
             25,000 sq.ft.
3. INDUS     proportion of non-retail business acres per town
4. CHAS      Charles River dummy variable (= 1 if tract bounds 
             river; 0 otherwise)
5. NOX       nitric oxides concentration (parts per 10 million)
6. RM        average number of rooms per dwelling
7. AGE       proportion of owner-occupied units built prior to 1940
8. DIS       weighted distances to five Boston employment centres
9. RAD       index of accessibility to radial highways
10. TAX      full-value property-tax rate per $10,000
11. PTRATIO  pupil-teacher ratio by town
12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
             by town
13. LSTAT    % lower status of the population
14. MEDV     Median value of owner-occupied homes in $1000's

Let’s get to some EDA to take a look what we have here:

#pairs(housing)
#ggpairs(housing)
#ggplot(melt(housing),aes(x=value)) + geom_histogram() + facet_wrap(~variable)
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(CRIM,breaks=50)
hist(ZN)
hist(INDUS)
qqnorm(CRIM); qqline(CRIM, col = 2)
qqnorm(ZN); qqline(ZN, col = 2)
qqnorm(INDUS); qqline(INDUS,col=2)

shapiro.test(CRIM)
## 
##  Shapiro-Wilk normality test
## 
## data:  CRIM
## W = 0.45, p-value < 2.2e-16
shapiro.test(ZN)
## 
##  Shapiro-Wilk normality test
## 
## data:  ZN
## W = 0.5559, p-value < 2.2e-16
shapiro.test(INDUS)
## 
##  Shapiro-Wilk normality test
## 
## data:  INDUS
## W = 0.8998, p-value < 2.2e-16
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(CHAS)
hist(NOX)
hist(RM)
qqnorm(CHAS); qqline(CHAS, col = 2)
qqnorm(NOX); qqline(NOX, col = 2)
qqnorm(RM); qqline(RM,col=2)

shapiro.test(CHAS)
## 
##  Shapiro-Wilk normality test
## 
## data:  CHAS
## W = 0.2748, p-value < 2.2e-16
shapiro.test(NOX)
## 
##  Shapiro-Wilk normality test
## 
## data:  NOX
## W = 0.9356, p-value = 5.776e-14
shapiro.test(RM)
## 
##  Shapiro-Wilk normality test
## 
## data:  RM
## W = 0.9609, p-value = 2.412e-10
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(AGE,breaks=80)
hist(DIS,breaks=50)
hist(RAD,breaks=30)
qqnorm(AGE); qqline(AGE, col = 2)
qqnorm(DIS); qqline(DIS, col = 2)
qqnorm(RAD); qqline(RAD,col=2)

shapiro.test(AGE)
## 
##  Shapiro-Wilk normality test
## 
## data:  AGE
## W = 0.892, p-value < 2.2e-16
shapiro.test(DIS)
## 
##  Shapiro-Wilk normality test
## 
## data:  DIS
## W = 0.9032, p-value < 2.2e-16
shapiro.test(RAD)
## 
##  Shapiro-Wilk normality test
## 
## data:  RAD
## W = 0.6796, p-value < 2.2e-16
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(TAX,breaks=20)
hist(PTRATIO,breaks=30)
hist(B,breaks=30)
qqnorm(TAX); qqline(TAX, col = 2)
qqnorm(PTRATIO); qqline(PTRATIO, col = 2)
qqnorm(B); qqline(B,col=2)

shapiro.test(TAX)
## 
##  Shapiro-Wilk normality test
## 
## data:  TAX
## W = 0.8152, p-value < 2.2e-16
shapiro.test(PTRATIO)
## 
##  Shapiro-Wilk normality test
## 
## data:  PTRATIO
## W = 0.9036, p-value < 2.2e-16
shapiro.test(B)
## 
##  Shapiro-Wilk normality test
## 
## data:  B
## W = 0.4768, p-value < 2.2e-16
par(mfrow=c(2,3), mai = c(1, 0.1, 0.1, 0.1))
hist(LSTAT,breaks=30)
hist(MEDV, breaks=30)
qqnorm(LSTAT); qqline(LSTAT, col = 2)
qqnorm(MEDV); qqline(MEDV,col=2)
shapiro.test(LSTAT)
## 
##  Shapiro-Wilk normality test
## 
## data:  LSTAT
## W = 0.9369, p-value = 8.287e-14
shapiro.test(MEDV)
## 
##  Shapiro-Wilk normality test
## 
## data:  MEDV
## W = 0.9172, p-value = 4.941e-16

In order to construct linear model we are mostly interested if any variables are correlated with the MEDV - our response variable (y)

correlations <- cor(housing); 
correlations
##                CRIM          ZN       INDUS         CHAS         NOX
## CRIM     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## ZN      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## INDUS    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## CHAS    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## NOX      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## RM      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## AGE      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## DIS     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## RAD      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## TAX      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## PTRATIO  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## B       -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## LSTAT    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## MEDV    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  RM         AGE         DIS          RAD         TAX
## CRIM    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## ZN       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## INDUS   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## CHAS     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## NOX     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## RM       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## AGE     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## DIS      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## RAD     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## TAX     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## PTRATIO -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## B        0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## LSTAT   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## MEDV     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            PTRATIO           B      LSTAT       MEDV
## CRIM     0.2899456 -0.38506394  0.4556215 -0.3883046
## ZN      -0.3916785  0.17552032 -0.4129946  0.3604453
## INDUS    0.3832476 -0.35697654  0.6037997 -0.4837252
## CHAS    -0.1215152  0.04878848 -0.0539293  0.1752602
## NOX      0.1889327 -0.38005064  0.5908789 -0.4273208
## RM      -0.3555015  0.12806864 -0.6138083  0.6953599
## AGE      0.2615150 -0.27353398  0.6023385 -0.3769546
## DIS     -0.2324705  0.29151167 -0.4969958  0.2499287
## RAD      0.4647412 -0.44441282  0.4886763 -0.3816262
## TAX      0.4608530 -0.44180801  0.5439934 -0.4685359
## PTRATIO  1.0000000 -0.17738330  0.3740443 -0.5077867
## B       -0.1773833  1.00000000 -0.3660869  0.3334608
## LSTAT    0.3740443 -0.36608690  1.0000000 -0.7376627
## MEDV    -0.5077867  0.33346082 -0.7376627  1.0000000
par(mfrow=c(1,1))
#boxplot(housing)
corrplot.mixed(correlations)

Analysis

The variable we are interested in predicting is Median value of owner-occupied homes in $1000’s (MEDV). The following variables are correlated to it and are suited to be response variables: RM 0.7 (average number of rooms per dwelling), TAX -0.47 (full-value property-tax rate per $10,000), INDUS -0.48 (proportion of non-retail business acres per town) LSTAT -0.74 (% lower status of the population), PTRATIO -0.51 (pupil-teacher ratio by town), NOX -0.42 (nitric oxides concentration)

Let’s build a model

bob <- housing[,c("MEDV", "RM", "TAX", "INDUS", "LSTAT", "PTRATIO", "NOX")]
housing_lm <- lm(MEDV ~., bob)
summary(housing_lm)
## 
## Call:
## lm(formula = MEDV ~ ., data = bob)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9802  -3.0470  -0.9347   1.7100  30.4545 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.145818   4.309359   4.443 1.09e-05 ***
## RM           4.655928   0.431815  10.782  < 2e-16 ***
## TAX         -0.002901   0.002225  -1.304    0.193    
## INDUS        0.087187   0.061080   1.427    0.154    
## LSTAT       -0.545935   0.050641 -10.780  < 2e-16 ***
## PTRATIO     -0.913819   0.131157  -6.967 1.03e-11 ***
## NOX         -3.403117   3.478085  -0.978    0.328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.225 on 499 degrees of freedom
## Multiple R-squared:  0.681,  Adjusted R-squared:  0.6772 
## F-statistic: 177.6 on 6 and 499 DF,  p-value: < 2.2e-16

Adjusted R-squared value is 0.6772 meaning that that proportion of housing prices can be predicted by the above meantioned variables

qqnorm(housing_lm$residuals, ylab = "Residuals Quantiles"); qqline(housing_lm$residuals, col = "red")

Finally let’s visualize the model

layout(matrix(c(1,2,3,4),2,2))
plot(housing_lm)