FINAL ANALYSIS: The influence of various factors on Housing Prices.

Description of the Project and the Source File

My project relates to Real Estate and I proposed to study the Median Value of owner-occupied homes and investigate the impact of the factors such as crime rate, Full-value property-tax rate per $10,000, index of accessibility to radial highways on median value of owner-occupied homes.

Description of CSV file

Number of data columns- 14

Number of rows- 506

There are 14 attributes in each case of the dataset. They are:

CRIM - per capita crime rate by town

ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS - proportion of non-retail business acres per town.

CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

NOX - nitric oxides concentration (parts per 10 million)

RM - average number of rooms per dwelling

AGE - proportion of owner-occupied units built prior to 1940

DIS - weighted distances to five employment centers

RAD - index of accessibility to radial highways

TAX - full-value property-tax rate per $10,000

PTRATIO - pupil-teacher ratio by town

B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

LSTAT - % lower status of the population

MEDV - Median value of owner-occupied homes in $1000’s

Reading the dataset

house.df<-read.csv(paste("housingdata.csv"))
View(house.df)

Visualizing the length and breadth of the dataset

dim(house.df)

## [1] 506  14

Structure of the dataframe

str(house.df)

## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Attaching the dataset

attach(house.df)

Summary and description of the dataset

summary(house.df)

##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX           PTRATIO            B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

library(psych)
describe(house.df)

##         vars   n   mean     sd median trimmed    mad    min    max  range
## CRIM       1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97
## ZN         2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
## INDUS      3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
## CHAS       4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00
## NOX        5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
## RM         6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
## AGE        7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
## DIS        8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
## RAD        9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00
## TAX       10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
## PTRATIO   11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
## B         12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
## LSTAT     13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
## MEDV      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
##          skew kurtosis   se
## CRIM     5.19    36.60 0.38
## ZN       2.21     3.95 1.04
## INDUS    0.29    -1.24 0.30
## CHAS     3.39     9.48 0.01
## NOX      0.72    -0.09 0.01
## RM       0.40     1.84 0.03
## AGE     -0.60    -0.98 1.25
## DIS      1.01     0.46 0.09
## RAD      1.00    -0.88 0.39
## TAX      0.67    -1.15 7.49
## PTRATIO -0.80    -0.30 0.10
## B       -2.87     7.10 4.06
## LSTAT    0.90     0.46 0.32
## MEDV     1.10     1.45 0.41

One-way contingency tables for the categorical variables in the dataset

table(house.df$CHAS)

## 
##   0   1 
## 471  35

table(house.df$RAD)

## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132

table(house.df$ZN)

## 
##    0 12.5 17.5   18   20   21   22   25   28   30   33   34   35   40   45 
##  372   10    1    1   21    4   10   10    3    6    4    3    3    7    6 
## 52.5   55   60   70   75   80 82.5   85   90   95  100 
##    3    3    4    3    3   15    2    2    5    4    1

table(house.df$TAX)

## 
## 187 188 193 198 216 222 223 224 226 233 241 242 243 244 245 247 252 254 
##   1   7   8   1   5   7   5  10   1   9   1   2   4   1   3   4   2   5 
## 255 256 264 265 270 273 276 277 279 280 281 284 285 287 289 293 296 300 
##   1   1  12   2   7   5   9  11   4   1   4   7   1   8   5   3   8   7 
## 304 305 307 311 313 315 329 330 334 335 337 345 348 351 352 358 370 384 
##  14   4  40   7   1   2   6  10   2   2   2   3   2   1   2   3   2  11 
## 391 398 402 403 411 422 430 432 437 469 666 711 
##   8  12   2  30   2   1   3   9  15   1 132   5

Two-way contingency tables for the categorical variables in the dataset

table(house.df$CHAS,house.df$RAD)

##    
##       1   2   3   4   5   6   7   8  24
##   0  19  24  36 102 104  26  17  19 124
##   1   1   0   2   8  11   0   0   5   8

table(house.df$CHAS,house.df$ZN)

##    
##       0 12.5 17.5  18  20  21  22  25  28  30  33  34  35  40  45 52.5  55
##   0 344   10    1   1  18   4  10  10   3   6   4   3   3   4   6    3   3
##   1  28    0    0   0   3   0   0   0   0   0   0   0   0   3   0    0   0
##    
##      60  70  75  80 82.5  85  90  95 100
##   0   4   3   3  15    2   2   4   4   1
##   1   0   0   0   0    0   0   1   0   0

table(house.df$RAD, house.df$ZN)

##     
##        0 12.5 17.5  18  20  21  22  25  28  30  33  34  35  40  45 52.5
##   1    6    0    0   1   0   0   0   0   0   0   0   0   3   2   0    0
##   2   18    0    0   0   0   0   0   0   0   0   0   0   0   0   0    0
##   3   26    0    1   0   5   0   0   0   0   0   0   0   0   0   0    0
##   4   77    3    0   0   0   4   0   4   3   0   0   0   0   5   0    0
##   5   78    7    0   0  16   0   0   0   0   0   0   0   0   0   6    0
##   6   17    0    0   0   0   0   0   0   0   6   0   0   0   0   0    3
##   7    0    0    0   0   0   0  10   0   0   0   4   3   0   0   0    0
##   8   18    0    0   0   0   0   0   6   0   0   0   0   0   0   0    0
##   24 132    0    0   0   0   0   0   0   0   0   0   0   0   0   0    0
##     
##       55  60  70  75  80 82.5  85  90  95 100
##   1    1   2   0   0   3    0   0   2   0   0
##   2    0   0   0   0   3    2   1   0   0   0
##   3    0   0   0   3   0    0   0   1   2   0
##   4    0   2   0   0   9    0   1   0   2   0
##   5    2   0   3   0   0    0   0   2   0   1
##   6    0   0   0   0   0    0   0   0   0   0
##   7    0   0   0   0   0    0   0   0   0   0
##   8    0   0   0   0   0    0   0   0   0   0
##   24   0   0   0   0   0    0   0   0   0   0

Boxplot of the variables that belong to the study

boxplot(house.df$CRIM,main="per capita crime rate by town",xlab="CRIM"
        , horizontal = T, col="light blue")

boxplot(house.df$INDUS,main="proportion of non-retail business acres per town",
        xlab="INDUS", col="light blue")

boxplot(house.df$NOX, main=" nitric oxides concentration (parts per 10 million)",
        xlab="NOX", col="light blue")

boxplot(house.df$RM, main="average number of rooms per dwelling", xlab="RM",
        col="light blue")

boxplot(house.df$DIS, main="weighted distances to five employment centres",
        xlab="DIS", col="light blue")

boxplot(house.df$TAX, main="full-value property-tax rate per $10,000",
        xlab="TAX", col="light blue")

boxplot(house.df$PTRATIO, main="pupil-teacher ratio by town ",
        xlab="PTRATIO", col="light blue")

boxplot(house.df$LSTAT, main="lower status of the population(%)",
        xlab="LSTAT", col="light blue")

boxplot(house.df$MEDV, main="Median value of owner-occupied homes in $1000's",
        xlab="MEDV", col="light blue")

Histograms for suitable data fields

library(lattice)
histogram(house.df$ZN, main="proportion of residential land zoned 
          for lots over 25,000 sq.ft", xlab="ZN", col="maroon")

charles=factor(house.df$CHAS, levels=c(1,0), labels=c("tract bounds river","Otherwise"))
histogram(charles,col="maroon", main="Charles River dummy variable")

histogram(house.df$AGE, main="proportion of owner-occupied units built prior to 1940",
          xlab="AGE", col="maroon")

histogram(house.df$RAD, main="index of accessibility to radial highways",
          xlab="RAD", col="maroon")

Suitable plot for data fields

plot(house.df$MEDV,house.df$CRIM, main="plot of CHAS v/s MEDV", 
            ylab = "per capita crime rate by town",
            xlab="Median value of owner-occupied homes in $1000's")

plot(house.df$MEDV,house.df$INDUS, main="plot of CHAS v/s INDUS", 
     xlab = "per capita crime rate by town",
     ylab="proportion of non-retail business acres per town")

plot(house.df$MEDV,house.df$TAX, main="plot of CHAS v/s TAX", 
     xlab = "per capita crime rate by town",
     ylab="full-value property-tax rate per $10,000")

plot(house.df$MEDV,house.df$RAD, main="plot of CHAS v/s RAD", 
     xlab = "per capita crime rate by town",
     ylab="Index of accessibility to radial highways ")

Correlation Matrix (numeric type)

hnum<-house.df[,c(1,2,3,5,6,7,8,11,12,13,14)]
cor(hnum)

##               CRIM         ZN      INDUS        NOX         RM        AGE
## CRIM     1.0000000 -0.2004692  0.4065834  0.4209717 -0.2192467  0.3527343
## ZN      -0.2004692  1.0000000 -0.5338282 -0.5166037  0.3119906 -0.5695373
## INDUS    0.4065834 -0.5338282  1.0000000  0.7636514 -0.3916759  0.6447785
## NOX      0.4209717 -0.5166037  0.7636514  1.0000000 -0.3021882  0.7314701
## RM      -0.2192467  0.3119906 -0.3916759 -0.3021882  1.0000000 -0.2402649
## AGE      0.3527343 -0.5695373  0.6447785  0.7314701 -0.2402649  1.0000000
## DIS     -0.3796701  0.6644082 -0.7080270 -0.7692301  0.2052462 -0.7478805
## PTRATIO  0.2899456 -0.3916785  0.3832476  0.1889327 -0.3555015  0.2615150
## B       -0.3850639  0.1755203 -0.3569765 -0.3800506  0.1280686 -0.2735340
## LSTAT    0.4556215 -0.4129946  0.6037997  0.5908789 -0.6138083  0.6023385
## MEDV    -0.3883046  0.3604453 -0.4837252 -0.4273208  0.6953599 -0.3769546
##                DIS    PTRATIO          B      LSTAT       MEDV
## CRIM    -0.3796701  0.2899456 -0.3850639  0.4556215 -0.3883046
## ZN       0.6644082 -0.3916785  0.1755203 -0.4129946  0.3604453
## INDUS   -0.7080270  0.3832476 -0.3569765  0.6037997 -0.4837252
## NOX     -0.7692301  0.1889327 -0.3800506  0.5908789 -0.4273208
## RM       0.2052462 -0.3555015  0.1280686 -0.6138083  0.6953599
## AGE     -0.7478805  0.2615150 -0.2735340  0.6023385 -0.3769546
## DIS      1.0000000 -0.2324705  0.2915117 -0.4969958  0.2499287
## PTRATIO -0.2324705  1.0000000 -0.1773833  0.3740443 -0.5077867
## B        0.2915117 -0.1773833  1.0000000 -0.3660869  0.3334608
## LSTAT   -0.4969958  0.3740443 -0.3660869  1.0000000 -0.7376627
## MEDV     0.2499287 -0.5077867  0.3334608 -0.7376627  1.0000000

Correlation matrix using corrgram

library(corrgram)
corrgram(house.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, 
         text.panel=panel.txt, main="Corrgram of housing dataset")

library(corrgram)
corrgram(hnum, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, 
         text.panel=panel.txt, main="Corrgram of housing dataset (numeric type)")

Scatter plot matrix for your data set

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

scatterplotMatrix(~MEDV+CRIM+INDUS+RAD+TAX, data=house.df, main="Scatterplot matrix Median value of owner-occupied homes in $1000's v/s other factors")

Suitable test to check hypothesis for suitable assumptions

Null Hypothesis 1: There is no relationship between CRIM (per capita crime rate by town) and MEDV (Median value of owner-occupied homes in $1000’s)

cor.test(house.df$CRIM,house.df$MEDV)

## 
##  Pearson's product-moment correlation
## 
## data:  house.df$CRIM and house.df$MEDV
## t = -9.4597, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4599064 -0.3116859
## sample estimates:
##        cor 
## -0.3883046

The p-value is less than 0.05, hence, we reject the null hypothesis and establish that there is significant relationship between CRIM and MEDV.

Null Hypothesis 2: There is no relationship between between CHAS (Charles River dummy variable) and MEDV(Median value of owner-occupied homes in $1000’s)

cor.test(house.df$CHAS, house.df$MEDV)

## 
##  Pearson's product-moment correlation
## 
## data:  house.df$CHAS and house.df$MEDV
## t = 3.9964, df = 504, p-value = 7.391e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08945816 0.25848001
## sample estimates:
##       cor 
## 0.1752602

The p-value is more than 0.05, hence, we fail to reject the null hypothesis.

Null Hypothesis 3: There is no relationship between TAX(full-value property-tax rate per $10,000) and MEDV(Median value of owner-occupied homes in $1000’s)

cor.test(house.df$TAX, house.df$MEDV)

## 
##  Pearson's product-moment correlation
## 
## data:  house.df$TAX and house.df$MEDV
## t = -11.906, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5338993 -0.3976061
## sample estimates:
##        cor 
## -0.4685359

The p-value is less than 0.05, hence, we reject the null hypothesis and establish that there is a significant relationship between TAX and MEDV.

Null Hypothesis 4: There is no relationship between CRIM(per capita crime rate by town) and RAD(index of accessibility to radial highways)

cor.test(house.df$CRIM,house.df$RAD)

## 
##  Pearson's product-moment correlation
## 
## data:  house.df$CRIM and house.df$RAD
## t = 17.998, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5693817 0.6758248
## sample estimates:
##       cor 
## 0.6255051

The p-value is less than 0.05, hence, we reject the null hypothesis and establish that there is a significant relationship between CRIM and RAD.

t-test to analyse your hypothesis

Null Hypothesis 5: There is no relationship between AGE(proportion of owner-occupied units built prior to 1940) and TAX(full-value property-tax rate per $10,000)

t.test(house.df$AGE, house.df$TAX)

## 
##  Welch Two Sample t-test
## 
## data:  house.df$AGE and house.df$TAX
## t = -44.715, df = 533.15, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -354.5843 -324.7402
## sample estimates:
## mean of x mean of y 
##   68.5749  408.2372

The p-value is less than 0.05, hence, through t-test we reject the null hypothesis and establish that there is significant relationship between AGE and TAX.

Null Hypothesis 6: There is no relationship between DIS(weighted distances to five employment centers) and MEDV(Median value of owner-occupied homes in $1000’s)

t.test(house.df$DIS, house.df$MEDV)

## 
##  Welch Two Sample t-test
## 
## data:  house.df$DIS and house.df$MEDV
## t = -44.673, df = 557.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.56164 -17.91389
## sample estimates:
## mean of x mean of y 
##  3.795043 22.532806

The p-value is less than 0.05, hence, through t.test we reject the null hypothesis and establish that there is no significant relationship between DIS and MEDV.

Further Analysis

Hypothesis that you could test using a Regression Model

HYPOTHESIS 1: If the house’s tract bounds the river, then the median value of the house is affected.

scatterplot(house.df$CHAS,house.df$MEDV, ylab= "Median value of 
            owner-occupied homes in $1000's", xlab="Charles River dummy variable")

Checking Hypothesis 1 using t-test.

t.test(house.df$CHAS,house.df$MEDV)

## 
##  Welch Two Sample t-test
## 
## data:  house.df$CHAS and house.df$MEDV
## t = -54.921, df = 505.77, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.26722 -21.66005
## sample estimates:
##   mean of x   mean of y 
##  0.06916996 22.53280632

The p-value is less than 0.05, indicating that CHAS (Charles River dummy variable) and MEDV (Median value of owner-occupied homes in $1000’s)are correlated.

Checking Hypothesis 1 using regression model.

fit1<- lm(CHAS~MEDV, data=house.df)
summary(fit1)

## 
## Call:
## lm(formula = CHAS ~ MEDV, data = house.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20211 -0.07869 -0.05860 -0.03223  0.97503 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.039891   0.029471  -1.354    0.176    
## MEDV         0.004840   0.001211   3.996 7.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2503 on 504 degrees of freedom
## Multiple R-squared:  0.03072,    Adjusted R-squared:  0.02879 
## F-statistic: 15.97 on 1 and 504 DF,  p-value: 7.391e-05

The p-value is less than 0.05, confirming that that CHAS and MEDV are correlated. The estimate calculated shows, Median value of owner-occupied homes increases if the house’s tract bounds the river.

HYPOTHESIS 2: The accessibility to the radial highway accessibility affects the median value of houses.

scatterplot(house.df$RAD, house.df$MEDV, ylab="Median value of owner-occupied homes in $1000's",xlab="index of accessibility to radial highways")

Checking hypothesis 2 using t-test.

t.test(house.df$RAD, house.df$MEDV)

## 
##  Welch Two Sample t-test
## 
## data:  house.df$RAD and house.df$MEDV
## t = -23.06, df = 1007, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.08824 -11.87855
## sample estimates:
## mean of x mean of y 
##  9.549407 22.532806

The p-value is less than 0.05, indicating that RAD and MEDV are correlated, but the estimate calculated shows that median value of the owner owned homes decreases because of the radial highway accessibility.

Checking hypothesis 2 using regression model

fit2<-lm(RAD~MEDV, data=house.df)
summary(fit2)

## 
## Call:
## lm(formula = RAD ~ MEDV, data = house.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.391  -5.862  -3.658   8.893  24.375 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.69052    0.94853  18.650   <2e-16 ***
## MEDV        -0.36130    0.03898  -9.269   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.056 on 504 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1439 
## F-statistic: 85.91 on 1 and 504 DF,  p-value: < 2.2e-16

The p-value is less than 0.05, indicating that RAD and MEDV are correlated, but the estimate calculated shows, that median value of the owner owned homes decreases because of the radial highway accessability.

Linear Regression

lr.df<-house.df[,c(1,2,3,4,5,6,7,8,9,10,11,13,14)]
corrgram(lr.df, order=FALSE, lower.panel=panel.shade, upper.panel=panel.pie, 
         text.panel=panel.txt, main="Corrgram of housing dataset (linear regression)")

fit3<- lm(MEDV~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+
     LSTAT,data=house.df)
summary(fit3)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + 
##     DIS + RAD + TAX + PTRATIO + LSTAT, data = house.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1304  -2.7673  -0.5814   1.9414  26.2526 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
## CRIM         -0.121389   0.033000  -3.678 0.000261 ***
## ZN            0.046963   0.013879   3.384 0.000772 ***
## INDUS         0.013468   0.062145   0.217 0.828520    
## CHAS          2.839993   0.870007   3.264 0.001173 ** 
## NOX         -18.758022   3.851355  -4.870 1.50e-06 ***
## RM            3.658119   0.420246   8.705  < 2e-16 ***
## AGE           0.003611   0.013329   0.271 0.786595    
## DIS          -1.490754   0.201623  -7.394 6.17e-13 ***
## RAD           0.289405   0.066908   4.325 1.84e-05 ***
## TAX          -0.012682   0.003801  -3.337 0.000912 ***
## PTRATIO      -0.937533   0.132206  -7.091 4.63e-12 ***
## LSTAT        -0.552019   0.050659 -10.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.798 on 493 degrees of freedom
## Multiple R-squared:  0.7343, Adjusted R-squared:  0.7278 
## F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16

Insight:

The Linear Regression conducted above is valid as the p-value calculated is less than 0.05.

Through the Linear regression, we can decipher that:

CRIM(per capita crime by town), NOX(Nitric oxide concentration), DIS(weighted distance to five employment centers), TAX(full-value property-tax per 10000 dollars), PTRATIO(pupil-teacher ratio per town) and LSTAT(lower status of the population) have a negative effect on the MEDV(Median value of owner occupied homes in $10000s).
INDUS(proportion of non-retail businesses per town), AGE(proportion of owner-occupied units built prior to 1940) doesn’t influence MEDV.
ZN(proportion of residential land zoned for lots over 25,000 sq.ft.) and CHAS(Charles River Dummy Variable), RM(Average rooms per dwelling), RAD(index of accessibility to radial highways) influence MEDV positively.

Housing Data Analysis

Shreya Singireddy

26 January 2018