Name: Anshuman Raina

Email: anshuman.raina@gmail.com

College: MAHARAJA AGARASEN INSTITUTE OF TECHNOLOGY

Capstone Project

This is a capstone project on the Analysis of Boston Housing Dataset, to find the various factors that are taken into consideration while moving to boston by people.

Reading The DataSet

data.df<-read.csv(paste("G:/R Intern/Capstone/bostonhousing.csv",sep=""))
View(data.df)
dim(data.df)
## [1] 506  14

Firstly, we read the dataset and store it in a datafram (data.df). Then we apply the dim() to know the dimensions. We get the dimensions of the Dataset to be 506 rows and 14 columns.

Creating Descriptive Analysis

For this, we describe what each column implies: This data frame contains the following columns:

1.crim: per capita crime rate by town.

2.zn: proportion of residential land zoned for lots over 25,000 sq.ft.

3.indus: proportion of non-retail business acres per town.

4.chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

5.nox: nitrogen oxides concentration (parts per 10 million).

6.rm: average number of rooms per dwelling.

7.age: proportion of owner-occupied units built prior to 1940.

8.dis: weighted mean of distances to five Boston employment centres.

9.rad: index of accessibility to radial highways.

10.tax: full-value property-tax rate per $10,000.

11.pt: pupil-teacher ratio by town.

12.b: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

13.lstat: lower status of the population (percent).

14.mv: median value of owner-occupied homes in $1000s.

Here, we do the descriptive analysis i.e. the mean,median,mode etc of various column entries.

str(data.df)
## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN   : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS: num  2.31 7.07 7.07 2.18 2.18 ...
##  $ CHAS : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX  : num  0.538 0.469 0.469 0.458 0.458 ...
##  $ RM   : num  6.57 6.42 7.18 7 7.15 ...
##  $ AGE  : num  65.2 78.9 61.1 45.8 54.2 ...
##  $ DIS  : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD  : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX  : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PT   : num  15.3 17.8 17.8 18.7 18.7 ...
##  $ B    : num  397 397 393 395 397 ...
##  $ LSTAT: num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MV   : num  24 21.6 34.7 33.4 36.2 ...
summary(data.df)
##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX              PT              B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT             MV       
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Column Wise Description:-

library(psych)
describe(data.df)
##       vars   n   mean     sd median trimmed    mad    min    max  range
## CRIM     1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97
## ZN       2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
## INDUS    3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
## CHAS     4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00
## NOX      5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
## RM       6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
## AGE      7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
## DIS      8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
## RAD      9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00
## TAX     10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
## PT      11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
## B       12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
## LSTAT   13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
## MV      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
##        skew kurtosis   se
## CRIM   5.19    36.60 0.38
## ZN     2.21     3.95 1.04
## INDUS  0.29    -1.24 0.30
## CHAS   3.39     9.48 0.01
## NOX    0.72    -0.09 0.01
## RM     0.40     1.84 0.03
## AGE   -0.60    -0.98 1.25
## DIS    1.01     0.46 0.09
## RAD    1.00    -0.88 0.39
## TAX    0.67    -1.15 7.49
## PT    -0.80    -0.30 0.10
## B     -2.87     7.10 4.06
## LSTAT  0.90     0.46 0.32
## MV     1.10     1.45 0.41

Thus we get the mean , median , standard deviation and other factors from the data. *****

Create one-way contingency tables for the categorical variables in your dataset.

To do this, we must first analyze the categorical variables. They are:

Firstly,chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

As we see clearly, this is a dummy variable categorical at 1 and 0.

mytab<-table(data.df$CHAS)
mytab
## 
##   0   1 
## 471  35

So we see, 471 dont have tract bounds river and only 35 have it.

mytab<-table(data.df$ZN)
mytab
## 
##    0 12.5 17.5   18   20   21   22   25   28   30   33   34   35   40   45 
##  372   10    1    1   21    4   10   10    3    6    4    3    3    7    6 
## 52.5   55   60   70   75   80 82.5   85   90   95  100 
##    3    3    4    3    3   15    2    2    5    4    1

The table of residential zone ratio is also a long yet a one-way table.

mytab<-table(data.df$INDUS)
mytab
## 
## 0.460000008  0.74000001 1.210000038 1.220000029        1.25 1.320000052 
##           1           1           1           1           2           1 
## 1.379999995 1.470000029 1.519999981 1.690000057  1.75999999 1.889999986 
##           1           2           4           2           1           1 
## 1.909999967  2.00999999 2.019999981 2.029999971 2.180000067  2.24000001 
##           2           1           1           2           7           3 
##        2.25 2.309999943 2.460000038 2.680000067 2.890000105 2.930000067 
##           1           1           8           2           5           2 
## 2.950000048 2.970000029  3.24000001 3.329999924 3.369999886 3.410000086 
##           2           1           3           4           2           4 
## 3.440000057 3.640000105        3.75 3.779999971 3.970000029           4 
##           6           2           1           2          12           1 
## 4.050000191 4.150000095 4.389999866 4.489999771 4.860000134 4.929999828 
##           7           1           2           4           4           6 
## 4.949999809 5.130000114 5.190000057 5.320000172 5.639999866 5.860000134 
##           3           6           8           3           4          10 
## 5.960000038 6.059999943 6.070000172 6.090000153 6.199999809 6.409999847 
##           4           2           3           3          18           5 
## 6.909999847 6.960000038 7.070000172 7.380000114 7.869999886 8.140000343 
##           9           5           2           8           7          22 
##  8.56000042  9.68999958 9.899999619 10.01000023 10.59000015 10.81000042 
##          11           8          12           9          11           4 
## 11.93000031 12.82999992 13.89000034 13.92000008 15.03999996 18.10000038 
##           5           6           4           5           3         132 
## 19.57999992 21.88999939 25.64999962 27.73999977 
##          30          15           7           5

Same as ZN, Indus is also a one-way table with matched entries.

mytab<-table(data.df$NOX)
mytab
## 
##  0.38499999 0.388999999  0.39199999 0.393999994 0.398000002 0.400000006 
##           1           1           2           1           2           4 
## 0.400999993 0.402999997 0.404000014 0.405000001 0.409000009 0.409999996 
##           3           3           3           3           3           3 
## 0.411000013 0.412999988 0.414999992 0.416099995 0.421999991 0.425999999 
##           6           6           2           3           1           4 
## 0.428000003  0.42899999 0.430999994 0.432999998 0.435000002 0.437000006 
##           8           3          10           3           1          17 
## 0.437900007  0.43900001 0.442000002 0.442900002 0.444999993 0.446999997 
##           2           4           3           4           5           5 
## 0.448000014 0.449000001 0.453000009 0.458000004 0.460000008 0.463999987 
##           9           4           6           3           3           8 
## 0.469000012 0.472000003 0.483999997 0.488000005 0.488999993 0.493000001 
##           2           4           2           8          15           8 
## 0.499000013 0.504000008 0.507000029  0.50999999 0.514999986 0.518000007 
##           4           8          10           7           8           1 
## 0.519999981 0.523999989 0.532000005 0.537999988  0.54400003 0.546999991 
##          11           7           5          23          12           9 
## 0.550000012 0.573000014 0.574999988 0.579999983  0.58099997 0.583000004 
##           4           5           2           4           7           4 
## 0.583999991 0.584999979 0.597000003 0.605000019 0.609000027 0.614000022 
##           8           8           6          14           5           7 
## 0.624000013 0.630999982 0.647000015 0.654999971 0.658999979 0.667999983 
##          15           5          10           3           2           3 
## 0.671000004  0.67900002 0.693000019 0.699999988       0.713 0.717999995 
##           7           8          14          11          18           6 
##  0.74000001 0.769999981 0.870999992 
##          13           8          16

This table also has many entres resembling and matching thus being a long one-way table.

mytab<-table(data.df$PT)
mytab
## 
## 12.60000038          13 13.60000038 14.39999962 14.69999981 14.80000019 
##           3          12           1           1          34           3 
## 14.89999962 15.10000038 15.19999981 15.30000019        15.5 15.60000038 
##           4           1          13           3           1           2 
## 15.89999962          16 16.10000038 16.39999962 16.60000038 16.79999924 
##           2           5           5           6          16           4 
## 16.89999962          17 17.29999924 17.39999962 17.60000038 17.79999924 
##           5           4           1          18           7          23 
## 17.89999962          18 18.20000076 18.29999924 18.39999962        18.5 
##          11           5           4           4          16           4 
## 18.60000038 18.70000076 18.79999924 18.89999962          19 19.10000038 
##          17           9           2           3           4          17 
## 19.20000076 19.60000038 19.70000076 20.10000038 20.20000076 20.89999962 
##          19           8           8           5         140          11 
##          21 21.10000038 21.20000076          22 
##          27           1          15           2

Here too, a one way table, albeit long, but a table having only contingency variable=1 is created.

mytab<-table(data.df$TAX)
mytab
## 
## 187 188 193 198 216 222 223 224 226 233 241 242 243 244 245 247 252 254 
##   1   7   8   1   5   7   5  10   1   9   1   2   4   1   3   4   2   5 
## 255 256 264 265 270 273 276 277 279 280 281 284 285 287 289 293 296 300 
##   1   1  12   2   7   5   9  11   4   1   4   7   1   8   5   3   8   7 
## 304 305 307 311 313 315 329 330 334 335 337 345 348 351 352 358 370 384 
##  14   4  40   7   1   2   6  10   2   2   2   3   2   1   2   3   2  11 
## 391 398 402 403 411 422 430 432 437 469 666 711 
##   8  12   2  30   2   1   3   9  15   1 132   5

TAX also creates a long one way table.

mytab<-table(data.df$RAD)
mytab
## 
##   1   2   3   4   5   6   7   8  24 
##  20  24  38 110 115  26  17  24 132

RAD also creates a one way table.

Create two-way contingency tables for the categorical variables in your dataset.

Creating two way tables can be done with one way tables. To summarize, the following variables form a one way table (Table length too long not considered).

CHAS,ZN,INDEX,NOX,PT,TAX,RAD

Creating Two Way Tables:

Total 7C2 combinations i.e. 21 combinations are possible, thus eliminating those having huge length, we get:

mytab<-xtabs(~CHAS+RAD,data=data.df)
mytab
##     RAD
## CHAS   1   2   3   4   5   6   7   8  24
##    0  19  24  36 102 104  26  17  19 124
##    1   1   0   2   8  11   0   0   5   8

Thus here we see a two way table.

mytab<-xtabs(~CHAS+ZN,data=data.df)
mytab
##     ZN
## CHAS   0 12.5 17.5  18  20  21  22  25  28  30  33  34  35  40  45 52.5
##    0 344   10    1   1  18   4  10  10   3   6   4   3   3   4   6    3
##    1  28    0    0   0   3   0   0   0   0   0   0   0   0   3   0    0
##     ZN
## CHAS  55  60  70  75  80 82.5  85  90  95 100
##    0   3   4   3   3  15    2   2   4   4   1
##    1   0   0   0   0   0    0   0   1   0   0

Here is one more contingency table as above.

Draw a boxplot of the variables that belong to your study.

Visualising variables in our study using Boxplots:

CRIM : Crime Rate

boxplot(data.df$CRIM,col=c("Grey"),main="Boxplot of Crime Rate")

The above graoh tells us that crime rate of housing districts is usually very low thus the graph is near 0-10% area with a few outliers till 88%. Magnifying this area:

boxplot(data.df$CRIM,col=c("Grey"),main="Boxplot of Crime Rate",ylim=c(0,5))

This is how the graph looks downwards. The inference is that most housing data is from zones having crime rate near 0.25%.


ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
boxplot(data.df$ZN,col=c("Pink"),main="Boxplot of proportion of residential zones")

The graph has median around 0.1% implying most have houses where number of houses is very less.


INDUS: proportion of non-retail business acres per town.
boxplot(data.df$INDUS,col=c("yellowgreen"),main="Boxplot of INDUS")

This graph has median around 9-10% implying people prefer those areas for residing where 1 out of every 10 people is non-retail businessman. ***** ##### chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

Since this is a one way table, a better way would be to represent it by histogram. ****

mytable<-table(data.df$CHAS)
barplot(table(data.df$CHAS),col=c("salmon"),main="Boxplot of River tract bound",ylab = "Frequency Count",xlab="Chas dummy variable for river tract bound")

pie(table(data.df$CHAS),col=c("red","blue"),main="Proportion Chart",radius = 0.6)
legend("topright",c("tract does bound river","tract doesnot bound river"),fill=c("blue","red"))

nox: nitrogen oxides concentration (parts per 10 million).
boxplot(data.df$INDUS,col=c("orchid1"),main="Boxplot of NitrousOxide Concentration")

We infer that most people buy home where nox is low (nearly 10%).

rm: average number of rooms per dwelling.
boxplot(data.df$RM,col=c("paleturquoise2"),main="Boxplot of number of rooms per home",ylim=c(0,10))

From the above graph,we see that the most houses have atleast 6 rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

boxplot(data.df$AGE,col=c("maroon4"),main="Boxplot of owner-occupied units built prior to 1940")

Most owners thus live in homes built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.
boxplot(data.df$DIS,col=c("firebrick"),main="Boxplot of mean of distances to employment")

##### rad: index of accessibility to radial highways. *****

boxplot(data.df$RAD,col=c("firebrick"),main="Boxplot of index of accesibility to highways")

***** #####tax: full-value property-tax rate per $10,000. *****

boxplot(data.df$TAX,col=c("sienna"),main="Boxplot of property tax")

Most eople pay tax in excess of $30,000,000

pt: pupil-teacher ratio by town.
boxplot(data.df$PT,col=c("wheat"),main="Boxplot of mean of pupil teacher ratio")

The pupil teacher ratio is mostly 19 i.e. 19 students for 1 teacher.This implies people desire a good teacher for children.

b: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

boxplot(data.df$B,col=c("purple"),main="Boxplot of proportion of Blacks")

Most people have high proportion of backs in their area. Thus racism is not seen.

lstat: lower status of the population (percent).

boxplot(data.df$LSTAT,col=c("olivedrab"),main="Boxplot of mean of ditances to employment")

Mostly people live where there is less percent or minority of poor people.

mv: median value of owner-occupied homes in $1000s.

boxplot(data.df$MV,col=c("steelblue2"),main="Boxplot of owner ocuppied homes")

This is the variable we seek to predict.This guves us the price of the house as predicted on the basis of variables.

Viewing the proportion of owner occupied homes:
attach(data.df)
hist(MV)

qqnorm(MV)
qqline(MV)

Create a correlation matrix.

To understand the correlation:

Creating a correlation Matrix of the above.

round(cor(data.df),2)
##        CRIM    ZN INDUS  CHAS   NOX    RM   AGE   DIS   RAD   TAX    PT
## CRIM   1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58  0.29
## ZN    -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31 -0.39
## INDUS  0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72  0.38
## CHAS  -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04 -0.12
## NOX    0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67  0.19
## RM    -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29 -0.36
## AGE    0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51  0.26
## DIS   -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53 -0.23
## RAD    0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91  0.46
## TAX    0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00  0.46
## PT     0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46  1.00
## B     -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44 -0.18
## LSTAT  0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54  0.37
## MV    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47 -0.51
##           B LSTAT    MV
## CRIM  -0.39  0.46 -0.39
## ZN     0.18 -0.41  0.36
## INDUS -0.36  0.60 -0.48
## CHAS   0.05 -0.05  0.18
## NOX   -0.38  0.59 -0.43
## RM     0.13 -0.61  0.70
## AGE   -0.27  0.60 -0.38
## DIS    0.29 -0.50  0.25
## RAD   -0.44  0.49 -0.38
## TAX   -0.44  0.54 -0.47
## PT    -0.18  0.37 -0.51
## B      1.00 -0.37  0.33
## LSTAT -0.37  1.00 -0.74
## MV     0.33 -0.74  1.00

Thus we obtain the correlation matrix of the dataset.Inferences: It is interesting to note the highest correlations between indus and nox, as well as those between tax and rad and tax and indus. It makes sense that nitrogen oxide levels as well as tax levels are highest near industrial areas. These are possible sources of multicollinearity, each explaining the same thing as far as how they affect variation in medv.

Related to medv itself, it is found that average number of rooms has the highest positive correlation, while pupil-teacher ratio and lstat have the highest negative correlations.

Visualize your correlation matrix using corrgram.

the Corrgram is below:

library(corrgram)
corrgram(data.df,upper.panel = panel.pie)

The above corrgram is thus obtained.

Create a scatter plot matrix for your data set.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
scatterplotMatrix(formula = ~ CRIM + ZN + INDUS + CHAS + NOX + RM +AGE + DIS + RAD +TAX +PT +B +LSTAT+ MV, cex=0.6,
                       data=data.df, diagonal="histogram")                  
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth

Run a suitable test to check your hypothesis for your suitable assumptions.

Hypothesis: The price of houses(MV) is independent of all criterias mentioned.

Model<-MV~CRIM + ZN + INDUS + CHAS + NOX + RM +AGE + DIS + RAD +TAX +PT +B +LSTAT
fit<-lm(Model,data=data.df)
summary(fit)
## 
## Call:
## lm(formula = Model, data = data.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## CRIM        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## ZN           4.642e-02  1.373e-02   3.382 0.000778 ***
## INDUS        2.056e-02  6.150e-02   0.334 0.738287    
## CHAS         2.687e+00  8.616e-01   3.118 0.001925 ** 
## NOX         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## RM           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## AGE          6.922e-04  1.321e-02   0.052 0.958229    
## DIS         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## RAD          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## TAX         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## PT          -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## B            9.312e-03  2.686e-03   3.467 0.000573 ***
## LSTAT       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Inferences:

  1. Out of 13, 11 column values are correlated and help in predicting mv(price of houses). ***
  2. CRIM is less significant yet negatively on house prices.
  3. The residential proportion has high positive correlation with MV.
  4. INDUS has no effect on house prices prediction.
  5. CHAS has less significant correlation (positive) with MV.
  6. NOX has high negative correlation with MV.
  7. DIS, TAX,PT and LSTAT have negative correlations while others have positive correlation significant except age.

The Residual error is 4.745 in predicting 492 x-values.

The model covers 74.06 % variance and 73.38 covariance on variables(Adjusted).

p-value is highly significant(p=2.2e-16)<0.001 implying high correlation.

Run a t-test to analyse your hypothesis

Running a T-Test.

t.test(data.df$MV)
## 
##  One Sample t-test
## 
## data:  data.df$MV
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

So, p-value obtained here is same. Thus it is significant and NULL HYpotheis that data is independent is rejected.

This concludes our initial report of Capstone Project on Boston Housing Dataset .