Name: Anshuman Raina
Email: anshuman.raina@gmail.com
College: MAHARAJA AGARASEN INSTITUTE OF TECHNOLOGY
This is a capstone project on the Analysis of Boston Housing Dataset, to find the various factors that are taken into consideration while moving to boston by people.
data.df<-read.csv(paste("G:/R Intern/Capstone/bostonhousing.csv",sep=""))
View(data.df)
dim(data.df)
## [1] 506 14
Firstly, we read the dataset and store it in a datafram (data.df). Then we apply the dim() to know the dimensions. We get the dimensions of the Dataset to be 506 rows and 14 columns.
For this, we describe what each column implies: This data frame contains the following columns:
1.crim: per capita crime rate by town.
2.zn: proportion of residential land zoned for lots over 25,000 sq.ft.
3.indus: proportion of non-retail business acres per town.
4.chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5.nox: nitrogen oxides concentration (parts per 10 million).
6.rm: average number of rooms per dwelling.
7.age: proportion of owner-occupied units built prior to 1940.
8.dis: weighted mean of distances to five Boston employment centres.
9.rad: index of accessibility to radial highways.
10.tax: full-value property-tax rate per $10,000.
11.pt: pupil-teacher ratio by town.
12.b: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
13.lstat: lower status of the population (percent).
14.mv: median value of owner-occupied homes in $1000s.
Here, we do the descriptive analysis i.e. the mean,median,mode etc of various column entries.
str(data.df)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS: num 2.31 7.07 7.07 2.18 2.18 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 ...
## $ RM : num 6.57 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PT : num 15.3 17.8 17.8 18.7 18.7 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT: num 4.98 9.14 4.03 2.94 5.33 ...
## $ MV : num 24 21.6 34.7 33.4 36.2 ...
summary(data.df)
## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PT B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Column Wise Description:-
library(psych)
describe(data.df)
## vars n mean sd median trimmed mad min max range
## CRIM 1 506 3.61 8.60 0.26 1.68 0.33 0.01 88.98 88.97
## ZN 2 506 11.36 23.32 0.00 5.08 0.00 0.00 100.00 100.00
## INDUS 3 506 11.14 6.86 9.69 10.93 9.37 0.46 27.74 27.28
## CHAS 4 506 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00
## NOX 5 506 0.55 0.12 0.54 0.55 0.13 0.38 0.87 0.49
## RM 6 506 6.28 0.70 6.21 6.25 0.51 3.56 8.78 5.22
## AGE 7 506 68.57 28.15 77.50 71.20 28.98 2.90 100.00 97.10
## DIS 8 506 3.80 2.11 3.21 3.54 1.91 1.13 12.13 11.00
## RAD 9 506 9.55 8.71 5.00 8.73 2.97 1.00 24.00 23.00
## TAX 10 506 408.24 168.54 330.00 400.04 108.23 187.00 711.00 524.00
## PT 11 506 18.46 2.16 19.05 18.66 1.70 12.60 22.00 9.40
## B 12 506 356.67 91.29 391.44 383.17 8.09 0.32 396.90 396.58
## LSTAT 13 506 12.65 7.14 11.36 11.90 7.11 1.73 37.97 36.24
## MV 14 506 22.53 9.20 21.20 21.56 5.93 5.00 50.00 45.00
## skew kurtosis se
## CRIM 5.19 36.60 0.38
## ZN 2.21 3.95 1.04
## INDUS 0.29 -1.24 0.30
## CHAS 3.39 9.48 0.01
## NOX 0.72 -0.09 0.01
## RM 0.40 1.84 0.03
## AGE -0.60 -0.98 1.25
## DIS 1.01 0.46 0.09
## RAD 1.00 -0.88 0.39
## TAX 0.67 -1.15 7.49
## PT -0.80 -0.30 0.10
## B -2.87 7.10 4.06
## LSTAT 0.90 0.46 0.32
## MV 1.10 1.45 0.41
Thus we get the mean , median , standard deviation and other factors from the data. *****
To do this, we must first analyze the categorical variables. They are:
Firstly,chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
As we see clearly, this is a dummy variable categorical at 1 and 0.
mytab<-table(data.df$CHAS)
mytab
##
## 0 1
## 471 35
So we see, 471 dont have tract bounds river and only 35 have it.
mytab<-table(data.df$ZN)
mytab
##
## 0 12.5 17.5 18 20 21 22 25 28 30 33 34 35 40 45
## 372 10 1 1 21 4 10 10 3 6 4 3 3 7 6
## 52.5 55 60 70 75 80 82.5 85 90 95 100
## 3 3 4 3 3 15 2 2 5 4 1
The table of residential zone ratio is also a long yet a one-way table.
mytab<-table(data.df$INDUS)
mytab
##
## 0.460000008 0.74000001 1.210000038 1.220000029 1.25 1.320000052
## 1 1 1 1 2 1
## 1.379999995 1.470000029 1.519999981 1.690000057 1.75999999 1.889999986
## 1 2 4 2 1 1
## 1.909999967 2.00999999 2.019999981 2.029999971 2.180000067 2.24000001
## 2 1 1 2 7 3
## 2.25 2.309999943 2.460000038 2.680000067 2.890000105 2.930000067
## 1 1 8 2 5 2
## 2.950000048 2.970000029 3.24000001 3.329999924 3.369999886 3.410000086
## 2 1 3 4 2 4
## 3.440000057 3.640000105 3.75 3.779999971 3.970000029 4
## 6 2 1 2 12 1
## 4.050000191 4.150000095 4.389999866 4.489999771 4.860000134 4.929999828
## 7 1 2 4 4 6
## 4.949999809 5.130000114 5.190000057 5.320000172 5.639999866 5.860000134
## 3 6 8 3 4 10
## 5.960000038 6.059999943 6.070000172 6.090000153 6.199999809 6.409999847
## 4 2 3 3 18 5
## 6.909999847 6.960000038 7.070000172 7.380000114 7.869999886 8.140000343
## 9 5 2 8 7 22
## 8.56000042 9.68999958 9.899999619 10.01000023 10.59000015 10.81000042
## 11 8 12 9 11 4
## 11.93000031 12.82999992 13.89000034 13.92000008 15.03999996 18.10000038
## 5 6 4 5 3 132
## 19.57999992 21.88999939 25.64999962 27.73999977
## 30 15 7 5
Same as ZN, Indus is also a one-way table with matched entries.
mytab<-table(data.df$NOX)
mytab
##
## 0.38499999 0.388999999 0.39199999 0.393999994 0.398000002 0.400000006
## 1 1 2 1 2 4
## 0.400999993 0.402999997 0.404000014 0.405000001 0.409000009 0.409999996
## 3 3 3 3 3 3
## 0.411000013 0.412999988 0.414999992 0.416099995 0.421999991 0.425999999
## 6 6 2 3 1 4
## 0.428000003 0.42899999 0.430999994 0.432999998 0.435000002 0.437000006
## 8 3 10 3 1 17
## 0.437900007 0.43900001 0.442000002 0.442900002 0.444999993 0.446999997
## 2 4 3 4 5 5
## 0.448000014 0.449000001 0.453000009 0.458000004 0.460000008 0.463999987
## 9 4 6 3 3 8
## 0.469000012 0.472000003 0.483999997 0.488000005 0.488999993 0.493000001
## 2 4 2 8 15 8
## 0.499000013 0.504000008 0.507000029 0.50999999 0.514999986 0.518000007
## 4 8 10 7 8 1
## 0.519999981 0.523999989 0.532000005 0.537999988 0.54400003 0.546999991
## 11 7 5 23 12 9
## 0.550000012 0.573000014 0.574999988 0.579999983 0.58099997 0.583000004
## 4 5 2 4 7 4
## 0.583999991 0.584999979 0.597000003 0.605000019 0.609000027 0.614000022
## 8 8 6 14 5 7
## 0.624000013 0.630999982 0.647000015 0.654999971 0.658999979 0.667999983
## 15 5 10 3 2 3
## 0.671000004 0.67900002 0.693000019 0.699999988 0.713 0.717999995
## 7 8 14 11 18 6
## 0.74000001 0.769999981 0.870999992
## 13 8 16
This table also has many entres resembling and matching thus being a long one-way table.
mytab<-table(data.df$PT)
mytab
##
## 12.60000038 13 13.60000038 14.39999962 14.69999981 14.80000019
## 3 12 1 1 34 3
## 14.89999962 15.10000038 15.19999981 15.30000019 15.5 15.60000038
## 4 1 13 3 1 2
## 15.89999962 16 16.10000038 16.39999962 16.60000038 16.79999924
## 2 5 5 6 16 4
## 16.89999962 17 17.29999924 17.39999962 17.60000038 17.79999924
## 5 4 1 18 7 23
## 17.89999962 18 18.20000076 18.29999924 18.39999962 18.5
## 11 5 4 4 16 4
## 18.60000038 18.70000076 18.79999924 18.89999962 19 19.10000038
## 17 9 2 3 4 17
## 19.20000076 19.60000038 19.70000076 20.10000038 20.20000076 20.89999962
## 19 8 8 5 140 11
## 21 21.10000038 21.20000076 22
## 27 1 15 2
Here too, a one way table, albeit long, but a table having only contingency variable=1 is created.
mytab<-table(data.df$TAX)
mytab
##
## 187 188 193 198 216 222 223 224 226 233 241 242 243 244 245 247 252 254
## 1 7 8 1 5 7 5 10 1 9 1 2 4 1 3 4 2 5
## 255 256 264 265 270 273 276 277 279 280 281 284 285 287 289 293 296 300
## 1 1 12 2 7 5 9 11 4 1 4 7 1 8 5 3 8 7
## 304 305 307 311 313 315 329 330 334 335 337 345 348 351 352 358 370 384
## 14 4 40 7 1 2 6 10 2 2 2 3 2 1 2 3 2 11
## 391 398 402 403 411 422 430 432 437 469 666 711
## 8 12 2 30 2 1 3 9 15 1 132 5
TAX also creates a long one way table.
mytab<-table(data.df$RAD)
mytab
##
## 1 2 3 4 5 6 7 8 24
## 20 24 38 110 115 26 17 24 132
RAD also creates a one way table.
Creating two way tables can be done with one way tables. To summarize, the following variables form a one way table (Table length too long not considered).
CHAS,ZN,INDEX,NOX,PT,TAX,RAD
Total 7C2 combinations i.e. 21 combinations are possible, thus eliminating those having huge length, we get:
mytab<-xtabs(~CHAS+RAD,data=data.df)
mytab
## RAD
## CHAS 1 2 3 4 5 6 7 8 24
## 0 19 24 36 102 104 26 17 19 124
## 1 1 0 2 8 11 0 0 5 8
Thus here we see a two way table.
mytab<-xtabs(~CHAS+ZN,data=data.df)
mytab
## ZN
## CHAS 0 12.5 17.5 18 20 21 22 25 28 30 33 34 35 40 45 52.5
## 0 344 10 1 1 18 4 10 10 3 6 4 3 3 4 6 3
## 1 28 0 0 0 3 0 0 0 0 0 0 0 0 3 0 0
## ZN
## CHAS 55 60 70 75 80 82.5 85 90 95 100
## 0 3 4 3 3 15 2 2 4 4 1
## 1 0 0 0 0 0 0 0 1 0 0
Here is one more contingency table as above.
Visualising variables in our study using Boxplots:
boxplot(data.df$CRIM,col=c("Grey"),main="Boxplot of Crime Rate")
The above graoh tells us that crime rate of housing districts is usually very low thus the graph is near 0-10% area with a few outliers till 88%. Magnifying this area:
boxplot(data.df$CRIM,col=c("Grey"),main="Boxplot of Crime Rate",ylim=c(0,5))
This is how the graph looks downwards. The inference is that most housing data is from zones having crime rate near 0.25%.
boxplot(data.df$ZN,col=c("Pink"),main="Boxplot of proportion of residential zones")
The graph has median around 0.1% implying most have houses where number of houses is very less.
boxplot(data.df$INDUS,col=c("yellowgreen"),main="Boxplot of INDUS")
This graph has median around 9-10% implying people prefer those areas for residing where 1 out of every 10 people is non-retail businessman. ***** ##### chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
Since this is a one way table, a better way would be to represent it by histogram. ****
mytable<-table(data.df$CHAS)
barplot(table(data.df$CHAS),col=c("salmon"),main="Boxplot of River tract bound",ylab = "Frequency Count",xlab="Chas dummy variable for river tract bound")
pie(table(data.df$CHAS),col=c("red","blue"),main="Proportion Chart",radius = 0.6)
legend("topright",c("tract does bound river","tract doesnot bound river"),fill=c("blue","red"))
boxplot(data.df$INDUS,col=c("orchid1"),main="Boxplot of NitrousOxide Concentration")
We infer that most people buy home where nox is low (nearly 10%).
boxplot(data.df$RM,col=c("paleturquoise2"),main="Boxplot of number of rooms per home",ylim=c(0,10))
From the above graph,we see that the most houses have atleast 6 rooms per dwelling.
boxplot(data.df$AGE,col=c("maroon4"),main="Boxplot of owner-occupied units built prior to 1940")
Most owners thus live in homes built prior to 1940.
boxplot(data.df$DIS,col=c("firebrick"),main="Boxplot of mean of distances to employment")
##### rad: index of accessibility to radial highways. *****
boxplot(data.df$RAD,col=c("firebrick"),main="Boxplot of index of accesibility to highways")
***** #####tax: full-value property-tax rate per $10,000. *****
boxplot(data.df$TAX,col=c("sienna"),main="Boxplot of property tax")
Most eople pay tax in excess of $30,000,000
boxplot(data.df$PT,col=c("wheat"),main="Boxplot of mean of pupil teacher ratio")
The pupil teacher ratio is mostly 19 i.e. 19 students for 1 teacher.This implies people desire a good teacher for children.
boxplot(data.df$B,col=c("purple"),main="Boxplot of proportion of Blacks")
Most people have high proportion of backs in their area. Thus racism is not seen.
boxplot(data.df$LSTAT,col=c("olivedrab"),main="Boxplot of mean of ditances to employment")
Mostly people live where there is less percent or minority of poor people.
boxplot(data.df$MV,col=c("steelblue2"),main="Boxplot of owner ocuppied homes")
This is the variable we seek to predict.This guves us the price of the house as predicted on the basis of variables.
attach(data.df)
hist(MV)
qqnorm(MV)
qqline(MV)
To understand the correlation:
Creating a correlation Matrix of the above.
round(cor(data.df),2)
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PT
## CRIM 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## ZN -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## INDUS 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## CHAS -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## NOX 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## RM -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## AGE 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## PT 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## MV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## B LSTAT MV
## CRIM -0.39 0.46 -0.39
## ZN 0.18 -0.41 0.36
## INDUS -0.36 0.60 -0.48
## CHAS 0.05 -0.05 0.18
## NOX -0.38 0.59 -0.43
## RM 0.13 -0.61 0.70
## AGE -0.27 0.60 -0.38
## DIS 0.29 -0.50 0.25
## RAD -0.44 0.49 -0.38
## TAX -0.44 0.54 -0.47
## PT -0.18 0.37 -0.51
## B 1.00 -0.37 0.33
## LSTAT -0.37 1.00 -0.74
## MV 0.33 -0.74 1.00
Thus we obtain the correlation matrix of the dataset.Inferences: It is interesting to note the highest correlations between indus and nox, as well as those between tax and rad and tax and indus. It makes sense that nitrogen oxide levels as well as tax levels are highest near industrial areas. These are possible sources of multicollinearity, each explaining the same thing as far as how they affect variation in medv.
Related to medv itself, it is found that average number of rooms has the highest positive correlation, while pupil-teacher ratio and lstat have the highest negative correlations.
the Corrgram is below:
library(corrgram)
corrgram(data.df,upper.panel = panel.pie)
The above corrgram is thus obtained.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplotMatrix(formula = ~ CRIM + ZN + INDUS + CHAS + NOX + RM +AGE + DIS + RAD +TAX +PT +B +LSTAT+ MV, cex=0.6,
data=data.df, diagonal="histogram")
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
## Warning in smoother(x, y, col = col[2], log.x = FALSE, log.y = FALSE,
## spread = spread, : could not fit smooth
Hypothesis: The price of houses(MV) is independent of all criterias mentioned.
Model<-MV~CRIM + ZN + INDUS + CHAS + NOX + RM +AGE + DIS + RAD +TAX +PT +B +LSTAT
fit<-lm(Model,data=data.df)
summary(fit)
##
## Call:
## lm(formula = Model, data = data.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
## ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
## INDUS 2.056e-02 6.150e-02 0.334 0.738287
## CHAS 2.687e+00 8.616e-01 3.118 0.001925 **
## NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## AGE 6.922e-04 1.321e-02 0.052 0.958229
## DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
## PT -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## B 9.312e-03 2.686e-03 3.467 0.000573 ***
## LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Inferences:
The Residual error is 4.745 in predicting 492 x-values.
The model covers 74.06 % variance and 73.38 covariance on variables(Adjusted).
p-value is highly significant(p=2.2e-16)<0.001 implying high correlation.
Running a T-Test.
t.test(data.df$MV)
##
## One Sample t-test
##
## data: data.df$MV
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
So, p-value obtained here is same. Thus it is significant and NULL HYpotheis that data is independent is rejected.
This concludes our initial report of Capstone Project on Boston Housing Dataset .