Libaries installed for this Project. tidyverse reshape2 kableExtra
We will be using the data set regarding the houses in California, which has column names as 1.longitude, 2.latitude ,3.housing_median_age ,4.total_rooms ,5.total_bedrooms , 6.population ,7.households , 8.median_income , 9.median_house_value , 10.ocean_proximity (distance to the ocean) ( <1H OCEAN / INLAND / ISLAND / NEAR BAY / NEAR OCEAN ) columns.The columns are as follows, their names are pretty self explanitory: Kaggle site archive
Does house value depends on ocean_proximity or housing_median age predict the variation of house value.
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity |
---|---|---|---|---|---|---|---|---|---|
-122.23 | 37.88 | 41 | 880 | 129 | 322 | 126 | 8.3252 | 452600 | NEAR BAY |
-122.22 | 37.86 | 21 | 7099 | 1106 | 2401 | 1138 | 8.3014 | 358500 | NEAR BAY |
-122.24 | 37.85 | 52 | 1467 | 190 | 496 | 177 | 7.2574 | 352100 | NEAR BAY |
-122.25 | 37.85 | 52 | 1274 | 235 | 558 | 219 | 5.6431 | 341300 | NEAR BAY |
-122.25 | 37.85 | 52 | 1627 | 280 | 565 | 259 | 3.8462 | 342200 | NEAR BAY |
-122.25 | 37.85 | 52 | 919 | 213 | 413 | 193 | 4.0368 | 269700 | NEAR BAY |
## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 <1H OCEAN :9136
## 1st Qu.:119600 INLAND :6551
## Median :179700 ISLAND : 5
## Mean :206856 NEAR BAY :2290
## 3rd Qu.:264725 NEAR OCEAN:2658
## Max. :500001
##
From Summary we can interpret that 1. There are 207 NA’s in total_bedrooms which needs to addressed in cleaning.
## Using ocean_proximity as id variables
The data need to be reshaped in order to aid exploration of the data and modeling to predict the house value.
Replace NA’s with the median for total_bedrooms. The median is used instead of mean because it is less influenced by outliers.
Specify the predictor variables or independent variables features) from the housing data and the dependent variable which is the house value.
## 'data.frame': 20640 obs. of 10 variables:
## $ longitude : num -122 -122 -122 -122 -122 ...
## $ latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ housing_median_age: num 41 21 52 52 52 52 52 52 42 52 ...
## $ total_rooms : num 880 7099 1467 1274 1627 ...
## $ total_bedrooms : num 129 1106 190 235 280 ...
## $ population : num 322 2401 496 558 565 ...
## $ households : num 126 1138 177 219 259 ...
## $ median_income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ median_house_value: num 452600 358500 352100 341300 342200 ...
## $ ocean_proximity : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...
There are 20640 observation and 10 variables.
The response variable, house_value, is calculated based on two or more qualitative variable : ocean proximity
Categorical ocean_proximity : Data provides 5-level variable for ocean proximity
Numerical longitude
latitude
housing_median_age total_rooms
total_bedrooms
population
households
median_income
This is an observational study. as we are trying to infer from already collected data and make some correlation.
Each of the variables is explored for distribution, variance, and predictability. We will explore numeric variables.
The below correlation chows that all the numerical(continous) variables are not co-related and are independent of each other.
correlationNumerical = cor (housing_DS[, c("housing_median_age", "total_bedrooms","population", "median_income")])
diag (correlationNumerical) = 0 # ythis removes any correlation with self
correlationNumerical
## housing_median_age total_bedrooms population
## housing_median_age 0.0000000 -0.319026332 -0.296244240
## total_bedrooms -0.3190263 0.000000000 0.873534861
## population -0.2962442 0.873534861 0.000000000
## median_income -0.1190340 -0.007616874 0.004834346
## median_income
## housing_median_age -0.119033990
## total_bedrooms -0.007616874
## population 0.004834346
## median_income 0.000000000
## [1] NEAR BAY <1H OCEAN INLAND NEAR OCEAN ISLAND
## Levels: <1H OCEAN INLAND ISLAND NEAR BAY NEAR OCEAN
ggplot(housing_DS,aes(ocean_proximity,fill=factor(ocean_proximity)))+
geom_bar()+
theme(axis.text.x = element_blank())+
facet_grid(.~ocean_proximity)+
ggtitle("House value, Ocean Proximity")+
theme_classic() +
scale_x_discrete(breaks=c("0","1"),
labels=c("No", "Yes"))+
labs(fill = "ocean_proximity")
we will consider 2 variables and try to deduce the inference between them median_house_value ( i.e. Price) & ocean Proximity
Ho :- The ocean proximity do not have any effect on the price of the house. Ha :- The ocean proximity has significant impact on price of the house.
The p-value is less .05 , hence we can reject the Null hypothesis.
##
## Call:
## lm(formula = median_house_value ~ factor(ocean_proximity), data = housing_DS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -236712 -66247 -21005 42273 375196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 240084 1054 227.804 < 2e-16 ***
## factor(ocean_proximity)INLAND -115279 1631 -70.686 < 2e-16 ***
## factor(ocean_proximity)ISLAND 140356 45062 3.115 0.00184 **
## factor(ocean_proximity)NEAR BAY 19128 2354 8.125 4.71e-16 ***
## factor(ocean_proximity)NEAR OCEAN 9350 2220 4.212 2.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 100700 on 20635 degrees of freedom
## Multiple R-squared: 0.2381, Adjusted R-squared: 0.238
## F-statistic: 1612 on 4 and 20635 DF, p-value: < 2.2e-16
The summary on both variables show clearly that p value is nearly zero hence we can reject the Null hypothesis thus clearly indication that thier is significant impact between Price of house (median_house_value) and ocean proximity varible.
We can clearly see from above dignostic plots that all 3 explanatory varibles(housing_median_age , total_bedrooms , ocean_proximity) have impact on the median_house_value. As the multiple R-Squared value is 0.238 thus proving that there is strong correlation between the expanatory variables and price of house(median_house_value). Thus these factors can a good predictor of price of house .May be next step would be to see how to run an inference (Theoretical & Simulation both ) and also to add more explanatory parameters and using Backward Elimination and Forward Selection to come up with a good fit model.