Libaries installed for this Project. tidyverse reshape2 kableExtra

1 Problem statement

We will be using the data set regarding the houses in California, which has column names as 1.longitude, 2.latitude ,3.housing_median_age ,4.total_rooms ,5.total_bedrooms , 6.population ,7.households , 8.median_income , 9.median_house_value , 10.ocean_proximity (distance to the ocean) ( <1H OCEAN / INLAND / ISLAND / NEAR BAY / NEAR OCEAN ) columns.The columns are as follows, their names are pretty self explanitory: Kaggle site archive

Does house value depends on ocean_proximity or housing_median age predict the variation of house value.

2 Load in the data

longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
-122.23 37.88 41 880 129 322 126 8.3252 452600 NEAR BAY
-122.22 37.86 21 7099 1106 2401 1138 8.3014 358500 NEAR BAY
-122.24 37.85 52 1467 190 496 177 7.2574 352100 NEAR BAY
-122.25 37.85 52 1274 235 558 219 5.6431 341300 NEAR BAY
-122.25 37.85 52 1627 280 565 259 3.8462 342200 NEAR BAY
-122.25 37.85 52 919 213 413 193 4.0368 269700 NEAR BAY
##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value   ocean_proximity
##  Min.   : 14999     <1H OCEAN :9136  
##  1st Qu.:119600     INLAND    :6551  
##  Median :179700     ISLAND    :   5  
##  Mean   :206856     NEAR BAY  :2290  
##  3rd Qu.:264725     NEAR OCEAN:2658  
##  Max.   :500001                      
## 

From Summary we can interpret that 1. There are 207 NA’s in total_bedrooms which needs to addressed in cleaning.

## Using ocean_proximity as id variables

3 Clean the data

The data need to be reshaped in order to aid exploration of the data and modeling to predict the house value.

3.1 Imputing missing value.

Replace NA’s with the median for total_bedrooms. The median is used instead of mean because it is less influenced by outliers.

4 Describe Data

Specify the predictor variables or independent variables features) from the housing data and the dependent variable which is the house value.

4.1 Cases

## 'data.frame':    20640 obs. of  10 variables:
##  $ longitude         : num  -122 -122 -122 -122 -122 ...
##  $ latitude          : num  37.9 37.9 37.9 37.9 37.9 ...
##  $ housing_median_age: num  41 21 52 52 52 52 52 52 42 52 ...
##  $ total_rooms       : num  880 7099 1467 1274 1627 ...
##  $ total_bedrooms    : num  129 1106 190 235 280 ...
##  $ population        : num  322 2401 496 558 565 ...
##  $ households        : num  126 1138 177 219 259 ...
##  $ median_income     : num  8.33 8.3 7.26 5.64 3.85 ...
##  $ median_house_value: num  452600 358500 352100 341300 342200 ...
##  $ ocean_proximity   : Factor w/ 5 levels "<1H OCEAN","INLAND",..: 4 4 4 4 4 4 4 4 4 4 ...

There are 20640 observation and 10 variables.

4.2 Dependent Variables

The response variable, house_value, is calculated based on two or more qualitative variable : ocean proximity

4.3 Independent Variables

Categorical ocean_proximity : Data provides 5-level variable for ocean proximity
Numerical longitude
latitude
housing_median_age total_rooms
total_bedrooms
population
households
median_income

4.4 Type of study

This is an observational study. as we are trying to infer from already collected data and make some correlation.

5 Exploratory Data Analysis

Each of the variables is explored for distribution, variance, and predictability. We will explore numeric variables.

The below correlation chows that all the numerical(continous) variables are not co-related and are independent of each other.

##                    housing_median_age total_bedrooms   population
## housing_median_age          0.0000000   -0.319026332 -0.296244240
## total_bedrooms             -0.3190263    0.000000000  0.873534861
## population                 -0.2962442    0.873534861  0.000000000
## median_income              -0.1190340   -0.007616874  0.004834346
##                    median_income
## housing_median_age  -0.119033990
## total_bedrooms      -0.007616874
## population           0.004834346
## median_income        0.000000000

6 Inference

we will consider 2 variables and try to deduce the inference between them median_house_value ( i.e. Price) & ocean Proximity

  1. Check The conditions :-
    1. The data comes from a simple random sampl and is less than 10% of the overall propulation. In this case the data looks to randomly selected and as this is sample housing data of california so can be sure that it is less than 10% of the overall population.
    2. The data of the Observations are indepedent of each other. We can safely assume that the data of both observations are independent of each other and also within each group.

Ho :- The ocean proximity do not have any effect on the price of the house. Ha :- The ocean proximity has significant impact on price of the house.

The p-value is less .05 , hence we can reject the Null hypothesis.

## 
## Call:
## lm(formula = median_house_value ~ factor(ocean_proximity), data = housing_DS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -236712  -66247  -21005   42273  375196 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         240084       1054 227.804  < 2e-16 ***
## factor(ocean_proximity)INLAND      -115279       1631 -70.686  < 2e-16 ***
## factor(ocean_proximity)ISLAND       140356      45062   3.115  0.00184 ** 
## factor(ocean_proximity)NEAR BAY      19128       2354   8.125 4.71e-16 ***
## factor(ocean_proximity)NEAR OCEAN     9350       2220   4.212 2.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100700 on 20635 degrees of freedom
## Multiple R-squared:  0.2381, Adjusted R-squared:  0.238 
## F-statistic:  1612 on 4 and 20635 DF,  p-value: < 2.2e-16

The summary on both variables show clearly that p value is nearly zero hence we can reject the Null hypothesis thus clearly indication that thier is significant impact between Price of house (median_house_value) and ocean proximity varible.

7 Conclusion

We can clearly see from above dignostic plots that all 3 explanatory varibles(housing_median_age , total_bedrooms , ocean_proximity) have impact on the median_house_value. As the multiple R-Squared value is 0.238 thus proving that there is strong correlation between the expanatory variables and price of house(median_house_value). Thus these factors can a good predictor of price of house .May be next step would be to see how to run an inference (Theoretical & Simulation both ) and also to add more explanatory parameters and using Backward Elimination and Forward Selection to come up with a good fit model.