info
I found the data from kaggle and download locally, for the purpose of code repreduction, it would better to download the data set and save it with this R markdwon file. Finally, set the working directory where you save these files.
Read data
housing <- read.csv('housing.csv')
# what is in the data set
head(housing)## longitude latitude housing_median_age total_rooms total_bedrooms population
## 1 -122.23 37.88 41 880 129 322
## 2 -122.22 37.86 21 7099 1106 2401
## 3 -122.24 37.85 52 1467 190 496
## 4 -122.25 37.85 52 1274 235 558
## 5 -122.25 37.85 52 1627 280 565
## 6 -122.25 37.85 52 919 213 413
## households median_income median_house_value ocean_proximity
## 1 126 8.3252 452600 NEAR BAY
## 2 1138 8.3014 358500 NEAR BAY
## 3 177 7.2574 352100 NEAR BAY
## 4 219 5.6431 341300 NEAR BAY
## 5 259 3.8462 342200 NEAR BAY
## 6 193 4.0368 269700 NEAR BAY
From the first few row of data, there are 10 independent variables whose name are straight forward.
# out of curiousity, wanna see the distribution of these house
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
ggplot(data = housing, aes(x = ocean_proximity)) + geom_bar() + labs( title = 'house location of distribution', x = 'location', y = 'number of housing') # Explore data
summary
# number of observations
nrow(housing)## [1] 20640
# summary
summary(housing)## longitude latitude housing_median_age total_rooms
## Min. :-124.3 Min. :32.54 Min. : 1.00 Min. : 2
## 1st Qu.:-121.8 1st Qu.:33.93 1st Qu.:18.00 1st Qu.: 1448
## Median :-118.5 Median :34.26 Median :29.00 Median : 2127
## Mean :-119.6 Mean :35.63 Mean :28.64 Mean : 2636
## 3rd Qu.:-118.0 3rd Qu.:37.71 3rd Qu.:37.00 3rd Qu.: 3148
## Max. :-114.3 Max. :41.95 Max. :52.00 Max. :39320
##
## total_bedrooms population households median_income
## Min. : 1.0 Min. : 3 Min. : 1.0 Min. : 0.4999
## 1st Qu.: 296.0 1st Qu.: 787 1st Qu.: 280.0 1st Qu.: 2.5634
## Median : 435.0 Median : 1166 Median : 409.0 Median : 3.5348
## Mean : 537.9 Mean : 1425 Mean : 499.5 Mean : 3.8707
## 3rd Qu.: 647.0 3rd Qu.: 1725 3rd Qu.: 605.0 3rd Qu.: 4.7432
## Max. :6445.0 Max. :35682 Max. :6082.0 Max. :15.0001
## NA's :207
## median_house_value ocean_proximity
## Min. : 14999 Length:20640
## 1st Qu.:119600 Class :character
## Median :179700 Mode :character
## Mean :206856
## 3rd Qu.:264725
## Max. :500001
##
we see that the max value for variable total_rooms, total_bedrooms, population, households, median_income are multiple times larger than median and mean value. It looks weird. So I double check the description of meanings of these variables, the measurement unit of those variables is per block. That makes sense to me. Depends on different location, housing type can be vary.
subseting
from previous summary, we see that only column total_bedrooms contains 207 NAs. Regarless of the total number of observations, 207 is a small amount which we can discard, and there will still be 20K data available.
at the same time, I would like to see the relationships among these independent variable. Therefore, I subset the data which exclude categorical variable.
# create subset of housing
sub_housing <- housing[ , 1:(ncol(housing)-1)]
# remove NA
sub_housing <- sub_housing %>% filter(!is.na(total_bedrooms))
pairs(sub_housing) from the plot, we see that median_house_value doesn’t seem to have relationship with house_age.
predictors
now I am going to have some predictor. As I mention before, house_age is not useful, so exclude it for sure. instead of using longitude and latitude, I am going to keep ocean_proximity since it is more general for me. category ‘ISLAND’ from ocean_proximity only has 5 values, it will def cause skewness, so I remove that category as well.
# forget remove NA in whole data :(
housing <- housing %>% filter(!is.na(housing)) %>% filter(ocean_proximity != 'ISLAND')
# first model
housing.lm <- lm(median_house_value ~ total_rooms + total_bedrooms + population + households + median_income + ocean_proximity, data = housing)see from summary of linear model, the median of residual is far way from zero, that is bad, even though p-values for each variable are below the threshold and \(R^2\) explains 62% of data.
summary(housing.lm)##
## Call:
## lm(formula = median_house_value ~ total_rooms + total_bedrooms +
## population + households + median_income + ocean_proximity,
## data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -530093 -43144 -11176 28039 780198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.189e+04 1.805e+03 39.828 <2e-16 ***
## total_rooms -8.675e+00 8.136e-01 -10.661 <2e-16 ***
## total_bedrooms 6.894e+01 7.021e+00 9.819 <2e-16 ***
## population -3.764e+01 1.106e+00 -34.016 <2e-16 ***
## households 8.598e+01 7.603e+00 11.308 <2e-16 ***
## median_income 3.927e+04 3.430e+02 114.497 <2e-16 ***
## ocean_proximityINLAND -7.495e+04 1.252e+03 -59.869 <2e-16 ***
## ocean_proximityNEAR BAY 1.389e+04 1.685e+03 8.244 <2e-16 ***
## ocean_proximityNEAR OCEAN 1.304e+04 1.583e+03 8.236 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71010 on 20419 degrees of freedom
## (207 observations deleted due to missingness)
## Multiple R-squared: 0.6216, Adjusted R-squared: 0.6215
## F-statistic: 4193 on 8 and 20419 DF, p-value: < 2.2e-16
look into residuals
welp, clearly there is something going on. the varience is not constant. majority residuals cluster together, and a few leverage points.
plot(fitted(housing.lm), resid(housing.lm))
abline(housing.lm)## Warning in abline(housing.lm): only using the first two of 9 regression
## coefficients
Last hope :(
QQ plot shows more clear.
qqnorm(resid(housing.lm))
qqline(resid(housing.lm))conclusion
unfortunely, the data set I found, it cannot use linear regression to build model. Well, maybe there are some other regression works such as logistic regression?