info

I found the data from kaggle and download locally, for the purpose of code repreduction, it would better to download the data set and save it with this R markdwon file. Finally, set the working directory where you save these files.

Read data

housing <- read.csv('housing.csv')

# what is in the data set
head(housing)

##   longitude latitude housing_median_age total_rooms total_bedrooms population
## 1   -122.23    37.88                 41         880            129        322
## 2   -122.22    37.86                 21        7099           1106       2401
## 3   -122.24    37.85                 52        1467            190        496
## 4   -122.25    37.85                 52        1274            235        558
## 5   -122.25    37.85                 52        1627            280        565
## 6   -122.25    37.85                 52         919            213        413
##   households median_income median_house_value ocean_proximity
## 1        126        8.3252             452600        NEAR BAY
## 2       1138        8.3014             358500        NEAR BAY
## 3        177        7.2574             352100        NEAR BAY
## 4        219        5.6431             341300        NEAR BAY
## 5        259        3.8462             342200        NEAR BAY
## 6        193        4.0368             269700        NEAR BAY

From the first few row of data, there are 10 independent variables whose name are straight forward.

# out of curiousity, wanna see the distribution of these house
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
ggplot(data = housing, aes(x = ocean_proximity)) + geom_bar() + labs( title = 'house location of distribution', x = 'location', y = 'number of housing')

# Explore data

summary

# number of observations
nrow(housing)

## [1] 20640

# summary
summary(housing)

##    longitude         latitude     housing_median_age  total_rooms   
##  Min.   :-124.3   Min.   :32.54   Min.   : 1.00      Min.   :    2  
##  1st Qu.:-121.8   1st Qu.:33.93   1st Qu.:18.00      1st Qu.: 1448  
##  Median :-118.5   Median :34.26   Median :29.00      Median : 2127  
##  Mean   :-119.6   Mean   :35.63   Mean   :28.64      Mean   : 2636  
##  3rd Qu.:-118.0   3rd Qu.:37.71   3rd Qu.:37.00      3rd Qu.: 3148  
##  Max.   :-114.3   Max.   :41.95   Max.   :52.00      Max.   :39320  
##                                                                     
##  total_bedrooms     population      households     median_income    
##  Min.   :   1.0   Min.   :    3   Min.   :   1.0   Min.   : 0.4999  
##  1st Qu.: 296.0   1st Qu.:  787   1st Qu.: 280.0   1st Qu.: 2.5634  
##  Median : 435.0   Median : 1166   Median : 409.0   Median : 3.5348  
##  Mean   : 537.9   Mean   : 1425   Mean   : 499.5   Mean   : 3.8707  
##  3rd Qu.: 647.0   3rd Qu.: 1725   3rd Qu.: 605.0   3rd Qu.: 4.7432  
##  Max.   :6445.0   Max.   :35682   Max.   :6082.0   Max.   :15.0001  
##  NA's   :207                                                        
##  median_house_value ocean_proximity   
##  Min.   : 14999     Length:20640      
##  1st Qu.:119600     Class :character  
##  Median :179700     Mode  :character  
##  Mean   :206856                       
##  3rd Qu.:264725                       
##  Max.   :500001                       
##

we see that the max value for variable total_rooms, total_bedrooms, population, households, median_income are multiple times larger than median and mean value. It looks weird. So I double check the description of meanings of these variables, the measurement unit of those variables is per block. That makes sense to me. Depends on different location, housing type can be vary.

subseting

from previous summary, we see that only column total_bedrooms contains 207 NAs. Regarless of the total number of observations, 207 is a small amount which we can discard, and there will still be 20K data available.

at the same time, I would like to see the relationships among these independent variable. Therefore, I subset the data which exclude categorical variable.

# create subset of housing
sub_housing <- housing[ , 1:(ncol(housing)-1)]

# remove NA
sub_housing <- sub_housing %>% filter(!is.na(total_bedrooms))

pairs(sub_housing)

from the plot, we see that median_house_value doesn’t seem to have relationship with house_age.

predictors

now I am going to have some predictor. As I mention before, house_age is not useful, so exclude it for sure. instead of using longitude and latitude, I am going to keep ocean_proximity since it is more general for me. category ‘ISLAND’ from ocean_proximity only has 5 values, it will def cause skewness, so I remove that category as well.

# forget remove NA in whole data :(
housing <- housing %>% filter(!is.na(housing)) %>% filter(ocean_proximity != 'ISLAND')

# first model
housing.lm <- lm(median_house_value ~ total_rooms + total_bedrooms + population + households + median_income + ocean_proximity, data = housing)

see from summary of linear model, the median of residual is far way from zero, that is bad, even though p-values for each variable are below the threshold and \(R^2\) explains 62% of data.

summary(housing.lm)

## 
## Call:
## lm(formula = median_house_value ~ total_rooms + total_bedrooms + 
##     population + households + median_income + ocean_proximity, 
##     data = housing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -530093  -43144  -11176   28039  780198 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.189e+04  1.805e+03  39.828   <2e-16 ***
## total_rooms               -8.675e+00  8.136e-01 -10.661   <2e-16 ***
## total_bedrooms             6.894e+01  7.021e+00   9.819   <2e-16 ***
## population                -3.764e+01  1.106e+00 -34.016   <2e-16 ***
## households                 8.598e+01  7.603e+00  11.308   <2e-16 ***
## median_income              3.927e+04  3.430e+02 114.497   <2e-16 ***
## ocean_proximityINLAND     -7.495e+04  1.252e+03 -59.869   <2e-16 ***
## ocean_proximityNEAR BAY    1.389e+04  1.685e+03   8.244   <2e-16 ***
## ocean_proximityNEAR OCEAN  1.304e+04  1.583e+03   8.236   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71010 on 20419 degrees of freedom
##   (207 observations deleted due to missingness)
## Multiple R-squared:  0.6216, Adjusted R-squared:  0.6215 
## F-statistic:  4193 on 8 and 20419 DF,  p-value: < 2.2e-16

look into residuals

welp, clearly there is something going on. the varience is not constant. majority residuals cluster together, and a few leverage points.

plot(fitted(housing.lm), resid(housing.lm))
abline(housing.lm)

## Warning in abline(housing.lm): only using the first two of 9 regression
## coefficients

Last hope :(

QQ plot shows more clear.

qqnorm(resid(housing.lm))
qqline(resid(housing.lm))

conclusion

unfortunely, the data set I found, it cannot use linear regression to build model. Well, maybe there are some other regression works such as logistic regression?

605: discussion12

Jie Zou

2021-11-11