DATA 607 Final Project: House price prediction

# Load libraries

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.0.3

library(plyr)
library(kableExtra)
library(plotly)

## Warning: package 'plotly' was built under R version 4.0.3

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.3

library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.0.3

## Warning: package 'xts' was built under R version 4.0.3

## Warning: package 'zoo' was built under R version 4.0.3

library(stats)
library(xgboost)

## Warning: package 'xgboost' was built under R version 4.0.3

Problem Statement and background

One of the big decisions to make in life is purchasing a home. There are many factors that need to be taken into consideration while making such a decision. In this project, we tend to retrieve data from different data sources. The data will have metrics such as median income, unemployment rate, public schools, hospitals, hospital ratings, crime rate, … Our task will be to get data from those multiple data sources using different methods (read csv, web scrapping…) learned throughout the course of this class, to store them (on a database, cloud,…), to clean and transform them, and to analyze and visualize them to get some useful information for houses price. We will then go further on using different models to predict the house price based on different features that we would find are necessary and weight on houses price.

One of the big questions that we ask ourselves is : “Which variable is a best predictor of housing prices?”

Work process

Process <- c('Data Collection',
             'Data Transformation (Cleaning & Tidying data)',
             'Data Analysis', 'Visualization',
             'Modeling','Conclusion', 'Presentation')

Team <- c('Jered & Zhouxin', 'Jered & Zhouxin', 'Jered ', 
          'Jered', 'Jered', 'Jered & Zhouxin',
          'Jered & Zhouxin')
df_team <- data.frame(Process, Team)
names(df_team) <- c('Process', 'Team Members')

df_team

##                                         Process    Team Members
## 1                               Data Collection Jered & Zhouxin
## 2 Data Transformation (Cleaning & Tidying data) Jered & Zhouxin
## 3                                 Data Analysis          Jered 
## 4                                 Visualization           Jered
## 5                                      Modeling           Jered
## 6                                    Conclusion Jered & Zhouxin
## 7                                  Presentation Jered & Zhouxin

Cleaning the data

# Get raw data from Github

data_raw <- read.csv("data_raw_final.csv")

head(data_raw)

##   FIPS AverageHousePrice CountyName StateName UnemploymentRate NumberOfSchools
## 1 1001          114483.7    Autauga   Alabama              3.6              15
## 2 1003          164861.7    Baldwin   Alabama              3.6              47
## 3 1005                NA    Barbour   Alabama              5.3              11
## 4 1013                NA     Butler   Alabama              4.8               8
## 5 1015                NA    Calhoun   Alabama              4.7              40
## 6 1019                NA   Cherokee   Alabama              3.7               8
##   NumberOfHospitals AverageHospitalRating
## 1                 1              4.000000
## 2                 4              3.000000
## 3                 1              3.000000
## 4                 2              3.000000
## 5                 4              2.666667
## 6                 1              4.000000

# Drop rows with missing Average house price

data_clean <- data_raw %>%
    drop_na(AverageHousePrice)

# Move the target in the end

data_final <- data_clean %>%
    select(-AverageHousePrice, AverageHousePrice)

head(data_final)

##   FIPS CountyName StateName UnemploymentRate NumberOfSchools NumberOfHospitals
## 1 1001    Autauga   Alabama              3.6              15                 1
## 2 1003    Baldwin   Alabama              3.6              47                 4
## 3 1033    Colbert   Alabama              4.7              27                 2
## 4 1049    De Kalb   Alabama              3.8              20                 1
## 5 1051     Elmore   Alabama              3.4              20                 2
## 6 1055     Etowah   Alabama              4.1              45                 4
##   AverageHospitalRating AverageHousePrice
## 1                  4.00         114483.67
## 2                  3.00         164861.69
## 3                  2.50          92332.05
## 4                  3.00          98315.49
## 5                  3.40         125561.37
## 6                  2.75          77994.25

Exploratory Data Analysis

Which state can you effort living ?

We are going to calculate the average county home price per state.

data_1 <- data_final %>%
    group_by(StateName) %>%
    transmute(StateName, avg_house_price = mean(AverageHousePrice))

data_1 <- data_1 %>%
    distinct(StateName, avg_house_price)

data_1

## # A tibble: 50 x 2
## # Groups:   StateName [50]
##    StateName            avg_house_price
##    <chr>                          <dbl>
##  1 Alabama                      111315.
##  2 Alaska                       231293.
##  3 Arizona                      154756.
##  4 Arkansas                      96635.
##  5 California                   313204.
##  6 Colorado                     246091.
##  7 Connecticut                  222477.
##  8 District of Columbia         351877.
##  9 Florida                      147434.
## 10 Georgia                      120758.
## # ... with 40 more rows

Now we are going to order the state from the least affordable to the most affordable

data_2 <- data_1 %>%
    arrange(desc(avg_house_price))

data_2 %>%
    kbl(caption = "Home price per state") %>%
    kable_material(c("striped", "hover")) %>%
    row_spec(0, color = "indigo")

Home price per state
StateName	avg_house_price
Hawaii	367663.35
District of Columbia	351876.68
California	313204.10
Massachusetts	252502.39
Colorado	246090.52
New Jersey	240695.04
Rhode Island	232272.22
Alaska	231292.74
Connecticut	222477.39
Maryland	219910.15
Nevada	211545.48
Vermont	197167.04
Washington	187952.16
Virginia	184240.35
Wyoming	178689.44
Oregon	177559.35
Utah	175536.27
New Hampshire	174137.05
New Mexico	173700.21
South Dakota	173370.37
New York	173113.39
Maine	159582.94
Idaho	159432.27
Montana	155786.82
Arizona	154756.12
Minnesota	153400.68
Wisconsin	152580.67
Florida	147433.63
North Dakota	146838.69
Pennsylvania	122026.17
Georgia	120757.50
North Carolina	118907.18
Texas	118704.49
Missouri	117672.74
South Carolina	117201.77
Kentucky	115372.95
Michigan	114320.94
Nebraska	112512.59
Alabama	111315.46
Illinois	110639.05
Iowa	108837.37
Ohio	107811.92
Louisiana	106155.52
West Virginia	104864.69
Mississippi	103178.53
Indiana	99398.11
Arkansas	96635.30
Tennessee	96511.72
Kansas	92633.36
Oklahoma	84037.59

Visualization

Visualize the least affordable states based on average county home price per state

# Top 10

top_n(ungroup(data_2), 10) %>%
    ggplot(aes(reorder(StateName, avg_house_price), avg_house_price)) +
    geom_col(aes(fill = avg_house_price)) +
    
    coord_flip() +
    
    labs(title = '10 most expensive state to buy a house', x = "State")

## Selecting by avg_house_price

Visualize the most affordable states based on average county home price per state

# Top 10

top_n(ungroup(data_2), -10) %>%
    ggplot(aes(reorder(StateName, avg_house_price), avg_house_price)) +
    geom_bar(stat="identity", color="blue", fill="purple") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=0.5)) +
    
    
    labs(title = '10 least expensive state to buy a house', x = "State")

## Selecting by avg_house_price

Visualize the average house price by state

# Add state abbreviation 

data_3 <- data_1 %>%
    mutate(code = state.abb[match(StateName,  state.name)])


# Plot the map
w <- list(color = toRGB("white"), width = 2)
g <- list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)
p <- plot_geo(data_3, locationmode = 'USA-states') %>%
  add_trace(
    z = ~avg_house_price, locations = ~code,
    color = ~avg_house_price, colors = 'Purples'
  ) %>%
  colorbar(title = "Avg house price") %>%
  layout(
    title = 'Avg house price by State',
    geo = g
  )

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Modeling

For this predictive analysis, we are going to use Linear Regression.

Since wee have many explanatory variables, this case will be a multiple linear regression model.

The explanatory variables or predictors are: unemployment rate, number of schools, number of hospitals, average hospital ratings. Our response variable is the average house price.

Our research question is “which variable is a best predictor of average house price?”

We are going then to test around the relationship between housing prices and each one of the predictors. The model selection will be based on adjusted R square. Thus, we are going to apply “backward-selection”.

The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached: Start with the full model, refit all possible models omitting one variable at a time, and choose the model with the highest adjusted R squared, repeat until maximum possible adjusted R squared is reached.

Subset the data set with numeric variables to get it ready for modeling

# Subset the data set with only numerical variables

data_4 <- data_final %>%
    group_by(StateName) %>%
    transmute(StateName,
              UnemploymentRate = round(mean(UnemploymentRate), 0),
              NumberOfSchools = sum(NumberOfSchools),
              NumberOfHospitals = sum(NumberOfHospitals),
              AverageHospitalRating = round(mean(AverageHospitalRating), 0),
              avg_house_price = mean(AverageHousePrice))

data_4 <- data_4 %>%
    distinct(StateName, UnemploymentRate, NumberOfSchools, NumberOfHospitals, AverageHospitalRating, avg_house_price)

data_4 <- subset(data_4, select = -c(StateName))
head(data_4)

## # A tibble: 6 x 5
##   UnemploymentRate NumberOfSchools NumberOfHospita~ AverageHospital~
##              <dbl>           <int>            <int>            <dbl>
## 1                4             731               51                3
## 2                6             206               17                3
## 3                7            2323              133                3
## 4                4             542               68                3
## 5                5            9942              544                3
## 6                3            1567               90                4
## # ... with 1 more variable: avg_house_price <dbl>

Correlation matrix

res <- cor(data_4)
round(res, 2)

##                       UnemploymentRate NumberOfSchools NumberOfHospitals
## UnemploymentRate                  1.00            0.23              0.22
## NumberOfSchools                   0.23            1.00              0.95
## NumberOfHospitals                 0.22            0.95              1.00
## AverageHospitalRating            -0.42           -0.02             -0.03
## avg_house_price                   0.07            0.12             -0.02
##                       AverageHospitalRating avg_house_price
## UnemploymentRate                      -0.42            0.07
## NumberOfSchools                       -0.02            0.12
## NumberOfHospitals                     -0.03           -0.02
## AverageHospitalRating                  1.00           -0.22
## avg_house_price                       -0.22            1.00

Performance analytics

data_4 %>%
    chart.Correlation(histogram=TRUE, pch=19)

Summary table of correlation between predictors and house price

Features <- c('NumberOfSchools', 'NumberOfHospitals', 'AverageHospitalRating', 'UnemploymentRate')
Correlation <- c(0.12, -0.02, -0.22, 0.07)

df <- data.frame(Features, Correlation)
df

##                Features Correlation
## 1       NumberOfSchools        0.12
## 2     NumberOfHospitals       -0.02
## 3 AverageHospitalRating       -0.22
## 4      UnemploymentRate        0.07

Multiple linear model:

We are going to evaluate the avg_house_price with each of the predictors

Since we are using backward-selection, let first start with full model:

res_mul <- lm(avg_house_price ~  NumberOfSchools + NumberOfHospitals + AverageHospitalRating + UnemploymentRate, data = data_4)
summary(res_mul)

## 
## Call:
## lm(formula = avg_house_price ~ NumberOfSchools + NumberOfHospitals + 
##     AverageHospitalRating + UnemploymentRate, data = data_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -104171  -43067    -280   19905  192927 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           307782.82   91954.46   3.347  0.00166 **
## NumberOfSchools           51.88      16.13   3.216  0.00241 **
## NumberOfHospitals       -811.62     263.03  -3.086  0.00347 **
## AverageHospitalRating -39218.47   22684.77  -1.729  0.09069 . 
## UnemploymentRate       -3794.32    9168.35  -0.414  0.68095   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59780 on 45 degrees of freedom
## Multiple R-squared:  0.2275, Adjusted R-squared:  0.1588 
## F-statistic: 3.313 on 4 and 45 DF,  p-value: 0.0184

Now, let analyze each individual predictor with house price:

avg_house_price~UnemploymentRate

res_1 <- lm(avg_house_price ~ UnemploymentRate, data = data_4)
summary(res_1)

## 
## Call:
## lm(formula = avg_house_price ~ UnemploymentRate, data = data_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79759 -49213  -8008  22744 212819 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        145893      36797   3.965 0.000244 ***
## UnemploymentRate     4476       8857   0.505 0.615632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65670 on 48 degrees of freedom
## Multiple R-squared:  0.005292,   Adjusted R-squared:  -0.01543 
## F-statistic: 0.2554 on 1 and 48 DF,  p-value: 0.6156

avg_house_price~NumberOfSchools

res_2 <- lm(avg_house_price ~ NumberOfSchools, data = data_4)
summary(res_2)

## 
## Call:
## lm(formula = avg_house_price ~ NumberOfSchools, data = data_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -78233 -50871 -10537  21548 209009 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.574e+05  1.221e+04  12.896   <2e-16 ***
## NumberOfSchools 4.600e+00  5.652e+00   0.814     0.42    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65400 on 48 degrees of freedom
## Multiple R-squared:  0.01362,    Adjusted R-squared:  -0.006934 
## F-statistic: 0.6626 on 1 and 48 DF,  p-value: 0.4197

avg_house_price~NumberOfHospitals

res_3 <- lm(avg_house_price ~ NumberOfHospitals, data = data_4)
summary(res_3)

## 
## Call:
## lm(formula = avg_house_price ~ NumberOfHospitals, data = data_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79799 -51291  -9966  23105 202933 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       165048.36   13088.48  12.610   <2e-16 ***
## NumberOfHospitals    -11.77      93.12  -0.126      0.9    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65840 on 48 degrees of freedom
## Multiple R-squared:  0.0003325,  Adjusted R-squared:  -0.02049 
## F-statistic: 0.01597 on 1 and 48 DF,  p-value: 0.9

avg_house_price~AverageHospitalRating

res_4 <- lm(avg_house_price ~ AverageHospitalRating, data = data_4)
summary(res_4)

## 
## Call:
## lm(formula = avg_house_price ~ AverageHospitalRating, data = data_4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83312 -51520  -9665  25213 200313 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             271267      68881   3.938 0.000265 ***
## AverageHospitalRating   -34639      22026  -1.573 0.122364    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64220 on 48 degrees of freedom
## Multiple R-squared:  0.049,  Adjusted R-squared:  0.02919 
## F-statistic: 2.473 on 1 and 48 DF,  p-value: 0.1224

Summary table

Features <- c('NumberOfSchools', 'NumberOfHospitals', 'AverageHospitalRating', 'UnemploymentRate')
Correlation <- c(0.12, -0.02, -0.22, 0.07)
P_values <- c(0.00241, 0.00347, 0.09069,  0.68095)
Adj_r_square <- c(-0.006934, -0.02049,  0.02919, -0.01543)

df_final <- data.frame(Features, Correlation, P_values, Adj_r_square)
df_final

##                Features Correlation P_values Adj_r_square
## 1       NumberOfSchools        0.12  0.00241    -0.006934
## 2     NumberOfHospitals       -0.02  0.00347    -0.020490
## 3 AverageHospitalRating       -0.22  0.09069     0.029190
## 4      UnemploymentRate        0.07  0.68095    -0.015430

Conclusion

As to answer to our main question to test around relationship between housing prices and each predictors, We realize that the house price has a positive relationship with the number of schools and the unemployment rate. Those are also the two big factors (from the features we explored) that contribute the most to the price of home in US. The two other factors have a negative relationship with house prices. Though the predictors we used, we should have taken into consideration the crime rate which we believe should be a great predictor of home price. This will be part of further work we will have to do to make this model more efficient. We need also to mention that the project presented some challenges such as we needed to find appropriate data set for various factors, understand different terms such FIPS which we never heard before, merge different data set to make one useful data set for analysis and prediction.

References

CUNY DATA606: https://fall2020.data606.net/chapters/chapter9/