# Load libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(plyr)
library(kableExtra)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.0.3
## Warning: package 'xts' was built under R version 4.0.3
## Warning: package 'zoo' was built under R version 4.0.3
library(stats)
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.0.3
One of the big decisions to make in life is purchasing a home. There are many factors that need to be taken into consideration while making such a decision. In this project, we tend to retrieve data from different data sources. The data will have metrics such as median income, unemployment rate, public schools, hospitals, hospital ratings, crime rate, … Our task will be to get data from those multiple data sources using different methods (read csv, web scrapping…) learned throughout the course of this class, to store them (on a database, cloud,…), to clean and transform them, and to analyze and visualize them to get some useful information for houses price. We will then go further on using different models to predict the house price based on different features that we would find are necessary and weight on houses price.
One of the big questions that we ask ourselves is : “Which variable is a best predictor of housing prices?”
Process <- c('Data Collection',
'Data Transformation (Cleaning & Tidying data)',
'Data Analysis', 'Visualization',
'Modeling','Conclusion', 'Presentation')
Team <- c('Jered & Zhouxin', 'Jered & Zhouxin', 'Jered ',
'Jered', 'Jered', 'Jered & Zhouxin',
'Jered & Zhouxin')
df_team <- data.frame(Process, Team)
names(df_team) <- c('Process', 'Team Members')
df_team
## Process Team Members
## 1 Data Collection Jered & Zhouxin
## 2 Data Transformation (Cleaning & Tidying data) Jered & Zhouxin
## 3 Data Analysis Jered
## 4 Visualization Jered
## 5 Modeling Jered
## 6 Conclusion Jered & Zhouxin
## 7 Presentation Jered & Zhouxin
# Get raw data from Github
data_raw <- read.csv("data_raw_final.csv")
head(data_raw)
## FIPS AverageHousePrice CountyName StateName UnemploymentRate NumberOfSchools
## 1 1001 114483.7 Autauga Alabama 3.6 15
## 2 1003 164861.7 Baldwin Alabama 3.6 47
## 3 1005 NA Barbour Alabama 5.3 11
## 4 1013 NA Butler Alabama 4.8 8
## 5 1015 NA Calhoun Alabama 4.7 40
## 6 1019 NA Cherokee Alabama 3.7 8
## NumberOfHospitals AverageHospitalRating
## 1 1 4.000000
## 2 4 3.000000
## 3 1 3.000000
## 4 2 3.000000
## 5 4 2.666667
## 6 1 4.000000
# Drop rows with missing Average house price
data_clean <- data_raw %>%
drop_na(AverageHousePrice)
# Move the target in the end
data_final <- data_clean %>%
select(-AverageHousePrice, AverageHousePrice)
head(data_final)
## FIPS CountyName StateName UnemploymentRate NumberOfSchools NumberOfHospitals
## 1 1001 Autauga Alabama 3.6 15 1
## 2 1003 Baldwin Alabama 3.6 47 4
## 3 1033 Colbert Alabama 4.7 27 2
## 4 1049 De Kalb Alabama 3.8 20 1
## 5 1051 Elmore Alabama 3.4 20 2
## 6 1055 Etowah Alabama 4.1 45 4
## AverageHospitalRating AverageHousePrice
## 1 4.00 114483.67
## 2 3.00 164861.69
## 3 2.50 92332.05
## 4 3.00 98315.49
## 5 3.40 125561.37
## 6 2.75 77994.25
We are going to calculate the average county home price per state.
data_1 <- data_final %>%
group_by(StateName) %>%
transmute(StateName, avg_house_price = mean(AverageHousePrice))
data_1 <- data_1 %>%
distinct(StateName, avg_house_price)
data_1
## # A tibble: 50 x 2
## # Groups: StateName [50]
## StateName avg_house_price
## <chr> <dbl>
## 1 Alabama 111315.
## 2 Alaska 231293.
## 3 Arizona 154756.
## 4 Arkansas 96635.
## 5 California 313204.
## 6 Colorado 246091.
## 7 Connecticut 222477.
## 8 District of Columbia 351877.
## 9 Florida 147434.
## 10 Georgia 120758.
## # ... with 40 more rows
Now we are going to order the state from the least affordable to the most affordable
data_2 <- data_1 %>%
arrange(desc(avg_house_price))
data_2 %>%
kbl(caption = "Home price per state") %>%
kable_material(c("striped", "hover")) %>%
row_spec(0, color = "indigo")
| StateName | avg_house_price |
|---|---|
| Hawaii | 367663.35 |
| District of Columbia | 351876.68 |
| California | 313204.10 |
| Massachusetts | 252502.39 |
| Colorado | 246090.52 |
| New Jersey | 240695.04 |
| Rhode Island | 232272.22 |
| Alaska | 231292.74 |
| Connecticut | 222477.39 |
| Maryland | 219910.15 |
| Nevada | 211545.48 |
| Vermont | 197167.04 |
| Washington | 187952.16 |
| Virginia | 184240.35 |
| Wyoming | 178689.44 |
| Oregon | 177559.35 |
| Utah | 175536.27 |
| New Hampshire | 174137.05 |
| New Mexico | 173700.21 |
| South Dakota | 173370.37 |
| New York | 173113.39 |
| Maine | 159582.94 |
| Idaho | 159432.27 |
| Montana | 155786.82 |
| Arizona | 154756.12 |
| Minnesota | 153400.68 |
| Wisconsin | 152580.67 |
| Florida | 147433.63 |
| North Dakota | 146838.69 |
| Pennsylvania | 122026.17 |
| Georgia | 120757.50 |
| North Carolina | 118907.18 |
| Texas | 118704.49 |
| Missouri | 117672.74 |
| South Carolina | 117201.77 |
| Kentucky | 115372.95 |
| Michigan | 114320.94 |
| Nebraska | 112512.59 |
| Alabama | 111315.46 |
| Illinois | 110639.05 |
| Iowa | 108837.37 |
| Ohio | 107811.92 |
| Louisiana | 106155.52 |
| West Virginia | 104864.69 |
| Mississippi | 103178.53 |
| Indiana | 99398.11 |
| Arkansas | 96635.30 |
| Tennessee | 96511.72 |
| Kansas | 92633.36 |
| Oklahoma | 84037.59 |
Visualize the least affordable states based on average county home price per state
# Top 10
top_n(ungroup(data_2), 10) %>%
ggplot(aes(reorder(StateName, avg_house_price), avg_house_price)) +
geom_col(aes(fill = avg_house_price)) +
coord_flip() +
labs(title = '10 most expensive state to buy a house', x = "State")
## Selecting by avg_house_price
Visualize the most affordable states based on average county home price per state
# Top 10
top_n(ungroup(data_2), -10) %>%
ggplot(aes(reorder(StateName, avg_house_price), avg_house_price)) +
geom_bar(stat="identity", color="blue", fill="purple") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=0.5)) +
labs(title = '10 least expensive state to buy a house', x = "State")
## Selecting by avg_house_price
Visualize the average house price by state
# Add state abbreviation
data_3 <- data_1 %>%
mutate(code = state.abb[match(StateName, state.name)])
# Plot the map
w <- list(color = toRGB("white"), width = 2)
g <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
p <- plot_geo(data_3, locationmode = 'USA-states') %>%
add_trace(
z = ~avg_house_price, locations = ~code,
color = ~avg_house_price, colors = 'Purples'
) %>%
colorbar(title = "Avg house price") %>%
layout(
title = 'Avg house price by State',
geo = g
)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
p
For this predictive analysis, we are going to use Linear Regression.
Since wee have many explanatory variables, this case will be a multiple linear regression model.
The explanatory variables or predictors are: unemployment rate, number of schools, number of hospitals, average hospital ratings. Our response variable is the average house price.
Our research question is “which variable is a best predictor of average house price?”
We are going then to test around the relationship between housing prices and each one of the predictors. The model selection will be based on adjusted R square. Thus, we are going to apply “backward-selection”.
The general idea behind backward-selection is to start with the full model and eliminate one variable at a time until the ideal model is reached: Start with the full model, refit all possible models omitting one variable at a time, and choose the model with the highest adjusted R squared, repeat until maximum possible adjusted R squared is reached.
Subset the data set with numeric variables to get it ready for modeling
# Subset the data set with only numerical variables
data_4 <- data_final %>%
group_by(StateName) %>%
transmute(StateName,
UnemploymentRate = round(mean(UnemploymentRate), 0),
NumberOfSchools = sum(NumberOfSchools),
NumberOfHospitals = sum(NumberOfHospitals),
AverageHospitalRating = round(mean(AverageHospitalRating), 0),
avg_house_price = mean(AverageHousePrice))
data_4 <- data_4 %>%
distinct(StateName, UnemploymentRate, NumberOfSchools, NumberOfHospitals, AverageHospitalRating, avg_house_price)
data_4 <- subset(data_4, select = -c(StateName))
head(data_4)
## # A tibble: 6 x 5
## UnemploymentRate NumberOfSchools NumberOfHospita~ AverageHospital~
## <dbl> <int> <int> <dbl>
## 1 4 731 51 3
## 2 6 206 17 3
## 3 7 2323 133 3
## 4 4 542 68 3
## 5 5 9942 544 3
## 6 3 1567 90 4
## # ... with 1 more variable: avg_house_price <dbl>
Correlation matrix
res <- cor(data_4)
round(res, 2)
## UnemploymentRate NumberOfSchools NumberOfHospitals
## UnemploymentRate 1.00 0.23 0.22
## NumberOfSchools 0.23 1.00 0.95
## NumberOfHospitals 0.22 0.95 1.00
## AverageHospitalRating -0.42 -0.02 -0.03
## avg_house_price 0.07 0.12 -0.02
## AverageHospitalRating avg_house_price
## UnemploymentRate -0.42 0.07
## NumberOfSchools -0.02 0.12
## NumberOfHospitals -0.03 -0.02
## AverageHospitalRating 1.00 -0.22
## avg_house_price -0.22 1.00
Performance analytics
data_4 %>%
chart.Correlation(histogram=TRUE, pch=19)
Summary table of correlation between predictors and house price
Features <- c('NumberOfSchools', 'NumberOfHospitals', 'AverageHospitalRating', 'UnemploymentRate')
Correlation <- c(0.12, -0.02, -0.22, 0.07)
df <- data.frame(Features, Correlation)
df
## Features Correlation
## 1 NumberOfSchools 0.12
## 2 NumberOfHospitals -0.02
## 3 AverageHospitalRating -0.22
## 4 UnemploymentRate 0.07
Multiple linear model:
We are going to evaluate the avg_house_price with each of the predictors
Since we are using backward-selection, let first start with full model:
res_mul <- lm(avg_house_price ~ NumberOfSchools + NumberOfHospitals + AverageHospitalRating + UnemploymentRate, data = data_4)
summary(res_mul)
##
## Call:
## lm(formula = avg_house_price ~ NumberOfSchools + NumberOfHospitals +
## AverageHospitalRating + UnemploymentRate, data = data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104171 -43067 -280 19905 192927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 307782.82 91954.46 3.347 0.00166 **
## NumberOfSchools 51.88 16.13 3.216 0.00241 **
## NumberOfHospitals -811.62 263.03 -3.086 0.00347 **
## AverageHospitalRating -39218.47 22684.77 -1.729 0.09069 .
## UnemploymentRate -3794.32 9168.35 -0.414 0.68095
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 59780 on 45 degrees of freedom
## Multiple R-squared: 0.2275, Adjusted R-squared: 0.1588
## F-statistic: 3.313 on 4 and 45 DF, p-value: 0.0184
Now, let analyze each individual predictor with house price:
avg_house_price~UnemploymentRate
res_1 <- lm(avg_house_price ~ UnemploymentRate, data = data_4)
summary(res_1)
##
## Call:
## lm(formula = avg_house_price ~ UnemploymentRate, data = data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79759 -49213 -8008 22744 212819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 145893 36797 3.965 0.000244 ***
## UnemploymentRate 4476 8857 0.505 0.615632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65670 on 48 degrees of freedom
## Multiple R-squared: 0.005292, Adjusted R-squared: -0.01543
## F-statistic: 0.2554 on 1 and 48 DF, p-value: 0.6156
avg_house_price~NumberOfSchools
res_2 <- lm(avg_house_price ~ NumberOfSchools, data = data_4)
summary(res_2)
##
## Call:
## lm(formula = avg_house_price ~ NumberOfSchools, data = data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78233 -50871 -10537 21548 209009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.574e+05 1.221e+04 12.896 <2e-16 ***
## NumberOfSchools 4.600e+00 5.652e+00 0.814 0.42
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65400 on 48 degrees of freedom
## Multiple R-squared: 0.01362, Adjusted R-squared: -0.006934
## F-statistic: 0.6626 on 1 and 48 DF, p-value: 0.4197
avg_house_price~NumberOfHospitals
res_3 <- lm(avg_house_price ~ NumberOfHospitals, data = data_4)
summary(res_3)
##
## Call:
## lm(formula = avg_house_price ~ NumberOfHospitals, data = data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79799 -51291 -9966 23105 202933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165048.36 13088.48 12.610 <2e-16 ***
## NumberOfHospitals -11.77 93.12 -0.126 0.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 65840 on 48 degrees of freedom
## Multiple R-squared: 0.0003325, Adjusted R-squared: -0.02049
## F-statistic: 0.01597 on 1 and 48 DF, p-value: 0.9
avg_house_price~AverageHospitalRating
res_4 <- lm(avg_house_price ~ AverageHospitalRating, data = data_4)
summary(res_4)
##
## Call:
## lm(formula = avg_house_price ~ AverageHospitalRating, data = data_4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83312 -51520 -9665 25213 200313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 271267 68881 3.938 0.000265 ***
## AverageHospitalRating -34639 22026 -1.573 0.122364
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64220 on 48 degrees of freedom
## Multiple R-squared: 0.049, Adjusted R-squared: 0.02919
## F-statistic: 2.473 on 1 and 48 DF, p-value: 0.1224
Summary table
Features <- c('NumberOfSchools', 'NumberOfHospitals', 'AverageHospitalRating', 'UnemploymentRate')
Correlation <- c(0.12, -0.02, -0.22, 0.07)
P_values <- c(0.00241, 0.00347, 0.09069, 0.68095)
Adj_r_square <- c(-0.006934, -0.02049, 0.02919, -0.01543)
df_final <- data.frame(Features, Correlation, P_values, Adj_r_square)
df_final
## Features Correlation P_values Adj_r_square
## 1 NumberOfSchools 0.12 0.00241 -0.006934
## 2 NumberOfHospitals -0.02 0.00347 -0.020490
## 3 AverageHospitalRating -0.22 0.09069 0.029190
## 4 UnemploymentRate 0.07 0.68095 -0.015430
As to answer to our main question to test around relationship between housing prices and each predictors, We realize that the house price has a positive relationship with the number of schools and the unemployment rate. Those are also the two big factors (from the features we explored) that contribute the most to the price of home in US. The two other factors have a negative relationship with house prices. Though the predictors we used, we should have taken into consideration the crime rate which we believe should be a great predictor of home price. This will be part of further work we will have to do to make this model more efficient. We need also to mention that the project presented some challenges such as we needed to find appropriate data set for various factors, understand different terms such FIPS which we never heard before, merge different data set to make one useful data set for analysis and prediction.
CUNY DATA606: https://fall2020.data606.net/chapters/chapter9/