[Covid Infections][https://data.wprdc.org/dataset/allegheny-county-covid-19-tests-cases-and-deaths/resource/0f214885-ff3e-44e1-9963-e9e9062a04d1?inner_span=True]
[Vaccine Reports][https://data.wprdc.org/dataset/allegheny-county-covid-19-vaccinations/resource/61ab4ad4-fb84-4789-95c9-cbe116414287]
[House Prices][https://data.wprdc.org/dataset/real-estate-sales/resource/5bbe6c55-bce6-4edb-9d04-68edeb6bf7b1/view/fc32217e-2f0e-437d-9f68-f2922dfdf71f]
#HOUSE DATASET
house_data = read.csv("house_sales.csv")
#filter to price >1, some are 0, 1, or 2 for some reason
house_data = house_data[house_data$PRICE > 2,]
complete = complete.cases(house_data)
house_data = house_data[complete, ]
#filter house data to after Covid hit since this dataset includes house sales from 2013-present
house_data$SALEDATE = gsub("/", "-", house_data$SALEDATE)
house_data$SALEDATE = as.Date(house_data$SALEDATE, format = "%m-%d-%Y")
house_data = subset(house_data, SALEDATE >= as.Date("2020-06-01"))
#make dataframe for median house prices in each neighborhood
median_house_price = aggregate(house_data$PRICE, by = list(neighborhood = house_data$MUNIDESC), FUN = median)
colnames(median_house_price) = c("neighborhood", "median_price")
median_house_data = median_house_price
#read in covid dataset
covid_data = read.csv("covid_data.csv")
#some neighborhood names in covid_data have "(Pittsburgh)", take that out
covid_data$neighborhood_municipality = gsub("\\(Pittsburgh\\)", "", covid_data$neighborhood_municipality)
# Remove special and capital characters in both datasets
median_house_data$neighborhood = tolower(gsub("[^a-z0-9]+", "", median_house_data$neighborhood, perl = TRUE))
covid_data$neighborhood_municipality = tolower(gsub("[^a-z0-9]+", "", covid_data$neighborhood_municipality, perl = TRUE))
#merge house_data and covid_data according to neighborhood
house_and_covid_data = merge(covid_data, median_house_data, by.x = "neighborhood_municipality", by.y = "neighborhood")
#linear regression model on covid cases and house prices in Pittsburgh neighborhoods
lm_model1 = lm(house_and_covid_data$median_price ~ house_and_covid_data$infections, data = house_and_covid_data)
summary(lm_model1)
Call:
lm(formula = house_and_covid_data$median_price ~ house_and_covid_data$infections,
data = house_and_covid_data)
Residuals:
Min 1Q Median 3Q Max
-187010 -100285 -42292 46008 1061726
Coefficients:
Estimate Std. Error t value
(Intercept) 2.132e+05 2.151e+04 9.909
house_and_covid_data$infections 1.018e-01 1.360e+00 0.075
Pr(>|t|)
(Intercept) <2e-16 ***
house_and_covid_data$infections 0.94
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 179000 on 112 degrees of freedom
Multiple R-squared: 5.003e-05, Adjusted R-squared: -0.008878
F-statistic: 0.005603 on 1 and 112 DF, p-value: 0.9405
#plot model
plot(house_and_covid_data$infections, house_and_covid_data$median_price, xlab = "Covid Infections", ylab = "Median House Prices", main = "Covid Infections vs Median House Prices")
abline(lm_model1, col="blue")
#correlation between covid infections and house prices
correlation = cor(house_and_covid_data$median_price, house_and_covid_data$infections)
correlation_test = cor.test(house_and_covid_data$median_price, house_and_covid_data$infections)
correlation_test
Pearson's product-moment correlation
data: house_and_covid_data$median_price and house_and_covid_data$infections
t = 0.074855, df = 112, p-value = 0.9405
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1770722 0.1907396
sample estimates:
cor
0.007072941
cat("correlation coefficient between covid cases and house prices", correlation)
correlation coefficient between covid cases and house prices 0.007072941
#get slope of fitted line
slope1 = coef(lm_model1)[2]
cat("
slope of fitted line", slope1)
slope of fitted line 0.1017766
Looking at the plot, there seems to be no correlation and no apparent trends between covid infections and median house prices in Pittsburgh neighborhoods. Even as covid infections increase, median house prices tend to stay the same. The slope of the fitted line to the regression model is close to 0, indicating that there is no apparent correlation between the two variables. This is further proven as the p-value obtained from the linear regression model between the two is 0.9405, meaning the model is not significant; Furthermore, the correlation coefficient is close to 0, thus there is close to no linear relationship between house prices and covid infections.
# read in vaccine dataset
vaccine_data = read.csv("vaccine.csv")
#some neighborhood names have "(Pittsburgh)", take that out
vaccine_data$neighborhood_municipality = gsub("\\(Pittsburgh\\)", "", vaccine_data$neighborhood_municipality)
# Remove special and capital characters
vaccine_data$neighborhood_municipality = tolower(gsub("[^a-z0-9]+", "", covid_data$neighborhood_municipality, perl = TRUE))
#merge median_house_data and vaccine_data according to neighborhood
house_and_vaccine_data = merge(vaccine_data, median_house_data, by.x = "neighborhood_municipality", by.y = "neighborhood")
#linear regression model on vaccine count and med house prices in Pittsburgh neighborhoods
lm_model2 = lm(house_and_vaccine_data$median_price ~ house_and_vaccine_data$bivalent_booster, data = house_and_vaccine_data)
summary(lm_model2)
Call:
lm(formula = house_and_vaccine_data$median_price ~ house_and_vaccine_data$bivalent_booster,
data = house_and_vaccine_data)
Residuals:
Min 1Q Median 3Q Max
-193515 -104758 -43359 39903 1057190
Coefficients:
Estimate Std. Error
(Intercept) 222471.775 20136.897
house_and_vaccine_data$bivalent_booster -6.959 9.429
t value Pr(>|t|)
(Intercept) 11.048 <2e-16 ***
house_and_vaccine_data$bivalent_booster -0.738 0.462
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 178600 on 112 degrees of freedom
Multiple R-squared: 0.00484, Adjusted R-squared: -0.004046
F-statistic: 0.5447 on 1 and 112 DF, p-value: 0.462
#plot model
plot(house_and_vaccine_data$bivalent_booster, house_and_vaccine_data$median_price, main = "Vaccine Rate vs Median House Prices", ylab = "Median House Prices", xlab = "Vaccine Rate")
abline(lm_model1, col="blue")
#correlation between covid infections and house prices
correlation2 = cor(house_and_vaccine_data$median_price, house_and_vaccine_data$bivalent_booster)
correlation_test2 = cor.test(house_and_vaccine_data$median_price, house_and_vaccine_data$bivalent_booster)
correlation_test2
Pearson's product-moment correlation
data: house_and_vaccine_data$median_price and house_and_vaccine_data$bivalent_booster
t = -0.73803, df = 112, p-value = 0.462
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2502805 0.1158288
sample estimates:
cor
-0.06956795
cat("correlation coefficient between covid vaccine rate and house prices", correlation2)
correlation coefficient between covid vaccine rate and house prices -0.06956795
#get slope of fitted line
slope2 = coef(lm_model2)[2]
cat("
slope of fitted line", slope2)
slope of fitted line -6.958619
Looking at the plot, there seems to be no correlation and no apparent trends between the vaccine rate and median house prices in Pittsburgh neighborhoods. The slope of the fitted line to the regression model is about -7 (a small value in comparison to the numbers used in the data), indicating that there is no noticeable correlation between the two variables. This is further proven as the p-value obtained from the linear regression model between the two is 0.462, meaning the model is not significant; Furthermore, the correlation coefficient is close to 0, thus there is close to no linear relationship between house prices and covid infections.
#check for colinearity between vaccine rate and covid cases
#merge covid infections, vaccine rate, and med house prices into one dataset
all_data = merge(house_and_covid_data, house_and_vaccine_data, by = "neighborhood_municipality")
#1) plot covid infections vs house prices
plot(house_and_covid_data$infections, house_and_covid_data$median_price, xlab = "Covid Infections", ylab = "Median House Prices", main = "Covid Infections vs Median House Prices")
#lm between covid infections and house prices
summary(lm(house_and_covid_data$median_price ~ house_and_covid_data$infections, data = all_data))
Call:
lm(formula = house_and_covid_data$median_price ~ house_and_covid_data$infections,
data = all_data)
Residuals:
Min 1Q Median 3Q Max
-187010 -100285 -42292 46008 1061726
Coefficients:
Estimate Std. Error t value
(Intercept) 2.132e+05 2.151e+04 9.909
house_and_covid_data$infections 1.018e-01 1.360e+00 0.075
Pr(>|t|)
(Intercept) <2e-16 ***
house_and_covid_data$infections 0.94
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 179000 on 112 degrees of freedom
Multiple R-squared: 5.003e-05, Adjusted R-squared: -0.008878
F-statistic: 0.005603 on 1 and 112 DF, p-value: 0.9405
#2) plot vaccine rate vs house prices
plot(house_and_vaccine_data$bivalent_booster, house_and_vaccine_data$median_price, main = "Vaccine Rate vs Median House Prices", ylab = "Median House Prices", xlab = "Vaccine Rate")
#lm between vaccine rate and house prices
summary(lm(all_data$median_price.x ~ house_and_vaccine_data$bivalent_booster, data = all_data))
Call:
lm(formula = all_data$median_price.x ~ house_and_vaccine_data$bivalent_booster,
data = all_data)
Residuals:
Min 1Q Median 3Q Max
-193515 -104758 -43359 39903 1057190
Coefficients:
Estimate Std. Error
(Intercept) 222471.775 20136.897
house_and_vaccine_data$bivalent_booster -6.959 9.429
t value Pr(>|t|)
(Intercept) 11.048 <2e-16 ***
house_and_vaccine_data$bivalent_booster -0.738 0.462
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 178600 on 112 degrees of freedom
Multiple R-squared: 0.00484, Adjusted R-squared: -0.004046
F-statistic: 0.5447 on 1 and 112 DF, p-value: 0.462
#3) plot the two independent variables
plot(all_data$infections, all_data$bivalent_booster, main = "Vaccine Rate vs Covid Infections", ylab = "Vaccine Rate", xlab = "Covid Infections")
#linear model between the two independent variables
lm_model3 = lm(all_data$median_price.x ~ all_data$infections + all_data$bivalent_booster, data = all_data)
#summary of linear regression model
summary(lm_model3)
Call:
lm(formula = all_data$median_price.x ~ all_data$infections +
all_data$bivalent_booster, data = all_data)
Residuals:
Min 1Q Median 3Q Max
-193474 -104706 -43352 39852 1057244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.224e+05 2.498e+04 8.903 1.19e-14 ***
all_data$infections 5.715e-03 1.369e+00 0.004 0.997
all_data$bivalent_booster -6.955e+00 9.515e+00 -0.731 0.466
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 179400 on 111 degrees of freedom
Multiple R-squared: 0.00484, Adjusted R-squared: -0.01309
F-statistic: 0.2699 on 2 and 111 DF, p-value: 0.7639
Looking at the plot that shows no trend between the two independent variables and the linear regression model’s p-value of 0.7639, the two do not seem to be colinear. Thus, multiple linear regression is not needed and simple linear regression should suffice.