[Covid Infections][https://data.wprdc.org/dataset/allegheny-county-covid-19-tests-cases-and-deaths/resource/0f214885-ff3e-44e1-9963-e9e9062a04d1?inner_span=True]
[Vaccine Reports][https://data.wprdc.org/dataset/allegheny-county-covid-19-vaccinations/resource/61ab4ad4-fb84-4789-95c9-cbe116414287]
[House Prices][https://data.wprdc.org/dataset/real-estate-sales/resource/5bbe6c55-bce6-4edb-9d04-68edeb6bf7b1/view/fc32217e-2f0e-437d-9f68-f2922dfdf71f]
#HOUSE DATASET
house_data = read.csv("house_sales.csv")
#filter to price >1, some are 0, 1, or 2 for some reason
house_data = house_data[house_data$PRICE > 2,]
complete = complete.cases(house_data)
house_data = house_data[complete, ]
#filter house data to after Covid hit since this dataset includes house sales from 2013-present
house_data$SALEDATE = gsub("/", "-", house_data$SALEDATE)
house_data$SALEDATE = as.Date(house_data$SALEDATE, format = "%m-%d-%Y")
house_data = subset(house_data, SALEDATE >= as.Date("2020-06-01"))
#make dataframe for median house prices in each neighborhood
median_house_price = aggregate(house_data$PRICE, by = list(neighborhood = house_data$MUNIDESC), FUN = median)
colnames(median_house_price) = c("neighborhood", "median_price")
median_house_data = median_house_price
#read in covid dataset
covid_data = read.csv("covid_data.csv")
#some neighborhood names in covid_data have "(Pittsburgh)", take that out
covid_data$neighborhood_municipality = gsub("\\(Pittsburgh\\)", "", covid_data$neighborhood_municipality)
# Remove special and capital characters in both datasets
median_house_data$neighborhood = tolower(gsub("[^a-z0-9]+", "", median_house_data$neighborhood, perl = TRUE))
covid_data$neighborhood_municipality = tolower(gsub("[^a-z0-9]+", "", covid_data$neighborhood_municipality, perl = TRUE))
#merge house_data and covid_data according to neighborhood
house_and_covid_data = merge(covid_data, median_house_data, by.x = "neighborhood_municipality", by.y = "neighborhood")
#linear regression model on covid cases and house prices in Pittsburgh neighborhoods
lm_model1 = lm(house_and_covid_data$median_price ~ house_and_covid_data$infections, data = house_and_covid_data)
summary(lm_model1)
Call:
lm(formula = house_and_covid_data$median_price ~ house_and_covid_data$infections,
data = house_and_covid_data)
Residuals:
Min 1Q Median 3Q Max
-187010 -100285 -42292 46008 1061726
Coefficients:
Estimate Std. Error t value
(Intercept) 2.132e+05 2.151e+04 9.909
house_and_covid_data$infections 1.018e-01 1.360e+00 0.075
Pr(>|t|)
(Intercept) <2e-16 ***
house_and_covid_data$infections 0.94
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 179000 on 112 degrees of freedom
Multiple R-squared: 5.003e-05, Adjusted R-squared: -0.008878
F-statistic: 0.005603 on 1 and 112 DF, p-value: 0.9405
#plot model
plot(house_and_covid_data$infections, house_and_covid_data$median_price, xlab = "Covid Infections", ylab = "Median House Prices", main = "Covid Infections vs Median House Prices")
abline(lm_model1, col="blue")
#correlation between covid infections and house prices
correlation = cor(house_and_covid_data$median_price, house_and_covid_data$infections)
correlation_test = cor.test(house_and_covid_data$median_price, house_and_covid_data$infections)
correlation_test
Pearson's product-moment correlation
data: house_and_covid_data$median_price and house_and_covid_data$infections
t = 0.074855, df = 112, p-value = 0.9405
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1770722 0.1907396
sample estimates:
cor
0.007072941
cat("correlation coefficient between covid cases and house prices", correlation)
correlation coefficient between covid cases and house prices 0.007072941
#get slope of fitted line
slope1 = coef(lm_model1)[2]
cat("
slope of fitted line", slope1)
slope of fitted line 0.1017766
Looking at the plot, there seems to be no correlation and no apparent trends between covid infections and median house prices in Pittsburgh neighborhoods. Even as covid infections increase, median house prices tend to stay the same. The slope of the fitted line to the regression model is close to 0, indicating that there is no apparent correlation between the two variables. This is further proven as the p-value obtained from the linear regression model between the two is 0.9405, meaning the model is not significant; Furthermore, the correlation coefficient is close to 0, thus there is close to no linear relationship between house prices and covid infections.
# read in vaccine dataset
vaccine_data = read.csv("vaccine.csv")
#some neighborhood names have "(Pittsburgh)", take that out
vaccine_data$neighborhood_municipality = gsub("\\(Pittsburgh\\)", "", vaccine_data$neighborhood_municipality)
# Remove special and capital characters
vaccine_data$neighborhood_municipality = tolower(gsub("[^a-z0-9]+", "", covid_data$neighborhood_municipality, perl = TRUE))
#merge median_house_data and vaccine_data according to neighborhood
house_and_vaccine_data = merge(vaccine_data, median_house_data, by.x = "neighborhood_municipality", by.y = "neighborhood")
#linear regression model on vaccine count and med house prices in Pittsburgh neighborhoods
lm_model2 = lm(house_and_vaccine_data$median_price ~ house_and_vaccine_data$bivalent_booster, data = house_and_vaccine_data)
summary(lm_model2)
Call:
lm(formula = house_and_vaccine_data$median_price ~ house_and_vaccine_data$bivalent_booster,
data = house_and_vaccine_data)
Residuals:
Min 1Q Median 3Q Max
-193515 -104758 -43359 39903 1057190
Coefficients:
Estimate Std. Error
(Intercept) 222471.775 20136.897
house_and_vaccine_data$bivalent_booster -6.959 9.429
t value Pr(>|t|)
(Intercept) 11.048 <2e-16 ***
house_and_vaccine_data$bivalent_booster -0.738 0.462
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 178600 on 112 degrees of freedom
Multiple R-squared: 0.00484, Adjusted R-squared: -0.004046
F-statistic: 0.5447 on 1 and 112 DF, p-value: 0.462
#plot model
plot(house_and_vaccine_data$bivalent_booster, house_and_vaccine_data$median_price, main = "Vaccine Rate vs Median House Prices", ylab = "Median House Prices", xlab = "Vaccine Rate")
abline(lm_model1, col="blue")
#correlation between covid infections and house prices
correlation2 = cor(house_and_vaccine_data$median_price, house_and_vaccine_data$bivalent_booster)
correlation_test2 = cor.test(house_and_vaccine_data$median_price, house_and_vaccine_data$bivalent_booster)
correlation_test2
Pearson's product-moment correlation
data: house_and_vaccine_data$median_price and house_and_vaccine_data$bivalent_booster
t = -0.73803, df = 112, p-value = 0.462
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2502805 0.1158288
sample estimates:
cor
-0.06956795
cat("correlation coefficient between covid vaccine rate and house prices", correlation2)
correlation coefficient between covid vaccine rate and house prices -0.06956795
#get slope of fitted line
slope2 = coef(lm_model2)[2]
cat("
slope of fitted line", slope2)
slope of fitted line -6.958619
Looking at the plot, there seems to be no correlation and no apparent trends between the vaccine rate and median house prices in Pittsburgh neighborhoods. The slope of the fitted line to the regression model is about -7 (a small value in comparison to the numbers used in the data), indicating that there is no noticeable correlation between the two variables. This is further proven as the p-value obtained from the linear regression model between the two is 0.462, meaning the model is not significant; Furthermore, the correlation coefficient is close to 0, thus there is close to no linear relationship between house prices and covid infections.
#check for colinearity between vaccine rate and covid cases
#merge covid infections, vaccine rate, and med house prices into one dataset
all_data = merge(house_and_covid_data, house_and_vaccine_data, by = "neighborhood_municipality")
#1) plot covid infections vs house prices
plot(house_and_covid_data$infections, house_and_covid_data$median_price, xlab = "Covid Infections", ylab = "Median House Prices", main = "Covid Infections vs Median House Prices")
#lm between covid infections and house prices
summary(lm(house_and_covid_data$median_price ~ house_and_covid_data$infections, data = all_data))
Call:
lm(formula = house_and_covid_data$median_price ~ house_and_covid_data$infections,
data = all_data)
Residuals:
Min 1Q Median 3Q Max
-187010 -100285 -42292 46008 1061726
Coefficients:
Estimate Std. Error t value
(Intercept) 2.132e+05 2.151e+04 9.909
house_and_covid_data$infections 1.018e-01 1.360e+00 0.075
Pr(>|t|)
(Intercept) <2e-16 ***
house_and_covid_data$infections 0.94
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 179000 on 112 degrees of freedom
Multiple R-squared: 5.003e-05, Adjusted R-squared: -0.008878
F-statistic: 0.005603 on 1 and 112 DF, p-value: 0.9405
#2) plot vaccine rate vs house prices
plot(house_and_vaccine_data$bivalent_booster, house_and_vaccine_data$median_price, main = "Vaccine Rate vs Median House Prices", ylab = "Median House Prices", xlab = "Vaccine Rate")
#lm between vaccine rate and house prices
summary(lm(all_data$median_price.x ~ house_and_vaccine_data$bivalent_booster, data = all_data))
Call:
lm(formula = all_data$median_price.x ~ house_and_vaccine_data$bivalent_booster,
data = all_data)
Residuals:
Min 1Q Median 3Q Max
-193515 -104758 -43359 39903 1057190
Coefficients:
Estimate Std. Error
(Intercept) 222471.775 20136.897
house_and_vaccine_data$bivalent_booster -6.959 9.429
t value Pr(>|t|)
(Intercept) 11.048 <2e-16 ***
house_and_vaccine_data$bivalent_booster -0.738 0.462
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 178600 on 112 degrees of freedom
Multiple R-squared: 0.00484, Adjusted R-squared: -0.004046
F-statistic: 0.5447 on 1 and 112 DF, p-value: 0.462
#3) plot the two independent variables
plot(all_data$infections, all_data$bivalent_booster, main = "Vaccine Rate vs Covid Infections", ylab = "Vaccine Rate", xlab = "Covid Infections")
#linear model between the two independent variables
lm_model3 = lm(all_data$median_price.x ~ all_data$infections + all_data$bivalent_booster, data = all_data)
#summary of linear regression model
summary(lm_model3)
Call:
lm(formula = all_data$median_price.x ~ all_data$infections +
all_data$bivalent_booster, data = all_data)
Residuals:
Min 1Q Median 3Q Max
-193474 -104706 -43352 39852 1057244
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.224e+05 2.498e+04 8.903 1.19e-14
all_data$infections 5.715e-03 1.369e+00 0.004 0.997
all_data$bivalent_booster -6.955e+00 9.515e+00 -0.731 0.466
(Intercept) ***
all_data$infections
all_data$bivalent_booster
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 179400 on 111 degrees of freedom
Multiple R-squared: 0.00484, Adjusted R-squared: -0.01309
F-statistic: 0.2699 on 2 and 111 DF, p-value: 0.7639
Looking at the plot that shows no trend between the two independent variables and the linear regression model’s p-value of 0.7639, the two do not seem to be colinear. Thus, multiple linear regression is not needed and simple linear regression should suffice
library(dplyr)
#Read and rename files
housing_prices <- read.csv("house_sales.csv")
#change name
housing_prices <- housing_prices %>% rename(neighborhood = MUNIDESC)
#remove trailing spces
housing_prices$neighborhood <- trimws(housing_prices$neighborhood)
# covert SALEDATE
housing_prices$SALEDATE <- as.Date(housing_prices$SALEDATE, format = "%m/%d/%y")
# filter to only sales after covid
housing_prices <- housing_prices %>%
filter(SALEDATE > as.Date("2020-06-01"))
vaccines <- read.csv("vaccine.csv")
#change name
vaccines <- vaccines %>% rename(neighborhood = neighborhood_municipality)
covid_cases <- read.csv("covid_data.csv")
#change name
covid_cases <- covid_cases %>% rename(neighborhood = neighborhood_municipality)
Group by neighborhood and calculate summary
#remove empty
housing_prices <- housing_prices %>%
filter(!is.na(PRICE))
#mean price of each neighborhood
housing_prices_summary <- housing_prices %>%
group_by(neighborhood) %>%
summarize(mean_housing_price = mean(PRICE, na.rm = TRUE))
#merge everything
vaccines_summary <- vaccines %>%
group_by(neighborhood) %>%
summarize(total_vaccinations = bivalent_booster)
covid_cases_summary <- covid_cases %>%
group_by(neighborhood) %>%
summarize(total_covid_cases = infections)
merged_data <- merge(housing_prices_summary, covid_cases_summary, by = "neighborhood")
merged_data <- merge(merged_data, vaccines_summary, by = "neighborhood")
Muliple linear regression
multi_lm <- lm(mean_housing_price ~ total_covid_cases + total_vaccinations, data = merged_data)
summary(multi_lm)
Call:
lm(formula = mean_housing_price ~ total_covid_cases + total_vaccinations,
data = merged_data)
Residuals:
Min 1Q Median 3Q Max
-156167 -84381 -52448 57716 647951
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 176916.810 17048.883 10.377 <2e-16 ***
total_covid_cases -7.527 3.270 -2.302 0.0232 *
total_vaccinations 53.890 22.310 2.415 0.0174 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 140000 on 109 degrees of freedom
Multiple R-squared: 0.05084, Adjusted R-squared: 0.03342
F-statistic: 2.919 on 2 and 109 DF, p-value: 0.05822
#plot
plot(merged_data$total_covid_cases, merged_data$mean_housing_price, xlab = "Covid Cases", ylab = "House Prices")
plot(merged_data$total_vaccinations, merged_data$mean_housing_price, xlab = "Vaccinations", ylab = "House Prices")
plot(merged_data$total_covid_cases, merged_data$total_vaccinations, xlab = "Covid Cases", ylab = "Vaccinations")
Which variables are significant? Both total covid cases and total vaccinations have high p values, greater than 0.05 which indicates that they are not statistically significant.
What is the model fit? The multiple R-squared value is 0.03213. This means that total covid cases and total vaccinations only explain 3.21% of the data in housing prices.
Residual error difference between different models: A residual error of 166300 is high, suggesting that the model has a high amount of unexplained variability. It might be lower when looking at simple regression models.
Insights from R-squared error: Adjusted R-squared is even lower than R-squared, again meaning total covid cases and total vaccinations may not be strong variables in explaining housing prices.
Loading in the datasets and editing
library(dplyr)
housing_prices <- read.csv("house_sales.csv")
housing_prices <- housing_prices %>% rename(neighborhood = MUNIDESC)
housing_prices$neighborhood <- trimws(housing_prices$neighborhood)
housing_prices$SALEDATE <- as.Date(housing_prices$SALEDATE, format = "%m-%d-%y")
vaccines <- read.csv("vaccine.csv")
vaccines <- vaccines %>% rename(neighborhood = neighborhood_municipality)
covid_cases <- read.csv("covid_data.csv")
covid_cases <- covid_cases %>% rename(neighborhood = neighborhood_municipality)
Merging datasets into one set
housing_prices <- housing_prices %>%
filter(!is.na(PRICE))
housing_prices_summary <- housing_prices %>%
group_by(neighborhood) %>%
summarize(mean_housing_price = mean(PRICE), na.rm = TRUE)
vaccines_summary <- vaccines %>%
group_by(neighborhood) %>%
summarize(total_vaccinations = bivalent_booster)
covid_cases_summary <- covid_cases %>%
group_by(neighborhood) %>%
summarize(total_covid_cases = infections)
merged_data <- merge(housing_prices_summary, covid_cases_summary, by = "neighborhood")
merged_data <- merge(merged_data, vaccines_summary, by = "neighborhood")
Splitting data into training and test sets
training_set <- merged_data %>% dplyr::sample_frac(0.8)
testing_set <- dplyr::anti_join(merged_data, training_set, by = "neighborhood")
training_mlr <- lm(mean_housing_price ~ total_covid_cases + total_vaccinations, data = training_set)
summary(training_mlr)
Call:
lm(formula = mean_housing_price ~ total_covid_cases + total_vaccinations,
data = training_set)
Residuals:
Min 1Q Median 3Q Max
-160421 -96591 -61738 56151 567879
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 189092.333 20067.412 9.423 6.12e-15 ***
total_covid_cases -8.447 3.785 -2.232 0.0282 *
total_vaccinations 57.803 26.031 2.221 0.0290 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 147000 on 87 degrees of freedom
Multiple R-squared: 0.05548, Adjusted R-squared: 0.03376
F-statistic: 2.555 on 2 and 87 DF, p-value: 0.08351
wine_quality = read.csv("winequality-red.csv")
wine_training_set <- wine_quality %>% dplyr::sample_frac(0.8)
wine_testing_set <- dplyr::anti_join(wine_quality, wine_training_set, by = "quality")
wine_mlr <- lm(quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = wine_training_set)
summary(wine_mlr)
Call:
lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
density + pH + sulphates + alcohol, data = wine_training_set)
Residuals:
Min 1Q Median 3Q Max
-2.71146 -0.36404 -0.04722 0.45004 2.02685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.422e+01 2.393e+01 1.012 0.31172
fixed.acidity 2.667e-02 2.896e-02 0.921 0.35741
volatile.acidity -8.976e-01 1.353e-01 -6.631 4.91e-11 ***
citric.acid -1.476e-01 1.658e-01 -0.890 0.37350
residual.sugar 1.332e-02 1.661e-02 0.802 0.42271
chlorides -1.348e+00 4.685e-01 -2.877 0.00409 **
free.sulfur.dioxide 5.388e-03 2.425e-03 2.222 0.02644 *
total.sulfur.dioxide -3.473e-03 8.159e-04 -4.256 2.23e-05 ***
density -2.049e+01 2.442e+01 -0.839 0.40161
pH -3.562e-01 2.141e-01 -1.664 0.09638 .
sulphates 8.711e-01 1.303e-01 6.687 3.42e-11 ***
alcohol 2.767e-01 2.986e-02 9.265 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6506 on 1267 degrees of freedom
Multiple R-squared: 0.3383, Adjusted R-squared: 0.3325
F-statistic: 58.88 on 11 and 1267 DF, p-value: < 2.2e-16
Test Data:
RSE: 153400 on 87 degrees of freedom.
This means the RSE is very high based on our dataset. The R-squared value accounts for 3.18% of the total variation in the dependent variable.
Training Data:
RSE: 0.6437 on 1267 degrees of freedom.
This means the RSE is very low based on our dataset. Our R-squared value accounts for 37.54% of the total variation in the dependent variable.
The model appears to be underfitting the data. A low R-squared means that the model is not accounting for a larger portion of the data, especially in the case of our Test Data. Thus we would need either a more complex model than our current simple model, or a lack of training data.