State the problem and the setting of the analysis (i.e., Philadelphia).
Present either a brief review of the literature (use Google Scholar) or simply speculate as to why the predictors we’re using might be related with the response variable.
The original Philadelphia block group dataset has 1816 observations. We clean the data by removing the following block groups:
The final dataset contains 1720 block groups.
State that you will examine the summary statistics and distributions of variables.
Also state that as part of your exploratory data analysis, you will examine the correlations between the predictors.
Explain what a correlation is, and provide the formula for the sample correlation coefficient r. Also mention the possible range of r values, and what correlation of 0 means.
Describe the method of regression in several sentences. I.e., what is it used for, what does it do?
State the equation for y for this problem. The equation should be in the form: 𝑦=𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘+𝜀. However, in your report, instead of y and x1…xk, fill in the actual variable names (as in the regression example given above). Be sure to mention what βi’s and ε are as well.
State and explain regression assumptions (e.g., linearity; independence of observations; normality of residuals; homoscedasticity; no multicollinearity).
Mention the parameters that need to be estimated in multiple regression (σ2, β0 ,…, βk). State what σ2 is (you should have already talked about βi in (ii) above).
Talk about the way of estimating the parameters. (Hint: present the equation on the slide ‘β Coefficient Estimation – Least Squares’ for multiple regression and briefly discuss what the equation does).
Talk about the coefficient of multiple determination R2, and the adjusted R2. Present and explain the relevant formulas and all the terms that are used in the formulas.
State the hypotheses you test. Specifically, talk about the F-ratio and the H0 and Ha associated with it, as well as the hypotheses you test about each of the βi’s (again, state H0 and Ha).
Talk about stepwise regression – discuss what it does and its limitations
Talk about k-fold cross-validation (mentioning that k = 5) – discuss what it is used for, describe how it is operationalized and mention that the RMSE is used to compare models (explain what the RMSE is and how it is calculated, presenting and describing any relevant formulas).
The analyses and visualizations for this report have all been done in R. Relevant packages and Markdown settings can be seen by unhiding the code in this section.
library(tidyverse) #general
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(sf) #spatial
## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE
library(mapview) #quick mapping
library(tmap) #full mapping
library(ggpubr) #for ggarrange
library(gt) #for tables
library(glue) #for tables
library(janitor) #to clean col names
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(corrplot) #for easy correlation matrix
## corrplot 0.92 loaded
library(tmap) #for choropleth maps
library(MASS) #for stepwise regression
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(DAAG) #for CVlm
##
## Attaching package: 'DAAG'
##
## The following object is masked from 'package:MASS':
##
## hills
library(caret) #for a different attempt at cvlm
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
knitr::opts_chunk$set(echo = T, messages = F, warning = F, error = F)
In order to complete this entire project in R (rather than using ArcGIS, too), we have chosen to use the shapefile of data, rather than the .csv. Below, we import the shapefile and use a custom function to apply log transformations to the relevant columns. The function checks whether there are zero values in each column and then applies the appropriate log transformation accordingly.
reg_data = st_read('C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/ass_1_data_shp/RegressionData.shp')
## Reading layer `RegressionData' from data source
## `C:\Users\Nissim\Desktop\Fall 2022\Spat Stats\ass_1_data_shp\RegressionData.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1720 features and 13 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 2660605 ymin: 207610.6 xmax: 2750171 ymax: 304858.8
## CRS: NA
#define a function to find zero values in columns
col_zeros = function(a, b) {
pct_col_zeros = count(subset(st_drop_geometry(a), b != 0)) |>
pull(n) / nrow(st_drop_geometry(a))
return(pct_col_zeros)
}
#apply function with case_when statement
#case_when is a vectorized function, while ifelse is not.
#running this with ifelse will result in all row values in the mutated column being the same.
reg_data = reg_data |>
mutate(
ln_med_h_val = case_when(col_zeros(reg_data, reg_data$MEDHVAL) == 1 ~ log(reg_data$MEDHVAL),
TRUE ~ log(1 + reg_data$MEDHVAL)),
ln_pct_bach_more = case_when(col_zeros(reg_data, reg_data$PCTBACHMOR) == 1 ~ log(reg_data$PCTBACHMOR),
TRUE ~ log(1 + reg_data$PCTBACHMOR)),
ln_n_bel_pov_100 = case_when(col_zeros(reg_data, reg_data$NBelPov100) == 1 ~ log(reg_data$NBelPov100),
TRUE ~ log(1 + reg_data$NBelPov100)),
ln_pct_vacant = case_when(col_zeros(reg_data, reg_data$PCTVACANT) == 1 ~ log(reg_data$PCTVACANT),
TRUE ~ log(1 + reg_data$PCTVACANT)),
ln_pct_singles = case_when(col_zeros(reg_data, reg_data$PCTSINGLES) == 1 ~ log(reg_data$PCTSINGLES),
TRUE ~ log(1 + reg_data$PCTSINGLES)),
)
Present and briefly talk about the table with summary statistics which includes the dependent variable and the predictors (i.e., mean, standard deviation).
med_house_val = c("Median House Value", mean(reg_data$MEDHVAL), sd(reg_data$MEDHVAL))
hhs_in_pov = c("# Households Living in Poverty", mean(reg_data$NBelPov100), sd(reg_data$NBelPov100))
pct_w_bach_or_higher = c("% of Individuals with Bachelor's Degrees or Higher", mean(reg_data$PCTBACHMOR), sd(reg_data$PCTBACHMOR))
pct_vac_houses = c("% of Vacant Houses", mean(reg_data$PCTVACANT), sd(reg_data$PCTVACANT))
pct_sing_house_units = c("% of Single House Units", mean(reg_data$PCTSINGLES), sd(reg_data$PCTSINGLES))
table = as.data.frame(t(data.frame(
med_house_val,
hhs_in_pov,
pct_w_bach_or_higher,
pct_vac_houses,
pct_sing_house_units
)))
colnames(table) = c("Variable", "Mean", "SD")
table$Mean = as.numeric(table$Mean)
table$SD = as.numeric(table$SD)
table = table |>
mutate_if(is.numeric, round, digits = 3)
table_out = table |>
gt() |>
tab_header(
title = md("**Summary Statistics**")
) |>
tab_row_group(
label = md('**Predictors**'),
rows = 2:5
) |>
tab_row_group(
label = md('**Dependent Variable**'),
rows = 1
)
#print output
table_out
| Summary Statistics | ||
| Variable | Mean | SD |
|---|---|---|
| Dependent Variable | ||
| Median House Value | 66287.733 | 60006.076 |
| Predictors | ||
| # Households Living in Poverty | 189.771 | 164.318 |
| % of Individuals with Bachelor's Degrees or Higher | 16.081 | 17.770 |
| % of Vacant Houses | 11.289 | 9.628 |
| % of Single House Units | 9.226 | 13.249 |
Also state whether the variables are normal before and after the logarithmic transformation
house_val = ggplot(reg_data) +
geom_histogram(aes(MEDHVAL)) +
geom_vline(xintercept = mean(reg_data$MEDHVAL), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$MEDHVAL) + sd(reg_data$MEDHVAL)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$MEDHVAL) - sd(reg_data$MEDHVAL)), linetype = 'dashed') +
labs(title = "Figure 1a",
subtitle = "Histogram of Median House Values",
x = "Median House Value") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$MEDHVAL), y = 850, label = "bar(x) ", size = 7, parse = T)+
annotate("text", x = (mean(reg_data$MEDHVAL) + sd(reg_data$MEDHVAL)+1000), y = 850, label = "~sigma ", size = 7, parse = T)+
annotate("text", x = (mean(reg_data$MEDHVAL) - sd(reg_data$MEDHVAL)-2500), y = 850, label = "~-sigma ", size = 7, parse = T)
pct_bach = ggplot(reg_data) +
geom_histogram(aes(PCTBACHMOR)) +
geom_vline(xintercept = mean(reg_data$PCTBACHMOR), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$PCTBACHMOR) + sd(reg_data$PCTBACHMOR)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$PCTBACHMOR) - sd(reg_data$PCTBACHMOR)), linetype = 'dashed') +
labs(title = "Figure 1b",
subtitle = "Histogram of Educational Achievement",
x = "% w/a Bachelor's or Higher") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$PCTBACHMOR), y = 300, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$PCTBACHMOR) + sd(reg_data$PCTBACHMOR)), y = 300, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$PCTBACHMOR) - sd(reg_data$PCTBACHMOR)), y = 300, label = "~-sigma ", size = 5, parse = T)
nbelpov = ggplot(reg_data) +
geom_histogram(aes(NBelPov100)) +
geom_vline(xintercept = mean(reg_data$NBelPov100), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$NBelPov100) + sd(reg_data$NBelPov100)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$NBelPov100) - sd(reg_data$NBelPov100)), linetype = 'dashed') +
labs(title = "Figure 1c",
subtitle = "Histogram of Poverty Levels",
x = "# Below Poverty Line") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$NBelPov100), y = 300, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$NBelPov100) + sd(reg_data$NBelPov100)), y = 300, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$NBelPov100) - sd(reg_data$NBelPov100)), y = 300, label = "~-sigma ", size = 5, parse = T)
pct_vac = ggplot(reg_data) +
geom_histogram(aes(PCTVACANT)) +
geom_vline(xintercept = mean(reg_data$PCTVACANT), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$PCTVACANT) + sd(reg_data$PCTVACANT)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$PCTVACANT) - sd(reg_data$PCTVACANT)), linetype = 'dashed') +
labs(title = "Figure 1d",
subtitle = "Histogram of Vacancy Rates",
x = "% Vacancy Rate") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$PCTVACANT), y = 275, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$PCTVACANT) + sd(reg_data$PCTVACANT)), y = 275, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$PCTVACANT) - sd(reg_data$PCTVACANT)), y = 275, label = "~-sigma ", size = 5, parse = T)
pct_sing = ggplot(reg_data) +
geom_histogram(aes(PCTSINGLES)) +
geom_vline(xintercept = mean(reg_data$PCTSINGLES), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$PCTSINGLES) + sd(reg_data$PCTSINGLES)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$PCTSINGLES) - sd(reg_data$PCTSINGLES)), linetype = 'dashed') +
labs(title = "Figure 1e",
subtitle = "Histogram of Single House Units",
x = "% Single House Units") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$PCTSINGLES), y = 450, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$PCTSINGLES) + sd(reg_data$PCTSINGLES)), y = 450, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$PCTSINGLES) - sd(reg_data$PCTSINGLES)), y = 450, label = "~-sigma ", size = 5, parse = T)
house_val
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggarrange(pct_bach, nbelpov, pct_vac, pct_sing)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Present the histograms of the transformed variables and clearly state whether you’re using the log-transformed or original variable in your regression.
State that the other regression assumptions will be examined in a separate section below (Regression Assumption Checks).
ln_house_val = ggplot(reg_data) +
geom_histogram(aes(ln_med_h_val)) +
geom_vline(xintercept = mean(reg_data$ln_med_h_val), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$ln_med_h_val) + sd(reg_data$ln_med_h_val)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$ln_med_h_val) - sd(reg_data$ln_med_h_val)), linetype = 'dashed') +
labs(title = "Figure 2a",
subtitle = "Histogram of Ln of Median House Values",
x = "Ln of Median House Value") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$ln_med_h_val), y = 225, label = "bar(x) ", size = 7, parse = T)+
annotate("text", x = (mean(reg_data$ln_med_h_val) + sd(reg_data$ln_med_h_val)), y = 225, label = "~sigma ", size = 7, parse = T)+
annotate("text", x = (mean(reg_data$ln_med_h_val) - sd(reg_data$ln_med_h_val)), y = 225, label = "~-sigma ", size = 7, parse = T)
ln_pct_bach = ggplot(reg_data) +
geom_histogram(aes(ln_pct_bach_more)) +
geom_vline(xintercept = mean(reg_data$ln_pct_bach_more), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$ln_pct_bach_more) + sd(reg_data$ln_pct_bach_more)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$ln_pct_bach_more) - sd(reg_data$ln_pct_bach_more)), linetype = 'dashed') +
labs(title = "Figure 2b",
subtitle = "Histogram of Ln of Educational Achievement",
x = "Ln of Educational Achievement") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$ln_pct_bach_more), y = 140, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_pct_bach_more) + sd(reg_data$ln_pct_bach_more)), y = 140, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_pct_bach_more) - sd(reg_data$ln_pct_bach_more)), y = 140, label = "~-sigma ", size = 5, parse = T)
ln_nbelpov = ggplot(reg_data) +
geom_histogram(aes(ln_n_bel_pov_100)) +
geom_vline(xintercept = mean(reg_data$ln_n_bel_pov_100), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$ln_n_bel_pov_100) + sd(reg_data$ln_n_bel_pov_100)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$ln_n_bel_pov_100) - sd(reg_data$ln_n_bel_pov_100)), linetype = 'dashed') +
labs(title = "Figure 2c",
subtitle = "Histogram of Ln of Poverty Levels",
x = "Ln of # Below Poverty Line") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$ln_n_bel_pov_100), y = 225, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_n_bel_pov_100) + sd(reg_data$ln_n_bel_pov_100)), y = 225, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_n_bel_pov_100) - sd(reg_data$ln_n_bel_pov_100)), y = 225, label = "~-sigma ", size = 5, parse = T)
ln_pct_vac = ggplot(reg_data) +
geom_histogram(aes(ln_pct_vacant)) +
geom_vline(xintercept = mean(reg_data$ln_pct_vacant), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$ln_pct_vacant) + sd(reg_data$ln_pct_vacant)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$ln_pct_vacant) - sd(reg_data$ln_pct_vacant)), linetype = 'dashed') +
labs(title = "Figure 2d",
subtitle = "Histogram of Ln of Vacancy Rates",
x = "Ln of % Vacancy Rate") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$ln_pct_vacant), y = 150, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_pct_vacant) + sd(reg_data$ln_pct_vacant)), y = 150, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_pct_vacant) - sd(reg_data$ln_pct_vacant)), y = 150, label = "~-sigma ", size = 5, parse = T)
ln_pct_sing = ggplot(reg_data) +
geom_histogram(aes(ln_pct_singles)) +
geom_vline(xintercept = mean(reg_data$ln_pct_singles), color = 'darkred') +
geom_vline(xintercept = (mean(reg_data$ln_pct_singles) + sd(reg_data$ln_pct_singles)), linetype = 'dashed')+
geom_vline(xintercept = (mean(reg_data$ln_pct_singles) - sd(reg_data$ln_pct_singles)), linetype = 'dashed') +
labs(title = "Figure 2e",
subtitle = "Histogram of Ln of Single House Units",
x = "Ln of % Single House Units") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_blank()) +
annotate("text", x = mean(reg_data$ln_pct_singles), y = 225, label = "bar(x) ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_pct_singles) + sd(reg_data$ln_pct_singles)), y = 225, label = "~sigma ", size = 5, parse = T)+
annotate("text", x = (mean(reg_data$ln_pct_singles) - sd(reg_data$ln_pct_singles)), y = 225, label = "~-sigma ", size = 5, parse = T)
ln_house_val
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggarrange(ln_pct_bach, ln_nbelpov, ln_pct_vac, ln_pct_sing)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Present the choropleth maps of the dependent variable and the predictors which you created.
Refer to the maps in the text, and talk about the following:
Which maps look similar? Which maps look different? That is, which predictors do you expect to be strongly associated with the dependent variable based on the visualization? Also, given your examination of the maps, are there any predictors that you think will be strongly inter-correlated? That is, do you expect severe multicollinearity to be an issue here? Discuss this in a paragraph.
#lifted from lovelace: https://geocompr.robinlovelace.net/adv-map.html#faceted-maps
tmap_mode("plot")
## tmap mode set to plotting
phl_city_lims = st_read("C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/phl_city_limits/City_Limits.shp")
## Reading layer `City_Limits' from data source
## `C:\Users\Nissim\Desktop\Fall 2022\Spat Stats\phl_city_limits\City_Limits.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS: WGS 84
tm_shape(reg_data) +
tm_polygons(title = "Ln of Median House Value", col = "ln_med_h_val", border.col = NA, border.alpha = 0, lwd = 0, palette = "Blues", style = "jenks") +
tm_shape(phl_city_lims) +
tm_borders(col = "grey", lwd = 5) +
tm_compass(position = c("left", "top")) +
tm_layout(main.title = "Figure 3a",
legend.position = c("right", "bottom"))
?tm_layout
## starting httpd help server ...
## done
facets = c("ln_pct_bach_more",
"ln_n_bel_pov_100",
"ln_pct_vacant",
"ln_pct_singles")
facet_titles = c("Ln of Edu. Attain.",
"Ln of Pov. Levels",
"Ln of Vacancy",
"Ln of Single Occ")
tm_shape(reg_data) +
tm_polygons(facets, title = facet_titles, border.col = NA, border.alpha = 0,lwd = 0, palette = "Blues", style = "jenks") +
tm_facets(nrow = 2, sync = TRUE) +
tm_layout(legend.position = c("right", "bottom"),
panel.labels = c("Figure 3b",
"Figure 3c",
"Figure 3d",
"Figure 3e"))
Present the correlation matrix of the predictors which you obtained from R.
Talk about whether the correlation matrix shows that there is severe multicollinearity.
Does the correlation matrix support your conclusions based on your visual comparison of predictor maps?
#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
corr_reg_data = reg_data |>
st_drop_geometry() |>
dplyr::select(
PCTVACANT,
PCTSINGLES,
PCTBACHMOR,
LNNBELPOV)
corrplot(cor(corr_reg_data), method = "number", type = "lower", tl.col = "black", tl.cex = 0.75, number.cex = 1)
Present the regression output from R. Be sure that your output presents the parameter estimates (and associated standard errors, t-statistics and p-values), as well as the R2, the adjusted R2, and the relevant F-ratio and associated p-value.
Referencing the regression output in (i) above, interpret the results as in the example included above this report outline. NOTE: YOUR DEPENDENT VARIABLE (AND SOME PREDICTORS) WOULD BE LOG-TRANSFORMED, UNLIKE IN THE EXAMPLE HERE. LOOK AT THE SLIDES FOR EXAMPLES OF INTERPRETING REGRESSION OUTPUT WITH LOG-TRANSFORMED VARIABLES.
lm = lm(MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100, data = reg_data)
summary(lm)
##
## Call:
## lm(formula = MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## ln_n_bel_pov_100, data = reg_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -142167 -13065 -1156 10103 845813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60167.52 5095.12 11.809 < 2e-16 ***
## PCTVACANT -578.91 107.07 -5.407 7.32e-08 ***
## PCTSINGLES 754.91 76.99 9.805 < 2e-16 ***
## PCTBACHMOR 2051.89 59.48 34.499 < 2e-16 ***
## ln_n_bel_pov_100 -5681.22 925.99 -6.135 1.05e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40130 on 1715 degrees of freedom
## Multiple R-squared: 0.5538, Adjusted R-squared: 0.5528
## F-statistic: 532.2 on 4 and 1715 DF, p-value: < 2.2e-16
anova(lm)
## Analysis of Variance Table
##
## Response: MEDHVAL
## Df Sum Sq Mean Sq F value Pr(>F)
## PCTVACANT 1 6.6026e+11 6.6026e+11 410.032 < 2.2e-16 ***
## PCTSINGLES 1 5.0734e+11 5.0734e+11 315.064 < 2.2e-16 ***
## PCTBACHMOR 1 2.1998e+12 2.1998e+12 1366.112 < 2.2e-16 ***
## ln_n_bel_pov_100 1 6.0614e+10 6.0614e+10 37.642 1.053e-09 ***
## Residuals 1715 2.7616e+12 1.6103e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pred_vals = fitted(lm)
resids = residuals
stand_resids = rstandard(lm)
lm_df = data.frame(reg_data$MEDHVAL, pred_vals, stand_resids) |>
rename(MEDHVAL = reg_data.MEDHVAL)
First state that in this section, you will be talking about testing model assumptions and aptness. State that you have already looked at the variable distributions earlier.
Present scatter plots of the dependent variable and each of the predictors. State whether each of the relationships seems to be linear, as assumed by the regression model. [Hint: they will not look linear.]
Question here: are we meant to be using the original predictors or the log-transformed columns? (See section 1b)
pct_bach_plot = ggplot(reg_data) +
geom_point(aes(x = MEDHVAL,
y = PCTBACHMOR)) +
theme_minimal()
nbelpov_plot = ggplot(reg_data) +
geom_point(aes(x = MEDHVAL,
y = NBelPov100)) +
theme_minimal()
pct_vac_plot = ggplot(reg_data) +
geom_point(aes(x = MEDHVAL,
y = PCTVACANT)) +
theme_minimal()
pct_sing_plot = ggplot(reg_data) +
geom_point(aes(x = MEDHVAL,
y = PCTSINGLES)) +
theme_minimal()
ggarrange(pct_bach_plot, nbelpov_plot, pct_vac_plot, pct_sing_plot)
Present the histogram of the standardized residuals. State whether the residuals look normal.
#join lm_df back to reg_data to map stand_resids
#I'm not sure there's an easy way to make sure the rows match, but it should be okay
reg_data = left_join(reg_data, lm_df, by = "MEDHVAL")
ggplot(reg_data) +
geom_histogram(aes(x = stand_resids)) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Present the ‘Standardized Residual by Predicted Value’ scatter plot. What conclusions can you draw from that? Does there seem to be heteroscedasticity? Do there seem to be outliers? Anything else? Discuss.
ggplot(lm_df) +
geom_point(aes(x = pred_vals, y = stand_resids)) +
theme_minimal()
Mention what standardized residuals are.
Referencing the maps of the dependent variable and the predictors that you presented earlier, state whether there seems to be spatial autocorrelation in your variables. That is, does it seem that the observations (i.e., block groups) are independent of each other? Briefly discuss.
Now, present the choropleth map of the standardized regression residuals. Do there seem to be any noticeable spatial patterns in them? That is, do they seem to be spatially autocorrelated?
You will examine the spatial autocorrelation of the variables and residuals and run spatial regressions in the next assignment.
tm_shape(reg_data) +
tm_polygons(col = "stand_resids", border.col = NA, border.alpha = 0.1, lwd = 0, palette = "Blues", style = "jenks") +
tm_layout(legend.position = c("right", "bottom"))
Present the results of the stepwise regression and state whether all 4 predictors in the original model are kept in the final model.
stepAIC(lm)
## Start: AIC=36468.43
## MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100
##
## Df Sum of Sq RSS AIC
## <none> 2.7616e+12 36468
## - PCTVACANT 1 4.7073e+10 2.8087e+12 36495
## - ln_n_bel_pov_100 1 6.0614e+10 2.8222e+12 36504
## - PCTSINGLES 1 1.5480e+11 2.9164e+12 36560
## - PCTBACHMOR 1 1.9165e+12 4.6781e+12 37373
##
## Call:
## lm(formula = MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## ln_n_bel_pov_100, data = reg_data)
##
## Coefficients:
## (Intercept) PCTVACANT PCTSINGLES PCTBACHMOR
## 60167.5 -578.9 754.9 2051.9
## ln_n_bel_pov_100
## -5681.2
anova(lm)
## Analysis of Variance Table
##
## Response: MEDHVAL
## Df Sum Sq Mean Sq F value Pr(>F)
## PCTVACANT 1 6.6026e+11 6.6026e+11 410.032 < 2.2e-16 ***
## PCTSINGLES 1 5.0734e+11 5.0734e+11 315.064 < 2.2e-16 ***
## PCTBACHMOR 1 2.1998e+12 2.1998e+12 1366.112 < 2.2e-16 ***
## ln_n_bel_pov_100 1 6.0614e+10 6.0614e+10 37.642 1.053e-09 ***
## Residuals 1715 2.7616e+12 1.6103e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Present the cross-validation results – that is, compare the RMSE of the original model that includes all 4 predictors with the RMSE of the model that only includes PCTVACANT and MEDHHINC as predictors.
#----
#IGNORING THIS BC I RAN INTO ISSUES W THIS PACKAGE
#RMSE for full model
#cvlm_data = reg_data |>
# st_drop_geometry() |>
# dplyr::select(MEDHVAL,
# PCTVACANT,
# PCTSINGLES,
# PCTBACHMOR,
# ln_n_bel_pov_100)
#CVlm(data = cvlm_data, form.lm = lm, m = 5)
#class(lm)
#CVlm(reg_data, form.lm = lm, m =5)
#RMSE for model with only PCTVACANT and MEDHHINC
#----
#running into some weird errors with the DAAG cv.lm function
#trying a different one
#rmse for full model
lm_ii = trainControl(method = "cv", number = 5)
cvlm_model = train(MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100, data = reg_data, method = "lm", trControl = lm_ii)
print(cvlm_model)
## Linear Regression
##
## 4844 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3875, 3876, 3875, 3876, 3874
## Resampling results:
##
## RMSE Rsquared MAE
## 29437.05 0.5619683 15822.81
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
#rmse for reduced model (just PCTVACANT and MEDHHINC)
lm_ii_reduced = trainControl(method = "cv", number = 5)
cvlm_model_reduced = train(MEDHVAL ~ PCTVACANT + MEDHHINC, data = reg_data, method = "lm", trControl = lm_ii_reduced)
print(cvlm_model_reduced)
## Linear Regression
##
## 4844 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3877, 3874, 3875, 3876, 3874
## Resampling results:
##
## RMSE Rsquared MAE
## 35888.3 0.3475246 19336.82
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Recap what you did in the paper and your findings. Discuss what conclusions you can draw, which variables were significant and whether that was surprising or not.
Talk about the quality of the model – that is, state if this is a good model overall (e.g., R2, F-ratio test), and what other predictors that we didn’t include in our model might be associated with our dependent variable.
If you ran the stepwise regression, did the final model include all 4 predictors or were some dropped? What does that tell you about the quality of the model?
If you used cross-validation, was the RMSE better for the 4 predictor model or the 2 predictor model?
If you haven’t done that in the Results section, talk explicitly about the limitations of the model – that is, mention which assumptions were violated, and if applicable, how that may affect the model/parameter estimation/estimated significance.
In addition, talk about the limitations of using the NBELPOV100 variable as a predictor – that is, what are some limitations of using the raw number of households living in poverty rather than a percentage?