MUSA 500, Assignment #1

0.1 Introduction

State the problem and the setting of the analysis (i.e., Philadelphia).

Present either a brief review of the literature (use Google Scholar) or simply speculate as to why the predictors we’re using might be related with the response variable.

0.2 Methods

0.2.1 Data Cleaning

The original Philadelphia block group dataset has 1816 observations. We clean the data by removing the following block groups:

Block groups where population < 40
Block groups where there are no housing units
Block groups where the median house value is lower than $10,000
One North Philadelphia block group which had a very high median house value (over $800,000) and a very low median household income (less than $8,000)

The final dataset contains 1720 block groups.

0.2.2 Exploratory Data Analysis

State that you will examine the summary statistics and distributions of variables.

Also state that as part of your exploratory data analysis, you will examine the correlations between the predictors.

Explain what a correlation is, and provide the formula for the sample correlation coefficient r. Also mention the possible range of r values, and what correlation of 0 means.

0.2.3 Multiple Regression Analysis

Describe the method of regression in several sentences. I.e., what is it used for, what does it do?

State the equation for y for this problem. The equation should be in the form: 𝑦=𝛽0+𝛽1𝑥1+⋯+𝛽𝑘𝑥𝑘+𝜀. However, in your report, instead of y and x1…xk, fill in the actual variable names (as in the regression example given above). Be sure to mention what βi’s and ε are as well.

State and explain regression assumptions (e.g., linearity; independence of observations; normality of residuals; homoscedasticity; no multicollinearity).

Mention the parameters that need to be estimated in multiple regression (σ2, β0 ,…, βk). State what σ2 is (you should have already talked about βi in (ii) above).

Talk about the way of estimating the parameters. (Hint: present the equation on the slide ‘β Coefficient Estimation – Least Squares’ for multiple regression and briefly discuss what the equation does).

Talk about the coefficient of multiple determination R2, and the adjusted R2. Present and explain the relevant formulas and all the terms that are used in the formulas.

State the hypotheses you test. Specifically, talk about the F-ratio and the H0 and Ha associated with it, as well as the hypotheses you test about each of the βi’s (again, state H0 and Ha).

0.2.4 Additional Analyses

Talk about stepwise regression – discuss what it does and its limitations

Talk about k-fold cross-validation (mentioning that k = 5) – discuss what it is used for, describe how it is operationalized and mention that the RMSE is used to compare models (explain what the RMSE is and how it is calculated, presenting and describing any relevant formulas).

0.2.5 Tools

The analyses and visualizations for this report have all been done in R. Relevant packages and Markdown settings can be seen by unhiding the code in this section.

library(tidyverse) #general

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(sf) #spatial

## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE

library(mapview) #quick mapping
library(tmap) #full mapping
library(ggpubr) #for ggarrange
library(gt) #for tables
library(glue) #for tables
library(janitor) #to clean col names

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(corrplot) #for easy correlation matrix

## corrplot 0.92 loaded

library(tmap) #for choropleth maps
library(MASS) #for stepwise regression

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(DAAG) #for CVlm

## 
## Attaching package: 'DAAG'
## 
## The following object is masked from 'package:MASS':
## 
##     hills

library(caret) #for a different attempt at cvlm

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

knitr::opts_chunk$set(echo = T, messages = F, warning = F, error = F)

0.3 Results

0.3.1 Exploratory Results

0.3.1.1 Import data

In order to complete this entire project in R (rather than using ArcGIS, too), we have chosen to use the shapefile of data, rather than the .csv. Below, we import the shapefile and use a custom function to apply log transformations to the relevant columns. The function checks whether there are zero values in each column and then applies the appropriate log transformation accordingly.

reg_data = st_read('C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/ass_1_data_shp/RegressionData.shp')

## Reading layer `RegressionData' from data source 
##   `C:\Users\Nissim\Desktop\Fall 2022\Spat Stats\ass_1_data_shp\RegressionData.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1720 features and 13 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 2660605 ymin: 207610.6 xmax: 2750171 ymax: 304858.8
## CRS:           NA

#define a function to find zero values in columns
col_zeros = function(a, b) {
                  pct_col_zeros = count(subset(st_drop_geometry(a), b != 0)) |>
                                      pull(n) / nrow(st_drop_geometry(a))
                  return(pct_col_zeros)
                  }


#apply function with case_when statement
#case_when is a vectorized function, while ifelse is not.
#running this with ifelse will result in all row values in the mutated column being the same.
reg_data = reg_data |>
            mutate(
                ln_med_h_val = case_when(col_zeros(reg_data, reg_data$MEDHVAL) == 1 ~ log(reg_data$MEDHVAL),
                                     TRUE ~ log(1 + reg_data$MEDHVAL)),
                   ln_pct_bach_more = case_when(col_zeros(reg_data, reg_data$PCTBACHMOR) == 1 ~ log(reg_data$PCTBACHMOR),
                                     TRUE ~ log(1 + reg_data$PCTBACHMOR)),
                   ln_n_bel_pov_100 = case_when(col_zeros(reg_data, reg_data$NBelPov100) == 1 ~ log(reg_data$NBelPov100),
                                     TRUE ~ log(1 + reg_data$NBelPov100)),
                   ln_pct_vacant = case_when(col_zeros(reg_data, reg_data$PCTVACANT) == 1 ~ log(reg_data$PCTVACANT),
                                     TRUE ~ log(1 + reg_data$PCTVACANT)),
                   ln_pct_singles = case_when(col_zeros(reg_data, reg_data$PCTSINGLES) == 1 ~ log(reg_data$PCTSINGLES),
                                     TRUE ~ log(1 + reg_data$PCTSINGLES)),
                  )

0.3.1.2 Data Table

Present and briefly talk about the table with summary statistics which includes the dependent variable and the predictors (i.e., mean, standard deviation).

med_house_val = c("Median House Value", mean(reg_data$MEDHVAL), sd(reg_data$MEDHVAL))

hhs_in_pov = c("# Households Living in Poverty", mean(reg_data$NBelPov100), sd(reg_data$NBelPov100))

pct_w_bach_or_higher = c("% of Individuals with Bachelor's Degrees or Higher", mean(reg_data$PCTBACHMOR), sd(reg_data$PCTBACHMOR))

pct_vac_houses = c("% of Vacant Houses", mean(reg_data$PCTVACANT), sd(reg_data$PCTVACANT))

pct_sing_house_units = c("% of Single House Units", mean(reg_data$PCTSINGLES), sd(reg_data$PCTSINGLES))

table = as.data.frame(t(data.frame(
              med_house_val,
              hhs_in_pov,
              pct_w_bach_or_higher,
              pct_vac_houses,
              pct_sing_house_units
              )))

colnames(table) = c("Variable", "Mean", "SD")

table$Mean = as.numeric(table$Mean)
table$SD = as.numeric(table$SD)

table = table |>
          mutate_if(is.numeric, round, digits = 3)

table_out = table |>
        gt() |>
        tab_header(
          title = md("**Summary Statistics**")
        ) |>
        tab_row_group(
          label = md('**Predictors**'),
          rows = 2:5
        ) |>
        tab_row_group(
          label = md('**Dependent Variable**'),
          rows = 1
        )

#print output
table_out

Variable	Mean	SD
Summary Statistics
Dependent Variable
Median House Value	66287.733	60006.076
Predictors
# Households Living in Poverty	189.771	164.318
% of Individuals with Bachelor's Degrees or Higher	16.081	17.770
% of Vacant Houses	11.289	9.628
% of Single House Units	9.226	13.249

0.3.1.3 Histograms

Also state whether the variables are normal before and after the logarithmic transformation

 house_val = ggplot(reg_data) +
                geom_histogram(aes(MEDHVAL)) +
                geom_vline(xintercept = mean(reg_data$MEDHVAL), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$MEDHVAL) + sd(reg_data$MEDHVAL)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$MEDHVAL) - sd(reg_data$MEDHVAL)), linetype = 'dashed') +
    labs(title = "Figure 1a",
        subtitle = "Histogram of Median House Values",
        x = "Median House Value") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$MEDHVAL), y = 850, label = "bar(x) ", size = 7, parse = T)+
    annotate("text", x = (mean(reg_data$MEDHVAL) + sd(reg_data$MEDHVAL)+1000), y = 850, label = "~sigma ", size = 7, parse = T)+
    annotate("text", x = (mean(reg_data$MEDHVAL) - sd(reg_data$MEDHVAL)-2500), y = 850, label = "~-sigma ", size = 7, parse = T)
  
  pct_bach = ggplot(reg_data) +
    geom_histogram(aes(PCTBACHMOR)) +
    geom_vline(xintercept = mean(reg_data$PCTBACHMOR), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$PCTBACHMOR) + sd(reg_data$PCTBACHMOR)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$PCTBACHMOR) - sd(reg_data$PCTBACHMOR)), linetype = 'dashed') +
     labs(title = "Figure 1b",
        subtitle = "Histogram of Educational Achievement",
        x = "% w/a Bachelor's or Higher") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$PCTBACHMOR), y = 300, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTBACHMOR) + sd(reg_data$PCTBACHMOR)), y = 300, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTBACHMOR) - sd(reg_data$PCTBACHMOR)), y = 300, label = "~-sigma ", size = 5, parse = T)
  
  nbelpov = ggplot(reg_data) +
    geom_histogram(aes(NBelPov100)) +
    geom_vline(xintercept = mean(reg_data$NBelPov100), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$NBelPov100) + sd(reg_data$NBelPov100)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$NBelPov100) - sd(reg_data$NBelPov100)), linetype = 'dashed') +
     labs(title = "Figure 1c",
        subtitle = "Histogram of Poverty Levels",
        x = "# Below Poverty Line") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$NBelPov100), y = 300, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$NBelPov100) + sd(reg_data$NBelPov100)), y = 300, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$NBelPov100) - sd(reg_data$NBelPov100)), y = 300, label = "~-sigma ", size = 5, parse = T)
  
  pct_vac = ggplot(reg_data) +
    geom_histogram(aes(PCTVACANT)) +
    geom_vline(xintercept = mean(reg_data$PCTVACANT), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$PCTVACANT) + sd(reg_data$PCTVACANT)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$PCTVACANT) - sd(reg_data$PCTVACANT)), linetype = 'dashed') +
     labs(title = "Figure 1d",
        subtitle = "Histogram of Vacancy Rates",
        x = "% Vacancy Rate") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$PCTVACANT), y = 275, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTVACANT) + sd(reg_data$PCTVACANT)), y = 275, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTVACANT) - sd(reg_data$PCTVACANT)), y = 275, label = "~-sigma ", size = 5, parse = T)
  
  pct_sing = ggplot(reg_data) +
    geom_histogram(aes(PCTSINGLES)) +
    geom_vline(xintercept = mean(reg_data$PCTSINGLES), color = 'darkred') +
    geom_vline(xintercept = (mean(reg_data$PCTSINGLES) + sd(reg_data$PCTSINGLES)), linetype = 'dashed')+
    geom_vline(xintercept = (mean(reg_data$PCTSINGLES) - sd(reg_data$PCTSINGLES)), linetype = 'dashed') +
     labs(title = "Figure 1e",
        subtitle = "Histogram of Single House Units",
        x = "% Single House Units") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5),
          axis.title.y = element_blank()) +
    annotate("text", x = mean(reg_data$PCTSINGLES), y = 450, label = "bar(x) ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTSINGLES) + sd(reg_data$PCTSINGLES)), y = 450, label = "~sigma ", size = 5, parse = T)+
    annotate("text", x = (mean(reg_data$PCTSINGLES) - sd(reg_data$PCTSINGLES)), y = 450, label = "~-sigma ", size = 5, parse = T)
  
  house_val

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggarrange(pct_bach, nbelpov, pct_vac, pct_sing)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3.1.4 Log Transform Histograms

Present the histograms of the transformed variables and clearly state whether you’re using the log-transformed or original variable in your regression.

State that the other regression assumptions will be examined in a separate section below (Regression Assumption Checks).

 ln_house_val = ggplot(reg_data) +
                geom_histogram(aes(ln_med_h_val)) +
                geom_vline(xintercept = mean(reg_data$ln_med_h_val), color = 'darkred') +
                geom_vline(xintercept = (mean(reg_data$ln_med_h_val) + sd(reg_data$ln_med_h_val)), linetype = 'dashed')+
                geom_vline(xintercept = (mean(reg_data$ln_med_h_val) - sd(reg_data$ln_med_h_val)), linetype = 'dashed') +
                labs(title = "Figure 2a",
                    subtitle = "Histogram of Ln of Median House Values",
                    x = "Ln of Median House Value") +
                theme_minimal() +
                theme(plot.title = element_text(hjust = 0.5),
                      plot.subtitle = element_text(hjust = 0.5),
                      axis.title.y = element_blank()) +
                annotate("text", x = mean(reg_data$ln_med_h_val), y = 225, label = "bar(x) ", size = 7, parse = T)+
                annotate("text", x = (mean(reg_data$ln_med_h_val) + sd(reg_data$ln_med_h_val)), y = 225, label = "~sigma ", size = 7, parse = T)+
                annotate("text", x = (mean(reg_data$ln_med_h_val) - sd(reg_data$ln_med_h_val)), y = 225, label = "~-sigma ", size = 7, parse = T)
  
  ln_pct_bach = ggplot(reg_data) +
                geom_histogram(aes(ln_pct_bach_more)) +
                geom_vline(xintercept = mean(reg_data$ln_pct_bach_more), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_pct_bach_more) + sd(reg_data$ln_pct_bach_more)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_pct_bach_more) - sd(reg_data$ln_pct_bach_more)), linetype = 'dashed') +
                  labs(title = "Figure 2b",
                      subtitle = "Histogram of Ln of Educational Achievement",
                      x = "Ln of Educational Achievement") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_pct_bach_more), y = 140, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_bach_more) + sd(reg_data$ln_pct_bach_more)), y = 140, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_bach_more) - sd(reg_data$ln_pct_bach_more)), y = 140, label = "~-sigma ", size = 5, parse = T)
  
  ln_nbelpov = ggplot(reg_data) +
                geom_histogram(aes(ln_n_bel_pov_100)) +
                geom_vline(xintercept = mean(reg_data$ln_n_bel_pov_100), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_n_bel_pov_100) + sd(reg_data$ln_n_bel_pov_100)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_n_bel_pov_100) - sd(reg_data$ln_n_bel_pov_100)), linetype = 'dashed') +
                  labs(title = "Figure 2c",
                      subtitle = "Histogram of Ln of Poverty Levels",
                      x = "Ln of # Below Poverty Line") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_n_bel_pov_100), y = 225, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_n_bel_pov_100) + sd(reg_data$ln_n_bel_pov_100)), y = 225, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_n_bel_pov_100) - sd(reg_data$ln_n_bel_pov_100)), y = 225, label = "~-sigma ", size = 5, parse = T)
  
  ln_pct_vac = ggplot(reg_data) +
                geom_histogram(aes(ln_pct_vacant)) +
                geom_vline(xintercept = mean(reg_data$ln_pct_vacant), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_pct_vacant) + sd(reg_data$ln_pct_vacant)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_pct_vacant) - sd(reg_data$ln_pct_vacant)), linetype = 'dashed') +
                  labs(title = "Figure 2d",
                      subtitle = "Histogram of Ln of Vacancy Rates",
                      x = "Ln of % Vacancy Rate") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_pct_vacant), y = 150, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_vacant) + sd(reg_data$ln_pct_vacant)), y = 150, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_vacant) - sd(reg_data$ln_pct_vacant)), y = 150, label = "~-sigma ", size = 5, parse = T)
  
  ln_pct_sing = ggplot(reg_data) +
                geom_histogram(aes(ln_pct_singles)) +
                geom_vline(xintercept = mean(reg_data$ln_pct_singles), color = 'darkred') +
                  geom_vline(xintercept = (mean(reg_data$ln_pct_singles) + sd(reg_data$ln_pct_singles)), linetype = 'dashed')+
                  geom_vline(xintercept = (mean(reg_data$ln_pct_singles) - sd(reg_data$ln_pct_singles)), linetype = 'dashed') +
                  labs(title = "Figure 2e",
                      subtitle = "Histogram of Ln of Single House Units",
                      x = "Ln of % Single House Units") +
                  theme_minimal() +
                  theme(plot.title = element_text(hjust = 0.5),
                        plot.subtitle = element_text(hjust = 0.5),
                        axis.title.y = element_blank()) +
                  annotate("text", x = mean(reg_data$ln_pct_singles), y = 225, label = "bar(x) ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_singles) + sd(reg_data$ln_pct_singles)), y = 225, label = "~sigma ", size = 5, parse = T)+
                  annotate("text", x = (mean(reg_data$ln_pct_singles) - sd(reg_data$ln_pct_singles)), y = 225, label = "~-sigma ", size = 5, parse = T)
  
  ln_house_val

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  ggarrange(ln_pct_bach, ln_nbelpov, ln_pct_vac, ln_pct_sing)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3.1.5 Choropleth Maps

Present the choropleth maps of the dependent variable and the predictors which you created.

Refer to the maps in the text, and talk about the following:

Which maps look similar? Which maps look different? That is, which predictors do you expect to be strongly associated with the dependent variable based on the visualization? Also, given your examination of the maps, are there any predictors that you think will be strongly inter-correlated? That is, do you expect severe multicollinearity to be an issue here? Discuss this in a paragraph.

#lifted from lovelace: https://geocompr.robinlovelace.net/adv-map.html#faceted-maps
tmap_mode("plot")

## tmap mode set to plotting

phl_city_lims = st_read("C:/Users/Nissim/Desktop/Fall 2022/Spat Stats/phl_city_limits/City_Limits.shp")

## Reading layer `City_Limits' from data source 
##   `C:\Users\Nissim\Desktop\Fall 2022\Spat Stats\phl_city_limits\City_Limits.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS:  WGS 84

tm_shape(reg_data) + 
  tm_polygons(title = "Ln of Median House Value", col = "ln_med_h_val", border.col = NA, border.alpha = 0, lwd = 0, palette = "Blues", style = "jenks") + 
  tm_shape(phl_city_lims) +
  tm_borders(col = "grey", lwd = 5) +
  tm_compass(position = c("left", "top")) +
  tm_layout(main.title = "Figure 3a",
            legend.position = c("right", "bottom"))

  ?tm_layout

## starting httpd help server ...

##  done

facets = c("ln_pct_bach_more",
           "ln_n_bel_pov_100",
           "ln_pct_vacant",
           "ln_pct_singles")

facet_titles = c("Ln of Edu. Attain.",
                 "Ln of Pov. Levels",
                 "Ln of Vacancy",
                 "Ln of Single Occ")

tm_shape(reg_data) + 
  tm_polygons(facets, title = facet_titles, border.col = NA, border.alpha = 0,lwd = 0, palette = "Blues", style = "jenks") + 
  tm_facets(nrow = 2, sync = TRUE) +
  tm_layout(legend.position = c("right", "bottom"),
            panel.labels = c("Figure 3b",
                             "Figure 3c",
                             "Figure 3d",
                             "Figure 3e"))

0.3.1.6 Correlation Matrix

Present the correlation matrix of the predictors which you obtained from R.

Talk about whether the correlation matrix shows that there is severe multicollinearity.

Does the correlation matrix support your conclusions based on your visual comparison of predictor maps?

#https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
corr_reg_data = reg_data |>
                  st_drop_geometry() |>
                  dplyr::select(
                                PCTVACANT,
                                PCTSINGLES,
                                PCTBACHMOR,
                                LNNBELPOV)

corrplot(cor(corr_reg_data), method = "number", type = "lower", tl.col = "black", tl.cex = 0.75, number.cex = 1)

0.3.2 Regression Results

Present the regression output from R. Be sure that your output presents the parameter estimates (and associated standard errors, t-statistics and p-values), as well as the R2, the adjusted R2, and the relevant F-ratio and associated p-value.

Referencing the regression output in (i) above, interpret the results as in the example included above this report outline. NOTE: YOUR DEPENDENT VARIABLE (AND SOME PREDICTORS) WOULD BE LOG-TRANSFORMED, UNLIKE IN THE EXAMPLE HERE. LOOK AT THE SLIDES FOR EXAMPLES OF INTERPRETING REGRESSION OUTPUT WITH LOG-TRANSFORMED VARIABLES.

0.3.2.1 Regression

lm = lm(MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100, data = reg_data)

summary(lm)

## 
## Call:
## lm(formula = MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + 
##     ln_n_bel_pov_100, data = reg_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -142167  -13065   -1156   10103  845813 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60167.52    5095.12  11.809  < 2e-16 ***
## PCTVACANT         -578.91     107.07  -5.407 7.32e-08 ***
## PCTSINGLES         754.91      76.99   9.805  < 2e-16 ***
## PCTBACHMOR        2051.89      59.48  34.499  < 2e-16 ***
## ln_n_bel_pov_100 -5681.22     925.99  -6.135 1.05e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40130 on 1715 degrees of freedom
## Multiple R-squared:  0.5538, Adjusted R-squared:  0.5528 
## F-statistic: 532.2 on 4 and 1715 DF,  p-value: < 2.2e-16

anova(lm)

## Analysis of Variance Table
## 
## Response: MEDHVAL
##                    Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## PCTVACANT           1 6.6026e+11 6.6026e+11  410.032 < 2.2e-16 ***
## PCTSINGLES          1 5.0734e+11 5.0734e+11  315.064 < 2.2e-16 ***
## PCTBACHMOR          1 2.1998e+12 2.1998e+12 1366.112 < 2.2e-16 ***
## ln_n_bel_pov_100    1 6.0614e+10 6.0614e+10   37.642 1.053e-09 ***
## Residuals        1715 2.7616e+12 1.6103e+09                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pred_vals = fitted(lm)

resids = residuals

stand_resids = rstandard(lm)

lm_df = data.frame(reg_data$MEDHVAL, pred_vals, stand_resids) |>
          rename(MEDHVAL = reg_data.MEDHVAL)

0.3.3 Regression Assumption Checks

First state that in this section, you will be talking about testing model assumptions and aptness. State that you have already looked at the variable distributions earlier.

0.3.3.1 Scatterplots of Predictors

Present scatter plots of the dependent variable and each of the predictors. State whether each of the relationships seems to be linear, as assumed by the regression model. [Hint: they will not look linear.]

Question here: are we meant to be using the original predictors or the log-transformed columns? (See section 1b)

  pct_bach_plot = ggplot(reg_data) +
                    geom_point(aes(x = MEDHVAL, 
                                   y = PCTBACHMOR)) +
                    theme_minimal()
  
  nbelpov_plot = ggplot(reg_data) +
                    geom_point(aes(x = MEDHVAL, 
                                   y = NBelPov100)) +
                    theme_minimal()
  
  pct_vac_plot = ggplot(reg_data) +
                    geom_point(aes(x = MEDHVAL, 
                                   y = PCTVACANT)) +
                    theme_minimal()
    
  pct_sing_plot = ggplot(reg_data) +
                    geom_point(aes(x = MEDHVAL, 
                                   y = PCTSINGLES)) +
                    theme_minimal()
  
  ggarrange(pct_bach_plot, nbelpov_plot, pct_vac_plot, pct_sing_plot)

0.3.3.2 Histogram of Standardized Residuals

Present the histogram of the standardized residuals. State whether the residuals look normal.

#join lm_df back to reg_data to map stand_resids
#I'm not sure there's an easy way to make sure the rows match, but it should be okay
reg_data = left_join(reg_data, lm_df, by = "MEDHVAL")

ggplot(reg_data) +
  geom_histogram(aes(x = stand_resids)) +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.3.3.3 Standardized Residual by Predicted Value Scatter Plot

Present the ‘Standardized Residual by Predicted Value’ scatter plot. What conclusions can you draw from that? Does there seem to be heteroscedasticity? Do there seem to be outliers? Anything else? Discuss.

ggplot(lm_df) +
  geom_point(aes(x = pred_vals, y = stand_resids)) +
  theme_minimal()

Mention what standardized residuals are.

Referencing the maps of the dependent variable and the predictors that you presented earlier, state whether there seems to be spatial autocorrelation in your variables. That is, does it seem that the observations (i.e., block groups) are independent of each other? Briefly discuss.

0.3.3.4 Histogram & Choropleth of SRRs

Now, present the choropleth map of the standardized regression residuals. Do there seem to be any noticeable spatial patterns in them? That is, do they seem to be spatially autocorrelated?

You will examine the spatial autocorrelation of the variables and residuals and run spatial regressions in the next assignment.

tm_shape(reg_data) + 
  tm_polygons(col = "stand_resids", border.col = NA, border.alpha = 0.1, lwd = 0, palette = "Blues", style = "jenks") + 
  tm_layout(legend.position = c("right", "bottom"))

0.3.4 Additional Models

0.3.4.1 Stepwise Regression

Present the results of the stepwise regression and state whether all 4 predictors in the original model are kept in the final model.

stepAIC(lm)

## Start:  AIC=36468.43
## MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100
## 
##                    Df  Sum of Sq        RSS   AIC
## <none>                           2.7616e+12 36468
## - PCTVACANT         1 4.7073e+10 2.8087e+12 36495
## - ln_n_bel_pov_100  1 6.0614e+10 2.8222e+12 36504
## - PCTSINGLES        1 1.5480e+11 2.9164e+12 36560
## - PCTBACHMOR        1 1.9165e+12 4.6781e+12 37373

## 
## Call:
## lm(formula = MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + 
##     ln_n_bel_pov_100, data = reg_data)
## 
## Coefficients:
##      (Intercept)         PCTVACANT        PCTSINGLES        PCTBACHMOR  
##          60167.5            -578.9             754.9            2051.9  
## ln_n_bel_pov_100  
##          -5681.2

anova(lm)

## Analysis of Variance Table
## 
## Response: MEDHVAL
##                    Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## PCTVACANT           1 6.6026e+11 6.6026e+11  410.032 < 2.2e-16 ***
## PCTSINGLES          1 5.0734e+11 5.0734e+11  315.064 < 2.2e-16 ***
## PCTBACHMOR          1 2.1998e+12 2.1998e+12 1366.112 < 2.2e-16 ***
## ln_n_bel_pov_100    1 6.0614e+10 6.0614e+10   37.642 1.053e-09 ***
## Residuals        1715 2.7616e+12 1.6103e+09                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

0.3.4.2 K-Fold Cross-Validation

Present the cross-validation results – that is, compare the RMSE of the original model that includes all 4 predictors with the RMSE of the model that only includes PCTVACANT and MEDHHINC as predictors.

#----
#IGNORING THIS BC I RAN INTO ISSUES W THIS PACKAGE

#RMSE for full model
#cvlm_data = reg_data |>
#              st_drop_geometry() |>
#              dplyr::select(MEDHVAL,
#                            PCTVACANT,
#                            PCTSINGLES,
#                            PCTBACHMOR,
#                            ln_n_bel_pov_100)

#CVlm(data = cvlm_data, form.lm = lm, m = 5)

#class(lm)

#CVlm(reg_data, form.lm = lm, m =5)

#RMSE for model with only PCTVACANT and MEDHHINC
#----


#running into some weird errors with the DAAG cv.lm function
#trying a different one

#rmse for full model
lm_ii = trainControl(method = "cv", number = 5)

cvlm_model = train(MEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + ln_n_bel_pov_100, data = reg_data, method = "lm", trControl = lm_ii)

print(cvlm_model)

## Linear Regression 
## 
## 4844 samples
##    5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3875, 3876, 3875, 3876, 3874 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   29437.05  0.5619683  15822.81
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

#rmse for reduced model (just PCTVACANT and MEDHHINC)
lm_ii_reduced = trainControl(method = "cv", number = 5)

cvlm_model_reduced = train(MEDHVAL ~ PCTVACANT + MEDHHINC, data = reg_data, method = "lm", trControl = lm_ii_reduced)

print(cvlm_model_reduced)

## Linear Regression 
## 
## 4844 samples
##    3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3877, 3874, 3875, 3876, 3874 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   35888.3  0.3475246  19336.82
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

0.4 Discussion and Limitations

0.4.1 Recap

Recap what you did in the paper and your findings. Discuss what conclusions you can draw, which variables were significant and whether that was surprising or not.

0.4.2 Quality of Model

Talk about the quality of the model – that is, state if this is a good model overall (e.g., R2, F-ratio test), and what other predictors that we didn’t include in our model might be associated with our dependent variable.

If you ran the stepwise regression, did the final model include all 4 predictors or were some dropped? What does that tell you about the quality of the model?

If you used cross-validation, was the RMSE better for the 4 predictor model or the 2 predictor model?

0.4.3 Limitations of Model

If you haven’t done that in the Results section, talk explicitly about the limitations of the model – that is, mention which assumptions were violated, and if applicable, how that may affect the model/parameter estimation/estimated significance.

In addition, talk about the limitations of using the NBELPOV100 variable as a predictor – that is, what are some limitations of using the raw number of households living in poverty rather than a percentage?