Code
# Load packages
library(tidyverse)
library(car)
# Load dataset
ericksen_data <- read.table("data/Ericksen.txt") %>%
rownames_to_column("census_unit") %>%
as_tibble()Regression Analysis
# Load packages
library(tidyverse)
library(car)
# Load dataset
ericksen_data <- read.table("data/Ericksen.txt") %>%
rownames_to_column("census_unit") %>%
as_tibble()The dataset used for this analysis is Ericksen.txt. This data is a record of the 1980 U.S. Census Undercount.
Source: Ericksen, E. P., Kadane, J. B. and Tukey, J. W. (1989) Adjusting the 1980 Census of Population and Housing. Journal of the American Statistical Association 84, 927–944 [Tables 7 and 8].
Here is a short description of the variables contained in Ericksen.txt:
minority - Percentage black or Hispaniccrime - Rate of serious crimes per 1000 populationpoverty - Percentage poorlanguage - Percentage having difficulty speaking or writing Englishhighschool - Percentage age 25 or older who had not finished highschoolhousing - Percentage of housing in small, multiunit buildingscity - city, major city; state, state or state-remainderconventional - Percentage of households counted by conventional personal enumerationundercount - Preliminary estimate of percentage undercountLet us explore fitting a simple linear regression using undercount as the response variable and minority as the explanatory variable.
Fit a simple linear regression.
# ericksen regression
ggplot(ericksen_data, aes(x = minority, y = undercount)) +
geom_point(alpha = 0.6) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
geom_smooth(se = FALSE, method = "lm") +
theme_minimal()# ericksen fit model
ex3_original<- lm(undercount ~ minority, ericksen_data)
summary(ex3_original)
Call:
lm(formula = undercount ~ minority, data = ericksen_data)
Residuals:
Min 1Q Median 3Q Max
-3.9302 -1.2390 0.0676 0.9355 4.0625
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.005076 0.327215 -0.016 0.988
minority 0.099100 0.012549 7.897 4.9e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.772 on 64 degrees of freedom
Multiple R-squared: 0.4935, Adjusted R-squared: 0.4856
F-statistic: 62.36 on 1 and 64 DF, p-value: 4.9e-11
# correlation
sqrt(summary(ex3_original)$r.squared)[1] 0.7025047
# or cor(ericksen_data$undercount, ericksen_data$minority)Model:
\[\widehat{undercount}_i = A + B_1 \times minority\]
Intercept \(A\): When minority is 0, the undercount is -0.005076 percent.
Slope for minority \(B\): For every additional increase in minority, the undercount is expected to increase by 0.0991 percent on average.
Residual Standard Error \(S_E\): On average, the predicted undercount was off by 1.772 percent.
Correlation \(r\): The correlation between minority and undercount is 0.7025, which indicates a weak positive relationship.
R-squared \(r^2\): About 48% of the variance undercount is explained by percent minority.
The least-squares line is a reasonable summary between the two variables because the adjusted R-squared of the least-squares line is 0.4856, a moderately good goodness-of-fit or accuracy.
Let us now explore fitting a multiple linear regression using undercount as the response variable with minority, crime, poverty, and housing as the explanatory variables.
Fit the multiple regression.
# multiple regression
ericksen_data %>%
select(undercount, minority, crime, poverty, housing) %>%
GGally::ggpairs() +
theme_minimal()# fit model
ex3_model <- lm(undercount ~ minority + crime + poverty + housing,
data = ericksen_data)
summary(ex3_model)
Call:
lm(formula = undercount ~ minority + crime + poverty + housing,
data = ericksen_data)
Residuals:
Min 1Q Median 3Q Max
-3.0871 -1.1194 -0.0427 1.0271 3.8537
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.512005 0.989953 -0.517 0.606886
minority 0.087078 0.022167 3.928 0.000221 ***
crime 0.033242 0.012600 2.638 0.010562 *
poverty -0.094745 0.071570 -1.324 0.190505
housing -0.005084 0.025321 -0.201 0.841524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.651 on 61 degrees of freedom
Multiple R-squared: 0.581, Adjusted R-squared: 0.5535
F-statistic: 21.15 on 4 and 61 DF, p-value: 5.626e-11
# correlation
sqrt(summary(ex3_model)$r.squared)[1] 0.7622268
Model:
\[\widehat{undercount}_i = A + B_1 \times minority_i + B_2 \times crime_i + B_3 \times poverty_i + B_4 \times housing_i\]
Intercept \(A\): When minority, crime, poverty, housing is 0, the preliminary estimate of undercount is -0.512 percent. However, this does not make sense because there cannot be a negative percent of undercount.
Slope for minority \(B_1\): For every additional increase in minority, the preliminary estimated undercount is expected to increase by 0.08708 percent on average, when controlling for rate of serious crimes per 1000 population, the percentage poor, and the percentage of housing in small, multiunit buildings.
Slope for crime \(B_2\): For every additional increase in crime, the preliminary estimated undercount is expected to increase by 0.03324 percent on average, when controlling for percentage minority, the percentage poor, and the percentage of housing in small, multiunit buildings.
Slope for poverty \(B_3\): For every additional increase in poverty, the preliminary estimated undercount is expected to decrease by 0.09475 percent on average, when controlling for percentage minority, the rate of serious crimes per 1000 population, and the percentage of housing in small, multiunit buildings.
Slope for housing \(B_4\): For every additional increase in housing, the preliminary estimated undercount is expected to decrease by 0.005084 percent on average, when controlling for percentage minority, the rate of serious crimes per 1000 population, and the percentage poor.
Residual Standard Error \(S_E\): On average, the predicted undercount was off by 1.651 percent.
Correlation \(r\): Not really useful for multiple regression. Analyze the correlation matrix - all the pairwise correlations.
R-squared \(r^2\): About 58% of the variance in percent undercount is explained by the multiple regression model.
The results from the multiple regression model are somewhat similar to those of the simple regression model shown in exercise 3. For observing minority while holding all other regressors constant, the simple regression model has a slope of 0.0991 while the multiple regression model has a slope of 0.08708. The intercept in the simple regression model is -0.005076 while it is -0.512 in the multiple regression model. This difference may be cause by the other regressors that are being held constant.
Comparison of the multiple regression and simple regression
Still, the multiple regression model is an improvement over the simple regression because the R-squared increases from 49% to 58% and the standard error decreases from 1.772% to 1.651%, being closer to and a better predictor of the actual data.
The multiple regression fit in Part 2 was modified in order to compute standardized partial regression coefficients, \(B^*_j\)’s, by z-scoring/standardizing each of the variables in our model (includes response variable) and then re-fitting the model using the standardized variables/data.
# standardize data
scaled_ericksen <- ericksen_data %>%
select(undercount, minority, crime, poverty, housing) %>%
scale() %>%
as_tibble()
scaled_model <- lm(undercount ~ minority + crime + poverty + housing - 1,
data = scaled_ericksen)
summary(scaled_model)
Call:
lm(formula = undercount ~ minority + crime + poverty + housing -
1, data = scaled_ericksen)
Residuals:
Min 1Q Median 3Q Max
-1.24950 -0.45307 -0.01729 0.41573 1.55976
Coefficients:
Estimate Std. Error t value Pr(>|t|)
minority 0.61728 0.15587 3.960 0.000196 ***
crime 0.33490 0.12591 2.660 0.009940 **
poverty -0.17184 0.12875 -1.335 0.186882
housing -0.02022 0.09991 -0.202 0.840237
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6628 on 62 degrees of freedom
Multiple R-squared: 0.581, Adjusted R-squared: 0.554
F-statistic: 21.49 on 4 and 62 DF, p-value: 3.698e-11
# correlation
sqrt(summary(scaled_model)$r.squared)[1] 0.7622268
Coefficients for:
Minority \(B_1\): For one standardized unit increase in minority, there is a 0.6173 standardized unit increase in undercount.
Crime \(B_2\): For one standardized unit increase in crime, there is a 0.3349 standardized unit increase in undercount.
Poverty \(B_3\): For one standardized unit increase in poverty, there is a 0.1718 standardized unit decrease in undercount.
Housing \(B_4\): For one standardized unit increase in housing, there is a 0.02022 standardized unit decrease in undercount.
What has happened to the intercept \(A^*\)? Explain why this is not surprising.
The intercept is 0. This is not surprising because the data were z-scored so all the regressors will be shifted to 0. The intercept has to be 0 because when a regression line is fit to the data, it must pass through the means of the data.
Comparison of multiple \(R^2\)’s from the the model fit in Part 2 and from the standardized model fit
The R-squared of the models fit in Exercise 4 and the standardized model fit in this exercise are the same – 0.581 or about 58%. Intuitively speaking, the linear transformation does not change the value of R-squared because the model is the same, just the data being shifted so that the intercept is 0.
A 95% confidence interval was constructed for the least-squares estimates of the slope, \(\beta\), and intercept, \(\alpha\).
# fit the model
ex3_mod <- lm(undercount ~ minority, ericksen_data)
# confidence interval
confint(ex3_mod) 2.5 % 97.5 %
(Intercept) -0.65876210 0.6486110
minority 0.07402965 0.1241696
Estimated Model:
\[\widehat{undercount}_i = \alpha + \beta \times minority_i\]
Confidence Interval for slope \(\beta\): [0.0740, 0.124]
Confidence Interval for intercept \(\alpha\): [-0.659, 0.649]
In Part 2, I fit a multiple linear regression using undercount as the response variable with minority, crime, poverty, and housing as the explanatory variables.
Now, a 95% confidence interval was constructed for the all least-squares estimates (slopes, \(\beta_j\)’s, and intercept, \(\alpha\)).
# fit the model
ex4_mod <- lm(undercount ~ minority + crime + poverty + housing,
data = ericksen_data)
# confidence interval
confint(ex4_mod) 2.5 % 97.5 %
(Intercept) -2.491538658 1.46752795
minority 0.042751422 0.13140447
crime 0.008046421 0.05843797
poverty -0.237856985 0.04836755
housing -0.055715814 0.04554717
Estimated Model:
\[\widehat{undercount}_i = \alpha + \beta_1 \times minority_i + \beta_2 \times crime_i + \beta_3 \times poverty_i + \beta_4 \times housing_i\]
Confidence Interval for intercept \(\alpha\): [-2.49, 1.47]
Confidence Interval for minority slope \(\beta_1\): [0.0428, 0.131]
Confidence Interval for crime slope \(\beta_2\): [0.00805, 0.0584]
Confidence Interval for poverty slope \(\beta_3\): [-0.238, 0.0484]
Confidence Interval for housing slope \(\beta_4\): [-0.0557, 0.0455]
The ANOVA Table for Regression
| Source | Sum of Squares | df | Mean Square | F |
|---|---|---|---|---|
| Regression | 230.53 | 4 | 57.63 | 21.15 |
| Residuals | 166.26 | 61 | 2.73 | |
| Total | 396.78 | 65 |
# baseline model
ex4_baseline <- lm(undercount ~ 1, data = ericksen_data)
# anova values
anova(ex4_baseline, ex4_mod)Analysis of Variance Table
Model 1: undercount ~ 1
Model 2: undercount ~ minority + crime + poverty + housing
Res.Df RSS Df Sum of Sq F Pr(>F)
1 65 396.78
2 61 166.26 4 230.53 21.145 5.626e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# summary
summary(ex4_mod)
Call:
lm(formula = undercount ~ minority + crime + poverty + housing,
data = ericksen_data)
Residuals:
Min 1Q Median 3Q Max
-3.0871 -1.1194 -0.0427 1.0271 3.8537
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.512005 0.989953 -0.517 0.606886
minority 0.087078 0.022167 3.928 0.000221 ***
crime 0.033242 0.012600 2.638 0.010562 *
poverty -0.094745 0.071570 -1.324 0.190505
housing -0.005084 0.025321 -0.201 0.841524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.651 on 61 degrees of freedom
Multiple R-squared: 0.581, Adjusted R-squared: 0.5535
F-statistic: 21.15 on 4 and 61 DF, p-value: 5.626e-11
# OR calculate sum of squares directly
# RegSS
## regss <- (fitted(ex4_mod) - mean(ericksen_data$undercount))^2
# RSS
## rss <- sum(resid(ex4_mod)^2)
# TSS
## tss <- regss + rss
## sum((ericksen_data$undercount - mean(ericksen_data$undercount))^2)
# OR find p-value directly
## pf(21.15, 4, 61, lower.tail = FALSE)Omnibus Test
Using the ANOVA table for regression, the omnibus null hypothesis test that all \(\beta_j\)’s are 0 was conducted.
Estimated Model:
\[\widehat{undercount}_i = \alpha + \beta_1 \times minority_i + \beta_2 \times crime_i + \beta_3 \times poverty_i + \beta_4 \times housing_i + \epsilon_i\]
\(H_0: \beta_{minority} = \beta_{crime} = \beta_{poverty} = \beta_{housing} = 0\)
\(H_A\): at least one of the \(\beta\)s is non zero
For significance level \(\alpha = 0.05\),
Since p-value < 0.05, statistically significant, thus reject \(H_0\) in favor of \(H_A\). The data suggests that the slope parameters in the regression model is not 0.
Furthermore, the ANOVA table shows that the value of \(F_0\) is 21.2, which is sufficiently larger than 1, so we reject the omnibus null hypothesis.
Subset of Slopes Test
Using a subset of slopes test, I examined whether there is sufficient evidence that our multiple regression’s inclusion of crime, poverty, and housing as additional explanatory variables fits the data better than the simple linear regression that uses only has minority as an explanatory variable.
# anove test for nested models
anova(ex4_mod, ex3_mod)Analysis of Variance Table
Model 1: undercount ~ minority + crime + poverty + housing
Model 2: undercount ~ minority
Res.Df RSS Df Sum of Sq F Pr(>F)
1 61 166.25
2 64 200.96 -3 -34.709 4.245 0.00865 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimated Model:
Multiple Regression
\[\widehat{undercount}_i = \alpha + \beta_1 \times minority_i + \beta_2 \times crime_i + \beta_3 \times poverty_i + \beta_4 \times housing_i + \epsilon_i\] Simple Linear Regression
\[\widehat{undercount}_i = \alpha + \beta \times minority_i + \epsilon_i\]
\(H_0: \beta_{crime} = \beta_{poverty} = \beta_{housing} = 0\)
\(H_A\): at least one of the \(\beta\)s is non zero
For significance level \(\alpha = 0.05\),
Since p-value < 0.05, statistically significant, thus reject \(H_0\) in favor of \(H_A\). The data suggests that the slope parameters in the multiple regression model is not 0. The subset of slopes test shows that the multiple regression’s inclusion of crime, poverty, and housing as additional explanatory variables fits the data better than the simple linear regression that uses only has minority as an explanatory variable.
Subset of Slopes Test
Considering the multiple regression with its 4 explanatory variables, I tested whether the slope parameter for minority is 0 using a subset of slopes test and again using a t-test.
# fit the multiple subset model
ex4_minority <- lm(undercount ~ crime + poverty + housing, data = ericksen_data)
# anova test for nested models (only minority slope)
anova <- anova(ex4_minority, ex4_mod)
# t-test for minority slope
lm_fit <- summary(ex4_mod)
# calculate values
lm_fit$coefficients[2, 3]^2[1] 15.4307
anova$F[2][1] 15.4307
lm_fit$coefficients[2, 3][1] 3.928192
anova$`Pr(>F)`[2][1] 0.0002207146
For both the subset of slopes test, or the F-test with 1 and 61 degrees of freedom, and a two-sided t-test at 0.05 with a null value of 0, the p-value for the slope parameter of minority is 0.000221. The p-values are the same for the two tests. Since p-value < 0.05, statistically significant, thus reject \(H_0\) in favor of \(H_A\). The data suggests that the slope parameter for minority in the multiple regression model is not 0.
The t-test statistic is 3.93, while the F-test statistic is 15.4. The F-test statistic is equal to the t-test statistic squared.