Statistical Inference for Regression

Regression Analysis

Author

Louise Oh

Code

# Load packages
library(tidyverse)
library(car)

# Load dataset
ericksen_data <- read.table("data/Ericksen.txt") %>% 
  rownames_to_column("census_unit") %>% 
  as_tibble()

1980 U.S. Census Undercount

The dataset used for this analysis is Ericksen.txt. This data is a record of the 1980 U.S. Census Undercount.

Source: Ericksen, E. P., Kadane, J. B. and Tukey, J. W. (1989) Adjusting the 1980 Census of Population and Housing. Journal of the American Statistical Association 84, 927–944 [Tables 7 and 8].

Here is a short description of the variables contained in Ericksen.txt:

minority - Percentage black or Hispanic
crime - Rate of serious crimes per 1000 population
poverty - Percentage poor
language - Percentage having difficulty speaking or writing English
highschool - Percentage age 25 or older who had not finished highschool
housing - Percentage of housing in small, multiunit buildings
city - city, major city; state, state or state-remainder
conventional - Percentage of households counted by conventional personal enumeration
undercount - Preliminary estimate of percentage undercount

Part 1 - Fitting Simple Regression Model

Let us explore fitting a simple linear regression using undercount as the response variable and minority as the explanatory variable.

Fit a simple linear regression.

Code

# ericksen regression
ggplot(ericksen_data, aes(x = minority, y = undercount)) +
  geom_point(alpha = 0.6) +
  geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
  geom_smooth(se = FALSE, method = "lm") +
  theme_minimal()

Code

# ericksen fit model
ex3_original<- lm(undercount ~ minority, ericksen_data)
summary(ex3_original)


Call:
lm(formula = undercount ~ minority, data = ericksen_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9302 -1.2390  0.0676  0.9355  4.0625 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.005076   0.327215  -0.016    0.988    
minority     0.099100   0.012549   7.897  4.9e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.772 on 64 degrees of freedom
Multiple R-squared:  0.4935,    Adjusted R-squared:  0.4856 
F-statistic: 62.36 on 1 and 64 DF,  p-value: 4.9e-11

Code

# correlation
sqrt(summary(ex3_original)$r.squared)

[1] 0.7025047

Code

# or cor(ericksen_data$undercount, ericksen_data$minority)

Model:

\[\widehat{undercount}_i = A + B_1 \times minority\]

Intercept \(A\): When minority is 0, the undercount is -0.005076 percent.
Slope for minority \(B\): For every additional increase in minority, the undercount is expected to increase by 0.0991 percent on average.
Residual Standard Error \(S_E\): On average, the predicted undercount was off by 1.772 percent.
Correlation \(r\): The correlation between minority and undercount is 0.7025, which indicates a weak positive relationship.
R-squared \(r^2\): About 48% of the variance undercount is explained by percent minority.

The least-squares line is a reasonable summary between the two variables because the adjusted R-squared of the least-squares line is 0.4856, a moderately good goodness-of-fit or accuracy.

Part 2 - Fitting Multiple Regression Model

Let us now explore fitting a multiple linear regression using undercount as the response variable with minority, crime, poverty, and housing as the explanatory variables.

Fit the multiple regression.

Code

# multiple regression
ericksen_data %>% 
  select(undercount, minority, crime, poverty, housing) %>% 
  GGally::ggpairs() +
  theme_minimal()

Code

# fit model
ex3_model <- lm(undercount ~ minority + crime + poverty + housing,
                data = ericksen_data)
summary(ex3_model)


Call:
lm(formula = undercount ~ minority + crime + poverty + housing, 
    data = ericksen_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0871 -1.1194 -0.0427  1.0271  3.8537 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.512005   0.989953  -0.517 0.606886    
minority     0.087078   0.022167   3.928 0.000221 ***
crime        0.033242   0.012600   2.638 0.010562 *  
poverty     -0.094745   0.071570  -1.324 0.190505    
housing     -0.005084   0.025321  -0.201 0.841524    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.651 on 61 degrees of freedom
Multiple R-squared:  0.581, Adjusted R-squared:  0.5535 
F-statistic: 21.15 on 4 and 61 DF,  p-value: 5.626e-11

Code

# correlation
sqrt(summary(ex3_model)$r.squared)

[1] 0.7622268

Model:

\[\widehat{undercount}_i = A + B_1 \times minority_i + B_2 \times crime_i + B_3 \times poverty_i + B_4 \times housing_i\]

Intercept \(A\): When minority, crime, poverty, housing is 0, the preliminary estimate of undercount is -0.512 percent. However, this does not make sense because there cannot be a negative percent of undercount.
Slope for minority \(B_1\): For every additional increase in minority, the preliminary estimated undercount is expected to increase by 0.08708 percent on average, when controlling for rate of serious crimes per 1000 population, the percentage poor, and the percentage of housing in small, multiunit buildings.
Slope for crime \(B_2\): For every additional increase in crime, the preliminary estimated undercount is expected to increase by 0.03324 percent on average, when controlling for percentage minority, the percentage poor, and the percentage of housing in small, multiunit buildings.
Slope for poverty \(B_3\): For every additional increase in poverty, the preliminary estimated undercount is expected to decrease by 0.09475 percent on average, when controlling for percentage minority, the rate of serious crimes per 1000 population, and the percentage of housing in small, multiunit buildings.
Slope for housing \(B_4\): For every additional increase in housing, the preliminary estimated undercount is expected to decrease by 0.005084 percent on average, when controlling for percentage minority, the rate of serious crimes per 1000 population, and the percentage poor.
Residual Standard Error \(S_E\): On average, the predicted undercount was off by 1.651 percent.
Correlation \(r\): Not really useful for multiple regression. Analyze the correlation matrix - all the pairwise correlations.
R-squared \(r^2\): About 58% of the variance in percent undercount is explained by the multiple regression model.

The results from the multiple regression model are somewhat similar to those of the simple regression model shown in exercise 3. For observing minority while holding all other regressors constant, the simple regression model has a slope of 0.0991 while the multiple regression model has a slope of 0.08708. The intercept in the simple regression model is -0.005076 while it is -0.512 in the multiple regression model. This difference may be cause by the other regressors that are being held constant.

Comparison of the multiple regression and simple regression

Still, the multiple regression model is an improvement over the simple regression because the R-squared increases from 49% to 58% and the standard error decreases from 1.772% to 1.651%, being closer to and a better predictor of the actual data.

Part 3 - Refitting Multiple Regression Model (Z-Score)

The multiple regression fit in Part 2 was modified in order to compute standardized partial regression coefficients, \(B^*_j\)’s, by z-scoring/standardizing each of the variables in our model (includes response variable) and then re-fitting the model using the standardized variables/data.

Code

# standardize data
scaled_ericksen <- ericksen_data %>% 
  select(undercount, minority, crime, poverty, housing) %>% 
  scale() %>% 
  as_tibble() 
scaled_model <- lm(undercount ~ minority + crime + poverty + housing - 1,
                data = scaled_ericksen)
summary(scaled_model)


Call:
lm(formula = undercount ~ minority + crime + poverty + housing - 
    1, data = scaled_ericksen)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.24950 -0.45307 -0.01729  0.41573  1.55976 

Coefficients:
         Estimate Std. Error t value Pr(>|t|)    
minority  0.61728    0.15587   3.960 0.000196 ***
crime     0.33490    0.12591   2.660 0.009940 ** 
poverty  -0.17184    0.12875  -1.335 0.186882    
housing  -0.02022    0.09991  -0.202 0.840237    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6628 on 62 degrees of freedom
Multiple R-squared:  0.581, Adjusted R-squared:  0.554 
F-statistic: 21.49 on 4 and 62 DF,  p-value: 3.698e-11

Code

# correlation
sqrt(summary(scaled_model)$r.squared)

[1] 0.7622268

Coefficients for:

Minority \(B_1\): For one standardized unit increase in minority, there is a 0.6173 standardized unit increase in undercount.
Crime \(B_2\): For one standardized unit increase in crime, there is a 0.3349 standardized unit increase in undercount.
Poverty \(B_3\): For one standardized unit increase in poverty, there is a 0.1718 standardized unit decrease in undercount.
Housing \(B_4\): For one standardized unit increase in housing, there is a 0.02022 standardized unit decrease in undercount.

What has happened to the intercept \(A^*\)? Explain why this is not surprising.

The intercept is 0. This is not surprising because the data were z-scored so all the regressors will be shifted to 0. The intercept has to be 0 because when a regression line is fit to the data, it must pass through the means of the data.

Comparison of multiple \(R^2\)’s from the the model fit in Part 2 and from the standardized model fit

The R-squared of the models fit in Exercise 4 and the standardized model fit in this exercise are the same – 0.581 or about 58%. Intuitively speaking, the linear transformation does not change the value of R-squared because the model is the same, just the data being shifted so that the intercept is 0.

Part 4 - Confidence Interval for Simple Regression

A 95% confidence interval was constructed for the least-squares estimates of the slope, \(\beta\), and intercept, \(\alpha\).

Code

# fit the model
ex3_mod <- lm(undercount ~ minority, ericksen_data)
# confidence interval
confint(ex3_mod)

                  2.5 %    97.5 %
(Intercept) -0.65876210 0.6486110
minority     0.07402965 0.1241696

Estimated Model:

\[\widehat{undercount}_i = \alpha + \beta \times minority_i\]

Confidence Interval for slope \(\beta\): [0.0740, 0.124]

We are 95% confident that the slope parameter for the model lies between 0.0740 and 0.124 – that for every increase in minority percent, the average increase undercount is between 0.0740 and 0.124 percent.

Confidence Interval for intercept \(\alpha\): [-0.659, 0.649]

We are 95% confident that the intercept parameter for minority percent lies between -0.659 and 0.649 percent.

Part 5 - Confidence Interval for Multiple Regression

In Part 2, I fit a multiple linear regression using undercount as the response variable with minority, crime, poverty, and housing as the explanatory variables.

Now, a 95% confidence interval was constructed for the all least-squares estimates (slopes, \(\beta_j\)’s, and intercept, \(\alpha\)).

Code

# fit the model
ex4_mod <- lm(undercount ~ minority + crime + poverty + housing,
               data = ericksen_data)
# confidence interval
confint(ex4_mod)

                   2.5 %     97.5 %
(Intercept) -2.491538658 1.46752795
minority     0.042751422 0.13140447
crime        0.008046421 0.05843797
poverty     -0.237856985 0.04836755
housing     -0.055715814 0.04554717

Estimated Model:

\[\widehat{undercount}_i = \alpha + \beta_1 \times minority_i + \beta_2 \times crime_i + \beta_3 \times poverty_i + \beta_4 \times housing_i\]

Confidence Interval for intercept \(\alpha\): [-2.49, 1.47]

We are 95% confident that in the random sample, when there is 0 minority percent, crime rate, poverty rate, and housing, the undercount is between -2.49 and 1.47 percent.

Confidence Interval for minority slope \(\beta_1\): [0.0428, 0.131]

We are 95% confident that in the random sample, for every increase in minority percent, when controlling for crime rate, poverty rate, and housing, the average increase undercount is between 0.0428 and 0.131 percent.

Confidence Interval for crime slope \(\beta_2\): [0.00805, 0.0584]

We are 95% confident that in the random sample, for every increase in crime rate, when controlling for minority percent, poverty rate, and housing, the average increase undercount is between 0.00805 and 0.0584 percent.

Confidence Interval for poverty slope \(\beta_3\): [-0.238, 0.0484]

We are 95% confident that in the random sample, for every increase in poverty rate, when controlling for minority percent, crime rate, and housing, the average increase undercount is between -0.238 and 0.0484 percent.

Confidence Interval for housing slope \(\beta_4\): [-0.0557, 0.0455]

We are 95% confident that in the random sample, for every increase in housing, when controlling for minority percent, crime rate, and poverty rate, the average increase undercount is between -0.0557 and 0.0455 percent.

The ANOVA Table for Regression

Source	Sum of Squares	df	Mean Square	F
Regression	230.53	4	57.63	21.15
Residuals	166.26	61	2.73
Total	396.78	65

Code

# baseline model 
ex4_baseline <- lm(undercount ~ 1, data = ericksen_data)
# anova values
anova(ex4_baseline, ex4_mod)

Analysis of Variance Table

Model 1: undercount ~ 1
Model 2: undercount ~ minority + crime + poverty + housing
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     65 396.78                                  
2     61 166.26  4    230.53 21.145 5.626e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

# summary
summary(ex4_mod)


Call:
lm(formula = undercount ~ minority + crime + poverty + housing, 
    data = ericksen_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0871 -1.1194 -0.0427  1.0271  3.8537 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.512005   0.989953  -0.517 0.606886    
minority     0.087078   0.022167   3.928 0.000221 ***
crime        0.033242   0.012600   2.638 0.010562 *  
poverty     -0.094745   0.071570  -1.324 0.190505    
housing     -0.005084   0.025321  -0.201 0.841524    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.651 on 61 degrees of freedom
Multiple R-squared:  0.581, Adjusted R-squared:  0.5535 
F-statistic: 21.15 on 4 and 61 DF,  p-value: 5.626e-11

Code

# OR calculate sum of squares directly
# RegSS
## regss <- (fitted(ex4_mod) - mean(ericksen_data$undercount))^2
# RSS
## rss <- sum(resid(ex4_mod)^2)
# TSS
## tss <- regss + rss
## sum((ericksen_data$undercount - mean(ericksen_data$undercount))^2)

# OR find p-value directly
## pf(21.15, 4, 61, lower.tail = FALSE)

Omnibus Test

Using the ANOVA table for regression, the omnibus null hypothesis test that all \(\beta_j\)’s are 0 was conducted.

Estimated Model:

\[\widehat{undercount}_i = \alpha + \beta_1 \times minority_i + \beta_2 \times crime_i + \beta_3 \times poverty_i + \beta_4 \times housing_i + \epsilon_i\]

\(H_0: \beta_{minority} = \beta_{crime} = \beta_{poverty} = \beta_{housing} = 0\)
\(H_A\): at least one of the \(\beta\)s is non zero

For significance level \(\alpha = 0.05\),

p-value: 5.626e-11 (very small, close to 0 for one-tail)

Since p-value < 0.05, statistically significant, thus reject \(H_0\) in favor of \(H_A\). The data suggests that the slope parameters in the regression model is not 0.

Furthermore, the ANOVA table shows that the value of \(F_0\) is 21.2, which is sufficiently larger than 1, so we reject the omnibus null hypothesis.

Subset of Slopes Test

Using a subset of slopes test, I examined whether there is sufficient evidence that our multiple regression’s inclusion of crime, poverty, and housing as additional explanatory variables fits the data better than the simple linear regression that uses only has minority as an explanatory variable.

Code

# anove test for nested models
anova(ex4_mod, ex3_mod)

Analysis of Variance Table

Model 1: undercount ~ minority + crime + poverty + housing
Model 2: undercount ~ minority
  Res.Df    RSS Df Sum of Sq     F  Pr(>F)   
1     61 166.25                              
2     64 200.96 -3   -34.709 4.245 0.00865 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimated Model:

Multiple Regression

\[\widehat{undercount}_i = \alpha + \beta_1 \times minority_i + \beta_2 \times crime_i + \beta_3 \times poverty_i + \beta_4 \times housing_i + \epsilon_i\] Simple Linear Regression

\[\widehat{undercount}_i = \alpha + \beta \times minority_i + \epsilon_i\]

\(H_0: \beta_{crime} = \beta_{poverty} = \beta_{housing} = 0\)
\(H_A\): at least one of the \(\beta\)s is non zero

For significance level \(\alpha = 0.05\),

p-value: 0.00865

Since p-value < 0.05, statistically significant, thus reject \(H_0\) in favor of \(H_A\). The data suggests that the slope parameters in the multiple regression model is not 0. The subset of slopes test shows that the multiple regression’s inclusion of crime, poverty, and housing as additional explanatory variables fits the data better than the simple linear regression that uses only has minority as an explanatory variable.

Subset of Slopes Test

Considering the multiple regression with its 4 explanatory variables, I tested whether the slope parameter for minority is 0 using a subset of slopes test and again using a t-test.

Code

# fit the multiple subset model
ex4_minority <- lm(undercount ~ crime + poverty + housing, data = ericksen_data)
# anova test for nested models (only minority slope)
anova <- anova(ex4_minority, ex4_mod)
# t-test for minority slope
lm_fit <- summary(ex4_mod)
# calculate values
lm_fit$coefficients[2, 3]^2

[1] 15.4307

Code

anova$F[2]

[1] 15.4307

Code

lm_fit$coefficients[2, 3]

[1] 3.928192

Code

anova$`Pr(>F)`[2]

[1] 0.0002207146

For both the subset of slopes test, or the F-test with 1 and 61 degrees of freedom, and a two-sided t-test at 0.05 with a null value of 0, the p-value for the slope parameter of minority is 0.000221. The p-values are the same for the two tests. Since p-value < 0.05, statistically significant, thus reject \(H_0\) in favor of \(H_A\). The data suggests that the slope parameter for minority in the multiple regression model is not 0.

The t-test statistic is 3.93, while the F-test statistic is 15.4. The F-test statistic is equal to the t-test statistic squared.