Inference for Regression

Harold Nelson

2025-12-02

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)
library(infer)
load("OAW2309.Rdata")

Dummy Variables Model

Use the OAW2309 dataframe Create a linear regression model to predict the mean value of TMAX from the calendar month. Call this TMAX_mo_T.

Solution

TMAX_mo_T = lm(TMAX~mo,data=OAW2309)

Use summary() to examine the model.

Solution

summary(TMAX_mo_T)
## 
## Call:
## lm(formula = TMAX ~ mo, data = OAW2309)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.940  -5.033  -0.530   4.470  38.967 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.90791    0.14427 311.284   <2e-16 ***
## mo2          4.03164    0.20892  19.297   <2e-16 ***
## mo3          8.44918    0.20404  41.409   <2e-16 ***
## mo4         14.05624    0.20580  68.300   <2e-16 ***
## mo5         21.14569    0.20372 103.796   <2e-16 ***
## mo6         26.12542    0.20507 127.400   <2e-16 ***
## mo7         32.62203    0.20341 160.377   <2e-16 ***
## mo8         32.56430    0.20339 160.109   <2e-16 ***
## mo9         26.78366    0.20507 130.610   <2e-16 ***
## mo10        15.71011    0.20400  77.009   <2e-16 ***
## mo11         5.62339    0.20570  27.338   <2e-16 ***
## mo12        -0.06303    0.20404  -0.309    0.757    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.272 on 30063 degrees of freedom
## Multiple R-squared:  0.7198, Adjusted R-squared:  0.7197 
## F-statistic:  7019 on 11 and 30063 DF,  p-value: < 2.2e-16

Use the Improved Modern Dive Alternative

Use get_regression_table(). Compare with summary().

Solution

regression_table_T = get_regression_table(TMAX_mo_T)
regression_table_T
## # A tibble: 12 × 7
##    term      estimate std_error statistic p_value lower_ci upper_ci
##    <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
##  1 intercept   44.9       0.144   311.      0       44.6     45.2  
##  2 mo: 2        4.03      0.209    19.3     0        3.62     4.44 
##  3 mo: 3        8.45      0.204    41.4     0        8.05     8.85 
##  4 mo: 4       14.1       0.206    68.3     0       13.7     14.5  
##  5 mo: 5       21.1       0.204   104.      0       20.7     21.5  
##  6 mo: 6       26.1       0.205   127.      0       25.7     26.5  
##  7 mo: 7       32.6       0.203   160.      0       32.2     33.0  
##  8 mo: 8       32.6       0.203   160.      0       32.2     33.0  
##  9 mo: 9       26.8       0.205   131.      0       26.4     27.2  
## 10 mo: 10      15.7       0.204    77.0     0       15.3     16.1  
## 11 mo: 11       5.62      0.206    27.3     0        5.22     6.03 
## 12 mo: 12      -0.063     0.204    -0.309   0.757   -0.463    0.337

Obtain fitted and residuals dataframe.

Solution

fitted_and_residuals_T = get_regression_points(TMAX_mo_T)
glimpse(fitted_and_residuals_T)
## Rows: 30,075
## Columns: 5
## $ ID       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ TMAX     <dbl> 66, 63, 58, 55, 57, 59, 58, 65, 68, 85, 84, 75, 72, 59, 61, 5…
## $ mo       <fct> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6…
## $ TMAX_hat <dbl> 66.054, 66.054, 66.054, 66.054, 66.054, 66.054, 66.054, 66.05…
## $ residual <dbl> -0.054, -3.054, -8.054, -11.054, -9.054, -7.054, -8.054, -1.0…

Examine Graphically

Get a density plot of the residuals. Does it look like a normal distribution?

Solution

fitted_and_residuals_T %>% 
  ggplot(aes(x = residual)) +
  geom_density()

Look at Individual Months.

Compare the fitted values with the actuals. Put mo on the vertical axis. Put the fitted value and the actual values on the horizontal axis. Make the fitted value large and red. Use geom_jitter() with a size of .05 for the actuals.

Solution

fitted_and_residuals_T %>%
   
  ggplot(aes(x = TMAX_hat,y = mo)) +
           geom_point(color = "red",size = 2) +
           geom_jitter(aes(x = TMAX),size = .05,alpha = .1) 

A Second Graphic.

Use facet_wrap(~mo) to separate the months. Plot a density for the actuals and a vline for the fitted.

Solution

fitted_and_residuals_T %>% 
  ggplot() + 
  geom_density(aes(x = TMAX)) +
  geom_vline(aes(xintercept = TMAX_hat),color = "red") +
  facet_wrap(~mo)

Repeat with the Infer Package

First get observed_fit.

Solution

observed_fit <- OAW2309 |> 
  specify(
    TMAX ~ mo
  ) |>
  fit()
observed_fit
## # A tibble: 12 × 2
##    term      estimate
##    <chr>        <dbl>
##  1 intercept  44.9   
##  2 mo2         4.03  
##  3 mo3         8.45  
##  4 mo4        14.1   
##  5 mo5        21.1   
##  6 mo6        26.1   
##  7 mo7        32.6   
##  8 mo8        32.6   
##  9 mo9        26.8   
## 10 mo10       15.7   
## 11 mo11        5.62  
## 12 mo12       -0.0630

Note that the coeffcients are identical to the ones from the theoretical model.

Bootstrap

Get the bootstrap distribution of the model coefficients,

Solution

boot_distribution_model <- OAW2309 |>
  specify(
    TMAX ~ mo
  ) |>
  generate(reps = 1000, type = "bootstrap") |>
  fit()
boot_distribution_model
## # A tibble: 12,000 × 3
## # Groups:   replicate [1,000]
##    replicate term      estimate
##        <int> <chr>        <dbl>
##  1         1 intercept    44.8 
##  2         1 mo2           3.91
##  3         1 mo3           8.50
##  4         1 mo4          14.2 
##  5         1 mo5          21.5 
##  6         1 mo6          26.0 
##  7         1 mo7          32.8 
##  8         1 mo8          32.9 
##  9         1 mo9          27.1 
## 10         1 mo10         16.0 
## # ℹ 11,990 more rows

Get Confidence Intervals

Solution

`

confidence_intervals_model <- boot_distribution_model |> 
  get_confidence_interval(
    level = 0.95,
    type = "percentile",
    point_estimate = observed_fit)
confidence_intervals_model
## # A tibble: 12 × 3
##    term      lower_ci upper_ci
##    <chr>        <dbl>    <dbl>
##  1 intercept   44.7     45.2  
##  2 mo10        15.4     16.1  
##  3 mo11         5.29     5.99 
##  4 mo12        -0.403    0.279
##  5 mo2          3.68     4.40 
##  6 mo3          8.09     8.81 
##  7 mo4         13.7     14.4  
##  8 mo5         20.7     21.6  
##  9 mo6         25.7     26.5  
## 10 mo7         32.2     33.0  
## 11 mo8         32.1     32.9  
## 12 mo9         26.4     27.1

Compare

Repeat the display of the regression table from the theoretical model and compare.

Solution

regression_table_T %>% 
  select(term,lower_ci,upper_ci)
## # A tibble: 12 × 3
##    term      lower_ci upper_ci
##    <chr>        <dbl>    <dbl>
##  1 intercept   44.6     45.2  
##  2 mo: 2        3.62     4.44 
##  3 mo: 3        8.05     8.85 
##  4 mo: 4       13.7     14.5  
##  5 mo: 5       20.7     21.5  
##  6 mo: 6       25.7     26.5  
##  7 mo: 7       32.2     33.0  
##  8 mo: 8       32.2     33.0  
##  9 mo: 9       26.4     27.2  
## 10 mo: 10      15.3     16.1  
## 11 mo: 11       5.22     6.03 
## 12 mo: 12      -0.463    0.337