Excel_Replication

Author

1 Setup and Data Import

We begin by clearing the workspace, loading required libraries, and reading in the data exported from Excel.

remove(list=ls())

getwd()

[1] "/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/predictive_analytics/Spring 2026/Week1"

list.files()

 [1] "Discussion_Template_files"                            
 [2] "Discussion_Template.html"                             
 [3] "Discussion_Template.pdf"                              
 [4] "Discussion_Template.qmd"                              
 [5] "images"                                               
 [6] "in_class.html"                                        
 [7] "in_class.pdf"                                         
 [8] "in_class.qmd"                                         
 [9] "MyLibrary.bib"                                        
[10] "references.bib"                                       
[11] "rsconnect"                                            
[12] "tssible"                                              
[13] "U.S._Natural_Gas_Residential_Consumption_inclass.csv" 
[14] "U.S._Natural_Gas_Residential_Consumption_inclass.xlsx"
[15] "U.S._Natural_Gas_Residential_Consumption.html"        
[16] "U.S._Natural_Gas_Residential_Consumption.qmd"         
[17] "U.S._Natural_Gas_Residential_Consumption.rmarkdown"   
[18] "week 1-2_Annotated.pdf"                               
[19] "week 1-2.pdf"                                         
[20] "week 1-2.pptx"

library(readr)

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(lubridate)


Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union

library(fastDummies)
library(tsibble)

Registered S3 method overwritten by 'tsibble':
  method               from 
  as_tibble.grouped_df dplyr


Attaching package: 'tsibble'

The following object is masked from 'package:lubridate':

    interval

The following objects are masked from 'package:base':

    intersect, setdiff, union

library(visdat)
library(ggplot2)


df <- read_csv("U.S._Natural_Gas_Residential_Consumption_inclass.csv",  skip = 4)

Rows: 629 Columns: 2

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Month
dbl (1): U.S. Natural Gas Residential Consumption  Million Cubic Feet

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.1 Data Inspection

We quickly inspect the structure and missingness to confirm the data was read correctly.

str(df)

spc_tbl_ [629 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Month                                                       : chr [1:629] "May 2025" "Apr 2025" "Mar 2025" "Feb 2025" ...
 $ U.S. Natural Gas Residential Consumption  Million Cubic Feet: num [1:629] 188079 327128 525950 798740 1037050 ...
 - attr(*, "spec")=
  .. cols(
  ..   Month = col_character(),
  ..   `U.S. Natural Gas Residential Consumption  Million Cubic Feet` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

visdat::vis_dat(df)

1.2 Creating Monthly Dummy Variables

The Month column is converted to a proper year-month format.

We then extract the month and create dummy variables to replicate Excel’s seasonal regression.

df_dummies <- df %>%
  mutate(
    Month_ym = tsibble::yearmonth(Month),
    Month_name = factor(
      month(Month_ym, label = TRUE, abbr = TRUE),
      levels = c("Jan","Feb","Mar","Apr","May","Jun",
                 "Jul","Aug","Sep","Oct","Nov","Dec")
    )
  ) %>%
  select(
    Consumption = `U.S. Natural Gas Residential Consumption  Million Cubic Feet`,
    Month_name
  ) %>%
  dummy_cols(
    select_columns = "Month_name",
    remove_first_dummy = FALSE,   # keep all 12 months
    remove_selected_columns = TRUE
  )

str(df_dummies)

tibble [629 × 13] (S3: tbl_df/tbl/data.frame)
 $ Consumption   : num [1:629] 188079 327128 525950 798740 1037050 ...
 $ Month_name_Jan: int [1:629] 0 0 0 0 1 0 0 0 0 0 ...
 $ Month_name_Feb: int [1:629] 0 0 0 1 0 0 0 0 0 0 ...
 $ Month_name_Mar: int [1:629] 0 0 1 0 0 0 0 0 0 0 ...
 $ Month_name_Apr: int [1:629] 0 1 0 0 0 0 0 0 0 0 ...
 $ Month_name_May: int [1:629] 1 0 0 0 0 0 0 0 0 0 ...
 $ Month_name_Jun: int [1:629] 0 0 0 0 0 0 0 0 0 0 ...
 $ Month_name_Jul: int [1:629] 0 0 0 0 0 0 0 0 0 0 ...
 $ Month_name_Aug: int [1:629] 0 0 0 0 0 0 0 0 0 1 ...
 $ Month_name_Sep: int [1:629] 0 0 0 0 0 0 0 0 1 0 ...
 $ Month_name_Oct: int [1:629] 0 0 0 0 0 0 0 1 0 0 ...
 $ Month_name_Nov: int [1:629] 0 0 0 0 0 0 1 0 0 0 ...
 $ Month_name_Dec: int [1:629] 0 0 0 0 0 1 0 0 0 0 ...

head(df_dummies)

# A tibble: 6 × 13
  Consumption Month_name_Jan Month_name_Feb Month_name_Mar Month_name_Apr
        <dbl>          <int>          <int>          <int>          <int>
1      188079              0              0              0              0
2      327128              0              0              0              1
3      525950              0              0              1              0
4      798740              0              1              0              0
5     1037050              1              0              0              0
6      758478              0              0              0              0
# ℹ 8 more variables: Month_name_May <int>, Month_name_Jun <int>,
#   Month_name_Jul <int>, Month_name_Aug <int>, Month_name_Sep <int>,
#   Month_name_Oct <int>, Month_name_Nov <int>, Month_name_Dec <int>

2 Regression with Monthly Dummies

We regress consumption on month indicators. This mirrors an Excel regression with all months included.

reg <- lm(data =df_dummies,
          formula = Consumption ~ .)

# str(reg)

summary(reg)


Call:
lm(formula = Consumption ~ ., data = df_dummies)

Residuals:
    Min      1Q  Median      3Q     Max 
-206820  -24199   -1414   21833  220508 

Coefficients: (1 not defined because of singularities)
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      693832       7387  93.922  < 2e-16 ***
Month_name_Jan   172866      10398  16.625  < 2e-16 ***
Month_name_Feb    63970      10398   6.152 1.37e-09 ***
Month_name_Mar   -83862      10398  -8.065 3.81e-15 ***
Month_name_Apr  -296965      10398 -28.560  < 2e-16 ***
Month_name_May  -457522      10398 -44.001  < 2e-16 ***
Month_name_Jun  -540075      10447 -51.695  < 2e-16 ***
Month_name_Jul  -567629      10447 -54.333  < 2e-16 ***
Month_name_Aug  -577021      10447 -55.232  < 2e-16 ***
Month_name_Sep  -565792      10447 -54.157  < 2e-16 ***
Month_name_Oct  -473732      10447 -45.345  < 2e-16 ***
Month_name_Nov  -270796      10447 -25.920  < 2e-16 ***
Month_name_Dec       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 53270 on 617 degrees of freedom
Multiple R-squared:  0.9612,    Adjusted R-squared:  0.9605 
F-statistic:  1388 on 11 and 617 DF,  p-value: < 2.2e-16

3 Verifying Excel vs R Calculations

These quick checks confirm that R’s fitted values match Excel’s month-based calculations.

693832 - 577021 # Aug R coefficient calc matches Aug Excel coefficient below (116881)

[1] 116811

693832 + 172866 # Jan R coefficient calc matches

[1] 866698

116810 + 749887 # Jan Excel coefficient below

[1] 866697

3.1 Plotting Actual vs Fitted Values

Finally, we plot the actual consumption against the fitted values from our regression model.

df_plot <- df %>%
  mutate(
    Month = tsibble::yearmonth(Month),
    Fitted = fitted(reg)
  )

ggplot(df_plot, aes(x = Month)) +
  geom_line(aes(y = `U.S. Natural Gas Residential Consumption  Million Cubic Feet`,
                color = "Actual"), linewidth = 1) +
  geom_line(aes(y = Fitted,
                color = "Fitted"), linewidth = .8, linetype = "dashed") +
  labs(
    title = "Actual vs Fitted Residential Natural Gas Consumption",
    x = "Time",
    y = "Consumption (Million Cubic Feet)",
    color = ""
  ) +
  theme_minimal()

3.2 Excel Comparison

The figure below shows the corresponding Excel chart used for replication.

Key Takeaway

This exercise demonstrates how a month-dummy regression reproduces Excel’s seasonal logic exactly, while providing a clearer and more extensible framework in R.

Replicate the regression using TSLM() from fpp3