Predicting Sewer Water Flow Quantity

library(tidyverse)
setwd("C:/Users/sarah/Downloads")
overflow <- read_csv("Reported_Sewer_Overflows_20260505.csv")

I. Introduction

What variables are best at predicting the quantity of sewer overflows (in gallons). Data on reported sewer overflows is provided by the Maryland Department of the Environment. This dataset shows sewer overflows all throughout Maryland from January 1, 2005 through January 31, 2023. The dataset has 27,479 observations and 22 variables. Each observation/case shows an overflow that occurred. Of the 22 variables, I find Quantity In Gallons (Estimated), Cause, duration variables, County, and Overflow Type variables most important.

A. Variables

Quantity In Gallons (Estimated) is a numerical variable that shows how much water, in gallons, the sewer overflow had. This is an estimate.

Cause is a categorical variable that shows the cause of the overflow.

Duration variables, such as Duration - Days and Duration - Hours, are numerical variables that show how long the overflow lasted.

County is a categorical variable that shows which county the overflow happened in.

Overflow Type is a categorical variable that shows the overflow was either a sanitary sewer overflow (SSO) or a combined sewer overflow (CSO) or a bypass.

B. Motives & Objectives

I chose this project because it can answer many questions if all assumptions pass: Does the county of where the overflow happens important to the quantity–in gallons–of the overflow? Is duration of the overflow important to how much overflow actually happens (quantity)? Is the type of overflow important to the quantity of overflow? Is the cause of the overflow important at predicting how much overflow will actually happen?

Furthermore, if these variables explain the overflow quantity well, then initiatives can be taken that give strategic attention to such variables and reduce the quantity of overflows.

It’s important to consider that there may not even be any significant variables. If all assumptions pass, this signifies that none of these variables are meaningful at predicting the quantity of the overflow.

In this project, I will see which variables are most significant at predicting the quantity of sewer overflows. I will first clean the dataset and then do EDA. Then, I will use a multiple linear model, use backwards elimination based on p-values, and then display the results. Then, I will create visualizations and end it off with a conclusion, summarizing the whole project.

Variables I’ll be using for the multiple linear regression:

  • duration_day
  • duration_hour
  • duration_minutes
  • quantity_in_gallons_estimated (the variable being predicted)
  • cause
  • county
  • overflow_type

II. EDA and Cleaning

In the EDA section, I go over the structure and summary of the dataset to learn more about it. Then, I check for N/A values. Since R automatically gets rid of N/A values in the model, I really don’t really need to deal with them with that part. Besides that, I discover that the “causes” variable has over 2,000 categories; this stresses out my multiple linear model because it’ll have to generate statistics (such as p-values, coefficients, etc.) for each category. So, I select categories that have at least 200 counts of overflows in them.

In the cleaning section, I rename variable names and do the process of selecting categories in the “causes” variable that have at least 200 counts in them. Furthermore, I found that some categories in the “causes” and “overflow_type” variables are redundant, so I had to combine them.

A. EDA

Check the structure of the dataset.

str(overflow)
## spc_tbl_ [27,479 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Overflow Type                  : chr [1:27479] "SSO" "SSO" "SSO" "SSO" ...
##  $ Municipality/Facility          : chr [1:27479] "Orbital ATK" "Constellation Power Source Generation, Inc." "Holcim, U.S., Inc." "Holcim, U.S., Inc." ...
##  $ NPDES #                        : chr [1:27479] "MD0000078" "MD0001503" "MD0002151" "MD0002151" ...
##  $ Date Discovered                : chr [1:27479] "07/26/2017" "04/30/2018" "08/10/2015" "08/10/2015" ...
##  $ Time Discovered                : chr [1:27479] NA "9:20:00 AM" "8:34:00 AM" "8:34:00 AM" ...
##  $ Duration - Days                : num [1:27479] NA NA 0 0 NA 0 0 0 1 0 ...
##  $ Duration - Hours               : num [1:27479] NA NA 0 0 NA 0 1 0 0 0 ...
##  $ Duration - Minutes             : num [1:27479] NA NA 36 36 NA 15 0 43 0 6 ...
##  $ Location                       : chr [1:27479] "55 Thiokol Rd." "3000 Brandon Shores Rd." "1260 Security Road, Hagerstown, MD" "1260 Security Road, Hagerstown, MD" ...
##  $ Zip Code                       : num [1:27479] 21921 21226 21742 21742 20640 ...
##  $ Collection-System              : chr [1:27479] "Alliant Techsystems Operations LLC - Elkton Facility" "H.A. Wagner Generating Station" "Holcim WWTP" "Holcim WWTP" ...
##  $ Quantity in Gallons (Estimated): num [1:27479] NA 1500 300 300 0 ...
##  $ Net in Gallons (Estimated)     : num [1:27479] NA 1500 300 300 0 ...
##  $ Cause                          : chr [1:27479] "Broken pipe." "Sump discharge line failure." "Stuck toilet" "Stuck toilet" ...
##  $ Watershed                      : chr [1:27479] NA "Patapsco River" NA NA ...
##  $ Receiving waters               : chr [1:27479] NA "storm drain" "Unknown" "Unknown" ...
##  $ County                         : chr [1:27479] "Cecil" "Anne Arundel" "Washington" "Washington" ...
##  $ Notes                          : chr [1:27479] NA NA "Blank fields = No Information Submitted" "Blank fields = No Information Submitted" ...
##  $ Penalty Collected              : chr [1:27479] NA NA NA NA ...
##  $ Penalty Comments               : chr [1:27479] NA NA NA NA ...
##  $ Latitude                       : num [1:27479] 39.6 39.2 NA NA NA ...
##  $ Longitude                      : num [1:27479] -75.9 -76.5 NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Overflow Type` = col_character(),
##   ..   `Municipality/Facility` = col_character(),
##   ..   `NPDES #` = col_character(),
##   ..   `Date Discovered` = col_character(),
##   ..   `Time Discovered` = col_character(),
##   ..   `Duration - Days` = col_double(),
##   ..   `Duration - Hours` = col_double(),
##   ..   `Duration - Minutes` = col_number(),
##   ..   Location = col_character(),
##   ..   `Zip Code` = col_double(),
##   ..   `Collection-System` = col_character(),
##   ..   `Quantity in Gallons (Estimated)` = col_number(),
##   ..   `Net in Gallons (Estimated)` = col_number(),
##   ..   Cause = col_character(),
##   ..   Watershed = col_character(),
##   ..   `Receiving waters` = col_character(),
##   ..   County = col_character(),
##   ..   Notes = col_character(),
##   ..   `Penalty Collected` = col_character(),
##   ..   `Penalty Comments` = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Create a summary of the dataset

summary(overflow)
##  Overflow Type      Municipality/Facility   NPDES #          Date Discovered   
##  Length:27479       Length:27479          Length:27479       Length:27479      
##  Class :character   Class :character      Class :character   Class :character  
##  Mode  :character   Mode  :character      Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##  Time Discovered    Duration - Days   Duration - Hours Duration - Minutes
##  Length:27479       Min.   :  0.000   Min.   :  0.00   Min.   :    0.00  
##  Class :character   1st Qu.:  0.000   1st Qu.:  1.00   1st Qu.:    0.00  
##  Mode  :character   Median :  0.000   Median :  2.00   Median :   10.00  
##                     Mean   :  0.401   Mean   :  7.04   Mean   :   39.59  
##                     3rd Qu.:  0.000   3rd Qu.:  7.00   3rd Qu.:   30.00  
##                     Max.   :730.000   Max.   :576.00   Max.   :91730.00  
##                     NA's   :6933      NA's   :6870     NA's   :6884      
##    Location            Zip Code      Collection-System 
##  Length:27479       Min.   :  2122   Length:27479      
##  Class :character   1st Qu.: 21210   Class :character  
##  Mode  :character   Median : 21403   Mode  :character  
##                     Mean   : 21348                     
##                     3rd Qu.: 21532                     
##                     Max.   :215832                     
##                     NA's   :1439                       
##  Quantity in Gallons (Estimated) Net in Gallons (Estimated)    Cause          
##  Min.   :        0               Min.   :        0          Length:27479      
##  1st Qu.:      200               1st Qu.:      200          Class :character  
##  Median :     3400               Median :     3350          Mode  :character  
##  Mean   :   255930               Mean   :   254732                            
##  3rd Qu.:    26746               3rd Qu.:    26702                            
##  Max.   :106102592               Max.   :106102592                            
##  NA's   :59                      NA's   :59                                   
##   Watershed         Receiving waters      County             Notes          
##  Length:27479       Length:27479       Length:27479       Length:27479      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Penalty Collected  Penalty Comments      Latitude       Longitude      
##  Length:27479       Length:27479       Min.   :30.01   Min.   :-83.601  
##  Class :character   Class :character   1st Qu.:38.99   1st Qu.:-77.013  
##  Mode  :character   Mode  :character   Median :39.35   Median :-76.065  
##                                        Mean   :39.05   Mean   : -1.484  
##                                        3rd Qu.:39.39   3rd Qu.: 78.491  
##                                        Max.   :42.73   Max.   : 81.528  
##                                        NA's   :9288    NA's   :9293

Average quantity of sewage overflow: 255930 gallons.

Check for N/A values

colSums(is.na(overflow))
##                   Overflow Type           Municipality/Facility 
##                               1                               0 
##                         NPDES #                 Date Discovered 
##                            5403                               0 
##                 Time Discovered                 Duration - Days 
##                            6198                            6933 
##                Duration - Hours              Duration - Minutes 
##                            6870                            6884 
##                        Location                        Zip Code 
##                               6                            1439 
##               Collection-System Quantity in Gallons (Estimated) 
##                            3940                              59 
##      Net in Gallons (Estimated)                           Cause 
##                              59                             216 
##                       Watershed                Receiving waters 
##                           21384                            2356 
##                          County                           Notes 
##                               3                            8665 
##               Penalty Collected                Penalty Comments 
##                           26748                           27418 
##                        Latitude                       Longitude 
##                            9288                            9293

Lots of N/A values.

Check the column names of the dataset

names(overflow)
##  [1] "Overflow Type"                   "Municipality/Facility"          
##  [3] "NPDES #"                         "Date Discovered"                
##  [5] "Time Discovered"                 "Duration - Days"                
##  [7] "Duration - Hours"                "Duration - Minutes"             
##  [9] "Location"                        "Zip Code"                       
## [11] "Collection-System"               "Quantity in Gallons (Estimated)"
## [13] "Net in Gallons (Estimated)"      "Cause"                          
## [15] "Watershed"                       "Receiving waters"               
## [17] "County"                          "Notes"                          
## [19] "Penalty Collected"               "Penalty Comments"               
## [21] "Latitude"                        "Longitude"

I will need to rename some of these in the cleaning section.

Check unique values for categorical variables.

unique(overflow$county)
## Warning: Unknown or uninitialised column: `county`.
## NULL
unique(overflow$overflow_type)
## Warning: Unknown or uninitialised column: `overflow_type`.
## NULL

Check the length of unique categories in the “cause” variable.

length(unique(overflow$cause)) # command documentation in references
## Warning: Unknown or uninitialised column: `cause`.
## [1] 0

There are a lot of unique values for the cause variable. This can stress out the regression model because there is so much data. To combat this, I need to combine categories into “Other” and causes with at least 200 counts of overflow into their own names.

For example, if a certain overflow has only 50 counts of an overflow, then it’ll be put into the “Other” category. If a certain overflow has a 250 counts of an overflow, then it’ll have it’s own name in the new variable. I will do this process in the cleaning section.

B. Cleaning

Remove special characters, replace spaces with underscores, replace “()” with underscores, replace slashes with underscores, and lowercase all variable names.

names(overflow) <- gsub("[ ,\\.,\\-]", "_", tolower(names(overflow)))
names(overflow) <- gsub("\\(", "_", names(overflow))
names(overflow) <- gsub(")$", "_", names(overflow))
names(overflow) <- gsub("#$", "", names(overflow))
names(overflow) <- gsub("_$", "", names(overflow))
names(overflow) <- gsub("/", "_", names(overflow))
names(overflow) <- gsub("___", "_", names(overflow))
names(overflow) <- gsub("__", "_", names(overflow))
head(overflow)
## # A tibble: 6 × 22
##   overflow_type municipality_facility      npdes date_discovered time_discovered
##   <chr>         <chr>                      <chr> <chr>           <chr>          
## 1 SSO           Orbital ATK                MD00… 07/26/2017      <NA>           
## 2 SSO           Constellation Power Sourc… MD00… 04/30/2018      9:20:00 AM     
## 3 SSO           Holcim, U.S., Inc.         MD00… 08/10/2015      8:34:00 AM     
## 4 SSO           Holcim, U.S., Inc.         MD00… 08/10/2015      8:34:00 AM     
## 5 BYPASS        Navy, Department of        MD00… 09/09/2011      11:25:00 AM    
## 6 SSO           Washington County DPW      MD00… 05/27/2016      12:00:00 PM    
## # ℹ 17 more variables: duration_days <dbl>, duration_hours <dbl>,
## #   duration_minutes <dbl>, location <chr>, zip_code <dbl>,
## #   collection_system <chr>, quantity_in_gallons_estimated <dbl>,
## #   net_in_gallons_estimated <dbl>, cause <chr>, watershed <chr>,
## #   receiving_waters <chr>, county <chr>, notes <chr>, penalty_collected <chr>,
## #   penalty_comments <chr>, latitude <dbl>, longitude <dbl>

Create a new dataset called “counts” that has the counts of each cause.

counts <- overflow |>
  group_by(cause) |>
  count() |>
  arrange(desc(n)) 
counts
## # A tibble: 2,766 × 2
## # Groups:   cause [2,766]
##    cause             n
##    <chr>         <int>
##  1 Precipitation  6755
##  2 Rain.          1611
##  3 Grease          880
##  4 Blockage        680
##  5 Rainwater.      654
##  6 Precipiation    633
##  7 Rain            611
##  8 Roots           527
##  9 PRECIPITATION   470
## 10 Rags            431
## # ℹ 2,756 more rows

Join the counts dataset with the overflow dataset; name it overflow2.

overflow2 <- left_join(overflow, counts, by = "cause") # code from data 110

Create a new variable “causes_top” that has causes with 200 or more counts in them. Replace counts less than 200 with “Other.”

Categorize N/A values into “Other.” I do this because some causes are not listed in the dataset, and since “Other” has every other cause besides the top causes, we can safely put N/A values into “Other.”

overflow2 <- overflow2 |>
  mutate(causes_top = ifelse(n >= 200, cause, "Other")) |>
  mutate(causes_top = ifelse(is.na(causes_top), "Other", causes_top))
overflow2$causes_top <- as.factor(overflow2$causes_top)
unique(overflow2$causes_top)
##  [1] Other                                 
##  [2] Precipitation                         
##  [3] Mechanical Failure                    
##  [4] PRECIPITATION                         
##  [5] Rain                                  
##  [6] RAIN                                  
##  [7] Excess Flow with Inflow & Infiltration
##  [8] Rain.                                 
##  [9] Precipiation                          
## [10] Unknown                               
## [11] Blockage                              
## [12] Rags & grease                         
## [13] Grease                                
## [14] Rainfall.                             
## [15] Rainwater.                            
## [16] Excessive flow                        
## [17] Pipe failure                          
## [18] Debris                                
## [19] Roots                                 
## [20] Rags                                  
## [21] Rags, roots.                          
## [22] Rags, trash/debris.                   
## 22 Levels: Blockage Debris ... Unknown

Since they have precipitation and Rain three different times, I’ll combine them.

overflow2 <- overflow2 |>
  mutate(causes_top = case_when(causes_top %in% c("Precipitation", "Precipiation", "PRECIPITATION") ~ "Precipitation", causes_top %in% c("RAIN", "Rain", "Rain.") ~ "Rain", TRUE ~ causes_top))
unique(overflow2$causes_top)
##  [1] "Other"                                 
##  [2] "Precipitation"                         
##  [3] "Mechanical Failure"                    
##  [4] "Rain"                                  
##  [5] "Excess Flow with Inflow & Infiltration"
##  [6] "Unknown"                               
##  [7] "Blockage"                              
##  [8] "Rags & grease"                         
##  [9] "Grease"                                
## [10] "Rainfall."                             
## [11] "Rainwater."                            
## [12] "Excessive flow"                        
## [13] "Pipe failure"                          
## [14] "Debris"                                
## [15] "Roots"                                 
## [16] "Rags"                                  
## [17] "Rags, roots."                          
## [18] "Rags, trash/debris."
unique(overflow2$overflow_type)
## [1] "SSO"                        "BYPASS"                    
## [3] "CSO"                        "Bypass"                    
## [5] NA                           "Partially Treated Effluent"
## [7] "Filter Overflow"

Since they have two bypasses, I’ll combine them.

overflow2 <- overflow2 |>
  mutate(overflow_type = case_when(overflow_type %in% c("BYPASS", "Bypass") ~ "Bypass", TRUE ~ overflow_type))
overflow2$overflow_type <- as.factor(overflow2$overflow_type)
unique(overflow2$overflow_type)
## [1] SSO                        Bypass                    
## [3] CSO                        <NA>                      
## [5] Partially Treated Effluent Filter Overflow           
## Levels: Bypass CSO Filter Overflow Partially Treated Effluent SSO

III. Statistical Analysis - Multiple Linear Model

In this section, I create the multiple linear model and use backwards elimination through ANOVA F-tests to figure out which p-values are large. Then, I go over assumptions. A multiple linear regression model is appropriate to answering which variables are significant at predicting overflow quantity because it gives you p-values for each variable and category.

A. Multiple Linear Models - Backwards Elimination

Full Model:

fit <- lm(quantity_in_gallons_estimated ~ overflow_type + duration_days + duration_minutes + duration_hours + county + causes_top, data = overflow2)

summary(fit)
## 
## Call:
## lm(formula = quantity_in_gallons_estimated ~ overflow_type + 
##     duration_days + duration_minutes + duration_hours + county + 
##     causes_top, data = overflow2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22355128   -212483   -151654     -6189  84046249 
## 
## Coefficients:
##                                                    Estimate Std. Error t value
## (Intercept)                                       9.372e+05  1.986e+06   0.472
## overflow_typeCSO                                 -1.058e+06  1.359e+05  -7.786
## overflow_typePartially Treated Effluent          -1.177e+06  1.992e+06  -0.591
## overflow_typeSSO                                 -1.359e+06  1.167e+05 -11.646
## duration_days                                     3.091e+04  2.082e+03  14.845
## duration_minutes                                 -2.255e+00  1.365e+01  -0.165
## duration_hours                                    6.428e+03  8.690e+02   7.396
## countyAllegamy                                   -1.697e+04  2.798e+06  -0.006
## countyAllegany                                    5.900e+05  1.979e+06   0.298
## countyALLEGANY                                    1.485e+05  2.048e+06   0.073
## countyAnne Arundel                                3.974e+05  1.981e+06   0.201
## countyAnneArundel                                 2.633e+05  2.799e+06   0.094
## countyBaltimore City                              4.451e+05  1.980e+06   0.225
## countyBaltimore CIty                              1.446e+05  2.216e+06   0.065
## countyBaltimore County                            4.715e+05  1.981e+06   0.238
## countyBALTIMORE COUNTY                            2.180e+05  1.989e+06   0.110
## countyBatimore City                               2.699e+05  2.799e+06   0.096
## countyCalvert                                     2.908e+05  1.993e+06   0.146
## countyCALVERT                                     2.903e+05  2.214e+06   0.131
## countyCaroline                                    1.089e+05  2.035e+06   0.053
## countyCarroll                                     2.692e+05  1.992e+06   0.135
## countyCecil                                       2.575e+05  1.992e+06   0.129
## countyCecil County                                2.523e+05  2.799e+06   0.090
## countyCharles                                     7.466e+05  1.983e+06   0.377
## countyDistrict of Columbia                        2.266e+05  2.799e+06   0.081
## countyDorchester                                  3.795e+05  1.995e+06   0.190
## countyFrederick                                   3.072e+05  1.983e+06   0.155
## countyGarrett                                     2.830e+05  1.991e+06   0.142
## countyHarford                                     2.809e+05  1.984e+06   0.142
## countyHarford County                              1.851e+05  2.799e+06   0.066
## countyHoward                                      5.071e+05  1.983e+06   0.256
## countyKent                                        3.118e+05  1.993e+06   0.156
## countyLoudoun County, VA                          6.176e+05  2.427e+06   0.254
## countyMontgomery                                  3.089e+05  1.981e+06   0.156
## countyPrince George's                             4.193e+05  1.981e+06   0.212
## countyQueen Anne's                                2.566e+05  1.989e+06   0.129
## countySomerset                                    2.187e+05  1.989e+06   0.110
## countySt. Mary's                                  3.261e+05  1.987e+06   0.164
## countyTalbot                                      2.906e+05  1.990e+06   0.146
## countyVirginia                                    3.606e+05  2.214e+06   0.163
## countyWashington                                  2.033e+05  1.984e+06   0.102
## countyWashington DC                               2.505e+05  2.799e+06   0.089
## countyWicomico                                    3.892e+05  1.988e+06   0.196
## countyWorcester                                   3.257e+05  1.987e+06   0.164
## causes_topDebris                                  2.017e+04  1.485e+05   0.136
## causes_topExcess Flow with Inflow & Infiltration  1.908e+05  1.796e+05   1.062
## causes_topExcessive flow                          3.640e+06  1.725e+05  21.098
## causes_topGrease                                 -2.048e+03  1.114e+05  -0.018
## causes_topMechanical Failure                      3.515e+04  1.734e+05   0.203
## causes_topOther                                   1.396e+05  8.820e+04   1.583
## causes_topPipe failure                           -3.960e+04  1.483e+05  -0.267
## causes_topPrecipitation                          -3.173e+05  9.999e+04  -3.173
## causes_topRags                                   -3.396e+04  1.573e+05  -0.216
## causes_topRags & grease                          -3.510e+04  1.489e+05  -0.236
## causes_topRags, roots.                           -3.226e+04  4.246e+05  -0.076
## causes_topRags, trash/debris.                    -4.637e+04  2.797e+05  -0.166
## causes_topRain                                    1.472e+05  1.032e+05   1.427
## causes_topRainfall.                              -3.066e+05  1.577e+05  -1.945
## causes_topRainwater.                             -5.280e+05  1.265e+05  -4.175
## causes_topRoots                                  -2.099e+04  1.326e+05  -0.158
## causes_topUnknown                                -4.774e+02  1.633e+05  -0.003
##                                                  Pr(>|t|)    
## (Intercept)                                       0.63696    
## overflow_typeCSO                                 7.22e-15 ***
## overflow_typePartially Treated Effluent           0.55479    
## overflow_typeSSO                                  < 2e-16 ***
## duration_days                                     < 2e-16 ***
## duration_minutes                                  0.86880    
## duration_hours                                   1.45e-13 ***
## countyAllegamy                                    0.99516    
## countyAllegany                                    0.76556    
## countyALLEGANY                                    0.94220    
## countyAnne Arundel                                0.84106    
## countyAnneArundel                                 0.92505    
## countyBaltimore City                              0.82219    
## countyBaltimore CIty                              0.94796    
## countyBaltimore County                            0.81185    
## countyBALTIMORE COUNTY                            0.91270    
## countyBatimore City                               0.92317    
## countyCalvert                                     0.88397    
## countyCALVERT                                     0.89567    
## countyCaroline                                    0.95734    
## countyCarroll                                     0.89249    
## countyCecil                                       0.89714    
## countyCecil County                                0.92817    
## countyCharles                                     0.70651    
## countyDistrict of Columbia                        0.93546    
## countyDorchester                                  0.84914    
## countyFrederick                                   0.87691    
## countyGarrett                                     0.88702    
## countyHarford                                     0.88739    
## countyHarford County                              0.94726    
## countyHoward                                      0.79819    
## countyKent                                        0.87568    
## countyLoudoun County, VA                          0.79916    
## countyMontgomery                                  0.87610    
## countyPrince George's                             0.83239    
## countyQueen Anne's                                0.89733    
## countySomerset                                    0.91244    
## countySt. Mary's                                  0.86963    
## countyTalbot                                      0.88391    
## countyVirginia                                    0.87061    
## countyWashington                                  0.91837    
## countyWashington DC                               0.92869    
## countyWicomico                                    0.84482    
## countyWorcester                                   0.86976    
## causes_topDebris                                  0.89197    
## causes_topExcess Flow with Inflow & Infiltration  0.28824    
## causes_topExcessive flow                          < 2e-16 ***
## causes_topGrease                                  0.98534    
## causes_topMechanical Failure                      0.83932    
## causes_topOther                                   0.11350    
## causes_topPipe failure                            0.78940    
## causes_topPrecipitation                           0.00151 ** 
## causes_topRags                                    0.82914    
## causes_topRags & grease                           0.81369    
## causes_topRags, roots.                            0.93944    
## causes_topRags, trash/debris.                     0.86833    
## causes_topRain                                    0.15370    
## causes_topRainfall.                               0.05182 .  
## causes_topRainwater.                             2.99e-05 ***
## causes_topRoots                                   0.87418    
## causes_topUnknown                                 0.99767    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1978000 on 20411 degrees of freedom
##   (7007 observations deleted due to missingness)
## Multiple R-squared:  0.06406,    Adjusted R-squared:  0.06131 
## F-statistic: 23.28 on 60 and 20411 DF,  p-value: < 2.2e-16

Backwards Elimination:

anova(fit)
## Analysis of Variance Table
## 
## Response: quantity_in_gallons_estimated
##                     Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## overflow_type        3 9.4859e+14 3.1620e+14  80.8037 < 2.2e-16 ***
## duration_days        1 9.9049e+14 9.9049e+14 253.1198 < 2.2e-16 ***
## duration_minutes     1 8.5385e+09 8.5385e+09   0.0022    0.9627    
## duration_hours       1 1.4356e+14 1.4356e+14  36.6878 1.411e-09 ***
## county              37 1.3012e+14 3.5169e+12   0.8987    0.6453    
## causes_top          17 3.2539e+15 1.9141e+14  48.9141 < 2.2e-16 ***
## Residuals        20411 7.9871e+16 3.9131e+12                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

duration_minutes has the highest p-value.

fit2 <- lm(quantity_in_gallons_estimated ~ overflow_type + duration_days + duration_hours + county + causes_top, data = overflow2)

summary(fit2)
## 
## Call:
## lm(formula = quantity_in_gallons_estimated ~ overflow_type + 
##     duration_days + duration_hours + county + causes_top, data = overflow2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22391013   -214524   -154108     -6309  84048681 
## 
## Coefficients:
##                                                    Estimate Std. Error t value
## (Intercept)                                        936670.2  1985351.0   0.472
## overflow_typeCSO                                 -1055352.7   135671.7  -7.779
## overflow_typePartially Treated Effluent          -1176975.1  1991903.4  -0.591
## overflow_typeSSO                                 -1359492.5   116599.6 -11.659
## duration_days                                       30959.7     2080.7  14.880
## duration_hours                                       6401.2      868.4   7.372
## countyAllegamy                                     -16996.4  2797110.1  -0.006
## countyAllegany                                     590625.5  1978283.8   0.299
## countyALLEGANY                                     147404.9  2048116.5   0.072
## countyAnne Arundel                                 398356.9  1981129.0   0.201
## countyAnneArundel                                  264459.5  2798591.7   0.094
## countyBaltimore City                               445332.5  1980040.0   0.225
## countyBaltimore CIty                               145255.6  2215591.9   0.066
## countyBaltimore County                             473138.6  1980377.2   0.239
## countyBALTIMORE COUNTY                             219002.5  1988561.9   0.110
## countyBatimore City                                270936.7  2798597.6   0.097
## countyCalvert                                      291793.8  1992384.1   0.146
## countyCALVERT                                      291325.0  2213203.4   0.132
## countyCaroline                                     109844.6  2034500.3   0.054
## countyCarroll                                      270440.9  1991293.1   0.136
## countyCecil                                        258497.1  1991424.2   0.130
## countyCecil County                                 253353.5  2798609.3   0.091
## countyCharles                                      747582.3  1982560.6   0.377
## countyDistrict of Columbia                         227952.3  2798561.9   0.081
## countyDorchester                                   380940.3  1994847.2   0.191
## countyFrederick                                    308109.5  1982974.6   0.155
## countyGarrett                                      284028.7  1991170.4   0.143
## countyHarford                                      281739.2  1983582.6   0.142
## countyHarford County                               186236.5  2798575.3   0.067
## countyHoward                                       508185.6  1982913.1   0.256
## countyKent                                         312888.7  1992805.6   0.157
## countyLoudoun County, VA                           619074.4  2427040.3   0.255
## countyMontgomery                                   310005.5  1980858.7   0.157
## countyPrince George's                              420333.7  1980657.9   0.212
## countyQueen Anne's                                 257669.7  1988396.3   0.130
## countySomerset                                     219884.2  1988599.6   0.111
## countySt. Mary's                                   327607.1  1986391.3   0.165
## countyTalbot                                       291864.5  1989718.5   0.147
## countyVirginia                                     361502.9  2213271.6   0.163
## countyWashington                                   204376.7  1983511.6   0.103
## countyWashington DC                                251657.1  2798580.7   0.090
## countyWicomico                                     390286.4  1988133.2   0.196
## countyWorcester                                    325665.5  1986235.5   0.164
## causes_topDebris                                    19730.0   148096.4   0.133
## causes_topExcess Flow with Inflow & Infiltration   171188.1   178001.4   0.962
## causes_topExcessive flow                          3637533.3   172434.2  21.095
## causes_topGrease                                    -2247.3   111339.1  -0.020
## causes_topMechanical Failure                        34884.6   173283.8   0.201
## causes_topOther                                    139259.2    88098.8   1.581
## causes_topPipe failure                             -39872.0   148189.5  -0.269
## causes_topPrecipitation                           -317035.9    99875.4  -3.174
## causes_topRags                                     -33592.9   157270.7  -0.214
## causes_topRags & grease                            -34757.6   148859.8  -0.233
## causes_topRags, roots.                             -31782.0   424532.0  -0.075
## causes_topRags, trash/debris.                      -45867.3   279625.5  -0.164
## causes_topRain                                     145178.9   103063.4   1.409
## causes_topRainfall.                               -307595.7   157553.0  -1.952
## causes_topRainwater.                              -530428.9   126358.2  -4.198
## causes_topRoots                                    -21116.8   132499.2  -0.159
## causes_topUnknown                                    -616.2   163243.2  -0.004
##                                                  Pr(>|t|)    
## (Intercept)                                        0.6371    
## overflow_typeCSO                                 7.67e-15 ***
## overflow_typePartially Treated Effluent            0.5546    
## overflow_typeSSO                                  < 2e-16 ***
## duration_days                                     < 2e-16 ***
## duration_hours                                   1.75e-13 ***
## countyAllegamy                                     0.9952    
## countyAllegany                                     0.7653    
## countyALLEGANY                                     0.9426    
## countyAnne Arundel                                 0.8406    
## countyAnneArundel                                  0.9247    
## countyBaltimore City                               0.8221    
## countyBaltimore CIty                               0.9477    
## countyBaltimore County                             0.8112    
## countyBALTIMORE COUNTY                             0.9123    
## countyBatimore City                                0.9229    
## countyCalvert                                      0.8836    
## countyCALVERT                                      0.8953    
## countyCaroline                                     0.9569    
## countyCarroll                                      0.8920    
## countyCecil                                        0.8967    
## countyCecil County                                 0.9279    
## countyCharles                                      0.7061    
## countyDistrict of Columbia                         0.9351    
## countyDorchester                                   0.8486    
## countyFrederick                                    0.8765    
## countyGarrett                                      0.8866    
## countyHarford                                      0.8871    
## countyHarford County                               0.9469    
## countyHoward                                       0.7977    
## countyKent                                         0.8752    
## countyLoudoun County, VA                           0.7987    
## countyMontgomery                                   0.8756    
## countyPrince George's                              0.8319    
## countyQueen Anne's                                 0.8969    
## countySomerset                                     0.9120    
## countySt. Mary's                                   0.8690    
## countyTalbot                                       0.8834    
## countyVirginia                                     0.8703    
## countyWashington                                   0.9179    
## countyWashington DC                                0.9283    
## countyWicomico                                     0.8444    
## countyWorcester                                    0.8698    
## causes_topDebris                                   0.8940    
## causes_topExcess Flow with Inflow & Infiltration   0.3362    
## causes_topExcessive flow                          < 2e-16 ***
## causes_topGrease                                   0.9839    
## causes_topMechanical Failure                       0.8405    
## causes_topOther                                    0.1140    
## causes_topPipe failure                             0.7879    
## causes_topPrecipitation                            0.0015 ** 
## causes_topRags                                     0.8309    
## causes_topRags & grease                            0.8154    
## causes_topRags, roots.                             0.9403    
## causes_topRags, trash/debris.                      0.8697    
## causes_topRain                                     0.1590    
## causes_topRainfall.                                0.0509 .  
## causes_topRainwater.                             2.71e-05 ***
## causes_topRoots                                    0.8734    
## causes_topUnknown                                  0.9970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1978000 on 20444 degrees of freedom
##   (6975 observations deleted due to missingness)
## Multiple R-squared:  0.06404,    Adjusted R-squared:  0.06134 
## F-statistic: 23.71 on 59 and 20444 DF,  p-value: < 2.2e-16
anova(fit2)
## Analysis of Variance Table
## 
## Response: quantity_in_gallons_estimated
##                   Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## overflow_type      3 9.5533e+14 3.1844e+14  81.4033 < 2.2e-16 ***
## duration_days      1 9.9473e+14 9.9473e+14 254.2823 < 2.2e-16 ***
## duration_hours     1 1.4261e+14 1.4261e+14  36.4566 1.588e-09 ***
## county            37 1.3003e+14 3.5144e+12   0.8984    0.6459    
## causes_top        17 3.2494e+15 1.9114e+14  48.8616 < 2.2e-16 ***
## Residuals      20444 7.9975e+16 3.9119e+12                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

County has the highest p-value

FINAL MODEL

fit3 <- lm(quantity_in_gallons_estimated ~ overflow_type + duration_days + duration_hours + causes_top, data = overflow2)

summary(fit3)
## 
## Call:
## lm(formula = quantity_in_gallons_estimated ~ overflow_type + 
##     duration_days + duration_hours + causes_top, data = overflow2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -22494717   -194825   -136884       947  84060912 
## 
## Coefficients:
##                                                    Estimate Std. Error t value
## (Intercept)                                       1297041.4   137577.2   9.428
## overflow_typeCSO                                  -891517.9   111784.6  -7.975
## overflow_typePartially Treated Effluent          -1275494.8  1980512.7  -0.644
## overflow_typeSSO                                 -1313118.8   109163.6 -12.029
## duration_days                                       30950.5     2076.0  14.908
## duration_hours                                       6797.5      845.2   8.042
## causes_topDebris                                     -162.9   145523.8  -0.001
## causes_topExcess Flow with Inflow & Infiltration   239254.9   174966.6   1.367
## causes_topExcessive flow                          3699024.7   170400.1  21.708
## causes_topGrease                                    -5057.4   109001.5  -0.046
## causes_topMechanical Failure                        -6626.6   171871.6  -0.039
## causes_topOther                                    146263.5    86934.4   1.682
## causes_topPipe failure                             -54499.2   146349.9  -0.372
## causes_topPrecipitation                           -250078.2    95694.0  -2.613
## causes_topRags                                       2708.8   153952.7   0.018
## causes_topRags & grease                              2943.8   144836.0   0.020
## causes_topRags, roots.                              -7584.7   420820.0  -0.018
## causes_topRags, trash/debris.                       -8829.5   277296.5  -0.032
## causes_topRain                                     205315.8    99596.2   2.061
## causes_topRainfall.                               -181579.6   147397.3  -1.232
## causes_topRainwater.                              -467962.1   123476.3  -3.790
## causes_topRoots                                    -11403.7   130618.7  -0.087
## causes_topUnknown                                   -8278.1   160532.7  -0.052
##                                                  Pr(>|t|)    
## (Intercept)                                       < 2e-16 ***
## overflow_typeCSO                                  1.6e-15 ***
## overflow_typePartially Treated Effluent          0.519568    
## overflow_typeSSO                                  < 2e-16 ***
## duration_days                                     < 2e-16 ***
## duration_hours                                    9.3e-16 ***
## causes_topDebris                                 0.999107    
## causes_topExcess Flow with Inflow & Infiltration 0.171505    
## causes_topExcessive flow                          < 2e-16 ***
## causes_topGrease                                 0.962994    
## causes_topMechanical Failure                     0.969245    
## causes_topOther                                  0.092495 .  
## causes_topPipe failure                           0.709607    
## causes_topPrecipitation                          0.008974 ** 
## causes_topRags                                   0.985962    
## causes_topRags & grease                          0.983784    
## causes_topRags, roots.                           0.985620    
## causes_topRags, trash/debris.                    0.974599    
## causes_topRain                                   0.039270 *  
## causes_topRainfall.                              0.217998    
## causes_topRainwater.                             0.000151 ***
## causes_topRoots                                  0.930429    
## causes_topUnknown                                0.958875    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1978000 on 20483 degrees of freedom
##   (6973 observations deleted due to missingness)
## Multiple R-squared:  0.06255,    Adjusted R-squared:  0.06154 
## F-statistic: 62.12 on 22 and 20483 DF,  p-value: < 2.2e-16

Intercept: 1297041.4. This means that the quantity of overflow is ~1,297,041 gallons when every other predictor is zero.

Adjusted-R^2: 0.06154 or 6.2%. This means that this model explains about 6.2% of the variability in the estimated quantity of overflows in gallons. This model doesn’t explain the variability well.

P-Value of Entire Model: p-value < 2.2e-16. This means that this model is significant.

anova(fit3)
## Analysis of Variance Table
## 
## Response: quantity_in_gallons_estimated
##                   Df     Sum Sq    Mean Sq F value    Pr(>F)    
## overflow_type      3 9.5525e+14 3.1842e+14  81.422 < 2.2e-16 ***
## duration_days      1 9.9474e+14 9.9474e+14 254.364 < 2.2e-16 ***
## duration_hours     1 1.4265e+14 1.4265e+14  36.477 1.571e-09 ***
## causes_top        17 3.2518e+15 1.9128e+14  48.913 < 2.2e-16 ***
## Residuals      20483 8.0103e+16 3.9107e+12                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significant Variables: overflow_type with p < 2.2e-16. duration_days with p < 2.2e-16. duration_hours with p = 1.571e-09. causes_top with p < 2.2e-16

Coefficients (holding other variables constant): Since I have multiple variables that have multiple categories and each category has a coefficient, I will only list the coefficients of the significant categories (p < 0.05).

  • overflow_typeCSO = -891518 | This means that the CSO overflow type is 891,518 gallons of overflow less than the bypass type (baseline).

  • overflow_typeSSO = -1313118.8 | This means that the SSO overflow type is 1,313,119 gallons of overflow less than the bypass type.

  • duration_days = 30950.5 | This means that for every one day increase in the duration of the overflow, 30,951 more gallons of overflow occur.

  • duration_hours = 6797.5 | This means that for every hour increase in the duration of the overflow, 6,798 more gallons of overflow occur.

  • causes_topExcessive flow = 3699024.7 | This means that overflows caused by excessive flows generate, on average, 3,699,025 more gallons of overflow than the blockage cause (baseline).

  • causes_topPrecipitation = -250078.2 | This means that overflows caused by precipitation generate, on average, 250,078 less gallons of overflow than the blockage cause.

  • causes_topRain. = 205315.8 | This means that overflows caused by rain generate, on average, 205,315.8 more gallons of overflow than the blockage cause.

  • causes_topRainwater. = -467962.1 | This means that overflows caused by rain water generate, on average, 467,962 less gallons of overflow than the blockage cause.

Note: I didn’t combine “Rain” and “Rainwater” categories because there may be some distinction I do not know of (and it wasn’t documented in the source), so I decided to leave them differently.

This answers our question of “What variables are most significant at predicting overflow quantity, in gallons.”

The overflow type, duration day, duration hours, and causes_top are most significant at predicting overflow quantity.

B. Assumptions

Multiple Linear Regression assumes:

  • Linearity: Relationship between predictors and response is linear.
  • Independence: Observations are independent.
  • Homoscedasticity: Constant variance of residuals.
  • Normality: Residuals are normally distributed.
  • No multicollinearity: Predictors should not be too correlated.

1) Linearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
crPlots(fit3)

duration_hours: Linearity for duration_hours with quantity_in_gallons_estimated is good as the dashed line follows the fitted linear effect (straight line).

duration_days: Linearity doesn’t look too good with duration_days and quantity_in_gallons_estimated because the dashed line has a higher slope than the straight line. Linearity most likely fails here.

overflow_type: Treated as a factor, so linearity doesn’t apply but box plots have a lot of outliers.

causes_top: Also treated as a factor and boxplots have a lot of outliers too.

2) Independence

plot(resid(fit3), type="b",
     main="Residuals vs Order", ylab="Residuals"); abline(h=0, lty=2)

Residuals vs Order: We see that a lot of the residuals are condensed around zero with a lot of positive residual bursts, with one major negative burst. Although most residuals hover around zero, there is no constant spread in this plot because of the large spikes. At n~5000, residuals are tightly packed, whereas n~20000, residuals seem to spread out more with a lot of outliers. Furthermore, at n~7000, there seems to be strong activity of high spikes. There doesn’t seem to be a pattern though and most points are around zero, horizontally.

All of this indicates that we likely pass independence because most residuals are centered around zero, but the large, positive bursts are a concern.

3) Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

par(mfrow=c(2,2)); plot(fit3); par(mfrow=c(1,1))
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Residuals vs Fitted: Clustered towards lower values. Points are not randomly scattered around zero. We most likely fail linearity.

Homoscedasticity (equal variance): The spread of residuals is not constant. Because of the clustering at the start of the residuals vs fitted plot, we likely have heteroscedasticity.

Scale–Location: Again, points are clustered towards lower values. This shows that this model doesn’t have equal or constant variance. This further suggests that we have strong heteroscedasticity.

Q–Q Residuals: Points follow a straight line until the right tail. There is strong, positive deviation at the right tail. This means we have large positive residuals and outliers, specifically towards the right. Since the bending is strong, we likely have mid to strong non-normality.

Residuals vs Leverage: One point is a strong outlier, likely indicating influence on the regression line. Most points have low leverage and low standardized residuals, though some points are very high on the y-axis. This likely indicates that outliers are influencing the model.

4) Multicollinearity:

cor(overflow2[, c("duration_days", "duration_hours")], use = "complete.obs")
##                duration_days duration_hours
## duration_days   1.0000000000   0.0002601796
## duration_hours  0.0002601796   1.0000000000

Low correlation between duration_hours and duration_days.

Since overflow_type and causes_top are categorical variables, I will perform VIF.

vif(fit3) # Code from Professor
##                    GVIF Df GVIF^(1/(2*Df))
## overflow_type  2.498165  3        1.164851
## duration_days  1.004203  1        1.002099
## duration_hours 1.068331  1        1.033601
## causes_top     2.636924 17        1.028929

Interpretation:

  1. overflow_type –> degrees of freedom: 3 –> GVIF: 1.164851 –> Conclusion: No multicollinearity.

  2. causes_top –> degrees of freedom: 17 –> GVIF: 1.028929 –> Conclusion: No multicollinearity.

For duration_days and duration_hours (numerical variables), we already checked their correlation but it’s still worth noting their GVIF because a (G)VIF measures how well that predictor can be explained by every other predictor together.

duration_days GVIF: 1.002099 | No multicollinearity.

duration_hours GVIF: 1.033601 | No multicollinearity.

No multicollinearity assumption: Pass

IV. Visualizations

options(scipen = 999)
ggplot(overflow2, aes(x = duration_days, y = quantity_in_gallons_estimated, color = causes_top)) +
  geom_point() +
  labs(title = "Gallons of Overflow by the Duration of Overflow", color = "Causes", x = "Duration (Days)", y = "Overflow Quantity (in gallons)") 

Interpretation: This scatterplot shows the overflow quantity by the duration it lasted, in days. The points are colored by their causes. Most cases reside between 0 and a 100 days, and below 30,000,000 gallons of overflow. What I find interesting is one overflow case lasting over 600 days, while having very little overflow. This case was caused by a pipe failure.

We know that according to the multiple linear model, 30951 more gallons of sewer overflow happen per day. So, this point (the pipe failure point) is really interesting.

overflow3 <- overflow2 |>
  group_by(causes_top) |>
  count()

ggplot(overflow3, aes(x = causes_top, y = n, fill = n)) +
  geom_col() +
  coord_flip() +
  labs(fill = "Count", title = "Causes of Overflow by Count", x = "Top Causes", y = "Count")

Interpretation: This bar graph shows the top causes with the count of that cause. For example, the “Precipitation” cause has around 8000 counts of overflows. Without the “Other” cause, precipitation has the highest count. This shows how much overflows happen because of precipitation. After precipitation, it is rain. Note: precipitation and rain aren’t the same.

A. Tableau Visualization

Here is the link if you’d like to see the Tableau Map visualization: https://public.tableau.com/views/OverflowQuantitygallonsinMarylandCounties/Sheet1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

Interpretation: This Tableau plot shows the map of counties and the frequency of the overflows. We see that Charles County has the highest overflow quantity, with a total sum of approximately 260 million gallons of sewer overflow.

V. Conclusion and Future Directions

Using the multiple linear regression model and backwards elimination, we conclude with an Adjusted R^2 value of 6.2% and an overall p-value less than 2.2e-16 that the most significant predictors of quantity overflow are the overflow type, the duration (in days) of the overflow, the duration (in hours) of the overflow, and the top causes (causes with over 200 counts).

The multiple linear regression model fails multiple assumptions. This model fails the linearity assumption, homoscedasticity assumption, and the normality assumption. However, it passes the no multicollinearity assumption and likely passes the independence assumption.

Visualizations prove worthy of uncovering important insights. The bar graph shows which causes have the highest count of overflows, giving actionable insight to prevent such causes in the future. The scatterplot shows outliers that may warrant investigation to explain extremes, for example, the overflow that lasted over 600 days yet had very low overflow.

Since a multiple linear regression model with a limited number of variables isn’t the best at predicting overflow quantities, future analysis can develop a regression model that takes in every variable.

If such model is still not good, then PCA might be a better option to uncover this complex dataset.