What variables are best at predicting the quantity of sewer overflows (in gallons). Data on reported sewer overflows is provided by the Maryland Department of the Environment. This dataset shows sewer overflows all throughout Maryland from January 1, 2005 through January 31, 2023. The dataset has 27,479 observations and 22 variables. Each observation/case shows an overflow that occurred. Of the 22 variables, I find Quantity In Gallons (Estimated), Cause, duration variables, County, and Overflow Type variables most important.
Quantity In Gallons (Estimated) is a numerical variable that shows how much water, in gallons, the sewer overflow had. This is an estimate.
Cause is a categorical variable that shows the cause of the overflow.
Duration variables, such as Duration - Days and Duration - Hours, are numerical variables that show how long the overflow lasted.
County is a categorical variable that shows which county the overflow happened in.
Overflow Type is a categorical variable that shows the overflow was either a sanitary sewer overflow (SSO) or a combined sewer overflow (CSO) or a bypass.
I chose this project because it can answer many questions if all assumptions pass: Does the county of where the overflow happens important to the quantity–in gallons–of the overflow? Is duration of the overflow important to how much overflow actually happens (quantity)? Is the type of overflow important to the quantity of overflow? Is the cause of the overflow important at predicting how much overflow will actually happen?
Furthermore, if these variables explain the overflow quantity well, then initiatives can be taken that give strategic attention to such variables and reduce the quantity of overflows.
It’s important to consider that there may not even be any significant variables. If all assumptions pass, this signifies that none of these variables are meaningful at predicting the quantity of the overflow.
In this project, I will see which variables are most significant at predicting the quantity of sewer overflows. I will first clean the dataset and then do EDA. Then, I will use a multiple linear model, use backwards elimination based on p-values, and then display the results. Then, I will create visualizations and end it off with a conclusion, summarizing the whole project.
Variables I’ll be using for the multiple linear regression:
In the EDA section, I go over the structure and summary of the dataset to learn more about it. Then, I check for N/A values. Since R automatically gets rid of N/A values in the model, I really don’t really need to deal with them with that part. Besides that, I discover that the “causes” variable has over 2,000 categories; this stresses out my multiple linear model because it’ll have to generate statistics (such as p-values, coefficients, etc.) for each category. So, I select categories that have at least 200 counts of overflows in them.
In the cleaning section, I rename variable names and do the process of selecting categories in the “causes” variable that have at least 200 counts in them. Furthermore, I found that some categories in the “causes” and “overflow_type” variables are redundant, so I had to combine them.
Check the structure of the dataset.
## spc_tbl_ [27,479 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Overflow Type : chr [1:27479] "SSO" "SSO" "SSO" "SSO" ...
## $ Municipality/Facility : chr [1:27479] "Orbital ATK" "Constellation Power Source Generation, Inc." "Holcim, U.S., Inc." "Holcim, U.S., Inc." ...
## $ NPDES # : chr [1:27479] "MD0000078" "MD0001503" "MD0002151" "MD0002151" ...
## $ Date Discovered : chr [1:27479] "07/26/2017" "04/30/2018" "08/10/2015" "08/10/2015" ...
## $ Time Discovered : chr [1:27479] NA "9:20:00 AM" "8:34:00 AM" "8:34:00 AM" ...
## $ Duration - Days : num [1:27479] NA NA 0 0 NA 0 0 0 1 0 ...
## $ Duration - Hours : num [1:27479] NA NA 0 0 NA 0 1 0 0 0 ...
## $ Duration - Minutes : num [1:27479] NA NA 36 36 NA 15 0 43 0 6 ...
## $ Location : chr [1:27479] "55 Thiokol Rd." "3000 Brandon Shores Rd." "1260 Security Road, Hagerstown, MD" "1260 Security Road, Hagerstown, MD" ...
## $ Zip Code : num [1:27479] 21921 21226 21742 21742 20640 ...
## $ Collection-System : chr [1:27479] "Alliant Techsystems Operations LLC - Elkton Facility" "H.A. Wagner Generating Station" "Holcim WWTP" "Holcim WWTP" ...
## $ Quantity in Gallons (Estimated): num [1:27479] NA 1500 300 300 0 ...
## $ Net in Gallons (Estimated) : num [1:27479] NA 1500 300 300 0 ...
## $ Cause : chr [1:27479] "Broken pipe." "Sump discharge line failure." "Stuck toilet" "Stuck toilet" ...
## $ Watershed : chr [1:27479] NA "Patapsco River" NA NA ...
## $ Receiving waters : chr [1:27479] NA "storm drain" "Unknown" "Unknown" ...
## $ County : chr [1:27479] "Cecil" "Anne Arundel" "Washington" "Washington" ...
## $ Notes : chr [1:27479] NA NA "Blank fields = No Information Submitted" "Blank fields = No Information Submitted" ...
## $ Penalty Collected : chr [1:27479] NA NA NA NA ...
## $ Penalty Comments : chr [1:27479] NA NA NA NA ...
## $ Latitude : num [1:27479] 39.6 39.2 NA NA NA ...
## $ Longitude : num [1:27479] -75.9 -76.5 NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. `Overflow Type` = col_character(),
## .. `Municipality/Facility` = col_character(),
## .. `NPDES #` = col_character(),
## .. `Date Discovered` = col_character(),
## .. `Time Discovered` = col_character(),
## .. `Duration - Days` = col_double(),
## .. `Duration - Hours` = col_double(),
## .. `Duration - Minutes` = col_number(),
## .. Location = col_character(),
## .. `Zip Code` = col_double(),
## .. `Collection-System` = col_character(),
## .. `Quantity in Gallons (Estimated)` = col_number(),
## .. `Net in Gallons (Estimated)` = col_number(),
## .. Cause = col_character(),
## .. Watershed = col_character(),
## .. `Receiving waters` = col_character(),
## .. County = col_character(),
## .. Notes = col_character(),
## .. `Penalty Collected` = col_character(),
## .. `Penalty Comments` = col_character(),
## .. Latitude = col_double(),
## .. Longitude = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Create a summary of the dataset
## Overflow Type Municipality/Facility NPDES # Date Discovered
## Length:27479 Length:27479 Length:27479 Length:27479
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Time Discovered Duration - Days Duration - Hours Duration - Minutes
## Length:27479 Min. : 0.000 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 0.00
## Mode :character Median : 0.000 Median : 2.00 Median : 10.00
## Mean : 0.401 Mean : 7.04 Mean : 39.59
## 3rd Qu.: 0.000 3rd Qu.: 7.00 3rd Qu.: 30.00
## Max. :730.000 Max. :576.00 Max. :91730.00
## NA's :6933 NA's :6870 NA's :6884
## Location Zip Code Collection-System
## Length:27479 Min. : 2122 Length:27479
## Class :character 1st Qu.: 21210 Class :character
## Mode :character Median : 21403 Mode :character
## Mean : 21348
## 3rd Qu.: 21532
## Max. :215832
## NA's :1439
## Quantity in Gallons (Estimated) Net in Gallons (Estimated) Cause
## Min. : 0 Min. : 0 Length:27479
## 1st Qu.: 200 1st Qu.: 200 Class :character
## Median : 3400 Median : 3350 Mode :character
## Mean : 255930 Mean : 254732
## 3rd Qu.: 26746 3rd Qu.: 26702
## Max. :106102592 Max. :106102592
## NA's :59 NA's :59
## Watershed Receiving waters County Notes
## Length:27479 Length:27479 Length:27479 Length:27479
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Penalty Collected Penalty Comments Latitude Longitude
## Length:27479 Length:27479 Min. :30.01 Min. :-83.601
## Class :character Class :character 1st Qu.:38.99 1st Qu.:-77.013
## Mode :character Mode :character Median :39.35 Median :-76.065
## Mean :39.05 Mean : -1.484
## 3rd Qu.:39.39 3rd Qu.: 78.491
## Max. :42.73 Max. : 81.528
## NA's :9288 NA's :9293
Average quantity of sewage overflow: 255930 gallons.
Check for N/A values
## Overflow Type Municipality/Facility
## 1 0
## NPDES # Date Discovered
## 5403 0
## Time Discovered Duration - Days
## 6198 6933
## Duration - Hours Duration - Minutes
## 6870 6884
## Location Zip Code
## 6 1439
## Collection-System Quantity in Gallons (Estimated)
## 3940 59
## Net in Gallons (Estimated) Cause
## 59 216
## Watershed Receiving waters
## 21384 2356
## County Notes
## 3 8665
## Penalty Collected Penalty Comments
## 26748 27418
## Latitude Longitude
## 9288 9293
Lots of N/A values.
Check the column names of the dataset
## [1] "Overflow Type" "Municipality/Facility"
## [3] "NPDES #" "Date Discovered"
## [5] "Time Discovered" "Duration - Days"
## [7] "Duration - Hours" "Duration - Minutes"
## [9] "Location" "Zip Code"
## [11] "Collection-System" "Quantity in Gallons (Estimated)"
## [13] "Net in Gallons (Estimated)" "Cause"
## [15] "Watershed" "Receiving waters"
## [17] "County" "Notes"
## [19] "Penalty Collected" "Penalty Comments"
## [21] "Latitude" "Longitude"
I will need to rename some of these in the cleaning section.
Check unique values for categorical variables.
## Warning: Unknown or uninitialised column: `county`.
## NULL
## Warning: Unknown or uninitialised column: `overflow_type`.
## NULL
Check the length of unique categories in the “cause” variable.
## Warning: Unknown or uninitialised column: `cause`.
## [1] 0
There are a lot of unique values for the cause variable. This can stress out the regression model because there is so much data. To combat this, I need to combine categories into “Other” and causes with at least 200 counts of overflow into their own names.
For example, if a certain overflow has only 50 counts of an overflow, then it’ll be put into the “Other” category. If a certain overflow has a 250 counts of an overflow, then it’ll have it’s own name in the new variable. I will do this process in the cleaning section.
Remove special characters, replace spaces with underscores, replace “()” with underscores, replace slashes with underscores, and lowercase all variable names.
names(overflow) <- gsub("[ ,\\.,\\-]", "_", tolower(names(overflow)))
names(overflow) <- gsub("\\(", "_", names(overflow))
names(overflow) <- gsub(")$", "_", names(overflow))
names(overflow) <- gsub("#$", "", names(overflow))
names(overflow) <- gsub("_$", "", names(overflow))
names(overflow) <- gsub("/", "_", names(overflow))
names(overflow) <- gsub("___", "_", names(overflow))
names(overflow) <- gsub("__", "_", names(overflow))
head(overflow)## # A tibble: 6 × 22
## overflow_type municipality_facility npdes date_discovered time_discovered
## <chr> <chr> <chr> <chr> <chr>
## 1 SSO Orbital ATK MD00… 07/26/2017 <NA>
## 2 SSO Constellation Power Sourc… MD00… 04/30/2018 9:20:00 AM
## 3 SSO Holcim, U.S., Inc. MD00… 08/10/2015 8:34:00 AM
## 4 SSO Holcim, U.S., Inc. MD00… 08/10/2015 8:34:00 AM
## 5 BYPASS Navy, Department of MD00… 09/09/2011 11:25:00 AM
## 6 SSO Washington County DPW MD00… 05/27/2016 12:00:00 PM
## # ℹ 17 more variables: duration_days <dbl>, duration_hours <dbl>,
## # duration_minutes <dbl>, location <chr>, zip_code <dbl>,
## # collection_system <chr>, quantity_in_gallons_estimated <dbl>,
## # net_in_gallons_estimated <dbl>, cause <chr>, watershed <chr>,
## # receiving_waters <chr>, county <chr>, notes <chr>, penalty_collected <chr>,
## # penalty_comments <chr>, latitude <dbl>, longitude <dbl>
Create a new dataset called “counts” that has the counts of each cause.
## # A tibble: 2,766 × 2
## # Groups: cause [2,766]
## cause n
## <chr> <int>
## 1 Precipitation 6755
## 2 Rain. 1611
## 3 Grease 880
## 4 Blockage 680
## 5 Rainwater. 654
## 6 Precipiation 633
## 7 Rain 611
## 8 Roots 527
## 9 PRECIPITATION 470
## 10 Rags 431
## # ℹ 2,756 more rows
Join the counts dataset with the overflow dataset; name it overflow2.
Create a new variable “causes_top” that has causes with 200 or more counts in them. Replace counts less than 200 with “Other.”
Categorize N/A values into “Other.” I do this because some causes are not listed in the dataset, and since “Other” has every other cause besides the top causes, we can safely put N/A values into “Other.”
overflow2 <- overflow2 |>
mutate(causes_top = ifelse(n >= 200, cause, "Other")) |>
mutate(causes_top = ifelse(is.na(causes_top), "Other", causes_top))
overflow2$causes_top <- as.factor(overflow2$causes_top)## [1] Other
## [2] Precipitation
## [3] Mechanical Failure
## [4] PRECIPITATION
## [5] Rain
## [6] RAIN
## [7] Excess Flow with Inflow & Infiltration
## [8] Rain.
## [9] Precipiation
## [10] Unknown
## [11] Blockage
## [12] Rags & grease
## [13] Grease
## [14] Rainfall.
## [15] Rainwater.
## [16] Excessive flow
## [17] Pipe failure
## [18] Debris
## [19] Roots
## [20] Rags
## [21] Rags, roots.
## [22] Rags, trash/debris.
## 22 Levels: Blockage Debris ... Unknown
Since they have precipitation and Rain three different times, I’ll combine them.
overflow2 <- overflow2 |>
mutate(causes_top = case_when(causes_top %in% c("Precipitation", "Precipiation", "PRECIPITATION") ~ "Precipitation", causes_top %in% c("RAIN", "Rain", "Rain.") ~ "Rain", TRUE ~ causes_top))
unique(overflow2$causes_top)## [1] "Other"
## [2] "Precipitation"
## [3] "Mechanical Failure"
## [4] "Rain"
## [5] "Excess Flow with Inflow & Infiltration"
## [6] "Unknown"
## [7] "Blockage"
## [8] "Rags & grease"
## [9] "Grease"
## [10] "Rainfall."
## [11] "Rainwater."
## [12] "Excessive flow"
## [13] "Pipe failure"
## [14] "Debris"
## [15] "Roots"
## [16] "Rags"
## [17] "Rags, roots."
## [18] "Rags, trash/debris."
## [1] "SSO" "BYPASS"
## [3] "CSO" "Bypass"
## [5] NA "Partially Treated Effluent"
## [7] "Filter Overflow"
Since they have two bypasses, I’ll combine them.
overflow2 <- overflow2 |>
mutate(overflow_type = case_when(overflow_type %in% c("BYPASS", "Bypass") ~ "Bypass", TRUE ~ overflow_type))
overflow2$overflow_type <- as.factor(overflow2$overflow_type)
unique(overflow2$overflow_type)## [1] SSO Bypass
## [3] CSO <NA>
## [5] Partially Treated Effluent Filter Overflow
## Levels: Bypass CSO Filter Overflow Partially Treated Effluent SSO
In this section, I create the multiple linear model and use backwards elimination through ANOVA F-tests to figure out which p-values are large. Then, I go over assumptions. A multiple linear regression model is appropriate to answering which variables are significant at predicting overflow quantity because it gives you p-values for each variable and category.
Full Model:
fit <- lm(quantity_in_gallons_estimated ~ overflow_type + duration_days + duration_minutes + duration_hours + county + causes_top, data = overflow2)
summary(fit)##
## Call:
## lm(formula = quantity_in_gallons_estimated ~ overflow_type +
## duration_days + duration_minutes + duration_hours + county +
## causes_top, data = overflow2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22355128 -212483 -151654 -6189 84046249
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 9.372e+05 1.986e+06 0.472
## overflow_typeCSO -1.058e+06 1.359e+05 -7.786
## overflow_typePartially Treated Effluent -1.177e+06 1.992e+06 -0.591
## overflow_typeSSO -1.359e+06 1.167e+05 -11.646
## duration_days 3.091e+04 2.082e+03 14.845
## duration_minutes -2.255e+00 1.365e+01 -0.165
## duration_hours 6.428e+03 8.690e+02 7.396
## countyAllegamy -1.697e+04 2.798e+06 -0.006
## countyAllegany 5.900e+05 1.979e+06 0.298
## countyALLEGANY 1.485e+05 2.048e+06 0.073
## countyAnne Arundel 3.974e+05 1.981e+06 0.201
## countyAnneArundel 2.633e+05 2.799e+06 0.094
## countyBaltimore City 4.451e+05 1.980e+06 0.225
## countyBaltimore CIty 1.446e+05 2.216e+06 0.065
## countyBaltimore County 4.715e+05 1.981e+06 0.238
## countyBALTIMORE COUNTY 2.180e+05 1.989e+06 0.110
## countyBatimore City 2.699e+05 2.799e+06 0.096
## countyCalvert 2.908e+05 1.993e+06 0.146
## countyCALVERT 2.903e+05 2.214e+06 0.131
## countyCaroline 1.089e+05 2.035e+06 0.053
## countyCarroll 2.692e+05 1.992e+06 0.135
## countyCecil 2.575e+05 1.992e+06 0.129
## countyCecil County 2.523e+05 2.799e+06 0.090
## countyCharles 7.466e+05 1.983e+06 0.377
## countyDistrict of Columbia 2.266e+05 2.799e+06 0.081
## countyDorchester 3.795e+05 1.995e+06 0.190
## countyFrederick 3.072e+05 1.983e+06 0.155
## countyGarrett 2.830e+05 1.991e+06 0.142
## countyHarford 2.809e+05 1.984e+06 0.142
## countyHarford County 1.851e+05 2.799e+06 0.066
## countyHoward 5.071e+05 1.983e+06 0.256
## countyKent 3.118e+05 1.993e+06 0.156
## countyLoudoun County, VA 6.176e+05 2.427e+06 0.254
## countyMontgomery 3.089e+05 1.981e+06 0.156
## countyPrince George's 4.193e+05 1.981e+06 0.212
## countyQueen Anne's 2.566e+05 1.989e+06 0.129
## countySomerset 2.187e+05 1.989e+06 0.110
## countySt. Mary's 3.261e+05 1.987e+06 0.164
## countyTalbot 2.906e+05 1.990e+06 0.146
## countyVirginia 3.606e+05 2.214e+06 0.163
## countyWashington 2.033e+05 1.984e+06 0.102
## countyWashington DC 2.505e+05 2.799e+06 0.089
## countyWicomico 3.892e+05 1.988e+06 0.196
## countyWorcester 3.257e+05 1.987e+06 0.164
## causes_topDebris 2.017e+04 1.485e+05 0.136
## causes_topExcess Flow with Inflow & Infiltration 1.908e+05 1.796e+05 1.062
## causes_topExcessive flow 3.640e+06 1.725e+05 21.098
## causes_topGrease -2.048e+03 1.114e+05 -0.018
## causes_topMechanical Failure 3.515e+04 1.734e+05 0.203
## causes_topOther 1.396e+05 8.820e+04 1.583
## causes_topPipe failure -3.960e+04 1.483e+05 -0.267
## causes_topPrecipitation -3.173e+05 9.999e+04 -3.173
## causes_topRags -3.396e+04 1.573e+05 -0.216
## causes_topRags & grease -3.510e+04 1.489e+05 -0.236
## causes_topRags, roots. -3.226e+04 4.246e+05 -0.076
## causes_topRags, trash/debris. -4.637e+04 2.797e+05 -0.166
## causes_topRain 1.472e+05 1.032e+05 1.427
## causes_topRainfall. -3.066e+05 1.577e+05 -1.945
## causes_topRainwater. -5.280e+05 1.265e+05 -4.175
## causes_topRoots -2.099e+04 1.326e+05 -0.158
## causes_topUnknown -4.774e+02 1.633e+05 -0.003
## Pr(>|t|)
## (Intercept) 0.63696
## overflow_typeCSO 7.22e-15 ***
## overflow_typePartially Treated Effluent 0.55479
## overflow_typeSSO < 2e-16 ***
## duration_days < 2e-16 ***
## duration_minutes 0.86880
## duration_hours 1.45e-13 ***
## countyAllegamy 0.99516
## countyAllegany 0.76556
## countyALLEGANY 0.94220
## countyAnne Arundel 0.84106
## countyAnneArundel 0.92505
## countyBaltimore City 0.82219
## countyBaltimore CIty 0.94796
## countyBaltimore County 0.81185
## countyBALTIMORE COUNTY 0.91270
## countyBatimore City 0.92317
## countyCalvert 0.88397
## countyCALVERT 0.89567
## countyCaroline 0.95734
## countyCarroll 0.89249
## countyCecil 0.89714
## countyCecil County 0.92817
## countyCharles 0.70651
## countyDistrict of Columbia 0.93546
## countyDorchester 0.84914
## countyFrederick 0.87691
## countyGarrett 0.88702
## countyHarford 0.88739
## countyHarford County 0.94726
## countyHoward 0.79819
## countyKent 0.87568
## countyLoudoun County, VA 0.79916
## countyMontgomery 0.87610
## countyPrince George's 0.83239
## countyQueen Anne's 0.89733
## countySomerset 0.91244
## countySt. Mary's 0.86963
## countyTalbot 0.88391
## countyVirginia 0.87061
## countyWashington 0.91837
## countyWashington DC 0.92869
## countyWicomico 0.84482
## countyWorcester 0.86976
## causes_topDebris 0.89197
## causes_topExcess Flow with Inflow & Infiltration 0.28824
## causes_topExcessive flow < 2e-16 ***
## causes_topGrease 0.98534
## causes_topMechanical Failure 0.83932
## causes_topOther 0.11350
## causes_topPipe failure 0.78940
## causes_topPrecipitation 0.00151 **
## causes_topRags 0.82914
## causes_topRags & grease 0.81369
## causes_topRags, roots. 0.93944
## causes_topRags, trash/debris. 0.86833
## causes_topRain 0.15370
## causes_topRainfall. 0.05182 .
## causes_topRainwater. 2.99e-05 ***
## causes_topRoots 0.87418
## causes_topUnknown 0.99767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1978000 on 20411 degrees of freedom
## (7007 observations deleted due to missingness)
## Multiple R-squared: 0.06406, Adjusted R-squared: 0.06131
## F-statistic: 23.28 on 60 and 20411 DF, p-value: < 2.2e-16
Backwards Elimination:
## Analysis of Variance Table
##
## Response: quantity_in_gallons_estimated
## Df Sum Sq Mean Sq F value Pr(>F)
## overflow_type 3 9.4859e+14 3.1620e+14 80.8037 < 2.2e-16 ***
## duration_days 1 9.9049e+14 9.9049e+14 253.1198 < 2.2e-16 ***
## duration_minutes 1 8.5385e+09 8.5385e+09 0.0022 0.9627
## duration_hours 1 1.4356e+14 1.4356e+14 36.6878 1.411e-09 ***
## county 37 1.3012e+14 3.5169e+12 0.8987 0.6453
## causes_top 17 3.2539e+15 1.9141e+14 48.9141 < 2.2e-16 ***
## Residuals 20411 7.9871e+16 3.9131e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
duration_minutes has the highest p-value.
fit2 <- lm(quantity_in_gallons_estimated ~ overflow_type + duration_days + duration_hours + county + causes_top, data = overflow2)
summary(fit2)##
## Call:
## lm(formula = quantity_in_gallons_estimated ~ overflow_type +
## duration_days + duration_hours + county + causes_top, data = overflow2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22391013 -214524 -154108 -6309 84048681
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 936670.2 1985351.0 0.472
## overflow_typeCSO -1055352.7 135671.7 -7.779
## overflow_typePartially Treated Effluent -1176975.1 1991903.4 -0.591
## overflow_typeSSO -1359492.5 116599.6 -11.659
## duration_days 30959.7 2080.7 14.880
## duration_hours 6401.2 868.4 7.372
## countyAllegamy -16996.4 2797110.1 -0.006
## countyAllegany 590625.5 1978283.8 0.299
## countyALLEGANY 147404.9 2048116.5 0.072
## countyAnne Arundel 398356.9 1981129.0 0.201
## countyAnneArundel 264459.5 2798591.7 0.094
## countyBaltimore City 445332.5 1980040.0 0.225
## countyBaltimore CIty 145255.6 2215591.9 0.066
## countyBaltimore County 473138.6 1980377.2 0.239
## countyBALTIMORE COUNTY 219002.5 1988561.9 0.110
## countyBatimore City 270936.7 2798597.6 0.097
## countyCalvert 291793.8 1992384.1 0.146
## countyCALVERT 291325.0 2213203.4 0.132
## countyCaroline 109844.6 2034500.3 0.054
## countyCarroll 270440.9 1991293.1 0.136
## countyCecil 258497.1 1991424.2 0.130
## countyCecil County 253353.5 2798609.3 0.091
## countyCharles 747582.3 1982560.6 0.377
## countyDistrict of Columbia 227952.3 2798561.9 0.081
## countyDorchester 380940.3 1994847.2 0.191
## countyFrederick 308109.5 1982974.6 0.155
## countyGarrett 284028.7 1991170.4 0.143
## countyHarford 281739.2 1983582.6 0.142
## countyHarford County 186236.5 2798575.3 0.067
## countyHoward 508185.6 1982913.1 0.256
## countyKent 312888.7 1992805.6 0.157
## countyLoudoun County, VA 619074.4 2427040.3 0.255
## countyMontgomery 310005.5 1980858.7 0.157
## countyPrince George's 420333.7 1980657.9 0.212
## countyQueen Anne's 257669.7 1988396.3 0.130
## countySomerset 219884.2 1988599.6 0.111
## countySt. Mary's 327607.1 1986391.3 0.165
## countyTalbot 291864.5 1989718.5 0.147
## countyVirginia 361502.9 2213271.6 0.163
## countyWashington 204376.7 1983511.6 0.103
## countyWashington DC 251657.1 2798580.7 0.090
## countyWicomico 390286.4 1988133.2 0.196
## countyWorcester 325665.5 1986235.5 0.164
## causes_topDebris 19730.0 148096.4 0.133
## causes_topExcess Flow with Inflow & Infiltration 171188.1 178001.4 0.962
## causes_topExcessive flow 3637533.3 172434.2 21.095
## causes_topGrease -2247.3 111339.1 -0.020
## causes_topMechanical Failure 34884.6 173283.8 0.201
## causes_topOther 139259.2 88098.8 1.581
## causes_topPipe failure -39872.0 148189.5 -0.269
## causes_topPrecipitation -317035.9 99875.4 -3.174
## causes_topRags -33592.9 157270.7 -0.214
## causes_topRags & grease -34757.6 148859.8 -0.233
## causes_topRags, roots. -31782.0 424532.0 -0.075
## causes_topRags, trash/debris. -45867.3 279625.5 -0.164
## causes_topRain 145178.9 103063.4 1.409
## causes_topRainfall. -307595.7 157553.0 -1.952
## causes_topRainwater. -530428.9 126358.2 -4.198
## causes_topRoots -21116.8 132499.2 -0.159
## causes_topUnknown -616.2 163243.2 -0.004
## Pr(>|t|)
## (Intercept) 0.6371
## overflow_typeCSO 7.67e-15 ***
## overflow_typePartially Treated Effluent 0.5546
## overflow_typeSSO < 2e-16 ***
## duration_days < 2e-16 ***
## duration_hours 1.75e-13 ***
## countyAllegamy 0.9952
## countyAllegany 0.7653
## countyALLEGANY 0.9426
## countyAnne Arundel 0.8406
## countyAnneArundel 0.9247
## countyBaltimore City 0.8221
## countyBaltimore CIty 0.9477
## countyBaltimore County 0.8112
## countyBALTIMORE COUNTY 0.9123
## countyBatimore City 0.9229
## countyCalvert 0.8836
## countyCALVERT 0.8953
## countyCaroline 0.9569
## countyCarroll 0.8920
## countyCecil 0.8967
## countyCecil County 0.9279
## countyCharles 0.7061
## countyDistrict of Columbia 0.9351
## countyDorchester 0.8486
## countyFrederick 0.8765
## countyGarrett 0.8866
## countyHarford 0.8871
## countyHarford County 0.9469
## countyHoward 0.7977
## countyKent 0.8752
## countyLoudoun County, VA 0.7987
## countyMontgomery 0.8756
## countyPrince George's 0.8319
## countyQueen Anne's 0.8969
## countySomerset 0.9120
## countySt. Mary's 0.8690
## countyTalbot 0.8834
## countyVirginia 0.8703
## countyWashington 0.9179
## countyWashington DC 0.9283
## countyWicomico 0.8444
## countyWorcester 0.8698
## causes_topDebris 0.8940
## causes_topExcess Flow with Inflow & Infiltration 0.3362
## causes_topExcessive flow < 2e-16 ***
## causes_topGrease 0.9839
## causes_topMechanical Failure 0.8405
## causes_topOther 0.1140
## causes_topPipe failure 0.7879
## causes_topPrecipitation 0.0015 **
## causes_topRags 0.8309
## causes_topRags & grease 0.8154
## causes_topRags, roots. 0.9403
## causes_topRags, trash/debris. 0.8697
## causes_topRain 0.1590
## causes_topRainfall. 0.0509 .
## causes_topRainwater. 2.71e-05 ***
## causes_topRoots 0.8734
## causes_topUnknown 0.9970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1978000 on 20444 degrees of freedom
## (6975 observations deleted due to missingness)
## Multiple R-squared: 0.06404, Adjusted R-squared: 0.06134
## F-statistic: 23.71 on 59 and 20444 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: quantity_in_gallons_estimated
## Df Sum Sq Mean Sq F value Pr(>F)
## overflow_type 3 9.5533e+14 3.1844e+14 81.4033 < 2.2e-16 ***
## duration_days 1 9.9473e+14 9.9473e+14 254.2823 < 2.2e-16 ***
## duration_hours 1 1.4261e+14 1.4261e+14 36.4566 1.588e-09 ***
## county 37 1.3003e+14 3.5144e+12 0.8984 0.6459
## causes_top 17 3.2494e+15 1.9114e+14 48.8616 < 2.2e-16 ***
## Residuals 20444 7.9975e+16 3.9119e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
County has the highest p-value
FINAL MODEL
fit3 <- lm(quantity_in_gallons_estimated ~ overflow_type + duration_days + duration_hours + causes_top, data = overflow2)
summary(fit3)##
## Call:
## lm(formula = quantity_in_gallons_estimated ~ overflow_type +
## duration_days + duration_hours + causes_top, data = overflow2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22494717 -194825 -136884 947 84060912
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1297041.4 137577.2 9.428
## overflow_typeCSO -891517.9 111784.6 -7.975
## overflow_typePartially Treated Effluent -1275494.8 1980512.7 -0.644
## overflow_typeSSO -1313118.8 109163.6 -12.029
## duration_days 30950.5 2076.0 14.908
## duration_hours 6797.5 845.2 8.042
## causes_topDebris -162.9 145523.8 -0.001
## causes_topExcess Flow with Inflow & Infiltration 239254.9 174966.6 1.367
## causes_topExcessive flow 3699024.7 170400.1 21.708
## causes_topGrease -5057.4 109001.5 -0.046
## causes_topMechanical Failure -6626.6 171871.6 -0.039
## causes_topOther 146263.5 86934.4 1.682
## causes_topPipe failure -54499.2 146349.9 -0.372
## causes_topPrecipitation -250078.2 95694.0 -2.613
## causes_topRags 2708.8 153952.7 0.018
## causes_topRags & grease 2943.8 144836.0 0.020
## causes_topRags, roots. -7584.7 420820.0 -0.018
## causes_topRags, trash/debris. -8829.5 277296.5 -0.032
## causes_topRain 205315.8 99596.2 2.061
## causes_topRainfall. -181579.6 147397.3 -1.232
## causes_topRainwater. -467962.1 123476.3 -3.790
## causes_topRoots -11403.7 130618.7 -0.087
## causes_topUnknown -8278.1 160532.7 -0.052
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## overflow_typeCSO 1.6e-15 ***
## overflow_typePartially Treated Effluent 0.519568
## overflow_typeSSO < 2e-16 ***
## duration_days < 2e-16 ***
## duration_hours 9.3e-16 ***
## causes_topDebris 0.999107
## causes_topExcess Flow with Inflow & Infiltration 0.171505
## causes_topExcessive flow < 2e-16 ***
## causes_topGrease 0.962994
## causes_topMechanical Failure 0.969245
## causes_topOther 0.092495 .
## causes_topPipe failure 0.709607
## causes_topPrecipitation 0.008974 **
## causes_topRags 0.985962
## causes_topRags & grease 0.983784
## causes_topRags, roots. 0.985620
## causes_topRags, trash/debris. 0.974599
## causes_topRain 0.039270 *
## causes_topRainfall. 0.217998
## causes_topRainwater. 0.000151 ***
## causes_topRoots 0.930429
## causes_topUnknown 0.958875
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1978000 on 20483 degrees of freedom
## (6973 observations deleted due to missingness)
## Multiple R-squared: 0.06255, Adjusted R-squared: 0.06154
## F-statistic: 62.12 on 22 and 20483 DF, p-value: < 2.2e-16
Intercept: 1297041.4. This means that the quantity of overflow is ~1,297,041 gallons when every other predictor is zero.
Adjusted-R^2: 0.06154 or 6.2%. This means that this model explains about 6.2% of the variability in the estimated quantity of overflows in gallons. This model doesn’t explain the variability well.
P-Value of Entire Model: p-value < 2.2e-16. This means that this model is significant.
## Analysis of Variance Table
##
## Response: quantity_in_gallons_estimated
## Df Sum Sq Mean Sq F value Pr(>F)
## overflow_type 3 9.5525e+14 3.1842e+14 81.422 < 2.2e-16 ***
## duration_days 1 9.9474e+14 9.9474e+14 254.364 < 2.2e-16 ***
## duration_hours 1 1.4265e+14 1.4265e+14 36.477 1.571e-09 ***
## causes_top 17 3.2518e+15 1.9128e+14 48.913 < 2.2e-16 ***
## Residuals 20483 8.0103e+16 3.9107e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significant Variables: overflow_type with p < 2.2e-16. duration_days with p < 2.2e-16. duration_hours with p = 1.571e-09. causes_top with p < 2.2e-16
Coefficients (holding other variables constant): Since I have multiple variables that have multiple categories and each category has a coefficient, I will only list the coefficients of the significant categories (p < 0.05).
overflow_typeCSO = -891518 | This means that the CSO overflow type is 891,518 gallons of overflow less than the bypass type (baseline).
overflow_typeSSO = -1313118.8 | This means that the SSO overflow type is 1,313,119 gallons of overflow less than the bypass type.
duration_days = 30950.5 | This means that for every one day increase in the duration of the overflow, 30,951 more gallons of overflow occur.
duration_hours = 6797.5 | This means that for every hour increase in the duration of the overflow, 6,798 more gallons of overflow occur.
causes_topExcessive flow = 3699024.7 | This means that overflows caused by excessive flows generate, on average, 3,699,025 more gallons of overflow than the blockage cause (baseline).
causes_topPrecipitation = -250078.2 | This means that overflows caused by precipitation generate, on average, 250,078 less gallons of overflow than the blockage cause.
causes_topRain. = 205315.8 | This means that overflows caused by rain generate, on average, 205,315.8 more gallons of overflow than the blockage cause.
causes_topRainwater. = -467962.1 | This means that overflows caused by rain water generate, on average, 467,962 less gallons of overflow than the blockage cause.
Note: I didn’t combine “Rain” and “Rainwater” categories because there may be some distinction I do not know of (and it wasn’t documented in the source), so I decided to leave them differently.
This answers our question of “What variables are most significant at predicting overflow quantity, in gallons.”
The overflow type, duration day, duration hours, and causes_top are most significant at predicting overflow quantity.
Multiple Linear Regression assumes:
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
duration_hours: Linearity for duration_hours with quantity_in_gallons_estimated is good as the dashed line follows the fitted linear effect (straight line).
duration_days: Linearity doesn’t look too good with duration_days and quantity_in_gallons_estimated because the dashed line has a higher slope than the straight line. Linearity most likely fails here.
overflow_type: Treated as a factor, so linearity doesn’t apply but box plots have a lot of outliers.
causes_top: Also treated as a factor and boxplots have a lot of outliers too.
Residuals vs Order: We see that a lot of the residuals are condensed around zero with a lot of positive residual bursts, with one major negative burst. Although most residuals hover around zero, there is no constant spread in this plot because of the large spikes. At n~5000, residuals are tightly packed, whereas n~20000, residuals seem to spread out more with a lot of outliers. Furthermore, at n~7000, there seems to be strong activity of high spikes. There doesn’t seem to be a pattern though and most points are around zero, horizontally.
All of this indicates that we likely pass independence because most residuals are centered around zero, but the large, positive bursts are a concern.
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Residuals vs Fitted: Clustered towards lower values. Points are not randomly scattered around zero. We most likely fail linearity.
Homoscedasticity (equal variance): The spread of residuals is not constant. Because of the clustering at the start of the residuals vs fitted plot, we likely have heteroscedasticity.
Scale–Location: Again, points are clustered towards lower values. This shows that this model doesn’t have equal or constant variance. This further suggests that we have strong heteroscedasticity.
Q–Q Residuals: Points follow a straight line until the right tail. There is strong, positive deviation at the right tail. This means we have large positive residuals and outliers, specifically towards the right. Since the bending is strong, we likely have mid to strong non-normality.
Residuals vs Leverage: One point is a strong outlier, likely indicating influence on the regression line. Most points have low leverage and low standardized residuals, though some points are very high on the y-axis. This likely indicates that outliers are influencing the model.
## duration_days duration_hours
## duration_days 1.0000000000 0.0002601796
## duration_hours 0.0002601796 1.0000000000
Low correlation between duration_hours and duration_days.
Since overflow_type and causes_top are categorical variables, I will perform VIF.
## GVIF Df GVIF^(1/(2*Df))
## overflow_type 2.498165 3 1.164851
## duration_days 1.004203 1 1.002099
## duration_hours 1.068331 1 1.033601
## causes_top 2.636924 17 1.028929
Interpretation:
overflow_type –> degrees of freedom: 3 –> GVIF: 1.164851 –> Conclusion: No multicollinearity.
causes_top –> degrees of freedom: 17 –> GVIF: 1.028929 –> Conclusion: No multicollinearity.
For duration_days and duration_hours (numerical variables), we already checked their correlation but it’s still worth noting their GVIF because a (G)VIF measures how well that predictor can be explained by every other predictor together.
duration_days GVIF: 1.002099 | No multicollinearity.
duration_hours GVIF: 1.033601 | No multicollinearity.
No multicollinearity assumption: Pass
options(scipen = 999)
ggplot(overflow2, aes(x = duration_days, y = quantity_in_gallons_estimated, color = causes_top)) +
geom_point() +
labs(title = "Gallons of Overflow by the Duration of Overflow", color = "Causes", x = "Duration (Days)", y = "Overflow Quantity (in gallons)") Interpretation: This scatterplot shows the overflow quantity by the duration it lasted, in days. The points are colored by their causes. Most cases reside between 0 and a 100 days, and below 30,000,000 gallons of overflow. What I find interesting is one overflow case lasting over 600 days, while having very little overflow. This case was caused by a pipe failure.
We know that according to the multiple linear model, 30951 more gallons of sewer overflow happen per day. So, this point (the pipe failure point) is really interesting.
overflow3 <- overflow2 |>
group_by(causes_top) |>
count()
ggplot(overflow3, aes(x = causes_top, y = n, fill = n)) +
geom_col() +
coord_flip() +
labs(fill = "Count", title = "Causes of Overflow by Count", x = "Top Causes", y = "Count")Interpretation: This bar graph shows the top causes with the count of that cause. For example, the “Precipitation” cause has around 8000 counts of overflows. Without the “Other” cause, precipitation has the highest count. This shows how much overflows happen because of precipitation. After precipitation, it is rain. Note: precipitation and rain aren’t the same.
Here is the link if you’d like to see the Tableau Map visualization: https://public.tableau.com/views/OverflowQuantitygallonsinMarylandCounties/Sheet1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link
Interpretation: This Tableau plot shows the map of counties and the frequency of the overflows. We see that Charles County has the highest overflow quantity, with a total sum of approximately 260 million gallons of sewer overflow.
Using the multiple linear regression model and backwards elimination, we conclude with an Adjusted R^2 value of 6.2% and an overall p-value less than 2.2e-16 that the most significant predictors of quantity overflow are the overflow type, the duration (in days) of the overflow, the duration (in hours) of the overflow, and the top causes (causes with over 200 counts).
The multiple linear regression model fails multiple assumptions. This model fails the linearity assumption, homoscedasticity assumption, and the normality assumption. However, it passes the no multicollinearity assumption and likely passes the independence assumption.
Visualizations prove worthy of uncovering important insights. The bar graph shows which causes have the highest count of overflows, giving actionable insight to prevent such causes in the future. The scatterplot shows outliers that may warrant investigation to explain extremes, for example, the overflow that lasted over 600 days yet had very low overflow.
Since a multiple linear regression model with a limited number of variables isn’t the best at predicting overflow quantities, future analysis can develop a regression model that takes in every variable.
If such model is still not good, then PCA might be a better option to uncover this complex dataset.
Source of Dataset: https://opendata.maryland.gov/Energy-and-Environment/Reported-Sewer-Overflows/3rgd-zjxx/about_data
length(unique(object)) - documentation: https://www.geeksforgeeks.org/r-language/count-unique-values-in-r/
YAML Code Customization: https://stackoverflow.com/questions/48261379/rmarkdown-floating-toc-and-toc-at-beginning