Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The dataset I chose to review for this discussion post was used in a visualization assignment for Data602 from NYC OpenData on the Department of Sanitation, but I’m curious to see if the paper recyling collection totals are linearly related to other recylging collections and can be used to predict the number of tons that will be collected

location = 'https://data.cityofnewyork.us/resource/ebb7-mvp5.csv'
df<- as_tibble(read_csv(location))
## Rows: 1000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): month, borough, communitydistrict
## dbl (8): refusetonscollected, papertonscollected, mgptonscollected, resorgan...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 11
##   month  borough commu…¹ refus…² paper…³ mgpto…⁴ resor…⁵ schoo…⁶ leave…⁷ xmast…⁸
##   <chr>  <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 2023 … Bronx   01         534.    25.4    30.4      NA    16.9      NA      NA
## 2 2023 … Bronx   02         341.    20.9    28.3      NA    NA        NA      NA
## 3 2023 … Bronx   03         449.    29.7    35.2      NA    14.1      NA      NA
## 4 2023 … Bronx   04         874.    55.0    69.7      NA    15.1      NA      NA
## 5 2023 … Bronx   05         822.    52.2    86.8      NA    16.3      NA      NA
## 6 2023 … Bronx   06         490.    30.8    42.1      NA    12.8      NA      NA
## # … with 1 more variable: borough_id <dbl>, and abbreviated variable names
## #   ¹​communitydistrict, ²​refusetonscollected, ³​papertonscollected,
## #   ⁴​mgptonscollected, ⁵​resorganicstons, ⁶​schoolorganictons,
## #   ⁷​leavesorganictons, ⁸​xmastreetons
summary(df)
##     month             borough          communitydistrict  refusetonscollected
##  Length:1000        Length:1000        Length:1000        Min.   : 302.9     
##  Class :character   Class :character   Class :character   1st Qu.:2277.9     
##  Mode  :character   Mode  :character   Mode  :character   Median :3221.7     
##                                                           Mean   :3365.3     
##                                                           3rd Qu.:4356.9     
##                                                           Max.   :7874.6     
##                                                           NA's   :2          
##  papertonscollected mgptonscollected resorganicstons  schoolorganictons
##  Min.   :  19.11    Min.   : 28.27   Min.   :  2.06   Min.   :  2.83   
##  1st Qu.: 247.31    1st Qu.:254.91   1st Qu.: 20.95   1st Qu.: 22.86   
##  Median : 356.04    Median :377.19   Median : 39.15   Median : 36.37   
##  Mean   : 380.80    Mean   :391.31   Mean   : 62.07   Mean   : 51.12   
##  3rd Qu.: 485.57    3rd Qu.:496.99   3rd Qu.: 70.56   3rd Qu.: 64.59   
##  Max.   :1298.39    Max.   :977.52   Max.   :544.27   Max.   :236.45   
##  NA's   :2          NA's   :2        NA's   :811      NA's   :683      
##  leavesorganictons  xmastreetons      borough_id   
##  Min.   :  0.980   Min.   : 2.080   Min.   :1.000  
##  1st Qu.:  5.875   1st Qu.: 7.982   1st Qu.:2.000  
##  Median : 11.520   Median :13.500   Median :3.000  
##  Mean   : 35.923   Mean   :17.780   Mean   :2.719  
##  3rd Qu.: 25.640   3rd Qu.:23.293   3rd Qu.:4.000  
##  Max.   :453.490   Max.   :89.190   Max.   :5.000  
##  NA's   :897       NA's   :882      NA's   :2

Boxplot of tons of paper collected

Both plots appear to have fairly similar median values that are a bit skewed on the right for higher collections

par(mfrow=c(1,1))
boxplot(df$papertonscollected,df$mgptonscollected,
main = "Paper vs Other Recylcables",
ylab = "Tons Collected",
names = c("Paper", "Other"))

Scatterplot of Paper vs Other Recyclables

ggplot(df,aes(x=papertonscollected,y=mgptonscollected)) +
    geom_point() +
    geom_smooth(method='lm',na.rm=TRUE) +
    labs(x='Paper Collected (Tons)',y='Other Recylcables Collected (Tons)',title='Comparing Recycling Amounts in NYC')
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing missing values (geom_point).

There appears to be a fairly strong linear relationship between the two variables, but there is some fanning occurring at larger values for paper collected which may mean that the single linear regression model might not do as well to predict other recyclables at higher values. There will likely be larger residuals in the model when attempting to estimate the tons collected. The preliminary linear regression best fit line shows that it can approximate the relationship fairly well from a purely observational standpoint.

Calculate the Pearson Correlation Coefficient

cor(df$papertonscollected,df$mgptonscollected,use = "complete.obs")
## [1] 0.7675549

The correlation coefficient shows a strong linear relationship between the two variable with a value of 0.77 and I’m curious how much variability paper will be able to account for using regression.

Running Simple Linear Regression

lm.out <- lm(mgptonscollected ~papertonscollected,data=df)
summary(lm.out)
## 
## Call:
## lm(formula = mgptonscollected ~ papertonscollected, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -365.82  -77.97  -20.59   62.70  492.66 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        126.08678    7.95437   15.85   <2e-16 ***
## papertonscollected   0.69648    0.01843   37.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 118.3 on 996 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5891, Adjusted R-squared:  0.5887 
## F-statistic:  1428 on 1 and 996 DF,  p-value: < 2.2e-16

The summary output from the linear regression model indicate that paper is a statistically significant predictor of other recyclbles although it only accounts for approximately 56% of the variation that is present with other recyclable collections. The summary of the residuals indicates some of the skew that we expected based on the review of the scatterplot.

Diagnostic Plots to assess model

par(mfrow=c(2,2))
plot(lm.out)

The fitted residuals plot at the top left quadrant are not supposed to show much of a pattern around zero and there is some semblance of spread occuring between 400 and 600. While this would not inherently disqualify this model, it might be beneficial to add additional terms to better refine the prediction as indicated in the textbook. The QQ plot is fairly normal despite the skew at the higher end and it does not appear to violate this assumption using linear regression. The scaled location plot appears to show more of a trend between the values of 400 and 600 tons, but overall fairly similar to the fitted residuals plot. Lastly, the residuals vs leverage plot seems to indicate there might be an outlier but overall not that many points have that much leverage.

Conclusion

Overall, applying simple linear regression is appropriate when trying to estimate the number of tons collected of other recyclables. None of the core assumptions needed to run SLR are violated after reviewing the diagnostic plots; however, there are likely other variables that can be added to the model to enhance the prediction.