Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The dataset I chose to review for this discussion post was used in a visualization assignment for Data602 from NYC OpenData on the Department of Sanitation, but I’m curious to see if the paper recyling collection totals are linearly related to other recylging collections and can be used to predict the number of tons that will be collected
location = 'https://data.cityofnewyork.us/resource/ebb7-mvp5.csv'
df<- as_tibble(read_csv(location))
## Rows: 1000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): month, borough, communitydistrict
## dbl (8): refusetonscollected, papertonscollected, mgptonscollected, resorgan...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 11
## month borough commu…¹ refus…² paper…³ mgpto…⁴ resor…⁵ schoo…⁶ leave…⁷ xmast…⁸
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2023 … Bronx 01 534. 25.4 30.4 NA 16.9 NA NA
## 2 2023 … Bronx 02 341. 20.9 28.3 NA NA NA NA
## 3 2023 … Bronx 03 449. 29.7 35.2 NA 14.1 NA NA
## 4 2023 … Bronx 04 874. 55.0 69.7 NA 15.1 NA NA
## 5 2023 … Bronx 05 822. 52.2 86.8 NA 16.3 NA NA
## 6 2023 … Bronx 06 490. 30.8 42.1 NA 12.8 NA NA
## # … with 1 more variable: borough_id <dbl>, and abbreviated variable names
## # ¹communitydistrict, ²refusetonscollected, ³papertonscollected,
## # ⁴mgptonscollected, ⁵resorganicstons, ⁶schoolorganictons,
## # ⁷leavesorganictons, ⁸xmastreetons
summary(df)
## month borough communitydistrict refusetonscollected
## Length:1000 Length:1000 Length:1000 Min. : 302.9
## Class :character Class :character Class :character 1st Qu.:2277.9
## Mode :character Mode :character Mode :character Median :3221.7
## Mean :3365.3
## 3rd Qu.:4356.9
## Max. :7874.6
## NA's :2
## papertonscollected mgptonscollected resorganicstons schoolorganictons
## Min. : 19.11 Min. : 28.27 Min. : 2.06 Min. : 2.83
## 1st Qu.: 247.31 1st Qu.:254.91 1st Qu.: 20.95 1st Qu.: 22.86
## Median : 356.04 Median :377.19 Median : 39.15 Median : 36.37
## Mean : 380.80 Mean :391.31 Mean : 62.07 Mean : 51.12
## 3rd Qu.: 485.57 3rd Qu.:496.99 3rd Qu.: 70.56 3rd Qu.: 64.59
## Max. :1298.39 Max. :977.52 Max. :544.27 Max. :236.45
## NA's :2 NA's :2 NA's :811 NA's :683
## leavesorganictons xmastreetons borough_id
## Min. : 0.980 Min. : 2.080 Min. :1.000
## 1st Qu.: 5.875 1st Qu.: 7.982 1st Qu.:2.000
## Median : 11.520 Median :13.500 Median :3.000
## Mean : 35.923 Mean :17.780 Mean :2.719
## 3rd Qu.: 25.640 3rd Qu.:23.293 3rd Qu.:4.000
## Max. :453.490 Max. :89.190 Max. :5.000
## NA's :897 NA's :882 NA's :2
Both plots appear to have fairly similar median values that are a bit skewed on the right for higher collections
par(mfrow=c(1,1))
boxplot(df$papertonscollected,df$mgptonscollected,
main = "Paper vs Other Recylcables",
ylab = "Tons Collected",
names = c("Paper", "Other"))
ggplot(df,aes(x=papertonscollected,y=mgptonscollected)) +
geom_point() +
geom_smooth(method='lm',na.rm=TRUE) +
labs(x='Paper Collected (Tons)',y='Other Recylcables Collected (Tons)',title='Comparing Recycling Amounts in NYC')
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing missing values (geom_point).
There appears to be a fairly strong linear relationship between the two variables, but there is some fanning occurring at larger values for paper collected which may mean that the single linear regression model might not do as well to predict other recyclables at higher values. There will likely be larger residuals in the model when attempting to estimate the tons collected. The preliminary linear regression best fit line shows that it can approximate the relationship fairly well from a purely observational standpoint.
cor(df$papertonscollected,df$mgptonscollected,use = "complete.obs")
## [1] 0.7675549
The correlation coefficient shows a strong linear relationship between the two variable with a value of 0.77 and I’m curious how much variability paper will be able to account for using regression.
lm.out <- lm(mgptonscollected ~papertonscollected,data=df)
summary(lm.out)
##
## Call:
## lm(formula = mgptonscollected ~ papertonscollected, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -365.82 -77.97 -20.59 62.70 492.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 126.08678 7.95437 15.85 <2e-16 ***
## papertonscollected 0.69648 0.01843 37.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118.3 on 996 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5891, Adjusted R-squared: 0.5887
## F-statistic: 1428 on 1 and 996 DF, p-value: < 2.2e-16
The summary output from the linear regression model indicate that paper is a statistically significant predictor of other recyclbles although it only accounts for approximately 56% of the variation that is present with other recyclable collections. The summary of the residuals indicates some of the skew that we expected based on the review of the scatterplot.
par(mfrow=c(2,2))
plot(lm.out)
The fitted residuals plot at the top left quadrant are not supposed to show much of a pattern around zero and there is some semblance of spread occuring between 400 and 600. While this would not inherently disqualify this model, it might be beneficial to add additional terms to better refine the prediction as indicated in the textbook. The QQ plot is fairly normal despite the skew at the higher end and it does not appear to violate this assumption using linear regression. The scaled location plot appears to show more of a trend between the values of 400 and 600 tons, but overall fairly similar to the fitted residuals plot. Lastly, the residuals vs leverage plot seems to indicate there might be an outlier but overall not that many points have that much leverage.
Overall, applying simple linear regression is appropriate when trying to estimate the number of tons collected of other recyclables. None of the core assumptions needed to run SLR are violated after reviewing the diagnostic plots; however, there are likely other variables that can be added to the model to enhance the prediction.