The following is the data analysis I have done for the Macalester Sustainability team from July 28th to August 7th of the summer of 2014. The purpose of this document is not only to document the work on a page that can be easily shared, but also serve as reference in the future if we need to reproduce these results. The data that I used to make these graphs are from September 2006 to May 2012, the date before we changed to the current recycling and composting vendor at Macalester. I have made graphs for five different categories, Diversion Rate, Trash, Mixed Papers, Bottles and Cans, and Total Recycling, by three different timelines, from Sept 2006 to May 2012 with each point representing a month, by School Year, and by each specific Month.
The other portion of my work has been trying to do what is called “data imputation”. Due to the fact that our current vendor does not weigh our trash, we do not have reliable data past May 2012. My task has been to handle the missing data. However, we can see from the following of a rough model, that the variability in these numbers is too high to project/predict with any real confidence. If we look at the confidence interval for each specific month, the numbers fluctate far too much. This can be due to the model, so the other option would be to remove outlier months such as May, since we know intuitively that we will have more trash when students are moving out. However, if we look at the multi-colored graph, we see that Mixed Papers, one of the categories we would be trying to predict for, shows abnormally high totals for certain years for August and September. This would not be the only case for Mixed Papers, many of the categories will exhibit these characteristics and generally be hard to find a reliable trend. Even if we found a model that worked to help us predict more accurate numbers, the other issue we have of trying to fill in the missing months and years, is that we are essentially lying to the readers by analyzing data that we handpicked using our model. First, this would not be correct, because we know that our numbers will be perfect which we know is impossible, and second this is unethical to produce unreliable data for the purpose of phrasing it in a way that benefits us. All that being said, there are studies on missing-data imputation techniques that are not taught in school, so my next task at hand will be working on implementing those techniques and I will document them in later post similar to this.
mod <- lm(Mixed.Papers ~ School.Year + Month, data=trash_data)
trash_function <- makeFun(mod)
summary(mod)
##
## Call:
## lm(formula = Mixed.Papers ~ School.Year + Month, data = trash_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5343 -1688 262 1301 9149
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1962610 383642 -5.12 4.0e-06 ***
## School.Year 982 191 5.14 3.6e-06 ***
## MonthOctober -834 1510 -0.55 0.5828
## MonthNovember -1324 1510 -0.88 0.3845
## MonthDecember -1296 1510 -0.86 0.3946
## MonthJanuary -3510 1510 -2.32 0.0238 *
## MonthFebruary -1154 1510 -0.76 0.4479
## MonthMarch -2114 1510 -1.40 0.1670
## MonthApril -1304 1510 -0.86 0.3913
## MonthMay 4557 1510 3.02 0.0038 **
## MonthJune -3109 1587 -1.96 0.0551 .
## MonthJuly -3153 1587 -1.99 0.0518 .
## MonthAugust 527 1587 0.33 0.7411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2620 on 56 degrees of freedom
## Multiple R-squared: 0.562, Adjusted R-squared: 0.468
## F-statistic: 5.99 on 12 and 56 DF, p-value: 1.38e-06
confint(mod)
## 2.5 % 97.5 %
## (Intercept) -2731136.0 -1.194e+06
## School.Year 598.8 1.364e+03
## MonthOctober -3859.3 2.191e+03
## MonthNovember -4348.5 1.701e+03
## MonthDecember -4320.5 1.729e+03
## MonthJanuary -6534.8 -4.848e+02
## MonthFebruary -4179.2 1.871e+03
## MonthMarch -5139.0 9.110e+02
## MonthApril -4329.5 1.720e+03
## MonthMay 1532.3 7.582e+03
## MonthJune -6287.1 6.971e+01
## MonthJuly -6331.1 2.571e+01
## MonthAugust -2651.5 3.705e+03