Macalester Sustainability Data Analysis July 28 - August 7

The following is the data analysis I have done for the Macalester Sustainability team from July 28th to August 7th of the summer of 2014. The purpose of this document is not only to document the work on a page that can be easily shared, but also serve as reference in the future if we need to reproduce these results. The data that I used to make these graphs are from September 2006 to May 2012, the date before we changed to the current recycling and composting vendor at Macalester. I have made graphs for five different categories, Diversion Rate, Trash, Mixed Papers, Bottles and Cans, and Total Recycling, by three different timelines, from Sept 2006 to May 2012 with each point representing a month, by School Year, and by each specific Month.


Plots

plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2plot of chunk unnamed-chunk-2

Data Imputation

The other portion of my work has been trying to do what is called “data imputation”. Due to the fact that our current vendor does not weigh our trash, we do not have reliable data past May 2012. My task has been to handle the missing data. However, we can see from the following of a rough model, that the variability in these numbers is too high to project/predict with any real confidence. If we look at the confidence interval for each specific month, the numbers fluctate far too much. This can be due to the model, so the other option would be to remove outlier months such as May, since we know intuitively that we will have more trash when students are moving out. However, if we look at the multi-colored graph, we see that Mixed Papers, one of the categories we would be trying to predict for, shows abnormally high totals for certain years for August and September. This would not be the only case for Mixed Papers, many of the categories will exhibit these characteristics and generally be hard to find a reliable trend. Even if we found a model that worked to help us predict more accurate numbers, the other issue we have of trying to fill in the missing months and years, is that we are essentially lying to the readers by analyzing data that we handpicked using our model. First, this would not be correct, because we know that our numbers will be perfect which we know is impossible, and second this is unethical to produce unreliable data for the purpose of phrasing it in a way that benefits us. All that being said, there are studies on missing-data imputation techniques that are not taught in school, so my next task at hand will be working on implementing those techniques and I will document them in later post similar to this.

mod <- lm(Mixed.Papers ~ School.Year + Month, data=trash_data)
trash_function <- makeFun(mod)
summary(mod)
## 
## Call:
## lm(formula = Mixed.Papers ~ School.Year + Month, data = trash_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5343  -1688    262   1301   9149 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1962610     383642   -5.12  4.0e-06 ***
## School.Year        982        191    5.14  3.6e-06 ***
## MonthOctober      -834       1510   -0.55   0.5828    
## MonthNovember    -1324       1510   -0.88   0.3845    
## MonthDecember    -1296       1510   -0.86   0.3946    
## MonthJanuary     -3510       1510   -2.32   0.0238 *  
## MonthFebruary    -1154       1510   -0.76   0.4479    
## MonthMarch       -2114       1510   -1.40   0.1670    
## MonthApril       -1304       1510   -0.86   0.3913    
## MonthMay          4557       1510    3.02   0.0038 ** 
## MonthJune        -3109       1587   -1.96   0.0551 .  
## MonthJuly        -3153       1587   -1.99   0.0518 .  
## MonthAugust        527       1587    0.33   0.7411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2620 on 56 degrees of freedom
## Multiple R-squared:  0.562,  Adjusted R-squared:  0.468 
## F-statistic: 5.99 on 12 and 56 DF,  p-value: 1.38e-06
confint(mod)
##                    2.5 %     97.5 %
## (Intercept)   -2731136.0 -1.194e+06
## School.Year        598.8  1.364e+03
## MonthOctober     -3859.3  2.191e+03
## MonthNovember    -4348.5  1.701e+03
## MonthDecember    -4320.5  1.729e+03
## MonthJanuary     -6534.8 -4.848e+02
## MonthFebruary    -4179.2  1.871e+03
## MonthMarch       -5139.0  9.110e+02
## MonthApril       -4329.5  1.720e+03
## MonthMay          1532.3  7.582e+03
## MonthJune        -6287.1  6.971e+01
## MonthJuly        -6331.1  2.571e+01
## MonthAugust      -2651.5  3.705e+03

plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4