In this homework we will begin to take a look at ANOVAs and GLM. Data for this assignment can be found at: http://asayanalytics.com/telework_csv The data is a compilation of US Census data that has been cleaned up for use in this assignment
Let us take a simple look at how telecommuting impacts a person’s weekly income
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ telecommute.text, data = census.data)
##
## $telecommute.text
## diff lwr upr p adj
## Yes-No 350.7614 313.8213 387.7015 0
Based on our ANOVA model, we would expect the average telecommuting worker to make $350 more per week. We can see from our Tukey HSD test that we do have a statistically significant difference between our two types of workers.
We can visually see this difference in the following graph:
Even though this model shows that a person who telecommutes makes a statistically significant greater weekly income than a non-telecommuter worker, it fails to take into account many potential factors. Things like the type of industry that a person works in (a person that telecommuting is more like to work in a professional environment, while someone who doesn’t is more likely to be a non-skilled worker) The model also fails to take into account the level of education that the person has, as people with college degrees tend to make more money than those without
In an effort to continue to understand impacts to weekly income, let us now look at how both telecommuting and gender impact it:
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ telecommute.text + sex.word, data = census.data)
##
## $telecommute.text
## diff lwr upr p adj
## Yes-No 350.7614 314.4721 387.0508 0
##
## $sex.word
## diff lwr upr p adj
## Male-Female 242.1472 208.6566 275.6378 0
From this model we can see that telecommuter workers make $351 dollars per week more than a non-telecommuter and a female worker will make $242 less per week than their male counterpart.
Visually these impacts on weekly income look like this:
Finally let us see if our second model does a better job of explaining how much variation of weekly income is explained by our independent variables (i.e. telecommuting and gender):
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ telecommute.text
## Model 2: weekly_earnings ~ telecommute.text + sex.word
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5540 2319766548
## 2 5539 2238348491 1 81418057 201.48 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output we can see that there is a statistically significant increase in the amount of residual sum of squares explained by the second model vs the first model (i.e. the second model better explains the differences in weekly income.).
While weekly income is better explained by the second model we still have several factors that could have a potential impact to weekly income (i.e. industry, education level, type of industry, type of job, etc.).
Let us now take a look at a linear model, and how the number of hours worked has an impact on income. Mathematically we can express linear models in the following form:
\[\beta_{WeeklyIncome}=C + \beta_{HoursWorked}+\epsilon\] Where C is some constant (or y intercept) and \[\epsilon\] is some error term.
For our model the mathematic equation can be written as:
\[\beta_{WeeklyIncome}=60.40 + 22.59\beta_{HoursWorked}+\epsilon_{i}\] And the lm analyzed is given from this R output:
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = census.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
Based on this output, we can see that the number of hours worked does have a statistically significant effect on the amount of weekly income. We can also see that for every hour worked we expect to see a $22.59 increase in pay. We also see from the output that about 16% of the amount of the variation in weekly income is explained by the amount of hours that a person works. We can also see that the mean residual of our model is around $613 per week.
While this model does help us to understand around 16% of the variation in weekly income, it still doesn’t take into account level of education, gender, and race.
If we take a closer look at our model we can see that our model doesn’t do an excellent job of modeling how weekly income is impacted:
If we consider the above recommendations, the following model is created:
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + as.factor(education) +
## sex.word + as.factor(race), data = census.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1781.0 -336.6 -101.4 207.4 2745.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -211.7033 223.6891 -0.946 0.343977
## hours_worked 19.0812 0.6477 29.461 < 2e-16 ***
## as.factor(education)32 -241.8049 329.8095 -0.733 0.463489
## as.factor(education)33 -114.2745 247.5396 -0.462 0.644357
## as.factor(education)34 -114.1348 254.9862 -0.448 0.654451
## as.factor(education)35 -104.6667 241.5775 -0.433 0.664840
## as.factor(education)36 -0.5719 233.8553 -0.002 0.998049
## as.factor(education)37 -9.4669 230.5358 -0.041 0.967246
## as.factor(education)38 -74.6092 239.6286 -0.311 0.755544
## as.factor(education)39 121.0830 222.7948 0.543 0.586826
## as.factor(education)40 168.2064 222.9247 0.755 0.450555
## as.factor(education)41 230.7979 224.6322 1.027 0.304255
## as.factor(education)42 304.2107 224.1146 1.357 0.174713
## as.factor(education)43 573.2131 222.7541 2.573 0.010099 *
## as.factor(education)44 767.3268 223.4553 3.434 0.000599 ***
## as.factor(education)45 1034.3799 228.9592 4.518 6.38e-06 ***
## as.factor(education)46 923.8269 228.3151 4.046 5.27e-05 ***
## sex.wordMale 163.9762 15.0648 10.885 < 2e-16 ***
## as.factor(race)2 -145.3370 23.9043 -6.080 1.28e-09 ***
## as.factor(race)3 -3.9834 71.4993 -0.056 0.955573
## as.factor(race)4 -16.6237 34.2584 -0.485 0.627522
## as.factor(race)5 -54.8574 89.9543 -0.610 0.541995
## as.factor(race)6 -115.5352 107.0931 -1.079 0.280711
## as.factor(race)7 138.2888 113.8896 1.214 0.224710
## as.factor(race)8 152.4390 140.8376 1.082 0.279134
## as.factor(race)9 -198.4867 192.7541 -1.030 0.303177
## as.factor(race)10 -91.9921 192.6509 -0.478 0.633020
## as.factor(race)11 -104.7986 544.1356 -0.193 0.847282
## as.factor(race)12 -176.6050 544.0587 -0.325 0.745491
## as.factor(race)13 67.5411 544.0796 0.124 0.901210
## as.factor(race)16 -373.9605 314.3578 -1.190 0.234254
## as.factor(race)21 1411.2649 544.0705 2.594 0.009515 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 543.8 on 5510 degrees of freedom
## Multiple R-squared: 0.3389, Adjusted R-squared: 0.3352
## F-statistic: 91.12 on 31 and 5510 DF, p-value: < 2.2e-16
While this is an eye sore to look at, we can see that almost 34% of our variation in weekly income is now being explained. We also have a statistically significant model. We might also want to consider such things as interaction effects.
Now let us see how weekly earnings change as a function of age. The general form of our model is:
\[\beta_{WeeklyIncome}=C + \beta_{Age}+\epsilon\]
##
## Call:
## lm(formula = weekly_earnings ~ age, data = census.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
The specific form of our model is: \[\beta_{WeeklyIncome}=584.95 + 9.19\beta_{Age}+\epsilon\]
From this model we expect the average weekly income to increase by $9.19. This is a statistically significant income, however it as a very small r squared. There is also a very large residual sum of squares. The model is also a statistically significant model.
This model is also not complete. While we have taken into account age/experience, we have failed to take into account industry, gender and hours worked.
As we look at our linearity assumption, let us take a look at the studentized residual qqnorm plot:
Based on the plot above we can see that our model has some curvature, which indicates that we have some bias in our data. The bias that we are likely seeing is likely due to younger professionals with high paying jobs and experienced individuals who do not make much (i.e. a semi-retired professional who has gone back to work as a bus drive to stay busy)
There is also concern that this model is not linear but some other type of model (i.e. uniform, bivariate, Weibull, etc.). To identify this we will need to do a scatter plot of our data. Weekly income also as a limit to the weekly income that is collected by the census bureau, causing there to be a high number of $3000+ weekly earnings. Because we cannot recollect this data we should consider removing them from the dataset and making note of that in our date. There are also several number of individuals who are working well beyond retirement age (i.e. 65). The model also doesn’t take into consideration the different levels of education and a person’s gender. We also have a very low representation of minorities. The graphs below help us understand some of the concerns that this data and its corresponding model have:
In the previous question we identified some concerns that we had with the data. We will now take a moment to clean up our data and run the model again.
For this model we will remove any person whose weekly income is over $2750, any person who is over the age of 65, earned only a bachelor’s degree and are white.
##
## Call:
## lm(formula = weekly_earnings ~ age + as.factor(sex) + hours_worked,
## data = census.data.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1305.8 -326.5 -104.8 340.7 1754.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 596.633 129.449 4.609 5.10e-06 ***
## age 6.850 1.896 3.613 0.000332 ***
## as.factor(sex)2 -69.184 46.849 -1.477 0.140348
## hours_worked 14.952 2.094 7.140 3.17e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 515 on 519 degrees of freedom
## Multiple R-squared: 0.1229, Adjusted R-squared: 0.1178
## F-statistic: 24.24 on 3 and 519 DF, p-value: 1.075e-14
The general form of our model is:
\[\beta_{WeeklyIncome}=C + \beta_{Age}+\beta_{Gender}+\beta_{HoursWorked}+\epsilon\] The specific form of our model is: \[\beta_{WeeklyIncome}=527.45 + 6.85\beta_{Age}+69.18\beta_{Gender}+14.95\beta_{HoursWorked}+\epsilon\]
## age as.factor(sex) hours_worked
## 1.004217 1.074777 1.078789
For this model there is likely very little collinearity. Our Pairs graph above shows that our data has very little corrilation. Our subsetting out of the data has allowed us to remove much of the collinearity from our model. The vif score further confirms this result.
Let us now take a look at the ranges of our independent variables (Gender only has 2 levels, male and female):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.00 32.00 43.00 42.82 53.00 64.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 40.00 40.00 41.04 46.00 84.00
From these summaries above we can see that any person between the ages of 22 and 64 who works between 1 and 84 hours a week we can potentially make a prediction about. However, because we have so few observations past 60 hours, any prediction we make out there will need very large prediction intervals and should only be used as a directional prediction.
Now that we have a model, let us take a look at a person similar to me. I a 32 year old white male, with a college degree who works 40 hours per week.
## fit lwr upr
## 1 1413.932 399.535 2428.328
Based upon the inputs above, on average we would predict that a 32 year old white bachelor degree male working 40 hours a week would make on average $1,413.93 per week. However, that same male could be between anything from $399.54 and $2428.33. I currently make $1,400 per week, placing myself right in the range that I should be at.
However, this model still has a very small percent of the variation in weekly earning explained (12%). The residual standard error is also fairly high (515). Based on these findings there is still other factors that could be analyzed for our model.