BAIS 660 GLM Homework Clifford Gammon

In this homework we will begin to take a look at ANOVAs and GLM. Data for this assignment can be found at: http://asayanalytics.com/telework_csv The data is a compilation of US Census data that has been cleaned up for use in this assignment

Question 1: One-way ANOVA

Let us take a simple look at how telecommuting impacts a person’s weekly income

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ telecommute.text, data = census.data)
## 
## $telecommute.text
##            diff      lwr      upr p adj
## Yes-No 350.7614 313.8213 387.7015     0

Based on our ANOVA model, we would expect the average telecommuting worker to make $350 more per week. We can see from our Tukey HSD test that we do have a statistically significant difference between our two types of workers.

We can visually see this difference in the following graph:

Even though this model shows that a person who telecommutes makes a statistically significant greater weekly income than a non-telecommuter worker, it fails to take into account many potential factors. Things like the type of industry that a person works in (a person that telecommuting is more like to work in a professional environment, while someone who doesn’t is more likely to be a non-skilled worker) The model also fails to take into account the level of education that the person has, as people with college degrees tend to make more money than those without

Question 2: Two-Way ANOVA

In an effort to continue to understand impacts to weekly income, let us now look at how both telecommuting and gender impact it:

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ telecommute.text + sex.word, data = census.data)
## 
## $telecommute.text
##            diff      lwr      upr p adj
## Yes-No 350.7614 314.4721 387.0508     0
## 
## $sex.word
##                 diff      lwr      upr p adj
## Male-Female 242.1472 208.6566 275.6378     0

From this model we can see that telecommuter workers make $351 dollars per week more than a non-telecommuter and a female worker will make $242 less per week than their male counterpart.

Visually these impacts on weekly income look like this:

Finally let us see if our second model does a better job of explaining how much variation of weekly income is explained by our independent variables (i.e. telecommuting and gender):

## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ telecommute.text
## Model 2: weekly_earnings ~ telecommute.text + sex.word
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   5540 2319766548                                  
## 2   5539 2238348491  1  81418057 201.48 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can see that there is a statistically significant increase in the amount of residual sum of squares explained by the second model vs the first model (i.e. the second model better explains the differences in weekly income.).

While weekly income is better explained by the second model we still have several factors that could have a potential impact to weekly income (i.e. industry, education level, type of industry, type of job, etc.).

Question 3: GLM 1 Independent Variable

Let us now take a look at a linear model, and how the number of hours worked has an impact on income. Mathematically we can express linear models in the following form:

\[\beta_{WeeklyIncome}=C + \beta_{HoursWorked}+\epsilon\] Where C is some constant (or y intercept) and \[\epsilon\] is some error term.

For our model the mathematic equation can be written as:

\[\beta_{WeeklyIncome}=60.40 + 22.59\beta_{HoursWorked}+\epsilon_{i}\] And the lm analyzed is given from this R output:

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = census.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

Based on this output, we can see that the number of hours worked does have a statistically significant effect on the amount of weekly income. We can also see that for every hour worked we expect to see a $22.59 increase in pay. We also see from the output that about 16% of the amount of the variation in weekly income is explained by the amount of hours that a person works. We can also see that the mean residual of our model is around $613 per week.

While this model does help us to understand around 16% of the variation in weekly income, it still doesn’t take into account level of education, gender, and race.

If we take a closer look at our model we can see that our model doesn’t do an excellent job of modeling how weekly income is impacted:

If we consider the above recommendations, the following model is created:

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + as.factor(education) + 
##     sex.word + as.factor(race), data = census.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1781.0  -336.6  -101.4   207.4  2745.2 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -211.7033   223.6891  -0.946 0.343977    
## hours_worked             19.0812     0.6477  29.461  < 2e-16 ***
## as.factor(education)32 -241.8049   329.8095  -0.733 0.463489    
## as.factor(education)33 -114.2745   247.5396  -0.462 0.644357    
## as.factor(education)34 -114.1348   254.9862  -0.448 0.654451    
## as.factor(education)35 -104.6667   241.5775  -0.433 0.664840    
## as.factor(education)36   -0.5719   233.8553  -0.002 0.998049    
## as.factor(education)37   -9.4669   230.5358  -0.041 0.967246    
## as.factor(education)38  -74.6092   239.6286  -0.311 0.755544    
## as.factor(education)39  121.0830   222.7948   0.543 0.586826    
## as.factor(education)40  168.2064   222.9247   0.755 0.450555    
## as.factor(education)41  230.7979   224.6322   1.027 0.304255    
## as.factor(education)42  304.2107   224.1146   1.357 0.174713    
## as.factor(education)43  573.2131   222.7541   2.573 0.010099 *  
## as.factor(education)44  767.3268   223.4553   3.434 0.000599 ***
## as.factor(education)45 1034.3799   228.9592   4.518 6.38e-06 ***
## as.factor(education)46  923.8269   228.3151   4.046 5.27e-05 ***
## sex.wordMale            163.9762    15.0648  10.885  < 2e-16 ***
## as.factor(race)2       -145.3370    23.9043  -6.080 1.28e-09 ***
## as.factor(race)3         -3.9834    71.4993  -0.056 0.955573    
## as.factor(race)4        -16.6237    34.2584  -0.485 0.627522    
## as.factor(race)5        -54.8574    89.9543  -0.610 0.541995    
## as.factor(race)6       -115.5352   107.0931  -1.079 0.280711    
## as.factor(race)7        138.2888   113.8896   1.214 0.224710    
## as.factor(race)8        152.4390   140.8376   1.082 0.279134    
## as.factor(race)9       -198.4867   192.7541  -1.030 0.303177    
## as.factor(race)10       -91.9921   192.6509  -0.478 0.633020    
## as.factor(race)11      -104.7986   544.1356  -0.193 0.847282    
## as.factor(race)12      -176.6050   544.0587  -0.325 0.745491    
## as.factor(race)13        67.5411   544.0796   0.124 0.901210    
## as.factor(race)16      -373.9605   314.3578  -1.190 0.234254    
## as.factor(race)21      1411.2649   544.0705   2.594 0.009515 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 543.8 on 5510 degrees of freedom
## Multiple R-squared:  0.3389, Adjusted R-squared:  0.3352 
## F-statistic: 91.12 on 31 and 5510 DF,  p-value: < 2.2e-16

While this is an eye sore to look at, we can see that almost 34% of our variation in weekly income is now being explained. We also have a statistically significant model. We might also want to consider such things as interaction effects.

Question: 4 GLM 2

Now let us see how weekly earnings change as a function of age. The general form of our model is:

\[\beta_{WeeklyIncome}=C + \beta_{Age}+\epsilon\]

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = census.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

The specific form of our model is: \[\beta_{WeeklyIncome}=584.95 + 9.19\beta_{Age}+\epsilon\]

From this model we expect the average weekly income to increase by $9.19. This is a statistically significant income, however it as a very small r squared. There is also a very large residual sum of squares. The model is also a statistically significant model.

This model is also not complete. While we have taken into account age/experience, we have failed to take into account industry, gender and hours worked.

As we look at our linearity assumption, let us take a look at the studentized residual qqnorm plot:

Based on the plot above we can see that our model has some curvature, which indicates that we have some bias in our data. The bias that we are likely seeing is likely due to younger professionals with high paying jobs and experienced individuals who do not make much (i.e. a semi-retired professional who has gone back to work as a bus drive to stay busy)

There is also concern that this model is not linear but some other type of model (i.e. uniform, bivariate, Weibull, etc.). To identify this we will need to do a scatter plot of our data. Weekly income also as a limit to the weekly income that is collected by the census bureau, causing there to be a high number of $3000+ weekly earnings. Because we cannot recollect this data we should consider removing them from the dataset and making note of that in our date. There are also several number of individuals who are working well beyond retirement age (i.e. 65). The model also doesn’t take into consideration the different levels of education and a person’s gender. We also have a very low representation of minorities. The graphs below help us understand some of the concerns that this data and its corresponding model have:

Question 5: Your Enhance GLM

In the previous question we identified some concerns that we had with the data. We will now take a moment to clean up our data and run the model again.

For this model we will remove any person whose weekly income is over $2750, any person who is over the age of 65, earned only a bachelor’s degree and are white.

## 
## Call:
## lm(formula = weekly_earnings ~ age + as.factor(sex) + hours_worked, 
##     data = census.data.2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1305.8  -326.5  -104.8   340.7  1754.1 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      596.633    129.449   4.609 5.10e-06 ***
## age                6.850      1.896   3.613 0.000332 ***
## as.factor(sex)2  -69.184     46.849  -1.477 0.140348    
## hours_worked      14.952      2.094   7.140 3.17e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 515 on 519 degrees of freedom
## Multiple R-squared:  0.1229, Adjusted R-squared:  0.1178 
## F-statistic: 24.24 on 3 and 519 DF,  p-value: 1.075e-14

The general form of our model is:

\[\beta_{WeeklyIncome}=C + \beta_{Age}+\beta_{Gender}+\beta_{HoursWorked}+\epsilon\] The specific form of our model is: \[\beta_{WeeklyIncome}=527.45 + 6.85\beta_{Age}+69.18\beta_{Gender}+14.95\beta_{HoursWorked}+\epsilon\]

##            age as.factor(sex)   hours_worked 
##       1.004217       1.074777       1.078789

For this model there is likely very little collinearity. Our Pairs graph above shows that our data has very little corrilation. Our subsetting out of the data has allowed us to remove much of the collinearity from our model. The vif score further confirms this result.

Let us now take a look at the ranges of our independent variables (Gender only has 2 levels, male and female):

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.00   32.00   43.00   42.82   53.00   64.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   40.00   40.00   41.04   46.00   84.00

From these summaries above we can see that any person between the ages of 22 and 64 who works between 1 and 84 hours a week we can potentially make a prediction about. However, because we have so few observations past 60 hours, any prediction we make out there will need very large prediction intervals and should only be used as a directional prediction.

Now that we have a model, let us take a look at a person similar to me. I a 32 year old white male, with a college degree who works 40 hours per week.

##        fit     lwr      upr
## 1 1413.932 399.535 2428.328

Based upon the inputs above, on average we would predict that a 32 year old white bachelor degree male working 40 hours a week would make on average $1,413.93 per week. However, that same male could be between anything from $399.54 and $2428.33. I currently make $1,400 per week, placing myself right in the range that I should be at.

However, this model still has a very small percent of the variation in weekly earning explained (12%). The residual standard error is also fairly high (515). Based on these findings there is still other factors that could be analyzed for our model.