GLM Assignment

Question 1: Telecommuting and Income

First, we will look at a model of weekly earnings (income) as it directly relates to whether someone telecommutes or not.

A simple one-way ANOVA provides these results:

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## telecommute    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals   5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A. This model suggests that whether or not someone telecommutes has an impact on their income. The resulting p-value, being less than 2e-16, tells us that, given the confidence level of 95%, there is good certainty that the dependent variable here, weekly earnings, is affected by the status of the dependent variable, the boolean telecommute variable.

B. A boxplot gives graphical represenation that the range of income for those that telecommute is higher than the range of those that do not:

C. This should be considered a naive model for a couple of reasons:

Correlation between the variables should not imply causation. Logically, the act of telecommuting is likely not in itself a driver of higher income.
There may be other mediating variables in the data that give the appearance of influence between these two selected variables. In this case, what is more likely is that the type of occupation one has where telecommuting is an option may drive higher income.

Question 2: Telecommuting, Income, and Full or Part-Time Status

A second model includes full or part-time status as a variable in addition to the telecommute status.

A two-way ANOVA gives these results:

##                     Df    Sum Sq   Mean Sq F value Pr(>F)    
## telecommute          1 1.451e+08 145093488   394.6 <2e-16 ***
## full_or_part_time    1 2.833e+08 283266408   770.4 <2e-16 ***
## Residuals         5539 2.037e+09    367666                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A. There is logic to this model, as one would assume that holding full-time employment would have a direct influence on an increase in income, while a part-time worker would likely have a compartively lower income.

B. In this model, both variables show significance, as the p-values in the above results are both very low. I.e. both telecommute status and full or part-time status show a correlation to income.

Concerning the telecommute status, however, there may still be some correllation to income, but this is likely mediated by the full-time/part-time status variable. If that is the case, then this model still shows some naievety as there is multicollinearity between the variables.

C. A boxplot of this model shows the similarities in the range variancess for those that telecommute and those that do not. While weekly earning are higher for full-time workers that telecommute, the similar dynamic between the ranges of the two telecommute categories is apparent:

D. Comparing the first model with only telecommute as a variable to this model that includes full or part time status, this new model appears to be a better fit:

##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## telecommute    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals   5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##                     Df    Sum Sq   Mean Sq F value Pr(>F)    
## telecommute          1 1.451e+08 145093488   394.6 <2e-16 ***
## full_or_part_time    1 2.833e+08 283266408   770.4 <2e-16 ***
## Residuals         5539 2.037e+09    367666                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The comparison between these ANOVA results shows that the low p-values and the sum of squares indicates a greater fit with the additional variable.

Question 3: Simple Regression of Income / Hours Worked

Next is a simple linear regression model for weekly earnings (income) as a function of the number of hours worked.

A. The general equation for this model is: \[\hat Y = b_{0} + b_{1}X_{i}\]

B. Incorporating the above variables, the specific form for this model will be: \[weekly~earnings_{i} = \beta_{0} + \beta_{1}(hours~worked)\] The results of the regression are:

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

From this, the specific equation for the model is:

\[weekly~earnings_{i} = {66.0433071} + {22.5887368}(hours~ worked)\] C. This is a naive model for several reasons:

While hours worked may have influence on weekly earnings given comparable circumstances (same job, same industry), hours worked alone does not have significant influence on income.
Hours worked is also likely a function of full-time or part-time status, as one who works full-time is also likely to have a higher number of hours worked.
The affect of hourly or non-hourly status also likely has a strong influence on hours worked. Non-hourly (salaried) employees are typically in positions of management or professional occupations. They may work near 40 hours a week, but make significantly higher income.

D. Using only these two variables, this is not a helpful model. The above stated affect of hourly and non-hourly employment certainly should be considered. We should expect that the hours worked would have a greater influence on hourly workers than non-hourly.

With only the two variables of income and weekly earnings, the model appears as:

E. However, when accounting for the hourly/non-hourly status, the models show that hours worked appears to have a more correlated impact on weekly earnings. In the graphs below, the slimmer scope of the shaded area of the left visualization (hourly) as compared to the right (non-hourly) shows that these categories are affected differently by the hourly/non-hourly status.

Question 4: Simple Regression of Income / Age

Next is a simple linear regression model for weekly earnings (income) as a function of age.

A. The general equation for the model is: \[\hat Y = b_{0} + b_{1}X_{i}\]

B. Incorporating the above variables, the specific form for this model will be: \[weekly~earnings_{i} = \beta_{0} + \beta_{1}(age)\]

These results provide estimated form of the regression as: \[weekly~earnings_{i} = 548.9457 + 9.1941(age)\]

C. This model is naive for several reasons:

This model may account for the increasing income an employee would experience with age, but it ignores factors of how much work is done such as part-time/full-time status, hours worked, and hourly/non-hourly status.
There are likely large influences in earning based on geographic location, e.g. a 50-year-old in California with the same occupation as a 50-yearl-old in Wyoming likely has much higher earnings.
Other demographic information such as sex and citizenship are not accounted for in this model.

D. Testing the linearity assumptions of this model can start with the summary of the linear model:

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

These results show deficiences with the model:

The R-Squared is very low at .03696, suggesting the model is not a good fit.
The residual standard error is significant. At 654.6, this shows a wide variability in the residuals, likely caused by a number of outliers.

Another test is to visualize the residuals against the fitted values:

From this plot, the curve of the line shows there is not a good fit betwen the fitted values and the residuals.

E. Other concerns become apparent looking at the plot of this model:

From this graph, several concerns are apparent:

There appears to be a cap or placeholder value for weekly earnings at around $2800. Surely some employees make more than this amount. The use of a placeholder or a cap is skewing the data towards that value. Data should be collected for all income levels, or this placeholder should be removed and accounted for in a model
There is no data between the ages of 80 and about 85. Every other age is represented. The data is therefore skewed away from this range. If data is not available for this range, the best solution would be to remove or filter out the data for employees at age 85 and qualify the data as only up to 80.

Question 5: Linearity with Multiple Variables

This final model will be a regression of weekly earnings with the following variables:

X ₁ = age
X ₂ = sex
X ₃ = full-time/part-time status
X ₄ = hourly/non-hourly status

A. The regression equation is: \[\hat Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4\]

The results for this model are:

## 
## Call:
## lm(formula = weekly_earnings ~ age + sex + full_or_part_time + 
##     hourly_non_hourly, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1396.8  -349.4  -105.7   202.8  2558.1 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         611.7809    25.5655   23.93   <2e-16 ***
## age                   6.6874     0.5421   12.34   <2e-16 ***
## sex2               -179.9994    15.1774  -11.86   <2e-16 ***
## full_or_part_time2 -493.1572    21.8289  -22.59   <2e-16 ***
## hourly_non_hourly2  479.0939    15.6047   30.70   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 557.5 on 5537 degrees of freedom
## Multiple R-squared:  0.3017, Adjusted R-squared:  0.3012 
## F-statistic: 598.2 on 4 and 5537 DF,  p-value: < 2.2e-16

B. From the results of the regression, the model becomes:

\[weekly\ earnings = 611.7809 + 6.6874(age) - 179.994(sex) - 493.1572(full/part\ status) + 479.0939(hourly/nonhourly\ status)\]

C. There does not appear to be any obvious colinearity, as shown by a variance inflation factors (vif) test where all the variables test with a low result:

##               age               sex full_or_part_time hourly_non_hourly 
##          1.018760          1.026135          1.064558          1.061522

D. The best ranges to use for the variables can be computed using the confidence intervals from the linear model:

##                          2.5 %      97.5 %
## (Intercept)         561.662481  661.899371
## age                   5.624653    7.750082
## sex2               -209.753148 -150.245677
## full_or_part_time2 -535.950431 -450.363994
## hourly_non_hourly2  448.502523  509.685339

E. A hypothetical observation might be a person who is 40 years old, male, works full-time, and is non-hourly. Using this model, the equation for this observation would be:

\[611.7809 + 6.6874* age + -179.9994 * sex + -493.1572 * full\_or\_part\_time + 479.0939 * hourly\_non\_hourly\]

From this model, the predicted weekly earnings for this observation would be $671.15.

Using the ranges above, the weekly earnings lower limit would be $940.3 and the upper limit would be $402.