First, we will look at a model of weekly earnings (income) as it directly relates to whether someone telecommutes or not.
A simple one-way ANOVA provides these results:
## Df Sum Sq Mean Sq F value Pr(>F)
## telecommute 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A. This model suggests that whether or not someone telecommutes has an impact on their income. The resulting p-value, being less than 2e-16, tells us that, given the confidence level of 95%, there is good certainty that the dependent variable here, weekly earnings, is affected by the status of the dependent variable, the boolean telecommute variable.
B. A boxplot gives graphical represenation that the range of income for those that telecommute is higher than the range of those that do not:
C. This should be considered a naive model for a couple of reasons:
A second model includes full or part-time status as a variable in addition to the telecommute status.
A two-way ANOVA gives these results:
## Df Sum Sq Mean Sq F value Pr(>F)
## telecommute 1 1.451e+08 145093488 394.6 <2e-16 ***
## full_or_part_time 1 2.833e+08 283266408 770.4 <2e-16 ***
## Residuals 5539 2.037e+09 367666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A. There is logic to this model, as one would assume that holding full-time employment would have a direct influence on an increase in income, while a part-time worker would likely have a compartively lower income.
B. In this model, both variables show significance, as the p-values in the above results are both very low. I.e. both telecommute status and full or part-time status show a correlation to income.
Concerning the telecommute status, however, there may still be some correllation to income, but this is likely mediated by the full-time/part-time status variable. If that is the case, then this model still shows some naievety as there is multicollinearity between the variables.
C. A boxplot of this model shows the similarities in the range variancess for those that telecommute and those that do not. While weekly earning are higher for full-time workers that telecommute, the similar dynamic between the ranges of the two telecommute categories is apparent:
D. Comparing the first model with only telecommute as a variable to this model that includes full or part time status, this new model appears to be a better fit:
## Df Sum Sq Mean Sq F value Pr(>F)
## telecommute 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## telecommute 1 1.451e+08 145093488 394.6 <2e-16 ***
## full_or_part_time 1 2.833e+08 283266408 770.4 <2e-16 ***
## Residuals 5539 2.037e+09 367666
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The comparison between these ANOVA results shows that the low p-values and the sum of squares indicates a greater fit with the additional variable.
Next is a simple linear regression model for weekly earnings (income) as a function of the number of hours worked.
A. The general equation for this model is: \[\hat Y = b_{0} + b_{1}X_{i}\]
B. Incorporating the above variables, the specific form for this model will be: \[weekly~earnings_{i} = \beta_{0} + \beta_{1}(hours~worked)\] The results of the regression are:
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
From this, the specific equation for the model is:
\[weekly~earnings_{i} = {66.0433071} + {22.5887368}(hours~ worked)\] C. This is a naive model for several reasons:D. Using only these two variables, this is not a helpful model. The above stated affect of hourly and non-hourly employment certainly should be considered. We should expect that the hours worked would have a greater influence on hourly workers than non-hourly.
With only the two variables of income and weekly earnings, the model appears as:
E. However, when accounting for the hourly/non-hourly status, the models show that hours worked appears to have a more correlated impact on weekly earnings. In the graphs below, the slimmer scope of the shaded area of the left visualization (hourly) as compared to the right (non-hourly) shows that these categories are affected differently by the hourly/non-hourly status.
Next is a simple linear regression model for weekly earnings (income) as a function of age.
A. The general equation for the model is: \[\hat Y = b_{0} + b_{1}X_{i}\]
B. Incorporating the above variables, the specific form for this model will be: \[weekly~earnings_{i} = \beta_{0} + \beta_{1}(age)\]
These results provide estimated form of the regression as: \[weekly~earnings_{i} = 548.9457 + 9.1941(age)\]
C. This model is naive for several reasons:D. Testing the linearity assumptions of this model can start with the summary of the linear model:
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
These results show deficiences with the model:
Another test is to visualize the residuals against the fitted values:
From this plot, the curve of the line shows there is not a good fit betwen the fitted values and the residuals.
E. Other concerns become apparent looking at the plot of this model:
A. The regression equation is: \[\hat Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4\]
The results for this model are:
##
## Call:
## lm(formula = weekly_earnings ~ age + sex + full_or_part_time +
## hourly_non_hourly, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1396.8 -349.4 -105.7 202.8 2558.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 611.7809 25.5655 23.93 <2e-16 ***
## age 6.6874 0.5421 12.34 <2e-16 ***
## sex2 -179.9994 15.1774 -11.86 <2e-16 ***
## full_or_part_time2 -493.1572 21.8289 -22.59 <2e-16 ***
## hourly_non_hourly2 479.0939 15.6047 30.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 557.5 on 5537 degrees of freedom
## Multiple R-squared: 0.3017, Adjusted R-squared: 0.3012
## F-statistic: 598.2 on 4 and 5537 DF, p-value: < 2.2e-16
B. From the results of the regression, the model becomes:
\[weekly\ earnings = 611.7809 + 6.6874(age) - 179.994(sex) - 493.1572(full/part\ status) + 479.0939(hourly/nonhourly\ status)\]
C. There does not appear to be any obvious colinearity, as shown by a variance inflation factors (vif) test where all the variables test with a low result:
## age sex full_or_part_time hourly_non_hourly
## 1.018760 1.026135 1.064558 1.061522
D. The best ranges to use for the variables can be computed using the confidence intervals from the linear model:
## 2.5 % 97.5 %
## (Intercept) 561.662481 661.899371
## age 5.624653 7.750082
## sex2 -209.753148 -150.245677
## full_or_part_time2 -535.950431 -450.363994
## hourly_non_hourly2 448.502523 509.685339
E. A hypothetical observation might be a person who is 40 years old, male, works full-time, and is non-hourly. Using this model, the equation for this observation would be:
\[611.7809 + 6.6874* age + -179.9994 * sex + -493.1572 * full\_or\_part\_time + 479.0939 * hourly\_non\_hourly\]
From this model, the predicted weekly earnings for this observation would be $671.15.
Using the ranges above, the weekly earnings lower limit would be $940.3 and the upper limit would be $402.