Introduction

Packages Required

Package Purpose
tidyverse { Core packages for data analysis and manipulation
car {
stats { General-purpose literate program engine for output control
knitr { For importing cSV data rectangular data
readr { For transforming and rescaling data
syuzhet { For general summary statistics about data set
prettydoc { Creating documents in R Markdown
PerformanceAnalytics { Companion to applied regression analysis
fmsb { for table placement in html

Data Preparation

Data cleaning tasks were performed

Numeric data columns converted to factors

Question 1: Using a simple one-way ANOVA, does telworking appear to have a signficant effect on income?

As indicated in the one-way ANOVA, given a p-value < 0.05, we can reject the null hypothesis and conclude that telecommuting has a statistically significant effect on weekly earnings. The boxplot indicates that teleworking (1) generates higher mean weekly earnings than those that do not telework (2).

Based on the higher weekly_earning averages for telecommuters (1) as compared to noncommuters (2), it would appear that there is an income effect. However, is this observed difference statistically significant?

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0

Given a p-value < 0.05 as indicated in the anova table and a similar p-value in the Tukey test, we can reject the null hypothesis and conclude at a confidence level of 95% that the differences in mean weekly-earnings are statistically significant.

The box chart below depicts the dispersion in weekly-earnings by telecommute status.

2. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

## [1] 66.04331
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

From the table above of a simple regression model estimating weekly earnings by hours worked, we see that the model is statistically significant and we can reject the null hypothesis in favor of the alternative. a. The generalized form of the regression is

weekly_earnings = B0 + B1 * (hours_worked)

  1. the estimated form of the regression equaion is:

weekly earnings = 66.0433 + 22.5887 (Hrs Worked)

  1. Provide at least 3 explanations for why we would consider this a naïve model.

One reason why we might consider this to be a naive model is because our R square is 0.1555 which indicates that there might be weak expanatories qualities of the independent variable, hours worked on weekly earnings. We would consider this to be a naive model however for the following reasons:

Another reason this might be considered to be a naive model would be that weekly earnings are not solely based on hours worked. For example, many professionals work salary that is not based on the number of hours worked per period.

Additionally, this might be considered naive would be related to the type of job and skill level where payment or fees are based on an outcome and not hourly wages. For example, a tax preparerer or lawyer writing up one’s will.

Finally, often people with more experience tend to be paid a higher rate than someone with less time on the job. Theefore, weekly earnings would not be solely a function of the amount of hours worked per week but also include level of experience.

  1. In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8090 -0.4577 -0.0287  0.4182  2.6466 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.286861   0.030671  172.37   <2e-16 ***
## hours_worked 0.033644   0.000759   44.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6579 on 5540 degrees of freedom
## Multiple R-squared:  0.2618, Adjusted R-squared:  0.2617 
## F-statistic:  1965 on 1 and 5540 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked + I(hours_worked * 
##     hours_worked), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7182 -0.4519 -0.0412  0.4193  2.9261 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.990e+00  4.812e-02 103.697  < 2e-16 ***
## hours_worked                    5.136e-02  2.346e-03  21.896  < 2e-16 ***
## I(hours_worked * hours_worked) -2.381e-04  2.984e-05  -7.978  1.8e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6542 on 5539 degrees of freedom
## Multiple R-squared:  0.2702, Adjusted R-squared:  0.2699 
## F-statistic:  1025 on 2 and 5539 DF,  p-value: < 2.2e-16

By progressing from a linear-liner model to log-linear and then log-polinomial, we get a slightly better adjusted R square, however the explanation power of the model could best be described as weak. The standardized residuals plot, seen above illustrates the low predicative value of the model. Additional independent variables would be needed to improve the ability of the model to explain the dependent variable, weekly earnings.

  1. If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using this dataset, propose another solution using additional variables if possible, a different data set or another strategy.
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + full_or_part_time + 
##     age + detailed_occupation_group, data = telework)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1328.52  -333.64   -97.27   203.51  2684.44 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  426.0640    46.9233   9.080  < 2e-16 ***
## hours_worked                  14.3835     0.7715  18.644  < 2e-16 ***
## full_or_part_time2          -266.7181    25.2941 -10.545  < 2e-16 ***
## age                            7.4330     0.5255  14.144  < 2e-16 ***
## detailed_occupation_group2   -14.0418    37.4129  -0.375 0.707437    
## detailed_occupation_group3    90.6285    45.0338   2.012 0.044220 *  
## detailed_occupation_group4   103.2604    48.2186   2.142 0.032277 *  
## detailed_occupation_group5  -138.8140    66.1487  -2.099 0.035905 *  
## detailed_occupation_group6  -371.7495    60.5991  -6.135 9.13e-10 ***
## detailed_occupation_group7   247.8487    66.9185   3.704 0.000215 ***
## detailed_occupation_group8  -131.2259    41.2853  -3.179 0.001488 ** 
## detailed_occupation_group9  -159.7264    59.4970  -2.685 0.007283 ** 
## detailed_occupation_group10 -122.4776    35.3752  -3.462 0.000540 ***
## detailed_occupation_group11 -674.5186    47.5637 -14.181  < 2e-16 ***
## detailed_occupation_group12 -332.9938    50.4692  -6.598 4.56e-11 ***
## detailed_occupation_group13 -614.1451    39.5249 -15.538  < 2e-16 ***
## detailed_occupation_group14 -677.9578    48.3717 -14.016  < 2e-16 ***
## detailed_occupation_group15 -673.1004    48.9391 -13.754  < 2e-16 ***
## detailed_occupation_group16 -393.4039    31.2111 -12.605  < 2e-16 ***
## detailed_occupation_group17 -555.3577    28.3138 -19.614  < 2e-16 ***
## detailed_occupation_group18 -601.2618   105.4690  -5.701 1.25e-08 ***
## detailed_occupation_group19 -365.3050    41.5688  -8.788  < 2e-16 ***
## detailed_occupation_group20 -301.0605    42.9177  -7.015 2.58e-12 ***
## detailed_occupation_group21 -538.5115    36.8069 -14.631  < 2e-16 ***
## detailed_occupation_group22 -495.0241    36.6261 -13.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 536 on 5517 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.3542 
## F-statistic: 127.7 on 24 and 5517 DF,  p-value: < 2.2e-16

In this new model, we have a p-value < 0.05 which indicates that it is statistically significant and can reject the null hypothesis that the slope of the line = 0. Also, the adjusted R-squared value has improved to 0.3542 and thus increases the explanatory power of the model. Also, the RMSE = 654. However, when drilling down into the details of the specific independent variables, we can notice that some are not significant. Certain detailed occupatiosn groups do not provide significance to the model and therefore increase the level of noise.

3. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16
  1. Write the generalized form of the regression using beta notation
  2. Write the estimated form of the regression using your results

The estimated form of the regression is:

Weekly Earnings = $548.95 + $9.19(Age)

  1. Provide at least 3 explanations for why this is considered a naïve model.

This can be considered a naive model for the following reasons:

  1. Age is not the only factor which would explain weekly_earnings. While typically, we would expect individuals to earn increasing wages over time, other factors would play a greater part in predicting weekly earnings, such as the amount of work done in a given week, as captured by the full- or part-time variable.

  2. Clearly ones occupation also would play a tremendous role in determing weekly earnings. professional occupations as compared to non-skilled work could have a greater impact on weekly earnings.

  3. Given the large cost of living differences among the various regions of the country, location could also play a substantial role in explaining weekly earnings.

  1. Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.

The linearity assumption can be checked by examining the sclae-location plo, also known as the spread location plot. This plot shows if residuals are spread equally along the ranges of the predictors.

## 
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8090 -0.4577 -0.0287  0.4182  2.6466 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.286861   0.030671  172.37   <2e-16 ***
## hours_worked 0.033644   0.000759   44.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6579 on 5540 degrees of freedom
## Multiple R-squared:  0.2618, Adjusted R-squared:  0.2617 
## F-statistic:  1965 on 1 and 5540 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked + I(hours_worked * 
##     hours_worked), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7182 -0.4519 -0.0412  0.4193  2.9261 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.990e+00  4.812e-02 103.697  < 2e-16 ***
## hours_worked                    5.136e-02  2.346e-03  21.896  < 2e-16 ***
## I(hours_worked * hours_worked) -2.381e-04  2.984e-05  -7.978  1.8e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6542 on 5539 degrees of freedom
## Multiple R-squared:  0.2702, Adjusted R-squared:  0.2699 
## F-statistic:  1025 on 2 and 5539 DF,  p-value: < 2.2e-16

The graphs indicate that the model does not possess a good lineal fit for the data which exhibits potential heteroskedasticity concerns–a sign that the model may not be well-defined.

  1. Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.

Given the heteroskedasticity concerns outlined above and assessed, using log or polynomial transformations could reduce the amount of variance and improve the model’s explanatory value. However, the usage of more IVs should improve the model’s ability to explain the dependent variable, weeekly earnings as age does not appear to be the only explanation for level of weekly earnings.

Another concern relates to the y-intercept value and its relevance and accuracy. A person of age 0 would be predicted to receive $548.95 in weekly earnings.

An additional concern would be related to distinguishing the difference between age and experience. It could be argued that a person of greater age might receive less than a younger person with more experience, ceteris paribus. A way to improve the model may be to use a measure of experience as a substitute for age.

A final concern would be the assumption of a lineal relationship between age and weekly earnings. Intuitively, we might expect earnings to gap up with ever increasing experience.

4. Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:

## 
## Call:
## lm(formula = weekly_earnings ~ age + I(age * age) + hours_worked + 
##     hourly_non_hourly + full_or_part_time + occupation_group, 
##     data = telework)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1492.68  -318.70   -81.97   194.24  2634.69 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -48.36956   74.79910  -0.647 0.517879    
## age                  22.74989    3.17170   7.173 8.32e-13 ***
## I(age * age)         -0.18662    0.03556  -5.249 1.59e-07 ***
## hours_worked         13.04987    0.74904  17.422  < 2e-16 ***
## hourly_non_hourly2  325.31496   16.08411  20.226  < 2e-16 ***
## full_or_part_time2 -218.78703   24.84086  -8.808  < 2e-16 ***
## occupation_group2   -19.58143   22.58074  -0.867 0.385884    
## occupation_group3  -435.61031   26.12580 -16.674  < 2e-16 ***
## occupation_group4  -293.58199   28.47477 -10.310  < 2e-16 ***
## occupation_group5  -416.83376   25.82964 -16.138  < 2e-16 ***
## occupation_group6  -426.23175  102.34706  -4.165 3.17e-05 ***
## occupation_group7  -167.48226   39.93989  -4.193 2.79e-05 ***
## occupation_group8  -150.23296   40.82552  -3.680 0.000236 ***
## occupation_group9  -347.83247   35.17233  -9.889  < 2e-16 ***
## occupation_group10 -326.05283   34.70294  -9.396  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 521.4 on 5527 degrees of freedom
## Multiple R-squared:  0.3903, Adjusted R-squared:  0.3887 
## F-statistic: 252.7 on 14 and 5527 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = weekly_earnings ~ age + I(age * age) + hours_worked + 
##     hourly_non_hourly + full_or_part_time + occupation_group, 
##     data = telework)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1492.68  -318.70   -81.97   194.24  2634.69 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -48.36956   74.79910  -0.647 0.517879    
## age                  22.74989    3.17170   7.173 8.32e-13 ***
## I(age * age)         -0.18662    0.03556  -5.249 1.59e-07 ***
## hours_worked         13.04987    0.74904  17.422  < 2e-16 ***
## hourly_non_hourly2  325.31496   16.08411  20.226  < 2e-16 ***
## full_or_part_time2 -218.78703   24.84086  -8.808  < 2e-16 ***
## occupation_group2   -19.58143   22.58074  -0.867 0.385884    
## occupation_group3  -435.61031   26.12580 -16.674  < 2e-16 ***
## occupation_group4  -293.58199   28.47477 -10.310  < 2e-16 ***
## occupation_group5  -416.83376   25.82964 -16.138  < 2e-16 ***
## occupation_group6  -426.23175  102.34706  -4.165 3.17e-05 ***
## occupation_group7  -167.48226   39.93989  -4.193 2.79e-05 ***
## occupation_group8  -150.23296   40.82552  -3.680 0.000236 ***
## occupation_group9  -347.83247   35.17233  -9.889  < 2e-16 ***
## occupation_group10 -326.05283   34.70294  -9.396  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 521.4 on 5527 degrees of freedom
## Multiple R-squared:  0.3903, Adjusted R-squared:  0.3887 
## F-statistic: 252.7 on 14 and 5527 DF,  p-value: < 2.2e-16
  1. Write the generalized form of the regression using beta notation

In the new model, I have used age, age squared, hours worked, full or part time, and hourly/non-hourly, and occupation group. The result of this new model is that the ability to explain the dependent variable, weekly earnings, has improved, based on adjusted R-squared, to 0.3903. Additionally the overall model is statistically signficant at p less than an alpha = 0.05 and we can reject the null hypothesis being the slope is equal to 0.

  1. Write the estimated form of the regression using your results

The regression equation using the coefficients generated from the regression analysis is:

  weekly earnings = -48.37 + 22.75(age) - 0.187(age^2) + 13.0499(hours worked) + 325.315(hourly_nonhourly) -218.787(full or parttime) -19.5814(occupation group2) - ... -326.053(occupation group10)
  
  1. Do you suspect any of your independent variables are colinear? Explain why or why not.

Given the low VIF scores presented in the table below for the VIF test for collinearity, I can conclude that ther eis little correlation among the predictor adn the remaining predicitor variables. The general rule of thumb is that Variance Inflation Factors (VIF) of 1 means that there is no correlation and those exceeding 4 warrant further investigation.

##               age      hours_worked hourly_non_hourly full_or_part_time 
##          1.019456          1.511649          1.074684          1.507727
## [1] 1.475405
  1. Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?

In the table below are the confidence intervals at a 95% level which indicate the range of values that I am most comfortable using the model with

##                          2.5 %      97.5 %
## (Intercept)        -117.023022   42.098498
## age                   5.826261    7.921016
## hours_worked         12.237957   15.292780
## hourly_non_hourly2  423.943649  484.595471
## full_or_part_time2 -329.169851 -228.819259
  1. Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results. HINT: use the coefficient confidence intervals to estimate the highest and lowest values if this person’s income.
##        (Intercept)                age       hours_worked hourly_non_hourly2 
##         -37.462262           6.873638          13.765369         454.269560 
## full_or_part_time2 
##        -278.994555

My model predicts that a specific observation in the data set, which has weekly earnings of $1000.00, should have the following weekly earnings

-37.4622 +6.8736(44) + 13.7653 (1) + 454.2695(2) - 278.9946(1) =

## [1] 908.2859