We are using data from a Census bureau study about internet and technology use to evaluate the impact of telecommuting on wages. Using a simple one-way ANOVA test, we can see that the teleworking appears to have a significant effect on imcome.

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Teleworking appears to have a significant effect on income. A one-way ANOVA was conducted to compare the effect of teleworking on weekly income. There was a significant effect on income at the p<.05 level [F(1, 5540) = 346.51, p = 2.2e-16]. We can see below that those who telecommute generally have higher weekly earnings than those who do not.

As the above visualization shows, telecommuters on average make more money than non-telecommuters. In fact, in looking at the above boxplot, it appears that the 75th percentile of weekly earners who do not telecommute make roughly the same amount of money as the 50th percentile of weekly earners who do telecommute.

Estimating Weekly Earnings by Hours Worked

We are interested in determining how much of an impact hours worked have on weekly earnings. We can build a simple regression model to explore this relationship.

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           weekly_earnings      
## -----------------------------------------------
## hours_worked                 22.589***         
##                               (0.707)          
##                                                
## Constant                     66.043**          
##                              (28.577)          
##                                                
## -----------------------------------------------
## Observations                   5,542           
## R2                             0.156           
## Adjusted R2                    0.155           
## Residual Std. Error     612.961 (df = 5540)    
## F Statistic         1,020.352*** (df = 1; 5540)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The predicted weekly earnings are $66.04 + 22.59(hours worked). y = \[beta_0\] + \[beta_1\](age) + \[epsillon\].

This model is naive because it does not attempt to explain the underlying causal relationship between hours worked and weekly earnings. Also, the linear regression model only uses a particular linear function of one of the predictor variables while ignoring the other potentially predictive variables. The model is also assuming linearity and homoskedasticity without testing these relationships.

Model alternatives

We know going in that weekly earnings are going to be highly correlated with hours worked because one is basically a function of the other since most are paid either specifically by the hour or given a salary based on an assumption that a specific number of hours will be worked each week. Assuming we are more interested in the impact of hours worked on earnings potential, a better model would compare hourly earnings to hours worked. This would provide some more interesting insight because the relationship between the two is more difficult to predict and may provide some actual insight.

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           hourly_earnings      
## -----------------------------------------------
## hours_worked                 -2.446***         
##                               (0.122)          
##                                                
## Constant                    127.534***         
##                               (4.938)          
##                                                
## -----------------------------------------------
## Observations                   5,542           
## R2                             0.067           
## Adjusted R2                    0.067           
## Residual Std. Error     105.915 (df = 5540)    
## F Statistic          400.846*** (df = 1; 5540) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

If we compare hours worked to hourly earnings, we discover that working more hours actually is negatively correlated to hourly earnings. This perhaps provides support for the old adage “work smarter not harder.”

## `geom_smooth()` using formula 'y ~ x'

As we can see above, hourly earning decrease as hours worked increase.

Age and Weekly Earnings

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           weekly_earnings      
## -----------------------------------------------
## age                          9.194***          
##                               (0.631)          
##                                                
## Constant                    548.946***         
##                              (28.235)          
##                                                
## -----------------------------------------------
## Observations                   5,542           
## R2                             0.037           
## Adjusted R2                    0.037           
## Residual Std. Error     654.583 (df = 5540)    
## F Statistic          212.588*** (df = 1; 5540) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The predicted weekly earnings are $548.95 + 9.19(age).

One of the problems with the model is that the model is unlikely to be linear. Earnings are likely to increase up to a certain age and then start to decrease. There is also a disconnect in terms of age versus career. A 36-year old is not twice as experienced as an 18-year old, for example, since most do not enter the work force until they are 16-years old. Many do not enter the workforce full-time until their 20s.

In the visualization below, we are plotting the average weekly earnings for each group. We can see a clear curvilinear pattern.

## `summarise()` ungrouping output (override with `.groups` argument)

If we want to plot each individual’s weekly earnings by age and apply a smoothed line to the data, we can again see a curvilinear pattern to the data instead of a straight line that would indicate linearity.

Using the car package, we can also test the linearity by plotting the residuals. As we can see above, the residuals are not randomly distributed. Instead, we clearly see a curvilinear pattern where residuals are negative at both the lower and upper ends of the age range.

Concerns with age model

The first concern is linearity. We can see that the the relationship is not linear. We can attempt to address this issue by using age as a polynomial. Another concern is that we do not have a lot of data at the extreme ends of the age range. We can address this concern by limiting the application of our model to only ages within a range within which we have enough information to make relatively confident predictions. We also know that there will be multicollinearity issues with other important variables like level of education.

Expanded Model

## 
## =========================================================
##                                   Dependent variable:    
##                               ---------------------------
##                                     weekly_earnings      
## ---------------------------------------------------------
## age                                    46.150***         
##                                         (3.391)          
##                                                          
## I(age * age)                           -0.446***         
##                                         (0.038)          
##                                                          
## as.factor(telecommute)2               -218.292***        
##                                        (17.314)          
##                                                          
## as.factor(hourly_non_hourly)2         495.241***         
##                                        (16.345)          
##                                                          
## Constant                              -188.141***        
##                                        (71.697)          
##                                                          
## ---------------------------------------------------------
## Observations                             5,542           
## R2                                       0.250           
## Adjusted R2                              0.250           
## Residual Std. Error               577.701 (df = 5537)    
## F Statistic                    462.148*** (df = 4; 5537) 
## =========================================================
## Note:                         *p<0.1; **p<0.05; ***p<0.01

The model predicts weekly earnings to be -188 + 46(age) - .45(age)^2 - 218.29(non-telecommuter) + 495.24(non-hourly). In general, weekly earnings will increase along with age up until turning slightly negative for older workers. Weekly earnings are expected to be $218 more for telecommuters and also are higher for non-hourly employees.

We have enough observations to feel comfortable making predictions for ages above 16 and below 75. Outside of that range, we do not have enough observations to make predictions with any confidence.

Making a prediction

## 
## ===============================
##      fit       lwr       upr   
## -------------------------------
## 1 1,326.772 1,293.753 1,359.791
## -------------------------------

For a 32-year old individual who telecommutes and is non-hourly, our model predicts he or she would make $1,326.77 per week. The lower-end of our confidence interval is 1,293 and the upper end of our confidence interval is 1,360.