1.1 Does Teleworking appear to have a significant effect on income?

The data contains information on over 5,500 employees and 31% of those individuals worked from home, or telecommuted. The remaining 69% commuted to work and listed below is the average salary of those 2 groups. Listed first are Telecommuters with an average salary approximately $1,100

1.2 Comparing telecommuter status

In this graphic, we note a wider range in salaries for telecommuters than commuters and that the average salary is higher for a telecommuter than non-telecommuter.

1.3 A significant effect

It does appear that teleworking has a significant effect on income given the p-value < 0.05. These values are provided by an analysis of variance test. On average, commuters make $350.76 less as compared to telecommuters,

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0

1.4 Naïve model

This model can be considered a naïve model because it is only using one dependent and one independent variable. It also does not account for any outside influencers that might also have an effect on the variability. For example, full- and part-time employees often times receive pay at different rates. Also, this model does not account for an employee’s position or tenure at an organization which could also contribute to higher wages for some.

2.1 Explain your Model Design

My hypothesis is that hourly workers earn less than non-hourly workers, whether they worked from home or not. I wanted to test this hypothesis and added hourly_non_hourly to my analysis to see the effect it has on predicting weekly earnings.

##                                                       Df    Sum Sq
## as.factor(telecommute)                                 1 1.451e+08
## as.factor(hourly_non_hourly)                           1 3.755e+08
## as.factor(telecommute):as.factor(hourly_non_hourly)    1 9.009e+06
## Residuals                                           5538 1.935e+09
##                                                       Mean Sq F value
## as.factor(telecommute)                              145093488  415.20
## as.factor(hourly_non_hourly)                        375496285 1074.53
## as.factor(telecommute):as.factor(hourly_non_hourly)   9009218   25.78
## Residuals                                              349451        
##                                                       Pr(>F)    
## as.factor(telecommute)                               < 2e-16 ***
## as.factor(hourly_non_hourly)                         < 2e-16 ***
## as.factor(telecommute):as.factor(hourly_non_hourly) 3.95e-07 ***
## Residuals                                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) * as.factor(hourly_non_hourly), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -384.5076 -317.0153     0
## 
## $`as.factor(hourly_non_hourly)`
##         diff      lwr      upr p adj
## 2-1 512.5514 481.0693 544.0335     0
## 
## $`as.factor(telecommute):as.factor(hourly_non_hourly)`
##              diff       lwr        upr   p adj
## 2:1-1:1 -125.7199 -191.1042  -60.33548 4.8e-06
## 1:2-1:1  662.9527  587.9357  737.96978 0.0e+00
## 2:2-1:1  357.5900  286.4375  428.74251 0.0e+00
## 1:2-2:1  788.6726  732.0702  845.27496 0.0e+00
## 2:2-2:1  483.3099  431.9392  534.68055 0.0e+00
## 2:2-1:2 -305.3627 -368.5402 -242.18525 0.0e+00

2.2 Interpret Results

The independent variables, and the interaction between them, are statistically significant because their p-values are less than 0.05. The largest difference exists between salaried telecommuters and commuters who work per hour, who on average earned $788.67 more hourly commuters. This could likely be explained by the type of job that telecommuters are doing from home. For this reason I consider this to be a naïve model. The additional variable certainly helps strengthen the overall model, however, additional variables should continue to be analyzed to ensure that all relevant variables are captured.

2.3 Visualization

In this graphic we see that on average salaried telecommuters (blue box on left) make more than commuters who work per hour (red box on right)

2.4 Test Differences in Models

In our first linear model, approximately 6% of the variability in weekly earnings is explained by variability in telecommuter status which is not a very good fit.

## 
## Call:
## lm(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1170.3  -447.4  -159.4   277.4  2052.2 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1183.19      15.69   75.43   <2e-16 ***
## as.factor(telecommute)2  -350.76      18.84  -18.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 647.1 on 5540 degrees of freedom
## Multiple R-squared:  0.05886,    Adjusted R-squared:  0.05869 
## F-statistic: 346.5 on 1 and 5540 DF,  p-value: < 2.2e-16

Upon adding hourly_non-hourly to the model we see an improvement in fit as the overall standard error of the model decreases and 21% of the variability in weekly earnings can be explained by telecommuter status.

## 
## Call:
## lm(formula = weekly_earnings ~ as.factor(telecommute) + as.factor(hourly_non_hourly), 
##     data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1389.1  -387.7  -125.8   258.2  2241.8 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     861.40      17.41   49.48   <2e-16 ***
## as.factor(telecommute)2        -218.62      17.72  -12.34   <2e-16 ***
## as.factor(hourly_non_hourly)2   540.66      16.53   32.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 592.5 on 5539 degrees of freedom
## Multiple R-squared:  0.2112, Adjusted R-squared:  0.2109 
## F-statistic: 741.5 on 2 and 5539 DF,  p-value: < 2.2e-16

The difference between these two models is significant given that the p-value for the hypothesis test is below 0.05 when we run an analysis of variance between the two.

## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ as.factor(telecommute)
## Model 2: weekly_earnings ~ as.factor(telecommute) + as.factor(hourly_non_hourly)
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   5540 2319766548                                  
## 2   5539 1944270263  1 375496285 1069.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.1 Weekly earnings per hours worked

a) Write the generalized form of the regression using beta notation

Y = 66.04 + 22.69(hours_worked)

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

3.2 Estimated Regression

Estimating 160 hours worked in a one month period

Weekly earnings = 66.04 + 22.69(160hours) Weekly earnings = $3,680.24

##        1 
## 3680.241

3.3 Explanations why this is a naïve model

This model is naïve for several reasons. First, this is a simple regression with only one dependent and one independent variable. Second, the model does not account for any interaction between the number of hours worked and at least one other variable. Lastly, when using this model, only 15.5% of the variability in weekly earnings can be explained by hours worked which implies there are other factors that also influence weekly earnings.

3.4 How to better the model

To better model these two continuous variables, I account for the interaction between hours worked and itself. A linear model does not fit the data so I convert the data into a parabolic formula to find a better fitting line using the interaction between hours worked and itself. In this model, we note the that the line curves up at an increasing rate as the number of hours worked increase.

4.1 Weekly Earnings as a function of Age

A generalized form of the equation

Weekly earnings = 548.95 + 9.19(age)

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

4.2 Estimated Regression

To estimate weekly earnings for someone 38 years of age:

Weekly earnings = 548.95 + 9.19(38) Weekly earnings = $898.32

##        1 
## 898.3197

4.3 Explaination of a naïve model

First, this is a naïve model because as a simple regression it only uses one dependent and one indepedent variable. Second, previous models confirm a statistical significance with among other variables, like total number of hours worked and whether the employee telecommutes, that should be considered and included in the overall model. Lastly, there is a weak overall fit of the model as only 3% of the variability in weekly earnings can be explained by age.

4.4 Testing Linearity Assumption

In this model, we see that the regression line is under-predicitng a majority of the weekly earnings values.

plot(telework$age, telework$weekly_earnings)
twlm <- lm(weekly_earnings~age, data = telework)
abline(twlm, col="red")

From the data points, it looks as though weekly earnings should increase from ages 20-40, peak or level off between 40 - 60 before the earnings should begin to decrease after age 60.

The regression line indicates that weekly earnings will continue to increase as a person’s age continues to increase which we know is not true.

summary(twlm)
## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

4.5 Additional Concerns

  1. Concern 1: Outliers are a concern in this data given that many of the values are at extremes of the ranges. For example, some individuals earned close to $3,000 after working less than 10 hours. These individuals are likely managers whose salary skew the data. To address this, I would remove the top 5% outliers

  2. Concern 2: This model does not consider whether the employee was working full- or part-time. To address this, I would make sure to add this variable to the overal model to account for its influence.

  3. Concern 3: Lastly, gender is a likely influencer in weekly earnings and it should be included in the model. Below is an graph that shows the difference between pay between men and women, regardless of their age.

5.1 Modify Model Q4

A generalized form of the equation

Weekly earnings = 242.13 + 9.06(age) - 151 (sex) + 15.19 (hours_worked) - 345.34 (as.factor(part-time))

## 
## Call:
## lm(formula = weekly_earnings ~ age + as.factor(sex) + hours_worked + 
##     as.factor(full_or_part_time), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1482.8  -393.3  -146.5   219.4  2820.8 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     91.1355    45.0100   2.025   0.0429 *  
## age                              9.0609     0.5652  16.032   <2e-16 ***
## as.factor(sex)2               -150.9902    16.1485  -9.350   <2e-16 ***
## hours_worked                    15.1904     0.8362  18.165   <2e-16 ***
## as.factor(full_or_part_time)2 -345.3448    27.1803 -12.706   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 585.9 on 5537 degrees of freedom
## Multiple R-squared:  0.2288, Adjusted R-squared:  0.2283 
## F-statistic: 410.8 on 4 and 5537 DF,  p-value: < 2.2e-16

5.2 Estimated Regression

To estimate weekly earnings for a 38 y/o male who works 80 hours a month as a part-timer: Weekly earnings = 242.13 + 9.06(age) - 151 (sex) + 15.19 (hours_worked) - 345.34 (as.factor(part-time)) Weekly earnings = $1,305.33

## Warning in as.data.frame.numeric(age, sex, hours_worked,
## full_or_part_time): 'row.names' is not a character vector of length 1 --
## omitting it. Will be an error!
##        1 
## 1305.332

5.3 Are any independent variables collinear?

I do not suspect any of the independent variables to be collinear because each of the variables are unique enough from one another. The table below shows that none of the variables have correlation values above .5 which means that are not collinear.

##             hours_worked full_or_part_time
## sex         -0.218094436        0.15208703
## citizenship  0.005233085       -0.03281023
## age         -0.003209048       -0.03457020

Additionally, the vif’s for each individual variable is within the acceptable range (less than 5) while the vif for the entire model is also within the acceptable range at 1.5.

##                          age               as.factor(sex) 
##                     1.002630                     1.051816 
##                 hours_worked as.factor(full_or_part_time) 
##                     1.530502                     1.494447
## [1] 1.296741

5.4 Comfortable range of values for estimation

To estimate future observations using this model, I am comfortable using a range of:

Weekly earnings between $0 and $2,820 Between the ages of 25 and 65 Are females And work 100 hours part-time

5.5 Hypothetical Observation

A 40 y/o female that works 100 part-time hours is estimated to earn $1,476.

## Warning in as.data.frame.numeric(age, sex, hours_worked,
## full_or_part_time): 'row.names' is not a character vector of length 1 --
## omitting it. Will be an error!
##       1 
## 1476.27