The data contains information on over 5,500 employees and 31% of those individuals worked from home, or telecommuted. The remaining 69% commuted to work and listed below is the average salary of those 2 groups. Listed first are Telecommuters with an average salary approximately $1,100
In this graphic, we note a wider range in salaries for telecommuters than commuters and that the average salary is higher for a telecommuter than non-telecommuter.
It does appear that teleworking has a significant effect on income given the p-value < 0.05. These values are provided by an analysis of variance test. On average, commuters make $350.76 less as compared to telecommuters,
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0
This model can be considered a naïve model because it is only using one dependent and one independent variable. It also does not account for any outside influencers that might also have an effect on the variability. For example, full- and part-time employees often times receive pay at different rates. Also, this model does not account for an employee’s position or tenure at an organization which could also contribute to higher wages for some.
My hypothesis is that hourly workers earn less than non-hourly workers, whether they worked from home or not. I wanted to test this hypothesis and added hourly_non_hourly to my analysis to see the effect it has on predicting weekly earnings.
## Df Sum Sq
## as.factor(telecommute) 1 1.451e+08
## as.factor(hourly_non_hourly) 1 3.755e+08
## as.factor(telecommute):as.factor(hourly_non_hourly) 1 9.009e+06
## Residuals 5538 1.935e+09
## Mean Sq F value
## as.factor(telecommute) 145093488 415.20
## as.factor(hourly_non_hourly) 375496285 1074.53
## as.factor(telecommute):as.factor(hourly_non_hourly) 9009218 25.78
## Residuals 349451
## Pr(>F)
## as.factor(telecommute) < 2e-16 ***
## as.factor(hourly_non_hourly) < 2e-16 ***
## as.factor(telecommute):as.factor(hourly_non_hourly) 3.95e-07 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) * as.factor(hourly_non_hourly), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -384.5076 -317.0153 0
##
## $`as.factor(hourly_non_hourly)`
## diff lwr upr p adj
## 2-1 512.5514 481.0693 544.0335 0
##
## $`as.factor(telecommute):as.factor(hourly_non_hourly)`
## diff lwr upr p adj
## 2:1-1:1 -125.7199 -191.1042 -60.33548 4.8e-06
## 1:2-1:1 662.9527 587.9357 737.96978 0.0e+00
## 2:2-1:1 357.5900 286.4375 428.74251 0.0e+00
## 1:2-2:1 788.6726 732.0702 845.27496 0.0e+00
## 2:2-2:1 483.3099 431.9392 534.68055 0.0e+00
## 2:2-1:2 -305.3627 -368.5402 -242.18525 0.0e+00
The independent variables, and the interaction between them, are statistically significant because their p-values are less than 0.05. The largest difference exists between salaried telecommuters and commuters who work per hour, who on average earned $788.67 more hourly commuters. This could likely be explained by the type of job that telecommuters are doing from home. For this reason I consider this to be a naïve model. The additional variable certainly helps strengthen the overall model, however, additional variables should continue to be analyzed to ensure that all relevant variables are captured.
In this graphic we see that on average salaried telecommuters (blue box on left) make more than commuters who work per hour (red box on right)
In our first linear model, approximately 6% of the variability in weekly earnings is explained by variability in telecommuter status which is not a very good fit.
##
## Call:
## lm(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1170.3 -447.4 -159.4 277.4 2052.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1183.19 15.69 75.43 <2e-16 ***
## as.factor(telecommute)2 -350.76 18.84 -18.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 647.1 on 5540 degrees of freedom
## Multiple R-squared: 0.05886, Adjusted R-squared: 0.05869
## F-statistic: 346.5 on 1 and 5540 DF, p-value: < 2.2e-16
Upon adding hourly_non-hourly to the model we see an improvement in fit as the overall standard error of the model decreases and 21% of the variability in weekly earnings can be explained by telecommuter status.
##
## Call:
## lm(formula = weekly_earnings ~ as.factor(telecommute) + as.factor(hourly_non_hourly),
## data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1389.1 -387.7 -125.8 258.2 2241.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 861.40 17.41 49.48 <2e-16 ***
## as.factor(telecommute)2 -218.62 17.72 -12.34 <2e-16 ***
## as.factor(hourly_non_hourly)2 540.66 16.53 32.71 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 592.5 on 5539 degrees of freedom
## Multiple R-squared: 0.2112, Adjusted R-squared: 0.2109
## F-statistic: 741.5 on 2 and 5539 DF, p-value: < 2.2e-16
The difference between these two models is significant given that the p-value for the hypothesis test is below 0.05 when we run an analysis of variance between the two.
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ as.factor(telecommute)
## Model 2: weekly_earnings ~ as.factor(telecommute) + as.factor(hourly_non_hourly)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5540 2319766548
## 2 5539 1944270263 1 375496285 1069.7 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Y = 66.04 + 22.69(hours_worked)
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
Weekly earnings = 66.04 + 22.69(160hours) Weekly earnings = $3,680.24
## 1
## 3680.241
This model is naïve for several reasons. First, this is a simple regression with only one dependent and one independent variable. Second, the model does not account for any interaction between the number of hours worked and at least one other variable. Lastly, when using this model, only 15.5% of the variability in weekly earnings can be explained by hours worked which implies there are other factors that also influence weekly earnings.
To better model these two continuous variables, I account for the interaction between hours worked and itself. A linear model does not fit the data so I convert the data into a parabolic formula to find a better fitting line using the interaction between hours worked and itself. In this model, we note the that the line curves up at an increasing rate as the number of hours worked increase.
Weekly earnings = 548.95 + 9.19(age)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
Weekly earnings = 548.95 + 9.19(38) Weekly earnings = $898.32
## 1
## 898.3197
First, this is a naïve model because as a simple regression it only uses one dependent and one indepedent variable. Second, previous models confirm a statistical significance with among other variables, like total number of hours worked and whether the employee telecommutes, that should be considered and included in the overall model. Lastly, there is a weak overall fit of the model as only 3% of the variability in weekly earnings can be explained by age.
In this model, we see that the regression line is under-predicitng a majority of the weekly earnings values.
plot(telework$age, telework$weekly_earnings)
twlm <- lm(weekly_earnings~age, data = telework)
abline(twlm, col="red")
From the data points, it looks as though weekly earnings should increase from ages 20-40, peak or level off between 40 - 60 before the earnings should begin to decrease after age 60.
The regression line indicates that weekly earnings will continue to increase as a person’s age continues to increase which we know is not true.
summary(twlm)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
Concern 1: Outliers are a concern in this data given that many of the values are at extremes of the ranges. For example, some individuals earned close to $3,000 after working less than 10 hours. These individuals are likely managers whose salary skew the data. To address this, I would remove the top 5% outliers
Concern 2: This model does not consider whether the employee was working full- or part-time. To address this, I would make sure to add this variable to the overal model to account for its influence.
Concern 3: Lastly, gender is a likely influencer in weekly earnings and it should be included in the model. Below is an graph that shows the difference between pay between men and women, regardless of their age.
Weekly earnings = 242.13 + 9.06(age) - 151 (sex) + 15.19 (hours_worked) - 345.34 (as.factor(part-time))
##
## Call:
## lm(formula = weekly_earnings ~ age + as.factor(sex) + hours_worked +
## as.factor(full_or_part_time), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1482.8 -393.3 -146.5 219.4 2820.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.1355 45.0100 2.025 0.0429 *
## age 9.0609 0.5652 16.032 <2e-16 ***
## as.factor(sex)2 -150.9902 16.1485 -9.350 <2e-16 ***
## hours_worked 15.1904 0.8362 18.165 <2e-16 ***
## as.factor(full_or_part_time)2 -345.3448 27.1803 -12.706 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 585.9 on 5537 degrees of freedom
## Multiple R-squared: 0.2288, Adjusted R-squared: 0.2283
## F-statistic: 410.8 on 4 and 5537 DF, p-value: < 2.2e-16
To estimate weekly earnings for a 38 y/o male who works 80 hours a month as a part-timer: Weekly earnings = 242.13 + 9.06(age) - 151 (sex) + 15.19 (hours_worked) - 345.34 (as.factor(part-time)) Weekly earnings = $1,305.33
## Warning in as.data.frame.numeric(age, sex, hours_worked,
## full_or_part_time): 'row.names' is not a character vector of length 1 --
## omitting it. Will be an error!
## 1
## 1305.332
I do not suspect any of the independent variables to be collinear because each of the variables are unique enough from one another. The table below shows that none of the variables have correlation values above .5 which means that are not collinear.
## hours_worked full_or_part_time
## sex -0.218094436 0.15208703
## citizenship 0.005233085 -0.03281023
## age -0.003209048 -0.03457020
Additionally, the vif’s for each individual variable is within the acceptable range (less than 5) while the vif for the entire model is also within the acceptable range at 1.5.
## age as.factor(sex)
## 1.002630 1.051816
## hours_worked as.factor(full_or_part_time)
## 1.530502 1.494447
## [1] 1.296741
To estimate future observations using this model, I am comfortable using a range of:
Weekly earnings between $0 and $2,820 Between the ages of 25 and 65 Are females And work 100 hours part-time
A 40 y/o female that works 100 part-time hours is estimated to earn $1,476.
## Warning in as.data.frame.numeric(age, sex, hours_worked,
## full_or_part_time): 'row.names' is not a character vector of length 1 --
## omitting it. Will be an error!
## 1
## 1476.27