Census data was collected to explore employment data. This data includes employees weekly earnings and more information about them including occupation, gender, age, geographic location, and telecommute status, among others. We will explore how these variables impact weekly earnings and build a model to predict weekly earnings based on random values for variables that are deemed statistically significant.
| Package | Summary |
|---|---|
| tidyverse | The tidyverse collection of packages |
| skimr | Quick data check tool |
| car | Used for testing linearity of regression models |
| kableExtra | Formatting Data Tables |
| Variable | Transformation |
|---|---|
| telecommute | 1 = Telecommute |
| 2 = Traditional | |
| hourly_non_hourly | 1 = Hourly Worker |
| 2 = Salaried Worker | |
| -1 = Not in Universe | |
| sex | 1 = male |
| 2 = female |
Teleworking does appear to have a significant effect on income. The summary table below shows that the p-value for the telecommute independent variable is below 0.05, which means that the variable is statistically significant.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The below plot shows that the mean weekly earnings is significantly higher than those who have a traditional commute, or simply, those who do not telecommute. While the range for both telecommuters and those who have a traditional commute are similar, Q1 and Q3 are both higher for telecommuters.
weekly_earnings = b0 + b1(hours_worked)
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
The table above shows the relationship between hours worked and weekly earnings. The summary table shows that based on the p-value, hours worked is statistically significant. The IV coefficient for hours worked is 22.59 and the intercept is 66.04. This means that starting at $66.04, for every additional hour worked, weekly earnings increase by $22.59. The table also tells us that the RMSE is 613 on 5540 degrees of freedom, meaning on average, a prediction will fall $613 away from the actual weekly earnings when using only hours worked to make a prediction. Finally, the R-squared is only 0.1555, meaning that 15.55% of the variance in weekly earnings is explained by hours worked alone. All of this can be summarized in the estemated equation of weekly_earnings = 66.04+22.59(hours_worked).
This is a naïve model because there are many other factors that determine how much someone makes. First, there are both hourly and salaried workers. A salaried worker gets paid the same amount whether they work 10 or 100 hours a week. On the other hand, hours worked directly impacts how much an hourly worker makes in a week. Other variables can be correlated to hours worked such as occupation and education. It is reasonable to expect an investment banking associate right out of an Ivy league school to work 70+ hours a week while a retail worker may only work something like 25-30 hours a week. While we would expect the person working on wall street to make more money, it is not just hours worked that determines how they are paid. Finally, there are other variables such as geography that would have a major impact on weekly earnings. If that retail worker working 25 hours a week lives in Seattle or New York, they would be making at least $15 an hour, while someone working the same job and hours in Iowa may only be making federal minimum wage of $7.25. The more than doubled weekly earnings has nothing to do with hours worked but is determined by the state laws where they are employed.
In order to better model these two variables, hours worked can be changed from a continuous to a factor variable. Other than that, I do not think it makes sense to use just hours worked to predict weekly earnings unless some sort of filtering is done to control for other things like salary vs non salary workers.
##
## Call:
## lm(formula = weekly_earnings ~ as.factor(hours_worked), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1632.5 -386.5 -153.8 213.5 2531.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 990.815 86.776 11.418 < 2e-16 ***
## as.factor(hours_worked)2 -442.103 279.845 -1.580 0.114208
## as.factor(hours_worked)3 -896.131 279.845 -3.202 0.001371 **
## as.factor(hours_worked)4 -514.070 199.260 -2.580 0.009909 **
## as.factor(hours_worked)5 -621.672 192.414 -3.231 0.001241 **
## as.factor(hours_worked)6 -612.339 192.414 -3.182 0.001469 **
## as.factor(hours_worked)7 -610.509 216.458 -2.820 0.004813 **
## as.factor(hours_worked)8 -419.548 136.345 -3.077 0.002101 **
## as.factor(hours_worked)9 -671.065 309.854 -2.166 0.030373 *
## as.factor(hours_worked)10 -625.625 133.938 -4.671 3.07e-06 ***
## as.factor(hours_worked)11 -739.127 257.907 -2.866 0.004175 **
## as.factor(hours_worked)12 -763.864 142.021 -5.379 7.82e-08 ***
## as.factor(hours_worked)13 -245.565 309.854 -0.793 0.428092
## as.factor(hours_worked)14 -748.198 309.854 -2.415 0.015782 *
## as.factor(hours_worked)15 -765.548 131.762 -5.810 6.60e-09 ***
## as.factor(hours_worked)16 -515.392 118.174 -4.361 1.32e-05 ***
## as.factor(hours_worked)17 -752.965 309.854 -2.430 0.015128 *
## as.factor(hours_worked)18 -677.350 164.900 -4.108 4.06e-05 ***
## as.factor(hours_worked)19 -612.935 429.521 -1.427 0.153631
## as.factor(hours_worked)20 -637.923 98.428 -6.481 9.91e-11 ***
## as.factor(hours_worked)21 -665.224 257.907 -2.579 0.009926 **
## as.factor(hours_worked)22 -397.373 161.732 -2.457 0.014042 *
## as.factor(hours_worked)23 -612.882 216.458 -2.831 0.004651 **
## as.factor(hours_worked)24 -325.775 104.720 -3.111 0.001875 **
## as.factor(hours_worked)25 -586.618 106.279 -5.520 3.55e-08 ***
## as.factor(hours_worked)26 -583.788 186.426 -3.131 0.001748 **
## as.factor(hours_worked)27 -573.860 156.152 -3.675 0.000240 ***
## as.factor(hours_worked)28 -563.276 140.478 -4.010 6.16e-05 ***
## as.factor(hours_worked)29 -348.008 192.414 -1.809 0.070562 .
## as.factor(hours_worked)30 -387.214 96.968 -3.993 6.60e-05 ***
## as.factor(hours_worked)31 -429.634 216.458 -1.985 0.047213 *
## as.factor(hours_worked)32 -126.997 98.231 -1.293 0.196122
## as.factor(hours_worked)33 -103.723 199.260 -0.521 0.602708
## as.factor(hours_worked)34 -71.693 158.827 -0.451 0.651727
## as.factor(hours_worked)35 -326.612 97.506 -3.350 0.000815 ***
## as.factor(hours_worked)36 -161.189 107.695 -1.497 0.134524
## as.factor(hours_worked)37 -204.910 124.076 -1.651 0.098697 .
## as.factor(hours_worked)38 -179.297 108.367 -1.655 0.098076 .
## as.factor(hours_worked)39 -218.162 151.386 -1.441 0.149616
## as.factor(hours_worked)40 -44.338 87.550 -0.506 0.612577
## as.factor(hours_worked)41 292.709 176.422 1.659 0.097144 .
## as.factor(hours_worked)42 -4.121 117.214 -0.035 0.971953
## as.factor(hours_worked)43 -6.798 133.938 -0.051 0.959525
## as.factor(hours_worked)44 102.695 127.977 0.802 0.422325
## as.factor(hours_worked)45 263.183 94.735 2.778 0.005486 **
## as.factor(hours_worked)46 -135.655 147.264 -0.921 0.357006
## as.factor(hours_worked)47 132.231 192.414 0.687 0.491972
## as.factor(hours_worked)48 138.669 117.686 1.178 0.238731
## as.factor(hours_worked)49 81.115 207.176 0.392 0.695424
## as.factor(hours_worked)50 496.916 92.357 5.380 7.74e-08 ***
## as.factor(hours_worked)51 606.676 257.907 2.352 0.018693 *
## as.factor(hours_worked)52 -94.109 145.404 -0.647 0.517515
## as.factor(hours_worked)53 398.068 192.414 2.069 0.038611 *
## as.factor(hours_worked)54 138.817 309.854 0.448 0.654165
## as.factor(hours_worked)55 517.677 107.063 4.835 1.37e-06 ***
## as.factor(hours_worked)56 226.877 156.152 1.453 0.146301
## as.factor(hours_worked)57 -125.435 429.521 -0.292 0.770271
## as.factor(hours_worked)58 91.446 186.426 0.491 0.623783
## as.factor(hours_worked)59 -119.815 429.521 -0.279 0.780293
## as.factor(hours_worked)60 501.582 98.167 5.109 3.34e-07 ***
## as.factor(hours_worked)62 305.951 257.907 1.186 0.235562
## as.factor(hours_worked)63 1044.503 279.845 3.732 0.000192 ***
## as.factor(hours_worked)64 -1.735 241.018 -0.007 0.994256
## as.factor(hours_worked)65 597.004 149.254 4.000 6.42e-05 ***
## as.factor(hours_worked)66 -40.308 309.854 -0.130 0.896503
## as.factor(hours_worked)68 -244.982 354.263 -0.692 0.489265
## as.factor(hours_worked)70 691.676 149.254 4.634 3.67e-06 ***
## as.factor(hours_worked)72 -298.515 601.204 -0.497 0.619541
## as.factor(hours_worked)74 788.025 429.521 1.835 0.066611 .
## as.factor(hours_worked)75 379.937 279.845 1.358 0.174626
## as.factor(hours_worked)76 316.875 429.521 0.738 0.460705
## as.factor(hours_worked)80 814.039 186.426 4.367 1.29e-05 ***
## as.factor(hours_worked)81 739.945 601.204 1.231 0.218462
## as.factor(hours_worked)84 695.665 257.907 2.697 0.007011 **
## as.factor(hours_worked)86 393.185 601.204 0.654 0.513143
## as.factor(hours_worked)87 -56.815 601.204 -0.095 0.924713
## as.factor(hours_worked)90 6.935 309.854 0.022 0.982145
## as.factor(hours_worked)96 1893.795 601.204 3.150 0.001642 **
## as.factor(hours_worked)98 1893.185 601.204 3.149 0.001647 **
## as.factor(hours_worked)99 767.925 354.263 2.168 0.030227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 594.9 on 5462 degrees of freedom
## Multiple R-squared: 0.2157, Adjusted R-squared: 0.2044
## F-statistic: 19.02 on 79 and 5462 DF, p-value: < 2.2e-16
When changing hours worked to a factor, R-squared improves from 0.1555 to 0.2157 and the RMSE improved from 613 on 5540 degrees of freedom to 594.9 on 5462 degrees of freedom. This is a slight improvement over the original model but according to the model, many of the hours worked are not statistically significant.
weekly_earnings = b0 + b1(age)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
The table above shows that the intercept is 548.9457 and the age coefficient is 9.1941. The p-value shows that both are statistically significant. The RMSE is 654.6 on 5540 degrees of freedom, meaning the average prediction will fall $654.60 away from the actual weekly earnings when usinng age to make predictions. Finally the R-squared is 0.03696, meaning age only explains 3.7% of the varuance in weekly earnings.
The estimated form of regression is weekly_earnings = 548.97 + 9.19(age).
This is a naïve model for many reasons. First, as with the previous model, there are many other indendent or correlated variables that would improve accuracy. Next, the R-squared is extremely low, meaning variance is not explained very well using the model. Finally, age cannot increase infinately and as someone ages, they will most likely make less money or at the very least, stop receiving promotions, which lead to increases in weekly earnings.
Above are two plots showing the relationship between age and weekly earnings. The first is a simple ggplot of age on the x-axis and weekly earnings on the y-axis fitted with a line using linear modeling. This line is upward sloping and matches the estimated regression from part 3b. The second is a crPlot testing the linearity of the data. As you can see, the pink line starts in an upward slope but at around age 40, the line does not continue to increase, suggesting that the data is not a linear relationship.
weekly_earnings = b0 + b1(age) + b2(age)^2 + b3(sex) + b4(geography_regional) + b5(telecommute)
##
## Call:
## lm(formula = weekly_earnings ~ age + I(age * age) + sex + hourly_non_hourly +
## telecommute, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1467.1 -359.0 -101.7 232.1 2436.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -313.85642 70.91403 -4.426 9.79e-06 ***
## age 46.44067 3.32871 13.952 < 2e-16 ***
## I(age * age) -0.44689 0.03734 -11.968 < 2e-16 ***
## sexmale 221.56198 15.29148 14.489 < 2e-16 ***
## hourly_non_hourlyNonhourly Worker 481.80686 16.07149 29.979 < 2e-16 ***
## telecommuteTraditional -208.91662 17.00905 -12.283 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 567.1 on 5536 degrees of freedom
## Multiple R-squared: 0.2777, Adjusted R-squared: 0.277
## F-statistic: 425.7 on 5 and 5536 DF, p-value: < 2.2e-16
The summary output results in the below estimated regression equation:
weekly_earnings = -313.86 + 46.44(age) - 0.45(age)^2 + 221.56(male) + 481.81(Salaried Worker) - 208.92(Traditional)
All of the coefficients are statistically significant according to their respective p-values. The R-squared for the regression is 0.2777, meaning 27.8% of variation in weekly earnings can be explained using this model. The R-squared is 567.1 on 5536 degrees of freedom, meaning the average prediction will fall $56.10 away from the actual weekly earnings when using this regression equation.
I think the only variables that could be colinear are hourly or salaried worker and if an employee telecommutes or not. I would think there is a lower probability that a worker is hourly and is able to telecommute but that may not be the case.
Age is the only independent variable that would have a range as the others are all factors. The minimum and maximum values of the dataset are 15 and 85, respectively so I would not feel comfortable estimating employee’s weekly earnings outside that range. For the dependent variable, the dataset has a range of 7.5 to 2,884.61 so a range over 3,000, I would not feel comfortable with that result.
| Variable | Value |
|---|---|
| Age | 37 |
| Hourly_Non_Hourly | Salaried Worker |
| Sex | Male |
| Telecommute | Telecommute |
## 1
## 1496.02
Estimated: 1,496.02
High: weekly_earnings = -313.86 + 46.44+(2*3.33)(age) - 0.45+(2*0.04)(age)^2 + 221.56+(2*15.29)(male) + 481.81+(2*16.07)(Salaried Worker) - 208.92+(2*17.01)(Traditional) weekly_earnings = -313.86 + 53.10(age) - 0.37(age)^2 + 252.14(male) + 513.95(Salaried Worker) - 174.9(Traditional) weekly_earnings = 1,910.40
Low: weekly_earnings = -313.86 + 46.44-(2*3.33)(age) - 0.45-(2*0.04)(age)^2 + 221.56-(2*15.29)(male) + 481.81-(2*16.07)(Salaried Worker) - 208.92-(2*17.01)(Traditional) weekly_earnings = -313.86 + 39.78(age) - 0.53(age)^2 + 190.98(male) + 449.67(Salaried Worker) - 242.94(Traditional) weekly_earnings = 1,073.08