Introduction

Census data was collected to explore employment data. This data includes employees weekly earnings and more information about them including occupation, gender, age, geographic location, and telecommute status, among others. We will explore how these variables impact weekly earnings and build a model to predict weekly earnings based on random values for variables that are deemed statistically significant.

Packages Used

Package Summary
tidyverse The tidyverse collection of packages
skimr Quick data check tool
car Used for testing linearity of regression models
kableExtra Formatting Data Tables

Data Cleaning

Variable Transformation
telecommute 1 = Telecommute
2 = Traditional
hourly_non_hourly 1 = Hourly Worker
2 = Salaried Worker
-1 = Not in Universe
sex 1 = male
2 = female

1. Using a simple one-way ANOVA, answer the following questions:

a. Does Teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists.)

Teleworking does appear to have a significant effect on income. The summary table below shows that the p-value for the telecommute independent variable is below 0.05, which means that the variable is statistically significant.

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b. Provide a visualization that helps a reader understand the model

The below plot shows that the mean weekly earnings is significantly higher than those who have a traditional commute, or simply, those who do not telecommute. While the range for both telecommuters and those who have a traditional commute are similar, Q1 and Q3 are both higher for telecommuters.

2. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

a. Write the generalized form of the regression using beta notation

weekly_earnings = b0 + b1(hours_worked)

b. Write the estimated form of the regression using your results

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

The table above shows the relationship between hours worked and weekly earnings. The summary table shows that based on the p-value, hours worked is statistically significant. The IV coefficient for hours worked is 22.59 and the intercept is 66.04. This means that starting at $66.04, for every additional hour worked, weekly earnings increase by $22.59. The table also tells us that the RMSE is 613 on 5540 degrees of freedom, meaning on average, a prediction will fall $613 away from the actual weekly earnings when using only hours worked to make a prediction. Finally, the R-squared is only 0.1555, meaning that 15.55% of the variance in weekly earnings is explained by hours worked alone. All of this can be summarized in the estemated equation of weekly_earnings = 66.04+22.59(hours_worked).

c. Provide at least 3 explanations for why we would consider this a naïve model.

This is a naïve model because there are many other factors that determine how much someone makes. First, there are both hourly and salaried workers. A salaried worker gets paid the same amount whether they work 10 or 100 hours a week. On the other hand, hours worked directly impacts how much an hourly worker makes in a week. Other variables can be correlated to hours worked such as occupation and education. It is reasonable to expect an investment banking associate right out of an Ivy league school to work 70+ hours a week while a retail worker may only work something like 25-30 hours a week. While we would expect the person working on wall street to make more money, it is not just hours worked that determines how they are paid. Finally, there are other variables such as geography that would have a major impact on weekly earnings. If that retail worker working 25 hours a week lives in Seattle or New York, they would be making at least $15 an hour, while someone working the same job and hours in Iowa may only be making federal minimum wage of $7.25. The more than doubled weekly earnings has nothing to do with hours worked but is determined by the state laws where they are employed.

d. In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.

In order to better model these two variables, hours worked can be changed from a continuous to a factor variable. Other than that, I do not think it makes sense to use just hours worked to predict weekly earnings unless some sort of filtering is done to control for other things like salary vs non salary workers.

e. If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using this dataset, propose another solution using additional variables if possible, a different data set or another strategy.

## 
## Call:
## lm(formula = weekly_earnings ~ as.factor(hours_worked), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1632.5  -386.5  -153.8   213.5  2531.7 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                990.815     86.776  11.418  < 2e-16 ***
## as.factor(hours_worked)2  -442.103    279.845  -1.580 0.114208    
## as.factor(hours_worked)3  -896.131    279.845  -3.202 0.001371 ** 
## as.factor(hours_worked)4  -514.070    199.260  -2.580 0.009909 ** 
## as.factor(hours_worked)5  -621.672    192.414  -3.231 0.001241 ** 
## as.factor(hours_worked)6  -612.339    192.414  -3.182 0.001469 ** 
## as.factor(hours_worked)7  -610.509    216.458  -2.820 0.004813 ** 
## as.factor(hours_worked)8  -419.548    136.345  -3.077 0.002101 ** 
## as.factor(hours_worked)9  -671.065    309.854  -2.166 0.030373 *  
## as.factor(hours_worked)10 -625.625    133.938  -4.671 3.07e-06 ***
## as.factor(hours_worked)11 -739.127    257.907  -2.866 0.004175 ** 
## as.factor(hours_worked)12 -763.864    142.021  -5.379 7.82e-08 ***
## as.factor(hours_worked)13 -245.565    309.854  -0.793 0.428092    
## as.factor(hours_worked)14 -748.198    309.854  -2.415 0.015782 *  
## as.factor(hours_worked)15 -765.548    131.762  -5.810 6.60e-09 ***
## as.factor(hours_worked)16 -515.392    118.174  -4.361 1.32e-05 ***
## as.factor(hours_worked)17 -752.965    309.854  -2.430 0.015128 *  
## as.factor(hours_worked)18 -677.350    164.900  -4.108 4.06e-05 ***
## as.factor(hours_worked)19 -612.935    429.521  -1.427 0.153631    
## as.factor(hours_worked)20 -637.923     98.428  -6.481 9.91e-11 ***
## as.factor(hours_worked)21 -665.224    257.907  -2.579 0.009926 ** 
## as.factor(hours_worked)22 -397.373    161.732  -2.457 0.014042 *  
## as.factor(hours_worked)23 -612.882    216.458  -2.831 0.004651 ** 
## as.factor(hours_worked)24 -325.775    104.720  -3.111 0.001875 ** 
## as.factor(hours_worked)25 -586.618    106.279  -5.520 3.55e-08 ***
## as.factor(hours_worked)26 -583.788    186.426  -3.131 0.001748 ** 
## as.factor(hours_worked)27 -573.860    156.152  -3.675 0.000240 ***
## as.factor(hours_worked)28 -563.276    140.478  -4.010 6.16e-05 ***
## as.factor(hours_worked)29 -348.008    192.414  -1.809 0.070562 .  
## as.factor(hours_worked)30 -387.214     96.968  -3.993 6.60e-05 ***
## as.factor(hours_worked)31 -429.634    216.458  -1.985 0.047213 *  
## as.factor(hours_worked)32 -126.997     98.231  -1.293 0.196122    
## as.factor(hours_worked)33 -103.723    199.260  -0.521 0.602708    
## as.factor(hours_worked)34  -71.693    158.827  -0.451 0.651727    
## as.factor(hours_worked)35 -326.612     97.506  -3.350 0.000815 ***
## as.factor(hours_worked)36 -161.189    107.695  -1.497 0.134524    
## as.factor(hours_worked)37 -204.910    124.076  -1.651 0.098697 .  
## as.factor(hours_worked)38 -179.297    108.367  -1.655 0.098076 .  
## as.factor(hours_worked)39 -218.162    151.386  -1.441 0.149616    
## as.factor(hours_worked)40  -44.338     87.550  -0.506 0.612577    
## as.factor(hours_worked)41  292.709    176.422   1.659 0.097144 .  
## as.factor(hours_worked)42   -4.121    117.214  -0.035 0.971953    
## as.factor(hours_worked)43   -6.798    133.938  -0.051 0.959525    
## as.factor(hours_worked)44  102.695    127.977   0.802 0.422325    
## as.factor(hours_worked)45  263.183     94.735   2.778 0.005486 ** 
## as.factor(hours_worked)46 -135.655    147.264  -0.921 0.357006    
## as.factor(hours_worked)47  132.231    192.414   0.687 0.491972    
## as.factor(hours_worked)48  138.669    117.686   1.178 0.238731    
## as.factor(hours_worked)49   81.115    207.176   0.392 0.695424    
## as.factor(hours_worked)50  496.916     92.357   5.380 7.74e-08 ***
## as.factor(hours_worked)51  606.676    257.907   2.352 0.018693 *  
## as.factor(hours_worked)52  -94.109    145.404  -0.647 0.517515    
## as.factor(hours_worked)53  398.068    192.414   2.069 0.038611 *  
## as.factor(hours_worked)54  138.817    309.854   0.448 0.654165    
## as.factor(hours_worked)55  517.677    107.063   4.835 1.37e-06 ***
## as.factor(hours_worked)56  226.877    156.152   1.453 0.146301    
## as.factor(hours_worked)57 -125.435    429.521  -0.292 0.770271    
## as.factor(hours_worked)58   91.446    186.426   0.491 0.623783    
## as.factor(hours_worked)59 -119.815    429.521  -0.279 0.780293    
## as.factor(hours_worked)60  501.582     98.167   5.109 3.34e-07 ***
## as.factor(hours_worked)62  305.951    257.907   1.186 0.235562    
## as.factor(hours_worked)63 1044.503    279.845   3.732 0.000192 ***
## as.factor(hours_worked)64   -1.735    241.018  -0.007 0.994256    
## as.factor(hours_worked)65  597.004    149.254   4.000 6.42e-05 ***
## as.factor(hours_worked)66  -40.308    309.854  -0.130 0.896503    
## as.factor(hours_worked)68 -244.982    354.263  -0.692 0.489265    
## as.factor(hours_worked)70  691.676    149.254   4.634 3.67e-06 ***
## as.factor(hours_worked)72 -298.515    601.204  -0.497 0.619541    
## as.factor(hours_worked)74  788.025    429.521   1.835 0.066611 .  
## as.factor(hours_worked)75  379.937    279.845   1.358 0.174626    
## as.factor(hours_worked)76  316.875    429.521   0.738 0.460705    
## as.factor(hours_worked)80  814.039    186.426   4.367 1.29e-05 ***
## as.factor(hours_worked)81  739.945    601.204   1.231 0.218462    
## as.factor(hours_worked)84  695.665    257.907   2.697 0.007011 ** 
## as.factor(hours_worked)86  393.185    601.204   0.654 0.513143    
## as.factor(hours_worked)87  -56.815    601.204  -0.095 0.924713    
## as.factor(hours_worked)90    6.935    309.854   0.022 0.982145    
## as.factor(hours_worked)96 1893.795    601.204   3.150 0.001642 ** 
## as.factor(hours_worked)98 1893.185    601.204   3.149 0.001647 ** 
## as.factor(hours_worked)99  767.925    354.263   2.168 0.030227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 594.9 on 5462 degrees of freedom
## Multiple R-squared:  0.2157, Adjusted R-squared:  0.2044 
## F-statistic: 19.02 on 79 and 5462 DF,  p-value: < 2.2e-16

When changing hours worked to a factor, R-squared improves from 0.1555 to 0.2157 and the RMSE improved from 613 on 5540 degrees of freedom to 594.9 on 5462 degrees of freedom. This is a slight improvement over the original model but according to the model, many of the hours worked are not statistically significant.

3. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

a. Write the generalized form of the regression using beta notation

weekly_earnings = b0 + b1(age)

b. Write the estimated form of the regression using your results

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

The table above shows that the intercept is 548.9457 and the age coefficient is 9.1941. The p-value shows that both are statistically significant. The RMSE is 654.6 on 5540 degrees of freedom, meaning the average prediction will fall $654.60 away from the actual weekly earnings when usinng age to make predictions. Finally the R-squared is 0.03696, meaning age only explains 3.7% of the varuance in weekly earnings.

The estimated form of regression is weekly_earnings = 548.97 + 9.19(age).

c. Provide at least 3 explanations for why this is considered a naïve model.

This is a naïve model for many reasons. First, as with the previous model, there are many other indendent or correlated variables that would improve accuracy. Next, the R-squared is extremely low, meaning variance is not explained very well using the model. Finally, age cannot increase infinately and as someone ages, they will most likely make less money or at the very least, stop receiving promotions, which lead to increases in weekly earnings.

d. Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.

Above are two plots showing the relationship between age and weekly earnings. The first is a simple ggplot of age on the x-axis and weekly earnings on the y-axis fitted with a line using linear modeling. This line is upward sloping and matches the estimated regression from part 3b. The second is a crPlot testing the linearity of the data. As you can see, the pink line starts in an upward slope but at around age 40, the line does not continue to increase, suggesting that the data is not a linear relationship.

e. Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.

4. Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:

a. Write the generalized form of the regression using beta notation

weekly_earnings = b0 + b1(age) + b2(age)^2 + b3(sex) + b4(geography_regional) + b5(telecommute)

b. Write the estimated form of the regression using your results

## 
## Call:
## lm(formula = weekly_earnings ~ age + I(age * age) + sex + hourly_non_hourly + 
##     telecommute, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1467.1  -359.0  -101.7   232.1  2436.4 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       -313.85642   70.91403  -4.426 9.79e-06 ***
## age                                 46.44067    3.32871  13.952  < 2e-16 ***
## I(age * age)                        -0.44689    0.03734 -11.968  < 2e-16 ***
## sexmale                            221.56198   15.29148  14.489  < 2e-16 ***
## hourly_non_hourlyNonhourly Worker  481.80686   16.07149  29.979  < 2e-16 ***
## telecommuteTraditional            -208.91662   17.00905 -12.283  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 567.1 on 5536 degrees of freedom
## Multiple R-squared:  0.2777, Adjusted R-squared:  0.277 
## F-statistic: 425.7 on 5 and 5536 DF,  p-value: < 2.2e-16

The summary output results in the below estimated regression equation:

weekly_earnings = -313.86 + 46.44(age) - 0.45(age)^2 + 221.56(male) + 481.81(Salaried Worker) - 208.92(Traditional)

All of the coefficients are statistically significant according to their respective p-values. The R-squared for the regression is 0.2777, meaning 27.8% of variation in weekly earnings can be explained using this model. The R-squared is 567.1 on 5536 degrees of freedom, meaning the average prediction will fall $56.10 away from the actual weekly earnings when using this regression equation.

c. Do you suspect any of your independent variables are colinear? Explain why or why not.

I think the only variables that could be colinear are hourly or salaried worker and if an employee telecommutes or not. I would think there is a lower probability that a worker is hourly and is able to telecommute but that may not be the case.

d. Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?

Age is the only independent variable that would have a range as the others are all factors. The minimum and maximum values of the dataset are 15 and 85, respectively so I would not feel comfortable estimating employee’s weekly earnings outside that range. For the dependent variable, the dataset has a range of 7.5 to 2,884.61 so a range over 3,000, I would not feel comfortable with that result.

e. Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.

Variable Value
Age 37
Hourly_Non_Hourly Salaried Worker
Sex Male
Telecommute Telecommute
##       1 
## 1496.02

Estimated: 1,496.02

High: weekly_earnings = -313.86 + 46.44+(2*3.33)(age) - 0.45+(2*0.04)(age)^2 + 221.56+(2*15.29)(male) + 481.81+(2*16.07)(Salaried Worker) - 208.92+(2*17.01)(Traditional) weekly_earnings = -313.86 + 53.10(age) - 0.37(age)^2 + 252.14(male) + 513.95(Salaried Worker) - 174.9(Traditional) weekly_earnings = 1,910.40

Low: weekly_earnings = -313.86 + 46.44-(2*3.33)(age) - 0.45-(2*0.04)(age)^2 + 221.56-(2*15.29)(male) + 481.81-(2*16.07)(Salaried Worker) - 208.92-(2*17.01)(Traditional) weekly_earnings = -313.86 + 39.78(age) - 0.53(age)^2 + 190.98(male) + 449.67(Salaried Worker) - 242.94(Traditional) weekly_earnings = 1,073.08