Introduction
Packages Required
| Package | Purpose |
|---|---|
| tidyverse | { Core packages for data analysis and manipulation |
| car | { |
| stats | { General-purpose literate program engine for output control |
| knitr | { For importing cSV data rectangular data |
| readr | { For transforming and rescaling data |
| syuzhet | { For general summary statistics about data set |
| prettydoc | { Creating documents in R Markdown |
| PerformanceAnalytics | { Companion to applied regression analysis |
| fmsb | { for table placement in html |
Data Preparation
Data cleaning tasks were performed
Numeric data columns converted to factors
Question 1: Using a simple one-way ANOVA, does telworking appear to have a signficant effect on income?
As indicated in the one-way ANOVA, given a p-value < 0.05, we can reject the null hypothesis and conclude that telecommuting has a statistically significant effect on weekly earnings. The boxplot indicates that teleworking (1) generates higher mean weekly earnings than those that do not telework (2).
Based on the higher weekly_earning averages for telecommuters (1) as compared to noncommuters (2), it would appear that there is an income effect. However, is this observed difference statistically significant?
- Ho: There is no difference in the average weekly earnings by telecommute status.
- Ha: There is a difference in average weekly earnings by telecommute status.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0
Given a p-value < 0.05 as indicated in the anova table and a similar p-value in the Tukey test, we can reject the null hypothesis and conclude at a confidence level of 95% that the differences in mean weekly-earnings are statistically significant.
The box chart below depicts the dispersion in weekly-earnings by telecommute status.
2. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
## [1] 66.04331
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
From the table above of a simple regression model estimating weekly earnings by hours worked, we see that the model is statistically significant and we can reject the null hypothesis in favor of the alternative. a. The generalized form of the regression is
weekly_earnings = B0 + B1 * (hours_worked)
- the estimated form of the regression equaion is:
weekly earnings = 66.0433 + 22.5887 (Hrs Worked)
- Provide at least 3 explanations for why we would consider this a naïve model.
One reason why we might consider this to be a naive model is because our R square is 0.1555 which indicates that there might be weak expanatories qualities of the independent variable, hours worked on weekly earnings. We would consider this to be a naive model however for the following reasons:
Another reason this might be considered to be a naive model would be that weekly earnings are not solely based on hours worked. For example, many professionals work salary that is not based on the number of hours worked per period.
Additionally, this might be considered naive would be related to the type of job and skill level where payment or fees are based on an outcome and not hourly wages. For example, a tax preparerer or lawyer writing up one’s will.
Finally, often people with more experience tend to be paid a higher rate than someone with less time on the job. Theefore, weekly earnings would not be solely a function of the amount of hours worked per week but also include level of experience.
- In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8090 -0.4577 -0.0287 0.4182 2.6466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.286861 0.030671 172.37 <2e-16 ***
## hours_worked 0.033644 0.000759 44.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6579 on 5540 degrees of freedom
## Multiple R-squared: 0.2618, Adjusted R-squared: 0.2617
## F-statistic: 1965 on 1 and 5540 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked + I(hours_worked *
## hours_worked), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7182 -0.4519 -0.0412 0.4193 2.9261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.990e+00 4.812e-02 103.697 < 2e-16 ***
## hours_worked 5.136e-02 2.346e-03 21.896 < 2e-16 ***
## I(hours_worked * hours_worked) -2.381e-04 2.984e-05 -7.978 1.8e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6542 on 5539 degrees of freedom
## Multiple R-squared: 0.2702, Adjusted R-squared: 0.2699
## F-statistic: 1025 on 2 and 5539 DF, p-value: < 2.2e-16
By progressing from a linear-liner model to log-linear and then log-polinomial, we get a slightly better adjusted R square, however the explanation power of the model could best be described as weak. The standardized residuals plot, seen above illustrates the low predicative value of the model. Additional independent variables would be needed to improve the ability of the model to explain the dependent variable, weekly earnings.
- If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using this dataset, propose another solution using additional variables if possible, a different data set or another strategy.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + full_or_part_time +
## age + detailed_occupation_group, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1328.52 -333.64 -97.27 203.51 2684.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 426.0640 46.9233 9.080 < 2e-16 ***
## hours_worked 14.3835 0.7715 18.644 < 2e-16 ***
## full_or_part_time2 -266.7181 25.2941 -10.545 < 2e-16 ***
## age 7.4330 0.5255 14.144 < 2e-16 ***
## detailed_occupation_group2 -14.0418 37.4129 -0.375 0.707437
## detailed_occupation_group3 90.6285 45.0338 2.012 0.044220 *
## detailed_occupation_group4 103.2604 48.2186 2.142 0.032277 *
## detailed_occupation_group5 -138.8140 66.1487 -2.099 0.035905 *
## detailed_occupation_group6 -371.7495 60.5991 -6.135 9.13e-10 ***
## detailed_occupation_group7 247.8487 66.9185 3.704 0.000215 ***
## detailed_occupation_group8 -131.2259 41.2853 -3.179 0.001488 **
## detailed_occupation_group9 -159.7264 59.4970 -2.685 0.007283 **
## detailed_occupation_group10 -122.4776 35.3752 -3.462 0.000540 ***
## detailed_occupation_group11 -674.5186 47.5637 -14.181 < 2e-16 ***
## detailed_occupation_group12 -332.9938 50.4692 -6.598 4.56e-11 ***
## detailed_occupation_group13 -614.1451 39.5249 -15.538 < 2e-16 ***
## detailed_occupation_group14 -677.9578 48.3717 -14.016 < 2e-16 ***
## detailed_occupation_group15 -673.1004 48.9391 -13.754 < 2e-16 ***
## detailed_occupation_group16 -393.4039 31.2111 -12.605 < 2e-16 ***
## detailed_occupation_group17 -555.3577 28.3138 -19.614 < 2e-16 ***
## detailed_occupation_group18 -601.2618 105.4690 -5.701 1.25e-08 ***
## detailed_occupation_group19 -365.3050 41.5688 -8.788 < 2e-16 ***
## detailed_occupation_group20 -301.0605 42.9177 -7.015 2.58e-12 ***
## detailed_occupation_group21 -538.5115 36.8069 -14.631 < 2e-16 ***
## detailed_occupation_group22 -495.0241 36.6261 -13.516 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 536 on 5517 degrees of freedom
## Multiple R-squared: 0.357, Adjusted R-squared: 0.3542
## F-statistic: 127.7 on 24 and 5517 DF, p-value: < 2.2e-16
In this new model, we have a p-value < 0.05 which indicates that it is statistically significant and can reject the null hypothesis that the slope of the line = 0. Also, the adjusted R-squared value has improved to 0.3542 and thus increases the explanatory power of the model. Also, the RMSE = 654. However, when drilling down into the details of the specific independent variables, we can notice that some are not significant. Certain detailed occupatiosn groups do not provide significance to the model and therefore increase the level of noise.
3. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
- Write the generalized form of the regression using beta notation
- Write the estimated form of the regression using your results
The estimated form of the regression is:
Weekly Earnings = $548.95 + $9.19(Age)
- Provide at least 3 explanations for why this is considered a naïve model.
This can be considered a naive model for the following reasons:
Age is not the only factor which would explain weekly_earnings. While typically, we would expect individuals to earn increasing wages over time, other factors would play a greater part in predicting weekly earnings, such as the amount of work done in a given week, as captured by the full- or part-time variable.
Clearly ones occupation also would play a tremendous role in determing weekly earnings. professional occupations as compared to non-skilled work could have a greater impact on weekly earnings.
Given the large cost of living differences among the various regions of the country, location could also play a substantial role in explaining weekly earnings.
- Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.
The linearity assumption can be checked by examining the sclae-location plo, also known as the spread location plot. This plot shows if residuals are spread equally along the ranges of the predictors.
##
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8090 -0.4577 -0.0287 0.4182 2.6466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.286861 0.030671 172.37 <2e-16 ***
## hours_worked 0.033644 0.000759 44.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6579 on 5540 degrees of freedom
## Multiple R-squared: 0.2618, Adjusted R-squared: 0.2617
## F-statistic: 1965 on 1 and 5540 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked + I(hours_worked *
## hours_worked), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7182 -0.4519 -0.0412 0.4193 2.9261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.990e+00 4.812e-02 103.697 < 2e-16 ***
## hours_worked 5.136e-02 2.346e-03 21.896 < 2e-16 ***
## I(hours_worked * hours_worked) -2.381e-04 2.984e-05 -7.978 1.8e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6542 on 5539 degrees of freedom
## Multiple R-squared: 0.2702, Adjusted R-squared: 0.2699
## F-statistic: 1025 on 2 and 5539 DF, p-value: < 2.2e-16
The graphs indicate that the model does not possess a good lineal fit for the data which exhibits potential heteroskedasticity concerns–a sign that the model may not be well-defined.
- Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.
Given the heteroskedasticity concerns outlined above and assessed, using log or polynomial transformations could reduce the amount of variance and improve the model’s explanatory value. However, the usage of more IVs should improve the model’s ability to explain the dependent variable, weeekly earnings as age does not appear to be the only explanation for level of weekly earnings.
Another concern relates to the y-intercept value and its relevance and accuracy. A person of age 0 would be predicted to receive $548.95 in weekly earnings.
An additional concern would be related to distinguishing the difference between age and experience. It could be argued that a person of greater age might receive less than a younger person with more experience, ceteris paribus. A way to improve the model may be to use a measure of experience as a substitute for age.
A final concern would be the assumption of a lineal relationship between age and weekly earnings. Intuitively, we might expect earnings to gap up with ever increasing experience.
4. Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:
##
## Call:
## lm(formula = weekly_earnings ~ age + I(age * age) + hours_worked +
## hourly_non_hourly + full_or_part_time + occupation_group,
## data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1492.68 -318.70 -81.97 194.24 2634.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.36956 74.79910 -0.647 0.517879
## age 22.74989 3.17170 7.173 8.32e-13 ***
## I(age * age) -0.18662 0.03556 -5.249 1.59e-07 ***
## hours_worked 13.04987 0.74904 17.422 < 2e-16 ***
## hourly_non_hourly2 325.31496 16.08411 20.226 < 2e-16 ***
## full_or_part_time2 -218.78703 24.84086 -8.808 < 2e-16 ***
## occupation_group2 -19.58143 22.58074 -0.867 0.385884
## occupation_group3 -435.61031 26.12580 -16.674 < 2e-16 ***
## occupation_group4 -293.58199 28.47477 -10.310 < 2e-16 ***
## occupation_group5 -416.83376 25.82964 -16.138 < 2e-16 ***
## occupation_group6 -426.23175 102.34706 -4.165 3.17e-05 ***
## occupation_group7 -167.48226 39.93989 -4.193 2.79e-05 ***
## occupation_group8 -150.23296 40.82552 -3.680 0.000236 ***
## occupation_group9 -347.83247 35.17233 -9.889 < 2e-16 ***
## occupation_group10 -326.05283 34.70294 -9.396 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 521.4 on 5527 degrees of freedom
## Multiple R-squared: 0.3903, Adjusted R-squared: 0.3887
## F-statistic: 252.7 on 14 and 5527 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = weekly_earnings ~ age + I(age * age) + hours_worked +
## hourly_non_hourly + full_or_part_time + occupation_group,
## data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1492.68 -318.70 -81.97 194.24 2634.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.36956 74.79910 -0.647 0.517879
## age 22.74989 3.17170 7.173 8.32e-13 ***
## I(age * age) -0.18662 0.03556 -5.249 1.59e-07 ***
## hours_worked 13.04987 0.74904 17.422 < 2e-16 ***
## hourly_non_hourly2 325.31496 16.08411 20.226 < 2e-16 ***
## full_or_part_time2 -218.78703 24.84086 -8.808 < 2e-16 ***
## occupation_group2 -19.58143 22.58074 -0.867 0.385884
## occupation_group3 -435.61031 26.12580 -16.674 < 2e-16 ***
## occupation_group4 -293.58199 28.47477 -10.310 < 2e-16 ***
## occupation_group5 -416.83376 25.82964 -16.138 < 2e-16 ***
## occupation_group6 -426.23175 102.34706 -4.165 3.17e-05 ***
## occupation_group7 -167.48226 39.93989 -4.193 2.79e-05 ***
## occupation_group8 -150.23296 40.82552 -3.680 0.000236 ***
## occupation_group9 -347.83247 35.17233 -9.889 < 2e-16 ***
## occupation_group10 -326.05283 34.70294 -9.396 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 521.4 on 5527 degrees of freedom
## Multiple R-squared: 0.3903, Adjusted R-squared: 0.3887
## F-statistic: 252.7 on 14 and 5527 DF, p-value: < 2.2e-16
- Write the generalized form of the regression using beta notation
In the new model, I have used age, age squared, hours worked, full or part time, and hourly/non-hourly, and occupation group. The result of this new model is that the ability to explain the dependent variable, weekly earnings, has improved, based on adjusted R-squared, to 0.3903. Additionally the overall model is statistically signficant at p less than an alpha = 0.05 and we can reject the null hypothesis being the slope is equal to 0.
- Write the estimated form of the regression using your results
The regression equation using the coefficients generated from the regression analysis is:
weekly earnings = -48.37 + 22.75(age) - 0.187(age^2) + 13.0499(hours worked) + 325.315(hourly_nonhourly) -218.787(full or parttime) -19.5814(occupation group2) - ... -326.053(occupation group10)
- Do you suspect any of your independent variables are colinear? Explain why or why not.
Given the low VIF scores presented in the table below for the VIF test for collinearity, I can conclude that ther eis little correlation among the predictor adn the remaining predicitor variables. The general rule of thumb is that Variance Inflation Factors (VIF) of 1 means that there is no correlation and those exceeding 4 warrant further investigation.
## age hours_worked hourly_non_hourly full_or_part_time
## 1.019456 1.511649 1.074684 1.507727
## [1] 1.475405
- Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?
In the table below are the confidence intervals at a 95% level which indicate the range of values that I am most comfortable using the model with
## 2.5 % 97.5 %
## (Intercept) -117.023022 42.098498
## age 5.826261 7.921016
## hours_worked 12.237957 15.292780
## hourly_non_hourly2 423.943649 484.595471
## full_or_part_time2 -329.169851 -228.819259
- Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results. HINT: use the coefficient confidence intervals to estimate the highest and lowest values if this person’s income.
## (Intercept) age hours_worked hourly_non_hourly2
## -37.462262 6.873638 13.765369 454.269560
## full_or_part_time2
## -278.994555
My model predicts that a specific observation in the data set, which has weekly earnings of $1000.00, should have the following weekly earnings
-37.4622 +6.8736(44) + 13.7653 (1) + 454.2695(2) - 278.9946(1) =
## [1] 908.2859