## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0
## R Squared: 0.05886171
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 359.0 <2e-16 ***
## as.factor(sex) 1 8.142e+07 81418057 201.5 <2e-16 ***
## Residuals 5539 2.238e+09 404107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) + as.factor(sex), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.0508 -314.4721 0
##
## $`as.factor(sex)`
## diff lwr upr p adj
## 2-1 -242.1472 -275.6378 -208.6566 0
## R Squared: 0.09191242
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ as.factor(telecommute) + as.factor(sex)
## Model 2: weekly_earnings ~ as.factor(telecommute)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5539 2238348491
## 2 5540 2319766548 -1 -81418057 201.48 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
a.Y = ßo + ß1 * Hours_Worked
Weekly Earnings = 66.04 + 22.58*Hours_Worked
d.I think we should narrow down our dataset to well defined parameters. I believe we are comparing apples to oranges with this dataset. The earnings function for someone in a management, business, or finance position won’t be the same than for someone in service or construction occupations. It is also naive to assume that relationships are linear (they are not). There is much more going on in the amount of money you make than only working.
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = filter(telework,
## occupation_group == 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1516.8 -551.6 -193.9 461.6 2426.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.458 96.844 4.507 7.39e-06 ***
## hours_worked 21.629 2.232 9.688 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 715.2 on 961 degrees of freedom
## Multiple R-squared: 0.08898, Adjusted R-squared: 0.08804
## F-statistic: 93.87 on 1 and 961 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = filter(telework,
## occupation_group == 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1020.59 -195.36 -104.88 58.74 2568.51
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.389 39.125 0.444 0.657
## hours_worked 14.936 1.052 14.201 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 378.8 on 876 degrees of freedom
## Multiple R-squared: 0.1871, Adjusted R-squared: 0.1862
## F-statistic: 201.7 on 1 and 876 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = filter(telework,
## occupation_group == 7))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1366.7 -315.4 -115.4 210.4 1969.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.348 120.612 1.578 0.116
## hours_worked 18.125 2.793 6.489 5.6e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 516.2 on 221 degrees of freedom
## Multiple R-squared: 0.16, Adjusted R-squared: 0.1562
## F-statistic: 42.11 on 1 and 221 DF, p-value: 5.603e-10
##
## Call:
## lm(formula = weekly_earnings ~ age + as.factor(sex), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1305.6 -430.2 -160.9 271.8 2236.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 668.7690 28.7230 23.28 <2e-16 ***
## age 9.4213 0.6177 15.25 <2e-16 ***
## as.factor(sex)2 -265.5980 17.2322 -15.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 641 on 5539 degrees of freedom
## Multiple R-squared: 0.07656, Adjusted R-squared: 0.07623
## F-statistic: 229.6 on 2 and 5539 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: weekly_earnings
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 91089447 91089447 221.67 < 2.2e-16 ***
## as.factor(sex) 1 97620084 97620084 237.56 < 2.2e-16 ***
## Residuals 5539 2276150505 410932
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Y = ßo + ß1 * Age
Weekly_Earnings = 548.94 + 9.19 * Age
Doesn’t account for the type of business. If you are in business you might make more money when you are older, whereas if you are a construction worker it might be the opposite
As we said earlier, gender is not a causal factor of higher or lower earnings. There are many things that are not accounted fopr, such as workload, naturee of task, market forces, etc.
The model is not really linear based on the visual exploration of the plots. We would like to see the red line as a straight line in residuals vs fitted and the Q-Q plot as a straight line as well.
Heteroskedasticity: The variance of the errors doesn’t seem to be constant based on the plots and the spread out nature of earnings as age increases. This is a really difficult thing to fix. We might try to asses it by building a model with a narrower scope, say, earning for people b/w 20 & 30 years old.
Non-linearity: The nonlinearity of this model might be assesed by doing transformations to our independent variables. Ex: log-log, log-linear, cuadratic, cubic, higher dimensional spaces, etc.
Non-normal multivariate distribution: based on the Q-Q plot, the distribution of our data seems to be non normal. This could be assed with more transformations, sugh as log or square roots,
##
## Call:
## lm(formula = weekly_earnings ~ poly(age, 6, raw = T) + as.factor(sex) +
## as.factor(union_member) + as.factor(hourly_non_hourly) +
## as.factor(telecommute) + as.factor(full_or_part_time) + poly(hours_worked,
## 2, raw = T), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1802.76 -329.54 -90.09 193.75 2569.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.799e+03 2.266e+03 2.559 0.010510 *
## poly(age, 6, raw = T)1 -8.407e+02 3.461e+02 -2.429 0.015173 *
## poly(age, 6, raw = T)2 5.169e+01 2.097e+01 2.465 0.013728 *
## poly(age, 6, raw = T)3 -1.564e+00 6.465e-01 -2.419 0.015575 *
## poly(age, 6, raw = T)4 2.516e-02 1.073e-02 2.344 0.019096 *
## poly(age, 6, raw = T)5 -2.059e-04 9.124e-05 -2.256 0.024089 *
## poly(age, 6, raw = T)6 6.727e-07 3.114e-07 2.160 0.030780 *
## as.factor(sex)2 -1.353e+02 1.479e+01 -9.150 < 2e-16 ***
## as.factor(union_member)2 -8.119e+01 2.383e+01 -3.407 0.000661 ***
## as.factor(hourly_non_hourly)2 4.024e+02 1.552e+01 25.934 < 2e-16 ***
## as.factor(telecommute)2 -1.894e+02 1.605e+01 -11.800 < 2e-16 ***
## as.factor(full_or_part_time)2 -2.784e+02 2.650e+01 -10.504 < 2e-16 ***
## poly(hours_worked, 2, raw = T)1 8.834e-01 2.146e+00 0.412 0.680575
## poly(hours_worked, 2, raw = T)2 1.417e-01 2.541e-02 5.576 2.58e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 533.9 on 5528 degrees of freedom
## Multiple R-squared: 0.3607, Adjusted R-squared: 0.3592
## F-statistic: 239.9 on 13 and 5528 DF, p-value: < 2.2e-16
Y = ßo + ß1Age + ß2Age^2 + ß3Age^3 + ß4Age^4 + ß5Age^5 + ß6Age^6 +ß7Sex + ß8Union_Member + ß9Metro2 + ß10Metro3 + ß11Telecommute + ß12full_or_part_time2 + ß13hours_worked + ß14hours_worked^2
Weekly_Earnings = 5.665e+03 + -8.133e+02 x Age + 4.983e+01 x Age^2 + -1.501e+00 x Age^3 + 2.404e-02 x Age^4 + -1.958e-04 x Age^5 + 6.370e-07 x Age^6 + -1.364e+02 x Sex + -7.891e+01 x union_member + 3.939e+02 x hourly_non_hourly + -1.894e+02 x telecommute2 + -2.784e+02 x full_or_part_time2 + 8.834e-01 x hours_worked + 1.417e-01 x hours_worked^2
c.Based on the VIF test, none of them are colinear. However, one might be tempted to believe that being a union member could be colinear with being a non_hourly worker, since blue collar jobs tends to pay by the hour.
## Warning in as.data.frame.numeric(age, sex, union_member,
## hourly_non_hourly, : 'row.names' is not a character vector of length 1 --
## omitting it. Will be an error!
## fit lwr upr
## 1 494.4783 -554.6931 1543.65
## Hourly Wage: 12.36175