1a) The p-value (<2e-16) is less than the level of significances of .05 meaning there are significant differences between weekly earnings and telecommute.
A Telecommute of 1 compared to 0 the value is between $-387.70 and $-313.82 of Weekly Earnings. This is confident because the p-value of 2e-16 is less than 0.05.
A person who telecommutes to work and back is expected to $350.67 on average more than those who do not telecommute.
1b) Below is a visual representation of this difference based on the data.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = working)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0
1c) This is a naïve model because it has only one independent variable. There are more influences that need to be taken into consideration. A person’s gender could potentially factor into the person weekly earnings or the state a person is from could also factor into one’s ability to telecommute.
2a) For a more in-depth analysis I added geography region as a factor. The purpose of this is the understand if telecommuting from different regions impact on weekly earnings. The thought is weekly earnings will be varied by region.
2b) The p-value for both is less than the level of significance 0.05 meaning there is a significant difference between weekly earnings and telecommute by geography region.
A Telecommute of 1 compared to 0 the value is between $-387.67 and $-313.85 of Weekly Earnings. This is confident because the p-value of 2e-16 is less than 0.05. If you were telecommuting from the Northeast region you would expect on average $95.70 more than Midwest region, $40.49 more than South region, and $26.61 than the West region. The Midwest region can expect average of $55.21 less than South region and $69.10 less than the West region. Finally, the South region would expect an average of $13.89 less than the West region.
Yes, this is a naïve model because there are many other factors that can easily factor into weekly earnings like state, industry, part-time/full-time employment status. Each of these have categories and should have their interactions tested.
2c) Below is a visual representation of the differences.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 347.124 < 2e-16 ***
## as.factor(geography_region) 3 5.370e+06 1790092 4.283 0.00501 **
## Residuals 5537 2.314e+09 417987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) + as.factor(geography_region), data = working)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## No-Yes -350.7614 -387.6687 -313.8541 0
##
## $`as.factor(geography_region)`
## diff lwr upr p adj
## 2-1 -95.70419 -168.509675 -22.89870 0.0040941
## 3-1 -40.49404 -106.210612 25.22254 0.3881232
## 4-1 -26.60643 -95.954029 42.74116 0.7574817
## 3-2 55.21015 -6.249370 116.66967 0.0961779
## 4-2 69.09776 3.770169 134.42534 0.0333266
## 4-3 13.88760 -43.433624 71.20883 0.9248676
2d) The fit of model 2 is slightly more significant. There is an increase in the residual sums squared and with a p-value of 0.01 model 2 with the included geography region as a factor is more statistically significant.
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ as.factor(telecommute)
## Model 2: weekly_earnings ~ as.factor(telecommute) * as.factor(geography_region)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5540 2319766548
## 2 5534 2312735186 6 7031362 2.8042 0.01003 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3a) Linear Regression Beta notation: Y[i] = Beta[0] + Beta[1]X[1] + e Where Y is Weekly Earnings, Beta[0] is the y-intercept constant, Beta[1] is the coefficient with x being the predictor of the coefficient, and e being the prediction error.
3b) Estimated form notation for our general model: Weekly Earnings = 66.04 + 22.59(Hours Worked)
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = working)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
3c) This is a naive model. 1) It makes assumptions that may or may not turn out to be correct. This is due to not including other variables or factors 2) The model is assuming each of the variables are independent when they might be correlated in some way. 3) The r-values show this model can account for 15.5% of weekly earnings variance can be explained by hours worked.
3d) I would start to include other coefficients to see if we can improve the r-value of the model allowing it to be more confident. With the visualization below, we can see the linear regression line does not fit the data appropriately and in addition the Residuals vs. Fitted chart shows the residuals do not appear randomly around the 0 line suggesting that the assumption that the relationship is not linear.
3e) When running a correlation matrix against Weekly Earnings, 5 factors were identified with a positive or negative correlation of 0.35 or stronger. The r-squared increased to 42% which would point to a more fitting model than just using hours worked variable. The coefficients of hours worked, full or part-time, education, hourly or non-hourly, and detailed occupation group produced a statistically significant model with ~42% of the weekly earnings being explained by the data.
4a) Multi-Variate Regression Beta notation: Y[i] = Beta[0] + Beta[1]X[1] + e ???? Where Y is Weekly Earnings, ??0 is the y-intercept constant, ??1 is the coefficient with x being the predictor of the coefficient, and e being the prediction error.
4b) Estimated form notation for our general model: Weekly Earnings = 548.95 + 9.19(age)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = working)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
4c) This is considered naive model. In real life, it is almost impossible that we get a set of predictors which are completely independent and in our model we only utilized age. We should consider weekly earnings as a factor of age, education, home location, ect… Even outside of the data set we are given there could be an unknown variable that could be driving other variables in which is not controllable by us.
4d) When mapping the residuals vs fitted data, a red curved line appears showing the fit of the data is not ideal as we would like the red line along the residuals 0 line.
4e) 1) In reviewing the residuals vs fitted values chart above, the data shape resembles a parabola. Even with the presence of outliers the bulk of the data still appears curved (next point). I would compare the model with one of a quadratic models define a better fit.
2) Outliers appear to be an issue. Weekly Earnings appears to have a ceiling of $3000 causing the data to skew. I would investigate removing this and analyze for other possible outliers. 3) Based on the equation a newborn would earn on average weekly earnings of $548.95. Not realistic, so I would evaluate data at such a young age as an outlier as well and seek to remove them.
5a) Multi-Variate Regression Beta notation : Y[i] = Beta[0] + Beta[1]X[1] + Beta[2]X[2] + Beta[3]X[3] + … + Beta[k]X[k] + e Where Y is Weekly Earnings, Beta[0] is the y-intercept constant, Beta[1] is the coefficient with x being the predictor of the coefficient, and e being the prediction error. The number of factors can be numerous as noted by the Beta[k]X[k].
5b) Estimated form notation for our general model: Weekly Earnings = -2509.19 + 14.09(hours worked) + 6.28(age) + 65.59(education) + 271.99(hourly or non-hourly) - 257.28(full or part-time) - 12.46(detailed occupation group)
5c) Taking the same explanatory variables identified from 3e we will examine the collinearity of our factors. I would suspect collinearity based on the variable names. For example hourly_non_hourly and full_or_part_time could be colinear as there is typically both correlate hourly. Also, education and age could be colinear due to the fact as you get older typically education level is higher.
We can test these assumptions with running Variance Inflation Factor (VIF) to detect multicollinearity. The assumptions were not as significant as thought. With the VIF of each independent variable being less than 5, we shouldn’t be worried about multicollinearity. If we have a value >5 then we would pursue further investigation.
vif(weeklyearnings_wkearnings5.lm1)
## GVIF Df GVIF^(1/(2*Df))
## hours_worked 1.619233 1 1.272491
## as.factor(age) 2.628669 66 1.007349
## as.factor(education) 3.858904 15 1.046041
## as.factor(hourly_non_hourly) 1.454462 1 1.206011
## as.factor(full_or_part_time) 1.653899 1 1.286040
## as.factor(detailed_occupation_group) 3.692145 21 1.031589
5d) With not multicollinearity between the variables with a statistically significant model including all variables being significant, I would only use this if this is the only data I can collect. There is still the concern the model can only predict weekly earnings 40% of the time, so I would use it questionably with further data analysis.
5e) The sample data represents an hourly worker of 39 years of age with high school graduation diploma or equivilent in the sales and related occupations who full-times at 40 hours is predicted to make a weekly earnings on average of $202.25 with a possible range from $160.19 to $244.31.
Visualizaiton of confidence interval: Note: The thin confidence interval represents weekly earnings is only confident by 40% with explaintary data in the model.