a. Does Teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists).
Based on a simple one-way ANOVA of teleworking and income, there appears to be a significant effect on income because the p-value is less than 0.05. The data is telling us that the effect of telecommuting results in an increase of 350.76 dollars with a confidence range of +/- 36.94 dollars.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 1-0 350.7614 313.8213 387.7015 0
b. Provide a visualization that helps a reader understand the model.
c. Provide at least two explanations for why this is conssidered a naive model.
This analysis provides a naive model becuase it (1) is missing the interaction effect, where the analysis is run excluding other independent variables and the interaction with those additional variables. (2) In addition, the one-way ANOVA test tells us that at least two groups are different from each other, but the analysis will not tell us which groups are different.
a. Explain your model design. Why did you use this additional factor variable compared to others?
I wanted to examine the effect of male versus female and its effect on weekly earnings. Since there is already a difference in the workplace in general, I wanted to see if this would still hold true in a telecommute setting.
b. Interpret the results of your model. Do you consider this a naive model? Why or why not?
Based on a two-way ANOVA of teleworking and sex variables effect on income, there appears to be a significant effect on income because the p-value is less than 0.05 for both teleworking and sex variables. The data is telling us that the effect of telecommuting results in an increase of 350.76 dollars with a confidence range of +/- 36.94 dollars. The effect of being a male results in an increase of 242.15 dollars with a confidence range of +/- 33.49 dollars.
This model would be considered naive because it does not evaluate the interaction between teleworking and sex.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 359.0 <2e-16 ***
## sex 1 8.142e+07 81418057 201.5 <2e-16 ***
## Residuals 5539 2.238e+09 404107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) + sex, data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 1-0 350.7614 314.4721 387.0508 0
##
## $sex
## diff lwr upr p adj
## M-F 242.1472 208.6566 275.6378 0
c. Provide a visualization that helps a reader understand the model.
d. Test the difference between your model from Q1 and your model in Q2. Is there an improvement in fit? How do you know?
There is an improvement in fit by adding sex to the ANOVA analysis. The reduction of the residual is different from zero and stastically signifant (<0.05).
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ as.factor(telecommute)
## Model 2: weekly_earnings ~ as.factor(telecommute) + sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 5540 2319766548
## 2 5539 2238348491 1 81418057 201.48 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
a. Write the generalized form of the regression using beta notation.
\(Weekly~Earnings_i = \beta_0 + \beta_1(Hours~Worked)\)
b. Write the estimated form of the regression using your results.
\(Weekly~Earnings_i = 66.04 + 22.59(Hours~Worked)\)
c. Provide at least 3 explanations for why we would consider this a naive model
d. In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.
By examining the chart below, I do not think that the two variables could be adjusted to provide comparable information. For instance, a full-time, salary worker will earn 0 for each hour worked past 40 per week. Also, there are full-time employees listed with hours lower than the 30 hour requirement to be classified as full-time. I think a better way to look at the data presented is to determine weekly earnings by hourly wage for those employees that have telecommuted for their employer.
e. If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using theis dataset, propose another solution using additional variables if possible, a different data set or another strategy
As stated above - I would determine weekly earnings by hourly wage for those employees that have telecommuted for their employer.
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
a. Write the generalized form of the regression using beta notation.
\(Weekly~Earnings_i = \beta_0 + \beta_1(Age)\)
b. Write the estimated form of the regression using your results.
\(Weekly~Earnings_i = 548.96 + 9.19(Age)\)
c. Provide at least 3 explanations for why we would consider this a naive model
d. Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.
The linearity line is rather flat and excludes the possiblity of a more parabolic shape of the data.
e. Identify at least 3 other possible concerns regarding this model beyond those inherent in the naive design. For each possible concern, comment on you would assess and address the concern.
Assessment: Most evident in a plot of observed versus predicted values or a plot of residuals versus predicted values.
Correction: Apply a nonlinear transformation to the dependent and/or independent variables if you can think of a transformation that seems appropriate.
Assessment: Examine a plot of residuals versus predicted values.
Correction: Seasonal patterns in the data are a common source of heteroscedasticity in the errors.
Assessment: The best test for normally distributed errors is a normal probability plot.
Correction: Tthe problem with the error distribution is mainly due to one or two very large errors, such values should be scrutinized closely.
##
## Call:
## lm(formula = weekly_earnings ~ age_group + education_group +
## sex + as.factor(full_or_part_time) + as.factor(msa_size),
## data = telework_adj1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1645.61 -336.92 -92.96 224.85 2694.58
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 569.05 29.97 18.988
## age_group30-39 158.36 22.39 7.071
## age_group40-49 269.00 23.01 11.693
## age_group50-59 312.71 22.72 13.761
## age_group60+ 269.04 26.05 10.326
## age_groupUnder 20 15.48 55.05 0.281
## education_groupBachelor Degree 282.77 25.74 10.987
## education_groupHigh School -165.14 25.03 -6.596
## education_groupMasters/Professional Degree 508.42 29.04 17.507
## education_groupSome College -77.22 26.54 -2.909
## sexM 211.26 14.73 14.341
## as.factor(full_or_part_time)2 -526.60 21.62 -24.356
## as.factor(msa_size)2 13.61 29.00 0.470
## as.factor(msa_size)3 61.36 28.72 2.137
## as.factor(msa_size)4 46.91 25.54 1.837
## as.factor(msa_size)5 47.03 23.35 2.014
## as.factor(msa_size)6 244.17 24.96 9.782
## as.factor(msa_size)7 130.12 22.92 5.677
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## age_group30-39 1.72e-12 ***
## age_group40-49 < 2e-16 ***
## age_group50-59 < 2e-16 ***
## age_group60+ < 2e-16 ***
## age_groupUnder 20 0.77855
## education_groupBachelor Degree < 2e-16 ***
## education_groupHigh School 4.61e-11 ***
## education_groupMasters/Professional Degree < 2e-16 ***
## education_groupSome College 0.00364 **
## sexM < 2e-16 ***
## as.factor(full_or_part_time)2 < 2e-16 ***
## as.factor(msa_size)2 0.63872
## as.factor(msa_size)3 0.03266 *
## as.factor(msa_size)4 0.06626 .
## as.factor(msa_size)5 0.04404 *
## as.factor(msa_size)6 < 2e-16 ***
## as.factor(msa_size)7 1.44e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 540 on 5524 degrees of freedom
## Multiple R-squared: 0.3465, Adjusted R-squared: 0.3445
## F-statistic: 172.3 on 17 and 5524 DF, p-value: < 2.2e-16
a. Write the generalized form of the regression using beta notation.
Generalized regression formula for those not omitted for age group 20-29, education level of Associate degree, female, and working full-time.
\(Weekly~Earnings_i = \beta_0 + \beta_1(Age~Group) + \beta_2(Education~Group) + \beta_3(Sex) + \beta_4(Employment~Status)+ \beta_5(MSA~Size)\)
b. Write the estimated form of the regression using your results.
Estimated regression formula for those not omitted for age group 20-29, education level of Associate degree, female, and working full-time.
Weekly Earnings = 569.05 + 15.48 x age_groupUnder20 + 158.36 x age_group30-39 + 269.00 x age_group40-49 + 312.71 x age_group50-59 + 269.04 x age_group60 + 282.77 x education_levelBachelorDegree - 165.14 x education_levelHighSchool + 508.42 x education_levelMasters/ProfessionalDegree - 77.22 x education_levelSomeCollege + 211.16 x sex_male - 526.60 x part-tme + 13.61 x msa_size_100,000-249,999 + 61.36 x msa_size_250,000-499,999 + 46.91 x msa_size_500,000-999,999 + 47.03 x msa_size_1,000,000-2,499,999 + 244.17 x msa_size_2,500,00-4,999,999 + 130.12 x msa_size_5,000,000+
c. Do you suspect any of your independent variables are colinear? Explain why or why not.
I do not suspect any of the independent variables are colinear because it does not appear that any of the independent values follow the same pattern.
d. Judging by your output, what ranges of values are you most comfortable using this model to estimate future observations?
Please see information below:
## 2.5 % 97.5 %
## (Intercept) 510.302975 627.80578
## age_group30-39 114.459050 202.26294
## age_group40-49 223.896575 314.09631
## age_group50-59 268.161212 357.25765
## age_group60+ 217.961571 320.11478
## age_groupUnder 20 -92.436378 123.39795
## education_groupBachelor Degree 232.313169 333.22078
## education_groupHigh School -214.215241 -116.05981
## education_groupMasters/Professional Degree 451.490512 565.35228
## education_groupSome College -129.254535 -25.18429
## sexM 182.376021 240.13415
## as.factor(full_or_part_time)2 -568.981957 -484.21011
## as.factor(msa_size)2 -43.232113 70.46210
## as.factor(msa_size)3 5.063832 117.65392
## as.factor(msa_size)4 -3.150957 96.97092
## as.factor(msa_size)5 1.255404 92.80598
## as.factor(msa_size)6 195.241424 293.10679
## as.factor(msa_size)7 85.190103 175.05228
e. Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.
For this hypothetical observation I will use a 40-year-old male, with a Master’s degree, who works full-time and lives in a metropolitan area of 2.1M people.
Weekly earnings = 569.05 + 269.00 x age_group40-49 + 508.42 x education_levelMasters/ProfessionalDegree + 211.16 x sex_male + 47.03 x msa_size_1,000,000-2,499,999
Weekly earnings = 1,604.66
Weekly earnings (Low) = 510.30 + 223.9 x age_group40-49 + 451.49 x education_levelMasters/ProfessionalDegree + 182.38 x sex_male + 1.26 x msa_size_1,000,000-2,499,999
Weekly earnings (Low) = 859.03
Weekly earnings (High) = 627.81 + 314.10 x age_group40-49 + 565.35 x education_levelMasters/ProfessionalDegree + 240.13 x sex_male + 92.81 x msa_size_1,000,000-2,499,999
Weekly earnings (High) = 1,840.20