knitr::opts_chunk$set(prompt=FALSE,
message=FALSE,
warning=FALSE,
error=TRUE,
eval=TRUE)
Introduction
Evaluation of Telecommuting use GLM.
Packages Required
Below is the packages used in this project.
Import data
Import data from csv file.
Question 1
- Using a simple one-way ANOVA, answer the following questions:
- Does Teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists.)
- Provide a visualization that helps a reader understand the model
Answers:
- Based on the one-way ANOVA model we obtained, find that p-value of the as.factor(telecommute) variable is low (P<0.001), which is around 2e-16 in the result. Therefore, it indicates that teleworking has a significant effect on income.
- In addition to the model we built, I also present the the correlation between teleworking and income by table and boxplot figure. As we can see from both presentations, the average weekly earnings with teleworking is 1,183 dollars, higher than the one without teleworking (832.4 dollars)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = tele)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0

Question 2
- Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
- Write the generalized form of the regression using beta notation
- Write the estimated form of the regression using your results
- Provide at least 3 explanations for why we would consider this a naïve model.
- In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.
- If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using this dataset, propose another solution using additional variables if possible, a different data set or another strategy.
Answers:
- weekly_earnings = beta_0 + beta_1(hours_worked)
- weekly_earnings = 66 + 22.59(hours_worked)
- the R-squared is 0.16 with P-value < 0.001 which means that only 16% of weekly earnings can be explained by hours worked, 2) this model built here is assume that two variables are in a linear relationship, however, obviously, weekly earnings is not simply linear regression by hours. 3) Intuitively, longer you worked, more you earned, but it also depended on other factors, such as the location (state), education and occupation.
- whether workers are hourly or non-hourly workers should also be considered into this regression model. Because of if the worker is hourly workers, then there is a relationship between weekly earnings and hours worked, while it hard to find a certain relationship between weekly earnings and hours worked if the workers are non-hourly workers.
- perform multiple variables regression, while hourly_non_hourly treated as discrete variable by using as.factor. As we can see from the result, the R-squared value increased to 0.286 (p<0.001) from 0.16 where is R-squared value of the naive model. which means the model can explain 28.6% weekly earnings based on hours worked and factor(hourly non-hourly)
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = tele)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16

##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + as.factor(hourly_non_hourly),
## data = tele)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1819.8 -331.6 -135.5 213.8 2728.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.0556 26.3076 0.876 0.381
## hours_worked 18.2135 0.6645 27.410 <2e-16 ***
## as.factor(hourly_non_hourly)2 498.5295 15.6472 31.861 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 563.5 on 5539 degrees of freedom
## Multiple R-squared: 0.2863, Adjusted R-squared: 0.2861
## F-statistic: 1111 on 2 and 5539 DF, p-value: < 2.2e-16
Question 3
- Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
- Write the generalized form of the regression using beta notation
- Write the estimated form of the regression using your results
- Provide at least 3 explanations for why this is considered a naïve model.
- Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.
- Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.
Answers:
- weekly_earnings = beta_0 + beta_1(age)
- weekly_earnings = 549 + 9.19(age)
- R-squared value is 0.037 with P-value < 0.001, which means only 3.7% weekly earnings can be explained by age, therefore it’s simple. 2) the model built it assume the correlation is linear, however this is not the case here . 3) weekly earnings may also depend on state, education and occupation and so on, so it is not simple linear correlation with age.
- plot scatter figure with lm trend added to test the linearity. as we can see from the figure, there is a trend that the weekly earnings slightly increase as the age increase. In addition, I also build a model by treat age as category variable using as.factor to see if the model fit better, as we can see that the R-squared value increase slightly to 0.09 (p<0.001) from 0.037. Plus, the result from chi-square test also suggest model, where age as factor, fits better
- simply with age (factor), it can only explain 9% weekly earnings. obviously weekly earnings may related to other several factors, such as worked hours, location, occupation, education. Therefore, multiple variable regression should be consider as one solution here. and Chi-square test to see which model fits better.
##
## Call:
## lm(formula = weekly_earnings ~ age, data = tele)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16

##
## Call:
## lm(formula = weekly_earnings ~ as.factor(age), data = tele)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1187.7 -437.6 -145.6 258.2 2420.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.08 261.30 0.620 0.535092
## as.factor(age)16 10.18 320.03 0.032 0.974637
## as.factor(age)17 82.89 303.94 0.273 0.785086
## as.factor(age)18 99.58 280.22 0.355 0.722329
## as.factor(age)19 135.53 281.18 0.482 0.629806
## as.factor(age)20 249.07 274.27 0.908 0.363848
## as.factor(age)21 248.47 271.83 0.914 0.360715
## as.factor(age)22 301.79 269.97 1.118 0.263669
## as.factor(age)23 424.47 268.46 1.581 0.113909
## as.factor(age)24 458.29 267.92 1.711 0.087218 .
## as.factor(age)25 558.76 267.17 2.091 0.036540 *
## as.factor(age)26 592.74 267.50 2.216 0.026744 *
## as.factor(age)27 639.57 267.40 2.392 0.016800 *
## as.factor(age)28 704.27 267.00 2.638 0.008371 **
## as.factor(age)29 733.31 266.80 2.748 0.006007 **
## as.factor(age)30 765.64 267.22 2.865 0.004183 **
## as.factor(age)31 733.54 266.84 2.749 0.005998 **
## as.factor(age)32 876.93 267.86 3.274 0.001068 **
## as.factor(age)33 775.30 266.51 2.909 0.003640 **
## as.factor(age)34 767.55 266.84 2.876 0.004038 **
## as.factor(age)35 864.38 267.09 3.236 0.001218 **
## as.factor(age)36 744.06 267.86 2.778 0.005492 **
## as.factor(age)37 732.49 268.03 2.733 0.006299 **
## as.factor(age)38 768.71 267.86 2.870 0.004123 **
## as.factor(age)39 918.00 268.53 3.419 0.000634 ***
## as.factor(age)40 874.05 268.15 3.260 0.001123 **
## as.factor(age)41 976.63 268.27 3.640 0.000275 ***
## as.factor(age)42 882.03 268.81 3.281 0.001040 **
## as.factor(age)43 786.62 268.27 2.932 0.003380 **
## as.factor(age)44 948.26 268.09 3.537 0.000408 ***
## as.factor(age)45 855.30 267.36 3.199 0.001386 **
## as.factor(age)46 977.56 268.40 3.642 0.000273 ***
## as.factor(age)47 774.97 268.15 2.890 0.003867 **
## as.factor(age)48 900.31 267.92 3.360 0.000784 ***
## as.factor(age)49 848.85 268.03 3.167 0.001549 **
## as.factor(age)50 910.43 266.84 3.412 0.000650 ***
## as.factor(age)51 942.30 268.53 3.509 0.000453 ***
## as.factor(age)52 995.68 267.50 3.722 0.000200 ***
## as.factor(age)53 887.22 267.45 3.317 0.000915 ***
## as.factor(age)54 950.65 267.86 3.549 0.000390 ***
## as.factor(age)55 1028.78 267.70 3.843 0.000123 ***
## as.factor(age)56 846.52 267.65 3.163 0.001571 **
## as.factor(age)57 876.20 267.92 3.270 0.001081 **
## as.factor(age)58 901.67 267.50 3.371 0.000755 ***
## as.factor(age)59 927.92 269.26 3.446 0.000573 ***
## as.factor(age)60 887.80 268.95 3.301 0.000970 ***
## as.factor(age)61 877.69 268.81 3.265 0.001101 **
## as.factor(age)62 782.45 270.26 2.895 0.003805 **
## as.factor(age)63 880.58 271.17 3.247 0.001172 **
## as.factor(age)64 928.62 273.65 3.393 0.000695 ***
## as.factor(age)65 809.06 274.71 2.945 0.003242 **
## as.factor(age)66 886.10 282.24 3.140 0.001701 **
## as.factor(age)67 947.19 282.24 3.356 0.000796 ***
## as.factor(age)68 955.83 286.24 3.339 0.000846 ***
## as.factor(age)69 866.46 296.29 2.924 0.003466 **
## as.factor(age)70 470.48 292.14 1.610 0.107360
## as.factor(age)71 910.85 312.32 2.916 0.003555 **
## as.factor(age)72 721.59 356.09 2.026 0.042772 *
## as.factor(age)73 154.62 324.84 0.476 0.634112
## as.factor(age)74 759.98 330.52 2.299 0.021525 *
## as.factor(age)75 767.81 356.09 2.156 0.031113 *
## as.factor(age)76 200.52 356.09 0.563 0.573390
## as.factor(age)77 114.61 387.57 0.296 0.767464
## as.factor(age)78 1246.58 387.57 3.216 0.001306 **
## as.factor(age)79 320.48 452.59 0.708 0.478911
## as.factor(age)80 680.27 337.34 2.017 0.043790 *
## as.factor(age)85 796.24 345.67 2.303 0.021290 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 640.1 on 5475 degrees of freedom
## Multiple R-squared: 0.09003, Adjusted R-squared: 0.07906
## F-statistic: 8.207 on 66 and 5475 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ age
## Model 2: weekly_earnings ~ as.factor(age)
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 5540 2373770588
## 2 5475 2242958893 65 130811695 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question 4
- Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:
- Write the generalized form of the regression using beta notation
- Write the estimated form of the regression using your results
- Do you suspect any of your independent variables are colinear? Explain why or why not.
- Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?
- Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results. HINT: use the coefficient confidence intervals to estimate the highest and lowest values if this person’s income.
Answer:
- weekly_earnings = alpha_0 + beta_0(hours_worked) + beta_1(age) + beta_2(state) + beta_3(detailed_occupation_group) + beta_4(education)
- weekly_earnings = -3206 + 20.45(hours_worked) + 7.43(age) - 2.46(state) - 17.97(detailed_occupation_group) + 80.94(education)
- Based on the coefficients generated, I can conclude that no independent variables are colinear since all the values are totally different which is no way there are some IV are colinear.
- In general, min and max would be the appropriate range for both IVs and DV. Specifically, weekly_earnings [7.5,2885], hours_worked [1,99], gae [15,85],state [1,56], detailed_occupation_group [1,22], education [31,46]. the results present as table as below.
- Assume the individual age is 35, hour_worked is 40, state is 36 (NYC), detail_occupation_group is 3 (Computer and mathematical science occupations), Education is 46 (doctorate). then the estimate weekly earnings would be 1452.82 dollars based on the equation we obtained at b.
The estimate range for this individual is [606 2745], where the results based on the situation that his/her age, education and occupation is fixed at this circumstance, and other factors can be changed here.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + age + state + detailed_occupation_group +
## education, data = tele)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1658.0 -343.8 -101.1 224.2 2931.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3205.7797 152.0913 -21.078 < 2e-16 ***
## hours_worked 20.4492 0.6219 32.883 < 2e-16 ***
## age 7.4271 0.5198 14.288 < 2e-16 ***
## state -2.4569 0.4506 -5.452 5.19e-08 ***
## detailed_occupation_group -17.9736 1.1633 -15.451 < 2e-16 ***
## education 80.9430 3.5009 23.121 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 536.5 on 5536 degrees of freedom
## Multiple R-squared: 0.3535, Adjusted R-squared: 0.3529
## F-statistic: 605.3 on 5 and 5536 DF, p-value: < 2.2e-16
Table continues below
| 1 |
99 |
15 |
85 |
1 |
Table continues below
| 56 |
1 |
22 |