GLM Assignment 1
Part 1
The first part of this assignment will answer the following questions.
A. Does teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists.)
B. Provide a visualization that helps a reader understand the model
Part A: As seen in the results below, the telecommute population has a p-value less than 0.05 this suggests that the mean of weekly earnings are statistically significant. The Tukey test was performed as a post hoc test. The results of the tukey test suggest that the results are statically significant too because the confidence interval does not contain zero which indicates the difference between that pair of groups is statistically significant.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
##
## $`as.factor(telecommute)`
## diff lwr upr p adj
## 2-1 -350.7614 -387.7015 -313.8213 0
The boxplot below displays visually the two populations. The telecommute population suggests that the weekly earnings average is above that of people that do not telecommute. There is more variation in weekly earnings than that of people that do not telecommute. Overall the populations are statistically significant.
Part 2
Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
- Write the generalized form of the regression using beta notation
- Write the estimated form of the regression using your results
- Provide at least 3 explanations for why we would consider this a naïve model.
- In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.
- If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using this dataset, propose another solution using additional variables if possible, a different data set or another strategy.
Part A: Generalized form of the regression:
Weekly_Earnings=β_0+ β_1 (Hours_worked)
Part B: The summary below shows that the regression equations should be:
Weekly_Earnings = 66 + 22.59(Hours_Worked)
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
Part C: This is a naive model because it assumes that hours worked is the only factor that can explain weekly earnings. The R^2 is .1554 with a p-value less than 0.05. This means that only 15.5% of weekly earnings can be explained with hours worked. There are several other factors that could help explain weekly earnings such as occupation, the state one lives and many more. Also, the model is assuming a linear relationship between X and Y which could not necessarily be the case. Linear regression only looks at the mean of the dependent variable. Outliers in the data could have a large effect on the regression because linear regression are sensitive to outliers. Heteroscedasticity could also be seen which unequal scatter in the data.
Part D: In order to better model these two variables, hours could be looked at as factors instead of continuous. This raises the R^2 to .20 with a p-value below 0.05. By looking at hours as factors it explains 20% of the weekly earnings by hours.
##
## Call:
## lm(formula = weekly_earnings ~ as.factor(hours_worked), data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1632.5 -386.5 -153.8 213.5 2531.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 990.815 86.776 11.418 < 2e-16 ***
## as.factor(hours_worked)2 -442.103 279.845 -1.580 0.114208
## as.factor(hours_worked)3 -896.131 279.845 -3.202 0.001371 **
## as.factor(hours_worked)4 -514.070 199.260 -2.580 0.009909 **
## as.factor(hours_worked)5 -621.672 192.414 -3.231 0.001241 **
## as.factor(hours_worked)6 -612.339 192.414 -3.182 0.001469 **
## as.factor(hours_worked)7 -610.509 216.458 -2.820 0.004813 **
## as.factor(hours_worked)8 -419.548 136.345 -3.077 0.002101 **
## as.factor(hours_worked)9 -671.065 309.854 -2.166 0.030373 *
## as.factor(hours_worked)10 -625.625 133.938 -4.671 3.07e-06 ***
## as.factor(hours_worked)11 -739.127 257.907 -2.866 0.004175 **
## as.factor(hours_worked)12 -763.864 142.021 -5.379 7.82e-08 ***
## as.factor(hours_worked)13 -245.565 309.854 -0.793 0.428092
## as.factor(hours_worked)14 -748.198 309.854 -2.415 0.015782 *
## as.factor(hours_worked)15 -765.548 131.762 -5.810 6.60e-09 ***
## as.factor(hours_worked)16 -515.392 118.174 -4.361 1.32e-05 ***
## as.factor(hours_worked)17 -752.965 309.854 -2.430 0.015128 *
## as.factor(hours_worked)18 -677.350 164.900 -4.108 4.06e-05 ***
## as.factor(hours_worked)19 -612.935 429.521 -1.427 0.153631
## as.factor(hours_worked)20 -637.923 98.428 -6.481 9.91e-11 ***
## as.factor(hours_worked)21 -665.224 257.907 -2.579 0.009926 **
## as.factor(hours_worked)22 -397.373 161.732 -2.457 0.014042 *
## as.factor(hours_worked)23 -612.882 216.458 -2.831 0.004651 **
## as.factor(hours_worked)24 -325.775 104.720 -3.111 0.001875 **
## as.factor(hours_worked)25 -586.618 106.279 -5.520 3.55e-08 ***
## as.factor(hours_worked)26 -583.788 186.426 -3.131 0.001748 **
## as.factor(hours_worked)27 -573.860 156.152 -3.675 0.000240 ***
## as.factor(hours_worked)28 -563.276 140.478 -4.010 6.16e-05 ***
## as.factor(hours_worked)29 -348.008 192.414 -1.809 0.070562 .
## as.factor(hours_worked)30 -387.214 96.968 -3.993 6.60e-05 ***
## as.factor(hours_worked)31 -429.634 216.458 -1.985 0.047213 *
## as.factor(hours_worked)32 -126.997 98.231 -1.293 0.196122
## as.factor(hours_worked)33 -103.723 199.260 -0.521 0.602708
## as.factor(hours_worked)34 -71.693 158.827 -0.451 0.651727
## as.factor(hours_worked)35 -326.612 97.506 -3.350 0.000815 ***
## as.factor(hours_worked)36 -161.189 107.695 -1.497 0.134524
## as.factor(hours_worked)37 -204.910 124.076 -1.651 0.098697 .
## as.factor(hours_worked)38 -179.297 108.367 -1.655 0.098076 .
## as.factor(hours_worked)39 -218.162 151.386 -1.441 0.149616
## as.factor(hours_worked)40 -44.338 87.550 -0.506 0.612577
## as.factor(hours_worked)41 292.709 176.422 1.659 0.097144 .
## as.factor(hours_worked)42 -4.121 117.214 -0.035 0.971953
## as.factor(hours_worked)43 -6.798 133.938 -0.051 0.959525
## as.factor(hours_worked)44 102.695 127.977 0.802 0.422325
## as.factor(hours_worked)45 263.183 94.735 2.778 0.005486 **
## as.factor(hours_worked)46 -135.655 147.264 -0.921 0.357006
## as.factor(hours_worked)47 132.231 192.414 0.687 0.491972
## as.factor(hours_worked)48 138.669 117.686 1.178 0.238731
## as.factor(hours_worked)49 81.115 207.176 0.392 0.695424
## as.factor(hours_worked)50 496.916 92.357 5.380 7.74e-08 ***
## as.factor(hours_worked)51 606.676 257.907 2.352 0.018693 *
## as.factor(hours_worked)52 -94.109 145.404 -0.647 0.517515
## as.factor(hours_worked)53 398.068 192.414 2.069 0.038611 *
## as.factor(hours_worked)54 138.817 309.854 0.448 0.654165
## as.factor(hours_worked)55 517.677 107.063 4.835 1.37e-06 ***
## as.factor(hours_worked)56 226.877 156.152 1.453 0.146301
## as.factor(hours_worked)57 -125.435 429.521 -0.292 0.770271
## as.factor(hours_worked)58 91.446 186.426 0.491 0.623783
## as.factor(hours_worked)59 -119.815 429.521 -0.279 0.780293
## as.factor(hours_worked)60 501.582 98.167 5.109 3.34e-07 ***
## as.factor(hours_worked)62 305.951 257.907 1.186 0.235562
## as.factor(hours_worked)63 1044.503 279.845 3.732 0.000192 ***
## as.factor(hours_worked)64 -1.735 241.018 -0.007 0.994256
## as.factor(hours_worked)65 597.004 149.254 4.000 6.42e-05 ***
## as.factor(hours_worked)66 -40.308 309.854 -0.130 0.896503
## as.factor(hours_worked)68 -244.982 354.263 -0.692 0.489265
## as.factor(hours_worked)70 691.676 149.254 4.634 3.67e-06 ***
## as.factor(hours_worked)72 -298.515 601.204 -0.497 0.619541
## as.factor(hours_worked)74 788.025 429.521 1.835 0.066611 .
## as.factor(hours_worked)75 379.937 279.845 1.358 0.174626
## as.factor(hours_worked)76 316.875 429.521 0.738 0.460705
## as.factor(hours_worked)80 814.039 186.426 4.367 1.29e-05 ***
## as.factor(hours_worked)81 739.945 601.204 1.231 0.218462
## as.factor(hours_worked)84 695.665 257.907 2.697 0.007011 **
## as.factor(hours_worked)86 393.185 601.204 0.654 0.513143
## as.factor(hours_worked)87 -56.815 601.204 -0.095 0.924713
## as.factor(hours_worked)90 6.935 309.854 0.022 0.982145
## as.factor(hours_worked)96 1893.795 601.204 3.150 0.001642 **
## as.factor(hours_worked)98 1893.185 601.204 3.149 0.001647 **
## as.factor(hours_worked)99 767.925 354.263 2.168 0.030227 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 594.9 on 5462 degrees of freedom
## Multiple R-squared: 0.2157, Adjusted R-squared: 0.2044
## F-statistic: 19.02 on 79 and 5462 DF, p-value: < 2.2e-16
The scatterplot shows that there is a positive relationship between hours worked and weekly earnings. The graph shows that there is some heteroscedasticity going on in the data as well.
Part E: By showing hours as a factor it increases the R^2 to .20 and the p-value is less than 0.05. By looking at hours as a factor it explains slightly more of the population. The hours as a factor is shown in the results in part D.
Part 3
Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
- Write the generalized form of the regression using beta notation
- Write the estimated form of the regression using your results
- Provide at least 3 explanations for why this is considered a naïve model.
- Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.
- Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.
Part A: Generalized Form of Regression Weekly_Earnings=β_0+ β_1 (Age)
Part B: Weekly_Earnings = 548.95 + 9.19(Age)
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
Part C: This is a naive model because the R^2 is .037 with a p-value less than 0.05. This means that only 3.7% of weekly earnings can be explained with age. There are several other factors that could help explain weekly earnings such as occupation, the state one lives and many more. We also can test age as a factor to understand if it will fit the data better. This model also assumes a linear relationship which could not necessarily be the case. A linear model is also sensitive to outliers.
Part D: In order to test the linearity, a plot is generated. The plot shows an upward trend. When looking at the R^2 and P value it suggests that only ~8% of the data can be explained by age alone. The chi-squared was used to test which model fit better. Age as a factor is a better fit to the data.
## Analysis of Variance Table
##
## Model 1: weekly_earnings ~ age
## Model 2: weekly_earnings ~ (as.factor(age))
## Res.Df RSS Df Sum of Sq Pr(>Chi)
## 1 5540 2373770588
## 2 5475 2242958893 65 130811695 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Park E: Other possible concerns in this data is that age only explains 8% of the weekly earnings. More independent variables should be taken into account such as type of occupation, geography and schooling. I would use multiple regression to understand if more of the data is explained by these variables. I would also use Chi-squared to understand the fit of the data. There is a lot of spread in this data and by using a linear regression does not capture the entire relationship of the data.
Part 4
Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:
- Write the generalized form of the regression using beta notation
Weekly_Earnings = β_0+ β_1 (Age)+β_2(education)+β_3(hours_worked)+β_4(occupation_group)
- Write the estimated form of the regression using your results
Weekly_Earnings = -3968.9694+ 7.55(Age) + 93.9772(education) + 21.1732(hours_worked) + -23.2713(occupation_group)
Do you suspect any of your independent variables are colinear? Explain why or why not. I don’t suspect that they are colinear because looking at the coefficients they are not similar. This means the slope of their lines would be different and would not be colinear.
Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?
All variables show statistical significance and also the model as a whole shows statistical significance. I am comfortable with numbers within the min and max range of each independent variable.
| IV | Min |
|---|---|
| Age | 15 |
| Education | 31 |
| Hours Worked | 1 |
| Occupation Group | 15 |
| IV | Max |
|---|---|
| Age | 85 |
| Education | 46 |
| Hours Worked | 99 |
| Occupation Group | 85 |
For the dependent variable, I am most comfortable within the min and max ranges using the data from the min/max of the independent variables and calculating the value. Zero will be used as the minimum value since a negative earnings does not make sense.
| DV | Min |
|---|---|
| Weekly Earnings | 7.5 |
| DV | Max |
|---|---|
| Weekly Earnings | 2884.61 |
- Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results. Hypothetical observation:
| IV | Hypothetical Observation |
|---|---|
| Age | 37 |
| Education | 44 (Master’s Degree) |
| Hours Worked | 45 hours |
| Occupation Group | 1 (i.e. business/management) |
Weekly Earnings = 1374.9001
The range of the weekly earnings is calculated by using the coefficient confidence intervals.
Min = 1020.27
Max = 1729.53
The summary below shows the results of the multiple regression used in the equations above.
##
## Call:
## lm(formula = weekly_earnings ~ age + education + hours_worked +
## occupation_group, data = telework)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1809.3 -350.2 -109.0 231.0 3004.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3968.9694 148.8232 -26.669 < 2e-16 ***
## age 7.5523 0.5293 14.269 < 2e-16 ***
## education 93.9772 3.5166 26.724 < 2e-16 ***
## hours_worked 21.1732 0.6335 33.421 < 2e-16 ***
## occupation_group -23.2713 2.9854 -7.795 7.65e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 546.5 on 5537 degrees of freedom
## Multiple R-squared: 0.3292, Adjusted R-squared: 0.3287
## F-statistic: 679.3 on 4 and 5537 DF, p-value: < 2.2e-16