GLM Assignment 1

Part 1

The first part of this assignment will answer the following questions.

A. Does teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists.)

B. Provide a visualization that helps a reader understand the model

Part A: As seen in the results below, the telecommute population has a p-value less than 0.05 this suggests that the mean of weekly earnings are statistically significant. The Tukey test was performed as a post hoc test. The results of the tukey test suggest that the results are statically significant too because the confidence interval does not contain zero which indicates the difference between that pair of groups is statistically significant.

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0

The boxplot below displays visually the two populations. The telecommute population suggests that the weekly earnings average is above that of people that do not telecommute. There is more variation in weekly earnings than that of people that do not telecommute. Overall the populations are statistically significant.

Part 2

Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

Write the generalized form of the regression using beta notation
Write the estimated form of the regression using your results
Provide at least 3 explanations for why we would consider this a naïve model.
In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.
If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using this dataset, propose another solution using additional variables if possible, a different data set or another strategy.

Part A: Generalized form of the regression:

Weekly_Earnings=β_0+ β_1 (Hours_worked)

Part B: The summary below shows that the regression equations should be:

Weekly_Earnings = 66 + 22.59(Hours_Worked)

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

Part C: This is a naive model because it assumes that hours worked is the only factor that can explain weekly earnings. The R^2 is .1554 with a p-value less than 0.05. This means that only 15.5% of weekly earnings can be explained with hours worked. There are several other factors that could help explain weekly earnings such as occupation, the state one lives and many more. Also, the model is assuming a linear relationship between X and Y which could not necessarily be the case. Linear regression only looks at the mean of the dependent variable. Outliers in the data could have a large effect on the regression because linear regression are sensitive to outliers. Heteroscedasticity could also be seen which unequal scatter in the data.

Part D: In order to better model these two variables, hours could be looked at as factors instead of continuous. This raises the R^2 to .20 with a p-value below 0.05. By looking at hours as factors it explains 20% of the weekly earnings by hours.

## 
## Call:
## lm(formula = weekly_earnings ~ as.factor(hours_worked), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1632.5  -386.5  -153.8   213.5  2531.7 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                990.815     86.776  11.418  < 2e-16 ***
## as.factor(hours_worked)2  -442.103    279.845  -1.580 0.114208    
## as.factor(hours_worked)3  -896.131    279.845  -3.202 0.001371 ** 
## as.factor(hours_worked)4  -514.070    199.260  -2.580 0.009909 ** 
## as.factor(hours_worked)5  -621.672    192.414  -3.231 0.001241 ** 
## as.factor(hours_worked)6  -612.339    192.414  -3.182 0.001469 ** 
## as.factor(hours_worked)7  -610.509    216.458  -2.820 0.004813 ** 
## as.factor(hours_worked)8  -419.548    136.345  -3.077 0.002101 ** 
## as.factor(hours_worked)9  -671.065    309.854  -2.166 0.030373 *  
## as.factor(hours_worked)10 -625.625    133.938  -4.671 3.07e-06 ***
## as.factor(hours_worked)11 -739.127    257.907  -2.866 0.004175 ** 
## as.factor(hours_worked)12 -763.864    142.021  -5.379 7.82e-08 ***
## as.factor(hours_worked)13 -245.565    309.854  -0.793 0.428092    
## as.factor(hours_worked)14 -748.198    309.854  -2.415 0.015782 *  
## as.factor(hours_worked)15 -765.548    131.762  -5.810 6.60e-09 ***
## as.factor(hours_worked)16 -515.392    118.174  -4.361 1.32e-05 ***
## as.factor(hours_worked)17 -752.965    309.854  -2.430 0.015128 *  
## as.factor(hours_worked)18 -677.350    164.900  -4.108 4.06e-05 ***
## as.factor(hours_worked)19 -612.935    429.521  -1.427 0.153631    
## as.factor(hours_worked)20 -637.923     98.428  -6.481 9.91e-11 ***
## as.factor(hours_worked)21 -665.224    257.907  -2.579 0.009926 ** 
## as.factor(hours_worked)22 -397.373    161.732  -2.457 0.014042 *  
## as.factor(hours_worked)23 -612.882    216.458  -2.831 0.004651 ** 
## as.factor(hours_worked)24 -325.775    104.720  -3.111 0.001875 ** 
## as.factor(hours_worked)25 -586.618    106.279  -5.520 3.55e-08 ***
## as.factor(hours_worked)26 -583.788    186.426  -3.131 0.001748 ** 
## as.factor(hours_worked)27 -573.860    156.152  -3.675 0.000240 ***
## as.factor(hours_worked)28 -563.276    140.478  -4.010 6.16e-05 ***
## as.factor(hours_worked)29 -348.008    192.414  -1.809 0.070562 .  
## as.factor(hours_worked)30 -387.214     96.968  -3.993 6.60e-05 ***
## as.factor(hours_worked)31 -429.634    216.458  -1.985 0.047213 *  
## as.factor(hours_worked)32 -126.997     98.231  -1.293 0.196122    
## as.factor(hours_worked)33 -103.723    199.260  -0.521 0.602708    
## as.factor(hours_worked)34  -71.693    158.827  -0.451 0.651727    
## as.factor(hours_worked)35 -326.612     97.506  -3.350 0.000815 ***
## as.factor(hours_worked)36 -161.189    107.695  -1.497 0.134524    
## as.factor(hours_worked)37 -204.910    124.076  -1.651 0.098697 .  
## as.factor(hours_worked)38 -179.297    108.367  -1.655 0.098076 .  
## as.factor(hours_worked)39 -218.162    151.386  -1.441 0.149616    
## as.factor(hours_worked)40  -44.338     87.550  -0.506 0.612577    
## as.factor(hours_worked)41  292.709    176.422   1.659 0.097144 .  
## as.factor(hours_worked)42   -4.121    117.214  -0.035 0.971953    
## as.factor(hours_worked)43   -6.798    133.938  -0.051 0.959525    
## as.factor(hours_worked)44  102.695    127.977   0.802 0.422325    
## as.factor(hours_worked)45  263.183     94.735   2.778 0.005486 ** 
## as.factor(hours_worked)46 -135.655    147.264  -0.921 0.357006    
## as.factor(hours_worked)47  132.231    192.414   0.687 0.491972    
## as.factor(hours_worked)48  138.669    117.686   1.178 0.238731    
## as.factor(hours_worked)49   81.115    207.176   0.392 0.695424    
## as.factor(hours_worked)50  496.916     92.357   5.380 7.74e-08 ***
## as.factor(hours_worked)51  606.676    257.907   2.352 0.018693 *  
## as.factor(hours_worked)52  -94.109    145.404  -0.647 0.517515    
## as.factor(hours_worked)53  398.068    192.414   2.069 0.038611 *  
## as.factor(hours_worked)54  138.817    309.854   0.448 0.654165    
## as.factor(hours_worked)55  517.677    107.063   4.835 1.37e-06 ***
## as.factor(hours_worked)56  226.877    156.152   1.453 0.146301    
## as.factor(hours_worked)57 -125.435    429.521  -0.292 0.770271    
## as.factor(hours_worked)58   91.446    186.426   0.491 0.623783    
## as.factor(hours_worked)59 -119.815    429.521  -0.279 0.780293    
## as.factor(hours_worked)60  501.582     98.167   5.109 3.34e-07 ***
## as.factor(hours_worked)62  305.951    257.907   1.186 0.235562    
## as.factor(hours_worked)63 1044.503    279.845   3.732 0.000192 ***
## as.factor(hours_worked)64   -1.735    241.018  -0.007 0.994256    
## as.factor(hours_worked)65  597.004    149.254   4.000 6.42e-05 ***
## as.factor(hours_worked)66  -40.308    309.854  -0.130 0.896503    
## as.factor(hours_worked)68 -244.982    354.263  -0.692 0.489265    
## as.factor(hours_worked)70  691.676    149.254   4.634 3.67e-06 ***
## as.factor(hours_worked)72 -298.515    601.204  -0.497 0.619541    
## as.factor(hours_worked)74  788.025    429.521   1.835 0.066611 .  
## as.factor(hours_worked)75  379.937    279.845   1.358 0.174626    
## as.factor(hours_worked)76  316.875    429.521   0.738 0.460705    
## as.factor(hours_worked)80  814.039    186.426   4.367 1.29e-05 ***
## as.factor(hours_worked)81  739.945    601.204   1.231 0.218462    
## as.factor(hours_worked)84  695.665    257.907   2.697 0.007011 ** 
## as.factor(hours_worked)86  393.185    601.204   0.654 0.513143    
## as.factor(hours_worked)87  -56.815    601.204  -0.095 0.924713    
## as.factor(hours_worked)90    6.935    309.854   0.022 0.982145    
## as.factor(hours_worked)96 1893.795    601.204   3.150 0.001642 ** 
## as.factor(hours_worked)98 1893.185    601.204   3.149 0.001647 ** 
## as.factor(hours_worked)99  767.925    354.263   2.168 0.030227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 594.9 on 5462 degrees of freedom
## Multiple R-squared:  0.2157, Adjusted R-squared:  0.2044 
## F-statistic: 19.02 on 79 and 5462 DF,  p-value: < 2.2e-16

The scatterplot shows that there is a positive relationship between hours worked and weekly earnings. The graph shows that there is some heteroscedasticity going on in the data as well.

Part E: By showing hours as a factor it increases the R^2 to .20 and the p-value is less than 0.05. By looking at hours as a factor it explains slightly more of the population. The hours as a factor is shown in the results in part D.

Part 3

Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

Write the generalized form of the regression using beta notation
Write the estimated form of the regression using your results
Provide at least 3 explanations for why this is considered a naïve model.
Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.
Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.

Part A: Generalized Form of Regression Weekly_Earnings=β_0+ β_1 (Age)

Part B: Weekly_Earnings = 548.95 + 9.19(Age)

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

Part C: This is a naive model because the R^2 is .037 with a p-value less than 0.05. This means that only 3.7% of weekly earnings can be explained with age. There are several other factors that could help explain weekly earnings such as occupation, the state one lives and many more. We also can test age as a factor to understand if it will fit the data better. This model also assumes a linear relationship which could not necessarily be the case. A linear model is also sensitive to outliers.

Part D: In order to test the linearity, a plot is generated. The plot shows an upward trend. When looking at the R^2 and P value it suggests that only ~8% of the data can be explained by age alone. The chi-squared was used to test which model fit better. Age as a factor is a better fit to the data.

## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ age
## Model 2: weekly_earnings ~ (as.factor(age))
##   Res.Df        RSS Df Sum of Sq  Pr(>Chi)    
## 1   5540 2373770588                           
## 2   5475 2242958893 65 130811695 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Park E: Other possible concerns in this data is that age only explains 8% of the weekly earnings. More independent variables should be taken into account such as type of occupation, geography and schooling. I would use multiple regression to understand if more of the data is explained by these variables. I would also use Chi-squared to understand the fit of the data. There is a lot of spread in this data and by using a linear regression does not capture the entire relationship of the data.

Part 4

Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:

Write the generalized form of the regression using beta notation

Weekly_Earnings = β_0+ β_1 (Age)+β_2(education)+β_3(hours_worked)+β_4(occupation_group)

Write the estimated form of the regression using your results

Weekly_Earnings = -3968.9694+ 7.55(Age) + 93.9772(education) + 21.1732(hours_worked) + -23.2713(occupation_group)

Do you suspect any of your independent variables are colinear? Explain why or why not. I don’t suspect that they are colinear because looking at the coefficients they are not similar. This means the slope of their lines would be different and would not be colinear.
Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?

All variables show statistical significance and also the model as a whole shows statistical significance. I am comfortable with numbers within the min and max range of each independent variable.

IV	Min
Age	15
Education	31
Hours Worked	1
Occupation Group	15

IV	Max
Age	85
Education	46
Hours Worked	99
Occupation Group	85

For the dependent variable, I am most comfortable within the min and max ranges using the data from the min/max of the independent variables and calculating the value. Zero will be used as the minimum value since a negative earnings does not make sense.

DV	Min
Weekly Earnings	7.5

DV	Max
Weekly Earnings	2884.61

Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results. Hypothetical observation:

IV	Hypothetical Observation
Age	37
Education	44 (Master’s Degree)
Hours Worked	45 hours
Occupation Group	1 (i.e. business/management)

Weekly Earnings = 1374.9001

The range of the weekly earnings is calculated by using the coefficient confidence intervals.

Min = 1020.27

Max = 1729.53

The summary below shows the results of the multiple regression used in the equations above.

## 
## Call:
## lm(formula = weekly_earnings ~ age + education + hours_worked + 
##     occupation_group, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1809.3  -350.2  -109.0   231.0  3004.0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -3968.9694   148.8232 -26.669  < 2e-16 ***
## age                  7.5523     0.5293  14.269  < 2e-16 ***
## education           93.9772     3.5166  26.724  < 2e-16 ***
## hours_worked        21.1732     0.6335  33.421  < 2e-16 ***
## occupation_group   -23.2713     2.9854  -7.795 7.65e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 546.5 on 5537 degrees of freedom
## Multiple R-squared:  0.3292, Adjusted R-squared:  0.3287 
## F-statistic: 679.3 on 4 and 5537 DF,  p-value: < 2.2e-16