GLM 1

Question 1

  1. Using a simple one-way ANOVA, answer the following questions:
  1. Does Teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists.)
  2. Provide a visualization that helps a reader understand the model
##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0

Teleworking appears to have a significant effect on income because the p-value is less than 0.05 and the F-value is 346. p is also less than 0.05 when looking at the TukeyHSD at 95% confidence level.

From the boxplot, we can tell that people who telework make on average $346 more a week than those who do not.

Question 2

  1. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16
  1. Write the generalized form of the regression using beta notation

weekly_earnings = B0 + B(hours_worked)

  1. Write the estimated form of the regression using your results

weekly_earnings = 66.04 + 22.59(hours_worked) for each additional hour, earnings increase by $22.59

  1. Provide at least 3 explanations for why we would consider this a naïve model.

This is a naive model because we are assuming that one independent variable (hours_worked) is the only factor influencing weekly earnings. A simple linear line does not provide the best fit. The summary also shows that the RSquared value is low (0.15) and the Residual Standard Error is significantly high.

  1. In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.

## 
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8090 -0.4577 -0.0287  0.4182  2.6466 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.286861   0.030671  172.37   <2e-16 ***
## hours_worked 0.033644   0.000759   44.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6579 on 5540 degrees of freedom
## Multiple R-squared:  0.2618, Adjusted R-squared:  0.2617 
## F-statistic:  1965 on 1 and 5540 DF,  p-value: < 2.2e-16

To better model this, we may have to use log transformations due to the shape of the plot. With the improved model, the Residual Standard Error is significantly smaller and there’s an improved RSquared

Question 3

  1. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:
## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16
  1. Write the generalized form of the regression using beta notation

weekly_earnings = B0 + B(age)

  1. Write the estimated form of the regression using your results

weekly_earnings = 549 + 9.19(age) for every unit increase in age earnings increase by $9.19

  1. Provide at least 3 explanations for why this is considered a naïve model.

This model is naive because the Residual Standard Error is high, the r-squared value is small and it does not satisfy the assumption of conditional independence.

  1. Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.

From the visualization, we see that there are a lot of outliers when testing for linearity assumption of the model.

  1. Identify at least 3 other possible concerns regarding this model beyond those inherent in the naïve design. For each possible concern, comment on how you would assess and address the concern.

A parabolic model would provide a better fit Age is in numeric form so I could change it to a categorical variable The outliers could be taken out of the model

Question 4

  1. Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:
## 
## Call:
## lm(formula = weekly_earnings ~ as.factor(age1) + as.factor(union_member) + 
##     as.factor(full_or_part_time) + as.factor(sex) + hours_worked, 
##     data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1550.6  -382.0  -146.6   220.2  2708.5 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    556.3111    44.6924  12.448  < 2e-16 ***
## as.factor(age1)Old              76.7933    64.9826   1.182   0.2374    
## as.factor(age1)Senior          100.6472    18.5851   5.415 6.37e-08 ***
## as.factor(age1)Young-Adult    -244.8636    20.0590 -12.207  < 2e-16 ***
## as.factor(union_member)2       -49.4413    26.0311  -1.899   0.0576 .  
## as.factor(full_or_part_time)2 -330.7424    27.3691 -12.085  < 2e-16 ***
## as.factor(sex)2               -149.2354    16.1592  -9.235  < 2e-16 ***
## hours_worked                    14.9190     0.8357  17.851  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 584.8 on 5534 degrees of freedom
## Multiple R-squared:  0.2321, Adjusted R-squared:  0.2312 
## F-statistic:   239 on 7 and 5534 DF,  p-value: < 2.2e-16
  1. Write the generalized form of the regression using beta notation

weekly_earnings = Bo + B(age1Old)+B(age1Senior)+B(age1Young-Adult)+B(union_member)+ B(full_or_part_time)+B(sex)+B(hours_worked)

  1. Write the estimated form of the regression using your results

weekly_earning = 556.31 + 77 (Old) +101(Senior) - 245 (Young-Adult) - 49(union_member) - 330(full_or_part_time) - 149(sex) + 15(hours_worked)

A unit increase in Old age leads to an increase in earnings by 77; a unit increase in Senior age leads to an increase in earnings by 101; a unit increase in Young-Adult age leads to a decrease in earnings by 245; no union membership leads to a decrease in earnings by 49; part time working leads to a decrease in earnings by 330; female workers have a decrease in earnings by 149; a unit increase in hours worked leads to an increase in earnings by 15.

  1. Do you suspect any of your independent variables are colinear? Explain why or why not.
##                   union_member full_or_part_time         sex hours_worked
## union_member        1.00000000        0.06813902  0.05401392  -0.02791061
## full_or_part_time   0.06813902        1.00000000  0.15208703  -0.57766801
## sex                 0.05401392        0.15208703  1.00000000  -0.23673674
## hours_worked       -0.02791061       -0.57766801 -0.23673674   1.00000000

None of the independent variables are colinear; they don’t have a correlation greater than or equal to 70%.

  1. Judging by your output, what ranges of values (for your IVs and the DV, weekly earnings) are you most comfortable using this model to estimate future observations?

All independent variables are usable except (age1)Old because it’s p-value is on the higher side.

  1. Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results. HINT: use the coefficient confidence intervals to estimate the highest and lowest values if this person’s income.

Hypothetical Observation: Full time Young-Adult female worker who is a union member

Her estimated weekly earnings is $712.63 per week.

Her estimated range of weekly earnings is -$1394.28 to $2883.42