This is an R markdown for the Intro to GLM assignment. I will be using the telework dataset.

Q1

It appears based off the anova model that telework does have a statistically significant effect on income and that can be seen in the boxplot as well.

The anova is the naive model because it only compares the mean between the values of the variables. It doesn’t explain the magnitude of the effect and it doesnt incorporate any other variables into the model that might have an effect as well.

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q2

For the second model I chose to add in the occupation group variable. I chose add in this variable because I figured a persons occupation probably has significant impact on there weekly earnings.

I do believe this model is still a naive model although it is statistically signficant for both variables. There are a lot variables that go into determining how much people make and I don’t feel that both of these variables capture that.

It looks like this model is better than the original model because their is a difference in the sum of squares from the first model to the second and that difference is statistically significant.

##                               Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)         1 1.451e+08 145093488   412.3 <2e-16 ***
## as.factor(occupation_group)    9 3.735e+08  41495293   117.9 <2e-16 ***
## Residuals                   5531 1.946e+09    351891                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q3

weekly_earnings = Bo + B1HoursWorked + e

weekly_earnings = 66.04 + 22.59*HoursWorked + e

Where it does appear that this model does do a good job modeling these two variabels and it is statisitcally signficant it doesnt look like it can definitely be improved. there is a lot of space between the regression lines and the points and there is a high residual error. The R2 is also pretty smallaround 16% so those are two reasons why I would say the model is Naive. It also appears that the relation ship starts of somewhat flat and then gets steeper so there might be a nonlinear interaction between the variables.

I decided to log the weekly_earnings variable and that appeared to provide a better fit for the model.

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = log(weekly_earnings) ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8090 -0.4577 -0.0287  0.4182  2.6466 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.286861   0.030671  172.37   <2e-16 ***
## hours_worked 0.033644   0.000759   44.33   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6579 on 5540 degrees of freedom
## Multiple R-squared:  0.2618, Adjusted R-squared:  0.2617 
## F-statistic:  1965 on 1 and 5540 DF,  p-value: < 2.2e-16

Q4

weekly_earnings = B0 + B1age + e

weekly_earnings = 548.95 + 9.19age + e

This model is considered naive because it has an incredibly low R2 and very high standard error. Also, based off of the visualization it deosnt looke like the line properly estimates the model.

Based off the residuals vs. fitted plot it looks like the model is not linear and is definitely curved.

I’m concerned that age is a discrete variable and I think it might be better to lump them in categories such as ages 20-29 for example and model it as a categorical variable rather than a numeric one. It also appears that there are several outliers in the weekly_earnings variable so I would most likely transform that variable with the log again to adjust for that. Also, because the variable is curved if I didnt transform the variable to a categorical variable I would use some kind of interaction to address that as well.

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

Q5

log(Weekly_earnings) = B0 + B1agegroup30-39 + B2agegroup40-49 + B3agegroup50-59 + B4agegroup60-69 + B5agegroup70+ + B6hours_worked + B7telecommute

log(Weekly_earnings) = 5.32 + .30agegroup30-39 + .36agegroup40-49 + .42agegroup50-59 + .42agegroup60-69 + .17agegroup70+ + .03hours_worked - .28telecommute

I think age and hours worked might be colinear just because it would make sense that once a person gets to certain age they probably wouldnt work as much as they did when they were younger.

for weekly_earnings it appears that from about 500 to 2000 would be fine for estimating but outside of that it gets a little hairy, and then for hours worked 0-65 looks fine but higher than that it doesnt look great.

for someone who is 30 works 45 hours a week and doesn’t telecommute there estimated weekly earnings would be 727.78 - 845.56 at a 95% confidence interval.

## 
## Call:
## lm(formula = log(weekly_earnings) ~ agegroup + hours_worked + 
##     as.factor(telecommute), data = telework1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8026 -0.3904 -0.0226  0.3938  2.6895 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.3161431  0.0364135 145.994  < 2e-16 ***
## agegroup30-39            0.2963560  0.0252182  11.752  < 2e-16 ***
## agegroup40-49            0.3609947  0.0259451  13.914  < 2e-16 ***
## agegroup50-59            0.4203338  0.0255760  16.435  < 2e-16 ***
## agegroup60-69            0.4219720  0.0310285  13.599  < 2e-16 ***
## agegroup70+              0.1714422  0.0622655   2.753  0.00592 ** 
## hours_worked             0.0307896  0.0007387  41.684  < 2e-16 ***
## as.factor(telecommute)2 -0.2847866  0.0182713 -15.587  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6222 on 5499 degrees of freedom
##   (35 observations deleted due to missingness)
## Multiple R-squared:  0.3264, Adjusted R-squared:  0.3255 
## F-statistic: 380.6 on 7 and 5499 DF,  p-value: < 2.2e-16

##                               2.5 %      97.5 %
## (Intercept)              5.24475828  5.38752785
## agegroup30-39            0.24691830  0.34579363
## agegroup40-49            0.31013202  0.41185733
## agegroup50-59            0.37019471  0.47047284
## agegroup60-69            0.36114382  0.48280017
## agegroup70+              0.04937722  0.29350715
## hours_worked             0.02934159  0.03223769
## as.factor(telecommute)2 -0.32060566 -0.24896756