GLM HW

GLM1

Question 1

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0

## R Squared: 0.05886171

Based on the ANOVA output teleworking appears to have an effect on weekly earning. I know because the output of the ANOVA shows a statistical significant F value & the output of the Tukey test shows that there is a difference of -$350 between those who telework and those who don’t because the p-values show that the variables are significant in the regression. However, my main issue with this approach is the small R^2, which means that this model is useless (teleworking only explains 5% of the variation in weekly earnings). The main takeaway is that bsed on the model, if you don’t telecommute, you can expect to earn $350 less than someone who does b.The boxplot illustrates the difference in the ranges of people that telework vs no telework. The mean weekly earning for thos who telework seems to be higher than if you don’t. This is a naive model because
1. Weekly earning depends on way more variables that are not accounted for telecommuting (Note the low R^2 of only 5%)

It is extremely naive to think that telecommute is a causal factor of higher weekly earnings. Pherhaps there is something else that is causing it, and telecommute is just capturing some of that effect.

Question 2

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   359.0 <2e-16 ***
## as.factor(sex)            1 8.142e+07  81418057   201.5 <2e-16 ***
## Residuals              5539 2.238e+09    404107                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) + as.factor(sex), data = telework)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.0508 -314.4721     0
## 
## $`as.factor(sex)`
##          diff       lwr       upr p adj
## 2-1 -242.1472 -275.6378 -208.6566     0

## R Squared: 0.09191242

The model is estimting earnings as a factor of teleworking and gender. I chose gender as my other factor because nowadays there is too much talk around the gap in pay between males and females. I wan’t to see a brief model on the subject
The difference in weekly earnings is the same for teleworkers that e estimated on the model above. However, our model outputs that females earns $242 less per week on average than males. This is a naive model because it assumes that just by being a women you automatically earn less money. However, it doiesn;t factor for other things things such as the kind of jobs and tasks men tend to perform on average vs women, the risk taken on the job, the amount of hours worked, salary negotiation, etc.
The box plot illustrate the difference between telecommuters facetwrapped by gender. We can see that mean earnings seem to be less for females than for males, independent of telecommuting.

## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ as.factor(telecommute) + as.factor(sex)
## Model 2: weekly_earnings ~ as.factor(telecommute)
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   5539 2238348491                                  
## 2   5540 2319766548 -1 -81418057 201.48 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the output of the ANOVA, there is an improvement in fit. I know because of the lower RSS of model 1 vs model 2 and the F-test outputs that the difference is significant.

Question 3

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

a.Y = ßo + ß1 * Hours_Worked

Weekly Earnings = 66.04 + 22.58*Hours_Worked
1. Assumes that earnings is a function of hours worked, but many low paying jobs have short hours and many high paying jobs tend to have really long hours and are paid base salary with bonuses, not on hourly basis. The model doesn’t account for that

Other factors will determine the weekly earning, such as education, prior experience,risk of job, etc 3)Assumes that weekly earning earnings is a linear function of worked hours. However, at the far end of worked hours (>50~60) it isn’t the case.

d.I think we should narrow down our dataset to well defined parameters. I believe we are comparing apples to oranges with this dataset. The earnings function for someone in a management, business, or finance position won’t be the same than for someone in service or construction occupations. It is also naive to assume that relationships are linear (they are not). There is much more going on in the amount of money you make than only working.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = filter(telework, 
##     occupation_group == 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1516.8  -551.6  -193.9   461.6  2426.5 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   436.458     96.844   4.507 7.39e-06 ***
## hours_worked   21.629      2.232   9.688  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 715.2 on 961 degrees of freedom
## Multiple R-squared:  0.08898,    Adjusted R-squared:  0.08804 
## F-statistic: 93.87 on 1 and 961 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = filter(telework, 
##     occupation_group == 3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1020.59  -195.36  -104.88    58.74  2568.51 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    17.389     39.125   0.444    0.657    
## hours_worked   14.936      1.052  14.201   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 378.8 on 876 degrees of freedom
## Multiple R-squared:  0.1871, Adjusted R-squared:  0.1862 
## F-statistic: 201.7 on 1 and 876 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = filter(telework, 
##     occupation_group == 7))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1366.7  -315.4  -115.4   210.4  1969.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   190.348    120.612   1.578    0.116    
## hours_worked   18.125      2.793   6.489  5.6e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 516.2 on 221 degrees of freedom
## Multiple R-squared:   0.16,  Adjusted R-squared:  0.1562 
## F-statistic: 42.11 on 1 and 221 DF,  p-value: 5.603e-10

While both R^2 numbers are low, the difference in R^2 between management, business, and finance ocupations and construction or services occupations seems significant. We can see that for blue collar occupations the R^2 is higher than for white collar or business/mgmnt occupations. This shows some evidence of some the comments we made in d.

Question 4

## 
## Call:
## lm(formula = weekly_earnings ~ age + as.factor(sex), data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1305.6  -430.2  -160.9   271.8  2236.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      668.7690    28.7230   23.28   <2e-16 ***
## age                9.4213     0.6177   15.25   <2e-16 ***
## as.factor(sex)2 -265.5980    17.2322  -15.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 641 on 5539 degrees of freedom
## Multiple R-squared:  0.07656,    Adjusted R-squared:  0.07623 
## F-statistic: 229.6 on 2 and 5539 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: weekly_earnings
##                  Df     Sum Sq  Mean Sq F value    Pr(>F)    
## age               1   91089447 91089447  221.67 < 2.2e-16 ***
## as.factor(sex)    1   97620084 97620084  237.56 < 2.2e-16 ***
## Residuals      5539 2276150505   410932                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Y = ßo + ß1 * Age
Weekly_Earnings = 548.94 + 9.19 * Age
1. Age is not a causal factor of being able to generate higher earnings

Doesn’t account for the type of business. If you are in business you might make more money when you are older, whereas if you are a construction worker it might be the opposite
As we said earlier, gender is not a causal factor of higher or lower earnings. There are many things that are not accounted fopr, such as workload, naturee of task, market forces, etc.

The model is not really linear based on the visual exploration of the plots. We would like to see the red line as a straight line in residuals vs fitted and the Q-Q plot as a straight line as well.
Heteroskedasticity: The variance of the errors doesn’t seem to be constant based on the plots and the spread out nature of earnings as age increases. This is a really difficult thing to fix. We might try to asses it by building a model with a narrower scope, say, earning for people b/w 20 & 30 years old.

Non-linearity: The nonlinearity of this model might be assesed by doing transformations to our independent variables. Ex: log-log, log-linear, cuadratic, cubic, higher dimensional spaces, etc.

Non-normal multivariate distribution: based on the Q-Q plot, the distribution of our data seems to be non normal. This could be assed with more transformations, sugh as log or square roots,

Question 5

## 
## Call:
## lm(formula = weekly_earnings ~ poly(age, 6, raw = T) + as.factor(sex) + 
##     as.factor(union_member) + as.factor(hourly_non_hourly) + 
##     as.factor(telecommute) + as.factor(full_or_part_time) + poly(hours_worked, 
##     2, raw = T), data = telework)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1802.76  -329.54   -90.09   193.75  2569.09 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.799e+03  2.266e+03   2.559 0.010510 *  
## poly(age, 6, raw = T)1          -8.407e+02  3.461e+02  -2.429 0.015173 *  
## poly(age, 6, raw = T)2           5.169e+01  2.097e+01   2.465 0.013728 *  
## poly(age, 6, raw = T)3          -1.564e+00  6.465e-01  -2.419 0.015575 *  
## poly(age, 6, raw = T)4           2.516e-02  1.073e-02   2.344 0.019096 *  
## poly(age, 6, raw = T)5          -2.059e-04  9.124e-05  -2.256 0.024089 *  
## poly(age, 6, raw = T)6           6.727e-07  3.114e-07   2.160 0.030780 *  
## as.factor(sex)2                 -1.353e+02  1.479e+01  -9.150  < 2e-16 ***
## as.factor(union_member)2        -8.119e+01  2.383e+01  -3.407 0.000661 ***
## as.factor(hourly_non_hourly)2    4.024e+02  1.552e+01  25.934  < 2e-16 ***
## as.factor(telecommute)2         -1.894e+02  1.605e+01 -11.800  < 2e-16 ***
## as.factor(full_or_part_time)2   -2.784e+02  2.650e+01 -10.504  < 2e-16 ***
## poly(hours_worked, 2, raw = T)1  8.834e-01  2.146e+00   0.412 0.680575    
## poly(hours_worked, 2, raw = T)2  1.417e-01  2.541e-02   5.576 2.58e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 533.9 on 5528 degrees of freedom
## Multiple R-squared:  0.3607, Adjusted R-squared:  0.3592 
## F-statistic: 239.9 on 13 and 5528 DF,  p-value: < 2.2e-16

Y = ßo + ß1Age + ß2Age^2 + ß3Age^3 + ß4Age^4 + ß5Age^5 + ß6Age^6 +ß7Sex + ß8Union_Member + ß9Metro2 + ß10Metro3 + ß11Telecommute + ß12full_or_part_time2 + ß13hours_worked + ß14hours_worked^2
Weekly_Earnings = 5.665e+03 + -8.133e+02 x Age + 4.983e+01 x Age^2 + -1.501e+00 x Age^3 + 2.404e-02 x Age^4 + -1.958e-04 x Age^5 + 6.370e-07 x Age^6 + -1.364e+02 x Sex + -7.891e+01 x union_member + 3.939e+02 x hourly_non_hourly + -1.894e+02 x telecommute2 + -2.784e+02 x full_or_part_time2 + 8.834e-01 x hours_worked + 1.417e-01 x hours_worked^2

c.Based on the VIF test, none of them are colinear. However, one might be tempted to believe that being a union member could be colinear with being a non_hourly worker, since blue collar jobs tends to pay by the hour.

For people that are younger than retirement age, are hourly workers (can be both full time and part time), can be union or non union members, of both genders, and can either telecommute or not. I think the main factor that would result in really off predictions are age, and salaried workers, whose salaries variance might be different across different ranges (something similar might apply to age)

## Warning in as.data.frame.numeric(age, sex, union_member,
## hourly_non_hourly, : 'row.names' is not a character vector of length 1 --
## omitting it. Will be an error!

##        fit       lwr     upr
## 1 494.4783 -554.6931 1543.65

## Hourly Wage: 12.36175

Based on the model output, a female of 38 years old, who is a union member, paid by the hour, does not telecommute, is part time worker, and works 40 hours per week, should make 494.47 or roughlt be paid $12.36/hour. The range of earnings based on the model goes from -554 to 1543.65 a week. However, it would be more realistic to say that the range goes fromn 0 to 1543 a week, unless negative earnings per week means that the person is so indebted that their net earnings are negative (but I don’t believe the dataset is accounting for that)