Intro to the GLM

1. Using a simple one-way ANOVA, answer the following questions:

a. Does Teleworking appear to have a significant effect on income? How do you know, and explain the effect (if it exists).

Based on a simple one-way ANOVA of teleworking and income, there appears to be a significant effect on income because the p-value is less than 0.05. The data is telling us that the effect of telecommuting results in an increase of 350.76 dollars with a confidence range of +/- 36.94 dollars.

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = telework)
## 
## $`as.factor(telecommute)`
##         diff      lwr      upr p adj
## 1-0 350.7614 313.8213 387.7015     0

b. Provide a visualization that helps a reader understand the model.

c. Provide at least two explanations for why this is conssidered a naive model.

This analysis provides a naive model becuase it (1) is missing the interaction effect, where the analysis is run excluding other independent variables and the interaction with those additional variables. (2) In addition, the one-way ANOVA test tells us that at least two groups are different from each other, but the analysis will not tell us which groups are different.

2. Using a two-way ANOVA, create a hierarchical variation on your result from Q1 by adding another factor variable and answer the following questions:

a. Explain your model design. Why did you use this additional factor variable compared to others?

I wanted to examine the effect of male versus female and its effect on weekly earnings. Since there is already a difference in the workplace in general, I wanted to see if this would still hold true in a telecommute setting.

b. Interpret the results of your model. Do you consider this a naive model? Why or why not?

Based on a two-way ANOVA of teleworking and sex variables effect on income, there appears to be a significant effect on income because the p-value is less than 0.05 for both teleworking and sex variables. The data is telling us that the effect of telecommuting results in an increase of 350.76 dollars with a confidence range of +/- 36.94 dollars. The effect of being a male results in an increase of 242.15 dollars with a confidence range of +/- 33.49 dollars.

This model would be considered naive because it does not evaluate the interaction between teleworking and sex.

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   359.0 <2e-16 ***
## sex                       1 8.142e+07  81418057   201.5 <2e-16 ***
## Residuals              5539 2.238e+09    404107                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute) + sex, data = telework)
## 
## $`as.factor(telecommute)`
##         diff      lwr      upr p adj
## 1-0 350.7614 314.4721 387.0508     0
## 
## $sex
##         diff      lwr      upr p adj
## M-F 242.1472 208.6566 275.6378     0

c. Provide a visualization that helps a reader understand the model.

d. Test the difference between your model from Q1 and your model in Q2. Is there an improvement in fit? How do you know?

There is an improvement in fit by adding sex to the ANOVA analysis. The reduction of the residual is different from zero and stastically signifant (<0.05).

## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ as.factor(telecommute)
## Model 2: weekly_earnings ~ as.factor(telecommute) + sex
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   5540 2319766548                                  
## 2   5539 2238348491  1  81418057 201.48 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point - only generate a naive model and answer the following questions

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

a. Write the generalized form of the regression using beta notation.

$Weekly~Earnings_i = \beta_0 + \beta_1(Hours~Worked)$

b. Write the estimated form of the regression using your results.

$Weekly~Earnings_i = 66.04 + 22.59(Hours~Worked)$

c. Provide at least 3 explanations for why we would consider this a naive model

The model is only including 1 variable to explain weekly earnings.
The model is unable to evaluate the interection between multiple variables.
The model has only one iteration and is unable to evaluate fit between additional models.

d. In your opinion, what do you think should be done to better model these two variables or do you think it does not make sense to model one as a function of the other under any circumstances? Provide a visualization to support your position.

By examining the chart below, I do not think that the two variables could be adjusted to provide comparable information. For instance, a full-time, salary worker will earn 0 for each hour worked past 40 per week. Also, there are full-time employees listed with hours lower than the 30 hour requirement to be classified as full-time. I think a better way to look at the data presented is to determine weekly earnings by hourly wage for those employees that have telecommuted for their employer.

e. If your recommendation from part d above is possible with the data available, do your best to execute your proposed adjustments. If your recommendation from part d is not possible using theis dataset, propose another solution using additional variables if possible, a different data set or another strategy

As stated above - I would determine weekly earnings by hourly wage for those employees that have telecommuted for their employer.

I would create an adjusted hours column to:

Set full-time hourly employees to a minimum of 30 hours.
Set part-time hourly employees to a maximum of 39 hours.
Set full-time salary employees to a maximum of 40 hours.
Set part-time salary employees to a maximum of 39 hours.

Calculate an adjusted pay rate to chart.
Exclude all rates above $100 per hour.
Show differences between full-time (1) and part-time (2).

4. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point - only generate a naive model and answer the following questions:

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telework)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

a. Write the generalized form of the regression using beta notation.

$Weekly~Earnings_i = \beta_0 + \beta_1(Age)$

b. Write the estimated form of the regression using your results.

$Weekly~Earnings_i = 548.96 + 9.19(Age)$

c. Provide at least 3 explanations for why we would consider this a naive model

The model is only including 1 variable to explain weekly earnings.
The model is unable to evaluate the interection between multiple variables.
The model has only one iteration and is unable to evaluate fit between additional models.

d. Test the linearity assumption of this model. Provide output of the tests you run to assess linearity and comment on the results.

The linearity line is rather flat and excludes the possiblity of a more parabolic shape of the data.

e. Identify at least 3 other possible concerns regarding this model beyond those inherent in the naive design. For each possible concern, comment on you would assess and address the concern.

Linearity and additivity of the relationship between dependent and independent variables.

Assessment: Most evident in a plot of observed versus predicted values or a plot of residuals versus predicted values.

Correction: Apply a nonlinear transformation to the dependent and/or independent variables if you can think of a transformation that seems appropriate.

Homoscedasticity (constant variance) of the errors.

Assessment: Examine a plot of residuals versus predicted values.

Correction: Seasonal patterns in the data are a common source of heteroscedasticity in the errors.

Normality of the error distribution.

Assessment: The best test for normally distributed errors is a normal probability plot.

Correction: Tthe problem with the error distribution is mainly due to one or two very large errors, such values should be scrutinized closely.

5. Modify your model from Q4 by adding at leat 3 other IV’s to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your reulsts and answer the following questions:

## 
## Call:
## lm(formula = weekly_earnings ~ age_group + education_group + 
##     sex + as.factor(full_or_part_time) + as.factor(msa_size), 
##     data = telework_adj1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1645.61  -336.92   -92.96   224.85  2694.58 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                                  569.05      29.97  18.988
## age_group30-39                               158.36      22.39   7.071
## age_group40-49                               269.00      23.01  11.693
## age_group50-59                               312.71      22.72  13.761
## age_group60+                                 269.04      26.05  10.326
## age_groupUnder 20                             15.48      55.05   0.281
## education_groupBachelor Degree               282.77      25.74  10.987
## education_groupHigh School                  -165.14      25.03  -6.596
## education_groupMasters/Professional Degree   508.42      29.04  17.507
## education_groupSome College                  -77.22      26.54  -2.909
## sexM                                         211.26      14.73  14.341
## as.factor(full_or_part_time)2               -526.60      21.62 -24.356
## as.factor(msa_size)2                          13.61      29.00   0.470
## as.factor(msa_size)3                          61.36      28.72   2.137
## as.factor(msa_size)4                          46.91      25.54   1.837
## as.factor(msa_size)5                          47.03      23.35   2.014
## as.factor(msa_size)6                         244.17      24.96   9.782
## as.factor(msa_size)7                         130.12      22.92   5.677
##                                            Pr(>|t|)    
## (Intercept)                                 < 2e-16 ***
## age_group30-39                             1.72e-12 ***
## age_group40-49                              < 2e-16 ***
## age_group50-59                              < 2e-16 ***
## age_group60+                                < 2e-16 ***
## age_groupUnder 20                           0.77855    
## education_groupBachelor Degree              < 2e-16 ***
## education_groupHigh School                 4.61e-11 ***
## education_groupMasters/Professional Degree  < 2e-16 ***
## education_groupSome College                 0.00364 ** 
## sexM                                        < 2e-16 ***
## as.factor(full_or_part_time)2               < 2e-16 ***
## as.factor(msa_size)2                        0.63872    
## as.factor(msa_size)3                        0.03266 *  
## as.factor(msa_size)4                        0.06626 .  
## as.factor(msa_size)5                        0.04404 *  
## as.factor(msa_size)6                        < 2e-16 ***
## as.factor(msa_size)7                       1.44e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 540 on 5524 degrees of freedom
## Multiple R-squared:  0.3465, Adjusted R-squared:  0.3445 
## F-statistic: 172.3 on 17 and 5524 DF,  p-value: < 2.2e-16

a. Write the generalized form of the regression using beta notation.

Generalized regression formula for those not omitted for age group 20-29, education level of Associate degree, female, and working full-time.

$Weekly~Earnings_i = \beta_0 + \beta_1(Age~Group) + \beta_2(Education~Group) + \beta_3(Sex) + \beta_4(Employment~Status)+ \beta_5(MSA~Size)$

b. Write the estimated form of the regression using your results.

Estimated regression formula for those not omitted for age group 20-29, education level of Associate degree, female, and working full-time.

Weekly Earnings = 569.05 + 15.48 x age_groupUnder20 + 158.36 x age_group30-39 + 269.00 x age_group40-49 + 312.71 x age_group50-59 + 269.04 x age_group60 + 282.77 x education_levelBachelorDegree - 165.14 x education_levelHighSchool + 508.42 x education_levelMasters/ProfessionalDegree - 77.22 x education_levelSomeCollege + 211.16 x sex_male - 526.60 x part-tme + 13.61 x msa_size_100,000-249,999 + 61.36 x msa_size_250,000-499,999 + 46.91 x msa_size_500,000-999,999 + 47.03 x msa_size_1,000,000-2,499,999 + 244.17 x msa_size_2,500,00-4,999,999 + 130.12 x msa_size_5,000,000+

c. Do you suspect any of your independent variables are colinear? Explain why or why not.

I do not suspect any of the independent variables are colinear because it does not appear that any of the independent values follow the same pattern.

d. Judging by your output, what ranges of values are you most comfortable using this model to estimate future observations?

Please see information below:

##                                                  2.5 %     97.5 %
## (Intercept)                                 510.302975  627.80578
## age_group30-39                              114.459050  202.26294
## age_group40-49                              223.896575  314.09631
## age_group50-59                              268.161212  357.25765
## age_group60+                                217.961571  320.11478
## age_groupUnder 20                           -92.436378  123.39795
## education_groupBachelor Degree              232.313169  333.22078
## education_groupHigh School                 -214.215241 -116.05981
## education_groupMasters/Professional Degree  451.490512  565.35228
## education_groupSome College                -129.254535  -25.18429
## sexM                                        182.376021  240.13415
## as.factor(full_or_part_time)2              -568.981957 -484.21011
## as.factor(msa_size)2                        -43.232113   70.46210
## as.factor(msa_size)3                          5.063832  117.65392
## as.factor(msa_size)4                         -3.150957   96.97092
## as.factor(msa_size)5                          1.255404   92.80598
## as.factor(msa_size)6                        195.241424  293.10679
## as.factor(msa_size)7                         85.190103  175.05228

e. Generate a hypothetical observation that exists within the ranges specified in part d and estimate the weekly earnings for that individual. In addition to this estimate, calculate the estimated range of weekly earnings this individual may have and comment on the results.

For this hypothetical observation I will use a 40-year-old male, with a Master’s degree, who works full-time and lives in a metropolitan area of 2.1M people.

Weekly earnings = 569.05 + 269.00 x age_group40-49 + 508.42 x education_levelMasters/ProfessionalDegree + 211.16 x sex_male + 47.03 x msa_size_1,000,000-2,499,999

Weekly earnings = 1,604.66

Weekly earnings (Low) = 510.30 + 223.9 x age_group40-49 + 451.49 x education_levelMasters/ProfessionalDegree + 182.38 x sex_male + 1.26 x msa_size_1,000,000-2,499,999

Weekly earnings (Low) = 859.03

Weekly earnings (High) = 627.81 + 314.10 x age_group40-49 + 565.35 x education_levelMasters/ProfessionalDegree + 240.13 x sex_male + 92.81 x msa_size_1,000,000-2,499,999

Weekly earnings (High) = 1,840.20

Intro to the GLM

Patrick Bastin

11/26/2019

1. Using a simple one-way ANOVA, answer the following questions:

2. Using a two-way ANOVA, create a hierarchical variation on your result from Q1 by adding another factor variable and answer the following questions:

3. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point - only generate a naive model and answer the following questions

4. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point - only generate a naive model and answer the following questions:

5. Modify your model from Q4 by adding at leat 3 other IV’s to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your reulsts and answer the following questions: