Intro to GLM - Census bureau survey

Sections

  • Packages Required
  • Introduction
  • Questions

Introduction

Source: http://asayanalytics.com/telework_csv

Variables

Variable Type Description
weekly_earnings Continuous Amount of money made during one week

Data Summary

1. One-Way Anova

A. Effect of Teleworking on Income

The coefficient of independent variable Telecommute is significant since it is the only variable in the model. p-value < 0.05.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1183 15.69 75.43 0 * * *
telecommute2 -350.8 18.84 -18.61 4.572e-75 * * *
Fitting linear model: weekly_earnings ~ telecommute
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 647.1 0.05886 0.05869

B. Plot

The mean income of telecommuters is USD $351 more than none telecommuters

C. Naive Explanation

  • It is simple. It is using only one variable to build the model and it is not explaining underlying causal relationships that produce the variable being investigated.
  • A simple analysis is this model doesn’t capture the relationship between number of hours worked and an employee is paid hourly or not.

2. Two-Way Anova

A. Effect of Teleworking + Hourly Paid on Income

Explanation of model design:

When designing the model, we assumed based on real life observations that hourly paid employees make less than exempt employees.

Table continues below
  Estimate Std. Error t value
(Intercept) 788.6 22.52 35.02
telecommute2 -125.7 25.44 -4.941
hourly_non_hourly2 663 29.19 22.71
telecommute2:hourly_non_hourly2 -179.6 35.38 -5.078
  Pr(>|t|)
(Intercept) 7.508e-243 * * *
telecommute2 7.993e-07 * * *
hourly_non_hourly2 3.025e-109 * * *
telecommute2:hourly_non_hourly2 3.949e-07 * * *
Fitting linear model: weekly_earnings ~ telecommute + hourly_non_hourly + telecommute:hourly_non_hourly
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 591.1 0.2149 0.2144

B. Model result explanation

  • p-values of the independent variables telecommute and hourly non hourly are significant, less than 0.05.
  • Adjusted R-squared increased to 21% compared to 5% on the one-way anova model
  • This model is still an naive model. There are many factor that are not considered that can explain better income such a age, industry and geography.

C. Model Visualization

The income mean increases when employee is telecommuter and exempt. Weekly earning is $125.72 more for telecommuters and $662.95 for exempt employees.

D. Comparison LM 1 vs LM 2

Model of telecommute + hourly non hourly paid fits better than the simple linear models to explain changes in weekly earnings. p-value of model 2 is significant < 0.05

Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
5540 2.32e+09 NA NA NA NA NA
5538 1.935e+09 2 384505502 550.2 1.166e-218

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ’ ’ 1

3. Weekly earnings by hours worked

  • A. weekly_earnings = $$b_1 hours_worked + b_0

B. Model

Weekly Earnings = 22.5887 hours worked + 66.0433

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.04 28.58 2.311 0.02086 *
hours_worked 22.59 0.7072 31.94 1.169e-205 * * *
Fitting linear model: weekly_earnings ~ hours_worked
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 613 0.1555 0.1554

C. Naive Explanation

  • It is simple. It is using only one variable to build the model and it is not explaining underlying causal relationships that produce the variable being investigated.
  • It is not considering the effect of being hourly paid or exempt which is an important aspect observed on the number of hours reported.
  • Hours worked explains only 15% of variability of the model.

D. Model Analysis

  • We cannot see a clear linear relationship between hours worked and weekly earnings. The data looks very spread out.
  • One reason could be exempt employee make a fixed salary regardless number of hours worked. This implies that the relationship cannot be linear.
  • First recommendation is to exclude exempt employees and build a model to explain weekly earnings by hours worked with only employees who are hourly paid.
  • Second recommendation is to play with other variables that can explain better the behavior on weekly earnings.

E. Recommendation

  • First recommendation: Excluding exempt employees didn’t improve a lot the model and we can discard the first recommendation.
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.15 25.83 0.5864 0.5576
hours_worked 18.43 0.6746 27.32 8.611e-148 * * *
Fitting linear model: weekly_earnings ~ hours_worked
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
3182 424.6 0.1901 0.1898

  • Second recommendation: Adding more variables, for example detailed_occupation_group helps to explain more variability on the model. We can say that using only hours worked is not enough to explain weekly earnings. We need to consider more variables that influence our model, for instance combining, geography, profession, sex, industry and so on have a better effect.

Example of behavior using detailed_occupation_group increases adjusted R-square to 31%

Table continues below
  Estimate Std. Error t value
(Intercept) 568.2 35.58 15.97
hours_worked 18.6 0.6629 28.06
as.factor(detailed_occupation_group)2 -13.7 38.45 -0.3563
as.factor(detailed_occupation_group)3 74.54 46.22 1.613
as.factor(detailed_occupation_group)4 94.24 49.55 1.902
as.factor(detailed_occupation_group)5 -154.7 67.95 -2.277
as.factor(detailed_occupation_group)6 -362 62.3 -5.811
as.factor(detailed_occupation_group)7 245.2 68.81 3.563
as.factor(detailed_occupation_group)8 -139.9 42.43 -3.298
as.factor(detailed_occupation_group)9 -195.3 61.14 -3.194
as.factor(detailed_occupation_group)10 -149.4 36.34 -4.11
as.factor(detailed_occupation_group)11 -735.2 48.78 -15.07
as.factor(detailed_occupation_group)12 -358.8 51.87 -6.917
as.factor(detailed_occupation_group)13 -746.2 39.92 -18.69
as.factor(detailed_occupation_group)14 -717.1 49.68 -14.43
as.factor(detailed_occupation_group)15 -750.2 50.12 -14.97
as.factor(detailed_occupation_group)16 -442.7 31.96 -13.85
as.factor(detailed_occupation_group)17 -570.5 29.1 -19.61
as.factor(detailed_occupation_group)18 -719.7 108.2 -6.65
as.factor(detailed_occupation_group)19 -397.6 42.66 -9.321
as.factor(detailed_occupation_group)20 -313.6 44.11 -7.111
as.factor(detailed_occupation_group)21 -551.7 37.81 -14.59
as.factor(detailed_occupation_group)22 -536.3 37.58 -14.27
  Pr(>|t|)
(Intercept) 3.716e-56 * * *
hours_worked 4.479e-162 * * *
as.factor(detailed_occupation_group)2 0.7217
as.factor(detailed_occupation_group)3 0.1068
as.factor(detailed_occupation_group)4 0.05723
as.factor(detailed_occupation_group)5 0.02283 *
as.factor(detailed_occupation_group)6 6.561e-09 * * *
as.factor(detailed_occupation_group)7 0.0003697 * * *
as.factor(detailed_occupation_group)8 0.0009804 * * *
as.factor(detailed_occupation_group)9 0.001411 * *
as.factor(detailed_occupation_group)10 4.008e-05 * * *
as.factor(detailed_occupation_group)11 2.487e-50 * * *
as.factor(detailed_occupation_group)12 5.12e-12 * * *
as.factor(detailed_occupation_group)13 1.193e-75 * * *
as.factor(detailed_occupation_group)14 2.204e-46 * * *
as.factor(detailed_occupation_group)15 1.096e-49 * * *
as.factor(detailed_occupation_group)16 6.572e-43 * * *
as.factor(detailed_occupation_group)17 8.15e-83 * * *
as.factor(detailed_occupation_group)18 3.218e-11 * * *
as.factor(detailed_occupation_group)19 1.626e-20 * * *
as.factor(detailed_occupation_group)20 1.303e-12 * * *
as.factor(detailed_occupation_group)21 2.479e-47 * * *
as.factor(detailed_occupation_group)22 2.064e-45 * * *
Fitting linear model: weekly_earnings ~ hours_worked + as.factor(detailed_occupation_group)
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 551.1 0.32 0.3173

4. Weekly earnings by age

  • A. weekly_earnings = $$b_1 age + b_0

B. Model

Weekly Earnings = 9.1941 hours worked + 548.9457

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 548.9 28.24 19.44 1.681e-81 * * *
age 9.194 0.6306 14.58 2.787e-47 * * *
Fitting linear model: weekly_earnings ~ age
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 654.6 0.03696 0.03678

C. Naive Explanation

  • It is using only age to explain weekly earnings.
  • It is not using other variables that can have an influence in weekly earnings. Example the age by gender be related somehow.
  • It is not looking at any interaction of age or other variables.

D. Visualization

E. Additional Explanation

  • It is clearly looking at the visualization in part D, that the relation between weekly earnings and age is not quite linear, actually it is very scattered, Even when p-value shows that the relation is significant.
  • Variability explained just using age is very small, only 3.6%.
  • Given the mean for all weekly earnings 548.9457 and the residuals standard error is 654.6. We can say that the percentage (any prediction will still be off by) 654.6/548.95 = 1.19 which means any prediction is off by 119%.

5. Aggregated model

A. weekly_earnings = b0 + b1(age) + b2(education) + b3(hours_worked) + b4(hourly_non_hourly) + b5(occupation_group) + b6(industry)

B. Model

Table continues below
  Estimate Std. Error t value
(Intercept) -222.1 229.9 -0.9661
age 5.691 0.5049 11.27
as.factor(education)32 -151.4 306.1 -0.4945
as.factor(education)33 -92.8 230 -0.4034
as.factor(education)34 -27.13 237.3 -0.1143
as.factor(education)35 -32.97 224.9 -0.1466
as.factor(education)36 98.48 217.3 0.4531
as.factor(education)37 50.51 214.3 0.2357
as.factor(education)38 24.02 222.7 0.1079
as.factor(education)39 109.9 207.1 0.5304
as.factor(education)40 123.5 207.4 0.5957
as.factor(education)41 157.2 208.9 0.7526
as.factor(education)42 191.6 208.5 0.9191
as.factor(education)43 367 207.5 1.769
as.factor(education)44 469.9 208.4 2.255
as.factor(education)45 732.2 213.7 3.426
as.factor(education)46 655.3 213 3.076
hours_worked 15.99 0.619 25.83
as.factor(hourly_non_hourly)2 245.2 16.24 15.1
as.factor(occupation_group)2 -72.59 23.34 -3.111
as.factor(occupation_group)3 -297.1 27.41 -10.84
as.factor(occupation_group)4 -220.4 30.75 -7.167
as.factor(occupation_group)5 -344.9 25.66 -13.45
as.factor(occupation_group)6 -237.7 128.1 -1.855
as.factor(occupation_group)7 -99.26 49.19 -2.018
as.factor(occupation_group)8 -63.14 41.08 -1.537
as.factor(occupation_group)9 -285.3 37.97 -7.514
as.factor(occupation_group)10 -275.6 36.79 -7.491
as.factor(industry)2 413.1 114.7 3.6
as.factor(industry)3 192.2 100 1.921
as.factor(industry)4 239.3 93.98 2.546
as.factor(industry)5 79.46 94.14 0.844
as.factor(industry)6 269.3 97.2 2.77
as.factor(industry)7 221.2 103.2 2.142
as.factor(industry)8 179.7 95.11 1.89
as.factor(industry)9 231.7 93.66 2.474
as.factor(industry)10 53.33 92.7 0.5753
as.factor(industry)11 26.67 94.33 0.2828
as.factor(industry)12 -7.591 97.4 -0.07794
as.factor(industry)13 198 94.78 2.089
  Pr(>|t|)
(Intercept) 0.334
age 3.788e-29 * * *
as.factor(education)32 0.621
as.factor(education)33 0.6867
as.factor(education)34 0.909
as.factor(education)35 0.8834
as.factor(education)36 0.6505
as.factor(education)37 0.8136
as.factor(education)38 0.9141
as.factor(education)39 0.5959
as.factor(education)40 0.5514
as.factor(education)41 0.4517
as.factor(education)42 0.3581
as.factor(education)43 0.07702
as.factor(education)44 0.02416 *
as.factor(education)45 0.0006159 * * *
as.factor(education)46 0.002108 * *
hours_worked 5.511e-139 * * *
as.factor(hourly_non_hourly)2 1.715e-50 * * *
as.factor(occupation_group)2 0.001877 * *
as.factor(occupation_group)3 4.173e-27 * * *
as.factor(occupation_group)4 8.696e-13 * * *
as.factor(occupation_group)5 1.42e-40 * * *
as.factor(occupation_group)6 0.0636
as.factor(occupation_group)7 0.04365 *
as.factor(occupation_group)8 0.1243
as.factor(occupation_group)9 6.677e-14 * * *
as.factor(occupation_group)10 7.908e-14 * * *
as.factor(industry)2 0.0003205 * * *
as.factor(industry)3 0.05474
as.factor(industry)4 0.01092 *
as.factor(industry)5 0.3987
as.factor(industry)6 0.005619 * *
as.factor(industry)7 0.0322 *
as.factor(industry)8 0.05885
as.factor(industry)9 0.01341 *
as.factor(industry)10 0.5651
as.factor(industry)11 0.7774
as.factor(industry)12 0.9379
as.factor(industry)13 0.03678 *
Fitting linear model: weekly_earnings ~ age + as.factor(education) + hours_worked + as.factor(hourly_non_hourly) + as.factor(occupation_group) + as.factor(industry)
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
5542 504.6 0.4316 0.4276

C. Validation of Collinearity

There are some problematic correlations among some variables with high values like: hourly_non_hourly & occupation_group, hourly_non_hourly & education, occupation_group & industry.

D. Ranges Judgement

  • The model works better working with education level when is equal or higher of a Master degree. Looking at the output, significant p-values are the ones over 44 which are the ones for Master, Professional School and PhD.
  • Industries (2, 4, 6 and 9) are the only ones significant. However, it could be early to say that industry 2 (mining) is the most relevant, because there are not too many observations compared to the number of observations for the other industries.

E. Hypothetical Observation

Observations that fit the ranges in part D. Education = 45 and Industry = 4

  • Prediction Results
       1        2        3        4        5        6        7        8 
1794.791 1947.052 1880.151 1827.548 1912.365 2027.276 2175.233 1674.571 
       9 
2044.931 

2019-11-24