knitr::opts_chunk$set(prompt=FALSE,
                      message=FALSE,
                      warning=FALSE,
                      error=TRUE,
                      eval=TRUE)

Introduction

Evaluation of Telecommuting use GLM.

Packages Required

Below is the packages used in this project.

Import data

Import data from csv file.

Question 1

  1. Using a simple one-way ANOVA, answer the following questions:

Answers:

    1. Based on the one-way ANOVA model we obtained, find that p-value of the as.factor(telecommute) variable is low (P<0.001), which is around 2e-16 in the result. Therefore, it indicates that teleworking has a significant effect on income.
    1. In addition to the model we built, I also present the the correlation between teleworking and income by table and boxplot figure. As we can see from both presentations, the average weekly earnings with teleworking is 1,183 dollars, higher than the one without teleworking (832.4 dollars)
##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weekly_earnings ~ as.factor(telecommute), data = tele)
## 
## $`as.factor(telecommute)`
##          diff       lwr       upr p adj
## 2-1 -350.7614 -387.7015 -313.8213     0
as.factor(telecommute) average
1 1183
2 832.4

Question 2

  1. Build a simple regression model estimating weekly earnings by hours worked. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

Answers:

    1. weekly_earnings = beta_0 + beta_1(hours_worked)
    1. weekly_earnings = 66 + 22.59(hours_worked)
      1. the R-squared is 0.16 with P-value < 0.001 which means that only 16% of weekly earnings can be explained by hours worked, 2) this model built here is assume that two variables are in a linear relationship, however, obviously, weekly earnings is not simply linear regression by hours. 3) Intuitively, longer you worked, more you earned, but it also depended on other factors, such as the location (state), education and occupation.
    1. whether workers are hourly or non-hourly workers should also be considered into this regression model. Because of if the worker is hourly workers, then there is a relationship between weekly earnings and hours worked, while it hard to find a certain relationship between weekly earnings and hours worked if the workers are non-hourly workers.
    1. perform multiple variables regression, while hourly_non_hourly treated as discrete variable by using as.factor. As we can see from the result, the R-squared value increased to 0.286 (p<0.001) from 0.16 where is R-squared value of the naive model. which means the model can explain 28.6% weekly earnings based on hours worked and factor(hourly non-hourly)
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = tele)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + as.factor(hourly_non_hourly), 
##     data = tele)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1819.8  -331.6  -135.5   213.8  2728.0 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    23.0556    26.3076   0.876    0.381    
## hours_worked                   18.2135     0.6645  27.410   <2e-16 ***
## as.factor(hourly_non_hourly)2 498.5295    15.6472  31.861   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 563.5 on 5539 degrees of freedom
## Multiple R-squared:  0.2863, Adjusted R-squared:  0.2861 
## F-statistic:  1111 on 2 and 5539 DF,  p-value: < 2.2e-16

Question 3

  1. Build a simple regression model estimating weekly earnings as a function of age. Do not make any transformations at this point—only generate a naïve model and answer the following questions:

Answers:

    1. weekly_earnings = beta_0 + beta_1(age)
    1. weekly_earnings = 549 + 9.19(age)
      1. R-squared value is 0.037 with P-value < 0.001, which means only 3.7% weekly earnings can be explained by age, therefore it’s simple. 2) the model built it assume the correlation is linear, however this is not the case here . 3) weekly earnings may also depend on state, education and occupation and so on, so it is not simple linear correlation with age.
    1. plot scatter figure with lm trend added to test the linearity. as we can see from the figure, there is a trend that the weekly earnings slightly increase as the age increase. In addition, I also build a model by treat age as category variable using as.factor to see if the model fit better, as we can see that the R-squared value increase slightly to 0.09 (p<0.001) from 0.037. Plus, the result from chi-square test also suggest model, where age as factor, fits better
    1. simply with age (factor), it can only explain 9% weekly earnings. obviously weekly earnings may related to other several factors, such as worked hours, location, occupation, education. Therefore, multiple variable regression should be consider as one solution here. and Chi-square test to see which model fits better.
## 
## Call:
## lm(formula = weekly_earnings ~ age, data = tele)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = weekly_earnings ~ as.factor(age), data = tele)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1187.7  -437.6  -145.6   258.2  2420.7 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        162.08     261.30   0.620 0.535092    
## as.factor(age)16    10.18     320.03   0.032 0.974637    
## as.factor(age)17    82.89     303.94   0.273 0.785086    
## as.factor(age)18    99.58     280.22   0.355 0.722329    
## as.factor(age)19   135.53     281.18   0.482 0.629806    
## as.factor(age)20   249.07     274.27   0.908 0.363848    
## as.factor(age)21   248.47     271.83   0.914 0.360715    
## as.factor(age)22   301.79     269.97   1.118 0.263669    
## as.factor(age)23   424.47     268.46   1.581 0.113909    
## as.factor(age)24   458.29     267.92   1.711 0.087218 .  
## as.factor(age)25   558.76     267.17   2.091 0.036540 *  
## as.factor(age)26   592.74     267.50   2.216 0.026744 *  
## as.factor(age)27   639.57     267.40   2.392 0.016800 *  
## as.factor(age)28   704.27     267.00   2.638 0.008371 ** 
## as.factor(age)29   733.31     266.80   2.748 0.006007 ** 
## as.factor(age)30   765.64     267.22   2.865 0.004183 ** 
## as.factor(age)31   733.54     266.84   2.749 0.005998 ** 
## as.factor(age)32   876.93     267.86   3.274 0.001068 ** 
## as.factor(age)33   775.30     266.51   2.909 0.003640 ** 
## as.factor(age)34   767.55     266.84   2.876 0.004038 ** 
## as.factor(age)35   864.38     267.09   3.236 0.001218 ** 
## as.factor(age)36   744.06     267.86   2.778 0.005492 ** 
## as.factor(age)37   732.49     268.03   2.733 0.006299 ** 
## as.factor(age)38   768.71     267.86   2.870 0.004123 ** 
## as.factor(age)39   918.00     268.53   3.419 0.000634 ***
## as.factor(age)40   874.05     268.15   3.260 0.001123 ** 
## as.factor(age)41   976.63     268.27   3.640 0.000275 ***
## as.factor(age)42   882.03     268.81   3.281 0.001040 ** 
## as.factor(age)43   786.62     268.27   2.932 0.003380 ** 
## as.factor(age)44   948.26     268.09   3.537 0.000408 ***
## as.factor(age)45   855.30     267.36   3.199 0.001386 ** 
## as.factor(age)46   977.56     268.40   3.642 0.000273 ***
## as.factor(age)47   774.97     268.15   2.890 0.003867 ** 
## as.factor(age)48   900.31     267.92   3.360 0.000784 ***
## as.factor(age)49   848.85     268.03   3.167 0.001549 ** 
## as.factor(age)50   910.43     266.84   3.412 0.000650 ***
## as.factor(age)51   942.30     268.53   3.509 0.000453 ***
## as.factor(age)52   995.68     267.50   3.722 0.000200 ***
## as.factor(age)53   887.22     267.45   3.317 0.000915 ***
## as.factor(age)54   950.65     267.86   3.549 0.000390 ***
## as.factor(age)55  1028.78     267.70   3.843 0.000123 ***
## as.factor(age)56   846.52     267.65   3.163 0.001571 ** 
## as.factor(age)57   876.20     267.92   3.270 0.001081 ** 
## as.factor(age)58   901.67     267.50   3.371 0.000755 ***
## as.factor(age)59   927.92     269.26   3.446 0.000573 ***
## as.factor(age)60   887.80     268.95   3.301 0.000970 ***
## as.factor(age)61   877.69     268.81   3.265 0.001101 ** 
## as.factor(age)62   782.45     270.26   2.895 0.003805 ** 
## as.factor(age)63   880.58     271.17   3.247 0.001172 ** 
## as.factor(age)64   928.62     273.65   3.393 0.000695 ***
## as.factor(age)65   809.06     274.71   2.945 0.003242 ** 
## as.factor(age)66   886.10     282.24   3.140 0.001701 ** 
## as.factor(age)67   947.19     282.24   3.356 0.000796 ***
## as.factor(age)68   955.83     286.24   3.339 0.000846 ***
## as.factor(age)69   866.46     296.29   2.924 0.003466 ** 
## as.factor(age)70   470.48     292.14   1.610 0.107360    
## as.factor(age)71   910.85     312.32   2.916 0.003555 ** 
## as.factor(age)72   721.59     356.09   2.026 0.042772 *  
## as.factor(age)73   154.62     324.84   0.476 0.634112    
## as.factor(age)74   759.98     330.52   2.299 0.021525 *  
## as.factor(age)75   767.81     356.09   2.156 0.031113 *  
## as.factor(age)76   200.52     356.09   0.563 0.573390    
## as.factor(age)77   114.61     387.57   0.296 0.767464    
## as.factor(age)78  1246.58     387.57   3.216 0.001306 ** 
## as.factor(age)79   320.48     452.59   0.708 0.478911    
## as.factor(age)80   680.27     337.34   2.017 0.043790 *  
## as.factor(age)85   796.24     345.67   2.303 0.021290 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 640.1 on 5475 degrees of freedom
## Multiple R-squared:  0.09003,    Adjusted R-squared:  0.07906 
## F-statistic: 8.207 on 66 and 5475 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: weekly_earnings ~ age
## Model 2: weekly_earnings ~ as.factor(age)
##   Res.Df        RSS Df Sum of Sq  Pr(>Chi)    
## 1   5540 2373770588                           
## 2   5475 2242958893 65 130811695 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 4

  1. Modify your model from Q3 by adding at least 3 other IVs to the regression and transforming the age variable (and others) as necessary to meet the linearity assumption. Interpret your results and answer the following questions:

Answer:

    1. weekly_earnings = alpha_0 + beta_0(hours_worked) + beta_1(age) + beta_2(state) + beta_3(detailed_occupation_group) + beta_4(education)
    1. weekly_earnings = -3206 + 20.45(hours_worked) + 7.43(age) - 2.46(state) - 17.97(detailed_occupation_group) + 80.94(education)
    1. Based on the coefficients generated, I can conclude that no independent variables are colinear since all the values are totally different which is no way there are some IV are colinear.
    1. In general, min and max would be the appropriate range for both IVs and DV. Specifically, weekly_earnings [7.5,2885], hours_worked [1,99], gae [15,85],state [1,56], detailed_occupation_group [1,22], education [31,46]. the results present as table as below.
    1. Assume the individual age is 35, hour_worked is 40, state is 36 (NYC), detail_occupation_group is 3 (Computer and mathematical science occupations), Education is 46 (doctorate). then the estimate weekly earnings would be 1452.82 dollars based on the equation we obtained at b.
      The estimate range for this individual is [606 2745], where the results based on the situation that his/her age, education and occupation is fixed at this circumstance, and other factors can be changed here.
## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + age + state + detailed_occupation_group + 
##     education, data = tele)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1658.0  -343.8  -101.1   224.2  2931.9 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -3205.7797   152.0913 -21.078  < 2e-16 ***
## hours_worked                 20.4492     0.6219  32.883  < 2e-16 ***
## age                           7.4271     0.5198  14.288  < 2e-16 ***
## state                        -2.4569     0.4506  -5.452 5.19e-08 ***
## detailed_occupation_group   -17.9736     1.1633 -15.451  < 2e-16 ***
## education                    80.9430     3.5009  23.121  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 536.5 on 5536 degrees of freedom
## Multiple R-squared:  0.3535, Adjusted R-squared:  0.3529 
## F-statistic: 605.3 on 5 and 5536 DF,  p-value: < 2.2e-16
Table continues below
hours_worked_min hours_worked_max age_min age_max state_min
1 99 15 85 1
Table continues below
state_max detailed_occupation_group_min detailed_occupation_group_max
56 1 22
education_min education_max
31 46