#Import the data from a web-hosted source
telew <- read_csv("http://asayanalytics.com/telework_csv")
telew$state <- as.character(telew$state)
telew$education <- as.factor(telew$education)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 346.5 <2e-16 ***
## Residuals 5540 2.320e+09 418730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Telework does show an effect on weekly earnings as see by the p value results show significance.
1C The reason this model is naive is because we are only looking at one variable, teleworking. Other variables in the dataset could cause different results.
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(telecommute) 1 1.451e+08 145093488 359.0 <2e-16 ***
## sex 1 8.142e+07 81418057 201.5 <2e-16 ***
## Residuals 5539 2.238e+09 404107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2B I choose sex because wanted to see if there was any difference males and females related to telecommuting. More and more I hear of companies offering telecommuting and I was curious if was offered to males more than females.
2D Yes, the mean sq lowered from 418730 to 404107.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1758.0 -411.2 -185.1 230.4 2796.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.0433 28.5766 2.311 0.0209 *
## hours_worked 22.5887 0.7072 31.943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared: 0.1555, Adjusted R-squared: 0.1554
## F-statistic: 1020 on 1 and 5540 DF, p-value: < 2.2e-16
3A
WE = b0 + b1x
3B WE = 66.04 + 22.58 * hours worked
3C The model is naive because their is a lack of consideration for the other variables in the dataset.Other variables like hourly/non hourly and industry are variables that could possible change the model results.
3D Adding another variable to help better accuracy display the relationship between weekly earnings and hours worked. Below i added hourly/non hourly.
##
## Call:
## lm(formula = weekly_earnings ~ hours_worked + hourly_non_hourly,
## data = telew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1819.8 -331.6 -135.5 213.8 2728.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -475.4739 31.2913 -15.20 <2e-16 ***
## hours_worked 18.2135 0.6645 27.41 <2e-16 ***
## hourly_non_hourly 498.5295 15.6472 31.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 563.5 on 5539 degrees of freedom
## Multiple R-squared: 0.2863, Adjusted R-squared: 0.2861
## F-statistic: 1111 on 2 and 5539 DF, p-value: < 2.2e-16
#4A
##
## Call:
## lm(formula = weekly_earnings ~ age, data = telew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1245.3 -445.3 -178.1 284.7 2133.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.9457 28.2350 19.44 <2e-16 ***
## age 9.1941 0.6306 14.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared: 0.03696, Adjusted R-squared: 0.03678
## F-statistic: 212.6 on 1 and 5540 DF, p-value: < 2.2e-16
#4A #WE = b0 + b1 * age
#4B #WE = 548.94 + 9.19 * age
#4C #The model is naive because their is a lack of consideration for the other variables in the dataset.
#4D #When testing the x to y relationship using the Plot function you can see the residuals relationship is parabolic.
#5
##
## Call:
## lm(formula = weekly_earnings ~ age + sex + education + hourly_non_hourly +
## telecommute, data = telew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1912.77 -335.99 -79.74 235.30 2306.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 334.3109 230.3405 1.451 0.14673
## age 6.2963 0.5389 11.684 < 2e-16 ***
## sex -235.5380 14.8961 -15.812 < 2e-16 ***
## education32 13.2233 332.7601 0.040 0.96830
## education33 25.2158 249.8495 0.101 0.91961
## education34 38.6175 257.3598 0.150 0.88073
## education35 85.3745 243.9379 0.350 0.72636
## education36 103.7881 236.1498 0.440 0.66032
## education37 158.4271 232.7226 0.681 0.49606
## education38 98.7581 241.9673 0.408 0.68318
## education39 275.7300 224.8574 1.226 0.22016
## education40 282.3217 225.0231 1.255 0.20966
## education41 355.0099 226.7153 1.566 0.11743
## education42 379.2753 226.2046 1.677 0.09366 .
## education43 574.0971 224.9565 2.552 0.01074 *
## education44 680.9568 225.7510 3.016 0.00257 **
## education45 958.5711 231.3593 4.143 3.48e-05 ***
## education46 901.2189 230.5840 3.908 9.40e-05 ***
## hourly_non_hourly 360.3897 16.7336 21.537 < 2e-16 ***
## telecommute -143.1313 16.8648 -8.487 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 549.3 on 5522 degrees of freedom
## Multiple R-squared: 0.3241, Adjusted R-squared: 0.3218
## F-statistic: 139.4 on 19 and 5522 DF, p-value: < 2.2e-16
#5a #WE = Bo + B1age + B2 sex + B3 education^2 + B4 hourlynonhourly + B5 *Telecommute
#5c based on the VIF results, i don’t think their is any high correlation.
#5d I would be comfortable using this output, however the education variable doesn’t always return a strong significance.