#Import the data from a web-hosted source
telew <- read_csv("http://asayanalytics.com/telework_csv")

telew$state <- as.character(telew$state)
telew$education <- as.factor(telew$education)

1A

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   346.5 <2e-16 ***
## Residuals              5540 2.320e+09    418730                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Telework does show an effect on weekly earnings as see by the p value results show significance.

1B

1C The reason this model is naive is because we are only looking at one variable, teleworking. Other variables in the dataset could cause different results.

2A

##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## as.factor(telecommute)    1 1.451e+08 145093488   359.0 <2e-16 ***
## sex                       1 8.142e+07  81418057   201.5 <2e-16 ***
## Residuals              5539 2.238e+09    404107                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2B I choose sex because wanted to see if there was any difference males and females related to telecommuting. More and more I hear of companies offering telecommuting and I was curious if was offered to males more than females.

2C

2D Yes, the mean sq lowered from 418730 to 404107.

3

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked, data = telew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1758.0  -411.2  -185.1   230.4  2796.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   66.0433    28.5766   2.311   0.0209 *  
## hours_worked  22.5887     0.7072  31.943   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 613 on 5540 degrees of freedom
## Multiple R-squared:  0.1555, Adjusted R-squared:  0.1554 
## F-statistic:  1020 on 1 and 5540 DF,  p-value: < 2.2e-16

3A
WE = b0 + b1x

3B WE = 66.04 + 22.58 * hours worked

3C The model is naive because their is a lack of consideration for the other variables in the dataset.Other variables like hourly/non hourly and industry are variables that could possible change the model results.

3D Adding another variable to help better accuracy display the relationship between weekly earnings and hours worked. Below i added hourly/non hourly.

## 
## Call:
## lm(formula = weekly_earnings ~ hours_worked + hourly_non_hourly, 
##     data = telew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1819.8  -331.6  -135.5   213.8  2728.0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -475.4739    31.2913  -15.20   <2e-16 ***
## hours_worked        18.2135     0.6645   27.41   <2e-16 ***
## hourly_non_hourly  498.5295    15.6472   31.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 563.5 on 5539 degrees of freedom
## Multiple R-squared:  0.2863, Adjusted R-squared:  0.2861 
## F-statistic:  1111 on 2 and 5539 DF,  p-value: < 2.2e-16

#4A

## 
## Call:
## lm(formula = weekly_earnings ~ age, data = telew)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1245.3  -445.3  -178.1   284.7  2133.4 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 548.9457    28.2350   19.44   <2e-16 ***
## age           9.1941     0.6306   14.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 654.6 on 5540 degrees of freedom
## Multiple R-squared:  0.03696,    Adjusted R-squared:  0.03678 
## F-statistic: 212.6 on 1 and 5540 DF,  p-value: < 2.2e-16

#4A #WE = b0 + b1 * age

#4B #WE = 548.94 + 9.19 * age

#4C #The model is naive because their is a lack of consideration for the other variables in the dataset.

#4D #When testing the x to y relationship using the Plot function you can see the residuals relationship is parabolic.

#5

## 
## Call:
## lm(formula = weekly_earnings ~ age + sex + education + hourly_non_hourly + 
##     telecommute, data = telew)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1912.77  -335.99   -79.74   235.30  2306.33 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        334.3109   230.3405   1.451  0.14673    
## age                  6.2963     0.5389  11.684  < 2e-16 ***
## sex               -235.5380    14.8961 -15.812  < 2e-16 ***
## education32         13.2233   332.7601   0.040  0.96830    
## education33         25.2158   249.8495   0.101  0.91961    
## education34         38.6175   257.3598   0.150  0.88073    
## education35         85.3745   243.9379   0.350  0.72636    
## education36        103.7881   236.1498   0.440  0.66032    
## education37        158.4271   232.7226   0.681  0.49606    
## education38         98.7581   241.9673   0.408  0.68318    
## education39        275.7300   224.8574   1.226  0.22016    
## education40        282.3217   225.0231   1.255  0.20966    
## education41        355.0099   226.7153   1.566  0.11743    
## education42        379.2753   226.2046   1.677  0.09366 .  
## education43        574.0971   224.9565   2.552  0.01074 *  
## education44        680.9568   225.7510   3.016  0.00257 ** 
## education45        958.5711   231.3593   4.143 3.48e-05 ***
## education46        901.2189   230.5840   3.908 9.40e-05 ***
## hourly_non_hourly  360.3897    16.7336  21.537  < 2e-16 ***
## telecommute       -143.1313    16.8648  -8.487  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 549.3 on 5522 degrees of freedom
## Multiple R-squared:  0.3241, Adjusted R-squared:  0.3218 
## F-statistic: 139.4 on 19 and 5522 DF,  p-value: < 2.2e-16

#5a #WE = Bo + B1age + B2 sex + B3 education^2 + B4 hourlynonhourly + B5 *Telecommute

#5c based on the VIF results, i don’t think their is any high correlation.

#5d I would be comfortable using this output, however the education variable doesn’t always return a strong significance.