460 Assignment3 Tin Yun Hon
tyh6518 6-8pm

1. Consider the prediction of annual salary from age and experience.

# Find and interpret the regression equation and regression coefficients
employee_data <- read.csv('Employee_database.csv')
reg <- lm(Salary ~ Age + Experience, employee_data)
reg
## 
## Call:
## lm(formula = Salary ~ Age + Experience, data = employee_data)
## 
## Coefficients:
## (Intercept)          Age   Experience  
##     22380.6        300.6       1579.3
  1. Salary = 22380.6 + 300.6Age + 1579.3Experience. This regression equation shows that when age and experience are 0, the expected salary is 22380.6. One additional year increase in age will increase salary by 300.6 dollars, on average, holding experience constant. One additional year increase in expereince will increase salary by 1579.3 dollars, on average, holding age constant.
# Find and interpret the standard error of estimate. Find and interpret the coefficient of determination. Is the model significant? What does this tell you?
summary(reg)
## 
## Call:
## lm(formula = Salary ~ Age + Experience, data = employee_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23420  -5321   1328   6785  15337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22380.6     6749.5   3.316  0.00147 ** 
## Age            300.6      157.6   1.908  0.06068 .  
## Experience    1579.3      355.6   4.441 3.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8910 on 68 degrees of freedom
## Multiple R-squared:  0.3395, Adjusted R-squared:  0.3201 
## F-statistic: 17.48 on 2 and 68 DF,  p-value: 7.507e-07
  1. The standard error of estimate is 8910, shows the actual salary are about 8910 dollars away from the predicted salary.
  2. The cooefficient of determination is 0.3395, which meas that 33.95% of the variation in Salary can be explained by age and experience.
  3. The model is significant, becuase the F test is significant, p-value is 7.507e-07 less than 0.05. Variables of age and experience, taken together, explain a very highly significant amount of variation in salary.
  4. The p-value for age is 0.06068, which is more than 0.05 siginificant value. Age is not a significant predictor for salary, holding experience constant. The p-value for experience is 3.37e-05 less than 0.001 significant value, there is a significant relationship between experience and salary, holding age constant.
## standardized regression coefiicients
library(betas)
betas.lm(reg)
##                 beta   se.beta
## Age        0.2034582 0.1066604
## Experience 0.4737158 0.1066604
  1. The standardized regression coefficient means that as age increase by one standard deviation, salary will increase by 0.2035 of one standard deviation. And as Experience increase by one standard deviation, salary will increase by 0.4737 of one standard deviation. This suggests that experience is more important than age in its effect on salary because absolute value of the standardized regression coefficient for experience is larger.

  2. The diagnostic plot shows the residuals distributes around 0 and there is no outliers. However, there are four dots that are close to -3 for rstudentized residuals that might be outliers as they are far the from other data points.

2. Continue using predictions of annual salary based on age and experience.

employee_data[33,]
##    Number Salary Gender Age Experience Level
## 33     33  35018      M  39          1     A
employee_data[52,]
##    Number Salary Gender Age Experience Level
## 52     52  50175      F  42          5     A
  1. Predicted Salary = 35681.42 Prediction Error = Residual = –663.4193 This employee’s actual salary is $663 lower than you would expect for this age and experience.
  2. Predicted Salary = 42900.11 Prediction Error = Residual = 7274.89 This employee’s actual salary is $7,275 more than you would expect for this age and experience.
## highest salary employee
salary_order <- employee_data[order(employee_data$Salary, decreasing = TRUE),]
salary_order[1,]
##    Number Salary Gender Age Experience Level
## 23     23  62530      M  50         10     B
  1. Predicted Salary = 53200.82 Prediction Error = Residual = 9329.182 This employee’s actual salary, is $9,329 more than you would expect for this age and experience.
## lowest salary employee
salary_order <- employee_data[order(employee_data$Salary),]
salary_order[1,]
##    Number Salary Gender Age Experience Level
## 10     10  23975      F  58          4     A
  1. Predicted Salary = 46129.68. Prediction Error = Residual = –22,154.68 This employee’s actual salary is $22,155 less than you would expect for this age and experience.

3. Consider the prediction of annual salary from age alone (as compared to exercise 1, where experience was also used as an X variable).

lm_age <- lm(Salary ~ Age, employee_data)
summary(lm_age)
## 
## Call:
## lm(formula = Salary ~ Age, data = employee_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28248  -6036   2343   7121  18791 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19271.7     7569.5   2.546 0.013134 *  
## Age            568.1      164.2   3.461 0.000928 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10050 on 69 degrees of freedom
## Multiple R-squared:  0.1479, Adjusted R-squared:  0.1356 
## F-statistic: 11.98 on 1 and 69 DF,  p-value: 0.000928
  1. The regression equation is Salary = 19271.7 + 568.1Age
  2. The p-value is 0.01313 when using only age, which means that the model is more significant when we use both age and experience to predict salary. The regression coefficient for age is 300.6 in question 1, compared to 568.1 in this question. The effect of age on annual salary increases without an adjustment for experience. Including an adjustment for experience will weaken the effect of age on annual salary. This shows that for two employees with the same experience (this is what the “adjustment” does), we expect the older one to earn 301 more per year of age difference than the younger worker. However, for two employees in general, we expect the older one to earn $586 more per year of age difference than the younger worker. c.In question 1, age doesn’t have a significant impact on annual salary because its p-value is 0.06068, higher than 0.05. Here age has a significant impact on annual salary with a p-value of 0.000928, lower than 0.001. Adjusting for experience will weaken age’s effect on salary.

If you choose two random employees, the older one is paid significantly more (on average) than the younger worker. But if you choose two random employees with the same experience, then the older one will not be paid significantly more on average than the younger one.

4. Now examine the effect of gender on annual salary, with and without adjusting for age and experience.

# Using a two-sided test at the 5% level, test whether men are paid significantly more than women.
t.test(Salary ~ Gender, data = employee_data,var.equal = FALSE )
## 
##  Welch Two Sample t-test
## 
## data:  Salary by Gender
## t = -3.491, df = 43.154, p-value = 0.001123
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14342.727  -3840.019
## sample estimates:
## mean in group F mean in group M 
##        39635.46        48726.84
  1. The average annual salary for female is 39635 and the average annual salary for male i 48727. Men have higher salary the age difference is 9092.
  2. P-value is significant, which means that male salary is significantly higher than female salary.(p-value = 0.001123 < 0.01, based on an unpaired t test with unequal variances assumed).
emp1 <- within(employee_data, Gender <- relevel(Gender, ref = 'M'))
lm_gen <- lm(Salary ~ Gender + Age + Experience, emp1)
summary(lm_gen)
## 
## Call:
## lm(formula = Salary ~ Gender + Age + Experience, data = emp1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19224  -4286   -184   6688  14965 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27777.5     6662.8   4.169 8.98e-05 ***
## GenderF      -6216.7     2123.8  -2.927 0.004668 ** 
## Age            260.2      150.1   1.733 0.087632 .  
## Experience    1386.7      343.7   4.035 0.000142 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8452 on 67 degrees of freedom
## Multiple R-squared:  0.4144, Adjusted R-squared:  0.3882 
## F-statistic:  15.8 on 3 and 67 DF,  p-value: 7.109e-08
contrasts(emp1$Gender)
##   F
## M 0
## F 1
  1. The multiple linear regression is Salary = 27777.5 - 6216.7GenderF + 260.2Age + 1386.7Experience
  2. The regression coefficient is -6216.7. The regression coefficient for gender is –6216.7, indicating that at the same age and experience, women are paid about $6,217 less than men, on average.
  3. Yes, gender has a significant impact on annual salary after adjustment for age and experience. p-value is 0.004668 less than 0.01 significant level.
  4. Both part b and e shows that men are paid significantly more than women, but the results of part e is at a 1% level and the results of part b is at a 5% level. So with adjustment for age and experience, the effect of gender on salary is more significant than without adjustment for age and experience.

The results from parts b and e of this exercise are in agreement. The conclusion of part b is that men are paid significantly more than women. In part e, it is found that even when adjustments are made for age and experience, men are still paid significantly more than women. #### 5. Examine the effect of training level on annual salary, with and without adjusting for age and experience.

## Find the average annual salary for each of the three Case training levels and compare them.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
emp2 <- employee_data %>% group_by(Level) %>% summarise(avg_salary = mean(Salary))
emp2
## # A tibble: 3 x 2
##   Level avg_salary
##   <fct>      <dbl>
## 1 A         41011.
## 2 B         48387.
## 3 C         53927.
  1. The average annual salary for level A is 41011, for level B is 48387 and for level C is 53927.
# Find the multiple regression equation to predict annual salary from age, experience, and training level, using indicator variables for training level. Omit the indicator variable for level A as the baseline.
lm3 <- lm(Salary ~ Age + Experience + Level, employee_data)
lm4 <- lm(Salary ~ Level, employee_data)
summary(lm3)
## 
## Call:
## lm(formula = Salary ~ Age + Experience + Level, data = employee_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20100  -3380   1046   4264  13256 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16023.1     5920.1   2.707  0.00864 ** 
## Age            369.9      136.4   2.712  0.00852 ** 
## Experience    1450.2      305.8   4.742 1.17e-05 ***
## LevelB        6647.7     1994.4   3.333  0.00141 ** 
## LevelC       13377.2     2857.1   4.682 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7638 on 66 degrees of freedom
## Multiple R-squared:  0.529,  Adjusted R-squared:  0.5004 
## F-statistic: 18.53 on 4 and 66 DF,  p-value: 3e-10
summary(lm4)
## 
## Call:
## lm(formula = Salary ~ Level, data = employee_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18781  -5790   2113   7964  15873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    41011       1596  25.704  < 2e-16 ***
## LevelB          7376       2564   2.876 0.005367 ** 
## LevelC         12916       3646   3.542 0.000721 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9835 on 68 degrees of freedom
## Multiple R-squared:  0.1952, Adjusted R-squared:  0.1716 
## F-statistic: 8.249 on 2 and 68 DF,  p-value: 0.0006204
  1. The multiple linear regression is 16023.1 + 369.9Age + 1450.2Experience + 6647.7LevelB + 13377.2LevelC.
  2. The regression coefficient for training level B says that employees at this level earn 6,648 more, on average, than do employees of the same age and experience level who are at training level A (the baseline category). It appears that the initial training (B) is worth $6,648 in salary. The regression coefficient for training level C says that employees at this level earn 13,377 more, on average, than those of the same age and experience who are at training level A. It seems that taking both training courses is worth a total of $13,377 in salary.

  3. Yes, training level has a significant effect on annual salary after adjusting for age and experience. The p-value for regression coefficient of training level B is 0.00141, lower than 0.01, B has a significant impact on annual salary after adjustment for age and experience at the 1% level. The p-value for regression coefficient of training level C is 1.46e-05, lower than 0.001, training level C has a significant impact on annual salary after adjustment for age and experience at the 0.1% level.
  4. Adjusted for age and experience: The estimated average salary for levelC is 13377.2 higher than levelA holding all other x variables constant. Not adjusted for age and experience: The estimated average salary for LevelC is 12916 higher than for levelA, holding all other x variables constant. After adjusting for age and experience, level C has greater impact on salary compared to without adjusting age and experience.

6.Consider predicting annual salary from age, experience, and an interaction term.

library(dplyr)
# Create a new variable, “interaction,” by multiplying age by experience for each employee.
employee_data <- mutate(employee_data, Interaction = Age * Experience)
lm5 <- lm(Salary ~ Age + Experience + Interaction, employee_data)
summary(lm5)
## 
## Call:
## lm(formula = Salary ~ Age + Experience + Interaction, data = employee_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24650  -3326   1780   6345  14109 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -2942.23   13665.57  -0.215  0.83019   
## Age           867.00     308.83   2.807  0.00654 **
## Experience   6578.67    2389.52   2.753  0.00759 **
## Interaction  -107.90      51.03  -2.115  0.03819 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8691 on 67 degrees of freedom
## Multiple R-squared:  0.3808, Adjusted R-squared:  0.3531 
## F-statistic: 13.74 on 3 and 67 DF,  p-value: 4.422e-07
  1. The regression equation is Salary = -2942.23 + 867Age + 6578.67Experience - 107.9Interaction
  2. The p-value of the regression coefficient of Interaction is 0.03819, lower than 0.05, the interaction is significant at 5% level.
  3. An extra year’s experience for a 40-year-old employee will increase annual salary by 6578.67 - 107.9*40 = 2262.67.
  4. An extra year’s experience for a 50-year-old employee will increase annual salary by 6578.67 - 107.9 *50 = 1183.67.
  5. The average effect of an extra year of experience on annual salary decreases as age increases because the interaction between age and experience is negative and statistically significant. An extra year of experience is worth more at age 40 than it is at age 50.