# Find and interpret the regression equation and regression coefficients
employee_data <- read.csv('Employee_database.csv')
reg <- lm(Salary ~ Age + Experience, employee_data)
reg
##
## Call:
## lm(formula = Salary ~ Age + Experience, data = employee_data)
##
## Coefficients:
## (Intercept) Age Experience
## 22380.6 300.6 1579.3
# Find and interpret the standard error of estimate. Find and interpret the coefficient of determination. Is the model significant? What does this tell you?
summary(reg)
##
## Call:
## lm(formula = Salary ~ Age + Experience, data = employee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23420 -5321 1328 6785 15337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22380.6 6749.5 3.316 0.00147 **
## Age 300.6 157.6 1.908 0.06068 .
## Experience 1579.3 355.6 4.441 3.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8910 on 68 degrees of freedom
## Multiple R-squared: 0.3395, Adjusted R-squared: 0.3201
## F-statistic: 17.48 on 2 and 68 DF, p-value: 7.507e-07
## standardized regression coefiicients
library(betas)
betas.lm(reg)
## beta se.beta
## Age 0.2034582 0.1066604
## Experience 0.4737158 0.1066604
The standardized regression coefficient means that as age increase by one standard deviation, salary will increase by 0.2035 of one standard deviation. And as Experience increase by one standard deviation, salary will increase by 0.4737 of one standard deviation. This suggests that experience is more important than age in its effect on salary because absolute value of the standardized regression coefficient for experience is larger.
The diagnostic plot shows the residuals distributes around 0 and there is no outliers. However, there are four dots that are close to -3 for rstudentized residuals that might be outliers as they are far the from other data points.
employee_data[33,]
## Number Salary Gender Age Experience Level
## 33 33 35018 M 39 1 A
employee_data[52,]
## Number Salary Gender Age Experience Level
## 52 52 50175 F 42 5 A
## highest salary employee
salary_order <- employee_data[order(employee_data$Salary, decreasing = TRUE),]
salary_order[1,]
## Number Salary Gender Age Experience Level
## 23 23 62530 M 50 10 B
## lowest salary employee
salary_order <- employee_data[order(employee_data$Salary),]
salary_order[1,]
## Number Salary Gender Age Experience Level
## 10 10 23975 F 58 4 A
lm_age <- lm(Salary ~ Age, employee_data)
summary(lm_age)
##
## Call:
## lm(formula = Salary ~ Age, data = employee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28248 -6036 2343 7121 18791
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19271.7 7569.5 2.546 0.013134 *
## Age 568.1 164.2 3.461 0.000928 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10050 on 69 degrees of freedom
## Multiple R-squared: 0.1479, Adjusted R-squared: 0.1356
## F-statistic: 11.98 on 1 and 69 DF, p-value: 0.000928
If you choose two random employees, the older one is paid significantly more (on average) than the younger worker. But if you choose two random employees with the same experience, then the older one will not be paid significantly more on average than the younger one.
# Using a two-sided test at the 5% level, test whether men are paid significantly more than women.
t.test(Salary ~ Gender, data = employee_data,var.equal = FALSE )
##
## Welch Two Sample t-test
##
## data: Salary by Gender
## t = -3.491, df = 43.154, p-value = 0.001123
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -14342.727 -3840.019
## sample estimates:
## mean in group F mean in group M
## 39635.46 48726.84
emp1 <- within(employee_data, Gender <- relevel(Gender, ref = 'M'))
lm_gen <- lm(Salary ~ Gender + Age + Experience, emp1)
summary(lm_gen)
##
## Call:
## lm(formula = Salary ~ Gender + Age + Experience, data = emp1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19224 -4286 -184 6688 14965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27777.5 6662.8 4.169 8.98e-05 ***
## GenderF -6216.7 2123.8 -2.927 0.004668 **
## Age 260.2 150.1 1.733 0.087632 .
## Experience 1386.7 343.7 4.035 0.000142 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8452 on 67 degrees of freedom
## Multiple R-squared: 0.4144, Adjusted R-squared: 0.3882
## F-statistic: 15.8 on 3 and 67 DF, p-value: 7.109e-08
contrasts(emp1$Gender)
## F
## M 0
## F 1
The results from parts b and e of this exercise are in agreement. The conclusion of part b is that men are paid significantly more than women. In part e, it is found that even when adjustments are made for age and experience, men are still paid significantly more than women. #### 5. Examine the effect of training level on annual salary, with and without adjusting for age and experience.
## Find the average annual salary for each of the three Case training levels and compare them.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
emp2 <- employee_data %>% group_by(Level) %>% summarise(avg_salary = mean(Salary))
emp2
## # A tibble: 3 x 2
## Level avg_salary
## <fct> <dbl>
## 1 A 41011.
## 2 B 48387.
## 3 C 53927.
# Find the multiple regression equation to predict annual salary from age, experience, and training level, using indicator variables for training level. Omit the indicator variable for level A as the baseline.
lm3 <- lm(Salary ~ Age + Experience + Level, employee_data)
lm4 <- lm(Salary ~ Level, employee_data)
summary(lm3)
##
## Call:
## lm(formula = Salary ~ Age + Experience + Level, data = employee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20100 -3380 1046 4264 13256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16023.1 5920.1 2.707 0.00864 **
## Age 369.9 136.4 2.712 0.00852 **
## Experience 1450.2 305.8 4.742 1.17e-05 ***
## LevelB 6647.7 1994.4 3.333 0.00141 **
## LevelC 13377.2 2857.1 4.682 1.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7638 on 66 degrees of freedom
## Multiple R-squared: 0.529, Adjusted R-squared: 0.5004
## F-statistic: 18.53 on 4 and 66 DF, p-value: 3e-10
summary(lm4)
##
## Call:
## lm(formula = Salary ~ Level, data = employee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18781 -5790 2113 7964 15873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41011 1596 25.704 < 2e-16 ***
## LevelB 7376 2564 2.876 0.005367 **
## LevelC 12916 3646 3.542 0.000721 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9835 on 68 degrees of freedom
## Multiple R-squared: 0.1952, Adjusted R-squared: 0.1716
## F-statistic: 8.249 on 2 and 68 DF, p-value: 0.0006204
The regression coefficient for training level B says that employees at this level earn 6,648 more, on average, than do employees of the same age and experience level who are at training level A (the baseline category). It appears that the initial training (B) is worth $6,648 in salary. The regression coefficient for training level C says that employees at this level earn 13,377 more, on average, than those of the same age and experience who are at training level A. It seems that taking both training courses is worth a total of $13,377 in salary.
Adjusted for age and experience: The estimated average salary for levelC is 13377.2 higher than levelA holding all other x variables constant. Not adjusted for age and experience: The estimated average salary for LevelC is 12916 higher than for levelA, holding all other x variables constant. After adjusting for age and experience, level C has greater impact on salary compared to without adjusting age and experience.
library(dplyr)
# Create a new variable, “interaction,” by multiplying age by experience for each employee.
employee_data <- mutate(employee_data, Interaction = Age * Experience)
lm5 <- lm(Salary ~ Age + Experience + Interaction, employee_data)
summary(lm5)
##
## Call:
## lm(formula = Salary ~ Age + Experience + Interaction, data = employee_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24650 -3326 1780 6345 14109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2942.23 13665.57 -0.215 0.83019
## Age 867.00 308.83 2.807 0.00654 **
## Experience 6578.67 2389.52 2.753 0.00759 **
## Interaction -107.90 51.03 -2.115 0.03819 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8691 on 67 degrees of freedom
## Multiple R-squared: 0.3808, Adjusted R-squared: 0.3531
## F-statistic: 13.74 on 3 and 67 DF, p-value: 4.422e-07