emp <- foreign::read.spss("../Week 5/Employee data.sav", to.data.frame = T)
emp %<>% mutate_at(.vars = vars(salary, salbegin, jobtime, prevexp, educ), .funs = funs(as.numeric(as.character(.))))
emp$age <- emp$bdate %>% {
./(24 * 60 * 60)
} %>% as.Date(origin = "1582-10-14") %>% {
as.Date(lubridate::today()) - .
} %>% {
./365.25
} %>% round %>% as.numeric
emp %<>% mutate_at(.vars = vars(female, custodial, manager), .f = funs(factor(.,
labels = c(F, T))))
emp <- emp[!is.na(emp$age), ]
If you haven’t, compute bivariate correlations between “educational level (in years)” and log of the current salary (since we previously determined that this variable is not normally distributed), and age_yr (age in years) and salary_ln.
emp$logsal <- log(emp$salary)
tags$h2("Testing Assumptions and evaluating normality:")
invisible(rbind(c("educ", "logsal"), c("age", "salaryLN")) %>% sapply(emp = emp,
function(x, emp) {
sp <- shapiro.test(emp[[x]])
g <- list()
g$hist <- emp %>% ggplot(data = ., mapping = aes_string(x = x)) + geom_histogram() +
labs(title = paste0("Histogram of ", x), caption = paste0(sp[["method"]],
": ", sp[["p.value"]] %>% p.txt, "\n", ifelse(sp[["p.value"]] < 0.05,
"not-normally distributed", "normally distributed")))
g$qq <- ggpubr::ggqqplot(data = emp[[x]], ylab = x, main = paste0("QQplot of ",
x))
g <- gridExtra::grid.arrange(grobs = g, nrow = 1)
return(g)
}))
tags$p("None of the variables appear to be normally distributed. Spearman's Rho will be used in addition to Pearson's R.")
None of the variables appear to be normally distributed. Spearman's Rho will be used in addition to Pearson's R.
tags$h2("1a. Correlation testing & Scatter Plots")
invisible(apply(rbind(c("educ", "logsal"), c("age", "salaryLN")), emp = emp, 1, function(r,
emp) {
pear <- cor.test(emp[[r[[1]]]], emp[[r[[2]]]], method = "pearson")
spear <- cor.test(emp[[r[[1]]]], emp[[r[[2]]]], method = "spearman")
g <- ggplot(data = emp, mapping = aes_string(x = r[[1]], y = r[[2]])) + geom_point() +
geom_smooth(method = "lm") + annotate("text", x = sort(emp[[r[[1]]]])[length(emp[[r[[1]]]])/3],
y = sort(emp[[r[[2]]]])[length(emp[[r[[2]]]])], label = paste0("Pearson's R: ",
round(cor(emp[[r[[1]]]], emp[[r[[2]]]], method = c("pearson")), 6), "\n",
"Spearman's Rho: ", round(cor(emp[[r[[1]]]], emp[[r[[2]]]], method = c("spearman")),
6))) + labs(title = paste0("Bivariate Correlation between ", r[[1]],
" & ", r[[2]]), xlab = r[[1]], ylab = r[[2]], caption = paste0(pear[["method"]],
": ", pear[["p.value"]] %>% p.txt, " ", ifelse(pear[["p.value"]] < 0.05,
"Correlated", "No-correlation"), "\n", spear[["method"]], ": ", spear[["p.value"]] %>%
p.txt, " ", ifelse(spear[["p.value"]] < 0.05, "Correlated", "No-correlation")))
print(g)
return(g)
}))
Note that Spearman’s Rho, which is more suitable for this data due to it’s deviations from normality, did not indicate a correlation at \(\alpha = .05\) but did show a correlation at \(\alpha = .1\).
1b. Do you expect that educational level and age would be predictive of salary and why? Which one will be a stronger predictor?
Education level and age will be predictive of salary. Education will be the strongest predictor because it sometimes affords people a greater capacity to contribute to society in meaningful and helpful ways (though not always). Higher levels of education tend to give people deeper understandings of specific disciplines, an increased capacity to comprehend the complex nature of reality, a more nuanced worldview and (hopefully) a more developed ability to be tactful with people of different intellectual and cultural backgrounds in order to work with them in productive ways; all of which ideally contribute to increased earning over one’s lifetime. Age will be a more general and a less significant predictor.
The linear regression line in the figure above shows that the correlation between age and the natural log of salary is negative. This is counter to the notion that with increasing years of experience individuals tend to earn more. Perhaps this negative relationship can be accounted for by outliers in salary related to the retirement ages?
emp %>% filter(age > 65) %>% group_by(age) %>% summarize(salary = mean(salary)) %>%
mutate(p.val = pnorm(salary, mean(emp$salary), sd = sd(emp$salary)))
pear <- cor.test(emp$age, emp$salary, method = "pearson")
spear <- cor.test(emp$age, emp$salary, method = "spearman")
emp %>% mutate(p.val = pnorm(salary, mean(emp$salary), sd = sd(emp$salary))) %>%
mutate(`P<.1` = p.val < 0.1) %>% ggplot(data = ., mapping = aes(x = age, y = salary,
color = `P<.1`)) + geom_point() + geom_smooth(method = "lm") + annotate("text",
x = sort(emp$age)[length(emp$age)/3], y = sort(emp$salary)[length(emp$salary)],
label = paste0("Pearson's R: ", round(cor(emp$age, emp$salary, method = c("pearson")),
6), "\n", "Spearman's Rho: ", round(cor(emp$age, emp$salary, method = c("spearman")),
6))) + labs(title = paste0("Bivariate Correlation between ", "Age", " & ",
"Salary grouped by P<.1"), xlab = "Age", ylab = "Salary", caption = paste0(pear[["method"]],
": ", pear[["p.value"]] %>% p.txt, " ", ifelse(pear[["p.value"]] < 0.05, "Correlated",
"No-correlation"), "\n", spear[["method"]], ": ", spear[["p.value"]] %>%
p.txt, " ", ifelse(spear[["p.value"]] < 0.05, "Correlated", "No-correlation")))
The table and graph above do show that retirement could be having significant influence on the relationship between age and salary. If we control for this factor, what will the relationship be?
pear <- cor.test(emp$age, emp$salary, method = "pearson")
spear <- cor.test(emp$age, emp$salary, method = "spearman")
emp %>% mutate(p.val = pnorm(salary, mean(emp$salary), sd = sd(emp$salary))) %>%
mutate(`P<.1` = p.val < 0.1) %>% filter(age < 65) %>% ggplot(data = ., mapping = aes(x = age,
y = salary, color = `P<.1`)) + geom_point() + geom_smooth(method = "lm") + annotate("text",
x = sort(emp$age)[length(emp$age)/3], y = sort(emp$salary)[length(emp$salary)],
label = paste0("Pearson's R: ", round(cor(emp$age, emp$salary, method = c("pearson")),
6), "\n", "Spearman's Rho: ", round(cor(emp$age, emp$salary, method = c("spearman")),
6))) + labs(title = paste0("Bivariate Correlation between ", "Age", " & ",
"Salary grouped by P<.1"), subtitle = "All Ages > 65 are removed", xlab = "Age",
ylab = "Salary", caption = paste0(pear[["method"]], ": ", pear[["p.value"]] %>%
p.txt, " ", ifelse(pear[["p.value"]] < 0.05, "Correlated", "No-correlation"),
"\n", spear[["method"]], ": ", spear[["p.value"]] %>% p.txt, " ", ifelse(spear[["p.value"]] <
0.05, "Correlated", "No-correlation")))
When we group the bottom 10% of the distribution of salary, we see a positive relationship between age and salary indicated by the pink line. However, looking at the Pearson and Spearman correlation coefficients not controlling for the p-values below 10%, these two individuals over age 60 with salaries below the 10% threshold maintain the negative relationship between age and salary overall.
Write out a theoretical equation for regression with log salary as an outcome and education as a predictor.
\(y_{logsal} = \beta_0 + \beta X_{educ} + \epsilon\)
2a. Fit the model in a software of your choice. Comment on the model fit (ANOVA table).
(emp_lm <- lm(salaryLN ~ educ, data = emp)) %>% anova()
emp_lm %>% summary
##
## Call:
## lm(formula = salaryLN ~ educ, data = emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66302 -0.19375 -0.03601 0.16519 0.95172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.061217 0.062816 144.25 <2e-16 ***
## educ 0.096050 0.004555 21.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2856 on 471 degrees of freedom
## Multiple R-squared: 0.4856, Adjusted R-squared: 0.4845
## F-statistic: 444.7 on 1 and 471 DF, p-value: < 0.00000000000000022
The model indicates that education is a significant predictor for the natural log of salary (p < .001) and the Adjusted \(R^2\) value shows that the model explains ~.6 of the variance and conversely that about 40% of the variance is not explained by education.
2b. Interpret regression coefficients.
The regression coefficient of education shows that for each additional year of education the natural log of salary increases by 0.09605. Though using the natural log of salary here does not make for a meaningful explanation. Rather, using the salary itself as the dependent variable (though this violates the assumptions of the linear regression model moreso than using the log of salary) allows for a more interpretable explanation of the coefficient, namely that for each additional year of education, salary increases by 3915.840513.
2c. Can education be used to predict salary and what would be a prediction equation?
\(y_{salary} = -18391.589074 + 3915.840513 X_{educ} + \epsilon\)
2d. What would be predicted values of salary for someone with 12 years of education, 16 years, 21 years?
cbind(data.frame(educ = c(12, 16, 21)), salary = predict(emp.sal_lm, newdata = data.frame(educ = c(12,
16, 21)))) %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")
educ | salary |
---|---|
12 | 28598.50 |
16 | 44261.86 |
21 | 63841.06 |
2e. Repeat the same for age as a predictor for salary.
(emp.age_lm <- lm(salaryLN ~ age, data = emp)) %>% anova()
emp.age_lm %>% summary
##
## Call:
## lm(formula = salaryLN ~ age, data = emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64494 -0.26138 -0.09744 0.17266 1.54894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.804292 0.096177 112.337 < 2e-16 ***
## age -0.007203 0.001520 -4.738 0.00000286 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.389 on 471 degrees of freedom
## Multiple R-squared: 0.04549, Adjusted R-squared: 0.04346
## F-statistic: 22.45 on 1 and 471 DF, p-value: 0.000002865
The model indicates that age is a also a significant predictor for the natural log of salary (p < .001). The negative coefficient of age shows that the relationship between age and education is negative. This could be due to the decrease in earnings after retirement, perhaps all ages over 65 should be removed from the dataset so as not to skew the data? Additionally, the Adjusted \(R^2\) value shows that the model using age as a predictor only explains ~.04 of the variance and conversely that about 96% of the variance is not explained by age. Though age is a significant predictor of salary, is does not adequately explain the natural log of salary due to the low \(R^2\).
Interpreting regression coefficients.
The regression coefficient of age shows that for each additional year of age the natural log of salary decreases by -0.007203. To put this in more meaningful terms: for each additional year of age, salary decreases by -208.9808475.
Prediction equation
\(y_{salary} = 47408.836951 + -208.980847X_{age} + \epsilon\)
2f. Predict values of salary for someone 20, 30 and 40 years old, if age is a useful predictor.
cbind(data.frame(Age = c(20, 30, 40)), salary = predict(emp.age.sal_lm, newdata = data.frame(age = c(20,
30, 40)))) %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")
Age | salary |
---|---|
20 | 43229.22 |
30 | 41139.41 |
40 | 39049.60 |
If we include only individuals under 60 years old we might make a more accurate model.
(emp.ret.age_lm <- lm(salary ~ 0 + age, data = emp[emp$age < 60, ])) %>% summary
##
## Call:
## lm(formula = salary ~ 0 + age, data = emp[emp$age < 60, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -14786 -9497 -5174 3152 55783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## age 657.39 15.66 41.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14110 on 279 degrees of freedom
## Multiple R-squared: 0.8633, Adjusted R-squared: 0.8628
## F-statistic: 1762 on 1 and 279 DF, p-value: < 0.00000000000000022
Setting the y-intercept to 0, and controlling for those likely to be retired (Age > 60), we now see a positive correlation between age and salary. The adjusted \(R^2\) of ~.81 confirms that this model does a significantly better job of representing the included data accurately, especially when contrasted to the previous model’s adjusted \(R^2\) of ~.04. It appears that we have found the additional factor explaining the relationship between age and salary: retirement.
cbind(data.frame(Age = c(20, 30, 40)), salary = predict(emp.ret.age_lm, newdata = data.frame(age = c(20,
30, 40)))) %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")
Age | salary |
---|---|
20 | 13147.85 |
30 | 19721.77 |
40 | 26295.69 |
We now see more heartening (and accurate) predictions for salary as age increases.
In the previous week, the homogeneity of variance assumption was not tested.
homoVariance <- function(form, dat) {
form <- as.formula(form)
out <- list(Levene = car::leveneTest(form, data = dat), Bartlett = bartlett.test(form,
data = dat), Fligner = fligner.test(form, data = dat))
out
}
hv.age <- homoVariance(salaryLN ~ cut(age, 5), emp)
hv.edu <- homoVariance(salaryLN ~ as.factor(educ_cat), emp)
All tests of homogeneity of variance of the log of salary on age: Levene’s test F (4,468) =4.09, p<.01, Bartlett’s test K^2 (4) =24.14, p<.001 and the Fligner-Killeen test chi^2 (4) =17.56, p<.01 all indicate that the variance is heterogenous, violating this assumption required by linear regression. The same is true for categories of education.
The previous week’s write-up included a discussion of the regression model fit in terms of raw correlation (Pearson’s R) and in terms of \(R^2\) as well as an interpretation of regression coefficients, however in the discussion of regression coefficients, we used a regression of salary on age or education level to better explain the coefficients in meaningful terms. Though upon review, more accuracy can be achieved by using the regression model of the log of salary on age or educational level and simply using an exponential transformation to convert from the log of salary back to salary. This transformation is demonstrated below. For accuracy, we will determine which log produces more normality and rerun the model using a computed variable of salary with the selected log function.
emp2 <- emp %>% mutate(logsal = log(salary), log2 = log2(salary), log10 = log10(salary)) %>%
select(salary, logsal, log2, log10, age, educ, educ_cat)
visEDA(emp2[, c("logsal", "log2", "log10")])
## [1] 1 2 3
The distributions of the different log functions appear roughly equivalent. For ease, the log of salary will be used as the response variable.
# apply(expand.grid(.x = names(emp2)[c(5,6,7)],.y =
# names(emp2)[c(1:2)],stringsAsFactors = F),1,dat=emp2,function(x,dat){ form <-
# paste0(x[['.y']],'~',x[['.x']],'+','I(',x[['.x']],'^2)+I(',x[['.x']],'^3)+log(',x[['.x']],')+exp(',x[['.x']],')')
# x_lm <- lm(as.formula(form),data=dat) mod.packages <- c('glmnet', 'doParallel',
# 'foreach', 'pROC') invisible(startPkgs(mod.packages)) registerDoParallel()
# train <- caret::createDataPartition(dat[[x[['.y']]]],list=F) a <- seq(0.1, 0.9,
# 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <-
# glmnet::cv.glmnet(as.matrix(dat[train,x[['.x']]]),
# as.matrix(dat[train,x[['.y']]]), family = 'gaussian', nfold = 5, type.measure =
# 'deviance', paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda ==
# cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv <-
# search[search$cvm == min(search$cvm), ] md <-
# glmnet(as.matrix(dat[train,x[['.x']]]), as.matrix(dat[train,x[['.y']]]), family
# = 'gaussian', lambda = cv$lambda.1se, alpha = cv$alpha)
# unloadPkgs(mod.packages) registerDoSeq() roc(dat[!train %in%
# row.names(dat),x[['.y']]], as.numeric(predict(md, dat[!train %in%
# row.names(dat),x[['.x']]], type = 'response'))) })
mds <- purrr::map(.x = names(emp2)[c(5, 6, 7)], y = "salary", dat = emp2, function(x,
y, dat) {
dat <- dat %>% mutate(sq = (!!rlang::sym(x))^2, cu = (!!rlang::sym(x))^3, log = log((!!rlang::sym(x))),
exp = exp((!!rlang::sym(x)))) %>% select(salary, (!!x), sq, cu, log, exp)
xvar <- colnames(dat[, -1])
form <- paste0(y, "~", paste0(xvar, collapse = "+"))
mod.packages <- c("glmnet", "doParallel", "foreach")
invisible(startPkgs(mod.packages))
registerDoParallel()
train <- caret::createDataPartition(dat[[y]], list = F)
a <- seq(0.1, 0.9, 0.05)
search <- foreach(i = a, .combine = rbind) %dopar% {
cv <- glmnet::cv.glmnet(x = as.matrix(dat[train, xvar]), y = as.matrix(dat[train,
y]), family = "gaussian", nfold = 5, type.measure = "deviance", paralle = TRUE,
alpha = i)
data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se,
alpha = i)
}
cv <- search[search$cvm == min(search$cvm), ]
out <- list()
out$md <- md <- glmnet(as.matrix(dat[train, xvar]), as.matrix(dat[train, y]),
family = "gaussian", lambda = cv$lambda.1se, alpha = cv$alpha)
registerDoSEQ()
out$rmse <- caret::RMSE(dat[!as.numeric(row.names(dat)) %in% train, y], as.numeric(predict(md,
as.matrix(dat[!as.numeric(row.names(dat)) %in% train, xvar]), type = "link")))
unloadPkgs(mod.packages)
return(out)
})
Examining Model fits
g <- apply(expand.grid(.x = names(emp2)[c(5, 6, 7)], .y = names(emp2)[c(1, 2)], stringsAsFactors = F),
1, dat = emp2, function(x, dat) {
lms <- vector("list", length = 4)
names(lms) <- c("log", "exp", "sq", "cu")
lms$log$form <- as.formula(paste0(x[[".y"]], "~log(", x[[".x"]], ")"))
lms$exp$form <- as.formula(paste0(x[[".y"]], "~exp(", x[[".x"]], ")"))
lms$sq$form <- as.formula(paste0(x[[".y"]], "~I(", x[[".x"]], "^2)"))
lms$cu$form <- as.formula(paste0(x[[".y"]], "~I(", x[[".x"]], "^3)"))
lms <- lapply(lms, dat = dat, function(l, dat) {
l$lm <- lm(l$form, data = dat)
l$p <- l$lm %>% summary %>% capture.output() %>% str_extract("(?<=p.value\\:[\\s\\<]{1,3})[\\d\\.]+") %>%
.[!is.na(.)] %>% as.numeric() %>% p.txt
l$r2 <- l$lm %>% summary %>% .[["adj.r.squared"]] %>% round(3)
return(l)
})
cap <- vector()
cap[1] <- paste("R2:", lapply(lms, purrr::pluck, "r2") %>% paste0(names(.),
.), collapse = " ")
cap[2] <- paste("p:", lapply(lms, purrr::pluck, "p") %>% paste0(names(.),
.), collapse = " ")
g <- dat %>% ggplot(data = ., mapping = aes_string(x = x[[".x"]], y = x[[".y"]])) +
geom_point() + geom_smooth(method = "lm", se = F) + geom_smooth(method = "loess",
se = F) + labs(title = paste0("Scatter of ", x[[".y"]], " v ", x[[".x"]]),
caption = paste0(cap, collapse = "\n"), x = "", y = "") + theme_bw() +
theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm"), margin = unit(rep(0,
4), "mm"), vjust = 0), plot.subtitle = element_text(hjust = 0.5,
margin = unit(rep(0, 4), "mm")), plot.caption = element_text(size = unit(7,
"mm"), vjust = 0, margin = unit(rep(0, 4), "mm")), plot.margin = unit(rep(0,
4), "mm"), axis.title.x = element_text(margin = unit(rep(0, 4), "mm")),
axis.text.x = element_text(margin = unit(rep(0, 4), "mm")))
for (i in seq_along(lms)) {
g <- g + geom_line(aes(x = x, y = y), color = RColorBrewer::brewer.pal(8,
"Accent")[i], data = data.frame(y = predict(lm(lms[[i]]$form, data = dat),
newdata = dat), x = dat[[x[[".x"]]]]))
}
return(g)
})
purrr::walk(split(g, rep(1:{
numbers::divisors(length(g))[3]
}, length(g)/numbers::divisors(length(g))[3])), function(g) gridExtra::grid.arrange(grobs = g,
ncol = 2))
The exploratory plots above indicate that the cubic tranformation of years of education best explains the variance in the log of salary with an \(R^2\) of 0.576. Similarly, the cubic transformation of age best explains the log of salary compared to other transformations of age with an \(R^2\) of 0.066
\[y_{log(salary)} = \beta_0+\beta_{age^3}\]
lm(logsal ~ I(educ^3), data = emp2) %>% assign("emp2_lm", ., envir = .GlobalEnv) %>%
anova()
emp2_lm %>% summary
##
## Call:
## lm(formula = logsal ~ I(educ^3), data = emp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62456 -0.17476 -0.02374 0.15956 0.95524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.850335457 0.023273766 423.24 <2e-16 ***
## I(educ^3) 0.000181680 0.000007175 25.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2591 on 471 degrees of freedom
## Multiple R-squared: 0.5765, Adjusted R-squared: 0.5756
## F-statistic: 641.1 on 1 and 471 DF, p-value: < 0.00000000000000022
\[y_{log(salary)} = 9.85 +0age^3\] \[y_{log(salary)} = \beta_0+\beta_{education^3}\]
lm(logsal ~ I(age^3), data = emp2) %>% assign("emp2.age_lm", ., envir = .GlobalEnv) %>%
anova()
emp2.age_lm %>% summary
##
## Call:
## lm(formula = logsal ~ I(age^3), data = emp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64507 -0.26245 -0.09806 0.17909 1.55440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5260873430 0.0338052765 311.374 < 2e-16 ***
## I(age^3) -0.0000006340 0.0000001078 -5.882 0.0000000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3843 on 471 degrees of freedom
## Multiple R-squared: 0.06843, Adjusted R-squared: 0.06645
## F-statistic: 34.6 on 1 and 471 DF, p-value: 0.000000007698
\[y_{log(salary)} = 10.526087 + -0.000001 education^3\]
The regression coefficient of education cubed shows the following relationship between education and salary:
pred.edu <- data.frame(educ = seq(10, 19, 1))
pred.edu$y <- exp(predict(emp2_lm, newdata = pred.edu[, "educ", drop = F]))
pred.edu %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")
educ | y |
---|---|
10 | 22743.07 |
11 | 24152.71 |
12 | 25959.13 |
13 | 28268.03 |
14 | 31221.61 |
15 | 35014.09 |
16 | 39914.59 |
17 | 46301.51 |
18 | 54715.04 |
19 | 65938.60 |
The regression coefficient of age cubed shows the following relationship between age and salary:
pred.age <- data.frame(age = seq(30, 39, 1))
pred.age$y <- exp(predict(emp2.age_lm, newdata = pred.age[, "age", drop = F]))
pred.age %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")
age | y |
---|---|
30 | 36642.71 |
31 | 36577.93 |
32 | 36508.96 |
33 | 36435.68 |
34 | 36357.99 |
35 | 36275.77 |
36 | 36188.92 |
37 | 36097.33 |
38 | 36000.90 |
39 | 35899.55 |
Using a step function we can determine which combinations of transformations of the independent variables best fit the data.
best_lm <- list()
best_lm$edu <- invisible(step(lm(salary ~ educ + log(educ) + exp(educ) + I(educ^2) +
I(educ^3), data = emp2), direction = "both"))
## Start: AIC=8752.3
## salary ~ educ + log(educ) + exp(educ) + I(educ^2) + I(educ^3)
##
## Df Sum of Sq RSS AIC
## <none> 50112552236 8752.3
## - exp(educ) 1 699016472 50811568708 8756.9
## - I(educ^3) 1 3425487453 53538039689 8781.6
## - I(educ^2) 1 3574584291 53687136526 8782.9
## - educ 1 3661992608 53774544843 8783.7
## - log(educ) 1 3711100000 53823652236 8784.1
best_lm$age <- invisible(step(lm(salary ~ age + log(age) + exp(age) + I(age^2) +
I(age^3), data = emp2), direction = "both"))
## Start: AIC=9123.34
## salary ~ age + log(age) + exp(age) + I(age^2) + I(age^3)
##
## Df Sum of Sq RSS AIC
## - exp(age) 1 35613230 109841471510 9121.5
## - log(age) 1 123159522 109929017801 9121.9
## - age 1 216673191 110022531471 9122.3
## - I(age^2) 1 316462648 110122320928 9122.7
## - I(age^3) 1 409601069 110215459349 9123.1
## <none> 109805858280 9123.3
##
## Step: AIC=9121.5
## salary ~ age + log(age) + I(age^2) + I(age^3)
##
## Df Sum of Sq RSS AIC
## - log(age) 1 87869826 109929341336 9119.9
## - age 1 184027895 110025499405 9120.3
## - I(age^2) 1 295602106 110137073616 9120.8
## - I(age^3) 1 406672367 110248143877 9121.2
## <none> 109841471510 9121.5
## + exp(age) 1 35613230 109805858280 9123.3
##
## Step: AIC=9119.88
## salary ~ age + I(age^2) + I(age^3)
##
## Df Sum of Sq RSS AIC
## <none> 109929341336 9119.9
## + log(age) 1 87869826 109841471510 9121.5
## + exp(age) 1 323534 109929017801 9121.9
## - I(age^3) 1 9140005097 119069346433 9155.7
## - I(age^2) 1 10508615238 120437956574 9161.1
## - age 1 11831105780 121760447115 9166.2
best_lm$edu3 <- emp2_lm
best_lm$age3 <- emp2.age_lm
lapply(names(best_lm), blm = best_lm, function(nm, blm) {
cooks.distance(blm[[nm]]) %>% plot(main = paste("Cook's distance on", nm))
hatvalues(blm[[nm]]) %>% plot(main = paste("hatvalues on", nm))
rstudent(blm[[nm]]) %>% plot(main = paste("Rstudent on", nm))
})
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
tags$p("The lm function in R comes built in with useful residual diagnostic plotting functions:")
The lm function in R comes built in with useful residual diagnostic plotting functions:
purrr::map(names(best_lm), blm = best_lm, function(.x, blm) {
out <- list()
par(mfrow = c(2, 2))
out$resid <- plot(blm[[.x]])
out$sum <- summary(blm[[.x]])
out$outlier <- car::outlierTest(blm[[.x]])
out$leverage <- car::leveragePlots(blm[[.x]])
out$influence <- car::influencePlot(blm[[.x]], id.method = "identify", main = "Influence Plot",
sub = paste0("Circle size is proportial to Cook's Distance", "\n", blm[[.x]]$call))
out$hetero <- car::ncvTest(blm[[.x]], sub = blm[[.x]]$call)
out$spread <- car::spreadLevelPlot(blm[[.x]], sub = blm[[.x]]$call)
return(out)
})
## [[1]]
## [[1]]$sum
##
## Call:
## lm(formula = salary ~ educ + log(educ) + exp(educ) + I(educ^2) +
## I(educ^3), data = emp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35625 -5055 -1305 4095 63375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8477029.21161617 1435958.03513955 -5.903 0.00000000686 ***
## educ -2964114.70248752 507400.73806618 -5.842 0.00000000969 ***
## log(educ) 12462555.49251605 2119193.42206785 5.881 0.00000000779 ***
## exp(educ) 0.00006250 0.00002449 2.552 0.011 *
## I(educ^2) 113274.35299651 19626.08338658 5.772 0.00000001432 ***
## I(educ^3) -1856.18856461 328.53052140 -5.650 0.00000002794 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10360 on 467 degrees of freedom
## Multiple R-squared: 0.6366, Adjusted R-squared: 0.6328
## F-statistic: 163.6 on 5 and 467 DF, p-value: < 0.00000000000000022
##
##
## [[1]]$outlier
## rstudent unadjusted p-value Bonferonni p
## 29 6.481779 0.00000000023121 0.00000010936
## 18 5.738225 0.00000001724800 0.00000815830
## 343 5.711401 0.00000001999100 0.00000945590
## 445 5.338390 0.00000014677000 0.00006942400
## 218 4.676841 0.00000382180000 0.00180770000
##
## [[1]]$leverage
## [1] 0
##
## [[1]]$influence
## StudRes Hat CookD
## 18 5.7382252 0.009585959 0.049716711
## 29 6.4817791 0.030894688 0.205206776
## 32 3.8815404 0.030894688 0.077710859
## 130 -0.6493864 0.113266841 0.008988829
## 137 -0.1325615 0.935419666 0.042511330
##
## [[1]]$hetero
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 242.7391 Df = 1 p = 9.941761e-55
##
## [[1]]$spread
##
## Suggested power transformation: -0.142165
##
##
## [[2]]
## [[2]]$sum
##
## Call:
## lm(formula = salary ~ age + I(age^2) + I(age^3), data = emp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23280 -9032 -3580 4282 104431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1157801.0809 160071.8638 -7.233 0.00000000000194 ***
## age 51976.4754 7315.8459 7.105 0.00000000000451 ***
## I(age^2) -734.5035 109.6962 -6.696 0.00000000006148 ***
## I(age^3) 3.3700 0.5397 6.245 0.00000000095310 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15310 on 469 degrees of freedom
## Multiple R-squared: 0.2029, Adjusted R-squared: 0.1978
## F-statistic: 39.8 on 3 and 469 DF, p-value: < 0.00000000000000022
##
##
## [[2]]$outlier
## rstudent unadjusted p-value Bonferonni p
## 29 7.222224 0.0000000000020866 0.00000000098695
## 32 4.530767 0.0000074697000000 0.00353320000000
## 343 4.036663 0.0000633310000000 0.02995600000000
## 18 3.967333 0.0000840760000000 0.03976800000000
##
## [[2]]$leverage
## [1] 0
##
## [[2]]$influence
## StudRes Hat CookD
## 29 7.2222244 0.010673081 0.1268432610
## 32 4.5307672 0.006383024 0.0316500560
## 152 0.2308602 0.065473535 0.0009353835
## 442 -0.3867922 0.065473535 0.0026251723
##
## [[2]]$hetero
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 75.08295 Df = 1 p = 4.513471e-18
##
## [[2]]$spread
##
## Suggested power transformation: -1.136487
##
##
## [[3]]
## [[3]]$sum
##
## Call:
## lm(formula = logsal ~ I(educ^3), data = emp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62456 -0.17476 -0.02374 0.15956 0.95524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.850335457 0.023273766 423.24 <2e-16 ***
## I(educ^3) 0.000181680 0.000007175 25.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2591 on 471 degrees of freedom
## Multiple R-squared: 0.5765, Adjusted R-squared: 0.5756
## F-statistic: 641.1 on 1 and 471 DF, p-value: < 0.00000000000000022
##
##
## [[3]]$outlier
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 18 3.743687 0.00020381 0.096401
##
## [[3]]$leverage
## [1] 0
##
## [[3]]$influence
## StudRes Hat CookD
## 18 3.743687 0.003429353 0.02346578
## 29 2.806470 0.014833362 0.05844213
## 130 -1.219408 0.022958697 0.01745235
## 137 -1.774181 0.034261355 0.05558208
## 343 3.733951 0.003429353 0.02334740
##
## [[3]]$hetero
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 16.72445 Df = 1 p = 0.0000432204
##
## [[3]]$spread
##
## Suggested power transformation: -3.097898
##
##
## [[4]]
## [[4]]$sum
##
## Call:
## lm(formula = logsal ~ I(age^3), data = emp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64507 -0.26245 -0.09806 0.17909 1.55440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5260873430 0.0338052765 311.374 < 2e-16 ***
## I(age^3) -0.0000006340 0.0000001078 -5.882 0.0000000077 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3843 on 471 degrees of freedom
## Multiple R-squared: 0.06843, Adjusted R-squared: 0.06645
## F-statistic: 34.6 on 1 and 471 DF, p-value: 0.000000007698
##
##
## [[4]]$outlier
## rstudent unadjusted p-value Bonferonni p
## 29 4.121237 0.000044544 0.021069
##
## [[4]]$leverage
## [1] 0
##
## [[4]]$influence
## StudRes Hat CookD
## 29 4.1212374 0.003991895 0.0329190426
## 32 3.3221511 0.002118292 0.0114698623
## 137 2.3437321 0.007599687 0.0208339439
## 152 0.7082557 0.018878325 0.0048311499
## 442 -0.2190513 0.018878325 0.0004625737
##
## [[4]]$hetero
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 2.189006 Df = 1 p = 0.138999
##
## [[4]]$spread
##
## Suggested power transformation: -5.888566
If there were any questions about outliers, heteroscedasticity, or influence of certain points in the data these residual diagnostic visualizations should satisfy those questions!