Continue working with the dataset “Employee_data.sav”.

If you haven’t, compute bivariate correlations between “educational level (in years)” and log of the current salary (since we previously determined that this variable is not normally distributed), and age_yr (age in years) and salary_ln.
- Construct scatter plots for both pairs of variables.
- Do you expect that educational level and age would be predictive of salary and why? Which one will be a stronger predictor?
Write out a theoretical equation for regression with log salary as an outcome and education as a predictor.
- Fit the model in a software of your choice. Comment on the model fit (ANOVA table).
- Interpret regression coefficients.
- Can education be used to predict salary and what would be a prediction equation?
- What would be predicted values of salary for someone with 12 years of education, 16 years, 21 years?
- Repeat the same for age as a predictor for salary.
- Predict values of salary for someone 20, 30 and 40 years old, if age is a useful predictor.

emp <- foreign::read.spss("../Week 5/Employee data.sav", to.data.frame = T)
emp %<>% mutate_at(.vars = vars(salary, salbegin, jobtime, prevexp, educ), .funs = funs(as.numeric(as.character(.))))
emp$age <- emp$bdate %>% {
    ./(24 * 60 * 60)
} %>% as.Date(origin = "1582-10-14") %>% {
    as.Date(lubridate::today()) - .
} %>% {
    ./365.25
} %>% round %>% as.numeric
emp %<>% mutate_at(.vars = vars(female, custodial, manager), .f = funs(factor(., 
    labels = c(F, T))))

emp <- emp[!is.na(emp$age), ]

Bivariate Correlation

If you haven’t, compute bivariate correlations between “educational level (in years)” and log of the current salary (since we previously determined that this variable is not normally distributed), and age_yr (age in years) and salary_ln.

emp$logsal <- log(emp$salary)
tags$h2("Testing Assumptions and evaluating normality:")

Testing Assumptions and evaluating normality:

invisible(rbind(c("educ", "logsal"), c("age", "salaryLN")) %>% sapply(emp = emp, 
    function(x, emp) {
        sp <- shapiro.test(emp[[x]])
        g <- list()
        g$hist <- emp %>% ggplot(data = ., mapping = aes_string(x = x)) + geom_histogram() + 
            labs(title = paste0("Histogram of ", x), caption = paste0(sp[["method"]], 
                ": ", sp[["p.value"]] %>% p.txt, "\n", ifelse(sp[["p.value"]] < 0.05, 
                  "not-normally distributed", "normally distributed")))
        g$qq <- ggpubr::ggqqplot(data = emp[[x]], ylab = x, main = paste0("QQplot of ", 
            x))
        g <- gridExtra::grid.arrange(grobs = g, nrow = 1)
        return(g)
    }))

tags$p("None of the variables appear to be normally distributed. Spearman's Rho will be used in addition to Pearson's R.")

None of the variables appear to be normally distributed. Spearman's Rho will be used in addition to Pearson's R.

tags$h2("1a. Correlation testing & Scatter Plots")

1a. Correlation testing & Scatter Plots

invisible(apply(rbind(c("educ", "logsal"), c("age", "salaryLN")), emp = emp, 1, function(r, 
    emp) {
    pear <- cor.test(emp[[r[[1]]]], emp[[r[[2]]]], method = "pearson")
    spear <- cor.test(emp[[r[[1]]]], emp[[r[[2]]]], method = "spearman")
    g <- ggplot(data = emp, mapping = aes_string(x = r[[1]], y = r[[2]])) + geom_point() + 
        geom_smooth(method = "lm") + annotate("text", x = sort(emp[[r[[1]]]])[length(emp[[r[[1]]]])/3], 
        y = sort(emp[[r[[2]]]])[length(emp[[r[[2]]]])], label = paste0("Pearson's R: ", 
            round(cor(emp[[r[[1]]]], emp[[r[[2]]]], method = c("pearson")), 6), "\n", 
            "Spearman's Rho: ", round(cor(emp[[r[[1]]]], emp[[r[[2]]]], method = c("spearman")), 
                6))) + labs(title = paste0("Bivariate Correlation between ", r[[1]], 
        " & ", r[[2]]), xlab = r[[1]], ylab = r[[2]], caption = paste0(pear[["method"]], 
        ": ", pear[["p.value"]] %>% p.txt, " ", ifelse(pear[["p.value"]] < 0.05, 
            "Correlated", "No-correlation"), "\n", spear[["method"]], ": ", spear[["p.value"]] %>% 
            p.txt, " ", ifelse(spear[["p.value"]] < 0.05, "Correlated", "No-correlation")))
    print(g)
    return(g)
}))

Note that Spearman’s Rho, which is more suitable for this data due to it’s deviations from normality, did not indicate a correlation at \(\alpha = .05\) but did show a correlation at \(\alpha = .1\).

1b. Do you expect that educational level and age would be predictive of salary and why? Which one will be a stronger predictor?

Education level and age will be predictive of salary. Education will be the strongest predictor because it sometimes affords people a greater capacity to contribute to society in meaningful and helpful ways (though not always). Higher levels of education tend to give people deeper understandings of specific disciplines, an increased capacity to comprehend the complex nature of reality, a more nuanced worldview and (hopefully) a more developed ability to be tactful with people of different intellectual and cultural backgrounds in order to work with them in productive ways; all of which ideally contribute to increased earning over one’s lifetime. Age will be a more general and a less significant predictor.

The linear regression line in the figure above shows that the correlation between age and the natural log of salary is negative. This is counter to the notion that with increasing years of experience individuals tend to earn more. Perhaps this negative relationship can be accounted for by outliers in salary related to the retirement ages?

emp %>% filter(age > 65) %>% group_by(age) %>% summarize(salary = mean(salary)) %>% 
    mutate(p.val = pnorm(salary, mean(emp$salary), sd = sd(emp$salary)))

pear <- cor.test(emp$age, emp$salary, method = "pearson")
spear <- cor.test(emp$age, emp$salary, method = "spearman")
emp %>% mutate(p.val = pnorm(salary, mean(emp$salary), sd = sd(emp$salary))) %>% 
    mutate(`P<.1` = p.val < 0.1) %>% ggplot(data = ., mapping = aes(x = age, y = salary, 
    color = `P<.1`)) + geom_point() + geom_smooth(method = "lm") + annotate("text", 
    x = sort(emp$age)[length(emp$age)/3], y = sort(emp$salary)[length(emp$salary)], 
    label = paste0("Pearson's R: ", round(cor(emp$age, emp$salary, method = c("pearson")), 
        6), "\n", "Spearman's Rho: ", round(cor(emp$age, emp$salary, method = c("spearman")), 
        6))) + labs(title = paste0("Bivariate Correlation between ", "Age", " & ", 
    "Salary grouped by P<.1"), xlab = "Age", ylab = "Salary", caption = paste0(pear[["method"]], 
    ": ", pear[["p.value"]] %>% p.txt, " ", ifelse(pear[["p.value"]] < 0.05, "Correlated", 
        "No-correlation"), "\n", spear[["method"]], ": ", spear[["p.value"]] %>% 
        p.txt, " ", ifelse(spear[["p.value"]] < 0.05, "Correlated", "No-correlation")))

The table and graph above do show that retirement could be having significant influence on the relationship between age and salary. If we control for this factor, what will the relationship be?

pear <- cor.test(emp$age, emp$salary, method = "pearson")
spear <- cor.test(emp$age, emp$salary, method = "spearman")
emp %>% mutate(p.val = pnorm(salary, mean(emp$salary), sd = sd(emp$salary))) %>% 
    mutate(`P<.1` = p.val < 0.1) %>% filter(age < 65) %>% ggplot(data = ., mapping = aes(x = age, 
    y = salary, color = `P<.1`)) + geom_point() + geom_smooth(method = "lm") + annotate("text", 
    x = sort(emp$age)[length(emp$age)/3], y = sort(emp$salary)[length(emp$salary)], 
    label = paste0("Pearson's R: ", round(cor(emp$age, emp$salary, method = c("pearson")), 
        6), "\n", "Spearman's Rho: ", round(cor(emp$age, emp$salary, method = c("spearman")), 
        6))) + labs(title = paste0("Bivariate Correlation between ", "Age", " & ", 
    "Salary grouped by P<.1"), subtitle = "All Ages > 65 are removed", xlab = "Age", 
    ylab = "Salary", caption = paste0(pear[["method"]], ": ", pear[["p.value"]] %>% 
        p.txt, " ", ifelse(pear[["p.value"]] < 0.05, "Correlated", "No-correlation"), 
        "\n", spear[["method"]], ": ", spear[["p.value"]] %>% p.txt, " ", ifelse(spear[["p.value"]] < 
            0.05, "Correlated", "No-correlation")))

When we group the bottom 10% of the distribution of salary, we see a positive relationship between age and salary indicated by the pink line. However, looking at the Pearson and Spearman correlation coefficients not controlling for the p-values below 10%, these two individuals over age 60 with salaries below the 10% threshold maintain the negative relationship between age and salary overall.

Regression

Write out a theoretical equation for regression with log salary as an outcome and education as a predictor.

\(y_{logsal} = \beta_0 + \beta X_{educ} + \epsilon\)

Modeling

2a. Fit the model in a software of your choice. Comment on the model fit (ANOVA table).

(emp_lm <- lm(salaryLN ~ educ, data = emp)) %>% anova()

emp_lm %>% summary

## 
## Call:
## lm(formula = salaryLN ~ educ, data = emp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66302 -0.19375 -0.03601  0.16519  0.95172 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.061217   0.062816  144.25   <2e-16 ***
## educ        0.096050   0.004555   21.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2856 on 471 degrees of freedom
## Multiple R-squared:  0.4856, Adjusted R-squared:  0.4845 
## F-statistic: 444.7 on 1 and 471 DF,  p-value: < 0.00000000000000022

The model indicates that education is a significant predictor for the natural log of salary (p < .001) and the Adjusted \(R^2\) value shows that the model explains ~.6 of the variance and conversely that about 40% of the variance is not explained by education.

2b. Interpret regression coefficients.

The regression coefficient of education shows that for each additional year of education the natural log of salary increases by 0.09605. Though using the natural log of salary here does not make for a meaningful explanation. Rather, using the salary itself as the dependent variable (though this violates the assumptions of the linear regression model moreso than using the log of salary) allows for a more interpretable explanation of the coefficient, namely that for each additional year of education, salary increases by 3915.840513.

2c. Can education be used to predict salary and what would be a prediction equation?

\(y_{salary} = -18391.589074 + 3915.840513 X_{educ} + \epsilon\)

2d. What would be predicted values of salary for someone with 12 years of education, 16 years, 21 years?

cbind(data.frame(educ = c(12, 16, 21)), salary = predict(emp.sal_lm, newdata = data.frame(educ = c(12, 
    16, 21)))) %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")

educ	salary
12	28598.50
16	44261.86
21	63841.06

2e. Repeat the same for age as a predictor for salary.

(emp.age_lm <- lm(salaryLN ~ age, data = emp)) %>% anova()

emp.age_lm %>% summary

## 
## Call:
## lm(formula = salaryLN ~ age, data = emp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64494 -0.26138 -0.09744  0.17266  1.54894 
## 
## Coefficients:
##              Estimate Std. Error t value   Pr(>|t|)    
## (Intercept) 10.804292   0.096177 112.337    < 2e-16 ***
## age         -0.007203   0.001520  -4.738 0.00000286 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.389 on 471 degrees of freedom
## Multiple R-squared:  0.04549,    Adjusted R-squared:  0.04346 
## F-statistic: 22.45 on 1 and 471 DF,  p-value: 0.000002865

The model indicates that age is a also a significant predictor for the natural log of salary (p < .001). The negative coefficient of age shows that the relationship between age and education is negative. This could be due to the decrease in earnings after retirement, perhaps all ages over 65 should be removed from the dataset so as not to skew the data? Additionally, the Adjusted \(R^2\) value shows that the model using age as a predictor only explains ~.04 of the variance and conversely that about 96% of the variance is not explained by age. Though age is a significant predictor of salary, is does not adequately explain the natural log of salary due to the low \(R^2\).

Interpreting regression coefficients.

The regression coefficient of age shows that for each additional year of age the natural log of salary decreases by -0.007203. To put this in more meaningful terms: for each additional year of age, salary decreases by -208.9808475.

Prediction equation

\(y_{salary} = 47408.836951 + -208.980847X_{age} + \epsilon\)

2f. Predict values of salary for someone 20, 30 and 40 years old, if age is a useful predictor.

cbind(data.frame(Age = c(20, 30, 40)), salary = predict(emp.age.sal_lm, newdata = data.frame(age = c(20, 
    30, 40)))) %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")

Age	salary
20	43229.22
30	41139.41
40	39049.60

If we include only individuals under 60 years old we might make a more accurate model.

(emp.ret.age_lm <- lm(salary ~ 0 + age, data = emp[emp$age < 60, ])) %>% summary

## 
## Call:
## lm(formula = salary ~ 0 + age, data = emp[emp$age < 60, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14786  -9497  -5174   3152  55783 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## age   657.39      15.66   41.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14110 on 279 degrees of freedom
## Multiple R-squared:  0.8633, Adjusted R-squared:  0.8628 
## F-statistic:  1762 on 1 and 279 DF,  p-value: < 0.00000000000000022

Setting the y-intercept to 0, and controlling for those likely to be retired (Age > 60), we now see a positive correlation between age and salary. The adjusted \(R^2\) of ~.81 confirms that this model does a significantly better job of representing the included data accurately, especially when contrasted to the previous model’s adjusted \(R^2\) of ~.04. It appears that we have found the additional factor explaining the relationship between age and salary: retirement.

cbind(data.frame(Age = c(20, 30, 40)), salary = predict(emp.ret.age_lm, newdata = data.frame(age = c(20, 
    30, 40)))) %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")

Age	salary
20	13147.85
30	19721.77
40	26295.69

We now see more heartening (and accurate) predictions for salary as age increases.

Week 11

Testing Homogeneity of Variance

In the previous week, the homogeneity of variance assumption was not tested.

homoVariance <- function(form, dat) {
    form <- as.formula(form)
    out <- list(Levene = car::leveneTest(form, data = dat), Bartlett = bartlett.test(form, 
        data = dat), Fligner = fligner.test(form, data = dat))
    out
}
hv.age <- homoVariance(salaryLN ~ cut(age, 5), emp)
hv.edu <- homoVariance(salaryLN ~ as.factor(educ_cat), emp)

All tests of homogeneity of variance of the log of salary on age: Levene’s test F_(4,468)=4.09, p<.01, Bartlett’s test K^2₍₄₎=24.14, p<.001 and the Fligner-Killeen test chi^2₍₄₎=17.56, p<.01 all indicate that the variance is heterogenous, violating this assumption required by linear regression. The same is true for categories of education.

The previous week’s write-up included a discussion of the regression model fit in terms of raw correlation (Pearson’s R) and in terms of \(R^2\) as well as an interpretation of regression coefficients, however in the discussion of regression coefficients, we used a regression of salary on age or education level to better explain the coefficients in meaningful terms. Though upon review, more accuracy can be achieved by using the regression model of the log of salary on age or educational level and simply using an exponential transformation to convert from the log of salary back to salary. This transformation is demonstrated below. For accuracy, we will determine which log produces more normality and rerun the model using a computed variable of salary with the selected log function.

emp2 <- emp %>% mutate(logsal = log(salary), log2 = log2(salary), log10 = log10(salary)) %>% 
    select(salary, logsal, log2, log10, age, educ, educ_cat)
visEDA(emp2[, c("logsal", "log2", "log10")])

## [1] 1 2 3

The distributions of the different log functions appear roughly equivalent. For ease, the log of salary will be used as the response variable.

# apply(expand.grid(.x = names(emp2)[c(5,6,7)],.y =
# names(emp2)[c(1:2)],stringsAsFactors = F),1,dat=emp2,function(x,dat){ form <-
# paste0(x[['.y']],'~',x[['.x']],'+','I(',x[['.x']],'^2)+I(',x[['.x']],'^3)+log(',x[['.x']],')+exp(',x[['.x']],')')
# x_lm <- lm(as.formula(form),data=dat) mod.packages <- c('glmnet', 'doParallel',
# 'foreach', 'pROC') invisible(startPkgs(mod.packages)) registerDoParallel()
# train <- caret::createDataPartition(dat[[x[['.y']]]],list=F) a <- seq(0.1, 0.9,
# 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <-
# glmnet::cv.glmnet(as.matrix(dat[train,x[['.x']]]),
# as.matrix(dat[train,x[['.y']]]), family = 'gaussian', nfold = 5, type.measure =
# 'deviance', paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda ==
# cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv <-
# search[search$cvm == min(search$cvm), ] md <-
# glmnet(as.matrix(dat[train,x[['.x']]]), as.matrix(dat[train,x[['.y']]]), family
# = 'gaussian', lambda = cv$lambda.1se, alpha = cv$alpha)
# unloadPkgs(mod.packages) registerDoSeq() roc(dat[!train %in%
# row.names(dat),x[['.y']]], as.numeric(predict(md, dat[!train %in%
# row.names(dat),x[['.x']]], type = 'response'))) })
mds <- purrr::map(.x = names(emp2)[c(5, 6, 7)], y = "salary", dat = emp2, function(x, 
    y, dat) {
    dat <- dat %>% mutate(sq = (!!rlang::sym(x))^2, cu = (!!rlang::sym(x))^3, log = log((!!rlang::sym(x))), 
        exp = exp((!!rlang::sym(x)))) %>% select(salary, (!!x), sq, cu, log, exp)
    xvar <- colnames(dat[, -1])
    form <- paste0(y, "~", paste0(xvar, collapse = "+"))
    mod.packages <- c("glmnet", "doParallel", "foreach")
    invisible(startPkgs(mod.packages))
    registerDoParallel()
    train <- caret::createDataPartition(dat[[y]], list = F)
    a <- seq(0.1, 0.9, 0.05)
    search <- foreach(i = a, .combine = rbind) %dopar% {
        cv <- glmnet::cv.glmnet(x = as.matrix(dat[train, xvar]), y = as.matrix(dat[train, 
            y]), family = "gaussian", nfold = 5, type.measure = "deviance", paralle = TRUE, 
            alpha = i)
        data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, 
            alpha = i)
    }
    cv <- search[search$cvm == min(search$cvm), ]
    out <- list()
    
    out$md <- md <- glmnet(as.matrix(dat[train, xvar]), as.matrix(dat[train, y]), 
        family = "gaussian", lambda = cv$lambda.1se, alpha = cv$alpha)
    registerDoSEQ()
    out$rmse <- caret::RMSE(dat[!as.numeric(row.names(dat)) %in% train, y], as.numeric(predict(md, 
        as.matrix(dat[!as.numeric(row.names(dat)) %in% train, xvar]), type = "link")))
    unloadPkgs(mod.packages)
    return(out)
})

Model Exploration

Examining Model fits

g <- apply(expand.grid(.x = names(emp2)[c(5, 6, 7)], .y = names(emp2)[c(1, 2)], stringsAsFactors = F), 
    1, dat = emp2, function(x, dat) {
        lms <- vector("list", length = 4)
        names(lms) <- c("log", "exp", "sq", "cu")
        lms$log$form <- as.formula(paste0(x[[".y"]], "~log(", x[[".x"]], ")"))
        lms$exp$form <- as.formula(paste0(x[[".y"]], "~exp(", x[[".x"]], ")"))
        lms$sq$form <- as.formula(paste0(x[[".y"]], "~I(", x[[".x"]], "^2)"))
        lms$cu$form <- as.formula(paste0(x[[".y"]], "~I(", x[[".x"]], "^3)"))
        
        lms <- lapply(lms, dat = dat, function(l, dat) {
            l$lm <- lm(l$form, data = dat)
            l$p <- l$lm %>% summary %>% capture.output() %>% str_extract("(?<=p.value\\:[\\s\\<]{1,3})[\\d\\.]+") %>% 
                .[!is.na(.)] %>% as.numeric() %>% p.txt
            l$r2 <- l$lm %>% summary %>% .[["adj.r.squared"]] %>% round(3)
            return(l)
        })
        cap <- vector()
        cap[1] <- paste("R2:", lapply(lms, purrr::pluck, "r2") %>% paste0(names(.), 
            .), collapse = " ")
        cap[2] <- paste("p:", lapply(lms, purrr::pluck, "p") %>% paste0(names(.), 
            .), collapse = " ")
        g <- dat %>% ggplot(data = ., mapping = aes_string(x = x[[".x"]], y = x[[".y"]])) + 
            geom_point() + geom_smooth(method = "lm", se = F) + geom_smooth(method = "loess", 
            se = F) + labs(title = paste0("Scatter of ", x[[".y"]], " v ", x[[".x"]]), 
            caption = paste0(cap, collapse = "\n"), x = "", y = "") + theme_bw() + 
            theme(plot.title = element_text(hjust = 0.5, size = unit(10, "mm"), margin = unit(rep(0, 
                4), "mm"), vjust = 0), plot.subtitle = element_text(hjust = 0.5, 
                margin = unit(rep(0, 4), "mm")), plot.caption = element_text(size = unit(7, 
                "mm"), vjust = 0, margin = unit(rep(0, 4), "mm")), plot.margin = unit(rep(0, 
                4), "mm"), axis.title.x = element_text(margin = unit(rep(0, 4), "mm")), 
                axis.text.x = element_text(margin = unit(rep(0, 4), "mm")))
        
        for (i in seq_along(lms)) {
            g <- g + geom_line(aes(x = x, y = y), color = RColorBrewer::brewer.pal(8, 
                "Accent")[i], data = data.frame(y = predict(lm(lms[[i]]$form, data = dat), 
                newdata = dat), x = dat[[x[[".x"]]]]))
        }
        return(g)
    })
purrr::walk(split(g, rep(1:{
    numbers::divisors(length(g))[3]
}, length(g)/numbers::divisors(length(g))[3])), function(g) gridExtra::grid.arrange(grobs = g, 
    ncol = 2))

The exploratory plots above indicate that the cubic tranformation of years of education best explains the variance in the log of salary with an \(R^2\) of 0.576. Similarly, the cubic transformation of age best explains the log of salary compared to other transformations of age with an \(R^2\) of 0.066

Best Fit Models

\[y_{log(salary)} = \beta_0+\beta_{age^3}\]

lm(logsal ~ I(educ^3), data = emp2) %>% assign("emp2_lm", ., envir = .GlobalEnv) %>% 
    anova()

emp2_lm %>% summary

## 
## Call:
## lm(formula = logsal ~ I(educ^3), data = emp2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62456 -0.17476 -0.02374  0.15956  0.95524 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 9.850335457 0.023273766  423.24   <2e-16 ***
## I(educ^3)   0.000181680 0.000007175   25.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2591 on 471 degrees of freedom
## Multiple R-squared:  0.5765, Adjusted R-squared:  0.5756 
## F-statistic: 641.1 on 1 and 471 DF,  p-value: < 0.00000000000000022

\[y_{log(salary)} = 9.85 +0age^3\] \[y_{log(salary)} = \beta_0+\beta_{education^3}\]

lm(logsal ~ I(age^3), data = emp2) %>% assign("emp2.age_lm", ., envir = .GlobalEnv) %>% 
    anova()

emp2.age_lm %>% summary

## 
## Call:
## lm(formula = logsal ~ I(age^3), data = emp2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64507 -0.26245 -0.09806  0.17909  1.55440 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 10.5260873430  0.0338052765 311.374      < 2e-16 ***
## I(age^3)    -0.0000006340  0.0000001078  -5.882 0.0000000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 471 degrees of freedom
## Multiple R-squared:  0.06843,    Adjusted R-squared:  0.06645 
## F-statistic:  34.6 on 1 and 471 DF,  p-value: 0.000000007698

\[y_{log(salary)} = 10.526087 + -0.000001 education^3\]

Interpreting regression coefficients

The regression coefficient of education cubed shows the following relationship between education and salary:

pred.edu <- data.frame(educ = seq(10, 19, 1))
pred.edu$y <- exp(predict(emp2_lm, newdata = pred.edu[, "educ", drop = F]))
pred.edu %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")

educ	y
10	22743.07
11	24152.71
12	25959.13
13	28268.03
14	31221.61
15	35014.09
16	39914.59
17	46301.51
18	54715.04
19	65938.60

The regression coefficient of age cubed shows the following relationship between age and salary:

pred.age <- data.frame(age = seq(30, 39, 1))
pred.age$y <- exp(predict(emp2.age_lm, newdata = pred.age[, "age", drop = F]))
pred.age %>% kableExtra::kable("html") %>% kableExtra::kable_styling(position = "center")

age	y
30	36642.71
31	36577.93
32	36508.96
33	36435.68
34	36357.99
35	36275.77
36	36188.92
37	36097.33
38	36000.90
39	35899.55

Residual Diagnostics

Using a step function we can determine which combinations of transformations of the independent variables best fit the data.

best_lm <- list()
best_lm$edu <- invisible(step(lm(salary ~ educ + log(educ) + exp(educ) + I(educ^2) + 
    I(educ^3), data = emp2), direction = "both"))

## Start:  AIC=8752.3
## salary ~ educ + log(educ) + exp(educ) + I(educ^2) + I(educ^3)
## 
##             Df  Sum of Sq         RSS    AIC
## <none>                    50112552236 8752.3
## - exp(educ)  1  699016472 50811568708 8756.9
## - I(educ^3)  1 3425487453 53538039689 8781.6
## - I(educ^2)  1 3574584291 53687136526 8782.9
## - educ       1 3661992608 53774544843 8783.7
## - log(educ)  1 3711100000 53823652236 8784.1

best_lm$age <- invisible(step(lm(salary ~ age + log(age) + exp(age) + I(age^2) + 
    I(age^3), data = emp2), direction = "both"))

## Start:  AIC=9123.34
## salary ~ age + log(age) + exp(age) + I(age^2) + I(age^3)
## 
##            Df Sum of Sq          RSS    AIC
## - exp(age)  1  35613230 109841471510 9121.5
## - log(age)  1 123159522 109929017801 9121.9
## - age       1 216673191 110022531471 9122.3
## - I(age^2)  1 316462648 110122320928 9122.7
## - I(age^3)  1 409601069 110215459349 9123.1
## <none>                  109805858280 9123.3
## 
## Step:  AIC=9121.5
## salary ~ age + log(age) + I(age^2) + I(age^3)
## 
##            Df Sum of Sq          RSS    AIC
## - log(age)  1  87869826 109929341336 9119.9
## - age       1 184027895 110025499405 9120.3
## - I(age^2)  1 295602106 110137073616 9120.8
## - I(age^3)  1 406672367 110248143877 9121.2
## <none>                  109841471510 9121.5
## + exp(age)  1  35613230 109805858280 9123.3
## 
## Step:  AIC=9119.88
## salary ~ age + I(age^2) + I(age^3)
## 
##            Df   Sum of Sq          RSS    AIC
## <none>                    109929341336 9119.9
## + log(age)  1    87869826 109841471510 9121.5
## + exp(age)  1      323534 109929017801 9121.9
## - I(age^3)  1  9140005097 119069346433 9155.7
## - I(age^2)  1 10508615238 120437956574 9161.1
## - age       1 11831105780 121760447115 9166.2

best_lm$edu3 <- emp2_lm
best_lm$age3 <- emp2.age_lm
lapply(names(best_lm), blm = best_lm, function(nm, blm) {
    cooks.distance(blm[[nm]]) %>% plot(main = paste("Cook's distance on", nm))
    hatvalues(blm[[nm]]) %>% plot(main = paste("hatvalues on", nm))
    rstudent(blm[[nm]]) %>% plot(main = paste("Rstudent on", nm))
})

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

tags$p("The lm function in R comes built in with useful residual diagnostic plotting functions:")

The lm function in R comes built in with useful residual diagnostic plotting functions:

purrr::map(names(best_lm), blm = best_lm, function(.x, blm) {
    out <- list()
    par(mfrow = c(2, 2))
    out$resid <- plot(blm[[.x]])
    out$sum <- summary(blm[[.x]])
    out$outlier <- car::outlierTest(blm[[.x]])
    out$leverage <- car::leveragePlots(blm[[.x]])
    out$influence <- car::influencePlot(blm[[.x]], id.method = "identify", main = "Influence Plot", 
        sub = paste0("Circle size is proportial to Cook's Distance", "\n", blm[[.x]]$call))
    out$hetero <- car::ncvTest(blm[[.x]], sub = blm[[.x]]$call)
    out$spread <- car::spreadLevelPlot(blm[[.x]], sub = blm[[.x]]$call)
    return(out)
})

## [[1]]
## [[1]]$sum
## 
## Call:
## lm(formula = salary ~ educ + log(educ) + exp(educ) + I(educ^2) + 
##     I(educ^3), data = emp2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35625  -5055  -1305   4095  63375 
## 
## Coefficients:
##                      Estimate        Std. Error t value      Pr(>|t|)    
## (Intercept) -8477029.21161617  1435958.03513955  -5.903 0.00000000686 ***
## educ        -2964114.70248752   507400.73806618  -5.842 0.00000000969 ***
## log(educ)   12462555.49251605  2119193.42206785   5.881 0.00000000779 ***
## exp(educ)          0.00006250        0.00002449   2.552         0.011 *  
## I(educ^2)     113274.35299651    19626.08338658   5.772 0.00000001432 ***
## I(educ^3)      -1856.18856461      328.53052140  -5.650 0.00000002794 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10360 on 467 degrees of freedom
## Multiple R-squared:  0.6366, Adjusted R-squared:  0.6328 
## F-statistic: 163.6 on 5 and 467 DF,  p-value: < 0.00000000000000022
## 
## 
## [[1]]$outlier
##     rstudent unadjusted p-value  Bonferonni p
## 29  6.481779   0.00000000023121 0.00000010936
## 18  5.738225   0.00000001724800 0.00000815830
## 343 5.711401   0.00000001999100 0.00000945590
## 445 5.338390   0.00000014677000 0.00006942400
## 218 4.676841   0.00000382180000 0.00180770000
## 
## [[1]]$leverage
## [1] 0
## 
## [[1]]$influence
##        StudRes         Hat       CookD
## 18   5.7382252 0.009585959 0.049716711
## 29   6.4817791 0.030894688 0.205206776
## 32   3.8815404 0.030894688 0.077710859
## 130 -0.6493864 0.113266841 0.008988829
## 137 -0.1325615 0.935419666 0.042511330
## 
## [[1]]$hetero
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 242.7391    Df = 1     p = 9.941761e-55 
## 
## [[1]]$spread
## 
## Suggested power transformation:  -0.142165 
## 
## 
## [[2]]
## [[2]]$sum
## 
## Call:
## lm(formula = salary ~ age + I(age^2) + I(age^3), data = emp2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23280  -9032  -3580   4282 104431 
## 
## Coefficients:
##                  Estimate    Std. Error t value         Pr(>|t|)    
## (Intercept) -1157801.0809   160071.8638  -7.233 0.00000000000194 ***
## age            51976.4754     7315.8459   7.105 0.00000000000451 ***
## I(age^2)        -734.5035      109.6962  -6.696 0.00000000006148 ***
## I(age^3)           3.3700        0.5397   6.245 0.00000000095310 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15310 on 469 degrees of freedom
## Multiple R-squared:  0.2029, Adjusted R-squared:  0.1978 
## F-statistic:  39.8 on 3 and 469 DF,  p-value: < 0.00000000000000022
## 
## 
## [[2]]$outlier
##     rstudent unadjusted p-value     Bonferonni p
## 29  7.222224 0.0000000000020866 0.00000000098695
## 32  4.530767 0.0000074697000000 0.00353320000000
## 343 4.036663 0.0000633310000000 0.02995600000000
## 18  3.967333 0.0000840760000000 0.03976800000000
## 
## [[2]]$leverage
## [1] 0
## 
## [[2]]$influence
##        StudRes         Hat        CookD
## 29   7.2222244 0.010673081 0.1268432610
## 32   4.5307672 0.006383024 0.0316500560
## 152  0.2308602 0.065473535 0.0009353835
## 442 -0.3867922 0.065473535 0.0026251723
## 
## [[2]]$hetero
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 75.08295    Df = 1     p = 4.513471e-18 
## 
## [[2]]$spread
## 
## Suggested power transformation:  -1.136487 
## 
## 
## [[3]]
## [[3]]$sum
## 
## Call:
## lm(formula = logsal ~ I(educ^3), data = emp2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62456 -0.17476 -0.02374  0.15956  0.95524 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 9.850335457 0.023273766  423.24   <2e-16 ***
## I(educ^3)   0.000181680 0.000007175   25.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2591 on 471 degrees of freedom
## Multiple R-squared:  0.5765, Adjusted R-squared:  0.5756 
## F-statistic: 641.1 on 1 and 471 DF,  p-value: < 0.00000000000000022
## 
## 
## [[3]]$outlier
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 18 3.743687         0.00020381     0.096401
## 
## [[3]]$leverage
## [1] 0
## 
## [[3]]$influence
##       StudRes         Hat      CookD
## 18   3.743687 0.003429353 0.02346578
## 29   2.806470 0.014833362 0.05844213
## 130 -1.219408 0.022958697 0.01745235
## 137 -1.774181 0.034261355 0.05558208
## 343  3.733951 0.003429353 0.02334740
## 
## [[3]]$hetero
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 16.72445    Df = 1     p = 0.0000432204 
## 
## [[3]]$spread
## 
## Suggested power transformation:  -3.097898 
## 
## 
## [[4]]
## [[4]]$sum
## 
## Call:
## lm(formula = logsal ~ I(age^3), data = emp2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64507 -0.26245 -0.09806  0.17909  1.55440 
## 
## Coefficients:
##                  Estimate    Std. Error t value     Pr(>|t|)    
## (Intercept) 10.5260873430  0.0338052765 311.374      < 2e-16 ***
## I(age^3)    -0.0000006340  0.0000001078  -5.882 0.0000000077 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3843 on 471 degrees of freedom
## Multiple R-squared:  0.06843,    Adjusted R-squared:  0.06645 
## F-statistic:  34.6 on 1 and 471 DF,  p-value: 0.000000007698
## 
## 
## [[4]]$outlier
##    rstudent unadjusted p-value Bonferonni p
## 29 4.121237        0.000044544     0.021069
## 
## [[4]]$leverage
## [1] 0
## 
## [[4]]$influence
##        StudRes         Hat        CookD
## 29   4.1212374 0.003991895 0.0329190426
## 32   3.3221511 0.002118292 0.0114698623
## 137  2.3437321 0.007599687 0.0208339439
## 152  0.7082557 0.018878325 0.0048311499
## 442 -0.2190513 0.018878325 0.0004625737
## 
## [[4]]$hetero
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.189006    Df = 1     p = 0.138999 
## 
## [[4]]$spread
## 
## Suggested power transformation:  -5.888566

If there were any questions about outliers, heteroscedasticity, or influence of certain points in the data these residual diagnostic visualizations should satisfy those questions!

CAEP7712 - Week 10 & Week 11

Stephen Synchronicity

2018-12-06