# Attach dependencies
rmarkdown::html_dependency_jquery()
rmarkdown::html_dependency_bootstrap("spacelab")
rmarkdown::html_dependency_jqueryui()The dataset “Employee_data.sav” was collected to assess a number of educational and demographic characteristics and their relationship to salary. For this homework, we will look at variables “salary” (current salary), “jobcat” (employment category with 3 levels), and “educ” (education). 1. Evaluate whether job type has an impact on salary. Formulate the overall hypothesis and test it with an omnibus F test. Test model assumptions. To evaluate which groups are different, carry out several post-hoc tests and comment on the group differences, comparing results across the post-hoc approaches. Present a graphical summary (e.g., error bar plot) for your results. Summarize the results. 2. Evaluate whether educational level is related to salary. Recode the “salary” variable into 1) high school or less (12 years or less), 2) above high school to bachelor’s degree (years 13 – 16), and 3) some or graduate degree (17 years or more). Provide a plot of means and carry out analyses of trends for these data. Summarize the output and comment on the findings. Include ONLY relevant output in your write-up.
emp <- haven::read_sav("Employee data.sav")
emp <- lapply(emp, unclass) %>% as.data.frame()Evaluate whether job type has an impact on salary. Formulate the overall hypothesis and test it with an omnibus F test. Test model assumptions. To evaluate which groups are different, carry out several post-hoc tests and comment on the group differences, comparing results across the post-hoc approaches. Present a graphical summary (e.g., error bar plot) for your results. Summarize the results.
Does job type influence salary?
\[H_0: \mu_1 = \mu_2 = \mu_3 \\ \text{Job Type has no influence on salary}\\ H_a: \mu_1 \neq \mu_2 \neq \mu_3 \\ \text{Job Type influences salary}\]
ANOVA Assumption Testing
emp %>% ggplot(data = ., aes(x = salary)) + geom_histogram(fill = "lightblue", color = "grey30",
alpha = 0.8) + theme_light() + labs(title = "Histogram of Salary", caption = paste("Shapiro-Wilks test of normality:",
shapiro.test(emp$salary) %>% .[["p.value"]] %>% p.txt), xlab = "Count", "Salary") +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0))tags$p("The data does not appear to fit a normal distribution.", HTML("<br>"), "Does the log of current salary fit a normal distribution and meet the assumptions of the ANOVA test?")
The data does not appear to fit a normal distribution.
Does the log of current salary fit a normal distribution and meet the assumptions of the ANOVA test?
emp %>% ggplot(data = ., aes(x = salaryLN)) + geom_histogram(fill = "lightblue",
color = "grey30", alpha = 0.8) + theme_light() + labs(title = "Histogram of Salary",
caption = paste("Shapiro-Wilks test of normality:", shapiro.test(emp$salary) %>%
.[["p.value"]] %>% p.txt), xlab = "Count", "Salary") + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0))tags$p("The natural log of current salary does not fit ANOVA assumptions. A Chi-squared test is a non-parametric test that tests the same hypotheses as ANOVA without the requirement of normality and is suitable for categorical (non-ordered data) and will be used here.")The natural log of current salary does not fit ANOVA assumptions. A Chi-squared test is a non-parametric test that tests the same hypotheses as ANOVA without the requirement of normality and is suitable for categorical (non-ordered data) and will be used here.
emp.chisq <- with(emp, chisq.test(table(salary, jobcat)))
ggplot(data = emp, mapping = aes(y = salary, x = jobcat, group = jobcat)) + geom_boxplot(alpha = 0) +
theme_light() + labs(title = "Boxplot of Salary grouped by Job Category", subtitle = "",
caption = paste("Chi-squared Test:", emp.chisq$p.value %>% p.txt), x = "Job Category",
y = "Salary") + scale_y_continuous(breaks = seq(0, max(emp$salary), 10000), lim = c(0,
100000)) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0))As indicated by the boxplot and Chi-squared p-value, at the .05 \(\alpha\) level there is support for the alternative hypothesis that job category does indeed have an influence on salary.
emp.PH <- apply(combn(emp$jobcat %>% unique, 2), 2, dat = emp, function(facs, dat) {
t.out <- t.test(x = dat$salary[dat$jobcat == facs[1]], y = dat$salary[dat$jobcat ==
facs[2]])
t.out[["data.name"]] <- paste("T-test of Salary where Job category is", facs[1],
"vs where Job category is", facs[2])
return(t.out)
})
emp.PH.Cd <- apply(combn(emp$jobcat %>% unique, 2), 2, dat = emp, function(facs,
dat) {
v <- dat$jobcat == facs[1] | dat$jobcat == facs[2]
d <- dat$salary[v] %>% as.numeric
f <- dat$jobcat[v] %>% as.factor
d.out <- effsize::cohen.d(d = d, f = f)
d.out[["data.name"]] <- paste("Cohen's D of Salary where Job category is", facs[1],
"vs where Job category is", facs[2])
return(d.out)
})
lapply(seq_along(emp.PH), PH = emp.PH, Cd = emp.PH.Cd, function(ind, PH, Cd) {
out <- tags$p(PH[[ind]][["data.name"]], ":", PH[[ind]] %>% apa_t, HTML("<br>"),
Cd[[ind]][["data.name"]], ":", Cd[[ind]] %>% apa_t)
return(out)
})
T-test of Salary where Job category is 3 vs where Job category is 1 : t (90) =17.8, CI[32106.31,40172.21] p<.001
Cohen’s D of Salary where Job category is 3 vs where Job category is 1 : d=2.15, CI[1.88,2.43] This effect size is considered large.
T-test of Salary where Job category is 3 vs where Job category is 2 : t (90) =16.26, CI[29002.05,37075.77] p<.001
Cohen’s D of Salary where Job category is 3 vs where Job category is 2 : d=3.63, CI[2.98,4.28] This effect size is considered large.
T-test of Salary where Job category is 1 vs where Job category is 2 : t (93) =-5.45, CI[-4229.62,-1971.08] p<.001
Cohen’s D of Salary where Job category is 1 vs where Job category is 2 : d=-0.42, CI[-0.82,-0.03] This effect size is considered small.
Post-Hoc tests, also at the .05 \(\alpha\) level indicate that salaries associated with each job category have statistically significant differences. In other words, the alternative hypothesis that job category influences salary is well-supported across all job categories.
data.frame(Lbl = apply(combn(emp$jobcat %>% unique, 2), 2, FUN = paste, collapse = ","),
meandiff = sapply(emp.PH, function(l) {
l[["estimate"]][1] - l[["estimate"]][2]
}), ymin = sapply(emp.PH, function(l) {
l[["conf.int"]][1]
}), ymax = sapply(emp.PH, function(l) {
l[["conf.int"]][2]
})) %>% ggplot(aes(x = Lbl, y = meandiff)) + geom_errorbar(aes(ymin = ymin, ymax = ymax),
position = position_dodge(width = 1), alpha = 1, size = 1, width = 0.2, color = "lightblue") +
geom_point(stat = "identity", position = position_dodge(width = 1), size = 3,
color = "lightblue") + theme_light() + xlab("T-Test: Job Categories Compared") +
ylab("Difference in Means & Conf.Int.") + ggtitle("Error Bar Plot for Difference in Means Test") +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))Evaluate whether educational level is related to salary. Recode the “salary” variable into 1) high school or less (12 years or less), 2) above high school to bachelor’s degree (years 13 – 16), and 3) some or graduate degree (17 years or more). Provide a plot of means and carry out analyses of trends for these data. Summarize the output and comment on the findings. Include ONLY relevant output in your write-up.
emp$educat <- emp$educ %>% as.numeric %>% sapply(function(.) {
if ({
. <= 12
}) {
out <- "<=HS"
} else if ({
. >= 13 & . <= 16
}) {
out <- "<=BD"
} else if ({
. >= 17
}) {
out <- ">=GS"
}
return(out)
}) %>% factor(levels = c("<=HS", "<=BD", ">=GS"))Does education level influence salary?
\[H_0: \mu_{\leq HS} = \mu_{\leq BD} = \mu_{\geq GS} \\ \text{Education level has no influence on salary}\\ H_a: \mu_{\leq HS} = \mu_{\leq BD} = \mu_{\geq GS} \\ \text{Education level influences salary}\]
Salary, the response variable, was determined to be non-parametric in the previous experiment. The independent variable is now ordinal, rather than categorical, so a Kruskal-Wallis test is more appropriate than a Chi-squared in this case.
emp.kw <- kruskal.test(salary ~ educat, data = emp)
ggplot(data = emp, mapping = aes(y = salary, x = educat, group = educat)) + geom_boxplot(alpha = 0) +
theme_light() + labs(title = "Boxplot of Salary grouped by Education Level",
subtitle = "", caption = paste("Kruskal Wallis Test:", emp.kw$p.value %>% p.txt),
x = "Education Level", y = "Salary") + scale_y_continuous(breaks = seq(0, max(emp$salary),
10000), lim = c(0, 100000)) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0))The Kruskal-Wallis test at the .05 \(\alpha\) level and the boxplot with education level salary means show that there is support for the alternative hypothesis that education level affects salary. The groupings of education level make the differences in salary between education levels very pronounced, especially in stepping from college to graduate school.
(emp.edu.ph <- with(emp, pairwise.t.test(x = salary, g = educat, p.adjust.method = "bonf",
pool.sd = F)))##
## Pairwise comparisons using t tests with non-pooled SD
##
## data: salary and educat
##
## <=HS <=BD
## <=BD < 2e-16 -
## >=GS < 2e-16 0.0000000000000017
##
## P value adjustment method: bonferroni
emp.lm <- lm(salary ~ educ, data = emp)
emp.sum <- emp.lm %>% summary
emp.sum$coefficients[, 1:3] %<>% apply(2, round, 2)
emp.sum$coefficients[, 4] %<>% sapply(p.txt)
emp$sal.ols <- emp$salary %>% scale %>% {
. > qt(0.999, df = length(scale(emp$salary)))
}
emp %>% ggplot(data = ., mapping = aes(x = educ, y = salary)) + geom_point(position = position_jitter(),
aes(shape = sal.ols, size = sal.ols, color = educat)) + geom_smooth(method = "lm") +
annotation_custom(gridExtra::tableGrob(emp.sum$coefficients), xmin = 2, ymin = 110000) +
labs(title = "Scatter Plot with Regression Trend Line", subtitle = "Table with Regression Output Coefficients",
caption = "", x = "Education level", y = "Salary") + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) + scale_y_continuous(breaks = seq(0,
max(emp$salary), 10000)) + scale_size_manual(values = c(`TRUE` = 2, `FALSE` = 1))The linear regression model (though we know salary violates the assumption of normality) and scatter plot above illustrates the statistically significant relationship between education level and salary. Notable are the majority of outliers in salary that correspond only to education levels beyond high school. The geom_jitter also makes it easy to see how variance in salary increases with education level.
Homework 6
This homework is an extension of HW5 on the “employee” dataset. 1. In the 1 st question, you evaluated an effect of a job type on salary. The overall F statistic yielded significant results. Carry out a meaningful set of a priori contrasts to evaluate what groups are different from each other. Remember to specify an appropriate number of orthogonal contrasts and create a coefficients table. Briefly, report your findings 2. In the 2 nd question, you evaluated an effect of education on salary. The overall F statistic yielded significant results. Carry out an analysis of trends for these research question and report on any patterns observed in the relationship between education and salary. Briefly, report your findings. Refer to any necessary graphical summary to make your interpretation more meaningful.
Review your previous homework on one-way ANOVA and t-tests to make sure you understand major steps involved in the analyses, SPSS or R functions, interpretation of findings, and the progressive flow of hypothesis testing and the write-up. If you have any questions, consult posted answers to previous HW assignments, review class notes, and book readings. If anything remains unclear, send your questions along with your homework assignment and I will respond to them.
In the 1st question, you evaluated an effect of a job type on salary. The overall F statistic yielded significant results. Carry out a meaningful set of a priori contrasts to evaluate what groups are different from each other. Remember to specify an appropriate number of orthogonal contrasts and create a coefficients table. Briefly, report your findings
The hypotheses for the ANOVA are identical to those of the Kruskal Wallis test from Q1 above
Does job type influence salary?
\[H_0: \mu_1 = \mu_2 = \mu_3 \\ \text{Job Type has no influence on salary}\\ H_a: \mu_1 \neq \mu_2 \neq \mu_3 \\ \text{Job Type influences salary}\]
We know from previous assumption testing that the variables are independent, and the data is non-normally distributed, even with a log transformation. However, the ANOVA is robust to deviations from normality, so the natural log transformation of salary will be used as the dependent variable. A further assumption that is relevant to tests is whether the standard deviation can be pooled. Levene’s test will be used to test this assumption.
car::leveneTest(y = emp$salaryLN, group = emp$jobcat)# https://stats.stackexchange.com/questions/286706/setting-custom-contrasts-in-r
# IT appears that it is only possible to do a number of contrasts equal to the
# degrees of freedom.
emp$jobcat %<>% factor(levels = c(1, 2, 3), labels = c("Clerical", "Custodial", "Manager"))
contr.all <- function(fac, nm) {
mat <- contr.helmert(levels(fac)) %>% cbind(matrix(rep(gtools::permute(.[, 2]),
8), byrow = T, nrow = length(levels(fac)))) %>% .[, duplicated.matrix(.,
MARGIN = 2) %>% not] %>% .[, -1]
out <- mat
splits <- rep(1:(length(levels(fac)) - 1), length(levels(fac)))[1:length(levels(fac))] %>%
split(., ceiling(seq_along(1:length(.))/(length(levels(fac)) - 1)))
split.names <- apply(mat, 2, FUN = function(clm) {
paste(row.names(mat)[sign(clm) == -1] %>% paste(collapse = " & "), row.names(mat)[sign(clm) ==
1] %>% paste(collapse = " & "), sep = " v ")
}) %>% split(., ceiling(seq_along(1:length(.))/(length(levels(fac)) - 1)))
splits <- purrr::map2(.x = splits, .y = split.names, .f = function(.x, .y) {
names(.x) <- unlist(.y)
return(.x)
})
names(splits) <- rep(nm, length(splits))
row.names(out) <- levels(fac)
attributes(out)$splits <- splits
return(out)
}
tags$p("Contrast Matrix:")Contrast Matrix:
(contr <- contr.all(fac = emp$jobcat, "jobcat"))## [,1] [,2] [,3]
## Clerical -1 2 -1
## Custodial -1 -1 2
## Manager 2 -1 -1
## attr(,"splits")
## attr(,"splits")$jobcat
## Clerical & Custodial v Manager Custodial & Manager v Clerical
## 1 2
##
## attr(,"splits")$jobcat
## Clerical & Manager v Custodial
## 1
contr.aov <- function(form, dat, contr) {
deg.f <- length(levels(dat[[form[[3]] %>% as.character]])) - 1
reps <- 1:ceiling(ncol(contr)/deg.f)
subs <- sapply(seq_along(reps), deg.f = deg.f, function(r, deg.f) {
rep(r, times = deg.f)
}) %>% as.vector(.) %>% .[1:(deg.f + 1)]
contr.mats <- lapply(split(1:ncol(contr), subs), mat = contr, function(ind, mat) {
mat[, ind]
})
splits <- attributes(contr)$splits
saovs <- purrr::map(seq_along(splits), c.mats = contr.mats, form = form, dat = dat,
splits = splits, .f = function(.x, c.mats, form, dat, splits) {
contrasts(dat[[form[[3]] %>% as.character()]]) <- c.mats[[.x]]
list(model = aov.out <- aov(form, data = dat), saov = summary.aov(aov.out,
split = splits[.x]))
})
list(table = rbind(purrr::pluck(saovs, list(1, "saov", 1))[1, ], lapply(saovs,
function(l) {
purrr::pluck(l, "saov", 1) %>% .[-c(1, nrow(.)), ]
}) %>% do.call("rbind", .) %>% .[duplicated(.[, 2]) %>% not, ], purrr::pluck(saovs,
list(1, "saov", 1)) %>% .[nrow(.), ]), ANOVA = saovs)
}
tagList(strong("ANOVA with Contrasts:"), br(), p("According to this", a("cross-validated post",
href = "https://stats.stackexchange.com/questions/286706/setting-custom-contrasts-in-r",
target = "_blank"), "it appears that it is only possible to do a number of contrasts equal to the degrees of freedom. A custom formula had to be created to perform the ANOVA with comparisons exceeding the number of degrees of freedom."))According to this cross-validated post it appears that it is only possible to do a number of contrasts equal to the degrees of freedom. A custom formula had to be created to perform the ANOVA with comparisons exceeding the number of degrees of freedom.
(caov <- contr.aov("salaryLN~jobcat" %>% as.formula, dat = emp, contr = contr))$tableCustodial and Clerical grouped together show the most significant difference from manager, indicative of a pay gap between these types of positions. Manager and Clerical also show a significant difference from Custodial, though not as significant. When Custodial and Manager are taken together, there is no significant different to clerical, indicating that clerical is close enough to the mean of these two groups to not be considered significantly different.
Pairwise comparisons between groups are represented in the table below. Each comparison achieves significance.
pairwise.t.test(x = emp$salaryLN, g = emp$jobcat, p.adj = "bonf", pool.sd = F)$p.value## Clerical Custodial
## Custodial 4.508176e-10 NA
## Manager 9.721600e-50 5.31961e-40
In the 2nd question, you evaluated an effect of education on salary. The overall F statistic yielded significant results. Carry out an analysis of trends for these research question and report on any patterns observed in the relationship between education and salary. Briefly, report your findings. Refer to any necessary graphical summary to make your interpretation more meaningful.
Linear and Quadratic ANOVA Comparisons
summary.aov(caov$ANOVA[[1]]$model, split = list(jobcat = list(Linear = 1, Quadratic = 2)))## Df Sum Sq Mean Sq F value Pr(>F)
## jobcat 2 46.67 23.34 392.564 < 2e-16 ***
## jobcat: Linear 1 46.22 46.22 777.433 < 2e-16 ***
## jobcat: Quadratic 1 0.46 0.46 7.695 0.00576 **
## Residuals 471 28.00 0.06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The data has significant linear and quadratic trends, as the model indicates. Though the ANOVA comparisons here indicate a more significant relationship for the linear trend. From previous graphical exploration, intuitively, it would appear that a quadratic trend fits the data better. We can re-graph the data with both linear and quadratic trendlines to assess this intuition:
Graphical Exploration of Linear and Quadratic Trends
emp %>% ggplot(data = ., mapping = aes(x = educ, y = salary)) + geom_point(position = position_jitter(),
aes(shape = sal.ols, size = sal.ols, color = educat)) + geom_smooth(method = "lm") +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1, color = "red") +
annotation_custom(gridExtra::tableGrob(emp.sum$coefficients), xmin = 2, ymin = 110000) +
labs(title = "Scatter Plot with Regression Trend Line", subtitle = "Table with Regression Output Coefficients",
caption = "", x = "Education level", y = "Salary") + theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) + scale_y_continuous(breaks = seq(0,
max(emp$salary), 10000)) + scale_size_manual(values = c(`TRUE` = 2, `FALSE` = 1))It does appear from the visualization above that a quadratic line better fits the data, we can test this hypothesis by looking at linear models of each trend separately, and then together for comparison.
Linear Models
tags$p("Linear")Linear
lm(salaryLN ~ educ, data = emp) %>% summary##
## Call:
## lm(formula = salaryLN ~ educ, data = emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66260 -0.19303 -0.03559 0.16538 0.95223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.062102 0.062738 144.4 <2e-16 ***
## educ 0.095963 0.004548 21.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2853 on 472 degrees of freedom
## Multiple R-squared: 0.4854, Adjusted R-squared: 0.4844
## F-statistic: 445.3 on 1 and 472 DF, p-value: < 0.00000000000000022
tags$p("Quadratic")Quadratic
lm(salaryLN ~ I(educ^2), data = emp) %>% summary##
## Call:
## lm(formula = salaryLN ~ I(educ^2), data = emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64854 -0.17940 -0.02153 0.15148 0.94540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6393645 0.0327259 294.55 <2e-16 ***
## I(educ^2) 0.0037695 0.0001592 23.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2689 on 472 degrees of freedom
## Multiple R-squared: 0.5428, Adjusted R-squared: 0.5419
## F-statistic: 560.4 on 1 and 472 DF, p-value: < 0.00000000000000022
tags$p("Linear and Quadratic")Linear and Quadratic
lm(salaryLN ~ educ + I(educ^2), data = emp) %>% summary##
## Call:
## lm(formula = salaryLN ~ educ + I(educ^2), data = emp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68571 -0.15861 -0.02415 0.16519 0.96679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.132436 0.189092 58.873 < 2e-16 ***
## educ -0.230009 0.028742 -8.002 0.00000000000000958 ***
## I(educ^2) 0.012229 0.001068 11.454 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2526 on 471 degrees of freedom
## Multiple R-squared: 0.5975, Adjusted R-squared: 0.5958
## F-statistic: 349.7 on 2 and 471 DF, p-value: < 0.00000000000000022
Each model indicates a statistically significant fit of the trendline to the data, with the p-value and significance level between the linear and quadratic models undifferentiable. However, the Adjusted R-Squared value for the quadratic model is greater than that of the linear model confirming the hypothesis that the variance in the data is better explained by the quadratic model.
Combining the linear and quadratic models into a single model makes the difference in fit evident in the statistical significance of each line, with a marginally lower p.value for the quadratic trendline. Predictably, adding both terms to the model further increases the \(R^2\) value and better explaining the data.
The quadratic trendline is a better fit because of the exponential increases in salary visible in the graph that occur as higher levels of education are attained. The magnitude increases gradually after the inflection point corresponding to some high school education, growing steepr with some college education and then reflecting the exponential growth in salary for those with bachelor’s degrees and/or graduate degrees.
This dataset would indicate that higher educaton is a lucrative investment. It would be interesting to further examine these findings by using data stratified by periods corresponding to political cycles and extrapolate that analysis across countries to see if the trend remains intact across time and demographics.