# Attach dependencies
rmarkdown::html_dependency_jquery()
rmarkdown::html_dependency_bootstrap("spacelab")
rmarkdown::html_dependency_jqueryui()

HW 5

The dataset “Employee_data.sav” was collected to assess a number of educational and demographic characteristics and their relationship to salary. For this homework, we will look at variables “salary” (current salary), “jobcat” (employment category with 3 levels), and “educ” (education). 1. Evaluate whether job type has an impact on salary. Formulate the overall hypothesis and test it with an omnibus F test. Test model assumptions. To evaluate which groups are different, carry out several post-hoc tests and comment on the group differences, comparing results across the post-hoc approaches. Present a graphical summary (e.g., error bar plot) for your results. Summarize the results. 2. Evaluate whether educational level is related to salary. Recode the “salary” variable into 1) high school or less (12 years or less), 2) above high school to bachelor’s degree (years 13 – 16), and 3) some or graduate degree (17 years or more). Provide a plot of means and carry out analyses of trends for these data. Summarize the output and comment on the findings. Include ONLY relevant output in your write-up.

emp <- haven::read_sav("Employee data.sav")
emp <- lapply(emp, unclass) %>% as.data.frame()

Q1

Evaluate whether job type has an impact on salary. Formulate the overall hypothesis and test it with an omnibus F test. Test model assumptions. To evaluate which groups are different, carry out several post-hoc tests and comment on the group differences, comparing results across the post-hoc approaches. Present a graphical summary (e.g., error bar plot) for your results. Summarize the results.

Hypothesis

Hypothesis

Does job type influence salary?

\[H_0: \mu_1 = \mu_2 = \mu_3 \\ \text{Job Type has no influence on salary}\\ H_a: \mu_1 \neq \mu_2 \neq \mu_3 \\ \text{Job Type influences salary}\]

Assumption Testing

ANOVA Assumption Testing

emp %>% ggplot(data = ., aes(x = salary)) + geom_histogram(fill = "lightblue", color = "grey30", 
    alpha = 0.8) + theme_light() + labs(title = "Histogram of Salary", caption = paste("Shapiro-Wilks test of normality:", 
    shapiro.test(emp$salary) %>% .[["p.value"]] %>% p.txt), xlab = "Count", "Salary") + 
    theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), 
        plot.caption = element_text(hjust = 0))

tags$p("The data does not appear to fit a normal distribution.", HTML("<br>"), "Does the log of current salary fit a normal distribution and meet the assumptions of the ANOVA test?")

The data does not appear to fit a normal distribution.
Does the log of current salary fit a normal distribution and meet the assumptions of the ANOVA test?

emp %>% ggplot(data = ., aes(x = salaryLN)) + geom_histogram(fill = "lightblue", 
    color = "grey30", alpha = 0.8) + theme_light() + labs(title = "Histogram of Salary", 
    caption = paste("Shapiro-Wilks test of normality:", shapiro.test(emp$salary) %>% 
        .[["p.value"]] %>% p.txt), xlab = "Count", "Salary") + theme(plot.title = element_text(hjust = 0.5), 
    plot.subtitle = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0))

tags$p("The natural log of current salary does not fit ANOVA assumptions. A Chi-squared test is a non-parametric test that tests the same hypotheses as ANOVA without the requirement of normality and is suitable for categorical (non-ordered data) and will be used here.")

The natural log of current salary does not fit ANOVA assumptions. A Chi-squared test is a non-parametric test that tests the same hypotheses as ANOVA without the requirement of normality and is suitable for categorical (non-ordered data) and will be used here.

Chi-squared Test

emp.chisq <- with(emp, chisq.test(table(salary, jobcat)))
ggplot(data = emp, mapping = aes(y = salary, x = jobcat, group = jobcat)) + geom_boxplot(alpha = 0) + 
    theme_light() + labs(title = "Boxplot of Salary grouped by Job Category", subtitle = "", 
    caption = paste("Chi-squared Test:", emp.chisq$p.value %>% p.txt), x = "Job Category", 
    y = "Salary") + scale_y_continuous(breaks = seq(0, max(emp$salary), 10000), lim = c(0, 
    100000)) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust = 0))

As indicated by the boxplot and Chi-squared p-value, at the .05 \(\alpha\) level there is support for the alternative hypothesis that job category does indeed have an influence on salary.

Post-Hoc Tests

emp.PH <- apply(combn(emp$jobcat %>% unique, 2), 2, dat = emp, function(facs, dat) {
    t.out <- t.test(x = dat$salary[dat$jobcat == facs[1]], y = dat$salary[dat$jobcat == 
        facs[2]])
    t.out[["data.name"]] <- paste("T-test of Salary where Job category is", facs[1], 
        "vs where Job category is", facs[2])
    return(t.out)
})

emp.PH.Cd <- apply(combn(emp$jobcat %>% unique, 2), 2, dat = emp, function(facs, 
    dat) {
    v <- dat$jobcat == facs[1] | dat$jobcat == facs[2]
    d <- dat$salary[v] %>% as.numeric
    f <- dat$jobcat[v] %>% as.factor
    d.out <- effsize::cohen.d(d = d, f = f)
    d.out[["data.name"]] <- paste("Cohen's D of Salary where Job category is", facs[1], 
        "vs where Job category is", facs[2])
    return(d.out)
})
lapply(seq_along(emp.PH), PH = emp.PH, Cd = emp.PH.Cd, function(ind, PH, Cd) {
    out <- tags$p(PH[[ind]][["data.name"]], ":", PH[[ind]] %>% apa_t, HTML("<br>"), 
        Cd[[ind]][["data.name"]], ":", Cd[[ind]] %>% apa_t)
    return(out)
})
[[1]]

T-test of Salary where Job category is 3 vs where Job category is 1 : t (90) =17.8, CI[32106.31,40172.21] p<.001
Cohen’s D of Salary where Job category is 3 vs where Job category is 1 : d=2.15, CI[1.88,2.43] This effect size is considered large.

[[2]]

T-test of Salary where Job category is 3 vs where Job category is 2 : t (90) =16.26, CI[29002.05,37075.77] p<.001
Cohen’s D of Salary where Job category is 3 vs where Job category is 2 : d=3.63, CI[2.98,4.28] This effect size is considered large.

[[3]]

T-test of Salary where Job category is 1 vs where Job category is 2 : t (93) =-5.45, CI[-4229.62,-1971.08] p<.001
Cohen’s D of Salary where Job category is 1 vs where Job category is 2 : d=-0.42, CI[-0.82,-0.03] This effect size is considered small.

Post-Hoc tests, also at the .05 \(\alpha\) level indicate that salaries associated with each job category have statistically significant differences. In other words, the alternative hypothesis that job category influences salary is well-supported across all job categories.

data.frame(Lbl = apply(combn(emp$jobcat %>% unique, 2), 2, FUN = paste, collapse = ","), 
    meandiff = sapply(emp.PH, function(l) {
        l[["estimate"]][1] - l[["estimate"]][2]
    }), ymin = sapply(emp.PH, function(l) {
        l[["conf.int"]][1]
    }), ymax = sapply(emp.PH, function(l) {
        l[["conf.int"]][2]
    })) %>% ggplot(aes(x = Lbl, y = meandiff)) + geom_errorbar(aes(ymin = ymin, ymax = ymax), 
    position = position_dodge(width = 1), alpha = 1, size = 1, width = 0.2, color = "lightblue") + 
    geom_point(stat = "identity", position = position_dodge(width = 1), size = 3, 
        color = "lightblue") + theme_light() + xlab("T-Test: Job Categories Compared") + 
    ylab("Difference in Means & Conf.Int.") + ggtitle("Error Bar Plot for Difference in Means Test") + 
    theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

Q2

Evaluate whether educational level is related to salary. Recode the “salary” variable into 1) high school or less (12 years or less), 2) above high school to bachelor’s degree (years 13 – 16), and 3) some or graduate degree (17 years or more). Provide a plot of means and carry out analyses of trends for these data. Summarize the output and comment on the findings. Include ONLY relevant output in your write-up.

emp$educat <- emp$educ %>% as.numeric %>% sapply(function(.) {
    if ({
        . <= 12
    }) {
        out <- "<=HS"
    } else if ({
        . >= 13 & . <= 16
    }) {
        out <- "<=BD"
    } else if ({
        . >= 17
    }) {
        out <- ">=GS"
    }
    return(out)
}) %>% factor(levels = c("<=HS", "<=BD", ">=GS"))

Hypothesis

Does education level influence salary?

\[H_0: \mu_{\leq HS} = \mu_{\leq BD} = \mu_{\geq GS} \\ \text{Education level has no influence on salary}\\ H_a: \mu_{\leq HS} = \mu_{\leq BD} = \mu_{\geq GS} \\ \text{Education level influences salary}\]

Assumption Testing

Salary, the response variable, was determined to be non-parametric in the previous experiment. The independent variable is now ordinal, rather than categorical, so a Kruskal-Wallis test is more appropriate than a Chi-squared in this case.

Kruskal-Wallis Test

emp.kw <- kruskal.test(salary ~ educat, data = emp)
ggplot(data = emp, mapping = aes(y = salary, x = educat, group = educat)) + geom_boxplot(alpha = 0) + 
    theme_light() + labs(title = "Boxplot of Salary grouped by Education Level", 
    subtitle = "", caption = paste("Kruskal Wallis Test:", emp.kw$p.value %>% p.txt), 
    x = "Education Level", y = "Salary") + scale_y_continuous(breaks = seq(0, max(emp$salary), 
    10000), lim = c(0, 100000)) + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), 
    plot.caption = element_text(hjust = 0))

The Kruskal-Wallis test at the .05 \(\alpha\) level and the boxplot with education level salary means show that there is support for the alternative hypothesis that education level affects salary. The groupings of education level make the differences in salary between education levels very pronounced, especially in stepping from college to graduate school.

Post-Hoc Analysis

(emp.edu.ph <- with(emp, pairwise.t.test(x = salary, g = educat, p.adjust.method = "bonf", 
    pool.sd = F)))
## 
##  Pairwise comparisons using t tests with non-pooled SD 
## 
## data:  salary and educat 
## 
##      <=HS    <=BD              
## <=BD < 2e-16 -                 
## >=GS < 2e-16 0.0000000000000017
## 
## P value adjustment method: bonferroni

Trend Analysis

emp.lm <- lm(salary ~ educ, data = emp)
emp.sum <- emp.lm %>% summary
emp.sum$coefficients[, 1:3] %<>% apply(2, round, 2)
emp.sum$coefficients[, 4] %<>% sapply(p.txt)
emp$sal.ols <- emp$salary %>% scale %>% {
    . > qt(0.999, df = length(scale(emp$salary)))
}
emp %>% ggplot(data = ., mapping = aes(x = educ, y = salary)) + geom_point(position = position_jitter(), 
    aes(shape = sal.ols, size = sal.ols, color = educat)) + geom_smooth(method = "lm") + 
    annotation_custom(gridExtra::tableGrob(emp.sum$coefficients), xmin = 2, ymin = 110000) + 
    labs(title = "Scatter Plot with Regression Trend Line", subtitle = "Table with Regression Output Coefficients", 
        caption = "", x = "Education level", y = "Salary") + theme(plot.title = element_text(hjust = 0.5), 
    plot.subtitle = element_text(hjust = 0.5)) + scale_y_continuous(breaks = seq(0, 
    max(emp$salary), 10000)) + scale_size_manual(values = c(`TRUE` = 2, `FALSE` = 1))

The linear regression model (though we know salary violates the assumption of normality) and scatter plot above illustrates the statistically significant relationship between education level and salary. Notable are the majority of outliers in salary that correspond only to education levels beyond high school. The geom_jitter also makes it easy to see how variance in salary increases with education level.

HW 6

Homework 6

This homework is an extension of HW5 on the “employee” dataset. 1. In the 1 st question, you evaluated an effect of a job type on salary. The overall F statistic yielded significant results. Carry out a meaningful set of a priori contrasts to evaluate what groups are different from each other. Remember to specify an appropriate number of orthogonal contrasts and create a coefficients table. Briefly, report your findings 2. In the 2 nd question, you evaluated an effect of education on salary. The overall F statistic yielded significant results. Carry out an analysis of trends for these research question and report on any patterns observed in the relationship between education and salary. Briefly, report your findings. Refer to any necessary graphical summary to make your interpretation more meaningful.

Review your previous homework on one-way ANOVA and t-tests to make sure you understand major steps involved in the analyses, SPSS or R functions, interpretation of findings, and the progressive flow of hypothesis testing and the write-up. If you have any questions, consult posted answers to previous HW assignments, review class notes, and book readings. If anything remains unclear, send your questions along with your homework assignment and I will respond to them.

Q1

In the 1st question, you evaluated an effect of a job type on salary. The overall F statistic yielded significant results. Carry out a meaningful set of a priori contrasts to evaluate what groups are different from each other. Remember to specify an appropriate number of orthogonal contrasts and create a coefficients table. Briefly, report your findings

Hypothesis

The hypotheses for the ANOVA are identical to those of the Kruskal Wallis test from Q1 above

Does job type influence salary?

\[H_0: \mu_1 = \mu_2 = \mu_3 \\ \text{Job Type has no influence on salary}\\ H_a: \mu_1 \neq \mu_2 \neq \mu_3 \\ \text{Job Type influences salary}\]

Assumption Testing

We know from previous assumption testing that the variables are independent, and the data is non-normally distributed, even with a log transformation. However, the ANOVA is robust to deviations from normality, so the natural log transformation of salary will be used as the dependent variable. A further assumption that is relevant to tests is whether the standard deviation can be pooled. Levene’s test will be used to test this assumption.

car::leveneTest(y = emp$salaryLN, group = emp$jobcat)
# https://stats.stackexchange.com/questions/286706/setting-custom-contrasts-in-r
# IT appears that it is only possible to do a number of contrasts equal to the
# degrees of freedom.
emp$jobcat %<>% factor(levels = c(1, 2, 3), labels = c("Clerical", "Custodial", "Manager"))
contr.all <- function(fac, nm) {
    mat <- contr.helmert(levels(fac)) %>% cbind(matrix(rep(gtools::permute(.[, 2]), 
        8), byrow = T, nrow = length(levels(fac)))) %>% .[, duplicated.matrix(., 
        MARGIN = 2) %>% not] %>% .[, -1]
    out <- mat
    splits <- rep(1:(length(levels(fac)) - 1), length(levels(fac)))[1:length(levels(fac))] %>% 
        split(., ceiling(seq_along(1:length(.))/(length(levels(fac)) - 1)))
    split.names <- apply(mat, 2, FUN = function(clm) {
        paste(row.names(mat)[sign(clm) == -1] %>% paste(collapse = " & "), row.names(mat)[sign(clm) == 
            1] %>% paste(collapse = " & "), sep = " v ")
    }) %>% split(., ceiling(seq_along(1:length(.))/(length(levels(fac)) - 1)))
    splits <- purrr::map2(.x = splits, .y = split.names, .f = function(.x, .y) {
        names(.x) <- unlist(.y)
        return(.x)
    })
    names(splits) <- rep(nm, length(splits))
    row.names(out) <- levels(fac)
    attributes(out)$splits <- splits
    return(out)
}
tags$p("Contrast Matrix:")

Contrast Matrix:

(contr <- contr.all(fac = emp$jobcat, "jobcat"))
##           [,1] [,2] [,3]
## Clerical    -1    2   -1
## Custodial   -1   -1    2
## Manager      2   -1   -1
## attr(,"splits")
## attr(,"splits")$jobcat
## Clerical & Custodial v Manager Custodial & Manager v Clerical 
##                              1                              2 
## 
## attr(,"splits")$jobcat
## Clerical & Manager v Custodial 
##                              1
contr.aov <- function(form, dat, contr) {
    deg.f <- length(levels(dat[[form[[3]] %>% as.character]])) - 1
    reps <- 1:ceiling(ncol(contr)/deg.f)
    subs <- sapply(seq_along(reps), deg.f = deg.f, function(r, deg.f) {
        rep(r, times = deg.f)
    }) %>% as.vector(.) %>% .[1:(deg.f + 1)]
    contr.mats <- lapply(split(1:ncol(contr), subs), mat = contr, function(ind, mat) {
        mat[, ind]
    })
    splits <- attributes(contr)$splits
    saovs <- purrr::map(seq_along(splits), c.mats = contr.mats, form = form, dat = dat, 
        splits = splits, .f = function(.x, c.mats, form, dat, splits) {
            contrasts(dat[[form[[3]] %>% as.character()]]) <- c.mats[[.x]]
            list(model = aov.out <- aov(form, data = dat), saov = summary.aov(aov.out, 
                split = splits[.x]))
        })
    list(table = rbind(purrr::pluck(saovs, list(1, "saov", 1))[1, ], lapply(saovs, 
        function(l) {
            purrr::pluck(l, "saov", 1) %>% .[-c(1, nrow(.)), ]
        }) %>% do.call("rbind", .) %>% .[duplicated(.[, 2]) %>% not, ], purrr::pluck(saovs, 
        list(1, "saov", 1)) %>% .[nrow(.), ]), ANOVA = saovs)
}
tagList(strong("ANOVA with Contrasts:"), br(), p("According to this", a("cross-validated post", 
    href = "https://stats.stackexchange.com/questions/286706/setting-custom-contrasts-in-r", 
    target = "_blank"), "it appears that it is only possible to do a number of contrasts equal to the degrees of freedom. A custom formula had to be created to perform the ANOVA with comparisons exceeding the number of degrees of freedom."))
ANOVA with Contrasts:

According to this cross-validated post it appears that it is only possible to do a number of contrasts equal to the degrees of freedom. A custom formula had to be created to perform the ANOVA with comparisons exceeding the number of degrees of freedom.

(caov <- contr.aov("salaryLN~jobcat" %>% as.formula, dat = emp, contr = contr))$table

Custodial and Clerical grouped together show the most significant difference from manager, indicative of a pay gap between these types of positions. Manager and Clerical also show a significant difference from Custodial, though not as significant. When Custodial and Manager are taken together, there is no significant different to clerical, indicating that clerical is close enough to the mean of these two groups to not be considered significantly different.

Pairwise comparisons between groups are represented in the table below. Each comparison achieves significance.

pairwise.t.test(x = emp$salaryLN, g = emp$jobcat, p.adj = "bonf", pool.sd = F)$p.value
##               Clerical   Custodial
## Custodial 4.508176e-10          NA
## Manager   9.721600e-50 5.31961e-40

Q2

In the 2nd question, you evaluated an effect of education on salary. The overall F statistic yielded significant results. Carry out an analysis of trends for these research question and report on any patterns observed in the relationship between education and salary. Briefly, report your findings. Refer to any necessary graphical summary to make your interpretation more meaningful.