About

This notebook contains analysis for a study of data librarian competencies and skills. This code was created using R version 3.3.2

Data Preparation

Load required packages and read in data, removing unwanted columns exported automatically from Survey Monkey and rename variables whose names didn’t get properly imported.

packages <- c("likert", "poLCA", "stringr", "ggplot2", "tidyverse", "dplyr", "ade4", "cluster", "fpc", "grid")
x = sapply(packages, function(x) if(!require(x, character.only = T)) install.packages(x))

#read in the data - assumes your working directory is set to where the data file is
dat <- read_csv("Sheet_12_redacted.csv", skip = 1) %>% 
  dplyr::rename(`Job title` = `Open-Ended Response`, `Years in current position` = `Open-Ended Response_1`, `Years in librarianship` = `Open-Ended Response_2`, `Other important skills` = `Open-Ended Response_3`, `Additional thoughts or comments` = `Open-Ended Response_4`) 
dat <- dat[, 10:78]

Data Wrangling

Calculate how many years of experience the person had when they started the job (i.e. subtract years in current job from years in the field total) and put it next to the other year columns.

dat <- dat %>% 
  mutate(`Years experience when started position` = `Years in librarianship` - `Years in current position`) %>% 
  dplyr::select(1:3, `Years experience when started position`, everything())
## Warning: package 'bindrcpp' was built under R version 3.4.3

Convert likert scale variables to ordered factors for easier analysis and charting.

levs <- c("Don't know or N/A", "Not at all important", "Slightly important", "Important", "Very important", "Absolutely essential")

dat[21:68] <- lapply(dat[21:68], parse_factor, levels = levs)

Survey Monkey does some weird things with columns that allow multiple selections from a list, so clean those up by replacing all non-NA values with a 1.

fix_rows <- c(5:10, 14:20) #select the offending rows
dat[fix_rows] <- replace(dat[fix_rows], !is.na(dat[fix_rows]), 1) #make the replacement
dat[fix_rows] <- replace(dat[fix_rows], is.na(dat[fix_rows]), 0)
dat[fix_rows] <- sapply(dat[fix_rows], as.numeric) #the replacement gets created as a character var, so switch it to numeric

Some people didn’t complete any of the likert scale items at all, so their responses aren’t really very helpful. Partial responses are okay but remove anyone who didn’t answer any likert scale questions. Also participant #5 asked asked to have their response removed in the comments column.

ind <- apply(dat[21:68], 1, function(x) all(is.na(x)))
dat <- dat[ !ind, ]
dat <- dat[-5, ]

Demographic analysis

Looking at some characteristics of the people who responded.

Job titles

Convert to lower case to compare and look at how often some common terms occur.

dat$`Job title` <- tolower(dat$`Job title`)
job_titles <- data.frame(table(dat$`Job title`))

word_list <- c("librarian", "data services", "data management", "research data", "informationist", "manager", "director")

for (i in 1:length(word_list)) {
  sentence <- paste(length(grep(word_list[i], dat$`Job title`)), "job titles contain the word", word_list[i])
  print(sentence)
}
## [1] "0 job titles contain the word librarian"
## [1] "0 job titles contain the word data services"
## [1] "0 job titles contain the word data management"
## [1] "0 job titles contain the word research data"
## [1] "0 job titles contain the word informationist"
## [1] "0 job titles contain the word manager"
## [1] "0 job titles contain the word director"

Disciplinary support

What disciplines do people work in? Make a chart that shows percentage of people supporting a given discipline.

discipline <- data.frame(discipline = names(dat[5:10]), count = colSums(dat[5:10], na.rm = TRUE))
discipline$percent <- discipline$count/nrow(dat) *100
discipline$label <- paste("n =", discipline$count)

discipline
##                                                              discipline
## Biomedical and/or health sciences     Biomedical and/or health sciences
## Life sciences                                             Life sciences
## Physical sciences                                     Physical sciences
## Mathematics and/or statistics             Mathematics and/or statistics
## Engineering and/or computer science Engineering and/or computer science
## Social sciences                                         Social sciences
##                                     count  percent  label
## Biomedical and/or health sciences      67 81.70732 n = 67
## Life sciences                          43 52.43902 n = 43
## Physical sciences                      31 37.80488 n = 31
## Mathematics and/or statistics          25 30.48780 n = 25
## Engineering and/or computer science    33 40.24390 n = 33
## Social sciences                        31 37.80488 n = 31
ggplot(discipline, aes(x = reorder(discipline, percent), y = percent)) + geom_bar(stat = "identity") + coord_flip() + ylab("Percent of Respondents\nSupporting Discipline") + theme_bw() + ggtitle("Disciplinary Support") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Discipline") + geom_text(aes(label = label, x = discipline, y = count), position = position_dodge(width = 0.8), hjust = .8, color = "white")

How many people are supporting more than one discipline?

dat <- dat %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines <- dat %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines <- mean(dat$disciplines_served)
## 55 participants serve more than one discipline 
## The mean number of disciplines served is 2.804878

What are the most common combinations of disciplines for people who support more than one?

Combos with biomed:

lapply(dat[, 6:10],table, dat$`Biomedical and/or health sciences`)
## $`Life sciences`
##    
##      0  1
##   0  8 31
##   1  7 36
## 
## $`Physical sciences`
##    
##      0  1
##   0  6 45
##   1  9 22
## 
## $`Mathematics and/or statistics`
##    
##      0  1
##   0  9 48
##   1  6 19
## 
## $`Engineering and/or computer science`
##    
##      0  1
##   0  5 44
##   1 10 23
## 
## $`Social sciences`
##    
##      0  1
##   0  9 42
##   1  6 25

Combos with life sciences

lapply(dat[, 7:10],table, dat$`Life sciences`)
## $`Physical sciences`
##    
##      0  1
##   0 33 18
##   1  6 25
## 
## $`Mathematics and/or statistics`
##    
##      0  1
##   0 34 23
##   1  5 20
## 
## $`Engineering and/or computer science`
##    
##      0  1
##   0 32 17
##   1  7 26
## 
## $`Social sciences`
##    
##      0  1
##   0 32 19
##   1  7 24

Combos with physical sciences

lapply(dat[, 8:10],table, dat$`Physical sciences`)
## $`Mathematics and/or statistics`
##    
##      0  1
##   0 46 11
##   1  5 20
## 
## $`Engineering and/or computer science`
##    
##      0  1
##   0 46  3
##   1  5 28
## 
## $`Social sciences`
##    
##      0  1
##   0 43  8
##   1  8 23

Combos with math

lapply(dat[, 9:10],table, dat$`Mathematics and/or statistics`)
## $`Engineering and/or computer science`
##    
##      0  1
##   0 46  3
##   1 11 22
## 
## $`Social sciences`
##    
##      0  1
##   0 48  3
##   1  9 22

Combos with engineering

lapply(dat[, 10],table, dat$`Engineering and/or computer science`)
## $`Social sciences`
##    
##      0  1
##   0 41 10
##   1  8 23

Time spent on data vs non data activities

First remove any entries where the times are NA, then make charts.

dat %>% mutate(sum_of_time = `Percent of time spent on other work` + `Percent of time spent on data-related work`) %>% 
  filter(!is.na(sum_of_time)) %>% ggplot(aes(x = `Percent of time spent on data-related work`)) + geom_histogram() + ylab("Number of respondents") + theme_bw() + ggtitle("Time Spent on Data-Related Work") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Percent of time")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

mean(dat$`Percent of time spent on data-related work`, na.rm= TRUE)
## [1] 55.375

Years of experience

Some visualizations of how long people have been in their current position and stuff. Remove anyone who put they had been in their current position longer than their whole career, since this would be impossible.

fixed_dat <- dat %>% filter(`Years in librarianship` >= `Years in current position`) 

require(gridExtra)
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 3.4.2
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
plot1 <- ggplot(fixed_dat, aes(x = `Years in current position`)) + geom_histogram() + ggtitle("Years in Current Position") + xlab("Years") + ylab("Number of Respondents") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

plot2 <- ggplot(fixed_dat, aes(x = `Years in librarianship`)) + geom_histogram() + ggtitle("Years in Librarianship Total") + xlab("Years") + ylab("Number of Respondents") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(plot1, plot2, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

For how many people is this their first library job, i.e. years in librarianship is equal to years in current position?

n_first <- dat %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
## For 20 participants, their current job is their first in the field of librarianship.

Educational Experience

How many people have various degrees? Sum up the counts and plot them as a percentage.

degrees <- data.frame(degree = names(dat[14:20]), count = colSums(dat[14:20], na.rm = TRUE))
degrees$percent <- degrees$count/nrow(dat) *100
degrees$label <- paste("n =", degrees$count)

ggplot(degrees, aes(x = reorder(degree, percent), y = percent)) + geom_bar(stat = "identity") + coord_flip() + ylab("Percent of Respondents\nwith Degree") + theme_bw() + ggtitle("Educational Experience") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Degree or certificate") + geom_text(aes(label = label, x = degree, y = count), position = position_dodge(width = 0.8), hjust = .8, color = "white")

dat <- dat %>% 
  mutate(degrees_held = `ALA-accredited masters degree` + `Science masters degree` + `Other non-ALA, non-science masters degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees <- dat %>% filter(degrees_held > 1) %>% nrow
mean_degrees <- mean(dat$degrees_held)
degrees
##                                                                                                                                                                              degree
## ALA-accredited masters degree                                                                                                                         ALA-accredited masters degree
## Science masters degree                                                                                                                                       Science masters degree
## Other non-ALA, non-science masters degree                                                                                                 Other non-ALA, non-science masters degree
## Undergraduate science degree                                                                                                                           Undergraduate science degree
## PhD (any discipline)                                                                                                                                           PhD (any discipline)
## Specialized librarianship certification (such as data or medical library certification)     Specialized librarianship certification (such as data or medical library certification)
## Other non-degree, non-certificate training in data, science, or specialized librarianship Other non-degree, non-certificate training in data, science, or specialized librarianship
##                                                                                           count
## ALA-accredited masters degree                                                                68
## Science masters degree                                                                       19
## Other non-ALA, non-science masters degree                                                    10
## Undergraduate science degree                                                                 32
## PhD (any discipline)                                                                         14
## Specialized librarianship certification (such as data or medical library certification)       9
## Other non-degree, non-certificate training in data, science, or specialized librarianship    31
##                                                                                            percent
## ALA-accredited masters degree                                                             82.92683
## Science masters degree                                                                    23.17073
## Other non-ALA, non-science masters degree                                                 12.19512
## Undergraduate science degree                                                              39.02439
## PhD (any discipline)                                                                      17.07317
## Specialized librarianship certification (such as data or medical library certification)   10.97561
## Other non-degree, non-certificate training in data, science, or specialized librarianship 37.80488
##                                                                                            label
## ALA-accredited masters degree                                                             n = 68
## Science masters degree                                                                    n = 19
## Other non-ALA, non-science masters degree                                                 n = 10
## Undergraduate science degree                                                              n = 32
## PhD (any discipline)                                                                      n = 14
## Specialized librarianship certification (such as data or medical library certification)    n = 9
## Other non-degree, non-certificate training in data, science, or specialized librarianship n = 31
## 62 participants have more than one degree 
## The mean number of degrees held is 2.231707

Analysis of Skill/Knowledge Importance Ratings

Do some data cleaning first. the relevant data (columns 21:68) need to be converted to an organization that ggplot2 can deal with - i.e. switch from wide to long.

##not all these packages know how to handle tibbles - make a regular data frame instead
dat_df <- as.data.frame(dat)
##need to handle the NA/don't know - convert these from a level to an NA
dat_df[21:68] <- lapply(dat_df[21:68], recode_factor, "Don't know or N/A" = NA_character_)
#fix ordered factor levels
levs <- c("Not at all important", "Slightly important", "Important", "Very important", "Absolutely essential")
dat_df[21:68] <- lapply(dat_df[21:68], factor, levels = levs)

demo_rows <- 1:10
data_mgmt1 <- 21:25
data_mgmt2 <- 26:30
tech <- 31:35
eval_assess <- 36:38
teaching <- 39:43
library <- 44:49
outreach <- 50:53
involvement <- 54:56
personal_attrs <- 57:63
education <- 64:68
subset_cols <- list(data_mgmt1, data_mgmt2, tech, eval_assess, teaching, library, outreach, involvement, personal_attrs, education)
names(subset_cols) <- c("Data Management Skills, Part 1", "Data Management Skills, Part 2", "Technology Skills", "Evaluation and Assessment Skills", "Teaching Skills", "Library Skills", "Networking and Outreach Skills", "Professional Involvement", "Personal Attributes", "Education")

Likert scale charts

multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

plots <- list()
for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[col_nums])
  p <- plot(lik_dat, low.color ="grey78", high.color="grey8", neutral.color="grey45", neutral.color.ramp="grey46", text.color="black", text.size = 3) + ggtitle(names(subset_cols)[i]) + theme(plot.title = element_text(hjust = 0.5), legend.position = "none", axis.text.x = element_text(size = 6.5))
  plots[[i]] <- p
}
multiplot(plotlist = plots, cols = 2)

#make one plot with a legend to be able to put it on the final chart
plot(lik_dat, low.color ="grey78", high.color="grey8", neutral.color="grey45", neutral.color.ramp="grey46", text.color="black", text.size = 3) + ggtitle(names(subset_cols)[i]) + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(size = 6.5))

How many skills do people think are important?

Basically, how much do you have to know?

We just want the likert_based items, which are columns 21:68. For each row, count up how many times a person has given that rating.

dat_df$absolutely_essential_count <- rowSums(dat_df[, 21:68] == "Absolutely essential", na.rm = TRUE)
dat_df$very_imp_count <- rowSums(dat_df[, 21:68] == "Very important", na.rm = TRUE)
dat_df$imp_count <- rowSums(dat_df[, 21:68] == "Important", na.rm = TRUE)
dat_df$slightly_imp_count <- rowSums(dat_df[, 21:68] == "Slightly important", na.rm = TRUE)
dat_df$not_imp_count <- rowSums(dat_df[, 21:68] == "Not at all important", na.rm = TRUE)

Let’s make a boxplot of this. We have to convert the data first.

likert_counts <- gather(dat_df[, 73:77], rating, count, absolutely_essential_count:not_imp_count)
levs <- c("absolutely_essential_count", "very_imp_count", "imp_count", "slightly_imp_count", "not_imp_count")
likert_counts$rating <- factor(likert_counts$rating, levels = levs)
likert_counts$rating <- plyr::revalue(likert_counts$rating, c("absolutely_essential_count" = "Absolutely essential", "very_imp_count" = "Very important", "imp_count" = "Important", "slightly_imp_count" = "Slightly important", "not_imp_count" = "Not at all important"))

boxp <- ggplot(likert_counts, aes(x = rating, y = count)) + geom_boxplot() + xlab("") + ylab("Number of items (per individual)") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Would it be easier to see this in a density plot?

density <- ggplot(likert_counts, aes(x = count)) + geom_density(fill = "grey") + facet_wrap(~rating, nrow = 5) + xlab("Number of items (per individual)") + theme_bw()

Just for fun let’s put them together

grid.arrange(boxp, density, top = "Distribution of Number of Items Ranked at Each Level ", nrow = 1)

Cluster Analysis

Make a new converting all likert scale responses to one-hot encoded variables, then add back in the non-likert variables. Also discard that one where the person said they’ve been in their current position longer than in librarianship total.

set.seed(100)
library(ade4)
one_hot <- acm.disjonctif(dat_df[1:82, 21:68])
one_hot <- cbind(dat_df[1:82, c(1:3, 5:10, 12:20, 71:77)], one_hot)

Calculate the similarities using gower distance

library(cluster)
gower_dist <- daisy(one_hot[, -1],
                    metric = "gower")
## Warning in daisy(one_hot[, -1], metric = "gower"): binary variable(s) 3,
## 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 25, 26, 27, 28, 29, 30, 31, 32,
## 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
## 52, 53, 54, 55, 56, 57, 58, 59, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
## 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
## 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,
## 108, 109, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123,
## 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
## 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153,
## 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
## 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183,
## 184, 187, 188, 189, 190, 191, 192, 193, 194, 196, 197, 198, 199, 200, 201,
## 202, 203, 204, 206, 207, 208, 209, 211, 212, 213, 214, 216, 217, 218, 219,
## 222, 223, 224, 225, 227, 228, 229, 232, 233, 234, 236, 237, 238, 239, 240,
## 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255,
## 256, 257, 258, 259, 260, 261, 262, 263, 264 treated as interval scaled
sil_width <- c(NA)
for(i in 2:10){
  
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  
  sil_width[i] <- pam_fit$silinfo$avg.width
}

# Plot sihouette width (higher is better)

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

This suggests 2 clusters is best.

pam_fit <- pam(gower_dist, diss = TRUE, k = 2)

pam_results <- one_hot %>%
  dplyr::select(-`Job title`) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

Trying a different implementation of this

library(fpc)
pc = pamk(gower_dist, krange=1:5, criterion="asw")
pc[2:3]
## $nc
## [1] 2
## 
## $crit
## [1] 0.00000000 0.06944619 0.03780652 0.03603893 0.04442350
hc.m = hclust(gower_dist, method="median")
hc.s = hclust(gower_dist, method="single")
hc.c = hclust(gower_dist, method="complete")
plot(hc.m)

plot(hc.s)

plot(hc.c)

table(cutree(hc.c, k=2), cutree(hc.s, k=2))
##    
##      1  2
##   1 56  1
##   2 25  0

Looking at the complete clustering. The first group on the dendrogram ends with observation 81, so let’s see which observation that is.

which(hc.c$order == 81)
## [1] 25

This tells us that group 1 includes observations 1:25 and group 2 observations 26:82

group1l <- hc.c$order[1:25]
group2l <- hc.c$order[26:82]

dat_df[,5:20] <- lapply(dat_df[, 5:20], as.factor)

generalists <- dat_df[group1l, ]
specialists <- dat_df[group2l, ]

specialists$group <- "Specialists"
generalists$group <- "Generalists"

grouped <- rbind(specialists, generalists)

Look at groupings

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(grouped[, col_nums], grouping = grouped$group)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i])+ theme(plot.title = element_text(hjust = 0.5)) + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n"), "") + scale_fill_grey() 
  print(p)
}
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

generalists[5:10] <- lapply(generalists[5:10], as.numeric)
generalists[5:10] <- generalists[5:10] - 1
discipline1 <- data.frame(discipline = names(generalists[5:10]), count = colSums(generalists[5:10], na.rm = TRUE))
discipline1$percent <- discipline1$count/nrow(generalists) *100
discipline1$label <- paste("n =", discipline1$count)


specialists[5:10] <- lapply(specialists[5:10], as.numeric)
specialists[5:10] <- specialists[5:10] - 1
discipline2 <- data.frame(discipline = names(specialists[5:10]), count = colSums(specialists[5:10], na.rm = TRUE))
discipline2$percent <- discipline2$count/nrow(specialists) *100
discipline2$label <- paste("n =", discipline2$count)

discipline1$Group <- "Generalists"
discipline2$Group <- "Specialists"

discipline <- rbind(discipline1, discipline2)

ggplot(discipline, aes(x = reorder(discipline, percent), y = percent, fill = Group)) + geom_bar(stat = "identity", position = "dodge") + coord_flip() + ylab("Percent of Respondents\nSupporting Discipline") + theme_bw() + ggtitle("Disciplinary Support by Group") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Discipline") + scale_fill_grey()

generalists <- generalists %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines1 <- generalists %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines1 <- mean(generalists$disciplines_served)

specialists <- specialists %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines2 <- specialists %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines2 <- mean(specialists$disciplines_served)
## 23 participants in the generalist group serve more than one discipline 
## The mean number of disciplines served is 4.36 .  32 participants in the specialist group serve more than one discipline 
## The mean number of disciplines served is 2.122807
generalists[14:20] <- lapply(generalists[14:20], as.numeric)
generalists[14:20] <- generalists[14:20] - 1
degree1 <- data.frame(degree = names(generalists[14:20]), count = colSums(generalists[14:20], na.rm = TRUE))
degree1$percent <- degree1$count/nrow(generalists) *100
degree1$label <- paste("n =", degree1$count)

specialists[14:20] <- lapply(specialists[14:20], as.numeric)
specialists[14:20] <- specialists[14:20] - 1
degree2 <- data.frame(degree = names(specialists[14:20]), count = colSums(specialists[14:20], na.rm = TRUE))
degree2$percent <- degree2$count/nrow(specialists) *100
degree2$label <- paste("n =", degree2$count)


degree1$Group <- "Generalists"
degree2$Group <- "Specialists"

degree3 <- rbind(degree1, degree2)

ggplot(degree3, aes(x = reorder(degree, percent), y = percent, fill = Group)) + geom_bar(stat = "identity", position = "dodge") + coord_flip() + ylab("Percent of Respondents\nwith Degree") + theme_bw() + ggtitle("Educational Experience by Group") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Degree or certificate") + scale_fill_grey()

generalists <- generalists %>% 
  mutate(degrees_held = `ALA-accredited masters degree` + `Science masters degree` + `Other non-ALA, non-science masters degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees1 <- generalists %>% filter(degrees_held > 1) %>% nrow
mean_degrees1 <- mean(generalists$degrees_held)

specialists <- specialists %>% 
  mutate(degrees_held = `ALA-accredited masters degree` + `Science masters degree` + `Other non-ALA, non-science masters degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees2 <- specialists %>% filter(degrees_held > 1) %>% nrow
mean_degrees2 <- mean(specialists$degrees_held)
## 23 participants in the generalist group have more than one degree and the mean number of degrees held is 2.6 .  39 participants in the specialist group have more than one degree and The mean number of degrees held is 2.070175
plot1 <- ggplot(grouped, aes(y = `Years in current position`, x = group)) + geom_boxplot() + ggtitle("Years in Current Position by Group") + ylab("Years") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + xlab("")

plot2 <- ggplot(grouped, aes(y = `Years in librarianship`, x = group)) + geom_boxplot() + ggtitle("Years in Librarianship Total by Group") + ylab("Years") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + xlab("")

grid.arrange(plot1, plot2, ncol=2)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

n_first1 <- specialists %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
n_first2 <- generalists %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
## For 15 Specialists, their current job is their first in the field of librarianship.  For 5 generalists their current job is their first in the field.
generalists$data_time <- as.numeric(as.character(generalists$`Percent of time spent on data-related work`))
specialists$data_time <- as.numeric(as.character(specialists$`Percent of time spent on data-related work`))


plot1 <- specialists %>% ggplot(aes(x = data_time)) + geom_density() + ylab("Number of respondents") + theme_bw() + ggtitle("Subject Specialists") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Percent of time")

plot2 <- generalists %>% ggplot(aes(x = data_time)) + geom_density() + ylab("Number of respondents") + theme_bw() + ggtitle("Data Generalists") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Percent of time")


grid.arrange(plot1, plot2, ncol=2, top="Time Spent on Data-related Work")
## Warning: Removed 2 rows containing non-finite values (stat_density).

Or maybe as a boxplot

grouped$data_time <- as.numeric(as.character(grouped$`Percent of time spent on data-related work`))
ggplot(grouped, aes(x = group, y = data_time)) + geom_boxplot() + ggtitle("Time Spent on Data-related Work") + theme_bw()
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).

generalists$majority_data_time <- ifelse(generalists$data_time >= 50,
c(1), c(0)) 

specialists$majority_data_time <- ifelse(specialists$data_time >= 50,
c(1), c(0)) 

Tests of statistical significance

Let’s do some t-tests! Mean number of disciplines served

t.test(generalists$disciplines_served, specialists$disciplines_served)
## 
##  Welch Two Sample t-test
## 
## data:  generalists$disciplines_served and specialists$disciplines_served
## t = 5.1904, df = 37.546, p-value = 7.539e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.364278 3.110108
## sample estimates:
## mean of x mean of y 
##  4.360000  2.122807

Mean years in current position

t.test(generalists$`Years in current position`, specialists$`Years in current position`)
## 
##  Welch Two Sample t-test
## 
## data:  generalists$`Years in current position` and specialists$`Years in current position`
## t = -1.5394, df = 71.814, p-value = 0.1281
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.8851825  0.4994682
## sample estimates:
## mean of x mean of y 
##  3.200000  4.892857

Time in profession overall

t.test(specialists$`Years in librarianship`, generalists$`Years in librarianship`)
## 
##  Welch Two Sample t-test
## 
## data:  specialists$`Years in librarianship` and generalists$`Years in librarianship`
## t = 0.10393, df = 50.739, p-value = 0.9176
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.846955  4.266955
## sample estimates:
## mean of x mean of y 
##     10.25     10.04

percent of time on data-related tasks

t.test(specialists$data_time, generalists$data_time)
## 
##  Welch Two Sample t-test
## 
## data:  specialists$data_time and generalists$data_time
## t = -3.8951, df = 54.064, p-value = 0.0002723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -39.27219 -12.58236
## sample estimates:
## mean of x mean of y 
##  47.27273  73.20000

Degrees and certificates

t.test(specialists$degrees_held, generalists$degrees_held)
## 
##  Welch Two Sample t-test
## 
## data:  specialists$degrees_held and generalists$degrees_held
## t = -2.5223, df = 55.511, p-value = 0.01455
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.9506947 -0.1089544
## sample estimates:
## mean of x mean of y 
##  2.070175  2.600000

Number of tasks considered Absolutely essential

t.test(specialists$absolutely_essential_count, generalists$absolutely_essential_count)
## 
##  Welch Two Sample t-test
## 
## data:  specialists$absolutely_essential_count and generalists$absolutely_essential_count
## t = -4.8029, df = 49.03, p-value = 1.514e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.147171  -4.570723
## sample estimates:
## mean of x mean of y 
##  11.42105  19.28000
t.test(specialists$not_imp_count, generalists$not_imp_count)
## 
##  Welch Two Sample t-test
## 
## data:  specialists$not_imp_count and generalists$not_imp_count
## t = 2.9975, df = 70.402, p-value = 0.003759
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7981089 3.9710139
## sample estimates:
## mean of x mean of y 
##  4.824561  2.440000