Data Preparation

Load required packages and read in data, removing unwanted columns exported automatically from Survey Monkey and rename variables whose names didn’t get properly imported.

packages <- c("likert", "poLCA", "stringr", "ggplot2", "tidyverse", "dplyr")
x = sapply(packages, function(x) if(!require(x, character.only = T)) install.packages(x))
Loading required package: poLCA
package 㤼㸱poLCA㤼㸲 was built under R version 3.3.3Loading required package: scatterplot3d
package 㤼㸱scatterplot3d㤼㸲 was built under R version 3.3.3Loading required package: MASS
Loading required package: stringr
package 㤼㸱stringr㤼㸲 was built under R version 3.3.3Loading required package: tidyverse
package 㤼㸱tidyverse㤼㸲 was built under R version 3.3.3Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
package 㤼㸱purrr㤼㸲 was built under R version 3.3.3Conflicts with tidy packages ----------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
recode(): dplyr, likert
select(): dplyr, MASS

Data Wrangling

Calculate how many years of experience the person had when they started the job and put it next to the other year columns.

dat <- dat %>% 
  mutate(`Years experience when started position` = `Years in librarianship` - `Years in current position`) %>% 
  select(1:3, `Years experience when started position`, everything())
Error in select(., 1:3, `Years experience when started position`, everything()) : 
  unused arguments (1:3, `Years experience when started position`, everything())

Convert likert scale variables to ordered factors for easier analysis.

levs <- c("Don't know or N/A", "Not at all important", "Slightly important", "Important", "Very important", "Absolutely essential")
dat[21:68] <- lapply(dat[21:68], parse_factor, levels = levs)

Survey Monkey does some weird things with columns that allow multiple selections from a list, so clean those up by replacing all non-NA values with a 1.

fix_rows <- c(5:10, 14:20) #select the offending rows
dat[fix_rows] <- replace(dat[fix_rows], !is.na(dat[fix_rows]), 1) #make the replacement
dat[fix_rows] <- replace(dat[fix_rows], is.na(dat[fix_rows]), 0)
dat[fix_rows] <- sapply(dat[fix_rows], as.numeric) #the replacement gets created as a character var, so switch it to numeric

Some people didn’t complete any of the likert scale items at all, so their responses aren’t really very helpful. Partial responses are okay but remove anyone who didn’t answer any likert scale questions.

ind <- apply(dat[21:68], 1, function(x) all(is.na(x)))
dat <- dat[ !ind, ]

Demographic analysis

Looking at some characteristics of the people who responded.

Job titles

dat$`Job title` <- tolower(dat$`Job title`)
job_titles <- data.frame(table(dat$`Job title`))
word_list <- c("librarian", "data services", "data management", "research data", "informationist", "manager", "director")
for (i in 1:length(word_list)) {
  sentence <- paste(length(grep(word_list[i], dat$`Job title`)), "job titles contain the word", word_list[i])
  print(sentence)
}
[1] "47 job titles contain the word librarian"
[1] "13 job titles contain the word data services"
[1] "5 job titles contain the word data management"
[1] "11 job titles contain the word research data"
[1] "6 job titles contain the word informationist"
[1] "5 job titles contain the word manager"
[1] "5 job titles contain the word director"

Years of experience

Some visualizations of how long people have been in their current position and stuff. Filter out anyone who put they had been in their current position longer than their whole career.

fixed_dat <- dat %>% filter(`Years in librarianship` >= `Years in current position`) 
require(gridExtra)
Loading required package: gridExtra
package 㤼㸱gridExtra㤼㸲 was built under R version 3.3.3
Attaching package: 㤼㸱gridExtra㤼㸲

The following object is masked from 㤼㸱package:dplyr㤼㸲:

    combine
plot1 <- ggplot(fixed_dat, aes(x = `Years in current position`)) + geom_histogram() + ggtitle("Years in Current Position") + xlab("Years") + ylab("Number of Respondents") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
plot2 <- ggplot(fixed_dat, aes(x = `Years in librarianship`)) + geom_histogram() + ggtitle("Years in Librarianship Total") + xlab("Years") + ylab("Number of Respondents") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(plot1, plot2, ncol=2)

For how many people is this their first library job, i.e. years in librarianship is equal to years in current position?

n_first <- dat %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
For 20 participants, their current job is their first in the field of librarianship.

Educational Experience

How many people have various degrees? Sum up the counts and plot them as a percentage.

62 participants havemore than one degree 
The mean number of degrees held is 2.216867

Areas supported

What disciplines do people work in?

How many people are supporting more than one discipline?

dat <- dat %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines <- dat %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines <- mean(dat$disciplines_served)
56 participants serve more than one discipline 
The mean number of disciplines served is 2.831325

Time spent on data vs non data activities

First remove any NA entries then make charts.

Analysis of Skill/Knowledge Importance Ratings

Do some data cleaning first for the likert package.

test <- for (i in 21:68) {
  levels(dat_df[i])[levels(dat_df[i] == "Don't know or N/A"] <- NA
Error: unexpected ']' in:
"test <- for (i in 21:68) {
  levels(dat_df[i])[levels(dat_df[i] == "Don't know or N/A"]"

Break the different skills/knowledge areas out into their category groupings to make this a little easier. First get the demographic info we want in all the data subsets, then pull out the category subsets. There’s too many in the data management category and the charts look crappy so break it into two.

demo_rows <- 1:10
data_mgmt1 <- 21:25
data_mgmt2 <- 26:30
tech <- 31:35
eval_assess <- 36:38
teaching <- 39:43
library <- 44:49
outreach <- 50:53
involvement <- 54:56
personal_attrs <- 57:63
education <- 64:68
subset_cols <- list(data_mgmt1, data_mgmt2, tech, eval_assess, teaching, library, outreach, involvement, personal_attrs, education)
##come back later and pretty up these names for your plots
names(subset_cols) <- c("data_mgmt1", "data_mgmt2", "tech", "eval_assess", "teaching", "library", "outreach", "involvement", "personal_attrs", "education")

Likert scale charts

Note: the likert package can’t handle tibbles, so convert to a data frame when you pass it to the likert function.

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[col_nums])
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i]) + scale_fill_grey()
  print(p)
}
Scale for 'fill' is already present. Adding another scale for
'fill', which will replace the existing scale.

Charting How many skills people think are important for various categories

for (i in 1:length(subset_cols)) {
  g_name <- names(subset_cols)[i]
  g_rows <- subset_cols[[i]]
  t_all$group[g_rows, ] <- g_name
}
Error in t_all$group[g_rows, ] <- g_name : 
  incorrect number of subscripts on matrix

Breaking down the analysis by demographics

Do people who spend more than half their time doing data related work feel differently about what’s important?

dat_df$majority_data_time <- ifelse(dat_df$`Percent of time spent on data-related work` >= 50,
c("yes"), c("no")) 
dat_df$majority_data_time <- as.factor(dat_df$majority_data_time)

dat_df <- dat_df[!is.na(dat_df$majority_data_time), ]

Make the same Likert scales but broken out by group.

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[, col_nums], grouping = dat_df$majority_data_time)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i])
  print(p)
}

PhD vs none

Some job ads specifically ask for a PhD. Let’s see if there’s anything different about the people who have one or don’t.

MLIS vs none

dat_df$hasmlis <- ifelse(dat_df$`ALA-accredited master’s degree` == 1,
c("yes"), c("no")) 
dat_df$hasmlis <- as.factor(dat_df$hasmlis)
dat_df <- dat_df[!is.na(dat_df$hasmlis), ]

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[, col_nums], grouping = dat_df$hasmlis)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i]) + theme(text = element_text(size = 6))
  print(p)
}

Breakdown by disciplines supported

Latent Class Analysis

Prepare the data for the analysis

Convert the likert data to a numeric scale, with 1 being low, 5 being high, and 6 NA. Some levels didn’t have any don’t know/NA so add a 6 level just in case. Also, I think we need to collapse this into just 3 levels instead of 5 to make the model work.

dat_lca <- dat
dat_lca[21:68] <- lapply(dat_lca[21:68], recode_factor, 
"Don't know or N/A" = "6", "Not at all important" = "1", "Slightly important" = "2", "Important" = "3", "Very important" = "4", "Absolutely essential" = "5")

##We used dat for this rather than dat_df so add in the majority data tiome variable
dat_lca$majority_data_time <- ifelse(dat_lca$`Percent of time spent on data-related work` >= 50,
c("yes"), c("no")) 
dat_lca$majority_data_time <- as.factor(dat_lca$majority_data_time)

Add 1 to all 0/1 columns so they are 1/2 as per instructions. Also make the majority of time on data a numeric column.

recode_factor(dat_lca$majority_data_time, "yes" = "2", "no" = "1")
 [1] 2    <NA> 2    1    2    1    1    2    2    <NA> 2    2   
[13] 2    1    2    1    1    2    1    2    2    2    2    1   

Remove unwanted columns

drops <- c(11:13, 69:70)
dat_lca <- dat_lca[-drops]

Replace any remaining NAs with 6

dat_lca[is.na(dat_lca)] <- 6
invalid factor level, NA generated

K-Means and One Hot Encoding

Make a new converting all likert scale responses to one-hot encoded variables, then add back in the non-likert variables. Also discard that one where the person said they’ve been in their current position longer than in librarianship total.

library(ade4)
one_hot <- acm.disjonctif(dat_df[1:82, 21:68])
one_hot <- cbind(dat_df[1:82, c(1:3, 5:10, 12:20)], one_hot)

Calculate the similarities using gower distance

This suggests 2 clusters is best.

pam_fit <- pam(gower_dist, diss = TRUE, k = 2)
pam_results <- one_hot %>%
  dplyr::select(-`Job title`) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))
Error: could not find function "%>%"

Trying a different implementation of this

library(fpc)
pc = pamk(gower_dist, krange=1:5, criterion="asw")
pc[2:3]
$nc
[1] 2

$crit
[1]  0.000000000  0.022472510 -0.005254221 -0.013158479 -0.013707184
hc.m = hclust(gower_dist, method="median")
hc.s = hclust(gower_dist, method="single")
hc.c = hclust(gower_dist, method="complete")
plot(hc.m)

plot(hc.s)

plot(hc.c)

table(cutree(hc.m, k=2), cutree(hc.s, k=2))
   
     1  2
  1 80  1
  2  1  0

Looking at the complete clustering.

dat_df$group[group1l, ] <- 1
Error in dat_df$group[group1l, ] <- 1 : 
  incorrect number of subscripts on matrix

Look at groupings

group1 <- group1 %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines1 <- group1 %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines1 <- mean(group1$disciplines_served)

group2 <- group2 %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines2 <- group2 %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines2 <- mean(group2$disciplines_served)
13 participants in Group 1 serve more than one discipline 
The mean number of disciplines served is 1.964286 .  43 participants in Group 2 serve more than one discipline 
The mean number of disciplines served is 3.314815

group1 <- group1 %>% 
  mutate(degrees_held = `ALA-accredited master’s degree` + `Science master’s degree` + `Other non-ALA, non-science master’s degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees1 <- group1 %>% filter(degrees_held > 1) %>% nrow
mean_degrees1 <- mean(group1$degrees_held)

group2 <- group2 %>% 
  mutate(degrees_held = `ALA-accredited master’s degree` + `Science master’s degree` + `Other non-ALA, non-science master’s degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees2 <- group2 %>% filter(degrees_held > 1) %>% nrow
mean_degrees2 <- mean(group2$degrees_held)
15 participants in Group 1 hvae more than one degree 
The mean number of degrees held is 1.75 .  43 participants in Group 2 have more than one degree 
The mean number of degrees held is 2.481481

n_first1 <- group1 %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
n_first2 <- group2 %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
For 7 Specialists, their current job is their first in the field of librarianship.  For 13 generalists their current job is their first in the field.

group1$majority_data_time <- ifelse(group1$`Percent of time spent on data-related work` >= 50,
c("yes"), c("no")) 
㤼㸱>=㤼㸲 not meaningful for factors
group1$majority_data_time <- as.factor(group1$majority_data_time)
g1 <- gather(g1, value, n, -title) %>%
  group_by(title, value, n) %>%
  tally
Using n as weighting variable
Error in summarise_impl(.data, dots) : 
  invalid 'type' (character) of argument
---
title: "Data Librarian Competencies"
output:
  html_notebook: default
  html_document: default
---

#Data Preparation
Load required packages and read in data, removing unwanted columns exported automatically from Survey Monkey and rename variables whose names didn't get properly imported.  
```{r message = F, warning=F}
packages <- c("likert", "poLCA", "stringr", "ggplot2", "tidyverse", "dplyr")
x = sapply(packages, function(x) if(!require(x, character.only = T)) install.packages(x))


dat <- read_csv("Sheet_1.csv", skip = 1) %>% 
  select(-c(1:9)) %>% 
  rename(`Job title` = `Open-Ended Response`, `Years in current position` = `Open-Ended Response_1`, `Years in librarianship` = `Open-Ended Response_2`, `Other important skills` = `Open-Ended Response_3`, `Additional thoughts or comments` = `Open-Ended Response_4`) 

```

##Data Wrangling
Calculate how many years of experience the person had when they started the job and put it next to the other year columns.
```{r}
dat <- dat %>% 
  mutate(`Years experience when started position` = `Years in librarianship` - `Years in current position`) %>% 
  select(1:3, `Years experience when started position`, everything())
```

Convert likert scale variables to ordered factors for easier analysis.
```{r}
levs <- c("Don't know or N/A", "Not at all important", "Slightly important", "Important", "Very important", "Absolutely essential")

dat[21:68] <- lapply(dat[21:68], parse_factor, levels = levs)
```

Survey Monkey does some weird things with columns that allow multiple selections from a list, so clean those up by replacing all non-NA values with a 1.
```{r}
fix_rows <- c(5:10, 14:20) #select the offending rows
dat[fix_rows] <- replace(dat[fix_rows], !is.na(dat[fix_rows]), 1) #make the replacement
dat[fix_rows] <- replace(dat[fix_rows], is.na(dat[fix_rows]), 0)
dat[fix_rows] <- sapply(dat[fix_rows], as.numeric) #the replacement gets created as a character var, so switch it to numeric

```
Some people didn't complete any of the likert scale items at all, so their responses aren't really very helpful.  Partial responses are okay but remove anyone who didn't answer any likert scale questions.
```{r}
ind <- apply(dat[21:68], 1, function(x) all(is.na(x)))
dat <- dat[ !ind, ]
```


#Demographic analysis
Looking at some characteristics of the people who responded.

##Job titles
```{r}
dat$`Job title` <- tolower(dat$`Job title`)
job_titles <- data.frame(table(dat$`Job title`))

word_list <- c("librarian", "data services", "data management", "research data", "informationist", "manager", "director")

for (i in 1:length(word_list)) {
  sentence <- paste(length(grep(word_list[i], dat$`Job title`)), "job titles contain the word", word_list[i])
  print(sentence)
}

```


##Years of experience
Some visualizations of how long people have been in their current position and stuff.  Filter out anyone who put they had been in their current position longer than their whole career.

```{r}
fixed_dat <- dat %>% filter(`Years in librarianship` >= `Years in current position`) 

require(gridExtra)

plot1 <- ggplot(fixed_dat, aes(x = `Years in current position`)) + geom_histogram() + ggtitle("Years in Current Position") + xlab("Years") + ylab("Number of Respondents") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

plot2 <- ggplot(fixed_dat, aes(x = `Years in librarianship`)) + geom_histogram() + ggtitle("Years in Librarianship Total") + xlab("Years") + ylab("Number of Respondents") + theme_bw() + theme(plot.title = element_text(hjust = 0.5))

grid.arrange(plot1, plot2, ncol=2)

```


For how many people is this their first library job, i.e. years in librarianship is equal to years in current position?
```{r}
n_first <- dat %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
```
```{r echo = F}
cat("For", n_first, "participants, their current job is their first in the field of librarianship.")
```

##Educational Experience
How many people have various degrees?  Sum up the counts and plot them as a percentage.
```{r}
degrees <- data.frame(degree = names(dat[14:20]), count = colSums(dat[14:20], na.rm = TRUE))
degrees$percent <- degrees$count/nrow(dat) *100
degrees$label <- paste("n =", degrees$count)

ggplot(degrees, aes(x = reorder(degree, percent), y = percent)) + geom_bar(stat = "identity") + coord_flip() + ylab("Percent of Respondents\nwith Degree") + theme_bw() + ggtitle("Educational Experience") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Degree or certificate") + geom_text(aes(label = label, x = degree, y = count), position = position_dodge(width = 0.8), hjust = .8, color = "white")
  

dat <- dat %>% 
  mutate(degrees_held = `ALA-accredited master’s degree` + `Science master’s degree` + `Other non-ALA, non-science master’s degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees <- dat %>% filter(degrees_held > 1) %>% nrow
mean_degrees <- mean(dat$degrees_held)
```
```{r echo = F}
cat(n_degrees, "participants havemore than one degree", "\nThe mean number of degrees held is", mean_degrees)
```


##Areas supported
What disciplines do people work in?
```{r}
discipline <- data.frame(discipline = names(dat[5:10]), count = colSums(dat[5:10], na.rm = TRUE))
discipline$percent <- discipline$count/nrow(dat) *100
discipline$label <- paste("n =", discipline$count)


ggplot(discipline, aes(x = reorder(discipline, percent), y = percent)) + geom_bar(stat = "identity") + coord_flip() + ylab("Percent of Respondents\nSupporting Discipline") + theme_bw() + ggtitle("Disciplinary Support") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Discipline") + geom_text(aes(label = label, x = discipline, y = count), position = position_dodge(width = 0.8), hjust = .8, color = "white")

```

How many people are supporting more than one discipline?
```{r}
dat <- dat %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines <- dat %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines <- mean(dat$disciplines_served)
```
```{r echo = F}
cat(n_disciplines, "participants serve more than one discipline", "\nThe mean number of disciplines served is", mean_disciplines)
```



##Time spent on data vs non data activities
First remove any NA entries then make charts.

```{r}
dat %>% mutate(sum_of_time = `Percent of time spent on other work` + `Percent of time spent on data-related work`) %>% 
  filter(!is.na(sum_of_time)) %>% ggplot(aes(x = `Percent of time spent on data-related work`)) + geom_histogram() + ylab("Number of respondents") + theme_bw() + ggtitle("Time Spent on Data-Related Work") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Percent of time")




```

#Analysis of Skill/Knowledge Importance Ratings
Do some data cleaning first for the likert package.
```{r}
dat_df <- as.data.frame(dat) ##likert package can't handle tibbles, so convert to a df

##need to handle the NA/don't know - convert these from a level to an NA
dat_df[21:68] <- lapply(dat_df[21:68], recode_factor, "Don't know or N/A" = NA_character_)

```



Break the different skills/knowledge areas out into their category groupings to make this a little easier.  First get the demographic info we want in all the data subsets, then pull out the category subsets.  There's too many in the data management category and the charts look crappy so break it into two.

```{r}
demo_rows <- 1:10
data_mgmt1 <- 21:25
data_mgmt2 <- 26:30
data_mgmttotal <- 21:30
tech <- 31:35
eval_assess <- 36:38
teaching <- 39:43
library <- 44:49
outreach <- 50:53
involvement <- 54:56
personal_attrs <- 57:63
education <- 64:68

subset_cols <- list(data_mgmt1, data_mgmt2, data_mgmttotal, tech, eval_assess, teaching, library, outreach, involvement, personal_attrs, education)

#Pretty names for plots
names(subset_cols) <- c("Data Management Skills, Part 1", "Data Management Skills, Part 2", "Data Management Skills", "Technology Skills", "Evaluation and Assessment Skills", "Teaching Skills", "Library Skills", "Networking and Outreach Skills", "Professional Involvement", "Personal Attributes", "Education")



```


##Likert scale charts
Note: the likert package can't handle tibbles, so convert to a data frame when you pass it to the likert function.
```{r}
par(mfrow = c(2, 4))

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[col_nums])
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i]) + scale_fill_grey()
  print(p)
}

```



#Charting How many skills people think are important for various categories
```{r}
t_all <- as.data.frame(t(dat_df))
t_all$group <- NA

for (i in 1:length(subset_cols)) {
  g_name <- names(subset_cols)[i]
  t_all$group[subset_cols[[i]]] <- g_name
}
t_all <- t_all[21:68,]
```






##Breaking down the analysis by demographics

Do people who spend more than half their time doing data related work feel differently about what's important?

```{r}
dat_df$majority_data_time <- ifelse(dat_df$`Percent of time spent on data-related work` >= 50,
c("yes"), c("no")) 
dat_df$majority_data_time <- as.factor(dat_df$majority_data_time)

dat_df <- dat_df[!is.na(dat_df$majority_data_time), ]
```

Make the same Likert scales but broken out by group.


```{r}
for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[, col_nums], grouping = dat_df$majority_data_time)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i])
  print(p)
}
```



##PhD vs none
Some job ads specifically ask for a PhD. Let's see if there's anything different about the people who have one or don't.

```{r}
dat_df$hasphd <- ifelse(dat_df$`PhD (any discipline)` == 1,
c("yes"), c("no")) 
dat_df$hasphd <- as.factor(dat_df$hasphd)

dat_df <- dat_df[!is.na(dat_df$hasphd), ]

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[, col_nums], grouping = dat_df$hasphd)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i])
  print(p)
}
```

##MLIS vs none
```{r}
dat_df$hasmlis <- ifelse(dat_df$`ALA-accredited master’s degree` == 1,
c("yes"), c("no")) 
dat_df$hasmlis <- as.factor(dat_df$hasmlis)
dat_df <- dat_df[!is.na(dat_df$hasmlis), ]

for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(dat_df[, col_nums], grouping = dat_df$hasmlis)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i]) + theme(text = element_text(size = 6))
  print(p)
}
```


##Breakdown by disciplines supported

#Latent Class Analysis
##Prepare the data for the analysis

Convert the likert data to a numeric scale, with 1 being low, 5 being high, and 6 NA.  Some levels didn't have any don't know/NA so add a 6 level just in case.  Also, I think we need to collapse this into just 3 levels instead of 5 to make the model work.
```{r}
dat_lca <- dat
dat_lca[21:68] <- lapply(dat_lca[21:68], recode_factor, 
"Don't know or N/A" = "6", "Not at all important" = "1", "Slightly important" = "2", "Important" = "3", "Very important" = "4", "Absolutely essential" = "5")

##We used dat for this rather than dat_df so add in the majority data tiome variable
dat_lca$majority_data_time <- ifelse(dat_lca$`Percent of time spent on data-related work` >= 50,
c("yes"), c("no")) 
dat_lca$majority_data_time <- as.factor(dat_lca$majority_data_time)

```

Add 1 to all 0/1 columns so they are 1/2 as per instructions.  Also make the majority of time on data a numeric column.
```{r}
dat_lca[c(5:10, 14:20)] <- dat_lca[c(5:10, 14:20)] + 1
dat_lca$majority_data_time <- recode_factor(dat_lca$majority_data_time, "yes" = "2", "no" = "1")

dat_lca[c(21:68, 71)] <- lapply(dat_lca[c(21:68, 71)], factor, levels = c(1, 2, 3, 4, 5, 6))
```

Remove unwanted columns
```{r}
drops <- c(11:13, 69:70)
dat_lca <- dat_lca[-drops]
```

Replace any remaining NAs with 6
```{r}
dat_lca[is.na(dat_lca)] <- 6
```

#K-Means and One Hot Encoding

Make a new converting all likert scale responses to one-hot encoded variables, then add back in the non-likert variables.  Also discard that one where the person said they've been in their current position longer than in librarianship total.
```{r}
library(ade4)
one_hot <- acm.disjonctif(dat_df[1:82, 21:68])
one_hot <- cbind(dat_df[1:82, c(1:3, 5:10, 12:20)], one_hot)
```

Calculate the similarities using gower distance
```{r}
gower_dist <- daisy(one_hot[, -1],
                    metric = "gower")



sil_width <- c(NA)

for(i in 2:10){
  
  pam_fit <- pam(gower_dist,
                 diss = TRUE,
                 k = i)
  
  sil_width[i] <- pam_fit$silinfo$avg.width
  
}

# Plot sihouette width (higher is better)

plot(1:10, sil_width,
     xlab = "Number of clusters",
     ylab = "Silhouette Width")
lines(1:10, sil_width)

```

This suggests 2 clusters is best.



```{r}
pam_fit <- pam(gower_dist, diss = TRUE, k = 2)

pam_results <- one_hot %>%
  dplyr::select(-`Job title`) %>%
  mutate(cluster = pam_fit$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

pam_results$the_summary
```


Trying a different implementation of this
```{r}
library(fpc)
pc = pamk(gower_dist, krange=1:5, criterion="asw")
pc[2:3]

hc.m = hclust(gower_dist, method="median")
hc.s = hclust(gower_dist, method="single")
hc.c = hclust(gower_dist, method="complete")
plot(hc.m)
plot(hc.s)
plot(hc.c)


table(cutree(hc.m, k=2), cutree(hc.s, k=2))
```

Looking at the complete clustering.
```{r}
group1l <- hc.c$order[1:28]
group2l <- hc.c$order[29:82]

dat_df[,5:20] <- lapply(dat_df[, 5:20], as.factor)

group1 <- dat_df[group1l, ]
group2 <- dat_df[group2l, ]

group1$Group <- "Specialists"
group2$Group <- "Generalists"

grouped <- rbind(group1, group2)
```

Look at groupings
```{r}
for (i in 1:length(subset_cols)) {
  col_nums <- unlist(subset_cols[i])
  lik_dat <- likert(grouped[, col_nums], grouping = grouped$group)
  p <- plot(lik_dat) + ggtitle(names(subset_cols)[i])
  print(p)
}
```



```{r}
group1[5:10] <- lapply(group1[5:10], as.numeric)
group1[5:10] <- group1[5:10] - 1
discipline1 <- data.frame(discipline = names(group1[5:10]), count = colSums(group1[5:10], na.rm = TRUE))
discipline1$percent <- discipline1$count/nrow(group1) *100
discipline1$label <- paste("n =", discipline1$count)


group2[5:10] <- lapply(group2[5:10], as.numeric)
group2[5:10] <- group2[5:10] - 1
discipline2 <- data.frame(discipline = names(group2[5:10]), count = colSums(group2[5:10], na.rm = TRUE))
discipline2$percent <- discipline2$count/nrow(group2) *100
discipline2$label <- paste("n =", discipline2$count)

discipline2$Group <- "Generalists"
discipline1$Group <- "Specialists"

discipline <- rbind(discipline1, discipline2)

ggplot(discipline, aes(x = reorder(discipline, percent), y = percent, fill = Group)) + geom_bar(stat = "identity", position = "dodge") + coord_flip() + ylab("Percent of Respondents\nSupporting Discipline") + theme_bw() + ggtitle("Disciplinary Support by Group") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Discipline") + scale_fill_grey()

```

```{r}
group1 <- group1 %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines1 <- group1 %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines1 <- mean(group1$disciplines_served)

group2 <- group2 %>% 
  mutate(disciplines_served = `Biomedical and/or health sciences` + `Life sciences` + `Physical sciences`+ `Mathematics and/or statistics` + `Engineering and/or computer science` + `Social sciences`)
n_disciplines2 <- group2 %>% filter(disciplines_served > 1) %>% nrow
mean_disciplines2 <- mean(group2$disciplines_served)

```
```{r echo = F}
cat(n_disciplines1, "participants in Group 1 serve more than one discipline", "\nThe mean number of disciplines served is", mean_disciplines1, ". ", n_disciplines2, "participants in Group 2 serve more than one discipline", "\nThe mean number of disciplines served is", mean_disciplines2)


```

```{r}
group1[14:20] <- lapply(group1[14:20], as.numeric)
group1[14:20] <- group1[14:20] - 1
degree1 <- data.frame(degree = names(group1[14:20]), count = colSums(group1[14:20], na.rm = TRUE))
degree1$percent <- degree1$count/nrow(group1) *100
degree1$label <- paste("n =", degree1$count)

group2[14:20] <- lapply(group2[14:20], as.numeric)
group2[14:20] <- group2[14:20] - 1
degree2 <- data.frame(degree = names(group2[14:20]), count = colSums(group2[14:20], na.rm = TRUE))
degree2$percent <- degree2$count/nrow(group2) *100
degree2$label <- paste("n =", degree2$count)


degree2$Group <- "Generalists"
degree1$Group <- "Specialists"

degree3 <- rbind(degree1, degree2)

ggplot(degree3, aes(x = reorder(degree, percent), y = percent, fill = Group)) + geom_bar(stat = "identity", position = "dodge") + coord_flip() + ylab("Percent of Respondents\nwith Degree") + theme_bw() + ggtitle("Educational Experience by Group") + scale_x_discrete(labels = function(x) lapply(strwrap(x, width = 35, simplify = FALSE), paste, collapse="\n")) + theme(plot.title = element_text(hjust = 0.5)) + xlab("Degree or certificate") + scale_fill_grey()



```

```{r}
group1 <- group1 %>% 
  mutate(degrees_held = `ALA-accredited master’s degree` + `Science master’s degree` + `Other non-ALA, non-science master’s degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees1 <- group1 %>% filter(degrees_held > 1) %>% nrow
mean_degrees1 <- mean(group1$degrees_held)

group2 <- group2 %>% 
  mutate(degrees_held = `ALA-accredited master’s degree` + `Science master’s degree` + `Other non-ALA, non-science master’s degree` + `Undergraduate science degree` + `PhD (any discipline)` + `Specialized librarianship certification (such as data or medical library certification)` + `Other non-degree, non-certificate training in data, science, or specialized librarianship`)
n_degrees2 <- group2 %>% filter(degrees_held > 1) %>% nrow
mean_degrees2 <- mean(group2$degrees_held)

```
```{r echo = F}
cat(n_degrees1, "participants in Group 1 have more than one degree", "\nThe mean number of degrees held is", mean_degrees1, ". ", n_disciplines2, "participants in Group 2 have more than one degree", "\nThe mean number of degrees held is", mean_degrees2)
```

```{r}

plot1 <- ggplot(grouped, aes(y = `Years in current position`, x = Group)) + geom_boxplot() + ggtitle("Years in Current Position by Group") + ylab("Years") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + xlab("")

plot2 <- ggplot(grouped, aes(y = `Years in librarianship`, x = Group)) + geom_boxplot() + ggtitle("Years in Librarianship Total by Group") + ylab("Years") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + xlab("")

grid.arrange(plot1, plot2, ncol=2)
```


```{r}
n_first1 <- group1 %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow
n_first2 <- group2 %>% filter(`Years in librarianship` == `Years in current position`) %>% nrow

```
```{r echo = F}
cat("For", n_first1, "Specialists, their current job is their first in the field of librarianship.  For", n_first2, "generalists their current job is their first in the field.")
```

```{r}
group2$data_time <- as.numeric(as.character(group2$`Percent of time spent on data-related work`))
group1$data_time <- as.numeric(as.character(group1$`Percent of time spent on data-related work`))

group1 %>% ggplot(aes(x = data_time)) + geom_histogram() + ylab("Number of respondents") + theme_bw() + ggtitle("Time Spent on Data-Related Work for Specialists") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Percent of time")

group2 %>% ggplot(aes(x = data_time)) + geom_histogram() + ylab("Number of respondents") + theme_bw() + ggtitle("Time Spent on Data-Related Work for Generalists") + theme(plot.title = element_text(hjust = 0.5)) + xlab("Percent of time")


```

```{r}
group1$majority_data_time <- ifelse(group1$data_time >= 50,
c("yes"), c("no")) 
group1$majority_data_time <- as.factor(group1$majority_data_time)

group2$majority_data_time <- ifelse(group2$data_time >= 50,
c("yes"), c("no")) 
group2$majority_data_time <- as.factor(group2$majority_data_time)


```

```{r}
t_group1 <- group1[,c(1, 21:68)]
t_group2 <- group2[,c(1, 21:68)]

melted_g1 <- tbl_df(melt(t_group1, id = c("Job title")))
names(melted_g1)[1] <- "title"

g1 <- gather(melted_g1, variable, value, -title) %>%
  group_by(title, variable, value) %>%
  tally %>% 
  subset(variable == "value")




```

