As the demand for data scientists has grown in the economy, the number of educational programs has also expanded. For this project, we decided to examine language used in the curricula of a sample of graduate programs to see how the key terms used by educators compare to the key terms found in job postings and resumes. In other words:

What does academia consider the most important skills for a data scientist?

To identify the schools in the study, we began with a simple Google search for the “Top Data Scientist Masters Programs.” Our results included the following websites:

We selected a sample of 14 schools from these lists by choosing the schools with full curriculum and course descriptions available online. The process of scraping the information was highly manual, as every school formats their data differently and must be tackled fresh. In the end, while we were able to develop sophisticated code to parse many sites, we did encounter several schools that we eliminated from the sample simply due to the level of scraping required (MIT - we’ll be back to battle again!).

Set up environment and load libraries

rm(list = ls())
library(kableExtra)
library(dplyr)
library(class)
library(knitr)
library(RCurl)
library(XML)
library(jsonlite)
library(rvest)
library(stringr)
library(tidyr)

Institutions Included

I partnered on this part of the assignment with YoungKoung Kim, and we kept a shared Google document to keep track of our progress. I uploaded the spreadsheet to github and then pulled the table into R.

schools <- read.csv("https://raw.githubusercontent.com/stipton/CUNY-SPS/master/DATA%20607%20Project%203/DATA%20607%20Proj%203%20Schools.csv")

kable_styling(kable(schools, "html"), bootstrap_options = "striped")
School.Code School Degree Degree.Name Website
msu Michigan State University MS Business Analytics http://broad.msu.edu/businessanalytics/
cin University of Cincinnati MS Business Analytics http://business.uc.edu/graduate/msbana.html
cuny CUNY SPS MS Data Science http://catalog.sps.cuny.edu/preview_program.php?catoid=2&poid=607
norw Northwestern University SPS MS Data Science http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/
asu Arizona State University MS Business Analytics http://wpcarey.asu.edu/masters-programs/business-analytics/curriculum
pur Purdue University MS Business Analytics and Information Management http://www.krannert.purdue.edu/masters/programs/business-analytics-and-information-management/
umd University of Maryland MS Business Analytics http://www.rhsmith.umd.edu/programs/ms-programs/marketing-analytics
bum Boston University Metropolitan College MS Computer Science concentration in Data Analytics http://www.bu.edu/met/programs/graduate/computer-science/data-analytics/
ncs North Carolina State University MS Analytics http://analytics.ncsu.edu/
nyu New York University MS Business Analytics http://www.stern.nyu.edu/programs-admissions/global-degrees/business-analytics/
umuc University of Maryland University College MS Data Analytics http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm
txam Texas A&M University MS Analytics https://analytics.stat.tamu.edu/
duke Duke University MQM (Master of Quantitative Management) Business Analytics https://www.fuqua.duke.edu/programs/mqm-business-analytics
berk Berkeley University MA Information and Data Science https://www.ischool.berkeley.edu/programs/mids

From Steve

For my portion of the web scraping, I compiled data from 9 university masters programs.

The basic process for most universities was to read in the website, and then use the html_nodes and html_text functions to extract the course data. Once retrieved, I used functions from the tidyverse to clean the data. Finally, I brought all the fields together into a data frame for the school, making sure to name add and rename columns as needed in order to perform a final union on all the data sets.

1. Michigan State University

msu <- read_html('https://accounting.broad.msu.edu/academic-programs/ms-business-analytics/course-descriptions/')

msu.classes <- msu %>%
  html_nodes('h2~ ul li') %>% ## using selector gadget
  html_text() %>%
  as.data.frame() 

msu.classes <- msu.classes %>%
  separate(colnames(msu.classes[1]), c("Title","Number"), sep = '\\(', extra = "merge") %>%
  separate(Number, c("Number","Description"), sep = '\\)', extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
msu.classes[6,3] <- msu.classes[6,1] %>%
  str_extract(":.*") %>%
  str_remove(":")
msu.classes[6,1] <- msu.classes[6,1] %>%
  str_extract(".*:") %>%
  str_remove(":")

School <- rep("msu", nrow(msu.classes))
msu.classes <- cbind.data.frame(School, msu.classes)
msu.classes[,"Type"] <- character(nrow(msu.classes))
msu.classes[,"Credits"] <- character(nrow(msu.classes))

2. University of Cincinnati

cin <- read_html("https://business.uc.edu/graduate/masters/ms-business-analytics/msba-academics.html")

cin.table <- cin %>% 
  html_nodes("table.table") %>%
  html_table(fill = TRUE, header = TRUE)

cin.classes <- as.data.frame(cin.table[2])

School <- rep("cin", nrow(cin.classes))
cin.classes <- cbind.data.frame(School, cin.classes)

colnames(cin.classes) <- c("School", "Type", "Number", "Title", "Description", "Syllabus")
cin.classes[,"Credits"] <- character(nrow(cin.classes))
cin.classes <- cin.classes[-6]

3. CUNY SPS

I’m such a nerd - when planning my coursework at SPS, I made an Excel sheet with all the courses and their descriptions by hand, copying and pasting from the website. May as well put it to use!

cuny.classes <- read.csv("https://raw.githubusercontent.com/stipton/CUNY-SPS/master/DATA%20607%20Project%203/cuny.classes.csv")

School <- rep("cuny", nrow(cuny.classes))
cuny.classes <- cbind.data.frame(School, cuny.classes)

colnames(cuny.classes) <- c("School", "Type", "Number", "Title", "Description", "Credits", "Prerequisites")

cuny.classes <- cuny.classes[,-7]
cuny.classes$Credits <- as.character(cuny.classes$Credits)

4. Northwestern University

For Northwestern, the course names and course descriptions are not on the same web page. I used the xpathSApply function to locate links on the curricullum page that led to different web pages containing the course descriptions.

norw <- read_html("https://sps.northwestern.edu/masters/data-science/program-courses.php")

norw.links <- norw %>%
  htmlParse() %>%
  xpathSApply("//table//a/@href") %>%
  str_extract_all("/program-courses.*") %>%
  unlist()

norw.classes <- data.frame()

for(i in 1:length(norw.links)) {
  url1 <- "https://sps.northwestern.edu/masters/data-science"
  url2 <- norw.links[i]
  final.url <- paste(url1, url2, sep = "")
  course <- read_html(final.url)
  course.title <- course %>%
    html_nodes('h3') %>%
    html_text()
  norw.classes[i,1] <- course.title
  course.desc <- course %>%
    html_nodes('#main-content p') %>%
    html_text() %>%
    paste(collapse = "")
  norw.classes[i,2] <- course.desc
}

School <- rep("norw", nrow(norw.classes))
norw.classes <- cbind.data.frame(School, norw.classes)

colnames(norw.classes) <- c("School", "Title", "Description")
norw.classes[,"Type"] <- character(nrow(norw.classes))
norw.classes[,"Credits"] <- character(nrow(norw.classes))
norw.classes[,"Number"] <- character(nrow(norw.classes))

5. Arizona State University

asu <- read_html("https://wpcarey.asu.edu/masters-programs/business-analytics/curriculum")

asu.titles <- asu %>%
  html_nodes("h3.panel-title") %>% 
  html_text() %>%
  as.character() 

asu.descriptions <- asu %>%
  html_nodes("div.panel-body p") %>%
  html_text() %>%
  as.character()

asu.descriptions[10] <- paste(asu.descriptions[10:12], collapse = "")
asu.descriptions <- asu.descriptions[1:10]

asu.classes <- cbind.data.frame(asu.titles, asu.descriptions)

School <- rep("asu", nrow(asu.classes))
asu.classes <- cbind.data.frame(School, asu.classes)

colnames(asu.classes) <- c("School", "Title", "Description")
asu.classes[,"Type"] <- character(nrow(asu.classes))
asu.classes[,"Credits"] <- character(nrow(asu.classes))
asu.classes[,"Number"] <- character(nrow(asu.classes))

6. Purdue University

pur <- read_html("http://www.krannert.purdue.edu/masters/programs/business-analytics-and-information-management/curriculum/home.php")

pur.classes <- pur %>%
  html_nodes("h2~ p ,h2 ,h2~ ul li") %>% 
  html_text() %>%
  as.character()

pur.classes <- pur.classes[-c(1,2,3,18,26,39,71:76)]
pur.classes[16] <- paste(pur.classes[16:21], collapse = " ")
pur.classes[39] <- paste(pur.classes[39:40], collapse = " ")
pur.classes[44] <- paste(pur.classes[44:45], collapse = " ")
pur.classes[55] <- paste(pur.classes[55:58], collapse = " ")
pur.classes <- pur.classes[-c(17:21,40,45,56:58)]

pur.titles <- pur.classes[c(TRUE,FALSE)]
pur.desc <- pur.classes[c(FALSE,TRUE)]
pur.classes <- cbind.data.frame(pur.titles, pur.desc)

School <- rep("pur", nrow(pur.classes))
pur.classes <- cbind.data.frame(School, pur.classes)

colnames(pur.classes) <- c("School", "Title", "Description")
pur.classes[,"Type"] <- character(nrow(pur.classes))
pur.classes[,"Credits"] <- character(nrow(pur.classes))
pur.classes[,"Number"] <- character(nrow(pur.classes))

7. University of Maryland

umd <- read_html("https://www.rhsmith.umd.edu/programs/ms-business-analytics/academics")

umd.classes <- umd %>%
  html_nodes("p") %>% 
  html_text() %>%
  as.character()

umd.classes <- umd.classes[str_detect(umd.classes, "BU[A-Z]{2} \\d")]

School <- rep("umd", length(umd.classes))
umd.classes <- cbind.data.frame(School, umd.classes)

umd.classes <- umd.classes %>%
  separate(umd.classes, c("Title","Credits"), sep = '\\(', extra = "merge") %>%
  separate(Credits, c("Credits","Description"), sep = '\\):', extra = "merge")

umd.classes[,"Type"] <- character(nrow(umd.classes))
umd.classes[,"Number"] <- character(nrow(umd.classes))

8. Boston University Metropolitan College

bum <- read_html("http://www.bu.edu/met/programs/graduate/computer-science/data-analytics/")

bum.classes <- bum %>%
  html_nodes(".bu_collapsible_container, .bu_collapsible_section") %>% 
  html_text() %>%
  as.character()

bum.classes <- bum.classes[-c(1,2)]
bum.desc <- bum.classes[c(FALSE,TRUE)]
bum.titles <- bum.classes[c(TRUE,FALSE)]
bum.titles <- str_sub(bum.titles, 1, 45) ## not perfect, but an approximation of titles

School <- rep("bum", length(bum.titles))
bum.classes <- cbind.data.frame(School, bum.titles, bum.desc)

colnames(bum.classes) <- c("School", "Title", "Description")
bum.classes[,"Type"] <- character(nrow(bum.classes))
bum.classes[,"Credits"] <- character(nrow(bum.classes))
bum.classes[,"Number"] <- character(nrow(bum.classes))

9. North Carolina State University

Scraping the data from NCSU returned a single vector mixing together titles and descriptions, so a little more manual clean-up was required than usual.

ncs <- read_html("http://analytics.ncsu.edu/?page_id=123")

ncs.classes <- ncs %>%
  html_nodes("ul+ h3 , #main li , hr+ h3") %>% 
  html_text() %>%
  as.character()

ncs.classes[2] <- paste(ncs.classes[2:14], collapse = " ")
ncs.classes[16] <- paste(ncs.classes[16:25], collapse = " ")
ncs.classes[27] <- paste(ncs.classes[27:36], collapse = " ")
ncs.classes[38] <- paste(ncs.classes[38:48], collapse = " ")
ncs.classes[50] <- paste(ncs.classes[50:59], collapse = " ")
ncs.classes[61] <- paste(ncs.classes[61:67], collapse = " ")
ncs.classes <- ncs.classes[-c(3:14,17:25,28:36,39:48,51:59,62:67)]

ncs.titles <- ncs.classes[c(TRUE,FALSE)]
ncs.desc <- ncs.classes[c(FALSE,TRUE)]

School <- rep("ncs", length(ncs.titles))
ncs.classes <- cbind.data.frame(School, ncs.titles, ncs.desc)

colnames(ncs.classes) <- c("School", "Title", "Description")
ncs.classes[,"Type"] <- character(nrow(ncs.classes))
ncs.classes[,"Credits"] <- character(nrow(ncs.classes))
ncs.classes[,"Number"] <- character(nrow(ncs.classes))

Union to combine school data

Since I constructed all of the data sets for each school to have the same columns, I can use the union function to connect them in one data set. Note that I added an identifier column to each individual data set in order to tag which school offers the course. This identifier also allows me to join the data set for courses to the school data set.

full.school.set.ST <- asu.classes %>%
  union(bum.classes) %>%
  union(cin.classes) %>%
  union(cuny.classes) %>%
  union(msu.classes) %>%
  union(ncs.classes) %>%
  union(norw.classes) %>%
  union(pur.classes) %>%
  union(umd.classes)

From Youngkoung

YoungKoung also delivered school information to the project, and I include her work below.

1. NYU

nyu <- read_html("http://www.stern.nyu.edu/programs-admissions/ms-business-analytics/academics/course-index")
nyu.Description <- nyu %>%
  html_nodes("#region-2 :nth-child(2) .content") %>%
  html_text()  %>%
  str_replace_all("[\r\n]" , " ") %>%
  str_replace_all("Module I: NYU Stern - New York", "  ") %>%
  str_replace_all("Module II: London", "  ") %>%
  str_replace_all("Module III: NYU Shanghai - Shanghai", "  ") %>%
  str_replace_all("Module IV: NYU Stern - New York", " ") %>%
  str_replace_all("Module V: NYU Stern - New York", " ") %>%
  str_split("Course description:|Strategic Capstone") %>%
  data.frame(stringsAsFactors=FALSE)


title <- html_nodes(nyu, "strong") %>%
  html_text()

previous <- html_nodes(nyu, "em") %>%
  html_text() 

# Remove course title 
for(i in 2:16)
{ 
  d <- nyu.Description[i, ] %>%
    str_replace_all(title[i], "  ") 
    nyu.Description[i, ] <- d
}
# Remove <em> field
for(i in 2:14)
{ 
  d <- nyu.Description[i, ] %>%
    str_replace_all(previous[i], "  ") 
  nyu.Description[i, ] <- d
}


nyu.Description <- nyu.Description[2:16, ]
title <- data.frame(title[1:15])
colnames(title) <- "Name"
School <- "nyu"
Program <- "Master of Science in Business Analytics"

nyu.class <- cbind(School, Program, title, nyu.Description)
colnames(nyu.class)[4] <- "Description"

2. University of Maryland University College

umuc <- read_html("http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm")
umuc.class<- html_nodes(umuc, "div.course-popup") %>%
  html_text() %>%
  data.frame()
umuc.class<- umuc.class %>%
  separate(colnames(umuc.class[1]), c("fill","Name", "CodeCredits", "Description"), sep = "\\t") %>%
  separate(CodeCredits, c("Code", "Credits"), sep = "\\|")

umuc.class$fill <- NULL

School <- "umuc"
Program <- "Master of Science in Data Analytics"

umuc.class <- cbind(School, Program, umuc.class)

3. Duke University

duke <- read_html("https://www.fuqua.duke.edu/programs/mqm-business-analytics/curriculum")
duke.class.Name <-html_nodes(duke, ".accordion_item_content strong") %>%
  html_text() %>%
  data.frame()
colnames(duke.class.Name) <- "Name"
duke.Description <-html_nodes(duke, ".accordion_item_content p") %>%
  html_text() %>%
  data.frame()
duke.Description <- duke.Description[2:28, ] 

School <- "duke"
Program <- "Master of Quantitative Management Business Analytics"

duke.class <- cbind(School, Program, duke.class.Name, duke.Description)
colnames(duke.class)[4] <- "Description"

4. Berkeley

berkeley <- read_html("https://www.ischool.berkeley.edu/courses/datasci")
berkeley.class.Name <-html_nodes(berkeley, ".course-title a") %>%
  html_text() %>%
  data.frame()
colnames(berkeley.class.Name) <- "Name"
berkeley.Description <-html_nodes(berkeley, ".views-field-field-course-catalog-description .field-content ") %>%
  html_text() %>%
  data.frame()

School <- "berk"
Program <- "Master of Information and Data Science"

berkeley.class <- cbind(School, Program, berkeley.class.Name, berkeley.Description)
colnames(berkeley.class)[4] <- "Description"

5. Texas A&M University

txam <- read_html("https://analytics.stat.tamu.edu/for-students-2/")
txam.class.Name <-html_nodes(txam, "h4 a") %>%
  html_text() %>%
  str_replace_all("[\r\n\t]" , "") %>%
  data.frame()

txam.class.Name <- txam.class.Name  %>%
  separate(colnames(txam.class.Name [1]), c("Name","Credits"), sep = '\\(', extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1,
## 8, 15].
txam.class.Name$Credits <- str_replace(txam.class.Name$Credits, "\\)", " ")

txam.Description1 <-html_nodes(txam, "h4") %>%
  html_text() %>%
  data.frame()
# select odd rows in addition to rows 2 and 42
txam.Description <- txam.Description1[c(2, seq(1, nrow(txam.Description1), 2), 42),] 
txam.Description <- txam.Description[-c(2, 3, 22)]

School <- "txam"
Program <- "Master of Science Analytics"

txam.class <- cbind(School, Program, txam.class.Name, txam.Description)
colnames(txam.class)[5] <- "Description"

Combine Shools

MSDSprogram <- bind_rows(nyu.class, umuc.class, berkeley.class, duke.class, txam.class)

Combine Steve and Youngkoung Data Sets

After a little formatting to align our data frames, we are able to combine them into the full data set of courses that we will examine.

colnames(MSDSprogram)[3] <- "Title"
colnames(MSDSprogram)[5] <- "Number"
MSDSprogram.formatted <- MSDSprogram[,-2]
MSDSprogram.formatted[,"Type"] <- character(nrow(MSDSprogram.formatted))

full.school.set <- union(full.school.set.ST, MSDSprogram.formatted)
colnames(full.school.set)[1] <- "School.Code"

kable_styling(kable(head(full.school.set, 10), "html"), bootstrap_options = "striped")
School.Code Title Description Type Credits Number
cin Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. Core Courses BANA 7030
norw MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning.
umuc Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. 3 Credits STAT 200
bum MET CS 599 Biometrics Sprg ‘18In this course In this course we will study the fundamental and design applications of various biometric systems based on fingerprints, voice, face, hand geometry, palm print, iris, retina, and other modalities. Multimodal biometric systems that use two or more of the above characteristics will be discussed. Biometric system performance and issues related to the security and privacy aspects of these systems will also be addressed.   [ 4 cr. ]Section Type Instructor Location Days Times A1 IND Djordjevic FLR 266 M 6:00 pm – 8:45 pm
norw MSDS 456-DL : Sports Performance Analytics An introduction to sports performance measurement and analytics, this course reviews roles of athletes at each position in sports selected by the instructor. With a focus on the individual athlete, the course discusses the development and use of accurate assessments and variability due to factors such as body type, climate, and playing surface. The course reviews athletic performance measurements, including jumping ability, running speed, agility, and strength. It utilizes exploratory data analysis, predictive modeling, and presentation graphics, showing real-world implications for athletes, coaches, team managers, and the sports industry.Prerequisites: MSDS 400-DL Math for Data Scientists and MSDS 401-DL Statistical Analysis.
norw MSDS 455-DL : Data Visualization This course begins with a review of human perception and cognition, drawing upon psychological studies of perceptual accuracy and preferences. The course reviews principles of graphic design, what makes for a good graph, and why some data visualizations effectively present information and others do not. It considers visualization as a component of systems for data science and presents examples of exploratory data analysis, visualizing time, networks, and maps. It reviews methods for static and interactive graphics and introduces tools for building web-browser-based presentations. This is a project-based course with programming assignments.Prerequisites: MSDS 400-DL Math for Data Scientists and MSDS 401-DL Statistical Analysis.
duke Intermediate Finance Gain a working knowledge of key concepts in portfolio management. This course covers mutual funds, multifactor models, asset classes and allocation, foreign exchange markets, international investment and capital budgeting, hedge funds, private equity, and venture capital. NA NA
pur MGMT 66000: Intro to Operations Management As goods and services are produced and distributed, they move through a set of inter-related operations or processes in order to match supply with demand. The design of these operations for strategic advantage, investment in improving their efficiency and effectiveness, and controlling these operations to meet performance objectives is the domain of Operations Management. The primary objective of this course is to provide an overview of this important functional area of business.
bum MET CS 555 Data Analysis and Visualization Sp

This course provides an overview of the statistical tools most commonly used to process, analyze, and visualize data. Topics include simple linear regression, multiple regression, logistic regression, analysis of variance, and survival analysis. These topics are explored using the statistical package R, with a focus on understanding how to use and interpret output from this software as well as how to visualize results. In each topic area, the methodology, including underlying assumptions and the mechanics of how it all works along with appropriate interpretation of the results, are discussed. Concepts are presented in context of real world examples. Recommended Prerequisite: MET CS 544 or equivalent knowledge, or instructor’s consent.  [ 4 cr. ]Section Type Instructor Location Days Times A1 IND Teymourian CAS 325 M 6:00 pm – 8:45 pm D1 IND Teymourian CAS B27 R 6:00 pm – 8:45 pm J1 IND Teymourian CAS 325 M 6:00 pm – 8:45 pm O1 IND Pedley

ARR –
duke Introductory Finance Cover all the basic concepts in finance—discounting, equities, bonds, portfolio diversification, the capital asset pricing model (CAPM), and the weighted average cost of capital (WACC)—in order to establish your knowledge foundation for exposure to more complex financial concepts through the program. NA NA

Prepare list of key terms - from Heather G. and Raj K.

In order to compare our data across different areas of study (education vs. job seekers vs. job postings), our group constructed one list of key terms to search for in our data sets.

The list of data science skills is based off the list found here: https://www.thebalance.com/list-of-data-scientist-skills-2062381

We culled the list and categorized it into two skill types to use in our analysis, soft skills and technical skills.

Heather and Raj developed the code to create the keyword list in R, including synonyms for certain words to count the same (such as “collaboration” and “collaborative”). The next R code chunk is Heather’s code for creating regular expressions based on the keyword list (huge thanks to Heather and Raj here!).

keywords <- read.table("https://raw.githubusercontent.com/heathergeiger/Data607_Project3_Group3/master/heathergeiger_individual_work/combine_ny_and_san_francisco/keywords.txt",header=TRUE,check.names=FALSE,stringsAsFactors=FALSE,sep="\t")
keywords <- keywords[grep('This is probably too tough',keywords$Other.notes,invert=TRUE),]

keyword_list <- vector("list",length=nrow(keywords))

for(i in 1:nrow(keywords)) {
keywords_this_row <- keywords$Skill[i]
if(keywords$Synonyms[i] != "None"){
    keywords_this_row <- c(keywords_this_row,unlist(strsplit(keywords$Synonyms[i],",")[[1]]))
    }
keyword_list[[i]] <- keywords_this_row
}

#Couldn't figure out how to get a regex for a space, comma, or word boundary. However did get one that can get either a space or comma.
space_or_comma <- "[[:space:],]"
word_boundary <- "\\b"

pattern_for_one_keyword <- function(keyword){
    regexes <- paste0(space_or_comma,keyword,space_or_comma)
    regexes <- c(regexes,paste0(word_boundary,keyword,word_boundary))
    regexes <- c(regexes,paste0(word_boundary,keyword,space_or_comma))
    regexes <- c(regexes,paste0(space_or_comma,keyword,word_boundary))
    return(paste0(regexes,collapse="|"))
}

pattern_for_multiple_keywords <- function(keyword_vector){
    if(length(keyword_vector) == 1){return(pattern_for_one_keyword(keyword_vector))}
    if(length(keyword_vector) > 1){
        individual_regexes <- c()
        for(i in 1:length(keyword_vector))
        {
            individual_regexes <- c(individual_regexes,pattern_for_one_keyword(keyword_vector[i]))
        }
    return(paste0(individual_regexes,collapse="|")) 
    }
}

keyword_regexes <- unlist(lapply(keyword_list,function(x)pattern_for_multiple_keywords(x)))

Compare key terms to title/course data

With the keyword list now prepared, we began the search on the education data list. First, we decided to treat the title and the description for each course as one unit, so I combined the Title column and the Description column together into one field. We then created a loop to check each keyword against the Title.Desc column and populate a TRUE/FALSE column showing whether or not that course contained the keyword.

cols <- c("Title", "Description")
full.school.set$Title.Desc <- do.call(paste, c(full.school.set[cols], sep=" "))

for(i in 1:length(keyword_regexes)) {
full.school.set[,keywords$Skill[i]] <- NA
skill <- keyword_regexes[i]
new.skill.col <- unlist(str_detect(tolower(full.school.set$Title.Desc),skill))
full.school.set[,keywords$Skill[i]] <- new.skill.col
}

kable_styling(kable(head(full.school.set, 3), "html"), bootstrap_options = "striped")
School.Code Title Description Type Credits Number Title.Desc algorithm appengine aws big data c++ collaboration communication prediction couchdb creativity critical thinking customer service data manipulation data wrangling data mining d3.js decision making decision tree ecl flare google visualization api hadoop java leadership machine learning matlab microsoft excel mining social media modeling perl powerpoint presentation problem solving python r raphael.js risk modeling sas scripting languages sql statistics tableau a/b testing data visualization
cin Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. Core Courses BANA 7030 Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
norw MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
umuc Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. 3 Credits STAT 200 Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE

Convert to long form and join with keywords and schools data frames

As a final step before analysis, we converted the wide data set with all the skills in separate columns into a long data set with one row for each course/skill combination. We also joined the data set together with the keywords data set (to add the Skill Type variable) and the schoolsdata set (to add the complete school information).

To facilitate data analysis, we saved the final data set so we won’t need to run the web scraping applications again when beginning our analysis.

long.school.set <- full.school.set %>% 
  gather("Skill", "Appears", 8:length(full.school.set)) %>%
  inner_join(keywords) %>%
  select(-c(Synonyms, Other.notes)) %>%
  inner_join(schools)
## Joining, by = "Skill"
## Joining, by = "School.Code"
## Warning: Column `School.Code` joining character vector and factor, coercing
## into character vector
kable_styling(kable(head(long.school.set, 3), "html"), bootstrap_options = "striped")
School.Code Title Description Type Credits Number Title.Desc Skill Appears Soft.or.technical School Degree Degree.Name Website
cin Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. Core Courses BANA 7030 Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. algorithm FALSE technical University of Cincinnati MS Business Analytics http://business.uc.edu/graduate/msbana.html
norw MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. algorithm FALSE technical Northwestern University SPS MS Data Science http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/
umuc Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. 3 Credits STAT 200 Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. algorithm FALSE technical University of Maryland University College MS Data Analytics http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm
save(long.school.set, file="long_school_set.Rdata")