As the demand for data scientists has grown in the economy, the number of educational programs has also expanded. For this project, we decided to examine language used in the curricula of a sample of graduate programs to see how the key terms used by educators compare to the key terms found in job postings and resumes. In other words:
What does academia consider the most important skills for a data scientist?
To identify the schools in the study, we began with a simple Google search for the “Top Data Scientist Masters Programs.” Our results included the following websites:
We selected a sample of 14 schools from these lists by choosing the schools with full curriculum and course descriptions available online. The process of scraping the information was highly manual, as every school formats their data differently and must be tackled fresh. In the end, while we were able to develop sophisticated code to parse many sites, we did encounter several schools that we eliminated from the sample simply due to the level of scraping required (MIT - we’ll be back to battle again!).
rm(list = ls())
library(kableExtra)
library(dplyr)
library(class)
library(knitr)
library(RCurl)
library(XML)
library(jsonlite)
library(rvest)
library(stringr)
library(tidyr)
I partnered on this part of the assignment with YoungKoung Kim, and we kept a shared Google document to keep track of our progress. I uploaded the spreadsheet to github and then pulled the table into R.
schools <- read.csv("https://raw.githubusercontent.com/stipton/CUNY-SPS/master/DATA%20607%20Project%203/DATA%20607%20Proj%203%20Schools.csv")
kable_styling(kable(schools, "html"), bootstrap_options = "striped")
School.Code | School | Degree | Degree.Name | Website |
---|---|---|---|---|
msu | Michigan State University | MS | Business Analytics | http://broad.msu.edu/businessanalytics/ |
cin | University of Cincinnati | MS | Business Analytics | http://business.uc.edu/graduate/msbana.html |
cuny | CUNY SPS | MS | Data Science | http://catalog.sps.cuny.edu/preview_program.php?catoid=2&poid=607 |
norw | Northwestern University SPS | MS | Data Science | http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/ |
asu | Arizona State University | MS | Business Analytics | http://wpcarey.asu.edu/masters-programs/business-analytics/curriculum |
pur | Purdue University | MS | Business Analytics and Information Management | http://www.krannert.purdue.edu/masters/programs/business-analytics-and-information-management/ |
umd | University of Maryland | MS | Business Analytics | http://www.rhsmith.umd.edu/programs/ms-programs/marketing-analytics |
bum | Boston University Metropolitan College | MS | Computer Science concentration in Data Analytics | http://www.bu.edu/met/programs/graduate/computer-science/data-analytics/ |
ncs | North Carolina State University | MS | Analytics | http://analytics.ncsu.edu/ |
nyu | New York University | MS | Business Analytics | http://www.stern.nyu.edu/programs-admissions/global-degrees/business-analytics/ |
umuc | University of Maryland University College | MS | Data Analytics | http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm |
txam | Texas A&M University | MS | Analytics | https://analytics.stat.tamu.edu/ |
duke | Duke University | MQM (Master of Quantitative Management) | Business Analytics | https://www.fuqua.duke.edu/programs/mqm-business-analytics |
berk | Berkeley University | MA | Information and Data Science | https://www.ischool.berkeley.edu/programs/mids |
For my portion of the web scraping, I compiled data from 9 university masters programs.
The basic process for most universities was to read in the website, and then use the html_nodes
and html_text
functions to extract the course data. Once retrieved, I used functions from the tidyverse
to clean the data. Finally, I brought all the fields together into a data frame for the school, making sure to name add and rename columns as needed in order to perform a final union on all the data sets.
msu <- read_html('https://accounting.broad.msu.edu/academic-programs/ms-business-analytics/course-descriptions/')
msu.classes <- msu %>%
html_nodes('h2~ ul li') %>% ## using selector gadget
html_text() %>%
as.data.frame()
msu.classes <- msu.classes %>%
separate(colnames(msu.classes[1]), c("Title","Number"), sep = '\\(', extra = "merge") %>%
separate(Number, c("Number","Description"), sep = '\\)', extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].
msu.classes[6,3] <- msu.classes[6,1] %>%
str_extract(":.*") %>%
str_remove(":")
msu.classes[6,1] <- msu.classes[6,1] %>%
str_extract(".*:") %>%
str_remove(":")
School <- rep("msu", nrow(msu.classes))
msu.classes <- cbind.data.frame(School, msu.classes)
msu.classes[,"Type"] <- character(nrow(msu.classes))
msu.classes[,"Credits"] <- character(nrow(msu.classes))
cin <- read_html("https://business.uc.edu/graduate/masters/ms-business-analytics/msba-academics.html")
cin.table <- cin %>%
html_nodes("table.table") %>%
html_table(fill = TRUE, header = TRUE)
cin.classes <- as.data.frame(cin.table[2])
School <- rep("cin", nrow(cin.classes))
cin.classes <- cbind.data.frame(School, cin.classes)
colnames(cin.classes) <- c("School", "Type", "Number", "Title", "Description", "Syllabus")
cin.classes[,"Credits"] <- character(nrow(cin.classes))
cin.classes <- cin.classes[-6]
I’m such a nerd - when planning my coursework at SPS, I made an Excel sheet with all the courses and their descriptions by hand, copying and pasting from the website. May as well put it to use!
cuny.classes <- read.csv("https://raw.githubusercontent.com/stipton/CUNY-SPS/master/DATA%20607%20Project%203/cuny.classes.csv")
School <- rep("cuny", nrow(cuny.classes))
cuny.classes <- cbind.data.frame(School, cuny.classes)
colnames(cuny.classes) <- c("School", "Type", "Number", "Title", "Description", "Credits", "Prerequisites")
cuny.classes <- cuny.classes[,-7]
cuny.classes$Credits <- as.character(cuny.classes$Credits)
For Northwestern, the course names and course descriptions are not on the same web page. I used the xpathSApply
function to locate links on the curricullum page that led to different web pages containing the course descriptions.
norw <- read_html("https://sps.northwestern.edu/masters/data-science/program-courses.php")
norw.links <- norw %>%
htmlParse() %>%
xpathSApply("//table//a/@href") %>%
str_extract_all("/program-courses.*") %>%
unlist()
norw.classes <- data.frame()
for(i in 1:length(norw.links)) {
url1 <- "https://sps.northwestern.edu/masters/data-science"
url2 <- norw.links[i]
final.url <- paste(url1, url2, sep = "")
course <- read_html(final.url)
course.title <- course %>%
html_nodes('h3') %>%
html_text()
norw.classes[i,1] <- course.title
course.desc <- course %>%
html_nodes('#main-content p') %>%
html_text() %>%
paste(collapse = "")
norw.classes[i,2] <- course.desc
}
School <- rep("norw", nrow(norw.classes))
norw.classes <- cbind.data.frame(School, norw.classes)
colnames(norw.classes) <- c("School", "Title", "Description")
norw.classes[,"Type"] <- character(nrow(norw.classes))
norw.classes[,"Credits"] <- character(nrow(norw.classes))
norw.classes[,"Number"] <- character(nrow(norw.classes))
asu <- read_html("https://wpcarey.asu.edu/masters-programs/business-analytics/curriculum")
asu.titles <- asu %>%
html_nodes("h3.panel-title") %>%
html_text() %>%
as.character()
asu.descriptions <- asu %>%
html_nodes("div.panel-body p") %>%
html_text() %>%
as.character()
asu.descriptions[10] <- paste(asu.descriptions[10:12], collapse = "")
asu.descriptions <- asu.descriptions[1:10]
asu.classes <- cbind.data.frame(asu.titles, asu.descriptions)
School <- rep("asu", nrow(asu.classes))
asu.classes <- cbind.data.frame(School, asu.classes)
colnames(asu.classes) <- c("School", "Title", "Description")
asu.classes[,"Type"] <- character(nrow(asu.classes))
asu.classes[,"Credits"] <- character(nrow(asu.classes))
asu.classes[,"Number"] <- character(nrow(asu.classes))
pur <- read_html("http://www.krannert.purdue.edu/masters/programs/business-analytics-and-information-management/curriculum/home.php")
pur.classes <- pur %>%
html_nodes("h2~ p ,h2 ,h2~ ul li") %>%
html_text() %>%
as.character()
pur.classes <- pur.classes[-c(1,2,3,18,26,39,71:76)]
pur.classes[16] <- paste(pur.classes[16:21], collapse = " ")
pur.classes[39] <- paste(pur.classes[39:40], collapse = " ")
pur.classes[44] <- paste(pur.classes[44:45], collapse = " ")
pur.classes[55] <- paste(pur.classes[55:58], collapse = " ")
pur.classes <- pur.classes[-c(17:21,40,45,56:58)]
pur.titles <- pur.classes[c(TRUE,FALSE)]
pur.desc <- pur.classes[c(FALSE,TRUE)]
pur.classes <- cbind.data.frame(pur.titles, pur.desc)
School <- rep("pur", nrow(pur.classes))
pur.classes <- cbind.data.frame(School, pur.classes)
colnames(pur.classes) <- c("School", "Title", "Description")
pur.classes[,"Type"] <- character(nrow(pur.classes))
pur.classes[,"Credits"] <- character(nrow(pur.classes))
pur.classes[,"Number"] <- character(nrow(pur.classes))
umd <- read_html("https://www.rhsmith.umd.edu/programs/ms-business-analytics/academics")
umd.classes <- umd %>%
html_nodes("p") %>%
html_text() %>%
as.character()
umd.classes <- umd.classes[str_detect(umd.classes, "BU[A-Z]{2} \\d")]
School <- rep("umd", length(umd.classes))
umd.classes <- cbind.data.frame(School, umd.classes)
umd.classes <- umd.classes %>%
separate(umd.classes, c("Title","Credits"), sep = '\\(', extra = "merge") %>%
separate(Credits, c("Credits","Description"), sep = '\\):', extra = "merge")
umd.classes[,"Type"] <- character(nrow(umd.classes))
umd.classes[,"Number"] <- character(nrow(umd.classes))
bum <- read_html("http://www.bu.edu/met/programs/graduate/computer-science/data-analytics/")
bum.classes <- bum %>%
html_nodes(".bu_collapsible_container, .bu_collapsible_section") %>%
html_text() %>%
as.character()
bum.classes <- bum.classes[-c(1,2)]
bum.desc <- bum.classes[c(FALSE,TRUE)]
bum.titles <- bum.classes[c(TRUE,FALSE)]
bum.titles <- str_sub(bum.titles, 1, 45) ## not perfect, but an approximation of titles
School <- rep("bum", length(bum.titles))
bum.classes <- cbind.data.frame(School, bum.titles, bum.desc)
colnames(bum.classes) <- c("School", "Title", "Description")
bum.classes[,"Type"] <- character(nrow(bum.classes))
bum.classes[,"Credits"] <- character(nrow(bum.classes))
bum.classes[,"Number"] <- character(nrow(bum.classes))
Scraping the data from NCSU returned a single vector mixing together titles and descriptions, so a little more manual clean-up was required than usual.
ncs <- read_html("http://analytics.ncsu.edu/?page_id=123")
ncs.classes <- ncs %>%
html_nodes("ul+ h3 , #main li , hr+ h3") %>%
html_text() %>%
as.character()
ncs.classes[2] <- paste(ncs.classes[2:14], collapse = " ")
ncs.classes[16] <- paste(ncs.classes[16:25], collapse = " ")
ncs.classes[27] <- paste(ncs.classes[27:36], collapse = " ")
ncs.classes[38] <- paste(ncs.classes[38:48], collapse = " ")
ncs.classes[50] <- paste(ncs.classes[50:59], collapse = " ")
ncs.classes[61] <- paste(ncs.classes[61:67], collapse = " ")
ncs.classes <- ncs.classes[-c(3:14,17:25,28:36,39:48,51:59,62:67)]
ncs.titles <- ncs.classes[c(TRUE,FALSE)]
ncs.desc <- ncs.classes[c(FALSE,TRUE)]
School <- rep("ncs", length(ncs.titles))
ncs.classes <- cbind.data.frame(School, ncs.titles, ncs.desc)
colnames(ncs.classes) <- c("School", "Title", "Description")
ncs.classes[,"Type"] <- character(nrow(ncs.classes))
ncs.classes[,"Credits"] <- character(nrow(ncs.classes))
ncs.classes[,"Number"] <- character(nrow(ncs.classes))
Since I constructed all of the data sets for each school to have the same columns, I can use the union
function to connect them in one data set. Note that I added an identifier column to each individual data set in order to tag which school offers the course. This identifier also allows me to join the data set for courses to the school data set.
full.school.set.ST <- asu.classes %>%
union(bum.classes) %>%
union(cin.classes) %>%
union(cuny.classes) %>%
union(msu.classes) %>%
union(ncs.classes) %>%
union(norw.classes) %>%
union(pur.classes) %>%
union(umd.classes)
YoungKoung also delivered school information to the project, and I include her work below.
nyu <- read_html("http://www.stern.nyu.edu/programs-admissions/ms-business-analytics/academics/course-index")
nyu.Description <- nyu %>%
html_nodes("#region-2 :nth-child(2) .content") %>%
html_text() %>%
str_replace_all("[\r\n]" , " ") %>%
str_replace_all("Module I: NYU Stern - New York", " ") %>%
str_replace_all("Module II: London", " ") %>%
str_replace_all("Module III: NYU Shanghai - Shanghai", " ") %>%
str_replace_all("Module IV: NYU Stern - New York", " ") %>%
str_replace_all("Module V: NYU Stern - New York", " ") %>%
str_split("Course description:|Strategic Capstone") %>%
data.frame(stringsAsFactors=FALSE)
title <- html_nodes(nyu, "strong") %>%
html_text()
previous <- html_nodes(nyu, "em") %>%
html_text()
# Remove course title
for(i in 2:16)
{
d <- nyu.Description[i, ] %>%
str_replace_all(title[i], " ")
nyu.Description[i, ] <- d
}
# Remove <em> field
for(i in 2:14)
{
d <- nyu.Description[i, ] %>%
str_replace_all(previous[i], " ")
nyu.Description[i, ] <- d
}
nyu.Description <- nyu.Description[2:16, ]
title <- data.frame(title[1:15])
colnames(title) <- "Name"
School <- "nyu"
Program <- "Master of Science in Business Analytics"
nyu.class <- cbind(School, Program, title, nyu.Description)
colnames(nyu.class)[4] <- "Description"
umuc <- read_html("http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm")
umuc.class<- html_nodes(umuc, "div.course-popup") %>%
html_text() %>%
data.frame()
umuc.class<- umuc.class %>%
separate(colnames(umuc.class[1]), c("fill","Name", "CodeCredits", "Description"), sep = "\\t") %>%
separate(CodeCredits, c("Code", "Credits"), sep = "\\|")
umuc.class$fill <- NULL
School <- "umuc"
Program <- "Master of Science in Data Analytics"
umuc.class <- cbind(School, Program, umuc.class)
duke <- read_html("https://www.fuqua.duke.edu/programs/mqm-business-analytics/curriculum")
duke.class.Name <-html_nodes(duke, ".accordion_item_content strong") %>%
html_text() %>%
data.frame()
colnames(duke.class.Name) <- "Name"
duke.Description <-html_nodes(duke, ".accordion_item_content p") %>%
html_text() %>%
data.frame()
duke.Description <- duke.Description[2:28, ]
School <- "duke"
Program <- "Master of Quantitative Management Business Analytics"
duke.class <- cbind(School, Program, duke.class.Name, duke.Description)
colnames(duke.class)[4] <- "Description"
berkeley <- read_html("https://www.ischool.berkeley.edu/courses/datasci")
berkeley.class.Name <-html_nodes(berkeley, ".course-title a") %>%
html_text() %>%
data.frame()
colnames(berkeley.class.Name) <- "Name"
berkeley.Description <-html_nodes(berkeley, ".views-field-field-course-catalog-description .field-content ") %>%
html_text() %>%
data.frame()
School <- "berk"
Program <- "Master of Information and Data Science"
berkeley.class <- cbind(School, Program, berkeley.class.Name, berkeley.Description)
colnames(berkeley.class)[4] <- "Description"
txam <- read_html("https://analytics.stat.tamu.edu/for-students-2/")
txam.class.Name <-html_nodes(txam, "h4 a") %>%
html_text() %>%
str_replace_all("[\r\n\t]" , "") %>%
data.frame()
txam.class.Name <- txam.class.Name %>%
separate(colnames(txam.class.Name [1]), c("Name","Credits"), sep = '\\(', extra = "merge")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1,
## 8, 15].
txam.class.Name$Credits <- str_replace(txam.class.Name$Credits, "\\)", " ")
txam.Description1 <-html_nodes(txam, "h4") %>%
html_text() %>%
data.frame()
# select odd rows in addition to rows 2 and 42
txam.Description <- txam.Description1[c(2, seq(1, nrow(txam.Description1), 2), 42),]
txam.Description <- txam.Description[-c(2, 3, 22)]
School <- "txam"
Program <- "Master of Science Analytics"
txam.class <- cbind(School, Program, txam.class.Name, txam.Description)
colnames(txam.class)[5] <- "Description"
MSDSprogram <- bind_rows(nyu.class, umuc.class, berkeley.class, duke.class, txam.class)
After a little formatting to align our data frames, we are able to combine them into the full data set of courses that we will examine.
colnames(MSDSprogram)[3] <- "Title"
colnames(MSDSprogram)[5] <- "Number"
MSDSprogram.formatted <- MSDSprogram[,-2]
MSDSprogram.formatted[,"Type"] <- character(nrow(MSDSprogram.formatted))
full.school.set <- union(full.school.set.ST, MSDSprogram.formatted)
colnames(full.school.set)[1] <- "School.Code"
kable_styling(kable(head(full.school.set, 10), "html"), bootstrap_options = "striped")
School.Code | Title | Description | Type | Credits | Number |
---|---|---|---|---|---|
cin | Simulation Modeling & Methods | Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. | Core Courses | BANA 7030 | |
norw | MSDS 454-DL : Advanced Modeling Techniques | Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. | |||
umuc | Introduction to Statistics | Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. | 3 Credits | STAT 200 | |
bum | MET CS 599 Biometrics Sprg 18In this course | In this course we will study the fundamental and design applications of various biometric systems based on fingerprints, voice, face, hand geometry, palm print, iris, retina, and other modalities. Multimodal biometric systems that use two or more of the above characteristics will be discussed. Biometric system performance and issues related to the security and privacy aspects of these systems will also be addressed. [ 4 cr. ]Section Type Instructor Location Days Times A1 IND Djordjevic FLR 266 M 6:00 pm 8:45 pm | |||
norw | MSDS 456-DL : Sports Performance Analytics | An introduction to sports performance measurement and analytics, this course reviews roles of athletes at each position in sports selected by the instructor. With a focus on the individual athlete, the course discusses the development and use of accurate assessments and variability due to factors such as body type, climate, and playing surface. The course reviews athletic performance measurements, including jumping ability, running speed, agility, and strength. It utilizes exploratory data analysis, predictive modeling, and presentation graphics, showing real-world implications for athletes, coaches, team managers, and the sports industry.Prerequisites: MSDS 400-DL Math for Data Scientists and MSDS 401-DL Statistical Analysis. | |||
norw | MSDS 455-DL : Data Visualization | This course begins with a review of human perception and cognition, drawing upon psychological studies of perceptual accuracy and preferences. The course reviews principles of graphic design, what makes for a good graph, and why some data visualizations effectively present information and others do not. It considers visualization as a component of systems for data science and presents examples of exploratory data analysis, visualizing time, networks, and maps. It reviews methods for static and interactive graphics and introduces tools for building web-browser-based presentations. This is a project-based course with programming assignments.Prerequisites: MSDS 400-DL Math for Data Scientists and MSDS 401-DL Statistical Analysis. | |||
duke | Intermediate Finance | Gain a working knowledge of key concepts in portfolio management. This course covers mutual funds, multifactor models, asset classes and allocation, foreign exchange markets, international investment and capital budgeting, hedge funds, private equity, and venture capital. | NA | NA | |
pur | MGMT 66000: Intro to Operations Management | As goods and services are produced and distributed, they move through a set of inter-related operations or processes in order to match supply with demand. The design of these operations for strategic advantage, investment in improving their efficiency and effectiveness, and controlling these operations to meet performance objectives is the domain of Operations Management. The primary objective of this course is to provide an overview of this important functional area of business. | |||
bum | MET CS 555 Data Analysis and Visualization Sp |
This course provides an overview of the statistical tools most commonly used to process, analyze, and visualize data. Topics include simple linear regression, multiple regression, logistic regression, analysis of variance, and survival analysis. These topics are explored using the statistical package R, with a focus on understanding how to use and interpret output from this software as well as how to visualize results. In each topic area, the methodology, including underlying assumptions and the mechanics of how it all works along with appropriate interpretation of the results, are discussed. Concepts are presented in context of real world examples. Recommended Prerequisite: MET CS 544 or equivalent knowledge, or instructor’s consent. [ 4 cr. ]Section Type Instructor Location Days Times A1 IND Teymourian CAS 325 M 6:00 pm 8:45 pm D1 IND Teymourian CAS B27 R 6:00 pm 8:45 pm J1 IND Teymourian CAS 325 M 6:00 pm 8:45 pm O1 IND Pedley ARR |
|||
duke | Introductory Finance | Cover all the basic concepts in financediscounting, equities, bonds, portfolio diversification, the capital asset pricing model (CAPM), and the weighted average cost of capital (WACC)in order to establish your knowledge foundation for exposure to more complex financial concepts through the program. | NA | NA |
In order to compare our data across different areas of study (education vs. job seekers vs. job postings), our group constructed one list of key terms to search for in our data sets.
The list of data science skills is based off the list found here: https://www.thebalance.com/list-of-data-scientist-skills-2062381
We culled the list and categorized it into two skill types to use in our analysis, soft skills and technical skills.
Heather and Raj developed the code to create the keyword list in R, including synonyms for certain words to count the same (such as “collaboration” and “collaborative”). The next R code chunk is Heather’s code for creating regular expressions based on the keyword list (huge thanks to Heather and Raj here!).
keywords <- read.table("https://raw.githubusercontent.com/heathergeiger/Data607_Project3_Group3/master/heathergeiger_individual_work/combine_ny_and_san_francisco/keywords.txt",header=TRUE,check.names=FALSE,stringsAsFactors=FALSE,sep="\t")
keywords <- keywords[grep('This is probably too tough',keywords$Other.notes,invert=TRUE),]
keyword_list <- vector("list",length=nrow(keywords))
for(i in 1:nrow(keywords)) {
keywords_this_row <- keywords$Skill[i]
if(keywords$Synonyms[i] != "None"){
keywords_this_row <- c(keywords_this_row,unlist(strsplit(keywords$Synonyms[i],",")[[1]]))
}
keyword_list[[i]] <- keywords_this_row
}
#Couldn't figure out how to get a regex for a space, comma, or word boundary. However did get one that can get either a space or comma.
space_or_comma <- "[[:space:],]"
word_boundary <- "\\b"
pattern_for_one_keyword <- function(keyword){
regexes <- paste0(space_or_comma,keyword,space_or_comma)
regexes <- c(regexes,paste0(word_boundary,keyword,word_boundary))
regexes <- c(regexes,paste0(word_boundary,keyword,space_or_comma))
regexes <- c(regexes,paste0(space_or_comma,keyword,word_boundary))
return(paste0(regexes,collapse="|"))
}
pattern_for_multiple_keywords <- function(keyword_vector){
if(length(keyword_vector) == 1){return(pattern_for_one_keyword(keyword_vector))}
if(length(keyword_vector) > 1){
individual_regexes <- c()
for(i in 1:length(keyword_vector))
{
individual_regexes <- c(individual_regexes,pattern_for_one_keyword(keyword_vector[i]))
}
return(paste0(individual_regexes,collapse="|"))
}
}
keyword_regexes <- unlist(lapply(keyword_list,function(x)pattern_for_multiple_keywords(x)))
With the keyword list now prepared, we began the search on the education data list. First, we decided to treat the title and the description for each course as one unit, so I combined the Title column and the Description column together into one field. We then created a loop to check each keyword against the Title.Desc
column and populate a TRUE
/FALSE
column showing whether or not that course contained the keyword.
cols <- c("Title", "Description")
full.school.set$Title.Desc <- do.call(paste, c(full.school.set[cols], sep=" "))
for(i in 1:length(keyword_regexes)) {
full.school.set[,keywords$Skill[i]] <- NA
skill <- keyword_regexes[i]
new.skill.col <- unlist(str_detect(tolower(full.school.set$Title.Desc),skill))
full.school.set[,keywords$Skill[i]] <- new.skill.col
}
kable_styling(kable(head(full.school.set, 3), "html"), bootstrap_options = "striped")
School.Code | Title | Description | Type | Credits | Number | Title.Desc | algorithm | appengine | aws | big data | c++ | collaboration | communication | prediction | couchdb | creativity | critical thinking | customer service | data manipulation | data wrangling | data mining | d3.js | decision making | decision tree | ecl | flare | google visualization api | hadoop | java | leadership | machine learning | matlab | microsoft excel | mining social media | modeling | perl | powerpoint | presentation | problem solving | python | r | raphael.js | risk modeling | sas | scripting languages | sql | statistics | tableau | a/b testing | data visualization |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cin | Simulation Modeling & Methods | Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. | Core Courses | BANA 7030 | Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | |
norw | MSDS 454-DL : Advanced Modeling Techniques | Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. | MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | |||
umuc | Introduction to Statistics | Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. | 3 Credits | STAT 200 | Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE |
keywords
and schools
data framesAs a final step before analysis, we converted the wide data set with all the skills in separate columns into a long data set with one row for each course/skill combination. We also joined the data set together with the keywords
data set (to add the Skill Type
variable) and the schools
data set (to add the complete school information).
To facilitate data analysis, we saved the final data set so we won’t need to run the web scraping applications again when beginning our analysis.
long.school.set <- full.school.set %>%
gather("Skill", "Appears", 8:length(full.school.set)) %>%
inner_join(keywords) %>%
select(-c(Synonyms, Other.notes)) %>%
inner_join(schools)
## Joining, by = "Skill"
## Joining, by = "School.Code"
## Warning: Column `School.Code` joining character vector and factor, coercing
## into character vector
kable_styling(kable(head(long.school.set, 3), "html"), bootstrap_options = "striped")
School.Code | Title | Description | Type | Credits | Number | Title.Desc | Skill | Appears | Soft.or.technical | School | Degree | Degree.Name | Website |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cin | Simulation Modeling & Methods | Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. | Core Courses | BANA 7030 | Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations. | algorithm | FALSE | technical | University of Cincinnati | MS | Business Analytics | http://business.uc.edu/graduate/msbana.html | |
norw | MSDS 454-DL : Advanced Modeling Techniques | Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. | MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning. | algorithm | FALSE | technical | Northwestern University SPS | MS | Data Science | http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/ | |||
umuc | Introduction to Statistics | Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. | 3 Credits | STAT 200 | Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230. | algorithm | FALSE | technical | University of Maryland University College | MS | Data Analytics | http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm |
save(long.school.set, file="long_school_set.Rdata")