DATA 607 Project 3

As the demand for data scientists has grown in the economy, the number of educational programs has also expanded. For this project, we decided to examine language used in the curricula of a sample of graduate programs to see how the key terms used by educators compare to the key terms found in job postings and resumes. In other words:

What does academia consider the most important skills for a data scientist?

To identify the schools in the study, we began with a simple Google search for the “Top Data Scientist Masters Programs.” Our results included the following websites:

We selected a sample of 14 schools from these lists by choosing the schools with full curriculum and course descriptions available online. The process of scraping the information was highly manual, as every school formats their data differently and must be tackled fresh. In the end, while we were able to develop sophisticated code to parse many sites, we did encounter several schools that we eliminated from the sample simply due to the level of scraping required (MIT - we’ll be back to battle again!).

Set up environment and load libraries

rm(list = ls())
library(kableExtra)
library(dplyr)
library(class)
library(knitr)
library(RCurl)
library(XML)
library(jsonlite)
library(rvest)
library(stringr)
library(tidyr)

Institutions Included

I partnered on this part of the assignment with YoungKoung Kim, and we kept a shared Google document to keep track of our progress. I uploaded the spreadsheet to github and then pulled the table into R.

schools <- read.csv("https://raw.githubusercontent.com/stipton/CUNY-SPS/master/DATA%20607%20Project%203/DATA%20607%20Proj%203%20Schools.csv")

kable_styling(kable(schools, "html"), bootstrap_options = "striped")

School.Code	School	Degree	Degree.Name	Website
msu	Michigan State University	MS	Business Analytics	http://broad.msu.edu/businessanalytics/
cin	University of Cincinnati	MS	Business Analytics	http://business.uc.edu/graduate/msbana.html
cuny	CUNY SPS	MS	Data Science	http://catalog.sps.cuny.edu/preview_program.php?catoid=2&poid=607
norw	Northwestern University SPS	MS	Data Science	http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/
asu	Arizona State University	MS	Business Analytics	http://wpcarey.asu.edu/masters-programs/business-analytics/curriculum
pur	Purdue University	MS	Business Analytics and Information Management	http://www.krannert.purdue.edu/masters/programs/business-analytics-and-information-management/
umd	University of Maryland	MS	Business Analytics	http://www.rhsmith.umd.edu/programs/ms-programs/marketing-analytics
bum	Boston University Metropolitan College	MS	Computer Science concentration in Data Analytics	http://www.bu.edu/met/programs/graduate/computer-science/data-analytics/
ncs	North Carolina State University	MS	Analytics	http://analytics.ncsu.edu/
nyu	New York University	MS	Business Analytics	http://www.stern.nyu.edu/programs-admissions/global-degrees/business-analytics/
umuc	University of Maryland University College	MS	Data Analytics	http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm
txam	Texas A&M University	MS	Analytics	https://analytics.stat.tamu.edu/
duke	Duke University	MQM (Master of Quantitative Management)	Business Analytics	https://www.fuqua.duke.edu/programs/mqm-business-analytics
berk	Berkeley University	MA	Information and Data Science	https://www.ischool.berkeley.edu/programs/mids

From Steve

For my portion of the web scraping, I compiled data from 9 university masters programs.

The basic process for most universities was to read in the website, and then use the html_nodes and html_text functions to extract the course data. Once retrieved, I used functions from the tidyverse to clean the data. Finally, I brought all the fields together into a data frame for the school, making sure to name add and rename columns as needed in order to perform a final union on all the data sets.

1. Michigan State University

msu <- read_html('https://accounting.broad.msu.edu/academic-programs/ms-business-analytics/course-descriptions/')

msu.classes <- msu %>%
  html_nodes('h2~ ul li') %>% ## using selector gadget
  html_text() %>%
  as.data.frame() 

msu.classes <- msu.classes %>%
  separate(colnames(msu.classes[1]), c("Title","Number"), sep = '\\(', extra = "merge") %>%
  separate(Number, c("Number","Description"), sep = '\\)', extra = "merge")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [6].

msu.classes[6,3] <- msu.classes[6,1] %>%
  str_extract(":.*") %>%
  str_remove(":")
msu.classes[6,1] <- msu.classes[6,1] %>%
  str_extract(".*:") %>%
  str_remove(":")

School <- rep("msu", nrow(msu.classes))
msu.classes <- cbind.data.frame(School, msu.classes)
msu.classes[,"Type"] <- character(nrow(msu.classes))
msu.classes[,"Credits"] <- character(nrow(msu.classes))

2. University of Cincinnati

cin <- read_html("https://business.uc.edu/graduate/masters/ms-business-analytics/msba-academics.html")

cin.table <- cin %>% 
  html_nodes("table.table") %>%
  html_table(fill = TRUE, header = TRUE)

cin.classes <- as.data.frame(cin.table[2])

School <- rep("cin", nrow(cin.classes))
cin.classes <- cbind.data.frame(School, cin.classes)

colnames(cin.classes) <- c("School", "Type", "Number", "Title", "Description", "Syllabus")
cin.classes[,"Credits"] <- character(nrow(cin.classes))
cin.classes <- cin.classes[-6]

3. CUNY SPS

I’m such a nerd - when planning my coursework at SPS, I made an Excel sheet with all the courses and their descriptions by hand, copying and pasting from the website. May as well put it to use!

cuny.classes <- read.csv("https://raw.githubusercontent.com/stipton/CUNY-SPS/master/DATA%20607%20Project%203/cuny.classes.csv")

School <- rep("cuny", nrow(cuny.classes))
cuny.classes <- cbind.data.frame(School, cuny.classes)

colnames(cuny.classes) <- c("School", "Type", "Number", "Title", "Description", "Credits", "Prerequisites")

cuny.classes <- cuny.classes[,-7]
cuny.classes$Credits <- as.character(cuny.classes$Credits)

4. Northwestern University

For Northwestern, the course names and course descriptions are not on the same web page. I used the xpathSApply function to locate links on the curricullum page that led to different web pages containing the course descriptions.

norw <- read_html("https://sps.northwestern.edu/masters/data-science/program-courses.php")

norw.links <- norw %>%
  htmlParse() %>%
  xpathSApply("//table//a/@href") %>%
  str_extract_all("/program-courses.*") %>%
  unlist()

norw.classes <- data.frame()

for(i in 1:length(norw.links)) {
  url1 <- "https://sps.northwestern.edu/masters/data-science"
  url2 <- norw.links[i]
  final.url <- paste(url1, url2, sep = "")
  course <- read_html(final.url)
  course.title <- course %>%
    html_nodes('h3') %>%
    html_text()
  norw.classes[i,1] <- course.title
  course.desc <- course %>%
    html_nodes('#main-content p') %>%
    html_text() %>%
    paste(collapse = "")
  norw.classes[i,2] <- course.desc
}

School <- rep("norw", nrow(norw.classes))
norw.classes <- cbind.data.frame(School, norw.classes)

colnames(norw.classes) <- c("School", "Title", "Description")
norw.classes[,"Type"] <- character(nrow(norw.classes))
norw.classes[,"Credits"] <- character(nrow(norw.classes))
norw.classes[,"Number"] <- character(nrow(norw.classes))

5. Arizona State University

asu <- read_html("https://wpcarey.asu.edu/masters-programs/business-analytics/curriculum")

asu.titles <- asu %>%
  html_nodes("h3.panel-title") %>% 
  html_text() %>%
  as.character() 

asu.descriptions <- asu %>%
  html_nodes("div.panel-body p") %>%
  html_text() %>%
  as.character()

asu.descriptions[10] <- paste(asu.descriptions[10:12], collapse = "")
asu.descriptions <- asu.descriptions[1:10]

asu.classes <- cbind.data.frame(asu.titles, asu.descriptions)

School <- rep("asu", nrow(asu.classes))
asu.classes <- cbind.data.frame(School, asu.classes)

colnames(asu.classes) <- c("School", "Title", "Description")
asu.classes[,"Type"] <- character(nrow(asu.classes))
asu.classes[,"Credits"] <- character(nrow(asu.classes))
asu.classes[,"Number"] <- character(nrow(asu.classes))

6. Purdue University

pur <- read_html("http://www.krannert.purdue.edu/masters/programs/business-analytics-and-information-management/curriculum/home.php")

pur.classes <- pur %>%
  html_nodes("h2~ p ,h2 ,h2~ ul li") %>% 
  html_text() %>%
  as.character()

pur.classes <- pur.classes[-c(1,2,3,18,26,39,71:76)]
pur.classes[16] <- paste(pur.classes[16:21], collapse = " ")
pur.classes[39] <- paste(pur.classes[39:40], collapse = " ")
pur.classes[44] <- paste(pur.classes[44:45], collapse = " ")
pur.classes[55] <- paste(pur.classes[55:58], collapse = " ")
pur.classes <- pur.classes[-c(17:21,40,45,56:58)]

pur.titles <- pur.classes[c(TRUE,FALSE)]
pur.desc <- pur.classes[c(FALSE,TRUE)]
pur.classes <- cbind.data.frame(pur.titles, pur.desc)

School <- rep("pur", nrow(pur.classes))
pur.classes <- cbind.data.frame(School, pur.classes)

colnames(pur.classes) <- c("School", "Title", "Description")
pur.classes[,"Type"] <- character(nrow(pur.classes))
pur.classes[,"Credits"] <- character(nrow(pur.classes))
pur.classes[,"Number"] <- character(nrow(pur.classes))

7. University of Maryland

umd <- read_html("https://www.rhsmith.umd.edu/programs/ms-business-analytics/academics")

umd.classes <- umd %>%
  html_nodes("p") %>% 
  html_text() %>%
  as.character()

umd.classes <- umd.classes[str_detect(umd.classes, "BU[A-Z]{2} \\d")]

School <- rep("umd", length(umd.classes))
umd.classes <- cbind.data.frame(School, umd.classes)

umd.classes <- umd.classes %>%
  separate(umd.classes, c("Title","Credits"), sep = '\\(', extra = "merge") %>%
  separate(Credits, c("Credits","Description"), sep = '\\):', extra = "merge")

umd.classes[,"Type"] <- character(nrow(umd.classes))
umd.classes[,"Number"] <- character(nrow(umd.classes))

8. Boston University Metropolitan College

bum <- read_html("http://www.bu.edu/met/programs/graduate/computer-science/data-analytics/")

bum.classes <- bum %>%
  html_nodes(".bu_collapsible_container, .bu_collapsible_section") %>% 
  html_text() %>%
  as.character()

bum.classes <- bum.classes[-c(1,2)]
bum.desc <- bum.classes[c(FALSE,TRUE)]
bum.titles <- bum.classes[c(TRUE,FALSE)]
bum.titles <- str_sub(bum.titles, 1, 45) ## not perfect, but an approximation of titles

School <- rep("bum", length(bum.titles))
bum.classes <- cbind.data.frame(School, bum.titles, bum.desc)

colnames(bum.classes) <- c("School", "Title", "Description")
bum.classes[,"Type"] <- character(nrow(bum.classes))
bum.classes[,"Credits"] <- character(nrow(bum.classes))
bum.classes[,"Number"] <- character(nrow(bum.classes))

9. North Carolina State University

Scraping the data from NCSU returned a single vector mixing together titles and descriptions, so a little more manual clean-up was required than usual.

ncs <- read_html("http://analytics.ncsu.edu/?page_id=123")

ncs.classes <- ncs %>%
  html_nodes("ul+ h3 , #main li , hr+ h3") %>% 
  html_text() %>%
  as.character()

ncs.classes[2] <- paste(ncs.classes[2:14], collapse = " ")
ncs.classes[16] <- paste(ncs.classes[16:25], collapse = " ")
ncs.classes[27] <- paste(ncs.classes[27:36], collapse = " ")
ncs.classes[38] <- paste(ncs.classes[38:48], collapse = " ")
ncs.classes[50] <- paste(ncs.classes[50:59], collapse = " ")
ncs.classes[61] <- paste(ncs.classes[61:67], collapse = " ")
ncs.classes <- ncs.classes[-c(3:14,17:25,28:36,39:48,51:59,62:67)]

ncs.titles <- ncs.classes[c(TRUE,FALSE)]
ncs.desc <- ncs.classes[c(FALSE,TRUE)]

School <- rep("ncs", length(ncs.titles))
ncs.classes <- cbind.data.frame(School, ncs.titles, ncs.desc)

colnames(ncs.classes) <- c("School", "Title", "Description")
ncs.classes[,"Type"] <- character(nrow(ncs.classes))
ncs.classes[,"Credits"] <- character(nrow(ncs.classes))
ncs.classes[,"Number"] <- character(nrow(ncs.classes))

Union to combine school data

Since I constructed all of the data sets for each school to have the same columns, I can use the union function to connect them in one data set. Note that I added an identifier column to each individual data set in order to tag which school offers the course. This identifier also allows me to join the data set for courses to the school data set.

full.school.set.ST <- asu.classes %>%
  union(bum.classes) %>%
  union(cin.classes) %>%
  union(cuny.classes) %>%
  union(msu.classes) %>%
  union(ncs.classes) %>%
  union(norw.classes) %>%
  union(pur.classes) %>%
  union(umd.classes)

From Youngkoung

YoungKoung also delivered school information to the project, and I include her work below.

1. NYU

nyu <- read_html("http://www.stern.nyu.edu/programs-admissions/ms-business-analytics/academics/course-index")
nyu.Description <- nyu %>%
  html_nodes("#region-2 :nth-child(2) .content") %>%
  html_text()  %>%
  str_replace_all("[\r\n]" , " ") %>%
  str_replace_all("Module I: NYU Stern - New York", "  ") %>%
  str_replace_all("Module II: London", "  ") %>%
  str_replace_all("Module III: NYU Shanghai - Shanghai", "  ") %>%
  str_replace_all("Module IV: NYU Stern - New York", " ") %>%
  str_replace_all("Module V: NYU Stern - New York", " ") %>%
  str_split("Course description:|Strategic Capstone") %>%
  data.frame(stringsAsFactors=FALSE)


title <- html_nodes(nyu, "strong") %>%
  html_text()

previous <- html_nodes(nyu, "em") %>%
  html_text() 

# Remove course title 
for(i in 2:16)
{ 
  d <- nyu.Description[i, ] %>%
    str_replace_all(title[i], "  ") 
    nyu.Description[i, ] <- d
}
# Remove <em> field
for(i in 2:14)
{ 
  d <- nyu.Description[i, ] %>%
    str_replace_all(previous[i], "  ") 
  nyu.Description[i, ] <- d
}


nyu.Description <- nyu.Description[2:16, ]
title <- data.frame(title[1:15])
colnames(title) <- "Name"
School <- "nyu"
Program <- "Master of Science in Business Analytics"

nyu.class <- cbind(School, Program, title, nyu.Description)
colnames(nyu.class)[4] <- "Description"

2. University of Maryland University College

umuc <- read_html("http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm")
umuc.class<- html_nodes(umuc, "div.course-popup") %>%
  html_text() %>%
  data.frame()
umuc.class<- umuc.class %>%
  separate(colnames(umuc.class[1]), c("fill","Name", "CodeCredits", "Description"), sep = "\\t") %>%
  separate(CodeCredits, c("Code", "Credits"), sep = "\\|")

umuc.class$fill <- NULL

School <- "umuc"
Program <- "Master of Science in Data Analytics"

umuc.class <- cbind(School, Program, umuc.class)

3. Duke University

duke <- read_html("https://www.fuqua.duke.edu/programs/mqm-business-analytics/curriculum")
duke.class.Name <-html_nodes(duke, ".accordion_item_content strong") %>%
  html_text() %>%
  data.frame()
colnames(duke.class.Name) <- "Name"
duke.Description <-html_nodes(duke, ".accordion_item_content p") %>%
  html_text() %>%
  data.frame()
duke.Description <- duke.Description[2:28, ] 

School <- "duke"
Program <- "Master of Quantitative Management Business Analytics"

duke.class <- cbind(School, Program, duke.class.Name, duke.Description)
colnames(duke.class)[4] <- "Description"

4. Berkeley

berkeley <- read_html("https://www.ischool.berkeley.edu/courses/datasci")
berkeley.class.Name <-html_nodes(berkeley, ".course-title a") %>%
  html_text() %>%
  data.frame()
colnames(berkeley.class.Name) <- "Name"
berkeley.Description <-html_nodes(berkeley, ".views-field-field-course-catalog-description .field-content ") %>%
  html_text() %>%
  data.frame()

School <- "berk"
Program <- "Master of Information and Data Science"

berkeley.class <- cbind(School, Program, berkeley.class.Name, berkeley.Description)
colnames(berkeley.class)[4] <- "Description"

5. Texas A&M University

txam <- read_html("https://analytics.stat.tamu.edu/for-students-2/")
txam.class.Name <-html_nodes(txam, "h4 a") %>%
  html_text() %>%
  str_replace_all("[\r\n\t]" , "") %>%
  data.frame()

txam.class.Name <- txam.class.Name  %>%
  separate(colnames(txam.class.Name [1]), c("Name","Credits"), sep = '\\(', extra = "merge")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3 rows [1,
## 8, 15].

txam.class.Name$Credits <- str_replace(txam.class.Name$Credits, "\\)", " ")

txam.Description1 <-html_nodes(txam, "h4") %>%
  html_text() %>%
  data.frame()
# select odd rows in addition to rows 2 and 42
txam.Description <- txam.Description1[c(2, seq(1, nrow(txam.Description1), 2), 42),] 
txam.Description <- txam.Description[-c(2, 3, 22)]

School <- "txam"
Program <- "Master of Science Analytics"

txam.class <- cbind(School, Program, txam.class.Name, txam.Description)
colnames(txam.class)[5] <- "Description"

Combine Shools

MSDSprogram <- bind_rows(nyu.class, umuc.class, berkeley.class, duke.class, txam.class)

Combine Steve and Youngkoung Data Sets

After a little formatting to align our data frames, we are able to combine them into the full data set of courses that we will examine.

colnames(MSDSprogram)[3] <- "Title"
colnames(MSDSprogram)[5] <- "Number"
MSDSprogram.formatted <- MSDSprogram[,-2]
MSDSprogram.formatted[,"Type"] <- character(nrow(MSDSprogram.formatted))

full.school.set <- union(full.school.set.ST, MSDSprogram.formatted)
colnames(full.school.set)[1] <- "School.Code"

kable_styling(kable(head(full.school.set, 10), "html"), bootstrap_options = "striped")

School.Code	Title	Description	Type	Credits	Number
cin	Simulation Modeling & Methods	Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations.	Core Courses		BANA 7030
norw	MSDS 454-DL : Advanced Modeling Techniques	Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning.
umuc	Introduction to Statistics	Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230.		3 Credits	STAT 200
bum	MET CS 599 Biometrics Sprg 18In this course	In this course we will study the fundamental and design applications of various biometric systems based on fingerprints, voice, face, hand geometry, palm print, iris, retina, and other modalities. Multimodal biometric systems that use two or more of the above characteristics will be discussed. Biometric system performance and issues related to the security and privacy aspects of these systems will also be addressed. [ 4 cr. ]Section Type Instructor Location Days Times A1 IND Djordjevic FLR 266 M 6:00 pm 8:45 pm
norw	MSDS 456-DL : Sports Performance Analytics	An introduction to sports performance measurement and analytics, this course reviews roles of athletes at each position in sports selected by the instructor. With a focus on the individual athlete, the course discusses the development and use of accurate assessments and variability due to factors such as body type, climate, and playing surface. The course reviews athletic performance measurements, including jumping ability, running speed, agility, and strength. It utilizes exploratory data analysis, predictive modeling, and presentation graphics, showing real-world implications for athletes, coaches, team managers, and the sports industry.Prerequisites: MSDS 400-DL Math for Data Scientists and MSDS 401-DL Statistical Analysis.
norw	MSDS 455-DL : Data Visualization	This course begins with a review of human perception and cognition, drawing upon psychological studies of perceptual accuracy and preferences. The course reviews principles of graphic design, what makes for a good graph, and why some data visualizations effectively present information and others do not. It considers visualization as a component of systems for data science and presents examples of exploratory data analysis, visualizing time, networks, and maps. It reviews methods for static and interactive graphics and introduces tools for building web-browser-based presentations. This is a project-based course with programming assignments.Prerequisites: MSDS 400-DL Math for Data Scientists and MSDS 401-DL Statistical Analysis.
duke	Intermediate Finance	Gain a working knowledge of key concepts in portfolio management. This course covers mutual funds, multifactor models, asset classes and allocation, foreign exchange markets, international investment and capital budgeting, hedge funds, private equity, and venture capital.		NA	NA
pur	MGMT 66000: Intro to Operations Management	As goods and services are produced and distributed, they move through a set of inter-related operations or processes in order to match supply with demand. The design of these operations for strategic advantage, investment in improving their efficiency and effectiveness, and controlling these operations to meet performance objectives is the domain of Operations Management. The primary objective of this course is to provide an overview of this important functional area of business.
bum	MET CS 555 Data Analysis and Visualization Sp	This course provides an overview of the statistical tools most commonly used to process, analyze, and visualize data. Topics include simple linear regression, multiple regression, logistic regression, analysis of variance, and survival analysis. These topics are explored using the statistical package R, with a focus on understanding how to use and interpret output from this software as well as how to visualize results. In each topic area, the methodology, including underlying assumptions and the mechanics of how it all works along with appropriate interpretation of the results, are discussed. Concepts are presented in context of real world examples. Recommended Prerequisite: MET CS 544 or equivalent knowledge, or instructor’s consent. [ 4 cr. ]Section Type Instructor Location Days Times A1 IND Teymourian CAS 325 M 6:00 pm 8:45 pm D1 IND Teymourian CAS B27 R 6:00 pm 8:45 pm J1 IND Teymourian CAS 325 M 6:00 pm 8:45 pm O1 IND Pedley ARR
duke	Introductory Finance	Cover all the basic concepts in financediscounting, equities, bonds, portfolio diversification, the capital asset pricing model (CAPM), and the weighted average cost of capital (WACC)in order to establish your knowledge foundation for exposure to more complex financial concepts through the program.		NA	NA

Prepare list of key terms - from Heather G. and Raj K.

In order to compare our data across different areas of study (education vs. job seekers vs. job postings), our group constructed one list of key terms to search for in our data sets.

The list of data science skills is based off the list found here: https://www.thebalance.com/list-of-data-scientist-skills-2062381

We culled the list and categorized it into two skill types to use in our analysis, soft skills and technical skills.

Heather and Raj developed the code to create the keyword list in R, including synonyms for certain words to count the same (such as “collaboration” and “collaborative”). The next R code chunk is Heather’s code for creating regular expressions based on the keyword list (huge thanks to Heather and Raj here!).

keywords <- read.table("https://raw.githubusercontent.com/heathergeiger/Data607_Project3_Group3/master/heathergeiger_individual_work/combine_ny_and_san_francisco/keywords.txt",header=TRUE,check.names=FALSE,stringsAsFactors=FALSE,sep="\t")
keywords <- keywords[grep('This is probably too tough',keywords$Other.notes,invert=TRUE),]

keyword_list <- vector("list",length=nrow(keywords))

for(i in 1:nrow(keywords)) {
keywords_this_row <- keywords$Skill[i]
if(keywords$Synonyms[i] != "None"){
    keywords_this_row <- c(keywords_this_row,unlist(strsplit(keywords$Synonyms[i],",")[[1]]))
    }
keyword_list[[i]] <- keywords_this_row
}

#Couldn't figure out how to get a regex for a space, comma, or word boundary. However did get one that can get either a space or comma.
space_or_comma <- "[[:space:],]"
word_boundary <- "\\b"

pattern_for_one_keyword <- function(keyword){
    regexes <- paste0(space_or_comma,keyword,space_or_comma)
    regexes <- c(regexes,paste0(word_boundary,keyword,word_boundary))
    regexes <- c(regexes,paste0(word_boundary,keyword,space_or_comma))
    regexes <- c(regexes,paste0(space_or_comma,keyword,word_boundary))
    return(paste0(regexes,collapse="|"))
}

pattern_for_multiple_keywords <- function(keyword_vector){
    if(length(keyword_vector) == 1){return(pattern_for_one_keyword(keyword_vector))}
    if(length(keyword_vector) > 1){
        individual_regexes <- c()
        for(i in 1:length(keyword_vector))
        {
            individual_regexes <- c(individual_regexes,pattern_for_one_keyword(keyword_vector[i]))
        }
    return(paste0(individual_regexes,collapse="|")) 
    }
}

keyword_regexes <- unlist(lapply(keyword_list,function(x)pattern_for_multiple_keywords(x)))

Compare key terms to title/course data

With the keyword list now prepared, we began the search on the education data list. First, we decided to treat the title and the description for each course as one unit, so I combined the Title column and the Description column together into one field. We then created a loop to check each keyword against the Title.Desc column and populate a TRUE/FALSE column showing whether or not that course contained the keyword.

cols <- c("Title", "Description")
full.school.set$Title.Desc <- do.call(paste, c(full.school.set[cols], sep=" "))

for(i in 1:length(keyword_regexes)) {
full.school.set[,keywords$Skill[i]] <- NA
skill <- keyword_regexes[i]
new.skill.col <- unlist(str_detect(tolower(full.school.set$Title.Desc),skill))
full.school.set[,keywords$Skill[i]] <- new.skill.col
}

kable_styling(kable(head(full.school.set, 3), "html"), bootstrap_options = "striped")

School.Code	Title	Description	Type	Credits	Number	Title.Desc	algorithm	appengine	aws	big data	c++	collaboration	communication	prediction	couchdb	creativity	critical thinking	customer service	data manipulation	data wrangling	data mining	d3.js	decision making	decision tree	ecl	flare	google visualization api	hadoop	java	leadership	machine learning	matlab	microsoft excel	mining social media	modeling	perl	powerpoint	presentation	problem solving	python	r	raphael.js	risk modeling	sas	scripting languages	sql	statistics	tableau	a/b testing	data visualization
cin	Simulation Modeling & Methods	Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations.	Core Courses		BANA 7030	Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations.	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
norw	MSDS 454-DL : Advanced Modeling Techniques	Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning.				MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning.	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
umuc	Introduction to Statistics	Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230.		3 Credits	STAT 200	Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230.	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE

Convert to long form and join with `keywords` and `schools` data frames

As a final step before analysis, we converted the wide data set with all the skills in separate columns into a long data set with one row for each course/skill combination. We also joined the data set together with the keywords data set (to add the Skill Type variable) and the schoolsdata set (to add the complete school information).

To facilitate data analysis, we saved the final data set so we won’t need to run the web scraping applications again when beginning our analysis.

long.school.set <- full.school.set %>% 
  gather("Skill", "Appears", 8:length(full.school.set)) %>%
  inner_join(keywords) %>%
  select(-c(Synonyms, Other.notes)) %>%
  inner_join(schools)

## Joining, by = "Skill"

## Joining, by = "School.Code"

## Warning: Column `School.Code` joining character vector and factor, coercing
## into character vector

kable_styling(kable(head(long.school.set, 3), "html"), bootstrap_options = "striped")

School.Code	Title	Description	Type	Credits	Number	Title.Desc	Skill	Appears	Soft.or.technical	School	Degree	Degree.Name	Website
cin	Simulation Modeling & Methods	Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations.	Core Courses		BANA 7030	Simulation Modeling & Methods Building and using simulation models of complex static and dynamic, stochastic systems using both spreadsheets and high-level simulation software. Topics include generating random numbers, random variates, and random processes, modeling systems, simulating static models in spreadsheets, modeling complex dynamic stochastic systems with high-level commercial simulation software, basic input modeling and statistical analysis of terminating and steady-state simulation output, and managing simulation projects. Applications in complex queueing and inventory models representing real systems such as manufacturing, supply chains, healthcare, and service operations.	algorithm	FALSE	technical	University of Cincinnati	MS	Business Analytics	http://business.uc.edu/graduate/msbana.html
norw	MSDS 454-DL : Advanced Modeling Techniques	Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning.				MSDS 454-DL : Advanced Modeling Techniques Drawing upon previous course work in data science, this course build on earlier courses in analytics and modeling, providing a advanced review of traditional statistics and machine learning. It explores computer-intensive methods for parameter and error estimation, model selection, and model validation. Example topics include ordinary least squares regression, logistic regression, multinomial logistic regression, classification and regression trees, neural networks, support vector machines, naïve Bayes methods, principal components analysis, cluster analysis, and regularization techniques. Students work on individual and team assignments using open-source programming tools.Prerequisites: MSDS 411-DL Generalized Linear Models and MSDS 422-DL Practical Machine Learning.	algorithm	FALSE	technical	Northwestern University SPS	MS	Data Science	http://sps.northwestern.edu/program-areas/graduate/predictive-analytics/
umuc	Introduction to Statistics	Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230.		3 Credits	STAT 200	Introduction to Statistics Prerequisite: MATH 012, MATH 037, or an appropriate score on a placement test. An introduction to statistics. The objective is to assess the validity of statistical conclusions; organize, summarize, interpret, and present data using graphical and tabular representations; and apply principles of inferential statistics. Focus is on selecting and applying appropriate statistical tests and determining reasonable inferences and predictions from a set of data. Topics include methods of sampling; percentiles; concepts of probability; probability distributions; normal, t-, and chi-square distributions; confidence intervals; hypothesis testing of one and two means; proportions; binomial experiments; sample size calculations; correlation; regression; and analysis of variance (ANOVA). Students may receive credit for only one of the following courses: BEHS 202, BEHS 302, BMGT 230, ECON 321, GNST 201, MATH 111, MGMT 316, PSYC 200, SOCY 201, STAT 100, STAT 200, STAT 225, or STAT 230.	algorithm	FALSE	technical	University of Maryland University College	MS	Data Analytics	http://www.umuc.edu/academic-programs/masters-degrees/data-analytics.cfm

save(long.school.set, file="long_school_set.Rdata")

DATA 607 Project 3 - Part 1

Steve Tipton

March 25, 2018

Set up environment and load libraries

Institutions Included

From Steve

1. Michigan State University

2. University of Cincinnati

3. CUNY SPS

4. Northwestern University

5. Arizona State University

6. Purdue University

7. University of Maryland

8. Boston University Metropolitan College

9. North Carolina State University

Union to combine school data

From Youngkoung

1. NYU

2. University of Maryland University College

3. Duke University

4. Berkeley

5. Texas A&M University

Combine Shools

Combine Steve and Youngkoung Data Sets

Prepare list of key terms - from Heather G. and Raj K.

Compare key terms to title/course data

Convert to long form and join with `keywords` and `schools` data frames

DATA 607 Project 3 - Part 1

Steve Tipton

March 25, 2018

Set up environment and load libraries

Institutions Included

From Steve

1. Michigan State University

2. University of Cincinnati

3. CUNY SPS

4. Northwestern University

5. Arizona State University

6. Purdue University

7. University of Maryland

8. Boston University Metropolitan College

9. North Carolina State University

Union to combine school data

From Youngkoung

1. NYU

2. University of Maryland University College

3. Duke University

4. Berkeley

5. Texas A&M University

Combine Shools

Combine Steve and Youngkoung Data Sets

Prepare list of key terms - from Heather G. and Raj K.

Compare key terms to title/course data

Convert to long form and join with keywords and schools data frames

Convert to long form and join with `keywords` and `schools` data frames