Overview: It is a group project to provide answer on the question “Which are the most valued data science skills?”
Load all the required packages.
library(tidyverse)
I google the questions on the web and looked at 10 websites. I manually collect those data and put it in a csv file. Read the CSV file
theFile <- "https://raw.githubusercontent.com/ferrysany/CUNY607P3/master/valuedSkills.csv"
skills <- read_csv(theFile, skip=2)
Tame and Tidy the CSV file
a<-max(str_count(skills$Skill, "\\s"))
i=1
newCol <- c(1)
for(i in 1:(a+1)) {
newCol[i] <- as.character(i)
}
newSkills <- skills %>%
separate(Skill, into = newCol, sep="\\s", fill="right")%>%
gather(key="nCol", value="nSkill", 2:9) %>%
mutate(nSkill = str_to_lower(nSkill)) %>%
mutate(nSkill = str_replace(nSkill, "[ ,?:]", "") ) %>%
filter(!is.na(nSkill)) %>%
arrange(desc(nSkill))
gskills <- newSkills %>%
group_by(nSkill)%>%
summarise(n=n())%>%
arrange(desc(n))
gskills
## # A tibble: 93 x 2
## nSkill n
## <chr> <int>
## 1 data 13
## 2 learning 11
## 3 programming 9
## 4 communication 7
## 5 machine 7
## 6 sql 7
## 7 and 6
## 8 & 5
## 9 python 5
## 10 r 5
## # … with 83 more rows
Plot a BAR chart
ggplot(data=(gskills[1:15,])) +
geom_col(mapping = aes(x = fct_reorder(nSkill, n), y = n)) +
coord_flip()
The top skills should be “Data” (including data analysis, data visualization, etc.), “Machine Learning”, “Programming”, “Communication” and “SQL”. (Machine and Learning are actually linked together after reviewing the csv file again)