Overview: It is a group project to provide answer on the question “Which are the most valued data science skills?”

Load all the required packages.

library(tidyverse)

I google the questions on the web and looked at 10 websites. I manually collect those data and put it in a csv file. Read the CSV file

theFile <- "https://raw.githubusercontent.com/ferrysany/CUNY607P3/master/valuedSkills.csv"
skills <- read_csv(theFile, skip=2)

Tame and Tidy the CSV file

a<-max(str_count(skills$Skill, "\\s"))

i=1
newCol <- c(1)
for(i in 1:(a+1)) {
  newCol[i] <- as.character(i)
}

newSkills <- skills %>% 
  separate(Skill, into = newCol, sep="\\s", fill="right")%>%
  gather(key="nCol", value="nSkill", 2:9) %>%
  mutate(nSkill = str_to_lower(nSkill)) %>%
  mutate(nSkill = str_replace(nSkill, "[ ,?:]", "") ) %>%
  filter(!is.na(nSkill)) %>%
  arrange(desc(nSkill)) 

gskills <- newSkills %>%
  group_by(nSkill)%>%
  summarise(n=n())%>%
  arrange(desc(n))

gskills
## # A tibble: 93 x 2
##    nSkill            n
##    <chr>         <int>
##  1 data             13
##  2 learning         11
##  3 programming       9
##  4 communication     7
##  5 machine           7
##  6 sql               7
##  7 and               6
##  8 &                 5
##  9 python            5
## 10 r                 5
## # … with 83 more rows

Plot a BAR chart

ggplot(data=(gskills[1:15,])) +
  geom_col(mapping = aes(x = fct_reorder(nSkill, n), y = n)) +
  coord_flip()

The top skills should be “Data” (including data analysis, data visualization, etc.), “Machine Learning”, “Programming”, “Communication” and “SQL”. (Machine and Learning are actually linked together after reviewing the csv file again)