1 Introduction

According to Glassdoor, data scientist as a profession is ranked number 1 in the US for 2017 and has been for three years running. The median base salary paid in the field is $110,000 and there are 4,524 job openings in the US alone. Job indicators point to aggressive data growth as the increasing digitalization in all aspect of our lives produces vast amounts of data. In turn, private entities are seeing business opportunities abound and investing heavily into data infrastructure to process that data. This has led to the proliferation of data science work opportunities and demand is far outpacing supply of data science talent. By 2018, the United States alone may face a 50% to 60% gap in supply versus demand. Worldwide, the dynamic job market and demand for data science as a business driver is causing an estimated shortage of 190,000 data specialists by 2030.

2 Project Question

W. Edwards Deming said, âIn God we trust, all others must bring data.â Please use data to answer the question, âWhich are the most valued data science skills?â Consider your work as an exploration; there is not necessarily a âright answer.â

3 Group Members and Role

All members of this group played a critical role, whether actively or passively, in all aspects of project implementation. However, listed below is the list of active roles undertaken by each member:

Juanelle Marks - Group Leader/Analysis

Guang Qiu - Data Transforming, Tidying, Debugging/ Analysis

Amanda Arce - Data Sourcer, Tidying and Transforming/Analysis

Calvin Wong - Data Visualizer/Analysis

Vijaya Cherukuri - Data Visualizer/Analysis

Saayed Alam - Data Tidying and Transforming/Analysis

4 Project Implementation Approach

It is important to understand the concept,“Data Science” before lunging into exploring the ’most valued skills" associated with this term. This was the first step that was undertaken by this group.

4.0.0.1 Tools for Communication and Collaboration

The group agreed to conduct weekly meetings in Skype in order to facilitate discussions on project progress. These meetings were held as the need arose. Communication exchanges were also conducted using a Slack group. This Slack group allowed for quick messaging and resource sharing (files, web links e.t.c). Code sharing and collaboration were done using github.

4.0.0.2 Analysis of Project Question:

We conducted meetings to breakdown and analyse project requirements. At the very core,relevant data which would assist in answering project question had to be sourced and gathered.

4.0.0.3 Data Sourcing and Gathering:

Each group member was tasked with researching possible data sources (create excel sheet with list of data sources contributed by each group member; save in repo and hyperlink the words immediately before this parenthesis) and sharing these with the group. Each data source shared, was great for gathering data to draw conclusions on the most valued data science skills. The most ideal sources were: “Indeed.com”, “Kaggle.com” and “Glassdoor”. However, as a group, we were limited by the lack of persons with sufficient skills in webscraping. After much deliberation, we decided on the Kaggle Survey 2017 as our chief data source for finding out, what the most valued data science skills are.

4.0.0.4 Data Description

As part of the project requirements, our analysis is based on a Kaggle survey of individuals within data science of varying capacities. The survey contained 16,716 responses from 171 countries and territories. The respondents were found through Kaggle channels such as email lists, discussion forums, and social media. The aim of this industry-wide survey is to identify the state of data science and machine learning across the globe. Due to the wide interest of data science and the development of supporting technologies, the survey provided a comprehensive view of demographics, compensation, data languages and skillsets of a data scientist.

4.0.0.5 Data Composition

This dataset schema is defined within a schema file provided by Kaggle https://www.kaggle.com/kaggle/kaggle-survey-2017#schema.csv https://www.kaggle.com/kaggle/kaggle-survey-2017#multipleChoiceResponses.csv

Here is a summary:

Columns â 290, Rows â 16,716

Respondents’ answers to multiple choice and ranking questions. These questions are centered around demographics, skill, and compensation.

4.0.0.6 Tidying and Transforming

The dataset from the Kaggle survey was somewhat messy so some amount of tidying and transformation was needed. The tidying and transforming process included: subsetting/filtering the dataset, gathering and spreading, removal of irrelevant rows based on conditions, renaming variables,handling missing values. Tidying and transforming made it easier for desired analysis to be conducted.

4.0.0.7 Analyzing and Visualizing

This step followed tidying and transforming. Various summary statistics and visualisations were generated to aid the drawing of conclusions on what the most valued data science skills are.

4.0.0.8 Project Implementation Goals

Our group will focus on two factors to determine the research question. First, how were skills ranked and second, how was salary impacted by skills.

4.0.0.8.1 Qualitative

Goal: To determine how respondents rank skill relevance

Our primary focus will be on how respondents replied to skill pertinence. We believe there is a direct correlation between skill relevance and perceived value of that skill. The dataset shows a large respondent demographic who reported freely on their skill usage. Therefore, we will analyze which of the skills garnered the highest ranking. We will also consider corporate influence because a respondent may have to use a particular skill due to work reasons. Therefore, when factoring respondent skill usage, there could be a strong element of influence between working respondents and non-working respondents.

4.0.0.8.2 Quantitative

Goal: To determine the economic viability of a skill

We believe that conducting an analysis based on salary can produce a definitive conclusion on which skill sets are most valuable to a data scientist. Salary tends to be a sizeable expense for most organizations and economic potential will draw individuals to develop those abilities. Therefore, a high salary or earning would indicate high-interest levels of those skillsets. We believe that individuals pursuing skills in data science would align themselves with the economic opportunities available in the labor market. Therefore, this analysis framework will evaluate skillset relevancies based on salary in this point-in-time snapshot.

5 Install libraries (options)

#install.packages('NLP')
#install.packages('slam')
#install.packages("tm", repos="http://R-Forge.R-project.org")
#install.packages("RSQLite")
#install.packages("wordcloud")
#install.packages('kableExtra')
#install.packages('gridExtra')
#install.packages('stats')
#install.packages('DT')
#install.packages('treemap')
#install.packages("stringr")
#install.packages('pandoc')

5.1 Libraries

library(tidyverse)
library(tm)
library(wordcloud)
library(knitr)
library(kableExtra)
library(reshape2)
library(RSQLite)
library(gridExtra)
library(stats)
library(DT)
library(ggplot2)
library(treemap)
library(stringr)

5.2 Load source data into dataframe

response <-  read.csv(file = "https://raw.githubusercontent.com/mandiemannz/Data-607--Fall-18/master/multipleChoiceResponses.csv", header= TRUE)
schema <- read.csv(file ="https://raw.githubusercontent.com/dikepower/607_DataAcquisition_And_Management/master/Project3/kaggle-survey-2017/schema.csv", header = TRUE)

6 Create SQL Database

6.1 Load dataframe into in-memory RSQLite Database

#create connection to SQLite DB
con <- dbConnect(RSQLite::SQLite(), "ML_Survey.sqlite" ,overwrite =TRUE )

#dbListTables(con)

#choose selected columns for data based on the information from schema 
#schema datafram info
#str(schema)

6.2 Subsetted database to df2

#col_index <- c(1:15, 38:47,71:82,134,168:172,208:211)
col_index<-c(1:4,7,9,14,37:49,57,66,67,76:77,79:172,197,202:205,207,208)# suggested columns which can  be used in the analysis phase of the project to draw conclusions on the top data scientist skills and why.  
df2<-response[,col_index]
#names(df2) #extracted variables

6.3 Load data into SQLite

dbWriteTable(con,  "MCR_Tb_Source", response, overwrite= T)
dbWriteTable(con,  "MCR_Tb", df2, overwrite= T)
dbWriteTable(con,  "MCR_Schema", schema, overwrite= T)
dbListTables(con)

## [1] "MCR_Schema"    "MCR_Tb"        "MCR_Tb_Source"

6.4 Display Database tables

dbListFields(con, "MCR_Tb")

##   [1] "GenderSelect"                              
##   [2] "Country"                                   
##   [3] "Age"                                       
##   [4] "EmploymentStatus"                          
##   [5] "CodeWriter"                                
##   [6] "CurrentJobTitleSelect"                     
##   [7] "LanguageRecommendationSelect"              
##   [8] "JobSkillImportanceBigData"                 
##   [9] "JobSkillImportanceDegree"                  
##  [10] "JobSkillImportanceStats"                   
##  [11] "JobSkillImportanceEnterpriseTools"         
##  [12] "JobSkillImportancePython"                  
##  [13] "JobSkillImportanceR"                       
##  [14] "JobSkillImportanceSQL"                     
##  [15] "JobSkillImportanceKaggleRanking"           
##  [16] "JobSkillImportanceMOOC"                    
##  [17] "JobSkillImportanceVisualizations"          
##  [18] "JobSkillImportanceOtherSelect1"            
##  [19] "JobSkillImportanceOtherSelect2"            
##  [20] "JobSkillImportanceOtherSelect3"            
##  [21] "Tenure"                                    
##  [22] "MLSkillsSelect"                            
##  [23] "MLTechniquesSelect"                        
##  [24] "WorkHardwareSelect"                        
##  [25] "WorkDataTypeSelect"                        
##  [26] "WorkDatasetSize"                           
##  [27] "WorkAlgorithmsSelect"                      
##  [28] "WorkToolsSelect"                           
##  [29] "WorkToolsFrequencyAmazonML"                
##  [30] "WorkToolsFrequencyAWS"                     
##  [31] "WorkToolsFrequencyAngoss"                  
##  [32] "WorkToolsFrequencyC"                       
##  [33] "WorkToolsFrequencyCloudera"                
##  [34] "WorkToolsFrequencyDataRobot"               
##  [35] "WorkToolsFrequencyFlume"                   
##  [36] "WorkToolsFrequencyGCP"                     
##  [37] "WorkToolsFrequencyHadoop"                  
##  [38] "WorkToolsFrequencyIBMCognos"               
##  [39] "WorkToolsFrequencyIBMSPSSModeler"          
##  [40] "WorkToolsFrequencyIBMSPSSStatistics"       
##  [41] "WorkToolsFrequencyIBMWatson"               
##  [42] "WorkToolsFrequencyImpala"                  
##  [43] "WorkToolsFrequencyJava"                    
##  [44] "WorkToolsFrequencyJulia"                   
##  [45] "WorkToolsFrequencyJupyter"                 
##  [46] "WorkToolsFrequencyKNIMECommercial"         
##  [47] "WorkToolsFrequencyKNIMEFree"               
##  [48] "WorkToolsFrequencyMathematica"             
##  [49] "WorkToolsFrequencyMATLAB"                  
##  [50] "WorkToolsFrequencyAzure"                   
##  [51] "WorkToolsFrequencyExcel"                   
##  [52] "WorkToolsFrequencyMicrosoftRServer"        
##  [53] "WorkToolsFrequencyMicrosoftSQL"            
##  [54] "WorkToolsFrequencyMinitab"                 
##  [55] "WorkToolsFrequencyNoSQL"                   
##  [56] "WorkToolsFrequencyOracle"                  
##  [57] "WorkToolsFrequencyOrange"                  
##  [58] "WorkToolsFrequencyPerl"                    
##  [59] "WorkToolsFrequencyPython"                  
##  [60] "WorkToolsFrequencyQlik"                    
##  [61] "WorkToolsFrequencyR"                       
##  [62] "WorkToolsFrequencyRapidMinerCommercial"    
##  [63] "WorkToolsFrequencyRapidMinerFree"          
##  [64] "WorkToolsFrequencySalfrod"                 
##  [65] "WorkToolsFrequencySAPBusinessObjects"      
##  [66] "WorkToolsFrequencySASBase"                 
##  [67] "WorkToolsFrequencySASEnterprise"           
##  [68] "WorkToolsFrequencySASJMP"                  
##  [69] "WorkToolsFrequencySpark"                   
##  [70] "WorkToolsFrequencySQL"                     
##  [71] "WorkToolsFrequencyStan"                    
##  [72] "WorkToolsFrequencyStatistica"              
##  [73] "WorkToolsFrequencyTableau"                 
##  [74] "WorkToolsFrequencyTensorFlow"              
##  [75] "WorkToolsFrequencyTIBCO"                   
##  [76] "WorkToolsFrequencyUnix"                    
##  [77] "WorkToolsFrequencySelect1"                 
##  [78] "WorkToolsFrequencySelect2"                 
##  [79] "WorkFrequencySelect3"                      
##  [80] "WorkMethodsSelect"                         
##  [81] "WorkMethodsFrequencyA.B"                   
##  [82] "WorkMethodsFrequencyAssociationRules"      
##  [83] "WorkMethodsFrequencyBayesian"              
##  [84] "WorkMethodsFrequencyCNNs"                  
##  [85] "WorkMethodsFrequencyCollaborativeFiltering"
##  [86] "WorkMethodsFrequencyCross.Validation"      
##  [87] "WorkMethodsFrequencyDataVisualization"     
##  [88] "WorkMethodsFrequencyDecisionTrees"         
##  [89] "WorkMethodsFrequencyEnsembleMethods"       
##  [90] "WorkMethodsFrequencyEvolutionaryApproaches"
##  [91] "WorkMethodsFrequencyGANs"                  
##  [92] "WorkMethodsFrequencyGBM"                   
##  [93] "WorkMethodsFrequencyHMMs"                  
##  [94] "WorkMethodsFrequencyKNN"                   
##  [95] "WorkMethodsFrequencyLiftAnalysis"          
##  [96] "WorkMethodsFrequencyLogisticRegression"    
##  [97] "WorkMethodsFrequencyMLN"                   
##  [98] "WorkMethodsFrequencyNaiveBayes"            
##  [99] "WorkMethodsFrequencyNLP"                   
## [100] "WorkMethodsFrequencyNeuralNetworks"        
## [101] "WorkMethodsFrequencyPCA"                   
## [102] "WorkMethodsFrequencyPrescriptiveModeling"  
## [103] "WorkMethodsFrequencyRandomForests"         
## [104] "WorkMethodsFrequencyRecommenderSystems"    
## [105] "WorkMethodsFrequencyRNNs"                  
## [106] "WorkMethodsFrequencySegmentation"          
## [107] "WorkMethodsFrequencySimulation"            
## [108] "WorkMethodsFrequencySVMs"                  
## [109] "WorkMethodsFrequencyTextAnalysis"          
## [110] "WorkMethodsFrequencyTimeSeriesAnalysis"    
## [111] "WorkMethodsFrequencySelect1"               
## [112] "WorkMethodsFrequencySelect2"               
## [113] "WorkMethodsFrequencySelect3"               
## [114] "TimeGatheringData"                         
## [115] "TimeModelBuilding"                         
## [116] "TimeProduction"                            
## [117] "TimeVisualizing"                           
## [118] "TimeFindingInsights"                       
## [119] "TimeOtherSelect"                           
## [120] "WorkDataVisualizations"                    
## [121] "WorkDataStorage"                           
## [122] "WorkDataSharing"                           
## [123] "WorkDataSourcing"                          
## [124] "WorkCodeSharing"                           
## [125] "CompensationAmount"                        
## [126] "CompensationCurrency"

6.4.1 load data from SQL_lite Database data

df<- dbReadTable(con, "MCR_Tb")
dim(df) # dimension of new dataframe

## [1] 16716   126

#head(display_cols, 3)

7 Analyses

7.0.1 Analysis 1

7.0.1.1 Demographics

It is apparant that technology is one of the key drivers of social and economic change. However, there is still a strong under-representation of women in technology. This observation is found to be true for the field of Data Science in the Kaggle survey as well. The Distribution of Gender graph shows four times as many male respondents to female. This could point to the lack of female participants due to the nature of this field.

The Distribution of Age graph shows a strong level of skewness towards younger respondents. This could be because Data Science as a field is relatively young. An overwhelming amount of respondents fall between the 20 to 35 year old mark. The graph peaks at 24 years of age with nearly 1000 respondents.

We can see that a higher percentage of respondents are employed. This accounts for more than half of all respondents combined. We can determine that Data Science as a career is stable even when factoring respondents who reside in less developed countries.

#Calculates percentage of a column
percentage = function(question, filteredData = df){
  filteredData %>% 
    drop_na(question)  %>%  # Remove nulls
    group_by(!! (sym(question))) %>% 
    summarise(count = n()) %>% # Count
    mutate(percent = round((count / sum(count)) * 100, digits=2)) %>% # calculates percentage
    arrange(desc(count))
}
#Apply percentage function to Gender
gender <- percentage("GenderSelect") %>% filter(GenderSelect == "Male" | GenderSelect == "Female" )

#Plot gender
p1 <- ggplot(gender,aes(x= reorder(GenderSelect, -percent), y= percent, fill=GenderSelect)) + 
  geom_bar(stat="identity", width=.5) +
  labs(x="Gender",y="Percent",title="Distribution of Gender") + 
  theme(legend.position="none") + 
  theme(plot.title = element_text(hjust = 0.5))

#Convert Age from char to num
df$Age <- as.numeric(as.character(df$Age))

#Remove null age values
agedf <- df %>% 
  filter(!Age == "") %>% 
  select(Age)

#Plot age
p2 <- ggplot(agedf, aes(Age)) + 
  geom_bar(fill = "#FF6666") + 
  labs(x="Age",y="Respondents",title="Distribution of Age") +   
  theme(legend.position="none") + 
  theme(plot.title = element_text(hjust = 0.5))

#Remove null employment values
employmentdf <- df %>% 
  filter(!EmploymentStatus == "") %>% 
  select(EmploymentStatus)

#Plot employment status
p3 <- ggplot(employmentdf, aes(EmploymentStatus)) + 
  labs(x="Category",y="Number of Respondents",title="Employment Status") +
  geom_bar(fill = "#00BFC4") +
  coord_flip()

grid.arrange(arrangeGrob(p1,p2, ncol=2, nrow=1), arrangeGrob(p3, nrow=2), heights=c(1,2))

Figure1 Basic Informatin of Respondents

#Combine

7.0.2 Analysis 2

7.0.2.1 Job Skills

The dataset allowed for interesting insights on how respondents associated value to their skillsets. We analyzed how respondents viewed job skill importance versus what skills respondents recommended to aspiring scientists. The Job Skill Importance graph was developed through subsetting ten different columns and applying a count function. We determined which languages were used most and applied a normalizing method to rank them against each other. The data was counted as follows,

Category	Points
Necessary	2
Nice to have	1
Unnecessary	-1
Null	0

This method prioritize respondents who ranked a certain language âNecessaryâ and modulated âNice to haveâ & âUnnecessaryâ responses. As expected, the two main data science languages R and Python did not fare too far from each other in a job setting. We also determined that statistics, big data, and visualization are important job skill requirements.

However, language requirements change quite drastically based on respondent responses. An overwhelming amount, more than double, respondents viewed Python as a more valuable skill compared to R. The other languages such as SQL, Matlab and Julia ranked much lower and negligible to our analysis.

df_target <- df
# Tidy JobSkillImportance* columns based on categories

JobImportance <- function(skill)
{  skill<- as.character(skill)
if (skill=='Necessary')
{
    return (2)
    
}
else if (skill=="Nice to have")
{
    return (1)
}
else if (skill=="Unnecessary")
{
    return (-1)
    
    
}
else 
{
    return (0)
}
}

df_target$JobSkillImportanceBigData <- sapply(df_target$JobSkillImportanceBigData, JobImportance)
df_target$JobSkillImportanceDegree <- sapply(df_target$JobSkillImportanceDegree, JobImportance)
df_target$JobSkillImportanceStats <- sapply(df_target$JobSkillImportanceStats, JobImportance)
df_target$JobSkillImportanceEnterpriseTools <- sapply(df_target$JobSkillImportanceEnterpriseTools, JobImportance)
df_target$JobSkillImportancePython <- sapply(df_target$JobSkillImportancePython, JobImportance)
df_target$JobSkillImportanceR <- sapply(df_target$JobSkillImportanceR, JobImportance)
df_target$JobSkillImportanceSQL <- sapply(df_target$JobSkillImportanceSQL, JobImportance)
df_target$JobSkillImportanceKaggleRanking <- sapply(df_target$JobSkillImportanceKaggleRanking, JobImportance)
df_target$JobSkillImportanceMOOC <- sapply(df_target$JobSkillImportanceMOOC, JobImportance)
df_target$JobSkillImportanceVisualizations <- sapply(df_target$JobSkillImportanceVisualizations, JobImportance)

# Tidy gender variable

GenderClean <- function(gender)
{  skill<- as.character(gender)
if (gender=='Female')
{
    return ('F')
    
}
else if (gender=="Male")
{
    return ("M")
}

else 
{
    
    return ("NA")
}

}
df_target$GenderSelect <- sapply(df_target$GenderSelect, GenderClean)
df_target$GenderSelect <- sapply(df_target$GenderSelect, factor)

# Tidy employee status

EmployeeClean <- function(emp)
{  emp<- as.character(emp)
if (emp=='Independent contractor, freelancer, or self-employed')
{
    return ('Self-employed')
    
}
else if (emp=="Not employed, and not looking for work"|emp=="Not employed, but looking for work")
{
    return ("Unemployed")
}
else if (emp=="I prefer not to say")
{
    return ("NA")
}
else if (emp=="Employed full-time")
{
    return ("Full_Time")
}
else if (emp=="Employed part-time")
{
    return ("Part_Time")
}
else 
{
    
    return (emp)
}

}
df_target$EmploymentStatus <- sapply(df_target$EmploymentStatus, EmployeeClean)
df_target$EmploymentStatus <- sapply(df_target$EmploymentStatus, factor)

# Tidy column name

JobskillImportance <-  df_target %>% 
    select (starts_with("Jobs")) %>% select (c(1:9)) 
JobskillImportance<- tibble::rowid_to_column(JobskillImportance, "ID") 
names(JobskillImportance) <- sub ("JobSkillImportance" , "", names(JobskillImportance))

#Visualization of JobSkillImportance

gather_p4 <- JobskillImportance %>% gather (JobSkill, Rate, 2:9)%>% select(ID,JobSkill, Rate)
 
p4 <- ggplot(gather_p4, aes(JobSkill, Rate),fill = "#FF6666") + 
          stat_summary(fun.y=sum,geom = "bar", aes(fill=JobSkill), position = position_stack(reverse= TRUE)) +   
          guides(fill=FALSE) +
          theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
          labs(x="Job Skill",y="# of Respondents",title="Job Skill Importance")

# Tidy language recommended
LanguageClean <- function(skill)
{  skill<- as.character(skill)
if (skill=='')
{
    return ('Not Specific')
    
}
else if (skill=="Other")
{
    return ('Not Specific')
}

else 
{
    return (skill)
}
}

df_target$LanguageRecommendationSelect <- sapply(df_target$LanguageRecommendationSelect , LanguageClean)
 
levels(df_target$LanguageRecommendationSelect)[1] <- "Not specified"

# Visualization
plot5 <- df_target %>% 
        group_by(LanguageRecommendationSelect,GenderSelect) %>% 
        summarise(n=n()) %>% 
        arrange(desc(n)) 

p5 <- plot5 %>% ggplot(aes(x=reorder(LanguageRecommendationSelect,-n),n)) +
        geom_bar(aes(),stat = 'identity', fill = "#00BFC4") + 
        theme_minimal() +
        theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
        labs(x="Recommended Language",y="# of Respondents",title="Recommended Languages")      

grid.arrange(p4, p5, ncol=1)

***

7.0.3 Analysis 3 - Part A

7.0.3.1 Comparison of learners and code writers responses

Learners were given a non exhaustive list with key data scientists. Below is a table that shows how each skill was ranked by learners. This gives us some insight into how important some data scientists skills were perceived by learners.

7.0.3.1.0.1 Learners reponse on rank of jobskill importance

library(dplyr)
rank_by_learners<- select(df2, starts_with("JobSkillImportance")) #select those columns in the dataset where learners ranked the importance of having certain data science skills.
#rank_by_learners

#Create tables to show count of each skill
bigdata<-table(rank_by_learners$JobSkillImportanceBigData)
stats<-table(rank_by_learners$JobSkillImportanceStats)
enterpriseTools<-table(rank_by_learners$JobSkillImportanceEnterpriseTools)
python<-table(rank_by_learners$JobSkillImportancePython)
R<-table(rank_by_learners$JobSkillImportanceR)
sql<-table(rank_by_learners$JobSkillImportanceSQL)
visualisations<-table(rank_by_learners$JobSkillImportanceVisualizations)

Skills<-as.data.frame(rbind(bigdata,stats,enterpriseTools,python,R,sql,visualisations))# combine tables
Skills<-Skills%>%rownames_to_column("Skill_Name") # change row names to column

colnames(Skills)[colnames(Skills)=="V1"] <- "Missing_Values"  #rename3 column two

Skills

7.0.3.2 Visualizing this information

Now that we have tabulated the skills, we will view a visualisation of this information to gain a clearer picture.

#Use of a barplot to visualise data. since this was diplayed earlier, this will not be shown now.
Skills_df<-Skills%>%gather(key =Importance, value = Count, c(3:4))
#Skills
g<-ggplot(Skills_df,aes(x= Skill_Name, y =Count)) + geom_bar(stat = "identity",aes(fill = Importance), position = position_stack(reverse= TRUE)) + coord_flip()
#g

> The diagram above shows that the learners’ group in the survey ranked ’Python" as the most important skill needed in getting a data scientist job.

7.0.3.2.0.1 Code writers reponse on skill frequency use on the job -Frequently used data science/analytic tools, technologies and languages over the past year.

Another group in the survey was labelled coding workers. These were persons who write code on a regular basis. They were given a list of tools used by datascientits and were asked to indicate the frequency with which they use these tools on the job.

7.0.3.3 Visualising responses by code writers on most frequently used tool

7.0.3.3.0.1 Visualization One

tool_frequency<-select(Frequent_tools,"Tool","Most of the time")# subset the variable Most of the time
names(tool_frequency)<-c("Tool", "Most_Time") #rename column two

tool_frequency1<-tool_frequency %>%
  arrange(Most_Time) %>%
  #mutate(CTool=factor(Tool, Tool)) %>%
  ggplot( aes(x=Tool, y=Most_Time) ) +
    geom_segment( aes(x=Tool ,xend=Tool, y=0, yend=Most_Time), color="red") +
    geom_point(size=2, color="#69b3a2") +
    coord_flip() +
    theme(
      panel.grid.minor.y = element_blank(),
      panel.grid.major.y = element_blank(),
      legend.position="none"
    ) +
    xlab("")

7.0.3.3.0.2 Visualization Two

library(treemap)

# Plot
tool_frequency2<-treemap(tool_frequency,
            
            # data
            index="Tool",
            vSize="Most_Time",
            type="index",
            
            # Main
            title="",
            palette="Dark2",

            # Borders:
            border.col=c("black"),             
            border.lwds=1,                         
        
            # Labels
            fontsize.labels=0.5,
            fontcolor.labels="white",
            fontface.labels=1,            
            bg.labels=c("transparent"),              
            align.labels=c("left", "top"),                                  
            overlap.labels=0.5,
            inflate.labels=T                        # If true, labels are bigger when rectangle is bigger.

            
            )

Both the lollipop chart and the tree diagram shows that “PYTHON” was ranked as the most frequently used tool by the code writers in the survey.

7.0.3.3.1 Summary of Analysis 3

It is interesting to note that both learners and code writers in the survey identified knowledge and competency in the programming language python as a very valuable data scientist skill. From their responses we can conclude that python is the most valued skill

However, this dataset also reveals some other skills that are very valusble for data scientists to have.Two of these are examined in Analysis 3 part B

7.0.4 Analysis 3 - Part B

This part examines a list of data scientists work methods to revela the frequency with which code writers in the data set used these methods on the job. This is importanty to find out as it will indicate what valued methods should be part of a datascientist’s skillsets.

7.0.4.0.1 Frequently used data science methods

Code writers in the survey were asked the frequency with which they used the various data science methods. The treemap below gives a visualisation of how each method ranked.

7.0.4.1 Visualization of code writers responses on work frequency of data science methods

>The treemap above shows that data visualisation was the most often used data science method. It can thus be said that among the most valued skills, data scientists should have, being able to do data visualisations is extremely critical.

7.0.5 Analysis 3 - Part C

Finally, code writers were asked to indicate the average amount of time that is devoted to aspects of their data science work. Below is a summary statistic and visualisation of their responses

rank_by_time<- select(df2, starts_with("Time"))
rank_by_time2<-rank_by_time %>% gather(key = Activity, value = Hours, c(1:5))%>% select (Activity, Hours)  
rank_by_time2 <- subset(rank_by_time2, (!is.na(rank_by_time2['Hours'])))
#summary(rank_by_time)

p <- ggplot(rank_by_time2, aes(Activity, Hours))
p +geom_bar(stat = "identity",position=position_dodge(), aes(fill='red'))+xlab('Tasks')+ylab('Average Hours')

The visualisation above shows that data scientists devoted most of their hours to finding insights.

7.0.6 Conclusion based on Analysis 3

Base on all the anlysis conducted for analysis three, it can be concluded that python, data visualisation and and finding insights are among the most valued skills that data scientists should possess.

#library(DT)
#skills_US<-filter(dataskills, Country=="United States") ##select observations where country is United states only
#names(skills_US)
#datatable(skills_US, options = list(pageLength = 15))
#head(skills_US,5)

7.0.7 Text mining

Use text mining (TM) to extract count of words using a corpus. Text Mining package also filters out “stop words” - words that don’t have value (this, is, and), numbers, and other unnecessary words that don’t add value (as defined by us).

# Install
#install.packages("tm")  # for text mining
#install.packages("SnowballC") # for text stemming
#install.packages("wordcloud") # word-cloud generator 
#install.packages("RColorBrewer") # color palettes
# Load
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

7.1 Create corpus from data

review_text <- paste(df$LanguageRecommendationSelect, collapse=" ")
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("important", "kaggle", "somewhat", "useful", "yes", "etc", "often", "enough", "courses", "non", "nice", "laptop", "coursera", "year", "udemy", "run", "youtube", "socrata", "workstation", "online", "edx", "sometimes", "employed", "logistic", "male", "necessary", "company", "increased"))
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=T)
frequency

##  python     sql  matlab    java   scala     sas   julia   stata haskell 
##    6941     385     238     138      94      88      30      28      17

#table <- head(frequency, 20)

7.2 Wordcloud of top words from within our dataset

Wordclouds give a quick and easy display of our top words. This allows us to quickly see which words are among the top for data science skills.

8 Build a term-document matrix

dtm <- TermDocumentMatrix(corpus)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

9 Generating a Word cloud

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

After running this analysis we determined that python is the recommended language and coincides with our analysis up top

10 Conclusion

The benefits of addressing the gender gap and encouraging more women to enter into technology is obvious. The economic ramifications of including more women into the data science sector is significant. In a research done by European Commission, encouraging women to fill more roles in technology will raise EU’s GDP by 9 Billion euros a year.

In an analysis of more than 330,000 U.S. employee and candidate survey data, new research from an analytic platform Visier found that Silicon Valley appears to be ageist in its hiring practices. In Technology, older workers are more likely to be passed over than their younger counterparts. Therefore, it makes sense then that the average tech worker is under 40 years of age. Visier found that the average tech worker is 38 years old, a number thatâs five years younger than the average workers in other industries.

Data is the new corporate currency, as advancing digitization sweeps every horizontal and vertical market throughout the world. The impact on the data science sector is far-reaching and, as a result, a range of new roles and skill sets are in demand. The long term career potential is high even in less developed countries.

Although respondents used Python and R nearly equal in a job setting, a majority of respondents recommended Python as an important skill to have for the future. There is two times more respondents who recommended Python to R. We believe this divide may not be noticeable within the corporate world, however, very apparent from the user level.

10.1 Lessons learned

Will be provided during the oral presentation.

10.2 Cited works

https://insidebigdata.com/2018/08/19/infographic-data-scientist-shortage/ https://www.kaggle.com/mhajabri/what-do-kagglers-say-about-data-science?scriptVersionId=2658373 https://www.weforum.org/agenda/2018/01/close-the-tech-gender-gap-gillian-tans/ https://www.forbes.com/sites/louiscolumbus/2018/01/29/data-scientist-is-the-best-job-in-america-according-glassdoors-2018-rankings/#4bf3149c5535 https://www.theladders.com/career-advice/older-tech-workers-fear-age-discrimination

CUNY MSDS DATA 607 Project

Amanda Arce, Juanelle Marks, Vijaya Cherukuri, Guang Qiu, Calvin Wong, Saayed Alam

October 1, 2018