Abstract

For the project 3, we have decided to conduct survey and analyze the response to find out the necessary soft-skills, and technical skills. Also Which programming language is need to learn first, and which programming language is necessary to be successful Data Scientist so we have decieded .

Data Source

Context

Kaggle already has the wide survey available for the Data science and Machine learning with more than 16000 respondents but data scientist also needs to know how to conduct survey in order to generate own datasets.

Methodology

This survey received 72 respondents from around the world. We will some excluded respondents who answered “No” to the question regarding their IT professional because these respondents can affect the overall analysis. even though this question was the first required question, so not answering it indicates that the respondent did not proceed further in survey and redirected to end of the survey.

The survey was taken during October 17 to October 20th. Most of the questions was not required because they can submit most information survey as many as they can and didn’t forced to select unnecessary and wrong options.

platform used

This survey was conducted using Google Forms. Also, most of the respondents were from LinkedIn group: “Data Scientist & Analyst”, “Database Administrator (+5000 DBAs from around the world) Join now!”, and “Datacamp Community” Slack Channel.

Environment Prep

if (!require('ggplot2')) install.packages('ggplot2')
if (!require('car')) install.packages('car')
if (!require('rlist')) install.packages('rlist')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('likert')) install.packages('likert')

Importing data

# Import survey data
rawData <- read.csv("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/Project_3/To%20find%20Most%20Valued%20Data%20Science%20Skills.csv", stringsAsFactors = FALSE, header = TRUE, na.strings = c("", "NA"))

rawData

Clean data

# Dropping none IT professionals records
clean_rawData<- rawData[!grepl("No", rawData$Are.you.related.to.IT.Profession.),]

#Drop column not needed
drops_c <- c("Timestamp","Are.you.related.to.IT.Profession.")
clean_rawData<-clean_rawData[ , !(names(clean_rawData) %in% drops_c)]

#Short name for column name
Short_name <- c("Gender","Country","Age","Are you in School","Eduction","Major","Are you learning data science skill","Emplyment Status", "Title", "Experience", "Blogs", "College","Projects", "Online Course", "Friends", "Co-workers", "Youtube","Textbooks","First language","must language","Amazon Machine Learning","Big Data","College Degree","Data Visualizations","Enterprise Tools","Google Cloud","Hadoop/Hive/Pig","IBM SPSS","Java","Microsoft Excel","NoSQL","Oracle Data Mining","Python","R","Relational data","SAS","SQL","Tableau","Intellectual curiosity","Business acumen", "Communication skills", "Teamwork", "Collaboration", "Creative Thinking","Problem Solving", "Active Learning","Perceptiveness","Interpersonal Skills","Generating Hypotheses")

colnames(clean_rawData) <- Short_name
head(clean_rawData)

Analysis

Gender, Demography, and Age Information

We are looking to find out gender ratio, demograpghy, and age information

DataAge <- clean_rawData%>%
  select(Age) %>%
  filter(Age!='NA') %>%
  group_by(Age) %>%
  summarise(count=n()) %>%
  arrange(desc(count))
head(DataAge, n=5)
Gender_ratio <- clean_rawData%>%
  select(Gender)%>%
  group_by(Gender)%>%
  summarise(count=n())%>%
  mutate(percent = (count / sum(count)) * 100) %>%
  arrange(desc(count))

Gender_ratio
#to see how many of the participants were male or female by their age for the graph
DataGender <- clean_rawData%>%
  select(Age,Gender) %>%
  filter(trimws(clean_rawData$Country)!='',Age!='NA') %>%
  group_by(Age,Gender) %>%
  summarise(count=n()) %>%
  arrange(desc(count))
# Plot
ggplot(data = DataGender, 
       mapping = aes(x = Age, fill = Gender, 
                     y = ifelse(test = Gender == "Female", 
                                yes = -count, no = count)))+
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = abs) +
  labs(y="Count") 

Results

From the findings, we can see the Majority of the participants Age are 24, 29,30 and there are 5 from each Ages.

Based on participants gender, we can see the Majority is Male. They are 73.9130435 percent compare to Female which are only 26.0869565 percent.

Demography

DemoGraphicsData <- clean_rawData %>%
  group_by(Country)%>% # Group by country
  summarise(count=n()) %>%# Count how many respondents selected each option
  mutate(percent = (count / sum(count)) * 100) %>%     
  arrange(desc(count))# Arrange the counts in descending order
 
DemoGraphicsData
#plot graph
DemoGraphicsData[] %>%
  arrange(count) %>%
  mutate(Country=factor(Country, levels=Country)) %>%
  ggplot( aes(x=Country, y=count)) +
    geom_segment( aes(xend=Country, yend=0)) +
    geom_point( size=2, color="orange") +
    coord_flip() + labs(title="Participates by Country")+ 
    theme_minimal()

Results

Based on participants country, we can see that there are 39 participants from United States and at the seconds place, there are 10 participants from India. United States is the majority by 56.5217391 of the all participants in survey.


Programming language

Most recommend Programming language

We had asked participant to recommend the programming language, and we also asked them which programming language is necessary to know as data scientist

# to findout most recommend language

First_language <- clean_rawData %>%  
    # Remove any rows where the respondent didn't answer the question
    filter(trimws(clean_rawData$`First language`)!='')%>%
    # Group by the responses to the question
    group_by(`First language`) %>% 
    # Count how many respondents selected each option
    summarise(count = n()) %>% 
    # Calculate what percent of respondents selected each option
    mutate(percent = (count / sum(count)) * 100) %>% 
    # Arrange the counts in descending order
    arrange(desc(count))

First_language

Must needed Programming language

# to findout have to learn language

Neccecery_language <- clean_rawData %>%  
  # Remove any rows where the respondent didn't answer the question
  filter(trimws(clean_rawData$`must language`)!='')%>%
  # Group by the responses to the question
  group_by(`must language`) %>% 
  # Count how many respondents selected each option
  summarise(count = n()) %>% 
  # Calculate what percent of respondents selected each option
  mutate(percent = (count / sum(count)) * 100) %>% 
  # Arrange the counts in descending order
  arrange(desc(count))

Neccecery_language

Results

As a result, it is clear that Python is most recommended programming language for Data Scientist. There are 56.7164179 percent participants who recommend the language. And, Second most recommended programming language is R with 26.8656716 percent votes.

Also, result suggests that 50.7462687 percent of participants thinks Python is must needed language. Second is R with 32.8358209 percent.


To find out helpfulness of diffenent way to learn

clean_rawData_helpful <- clean_rawData %>%
  select(Blogs:Textbooks)

clean_rawData_helpful
#clean record
clean_rawData_helpful<- na.omit(clean_rawData_helpful)

clean_rawData_helpful[] <- lapply(clean_rawData_helpful, factor)
clean_rawData_helpful
# helpfull the plot
helpful_chart <- likert(clean_rawData_helpful) 
plot(helpful_chart)

df.v<- summary(helpful_chart) #for references in result

Results

Analysis on the effectiveness of learning, it clear that Online Course is very useful way to learn. It is topping the table with 94.2857143 It also suggest that College and Projects are also around 91.4285714 percent.


Technical Skills/tools needed to know as data scienticst

Here, we are trying to find out the Most valuable Technical skills/ tool which needed to be successful Data Scientist.

Techdata <- clean_rawData %>%
  select(`Amazon Machine Learning`:Tableau)

Techdata<- na.omit(Techdata)
Techdata
Techdata[ ,1:18]  <- lapply(Techdata[ ,1:18],
                             FUN = function(x) Recode(x, "'Nice to have' =1 ;'Necessary'=2 ;'Unnecessary' = 0"))

Techdata

Here, we are different approach to find out the Most valuable Technical skills/ tool. Because we have asked in survey to rank them on the scale like Necessary, Nice to have and Unnecessary. We are converting that scale into numeric form like 2, 1, and 0 respectively. so we can add them and present the skill on the scale.

temp1 <- names(Techdata)
temp2<- list(colSums(Techdata))

temp2<- data.frame(matrix(unlist(colSums(Techdata)), byrow=T))

scale <- cbind(temp1,temp2)
names(scale)[1] <-"Tech_Skill"
names(scale)[2] <-"Rank"

scale <- scale %>%
  arrange(desc(Rank))
scale
#plot graph
scale %>%
  mutate(Tech_Skill = fct_reorder(Tech_Skill, Rank)) %>%
  ggplot( aes(x=Tech_Skill, y=Rank)) +
    geom_segment( aes(xend=Tech_Skill, yend=1)) +
    geom_point( size=2, color="orange") +
    coord_flip() + labs(title="Technical Skills Survey")+ 
    theme_minimal()

Result

From the Above analysis, we concluded that Top 5 Technical skills/tools are most important in the data science field are SQL, Python, Data Visualizations, R and Microsoft Excel.


Soft Skills/tools needed to know as data scienticst

softdata <- clean_rawData %>%
  select(`Intellectual curiosity`:`Generating Hypotheses`)

softdata<- na.omit(softdata)
softdata
temp3 <- names(softdata)
temp4<- data.frame(matrix(unlist(colSums(softdata)), byrow=T))

scale1 <- cbind(temp3,temp4)
names(scale1)[1] <-"Soft_skill"
names(scale1)[2] <-"Total"

scale1 <- scale1 %>%
  arrange(desc(Total))
scale1
#plot graph

scale1 %>%
  mutate(Soft_skill = fct_reorder(Soft_skill, Total)) %>%
  ggplot( aes(x=Soft_skill, y=Total)) +
    geom_segment( aes(xend=Soft_skill, yend=1)) +
    geom_point( size=2, color="orange") +
    coord_flip() + labs(title="Soft Skills Survey")+ 
    theme_minimal()

Result

From the Above analysis, we concluded that Top 5 soft skills/tools are most important in the data science field are Communication skills, Problem Solving, Intellectual curiosity, Business acumen and Creative Thinking.


Conclusion

There is so much to learn from this project. I have learned How to conduct survey. Use the data, in your analysis. In my findings, Python is most recommended programming language for Data Scientist. Top 5 Technical skills/tools are most important in the data science field are SQL, Python, Data Visualizations tools, R and Microsoft Excel. and Five soft skills/tools are most important in the data science field are Communication skills, Problem Solving, Intellectual curiosity, Business acumen and Creative Thinking.

Reference

  1. The R-Graph Gallery For likert type Chart [LINK] https://www.r-graph-gallery.com/202-barplot-for-likert-type-items/
  2. Kaggle ML and Data Science Survey, 2017 [LINK] https://www.kaggle.com/kaggle/kaggle-survey-2017
  3. Platform used for survey