For the project 3, we have decided to conduct survey and analyze the response to find out the necessary soft-skills, and technical skills. Also Which programming language is need to learn first, and which programming language is necessary to be successful Data Scientist so we have decieded .
Kaggle already has the wide survey available for the Data science and Machine learning with more than 16000 respondents but data scientist also needs to know how to conduct survey in order to generate own datasets.
This survey received 72 respondents from around the world. We will some excluded respondents who answered “No” to the question regarding their IT professional because these respondents can affect the overall analysis. even though this question was the first required question, so not answering it indicates that the respondent did not proceed further in survey and redirected to end of the survey.
The survey was taken during October 17 to October 20th. Most of the questions was not required because they can submit most information survey as many as they can and didn’t forced to select unnecessary and wrong options.
This survey was conducted using Google Forms. Also, most of the respondents were from LinkedIn group: “Data Scientist & Analyst”, “Database Administrator (+5000 DBAs from around the world) Join now!”, and “Datacamp Community” Slack Channel.
if (!require('ggplot2')) install.packages('ggplot2')
if (!require('car')) install.packages('car')
if (!require('rlist')) install.packages('rlist')
if (!require('tidyverse')) install.packages('tidyverse')
if (!require('likert')) install.packages('likert')
# Import survey data
rawData <- read.csv("https://raw.githubusercontent.com/Vinayak234/SPS_DATA_607/master/SPS_DATA_607/Project_3/To%20find%20Most%20Valued%20Data%20Science%20Skills.csv", stringsAsFactors = FALSE, header = TRUE, na.strings = c("", "NA"))
rawData
Clean data
# Dropping none IT professionals records
clean_rawData<- rawData[!grepl("No", rawData$Are.you.related.to.IT.Profession.),]
#Drop column not needed
drops_c <- c("Timestamp","Are.you.related.to.IT.Profession.")
clean_rawData<-clean_rawData[ , !(names(clean_rawData) %in% drops_c)]
#Short name for column name
Short_name <- c("Gender","Country","Age","Are you in School","Eduction","Major","Are you learning data science skill","Emplyment Status", "Title", "Experience", "Blogs", "College","Projects", "Online Course", "Friends", "Co-workers", "Youtube","Textbooks","First language","must language","Amazon Machine Learning","Big Data","College Degree","Data Visualizations","Enterprise Tools","Google Cloud","Hadoop/Hive/Pig","IBM SPSS","Java","Microsoft Excel","NoSQL","Oracle Data Mining","Python","R","Relational data","SAS","SQL","Tableau","Intellectual curiosity","Business acumen", "Communication skills", "Teamwork", "Collaboration", "Creative Thinking","Problem Solving", "Active Learning","Perceptiveness","Interpersonal Skills","Generating Hypotheses")
colnames(clean_rawData) <- Short_name
head(clean_rawData)
We are looking to find out gender ratio, demograpghy, and age information
DataAge <- clean_rawData%>%
select(Age) %>%
filter(Age!='NA') %>%
group_by(Age) %>%
summarise(count=n()) %>%
arrange(desc(count))
head(DataAge, n=5)
Gender_ratio <- clean_rawData%>%
select(Gender)%>%
group_by(Gender)%>%
summarise(count=n())%>%
mutate(percent = (count / sum(count)) * 100) %>%
arrange(desc(count))
Gender_ratio
#to see how many of the participants were male or female by their age for the graph
DataGender <- clean_rawData%>%
select(Age,Gender) %>%
filter(trimws(clean_rawData$Country)!='',Age!='NA') %>%
group_by(Age,Gender) %>%
summarise(count=n()) %>%
arrange(desc(count))
# Plot
ggplot(data = DataGender,
mapping = aes(x = Age, fill = Gender,
y = ifelse(test = Gender == "Female",
yes = -count, no = count)))+
geom_bar(stat = "identity") +
scale_y_continuous(labels = abs) +
labs(y="Count")
Results
From the findings, we can see the Majority of the participants Age are 24, 29,30 and there are 5 from each Ages.
Based on participants gender, we can see the Majority is Male. They are 73.9130435 percent compare to Female which are only 26.0869565 percent.
Demography
DemoGraphicsData <- clean_rawData %>%
group_by(Country)%>% # Group by country
summarise(count=n()) %>%# Count how many respondents selected each option
mutate(percent = (count / sum(count)) * 100) %>%
arrange(desc(count))# Arrange the counts in descending order
DemoGraphicsData
#plot graph
DemoGraphicsData[] %>%
arrange(count) %>%
mutate(Country=factor(Country, levels=Country)) %>%
ggplot( aes(x=Country, y=count)) +
geom_segment( aes(xend=Country, yend=0)) +
geom_point( size=2, color="orange") +
coord_flip() + labs(title="Participates by Country")+
theme_minimal()
Results
Based on participants country, we can see that there are 39 participants from United States and at the seconds place, there are 10 participants from India. United States is the majority by 56.5217391 of the all participants in survey.
We had asked participant to recommend the programming language, and we also asked them which programming language is necessary to know as data scientist
# to findout most recommend language
First_language <- clean_rawData %>%
# Remove any rows where the respondent didn't answer the question
filter(trimws(clean_rawData$`First language`)!='')%>%
# Group by the responses to the question
group_by(`First language`) %>%
# Count how many respondents selected each option
summarise(count = n()) %>%
# Calculate what percent of respondents selected each option
mutate(percent = (count / sum(count)) * 100) %>%
# Arrange the counts in descending order
arrange(desc(count))
First_language
Must needed Programming language
# to findout have to learn language
Neccecery_language <- clean_rawData %>%
# Remove any rows where the respondent didn't answer the question
filter(trimws(clean_rawData$`must language`)!='')%>%
# Group by the responses to the question
group_by(`must language`) %>%
# Count how many respondents selected each option
summarise(count = n()) %>%
# Calculate what percent of respondents selected each option
mutate(percent = (count / sum(count)) * 100) %>%
# Arrange the counts in descending order
arrange(desc(count))
Neccecery_language
Results
As a result, it is clear that Python is most recommended programming language for Data Scientist. There are 56.7164179 percent participants who recommend the language. And, Second most recommended programming language is R with 26.8656716 percent votes.
Also, result suggests that 50.7462687 percent of participants thinks Python is must needed language. Second is R with 32.8358209 percent.
clean_rawData_helpful <- clean_rawData %>%
select(Blogs:Textbooks)
clean_rawData_helpful
#clean record
clean_rawData_helpful<- na.omit(clean_rawData_helpful)
clean_rawData_helpful[] <- lapply(clean_rawData_helpful, factor)
clean_rawData_helpful
# helpfull the plot
helpful_chart <- likert(clean_rawData_helpful)
plot(helpful_chart)
df.v<- summary(helpful_chart) #for references in result
Results
Analysis on the effectiveness of learning, it clear that Online Course is very useful way to learn. It is topping the table with 94.2857143 It also suggest that College and Projects are also around 91.4285714 percent.
Here, we are trying to find out the Most valuable Technical skills/ tool which needed to be successful Data Scientist.
Techdata <- clean_rawData %>%
select(`Amazon Machine Learning`:Tableau)
Techdata<- na.omit(Techdata)
Techdata
Techdata[ ,1:18] <- lapply(Techdata[ ,1:18],
FUN = function(x) Recode(x, "'Nice to have' =1 ;'Necessary'=2 ;'Unnecessary' = 0"))
Techdata
Here, we are different approach to find out the Most valuable Technical skills/ tool. Because we have asked in survey to rank them on the scale like Necessary, Nice to have and Unnecessary. We are converting that scale into numeric form like 2, 1, and 0 respectively. so we can add them and present the skill on the scale.
temp1 <- names(Techdata)
temp2<- list(colSums(Techdata))
temp2<- data.frame(matrix(unlist(colSums(Techdata)), byrow=T))
scale <- cbind(temp1,temp2)
names(scale)[1] <-"Tech_Skill"
names(scale)[2] <-"Rank"
scale <- scale %>%
arrange(desc(Rank))
scale
#plot graph
scale %>%
mutate(Tech_Skill = fct_reorder(Tech_Skill, Rank)) %>%
ggplot( aes(x=Tech_Skill, y=Rank)) +
geom_segment( aes(xend=Tech_Skill, yend=1)) +
geom_point( size=2, color="orange") +
coord_flip() + labs(title="Technical Skills Survey")+
theme_minimal()
Result
From the Above analysis, we concluded that Top 5 Technical skills/tools are most important in the data science field are SQL, Python, Data Visualizations, R and Microsoft Excel.
softdata <- clean_rawData %>%
select(`Intellectual curiosity`:`Generating Hypotheses`)
softdata<- na.omit(softdata)
softdata
temp3 <- names(softdata)
temp4<- data.frame(matrix(unlist(colSums(softdata)), byrow=T))
scale1 <- cbind(temp3,temp4)
names(scale1)[1] <-"Soft_skill"
names(scale1)[2] <-"Total"
scale1 <- scale1 %>%
arrange(desc(Total))
scale1
#plot graph
scale1 %>%
mutate(Soft_skill = fct_reorder(Soft_skill, Total)) %>%
ggplot( aes(x=Soft_skill, y=Total)) +
geom_segment( aes(xend=Soft_skill, yend=1)) +
geom_point( size=2, color="orange") +
coord_flip() + labs(title="Soft Skills Survey")+
theme_minimal()
Result
From the Above analysis, we concluded that Top 5 soft skills/tools are most important in the data science field are Communication skills, Problem Solving, Intellectual curiosity, Business acumen and Creative Thinking.
There is so much to learn from this project. I have learned How to conduct survey. Use the data, in your analysis. In my findings, Python is most recommended programming language for Data Scientist. Top 5 Technical skills/tools are most important in the data science field are SQL, Python, Data Visualizations tools, R and Microsoft Excel. and Five soft skills/tools are most important in the data science field are Communication skills, Problem Solving, Intellectual curiosity, Business acumen and Creative Thinking.