1. Welcome to the world of data science
Throughout the world of data science, there are many languages and tools that can be used to complete a given task. While you are often able to use whichever tool you prefer, it is often important for analysts to work with similar platforms so that they can share their code with one another. Learning what professionals in the data science industry use while at work can help us gain a better understanding of things that may be asked to do in the future. In this project, I am going to find out what tools and languages professionals use in their day-to-day work. Our data comes from the Kaggle Data Science Dataset which includes responses from over 10,000 people that write code to analyze data in their daily work.
#Loading the data set
response<-read_excel("Kaggle_Data_Set.xlsx", sheet = 2)## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2018c.1.0/zoneinfo/America/New_York'
# Viewing the first 10 rows of response
kable(head(response,10),'html') %>%
kable_styling() %>%
scroll_box(width = "800px", height = "700px")| WorkAlgorithmsSelect | WorkToolsSelect | LanguageRecommendationSelect |
|---|---|---|
| Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R |
| Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python |
| Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python |
| Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python |
| Bayesian Techniques,Markov Logic Networks,Neural Networks,Regression/Logistic Regression | Microsoft Azure Machine Learning,NoSQL,SQL,Tableau | C/C++/C# |
| Bayesian Techniques,Markov Logic Networks,Neural Networks,Regression/Logistic Regression | Microsoft Azure Machine Learning,NoSQL,SQL,Tableau | Matlab |
| Bayesian Techniques,Markov Logic Networks,Neural Networks,Regression/Logistic Regression | Microsoft Azure Machine Learning,NoSQL,SQL,Tableau | Java |
| Bayesian Techniques,Regression/Logistic Regression | R,Other | R |
| Bayesian Techniques,Regression/Logistic Regression | R,Other | Python |
| Bayesian Techniques,Regression/Logistic Regression | R,Other | Other |
2. Using multiple tools
Now that we’ve loaded in the survey results, we want to focus on the tools and languages that the survey respondents use at work.
#Adding a Respondent column to the data frame
responses<-tibble::rowid_to_column(response,"Respondent")
# Creating a new data frame called tools
tools <- responses
# Adding a new column to tools which splits the WorkToolsSelect column at the commas and unnests the new column
tools <- tools %>%
mutate(work_tools = strsplit(as.character(WorkToolsSelect), ","))%>%
unnest(work_tools)
# Viewing the first 10 rows of tools
kable( head(tools,10),'html') %>%
kable_styling() %>%
scroll_box(width = "800px", height = "700px")| Respondent | WorkAlgorithmsSelect | WorkToolsSelect | LanguageRecommendationSelect | work_tools |
|---|---|---|---|---|
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | Amazon Web services |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | C/C++ |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | Jupyter notebooks |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | MATLAB/Octave |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | Python |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | SQL |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | TensorFlow |
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | Unix shell / awk |
| 2 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python | Amazon Web services |
| 2 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python | C/C++ |
3. Counting users of each tool
Now that we’ve split apart all of the tools used by each respondent, we can figure out which tools are the most popular.
# Creating a new data frame
tool_count <- tools
# Grouping the data by work_tools, calculate the number of responses in each group
tool_count <- tool_count %>%
group_by(work_tools) %>%
summarize(responses=n())
# Sorting tool_count so that the most popular tools are at the top
tool_count<-tool_count[order(tool_count$responses,decreasing = T),]
# Printing the first 6 results
kable( head(tool_count,6),'html') %>%
kable_styling() %>%
scroll_box(width = "800px", height = "300px")| work_tools | responses |
|---|---|
| Python | 6025 |
| R | 4632 |
| SQL | 3840 |
| Jupyter notebooks | 2998 |
| TensorFlow | 2533 |
| Unix shell / awk | 2282 |
Thus we can see Python, R and SQL, which are also my favorite tools, are the top 3 tools used by professionals in analyzing data
4. Plotting the most popular tools
Let’s see how my favorite tools stack up against the rest.
# Creating a bar chart of the work_tools column.
# Arranging the bars so that the tallest are on the far left
pl<-ggplot(tool_count,aes(x=reorder(work_tools,-responses),y=responses,na.rm="T")) +
geom_bar(stat = 'identity',fill='tan2')
#Rotating the bar labels on x-axis by 90 degrees
pl2<-pl + theme(axis.text.x=element_text(angle=90, hjust=1))
pl2+xlab('Work Tools')+ ylab('Total Responses')+ggtitle('Bar Chart of Work Tools Used')5. The Python vs R is more like Batman vs Superman
Within the field of data science, there is a lot of debate among professionals about whether R or Python should reign supreme. You can see from the last figure that R and Python are the two most commonly used languages, but it’s possible that many respondents use both R and Python. Let’s take a look at how many people use R, Python, and both tools.
# Creating a new data frame called debate_tools
debate_tools <- responses
# Creating a new column called language_preference, based on the following conditions
#"R" if WorkToolsSelect contains "R" but not "Python".
#"Python" if WorkToolsSelect contains "Python" but not "R".
#"Both" if WorkToolsSelect contains both "R" and "Python".
#"Neither" if WorkToolsSelect contains neither "R" nor "Python".
debate_tools <- debate_tools %>%
mutate(language_preference = case_when(
grepl('R',WorkToolsSelect)== T & grepl('Python',WorkToolsSelect)== T~'Both',
grepl('R',WorkToolsSelect)== T & grepl('Python',WorkToolsSelect)!= T ~ 'R',
grepl('R',WorkToolsSelect)!=T & grepl('Python',WorkToolsSelect)==T ~ 'Python',
grepl('R',WorkToolsSelect)!=T & grepl('Python',WorkToolsSelect)!=T ~ 'Neither'))
# Printing the first 10 rows of debate_tools
kable( head(debate_tools,10),'html') %>%
kable_styling() %>%
scroll_box(width = "800px", height = "600px")| Respondent | WorkAlgorithmsSelect | WorkToolsSelect | LanguageRecommendationSelect | language_preference |
|---|---|---|---|---|
| 1 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | R | Python |
| 2 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python | Python |
| 3 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python | Python |
| 4 | Bayesian Techniques,HMMs,Neural Networks,RNNs | Amazon Web services,C/C++,Jupyter notebooks,MATLAB/Octave,Python,SQL,TensorFlow,Unix shell / awk | Python | Python |
| 5 | Bayesian Techniques,Markov Logic Networks,Neural Networks,Regression/Logistic Regression | Microsoft Azure Machine Learning,NoSQL,SQL,Tableau | C/C++/C# | Neither |
| 6 | Bayesian Techniques,Markov Logic Networks,Neural Networks,Regression/Logistic Regression | Microsoft Azure Machine Learning,NoSQL,SQL,Tableau | Matlab | Neither |
| 7 | Bayesian Techniques,Markov Logic Networks,Neural Networks,Regression/Logistic Regression | Microsoft Azure Machine Learning,NoSQL,SQL,Tableau | Java | Neither |
| 8 | Bayesian Techniques,Regression/Logistic Regression | R,Other | R | R |
| 9 | Bayesian Techniques,Regression/Logistic Regression | R,Other | Python | R |
| 10 | Bayesian Techniques,Regression/Logistic Regression | R,Other | Other | R |
6. Plotting R vs Python users
Taking a closer look at how many respondents use R, Python, and both!
# Creating a new data frame
debate_plot <- debate_tools
# Grouping by language preference and calculate number of responses
debate_plot <- debate_plot %>%
group_by(language_preference) %>%
summarise(language_preference_response=n())
# Removing the row for users of "Neither"
debate_plot<-filter(debate_plot, language_preference != 'Neither')
#Viewing debate_plot
kable(debate_plot,'html') %>%
kable_styling() %>%
scroll_box(width = "700px", height = "200px")| language_preference | language_preference_response |
|---|---|
| Both | 3510 |
| Python | 2515 |
| R | 1250 |
# Creating a bar chart
pl<-ggplot(debate_plot,aes(x=language_preference,y=language_preference_response))+
geom_bar(stat = 'identity',fill='tan4')+xlab("Languages Preferred")+ylab('Count')+ggtitle('Languages Preferred by Respondent')
print(pl)Both the tools are frequently used followed by only Python and only R
7. Language recommendations
It looks like the largest group of professionals program in both Python and R. But what happens when they are asked which language they recommend to new learners? Lets find out
# Creating a new data frame
recommendations <- debate_tools
# Grouping by language_preference and LanguageRecommendationSelect
recommendations <- recommendations %>%
group_by(language_preference,LanguageRecommendationSelect)%>%
summarise(count_recommendation_language=n())
# Removing empty NA's and include the top recommendations
recommendations<-recommendations%>%
filter(!is.na(LanguageRecommendationSelect))%>%
arrange(desc(count_recommendation_language))%>%
mutate(row_number(language_preference))%>%
filter(row_number() <=4)Here we keep top 4 language recommended by professionals in Both, Neither, Python and R language preference categories
8. The most recommended language by the language preference categories
Let’s graphically determine which languages are most recommended based on the language preferences of professionals.
# Creating a faceted bar plot
ggplot(recommendations, aes(x=LanguageRecommendationSelect,y=count_recommendation_language,fill=count_recommendation_language))+
geom_bar(stat='identity')+
facet_wrap(~language_preference)+
xlab('Language Recommended')+ylab('Count')+
theme(axis.text.x=element_text(angle=90, hjust=1),legend.position = 'none')Thus we can see professionals in Both, Neither and Python categories recommends Python to the new user where as professionals in R category recommends R. Thus Python is the most recommended language to the new users
My advise to Budding Data Scientists based on this analysis
a. Learn Python,R and SQL as they are the most used languages by the data scientists.
b. Python and R both are open source languages with a strong community support and lots of material on the internet including good MOOC’s which will help new learners to grasp these languages effectively.
c. Thus Python and R will help in analytics and predictive modeling while SQL is best for querying the databases.