1. Welcome to the world of data science

Throughout the world of data science, there are many languages and tools that can be used to complete a given task. While you are often able to use whichever tool you prefer, it is often important for analysts to work with similar platforms so that they can share their code with one another. Learning what professionals in the data science industry use while at work can help you gain a better understanding of things that you may be asked to do in the future.

In this project, we are going to find out what tools and languages professionals use in their day-to-day work. Our data comes from the Kaggle Data Science Survey which includes responses from over 10,000 people that write code to analyze data in their daily work.

# Load necessary packages
library(tidyverse)

# Load data
responses <- read.csv("kagglesurvey.csv")
library(knitr)
library(kableExtra)
kable_styling(kable(head(responses) , caption= "Print the first 6 rows"))
Print the first 6 rows
Respondent WorkToolsSelect LanguageRecommendationSelect EmployerIndustry WorkAlgorithmsSelect
1 Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-based Neural Networks,Random Forests,RNNs
2 Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,Tableau Python Mix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic Regression
3 C/C++,Jupyter notebooks,MATLAB/Octave,Python,R,TensorFlow Python Technology Bayesian Techniques,CNNs,Ensemble Methods,Neural Networks,Regression/Logistic Regression,SVMs
4 Jupyter notebooks,Python,SQL,TensorFlow Python Academic Bayesian Techniques,CNNs,Decision Trees,Gradient Boosted Machines,Neural Networks,Random Forests,Regression/Logistic Regression
5 C/C++,Cloudera,Hadoop/Hive/Pig,Java,NoSQL,R,Unix shell / awk R Government
6 SQL Python Non-profit

2. Using multiple tools

Now that we’ve loaded in the survey results, we want to focus on the tools and languages that the survey respondents use at work.

# Print the first respondent's tools and languages
responses %>% select(c("WorkToolsSelect", "LanguageRecommendationSelect"))
# Create a new data frame called tools
tools <-data.frame(responses$WorkToolsSelect)

# Add a new column, and unnest the new column
tools <- tools  %>% 
    mutate(work_tools = strsplit(as.character(responses$WorkToolsSelect), '\\([^)]+,(*SKIP)(*FAIL)|,\\s*', perl = TRUE)) %>% 
  unnest(work_tools) 
# View the first 6 rows of tools
kable_styling(kable(head(tools) , caption= "Print the first 6 rows"))
Print the first 6 rows
responses.WorkToolsSelect work_tools
Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl Amazon Web services
Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl Oracle Data Mining/ Oracle R Enterprise
Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl Perl
Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,Tableau Amazon Machine Learning
Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,Tableau Amazon Web services
Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,Tableau Cloudera

3. Counting users of each tool

Now that we’ve split apart all of the tools used by each respondent, we can figure out which tools are the most popular.

# Create a new data frame
tool_count <-data.frame(tools)

# Group the data by work_tools, summarise the counts, and arrange in descending order
tool_count <- tool_count  %>% 
    group_by(work_tools)  %>% 
    summarise(count=n()) %>%
    arrange(desc(count))
    
# Print the first 6 results
kable_styling(kable(head(tool_count) , caption= "Print the first 6 rows"))
Print the first 6 rows
work_tools count
Python 6073
R 4708
SQL 4261
Jupyter notebooks 3206
TensorFlow 2256
Amazon Web services 1868

5. The R vs Python debate

Within the field of data science, there is a lot of debate among professionals about whether R or Python should reign supreme. You can see from our last figure that R and Python are the two most commonly used languages, but it’s possible that many respondents use both R and Python. Let’s take a look at how many people use R, Python, and both tools.

# Create a new data frame called debate_tools
debate_tools <- responses

# Creat a new column called language preference
debate_tools <- debate_tools  %>% 
   mutate(language_preference = case_when(
grepl("R", WorkToolsSelect) & ! grepl("Python", WorkToolsSelect) ~ "R", 
  grepl ("Python", WorkToolsSelect) & ! grepl ("R", WorkToolsSelect) ~"Python",
   grepl ("Python", WorkToolsSelect) & grepl ("R", WorkToolsSelect) ~"both",
   !grepl ("Python", WorkToolsSelect) & ! grepl ("R", WorkToolsSelect) ~"neither"
   ))

# Print the first 6 rows
kable_styling(kable(head(debate_tools) , caption= "Print the first 6 rows"))
Print the first 6 rows
Respondent WorkToolsSelect LanguageRecommendationSelect EmployerIndustry WorkAlgorithmsSelect language_preference
1 Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl F# Internet-based Neural Networks,Random Forests,RNNs R
2 Amazon Machine Learning,Amazon Web services,Cloudera,Hadoop/Hive/Pig,Impala,Java,Mathematica,MATLAB/Octave,Microsoft Excel Data Mining,Microsoft SQL Server Data Mining,NoSQL,Python,R,SAS Base,SAS JMP,SQL,Tableau Python Mix of fields Bayesian Techniques,Decision Trees,Random Forests,Regression/Logistic Regression both
3 C/C++,Jupyter notebooks,MATLAB/Octave,Python,R,TensorFlow Python Technology Bayesian Techniques,CNNs,Ensemble Methods,Neural Networks,Regression/Logistic Regression,SVMs both
4 Jupyter notebooks,Python,SQL,TensorFlow Python Academic Bayesian Techniques,CNNs,Decision Trees,Gradient Boosted Machines,Neural Networks,Random Forests,Regression/Logistic Regression Python
5 C/C++,Cloudera,Hadoop/Hive/Pig,Java,NoSQL,R,Unix shell / awk R Government R
6 SQL Python Non-profit neither

6. Plotting R vs Python users

Now we just need to take a closer look at how many respondents use R, Python, and both!

# Create a new data frame
debate_plot <- data.frame(debate_tools)

# Group by language preference, calculate number of responses, and remove "neither"
debate_plot <- debate_plot  %>% 
   group_by(language_preference)  %>% 
   summarise(count=n()) %>%
    filter(language_preference !="neither")

# Create a bar chart
ggplot (debate_plot, aes(x=language_preference , y=count, color=language_preference))+
geom_bar(stat = "identity", fill = "white")

7. Language recommendations

It looks like the largest group of professionals program in both Python and R. But what happens when they are asked which language they recommend to new learners? Do R lovers always recommend R?

# Create a new data frame
recommendations <- debate_tools

# Group by, summarise, filter, arrange, mutate, and filter
recommendations <- recommendations  %>% 
   group_by(language_preference, LanguageRecommendationSelect)  %>% 
   summarise (n=n()) %>%
   filter(LanguageRecommendationSelect!= 0) %>%
   arrange(desc(LanguageRecommendationSelect)) %>%
   mutate( count=row_number()) %>% 
   filter (count <= 4) 

9. The moral of the story

So we’ve made it to the end. We’ve found that Python is the most popular language used among Kaggle data scientists, but R users aren’t far behind. And while Python users may highly recommend that new learners learn Python, would R users find the following statement TRUE or FALSE?

# Would R users find this statement TRUE or FALSE?
R_is_number_one = TRUE