Welcome to the world of data science

Throughout the world of data science, there are many languages and tools that can be used to complete a given task. While you are often able to use whichever tool you prefer, it is often important for analysts to work with similar platforms so that they can share their code with one another. Learning what professionals in the data science industry use while at work can help you gain a better understanding of things that you may be asked to do in the future.

In this project, we are going to find out what tools and languages professionals use in their day-to-day work. Our data comes from the Kaggle Data Science Survey which includes responses from over 10,000 people that write code to analyze data in their daily work.

Setup

## # A tibble: 10 x 5
##    Respondent WorkToolsSelect LanguageRecomme… EmployerIndustry
##         <int> <chr>           <chr>            <chr>           
##  1          1 Amazon Web ser… F#               Internet-based  
##  2          2 Amazon Machine… Python           Mix of fields   
##  3          3 C/C++,Jupyter … Python           Technology      
##  4          4 Jupyter notebo… Python           Academic        
##  5          5 C/C++,Cloudera… R                Government      
##  6          6 SQL             Python           Non-profit      
##  7          7 Jupyter notebo… Python           Internet-based  
##  8          8 Python,Spark /… Python           Mix of fields   
##  9          9 Jupyter notebo… Python           Financial       
## 10         10 C/C++,IBM Cogn… R                Technology      
## # ... with 1 more variable: WorkAlgorithmsSelect <chr>

2. Using multiple tools

Now that we’ve loaded in the survey results, we want to focus on the tools and languages that the survey respondents use at work.

## [1] "Amazon Web services,Oracle Data Mining/ Oracle R Enterprise,Perl"
## # A tibble: 6 x 6
##   Respondent WorkToolsSelect LanguageRecomme… EmployerIndustry
##        <int> <chr>           <chr>            <chr>           
## 1          1 Amazon Web ser… F#               Internet-based  
## 2          1 Amazon Web ser… F#               Internet-based  
## 3          1 Amazon Web ser… F#               Internet-based  
## 4          2 Amazon Machine… Python           Mix of fields   
## 5          2 Amazon Machine… Python           Mix of fields   
## 6          2 Amazon Machine… Python           Mix of fields   
## # ... with 2 more variables: WorkAlgorithmsSelect <chr>, work_tools <chr>

3. Counting users of each tool

Now that we’ve split apart all of the tools used by each respondent, we can figure out which tools are the most popular.

## # A tibble: 6 x 2
##   work_tools        count
##   <chr>             <int>
## 1 Python             6073
## 2 R                  4708
## 3 SQL                4261
## 4 Jupyter notebooks  3206
## 5 TensorFlow         2256
## 6 <NA>               2198

5. The R vs Python debate

Within the field of data science, there is a lot of debate among professionals about whether R or Python should reign supreme. You can see from our last figure that R and Python are the two most commonly used languages, but it’s possible that many respondents use both R and Python. Let’s take a look at how many people use R, Python, and both tools.

## # A tibble: 6 x 6
##   Respondent WorkToolsSelect LanguageRecomme… EmployerIndustry
##        <int> <chr>           <chr>            <chr>           
## 1          1 Amazon Web ser… F#               Internet-based  
## 2          2 Amazon Machine… Python           Mix of fields   
## 3          3 C/C++,Jupyter … Python           Technology      
## 4          4 Jupyter notebo… Python           Academic        
## 5          5 C/C++,Cloudera… R                Government      
## 6          6 SQL             Python           Non-profit      
## # ... with 2 more variables: WorkAlgorithmsSelect <chr>,
## #   language_preference <chr>

6. Plotting R vs Python users

Now we just need to take a closer look at how many respondents use R, Python, and both!

7. Language recommendations

It looks like the largest group of professionals program in both Python and R. But what happens when they are asked which language they recommend to new learners? Do R lovers always recommend R?

## # A tibble: 16 x 4
## # Groups:   language_preference [4]
##    language_preference LanguageRecommendationSelect count row_no
##    <chr>               <chr>                        <int>  <int>
##  1 both                Python                        1917      1
##  2 both                R                              912      2
##  3 both                SQL                            108      3
##  4 both                Scala                           28      4
##  5 neither             Python                         196      1
##  6 neither             R                               94      2
##  7 neither             SQL                             53      3
##  8 neither             Matlab                          47      4
##  9 Python              Python                        1742      1
## 10 Python              C/C++/C#                        48      2
## 11 Python              Matlab                          43      3
## 12 Python              SQL                             36      4
## 13 R                   R                              632      1
## 14 R                   Python                         194      2
## 15 R                   SQL                             75      3
## 16 R                   C/C++/C#                        27      4

9. The moral of the story

So we’ve made it to the end. We’ve found that Python is the most popular language used among Kaggle data scientists, but R users aren’t far behind. And while Python users may highly recommend that new learners learn Python, would R users find the following statement TRUE or FALSE?