Overview and Objective

In 2020 Kaggle asked its users some questions. The questions and their responses can be found in this csv file (found here). The objective is to explore and summarize the first 20 questions. As we’ll see, some questions are simple to read in because they only have one column associated with their responses. Other questions will extend across multiple columns because users can select multiple choices.

# read in data from local downloads folder
f <- read.csv('~/Downloads/kaggle_survey_2020_responses.csv', na.strings = "")
f[1:4, c(2, 4, 8, 9)]
##                            Q1                                        Q3
## 1 What is your age (# years)? In which country do you currently reside?
## 2                       35-39                                  Colombia
## 3                       30-34                  United States of America
## 4                       35-39                                 Argentina
##                                                                                                      Q7_Part_1
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
## 2                                                                                                       Python
## 3                                                                                                       Python
## 4                                                                                                         <NA>
##                                                                                                 Q7_Part_2
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
## 2                                                                                                       R
## 3                                                                                                       R
## 4                                                                                                    <NA>

Handling the One-Column Questions

First to deal with the simple questions (11 of the 20):

# The Column name for the simple questions: "Q#"
# Regex can be used: Q begins the string and the number ends the column name.
simple_questions <- f[, grep(pattern = '^Q(20|1[0-9]$|[1-9]$)', x = colnames(f))]
n <- ncol(simple_questions)

head(simple_questions[, 1:4])
##                            Q1                                     Q2
## 1 What is your age (# years)? What is your gender? - Selected Choice
## 2                       35-39                                    Man
## 3                       30-34                                    Man
## 4                       35-39                                    Man
## 5                       30-34                                    Man
## 6                       30-34                                    Man
##                                          Q3
## 1 In which country do you currently reside?
## 2                                  Colombia
## 3                  United States of America
## 4                                 Argentina
## 5                  United States of America
## 6                                     Japan
##                                                                                                                Q4
## 1 What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
## 2                                                                                                 Doctoral degree
## 3                                                                                                 Master’s degree
## 4                                                                                               Bachelor’s degree
## 5                                                                                                 Master’s degree
## 6                                                                                                 Master’s degree
# for each column, create and assign a separate ggplot
for (i in 1:n) {
  responses <- simple_questions[-1,i] %>% data.frame()
  q_num <- colnames(simple_questions[i])  #  "Q#"
  
  p <- ggplot(responses, aes(x = .)) +
        geom_bar(width = 0.5, fill = "#21BEFF") +
        labs(subtitle = q_num, x = "") + 
        theme(plot.subtitle = element_text(hjust = 0.5),
              axis.text.x = element_text(angle = 45, hjust = 1))
  
  assign(paste0(q_num), p) # assign ggplot to Q# for each question
}

Handling the Multiple-Choice Questions

Some questions involve answers spread across many columns. 9 of the first 20 questions have multiple parts:

multi_q <- c(7, 9, 10, 12, 14, 16, 17, 18, 19)
# Select the Questions with multiple parts: "Q#_..."
multi_part_questions <- f[, grep(pattern = '^Q(1[0-9]_|[1-9]_)', x = colnames(f))]

head(multi_part_questions[, 1:3])
##                                                                                                      Q7_Part_1
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
## 2                                                                                                       Python
## 3                                                                                                       Python
## 4                                                                                                         <NA>
## 5                                                                                                       Python
## 6                                                                                                       Python
##                                                                                                 Q7_Part_2
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
## 2                                                                                                       R
## 3                                                                                                       R
## 4                                                                                                    <NA>
## 5                                                                                                    <NA>
## 6                                                                                                    <NA>
##                                                                                                   Q7_Part_3
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL
## 2                                                                                                       SQL
## 3                                                                                                       SQL
## 4                                                                                                      <NA>
## 5                                                                                                       SQL
## 6                                                                                                      <NA>
for (d in multi_q){
  q_num <- paste0("Q", d) # "Q#"
  
  k <- multi_part_questions %>% select(starts_with(q_num))
  
  # Replace Column name with "Selected" option
  for(j in 1:ncol(k)){
    names(k)[j] <- as.character(gsub(".*(Selected Choice -)","", k[1,j]))
  }

  k_long <- gather(k[-1,], key = choice, na.rm = TRUE)
  
  p <- ggplot(k_long, aes(x = choice)) + 
    geom_bar(width = 0.5, fill = "#21BEFF") + 
    labs(subtitle = q_num, x = "") + 
    theme(plot.subtitle = element_text(hjust = 0.5),
          axis.text.x = element_text(angle = 45, hjust = 1))

  assign(q_num, p) # Assign ggplot to Q# for each question
}

Now that we have assigned each question/categorical variable to a plot…

Demographics

Age

  • Most of the respondents are in their late teens, twenties, and early thirties.
Q1 + labs(title = "Age of Respondents", x = "Age") + theme(plot.title = element_text(hjust = 0.5))

Gender

  • The majority of respondents were self-identified males.
Q2 + labs(title = "Gender of Respondents") + theme(plot.title = element_text(hjust = 0.5))

Residence

  • Most respondents are from India. A relatively small amount are from the United States.
Q3 + labs(title = "Country of Residence") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Experience

Education

Question 4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

  • The majority of respondents have/plan to attain a bachelor’s or master’s degree.
Q4 + labs(title = "Formal education") + theme(plot.title = element_text(hjust = 0.5))

Question 5: Select the title most similar to your current role:
Question 6: How many years have you been writing code and/or programming?

  • Most respondents are students. A large number are data scientists and software engineers.
  • Most are either new to programming (<1 - 2 years) or have moderate experience (3-5 years). Relatively few have >5 years of experience.
Q5 + labs(title = "Position Title") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Q6 + labs(title = "Years of Programming Experience", x = "Years") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Question 20: What is the size of the company where you are employed?

  • Most of the people work at small companies with less than 50 employees. Many chose not to answer the question.
Q20 + labs(title = "Size of Company Where Working", x = "# of Employees") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Language and Library Preferences

Question 7: What programming languages do you use on a regular basis?
Question 8: What programming language would you recommend an aspiring data scientis to learn first?
Question 14: What data visualization libraries or tools do you use on a reegular basis?

Q7 + labs(title = "Most Used Languages", x = "Programming Languages") + theme(plot.title = element_text(hjust = 0.5))

Q8 + labs(title = "Recommended Languages to Learn", x = "Programming Lagnuages") + theme(plot.title = element_text(hjust = 0.5))

Q14 + labs(title = "Most Used Data Visualization Libraries", x = "Visualization Libraries/Tools") + theme(plot.title = element_text(hjust = 0.5))

Machine Learning

Question 15: For how many years have you used machine learning methods?
Question 16: Which of the following machine learning frameworks do you use on a regular basis?
Question 17: Which of the following machine learning algorithms do you use on a regular basis?

Q15 + labs(title = "Years of Machine Learning Methods", x = "# Years") + theme(plot.title = element_text(hjust = 0.5))

Q16 + labs(title = "Machine Learning Frameworks", x = "Frameworks") + theme(plot.title = element_text(hjust = 0.5))

Q17 + labs(title = "Machine Learning Algorithms", x = "Algorithms") + theme(plot.title = element_text(hjust = 0.5))

Other Questions

Question 9: Which integrated development environments (IDE’s) do you use on a regular basis?

Q9 + labs(title = "Most Used IDE's", x = "Integrated Development Environments") + theme(plot.title = element_text(hjust = 0.5))

Question 10: Which of the following hosted notebook products fo you use on a regular basis?

Q10 + labs(title = "Most Used Hosted Notebooks", x = "Hosted Notebook Products") + theme(plot.title = element_text(hjust = 0.5))

Question 11: What type of computing platform do you use most often for your data science projects?
Question 12: Which typed of specialized hardware do you use on a regular basis?
Question 13: Approximately how many times have you used a TPU (Tensor Processing Unit)?

Q11 + labs(title = "Type of Computing Platform Used", x = "Computing Platform") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Q12 + labs(title = "Specialized Hardware Use", x = "Hardware Type") + theme(plot.title = element_text(hjust = 0.5))

Q13 + labs(title = "Use of TPUs", x = "# of Times") + theme(plot.title = element_text(hjust = 0.5))

Question 19: Which of the following natural processing (NLP) methods do you use on a regular basis?

Q19 + labs(title = "Natural Processing Methods", x = "NLP") + theme(plot.title = element_text(hjust = 0.5))

Summary

In this project, I used R’s tidyverse package to explore the first 20 questions of Kaggle’s 2020 Survey. I created simple bar charts to compare the different responses. From this simple descriptive analysis, it can be determined that alot of Kaggle’s users are young, male students, many of whom are from India. Many program with Python and use the matplot library, but have little machine learning experience. Many are planning to or already have a Bachelor or Masters degree.