Kaggle 2020 Survey Exploration (Project-1)

Overview and Objective
Handling the One-Column Questions
Handling the Multiple-Choice Questions
Demographics
- Age
- Gender
- Residence
Experience
- Education
Language and Library Preferences
Machine Learning
Other Questions
Summary

Overview and Objective

In 2020 Kaggle asked its users some questions. The questions and their responses can be found in this csv file (found here). The objective is to explore and summarize the first 20 questions. As we’ll see, some questions are simple to read in because they only have one column associated with their responses. Other questions will extend across multiple columns because users can select multiple choices.

# read in data from local downloads folder
f <- read.csv('~/Downloads/kaggle_survey_2020_responses.csv', na.strings = "")
f[1:4, c(2, 4, 8, 9)]

##                            Q1                                        Q3
## 1 What is your age (# years)? In which country do you currently reside?
## 2                       35-39                                  Colombia
## 3                       30-34                  United States of America
## 4                       35-39                                 Argentina
##                                                                                                      Q7_Part_1
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
## 2                                                                                                       Python
## 3                                                                                                       Python
## 4                                                                                                         <NA>
##                                                                                                 Q7_Part_2
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
## 2                                                                                                       R
## 3                                                                                                       R
## 4                                                                                                    <NA>

Handling the One-Column Questions

First to deal with the simple questions (11 of the 20):

# The Column name for the simple questions: "Q#"
# Regex can be used: Q begins the string and the number ends the column name.
simple_questions <- f[, grep(pattern = '^Q(20|1[0-9]$|[1-9]$)', x = colnames(f))]
n <- ncol(simple_questions)

head(simple_questions[, 1:4])

##                            Q1                                     Q2
## 1 What is your age (# years)? What is your gender? - Selected Choice
## 2                       35-39                                    Man
## 3                       30-34                                    Man
## 4                       35-39                                    Man
## 5                       30-34                                    Man
## 6                       30-34                                    Man
##                                          Q3
## 1 In which country do you currently reside?
## 2                                  Colombia
## 3                  United States of America
## 4                                 Argentina
## 5                  United States of America
## 6                                     Japan
##                                                                                                                Q4
## 1 What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
## 2                                                                                                 Doctoral degree
## 3                                                                                                 Master’s degree
## 4                                                                                               Bachelor’s degree
## 5                                                                                                 Master’s degree
## 6                                                                                                 Master’s degree

# for each column, create and assign a separate ggplot
for (i in 1:n) {
  responses <- simple_questions[-1,i] %>% data.frame()
  q_num <- colnames(simple_questions[i])  #  "Q#"
  
  p <- ggplot(responses, aes(x = .)) +
        geom_bar(width = 0.5, fill = "#21BEFF") +
        labs(subtitle = q_num, x = "") + 
        theme(plot.subtitle = element_text(hjust = 0.5),
              axis.text.x = element_text(angle = 45, hjust = 1))
  
  assign(paste0(q_num), p) # assign ggplot to Q# for each question
}

Handling the Multiple-Choice Questions

Some questions involve answers spread across many columns. 9 of the first 20 questions have multiple parts:

multi_q <- c(7, 9, 10, 12, 14, 16, 17, 18, 19)
# Select the Questions with multiple parts: "Q#_..."
multi_part_questions <- f[, grep(pattern = '^Q(1[0-9]_|[1-9]_)', x = colnames(f))]

head(multi_part_questions[, 1:3])

##                                                                                                      Q7_Part_1
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
## 2                                                                                                       Python
## 3                                                                                                       Python
## 4                                                                                                         <NA>
## 5                                                                                                       Python
## 6                                                                                                       Python
##                                                                                                 Q7_Part_2
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
## 2                                                                                                       R
## 3                                                                                                       R
## 4                                                                                                    <NA>
## 5                                                                                                    <NA>
## 6                                                                                                    <NA>
##                                                                                                   Q7_Part_3
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL
## 2                                                                                                       SQL
## 3                                                                                                       SQL
## 4                                                                                                      <NA>
## 5                                                                                                       SQL
## 6                                                                                                      <NA>

for (d in multi_q){
  q_num <- paste0("Q", d) # "Q#"
  
  k <- multi_part_questions %>% select(starts_with(q_num))
  
  # Replace Column name with "Selected" option
  for(j in 1:ncol(k)){
    names(k)[j] <- as.character(gsub(".*(Selected Choice -)","", k[1,j]))
  }

  k_long <- gather(k[-1,], key = choice, na.rm = TRUE)
  
  p <- ggplot(k_long, aes(x = choice)) + 
    geom_bar(width = 0.5, fill = "#21BEFF") + 
    labs(subtitle = q_num, x = "") + 
    theme(plot.subtitle = element_text(hjust = 0.5),
          axis.text.x = element_text(angle = 45, hjust = 1))

  assign(q_num, p) # Assign ggplot to Q# for each question
}

Now that we have assigned each question/categorical variable to a plot…

Demographics

Age

Most of the respondents are in their late teens, twenties, and early thirties.

Q1 + labs(title = "Age of Respondents", x = "Age") + theme(plot.title = element_text(hjust = 0.5))

Gender

The majority of respondents were self-identified males.

Q2 + labs(title = "Gender of Respondents") + theme(plot.title = element_text(hjust = 0.5))

Residence

Most respondents are from India. A relatively small amount are from the United States.

Q3 + labs(title = "Country of Residence") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Experience

Education

Question 4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years?

The majority of respondents have/plan to attain a bachelor’s or master’s degree.

Q4 + labs(title = "Formal education") + theme(plot.title = element_text(hjust = 0.5))

Question 5: Select the title most similar to your current role:
Question 6: How many years have you been writing code and/or programming?

Most respondents are students. A large number are data scientists and software engineers.
Most are either new to programming (<1 - 2 years) or have moderate experience (3-5 years). Relatively few have >5 years of experience.

Q5 + labs(title = "Position Title") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Q6 + labs(title = "Years of Programming Experience", x = "Years") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Question 20: What is the size of the company where you are employed?

Most of the people work at small companies with less than 50 employees. Many chose not to answer the question.

Q20 + labs(title = "Size of Company Where Working", x = "# of Employees") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

Language and Library Preferences

Question 7: What programming languages do you use on a regular basis?
Question 8: What programming language would you recommend an aspiring data scientis to learn first?
Question 14: What data visualization libraries or tools do you use on a reegular basis?

Most of the respondents use Python. Some use SQL and R.
Most would recommend new programmers to learn Python first.
Many use the Matplot or Seaborn libraries to visualize their data.

Q7 + labs(title = "Most Used Languages", x = "Programming Languages") + theme(plot.title = element_text(hjust = 0.5))

Q8 + labs(title = "Recommended Languages to Learn", x = "Programming Lagnuages") + theme(plot.title = element_text(hjust = 0.5))

Q14 + labs(title = "Most Used Data Visualization Libraries", x = "Visualization Libraries/Tools") + theme(plot.title = element_text(hjust = 0.5))

Machine Learning

Question 15: For how many years have you used machine learning methods?
Question 16: Which of the following machine learning frameworks do you use on a regular basis?
Question 17: Which of the following machine learning algorithms do you use on a regular basis?

Most of the respondents have little to no experience using machine learning methods.
Those that do, use the Scikit-learn, TensorFlow, and Keras frameworks most often.
Most use linear/logistic regression or decision trees as algorithms for their projects.

Q15 + labs(title = "Years of Machine Learning Methods", x = "# Years") + theme(plot.title = element_text(hjust = 0.5))

Q16 + labs(title = "Machine Learning Frameworks", x = "Frameworks") + theme(plot.title = element_text(hjust = 0.5))

Q17 + labs(title = "Machine Learning Algorithms", x = "Algorithms") + theme(plot.title = element_text(hjust = 0.5))

Summary

In this project, I used R’s tidyverse package to explore the first 20 questions of Kaggle’s 2020 Survey. I created simple bar charts to compare the different responses. From this simple descriptive analysis, it can be determined that alot of Kaggle’s users are young, male students, many of whom are from India. Many program with Python and use the matplot library, but have little machine learning experience. Many are planning to or already have a Bachelor or Masters degree.