In 2020 Kaggle asked its users some questions. The questions and their responses can be found in this csv file (found here). The objective is to explore and summarize the first 20 questions. As we’ll see, some questions are simple to read in because they only have one column associated with their responses. Other questions will extend across multiple columns because users can select multiple choices.
# read in data from local downloads folder
f <- read.csv('~/Downloads/kaggle_survey_2020_responses.csv', na.strings = "")
f[1:4, c(2, 4, 8, 9)]
## Q1 Q3
## 1 What is your age (# years)? In which country do you currently reside?
## 2 35-39 Colombia
## 3 30-34 United States of America
## 4 35-39 Argentina
## Q7_Part_1
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
## 2 Python
## 3 Python
## 4 <NA>
## Q7_Part_2
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
## 2 R
## 3 R
## 4 <NA>
First to deal with the simple questions (11 of the 20):
# The Column name for the simple questions: "Q#"
# Regex can be used: Q begins the string and the number ends the column name.
simple_questions <- f[, grep(pattern = '^Q(20|1[0-9]$|[1-9]$)', x = colnames(f))]
n <- ncol(simple_questions)
head(simple_questions[, 1:4])
## Q1 Q2
## 1 What is your age (# years)? What is your gender? - Selected Choice
## 2 35-39 Man
## 3 30-34 Man
## 4 35-39 Man
## 5 30-34 Man
## 6 30-34 Man
## Q3
## 1 In which country do you currently reside?
## 2 Colombia
## 3 United States of America
## 4 Argentina
## 5 United States of America
## 6 Japan
## Q4
## 1 What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
## 2 Doctoral degree
## 3 Master’s degree
## 4 Bachelor’s degree
## 5 Master’s degree
## 6 Master’s degree
# for each column, create and assign a separate ggplot
for (i in 1:n) {
responses <- simple_questions[-1,i] %>% data.frame()
q_num <- colnames(simple_questions[i]) # "Q#"
p <- ggplot(responses, aes(x = .)) +
geom_bar(width = 0.5, fill = "#21BEFF") +
labs(subtitle = q_num, x = "") +
theme(plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
assign(paste0(q_num), p) # assign ggplot to Q# for each question
}
Some questions involve answers spread across many columns. 9 of the first 20 questions have multiple parts:
multi_q <- c(7, 9, 10, 12, 14, 16, 17, 18, 19)
# Select the Questions with multiple parts: "Q#_..."
multi_part_questions <- f[, grep(pattern = '^Q(1[0-9]_|[1-9]_)', x = colnames(f))]
head(multi_part_questions[, 1:3])
## Q7_Part_1
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - Python
## 2 Python
## 3 Python
## 4 <NA>
## 5 Python
## 6 Python
## Q7_Part_2
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - R
## 2 R
## 3 R
## 4 <NA>
## 5 <NA>
## 6 <NA>
## Q7_Part_3
## 1 What programming languages do you use on a regular basis? (Select all that apply) - Selected Choice - SQL
## 2 SQL
## 3 SQL
## 4 <NA>
## 5 SQL
## 6 <NA>
for (d in multi_q){
q_num <- paste0("Q", d) # "Q#"
k <- multi_part_questions %>% select(starts_with(q_num))
# Replace Column name with "Selected" option
for(j in 1:ncol(k)){
names(k)[j] <- as.character(gsub(".*(Selected Choice -)","", k[1,j]))
}
k_long <- gather(k[-1,], key = choice, na.rm = TRUE)
p <- ggplot(k_long, aes(x = choice)) +
geom_bar(width = 0.5, fill = "#21BEFF") +
labs(subtitle = q_num, x = "") +
theme(plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1))
assign(q_num, p) # Assign ggplot to Q# for each question
}
Now that we have assigned each question/categorical variable to a plot…
Q1 + labs(title = "Age of Respondents", x = "Age") + theme(plot.title = element_text(hjust = 0.5))
Q2 + labs(title = "Gender of Respondents") + theme(plot.title = element_text(hjust = 0.5))
Q3 + labs(title = "Country of Residence") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
Question 4: What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
Q4 + labs(title = "Formal education") + theme(plot.title = element_text(hjust = 0.5))
Question 5: Select the title most similar to your current role:
Question 6: How many years have you been writing code and/or programming?
Q5 + labs(title = "Position Title") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
Q6 + labs(title = "Years of Programming Experience", x = "Years") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
Question 20: What is the size of the company where you are employed?
Q20 + labs(title = "Size of Company Where Working", x = "# of Employees") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
Question 7: What programming languages do you use on a regular basis?
Question 8: What programming language would you recommend an aspiring data scientis to learn first?
Question 14: What data visualization libraries or tools do you use on a reegular basis?
Q7 + labs(title = "Most Used Languages", x = "Programming Languages") + theme(plot.title = element_text(hjust = 0.5))
Q8 + labs(title = "Recommended Languages to Learn", x = "Programming Lagnuages") + theme(plot.title = element_text(hjust = 0.5))
Q14 + labs(title = "Most Used Data Visualization Libraries", x = "Visualization Libraries/Tools") + theme(plot.title = element_text(hjust = 0.5))
Question 15: For how many years have you used machine learning methods?
Question 16: Which of the following machine learning frameworks do you use on a regular basis?
Question 17: Which of the following machine learning algorithms do you use on a regular basis?
Q15 + labs(title = "Years of Machine Learning Methods", x = "# Years") + theme(plot.title = element_text(hjust = 0.5))
Q16 + labs(title = "Machine Learning Frameworks", x = "Frameworks") + theme(plot.title = element_text(hjust = 0.5))
Q17 + labs(title = "Machine Learning Algorithms", x = "Algorithms") + theme(plot.title = element_text(hjust = 0.5))
Question 9: Which integrated development environments (IDE’s) do you use on a regular basis?
Q9 + labs(title = "Most Used IDE's", x = "Integrated Development Environments") + theme(plot.title = element_text(hjust = 0.5))
Question 10: Which of the following hosted notebook products fo you use on a regular basis?
Q10 + labs(title = "Most Used Hosted Notebooks", x = "Hosted Notebook Products") + theme(plot.title = element_text(hjust = 0.5))
Question 11: What type of computing platform do you use most often for your data science projects?
Question 12: Which typed of specialized hardware do you use on a regular basis?
Question 13: Approximately how many times have you used a TPU (Tensor Processing Unit)?
Q11 + labs(title = "Type of Computing Platform Used", x = "Computing Platform") + theme(plot.title = element_text(hjust = 0.5)) + coord_flip()
Q12 + labs(title = "Specialized Hardware Use", x = "Hardware Type") + theme(plot.title = element_text(hjust = 0.5))
Q13 + labs(title = "Use of TPUs", x = "# of Times") + theme(plot.title = element_text(hjust = 0.5))
Question 19: Which of the following natural processing (NLP) methods do you use on a regular basis?
Q19 + labs(title = "Natural Processing Methods", x = "NLP") + theme(plot.title = element_text(hjust = 0.5))
In this project, I used R’s tidyverse package to explore the first 20 questions of Kaggle’s 2020 Survey. I created simple bar charts to compare the different responses. From this simple descriptive analysis, it can be determined that alot of Kaggle’s users are young, male students, many of whom are from India. Many program with Python and use the matplot library, but have little machine learning experience. Many are planning to or already have a Bachelor or Masters degree.