Beware the self-selection bias!

Analysis

Data

The data we will be using can be found in here 2018 Kaggle ML & DS Survey Challenge, for the purpose of our analysis, it has been slightly preprocessed and with that I mean we cut it down to just a few variables as a beggining.

The loaded data contains 23860 rows.
We will make a small table to refer easily to the full text of the questions in focus and to make things simpler and writing code esier we will rename the first variable to “Q0” .

# Some summary statistics
 dat %>% 
  filter(row_number() == 1L) %>% 
  select(1:13, -3, -9, -11) %>% 
  gather(columns,"Q") %>% 
  knitr::kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)

columns	Q
Time from Start to Finish (seconds)	Duration (in seconds)
Q1	What is your gender? - Selected Choice
Q2	What is your age (# years)?
Q3	In which country do you currently reside?
Q4	What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
Q5	Which best describes your undergraduate major? - Selected Choice
Q6	Select the title most similar to your current role (or most recent title if retired): - Selected Choice
Q7	In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice
Q8	How many years of experience do you have in your current role?
Q9	What is your current yearly compensation (approximate $USD)?

As it can be seen from the above table, we will be dealing mainly with the demographics of the participants and the time they spent on the survey.

Preprocessing

The data is now filtered so it contains only a few of the first variables, also those with free text input were dropped and factors with natural order was orderd accordingly.

dat_small <- dat[-1,1:13] %>% 
  select(-3,-9,-11)

dat_small <- rename(dat_small,Q0 = `Time from Start to Finish (seconds)`)

dat_small$Q0 <- as.numeric(dat_small$Q0)

dat_small <- dat_small %>% 
  mutate_if(is.character, as.factor)
dat_small$CountNA <- rowSums(apply(is.na(dat_small), 2, as.numeric))

# Number of levels --------------------------------------------------------
dat_small <- dat_small %>% 
  mutate(Q9 = fct_relevel(Q9,"0-10,000","10-20,000","20-30,000","30-40,000","40-50,000",
                          "50-60,000","60-70,000","70-80,000","80-90,000","90-100,000",
                          "100-125,000","125-150,000","150-200,000","200-250,000",
                          "250-300,000","300-400,000","400-500,000","500,000+",
                          "I do not wish to disclose my approximate yearly compensation"))
dat_small <- dat_small %>% 
  mutate(Q9 = fct_recode(Q9,
                         "Undisclosed" = "I do not wish to disclose my approximate yearly compensation")) 

dat_small <- dat_small %>%
  mutate(Q4 = fct_relevel(Q4,
                          "No formal education past high school", 
                          "Professional degree",
      "Some college/university study without earning a bachelors degree",
                          "Bachelors degree",
                          "Masters degree",
                          "Doctoral degree",
                          "I prefer not to answer" ))
dat_small <- dat_small %>% 
  mutate(Q4 = fct_recode(Q4,
                        "No formal education\npast high school" =
                        "No formal education past high school",
                        "Some college/university study\nwithout earning a bachelors degree" =
                        "Some college/university study without earning a bachelors degree"))



dat_small <- dat_small %>% 
  mutate(Q8 = fct_relevel(Q8,"10-15","15-20", after = 6))

dat_small <- dat_small %>% 
  mutate(Q8 = fct_relevel(Q8,"20-25", "25-30", after = 8))
dat_small <- dat_small %>% 
  mutate(Q8 = fct_relevel(Q8,"10-15", "15-20", "30 +", after = Inf))  
dat_small <- dat_small %>% 
  mutate(Q8 = fct_relevel(Q8,"20-25", "25-30", after = 8))

dat_small  %>% 
  head(5) %>% 
  knitr::kable() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Q0	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9	CountNA
710	Female	45-49	United States of America	Doctoral degree	Other	Consultant	Other	NA	NA	2
434	Male	30-34	Indonesia	Bachelors degree	Engineering (non-computer focused)	Other	Manufacturing/Fabrication	5-10	10-20,000	0
718	Female	30-34	United States of America	Masters degree	Computer science (software engineering, etc.)	Data Scientist	I am a student	0-1	0-10,000	0
621	Male	35-39	United States of America	Masters degree	Social sciences (anthropology, psychology, sociology, etc.)	Not employed	NA	NA	NA	3
731	Male	22-24	India	Masters degree	Mathematics or statistics	Data Analyst	I am a student	0-1	0-10,000	0

A summary of the number of levels for each question, as all of them, are converted to factors.

dat_small %>% 
  summarise_all(nlevels) %>%
  gather(variable, num_levels) %>% 
  filter(!row_number() %in% c(1,11)) %>% 
  knitr::kable() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

variable	num_levels
Q1	4
Q2	12
Q3	58
Q4	7
Q5	13
Q6	21
Q7	19
Q8	11
Q9	19

Results

“This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!” - this is on the competition page said Mr Right and it might be 49% increase over a year but in 2017 Kaggle claimed that they surpassed one million accounts.

(23859-16716)/16716*100

Not to mention that the increase percentage comes as a different number when I do the math
42.7315147%.

So isn’t that a bit over 2% response rate? - I asked confounded. This is a negligible result by any standard, isn’t it?
- I don’t know! I am a bit puzzled. I thought that the average rate for an online survey is above 30%.
-I think we have nonresponse bias greater then 90%.
-You are damn right!
-No you are Dem Right.

But let’s go a bit deeper requested Mr Right.

As you can see some people were either very busy or took their time to carefully think about their answers. I would suggest that those who do not fall within some time limits did not take it seriously - said Mr Right.

dat_small %>% 
  filter(CountNA == 0) %>% 
  mutate(minutes = Q0/60) %>%
  ggplot(aes(x = Q1, y = minutes)) +
  geom_boxplot( varwidth = TRUE) +
  theme_classic() +
  coord_flip(ylim = c(0,60)) + # zoom in for the most common lenght of the survey
labs(title = "Let's zoom in on the main group",
     subtitle= "Actually, most of the participants spent\naround 20 minutes on the survey",
       y= "Minutes",
     x = "What is your gender? - Selected Choice",
       caption= "Source : 2018 Kaggle ML & DS Survey")+
     theme(plot.title = element_text(size = 22)) +
  theme(plot.subtitle = element_text(size = 14))

-I can see that some rushed through the survey just to get the promised dataset I believe but on the other hand very good estimate from the creators of the survey because the main group of participants completed the questionary between 15 and 35 min.

But question 4 bothers me a bit because when you use “and” or “or” you acctualy ask about a few things but give an option just for one answer

dat_small %>% 
  group_by(Q4) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n/sum(n)) %>% 
  ggplot(aes(reorder(Q4,n), n)) +
  geom_bar(stat = "identity", fill = "dark blue") +
  coord_flip(y=c(0, 12500)) +
  geom_text(aes(label=n), hjust=-0.1) +
  theme_classic() +
  labs(title = "Which question did you answer?",
       subtitle = "What is the highest level attained? or\nWhat are you planning to attain?",
       x = "What is the highest level of formal education\nthat you have attained or plan to attain within 2 years?",
       y = "# of respondents",
       caption= "Source : 2018 Kaggle ML & DS Survey")+
     theme(plot.title = element_text(size = 22)) +
  theme(plot.subtitle = element_text(size = 14))

We can see what’s left after we filter out the students,

dat_small %>% 
  filter(Q6 != "Student" | Q7 != "I am a student") %>% 
  group_by(Q4) %>% 
  summarise(n = n()) %>% 
  mutate(prop = n/sum(n)) %>% 
  ggplot(aes(reorder(Q4,n), n)) +
  geom_bar(stat = "identity", fill = "dark blue") +
  coord_flip(y=c(0, 12500)) +
  geom_text(aes(label=n), hjust=-0.1) +
  theme_classic() +
  theme(axis.text.x = element_blank()) +
  labs(title = "Without students!",
       subtitle = "Presumably those answerd what\nis the highest level attained?",
       x = "What is the highest level of formal education\nthat you have attained or plan to attain within 2 years?",
       y = "# of respondents",
       caption= "Source : 2018 Kaggle ML & DS Survey")+
     theme(plot.title = element_text(size = 22)) +
      theme(plot.subtitle = element_text(size = 14))

or colour them differently to visualise the proportions.

dat_student <-dat_small %>% 
    mutate(student = case_when(
        Q6 == "Student" | Q7 == "I am a student" ~ "Yes",
        Q6 != "Student"        ~ "No",
        Q7 != "I am a student" ~  "No"))

dat_student %>% 
    filter(!is.na(student)) %>% 
        group_by(Q4,student) %>% 
    summarise(n = n()) %>% 
    mutate(prop = n/sum(n)) %>% 
    ggplot(aes(reorder(Q4,n), n, fill = student)) +
    geom_bar(stat = "identity") +
    coord_flip(y=c(0, 12500)) +
    scale_colour_manual(values = 
                        c("dark blue", "grey35"),aesthetics = "fill") +
    theme_classic() +
  theme(legend.position = 'bottom') +
  labs(title = "Still a student?",
       subtitle = "This might be a bit helpful",
       x = "What is the highest level of formal education\nthat you have attained or plan to attain within 2 years?",
       y = "# of respondents",
       caption= "Source : 2018 Kaggle ML & DS Survey")+
     theme(plot.title = element_text(size = 22)) +
      theme(plot.subtitle = element_text(size = 14))

Beware the self-selection bias!

2018 Kaggle ML & DS Survey Challenge

Georgi Petkov

18 November 2018

Introduction

Preparations

Summary

Analysis

Data

Preprocessing

Results