Let me introduce to You Mr Right.
Demokritos Right is a friend of mine, as you can guess by the name, he is Greek and as every Greek, he is proud of his name and is keen on explaining the meaning of it. As I can recall it was something about the Greek philosopher Demokritos with meaning “judge of the people” but I told him outright that there’s no way I will use the whole name and from then on he became Dem Right.
Mr Right was very excited about getting his hands on the new 2018 Kaggle ML & DS Survey Challenge but last week he asked me for help. He explained, he was somewhat confused with some of the results. So we decided to start the proper way.
We will be using Tidyverse and Forcats packages mainly plus few others to add personal touch.
library(tidyverse)
library(forcats)
library(scales)
library(viridis)
library(RColorBrewer)
library(plotly)
library(knitr)
library(kableExtra)In statistics, self-selection bias arises in any situation in which individuals select themselves into a group, causing a biased sample. Drawing insights from a biased sample is probably worse than having no clue about what you are dealing with. On top of that small samples are dangerous in a way that they can give an impression that the data is skewed in a way which in reality is not true. Questions that inquire about more than one topic, yet allow only one answer, are known as double-barreled questions. All the answers will be lumped in one variable thus being very difficult to analyse and find out who meant what when answering. It’s not a bad idea to be mindful about the analysis once you know your shortcomings and very careful when generalising your findings.
The data we will be using can be found in here 2018 Kaggle ML & DS Survey Challenge, for the purpose of our analysis, it has been slightly preprocessed and with that I mean we cut it down to just a few variables as a beggining.
The loaded data contains 23860 rows.
We will make a small table to refer easily to the full text of the questions in focus and to make things simpler and writing code esier we will rename the first variable to “Q0” .
# Some summary statistics
dat %>%
filter(row_number() == 1L) %>%
select(1:13, -3, -9, -11) %>%
gather(columns,"Q") %>%
knitr::kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)| columns | Q |
|---|---|
| Time from Start to Finish (seconds) | Duration (in seconds) |
| Q1 | What is your gender? - Selected Choice |
| Q2 | What is your age (# years)? |
| Q3 | In which country do you currently reside? |
| Q4 | What is the highest level of formal education that you have attained or plan to attain within the next 2 years? |
| Q5 | Which best describes your undergraduate major? - Selected Choice |
| Q6 | Select the title most similar to your current role (or most recent title if retired): - Selected Choice |
| Q7 | In what industry is your current employer/contract (or your most recent employer if retired)? - Selected Choice |
| Q8 | How many years of experience do you have in your current role? |
| Q9 | What is your current yearly compensation (approximate $USD)? |
As it can be seen from the above table, we will be dealing mainly with the demographics of the participants and the time they spent on the survey.
The data is now filtered so it contains only a few of the first variables, also those with free text input were dropped and factors with natural order was orderd accordingly.
dat_small <- dat[-1,1:13] %>%
select(-3,-9,-11)
dat_small <- rename(dat_small,Q0 = `Time from Start to Finish (seconds)`)
dat_small$Q0 <- as.numeric(dat_small$Q0)
dat_small <- dat_small %>%
mutate_if(is.character, as.factor)
dat_small$CountNA <- rowSums(apply(is.na(dat_small), 2, as.numeric))
# Number of levels --------------------------------------------------------
dat_small <- dat_small %>%
mutate(Q9 = fct_relevel(Q9,"0-10,000","10-20,000","20-30,000","30-40,000","40-50,000",
"50-60,000","60-70,000","70-80,000","80-90,000","90-100,000",
"100-125,000","125-150,000","150-200,000","200-250,000",
"250-300,000","300-400,000","400-500,000","500,000+",
"I do not wish to disclose my approximate yearly compensation"))
dat_small <- dat_small %>%
mutate(Q9 = fct_recode(Q9,
"Undisclosed" = "I do not wish to disclose my approximate yearly compensation"))
dat_small <- dat_small %>%
mutate(Q4 = fct_relevel(Q4,
"No formal education past high school",
"Professional degree",
"Some college/university study without earning a bachelors degree",
"Bachelors degree",
"Masters degree",
"Doctoral degree",
"I prefer not to answer" ))
dat_small <- dat_small %>%
mutate(Q4 = fct_recode(Q4,
"No formal education\npast high school" =
"No formal education past high school",
"Some college/university study\nwithout earning a bachelors degree" =
"Some college/university study without earning a bachelors degree"))
dat_small <- dat_small %>%
mutate(Q8 = fct_relevel(Q8,"10-15","15-20", after = 6))
dat_small <- dat_small %>%
mutate(Q8 = fct_relevel(Q8,"20-25", "25-30", after = 8))
dat_small <- dat_small %>%
mutate(Q8 = fct_relevel(Q8,"10-15", "15-20", "30 +", after = Inf))
dat_small <- dat_small %>%
mutate(Q8 = fct_relevel(Q8,"20-25", "25-30", after = 8))dat_small %>%
head(5) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)| Q0 | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9 | CountNA |
|---|---|---|---|---|---|---|---|---|---|---|
| 710 | Female | 45-49 | United States of America | Doctoral degree | Other | Consultant | Other | NA | NA | 2 |
| 434 | Male | 30-34 | Indonesia | Bachelors degree | Engineering (non-computer focused) | Other | Manufacturing/Fabrication | 5-10 | 10-20,000 | 0 |
| 718 | Female | 30-34 | United States of America | Masters degree | Computer science (software engineering, etc.) | Data Scientist | I am a student | 0-1 | 0-10,000 | 0 |
| 621 | Male | 35-39 | United States of America | Masters degree | Social sciences (anthropology, psychology, sociology, etc.) | Not employed | NA | NA | NA | 3 |
| 731 | Male | 22-24 | India | Masters degree | Mathematics or statistics | Data Analyst | I am a student | 0-1 | 0-10,000 | 0 |
A summary of the number of levels for each question, as all of them, are converted to factors.
dat_small %>%
summarise_all(nlevels) %>%
gather(variable, num_levels) %>%
filter(!row_number() %in% c(1,11)) %>%
knitr::kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F)| variable | num_levels |
|---|---|
| Q1 | 4 |
| Q2 | 12 |
| Q3 | 58 |
| Q4 | 7 |
| Q5 | 13 |
| Q6 | 21 |
| Q7 | 19 |
| Q8 | 11 |
| Q9 | 19 |
“This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!” - this is on the competition page said Mr Right and it might be 49% increase over a year but in 2017 Kaggle claimed that they surpassed one million accounts.
(23859-16716)/16716*100Not to mention that the increase percentage comes as a different number when I do the math
42.7315147%.
So isn’t that a bit over 2% response rate? - I asked confounded. This is a negligible result by any standard, isn’t it?
- I don’t know! I am a bit puzzled. I thought that the average rate for an online survey is above 30%.
-I think we have nonresponse bias greater then 90%.
-You are damn right!
-No you are Dem Right.
But let’s go a bit deeper requested Mr Right.
As you can see some people were either very busy or took their time to carefully think about their answers. I would suggest that those who do not fall within some time limits did not take it seriously - said Mr Right.
dat_small %>%
filter(CountNA == 0) %>%
mutate(minutes = Q0/60) %>%
ggplot(aes(x = Q1, y = minutes)) +
geom_boxplot( varwidth = TRUE) +
theme_classic() +
coord_flip(ylim = c(0,60)) + # zoom in for the most common lenght of the survey
labs(title = "Let's zoom in on the main group",
subtitle= "Actually, most of the participants spent\naround 20 minutes on the survey",
y= "Minutes",
x = "What is your gender? - Selected Choice",
caption= "Source : 2018 Kaggle ML & DS Survey")+
theme(plot.title = element_text(size = 22)) +
theme(plot.subtitle = element_text(size = 14))-I can see that some rushed through the survey just to get the promised dataset I believe but on the other hand very good estimate from the creators of the survey because the main group of participants completed the questionary between 15 and 35 min.
But question 4 bothers me a bit because when you use “and” or “or” you acctualy ask about a few things but give an option just for one answer
dat_small %>%
group_by(Q4) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(reorder(Q4,n), n)) +
geom_bar(stat = "identity", fill = "dark blue") +
coord_flip(y=c(0, 12500)) +
geom_text(aes(label=n), hjust=-0.1) +
theme_classic() +
labs(title = "Which question did you answer?",
subtitle = "What is the highest level attained? or\nWhat are you planning to attain?",
x = "What is the highest level of formal education\nthat you have attained or plan to attain within 2 years?",
y = "# of respondents",
caption= "Source : 2018 Kaggle ML & DS Survey")+
theme(plot.title = element_text(size = 22)) +
theme(plot.subtitle = element_text(size = 14))We can see what’s left after we filter out the students,
dat_small %>%
filter(Q6 != "Student" | Q7 != "I am a student") %>%
group_by(Q4) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(reorder(Q4,n), n)) +
geom_bar(stat = "identity", fill = "dark blue") +
coord_flip(y=c(0, 12500)) +
geom_text(aes(label=n), hjust=-0.1) +
theme_classic() +
theme(axis.text.x = element_blank()) +
labs(title = "Without students!",
subtitle = "Presumably those answerd what\nis the highest level attained?",
x = "What is the highest level of formal education\nthat you have attained or plan to attain within 2 years?",
y = "# of respondents",
caption= "Source : 2018 Kaggle ML & DS Survey")+
theme(plot.title = element_text(size = 22)) +
theme(plot.subtitle = element_text(size = 14))or colour them differently to visualise the proportions.
dat_student <-dat_small %>%
mutate(student = case_when(
Q6 == "Student" | Q7 == "I am a student" ~ "Yes",
Q6 != "Student" ~ "No",
Q7 != "I am a student" ~ "No"))
dat_student %>%
filter(!is.na(student)) %>%
group_by(Q4,student) %>%
summarise(n = n()) %>%
mutate(prop = n/sum(n)) %>%
ggplot(aes(reorder(Q4,n), n, fill = student)) +
geom_bar(stat = "identity") +
coord_flip(y=c(0, 12500)) +
scale_colour_manual(values =
c("dark blue", "grey35"),aesthetics = "fill") +
theme_classic() +
theme(legend.position = 'bottom') +
labs(title = "Still a student?",
subtitle = "This might be a bit helpful",
x = "What is the highest level of formal education\nthat you have attained or plan to attain within 2 years?",
y = "# of respondents",
caption= "Source : 2018 Kaggle ML & DS Survey")+
theme(plot.title = element_text(size = 22)) +
theme(plot.subtitle = element_text(size = 14))