As a data scientist you are always excited to learn new methods of analyzing data. You’ve asked a range of people what method they want to learn next year. As you do this you become suspicious that data scientists that are older are a little out of touch. You wonder if the age of a data scientist influences what a data scientist will want to be learning next year.
To answer this question this analysis took data provided by Kaggle.com. The data provided by this website surveyed over 10,000 data scientists, analysts, and statisticians on their opinion of their field.
This question will be analyzed with a \(\chi^2\) analysis. To improve the quality of this test we are going to run it as a permutation test.
A frequency table of a respondent’s age range along with what they are most excited to learn next year in 2018 is listed below:
| Analytic Method | 21 - 30 | 31 - 40 | 41 - 50 | 51 - 60 |
|---|---|---|---|---|
| Deep Learning | 2215 | 1195 | 486 | 183 |
| Neural Nets | 688 | 337 | 145 | 61 |
| Time Series | 297 | 195 | 86 | 39 |
| Text Mining | 222 | 138 | 60 | 27 |
| Bayesian Methods | 212 | 172 | 54 | 27 |
\(H_0:\) Any pattern that has been witnessed in the sampled data is simply due to random chance. \(H_a:\) Any pattern that has been witnessed in the sampled data is not due to random chance.
This analysis will be conducted with \(\alpha\) of 0.05.
Listed below is a visual representation of the table listed above:
palette(c("skyblue4","firebrick", "skyblue", "sienna1", "sienna4"))
barplot(MCR,
beside = TRUE,
legend.text = TRUE,
xlab = "Age Range",
main = "Analytical Method Data Scientists are Most Excited to Learn in 2018",
col = palette(),
args.legend = list(cex = 1.0,
bty = "n",
x = "topright",
title = "Analytical Method"))
Generally speaking, the pattern of the four plots looks roughly the same. It should be noted that text mining was less frequent than Bayesian methods within the 31 - 40 age rage. That looks a little out of place. At this point it’s hard to tell if we should reject or fail to reject the null hypothesis. Let’s look at the quantitative data below and see what we can confirm about the data.
x <- MCR
sr <- rowSums(x)
sc <- colSums(x)
n <- sum(x)
E <- outer(sr, sc, "*")/n
v <- function(r, c, n) c * r * (n - r) * (n - c)/n^3
V <- outer(sr, sc, v, n)
dimnames(E) <- dimnames(x)
B = 2000
tmp <- .Call(stats:::C_chisq_sim, sr, sc, B, E)
STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))
almost.1 <- 1 - 64 * .Machine$double.eps
PVAL <- (1 + sum(tmp >= almost.1 * STATISTIC))/(B + 1)
my.chi <- chisq.test(x)
ccat = cut(h$breaks, c(0, 31.94, Inf))
plot(h, col=c("red","skyblue")[ccat],
xlab = "Permuted Test Statistics",
ylab = "Frequency of Test Statistics",
main = "Non Parametric Chi-Square Distribution",
border = "white")
abline(v = 31.94,
lwd = 1.0,
lty = 3)
Results of the non-parametric \(\chi^2\) test are listed below. Please note this data is based on 2000 replications.
| Test statistic | df | P value |
|---|---|---|
| 31.94 | NA | 0.002499 |
Since the the p-value of the non-parametric \(\chi^2\) test is less than \(\alpha\) (0.05) we reject the null hypothesis. In other-words, we have sufficient evidence to believe that any pattern that has been witnessed in the sampled data is not due to random chance. We can assume that that the age of data scientist influences the analytic method they will be learning next year.
It should be noted that the ranking of the analytic methods was almost the same for all age groups. There was a general consensus that deep learning was the hot new topic of 2018. Now that we’ve established the obvious, let’s find out where there were some disruptions in the pattern of the data. To do this we’re going to look at the residuals as if we had done \(\chi^2\) test:
| 21-30 | 31-40 | 41-50 | 51-60 | |
|---|---|---|---|---|
| Deep Learning | 1.022 | -0.5719 | -0.4328 | -1.269 |
| Neural Nets | 1.325 | -1.549 | -0.3743 | 0.04378 |
| Time Series | -1.704 | 0.8281 | 1.274 | 1.559 |
| Text Mining | -1.007 | 0.4213 | 0.7715 | 1.06 |
| Bayesian Methods | -2.232 | 2.847 | -0.3328 | 0.8537 |
It appears text mining, for the 31-40 age range, wasn’t as different as the visualization made it look like, That said, Bayesian Methods had a more varying opinion than expected. The residual values for the 21-30 and the 31-40 age ranges had residuals of -2.232 and 2.847 respectively. It appears young data scientists aren’t as excited about Bayesian methods as expected. It also appears that slightly older data scientists were more enthusiastic about that method than expected. These responses along with a few straying opinions are what led us to rejecting our null hypothesis. A future analysis should look into what factors have influenced the opinion of Bayesian methods between those two age groups.