library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Introduction

While reading about the fun statistics that were located in Tyler Vigen’s website link, I found 2 very interesting entries that spoke to me on a personal level: Women receiving STEM master’s degrees and Visitors to Disney World’s Animal Kingdom. Let’s see their correlation!

year <- c(2007, 2008, 2009, 2010, 2011, 2012)
women_masters <- c(54925, 57594, 61028, 63660, 68202, 73561)
visitors_disney <- c(9.49, 9.54, 9.59, 9.686, 9.783, 9.998)
dat <- as.data.frame(cbind(year, women_masters, visitors_disney))
cor(women_masters, visitors_disney)
## [1] 0.9864481

We can then see that the correlation between the amount of women receiving STEM master’s degrees and visitors to Disney World’s Animal Kingdom has a surprisingly high correlation of 0.9864. Now let’s plot them to be able to visualize them better.

ay <- list(
  tickfont = list(color = "red"),
  overlaying = "y",
  side = "right",
  title = "second y axis")
df1 <- dat %>% select(x=year,y=women_masters)
df2 <- dat %>% select(x=year,y=visitors_disney)
fig <- plot_ly(dat, x = df1$x, y = df1$y, type = 'scatter', mode = 'lines+markers', line = list(color = '#17B3CF'), symbol = 'triangle', name = "Women receiving STEM master's degrees")%>% 
  add_trace(x = df2$x, y = df2$y, type = 'scatter', mode = 'lines+markers', line = list(color = '#CF2017'), symbol = 'square', name = "Visitors to Disney World's Animal Kingdom", yaxis = "y2")%>% 
  layout(
    title = "Women receiving STEM master's degrees 
    vs. 
    Visitors to Disney World's Animal Kingdom", yaxis2 = ay,
    xaxis = list(title="x"))
fig

So this must mean that the 2 data sets have some causation for the data from one another, right? Actually, this is a special case of spurious correlation, in which the data, although being correlated, has no causation behind it. This correlation is often caused by something known as a confounding factor, which simply to wrongly estimate the relationship between your independent and dependent variables.