library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
While reading about the fun statistics that were located in Tyler Vigen’s website link, I found 2 very interesting entries that spoke to me on a personal level: Women receiving STEM master’s degrees and Visitors to Disney World’s Animal Kingdom. Let’s see their correlation!
year <- c(2007, 2008, 2009, 2010, 2011, 2012)
women_masters <- c(54925, 57594, 61028, 63660, 68202, 73561)
visitors_disney <- c(9.49, 9.54, 9.59, 9.686, 9.783, 9.998)
dat <- as.data.frame(cbind(year, women_masters, visitors_disney))
cor(women_masters, visitors_disney)
## [1] 0.9864481
We can then see that the correlation between the amount of women receiving STEM master’s degrees and visitors to Disney World’s Animal Kingdom has a surprisingly high correlation of 0.9864. Now let’s plot them to be able to visualize them better.
ay <- list(
tickfont = list(color = "red"),
overlaying = "y",
side = "right",
title = "second y axis")
df1 <- dat %>% select(x=year,y=women_masters)
df2 <- dat %>% select(x=year,y=visitors_disney)
fig <- plot_ly(dat, x = df1$x, y = df1$y, type = 'scatter', mode = 'lines+markers', line = list(color = '#17B3CF'), symbol = 'triangle', name = "Women receiving STEM master's degrees")%>%
add_trace(x = df2$x, y = df2$y, type = 'scatter', mode = 'lines+markers', line = list(color = '#CF2017'), symbol = 'square', name = "Visitors to Disney World's Animal Kingdom", yaxis = "y2")%>%
layout(
title = "Women receiving STEM master's degrees
vs.
Visitors to Disney World's Animal Kingdom", yaxis2 = ay,
xaxis = list(title="x"))
fig
So this must mean that the 2 data sets have some causation for the data from one another, right? Actually, this is a special case of spurious correlation, in which the data, although being correlated, has no causation behind it. This correlation is often caused by something known as a confounding factor, which simply to wrongly estimate the relationship between your independent and dependent variables.