In this week challenge, I would like to discuss the difference between Causation and Correlation
Causation means that one event causes another event to happen i.e. Outcome in event B will be different if event A does not happen. On the other hand, correlation shows relationship between 2 variables which does not imply causation. Event A can relates to event B, but it does not mean that event A caused event B.
I will use a famous case study in Statistics that shows the correlation between shark attacks and ice cream sales, but does not mean that eating/buying ice cream will increase your chance to be attacked by shark
I will start by setting up a working directory
setwd("~/NYU/R Programming")
Importing relevant packages
library(readr)
library(knitr)
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Getting to know the data
shark <- read_csv("shark_attacks.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## Year = col_double(),
## Month = col_double(),
## SharkAttacks = col_double(),
## Temperature = col_double(),
## IceCreamSales = col_double()
## )
View(shark)
print(shark)
## # A tibble: 84 x 5
## Year Month SharkAttacks Temperature IceCreamSales
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 1 25 11.9 76
## 2 2008 2 28 15.2 79
## 3 2008 3 32 17.2 91
## 4 2008 4 35 18.5 95
## 5 2008 5 38 19.4 103
## 6 2008 6 41 22.1 108
## 7 2008 7 43 25.1 102
## 8 2008 8 40 23.4 98
## 9 2008 9 38 22.6 83
## 10 2008 10 33 18.1 83
## # ... with 74 more rows
Let’s see the relationship between shark attacks and ice cream sales
plot(shark$SharkAttacks,shark$IceCreamSales)
cor (shark$SharkAttacks, shark$IceCreamSales)
## [1] 0.5343576
We can clearly see there are some correlation and some people might reach into a conclusion that as ice cream sales increase, it causes more shark attacks implying that shark likes to attack people that eat ice cream which is wrong
Let’s consider another variable, in this case temperature
cor(select(shark, SharkAttacks, Temperature, IceCreamSales))
## SharkAttacks Temperature IceCreamSales
## SharkAttacks 1.0000000 0.7169660 0.5343576
## Temperature 0.7169660 1.0000000 0.5957694
## IceCreamSales 0.5343576 0.5957694 1.0000000
plot (shark$Temperature, shark$SharkAttacks)
plot (shark$Temperature, shark$IceCreamSales)
Now, we can understand that there is a third factor, temperature. These two variables are both responding to the third factor. As the temperature get warmer, more people are going to the beaches which increase the number of shark attack increase. On the other hand, warmer temperature also increase the tendency of people buying ice cream which increase the number of ice cream sales.
In summary, correlation does not implies causation. In this case, ice cream sales does not caused shark attacks. Understanding this concept is crucial especially in statistical modelling. However, to consider the third factor takes part in our model, we can consider of including interaction.