Week 2 Challenge

In this week challenge, I would like to discuss the difference between Causation and Correlation

Understanding the definition

Causation means that one event causes another event to happen i.e. Outcome in event B will be different if event A does not happen. On the other hand, correlation shows relationship between 2 variables which does not imply causation. Event A can relates to event B, but it does not mean that event A caused event B.

Case study

I will use a famous case study in Statistics that shows the correlation between shark attacks and ice cream sales, but does not mean that eating/buying ice cream will increase your chance to be attacked by shark

I will start by setting up a working directory

setwd("~/NYU/R Programming")

Importing relevant packages

library(readr)
library(knitr)
library (dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Getting to know the data

shark <- read_csv("shark_attacks.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   Year = col_double(),
##   Month = col_double(),
##   SharkAttacks = col_double(),
##   Temperature = col_double(),
##   IceCreamSales = col_double()
## )
View(shark)
print(shark)
## # A tibble: 84 x 5
##     Year Month SharkAttacks Temperature IceCreamSales
##    <dbl> <dbl>        <dbl>       <dbl>         <dbl>
##  1  2008     1           25        11.9            76
##  2  2008     2           28        15.2            79
##  3  2008     3           32        17.2            91
##  4  2008     4           35        18.5            95
##  5  2008     5           38        19.4           103
##  6  2008     6           41        22.1           108
##  7  2008     7           43        25.1           102
##  8  2008     8           40        23.4            98
##  9  2008     9           38        22.6            83
## 10  2008    10           33        18.1            83
## # ... with 74 more rows

Let’s see the relationship between shark attacks and ice cream sales

plot(shark$SharkAttacks,shark$IceCreamSales)

cor (shark$SharkAttacks, shark$IceCreamSales)
## [1] 0.5343576

We can clearly see there are some correlation and some people might reach into a conclusion that as ice cream sales increase, it causes more shark attacks implying that shark likes to attack people that eat ice cream which is wrong

Let’s consider another variable, in this case temperature

cor(select(shark, SharkAttacks, Temperature, IceCreamSales))
##               SharkAttacks Temperature IceCreamSales
## SharkAttacks     1.0000000   0.7169660     0.5343576
## Temperature      0.7169660   1.0000000     0.5957694
## IceCreamSales    0.5343576   0.5957694     1.0000000
plot (shark$Temperature, shark$SharkAttacks)

plot (shark$Temperature, shark$IceCreamSales)

Now, we can understand that there is a third factor, temperature. These two variables are both responding to the third factor. As the temperature get warmer, more people are going to the beaches which increase the number of shark attack increase. On the other hand, warmer temperature also increase the tendency of people buying ice cream which increase the number of ice cream sales.

In summary, correlation does not implies causation. In this case, ice cream sales does not caused shark attacks. Understanding this concept is crucial especially in statistical modelling. However, to consider the third factor takes part in our model, we can consider of including interaction.