What we’re really interested in when we examine the relationship between two variables is whether or not there is a causal relationship between the two. There is usually a theory telling us that the value of one of the two variables influences the value of the other variable. The first variable is referred to as the explanatory variable and the second variable is referred to as the response. There are other pairs of terms, which may be more familiar
Whatever terms we use, there are four possible types of analysis depending on the types of the two variables.
In this course, we will study only the first three. The fourth one is very important, but too complicated to pursue in an introductory course.
We are about to examine some graphical and numerical techniques for exploring the idea of causal relationships. We will see if there is some correlation between the variables. However we must always repeat the mantra “Correlation does not imply Causation.”
Whenever we examine data and think we should conclude that some variable A causes another variable B, we have to recoginize that there are four possibilities for what we see.
For each of these situations identify the explanatory variable, the response variable and the type (Categorical or Quantitative of each).
Someone suggests that People who live in King County tend to be younger than people in the rest of the State of Washington.
Someone suggests that taller people get better scores on a reading exam than shorter people.
Someone suggests that people who drive red cars are more likely to get speeding tickets on I-5.
Someone tells you that people with higher blood pressure are more likely to have heart atacks.
Write down your own answers to these questions and then click here
For each of the following observed correlations and causality conclusions, comment on why the observed correlation might not imply the suggested causation based on first three of the four possibilities listed above.
Someone observes a strong positive correlation between height and reading scores.
People who drink red wine tend to live longer.
People who live in zip codes where the population has higher levels of education have higher incomes. So, higher education leads to higher income.
People who switch from other auto insurers to GEICO save a lot of money compared to those who don’t switch.
Write down your own answers to these questions and then click here
Let’s start by looking at a video of a case study published by OpenIntro.org. In this case, the explanatory variable is whether or not a stent is used to treat a medical condition and the response variable is whether or not the patient experiences a stroke. Visit https://www.youtube.com/watch?list=PLkIselvEzpM6pZ76FD3NoCvvgkj_p-dE8&v=nEHFF1ADpWE for the video. Right click and select “Open in New Tab”.
Let’s follow up by looking at that data, which I will load into my R workspace as stent365 and examine with str() and summary().
load("/cloud/project/stent365.Rdata")
str(stent365)
## 'data.frame': 451 obs. of 2 variables:
## $ group : Factor w/ 2 levels "control","treatment": 2 2 2 2 2 2 2 2 2 2 ...
## $ outcome: Factor w/ 2 levels "no event","stroke": 2 2 2 2 2 2 2 2 2 2 ...
summary(stent365)
## group outcome
## control :227 no event:378
## treatment:224 stroke : 73
To recreate the table in the video, we can just run the table command on the two vectors in the dataframe.
table(stent365$group,stent365$outcome)
##
## no event stroke
## control 199 28
## treatment 179 45
We can get a better table with all of the arithmetic done by using the CrossTable() command from the package gmodels. First we need to install and load it. You don’t need to understand this code.
package = "gmodels"
if (!require(package, character.only=T, quietly=T)) {
install.packages(package)
library(package, character.only=T)
}
Now run the CrossTable command on our variables. You’ll see the numbers referred to in the video. They are described as N/Row Total.
CrossTable(stent365$group,stent365$outcome)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 451
##
##
## | stent365$outcome
## stent365$group | no event | stroke | Row Total |
## ---------------|-----------|-----------|-----------|
## control | 199 | 28 | 227 |
## | 0.402 | 2.080 | |
## | 0.877 | 0.123 | 0.503 |
## | 0.526 | 0.384 | |
## | 0.441 | 0.062 | |
## ---------------|-----------|-----------|-----------|
## treatment | 179 | 45 | 224 |
## | 0.407 | 2.108 | |
## | 0.799 | 0.201 | 0.497 |
## | 0.474 | 0.616 | |
## | 0.397 | 0.100 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 378 | 73 | 451 |
## | 0.838 | 0.162 | |
## ---------------|-----------|-----------|-----------|
##
##
We can display this graphically by saving the crude table as an object and doing a mosaic plot of the object. I’ll add a title to the plot by using the optional main argument in the call to mosaicplot.
George = table(stent365$group,stent365$outcome)
mosaicplot(George,main = "Plot of Group by Outcome")
Now lets give you something to practice on. We can use the depression dataframe that you downloaded from CMU. Right click on the link and select “Open in new tab”. https://www.youtube.com/watch?v=_gjTgCPkBGU&feature=youtube_gdata
Here is the str() output to remind you of what the data looks like.
str(depression)
## 'data.frame': 109 obs. of 7 variables:
## $ Hospt : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Treat : int 0 1 1 0 0 2 0 2 2 2 ...
## $ Outcome: int 1 0 0 1 0 1 0 1 0 1 ...
## $ Time : num 36.1 105.1 74.6 49.7 14.4 ...
## $ AcuteT : int 211 176 191 206 63 70 55 512 162 306 ...
## $ Age : int 33 49 50 29 29 30 56 48 22 61 ...
## $ Gender : int 1 1 1 2 1 2 1 1 2 2 ...
I want you to consider whether or not the different hospitals vary in their treatment choices. You should create output similar to what is above and think about what it is telling you. It would be good for you to write down your observations. After you complete these tasks, watch the video https://www.youtube.com/watch?v=v4CUh-fRO5E&feature=youtube_gdata to see my output and comments.
For this case, we’ll work with the mtcars dataframe, which is always avaialble in base R. it is included in the datasets package, where you can see some documentation.
Let’s do an str() and summary() to look at the data first.
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
We have a categorical variable (even though it’s coded numerically), am, which is 0 f the vehicle has an automatic transmission and 1 if the vehicle has a manual transmission. For a related quantitative variable we’ll use mpg. What we’re doing is asking if cars with stick shifts get better gas mileage than those with automatic transmissions.
Now let’s use the tapply() and summary() functions together to get the numerical results.
tapply(mtcars$mpg,mtcars$am,summary)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
Judging from these numerical results, it looks like the stick shift cars do get better mileage. Now let’s look at the graphical results.
boxplot(mtcars$mpg~mtcars$am)
The graphical results confirm what we saw with the numerical results.
Now it’s your turn. Use the mtcars dataframe and see if the number of cylinders (cyl) influences mpg. What do you think? After you’re done, click here to see my analysis.
We’ll stick with the mtcars dataset since it has many numerical variables that might be related. I suspect that engines with larger displacemnts get poorer gas mileage. First, we’ll do a scatterplot with the base plot command. The independent variable should be entered first and will be graphed on the horizontal axis. We’ll also get the correlation coefficient. So let’s do this
plot(mtcars$disp,mtcars$mpg)
cor(mtcars$disp,mtcars$mpg)
## [1] -0.8475514
The graphical and numerical results confirm our theory. In the scatterplot, the points are lower as we move to the right. The correlation coefficient is close to 1 in absolute value and it has a negative value. So, it makes sense to construct a linear model in which displacement is the independent variable and mpg is the dependent variable. The general form of this model is given by
\[mpg = m*disp+b\]
Note that I have dropped the prefix mtcars$, since we are not submitting the written equation to R.
This is the general form of what we want. Given values of \(m\) and \(b\), we would have a formula we could use to predict a car’s mpg if we knew its engine displacement. There is a procedure in R that will give us estimates of these two numbers based on the data we have. The function lm() produces linear models. We’ll create a linear model, named Sally and then we’ll use summary() to see the details of Sally. I’m using a familiar name jut to make the point that you can name an R object anything you want, as long as you stay within the rules.
Sally = lm(mtcars$mpg~mtcars$disp)
summary(Sally)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$disp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## mtcars$disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
The values of \(m\) and \(b\) are in the column labeled Estimate. \(b=29.599855\) and \(m=-0.041215\).
Let’s store these values under these names in our R workspace for convenience. Then we can use them to calculate predict mpg values for different values of disp.
let’s see what happens when we predict the value of mpg for two different displacement values, 200 and 201. The change in displacement between these two is only one cubic inch.
b=29.599855
m=-0.041215
mpg200 = m * 200 + b
mpg200
## [1] 21.35686
Now predict the mpg if disp is 201 instead of 200.
mpg201 = m * 201 + b
mpg201
## [1] 21.31564
Now see what happened when we increased disp by 1.
delta.mpg = mpg201 - mpg200
delta.mpg
## [1] -0.041215
There are a few points to be made based on these computations.
We can add the actual line to the scatterplot with the abline command.
Note that we need to re-run the plot command right before we run the abline command. Also note that we used a tilde, “~” rather than a comma, “,” between the two variables in the plot command. This alternative puts the explanatory and response variables in the same place in the plot command as in the lm command.
plot(mtcars$mpg~mtcars$disp)
abline(Sally)
Note how the line does seem to fit the points. Another point to make is that the lower left corner in the plot display is not (0,0). It has a y-coordinate of about 10 and an x-coordinate of about 75. That explains why the plotted line does not intercept the apparent vertical axis at the value given by the intercept in the linear model output.
Now it’s your turn again. use the tools explained above to investigate the relationship between the weight of the vehicle and the gas mileage. When you’re done click https://www.youtube.com/watch?v=L0cHLndntCE&feature=youtube_gdata to see how I would do this.