What we’re really interested in when we examine the relationship between two variables is whether or not there is a causal relationship between the two. There is usually a theory telling us that the value of one of the two variables influences the value of the other variable. The first variable is referred to as the explanatory variable and the second variable is referred to as the response. There are other pairs of terms, which may be more familiar

- Cause/Effect
- Independent/Dependent
- Input/Output
- Driver/Outcome

Whatever terms we use, there are four possible types of analysis depending on the types of the two variables.

- Categorical Explanatory and Categorical Response
- Categorical Explanatory and Quantitative Response
- Quantitative Explanatory and Quantitative Response
- Quantitaive Explanatory and Categorical Response

In this course, we will study only the first three. The fourth one is very important, but too complicated to pursue in an introductory course.

We are about to examine some graphical and numerical techniques for exploring the idea of causal relationships. We will see if there is some correlation between the variables. However we must always repeat the mantra “Correlation does not imply Causation.”

Whenever we examine data and think we should conclude that some variable A causes another variable B, we have to recoginize that there are four possibilities for what we see.

- A really does cause B.
- B really does cause A.
- C, which is unobserved, causes both A and B.
- The observed results are simply the result of random fluctuations.

For each of these situations identify the explanatory variable, the response variable and the type (Categorical or Quantitative of each).

Someone suggests that People who live in King County tend to be younger than people in the rest of the State of Washington.

Someone suggests that taller people get better scores on a reading exam than shorter people.

Someone suggests that people who drive red cars are more likely to get speeding tickets on I-5.

Someone tells you that people with higher blood pressure are more likely to have heart atacks.

Write down your own answers to these questions and then click here

For each of the following observed correlations and causality conclusions, comment on why the observed correlation might not imply the suggested causation based on first three of the four possibilities listed above.

Someone observes a strong positive correlation between height and reading scores.

People who drink red wine tend to live longer.

People who live in zip codes where the population has higher levels of education have higher incomes. So, higher education leads to higher income.

People who switch from other auto insurers to GEICO save a lot of money compared to those who don’t switch.

Write down your own answers to these questions and then click here

Let’s start by looking at a video of a case study published by OpenIntro.org. In this case, the explanatory variable is whether or not a stent is used to treat a medical condition and the response variable is whether or not the patient experiences a stroke. Click here for the video.

Let’s follow up by looking at that data, which I will load into my R workspace as stent365 and examine with str() and summary().

```
load("stent365.RData")
str(stent365)
```

```
## 'data.frame': 451 obs. of 2 variables:
## $ group : Factor w/ 2 levels "control","treatment": 2 2 2 2 2 2 2 2 2 2 ...
## $ outcome: Factor w/ 2 levels "no event","stroke": 2 2 2 2 2 2 2 2 2 2 ...
```

`summary(stent365)`

```
## group outcome
## control :227 no event:378
## treatment:224 stroke : 73
```

To recreate the table in the video, we can just run the table command on the two vectors in the dataframe.

`table(stent365$group,stent365$outcome)`

```
##
## no event stroke
## control 199 28
## treatment 179 45
```

We can display this graphically by saving the table as an object and doing a mosaic plot of the object. I’ll add a title to the plot by using the optional main argument in the call to mosaicplot.

```
tbl.obj = table(stent365$group,stent365$outcome)
mosaicplot(tbl.obj,main = "Plot of Group by Outcome")
```

Now lets give you something to practice on. We can use the depression dataframe that you downloaded from CMU. Here is a video showing how to load it

Here is the str() output to remind you of what the data looks like.

`str(depression)`

```
## 'data.frame': 109 obs. of 7 variables:
## $ Hospt : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Treat : int 0 1 1 0 0 2 0 2 2 2 ...
## $ Outcome: int 1 0 0 1 0 1 0 1 0 1 ...
## $ Time : num 36.1 105.1 74.6 49.7 14.4 ...
## $ AcuteT : int 211 176 191 206 63 70 55 512 162 306 ...
## $ Age : int 33 49 50 29 29 30 56 48 22 61 ...
## $ Gender : int 1 1 1 2 1 2 1 1 2 2 ...
```

I want you to consider whether or not the different hospitals vary in their treatment choices. You should create output similar to what is above and think about what it is telling you. It would be good for you to write down your observations. After you complete these tasks, click here to see my output and comments.

For this case, we’ll work with the mtcars dataframe, which is always avaialble in base R. it is included in the datasets package, where you can see some documentation.

Let’s do an str() and summary() to look at the data first.

`str(mtcars)`

```
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```

`summary(mtcars)`

```
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
```

We have a categorical variable (even though it’s coded numerically), am, which is 0 f the vehicle has an automatic transmission and 1 if the vehicle has a manual transmission. For a related quantitative variable we’ll use mpg. What we’re doing is asking if cars with stick shifts get better gas mileage than those with automatic transmissions.

Now let’s use the tapply() and summary() functions together to get the numerical results.

`tapply(mtcars$mpg,mtcars$am,summary)`

```
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 14.95 17.30 17.15 19.20 24.40
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 21.00 22.80 24.39 30.40 33.90
```

Judging from these numerical results, it looks like the stick shift cars do get better mileage. Now let’s look at the graphical results.

`boxplot(mtcars$mpg~mtcars$am)`

The graphical results confirm what we saw with the numerical results.

Now it’s your turn. Use the mtcars dataframe and see if the number of cylinders (cyl) influences mpg. What do you think? After you’re done, click here to see my analysis.

We’ll stick with the mtcars dataset since it has many numerical variables that might be related. I suspect that engines with larger displacemnts get poorer gas mileage. First, we’ll do a scatterplot with the base plot command. The independent variable should be entered first and will be graphed on the horizontal axis. We’ll also get the correlation coefficient. So let’s do this

`plot(mtcars$disp,mtcars$mpg)`

`cor(mtcars$disp,mtcars$mpg)`

`## [1] -0.8475514`

The graphical and numerical results confirm our theory. In the scatterplot, the points are lower as we move to the right. The correlation coefficient is close to 1 in absolute value and it has a negative value. So, it makes sense to construct a linear model in which displacement is the independent variable and mpg is the dependent variable. The general form of this model is given by

\[mpg = m*disp+b\]

Note that I have dropped the prefix mtcars$, since we are not submitting the written equation to R.

This is the general form of what we want. Given values of \(m\) and \(b\), we would have a formula we could use to predict a car’s mpg if we knew its engine displacement. There is a procedure in R that will give us estimates of these two numbers based on the data we have. The function lm() produces linear models. We’ll create a linear model, named Sally and then we’ll use summary() to see the details of Sally. I’m using a familiar name jut to make the point that you can name an R object anything you want, as long as you stay within the rules.

```
Sally = lm(mtcars$mpg~mtcars$disp)
summary(Sally)
```

```
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$disp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## mtcars$disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
```

The values of \(m\) and \(b\) are in the column labeled Estimate. \(b=29.599855\) and \(m=-0.041215\).

Let’s store these values under these names in our R workspace for convenience. Then we can use them to calculate predict mpg values for different values of disp.

```
b=29.599855
m=-0.041215
mpg1000 = m * 1000 + b
mpg1100 = m * 1100 + b
delta.100 = mpg1100 - mpg1000
mpgzero = m * 0 + b
```

There are a few points to be made based on these computations.

- \(b\) is the intercept, the predicted value of mpg when disp = 0.
- The slope is the value of \(m\). Note that when we increased engine displacement by 100, from 1,000 to 1100, the mpg decreased by \(m * 100\).

We can add the actual line to the scatterplot with the abline command.

```
plot(mtcars$mpg~mtcars$disp)
abline(Sally)
```

Note how the line does seem to fit the points. Another point to make is that the lower left corner in the plot display is not (0,0). It has a y-coordinate of about 10 and an x-coordinate of about 75. That explains why the plotted line does not intercept the apparent vertical axis at the value given by the intercept in the linear model output.

Now it’s your turn again. use the tools explained above to investigate the relationship between the weight of the vehicle and the gas mileage. When you’re done click here to see how I would do this.