Exploring Relationships Between Two Variables

General Points

What we’re really interested in when we examine the relationship between two variables is whether or not there is a causal relationship between the two. There is usually a theory telling us that the value of one of the two variables influences the value of the other variable. The first variable is referred to as the explanatory variable and the second variable is referred to as the response. There are other pairs of terms, which may be more familiar

• Cause/Effect
• Independent/Dependent
• Input/Output
• Driver/Outcome

Whatever terms we use, there are four possible types of analysis depending on the types of the two variables.

1. Categorical Explanatory and Categorical Response
2. Categorical Explanatory and Quantitative Response
3. Quantitative Explanatory and Quantitative Response
4. Quantitaive Explanatory and Categorical Response

In this course, we will study only the first three. The fourth one is very important, but too complicated to pursue in an introductory course.

We are about to examine some graphical and numerical techniques for exploring the idea of causal relationships. We will see if there is some correlation between the variables. However we must always repeat the mantra “Correlation does not imply Causation.”

Whenever we examine data and think we should conclude that some variable A causes another variable B, we have to recoginize that there are four possibilities for what we see.

• A really does cause B.
• B really does cause A.
• C, which is unobserved, causes both A and B.
• The observed results are simply the result of random fluctuations.

Questions on General Points

For each of these situations identify the explanatory variable, the response variable and the type (Categorical or Quantitative of each).

1. Someone suggests that People who live in King County tend to be younger than people in the rest of the State of Washington.

2. Someone suggests that taller people get better scores on a reading exam than shorter people.

3. Someone suggests that people who drive red cars are more likely to get speeding tickets on I-5.

4. Someone tells you that people with higher blood pressure are more likely to have heart atacks.

For each of the following observed correlations and causality conclusions, comment on why the observed correlation might not imply the suggested causation based on first three of the four possibilities listed above.

1. Someone observes a strong positive correlation between height and reading scores.

2. People who drink red wine tend to live longer.

3. People who live in zip codes where the population has higher levels of education have higher incomes. So, higher education leads to higher income.

4. People who switch from other auto insurers to GEICO save a lot of money compared to those who don’t switch.

Categorical Explanatory and Categorical Response

Let’s start by looking at a video of a case study published by OpenIntro.org. In this case, the explanatory variable is whether or not a stent is used to treat a medical condition and the response variable is whether or not the patient experiences a stroke. Click here for the video.

Let’s follow up by looking at that data, which I will load into my R workspace as stent365 and examine with str() and summary().

load("stent365.RData")
str(stent365)
## 'data.frame':    451 obs. of  2 variables:
##  $group : Factor w/ 2 levels "control","treatment": 2 2 2 2 2 2 2 2 2 2 ... ##$ outcome: Factor w/ 2 levels "no event","stroke": 2 2 2 2 2 2 2 2 2 2 ...
summary(stent365)
##        group         outcome
##  control  :227   no event:378
##  treatment:224   stroke  : 73

To recreate the table in the video, we can just run the table command on the two vectors in the dataframe.

table(stent365$group,stent365$outcome)
##
##             no event stroke
##   control        199     28
##   treatment      179     45

We can display this graphically by saving the table as an object and doing a mosaic plot of the object. I’ll add a title to the plot by using the optional main argument in the call to mosaicplot.

tbl.obj = table(stent365$group,stent365$outcome)
mosaicplot(tbl.obj,main = "Plot of Group by Outcome")

Now lets give you something to practice on. We can use the depression dataframe that you downloaded from CMU. Here is a video showing how to load it

Here is the str() output to remind you of what the data looks like.

str(depression)
## 'data.frame':    109 obs. of  7 variables:
##  $Hospt : int 1 1 1 1 1 1 1 1 1 1 ... ##$ Treat  : int  0 1 1 0 0 2 0 2 2 2 ...
##  $Outcome: int 1 0 0 1 0 1 0 1 0 1 ... ##$ Time   : num  36.1 105.1 74.6 49.7 14.4 ...
##  $AcuteT : int 211 176 191 206 63 70 55 512 162 306 ... ##$ Age    : int  33 49 50 29 29 30 56 48 22 61 ...
##  $Gender : int 1 1 1 2 1 2 1 1 2 2 ... I want you to consider whether or not the different hospitals vary in their treatment choices. You should create output similar to what is above and think about what it is telling you. It would be good for you to write down your observations. After you complete these tasks, click here to see my output and comments. Categorical Explanatory and Quantitative Response For this case, we’ll work with the mtcars dataframe, which is always avaialble in base R. it is included in the datasets package, where you can see some documentation. Let’s do an str() and summary() to look at the data first. str(mtcars) ## 'data.frame': 32 obs. of 11 variables: ##$ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $cyl : num 6 6 4 6 8 6 8 4 4 6 ... ##$ disp: num  160 160 108 258 360 ...
##  $hp : num 110 110 93 110 175 105 245 62 95 123 ... ##$ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $wt : num 2.62 2.88 2.32 3.21 3.44 ... ##$ qsec: num  16.5 17 18.6 19.4 17 ...
##  $vs : num 0 0 1 1 0 1 0 1 1 1 ... ##$ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $gear: num 4 4 4 3 3 3 3 4 4 4 ... ##$ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
##       mpg             cyl             disp             hp
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0
##       drat             wt             qsec             vs
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000
##        am              gear            carb
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000
##  Median :0.0000   Median :4.000   Median :2.000
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

We have a categorical variable (even though it’s coded numerically), am, which is 0 f the vehicle has an automatic transmission and 1 if the vehicle has a manual transmission. For a related quantitative variable we’ll use mpg. What we’re doing is asking if cars with stick shifts get better gas mileage than those with automatic transmissions.

Now let’s use the tapply() and summary() functions together to get the numerical results.

tapply(mtcars$mpg,mtcars$am,summary)
## $0 ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 10.40 14.95 17.30 17.15 19.20 24.40 ## ##$1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   15.00   21.00   22.80   24.39   30.40   33.90

Judging from these numerical results, it looks like the stick shift cars do get better mileage. Now let’s look at the graphical results.

boxplot(mtcars$mpg~mtcars$am)

The graphical results confirm what we saw with the numerical results.

Now it’s your turn. Use the mtcars dataframe and see if the number of cylinders (cyl) influences mpg. What do you think? After you’re done, click here to see my analysis.

Quantitative Explanatory and Quantitative Response

We’ll stick with the mtcars dataset since it has many numerical variables that might be related. I suspect that engines with larger displacemnts get poorer gas mileage. First, we’ll do a scatterplot with the base plot command. The independent variable should be entered first and will be graphed on the horizontal axis. We’ll also get the correlation coefficient. So let’s do this

plot(mtcars$disp,mtcars$mpg)

cor(mtcars$disp,mtcars$mpg)
## [1] -0.8475514

The graphical and numerical results confirm our theory. In the scatterplot, the points are lower as we move to the right. The correlation coefficient is close to 1 in absolute value and it has a negative value. So, it makes sense to construct a linear model in which displacement is the independent variable and mpg is the dependent variable. The general form of this model is given by

$mpg = m*disp+b$

Note that I have dropped the prefix mtcars$, since we are not submitting the written equation to R. This is the general form of what we want. Given values of $$m$$ and $$b$$, we would have a formula we could use to predict a car’s mpg if we knew its engine displacement. There is a procedure in R that will give us estimates of these two numbers based on the data we have. The function lm() produces linear models. We’ll create a linear model, named Sally and then we’ll use summary() to see the details of Sally. I’m using a familiar name jut to make the point that you can name an R object anything you want, as long as you stay within the rules. Sally = lm(mtcars$mpg~mtcars$disp) summary(Sally) ## ## Call: ## lm(formula = mtcars$mpg ~ mtcars$disp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.8922 -2.2022 -0.9631 1.6272 7.2305 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 29.599855 1.229720 24.070 < 2e-16 *** ## mtcars$disp -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

The values of $$m$$ and $$b$$ are in the column labeled Estimate. $$b=29.599855$$ and $$m=-0.041215$$.

Let’s store these values under these names in our R workspace for convenience. Then we can use them to calculate predict mpg values for different values of disp.

b=29.599855
m=-0.041215

mpg1000 = m * 1000 + b
mpg1100 = m * 1100 + b
delta.100 = mpg1100 - mpg1000
mpgzero = m * 0 + b

There are a few points to be made based on these computations.

1. $$b$$ is the intercept, the predicted value of mpg when disp = 0.
2. The slope is the value of $$m$$. Note that when we increased engine displacement by 100, from 1,000 to 1100, the mpg decreased by $$m * 100$$.

We can add the actual line to the scatterplot with the abline command.

plot(mtcars$mpg~mtcars$disp)
abline(Sally)

Note how the line does seem to fit the points. Another point to make is that the lower left corner in the plot display is not (0,0). It has a y-coordinate of about 10 and an x-coordinate of about 75. That explains why the plotted line does not intercept the apparent vertical axis at the value given by the intercept in the linear model output.

Now it’s your turn again. use the tools explained above to investigate the relationship between the weight of the vehicle and the gas mileage. When you’re done click here to see how I would do this.