Theoretical Concepts

There are two classification problems

Explantory vs Response

The distinction between explanatory and response variables is based on theories of the specific phenomenon. Statistical methods can’t make this distinction. Statistical methods can reveal the degree of association or correlation between two variables, but not the direction of causation. Whenever two variables, A and B are related, there are three possibilities.

    1. A causes B
    1. B Causes A
    1. C causes both A and B.

For an amusing discussion see this.

Why are Waffle House concentration and obesity rates correlated? Note that both of these variables(Waffle houses per 100,000 residents and fraction of the population classified as obese) are quantitative.

Give arguments for:

  • Waffle House causes obesity
  • Obesity causes Waffle House
  • Something else causes both obesity and Waffle House.

Essential R Commands

This section is a review of the essential R commands and statistical concepts you will need for Module 2. I will use the mtcars dataframe from the datasets package, which is in the datasets package. This package is normally included in all R distribution. You may want to look in the lower right pane of RStudio and verify that datasets is there and that the box to the left of it is checked.

Review Single Variable Boxplot and Summary

The first topic is how to read an individual boxplot. It is useful to run both the summary command and the boxplot command on the same variable. Note that since we are using variables within a dataframe, we need to preface the variable names by mtcars$

boxplot(mtcars$mpg)

summary(mtcars$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90

Click here for a video explanation

Side-by-Side Boxplots

Now we want to use side-by-side boxplots to see how a categorical explanatory variable influences a numerical response variable. We will stay with mpg as the response variable and use the number of carburetors as the categoricalvariable.

boxplot(mtcars$mpg~mtcars$carb)

Click here for a video explanation

Quantitative Explanatory and Response Variabels

Now let’s look at the case where both the explanatory and response variables are quantitative.

First let’s examine the relatioship bwtween the engine displacement (explanatory) and mpg (response) graphically. We expect greater displacement to be associated with reduced mpg. A scatterplot should show that points farther to the right are lower. The correlation coeffieient should be negative and not very close to zero, probably close to -1.

plot(mtcars$disp,mtcars$mpg)

cor(mtcars$disp,mtcars$mpg)
## [1] -0.8475514

Click here for a video explanation

We can take this a step further and create a model of the relationship between engine displacement and gas mileage using linear regression.

The idea is to assume that there is a linear relationship of the form

\[ mpg = m*disp+b\]

We can use the function lm() in R to derive estimates of the parameters \(m\) and \(b\) from the existing data. You should recognize this as the slope-intercept form of a straight line. The slope, \(m\) is the more important of these two parameters. It tells us how much gas mileage will change and in which direction when engine displacement increases. We expect it to have a negative value in this case.

lm1 <- lm(mpg~disp,data = mtcars)
summary(lm1)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10
plot(mtcars$disp,mtcars$mpg)
abline(lm1)

Note that we need to ask for the summary of the model we create since creating a model does not automatically display it.

The value of \(m\) is given \(-.041\) in the output. The meaning of this is that increasing the value of displacement by one unit will decrease mpg by .041 units. This is easier to grasp by thinking of the impact of an extra 100 cubic inches of engine displacement, which would drive a decrease of 4 mpg in gas mileage.

Recreating the scatterplot and adding the regression line to the plot is useful to judge the usefulness of the model.

Two Categorical Variables

Now let’s examine a relationship between two categorical variables. One of these variables, am, is truly completely categorical. Automatic transmissions are coded with a “0” and manual transmissions are coded with a “1.” The other variable, cyl - the number of cylinders, could be considered quantitative, but with a small number of values, we can treat it as categorical.

We will use a table to describe the relationship numerically and a mosaic plot of the table to describe it graphically.

table(mtcars$am,mtcars$cyl)
##    
##      4  6  8
##   0  3  4 12
##   1  8  3  2
mosaicplot(table(mtcars$am,mtcars$cyl))

Click here for a video explanation