Lab 6

Categorical Predictors

  • Categorical variables must be numerically encoded to be used in regression.
  • Dummy coding is the most common method used.
  • In R, this is done automatically when a categorical variable is included in a model.

Automatic Dummy Coding in R

Load data:

data("iris")

Species is a categorical variable in the Iris dataset with 3 levels (setosa, versicolor, and virginica)

levels(iris$Species)
[1] "setosa"     "versicolor" "virginica" 

Here is an example model predicting Sepal.Length using Species. When running a MLR with categorical variables, R automatically chooses a level to be the reference group. In this case, the level is setosa species, which is represented by the intercept in the model.

model1 <- lm(Sepal.Length ~ Species, data = iris)
summary(model1)

Call:
lm(formula = Sepal.Length ~ Species, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         5.0060     0.0728  68.762  < 2e-16 ***
Speciesversicolor   0.9300     0.1030   9.033 8.77e-16 ***
Speciesvirginica    1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

Manually Dummy Coding in R

If we want, we can manually dummy code the variables by assigning them numerical values. The following code will create two new dummy variables labeled “versicolor” and “virginica”. The versicolor variable will encode all rows as 0 unless the species is a versicolor, which it will encode as 1. The equivalent will be done for the virginica variable. Setosa is then implied when both variables = 0.

iris$versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris$virginica  <- ifelse(iris$Species == "virginica", 1, 0)

When we run the MLR using our manually created dummy variables, we get the same output as the automatic model:

model2 <- lm(Sepal.Length ~ versicolor + virginica, data = iris)
summary(model2)

Call:
lm(formula = Sepal.Length ~ versicolor + virginica, data = iris)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6880 -0.3285 -0.0060  0.3120  1.3120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.0060     0.0728  68.762  < 2e-16 ***
versicolor    0.9300     0.1030   9.033 8.77e-16 ***
virginica     1.5820     0.1030  15.366  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared:  0.6187,    Adjusted R-squared:  0.6135 
F-statistic: 119.3 on 2 and 147 DF,  p-value: < 2.2e-16

ANOVA

You may have learned in previous stats classes about ANOVA, or analysis of variance. This is also a statistical test used to compare the differences between groups. We can get an anova output by using the aov() function in R.

model_anova <- aov(Sepal.Length ~ Species, data = iris)
summary(model_anova)
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can also obtain the same output by calling anova() on the linear model we created before.

anova(model1)
Analysis of Variance Table

Response: Sepal.Length
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 63.212  31.606  119.26 < 2.2e-16 ***
Residuals 147 38.956   0.265                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • This is the same anova() function we used earlier to compare models, except this time we only gave it one input!

Compared to the previous lm() output, what are the similarities? What are the differences?

Visualizing Categorical Data

  • We’ve previously seen how to create scatterplots of continuous data.

  • We just learned how to run statistical tests on data with categorical predictors.

    • Now let’s see how we can visualize this!

ggplot() review

library(ggplot2)

Previously we just used ggplot combined with geom_point to visualize our continuous data.

p <- ggplot(iris, aes(Species, Sepal.Length))
p + geom_point()

It does work with categorical predictors, although there are plots that are better suited for this kind of data. Let’s consider a few of these.

geom_col()

There are two ways to make bar charts in R: geom_bar and geom_col

  • geom_bar: categorical predictor and counts as the outcome

  • geom_col: categorical predictor and continuous outcome

Because our Iris dataset has a continuous outcome variable we would want to use geom_col

p + geom_col()

geom_jitter()

You can think of this as in-between a scatterplot and a barplot

p + geom_jitter(width = 0.1)

This adds a small amount of variability to the points to prevent overcrowding/overlapping of points (compare with the scatterplot example).

  • width determines how much random variability is added

  • What is this useful for?

    • Visualizing the distribution of points rather than just the mean.

Data Cleaning

Oftentimes you will work with data that is not quite as nice as the ones we’ve seen so far in this class. Let’s see what happens when we have data with:

  • Missing data

  • Outliers

  • High leverage points

Data

For this portion of the lab, download the lab6.csv file from canvas. You should also read it into a data frame and you may inspect it to see if you notice anything unusual

df <- read.csv("lab6.csv")
df
    x  y
1   1  1
2   2  3
3   3  5
4   4  7
5   5 NA
6   6 11
7   7 13
8  NA 15
9   9 17
10 18  9

Missing Data

Now let’s remove any rows with missing data

df1 <- na.omit(df)
df1
    x  y
1   1  1
2   2  3
3   3  5
4   4  7
6   6 11
7   7 13
9   9 17
10 18  9
  • na.omit() removes rows that have missing data in any column.

We can also use the na.rm = TRUE parameter in most functions in order to achieve the same effect.

sum(df) # This function is meant to sum all of the values in a dataframe
[1] NA
sum(df, na.rm = TRUE)
[1] 136

Finally, when plotting data it will usually automatically remove missing data for us:

ggplot(df, aes(x, y)) + geom_point()

You will notice that you get a warning when this happens, so it never hurts to do it ourselves beforehand!

Cook’s Distance

Looking at the graph below, do you notice any points that are out of place?

ggplot(df1, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Let’s analyze a Cook’s Distance Plot to see if we can better rationalize which points are “out of place”.

model <- lm(y ~ x, df1)
cooks <- cooks.distance(model)
cooks
           1            2            3            4            6            7 
1.988366e-01 6.538402e-02 1.291828e-02 1.715712e-05 2.879438e-02 6.736438e-02 
           9           10 
2.589227e-01 1.131034e+01 

For better visualization, we can also plot our results:

plot(cooks)

What do you notice here?

Application

Find a public dataset online

  • There are many good resources such as here and here.

    • Ideally look for one that has categorical predictors to practice what you’ve learned today!
  • Select a few variables you want to analyze, and remove any missing data.

  • Create diagnostic plots (residuals, cook’s distance).

  • Create a linear model of your data and chosen variables.

  • Generate an appropriate visualization for your data.