Load data:
Species is a categorical variable in the Iris dataset with 3 levels (setosa, versicolor, and virginica)
Here is an example model predicting Sepal.Length using Species. When running a MLR with categorical variables, R automatically chooses a level to be the reference group. In this case, the level is setosa species, which is represented by the intercept in the model.
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
If we want, we can manually dummy code the variables by assigning them numerical values. The following code will create two new dummy variables labeled “versicolor” and “virginica”. The versicolor variable will encode all rows as 0 unless the species is a versicolor, which it will encode as 1. The equivalent will be done for the virginica variable. Setosa is then implied when both variables = 0.
When we run the MLR using our manually created dummy variables, we get the same output as the automatic model:
Call:
lm(formula = Sepal.Length ~ versicolor + virginica, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
versicolor 0.9300 0.1030 9.033 8.77e-16 ***
virginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
You may have learned in previous stats classes about ANOVA, or analysis of variance. This is also a statistical test used to compare the differences between groups. We can get an anova output by using the aov() function in R.
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can also obtain the same output by calling anova() on the linear model we created before.
Analysis of Variance Table
Response: Sepal.Length
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.212 31.606 119.26 < 2.2e-16 ***
Residuals 147 38.956 0.265
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova() function we used earlier to compare models, except this time we only gave it one input!Compared to the previous lm() output, what are the similarities? What are the differences?
We’ve previously seen how to create scatterplots of continuous data.
We just learned how to run statistical tests on data with categorical predictors.
ggplot() reviewPreviously we just used ggplot combined with geom_point to visualize our continuous data.
It does work with categorical predictors, although there are plots that are better suited for this kind of data. Let’s consider a few of these.
geom_col()There are two ways to make bar charts in R: geom_bar and geom_col
geom_bar: categorical predictor and counts as the outcome
geom_col: categorical predictor and continuous outcome
Because our Iris dataset has a continuous outcome variable we would want to use geom_col
geom_jitter()You can think of this as in-between a scatterplot and a barplot
This adds a small amount of variability to the points to prevent overcrowding/overlapping of points (compare with the scatterplot example).
width determines how much random variability is added
What is this useful for?
Oftentimes you will work with data that is not quite as nice as the ones we’ve seen so far in this class. Let’s see what happens when we have data with:
Missing data
Outliers
High leverage points
For this portion of the lab, download the lab6.csv file from canvas. You should also read it into a data frame and you may inspect it to see if you notice anything unusual
Now let’s remove any rows with missing data
na.omit() removes rows that have missing data in any column.We can also use the na.rm = TRUE parameter in most functions in order to achieve the same effect.
[1] NA
[1] 136
Finally, when plotting data it will usually automatically remove missing data for us:
You will notice that you get a warning when this happens, so it never hurts to do it ourselves beforehand!
Looking at the graph below, do you notice any points that are out of place?
Let’s analyze a Cook’s Distance Plot to see if we can better rationalize which points are “out of place”.
1 2 3 4 6 7
1.988366e-01 6.538402e-02 1.291828e-02 1.715712e-05 2.879438e-02 6.736438e-02
9 10
2.589227e-01 1.131034e+01
For better visualization, we can also plot our results:
What do you notice here?
Find a public dataset online
There are many good resources such as here and here.
Select a few variables you want to analyze, and remove any missing data.
Create diagnostic plots (residuals, cook’s distance).
Create a linear model of your data and chosen variables.
Generate an appropriate visualization for your data.