This is for practicing along with the textbook Statistical Inference via Data Science: A Modern Dive into R and the Tidyverse, for chapter 5 onward.
First the textbook require us to load certain packages:
library(tidyverse)
library(moderndive)
library(skimr)
library(gapminder)
library(knitr)
opts_chunk$set(fig.align = "center")
For simplicity, the package installing was run in the background.
The textbook requires loading the data from the
moderndive package. This data will be used to model the
relationship between teaching scores and “beauty” scores using
simple linear regression.
First we need to import the data, choose which variable to keep for simplicity, and show first few rows of the data:
evals_ch5 <- evals %>%
select(ID, score, bty_avg, age)
kable(head(evals_ch5, 10))
| ID | score | bty_avg | age |
|---|---|---|---|
| 1 | 4.7 | 5.000 | 36 |
| 2 | 4.1 | 5.000 | 36 |
| 3 | 3.9 | 5.000 | 36 |
| 4 | 4.8 | 5.000 | 36 |
| 5 | 4.6 | 3.000 | 59 |
| 6 | 4.3 | 3.000 | 59 |
| 7 | 2.8 | 3.000 | 59 |
| 8 | 4.1 | 3.333 | 51 |
| 9 | 3.4 | 3.333 | 51 |
| 10 | 4.5 | 3.167 | 40 |
The authors suggest doing these three steps for EDA:
With first step, we can use these functions:
View(evals_ch5)
glimpse(evals_ch5)
summary(evals_ch5)
Here is the result of summary() function:
kable(summary(evals_ch5))
| ID | score | bty_avg | age | |
|---|---|---|---|---|
| Min. : 1.0 | Min. :2.300 | Min. :1.667 | Min. :29.00 | |
| 1st Qu.:116.5 | 1st Qu.:3.800 | 1st Qu.:3.167 | 1st Qu.:42.00 | |
| Median :232.0 | Median :4.300 | Median :4.333 | Median :48.00 | |
| Mean :232.0 | Mean :4.175 | Mean :4.418 | Mean :48.37 | |
| 3rd Qu.:347.5 | 3rd Qu.:4.600 | 3rd Qu.:5.500 | 3rd Qu.:57.00 | |
| Max. :463.0 | Max. :5.000 | Max. :8.167 | Max. :73.00 |
The textbook starts with introducing the glimpse()
function, let’s run it here:
glimpse(evals_ch5)
## Rows: 463
## Columns: 4
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.…
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333,…
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…
First the authors note about the observation units
importance. In this dataset, observation unit is an individual
course and not an individual instructor. Since instructors teach more
than one course in an academic year, the same instructor will appear
more than once in the data. Hence there are fewer than 463 unique
instructors being represented in evals_ch5.
Next thing to focus about is the description for 4 variables in the dataset:
ID: An identification variable used to distinguish
between the 1 through 463 courses in the dataset.score: A numerical variable of the course instructor’s
average teaching score, where the average is computed from the
evaluation scores from all students in that course. Teaching scores of 1
are lowest and 5 are highest. This is the outcome variable \(y\) of interest.bty_avg: A numerical variable of the course
instructor’s average “beauty” score, where the average is computed from
a separate panel of six students. “Beauty” scores of 1 are lowest and 10
are highest. This is the explanatory variable \(x\) of interest.age: A numerical variable of the course instructor’s
age. This will be another explanatory variable \(x\) that we’ll use later.An alternative way to look at the raw data values is by choosing a
random sample of the rows in evals_ch5 by piping it into
the sample_n() function from the dplyr
package.
evals_ch5 %>%
sample_n(size = 5)
The textbook also introduces a better function than
summary() to returns commonly used summary statistics.
Let’s see how it works:
evals_ch5 %>%
skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 232.00 | 133.80 | 1.00 | 116.50 | 232.00 | 347.5 | 463.00 | ▇▇▇▇▇ |
| score | 0 | 1 | 4.17 | 0.54 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 | ▁▁▅▇▇ |
| bty_avg | 0 | 1 | 4.42 | 1.53 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 | ▃▇▇▃▂ |
| age | 0 | 1 | 48.37 | 9.80 | 29.00 | 42.00 | 48.00 | 57.0 | 73.00 | ▅▆▇▆▁ |
For the numerical variables it returns:
missing: the number of missing valuescomplete_rate: the rate of non-missing or complete
valuesmean: the averagesd: the standard deviationp0: the minimum valuep25: the 1st quantilep50: the 2nd quantile - the medianp75: the 3rd quantilep100: the maximum valueFor further explanation, the authors note:
The
skim()function only returns what are known as univariate summary statistics: functions that take a single variable and return some numerical summary of that variable.
Beside from univariate summary statistics, we also have bivariate summary statistics. In case of two numerical variables, one important index is correlation coefficient, which is is a quantitative expression of the strength of the linear relationship between two numerical variables. Its value ranges between -1 and 1 where:
To compute correlation, we use the cor() summary
function within a summarize():
evals_ch5 %>%
summarize(correlation = cor(score, bty_avg))
In this case, the correlation coefficient of \(0.187\) indicates that the relationship between two variables is “weakly positive.”
After this, we can perform the last of the steps in an exploratory data analysis: creating data visualizations.
ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
geom_point() +
theme_light()
We can add the “best-fitting” line to the plot, which is explained by the authors as follow:
of all possible lines we can draw on this scatterplot, it is the line that “best” fits through the cloud of points.
The line can be added using layer geom_smooth():
ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
The method = "lm" argument sets the line to be a “linear
model.” The se = FALSE argument suppresses standard error
uncertainty bars, which will be explained further later.
The line in the resulting the figure is called a “regression line.” The regression line is a visual summary of the relationship between two numerical variables.
Now we will use another variable as a explanatory variable,
age. We will focus on this variable along with
score variable.
We will perform a EDA, first step is to look at raw data values:
evals_ch5 %>%
sample_n(size = 5)
Second step is computing summary statistics.
evals_ch5 %>%
skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 232.00 | 133.80 | 1.00 | 116.50 | 232.00 | 347.5 | 463.00 | ▇▇▇▇▇ |
| score | 0 | 1 | 4.17 | 0.54 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 | ▁▁▅▇▇ |
| bty_avg | 0 | 1 | 4.42 | 1.53 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 | ▃▇▇▃▂ |
| age | 0 | 1 | 48.37 | 9.80 | 29.00 | 42.00 | 48.00 | 57.0 | 73.00 | ▅▆▇▆▁ |
We can see that there is no missing value in score and
age.
After investigating the univariate profile, we examine the bivariate summary statistics, in this case is also correlation coefficient.
evals_ch5 %>%
summarise(correlation = cor(score, age))
The minus symbol indicate a negative relationship, and the value close to zero shows a week relationship.
Finally, we can perform data visualization:
ggplot(evals_ch5, aes(x = score, y = age)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_light()
## `geom_smooth()` using formula = 'y ~ x'
When defining a regression line, the equation of the regression line is:
\[ \hat{y} = b_0 + b_1x \]
The intercept coefficient is \(b_0\), which is the value of \(y\) when \(x = 0\). The slope coefficient for \(x\) is \(b_1\), i.e., the increase in \(y\) for every increase of one in \(x\). \(\hat{y}\) is a form of notation commonly used in regression to indicate that we have a “fitted value,” or the value of \(y\) on the regression line for a given \(x\) value. We’ll discuss this more in the upcoming sections.
In the example about correlation between bty_avg and
score, we can obtain the values of the intercept \(b\) and the slope for btg_avg
\(b_1\) by outputting a linear
regression table. This is done in two steps:
lm() function and save it in score_model.get_regression_table() function from the
moderndive package to score_model.score_model <- lm(score ~ bty_avg, data = evals_ch5)
get_regression_table(score_model)
To interpret the table, we look at the estimate column,
we have the intercept \(b_0 = 3.88\)
and the slopr \(b_1 = 0.067\) for
bty_avg. So the regression line is:
\[ \widehat{score} = 3.380 + 0.067 \cdot bty\_avg \]
Explaining the codes, the authors note:
First, we “fit” the linear regression model to the data using the
lm()function. When we say “fit”, we mean “find the best fitting line to this data.”
The function lm() stands for “linear model” and is used
as follows: lm(y ~ x, data = data_frame_name) where:
y is the outcome variable, followed by a tilde
~.x is the explanatory variable.y ~ x is called a model
formula. (Note the order of y and
x.)data_frame_name is the name of the data frame that
contains the variables y and x.The regression table obtained by using the
get_regression_table() show basic parameters of a
regression analysis. Besides from the estimate column, we
have other 5 columns which relatively are standard error,
test statistic, p-value, lower 95% confidence
interval bound, and upper 95% confidence interval bound.
They tell us about both the statistical significance and
practical significance of our results.
Let’s apply the sames function for age and
score:
lc5.2 <- lm(score ~ age, evals_ch5)
get_regression_table(lc5.2)
We can see that the regression line is:
\[ \widehat{score} = 4.462 - 0.006\cdot age \]
To compute fitted value and residual for all observations, we can use
get_regression_point() function.
regression_point <- get_regression_points(score_model)
head(regression_point, 10)
We can do the same for age variable:
lc5.3 <- get_regression_points(lc5.2)
head(lc5.3, 10)
In this section, the gapminder dataset is used as
demonstration. We have
y: a numerical variable of country’s life
expectancy.x: a categorical variable of continent that the country
is a part of.We will conduct the EDA. But first, for simplicity, we only choose a specific period and select variable that needed in this section
library(gapminder)
gapminder2007 <- gapminder %>%
filter(year == 2007) %>%
select(country, lifeExp, continent, gdpPercap)
Let’s perform the first common step in an exploratory data analysis: looking at the raw data values by using these function:
View(gapminder2007)
glimpse(gapminder2007)
summary(gapminder2007)
The results show there are 142 rows/observations in
gapminder2007, where each row corresponds to one country.
In other words, the observational unit is an individual
country.
Let’s look at a random sample of five out of the 142 countries:
gapminder2007 %>%
sample_n(size = 5)
Now that we’ve looked at the raw values in our
gapminder2007 data frame and got a sense of the data, let’s
move on to computing summary statistics.
gapminder2007 %>%
select(lifeExp, continent) %>%
skim()
| Name | Piped data |
| Number of rows | 142 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| continent | 0 | 1 | FALSE | 5 | Afr: 52, Asi: 33, Eur: 30, Ame: 25 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lifeExp | 0 | 1 | 67.01 | 12.07 | 39.61 | 57.16 | 71.94 | 76.41 | 82.6 | ▂▃▃▆▇ |
The `skim() output now reports summaries for categorical
variables separately from the numerical variables. For the categorical
variable, it reports:
missing, complete, and n,
which are the number of missing, complete, and total number of values as
before, respectively.n_unique: The number of unique levels to this
variable.top_counts: In this case, the top four counts:
Africa has 52 countries, Asia has 33,
Europe has 30, and Americas has 25. Not
displayed is Oceania with 2 countries.ordered: This tells us whether the categorical variable
is “ordinal”: whether there is an encoded hierarchy (like low, medium,
high). In this case, continent is not ordered.Then we can perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s visualize the distribution of our outcome variable:
ggplot(gapminder2007, aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color = "white", aes(y = after_stat(density))) +
geom_density(color = "red") +
theme_light()
This is just a univariate profile of a single variable, if we want to
include the explanatory variable to the visuals. we need to use the
layer facet_wrap():
ggplot(gapminder2007, aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color = "white") +
facet_wrap(~continent) +
theme_light()
Another way for numerical and categorical variables to appear in the same graph is using boxplot graph:
ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
theme_light()
It is easier comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram.
Let’s compute the median and mean life expectancy for each continent with a little more data wrangling:
lifeExp_by_continent <- gapminder2007 %>%
group_by(continent) %>%
summarize(median = median(lifeExp),
mean = mean(lifeExp))
kable(lifeExp_by_continent)
| continent | median | mean |
|---|---|---|
| Africa | 52.9265 | 54.80604 |
| Americas | 72.8990 | 73.60812 |
| Asia | 72.3960 | 70.72848 |
| Europe | 78.6085 | 77.64860 |
| Oceania | 80.7195 | 80.71950 |
We will use another outcome variable for further practice,
gdpPerCap:
First is raw data inspection, we can skip this part since the
examination of gapminder2007 above already cover this part.
We use skim() function again to look at the statistics of
gdpPerCap:
gapminder2007 %>%
select(gdpPercap) %>%
skim()
| Name | Piped data |
| Number of rows | 142 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| gdpPercap | 0 | 1 | 11680.07 | 12859.94 | 277.55 | 1624.84 | 6124.37 | 18008.84 | 49357.19 | ▇▂▁▂▁ |
Finally, we can visualize the bivariate profile of the
gdpPercap and continent variable.
ggplot(gapminder2007, aes(x = continent, y = gdpPercap)) +
geom_boxplot() +
theme_light()
In this example, we now instead have a categorical explanatory
variable continent. Our model will not yield a
“best-fitting” regression line but rather offsets relative to a
baseline for comparison.
Let’s see what happen if we apply the same function and syntax to this new model
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
get_regression_table(lifeExp_model)
Let’s create another table to compare mean life expectancy between Africa and other continents, we will detect a pattern between two tables:
mean_lifeExp <- gapminder2007 %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
mean_lifeExp_Africa <- mean_lifeExp %>%
filter(continent == "Africa") %>%
pull(mean_lifeExp)
mean_diff <- mean_lifeExp %>%
mutate(mean_diff = mean_lifeExp - mean_lifeExp_Africa)
kable(mean_diff)
| continent | mean_lifeExp | mean_diff |
|---|---|---|
| Africa | 54.80604 | 0.00000 |
| Americas | 73.60812 | 18.80208 |
| Asia | 70.72848 | 15.92245 |
| Europe | 77.64860 | 22.84256 |
| Oceania | 80.71950 | 25.91346 |
As we can see, the intercept is the mean life expectancy
of Africa, and values of the rest five rows from the column are
different between the mean of life expectancy of that continent with
Africa.
Why was Africa chosen as the “baseline for comparison” group? This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order.
The equation for fitted value \(\widehat{lifeExp}\) would be:
\[ \widehat{y} = b_0 + b_{Amer} \cdot \chi_{amer}(x) + b_{Asia} \cdot \chi_{Asia}(x) + b_{Euro} \cdot \chi_{Euro}(x) + b_{Ocean} \cdot \chi_{Ocean}(x) \]
\[ \widehat{lifeExp} = 54.8 + 18.8 \cdot \chi_{amer}(x) + 15.9 \cdot \chi_{Asia}(x) + 22.8 \cdot \chi_{Euro}(x) + 25.9 \cdot \chi_{Ocean}(x) \] To interpret the equation, we need to note that \(\chi_A(x)\) is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where:
\[ \chi_A(x) = \begin{cases} 1 & \text{if } x \text{ is in } A \\ 0 & \text{if otherwise} \end{cases} \]
Greek letter Chi (\(\chi\)) is used as a symbol for the indicator function in this document. However, this is not standard notation, I only use it since it easier to type.
In statistical modeling context this is also known as dummy variable. If we fit a linear regression model using a categorical explanatory variable \(x\) that has \(k\) possible categories, the regression table will return an intercept and \(k - 1\) “offsets.”
We obtained these values and other values using the
get_regression_points() function from the
moderndive package. This time, however, let’s add an
argument setting ID = "country": this is telling the
function to use the variable country in gapminder2007 as an
identification variable in the output. This will help contextualize our
analysis by matching values to countries.
regression_points <- get_regression_points(lifeExp_model, ID = "country")
head(regression_points, 10)
There are only 5 possible values for lifeExp_hat,
correspond to the five mean life expectancy for the 5 continents.
The residual column is simply \(y-\hat{y}
=\) lifeExp - lifeExp_hat. These values
can be interpreted as the deviation of a country’s life expectancy from
its continent’s average life expectancy.
Regression lines are also known as “best-fitting” lines. The name comes from the process of calculating a quantity called sum of squared residuals, usually represent the lack of fit of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse fitting model.
Of all possible lines we can draw through the cloud of points, the regression line minimize sum of squared residuals. In order words, the regression and its corresponding fitted value \(\hat{y}\) minimizes this value:
\[ \sum_{i=1}^{n} (y_i - \widehat{y}_i)^2 \] Let’s apply the equation and compute the sum of squared residuals of the first example of this document:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
regression_points <- get_regression_points(score_model)
regression_points %>%
mutate(squared_residuals = residual^2) %>%
summarize(sum_of_squared_residuals = sum(squared_residuals))
The next chapter of the book covers the topic on Multiple Regression. I think I will get the basic idea of how is it operated in R first here, then after reviewing the textbook about multivariate data analysis, I will conduct a comprehensive analysis later.
First, we need to load the necessary packages:
library(tidyverse)
library(moderndive)
library(skimr)
library(ISLR)
We will use the data from UT Austin again, but add one more variable to the model. The model will consist:
For simplicity, we only select the columns we need from the original dataset:
evals_ch6 <- evals %>%
select(ID, score, age, gender)
Recall the three common steps in an exploratory data analysis:
Let’s first look at the raw data values by using these functions:
View(evals_ch6)
glimpse(evals_ch6)
summary(evals_ch6)
We can also display a random sample of 5 rows:
evals_ch6 %>% sample_n(size = 5)
Now that we’ve looked at the raw values in our evals_ch6
data frame and got a sense of the data, let’s compute summary
statistics.
evals_ch6 %>%
select(score, age, gender) %>%
skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | mal: 268, fem: 195 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0 | 1 | 4.17 | 0.54 | 2.3 | 3.8 | 4.3 | 4.6 | 5 | ▁▁▅▇▇ |
| age | 0 | 1 | 48.37 | 9.80 | 29.0 | 42.0 | 48.0 | 57.0 | 73 | ▅▆▇▆▁ |
Observe that we have no missing data, that there are 268 courses taught by male instructors and 195 courses taught by female instructors, and that the average instructor age is 48.37.
Now we can process to the final step of EDA: creating data visualizations.
ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
geom_point() +
labs(x = "Age", y = "Teaching Score", color = "Gender") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
The syntax change in this case to y ~ x1 * x2. In other
words, our two explanatory variables x1 and x2
are separated by a * sign:
score_model_interaction <- lm(score ~ age * gender, data = evals_ch6)
get_regression_table(score_model_interaction)
While it is not immediately apparent, using these four values we can
write out the equations of both lines in the figure above. First, since
the word female comes alphabetically before
male, female instructors are the “baseline for comparison”
group. Thus, intercept is the intercept for only the
female instructors. This holds similarly for age. It
is the slope for age for only the female instructors.
What about the intercept and slope for age of the male instructors?
The value for gendermale of -0.446 is not the intercept for
the male instructors, but rather the offset in intercept for
male instructors relative to female instructors. The intercept for the
male instructors is intercept + gendermale =
4.883 - 0.446 = 4.437.
Similarly, age:gendermale = 0.014 is not the slope for
age for the male instructors, but rather the offset in slope
for the male instructors. Therefore, the slope for age for the male
instructors is age + age:gendermale = −0.018+0.014 =
−0.004.
Let’s now write the equation for our regression lines, which we can use to compute our fitted values \(\hat{y} = \widehat{score}\)
\[ \hat{y} = \widehat{score} = b_0 + b_F \cdot x_1 + b_M \cdot \chi_M(x_2) + b_{AM} \cdot x_1 \cdot \chi_M(x_2) \]
We have the indicator function in the equation, where:
\[ \chi_M(x_2) = \begin{cases} 1 & \text{if } x_2 \text{ is male} \\ 0 & \text{if otherwise} \end{cases} \]
And we also have:
intercept for the female instructorage for female instructorgendermale for the male instructorage:gendermale for male instructorLet’s put this all together and compute the fitted value \(\hat{y} = \widehat{score}\) for female instructors. Since for female instructors \(\chi_M(x_2) = 0\), the previous equation becomes:
\[ \hat{y} = \widehat{score} = 4.883 - 0.018 \cdot x_1 \]
Correspondingly, since for male instruction \(\chi_M(x_2) = 1\), the equation becomes:
\[ \hat{y} = \widehat{score} = 4.437 - 0.004 \cdot x_1 \]
Before we end this section, we explain why we refer to this type of
model as an “interaction model.” The \(b_{AM}\) term in the equation for the
fitted value \(\hat{y} =
\widehat{score}\) is what’s known in statistical modeling as an
“interaction effect.” The interaction term corresponds to the
age: gendermale = 0.014 in the final row of the regression
table.
We say there is an interaction effect if the associated effect of one variable depends on the value of another variable. That is to say, the two variables are “interacting” with each other. Here, the associated effect of the variable age depends on the value of the other variable gender. The difference in slopes for age of +0.014 of male instructors relative to female instructors shows this.
When creating regression models with one numerical and one categorical explanatory variable, we are not just limited to interaction models as we just saw. Another type of model we can use is known as a parallel slopes model. Unlike interaction models where the regression lines can have different intercepts and different slopes, parallel slopes models still allow for different intercepts but force all lines to have the same slope.
To visualize the model, we use a function in moderndive
package:
ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
geom_point() +
geom_parallel_slopes(se = FALSE)
Observe in the figure that we now have parallel lines corresponding to the female and male instructors, respectively: here they have the same negative slope. This is telling us that instructors who are older will tend to receive lower teaching scores than instructors who are younger.
However, observe also in the figure that these two lines have different intercepts. This is telling us that irrespective of age, female instructors tended to receive lower teaching scores than male instructors.
In order to obtain the precise numerical values of the two intercepts
and the single common slope, we once again “fit” the model using the
lm() “linear model” function and then apply the
get_regression_table() function. However, unlike the
interaction model which had a model formula of the form
y ~ x1 * x2, our model formula is now of the form
y ~ x1 + x2. In other words, our two explanatory variables
x1 and x2 are separated by a +
sign:
score_model_parallel <- lm(score ~ age + gender, data = evals_ch6)
get_regression_table(score_model_parallel)
The only different is now we don’t have the interaction value. So the equation will be as follow:
\[ hat{y} = \widehat{score} = b_0 + b_F \cdot x_1 + b_M \cdot \chi_M(x_2) \]
So the equation for female slope is:
\[ hat{y} = \widehat{score} = 4.484 - 0.009 \cdot x_1 \]
and for male slope
\[ hat{y} = \widehat{score} = 4.675 - 0.009 \cdot x_1 \]
we’ll first compute the observed values, fitted values, and residuals
for the interaction model which we saved in
score_model_interaction.
We can get all the fitted value by using the
get_regression_points() function:
regression_points <- get_regression_points(score_model_interaction)
kable(head(regression_points, 10))
| ID | score | age | gender | score_hat | residual |
|---|---|---|---|---|---|
| 1 | 4.7 | 36 | female | 4.252 | 0.448 |
| 2 | 4.1 | 36 | female | 4.252 | -0.152 |
| 3 | 3.9 | 36 | female | 4.252 | -0.352 |
| 4 | 4.8 | 36 | female | 4.252 | 0.548 |
| 5 | 4.6 | 59 | male | 4.201 | 0.399 |
| 6 | 4.3 | 59 | male | 4.201 | 0.099 |
| 7 | 2.8 | 59 | male | 4.201 | -1.401 |
| 8 | 4.1 | 51 | male | 4.233 | -0.133 |
| 9 | 3.4 | 51 | male | 4.233 | -0.833 |
| 10 | 4.5 | 40 | female | 4.182 | 0.318 |
How about the parallel model. Here is the result:
regression_points_par <- get_regression_points(score_model_parallel)
kable(head(regression_points_par,10))
| ID | score | age | gender | score_hat | residual |
|---|---|---|---|---|---|
| 1 | 4.7 | 36 | female | 4.172 | 0.528 |
| 2 | 4.1 | 36 | female | 4.172 | -0.072 |
| 3 | 3.9 | 36 | female | 4.172 | -0.272 |
| 4 | 4.8 | 36 | female | 4.172 | 0.628 |
| 5 | 4.6 | 59 | male | 4.163 | 0.437 |
| 6 | 4.3 | 59 | male | 4.163 | 0.137 |
| 7 | 2.8 | 59 | male | 4.163 | -1.363 |
| 8 | 4.1 | 51 | male | 4.232 | -0.132 |
| 9 | 3.4 | 51 | male | 4.232 | -0.832 |
| 10 | 4.5 | 40 | female | 4.137 | 0.363 |
We will use another dataset for this section.
In this section, we’ll fit a regression model using
Credit dataset where we have 1. A numerical outcome
variable \(y\), the cardholder’s credit
card debt 2. One numerical explanatory variable \(x_1\), the cardholder’s credit limit 3.
Another numerical explanatory variable \(x_2\), the cardholder’s income (in
thousands of dollars).
Let’s load the Credit dataset. To keep things simple
let’s select() the subset of the variables we’ll consider
in this chapter, and save this data in the new data frame
credit_ch6.
library(ISLR)