This is for practicing along with the textbook Statistical Inference via Data Science: A Modern Dive into R and the Tidyverse, for chapter 5 onward.

Basic Regression

First the textbook require us to load certain packages:

library(tidyverse)
library(moderndive)
library(skimr)
library(gapminder)
library(knitr)
opts_chunk$set(fig.align = "center")

For simplicity, the package installing was run in the background.

The textbook requires loading the data from the moderndive package. This data will be used to model the relationship between teaching scores and “beauty” scores using simple linear regression.

One numerical explanatory variable

Exploratory Data Analysis (EDA)

First we need to import the data, choose which variable to keep for simplicity, and show first few rows of the data:

evals_ch5 <- evals %>%
  select(ID, score, bty_avg, age)
kable(head(evals_ch5, 10))

ID	score	bty_avg	age
1	4.7	5.000	36
2	4.1	5.000	36
3	3.9	5.000	36
4	4.8	5.000	36
5	4.6	3.000	59
6	4.3	3.000	59
7	2.8	3.000	59
8	4.1	3.333	51
9	3.4	3.333	51
10	4.5	3.167	40

The authors suggest doing these three steps for EDA:

Most crucially, looking at the raw data values.
Computing summary statistics, such as means, medians, and interquartile ranges.
Creating data visualizations.

With first step, we can use these functions:

View(evals_ch5)
glimpse(evals_ch5)
summary(evals_ch5)

Here is the result of summary() function:

kable(summary(evals_ch5))

ID	score	bty_avg	age
Min. : 1.0	Min. :2.300	Min. :1.667	Min. :29.00
1st Qu.:116.5	1st Qu.:3.800	1st Qu.:3.167	1st Qu.:42.00
Median :232.0	Median :4.300	Median :4.333	Median :48.00
Mean :232.0	Mean :4.175	Mean :4.418	Mean :48.37
3rd Qu.:347.5	3rd Qu.:4.600	3rd Qu.:5.500	3rd Qu.:57.00
Max. :463.0	Max. :5.000	Max. :8.167	Max. :73.00

The textbook starts with introducing the glimpse() function, let’s run it here:

glimpse(evals_ch5)

## Rows: 463
## Columns: 4
## $ ID      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ score   <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.…
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333,…
## $ age     <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…

First the authors note about the observation units importance. In this dataset, observation unit is an individual course and not an individual instructor. Since instructors teach more than one course in an academic year, the same instructor will appear more than once in the data. Hence there are fewer than 463 unique instructors being represented in evals_ch5.

Next thing to focus about is the description for 4 variables in the dataset:

ID: An identification variable used to distinguish between the 1 through 463 courses in the dataset.
score: A numerical variable of the course instructor’s average teaching score, where the average is computed from the evaluation scores from all students in that course. Teaching scores of 1 are lowest and 5 are highest. This is the outcome variable \(y\) of interest.
bty_avg: A numerical variable of the course instructor’s average “beauty” score, where the average is computed from a separate panel of six students. “Beauty” scores of 1 are lowest and 10 are highest. This is the explanatory variable \(x\) of interest.
age: A numerical variable of the course instructor’s age. This will be another explanatory variable \(x\) that we’ll use later.

An alternative way to look at the raw data values is by choosing a random sample of the rows in evals_ch5 by piping it into the sample_n() function from the dplyr package.

evals_ch5 %>%
  sample_n(size = 5)

The textbook also introduces a better function than summary() to returns commonly used summary statistics. Let’s see how it works:

evals_ch5 %>%
  skim()

Data summary
Name	Piped data
Number of rows	463
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	232.00	133.80	1.00	116.50	232.00	347.5	463.00	▇▇▇▇▇
score	1	4.17	0.54	2.30	3.80	4.30	4.6	5.00	▁▁▅▇▇
bty_avg	1	4.42	1.53	1.67	3.17	4.33	5.5	8.17	▃▇▇▃▂
age	1	48.37	9.80	29.00	42.00	48.00	57.0	73.00	▅▆▇▆▁

For the numerical variables it returns:

missing: the number of missing values
complete_rate: the rate of non-missing or complete values
mean: the average
sd: the standard deviation
p0: the minimum value
p25: the 1st quantile
p50: the 2nd quantile - the median
p75: the 3rd quantile
p100: the maximum value

For further explanation, the authors note:

The skim() function only returns what are known as univariate summary statistics: functions that take a single variable and return some numerical summary of that variable.

Beside from univariate summary statistics, we also have bivariate summary statistics. In case of two numerical variables, one important index is correlation coefficient, which is is a quantitative expression of the strength of the linear relationship between two numerical variables. Its value ranges between -1 and 1 where:

\(-1\) indicates a perfect negative relationship: As one variable increases, the value of the other variable tends to go down, following a straight line.
\(0\) indicates no relationship: The values of both variables go up/down inde- pendently of each other.
\(+1\) indicates a perfect positive relationship: As the value of one variable goes up, the value of the other variable tends to go up as well in a linear fashion.

To compute correlation, we use the cor() summary function within a summarize():

evals_ch5 %>%
  summarize(correlation = cor(score, bty_avg))

In this case, the correlation coefficient of \(0.187\) indicates that the relationship between two variables is “weakly positive.”

After this, we can perform the last of the steps in an exploratory data analysis: creating data visualizations.

ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
  geom_point() +
  theme_light()

We can add the “best-fitting” line to the plot, which is explained by the authors as follow:

of all possible lines we can draw on this scatterplot, it is the line that “best” fits through the cloud of points.

The line can be added using layer geom_smooth():

ggplot(evals_ch5, aes(x = bty_avg, y = score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light()

## `geom_smooth()` using formula = 'y ~ x'

The method = "lm" argument sets the line to be a “linear model.” The se = FALSE argument suppresses standard error uncertainty bars, which will be explained further later.

The line in the resulting the figure is called a “regression line.” The regression line is a visual summary of the relationship between two numerical variables.

Now we will use another variable as a explanatory variable, age. We will focus on this variable along with score variable.

We will perform a EDA, first step is to look at raw data values:

evals_ch5 %>%
  sample_n(size = 5)

Second step is computing summary statistics.

evals_ch5 %>%
  skim()

Data summary
Name	Piped data
Number of rows	463
Number of columns	4
_______________________
Column type frequency:
numeric	4
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	1	232.00	133.80	1.00	116.50	232.00	347.5	463.00	▇▇▇▇▇
score	1	4.17	0.54	2.30	3.80	4.30	4.6	5.00	▁▁▅▇▇
bty_avg	1	4.42	1.53	1.67	3.17	4.33	5.5	8.17	▃▇▇▃▂
age	1	48.37	9.80	29.00	42.00	48.00	57.0	73.00	▅▆▇▆▁

We can see that there is no missing value in score and age.

After investigating the univariate profile, we examine the bivariate summary statistics, in this case is also correlation coefficient.

evals_ch5 %>%
  summarise(correlation = cor(score, age))

The minus symbol indicate a negative relationship, and the value close to zero shows a week relationship.

Finally, we can perform data visualization:

ggplot(evals_ch5, aes(x = score, y = age)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_light()

## `geom_smooth()` using formula = 'y ~ x'

Simple linear regresission

When defining a regression line, the equation of the regression line is:

\[ \hat{y} = b_0 + b_1x \]

The intercept coefficient is \(b_0\), which is the value of \(y\) when \(x = 0\). The slope coefficient for \(x\) is \(b_1\), i.e., the increase in \(y\) for every increase of one in \(x\). \(\hat{y}\) is a form of notation commonly used in regression to indicate that we have a “fitted value,” or the value of \(y\) on the regression line for a given \(x\) value. We’ll discuss this more in the upcoming sections.

In the example about correlation between bty_avg and score, we can obtain the values of the intercept \(b\) and the slope for btg_avg \(b_1\) by outputting a linear regression table. This is done in two steps:

We first “fit” the linear regression model using the lm() function and save it in score_model.
We get the regression table by applying the get_regression_table() function from the moderndive package to score_model.

score_model <- lm(score ~ bty_avg, data = evals_ch5)
get_regression_table(score_model)

To interpret the table, we look at the estimate column, we have the intercept \(b_0 = 3.88\) and the slopr \(b_1 = 0.067\) for bty_avg. So the regression line is:

\[ \widehat{score} = 3.380 + 0.067 \cdot bty\_avg \]

Explaining the codes, the authors note:

First, we “fit” the linear regression model to the data using the lm() function. When we say “fit”, we mean “find the best fitting line to this data.”

The function lm() stands for “linear model” and is used as follows: lm(y ~ x, data = data_frame_name) where:

y is the outcome variable, followed by a tilde ~.
x is the explanatory variable.
The combination of y ~ x is called a model formula. (Note the order of y and x.)
data_frame_name is the name of the data frame that contains the variables y and x.

The regression table obtained by using the get_regression_table() show basic parameters of a regression analysis. Besides from the estimate column, we have other 5 columns which relatively are standard error, test statistic, p-value, lower 95% confidence interval bound, and upper 95% confidence interval bound. They tell us about both the statistical significance and practical significance of our results.

Let’s apply the sames function for age and score:

lc5.2 <- lm(score ~ age, evals_ch5)
get_regression_table(lc5.2)

We can see that the regression line is:

\[ \widehat{score} = 4.462 - 0.006\cdot age \]

Observed/fitted values and residuals

The observed value is the actual value in dataset for each individual observation.
The fitted value is the value we have after plug the explanatory value into the regression equation.
The residual is the different between observed value and fitted value.

To compute fitted value and residual for all observations, we can use get_regression_point() function.

regression_point <- get_regression_points(score_model)
head(regression_point, 10)

We can do the same for age variable:

lc5.3 <- get_regression_points(lc5.2)
head(lc5.3, 10)

One categorical explanatory variable

In this section, the gapminder dataset is used as demonstration. We have

y: a numerical variable of country’s life expectancy.
x: a categorical variable of continent that the country is a part of.

Exploratory Data Analysis

We will conduct the EDA. But first, for simplicity, we only choose a specific period and select variable that needed in this section

library(gapminder)
gapminder2007 <- gapminder %>%
  filter(year == 2007) %>%
  select(country, lifeExp, continent, gdpPercap)

Let’s perform the first common step in an exploratory data analysis: looking at the raw data values by using these function:

View(gapminder2007)
glimpse(gapminder2007)
summary(gapminder2007)

The results show there are 142 rows/observations in gapminder2007, where each row corresponds to one country. In other words, the observational unit is an individual country.

Let’s look at a random sample of five out of the 142 countries:

gapminder2007 %>%
  sample_n(size = 5)

Now that we’ve looked at the raw values in our gapminder2007 data frame and got a sense of the data, let’s move on to computing summary statistics.

gapminder2007 %>%
  select(lifeExp, continent) %>%
  skim()

Data summary
Name	Piped data
Number of rows	142
Number of columns	2
_______________________
Column type frequency:
factor	1
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
continent	0	1	FALSE	5	Afr: 52, Asi: 33, Eur: 30, Ame: 25

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
lifeExp	0	1	67.01	12.07	39.61	57.16	71.94	76.41	82.6	▂▃▃▆▇

The `skim() output now reports summaries for categorical variables separately from the numerical variables. For the categorical variable, it reports:

missing, complete, and n, which are the number of missing, complete, and total number of values as before, respectively.
n_unique: The number of unique levels to this variable.
top_counts: In this case, the top four counts: Africa has 52 countries, Asia has 33, Europe has 30, and Americas has 25. Not displayed is Oceania with 2 countries.
ordered: This tells us whether the categorical variable is “ordinal”: whether there is an encoded hierarchy (like low, medium, high). In this case, continent is not ordered.

Then we can perform the last of the three common steps in an exploratory data analysis: creating data visualizations. Let’s visualize the distribution of our outcome variable:

ggplot(gapminder2007, aes(x = lifeExp)) +
  geom_histogram(binwidth = 5, color = "white", aes(y = after_stat(density))) +
  geom_density(color = "red") +
  theme_light()

This is just a univariate profile of a single variable, if we want to include the explanatory variable to the visuals. we need to use the layer facet_wrap():

ggplot(gapminder2007, aes(x = lifeExp)) +
  geom_histogram(binwidth = 5, color = "white") +
  facet_wrap(~continent) +
  theme_light()

Another way for numerical and categorical variables to appear in the same graph is using boxplot graph:

ggplot(gapminder2007, aes(x = continent, y = lifeExp)) +
  geom_boxplot() +
  theme_light()

It is easier comparing the distributions of a numerical variable between different levels of a categorical variable using a boxplot instead of a faceted histogram.

Let’s compute the median and mean life expectancy for each continent with a little more data wrangling:

lifeExp_by_continent <- gapminder2007 %>% 
  group_by(continent) %>% 
  summarize(median = median(lifeExp),
            mean = mean(lifeExp))
kable(lifeExp_by_continent)

continent	median	mean
Africa	52.9265	54.80604
Americas	72.8990	73.60812
Asia	72.3960	70.72848
Europe	78.6085	77.64860
Oceania	80.7195	80.71950

We will use another outcome variable for further practice, gdpPerCap:

First is raw data inspection, we can skip this part since the examination of gapminder2007 above already cover this part. We use skim() function again to look at the statistics of gdpPerCap:

gapminder2007 %>%
  select(gdpPercap) %>%
  skim()

Data summary
Name	Piped data
Number of rows	142
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
gdpPercap	0	1	11680.07	12859.94	277.55	1624.84	6124.37	18008.84	49357.19	▇▂▁▂▁

Finally, we can visualize the bivariate profile of the gdpPercap and continent variable.

ggplot(gapminder2007, aes(x = continent, y = gdpPercap)) +
  geom_boxplot() +
  theme_light()

Linear Regression

In this example, we now instead have a categorical explanatory variable continent. Our model will not yield a “best-fitting” regression line but rather offsets relative to a baseline for comparison.

Let’s see what happen if we apply the same function and syntax to this new model

lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
get_regression_table(lifeExp_model)

Let’s create another table to compare mean life expectancy between Africa and other continents, we will detect a pattern between two tables:

mean_lifeExp <- gapminder2007 %>%
  group_by(continent) %>%
  summarize(mean_lifeExp = mean(lifeExp))

mean_lifeExp_Africa <- mean_lifeExp %>%
  filter(continent == "Africa") %>%
  pull(mean_lifeExp)

mean_diff <- mean_lifeExp %>%
  mutate(mean_diff = mean_lifeExp - mean_lifeExp_Africa)

kable(mean_diff)

continent	mean_lifeExp	mean_diff
Africa	54.80604	0.00000
Americas	73.60812	18.80208
Asia	70.72848	15.92245
Europe	77.64860	22.84256
Oceania	80.71950	25.91346

As we can see, the intercept is the mean life expectancy of Africa, and values of the rest five rows from the column are different between the mean of life expectancy of that continent with Africa.

Why was Africa chosen as the “baseline for comparison” group? This is the case for no other reason than it comes first alphabetically of the five continents; by default R arranges factors/categorical variables in alphanumeric order.

The equation for fitted value \(\widehat{lifeExp}\) would be:

\[ \widehat{y} = b_0 + b_{Amer} \cdot \chi_{amer}(x) + b_{Asia} \cdot \chi_{Asia}(x) + b_{Euro} \cdot \chi_{Euro}(x) + b_{Ocean} \cdot \chi_{Ocean}(x) \]

\[ \widehat{lifeExp} = 54.8 + 18.8 \cdot \chi_{amer}(x) + 15.9 \cdot \chi_{Asia}(x) + 22.8 \cdot \chi_{Euro}(x) + 25.9 \cdot \chi_{Ocean}(x) \] To interpret the equation, we need to note that \(\chi_A(x)\) is what’s known in mathematics as an “indicator function.” It returns only one of two possible values, 0 and 1, where:

\[ \chi_A(x) = \begin{cases} 1 & \text{if } x \text{ is in } A \\ 0 & \text{if otherwise} \end{cases} \]

Greek letter Chi (\(\chi\)) is used as a symbol for the indicator function in this document. However, this is not standard notation, I only use it since it easier to type.

In statistical modeling context this is also known as dummy variable. If we fit a linear regression model using a categorical explanatory variable \(x\) that has \(k\) possible categories, the regression table will return an intercept and \(k - 1\) “offsets.”

Observed/fitted values and residuals

We obtained these values and other values using the get_regression_points() function from the moderndive package. This time, however, let’s add an argument setting ID = "country": this is telling the function to use the variable country in gapminder2007 as an identification variable in the output. This will help contextualize our analysis by matching values to countries.

regression_points <- get_regression_points(lifeExp_model, ID = "country")
head(regression_points, 10)

There are only 5 possible values for lifeExp_hat, correspond to the five mean life expectancy for the 5 continents.

The residual column is simply \(y-\hat{y} =\) lifeExp - lifeExp_hat. These values can be interpreted as the deviation of a country’s life expectancy from its continent’s average life expectancy.

Best-fitting line

Regression lines are also known as “best-fitting” lines. The name comes from the process of calculating a quantity called sum of squared residuals, usually represent the lack of fit of a model. Larger values of the sum of squared residuals indicate a bigger lack of fit. This corresponds to a worse fitting model.

Of all possible lines we can draw through the cloud of points, the regression line minimize sum of squared residuals. In order words, the regression and its corresponding fitted value \(\hat{y}\) minimizes this value:

\[ \sum_{i=1}^{n} (y_i - \widehat{y}_i)^2 \] Let’s apply the equation and compute the sum of squared residuals of the first example of this document:

score_model <- lm(score ~ bty_avg, data = evals_ch5)
regression_points <- get_regression_points(score_model)

regression_points %>%
  mutate(squared_residuals = residual^2) %>%
  summarize(sum_of_squared_residuals = sum(squared_residuals))

Multiple Regression

The next chapter of the book covers the topic on Multiple Regression. I think I will get the basic idea of how is it operated in R first here, then after reviewing the textbook about multivariate data analysis, I will conduct a comprehensive analysis later.

First, we need to load the necessary packages:

library(tidyverse) 
library(moderndive) 
library(skimr) 
library(ISLR)

One numerical and one categorical explanatory variable

We will use the data from UT Austin again, but add one more variable to the model. The model will consist:

A numerical outcome variable \(y\), the instructor’s teaching score
A numerical explanatory variable \(x_1\), the instructor’s age.
A categorical explanatory variable \(x2\), the instructor’s (binary) gender.

Exploratory data analysis

For simplicity, we only select the columns we need from the original dataset:

evals_ch6 <- evals %>%
  select(ID, score, age, gender)

Recall the three common steps in an exploratory data analysis:

Looking at the raw data values.
Computing summary statistics.
Creating data visualizations.

Let’s first look at the raw data values by using these functions:

View(evals_ch6)
glimpse(evals_ch6)
summary(evals_ch6)

We can also display a random sample of 5 rows:

evals_ch6 %>% sample_n(size = 5)

Now that we’ve looked at the raw values in our evals_ch6 data frame and got a sense of the data, let’s compute summary statistics.

evals_ch6 %>%
  select(score, age, gender) %>%
  skim()

Data summary
Name	Piped data
Number of rows	463
Number of columns	3
_______________________
Column type frequency:
factor	1
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
gender	0	1	FALSE	2	mal: 268, fem: 195

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
score	0	1	4.17	0.54	2.3	3.8	4.3	4.6	5	▁▁▅▇▇
age	0	1	48.37	9.80	29.0	42.0	48.0	57.0	73	▅▆▇▆▁

Observe that we have no missing data, that there are 268 courses taught by male instructors and 195 courses taught by female instructors, and that the average instructor age is 48.37.

Now we can process to the final step of EDA: creating data visualizations.

ggplot(evals_ch6, aes(x = age, y = score, color = gender)) + 
  geom_point() +
  labs(x = "Age", y = "Teaching Score", color = "Gender") + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula = 'y ~ x'

Interaction model

The syntax change in this case to y ~ x1 * x2. In other words, our two explanatory variables x1 and x2 are separated by a * sign:

score_model_interaction <- lm(score ~ age * gender, data = evals_ch6)
get_regression_table(score_model_interaction)

While it is not immediately apparent, using these four values we can write out the equations of both lines in the figure above. First, since the word female comes alphabetically before male, female instructors are the “baseline for comparison” group. Thus, intercept is the intercept for only the female instructors. This holds similarly for age. It is the slope for age for only the female instructors.

What about the intercept and slope for age of the male instructors? The value for gendermale of -0.446 is not the intercept for the male instructors, but rather the offset in intercept for male instructors relative to female instructors. The intercept for the male instructors is intercept + gendermale = 4.883 - 0.446 = 4.437.

Similarly, age:gendermale = 0.014 is not the slope for age for the male instructors, but rather the offset in slope for the male instructors. Therefore, the slope for age for the male instructors is age + age:gendermale = −0.018+0.014 = −0.004.

Let’s now write the equation for our regression lines, which we can use to compute our fitted values \(\hat{y} = \widehat{score}\)

\[ \hat{y} = \widehat{score} = b_0 + b_F \cdot x_1 + b_M \cdot \chi_M(x_2) + b_{AM} \cdot x_1 \cdot \chi_M(x_2) \]

We have the indicator function in the equation, where:

\[ \chi_M(x_2) = \begin{cases} 1 & \text{if } x_2 \text{ is male} \\ 0 & \text{if otherwise} \end{cases} \]

And we also have:

\(b_0\) is the intercept for the female instructor
\(b_F\) is the slope for age for female instructor
\(b_M\) is the gendermale for the male instructor
\(b_{AM}\) is the age:gendermale for male instructor
\(x_1\) is the age
\(x_2\) is gender

Let’s put this all together and compute the fitted value \(\hat{y} = \widehat{score}\) for female instructors. Since for female instructors \(\chi_M(x_2) = 0\), the previous equation becomes:

\[ \hat{y} = \widehat{score} = 4.883 - 0.018 \cdot x_1 \]

Correspondingly, since for male instruction \(\chi_M(x_2) = 1\), the equation becomes:

\[ \hat{y} = \widehat{score} = 4.437 - 0.004 \cdot x_1 \]

Before we end this section, we explain why we refer to this type of model as an “interaction model.” The \(b_{AM}\) term in the equation for the fitted value \(\hat{y} = \widehat{score}\) is what’s known in statistical modeling as an “interaction effect.” The interaction term corresponds to the age: gendermale = 0.014 in the final row of the regression table.

We say there is an interaction effect if the associated effect of one variable depends on the value of another variable. That is to say, the two variables are “interacting” with each other. Here, the associated effect of the variable age depends on the value of the other variable gender. The difference in slopes for age of +0.014 of male instructors relative to female instructors shows this.

Parallel slopes model

When creating regression models with one numerical and one categorical explanatory variable, we are not just limited to interaction models as we just saw. Another type of model we can use is known as a parallel slopes model. Unlike interaction models where the regression lines can have different intercepts and different slopes, parallel slopes models still allow for different intercepts but force all lines to have the same slope.

To visualize the model, we use a function in moderndive package:

ggplot(evals_ch6, aes(x = age, y = score, color = gender)) +
  geom_point() +
  geom_parallel_slopes(se = FALSE)

Observe in the figure that we now have parallel lines corresponding to the female and male instructors, respectively: here they have the same negative slope. This is telling us that instructors who are older will tend to receive lower teaching scores than instructors who are younger.

However, observe also in the figure that these two lines have different intercepts. This is telling us that irrespective of age, female instructors tended to receive lower teaching scores than male instructors.

In order to obtain the precise numerical values of the two intercepts and the single common slope, we once again “fit” the model using the lm() “linear model” function and then apply the get_regression_table() function. However, unlike the interaction model which had a model formula of the form y ~ x1 * x2, our model formula is now of the form y ~ x1 + x2. In other words, our two explanatory variables x1 and x2 are separated by a + sign:

score_model_parallel <- lm(score ~ age + gender, data = evals_ch6)
get_regression_table(score_model_parallel)

The only different is now we don’t have the interaction value. So the equation will be as follow:

\[ hat{y} = \widehat{score} = b_0 + b_F \cdot x_1 + b_M \cdot \chi_M(x_2) \]

So the equation for female slope is:

\[ hat{y} = \widehat{score} = 4.484 - 0.009 \cdot x_1 \]

and for male slope

\[ hat{y} = \widehat{score} = 4.675 - 0.009 \cdot x_1 \]

Observed/fitted values and residuals

we’ll first compute the observed values, fitted values, and residuals for the interaction model which we saved in score_model_interaction.

We can get all the fitted value by using the get_regression_points() function:

regression_points <- get_regression_points(score_model_interaction)
kable(head(regression_points, 10))

ID	score	age	gender	score_hat	residual
1	4.7	36	female	4.252	0.448
2	4.1	36	female	4.252	-0.152
3	3.9	36	female	4.252	-0.352
4	4.8	36	female	4.252	0.548
5	4.6	59	male	4.201	0.399
6	4.3	59	male	4.201	0.099
7	2.8	59	male	4.201	-1.401
8	4.1	51	male	4.233	-0.133
9	3.4	51	male	4.233	-0.833
10	4.5	40	female	4.182	0.318

How about the parallel model. Here is the result:

regression_points_par <- get_regression_points(score_model_parallel)
kable(head(regression_points_par,10))

ID	score	age	gender	score_hat	residual
1	4.7	36	female	4.172	0.528
2	4.1	36	female	4.172	-0.072
3	3.9	36	female	4.172	-0.272
4	4.8	36	female	4.172	0.628
5	4.6	59	male	4.163	0.437
6	4.3	59	male	4.163	0.137
7	2.8	59	male	4.163	-1.363
8	4.1	51	male	4.232	-0.132
9	3.4	51	male	4.232	-0.832
10	4.5	40	female	4.137	0.363

Two numerical explanatory variables

We will use another dataset for this section.

In this section, we’ll fit a regression model using Credit dataset where we have 1. A numerical outcome variable \(y\), the cardholder’s credit card debt 2. One numerical explanatory variable \(x_1\), the cardholder’s credit limit 3. Another numerical explanatory variable \(x_2\), the cardholder’s income (in thousands of dollars).

Exploratory data analysis

Let’s load the Credit dataset. To keep things simple let’s select() the subset of the variables we’ll consider in this chapter, and save this data in the new data frame credit_ch6.

library(ISLR)

Notebook: Statistical Inference via Data Science

Basic Regression

One numerical explanatory variable

Exploratory Data Analysis (EDA)

Simple linear regresission

Observed/fitted values and residuals

One categorical explanatory variable

Exploratory Data Analysis

Linear Regression

Observed/fitted values and residuals

Best-fitting line

Multiple Regression

One numerical and one categorical explanatory variable

Exploratory data analysis

Interaction model

Parallel slopes model

Observed/fitted values and residuals

Two numerical explanatory variables

Exploratory data analysis