R reference

https://bookdown.org/speegled/foundations-of-statistics/data-manipulation.html#data-manipulation (Chapters 1,5,6,7, and 10)

Book Page

https://www.stat2.org/

Data manual: https://www.stat2.org/manuals/Stat2DataManual.pdf

Book R Code

https://www.stat2.org/manuals/Stat2RCompanion.pdf

ChatGPT Will Code for Us

Register on https://opai.com. Use the following query:

“I have a data set, with two variables x and y. how can i build a regression model with y as the response variable and x as the explanatory variable. show me the r code with data you made from engineering and explain each line of code. don’t use simulated data or data frame from R.”

Compile the code.

Now, ask ChatGPT more:

“interpret the results”

Next, we can ask”

“can you show me the underlying theory?”

Still, we ask:

“What does the R-square tell us? How to interpret each p-value? how to make prediction using the fitted model?”

Still, we ask:

“use a data with specific meanings for x and y.”

Finally, we ask:

“how to check how good the model for prediction?”

Chapters 1&2 Notes for Simple Linear Regression

  1. Introduction

We’ll explore the relationship between sugar content and calories in cereals using data from https://www.lock5stat.com/datasets3e/Cereal.csv (see the documentation of the data from https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf). Simple linear regression is a statistical technique that helps us understand how changes in one variable (called independent/explanatory/predictor variable) can predict changes in another (called dependent/response/outcome variable).

  1. Data Preparation

Let’s start by reading the raw csv (comma separated values) data to create a data frame in RStudio that R recognizes.

# Load real dataset
cereal_data <- read.csv(file = "https://www.lock5stat.com/datasets3e/Cereal.csv", header = TRUE, sep = ",")

# Print variables
names(cereal_data)
##  [1] "Name"     "Company"  "Serving"  "Calories" "Fat"      "Sodium"  
##  [7] "Carbs"    "Fiber"    "Sugars"   "Protein"
# Check dimmension
dim(cereal_data)
## [1] 30 10
# Print first 10 observations
head(cereal_data, n = 10)
##                     Name Company Serving Calories Fat Sodium Carbs Fiber Sugars
## 1             AppleJacks       K    1.00      117 0.6    143    27   0.5   15.0
## 2              Boo Berry       G    1.00      118 0.8    211    27   0.1   14.0
## 3           Cap'n Crunch       Q    0.75      144 2.1    269    31   1.1   16.0
## 4  Cinnamon Toast Crunch       G    0.75      169 4.4    408    32   1.7   13.3
## 5           Cocoa Blasts       Q    1.00      130 1.2    135    29   0.8   16.0
## 6            Cocoa Puffs       G    1.00      117 1.0    171    26   0.8   14.0
## 7           Cookie Crisp       G    1.00      117 0.9    178    26   0.5   13.0
## 8            Corn Flakes       K    1.00      101 0.1    202    24   0.8    3.0
## 9              Corn Pops       K    1.00      117 0.2    120    28   0.3   15.0
## 10               Crispix       K    1.00      113 0.3    229    26   0.1    3.0
##    Protein
## 1      1.0
## 2      1.0
## 3      1.3
## 4      2.7
## 5      1.0
## 6      1.0
## 7      1.0
## 8      2.0
## 9      1.0
## 10     2.0
  1. Visualizing the Data

We focus on two variables: sugar content and calories. To gain insights into the relationship between them, we’ll create a scatter plot with calories as the response variable.

# Create a scatter plot with calories as the response variable
plot(formula = Calories ~ Sugars, 
     data = cereal_data, 
     main = "Scatter Plot of Sugar vs Calories",
     xlab = "Sugar Content",
     ylab = "Calories",
     col = "blue",
     pch = 19,
     cex = 2
    )  

In the code above, “Calories ~ Sugars” is called a “formula” which specifies the left side as the response and the right side as the explanatory variable. The data parameter tells that the variables “Calories” and “Sugars” are from data frame “cereal_data” which has been created before.

The parameters “main”, “xlab”, and “ylab” set the titles for the plot, the x-axis, and the y-axis. The parameters “col”, “pch”, “cex” set the color, symbol, and size of the plotting characters.

later, we will introduce a more advanced tool for visualizing data. It involves the R package “ggplot2”. R Packages are tool boxes that extend the functionality of the base package “base”. To use a package, you must install it first. This is done by issuing the code: install.packages(“packageName”) in the console (lower-left panel of RStudio). The next step is to load the installed the package by placing the code: library(packageName) in the setup code chunk (somewhere around line 10 in your R markdown file).

ggplot(cereal_data, aes(x = Sugars, y = Calories)) +
  geom_point() +
  labs(title = "Scatter Plot of Sugar vs Calories",
       x = "Sugar Content",
       y = "Calories") +
  theme(plot.title = element_text(hjust=0.5))

The code generates a scatter plot using ggplot2 with sugar content (Sugars) on the x-axis and calories (Calories) on the y-axis, based on the data in the cereal_data dataframe.

Points on the scatter plot represent observations in the dataset.

The plot is given a title (“Scatter Plot of Sugar vs Calories”) and axis labels (“Sugar Content” for x-axis, “Calories” for y-axis).

The theme modification ensures that the plot title is horizontally centered.

What is the use of this scatterplot? It tells us graphically whether there is a straight line relationship between the two variables. In other words, it tells whether the two variables are correlated. A quantity that quantifies the level of correlation is called the coefficient of correlation or just correlation.

R has a function to calculate such a correlation value, which is shown here:

cor(cereal_data$Sugars, cereal_data$Calories)
## [1] 0.6601099
# If there is any missing value
cor(cereal_data$Sugars, cereal_data$Calories, use = "complete.obs")
## [1] 0.6601099

Explanation of code:

  1. Fitting the Simple Linear Regression Model

Now, let’s fit a simple linear regression model to predict calories based on sugar content.

# Fit the model
model <- lm(Calories ~ Sugars, data = cereal_data)

We print the summary of the results:

summary(model)
## 
## Call:
## lm(formula = Calories ~ Sugars, data = cereal_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.574 -25.282  -2.549  17.796  51.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  88.9204    10.8120   8.224 5.96e-09 ***
## Sugars        4.3103     0.9269   4.650 7.22e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.61 on 28 degrees of freedom
## Multiple R-squared:  0.4357, Adjusted R-squared:  0.4156 
## F-statistic: 21.62 on 1 and 28 DF,  p-value: 7.217e-05
  1. Interpretation of Results

5.1 Coefficients

Let’s interpret the coefficients of the model.

The intercept \(\beta_0\) is estimated to be 88.92, which represents the estimated mean calories when sugar content is 0. The slope \(\beta\) is estimated to be 4.31, which represents the estimated change in mean calories for a one-unit increase in sugar. A confidence interval can also be calculated using the following R code:

confint(model, level = 0.90)
##                   5 %       95 %
## (Intercept) 70.527745 107.313079
## Sugars       2.733435   5.887087

The confint() function automatically calculates the confidence interval for all coefficients (intercept and slope) in your regression model. The “level = 0.90” tells R to calculate the confidence intervals with the 90% confidence level.

Interpretation of the confidence interval for the slope:

5.2 P-values

Now, let’s assess the significance of our predictor variable - sugar content.

p_value <- summary(model)$coefficients[2, 4]

The P-value for sugar content is 0, which is basically zero and indicates that our predictor variable - sugar content - is very significant as a predictor for calories.

  1. Model Checking

To ensure the validity of our model, let’s perform diagnostic checks.

# Diagnostic plots: Residuals vs. Fitted: Check for linearity and constant variance.
plot(model, which = 1)

Residuals vs. Fitted: This plot helps us check for linearity and constant variance. If the residuals exhibit a random scatter around zero, it suggests linearity. If the residual vs. fitted values plot exhibits a tunnel shape, it indicates a violation of the assumption of constant variance, also known as homoscedasticity. Homoscedasticity implies that the variance of the residuals should be roughly constant across all levels of the fitted values. A tunnel-shaped pattern in the residual plot suggests that the spread of residuals changes systematically as the predicted values increase or decrease.

The plot here shows there is slight concern regarding the constant variance assumption. The linearity is ok, since the residuals exhibit a random scatter around zero.

# Diagnostic plots: Normal Q-Q Plot- Check for normality of residuals.
plot(model, which = 2)  

A Normal Q-Q (Quantile-Quantile) plot is a graphical tool used to assess the normality of residuals in a statistical model. It compares the quantiles of the observed residuals against the quantiles of a theoretical normal distribution. If the points on the plot fall approximately along a straight line, it suggests that the residuals are normally distributed.

The plot here shows there is slight concern regarding the normality assumption on the residuals.

# Diagnostic plots: Scale-Location Plot- Check for constant variance.
plot(model, which = 3)  

Scale-Location Plot: Checks for constant variance of residuals. Ideally, points should be evenly distributed around the horizontal line.

The plot here shows a violation of constant variance in residuals.

# Diagnostic plots: Cook's Distance- Identify influential data points.
plot(model, which = 4)  

Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients.

The plot shows that observations 1, 15 and 16 are the top 3 influential points, with observation 16 the most influential.

  1. R-squared

Now, let’s evaluate how well our model explains the variance in calories.

R-squared (Coefficient of Determination) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. It quantifies the goodness of fit of the model and indicates the proportion of variability in the response variable that is explained by the predictor variable(s). R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. Remember, while R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.

The \(R^2\) here is 43.57% indicating that 43.57% of the total variation is explained by sugar content.

  1. Prediction

Finally, let’s make predictions based on our model for new data points.

# New data with sugar values
new_data <- data.frame(Sugars = c(5, 10, 15))

# Predict calories for new data
predictions <- predict(model, newdata = new_data)

# Display predictions
predictions
##        1        2        3 
## 110.4717 132.0230 153.5743

The code generates predictions for the calories corresponding to the sugar values specified in new_data using the fitted regression model (model).

The predict() function utilizes the coefficients from the regression model to estimate the dependent variable (calories) based on the new independent variable values (sugar).

The resulting predictions are stored in the variable predictions.

Finally, the predictions are displayed.

  1. Conclusion

In conclusion, this lecture has guided you through the entire process of simple linear regression analysis using real data. We’ve covered data preparation, visualization, model fitting, interpretation of coefficients and p-values, model checking, R-squared evaluation, and making predictions. The interpretation of diagnostic plots in model checking ensures the adequacy of the model by examining linearity, normality, constant variance, and identifying influential points. Adjustments can be made based on your specific dataset and research questions.

  1. Appendix 1: How \(R^2\) Works?

In the realm of regression analysis involving a quantitative response variable and an explanatory variable, understanding the reasons behind the variability in response values becomes crucial. A natural way to quantify the overall variation in these response values is expressed as:

\[SSTotal = (y_1-\bar{y})^2 + (y_2-\bar{y})^2 + \cdots + (y_n-\bar{y})^2. \] Here \(SSTotal\) denotes the total (corrected) sum of squares for convenience. Considering the regression model \(y = \beta_0+\beta_1\cdot x+\epsilon\), the variation in the response values is attributed to two sources: one stemming from the component \(\beta_0+\beta_1\cdot x\), and the other arising from random noise (\(\epsilon\)).

Mathematically, \(SSTotal\) can be split into two components: the total variation in the fitted values, denoted as \(SSModel\), and another is the sum of squared residuals (or errors), denoted as \(SSE\). This division is expressed as:

\[SSTotal = SSModel + SSE\]

In an ideal scenario where the fitted values and the observed response values are the same, \(SSTotal\) and \(SSModel\) would be equivalent. Hence, the ratio \(\frac{SSModel}{SSTotal}\) serves as a meaningful metric to evaluate the model’s effectiveness, is called the coefficient of determination, denoted by \(R^2\), and represents the proportion of total variability in response values explained by the model. \(R^2\) values always fall between 0 and 1, and a model with a higher \(R^2\) is considered more desirable.

  1. Appendix 2: Transformations

Transformations on the response variable in simple linear regression may be necessary or beneficial for several reasons:

The choice of transformation depends on the characteristics of the data and the goals of the analysis. However, it’s essential to interpret the results of the transformed model in the context of the original scale.

It’s advisable to perform diagnostic checks and assess the impact of transformations on the assumptions of the model before deciding on the most appropriate transformation. Additionally, communication of results should consider the interpretation of coefficients on the transformed scale.

It’s also helpful when transformations occur on the explanatory variable. Common transformations include the logarithmic or square transformations.

  1. Appendix 3: How to report results?

Refer to https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/ScaleScale/ScaleScale4.html#google_vignette.

Chapter 3 Multiple Linear Regression Models

3.1 Introduction

Multiple Linear Regression is a statistical method used to model the relationship between a dependent variable and two or more independent variables. The goal is to understand how the independent variables collectively influence the dependent variable. The model assumes a linear relationship, meaning changes in the independent variables are associated with a constant change in the dependent variable. The equation for a multiple linear regression model is:

\[y = \beta_0+\beta_1\cdot x_1+\beta_2\cdot x_2+\cdots +\beta_p\cdot x_p+\epsilon\] where \(\beta\)’s are the coefficients representing the impact of each independent variable, \(x\)’s are the independent variables, and \(\epsilon\) is the error term. The coefficients are estimated using the method of least squares to minimize the sum of squared differences between the observed and predicted values. The model is assessed based on statistical significance, goodness of fit, and adherence to model assumptions. Multiple Linear Regression is widely used in various fields, including economics, finance, and social sciences, for predictive modeling and understanding the relationships between variables.

3.2 Data Preparation

We will again use the cereal data from https://www.lock5stat.com/datasets3e/Cereal.csv. We will study whether other variables have impact on the calorie variable.

Let’s start by reading the raw csv (comma separated values) data to create a data frame in RStudio that R recognizes.

# Print variables
names(cereal_data)
##  [1] "Name"     "Company"  "Serving"  "Calories" "Fat"      "Sodium"  
##  [7] "Carbs"    "Fiber"    "Sugars"   "Protein"
# Check dimmension
dim(cereal_data)
## [1] 30 10
# Print first 10 observations
head(cereal_data, n = 10)
##                     Name Company Serving Calories Fat Sodium Carbs Fiber Sugars
## 1             AppleJacks       K    1.00      117 0.6    143    27   0.5   15.0
## 2              Boo Berry       G    1.00      118 0.8    211    27   0.1   14.0
## 3           Cap'n Crunch       Q    0.75      144 2.1    269    31   1.1   16.0
## 4  Cinnamon Toast Crunch       G    0.75      169 4.4    408    32   1.7   13.3
## 5           Cocoa Blasts       Q    1.00      130 1.2    135    29   0.8   16.0
## 6            Cocoa Puffs       G    1.00      117 1.0    171    26   0.8   14.0
## 7           Cookie Crisp       G    1.00      117 0.9    178    26   0.5   13.0
## 8            Corn Flakes       K    1.00      101 0.1    202    24   0.8    3.0
## 9              Corn Pops       K    1.00      117 0.2    120    28   0.3   15.0
## 10               Crispix       K    1.00      113 0.3    229    26   0.1    3.0
##    Protein
## 1      1.0
## 2      1.0
## 3      1.3
## 4      2.7
## 5      1.0
## 6      1.0
## 7      1.0
## 8      2.0
## 9      1.0
## 10     2.0

3.3 Visualizing the Data

To gain insights into the relationship between calorie and each of the other variables (called explanatory variable or predictor), we’ll create a scatter plot with calorie as the response variable.

# Create a scatterplot matrix
cereal_data_new = cereal_data[, c(4, 3, 5:10)] # I re-arranged the columns so that calorie is the first column
plot( cereal_data_new) 

3.4 Fitting the multiple linear regression model

Now, let’s fit a multiple linear regression model to predict calories based on other numeric variables.

# Fit a multiple linear regression model with calorie as response and all others as explanatory variables
# Since there are 7 explanatory variables, I use the plus sign among them.
model <- lm(Calories ~ Serving + Fat + Sodium + Carbs + Fiber + Sugars + Protein, data = cereal_data_new)

Alternatively, the formula “Calories ~ Serving + Fat + Sodium + Carbs + Fiber + Sugars + Protein” can be replaced by “Calories ~ .”, where “.” indicates that all other variables in the data frame “cereal_data_new” are used as explanatory variables. That is, the code looks like

model <- lm(Calories ~ ., data = cereal_data_new)

We print the summary of the results:

summary(model)
## 
## Call:
## lm(formula = Calories ~ Serving + Fat + Sodium + Carbs + Fiber + 
##     Sugars + Protein, data = cereal_data_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.7792 -0.8488  0.0588  1.3287  4.9312 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.821819   4.903981   0.779   0.4441    
## Serving      2.014173   3.209035   0.628   0.5367    
## Fat          9.978345   0.540534  18.460 7.06e-15 ***
## Sodium      -0.017209   0.006887  -2.499   0.0204 *  
## Carbs        3.787857   0.123745  30.610  < 2e-16 ***
## Fiber       -2.983346   0.334064  -8.930 9.07e-09 ***
## Sugars       0.204899   0.138810   1.476   0.1541    
## Protein      3.990694   0.398598  10.012 1.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.441 on 22 degrees of freedom
## Multiple R-squared:  0.9963, Adjusted R-squared:  0.9951 
## F-statistic: 839.9 on 7 and 22 DF,  p-value: < 2.2e-16

3.5 Interpretation of Results

3.5.1 Coefficients

Let’s interpret the coefficients of the model.

The intercept \(\beta_0\) is estimated to be 3.82, which represents the estimated mean calories when the content of each explanatory variable is 0. Usually, the interpretation of such intercept does not make sense. The coefficient corresponding to Sugars is estimated to be 0.2, which represents the estimated change in mean calories for a one-unit increase in sugar, holding all other explanatory variables fixed. Other coefficients are interpreted similarly.

3.5.2 P-values

Now, let’s assess the significance of each of our predictor variables. To take an example, we look at the p-value for the test of \(H_0: \beta_6=0\) versus \(H_a: \beta_6\ne 0\).

The P-value is 0.154, which is relatively large compared to the commonly used significance level such as 0.05, indicating that our predictor variable - sugar content - is not significant as a predictor for calories.

To take another example, we look at the p-value for the test of \(H_0: \beta_5=0\) versus \(H_a: \beta_5\ne 0\).

The P-value is 0, which is basically zero, indicating that our predictor variable - carbohydrate - is extremely significant as a predictor for calories.

3.6 Model Checking

To ensure the validity of our model, let’s perform diagnostic checks.

# Diagnostic plots: Residuals vs. Fitted: Check for linearity and constant variance.
plot(model, which = 1)

Residuals vs. Fitted: This plot helps us check for linearity and constant variance. If the residuals exhibit a random scatter around zero, it suggests linearity. If the residual vs. fitted values plot exhibits a tunnel shape, it indicates a violation of the assumption of constant variance, also known as homoscedasticity. Homoscedasticity implies that the variance of the residuals should be roughly constant across all levels of the fitted values. A tunnel-shaped pattern in the residual plot suggests that the spread of residuals changes systematically as the predicted values increase or decrease.

The plot indicates a minor concern about the assumption of constant variance. This could be attributed to the presence of three outliers identified in the plot. The linearity is ok, since the residuals exhibit a random scatter around zero.

# Diagnostic plots: Normal Q-Q Plot- Check for normality of residuals.
plot(model, which = 2)  

A Normal Q-Q (Quantile-Quantile) plot is a graphical tool used to assess the normality of residuals in a statistical model. It compares the quantiles of the observed residuals against the quantiles of a theoretical normal distribution. If the points on the plot fall approximately along a straight line, it suggests that the residuals are normally distributed.

The plot here shows there is slight concern regarding the normality assumption on the residuals. This could be again attributed to the presence of three outliers identified in the plot

# Diagnostic plots: Scale-Location Plot- Check for constant variance.
plot(model, which = 3)  

Scale-Location Plot: Checks for constant variance of residuals. Ideally, points should be evenly distributed around the horizontal line.

The plot here shows a violation of constant variance in residuals.

# Diagnostic plots: Cook's Distance- Identify influential data points.
plot(model, which = 4)  

Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients.

The plot shows that observations 4, 11 and 28 are the top 3 influential points, with observation 28 the most influential.

3.7 R-squared

Now, let’s evaluate how well our model explains the variance in calories.

Recall in simple linear regression, we used \(R^\) (Coefficient of Determination) to assess the effectiveness of a model. It is a statistical measure that represents the proportion of the variability in the response variable that is explained by the model. It quantifies the goodness of fit of the model. R-squared can be defined similarly in multiple linear regression. R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. Remember, while R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.

The \(R^2\) here is 99.63% indicating that 99.63% of the total variation is explained by the model.

Unlike R-squared, which increases with the addition of more variables (even if they are not significant), adjusted R-squared penalizes the addition of unnecessary predictors.

The two are related as follows:

\[\text{Adjusted} ~R^² = 1 - (1 - R²)\cdot \frac{n - 1}{n - k - 1}\] where \(n\) is the number of observations and \(k+1\) is the number of coefficients including the intercept.

A model with high R-squared but low adjusted R-squared indicates that the model is overfitting the data.

3.8 Prediction Interval for a New Observation

A prediction interval for a new observation in the context of regression analysis provides a range within which a future response value falls with a certain desired probability. This interval takes into account both the uncertainty in estimating the population regression line and the variability of individual observations around that line.

Let’s make predictions based on our model in the previous section for new data points.

# New data with sugar values
new_data <- data.frame(Serving = 0.75, 
                       Fat = 2.0,
                       Sodium = 162, 
                       Carbs = 28, 
                       Fiber = 0.6, 
                       Sugars = 17.2, 
                       Protein = 1.5
                      )

# Predict calories for new data
predictions <- predict(model, newdata = new_data, interval = "prediction", level = 0.95)

# Display predictions
predictions
##        fit      lwr      upr
## 1 136.2816 130.6127 141.9504

Explanation of code:

  • A new data frame new_data is created with specific values for each predictor variable.
  • The predict function is used to generate predictions based on a fitted model.
    • model: This is the regression model that has been previously fitted, in this case, a multiple regression model.
    • newdata = new_data: This specifies the new data for which predictions are to be made. The new_data object should be a data frame containing values for the predictor variables used in the model.
    • interval = “prediction”: This argument specifies that we want to calculate prediction intervals. Prediction intervals provide a range within which the true response values are likely to fall with certain probability (as specified by the level).
    • level = 0.95: It is the desired probability that the interval will contain the true value of the response variable for a new observation.
    • predictions <- …: The results of the predict function, which include the predicted values and prediction intervals, are stored in the predictions object. This object is typically a matrix containing columns for the point predictions, lower bounds of the prediction intervals (lwr), and upper bounds of the prediction intervals (upr).
    • Finally, the predictions are displayed.

3.9 Confidence Interval for the Mean Response

The mean of the response for all the cereals with a particular set of values on the seven predictors considered here can be of an interest. Since it is a parameter, we can calculate the confidence interval for it. The code is

# Confidence interval for the mean response for all the cereals with 
# the particular set of values given above on the seven predictors
CI <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)

The confidence interval is different from the prediction interval. In general, the prediction interval is always longer than the confidence interval, since there is more uncertainty when predicting an individual response.

3.10 Including indicator variables in multiple regression models

A categorical variable takes values that represent different categories. The iris data (the data frame can be used directly in R) has a variable called Species, which indicates whether a plant is a particular species. The data are displayed below:

DT::datatable(iris) # I used the datatable() function from DT package to better print the data

Sometimes, we need to create numeric variables that jointly are equivalent to a categorical variable such as Species here. The number of these numeric variables equals the number of categories of the categorical variable and each of these numeric variables takes only two values either 0 or 1, depending on the category of the categorical variable. The following shows how each of such binary variables is created:

\[ I_1 = \begin{cases} 1 & \text{if setosa} \\ 0 & \text{if not} \end{cases} \]

\[ I_2 = \begin{cases} 1 & \text{if versicolor} \\ 0 & \text{if not} \end{cases} \]

\[ I_3 = \begin{cases} 1 & \text{if virginica} \\ 0 & \text{if not} \end{cases} \]

We now add the 3 binary variables to the original data:

augmented_iris <- cbind(iris, 
                        I1 = as.numeric(iris$Species=="setosa"), 
                        I2 = as.numeric(iris$Species=="versicolor"), 
                        I3 = as.numeric(iris$Species=="virginica")
                       )

These 3 binary variables are called the indicator variables.

When a categorical variable with \(k\) categories is involved in a multiple linear regression as an explanatory variable, a common practice is to use only \(k-1\) of the indicator variables and there is a theory for such practice. The indicator variable that is not used is subject to the disposal of the person who does the analysis.

Next, we demonstrate how indicator variables can be used in multiple linear regression. We will use the Sepal.Width from the “iris” data as the response variable and all other variables (Sepal.Length, Petal.Length, Petal.Width, and Species) as the explanatory variables. For the Species variable, we use the two indicator variables I2 (for versicolor) and I3 (for virginica) instead (we then say that the setosa species is used as a baseline category). The model is

\[Sepal.Width = \beta_0+\beta_1\cdot Sepal.Length + \beta_2\cdot Petal.Length + \beta_3\cdot Petal.Width +\beta_4\cdot I2 +\beta_5\cdot I3+\epsilon \] The following is the R code that does the model fitting:

fit <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + I2 + I3, 
          data = augmented_iris)

summary(fit)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + 
##     I2 + I3, data = augmented_iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00102 -0.14786  0.00441  0.18544  0.69719 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.65716    0.25595   6.475 1.40e-09 ***
## Sepal.Length  0.37777    0.06557   5.761 4.87e-08 ***
## Petal.Length -0.18757    0.08349  -2.246   0.0262 *  
## Petal.Width   0.62571    0.12338   5.072 1.20e-06 ***
## I2           -1.16029    0.19329  -6.003 1.50e-08 ***
## I3           -1.39825    0.27715  -5.045 1.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2678 on 144 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.6225 
## F-statistic: 50.14 on 5 and 144 DF,  p-value: < 2.2e-16

The R output shows that all explanatory variables are significant at the 5% significance level.

The final regression equation is

\[Sepal.Width = 1.6572+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width - 1.1603\cdot I2 - 1.3983\cdot I3 \] What does the two coefficients \(-1.1603\) and \(-1.3983\) suggest? The number \(-1.1603\) means that versicolor flowers with the same sepal length, petal length and petal width is estimated to have a mean sepal width about \(1.1603\) centimeters shorter than that of the setosa flowers. While the number \(-1.3983\) means that virginica flowers with the same sepal length, petal length and petal width is estimated to have a mean sepal width about \(1.3983\) centimeter shorter than that of the setosa flowers.

The regression equation for the setosa species can be reduced to

\[Sepal.Width = 1.6572+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width \] For the versicolor species, since I2 takes the value “1” and I3 takes the value “0”, the regression equation reduces to

\[Sepal.Width = 0.2589+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width -1.16029\cdot 1 \] or, after combining like terms,

\[Sepal.Width = 0.4969+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width \]

For the verginica species, since I2 takes the value “0” and I3 takes the value “1”, the regression equation reduces to

\[Sepal.Width = 1.6572+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width -1.39825\cdot 1 \] or, after combining like terms,

\[Sepal.Width = 0.2589+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width \] So far so good or it is too complicated? The good news is there is no need to create indicate variables on our own. R software will do it for us automatically!

Here is how R does it:

# Fit a multiple regression model 
fit <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + Species, 
          data = iris)

# Display the summary of the model
summary(fit)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + 
##     Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.00102 -0.14786  0.00441  0.18544  0.69719 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.65716    0.25595   6.475 1.40e-09 ***
## Sepal.Length       0.37777    0.06557   5.761 4.87e-08 ***
## Petal.Length      -0.18757    0.08349  -2.246   0.0262 *  
## Petal.Width        0.62571    0.12338   5.072 1.20e-06 ***
## Speciesversicolor -1.16029    0.19329  -6.003 1.50e-08 ***
## Speciesvirginica  -1.39825    0.27715  -5.045 1.34e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2678 on 144 degrees of freedom
## Multiple R-squared:  0.6352, Adjusted R-squared:  0.6225 
## F-statistic: 50.14 on 5 and 144 DF,  p-value: < 2.2e-16

R has created two indicator variables automatically and they are Speciesversicolor (the same as I2) and Speciesvirginica (the same as I3). The results based on the above code is the same as we did using I2 and I3.

Some discussions:

  1. If we use sepal length along with species (use I2 and I3) as the explanatory variables, what would be the model and the regression equation you get?

  2. What are the separate regression equations for versicolor flowers and virginica flowers?

  3. If the equations in part (b) are plotted, what special pattern would you observe? What assumptions have you made implicitly? Is such an assumption reasonable? (hint: plot sepal width vs sepal length for the 3 species on the same graph, see below.). If such an assumption is not reasonable, could we gain a significant improvement in the fit by allowing for a different effect for each species?

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(title = "Scatter Plot with Regression Lines for Each Species",
         x = "Sepal Length",
         y = "Sepal Width")

3.11 Including Interaction Terms in Multiple Regression Models

Including interactions in a linear regression model allows for the possibility that the effect of one predictor variable on the response may depend on the level of another predictor variable.

Example 1.

# Fit a linear regression model with interactions
model_with_interactions <- lm(Calories ~ Serving + Fat + Sodium + Carbs + Fiber + Sugars * Protein, data = cereal_data_new)

# Print the summary of the model
summary(model_with_interactions)
## 
## Call:
## lm(formula = Calories ~ Serving + Fat + Sodium + Carbs + Fiber + 
##     Sugars * Protein, data = cereal_data_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4702 -0.6082  0.0204  1.2538  5.1062 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.267619  10.573862  -0.309    0.760    
## Serving         4.730155   4.828139   0.980    0.338    
## Fat            10.095856   0.567364  17.794 3.83e-14 ***
## Sodium         -0.013915   0.008198  -1.697    0.104    
## Carbs           3.874877   0.169602  22.847 2.58e-16 ***
## Fiber          -2.886639   0.360603  -8.005 8.15e-08 ***
## Sugars          0.354501   0.241898   1.465    0.158    
## Protein         4.554888   0.845463   5.387 2.42e-05 ***
## Sugars:Protein -0.082924   0.109280  -0.759    0.456    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.464 on 21 degrees of freedom
## Multiple R-squared:  0.9964, Adjusted R-squared:  0.995 
## F-statistic: 720.8 on 8 and 21 DF,  p-value: < 2.2e-16

Explanation of code:

In this example, Sugars * Protein includes both the main effects of Sugars and Protein and their interaction term. The * operator implicitly includes the main effects and the interaction term. If you prefer to include only the interaction term without the main effects, you can use the : operator:

Explanation of results:

Interpreting interactions involves considering how the effect of one variable on the response varies depending on the level of another variable. The summary output will provide information about coefficients, standard errors, t-values, and p-values associated with each term in the model.

Here, the interaction effect of the term Sugars:Protein is not statistically significant, so including this term does not improve the model fitting.

3.12 Multicollinearity Issues

Multicollinearity is a phenomenon in regression analysis where two or more predictor (i.e. explanatory) variables in a multiple regression model are highly correlated, making it difficult to precisely estimate the individual contribution of each variable to the dependent variable. It is a form of collinearity, which generally refers to a high degree of correlation among the predictor variables.

In the context of multicollinearity:

  • Perfect Multicollinearity: This occurs when one predictor variable is an exact linear function of another. For example, if you have two variables \(X_1\) and \(X_2\), and \(X_1\) is equal to a constant times \(X_2\) (or vice versa), then you have perfect multicollinearity.

  • High Multicollinearity: This occurs when there is a strong correlation, but not a perfect linear relationship, between two or more predictor variables. High multicollinearity can still lead to issues in regression analysis.

Multicollinearity can have several implications:

  • Unstable Coefficient Estimates: The coefficients of the correlated variables become unstable, and small changes in the data can lead to large changes in the estimated coefficients.

  • Increased Standard Errors: The standard errors of the coefficient estimates tend to be large, making it difficult to identify statistically significant predictors.

  • Reduced Statistical Power: The ability to detect the true effects of individual predictors may be reduced.

  • Interpretation Challenges: It becomes challenging to interpret the individual contribution of each predictor to the dependent variable.

To detect multicollinearity, researchers often examine variance inflation factors (VIF). High VIF values (typically above 5) suggest a potential multicollinearity issue. The R package “car” provides the function VIF() for calculating VIF for each predictor.

Here is an example using the cereals data. We remove the first two columns that are not numeric.

model <- lm(Calories ~ ., data = cereal_data[, -c(1:2)])
# The lm() function is used to fit a linear regression model.
# "Calories ~ ." means that Calories is the dependent variable, and all other variables in the 'cereal_data' dataset except the first two columns are used as predictors.

# Load the 'car' package
library(car)
# The 'car' package is loaded to access the VIF function.

# Calculate Variance Inflation Factor (VIF) for each predictor in the model
vif(model)
##  Serving      Fat   Sodium    Carbs    Fiber   Sugars  Protein 
## 1.508242 1.710968 1.383636 4.084036 1.919783 2.666816 1.785157
# The vif() function calculates the Variance Inflation Factor for each predictor variable in the linear regression model.
# VIF measures how much the variance of an estimated regression coefficient increases if the predictor is correlated with other predictors.

Explanation of code:

  • lm() function: This is used to fit a linear regression model.

  • Calories ~ .: Here, Calories is the dependent variable (the one you’re trying to predict), and the . means “use all other variables in the dataset as predictors.” This will include all columns in cereal_data except for the first two, which are excluded by the next part.

  • data = cereal_data[, -c(1:2)]: This specifies the dataset to use for the model. The -c(1:2) notation means that the first two columns of cereal_data are being excluded from the analysis. This is useful if those columns are identifiers or not relevant as predictors.

  • This command loads the car package, which provides various functions for regression analysis and diagnostics, including the Variance Inflation Factor (VIF).

  • vif() function: This function calculates the VIF for each predictor in the linear model.

  • Purpose of VIF: VIF measures how much the variance of an estimated regression coefficient increases when your predictors are correlated. A high VIF indicates that a predictor is highly correlated with other predictors, which can lead to multicollinearity issues. Typically, a VIF value above 5 or 10 suggests problematic multicollinearity. Multicollinearity refers to a situation in multiple regression where two or more predictor variables are highly correlated. This can lead to several consequences:

    • Inflated Standard Errors: Multicollinearity increases the standard errors of the coefficients, which can lead to wider confidence intervals and decreased statistical significance. This means that even if a predictor is actually important, it may not appear significant.
    • Reduced Model Predictive Power: While multicollinearity does not affect the overall fit of the model (i.e., \(R^2\) can remain high), it can impair the model’s predictive accuracy when applied to new data.

Dealing with multicollinearity may involve:

  • Removing Correlated Predictors: If two or more predictors are highly correlated, removing one of them might alleviate multicollinearity.

  • Combining Variables: Creating composite variables or using dimensionality reduction techniques (such as the Principal Component Analysis) can help handle multicollinearity.

It’s important to address multicollinearity to ensure the reliability and interpretability of regression results.

3.13 Nested \(F\) Test: Testing Subsets of Predictors

We might want to test if a subset of predictors (or terms) can be dropped from a multiple linear regression model. For example, for a 10-predictor model, we might want to test

\[H_0: \beta_4 = 0~ \text{and}~ \beta_7 = 0~ \text{vs.}~ H_a: \beta_4 \ne 0~ \text{or}~ \beta_7 \ne 0\] The null hypothesis says that the coefficients of the 4th and 7th terms in the model are both 0, while the alternative hypothesis says that at least one of them is non-zero. There is a formal procedure for doing this and it is called the partial \(F\)-test or the nested \(F\)-test. The procedure involves fitting two models: one is the model you are fitting (called the larger model for ease of reference) and the other one is a restricted model (called the smaller model) which is a special case of the model being fitted.

Let’s use the iris data (built-in in R) to show the procedure. For simplicity, we disregard petal length and petal width, and fit a model to study how sepal width is related to both sepal length and species. We are particularly interested in whether the effect of sepal length on sepal width varies between species. That is, we need to test whether there is any interaction between sepal length and species. To do this, we need to fit two models, one nested in the other:

The larger model is

\[Sepal.Width = \beta_0+\beta_1\cdot Sepal.Length + \beta_2\cdot versicolor+\beta_3\cdot virginica+\beta_4\cdot Sepal.Length\cdot versicolor+\beta_5\cdot Sepal.Length\cdot virginica+\epsilon\] where \(versicolor\) and \(virginica\) represent indicator or dummy variables, and the product terms are interaction terms.

The smaller model is

\[Sepal.Width = \beta_0+\beta_1\cdot Sepal.Length + \beta_2\cdot versicolor+\beta_3\cdot virginica+\epsilon\] We need to test:

\[H_0: \beta_4 = 0~ \text{and}~ \beta_5 = 0~~ \text{vs.}~~ H_a: \beta_4 \ne 0~ \text{or}~ \beta_5 \ne 0\] The fitting of each of the two models yields

# Fit the larger model
large_model <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris)
# Print the detailed results
summary(large_model)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length * Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72394 -0.16327 -0.00289  0.16457  0.60954 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.5694     0.5539  -1.028 0.305622    
## Sepal.Length                     0.7985     0.1104   7.235 2.55e-11 ***
## Speciesversicolor                1.4416     0.7130   2.022 0.045056 *  
## Speciesvirginica                 2.0157     0.6861   2.938 0.003848 ** 
## Sepal.Length:Speciesversicolor  -0.4788     0.1337  -3.582 0.000465 ***
## Sepal.Length:Speciesvirginica   -0.5666     0.1262  -4.490 1.45e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2723 on 144 degrees of freedom
## Multiple R-squared:  0.6227, Adjusted R-squared:  0.6096 
## F-statistic: 47.53 on 5 and 144 DF,  p-value: < 2.2e-16
# Print the analysis of variance result
anova(large_model)
## Analysis of Variance Table
## 
## Response: Sepal.Width
##                       Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Sepal.Length           1  0.3913  0.3913   5.2757   0.02307 *  
## Species                2 15.7225  7.8613 105.9948 < 2.2e-16 ***
## Sepal.Length:Species   2  1.5132  0.7566  10.2011  7.19e-05 ***
## Residuals            144 10.6800  0.0742                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fit the smaller model
small_model <- lm(Sepal.Width ~ Sepal.Length + Species, data = iris)
# Print the detailed results
summary(small_model)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95096 -0.16522  0.00171  0.18416  0.72918 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.67650    0.23536   7.123 4.46e-11 ***
## Sepal.Length       0.34988    0.04630   7.557 4.19e-12 ***
## Speciesversicolor -0.98339    0.07207 -13.644  < 2e-16 ***
## Speciesvirginica  -1.00751    0.09331 -10.798  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.289 on 146 degrees of freedom
## Multiple R-squared:  0.5693, Adjusted R-squared:  0.5604 
## F-statistic: 64.32 on 3 and 146 DF,  p-value: < 2.2e-16
# Print the analysis of variance result
anova(small_model)
## Analysis of Variance Table
## 
## Response: Sepal.Width
##               Df  Sum Sq Mean Sq F value  Pr(>F)    
## Sepal.Length   1  0.3913  0.3913  4.6851 0.03205 *  
## Species        2 15.7225  7.8613 94.1304 < 2e-16 ***
## Residuals    146 12.1931  0.0835                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output of the larger_model includes the main effect of Sepal.Length, the main effects for each species, and two interaction effects between Sepal.Length and the species. Two ANOVA tables, one for each model, are also produced, showing the contributions of individual terms to the total sum of squares of the (adjusted) response values.

Based on the ANOVA table results, the larger model has an error sum of squares of 10.6800 and a mean squared error (MSE) of 0.0742 (which is obtained by 10.68/144). The small model has an error sum of squares of 12.1931 and a mean squared error (MSE) of 0.0835 (which is obtained by 12.1931/146).

To test the above hypotheses, we calculate the \(F\) test statistic as

\[F = \frac{(12.1931-10.6800)/2}{0.0742}=10.20\] where the number 2 in the numerator equals the number of coefficients we are testing. We know the larger model always fits better, and it indicates dropping the two terms (Speciesversicolor and Speciesvirginica) from the model increases the error sum of squares by (12.1931-10.6800) or 1.5131 (this is called the extra sum of squares). Dividing the extra sum of squares by the number of terms dropped gives the average increase in the error sum of squares. Such average increase is compared with the mean squared error of the larger model and the ratio is the test statistic.

It can be shown that, under the null hypothesis, the \(F\) test statistic has an \(F\) distribution, The \(F\)-distribution (also known as the Fisher-Snedecor distribution) is a probability distribution that arises frequently in statistical hypothesis testing. It is named after Ronald A. Fisher and George W. Snedecor, who independently developed it.

In the context of the \(F\)-distribution, there are two sets of degrees of freedom associated with it: the degrees of freedom for the numerator and the degrees of freedom for the denominator. The following graph shows examples of a few \(F\) distributions:

To find the \(p\)-value of a test based on the \(F\) distribution, we always calculate the right tail area under the \(F\) distribution, since the larger the \(F\) test statistic, the more evidence that the large model and the small model differ. The R code for calculating the \(p\)-value is given by

1 - pf(F, df1, df2)  ## f is test statistic and df1 and df2 are degrees of freedom

For our example, \(F=0.0754, df1 = 2, df2 = 144\) and the \(p\)-value is 7.19e-05. Since the \(p\)-value is basically 0, we do reject the null hypothesis and conclude that there is a significant difference between the large model and the small model.

To avoid the above lengthy procedure, we can use the “car” package to achieve the same result, as shown in the following code:

library(car)

# Fit the multiple regression model
model <- lm(Sepal.Width ~ Sepal.Length + Species + Sepal.Length:Species, data = iris)

# Specify the hypothesis to test
hypothesis <- c("Sepal.Length:Speciesversicolor = 0", "Sepal.Length:Speciesvirginica = 0")

# Test the hypothesis
hypothesis_test <- linearHypothesis(model, hypothesis)

# Display the results
hypothesis_test
## 
## Linear hypothesis test:
## Sepal.Length:Speciesversicolor = 0
## Sepal.Length:Speciesvirginica = 0
## 
## Model 1: restricted model
## Model 2: Sepal.Width ~ Sepal.Length + Species + Sepal.Length:Species
## 
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    146 12.193                                 
## 2    144 10.680  2    1.5132 10.201 7.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is basically 0, the interaction term is significant. It means that the effect of sepal length on sepal width is significantly different for different species.

A remark:

When fitting a linear regression model, the bottom line of the R output includes an F test statistic, along with two sets of degrees of freedom and a \(p\)-value. This is the \(F\) test for \[H_0: \text{All} ~\beta \text{'s are 0} ~ ~\text{vs.}~~ H_a: \text{At least one} ~\beta \text{ is not 0.}\] It is a special nested \(F\) test with the small model being the one without any explanatory variable (such a model is called the intercept-only model).

Chapter 4 Additional Topics in Multiple Regression

4.1 Added Variable Plots (Optional)

The added variable plot (also known as partial regression plot), introduced by Mosteller and Tukey (1977), provides a visual representation of the regression coefficient of a new variable being considered for inclusion in a model. Here’s a step-by-step guide to construct an added variable plot:

  • Regress Y on all variables other than X: This step involves fitting a regression model where the response variable Y is regressed on all predictor variables except for X. The residuals obtained from this regression represent the part of Y not explained by the other variables (Y residuals).

  • Regress X on all other variables: Similarly, regress the variable X on all other variables included in the model. The residuals obtained from this regression represent the part of X not explained by the other variables (X residuals).

  • Construct a scatter plot: Create a scatter plot with Y residuals on the y-axis and X residuals on the x-axis.

The interpretation of Y and X residuals is as follows:

  • Y residuals: These represent the portion of Y that is not explained by all variables other than X.
  • X residuals: These represent the portion of X that is not explained by other variables.

The slope of the line fitted to the points in the added variable plot must be equal to the regression coefficient when Y is regressed on all variables, including X.

The interpretation of the plot:

  • A strong linear relationship in the added variable plot indicates the increased importance of the contribution of X to the model already containing the other predictors. Essentially, it suggests that X adds significant explanatory power to the model beyond what is already captured by the existing predictors.
  • If the relationship in the added variable plot is non-linear, it suggests that the relationship between the response variable Y and the predictor variable X is not adequately captured by a linear model, even after accounting for the effects of other predictor variables. Transforming the predictor variable X or the response variable Y may help linearize the relationship.

Here’s how you can create an added variable plot in R using the car package:

# Install and load the 'car' package (if not already installed)
# install.packages("car")
library(car)

# Fit a multiple regression model
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width, data = iris)

# Create an Added Variable Plot for each of the predictors
avPlots(model)

# Create an Added Variable Plot for the variable 'Sepal.Length'
avPlots(model, "Sepal.Length")

An example with simulated data:

# Set the seed for reproducibility
set.seed(321)

# Generate random data for x1 and x2
x1 <- rnorm(180, 100, 15)
x2 <- rnorm(180, 69, 3.6)

# Generate y using a linear equation with noise
y <- 5 - 2*x1 + rnorm(180, 0, 1)

# Fit a linear model with terms x1 and x2
model <- lm(y ~ x1 + x2)

# Create added variable plots to visualize the effect of each predictor on the response
avPlots(model)

The graph shows that adding \(x_2\) term in the model while \(x_1\) is already in the model won’t improve the model.

4.2 Best Subsets of Predictors

Best Subsets of Predictors is an approach in regression analysis where multiple linear regression models are fitted using different combinations of predictor variables, and the best model is selected based on a specified criterion. The goal is to identify the subset of predictors that results in the most parsimonious and effective model.

Mallow’s \(C_p\) is one of many statistical criteria used in the process of choosing the best subset of predictor variables for a regression model. The goal is to strike a balance between model simplicity and predictive accuracy.

Mallow’s \(C_p\) is defined as:

\[C_p=\frac{SSE_m}{MSE_k}+2(m+1)-n\]

where:

  • \(SSE_m\) is the sum of squared errors for the model with \(m\) predictors
  • \(SSE_k\) is the mean square error (or the estimated variance of the error term) for the full model with all \(k\) predictors.
  • \(n\) is the number of observations.

A model with \(C_p\) less than \(m+1\) or not much greater than \(m+1\) indicates a better balance between goodness of fit (as measured by \(SSE\)) and model complexity, and thus is a preferred model.

Here is R code for identifying the best model based on Mallow’s \(C_p\) criterion using simulated data.:

# Install and load the leaps package
library(leaps)

# Set a seed for reproducibility of random numbers to be generated below
set.seed(123)

# Create a simulated dataset with 100 observations and 6 predictors
simulated_data <- data.frame(
  Predictor1 = rnorm(100),  # Random normal values for Predictor1
  Predictor2 = rnorm(100),  # Random normal values for Predictor2
  Predictor3 = rnorm(100),  # Random normal values for Predictor3
  Predictor4 = rnorm(100),  # Random normal values for Predictor4
  Predictor5 = rnorm(100),  # Random normal values for Predictor5
  Predictor6 = rnorm(100)   # Random normal values for Predictor6
)

# The true model
# Response = 20 + 2 * Predictor2 - 4 * Predictor4 + 5 * Predictor5 - 6 * Predictor6 + random noise

# Generate response data based on the model
simulated_data$Response = 20 + 2 * simulated_data$Predictor2 - 4 * simulated_data$Predictor4 + 
                          5 * simulated_data$Predictor5 - 6 * simulated_data$Predictor6 + 
                          rnorm(100)  # Adding random noise to the response

# Use the leaps package to find the best subset of predictors for the response variable
best_subsets <- regsubsets(Response ~ ., data = simulated_data, nvmax = 4)  # nvmax sets the maximum number of predictors to consider

# Get the summary of the regression subsets, including Mallow's Cp values for each subset
res = summary(best_subsets)
cp_values <- res$cp  # Extract Mallow's Cp values from the summary

# Print the best subset of predictors and its corresponding Mallow's Cp value
cat("The best subset is:\n ", 
    paste("  ", best_subsets$xnames[-1][tail(summary(best_subsets)$which,1)[-1]], collapse = ", "),
    "\nwith Mallow's Cp =", min(cp_values), ".", sep = "")
## The best subset is:
##     Predictor2,    Predictor4,    Predictor5,    Predictor6
## with Mallow's Cp =3.112201.

The simulation shows that the Mallow’s \(C_p\) does finds the best model.

Practice: Using the FirstYearGPA data from https://www.stat2.org/datapage.html to fit and find the best model with GPA being the response variable.

4.3 Other Selection Methods for Best Subsets

In addition to Mallow’s \(C_p\) and \(R^2\) methods, Forward, backward, and stepwise selections are other variable selection methods commonly used in regression analysis. These methods are used to choose a subset of predictor variables to include in a model based on certain criteria, such as Akaike’s information criterion (AIC) or Bayesian information criterion (BIC). Here’s an overview of each method:

  • Forward Selection:

    • Process: Start with an empty model and add variables one by one, selecting the variable that most improves the model fit at each step.

    • Criteria: The criteria for improvement can vary (e.g., minimizing AIC or BIC).

The R code looks like

lm_model <- lm(y ~ 1, data = your_data)
forward_method <- step(lm_model, direction = "forward")
  • Backward Elimination:

    • Process: Start with a model containing all predictor variables and remove one variable at each step, eliminating the variable that contributes the least to the model.

    • Criteria: Similar to forward selection, the criteria for removal can vary.

lm_model_full <- lm(y ~ ., data = your_data)
backward_method <- step(lm_model_full, direction = "backward")
  • Stepwise Selection:

    • Process: Combines elements of both forward selection and backward elimination. It starts with no variables in the model or all variables in the model, adds variables one by one, and removes variables when they no longer contribute significantly.

    • Criteria: Typically uses a combination of adding and removing criteria.

lm_model_full <- lm(y ~ ., data = your_data)
stepwise_method <- step(lm_model_full, direction = "both")

We use the simulated data given before to demonstrate the Forward, backward, and stepwise selection methods:

# Fit an initial model with only the intercept
null_model <- lm(Response ~ 1, data = simulated_data)

# Fit a full model with all predictors
full_model <- lm(Response ~ ., data = simulated_data)

# Perform forward selection
forward_model <- step(null_model, 
                      direction = "forward", 
                      scope = list(lower = null_model, upper = full_model), 
                      trace = 1)
## Start:  AIC=409.03
## Response ~ 1
## 
##              Df Sum of Sq    RSS    AIC
## + Predictor6  1   1649.44 4208.1 377.96
## + Predictor4  1   1373.53 4484.0 384.31
## + Predictor5  1   1285.68 4571.9 386.25
## <none>                    5857.5 409.03
## + Predictor1  1     33.66 5823.9 410.46
## + Predictor2  1     28.99 5828.5 410.54
## + Predictor3  1      0.08 5857.5 411.03
## 
## Step:  AIC=377.96
## Response ~ Predictor6
## 
##              Df Sum of Sq    RSS    AIC
## + Predictor5  1   2045.18 2162.9 313.40
## + Predictor4  1   1671.05 2537.0 329.36
## + Predictor2  1    102.02 4106.1 377.51
## <none>                    4208.1 377.96
## + Predictor1  1     65.76 4142.3 378.38
## + Predictor3  1      1.06 4207.0 379.93
## 
## Step:  AIC=313.4
## Response ~ Predictor6 + Predictor5
## 
##              Df Sum of Sq     RSS    AIC
## + Predictor4  1   1668.46  494.45 167.83
## + Predictor2  1    306.27 1856.64 300.13
## <none>                    2162.91 313.40
## + Predictor3  1      5.53 2157.38 315.15
## + Predictor1  1      0.08 2162.84 315.40
## 
## Step:  AIC=167.83
## Response ~ Predictor6 + Predictor5 + Predictor4
## 
##              Df Sum of Sq    RSS     AIC
## + Predictor2  1    391.63 102.82  12.784
## <none>                    494.45 167.828
## + Predictor1  1      3.19 491.26 169.181
## + Predictor3  1      0.33 494.12 169.760
## 
## Step:  AIC=12.78
## Response ~ Predictor6 + Predictor5 + Predictor4 + Predictor2
## 
##              Df Sum of Sq    RSS    AIC
## <none>                    102.82 12.784
## + Predictor1  1  0.123901 102.70 14.664
## + Predictor3  1  0.002185 102.82 14.782
cat("Forward Selection Model:\n")
## Forward Selection Model:
summary(forward_model)
## 
## Call:
## lm(formula = Response ~ Predictor6 + Predictor5 + Predictor4 + 
##     Predictor2, data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44004 -0.70499  0.00054  0.68642  2.43949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.8520     0.1054  188.43   <2e-16 ***
## Predictor6   -6.0951     0.1156  -52.73   <2e-16 ***
## Predictor5    5.0232     0.1094   45.91   <2e-16 ***
## Predictor4   -4.0745     0.1012  -40.25   <2e-16 ***
## Predictor2    2.1002     0.1104   19.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.04 on 95 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9817 
## F-statistic:  1329 on 4 and 95 DF,  p-value: < 2.2e-16
# Backward Elimination
backward_model <- step(full_model, 
                       direction = "backward", 
                       trace = 1)
## Start:  AIC=16.66
## Response ~ Predictor1 + Predictor2 + Predictor3 + Predictor4 + 
##     Predictor5 + Predictor6
## 
##              Df Sum of Sq     RSS    AIC
## - Predictor3  1      0.00  102.70  14.66
## - Predictor1  1      0.12  102.82  14.78
## <none>                     102.70  16.66
## - Predictor2  1    388.45  491.15 171.16
## - Predictor4  1   1746.73 1849.43 303.75
## - Predictor5  1   2182.87 2285.56 324.92
## - Predictor6  1   3008.70 3111.40 355.77
## 
## Step:  AIC=14.66
## Response ~ Predictor1 + Predictor2 + Predictor4 + Predictor5 + 
##     Predictor6
## 
##              Df Sum of Sq     RSS    AIC
## - Predictor1  1      0.12  102.82  12.78
## <none>                     102.70  14.66
## - Predictor2  1    388.56  491.26 169.18
## - Predictor4  1   1751.40 1854.09 302.00
## - Predictor5  1   2188.64 2291.34 323.17
## - Predictor6  1   3009.32 3112.02 353.79
## 
## Step:  AIC=12.78
## Response ~ Predictor2 + Predictor4 + Predictor5 + Predictor6
## 
##              Df Sum of Sq     RSS    AIC
## <none>                     102.82  12.78
## - Predictor2  1    391.63  494.45 167.83
## - Predictor4  1   1753.82 1856.64 300.14
## - Predictor5  1   2281.56 2384.39 325.15
## - Predictor6  1   3009.25 3112.08 351.79
cat("\nBackward Elimination Model:\n")
## 
## Backward Elimination Model:
summary(backward_model)
## 
## Call:
## lm(formula = Response ~ Predictor2 + Predictor4 + Predictor5 + 
##     Predictor6, data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44004 -0.70499  0.00054  0.68642  2.43949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.8520     0.1054  188.43   <2e-16 ***
## Predictor2    2.1002     0.1104   19.02   <2e-16 ***
## Predictor4   -4.0745     0.1012  -40.25   <2e-16 ***
## Predictor5    5.0232     0.1094   45.91   <2e-16 ***
## Predictor6   -6.0951     0.1156  -52.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.04 on 95 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9817 
## F-statistic:  1329 on 4 and 95 DF,  p-value: < 2.2e-16
# Stepwise Selection starting from null model
stepwise_model1 <- step(null_model, 
                       direction = "both", 
                       trace = 1)
## Start:  AIC=409.03
## Response ~ 1
cat("\nStepwise Selection Model (starting from null model):\n")
## 
## Stepwise Selection Model (starting from null model):
summary(stepwise_model1)
## 
## Call:
## lm(formula = Response ~ 1, data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1119  -5.4323  -0.2773   5.3825  18.9794 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.5632     0.7692   26.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.692 on 99 degrees of freedom
# Stepwise Selection starting from full model
stepwise_model2 <- step(full_model, 
                       direction = "both", 
                       trace = 1)
## Start:  AIC=16.66
## Response ~ Predictor1 + Predictor2 + Predictor3 + Predictor4 + 
##     Predictor5 + Predictor6
## 
##              Df Sum of Sq     RSS    AIC
## - Predictor3  1      0.00  102.70  14.66
## - Predictor1  1      0.12  102.82  14.78
## <none>                     102.70  16.66
## - Predictor2  1    388.45  491.15 171.16
## - Predictor4  1   1746.73 1849.43 303.75
## - Predictor5  1   2182.87 2285.56 324.92
## - Predictor6  1   3008.70 3111.40 355.77
## 
## Step:  AIC=14.66
## Response ~ Predictor1 + Predictor2 + Predictor4 + Predictor5 + 
##     Predictor6
## 
##              Df Sum of Sq     RSS    AIC
## - Predictor1  1      0.12  102.82  12.78
## <none>                     102.70  14.66
## + Predictor3  1      0.00  102.70  16.66
## - Predictor2  1    388.56  491.26 169.18
## - Predictor4  1   1751.40 1854.09 302.00
## - Predictor5  1   2188.64 2291.34 323.17
## - Predictor6  1   3009.32 3112.02 353.79
## 
## Step:  AIC=12.78
## Response ~ Predictor2 + Predictor4 + Predictor5 + Predictor6
## 
##              Df Sum of Sq     RSS    AIC
## <none>                     102.82  12.78
## + Predictor1  1      0.12  102.70  14.66
## + Predictor3  1      0.00  102.82  14.78
## - Predictor2  1    391.63  494.45 167.83
## - Predictor4  1   1753.82 1856.64 300.14
## - Predictor5  1   2281.56 2384.39 325.15
## - Predictor6  1   3009.25 3112.08 351.79
cat("\nStepwise Selection Model(starting from full model:\n")
## 
## Stepwise Selection Model(starting from full model:
summary(stepwise_model2)
## 
## Call:
## lm(formula = Response ~ Predictor2 + Predictor4 + Predictor5 + 
##     Predictor6, data = simulated_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.44004 -0.70499  0.00054  0.68642  2.43949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.8520     0.1054  188.43   <2e-16 ***
## Predictor2    2.1002     0.1104   19.02   <2e-16 ***
## Predictor4   -4.0745     0.1012  -40.25   <2e-16 ***
## Predictor5    5.0232     0.1094   45.91   <2e-16 ***
## Predictor6   -6.0951     0.1156  -52.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.04 on 95 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9817 
## F-statistic:  1329 on 4 and 95 DF,  p-value: < 2.2e-16

All three methods end up with the correct model!

4.4 Cross-Validation

Many times, the model we build is for prediction in the future. A natural question would be: will our model perform well? That a model performs well does not guarantee it also perform well on new data. How can we assess the performance of the model for future prediction?

4.4.1 Training and Holdout Samples

Cross-validation is a statistical technique used in machine learning and statistical modeling to assess the performance and generalizability of a predictive model. The main goal of cross-validation is to estimate how well a model will perform on an independent dataset by partitioning the available data into two subsets: one for training the model and another for testing its performance.

  • Training Sample:
    • The training sample is the portion of the dataset used to train the model.
    • During model training, the algorithm learns the patterns and relationships within the data, adjusting its parameters to minimize the difference between predicted and actual outcomes.
    • The larger and more representative the training sample is, the better the model may capture the underlying patterns in the data.
    • The training sample is typically used for fitting the model, estimating coefficients, and adjusting internal parameters.
  • Holdout Sample (Testing Sample):
    • The holdout sample is a separate subset of the data that is not used during the training phase.
    • After the model is trained on the training sample, it is tested on the holdout sample to assess its performance on new, unseen data.
    • The holdout sample serves as an independent evaluation set, providing an estimate of how well the model is expected to generalize to new, unseen data.
    • The performance metrics on the holdout sample can help identify potential issues such as overfitting or underfitting.

It is generally expected that the performance of the model is worse for prediction based on the holdout data than for prediction based on the training data. If this happens, the model might not have found the patterns or relationships that occur or hold in the more general population.

Here is an example using the caret package, which is commonly used for machine learning tasks. We will use the iris data to fit a multiple linear regression model with Sepal length as the response and sepal Width, petal length, petal width as predictors.

# Set a seed for reproducibility
set.seed(123)

# Split the data into a training set (80%) and a testing set (20%)
size_of_data = nrow(iris)
size_of_training_sample = size_of_data * 0.8
train_indices <- sample(1:size_of_data, size_of_training_sample)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

# Fit a model using the training data
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
            data = train_data)

# Make predictions on the test data
predictions <- predict(model, newdata = test_data)

# Evaluate the model's performance on the test data (using mean squared error)
mse <- mean((test_data$Sepal.Length - predictions)^2)

cat("Mean Squared Error:", mse, "\n")
## Mean Squared Error: 0.07225017

Explanation of code:

  • set.seed(123): This line sets a seed for the random number generator. Setting a seed ensures that random processes, such as sampling, will produce the same result each time the code is run. This is useful for reproducibility, especially when sharing or rerunning the code.
  • The next 5 lines of code split the Iris dataset into a training set and a testing set. The dataset size is determined using nrow(iris), and then 80% of the data is allocated for training. The sample() function is used to randomly select indices for the training set (train_indices). The training set (train_data) is created by subsetting the original dataset based on the selected indices, and the testing set (test_data) contains the remaining data.
  • model <- lm(…): fits a linear regression model using the training data. The dependent variable is ‘Sepal.Length,’ and the predictors are ‘Sepal.Width,’ ‘Petal.Length,’ and ‘Petal.Width.’ The lm function is used for linear regression.
  • predictions <- predict(…): uses the trained model to make predictions on the testing set. The predict function is applied to the model using the testing set (test_data), resulting in predicted values for the dependent variable.
  • mse <- mean(…): calculates the mean squared error (MSE) to evaluate the model’s performance on the testing set. The MSE is a measure of the average squared differences between the observed and predicted values. It provides a quantitative assessment of how well the model’s predictions align with the actual data.
  • The final line of code prints the calculated mean squared error to the console, providing a summary measure of the model’s accuracy on the testing set.

The MSE calculated can be compared with the model MSE to see whether there is any overfitting issue. Overfitting occurs when the the former is much larger.

anova(model)
## Analysis of Variance Table
## 
## Response: Sepal.Length
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## Sepal.Width    1  1.062   1.062  10.004 0.0019946 ** 
## Petal.Length   1 70.616  70.616 665.181 < 2.2e-16 ***
## Petal.Width    1  1.547   1.547  14.570 0.0002183 ***
## Residuals    116 12.315   0.106                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table output, the model MSE is 0.106 (Mean Square for Residuals), indicating no overfitting.

Another way to check for overfitting is to compare cross-validation \(R^2\) with the model \(R^2\). The cross-validation \(R^2\) is defined as the correlation between the predicted values and the actual values, squared.

# Calculate the correlation between the predicted values and the actual values in the test dataset
cross_validationcor_R_square <- cor(predictions, test_data$Sepal.Length)

# Extract the R-squared value from the model summary
model_R_Square <- summary(model)$r.squared

# Round both values to 4 decimal places
cross_validationcor_R_square <- round(cross_validationcor_R_square, 4)
model_R_Square <- round(model_R_Square, 4)

# Print both values together
print(paste("Cross-Validation Correlation R-squared:", cross_validationcor_R_square, 
            "and Model R-squared:", model_R_Square))
## [1] "Cross-Validation Correlation R-squared: 0.9375 and Model R-squared: 0.856"

Note: The model R-squared is also the square of the correlation between the fitted values and the actual values.

The results show that there is no overfitting issue.

4.4.2 The k-fold Cross-Validation (Optional)

Sometimes, cross-validation is employed instead of a single holdout sample. In k-fold cross-validation, the dataset is divided into k subsets, and the model is trained and tested k times, each time using a different subset as the holdout sample.

Here’s a brief overview of k-fold cross-validation:

  • Splitting the Data:
    • The dataset is divided into k roughly equal-sized folds or subsets.
    • Typically, k is set to 5 or 10, but it can vary based on the size of the dataset and computational resources.
  • Iterative Training and Testing:
    • The model is trained k times, each time using k-1 folds for training and the remaining fold for testing.
    • This results in k sets of training and testing datasets.
  • Performance Evaluation:
    • The model’s performance is evaluated on each testing set, and the evaluation metrics (e.g., accuracy, mean squared error) are recorded.
  • Aggregating Results:
    • The overall performance of the model is computed by aggregating the performance metrics across all iterations.

The advantages of cross-validation include: - Provides a more robust estimate of a model’s performance by testing it on multiple subsets of the data. - Reduces the risk of overfitting or underfitting to a specific training-test split.

Common types of cross-validation include:

  • k-Fold Cross-Validation: As described above, where the dataset is divided into k folds.
  • Leave-One-Out Cross-Validation (LOOCV): Special case of k-fold where k is equal to the number of observations, meaning each observation serves as a separate test set.
  • Stratified k-Fold Cross-Validation: Ensures that the distribution of the target variable is similar in each fold, useful for imbalanced datasets.

Cross-validation is an essential tool in machine learning model evaluation, helping practitioners make more informed decisions about model selection and generalization performance.

Another example: using k-fold cross-validation.

# Load package caret
library(caret)

# Set the seed for reproducibility
set.seed(123)

# Define the training control for 5-fold cross-validation
cross_val_control <- trainControl(method = "cv", number = 5)

# Train a linear regression model on Sepal.Length using all available predictors
model <- train(Sepal.Length ~ ., data = iris, method = "lm", trControl = cross_val_control)

# Print the cross-validated results, including metrics like RMSE
print(model)
## Linear Regression 
## 
## 150 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 120 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3137574  0.8585577  0.2551525
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Extract the RMSE values from the cross-validated results
rmse <- model$results$RMSE

# Calculate and print the mean RMSE
cat("Cross-Validated RMSE:", mean(rmse), "\n")
## Cross-Validated RMSE: 0.3137574

In this example:

Explanation:

  • Seed for Reproducibility:
    • set.seed(123): Sets a seed for the random number generator, ensuring that random processes yield the same results for reproducibility.
  • Training Control for Cross-Validation:
    • cross_val_control <- trainControl(method = “cv”, number = 5): Defines the training control for 5-fold cross-validation using the ‘caret’ package.
  • Linear Regression Model Training: model <- train(Sepal.Length ~ ., data = iris, method = “lm”, trControl = cross_val_control): Trains a linear regression model on the Sepal.Length variable using all other available predictors in the Iris dataset. The model is trained using 5-fold cross-validation.
  • Print Cross-Validated Results:
    • print(model): Prints the cross-validated results, which include performance metrics such as RMSE for each fold.
  • Extract RMSE Values:
    • rmse <- model\(results\)RMSE: Extracts the RMSE values for each fold from the cross-validated results and stores them in the variable rmse.
  • Calculate and Print Mean RMSE: -cat(“Cross-Validated RMSE:”, mean(rmse), “”): Calculates the mean of the RMSE values and prints it to the console. The mean RMSE provides an overall measure of the model’s predictive accuracy across all folds.

This code overall performs cross-validated training of a linear regression model on the Iris dataset, evaluates its performance, and reports the mean RMSE as a summary metric.

The k-fold cross-validation is a valuable tool in model evaluation, particularly when working with limited data or when a robust estimate of performance is desired. It is a widely accepted practice in machine learning research and application. Common choices for k include 5-fold and 10-fold cross-validation, but the appropriate value can vary based on the dataset size and specific considerations.

4.5 Identifying Unusual Points in Regression

Identifying unusual points or outliers in a regression analysis is important for understanding the impact of influential observations on the model. Here are some common methods for identifying unusual points in regression:

  • Residual Plot: Create a residual plot by plotting the residuals (the differences between observed and predicted values) against the predictor variables. Unusual points may exhibit patterns or stand out from the overall distribution of residuals. The \(i\)-th residual is calculated as follows: \[r_i=\text{observed}-\text{predicted}=y_i - \hat{y}_i\]

  • Leverage: Leverage measures how much an observation’s predictor values differ from the mean of the predictor values. Observations with high leverage can have a strong impact on the regression model. Points with leverage greater than \(2(k+1)/n\) (where \(k\) is the number of predictors and \(n\) is the number of observations) may be considered influential.

  • Standardized Residuals: Standardized residuals are computed by taking the residual (the difference between the observed and predicted values) and dividing it by the estimated standard error of the residuals, adjusted for leverage. Observations with standardized residuals that fall outside a specified range (e.g., less than \(-2\) or greater than 2) may be flagged as unusual. The formula for the \(i\)-th standardized residual is given by:

\[\text{stdres}_i = \frac{r_i}{\hat{\sigma}_i\sqrt{1-h_i}}\]

  • Studentized Residuals (or deleted-t residuals): Calculate studentized residuals as residuals divided by the estimated standard error of the residuals (the estimation does not include the observation for which the residual is being evaluated), adjusted for leverage. Points with studentized residuals beyond a certain range (e.g., \(-2\) to 2) may be considered unusual. The formula for the \(i\)-th studentized residual is given by:

\[\text{studres}_i = \frac{r_i}{\hat{\sigma}_{(i)}\sqrt{1-h_i}}\]

  • Cook’s Distance: This metric combines both standardized residuals and leverage into a single measure, allowing for a clearer identification of influential observations. Standardized Residuals, while they can identify unusual observations, they do not account for the leverages of the observations. Leverage indicates how much an observation can influence the regression model. However, high leverage alone doesn’t imply that an observation is influential; it could be a good fit with a small residual. Observations with a Cook’s Distance larger than a certain threshold (e.g., 0.5, 1, or \(4/n\), \(4/(n-k-1\), where \(n\) is the number of observations and \(k\) is the number of predictors) may be considered influential. Cook’s distance is calculated as follows:

\[D_i = \frac{(\text{stdres}_i)^2}{k+1}\cdot \frac{h_i}{1-h_i}\]

  • Influence Plots: Visualize the influence of each observation on the regression coefficients. Observations with a substantial impact on the coefficients may be considered influential.

  • Mahalanobis Distance: This distance metric considers the correlation between variables. Points with a Mahalanobis Distance larger than a certain threshold may be considered outliers.

  • DFFITS: This measures the change in fitted values when an observation is omitted. Points with DFFITS larger than a certain threshold (e.g., \(2\sqrt{(k+1)/n}\), where \(k\) is the number of predictors and \(n\) is the number of observations) may be considered influential.

  • DFBETAS: measure the change in each coefficient due to the deletion of each observation one at a time. Typically, absolute values larger than \(2/\sqrt{2}\) are considered influential, where \(n\) is the number of observations.

  • Shapiro-Wilk test is performed to check the normality of residuals.

  • Grubbs’ test is used to identify outliers in the residuals.

Code examples for each of the above methods:

Load necessary libraries:

# Load necessary libraries
library(MASS)
library(car)
library(dplyr)
library(outliers)

Simulate a hypothetical dataset with outliers and an influential point:

# Generate a hypothetical dataset with outliers and influential points
set.seed(123)
n <- 20  # Number of observations
x <- rnorm(n)  # Predictor variable
y <- 2 * x + rnorm(n)  # Response variable

# Introduce outliers and influential points
y[c(5, 10, 15)] <- y[c(5, 10, 15)] + 8  # Introduce outliers
y[18] <- y[18] - 15  # Introduce an influential point

# Combine into a data frame
data <- data.frame(x, y)

data
##              x           y
## 1  -0.56047565  -2.1887750
## 2  -0.23017749  -0.6783299
## 3   1.55870831   2.0914122
## 4   0.07050839  -0.5878744
## 5   0.12928774   7.6335362
## 6   1.71506499   1.7434367
## 7   0.46091621   1.7596195
## 8  -1.26506123  -2.3767494
## 9  -0.68685285  -2.5118426
## 10 -0.44566197   8.3624910
## 11  1.22408180   2.8746278
## 12  0.35981383   0.4245562
## 13  0.40077145   1.6966686
## 14  0.11068272   1.0994989
## 15 -0.55584113   7.7098988
## 16  1.78691314   4.2624665
## 17  0.49785048   1.5496186
## 18 -1.96661716 -18.9951460
## 19  0.70135590   1.0967491
## 20 -0.47279141  -1.3260538
plot(y~x, main = "A Scatter Plot with an Influential Point")

We performs residual analysis, creating a residual plot and identifying studentized residuals:

# Fit regression model
model <- lm(y ~ x, data = data)

# Residual Analysis with Residual Plot
residuals <- model$residuals

# Standardized Residuals
std_residuals <- rstandard(model)

# Studentized Residuals
studentized_residuals <- studres(model)

# Identify outliers based on studentized residuals
outliers_studentized <- data[abs(studentized_residuals) > 2, ]

# Display identified outliers
outliers_studentized
##             x          y
## 10 -0.4456620   8.362491
## 15 -0.5558411   7.709899
## 18 -1.9666172 -18.995146
# Set up a 1x2 layout for plotting
par(mfrow = c(1, 2)) # the 2 plots will be placed on one row

# Plotting Residuals vs Predictor (x) with outliers
plot(x, residuals, main = "Residual Plot with Outliers", xlab = "Predictor (x)", ylab = "Residuals")
# Add labels for outliers
text(x = x[as.numeric(rownames(outliers_studentized))], 
     y = residuals[rownames(outliers_studentized)], 
     pos = 4, 
     labels = row.names(outliers_studentized), 
     col = "red")

# Plotting Studentized Residuals vs Predictor (x) with outliers
plot(x, studentized_residuals, main = "Studentized Residual Plot with Outliers", xlab = "Predictor (x)", ylab = "Studentized Residuals")
# Add labels for outliers
text(x = x[as.numeric(rownames(outliers_studentized))], 
     y = studentized_residuals[rownames(outliers_studentized)], 
     pos = 4, 
     labels = row.names(outliers_studentized), 
     col = "red")

Here’s how you can examine each of the two plots:

  • Residual Plot

    • Look for any clear patterns in the residuals.
    • Ideally, residuals should be randomly scattered around the horizontal axis (zero line).
    • If there’s a noticeable pattern (e.g., curvature, heteroscedasticity), it suggests that the linear regression assumptions might be violated.
    • Check for the presence of outliers, which appear as data points that are far away from the main cluster of points. There are 4 such points identified in the plot.
  • Studentized Residual Plot:

    • Similar to the residual plot, examine the scatter of studentized residuals.
    • Studentized residuals are standardized by dividing by their estimated standard deviation, providing a measure of how many standard deviations each residual is from the mean.
    • Outliers in this plot are those with absolute studentized residuals greater than 2 or 3. These are data points that are unusually far from the expected values based on the regression model. There are 3 such points identified in the plot.

Cook’s Distance is calculated to identify influential observations:

# Cook's Distance
cook_distance <- cooks.distance(model)
outliers_cooks <- data[cook_distance > 4/n, ]
outliers_cooks
##            x         y
## 18 -1.966617 -18.99515
# Diagnostic plots: Cook's Distance- Identify influential data points.
plot(model, which = 4, main = "Cook's Distance \n(dashed line plotted at cutoff 4/n)", caption = "mmm", sub.caption = "mmm") 


# Identify influential observations (Cook's distance > 4/n)
abline(h = 4/nrow(data), col = "red", lty = 2)

The Cook’s distance plot shows that there is only one influential point (obs. 18).

Calculates leverage values and identifies observations with high leverage:

# Leverage
leverage <- hatvalues(model)

outliers_leverage <- data[leverage > 2 * length(model$coefficients)/n, ]

outliers_leverage
##            x          y
## 16  1.786913   4.262467
## 18 -1.966617 -18.995146

Observations 16 & 18 have high leverages (leverage is greater than \(2(k+1)/n\)), where \(k\) is the number of predictors and \(n\) is the number of observations.

A more conplex influence plot is created to visually assess influential points:

# Calculate studentized residuals
studentized_residuals <- studres(lm(y ~ x, data = data))
m <- min(studentized_residuals)
M <- max(studentized_residuals)

# Calculate values for plotting boundaries
k <- ncol(data) - 1 + 1
h <- seq(min(hatvalues(model)), max(hatvalues(model)) * 1.1, length.out = 1000)
std_residuals0.5D <- sqrt(k * (1 - h) / h * 0.5)
std_residuals1.0D <- sqrt(k * (1 - h) / h)

r.m <- min(std_residuals0.5D, -std_residuals0.5D, std_residuals1.0D, -std_residuals1.0D)
r.M <- max(std_residuals0.5D, -std_residuals0.5D, std_residuals1.0D, -std_residuals1.0D)

# Influence Plots with overlaid lines
influencePlot(model,
              xlim = c(0, max(hatvalues(model)) * 1.1),
              id = TRUE,
              main = "Influence Plot with Outliers",
              xlab = "Leverage (hat-value, h)",
              ylim = range(r.m, r.M))
##       StudRes        Hat      CookD
## 10  2.2252550 0.06918750 0.15090311
## 15  2.1441674 0.07706227 0.15996566
## 16 -0.3404791 0.20059276 0.01529573
## 18 -4.7536343 0.29726351 2.17260329
# Add lines to the plot indicating different boundaries for standardized residuals
lines(std_residuals0.5D ~ h, lty = "dotted", col = "red")
lines(std_residuals1.0D ~ h, lty = "dashed", col = "darkred")
lines(-std_residuals0.5D ~ h, lty = "dotted", col = "red")
lines(-std_residuals1.0D ~ h, lty = "dashed", col = "darkred")

Explanation of plot:

  • horizontal lines: show the “moderately unusual (standardized residual \(>2\) or \(<-2\))” and “very unusual (standardized residual \(>3\) or \(<-3\))” boundaries for the leverage. Observations 10, 15 & 18 are very unusual c.
  • vertical lines: show the “moderately unusual (\(h>2(k+1)/n\))” and “very unusual (\(h>3(k+1)/n\))” boundaries for the leverage, where \(n = 20\) and \(k = 1\). Observations 16 & 18 are very unusual according to the leverage value.
  • curve lines: show the boundaries where Cook’s D becomes unusual, beyond 0.5 (dotted curve, moderately influential) or 1.0 (dashed curve, very influential). Here observation 18 is very influential according to Cook’s D.

Mahalanobis Distance is used to detect outliers based on multivariate distances:

# Mahalanobis Distance
mahalanobis_dist <- mahalanobis(data, colMeans(data), cov(data))
outliers_mahalanobis <- data[mahalanobis_dist > qchisq(0.95, df = 2), ]
outliers_mahalanobis
##            x         y
## 18 -1.966617 -18.99515

DFFITS values are calculated to identify influential points in the regression analysis:

# DFFITS
dffits <- dffits(model)
outliers_dffits <- data[abs(dffits) > 2 * sqrt(length(model$coefficients)/n), ]
outliers_dffits
##            x         y
## 18 -1.966617 -18.99515

The DFBETAS measure the change in each coefficient due to the deletion of each observation one at a time. Here’s how you can calculate DFBETAS for a linear regression model in R:

# Calculate DFBETAS
dfbetas <- dfbetas(model)
dfbetas
##     (Intercept)            x
## 1  -0.036629255  0.024695827
## 2  -0.010192428  0.003817945
## 3  -0.123291092 -0.239905104
## 4  -0.049179528  0.003688721
## 5   0.345431950 -0.004535898
## 6  -0.158946012 -0.354659561
## 7   0.003830504  0.001373508
## 8   0.083794242 -0.102903216
## 9  -0.033949234  0.026533170
## 10  0.557286954 -0.319490486
## 11 -0.048184220 -0.067065993
## 12 -0.041929854 -0.010104619
## 13  0.009375059  0.002701440
## 14  0.023891763 -0.000784583
## 15  0.547834962 -0.367159394
## 16 -0.062383132 -0.147777550
## 17 -0.010475154 -0.004216413
## 18 -1.670678571  2.819743025
## 19 -0.056974695 -0.037301800
## 20 -0.005246180  0.003134287

The Shapiro-Wilk test is conducted to check the normality of residuals:

# Shapiro-Wilk Test for Residuals
shapiro_test <- shapiro.test(residuals)
if (shapiro_test$p.value < 0.05) {
  cat("Residuals are not normally distributed (Shapiro-Wilk p-value:", shapiro_test$p.value, ")\n")
} else {
  cat("Residuals are normally distributed (Shapiro-Wilk p-value:", shapiro_test$p.value, ")\n")
}
## Residuals are not normally distributed (Shapiro-Wilk p-value: 0.001531177 )

Grubbs’ test is performed to identify outliers, and a message is printed based on the test results:

# Grubbs' Test
grubbs_test <- grubbs.test(residuals)
if (grubbs_test$p.value < 0.05) {
  obs = names(which(grubbs_test$p.value < 0.05))
  cat("Outlier detected using Grubbs' test (p-value:", grubbs_test$p.value, ")\n", 
      paste0(ifelse(length(obs) > 1, 
                    paste0("The outliers are the ", paste0(obs[1:(length(obs)-1)], "th, ", collapse = ""), "and ", tail(obs,1), "th observations"),
                    paste0("The outlier is the ", obs, "th observation."))))
} else {
  cat("No Outlier detected using Grubbs' test.\n")
}
## Outlier detected using Grubbs' test (p-value: 0.01931184 )
##  The outlier is the 18th observation.

Chapter 5 One-way ANOVA and Randomized Experiments

One-way analysis of variance (ANOVA) is a statistical technique used to compare the means of three or more groups to determine whether there are statistically significant differences among them. It is commonly used in experimental and observational studies to analyze the impact of one categorical variable (often referred to as a factor) on a continuous outcome variable.

5.1 Overview of One-Way ANOVA:

In a one-way ANOVA, the categorical variable divides the dataset into groups, and the continuous outcome variable is measured for each group. The goal is to assess whether there are statistically significant differences in the means of the outcome variable across the groups.

The ANOVA method has two major applications:

  • Experiments: In experiments, researchers actively manipulate the categorical variable (factor) to create different groups. For example, in a controlled clinical trial, participants may be randomly assigned to different treatment groups. The groups are typically well-defined and controlled, which allows for causal inference and interpretation of any observed differences as potentially caused by the manipulation of the factor. This is called a completely randomized design.

  • Observational Studies: In observational studies, researchers observe pre-existing groups based on some characteristic, and the assignment of individuals to groups is not controlled by the researcher. For example, comparing the average income across different education levels. Since the groups may not be as well-defined or controlled, confounding variables may influence the observed differences. Therefore, causal inference is more challenging, and observed differences may be due to factors other than the categorical variable of interest.

How can we interpret Results? Regardless of whether it’s an experiment or observational study, the interpretation of the results from a one-way ANOVA typically involves:

  • Assessing overall significance: The ANOVA test provides an F-statistic and associated p-value. A small p-value (< 0.05) indicates that there are statistically significant differences in the means of the groups.

  • Post-hoc Tests (if needed): If the overall ANOVA test is significant, post-hoc tests (e.g., Fisher’s Least Significant Difference (LSD), Tukey’s HSD (Honestly Significant Difference), Bonferroni correction) may be conducted to determine which specific group means are significantly different from each other.

5.2 Concepts in Experiments

In experiments, the following terms are fundamental:

  • Factor: In the context of one-way ANOVA, the factor refers to the categorical independent variable with two or more categories.

  • Levels: Levels are the different categories or groups within the factor. Each level represents a unique experimental condition applied to the experimental units.

  • Treatment: In one-way ANOVA, each level of the factor represents a treatment condition. The treatments correspond to the different groups or categories being compared to assess their effect on the response variable.

  • Null Hypothesis (H0): The null hypothesis in one-way ANOVA states that there are no differences in population means among the different treatment groups. It assumes that any observed differences in sample means are due to random sampling variability.

  • Alternative Hypothesis (Ha): The alternative hypothesis in one-way ANOVA states that at least one treatment group mean is significantly different from the others. It suggests that there are genuine differences in population means among the treatment groups.

  • Between-Group Variability (\(SSBetween\) or \(SSGroup\) or \(SSTreatment\)): The between-group variability, also known as the sum of squares between groups, measures the variation in the sample means of the treatment groups. It quantifies the differences among the group means.

  • Within-Group Variability (\(SSWithin\) or \(SSError\) or \(SSResidual\)): The within-group variability, also known as the sum of squares within groups, measures the variation of individual data points around their respective group means. It represents the random variability within each treatment group.

  • Total Variability (\(SSTotal\)): The total variability, also known as the total sum of squares, measures the overall variability in the response variable across all observations. It is the sum of the between-group and within-group variabilities.

  • F-statistic: The F-statistic is a test statistic used in one-way ANOVA to assess whether there are significant differences among the treatment group means. It is calculated as the ratio of the between-group variability (after accounting for its degrees of freedom) to the within-group variability (after accounting for its degrees of freedom).

  • P-Value: In hypothesis testing, the p-value is a measure that helps us determine the strength of evidence against the null hypothesis (\(H_0\)). Specifically, it quantifies the probability of observing the test statistic (or a more extreme value) under the assumption that the null hypothesis is true. If the p-value is smaller than the chosen significance level (\(\alpha\), typically 0.05), it suggests that the observed data is unlikely to have occurred if the null hypothesis is true. In other words, there is strong evidence against the null hypothesis, and we reject it in favor of the alternative hypothesis (\(H_a\)). If the p-value is greater than or equal to the significance level, it suggests that the observed data is reasonably likely to occur even if the null hypothesis is true. In this case, we do not have enough evidence to reject the null hypothesis, and we fail to reject it.

Understanding these concepts is crucial for interpreting the results of a one-way ANOVA correctly and drawing valid conclusions about the differences among treatment groups.

5.3 Fitting the ANOVA Model

In one-way ANOVA (analysis of variance), the statistical model is designed to assess whether there are significant differences in the means of three or more independent groups or levels of a categorical variable. The model can be expressed as follows:

\[Y_{ij}=\mu_i+\epsilon_{ij}\] where \(y_{ij}\) is the observed value of the response variable for the \(j\)th observation in the \(i\)th group, \(\mu_i\) is the mean of the \(i\)th group, and \(\epsilon_{ij}\) is the random error.

The model can also be written as

\[Y_{ij}=\mu+\alpha_i+\epsilon_{ij}\] Where \(\mu\) is the overall population mean (called the grand mean) of the response variable, and \(\alpha_i\) is the effect of the \(i\)th treatment or group level (difference between the group mean and the grand mean).

For given data, we can estimate the parameters \(\mu\) and \(\alpha\)’s by simple hand calculation:

  • The grand mean \(\mu\) is estimated by the mean of all data.
  • Each group mean \(\mu_i\) is estimated by the mean of all data in that group.
  • Each \(\alpha_i\) is estimated by the estimated group mean minus the estimated grand mean.

Example:

  • Group 1 data: 3, 6, 4, 4, 7
  • Group 2 data: 5, 7, 8, 6, 5
  • Group 3 data: 8, 8, 9, 10, 9
  • Group 4 data: 13, 15, 12, 13, 17, 20

The grand mean \(\mu\) is estimated to be 9. The group means are estimated as 4.8, 6.2, 8.8, and 15.0, respectively. Thus, the effects (\(\alpha\)’s) are estimated by subtracting the grand mean from the group means, resulting in estimates of \(-4.2\), \(-2.8\), \(-0.2\), and \(6.0\). Each random error is estimated by subtracting the corresponding group mean from the observation. For example, the random error corresponding to the observation “3” in group 1 can be estimated as \(3 - 4.8\), yielding \(-1.8\).

Each observation can be decomposed into three components: the grand mean, the corresponding group effect, and the corresponding residual. For instance, the observation “3” in group 1 can be expressed as: \[3 = 9 + (-4.2) + (-1.8)\] The same principle applies to each of the remaining observations.

Upon squaring each number in these decomposed identities and summing up the corresponding terms, we find that the sum of squared observations equals the sum of squared grand means, plus the sum of squared effects, plus the sum of squared residuals. This holds as a general rule. In mathematical terms:

\[\sum \text{observation}^2 = \sum \text{grand mean}^2 + \sum \text{effect}^2 + \sum \text{residual}^2 \]

where $^2 $ has \(n\) degrees of freedom, $^2 $ has only 1 degree of freedom, $^2 $ has only \(I-1\) degrees of freedom, and $^2 $ has only \(n-I\) degree of freedom. Here, \(n\) represents the total number of observations, and \(I\) denotes the total number of groups. It’s important to note the following decomposition:

\[n=1+(I-1)+(n-I)\] Here, each number of degrees of freedom tells how many chances the corresponding quantity can change. Specifically,

  • since each of the observation can change freely, \(\sum \text{observation}^2\) has \(n\) degrees of freedom.
  • since there is only one grand mean, \(\sum \text{grand mean}^2\) has \(1\) degree of freedom.
  • There are \(n\) effects, but only \(I\) of them are different and the sum of all \(n\) effects equals 0. Therefore, \(\sum \text{effect}^2\) has \(I-1\) degrees of freedom.
  • There are \(n\) residuals, but the sum of all residuals for each group equals 0. Therefore, \(\sum \text{residual}^2\) has \(n-I\) degrees of freedom.

The above two identities can also be written as:

\[\sum \text{observation}^2 - \sum \text{grand mean}^2 = \sum \text{effect}^2 + \sum \text{residual}^2 \]

\[n-1=(I-1)+(n-I)\] Since \(\sum \text{observation}^2 - \sum \text{grand mean}^2 = \sum (\text{observation} - \text{grand mean})^2\), it is called the total variation of data, or the total sum of squares (denoted SSTotal). In addition, \(\sum \text{effect}^2\) is called the sum of squares for groups or treatments (denoted SSGroup or SSTreatment), and \(\sum \text{residual}^2\) is called the sum of squares for errors or residuals (denoted SSError or SSResidual). Therefore, we have:

\[\text{SSTotal}=\text{SSGroup}+\text{SSError}\] For the data, \(n=21\), \(I=4\), \(SSTotal = 410\), \(SSGroup = 343.6\), and \(SSError = 66.4\).

In the following, we provide R code for calculating group means:

# Create the data frame D
D <- data.frame(
  group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)), 
  y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20) 
)

# Aggregate the data by group and calculate the mean of y for each group
aggregate(y ~ group, data = D, FUN = mean)
##   group    y
## 1     A  4.8
## 2     B  6.2
## 3     C  8.8
## 4     D 15.0

Explanation of the above code:

  • Creating a Data Frame D:

    • The data frame D is created with two columns: group and y.
    • The group column is created by repeating the labels “A”, “B”, “C”, and “D” according to the specified frequencies (5, 5, 5, 6).
    • The y column contains numeric values representing the variable y.
  • Aggregating Data by Group:

    • The aggregate() function is used to aggregate the data by the group variable.
    • It calculates the mean of the y variable for each group.
    • The result is a data frame with two columns: group and y.
    • For each unique group label (“A”, “B”, “C”, “D”), the mean value of y is calculated based on the corresponding observations in the original data frame D.

5.4 Statistical Inference Based on the ANOVA Model

To make formal inference about the parameters, we must make some assumptions on the model:

  • Independence: Observations within each group are independent of each other.
  • Normality: The residuals (error terms \(\epsilon_{ij}\)’s) are normally distributed with mean zero within each group.
  • Homogeneity of variances: The variance of the residuals is the same for all groups, denoted by \(\sigma^2\).

The main goal of one-way ANOVA is to test the null hypothesis that all group means are equal against the alternative hypothesis that at least one group mean is different. This is typically done using an F-test, where the F-statistic is calculated as the ratio of the between-group variability (representing signal) to the within-group variability (representing noise).

If the F-test yields a significant result (i.e., the p-value is less than a chosen significance level, often 0.05), we reject the null hypothesis and conclude that there are significant differences among the group means.

The following so-called ANOVA table shows the detail of calculation:

where \(I\) is the number of groups and \(n\) is the total sample size. The \(F\) statistic, under the null hypothesis of equal group means, has an \(F\) distribution with \(I-1\) and \(n-I\) numbers of degrees of freedom.

The R software can generate ANOVA tables, different treatments are the groups, residuals are the errors, “Sum Sq” is the same as SS, “Mean Sq” is the same as MS, and \(MS_{Error}\) is often written as \(MSE\). The \(MSE\) is an estimate of the error variance \(\sigma^2\).

Why is F used as the test statistic? The reason is as follows:

  • Under the null hypothesis that all group means are equal, it can be shown that the mean square for groups is expected to equal the error variance \(\sigma^2\).
  • Under the alternative hypothesis that not all group means are equal, it can be shown that the mean square for groups is expected to exceed the error variance \(\sigma^2\).
  • It can also be shown that the mean square for errors is always expected to equal the error variance \(\sigma^2\).
  • Thus, the ratio (\(F\)) of the mean square for groups to the mean square for errors is an indicator whether the null hypothesis is rejected or not. The larger the ratio, the more evidence to reject the null hypothesis. That is, the p-value is the area of the right tail under the \(F\)-distribution curve, separated by the \(F\) ratio. We can use the online calculator https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#F to calculate p-values.

In R, one-way ANOVA can be performed using the aov() function.

We demonstrate how to perform a one-way ANOVA analysis using R for an experiment.

Example 1: We use the data given before to perform a one-way ANOVA analysis.

# Create the data frame D
D <- data.frame(
  group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)), 
  y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20) 
)

# Perform one-way ANOVA
anova_result <- aov(y ~ group, data = D)
summary(anova_result)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        3  343.6  114.53   29.32 6.04e-07 ***
## Residuals   17   66.4    3.91                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table produces the same sums of squares as we did by hand. The p-value (6.04e-07 or 0.000000604) shows that differences in the mean of \(y\) for different groups are statistically significant at any commonly used significance level.

Example 2: Blood Pressure Medication Study

We visualize the data:

# Visualize the data
boxplot(blood_pressure ~ medication, data = med_data, 
        main = "Blood Pressure by Medication Group",
        xlab = "Medication", ylab = "Blood Pressure")

It seems there are differences in blood pressure between treatments. Are the differences significant?

We perform one-way ANOVA:

# Perform one-way ANOVA
anova_result <- aov(blood_pressure ~ medication, data = med_data)
summary(anova_result)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## medication   2  871.3   435.6   4.078 0.0284 *
## Residuals   27 2884.6   106.8                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result from the ANOVA table shows that the differences are statistically significant at level 0.05.

How can we check whether the assumptions of the ANOVA are met?

  • We can use the graphical method.
  • We can use formal test procedures.
# graphical check of ANOVA assumptions
plot(anova_result, 1) # Identify outliers and graphically check equal variances

plot(anova_result, 2) # Graphically check normality

Here’s how you can interpret the graphs:

  • Graphically Check Equal Variances (Plot Type 1): If the groups have equal variances, you should see relatively equal range of residuals across the groups in the plot. If not, it indicates that the assumption of equal variances may be violated.

  • Graphically Check Normality (Plot Type 2): If the data points fall approximately along a straight line, it suggests that the data follows a normal distribution. Deviations from the expected pattern, such as significant skewness or outliers, may indicate departures from normality.

A formal test, called the Levene’ test, available from the “car” package for testing equal variances is shown below:

library(car)
leveneTest(blood_pressure ~ medication, data = med_data)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  2   0.295 0.7469
##       27

From the output above we can see that the p-value (0.7469) is not less than any commonly used significance level (such as 0.05). This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

We can also use the following R code to calculate group standard deviations:

aggregate(blood_pressure ~ medication, data = med_data, sd)
##   medication blood_pressure
## 1          A       9.658617
## 2          B      12.543701
## 3          C       8.359293

If the largest standard deviation is no greater than twice the smallest standard deviation, there is no violation of the equal variances assumption. As the result shows, the largest standard deviation is 12.54 and the smallest is 8.36, so there is no violation of the equal variances assumption.

The Shapiro-Wilk test can be used to check normality:

shapiro.test(anova_result$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  anova_result$residuals
## W = 0.96657, p-value = 0.4501

The Shapiro-Wilk test on the ANOVA residuals (W = 0.9467, p = 0.4501) finds no indication that normality is violated.

5.5 How Big is the Effect: Confidence Intervals and Effect Sizes

In one-way ANOVA, assessing the magnitude of the effect involves examining

  • confidence intervals for group means: \[\bar{y_i}\pm t^*\cdot SD\cdot \sqrt{1/n_i}\]
  • confidence intervals for differences between group means, \[(\bar{y}_i-\bar{y}_j)\pm t^*\cdot SD\cdot \sqrt{1/n_i+1/n_j}\]

In the above formulas, \(\bar{y_i}\) and \(\bar{y_j}\) are the group means, \(n_i\) and \(n_j\) are the group sizes, \(SD=\sqrt{MSE}=\sqrt{SSError/(n-I)}\) is the estimate of the error standard deviation \(\sigma\), and \(t^*\) is the critical value based on the \(t\)-distribution with \(n-I\) degrees of freedom (\(n\) is the total sample size and \(I\) is the number of groups). You can use the app https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#t to calculate this critical value (set the number of degrees of freedom, check the two-tail box, and change the middle area to the confidence level).

Confidence intervals (CIs) for group means help determine the precision of the estimated mean for each group. A wider CI indicates greater uncertainty in the estimated mean, while a narrower CI suggests higher precision.

Comparing CIs between groups can reveal whether the differences in means are statistically significant. If the confidence interval for the difference between two group means does not include zero, it suggests a significant difference between those groups.

Example: Use the following data

  • Group 1 data: 3, 6, 4, 4, 7
  • Group 2 data: 5, 7, 8, 6, 5
  • Group 3 data: 8, 8, 9, 10, 9
  • Group 4 data: 13, 15, 12, 13, 17, 20

to calculate 95% confidence intervals for

  • \(\mu_1\)
  • \(\mu_2-\mu_1\)

Solution:

For, \(\mu_1\), we use the formula

\[\bar{y}_1\pm t^*\cdot SD\cdot \sqrt{1/n_1}\] where

Therefore, the \(95\%\) confidence interval is

\[4.8\pm2.1098\cdot 1.9763\cdot \sqrt{1/5}\] or, \((2.9353, 6.6647)\).

For, \(\mu_2-\mu_1\), we use the formula

\[(\bar{y}_2-\bar{y}_1)\pm t^*\cdot SD\cdot \sqrt{1/n_2+1/n_1}\]

to get the 95% confidence interval to be \((3.5629, 8.8371)\).

How much do the differences matter?

To answer this question, we need to calculate effect sizes. The p-values tell whether results are significant at certain levels, but significance does not imply importance.

An effect size provides valuable insights into the practical significance of the observed differences among groups.

In the one-way ANOVA setting, we can measure the effect size for a single group by

\[D_i = \frac{\bar{y_i}-\bar{y}}{SD}\] and for a difference between any pair of two groups by

\[D_{ij} = \frac{\bar{y_i}-\bar{y_j}}{SD}\]

where \(SD=\sqrt{MSE}\). These effect sizes are called Cohen’s \(D\).

As the formulas show, effect sizes quantify the magnitude of differences between groups in standard deviation units.

These effect size measures help to interpret the practical significance of group differences, irrespective of sample size.

Effect sizes can be categorized as small, medium, or large based on standard thresholds.

  • Small effect size: the magnitude of Cohen’s D around 0.2
  • Medium effect size: the magnitude of Cohen’s D around 0.5
  • Large effect size: the magnitude of Cohen’s D around 0.8 or higher

Here’s a more detailed interpretation of Cohen’s D effect sizes:

  • Small effect size: The difference between the means of the two groups is small and may have minimal practical significance. It may be statistically significant with a large sample size.
  • Medium effect size: The difference between the means of the two groups is moderate and may have practical significance. It’s typically considered meaningful in many research contexts.
  • Large effect size: The difference between the means of the two groups is large and likely to have practical significance. It’s often considered substantial and noteworthy in research findings.

For our previous example, the effect size that quantifies the magnitude of difference between - group 1 and group 2 is \(D_{12}=(4.8-6.2)/1.9763=-0.7084\), which is medium. - group 1 and group 3 is \(D_{12}=(4.8-8.8)/1.9763=-2.0240\), which is large.

5.6 Transformation on the Response Variable

The one-way ANOVA assumes that the response variable is normally distributed within each group and exhibits equal variance across groups. However, in real-world data, these assumptions may not always hold true. When these conditions are violated, one approach to address the issue is to apply transformations to the response variable to stabilize variation.

Several commonly used transformations for the response variable \(y\) include:

  • the natural logarithmic transformation: \(log(y)\)
  • the square-root transformation: \(\sqrt{y}\)
  • the reciprocal transformation: \(1/y\)

These transformations can help to achieve approximate normality and homogeneity of variances, thereby improving the validity of the ANOVA analysis.

For each transformation, the standard deviation of each group can be calculated. If the maximum standard deviation is no greater than twice the smallest standard deviation, the homogeneity in variance assumption is satisfied.

Example: Use the following data to find a transformation.

  • Group A: 0.9, 1, 1.1
  • Group B: 9, 10, 11
  • Group C: 90, 100, 110

Solution.

Plot without log-transformation:

# Create a data frame with two variables: 'group' and 'y'
D <- data.frame(
  group = rep(c("A", "B", "C"), each = 3),  # 'group' variable contains repetitions of "A", "B", and "C", each 3 times
  y = c(0.9, 1, 1.1, 9, 10, 11, 90, 100, 110)  
)

# Convert the 'group' variable to a factor
D$group = factor(D$group)

# Plot the data using ggplot2
ggplot(D, aes(x=group, y=y)) +
  geom_point() 

The data in different groups have different variations.

Plot after log-transformation:

ggplot(D, aes(x=group, y=log(y))) +
  geom_point() 

After log-transformation, the data in different groups have the same variation. Now, the 1-way ANOVA can be done with the transformed data.

aov_result = aov(log(y)~group, data = D)
summary(aov_result)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        2  31.81   15.91    1579 6.82e-09 ***
## Residuals    6   0.06    0.01                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.7 Multiple Comparisons and Fisher’s Least Significant Difference

When an analysis of variance (ANOVA) indicates a significant difference among groups that are compared, multiple comparisons are often conducted in for pairwise comparisons.

Fisher’s Least Significant Difference (LSD), Tukey’s Honestly Significant Difference (HSD), and Bonferroni correction are 3 methods commonly used for conducting multiple pairwise comparisons in the context of ANOVA. However, they differ in their underlying assumptions and the criteria they use to determine significant differences between group means.

We focus on Fisher’s LSD (Least Significant Difference) and introduce the other two methods in Chapter 8.

  • Fisher’s LSD is one of the oldest methods for pairwise comparisons following ANOVA.
  • It is based on the t-distribution and is used when the assumption of equal variances between groups is met.
  • Fisher’s LSD compares the difference between means of all possible pairs of groups and determines whether these differences are statistically significant.
  • It tends to be more liberal (i.e., it may find more significant differences) compared to other methods.

Steps of the Fisher’s LSD process:

  • Conduct 1-way ANOVA. If the result is not significant, stop. Otherwise, proceed to the next step.

  • Compute Fisher’s LSD.

\[LSD = t^*\cdot SE\] where \(SE=\sqrt{MSE}\cdot \sqrt{1/n_1+1/n_2}\).

  • Compare groups. Which pairs of differences are greater \(\ge LSD\)?

Fisher’s LSD test is relatively straightforward and easy to implement, but it can be less powerful than other post-hoc tests, especially when there are unequal sample sizes or unequal variances between groups. Therefore, it’s essential to consider the specific characteristics of your data and research question when choosing a multiple comparison method.

There is an R package “agricolae” that does multiple comparison using Fisher’s LSD. Here is an example.

Example 1.: Suppose a professor wants to know whether or not three different studying techniques lead to different exam scores among students. This example is taken from https://www.statology.org/fishers-lsd-in-r/.

To test this, she randomly assigns 10 students to use each studying technique and records their exam scores.

The following table shows the exam scores for each student based on the studying technique they used:

##    technique score
## 1      tech1    72
## 2      tech1    73
## 3      tech1    73
## 4      tech1    77
## 5      tech1    82
## 6      tech1    82
## 7      tech1    83
## 8      tech1    84
## 9      tech1    85
## 10     tech1    89
## 11     tech2    81
## 12     tech2    82
## 13     tech2    83
## 14     tech2    83
## 15     tech2    83
## 16     tech2    84
## 17     tech2    87
## 18     tech2    90
## 19     tech2    92
## 20     tech2    93
## 21     tech3    77
## 22     tech3    78
## 23     tech3    79
## 24     tech3    88
## 25     tech3    89
## 26     tech3    90
## 27     tech3    91
## 28     tech3    95
## 29     tech3    95
## 30     tech3    98

Let’s do ANOVA:

model<-aov(score~technique, data=df)
summary(model)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## technique    2  341.6  170.80   4.623 0.0188 *
## Residuals   27  997.6   36.95                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result shows that the mean scores differs significantly at level 0.05 for different techniques.

The following is R code for multiple comparison using the Fisher’s LSD process.

library(agricolae) # You must install the package first

model<-aov(score~technique, data=df)
out <- LSD.test(model,"technique")
out
## $statistics
##    MSerror Df Mean       CV  t.value     LSD
##   36.94815 27 84.6 7.184987 2.051831 5.57767
## 
## $parameters
##         test p.ajusted    name.t ntr alpha
##   Fisher-LSD      none technique   3  0.05
## 
## $means
##       score      std  r      se      LCL      UCL Min Max   Q25  Q50   Q75
## tech1  80.0 5.868939 10 1.92219 76.05599 83.94401  72  89 74.00 82.0 83.75
## tech2  85.8 4.391912 10 1.92219 81.85599 89.74401  81  93 83.00 83.5 89.25
## tech3  88.0 7.557189 10 1.92219 84.05599 91.94401  77  98 81.25 89.5 94.00
## 
## $comparison
## NULL
## 
## $groups
##       score groups
## tech3  88.0      a
## tech2  85.8      a
## tech1  80.0      b
## 
## attr(,"class")
## [1] "group"
plot(out)

The result shows that

  • Technique 1 and Technique 3 have significantly different mean exam scores (since tech1 has a value of “b” and tech3 has a value of “a”)
  • Technique 1 and Technique 2 have significantly different mean exam scores (since tech1 has a value of “b” and tech2 has a value of “a”)
  • Technique 2 and Technique 3 do not have significantly different mean exam scores (since they both have a value of “a”)

Chapter 6. Blocking and Two-Way ANOVA

Blocking and Two-Way ANOVA are statistical techniques used to analyze experimental data, particularly when there are multiple factors at play. Let’s introduce each of them:

Blocking:

6.1 The Randomized Complete Block (RCB) Design

The Randomized Complete Block (RCB) Design is a common experimental design used in agricultural, industrial, and scientific research to reduce variability and increase the precision of treatment comparisons. It is particularly useful when there are known sources of variability that may confound the treatment effects.

The Randomized Complete Block (RCB) Design can be implemented in various ways to suit different experimental setups and research goals. Three common strategies for implementing RCB designs are subdividing blocks, sorting/matching blocks, and reusing blocks. Let’s explore each:

  • Subdivide a large plot of land (the block) into subplots (the units) so that each gets one of a few different treatments at random one at a time. There are a few blocks.

  • Sort or match individuals with similar baseline characteristics into blocks so that each block has the number of individuals equal to the number of treatments. Randomly assign one treatment to one individual.

  • Reuse each subject (the block) the number of times equal to the number of treatments in random order.

Each strategy results in a data table of \(n\) rows and \(k\) columns, with \(n\) equal to the number of blocks and \(k\) equal to the number of treatments. The total number of observations equals \(n\cdot k\).

Here is a possible data table when \(n=5\) and \(k=3\).

Block Treatment1 Treatment2 Treatment3
Block1 y11 y12 y13
Block2 y21 y22 y23
Block3 y31 y32 y33
Block4 y41 y42 y43
Block5 y51 y52 y53

Let’s introduce some terminology based on the above table.

  • Factors: Block and Treatment are called two factors.
  • Levels: Block has 5 levels and Treatment has 3 levels.
  • Crossing: Considering all possible combinations of the levels of the two factors, we describe them as crossed.
  • Cells: Each unique combination of levels from the two factors forms a cell, resulting in a total of 15 cells.
  • Factor of interest: In this study, the factor of interest is Treatment, directly associated with the study’s objective.
  • Nuisance factor: Block is included solely as it contributes to variation in the data, making it a nuisance factor.
  • Main effect: The impact of a factor on the response variable. In RCBD, there are two main effects.
  • Additive main effects: If the effects of the two factors do not interact, they independently contribute to the response variable, and their effects are additive.

Read textbook page 260 for examples.

6.2 Exploring Data from Block Designs

What graphical insights can we glean from the data prior to model fitting? Constructing a plot for Randomized Complete Block Design (RCBD) data involves creating side-by-side dot plots for each treatment group and connecting observations from the same block.

What does such a plot reveal? It can demonstrate:

  • Variability due to block within a treatment group.
  • Variability due to treatment within a block.
  • The parallelism of lines corresponding to blocks.

As our aim is to discern differences in treatment effects, it’s imperative to eliminate block effects to derive adjusted responses before comparing treatments.

Here are the steps detailing how adjusted values are calculated:

  • Determine the grand mean.
  • Compute the mean of each block.
  • Determine the effect of each block: block mean minus grand mean.
  • Calculate adjusted values: subtract the corresponding block effect from each observation.

Plotting the adjusted values against treatments offers a clearer depiction of each treatment’s effect, as the influence of blocks has been mitigated.

Example 1. Let’s consider an example where we want to compare the effects of caffeine and theobromine with a placebo (a sham treatment). Four subjects participate in the study. On the first day, each subject randomly selects a treatment to ingest. Two hours later, the subjects are timed while they type a sequence of 1000 letters on a computer, and their finger typing rate is recorded. This process is repeated on the third and fifth days for the other two treatments in a random order. The data is illustrated below:

Subject Placebo Caffeine Theobromine
Alice 11 26 20
Bob 56 83 71
Cathy 15 34 41
David 6 13 32

The above data is in a wide format. To plot the original data and the block effect-adjusted data, R software requires us to convert the data from the wide format to tall or long format in 3 columns (subject, treatment, and response).

D=data.frame(Subject = factor(rep(1:4, each=3)),
             Treatment = rep(c("Placebo", "Caffeine", "Theobromine"), 4),
             Y = c(11, 26, 20, 56, 83, 71, 15, 34, 41, 6, 13, 32)
            )

D$Treatment = factor(D$Treatment, levels = c("Placebo", "Caffeine", "Theobromine"))

# Calculate block means
block_means <- aggregate(Y ~ Subject, data = D, FUN = mean)

# Calculate grand mean
grand_mean <- mean(D$Y)

# Calculate block effects
block_effects <- block_means$Y - grand_mean

# Adjusted data
D$Adjusted_Y <- D$Y - block_effects[match(D$Subject, block_means$Subject)]

# Original data plot
with(D, interaction.plot(Treatment, Subject, Y, xlab = "Treatment", ylab = "Rate", col = 1:4))

# Block effect-adjusted data plot
ggplot(D, aes(x = Treatment, y = Adjusted_Y)) +
  geom_point() +
  geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf), fill = "transparent", color = "black") +
  labs(title = "Block Effect-Adjusted Data",
       x = "Treatment",
       y = "Rate") +
  theme(panel.background = element_rect(fill = "white"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

The original data show that Variability observed within treatment groups may stem from both treatment effects and block effects, making it challenging to discern the true treatment effect.

The block effect-adjusted data plot illustrates the adjusted responses for each treatment group after accounting for block effects.

Reduced variability within treatment groups compared to the original data plot suggests a clearer delineation of treatment effects from block effects.

The impacts of caffeine and theobromine exhibit similarities, with both demonstrating higher effects compared to the placebo.

There is another plot called the Anscombe’s block plot introduced in the textbook. We skip it here.

6.3 The Two-Way ANOVA model for RCBD design

The two-way ANOVA model is

\[Y=\mu+\tau_i + \beta_j+ \epsilon\] where \(\mu\) is the grand or population mean, \(\tau_i\) is the \(i\)th treatment effect, \(\beta_j\) is the \(j\)th block effect, and \(\epsilon\) is the random error term.

Assumptions of the model:

  • The observations are all independent.
  • The effects are fixed and additive.
  • The error term has a normal distribution with mean 0 and constant variance \(\sigma^2\).

As in one-way ANOVA, an observation can be decomposed as

\[\text{Observed Value} = \text{Grand Mean} + \text{Block Effect} + \text{Treatment Effect} + \text{Residual}\] For the finger data we examined before,

  • the grand mean (mean of all data) is 34.
  • the block means are 19, 70, 30, and 17, respectively.
  • the treatment means are 22, 39, and 41, respectively.
  • the effects of blocks are the difference between block means and the grand mean, or \(-15, 36, -4\), and \(-17\), respectively.
  • the effects of treatments are the difference between treatment means and the grand mean, or \(-12, 5\), and 7, respectively.

For the observed value 11, it can be decomposed as

\[11 = 34+(-15)+(-12)+4\] which is equivalent to

\[11 - 34 = (-15)+(-12)+4\] It turns out that

\[\sum(y - grand~~mean)^2 = \sum(block~~effect)^2+\sum(treatment~~effect)^2+\sum(residual)^2\] where the summations are taken over all cells of the data in wide format.

The above sums of squares are called

  • the total sum of squares,
  • the block sum of squares,
  • the treatment sum of squares, and
  • the residual or error sum of squares.

Their numbers of degrees of freedom are

  • \(n-1\)
  • \(b-1\)
  • \(k-1\)
  • \(n-b-k+1\) or \((b-1)\cdot (k-1)\), since \(n = b\cdot k\).

Note that

\[(n-1)=(b-1)+(k-1)+(n-b-k+1)\] i.e., the total degrees of freedom equals the sum of the individual degrees of freedom.

The R code for carrying out the two-way ANOVA is shown in the following example.

Example 0: (Carrying out the analysis in Excel) Watch this video https://www.youtube.com/watch?v=7KVuq8tZeWA. The details are here: https://www.youtube.com/watch?v=jH6tkTdABHs

Example 1: Use the finger data to fit a two-way ANOVA model. The R is

model = aov(Y~Subject+Treatment, data = D)
summary(model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Subject      3   5478  1826.0   33.00 0.000399 ***
## Treatment    2    872   436.0    7.88 0.020967 *  
## Residuals    6    332    55.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that:

  • the sum of squares for subjects (blocks) is 5478, with \((4-1)\) or 3 degrees of freedom.
  • the sum of squares for treatments is 872, with \((3-1)\) or 2 degrees of freedom.
  • the sum of squares for residuals is 332, with \(3\cdot 2\) or 6 degrees of freedom.
  • the mean sum of squares is (5478/3) or 1826, (872/2) or 436, and (332/6) or 55.33, respectively.
  • the two \(F\) values are (1826/55.33) or 33 and (436/55.33) or 7.8795, respectively.
  • the two p-values can be obtained by R code: pf(33, 3, 6) and pf(7.8795, 2, 6). Since both p-values are below 5%, both Block and Treatment are significant at level 5%.

If we blindly fit a one-way ANOVA model to study the effect of treatments without considering blocks, the result would be not significant, as shown by the following:

model2 = aov(Y~Treatment, data = D)
summary(model2)
##             Df Sum Sq Mean Sq F value Pr(>F)
## Treatment    2    872   436.0   0.675  0.533
## Residuals    9   5810   645.6

This example shows that it is important to use the correct model.

6.4 Assessing the Two-Way ANOVA Model for RDBC Designs

The assumptions for the two-way ANOVA model for RDBC designs can be checked with 3 kinds of residual plots:

  • Q-Q plot to check for normality
  • Plot of residuals versus fitted values to check constant variances.
  • Tukey plot for non-additivity. It shows whether a change of scale can lead to a better fit. The plot is constructed by first calculating the comparison values (Block effect times Treatment effect divided by Grand mean), plotting residuals versus comparison values, and adding a fitted line with 0 intercepts. If the estimated slope is not 0, do a power transformation over the response with power equal to 1 minus the slope.

Example 2.: We use the finger data to show how R can be used for constructing these plots.

  1. Obtain a Q-Q plot of residuals. Is there any indication from this plot that the normality condition is violated?

  2. Obtain a plot of residuals versus fitted values. Is any indication from the shape of this plot that the variation is not constant?

  3. Make the Tukey additivity plot. Is a power transformation needed? If yes, what power transformation is appropriate?

  4. Fit a 2-way ANOVA model for the log-iron data. Answer questions (1) and (2).

Solution: Fit the model.

# Provided data
D <- data.frame(Subject = factor(rep(1:4, each=3)),
                Treatment = rep(c("Placebo", "Caffeine", "Theobromine"), 4),
                Y = c(11, 26, 20, 56, 83, 71, 15, 34, 41, 6, 13, 32))

model = aov(Y~Subject+Treatment, data = D)
summary(model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Subject      3   5478  1826.0   33.00 0.000399 ***
## Treatment    2    872   436.0    7.88 0.020967 *  
## Residuals    6    332    55.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Q-Q plot:
plot(model, 2)

The points tend to be on the line, so the normality assumption is satisfied.

  1. Residuals versus fitted values:
plot(model, 1)

Interpretation of the plot:

  • If the points are spread out evenly along the horizontal axis, it suggests that the variance of the residuals is constant across all levels of the predictor variables.
  • If the residuals form a curve or have a systematic pattern, it suggests that the relationship between the predictor variables and the response variable may not be adequately captured by the linear model. You might need to consider a more complex model or transformation of variables.
  • If the spread of residuals widens or narrows systematically across the range of fitted values, it indicates violation of the constant variance assumption.
  1. Tukey plot for additivity:
# Calculate block effects
block_effects <- tapply(D$Y, D$Subject, mean)

# Calculate treatment effects
treatment_effects <- tapply(D$Y, D$Treatment, mean)

# Calculate grand mean
grand_mean <- mean(D$Y)

# Calculate comparison values
comparison_values <- outer(block_effects, treatment_effects, FUN = "*") / grand_mean

# Flatten comparison values matrix to vector
comparison_values <- as.vector(comparison_values)

# Calculate residuals
fit <- lm(D$Y ~ comparison_values + 0)
residuals <- resid(fit)

# Extract slope coefficient
slope <- coef(fit)[1]

# Plot Tukey additivity plot with the fitted line
plot(comparison_values, residuals, xlab = "Comparison Values", ylab = "Residuals",
     main = "Tukey Additivity Plot", pch = 16)
abline(a = 0, b = slope, col = "red")  # Fitted line with intercept 0

# Add legend showing slope value
legend("topright", legend = paste("Slope:", round(slope, 4)), col = "red", lwd = 1, bty = "n")

The slope is 0.5631, so the power transformation suggested by the tukey plot has a power of \(1-0.5631\) or 0.437 which is close to 0.5, indicating that a square-root transformation might be appropriate.

  1. You do it.

6.5 Using the Model for RCBD

We can test hypotheses

  • whether there is a significant difference among treatments (major interest)
  • whether there is a significant difference among blocks (less interest)
  • whether there is a significant difference between two treatments

We can also find a confidence interval

  • for a treatment mean
  • for the difference between two treatment means
library(lsmeans)

# Perform the 2-way ANOVA
model <- aov(Y ~ Treatment + Subject, data = D)

# Compute least square means and confidence intervals
lsmeans_model <- lsmeans(model, "Treatment")
lsmeans_model
##  Treatment   lsmean   SE df lower.CL upper.CL
##  Caffeine        39 3.72  6     29.9     48.1
##  Placebo         22 3.72  6     12.9     31.1
##  Theobromine     41 3.72  6     31.9     50.1
## 
## Results are averaged over the levels of: Subject 
## Confidence level used: 0.95
# Calculate confidence intervals for pairwise comparisons
pairwise_CI <- pairs(lsmeans_model, adjust = "none", infer = TRUE)
pairwise_CI
##  contrast               estimate   SE df lower.CL upper.CL t.ratio p.value
##  Caffeine - Placebo           17 5.26  6     4.13    29.87   3.232  0.0179
##  Caffeine - Theobromine       -2 5.26  6   -14.87    10.87  -0.380  0.7169
##  Placebo - Theobromine       -19 5.26  6   -31.87    -6.13  -3.612  0.0112
## 
## Results are averaged over the levels of: Subject 
## Confidence level used: 0.95

The result shows that

  • The 95% confidence interval for Caffeine group is 29.9 to 48.1.
  • The 95% confidence interval for the mean difference between Caffeine group and Placebo group is 4.13 to 29.9.
  • The p-value for testing whether there is mean difference between Caffeine group and Placebo group is 0.0179, indicating a significant difference at level 5%.
  • The p-value for testing whether there is mean difference between Caffeine group and Theobromine group is 0.7169, indicating no significant difference at any commonly used significance level.

Note: We can also calculate effect sizes for a single treatment mean and for the mean difference between two treatments. The formulae are the same as in the 1-way ANOVA case.

6.6 Observational Relatives of the RCBD

The two-way ANOVA model can be applied to a situation where there is a major factor of interest (not treatment) and a nuisance factor (the block) and there is only one observation for each combination of the levels of the two factors.

Here is an example.

Example 3: Use the Rivlron data from https://stat2.org/datapage.html

The data were collected from four rivers in upstate New York. Some geologists were interested in the water chemistry. They took water samples at three different locations in each river (Grasse, Oswegatchie, Raquette, and St. Regis). The sampling sites were chosen to investigate how the composition of the water changes as it flows from the source to the mouth of each river. The sampling sites were labeled as upstream, midstream, and downstream. This dataset contains the concentrations of iron in the samples.

Fit a 2-way ANOVA model for the iron data.

Chapter 7. ANOVA with Interaction and Factorial Designs

7.1 Interaction

In ANOVA, interaction refers to the combined effect of two or more factors on the response variable. When an interaction exists between factors, the effect of one factor on the response variable depends on the level of another factor. Understanding interaction is crucial as it can provide deeper insights into the relationships between variables and help us make more informed decisions.

7.2 The Two-Way Factorial Design

The two-way factorial design is a powerful experimental design used to study the effects of two independent variables (factors) on a single dependent variable (response). In this design, each factor has two or more levels, and all possible combinations of the levels of the two factors are studied.

Key Concepts:

  • Factors: In a two-way factorial design, there are two independent variables, often referred to as factors. These factors can represent different treatments, conditions, or levels being studied in the experiment.

  • Levels: Each factor consists of two or more levels, which represent the different values or categories of the factor being manipulated or observed.

  • Cells: The combination of levels from the two factors creates cells in the experimental design. Each cell represents a unique combination of factor levels.

  • Main Effects: The main effects of each factor represent the average effect of that factor across all levels of the other factor. It describes how the response variable changes as the levels of one factor change, while holding the other factor constant.

  • Interaction Effects: Interaction effects occur when the effect of one factor on the response variable depends on the level of another factor. It signifies that the relationship between the response variable and one factor is not constant across the levels of the other factor.

Example 1:

Consider a study examining the effects of two factors,

  • Type of Diet: High-Protein Diet versus Low-Protein Diet
  • Exercise Regimen: Aerobic Exercise versus Strength Training

on weight loss. The two-way factorial design would involve all possible combinations of these levels, resulting in four treatment groups:

  • High-Protein Diet + Aerobic Exercise,
  • High-Protein Diet + Strength Training,
  • Low-Protein Diet + Aerobic Exercise, and
  • Low-Protein Diet + Strength Training.

7.3 Exploring Two-Way Data

Before diving into a two-way ANOVA analysis, it’s crucial to thoroughly explore the dataset to gain insights into its characteristics and uncover any potential patterns or trends.

We will construct the following plots:

  • Side-by-side dot plot for two-way data (book page 309): Holding one factor at a fixed level, the plot shows the dot plot for the other factor.
  • (optional) Plot of log(Standard Deviation) versus log(Mean): If a line fits well, find the slope. If the slope is not 0, it suggests to make a power transformation to the data. The power equals one minus the slope.
  • Interaction plot: The plot shows the mean for each combination of factor levels. One factor is the x-axis. The points are joined for each level of the other factor. If lines are parallel, there is no interaction. See an example plot below.

7.4 Fitting a Two-Way Balanced ANOVA Model

This involves the decomposition of each observed value:

\[y = \text{grand mean}+\text{effect of factor one}+\text{effect of factor two}+\text{interaction effect}+\text{residual}.\]

The sums of squares are defined similarly as in two-way ANOVA for the RCBD. They are

  • the total sum of squares,
  • the sum of squares for factor one,
  • the sum of squares for factor two,
  • the sum of squares for interaction, and
  • the sum of squares for residuals.

The total sum of squares can be decomposed as

\[SSTo = SSA + SSB + SSAB + SSE\]

The corresponding degrees of freedom are

  • The total number of observations minus one,
  • The number of levels minus one for factor one,
  • The number of levels minus one for factor two,
  • The product of the numbers of degrees of freedom of the two factors, and
  • The total number of observations minus the product of the numbers of the levels of the two factors.

The number of degrees of freedom for the total sum of squares equals the sum of the other numbers of degrees of freedom.

As before, we can make an ANOVA table, which will show 3 p-values, one for each factor and one for interaction. We interpret the p-value for interaction first.

  • If there is no interaction as indicated by a large p-value, then we interpret the main effect of each factor. Marginal means can be compared by finding confidence intervals.
  • If interaction is present, we do not interpret the p-value for each factor, since main effect is not meaningfully defined in this situation. Conditional means for a factor can be compared by finding confidence intervals holding the other factor at each of its levels.
  • Examples are shown in Section 7.6.

A very good reference is https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-6-two-way-analysis-of-variance/.

7.5 Assessing Fit: Do We Need a Transformation? (Optional)

In some cases, the assumptions of ANOVA, such as normality and homogeneity of variances, may not be met. When this occurs, it may be necessary to transform the data to meet these assumptions before conducting the analysis. Assessing the fit of the ANOVA model and considering data transformations are essential steps in ensuring the validity and reliability of the results.

We can again use the Tukey additivity plot to suggest a transformation on the response variable, but since we have more than one observation in each cell, we use cell means instead. The R code in previous chapter can be used to the cell means. The plot can answer two related questions:

  • Is there a scale for which the main effects will be additive with no interaction?
    • The answer is yes, if the points lie near a line.
    • The answer is no, if the points do not lie near a line.
  • If there is such a scale, what is the right transformation? Answer: Power of transformation equals 1 minus slope.

Another plot we can construct is to plot log(standard deviation) versus log(mean). It’s similar to Tukey’s plot. The plot can answer two related questions:

  • Is there a scale for which the standard deviations will be roughly equal?
    • The answer is yes, if the points lie near a line.
    • The answer is no, if the points do not lie near a line.
  • If there is such a scale, what is the right transformation? Answer: Power of transformation equals 1 minus slope.

7.6 Using a Two-Way ANOVA Model

After fitting a two-way ANOVA model and assessing its fit, several procedures can be performed:

  • Test interaction between two factors: Use the ANOVA table to test the interaction term in the model. Look at the corresponding p-value based on the \(F\) test.
  • Test main (or marginal) effects: If there is NO interaction, test whether each factor has any effect. Look at the corresponding p-value based on the \(F\) test.
  • Calculate confidence interval for the mean difference between two levels of a factor.
  • Calculate effect sizes

Example 1: Do male and female birds respond in the same way to a hormone supplement added to their food? The hormone in question was meant to raise the concentration of calcium in the blood. To investigate this question, 10 male robins were divided at random into two groups of 5, one group being treated with the hormone and the other not. A similar procedure was applied to 10 female robine, 5 getting the hormone and 5 not. The response variable of interest is the blood calcium level (measured in mg per 100 ml). The data are

  • Males, no hormone: 14.5, 11, 10.8, 14.3, 10
  • Males, hormone: 32, 23.8, 28.8, 25, 29.3
  • Females, no hormone: 16.5, 18.4, 12.7, 14, 12.8
  • Females, hormone: 31.9, 26.2, 21.3, 35.8, 40.2

We can display the data in a table below:

No Hormone Hormone
Male 14.5, 11, 10.8, 14.3, 10 32, 23.8, 28.8, 25, 29.3
Femalee 16.5, 18.4, 12.7, 14, 12.8 31.9, 26.2, 21.3, 35.8, 40.2
  1. Make an interaction plot for the data.

  2. Is it clear that the hormone supplement raises the calcium level?

  3. Is the size of the effect much different for male and female birds? That is, is there any interaction between the two factors (gender and hormone)?

  4. Fit a two-way ANOVA model with interaction. Test the interaction effect. If there is no interaction, test the main effects and construct confidence intervals for the mean difference between male and females and the mean difference with and without hormone. If there is an interaction, construct confidence intervals for the conditional mean difference between male and females given hormone occurrence or not and the conditional mean difference with and without hormone given male or female.

  5. Calculate the effect size for hormone versus no hormone and the effect size for female versus male.

Solution.

  1. The interaction plot is
# Create data frame
bird <- data.frame(gender = rep(c("Male", "Female"), each = 10),
                   hormone = rep(rep(c("No", "Yes"), each = 5), times = 2),
                   blood_calcium = c(14.5, 11, 10.8, 14.3, 10, 32, 23.8, 28.8, 25, 29.3, 16.5, 18.4, 12.7, 14, 12.8, 31.9, 26.2, 21.3, 35.8, 40.2) 
)

# Create interaction plot
interaction.plot(x.factor = bird$hormone, 
                  trace.factor = bird$gender, 
                  response = bird$blood_calcium, 
                  fun = mean, 
                  type = "l",
                  legend = TRUE,
                  xlab = "Hormone Supplement?",
                  ylab = "Mean Blood Calcium Level (mg/100 ml)",
                  trace.label = "Gender",
                 col = 1:2)

  1. The interaction plot clearly shows that the hormone supplement raises the calcium level.

  2. The two lines are parallel, so there is no interaction between the two factors (gender and hormone).

  3. Fit the two-way ANOVA model using R code

# Fit the two-way ANOVA model
model <- aov(blood_calcium ~ gender * hormone, data = bird)
summary(model)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## gender          1   45.9    45.9   2.346    0.145    
## hormone         1 1268.8  1268.8  64.841 5.09e-07 ***
## gender:hormone  1    0.4     0.4   0.019    0.893    
## Residuals      16  313.1    19.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that

  • The p-value for testing whether interaction is present is 0.893. The interaction is not statistically significant at any commonly used significance level such as 5%. Thus, we can further test the main effects.
  • The p-value for testing whether there is a sex difference is 0.145, so there is no significant difference in sex at commonly used significance level such as 5%.
  • The p-value for testing whether there is a hormone difference is almost 0, so there is a significant difference in hormone level at any commonly used significance level such as 5%.
  • Since interaction is not present, we can make the following marginal comparisons using 95% confidence intervals:
    • Hormone versus no hormone: \((mean ~with~ Hormone - mean ~without~ Hormone)\pm t\cdot SE\) or \((29.43-13.50)\pm 2.1199\cdot 1.98\) or 11.73 to 20.13.
    • where \(t=2.1199\) based on 16 degrees of freedom,
    • \(SE=SD\cdot \sqrt{1/n_1+1/n_2}=4.4272\cdot \sqrt{1/10+1/10}=1.98\) and
    • \(SD = \sqrt{MSE}=\sqrt{19.6}=4.4272\)
    • similarly, Female versus male: \((22.98-19.95)\pm 2.1199\cdot 1.98\) or \(-1.17\) to 7.23.

To interpret the result, we say

  • “We can be 95% confident that the average hormone/no hormone difference in blood calcium concentration lies somewhere between 11.73 and 20.13.”
  • “We can be 95% confident that the average female/male difference in blood calcium concentration lies somewhere between \(-1.17\) to 7.23. At one extreme (\(-1.17\)), there is a negative effect; at the other extreme (7.23), there is a positive effect.”
  1. The effect size for
  • hormone versus no hormone: \((\text{mean difference})/SD = (29.43-13.50)/4.4272 = 3.60\), a very large effect
  • for female versus male: \((\text{mean difference})/SD = (22.98-19.95)/4.4272 = 0.68\), a moderate effect; that is, there is a evidence of small sex difference, with calcium level for female birds about 0.7 standard deviation higher than for males.

Example 2: In theory, adding vitamin B12 to the diet of baby pigs should cause them to gain weight faster. In practice, however, the presence of bacteria in a pig’s gut can prevent the use of vitamin B12. Would taking antibiotics make the use of vitamin B12 more effective?

You want to compare 4 diets based on crossing two factors, B12 (0 mg or 5 mg) added to the diet, and Antibiotics (0 mg or 40 mg): no/no, no/yes, yes/no, and yes/yes. With 12 pigs and 4 diets, you set up a lottery to assign diets to pigs, with 3 tickets for each diet. The tickets tell pigs what they get to eat. The data are shown below:

No Antibiotics Antibiotics
No B12 30, 19, 8 5, 0, 4
B12 26, 21, 19 52, 56, 54

In the table, the numbers are daily weight gain (hundredths of a pound over 1.00). For example, 56 represents 0.56 pounds gained per day.

  1. Make an interaction plot for the data.

  2. Fit a two-way ANOVA model with interaction. Test the interaction effect. If there is no interaction, test the main effects and construct confidence intervals for the mean difference between B12 and no B12 and the mean difference between Antibiotics and no antibiotics. If there is an interaction, construct confidence intervals for the conditional mean difference between B12 and no B12 given Antibiotics and no antibiotics and the conditional mean difference between Antibiotics and no antibiotics given B12 and no B12.

  3. Calculate the effect size for B12 versus with Antibiotics and no Antibiotics. Calculate the effect size for Antibiotics versus no antibiotics given B12 and no B12.

Solution.

  1. The interaction plot is
# Create data frame
pig <- data.frame(B12 = rep(c("No", "Yes"), each = 6),
                   Antibiotics = rep(rep(c("No", "Yes"), each = 3), times = 2),
                   Weight_gain = c(30, 19, 8, 5, 0, 4, 26, 21, 19, 52, 56, 54) 
)

# Create interaction plot
interaction.plot(x.factor = pig$Antibiotics, 
                  trace.factor = pig$B12, 
                  response = pig$Weight_gain, 
                  fun = mean, 
                  type = "l",
                  legend = TRUE,
                  xlab = "Antibiotics (mg)",
                  ylab = "Mean Weight Gain (hundredths of a pound over 1.00)",
                  trace.label = "B12",
                 col = 1:2)

The plot shows that - When B12 is present (5 mg), the mean difference between the levels of Antibiotics (40 mg and 0 mg) is 32. - When B12 is absent (0 mg), the mean difference between the levels of Antibiotics (40 mg and 0 mg) is \(-16\). - So, clearly, there is an interaction between the two factors.

  1. Fit the two-way ANOVA model using R code
# Fit the two-way ANOVA model
model <- aov(Weight_gain ~ B12 * Antibiotics, data = pig)
summary(model)
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## B12              1   2187  2187.0  60.331  5.4e-05 ***
## Antibiotics      1    192   192.0   5.297 0.050359 .  
## B12:Antibiotics  1   1728  1728.0  47.669 0.000124 ***
## Residuals        8    290    36.3                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that

  • The p-value for testing whether interaction is present is 0.000124. The interaction is statistically significant at any commonly used significance level such as 5%. Thus, we CANNOT further test the main effects. That is, the two p-values for B12 and Antibiotics are not interpretable.
  • Since interaction is present, we can make the following conditional comparisons using 95% confidence intervals:
    • B12 (present versus absent) with Antibiotics present: \((mean ~with~ B12 present - mean ~with~ B12 absent)\pm t\cdot SE\) or \((54-3)\pm 2.306\cdot 4.92\) or roughly 40 to 62.
    • where \(t=2.306\) based on 8 degrees of freedom,
    • \(SE=SD\cdot \sqrt{1/n_1+1/n_2}=6.02\cdot \sqrt{1/3+1/3}=4.92\) and
    • \(SD = \sqrt{MSE}=\sqrt{36.25}=6.02\)
    • similarly, B12 (present versus absent) with Antibiotics absent: \((22-19)\pm 2.306\cdot 4.92\) or roughly \(-8\) to 14, - Antibiotics (present versus absent) with B12 present: \((54-22)\pm 2.306\cdot 4.92\) or roughly 21 to 43,
    • Antibiotics (present versus absent) with B12 absent: \((3-19)\pm 2.306\cdot 4.92\) or roughly \(-27\) to \(-5\).

To interpret the result, we say

  • “With antibiotics present, we can be 95% confident that the effect of B12 is to add 0.40 to 0.62 pound (on average) to the weekly weight gain. In the absence of antibiotics, adding B12 might lower the average weight gain by as much as 0.08 pound per week, or raise the gain by as much as 0.14 pound, with 95% confidence.”
  • “With B12 present, we can be 95% confident that the effect of antibiotics is to add 0.21 to 0.43 pound (on average) to the weekly weight gain. In the absence of B12, adding antibiotics might lower the average weight gain by 0.27 to 0.50 pound per week, with 95% confidence.”
  1. The effect sizes are reported as follows:
  • B12 (present versus absent) with Antibiotics present: \((54-3)/6.02\) or 8.47, huge benefit
  • B12 (present versus absent) with Antibiotics absent: \((22-19)/6.02\) or 0.5, small effect
  • Antibiotics (present versus absent) with B12 present: \((54-22)/6.02\) or 5.32, large effect
  • Antibiotics (present versus absent) with B12 absent: \((3-19)/6.02\) or \(-2.66\), indicating that giving antibiotics alone cuts the weekly weight gain by 2.66 times the typical size of pig-to-pig variation.

You are all set!!!

A video regarding : https://datatab.net/tutorial/two-factorial-anova-without-repeated-measures

More Examples

Example 1.

A data example for a two-way factorial design could be a study examining the effect of different types of fertilizer (“Fertilizer Type: A, B”) and watering frequency (“Watering Frequency: Low, High”) on plant growth (measured in height), where each combination of fertilizer and watering frequency is tested on multiple plants, allowing analysis of both main effects (fertilizer type and watering frequency) and their interaction on plant height.

Example Data Table:

Fertilizer Type Watering Frequency Plant Height (cm)
A Low 10
A Low 12
A Low 11
A High 15
A High 14
A High 16
B Low 8
B Low 9
B Low 10
B High 18
B High 17
B High 19

Key points about this example:

  • Two factors: “Fertilizer Type” with two levels (A, B) and “Watering Frequency” with two levels (Low, High).
  • Multiple observations per combination: Each combination of fertilizer type and watering frequency is replicated multiple times to account for variability within each condition.
  • Interaction potential: This design allows you to test if the effect of fertilizer type depends on the watering frequency, and vice versa.

Example 2.

Refer to the page: https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Mostly_Harmless_Statistics_(Webb)/11%3A_Analysis_of_Variance/11.03%3A_Two-Way_ANOVA_(Factorial_Design)

A template for writing up

https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/ScaleScale/ScaleScale4.html

References

  1. Julian J. Faraway’s book: https://julianfaraway.github.io/faraway/PRA/pra.pdf

  2. Panel Data Analysis: https://dss.princeton.edu/training/Panel101R.pdf

Research Questions

  1. Does the age of becoming GM have a significant effect on the most recent rating of a chess GM player? How can we collect data? Any other factor affects the relationship?

  2. Does the age of becoming GM have a significant effect on the most recent rating of a chess GM player adjusting for the effect of nationality and sex? How can we collect data?