https://bookdown.org/speegled/foundations-of-statistics/data-manipulation.html#data-manipulation (Chapters 1,5,6,7, and 10)
Register on https://opai.com. Use the following query:
“I have a data set, with two variables x and y. how can i build a regression model with y as the response variable and x as the explanatory variable. show me the r code with data you made from engineering and explain each line of code. don’t use simulated data or data frame from R.”
Compile the code.
Now, ask ChatGPT more:
“interpret the results”
Next, we can ask”
“can you show me the underlying theory?”
Still, we ask:
“What does the R-square tell us? How to interpret each p-value? how to make prediction using the fitted model?”
Still, we ask:
“use a data with specific meanings for x and y.”
Finally, we ask:
“how to check how good the model for prediction?”
We’ll explore the relationship between sugar content and calories in cereals using data from https://www.lock5stat.com/datasets3e/Cereal.csv (see the documentation of the data from https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf). Simple linear regression is a statistical technique that helps us understand how changes in one variable (called independent/explanatory/predictor variable) can predict changes in another (called dependent/response/outcome variable).
Let’s start by reading the raw csv (comma separated values) data to create a data frame in RStudio that R recognizes.
# Load real dataset
cereal_data <- read.csv(file = "https://www.lock5stat.com/datasets3e/Cereal.csv", header = TRUE, sep = ",")
# Print variables
names(cereal_data)
## [1] "Name" "Company" "Serving" "Calories" "Fat" "Sodium"
## [7] "Carbs" "Fiber" "Sugars" "Protein"
# Check dimmension
dim(cereal_data)
## [1] 30 10
# Print first 10 observations
head(cereal_data, n = 10)
## Name Company Serving Calories Fat Sodium Carbs Fiber Sugars
## 1 AppleJacks K 1.00 117 0.6 143 27 0.5 15.0
## 2 Boo Berry G 1.00 118 0.8 211 27 0.1 14.0
## 3 Cap'n Crunch Q 0.75 144 2.1 269 31 1.1 16.0
## 4 Cinnamon Toast Crunch G 0.75 169 4.4 408 32 1.7 13.3
## 5 Cocoa Blasts Q 1.00 130 1.2 135 29 0.8 16.0
## 6 Cocoa Puffs G 1.00 117 1.0 171 26 0.8 14.0
## 7 Cookie Crisp G 1.00 117 0.9 178 26 0.5 13.0
## 8 Corn Flakes K 1.00 101 0.1 202 24 0.8 3.0
## 9 Corn Pops K 1.00 117 0.2 120 28 0.3 15.0
## 10 Crispix K 1.00 113 0.3 229 26 0.1 3.0
## Protein
## 1 1.0
## 2 1.0
## 3 1.3
## 4 2.7
## 5 1.0
## 6 1.0
## 7 1.0
## 8 2.0
## 9 1.0
## 10 2.0
We focus on two variables: sugar content and calories. To gain insights into the relationship between them, we’ll create a scatter plot with calories as the response variable.
# Create a scatter plot with calories as the response variable
plot(formula = Calories ~ Sugars,
data = cereal_data,
main = "Scatter Plot of Sugar vs Calories",
xlab = "Sugar Content",
ylab = "Calories",
col = "blue",
pch = 19,
cex = 2
)
In the code above, “Calories ~ Sugars” is called a “formula” which specifies the left side as the response and the right side as the explanatory variable. The data parameter tells that the variables “Calories” and “Sugars” are from data frame “cereal_data” which has been created before.
The parameters “main”, “xlab”, and “ylab” set the titles for the plot, the x-axis, and the y-axis. The parameters “col”, “pch”, “cex” set the color, symbol, and size of the plotting characters.
later, we will introduce a more advanced tool for visualizing data. It involves the R package “ggplot2”. R Packages are tool boxes that extend the functionality of the base package “base”. To use a package, you must install it first. This is done by issuing the code: install.packages(“packageName”) in the console (lower-left panel of RStudio). The next step is to load the installed the package by placing the code: library(packageName) in the setup code chunk (somewhere around line 10 in your R markdown file).
ggplot(cereal_data, aes(x = Sugars, y = Calories)) +
geom_point() +
labs(title = "Scatter Plot of Sugar vs Calories",
x = "Sugar Content",
y = "Calories") +
theme(plot.title = element_text(hjust=0.5))
The code generates a scatter plot using ggplot2 with sugar content (Sugars) on the x-axis and calories (Calories) on the y-axis, based on the data in the cereal_data dataframe.
Points on the scatter plot represent observations in the dataset.
The plot is given a title (“Scatter Plot of Sugar vs Calories”) and axis labels (“Sugar Content” for x-axis, “Calories” for y-axis).
The theme modification ensures that the plot title is horizontally centered.
What is the use of this scatterplot? It tells us graphically whether there is a straight line relationship between the two variables. In other words, it tells whether the two variables are correlated. A quantity that quantifies the level of correlation is called the coefficient of correlation or just correlation.
R has a function to calculate such a correlation value, which is shown here:
cor(cereal_data$Sugars, cereal_data$Calories)
## [1] 0.6601099
# If there is any missing value
cor(cereal_data$Sugars, cereal_data$Calories, use = "complete.obs")
## [1] 0.6601099
Explanation of code:
Now, let’s fit a simple linear regression model to predict calories based on sugar content.
# Fit the model
model <- lm(Calories ~ Sugars, data = cereal_data)
We print the summary of the results:
summary(model)
##
## Call:
## lm(formula = Calories ~ Sugars, data = cereal_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.574 -25.282 -2.549 17.796 51.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.9204 10.8120 8.224 5.96e-09 ***
## Sugars 4.3103 0.9269 4.650 7.22e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.61 on 28 degrees of freedom
## Multiple R-squared: 0.4357, Adjusted R-squared: 0.4156
## F-statistic: 21.62 on 1 and 28 DF, p-value: 7.217e-05
5.1 Coefficients
Let’s interpret the coefficients of the model.
The intercept \(\beta_0\) is estimated to be 88.92, which represents the estimated mean calories when sugar content is 0. The slope \(\beta\) is estimated to be 4.31, which represents the estimated change in mean calories for a one-unit increase in sugar. A confidence interval can also be calculated using the following R code:
confint(model, level = 0.90)
## 5 % 95 %
## (Intercept) 70.527745 107.313079
## Sugars 2.733435 5.887087
The confint() function automatically calculates the confidence interval for all coefficients (intercept and slope) in your regression model. The “level = 0.90” tells R to calculate the confidence intervals with the 90% confidence level.
Interpretation of the confidence interval for the slope:
5.2 P-values
Now, let’s assess the significance of our predictor variable - sugar content.
p_value <- summary(model)$coefficients[2, 4]
The P-value for sugar content is 0, which is basically zero and indicates that our predictor variable - sugar content - is very significant as a predictor for calories.
To ensure the validity of our model, let’s perform diagnostic checks.
# Diagnostic plots: Residuals vs. Fitted: Check for linearity and constant variance.
plot(model, which = 1)
Residuals vs. Fitted: This plot helps us check for linearity and constant variance. If the residuals exhibit a random scatter around zero, it suggests linearity. If the residual vs. fitted values plot exhibits a tunnel shape, it indicates a violation of the assumption of constant variance, also known as homoscedasticity. Homoscedasticity implies that the variance of the residuals should be roughly constant across all levels of the fitted values. A tunnel-shaped pattern in the residual plot suggests that the spread of residuals changes systematically as the predicted values increase or decrease.
The plot here shows there is slight concern regarding the constant variance assumption. The linearity is ok, since the residuals exhibit a random scatter around zero.
# Diagnostic plots: Normal Q-Q Plot- Check for normality of residuals.
plot(model, which = 2)
A Normal Q-Q (Quantile-Quantile) plot is a graphical tool used to assess the normality of residuals in a statistical model. It compares the quantiles of the observed residuals against the quantiles of a theoretical normal distribution. If the points on the plot fall approximately along a straight line, it suggests that the residuals are normally distributed.
The plot here shows there is slight concern regarding the normality assumption on the residuals.
# Diagnostic plots: Scale-Location Plot- Check for constant variance.
plot(model, which = 3)
Scale-Location Plot: Checks for constant variance of residuals. Ideally, points should be evenly distributed around the horizontal line.
The plot here shows a violation of constant variance in residuals.
# Diagnostic plots: Cook's Distance- Identify influential data points.
plot(model, which = 4)
Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients.
The plot shows that observations 1, 15 and 16 are the top 3 influential points, with observation 16 the most influential.
Now, let’s evaluate how well our model explains the variance in calories.
R-squared (Coefficient of Determination) is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model. It quantifies the goodness of fit of the model and indicates the proportion of variability in the response variable that is explained by the predictor variable(s). R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. Remember, while R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.
The \(R^2\) here is 43.57% indicating that 43.57% of the total variation is explained by sugar content.
Finally, let’s make predictions based on our model for new data points.
# New data with sugar values
new_data <- data.frame(Sugars = c(5, 10, 15))
# Predict calories for new data
predictions <- predict(model, newdata = new_data)
# Display predictions
predictions
## 1 2 3
## 110.4717 132.0230 153.5743
The code generates predictions for the calories corresponding to the sugar values specified in new_data using the fitted regression model (model).
The predict() function utilizes the coefficients from the regression model to estimate the dependent variable (calories) based on the new independent variable values (sugar).
The resulting predictions are stored in the variable predictions.
Finally, the predictions are displayed.
In conclusion, this lecture has guided you through the entire process of simple linear regression analysis using real data. We’ve covered data preparation, visualization, model fitting, interpretation of coefficients and p-values, model checking, R-squared evaluation, and making predictions. The interpretation of diagnostic plots in model checking ensures the adequacy of the model by examining linearity, normality, constant variance, and identifying influential points. Adjustments can be made based on your specific dataset and research questions.
In the realm of regression analysis involving a quantitative response variable and an explanatory variable, understanding the reasons behind the variability in response values becomes crucial. A natural way to quantify the overall variation in these response values is expressed as:
\[SSTotal = (y_1-\bar{y})^2 + (y_2-\bar{y})^2 + \cdots + (y_n-\bar{y})^2. \] Here \(SSTotal\) denotes the total (corrected) sum of squares for convenience. Considering the regression model \(y = \beta_0+\beta_1\cdot x+\epsilon\), the variation in the response values is attributed to two sources: one stemming from the component \(\beta_0+\beta_1\cdot x\), and the other arising from random noise (\(\epsilon\)).
Mathematically, \(SSTotal\) can be split into two components: the total variation in the fitted values, denoted as \(SSModel\), and another is the sum of squared residuals (or errors), denoted as \(SSE\). This division is expressed as:
\[SSTotal = SSModel + SSE\]
In an ideal scenario where the fitted values and the observed response values are the same, \(SSTotal\) and \(SSModel\) would be equivalent. Hence, the ratio \(\frac{SSModel}{SSTotal}\) serves as a meaningful metric to evaluate the model’s effectiveness, is called the coefficient of determination, denoted by \(R^2\), and represents the proportion of total variability in response values explained by the model. \(R^2\) values always fall between 0 and 1, and a model with a higher \(R^2\) is considered more desirable.
Transformations on the response variable in simple linear regression may be necessary or beneficial for several reasons:
Non-Linearity: If the relationship between the response variable and the predictor variable is not linear, transforming the response variable might help make the relationship more linear. Common transformations include the logarithmic, square root, or reciprocal transformations.
Homoscedasticity: Simple linear regression assumes constant variance of errors (homoscedasticity). If the variance of the residuals increases or decreases with the level of the response variable, transforming the response variable can sometimes stabilize the variance.
Outliers and Skewness: Transformations can be useful for mitigating the impact of outliers or skewed distributions in the response variable, making the model more robust.
Normality of Residuals: Transforming the response variable may help in achieving the assumption of normally distributed residuals, which is important for valid statistical inferences.
Equalizing Spread: In cases where the spread of the response variable is not constant across levels of the predictor variable, a transformation can sometimes help equalize the spread.
The choice of transformation depends on the characteristics of the data and the goals of the analysis. However, it’s essential to interpret the results of the transformed model in the context of the original scale.
It’s advisable to perform diagnostic checks and assess the impact of transformations on the assumptions of the model before deciding on the most appropriate transformation. Additionally, communication of results should consider the interpretation of coefficients on the transformed scale.
It’s also helpful when transformations occur on the explanatory variable. Common transformations include the logarithmic or square transformations.
Refer to https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/ScaleScale/ScaleScale4.html#google_vignette.
Multiple Linear Regression is a statistical method used to model the relationship between a dependent variable and two or more independent variables. The goal is to understand how the independent variables collectively influence the dependent variable. The model assumes a linear relationship, meaning changes in the independent variables are associated with a constant change in the dependent variable. The equation for a multiple linear regression model is:
\[y = \beta_0+\beta_1\cdot x_1+\beta_2\cdot x_2+\cdots +\beta_p\cdot x_p+\epsilon\] where \(\beta\)’s are the coefficients representing the impact of each independent variable, \(x\)’s are the independent variables, and \(\epsilon\) is the error term. The coefficients are estimated using the method of least squares to minimize the sum of squared differences between the observed and predicted values. The model is assessed based on statistical significance, goodness of fit, and adherence to model assumptions. Multiple Linear Regression is widely used in various fields, including economics, finance, and social sciences, for predictive modeling and understanding the relationships between variables.
We will again use the cereal data from https://www.lock5stat.com/datasets3e/Cereal.csv. We will study whether other variables have impact on the calorie variable.
Let’s start by reading the raw csv (comma separated values) data to create a data frame in RStudio that R recognizes.
# Print variables
names(cereal_data)
## [1] "Name" "Company" "Serving" "Calories" "Fat" "Sodium"
## [7] "Carbs" "Fiber" "Sugars" "Protein"
# Check dimmension
dim(cereal_data)
## [1] 30 10
# Print first 10 observations
head(cereal_data, n = 10)
## Name Company Serving Calories Fat Sodium Carbs Fiber Sugars
## 1 AppleJacks K 1.00 117 0.6 143 27 0.5 15.0
## 2 Boo Berry G 1.00 118 0.8 211 27 0.1 14.0
## 3 Cap'n Crunch Q 0.75 144 2.1 269 31 1.1 16.0
## 4 Cinnamon Toast Crunch G 0.75 169 4.4 408 32 1.7 13.3
## 5 Cocoa Blasts Q 1.00 130 1.2 135 29 0.8 16.0
## 6 Cocoa Puffs G 1.00 117 1.0 171 26 0.8 14.0
## 7 Cookie Crisp G 1.00 117 0.9 178 26 0.5 13.0
## 8 Corn Flakes K 1.00 101 0.1 202 24 0.8 3.0
## 9 Corn Pops K 1.00 117 0.2 120 28 0.3 15.0
## 10 Crispix K 1.00 113 0.3 229 26 0.1 3.0
## Protein
## 1 1.0
## 2 1.0
## 3 1.3
## 4 2.7
## 5 1.0
## 6 1.0
## 7 1.0
## 8 2.0
## 9 1.0
## 10 2.0
To gain insights into the relationship between calorie and each of the other variables (called explanatory variable or predictor), we’ll create a scatter plot with calorie as the response variable.
# Create a scatterplot matrix
cereal_data_new = cereal_data[, c(4, 3, 5:10)] # I re-arranged the columns so that calorie is the first column
plot( cereal_data_new)
Now, let’s fit a multiple linear regression model to predict calories based on other numeric variables.
# Fit a multiple linear regression model with calorie as response and all others as explanatory variables
# Since there are 7 explanatory variables, I use the plus sign among them.
model <- lm(Calories ~ Serving + Fat + Sodium + Carbs + Fiber + Sugars + Protein, data = cereal_data_new)
Alternatively, the formula “Calories ~ Serving + Fat + Sodium + Carbs + Fiber + Sugars + Protein” can be replaced by “Calories ~ .”, where “.” indicates that all other variables in the data frame “cereal_data_new” are used as explanatory variables. That is, the code looks like
model <- lm(Calories ~ ., data = cereal_data_new)
We print the summary of the results:
summary(model)
##
## Call:
## lm(formula = Calories ~ Serving + Fat + Sodium + Carbs + Fiber +
## Sugars + Protein, data = cereal_data_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.7792 -0.8488 0.0588 1.3287 4.9312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.821819 4.903981 0.779 0.4441
## Serving 2.014173 3.209035 0.628 0.5367
## Fat 9.978345 0.540534 18.460 7.06e-15 ***
## Sodium -0.017209 0.006887 -2.499 0.0204 *
## Carbs 3.787857 0.123745 30.610 < 2e-16 ***
## Fiber -2.983346 0.334064 -8.930 9.07e-09 ***
## Sugars 0.204899 0.138810 1.476 0.1541
## Protein 3.990694 0.398598 10.012 1.18e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.441 on 22 degrees of freedom
## Multiple R-squared: 0.9963, Adjusted R-squared: 0.9951
## F-statistic: 839.9 on 7 and 22 DF, p-value: < 2.2e-16
Let’s interpret the coefficients of the model.
The intercept \(\beta_0\) is estimated to be 3.82, which represents the estimated mean calories when the content of each explanatory variable is 0. Usually, the interpretation of such intercept does not make sense. The coefficient corresponding to Sugars is estimated to be 0.2, which represents the estimated change in mean calories for a one-unit increase in sugar, holding all other explanatory variables fixed. Other coefficients are interpreted similarly.
Now, let’s assess the significance of each of our predictor variables. To take an example, we look at the p-value for the test of \(H_0: \beta_6=0\) versus \(H_a: \beta_6\ne 0\).
The P-value is 0.154, which is relatively large compared to the commonly used significance level such as 0.05, indicating that our predictor variable - sugar content - is not significant as a predictor for calories.
To take another example, we look at the p-value for the test of \(H_0: \beta_5=0\) versus \(H_a: \beta_5\ne 0\).
The P-value is 0, which is basically zero, indicating that our predictor variable - carbohydrate - is extremely significant as a predictor for calories.
To ensure the validity of our model, let’s perform diagnostic checks.
# Diagnostic plots: Residuals vs. Fitted: Check for linearity and constant variance.
plot(model, which = 1)
Residuals vs. Fitted: This plot helps us check for linearity and constant variance. If the residuals exhibit a random scatter around zero, it suggests linearity. If the residual vs. fitted values plot exhibits a tunnel shape, it indicates a violation of the assumption of constant variance, also known as homoscedasticity. Homoscedasticity implies that the variance of the residuals should be roughly constant across all levels of the fitted values. A tunnel-shaped pattern in the residual plot suggests that the spread of residuals changes systematically as the predicted values increase or decrease.
The plot indicates a minor concern about the assumption of constant variance. This could be attributed to the presence of three outliers identified in the plot. The linearity is ok, since the residuals exhibit a random scatter around zero.
# Diagnostic plots: Normal Q-Q Plot- Check for normality of residuals.
plot(model, which = 2)
A Normal Q-Q (Quantile-Quantile) plot is a graphical tool used to assess the normality of residuals in a statistical model. It compares the quantiles of the observed residuals against the quantiles of a theoretical normal distribution. If the points on the plot fall approximately along a straight line, it suggests that the residuals are normally distributed.
The plot here shows there is slight concern regarding the normality assumption on the residuals. This could be again attributed to the presence of three outliers identified in the plot
# Diagnostic plots: Scale-Location Plot- Check for constant variance.
plot(model, which = 3)
Scale-Location Plot: Checks for constant variance of residuals. Ideally, points should be evenly distributed around the horizontal line.
The plot here shows a violation of constant variance in residuals.
# Diagnostic plots: Cook's Distance- Identify influential data points.
plot(model, which = 4)
Cook’s Distance is a statistical measure used to identify influential data points in a regression analysis. It measures the impact of deleting a particular observation on the overall fit of the model. Large values of Cook’s Distance indicate that removing a specific data point significantly changes the regression coefficients.
The plot shows that observations 4, 11 and 28 are the top 3 influential points, with observation 28 the most influential.
Now, let’s evaluate how well our model explains the variance in calories.
Recall in simple linear regression, we used \(R^\) (Coefficient of Determination) to assess the effectiveness of a model. It is a statistical measure that represents the proportion of the variability in the response variable that is explained by the model. It quantifies the goodness of fit of the model. R-squared can be defined similarly in multiple linear regression. R-squared values range from 0 to 1. A higher R-squared value indicates a better fit, meaning that a larger proportion of the variance in the dependent variable is accounted for by the independent variable(s) in the model. Remember, while R-squared provides information about the goodness of fit, it doesn’t necessarily indicate the causation between variables or the appropriateness of the model for making predictions. It should be used in conjunction with other model evaluation techniques.
The \(R^2\) here is 99.63% indicating that 99.63% of the total variation is explained by the model.
Unlike R-squared, which increases with the addition of more variables (even if they are not significant), adjusted R-squared penalizes the addition of unnecessary predictors.
The two are related as follows:
\[\text{Adjusted} ~R^² = 1 - (1 - R²)\cdot \frac{n - 1}{n - k - 1}\] where \(n\) is the number of observations and \(k+1\) is the number of coefficients including the intercept.
A model with high R-squared but low adjusted R-squared indicates that the model is overfitting the data.
A prediction interval for a new observation in the context of regression analysis provides a range within which a future response value falls with a certain desired probability. This interval takes into account both the uncertainty in estimating the population regression line and the variability of individual observations around that line.
Let’s make predictions based on our model in the previous section for new data points.
# New data with sugar values
new_data <- data.frame(Serving = 0.75,
Fat = 2.0,
Sodium = 162,
Carbs = 28,
Fiber = 0.6,
Sugars = 17.2,
Protein = 1.5
)
# Predict calories for new data
predictions <- predict(model, newdata = new_data, interval = "prediction", level = 0.95)
# Display predictions
predictions
## fit lwr upr
## 1 136.2816 130.6127 141.9504
Explanation of code:
The mean of the response for all the cereals with a particular set of values on the seven predictors considered here can be of an interest. Since it is a parameter, we can calculate the confidence interval for it. The code is
# Confidence interval for the mean response for all the cereals with
# the particular set of values given above on the seven predictors
CI <- predict(model, newdata = new_data, interval = "confidence", level = 0.95)
The confidence interval is different from the prediction interval. In general, the prediction interval is always longer than the confidence interval, since there is more uncertainty when predicting an individual response.
A categorical variable takes values that represent different categories. The iris data (the data frame can be used directly in R) has a variable called Species, which indicates whether a plant is a particular species. The data are displayed below:
DT::datatable(iris) # I used the datatable() function from DT package to better print the data
Sometimes, we need to create numeric variables that jointly are equivalent to a categorical variable such as Species here. The number of these numeric variables equals the number of categories of the categorical variable and each of these numeric variables takes only two values either 0 or 1, depending on the category of the categorical variable. The following shows how each of such binary variables is created:
\[ I_1 = \begin{cases} 1 & \text{if setosa} \\ 0 & \text{if not} \end{cases} \]
\[ I_2 = \begin{cases} 1 & \text{if versicolor} \\ 0 & \text{if not} \end{cases} \]
\[ I_3 = \begin{cases} 1 & \text{if virginica} \\ 0 & \text{if not} \end{cases} \]
We now add the 3 binary variables to the original data:
augmented_iris <- cbind(iris,
I1 = as.numeric(iris$Species=="setosa"),
I2 = as.numeric(iris$Species=="versicolor"),
I3 = as.numeric(iris$Species=="virginica")
)
These 3 binary variables are called the indicator variables.
When a categorical variable with \(k\) categories is involved in a multiple linear regression as an explanatory variable, a common practice is to use only \(k-1\) of the indicator variables and there is a theory for such practice. The indicator variable that is not used is subject to the disposal of the person who does the analysis.
Next, we demonstrate how indicator variables can be used in multiple linear regression. We will use the Sepal.Width from the “iris” data as the response variable and all other variables (Sepal.Length, Petal.Length, Petal.Width, and Species) as the explanatory variables. For the Species variable, we use the two indicator variables I2 (for versicolor) and I3 (for virginica) instead (we then say that the setosa species is used as a baseline category). The model is
\[Sepal.Width = \beta_0+\beta_1\cdot Sepal.Length + \beta_2\cdot Petal.Length + \beta_3\cdot Petal.Width +\beta_4\cdot I2 +\beta_5\cdot I3+\epsilon \] The following is the R code that does the model fitting:
fit <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + I2 + I3,
data = augmented_iris)
summary(fit)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width +
## I2 + I3, data = augmented_iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00102 -0.14786 0.00441 0.18544 0.69719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.65716 0.25595 6.475 1.40e-09 ***
## Sepal.Length 0.37777 0.06557 5.761 4.87e-08 ***
## Petal.Length -0.18757 0.08349 -2.246 0.0262 *
## Petal.Width 0.62571 0.12338 5.072 1.20e-06 ***
## I2 -1.16029 0.19329 -6.003 1.50e-08 ***
## I3 -1.39825 0.27715 -5.045 1.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2678 on 144 degrees of freedom
## Multiple R-squared: 0.6352, Adjusted R-squared: 0.6225
## F-statistic: 50.14 on 5 and 144 DF, p-value: < 2.2e-16
The R output shows that all explanatory variables are significant at the 5% significance level.
The final regression equation is
\[Sepal.Width = 1.6572+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width - 1.1603\cdot I2 - 1.3983\cdot I3 \] What does the two coefficients \(-1.1603\) and \(-1.3983\) suggest? The number \(-1.1603\) means that versicolor flowers with the same sepal length, petal length and petal width is estimated to have a mean sepal width about \(1.1603\) centimeters shorter than that of the setosa flowers. While the number \(-1.3983\) means that virginica flowers with the same sepal length, petal length and petal width is estimated to have a mean sepal width about \(1.3983\) centimeter shorter than that of the setosa flowers.
The regression equation for the setosa species can be reduced to
\[Sepal.Width = 1.6572+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width \] For the versicolor species, since I2 takes the value “1” and I3 takes the value “0”, the regression equation reduces to
\[Sepal.Width = 0.2589+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width -1.16029\cdot 1 \] or, after combining like terms,
\[Sepal.Width = 0.4969+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width \]
For the verginica species, since I2 takes the value “0” and I3 takes the value “1”, the regression equation reduces to
\[Sepal.Width = 1.6572+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width -1.39825\cdot 1 \] or, after combining like terms,
\[Sepal.Width = 0.2589+0.3778\cdot Sepal.Length -0.1876\cdot Petal.Length + 0.6257\cdot Petal.Width \] So far so good or it is too complicated? The good news is there is no need to create indicate variables on our own. R software will do it for us automatically!
Here is how R does it:
# Fit a multiple regression model
fit <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width + Species,
data = iris)
# Display the summary of the model
summary(fit)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width +
## Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.00102 -0.14786 0.00441 0.18544 0.69719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.65716 0.25595 6.475 1.40e-09 ***
## Sepal.Length 0.37777 0.06557 5.761 4.87e-08 ***
## Petal.Length -0.18757 0.08349 -2.246 0.0262 *
## Petal.Width 0.62571 0.12338 5.072 1.20e-06 ***
## Speciesversicolor -1.16029 0.19329 -6.003 1.50e-08 ***
## Speciesvirginica -1.39825 0.27715 -5.045 1.34e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2678 on 144 degrees of freedom
## Multiple R-squared: 0.6352, Adjusted R-squared: 0.6225
## F-statistic: 50.14 on 5 and 144 DF, p-value: < 2.2e-16
R has created two indicator variables automatically and they are Speciesversicolor (the same as I2) and Speciesvirginica (the same as I3). The results based on the above code is the same as we did using I2 and I3.
Some discussions:
If we use sepal length along with species (use I2 and I3) as the explanatory variables, what would be the model and the regression equation you get?
What are the separate regression equations for versicolor flowers and virginica flowers?
If the equations in part (b) are plotted, what special pattern would you observe? What assumptions have you made implicitly? Is such an assumption reasonable? (hint: plot sepal width vs sepal length for the 3 species on the same graph, see below.). If such an assumption is not reasonable, could we gain a significant improvement in the fit by allowing for a different effect for each species?
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Scatter Plot with Regression Lines for Each Species",
x = "Sepal Length",
y = "Sepal Width")
Including interactions in a linear regression model allows for the possibility that the effect of one predictor variable on the response may depend on the level of another predictor variable.
Example 1.
# Fit a linear regression model with interactions
model_with_interactions <- lm(Calories ~ Serving + Fat + Sodium + Carbs + Fiber + Sugars * Protein, data = cereal_data_new)
# Print the summary of the model
summary(model_with_interactions)
##
## Call:
## lm(formula = Calories ~ Serving + Fat + Sodium + Carbs + Fiber +
## Sugars * Protein, data = cereal_data_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4702 -0.6082 0.0204 1.2538 5.1062
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.267619 10.573862 -0.309 0.760
## Serving 4.730155 4.828139 0.980 0.338
## Fat 10.095856 0.567364 17.794 3.83e-14 ***
## Sodium -0.013915 0.008198 -1.697 0.104
## Carbs 3.874877 0.169602 22.847 2.58e-16 ***
## Fiber -2.886639 0.360603 -8.005 8.15e-08 ***
## Sugars 0.354501 0.241898 1.465 0.158
## Protein 4.554888 0.845463 5.387 2.42e-05 ***
## Sugars:Protein -0.082924 0.109280 -0.759 0.456
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.464 on 21 degrees of freedom
## Multiple R-squared: 0.9964, Adjusted R-squared: 0.995
## F-statistic: 720.8 on 8 and 21 DF, p-value: < 2.2e-16
Explanation of code:
In this example, Sugars * Protein includes both the main effects of Sugars and Protein and their interaction term. The * operator implicitly includes the main effects and the interaction term. If you prefer to include only the interaction term without the main effects, you can use the : operator:
Explanation of results:
Interpreting interactions involves considering how the effect of one variable on the response varies depending on the level of another variable. The summary output will provide information about coefficients, standard errors, t-values, and p-values associated with each term in the model.
Here, the interaction effect of the term Sugars:Protein is not statistically significant, so including this term does not improve the model fitting.
Multicollinearity is a phenomenon in regression analysis where two or more predictor (i.e. explanatory) variables in a multiple regression model are highly correlated, making it difficult to precisely estimate the individual contribution of each variable to the dependent variable. It is a form of collinearity, which generally refers to a high degree of correlation among the predictor variables.
In the context of multicollinearity:
Perfect Multicollinearity: This occurs when one predictor variable is an exact linear function of another. For example, if you have two variables \(X_1\) and \(X_2\), and \(X_1\) is equal to a constant times \(X_2\) (or vice versa), then you have perfect multicollinearity.
High Multicollinearity: This occurs when there is a strong correlation, but not a perfect linear relationship, between two or more predictor variables. High multicollinearity can still lead to issues in regression analysis.
Multicollinearity can have several implications:
Unstable Coefficient Estimates: The coefficients of the correlated variables become unstable, and small changes in the data can lead to large changes in the estimated coefficients.
Increased Standard Errors: The standard errors of the coefficient estimates tend to be large, making it difficult to identify statistically significant predictors.
Reduced Statistical Power: The ability to detect the true effects of individual predictors may be reduced.
Interpretation Challenges: It becomes challenging to interpret the individual contribution of each predictor to the dependent variable.
To detect multicollinearity, researchers often examine variance inflation factors (VIF). High VIF values (typically above 5) suggest a potential multicollinearity issue. The R package “car” provides the function VIF() for calculating VIF for each predictor.
Here is an example using the cereals data. We remove the first two columns that are not numeric.
model <- lm(Calories ~ ., data = cereal_data[, -c(1:2)])
# The lm() function is used to fit a linear regression model.
# "Calories ~ ." means that Calories is the dependent variable, and all other variables in the 'cereal_data' dataset except the first two columns are used as predictors.
# Load the 'car' package
library(car)
# The 'car' package is loaded to access the VIF function.
# Calculate Variance Inflation Factor (VIF) for each predictor in the model
vif(model)
## Serving Fat Sodium Carbs Fiber Sugars Protein
## 1.508242 1.710968 1.383636 4.084036 1.919783 2.666816 1.785157
# The vif() function calculates the Variance Inflation Factor for each predictor variable in the linear regression model.
# VIF measures how much the variance of an estimated regression coefficient increases if the predictor is correlated with other predictors.
Explanation of code:
lm() function: This is used to fit a linear regression model.
Calories ~ .: Here, Calories is the dependent variable (the one you’re trying to predict), and the . means “use all other variables in the dataset as predictors.” This will include all columns in cereal_data except for the first two, which are excluded by the next part.
data = cereal_data[, -c(1:2)]: This specifies the dataset to use for the model. The -c(1:2) notation means that the first two columns of cereal_data are being excluded from the analysis. This is useful if those columns are identifiers or not relevant as predictors.
This command loads the car package, which provides various functions for regression analysis and diagnostics, including the Variance Inflation Factor (VIF).
vif() function: This function calculates the VIF for each predictor in the linear model.
Purpose of VIF: VIF measures how much the variance of an estimated regression coefficient increases when your predictors are correlated. A high VIF indicates that a predictor is highly correlated with other predictors, which can lead to multicollinearity issues. Typically, a VIF value above 5 or 10 suggests problematic multicollinearity. Multicollinearity refers to a situation in multiple regression where two or more predictor variables are highly correlated. This can lead to several consequences:
Dealing with multicollinearity may involve:
Removing Correlated Predictors: If two or more predictors are highly correlated, removing one of them might alleviate multicollinearity.
Combining Variables: Creating composite variables or using dimensionality reduction techniques (such as the Principal Component Analysis) can help handle multicollinearity.
It’s important to address multicollinearity to ensure the reliability and interpretability of regression results.
We might want to test if a subset of predictors (or terms) can be dropped from a multiple linear regression model. For example, for a 10-predictor model, we might want to test
\[H_0: \beta_4 = 0~ \text{and}~ \beta_7 = 0~ \text{vs.}~ H_a: \beta_4 \ne 0~ \text{or}~ \beta_7 \ne 0\] The null hypothesis says that the coefficients of the 4th and 7th terms in the model are both 0, while the alternative hypothesis says that at least one of them is non-zero. There is a formal procedure for doing this and it is called the partial \(F\)-test or the nested \(F\)-test. The procedure involves fitting two models: one is the model you are fitting (called the larger model for ease of reference) and the other one is a restricted model (called the smaller model) which is a special case of the model being fitted.
Let’s use the iris data (built-in in R) to show the procedure. For simplicity, we disregard petal length and petal width, and fit a model to study how sepal width is related to both sepal length and species. We are particularly interested in whether the effect of sepal length on sepal width varies between species. That is, we need to test whether there is any interaction between sepal length and species. To do this, we need to fit two models, one nested in the other:
The larger model is
\[Sepal.Width = \beta_0+\beta_1\cdot Sepal.Length + \beta_2\cdot versicolor+\beta_3\cdot virginica+\beta_4\cdot Sepal.Length\cdot versicolor+\beta_5\cdot Sepal.Length\cdot virginica+\epsilon\] where \(versicolor\) and \(virginica\) represent indicator or dummy variables, and the product terms are interaction terms.
The smaller model is
\[Sepal.Width = \beta_0+\beta_1\cdot Sepal.Length + \beta_2\cdot versicolor+\beta_3\cdot virginica+\epsilon\] We need to test:
\[H_0: \beta_4 = 0~ \text{and}~ \beta_5 = 0~~ \text{vs.}~~ H_a: \beta_4 \ne 0~ \text{or}~ \beta_5 \ne 0\] The fitting of each of the two models yields
# Fit the larger model
large_model <- lm(Sepal.Width ~ Sepal.Length*Species, data = iris)
# Print the detailed results
summary(large_model)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length * Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72394 -0.16327 -0.00289 0.16457 0.60954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5694 0.5539 -1.028 0.305622
## Sepal.Length 0.7985 0.1104 7.235 2.55e-11 ***
## Speciesversicolor 1.4416 0.7130 2.022 0.045056 *
## Speciesvirginica 2.0157 0.6861 2.938 0.003848 **
## Sepal.Length:Speciesversicolor -0.4788 0.1337 -3.582 0.000465 ***
## Sepal.Length:Speciesvirginica -0.5666 0.1262 -4.490 1.45e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2723 on 144 degrees of freedom
## Multiple R-squared: 0.6227, Adjusted R-squared: 0.6096
## F-statistic: 47.53 on 5 and 144 DF, p-value: < 2.2e-16
# Print the analysis of variance result
anova(large_model)
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Sepal.Length 1 0.3913 0.3913 5.2757 0.02307 *
## Species 2 15.7225 7.8613 105.9948 < 2.2e-16 ***
## Sepal.Length:Species 2 1.5132 0.7566 10.2011 7.19e-05 ***
## Residuals 144 10.6800 0.0742
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Fit the smaller model
small_model <- lm(Sepal.Width ~ Sepal.Length + Species, data = iris)
# Print the detailed results
summary(small_model)
##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length + Species, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95096 -0.16522 0.00171 0.18416 0.72918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.67650 0.23536 7.123 4.46e-11 ***
## Sepal.Length 0.34988 0.04630 7.557 4.19e-12 ***
## Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 ***
## Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.289 on 146 degrees of freedom
## Multiple R-squared: 0.5693, Adjusted R-squared: 0.5604
## F-statistic: 64.32 on 3 and 146 DF, p-value: < 2.2e-16
# Print the analysis of variance result
anova(small_model)
## Analysis of Variance Table
##
## Response: Sepal.Width
## Df Sum Sq Mean Sq F value Pr(>F)
## Sepal.Length 1 0.3913 0.3913 4.6851 0.03205 *
## Species 2 15.7225 7.8613 94.1304 < 2e-16 ***
## Residuals 146 12.1931 0.0835
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output of the larger_model includes the main effect of Sepal.Length, the main effects for each species, and two interaction effects between Sepal.Length and the species. Two ANOVA tables, one for each model, are also produced, showing the contributions of individual terms to the total sum of squares of the (adjusted) response values.
Based on the ANOVA table results, the larger model has an error sum of squares of 10.6800 and a mean squared error (MSE) of 0.0742 (which is obtained by 10.68/144). The small model has an error sum of squares of 12.1931 and a mean squared error (MSE) of 0.0835 (which is obtained by 12.1931/146).
To test the above hypotheses, we calculate the \(F\) test statistic as
\[F = \frac{(12.1931-10.6800)/2}{0.0742}=10.20\] where the number 2 in the numerator equals the number of coefficients we are testing. We know the larger model always fits better, and it indicates dropping the two terms (Speciesversicolor and Speciesvirginica) from the model increases the error sum of squares by (12.1931-10.6800) or 1.5131 (this is called the extra sum of squares). Dividing the extra sum of squares by the number of terms dropped gives the average increase in the error sum of squares. Such average increase is compared with the mean squared error of the larger model and the ratio is the test statistic.
It can be shown that, under the null hypothesis, the \(F\) test statistic has an \(F\) distribution, The \(F\)-distribution (also known as the Fisher-Snedecor distribution) is a probability distribution that arises frequently in statistical hypothesis testing. It is named after Ronald A. Fisher and George W. Snedecor, who independently developed it.
In the context of the \(F\)-distribution, there are two sets of degrees of freedom associated with it: the degrees of freedom for the numerator and the degrees of freedom for the denominator. The following graph shows examples of a few \(F\) distributions:
To find the \(p\)-value of a test based on the \(F\) distribution, we always calculate the right tail area under the \(F\) distribution, since the larger the \(F\) test statistic, the more evidence that the large model and the small model differ. The R code for calculating the \(p\)-value is given by
1 - pf(F, df1, df2) ## f is test statistic and df1 and df2 are degrees of freedom
For our example, \(F=0.0754, df1 = 2, df2 = 144\) and the \(p\)-value is 7.19e-05. Since the \(p\)-value is basically 0, we do reject the null hypothesis and conclude that there is a significant difference between the large model and the small model.
To avoid the above lengthy procedure, we can use the “car” package to achieve the same result, as shown in the following code:
library(car)
# Fit the multiple regression model
model <- lm(Sepal.Width ~ Sepal.Length + Species + Sepal.Length:Species, data = iris)
# Specify the hypothesis to test
hypothesis <- c("Sepal.Length:Speciesversicolor = 0", "Sepal.Length:Speciesvirginica = 0")
# Test the hypothesis
hypothesis_test <- linearHypothesis(model, hypothesis)
# Display the results
hypothesis_test
##
## Linear hypothesis test:
## Sepal.Length:Speciesversicolor = 0
## Sepal.Length:Speciesvirginica = 0
##
## Model 1: restricted model
## Model 2: Sepal.Width ~ Sepal.Length + Species + Sepal.Length:Species
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 146 12.193
## 2 144 10.680 2 1.5132 10.201 7.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is basically 0, the interaction term is significant. It means that the effect of sepal length on sepal width is significantly different for different species.
A remark:
When fitting a linear regression model, the bottom line of the R output includes an F test statistic, along with two sets of degrees of freedom and a \(p\)-value. This is the \(F\) test for \[H_0: \text{All} ~\beta \text{'s are 0} ~ ~\text{vs.}~~ H_a: \text{At least one} ~\beta \text{ is not 0.}\] It is a special nested \(F\) test with the small model being the one without any explanatory variable (such a model is called the intercept-only model).
The added variable plot (also known as partial regression plot), introduced by Mosteller and Tukey (1977), provides a visual representation of the regression coefficient of a new variable being considered for inclusion in a model. Here’s a step-by-step guide to construct an added variable plot:
Regress Y on all variables other than X: This step involves fitting a regression model where the response variable Y is regressed on all predictor variables except for X. The residuals obtained from this regression represent the part of Y not explained by the other variables (Y residuals).
Regress X on all other variables: Similarly, regress the variable X on all other variables included in the model. The residuals obtained from this regression represent the part of X not explained by the other variables (X residuals).
Construct a scatter plot: Create a scatter plot with Y residuals on the y-axis and X residuals on the x-axis.
The interpretation of Y and X residuals is as follows:
The slope of the line fitted to the points in the added variable plot must be equal to the regression coefficient when Y is regressed on all variables, including X.
The interpretation of the plot:
Here’s how you can create an added variable plot in R using the car package:
# Install and load the 'car' package (if not already installed)
# install.packages("car")
library(car)
# Fit a multiple regression model
model <- lm(Sepal.Width ~ Sepal.Length + Petal.Length + Petal.Width, data = iris)
# Create an Added Variable Plot for each of the predictors
avPlots(model)
# Create an Added Variable Plot for the variable 'Sepal.Length'
avPlots(model, "Sepal.Length")
An example with simulated data:
# Set the seed for reproducibility
set.seed(321)
# Generate random data for x1 and x2
x1 <- rnorm(180, 100, 15)
x2 <- rnorm(180, 69, 3.6)
# Generate y using a linear equation with noise
y <- 5 - 2*x1 + rnorm(180, 0, 1)
# Fit a linear model with terms x1 and x2
model <- lm(y ~ x1 + x2)
# Create added variable plots to visualize the effect of each predictor on the response
avPlots(model)
The graph shows that adding \(x_2\) term in the model while \(x_1\) is already in the model won’t improve the model.
Best Subsets of Predictors is an approach in regression analysis where multiple linear regression models are fitted using different combinations of predictor variables, and the best model is selected based on a specified criterion. The goal is to identify the subset of predictors that results in the most parsimonious and effective model.
Mallow’s \(C_p\) is one of many statistical criteria used in the process of choosing the best subset of predictor variables for a regression model. The goal is to strike a balance between model simplicity and predictive accuracy.
Mallow’s \(C_p\) is defined as:
\[C_p=\frac{SSE_m}{MSE_k}+2(m+1)-n\]
where:
A model with \(C_p\) less than \(m+1\) or not much greater than \(m+1\) indicates a better balance between goodness of fit (as measured by \(SSE\)) and model complexity, and thus is a preferred model.
Here is R code for identifying the best model based on Mallow’s \(C_p\) criterion using simulated data.:
# Install and load the leaps package
library(leaps)
# Set a seed for reproducibility of random numbers to be generated below
set.seed(123)
# Create a simulated dataset with 100 observations and 6 predictors
simulated_data <- data.frame(
Predictor1 = rnorm(100), # Random normal values for Predictor1
Predictor2 = rnorm(100), # Random normal values for Predictor2
Predictor3 = rnorm(100), # Random normal values for Predictor3
Predictor4 = rnorm(100), # Random normal values for Predictor4
Predictor5 = rnorm(100), # Random normal values for Predictor5
Predictor6 = rnorm(100) # Random normal values for Predictor6
)
# The true model
# Response = 20 + 2 * Predictor2 - 4 * Predictor4 + 5 * Predictor5 - 6 * Predictor6 + random noise
# Generate response data based on the model
simulated_data$Response = 20 + 2 * simulated_data$Predictor2 - 4 * simulated_data$Predictor4 +
5 * simulated_data$Predictor5 - 6 * simulated_data$Predictor6 +
rnorm(100) # Adding random noise to the response
# Use the leaps package to find the best subset of predictors for the response variable
best_subsets <- regsubsets(Response ~ ., data = simulated_data, nvmax = 4) # nvmax sets the maximum number of predictors to consider
# Get the summary of the regression subsets, including Mallow's Cp values for each subset
res = summary(best_subsets)
cp_values <- res$cp # Extract Mallow's Cp values from the summary
# Print the best subset of predictors and its corresponding Mallow's Cp value
cat("The best subset is:\n ",
paste(" ", best_subsets$xnames[-1][tail(summary(best_subsets)$which,1)[-1]], collapse = ", "),
"\nwith Mallow's Cp =", min(cp_values), ".", sep = "")
## The best subset is:
## Predictor2, Predictor4, Predictor5, Predictor6
## with Mallow's Cp =3.112201.
The simulation shows that the Mallow’s \(C_p\) does finds the best model.
Practice: Using the FirstYearGPA data from https://www.stat2.org/datapage.html to fit and find the best model with GPA being the response variable.
In addition to Mallow’s \(C_p\) and \(R^2\) methods, Forward, backward, and stepwise selections are other variable selection methods commonly used in regression analysis. These methods are used to choose a subset of predictor variables to include in a model based on certain criteria, such as Akaike’s information criterion (AIC) or Bayesian information criterion (BIC). Here’s an overview of each method:
Forward Selection:
Process: Start with an empty model and add variables one by one, selecting the variable that most improves the model fit at each step.
Criteria: The criteria for improvement can vary (e.g., minimizing AIC or BIC).
The R code looks like
lm_model <- lm(y ~ 1, data = your_data)
forward_method <- step(lm_model, direction = "forward")
Backward Elimination:
Process: Start with a model containing all predictor variables and remove one variable at each step, eliminating the variable that contributes the least to the model.
Criteria: Similar to forward selection, the criteria for removal can vary.
lm_model_full <- lm(y ~ ., data = your_data)
backward_method <- step(lm_model_full, direction = "backward")
Stepwise Selection:
Process: Combines elements of both forward selection and backward elimination. It starts with no variables in the model or all variables in the model, adds variables one by one, and removes variables when they no longer contribute significantly.
Criteria: Typically uses a combination of adding and removing criteria.
lm_model_full <- lm(y ~ ., data = your_data)
stepwise_method <- step(lm_model_full, direction = "both")
We use the simulated data given before to demonstrate the Forward, backward, and stepwise selection methods:
# Fit an initial model with only the intercept
null_model <- lm(Response ~ 1, data = simulated_data)
# Fit a full model with all predictors
full_model <- lm(Response ~ ., data = simulated_data)
# Perform forward selection
forward_model <- step(null_model,
direction = "forward",
scope = list(lower = null_model, upper = full_model),
trace = 1)
## Start: AIC=409.03
## Response ~ 1
##
## Df Sum of Sq RSS AIC
## + Predictor6 1 1649.44 4208.1 377.96
## + Predictor4 1 1373.53 4484.0 384.31
## + Predictor5 1 1285.68 4571.9 386.25
## <none> 5857.5 409.03
## + Predictor1 1 33.66 5823.9 410.46
## + Predictor2 1 28.99 5828.5 410.54
## + Predictor3 1 0.08 5857.5 411.03
##
## Step: AIC=377.96
## Response ~ Predictor6
##
## Df Sum of Sq RSS AIC
## + Predictor5 1 2045.18 2162.9 313.40
## + Predictor4 1 1671.05 2537.0 329.36
## + Predictor2 1 102.02 4106.1 377.51
## <none> 4208.1 377.96
## + Predictor1 1 65.76 4142.3 378.38
## + Predictor3 1 1.06 4207.0 379.93
##
## Step: AIC=313.4
## Response ~ Predictor6 + Predictor5
##
## Df Sum of Sq RSS AIC
## + Predictor4 1 1668.46 494.45 167.83
## + Predictor2 1 306.27 1856.64 300.13
## <none> 2162.91 313.40
## + Predictor3 1 5.53 2157.38 315.15
## + Predictor1 1 0.08 2162.84 315.40
##
## Step: AIC=167.83
## Response ~ Predictor6 + Predictor5 + Predictor4
##
## Df Sum of Sq RSS AIC
## + Predictor2 1 391.63 102.82 12.784
## <none> 494.45 167.828
## + Predictor1 1 3.19 491.26 169.181
## + Predictor3 1 0.33 494.12 169.760
##
## Step: AIC=12.78
## Response ~ Predictor6 + Predictor5 + Predictor4 + Predictor2
##
## Df Sum of Sq RSS AIC
## <none> 102.82 12.784
## + Predictor1 1 0.123901 102.70 14.664
## + Predictor3 1 0.002185 102.82 14.782
cat("Forward Selection Model:\n")
## Forward Selection Model:
summary(forward_model)
##
## Call:
## lm(formula = Response ~ Predictor6 + Predictor5 + Predictor4 +
## Predictor2, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44004 -0.70499 0.00054 0.68642 2.43949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8520 0.1054 188.43 <2e-16 ***
## Predictor6 -6.0951 0.1156 -52.73 <2e-16 ***
## Predictor5 5.0232 0.1094 45.91 <2e-16 ***
## Predictor4 -4.0745 0.1012 -40.25 <2e-16 ***
## Predictor2 2.1002 0.1104 19.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.04 on 95 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9817
## F-statistic: 1329 on 4 and 95 DF, p-value: < 2.2e-16
# Backward Elimination
backward_model <- step(full_model,
direction = "backward",
trace = 1)
## Start: AIC=16.66
## Response ~ Predictor1 + Predictor2 + Predictor3 + Predictor4 +
## Predictor5 + Predictor6
##
## Df Sum of Sq RSS AIC
## - Predictor3 1 0.00 102.70 14.66
## - Predictor1 1 0.12 102.82 14.78
## <none> 102.70 16.66
## - Predictor2 1 388.45 491.15 171.16
## - Predictor4 1 1746.73 1849.43 303.75
## - Predictor5 1 2182.87 2285.56 324.92
## - Predictor6 1 3008.70 3111.40 355.77
##
## Step: AIC=14.66
## Response ~ Predictor1 + Predictor2 + Predictor4 + Predictor5 +
## Predictor6
##
## Df Sum of Sq RSS AIC
## - Predictor1 1 0.12 102.82 12.78
## <none> 102.70 14.66
## - Predictor2 1 388.56 491.26 169.18
## - Predictor4 1 1751.40 1854.09 302.00
## - Predictor5 1 2188.64 2291.34 323.17
## - Predictor6 1 3009.32 3112.02 353.79
##
## Step: AIC=12.78
## Response ~ Predictor2 + Predictor4 + Predictor5 + Predictor6
##
## Df Sum of Sq RSS AIC
## <none> 102.82 12.78
## - Predictor2 1 391.63 494.45 167.83
## - Predictor4 1 1753.82 1856.64 300.14
## - Predictor5 1 2281.56 2384.39 325.15
## - Predictor6 1 3009.25 3112.08 351.79
cat("\nBackward Elimination Model:\n")
##
## Backward Elimination Model:
summary(backward_model)
##
## Call:
## lm(formula = Response ~ Predictor2 + Predictor4 + Predictor5 +
## Predictor6, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44004 -0.70499 0.00054 0.68642 2.43949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8520 0.1054 188.43 <2e-16 ***
## Predictor2 2.1002 0.1104 19.02 <2e-16 ***
## Predictor4 -4.0745 0.1012 -40.25 <2e-16 ***
## Predictor5 5.0232 0.1094 45.91 <2e-16 ***
## Predictor6 -6.0951 0.1156 -52.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.04 on 95 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9817
## F-statistic: 1329 on 4 and 95 DF, p-value: < 2.2e-16
# Stepwise Selection starting from null model
stepwise_model1 <- step(null_model,
direction = "both",
trace = 1)
## Start: AIC=409.03
## Response ~ 1
cat("\nStepwise Selection Model (starting from null model):\n")
##
## Stepwise Selection Model (starting from null model):
summary(stepwise_model1)
##
## Call:
## lm(formula = Response ~ 1, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1119 -5.4323 -0.2773 5.3825 18.9794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.5632 0.7692 26.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.692 on 99 degrees of freedom
# Stepwise Selection starting from full model
stepwise_model2 <- step(full_model,
direction = "both",
trace = 1)
## Start: AIC=16.66
## Response ~ Predictor1 + Predictor2 + Predictor3 + Predictor4 +
## Predictor5 + Predictor6
##
## Df Sum of Sq RSS AIC
## - Predictor3 1 0.00 102.70 14.66
## - Predictor1 1 0.12 102.82 14.78
## <none> 102.70 16.66
## - Predictor2 1 388.45 491.15 171.16
## - Predictor4 1 1746.73 1849.43 303.75
## - Predictor5 1 2182.87 2285.56 324.92
## - Predictor6 1 3008.70 3111.40 355.77
##
## Step: AIC=14.66
## Response ~ Predictor1 + Predictor2 + Predictor4 + Predictor5 +
## Predictor6
##
## Df Sum of Sq RSS AIC
## - Predictor1 1 0.12 102.82 12.78
## <none> 102.70 14.66
## + Predictor3 1 0.00 102.70 16.66
## - Predictor2 1 388.56 491.26 169.18
## - Predictor4 1 1751.40 1854.09 302.00
## - Predictor5 1 2188.64 2291.34 323.17
## - Predictor6 1 3009.32 3112.02 353.79
##
## Step: AIC=12.78
## Response ~ Predictor2 + Predictor4 + Predictor5 + Predictor6
##
## Df Sum of Sq RSS AIC
## <none> 102.82 12.78
## + Predictor1 1 0.12 102.70 14.66
## + Predictor3 1 0.00 102.82 14.78
## - Predictor2 1 391.63 494.45 167.83
## - Predictor4 1 1753.82 1856.64 300.14
## - Predictor5 1 2281.56 2384.39 325.15
## - Predictor6 1 3009.25 3112.08 351.79
cat("\nStepwise Selection Model(starting from full model:\n")
##
## Stepwise Selection Model(starting from full model:
summary(stepwise_model2)
##
## Call:
## lm(formula = Response ~ Predictor2 + Predictor4 + Predictor5 +
## Predictor6, data = simulated_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44004 -0.70499 0.00054 0.68642 2.43949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.8520 0.1054 188.43 <2e-16 ***
## Predictor2 2.1002 0.1104 19.02 <2e-16 ***
## Predictor4 -4.0745 0.1012 -40.25 <2e-16 ***
## Predictor5 5.0232 0.1094 45.91 <2e-16 ***
## Predictor6 -6.0951 0.1156 -52.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.04 on 95 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9817
## F-statistic: 1329 on 4 and 95 DF, p-value: < 2.2e-16
All three methods end up with the correct model!
Many times, the model we build is for prediction in the future. A natural question would be: will our model perform well? That a model performs well does not guarantee it also perform well on new data. How can we assess the performance of the model for future prediction?
Cross-validation is a statistical technique used in machine learning and statistical modeling to assess the performance and generalizability of a predictive model. The main goal of cross-validation is to estimate how well a model will perform on an independent dataset by partitioning the available data into two subsets: one for training the model and another for testing its performance.
It is generally expected that the performance of the model is worse for prediction based on the holdout data than for prediction based on the training data. If this happens, the model might not have found the patterns or relationships that occur or hold in the more general population.
Here is an example using the caret package, which is commonly used for machine learning tasks. We will use the iris data to fit a multiple linear regression model with Sepal length as the response and sepal Width, petal length, petal width as predictors.
# Set a seed for reproducibility
set.seed(123)
# Split the data into a training set (80%) and a testing set (20%)
size_of_data = nrow(iris)
size_of_training_sample = size_of_data * 0.8
train_indices <- sample(1:size_of_data, size_of_training_sample)
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]
# Fit a model using the training data
model <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
data = train_data)
# Make predictions on the test data
predictions <- predict(model, newdata = test_data)
# Evaluate the model's performance on the test data (using mean squared error)
mse <- mean((test_data$Sepal.Length - predictions)^2)
cat("Mean Squared Error:", mse, "\n")
## Mean Squared Error: 0.07225017
Explanation of code:
The MSE calculated can be compared with the model MSE to see whether there is any overfitting issue. Overfitting occurs when the the former is much larger.
anova(model)
## Analysis of Variance Table
##
## Response: Sepal.Length
## Df Sum Sq Mean Sq F value Pr(>F)
## Sepal.Width 1 1.062 1.062 10.004 0.0019946 **
## Petal.Length 1 70.616 70.616 665.181 < 2.2e-16 ***
## Petal.Width 1 1.547 1.547 14.570 0.0002183 ***
## Residuals 116 12.315 0.106
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA table output, the model MSE is 0.106 (Mean Square for Residuals), indicating no overfitting.
Another way to check for overfitting is to compare cross-validation \(R^2\) with the model \(R^2\). The cross-validation \(R^2\) is defined as the correlation between the predicted values and the actual values, squared.
# Calculate the correlation between the predicted values and the actual values in the test dataset
cross_validationcor_R_square <- cor(predictions, test_data$Sepal.Length)
# Extract the R-squared value from the model summary
model_R_Square <- summary(model)$r.squared
# Round both values to 4 decimal places
cross_validationcor_R_square <- round(cross_validationcor_R_square, 4)
model_R_Square <- round(model_R_Square, 4)
# Print both values together
print(paste("Cross-Validation Correlation R-squared:", cross_validationcor_R_square,
"and Model R-squared:", model_R_Square))
## [1] "Cross-Validation Correlation R-squared: 0.9375 and Model R-squared: 0.856"
Note: The model R-squared is also the square of the correlation between the fitted values and the actual values.
The results show that there is no overfitting issue.
Sometimes, cross-validation is employed instead of a single holdout sample. In k-fold cross-validation, the dataset is divided into k subsets, and the model is trained and tested k times, each time using a different subset as the holdout sample.
Here’s a brief overview of k-fold cross-validation:
The advantages of cross-validation include: - Provides a more robust estimate of a model’s performance by testing it on multiple subsets of the data. - Reduces the risk of overfitting or underfitting to a specific training-test split.
Common types of cross-validation include:
Cross-validation is an essential tool in machine learning model evaluation, helping practitioners make more informed decisions about model selection and generalization performance.
Another example: using k-fold cross-validation.
# Load package caret
library(caret)
# Set the seed for reproducibility
set.seed(123)
# Define the training control for 5-fold cross-validation
cross_val_control <- trainControl(method = "cv", number = 5)
# Train a linear regression model on Sepal.Length using all available predictors
model <- train(Sepal.Length ~ ., data = iris, method = "lm", trControl = cross_val_control)
# Print the cross-validated results, including metrics like RMSE
print(model)
## Linear Regression
##
## 150 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 120, 120, 120, 120, 120
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3137574 0.8585577 0.2551525
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
# Extract the RMSE values from the cross-validated results
rmse <- model$results$RMSE
# Calculate and print the mean RMSE
cat("Cross-Validated RMSE:", mean(rmse), "\n")
## Cross-Validated RMSE: 0.3137574
In this example:
Explanation:
This code overall performs cross-validated training of a linear regression model on the Iris dataset, evaluates its performance, and reports the mean RMSE as a summary metric.
The k-fold cross-validation is a valuable tool in model evaluation, particularly when working with limited data or when a robust estimate of performance is desired. It is a widely accepted practice in machine learning research and application. Common choices for k include 5-fold and 10-fold cross-validation, but the appropriate value can vary based on the dataset size and specific considerations.
Identifying unusual points or outliers in a regression analysis is important for understanding the impact of influential observations on the model. Here are some common methods for identifying unusual points in regression:
Residual Plot: Create a residual plot by plotting the residuals (the differences between observed and predicted values) against the predictor variables. Unusual points may exhibit patterns or stand out from the overall distribution of residuals. The \(i\)-th residual is calculated as follows: \[r_i=\text{observed}-\text{predicted}=y_i - \hat{y}_i\]
Leverage: Leverage measures how much an observation’s predictor values differ from the mean of the predictor values. Observations with high leverage can have a strong impact on the regression model. Points with leverage greater than \(2(k+1)/n\) (where \(k\) is the number of predictors and \(n\) is the number of observations) may be considered influential.
Standardized Residuals: Standardized residuals are computed by taking the residual (the difference between the observed and predicted values) and dividing it by the estimated standard error of the residuals, adjusted for leverage. Observations with standardized residuals that fall outside a specified range (e.g., less than \(-2\) or greater than 2) may be flagged as unusual. The formula for the \(i\)-th standardized residual is given by:
\[\text{stdres}_i = \frac{r_i}{\hat{\sigma}_i\sqrt{1-h_i}}\]
\[\text{studres}_i = \frac{r_i}{\hat{\sigma}_{(i)}\sqrt{1-h_i}}\]
\[D_i = \frac{(\text{stdres}_i)^2}{k+1}\cdot \frac{h_i}{1-h_i}\]
Influence Plots: Visualize the influence of each observation on the regression coefficients. Observations with a substantial impact on the coefficients may be considered influential.
Mahalanobis Distance: This distance metric considers the correlation between variables. Points with a Mahalanobis Distance larger than a certain threshold may be considered outliers.
DFFITS: This measures the change in fitted values when an observation is omitted. Points with DFFITS larger than a certain threshold (e.g., \(2\sqrt{(k+1)/n}\), where \(k\) is the number of predictors and \(n\) is the number of observations) may be considered influential.
DFBETAS: measure the change in each coefficient due to the deletion of each observation one at a time. Typically, absolute values larger than \(2/\sqrt{2}\) are considered influential, where \(n\) is the number of observations.
Shapiro-Wilk test is performed to check the normality of residuals.
Grubbs’ test is used to identify outliers in the residuals.
Code examples for each of the above methods:
Load necessary libraries:
# Load necessary libraries
library(MASS)
library(car)
library(dplyr)
library(outliers)
Simulate a hypothetical dataset with outliers and an influential point:
# Generate a hypothetical dataset with outliers and influential points
set.seed(123)
n <- 20 # Number of observations
x <- rnorm(n) # Predictor variable
y <- 2 * x + rnorm(n) # Response variable
# Introduce outliers and influential points
y[c(5, 10, 15)] <- y[c(5, 10, 15)] + 8 # Introduce outliers
y[18] <- y[18] - 15 # Introduce an influential point
# Combine into a data frame
data <- data.frame(x, y)
data
## x y
## 1 -0.56047565 -2.1887750
## 2 -0.23017749 -0.6783299
## 3 1.55870831 2.0914122
## 4 0.07050839 -0.5878744
## 5 0.12928774 7.6335362
## 6 1.71506499 1.7434367
## 7 0.46091621 1.7596195
## 8 -1.26506123 -2.3767494
## 9 -0.68685285 -2.5118426
## 10 -0.44566197 8.3624910
## 11 1.22408180 2.8746278
## 12 0.35981383 0.4245562
## 13 0.40077145 1.6966686
## 14 0.11068272 1.0994989
## 15 -0.55584113 7.7098988
## 16 1.78691314 4.2624665
## 17 0.49785048 1.5496186
## 18 -1.96661716 -18.9951460
## 19 0.70135590 1.0967491
## 20 -0.47279141 -1.3260538
plot(y~x, main = "A Scatter Plot with an Influential Point")
We performs residual analysis, creating a residual plot and identifying studentized residuals:
# Fit regression model
model <- lm(y ~ x, data = data)
# Residual Analysis with Residual Plot
residuals <- model$residuals
# Standardized Residuals
std_residuals <- rstandard(model)
# Studentized Residuals
studentized_residuals <- studres(model)
# Identify outliers based on studentized residuals
outliers_studentized <- data[abs(studentized_residuals) > 2, ]
# Display identified outliers
outliers_studentized
## x y
## 10 -0.4456620 8.362491
## 15 -0.5558411 7.709899
## 18 -1.9666172 -18.995146
# Set up a 1x2 layout for plotting
par(mfrow = c(1, 2)) # the 2 plots will be placed on one row
# Plotting Residuals vs Predictor (x) with outliers
plot(x, residuals, main = "Residual Plot with Outliers", xlab = "Predictor (x)", ylab = "Residuals")
# Add labels for outliers
text(x = x[as.numeric(rownames(outliers_studentized))],
y = residuals[rownames(outliers_studentized)],
pos = 4,
labels = row.names(outliers_studentized),
col = "red")
# Plotting Studentized Residuals vs Predictor (x) with outliers
plot(x, studentized_residuals, main = "Studentized Residual Plot with Outliers", xlab = "Predictor (x)", ylab = "Studentized Residuals")
# Add labels for outliers
text(x = x[as.numeric(rownames(outliers_studentized))],
y = studentized_residuals[rownames(outliers_studentized)],
pos = 4,
labels = row.names(outliers_studentized),
col = "red")
Here’s how you can examine each of the two plots:
Residual Plot
Studentized Residual Plot:
Cook’s Distance is calculated to identify influential observations:
# Cook's Distance
cook_distance <- cooks.distance(model)
outliers_cooks <- data[cook_distance > 4/n, ]
outliers_cooks
## x y
## 18 -1.966617 -18.99515
# Diagnostic plots: Cook's Distance- Identify influential data points.
plot(model, which = 4, main = "Cook's Distance \n(dashed line plotted at cutoff 4/n)", caption = "mmm", sub.caption = "mmm")
# Identify influential observations (Cook's distance > 4/n)
abline(h = 4/nrow(data), col = "red", lty = 2)
The Cook’s distance plot shows that there is only one influential point (obs. 18).
Calculates leverage values and identifies observations with high leverage:
# Leverage
leverage <- hatvalues(model)
outliers_leverage <- data[leverage > 2 * length(model$coefficients)/n, ]
outliers_leverage
## x y
## 16 1.786913 4.262467
## 18 -1.966617 -18.995146
Observations 16 & 18 have high leverages (leverage is greater than \(2(k+1)/n\)), where \(k\) is the number of predictors and \(n\) is the number of observations.
A more conplex influence plot is created to visually assess influential points:
# Calculate studentized residuals
studentized_residuals <- studres(lm(y ~ x, data = data))
m <- min(studentized_residuals)
M <- max(studentized_residuals)
# Calculate values for plotting boundaries
k <- ncol(data) - 1 + 1
h <- seq(min(hatvalues(model)), max(hatvalues(model)) * 1.1, length.out = 1000)
std_residuals0.5D <- sqrt(k * (1 - h) / h * 0.5)
std_residuals1.0D <- sqrt(k * (1 - h) / h)
r.m <- min(std_residuals0.5D, -std_residuals0.5D, std_residuals1.0D, -std_residuals1.0D)
r.M <- max(std_residuals0.5D, -std_residuals0.5D, std_residuals1.0D, -std_residuals1.0D)
# Influence Plots with overlaid lines
influencePlot(model,
xlim = c(0, max(hatvalues(model)) * 1.1),
id = TRUE,
main = "Influence Plot with Outliers",
xlab = "Leverage (hat-value, h)",
ylim = range(r.m, r.M))
## StudRes Hat CookD
## 10 2.2252550 0.06918750 0.15090311
## 15 2.1441674 0.07706227 0.15996566
## 16 -0.3404791 0.20059276 0.01529573
## 18 -4.7536343 0.29726351 2.17260329
# Add lines to the plot indicating different boundaries for standardized residuals
lines(std_residuals0.5D ~ h, lty = "dotted", col = "red")
lines(std_residuals1.0D ~ h, lty = "dashed", col = "darkred")
lines(-std_residuals0.5D ~ h, lty = "dotted", col = "red")
lines(-std_residuals1.0D ~ h, lty = "dashed", col = "darkred")
Explanation of plot:
Mahalanobis Distance is used to detect outliers based on multivariate distances:
# Mahalanobis Distance
mahalanobis_dist <- mahalanobis(data, colMeans(data), cov(data))
outliers_mahalanobis <- data[mahalanobis_dist > qchisq(0.95, df = 2), ]
outliers_mahalanobis
## x y
## 18 -1.966617 -18.99515
DFFITS values are calculated to identify influential points in the regression analysis:
# DFFITS
dffits <- dffits(model)
outliers_dffits <- data[abs(dffits) > 2 * sqrt(length(model$coefficients)/n), ]
outliers_dffits
## x y
## 18 -1.966617 -18.99515
The DFBETAS measure the change in each coefficient due to the deletion of each observation one at a time. Here’s how you can calculate DFBETAS for a linear regression model in R:
# Calculate DFBETAS
dfbetas <- dfbetas(model)
dfbetas
## (Intercept) x
## 1 -0.036629255 0.024695827
## 2 -0.010192428 0.003817945
## 3 -0.123291092 -0.239905104
## 4 -0.049179528 0.003688721
## 5 0.345431950 -0.004535898
## 6 -0.158946012 -0.354659561
## 7 0.003830504 0.001373508
## 8 0.083794242 -0.102903216
## 9 -0.033949234 0.026533170
## 10 0.557286954 -0.319490486
## 11 -0.048184220 -0.067065993
## 12 -0.041929854 -0.010104619
## 13 0.009375059 0.002701440
## 14 0.023891763 -0.000784583
## 15 0.547834962 -0.367159394
## 16 -0.062383132 -0.147777550
## 17 -0.010475154 -0.004216413
## 18 -1.670678571 2.819743025
## 19 -0.056974695 -0.037301800
## 20 -0.005246180 0.003134287
The Shapiro-Wilk test is conducted to check the normality of residuals:
# Shapiro-Wilk Test for Residuals
shapiro_test <- shapiro.test(residuals)
if (shapiro_test$p.value < 0.05) {
cat("Residuals are not normally distributed (Shapiro-Wilk p-value:", shapiro_test$p.value, ")\n")
} else {
cat("Residuals are normally distributed (Shapiro-Wilk p-value:", shapiro_test$p.value, ")\n")
}
## Residuals are not normally distributed (Shapiro-Wilk p-value: 0.001531177 )
Grubbs’ test is performed to identify outliers, and a message is printed based on the test results:
# Grubbs' Test
grubbs_test <- grubbs.test(residuals)
if (grubbs_test$p.value < 0.05) {
obs = names(which(grubbs_test$p.value < 0.05))
cat("Outlier detected using Grubbs' test (p-value:", grubbs_test$p.value, ")\n",
paste0(ifelse(length(obs) > 1,
paste0("The outliers are the ", paste0(obs[1:(length(obs)-1)], "th, ", collapse = ""), "and ", tail(obs,1), "th observations"),
paste0("The outlier is the ", obs, "th observation."))))
} else {
cat("No Outlier detected using Grubbs' test.\n")
}
## Outlier detected using Grubbs' test (p-value: 0.01931184 )
## The outlier is the 18th observation.
One-way analysis of variance (ANOVA) is a statistical technique used to compare the means of three or more groups to determine whether there are statistically significant differences among them. It is commonly used in experimental and observational studies to analyze the impact of one categorical variable (often referred to as a factor) on a continuous outcome variable.
In a one-way ANOVA, the categorical variable divides the dataset into groups, and the continuous outcome variable is measured for each group. The goal is to assess whether there are statistically significant differences in the means of the outcome variable across the groups.
The ANOVA method has two major applications:
Experiments: In experiments, researchers actively manipulate the categorical variable (factor) to create different groups. For example, in a controlled clinical trial, participants may be randomly assigned to different treatment groups. The groups are typically well-defined and controlled, which allows for causal inference and interpretation of any observed differences as potentially caused by the manipulation of the factor. This is called a completely randomized design.
Observational Studies: In observational studies, researchers observe pre-existing groups based on some characteristic, and the assignment of individuals to groups is not controlled by the researcher. For example, comparing the average income across different education levels. Since the groups may not be as well-defined or controlled, confounding variables may influence the observed differences. Therefore, causal inference is more challenging, and observed differences may be due to factors other than the categorical variable of interest.
How can we interpret Results? Regardless of whether it’s an experiment or observational study, the interpretation of the results from a one-way ANOVA typically involves:
Assessing overall significance: The ANOVA test provides an F-statistic and associated p-value. A small p-value (< 0.05) indicates that there are statistically significant differences in the means of the groups.
Post-hoc Tests (if needed): If the overall ANOVA test is significant, post-hoc tests (e.g., Fisher’s Least Significant Difference (LSD), Tukey’s HSD (Honestly Significant Difference), Bonferroni correction) may be conducted to determine which specific group means are significantly different from each other.
In experiments, the following terms are fundamental:
Factor: In the context of one-way ANOVA, the factor refers to the categorical independent variable with two or more categories.
Levels: Levels are the different categories or groups within the factor. Each level represents a unique experimental condition applied to the experimental units.
Treatment: In one-way ANOVA, each level of the factor represents a treatment condition. The treatments correspond to the different groups or categories being compared to assess their effect on the response variable.
Null Hypothesis (H0): The null hypothesis in one-way ANOVA states that there are no differences in population means among the different treatment groups. It assumes that any observed differences in sample means are due to random sampling variability.
Alternative Hypothesis (Ha): The alternative hypothesis in one-way ANOVA states that at least one treatment group mean is significantly different from the others. It suggests that there are genuine differences in population means among the treatment groups.
Between-Group Variability (\(SSBetween\) or \(SSGroup\) or \(SSTreatment\)): The between-group variability, also known as the sum of squares between groups, measures the variation in the sample means of the treatment groups. It quantifies the differences among the group means.
Within-Group Variability (\(SSWithin\) or \(SSError\) or \(SSResidual\)): The within-group variability, also known as the sum of squares within groups, measures the variation of individual data points around their respective group means. It represents the random variability within each treatment group.
Total Variability (\(SSTotal\)): The total variability, also known as the total sum of squares, measures the overall variability in the response variable across all observations. It is the sum of the between-group and within-group variabilities.
F-statistic: The F-statistic is a test statistic used in one-way ANOVA to assess whether there are significant differences among the treatment group means. It is calculated as the ratio of the between-group variability (after accounting for its degrees of freedom) to the within-group variability (after accounting for its degrees of freedom).
P-Value: In hypothesis testing, the p-value is a measure that helps us determine the strength of evidence against the null hypothesis (\(H_0\)). Specifically, it quantifies the probability of observing the test statistic (or a more extreme value) under the assumption that the null hypothesis is true. If the p-value is smaller than the chosen significance level (\(\alpha\), typically 0.05), it suggests that the observed data is unlikely to have occurred if the null hypothesis is true. In other words, there is strong evidence against the null hypothesis, and we reject it in favor of the alternative hypothesis (\(H_a\)). If the p-value is greater than or equal to the significance level, it suggests that the observed data is reasonably likely to occur even if the null hypothesis is true. In this case, we do not have enough evidence to reject the null hypothesis, and we fail to reject it.
Understanding these concepts is crucial for interpreting the results of a one-way ANOVA correctly and drawing valid conclusions about the differences among treatment groups.
In one-way ANOVA (analysis of variance), the statistical model is designed to assess whether there are significant differences in the means of three or more independent groups or levels of a categorical variable. The model can be expressed as follows:
\[Y_{ij}=\mu_i+\epsilon_{ij}\] where \(y_{ij}\) is the observed value of the response variable for the \(j\)th observation in the \(i\)th group, \(\mu_i\) is the mean of the \(i\)th group, and \(\epsilon_{ij}\) is the random error.
The model can also be written as
\[Y_{ij}=\mu+\alpha_i+\epsilon_{ij}\] Where \(\mu\) is the overall population mean (called the grand mean) of the response variable, and \(\alpha_i\) is the effect of the \(i\)th treatment or group level (difference between the group mean and the grand mean).
For given data, we can estimate the parameters \(\mu\) and \(\alpha\)’s by simple hand calculation:
Example:
The grand mean \(\mu\) is estimated to be 9. The group means are estimated as 4.8, 6.2, 8.8, and 15.0, respectively. Thus, the effects (\(\alpha\)’s) are estimated by subtracting the grand mean from the group means, resulting in estimates of \(-4.2\), \(-2.8\), \(-0.2\), and \(6.0\). Each random error is estimated by subtracting the corresponding group mean from the observation. For example, the random error corresponding to the observation “3” in group 1 can be estimated as \(3 - 4.8\), yielding \(-1.8\).
Each observation can be decomposed into three components: the grand mean, the corresponding group effect, and the corresponding residual. For instance, the observation “3” in group 1 can be expressed as: \[3 = 9 + (-4.2) + (-1.8)\] The same principle applies to each of the remaining observations.
Upon squaring each number in these decomposed identities and summing up the corresponding terms, we find that the sum of squared observations equals the sum of squared grand means, plus the sum of squared effects, plus the sum of squared residuals. This holds as a general rule. In mathematical terms:
\[\sum \text{observation}^2 = \sum \text{grand mean}^2 + \sum \text{effect}^2 + \sum \text{residual}^2 \]
where $^2 $ has \(n\) degrees of freedom, $^2 $ has only 1 degree of freedom, $^2 $ has only \(I-1\) degrees of freedom, and $^2 $ has only \(n-I\) degree of freedom. Here, \(n\) represents the total number of observations, and \(I\) denotes the total number of groups. It’s important to note the following decomposition:
\[n=1+(I-1)+(n-I)\] Here, each number of degrees of freedom tells how many chances the corresponding quantity can change. Specifically,
The above two identities can also be written as:
\[\sum \text{observation}^2 - \sum \text{grand mean}^2 = \sum \text{effect}^2 + \sum \text{residual}^2 \]
\[n-1=(I-1)+(n-I)\] Since \(\sum \text{observation}^2 - \sum \text{grand mean}^2 = \sum (\text{observation} - \text{grand mean})^2\), it is called the total variation of data, or the total sum of squares (denoted SSTotal). In addition, \(\sum \text{effect}^2\) is called the sum of squares for groups or treatments (denoted SSGroup or SSTreatment), and \(\sum \text{residual}^2\) is called the sum of squares for errors or residuals (denoted SSError or SSResidual). Therefore, we have:
\[\text{SSTotal}=\text{SSGroup}+\text{SSError}\] For the data, \(n=21\), \(I=4\), \(SSTotal = 410\), \(SSGroup = 343.6\), and \(SSError = 66.4\).
In the following, we provide R code for calculating group means:
# Create the data frame D
D <- data.frame(
group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)),
y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20)
)
# Aggregate the data by group and calculate the mean of y for each group
aggregate(y ~ group, data = D, FUN = mean)
## group y
## 1 A 4.8
## 2 B 6.2
## 3 C 8.8
## 4 D 15.0
Explanation of the above code:
Creating a Data Frame D:
Aggregating Data by Group:
To make formal inference about the parameters, we must make some assumptions on the model:
The main goal of one-way ANOVA is to test the null hypothesis that all group means are equal against the alternative hypothesis that at least one group mean is different. This is typically done using an F-test, where the F-statistic is calculated as the ratio of the between-group variability (representing signal) to the within-group variability (representing noise).
If the F-test yields a significant result (i.e., the p-value is less than a chosen significance level, often 0.05), we reject the null hypothesis and conclude that there are significant differences among the group means.
The following so-called ANOVA table shows the detail of calculation:
where \(I\) is the number of groups and \(n\) is the total sample size. The \(F\) statistic, under the null hypothesis of equal group means, has an \(F\) distribution with \(I-1\) and \(n-I\) numbers of degrees of freedom.
The R software can generate ANOVA tables, different treatments are the groups, residuals are the errors, “Sum Sq” is the same as SS, “Mean Sq” is the same as MS, and \(MS_{Error}\) is often written as \(MSE\). The \(MSE\) is an estimate of the error variance \(\sigma^2\).
Why is F used as the test statistic? The reason is as follows:
In R, one-way ANOVA can be performed using the aov() function.
We demonstrate how to perform a one-way ANOVA analysis using R for an experiment.
Example 1: We use the data given before to perform a one-way ANOVA analysis.
# Create the data frame D
D <- data.frame(
group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)),
y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20)
)
# Perform one-way ANOVA
anova_result <- aov(y ~ group, data = D)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 343.6 114.53 29.32 6.04e-07 ***
## Residuals 17 66.4 3.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table produces the same sums of squares as we did by hand. The p-value (6.04e-07 or 0.000000604) shows that differences in the mean of \(y\) for different groups are statistically significant at any commonly used significance level.
Example 2: Blood Pressure Medication Study
We visualize the data:
# Visualize the data
boxplot(blood_pressure ~ medication, data = med_data,
main = "Blood Pressure by Medication Group",
xlab = "Medication", ylab = "Blood Pressure")
It seems there are differences in blood pressure between treatments. Are the differences significant?
We perform one-way ANOVA:
# Perform one-way ANOVA
anova_result <- aov(blood_pressure ~ medication, data = med_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## medication 2 871.3 435.6 4.078 0.0284 *
## Residuals 27 2884.6 106.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result from the ANOVA table shows that the differences are statistically significant at level 0.05.
How can we check whether the assumptions of the ANOVA are met?
# graphical check of ANOVA assumptions
plot(anova_result, 1) # Identify outliers and graphically check equal variances
plot(anova_result, 2) # Graphically check normality
Here’s how you can interpret the graphs:
Graphically Check Equal Variances (Plot Type 1): If the groups have equal variances, you should see relatively equal range of residuals across the groups in the plot. If not, it indicates that the assumption of equal variances may be violated.
Graphically Check Normality (Plot Type 2): If the data points fall approximately along a straight line, it suggests that the data follows a normal distribution. Deviations from the expected pattern, such as significant skewness or outliers, may indicate departures from normality.
A formal test, called the Levene’ test, available from the “car” package for testing equal variances is shown below:
library(car)
leveneTest(blood_pressure ~ medication, data = med_data)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.295 0.7469
## 27
From the output above we can see that the p-value (0.7469) is not less than any commonly used significance level (such as 0.05). This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.
We can also use the following R code to calculate group standard deviations:
aggregate(blood_pressure ~ medication, data = med_data, sd)
## medication blood_pressure
## 1 A 9.658617
## 2 B 12.543701
## 3 C 8.359293
If the largest standard deviation is no greater than twice the smallest standard deviation, there is no violation of the equal variances assumption. As the result shows, the largest standard deviation is 12.54 and the smallest is 8.36, so there is no violation of the equal variances assumption.
The Shapiro-Wilk test can be used to check normality:
shapiro.test(anova_result$residuals)
##
## Shapiro-Wilk normality test
##
## data: anova_result$residuals
## W = 0.96657, p-value = 0.4501
The Shapiro-Wilk test on the ANOVA residuals (W = 0.9467, p = 0.4501) finds no indication that normality is violated.
In one-way ANOVA, assessing the magnitude of the effect involves examining
In the above formulas, \(\bar{y_i}\) and \(\bar{y_j}\) are the group means, \(n_i\) and \(n_j\) are the group sizes, \(SD=\sqrt{MSE}=\sqrt{SSError/(n-I)}\) is the estimate of the error standard deviation \(\sigma\), and \(t^*\) is the critical value based on the \(t\)-distribution with \(n-I\) degrees of freedom (\(n\) is the total sample size and \(I\) is the number of groups). You can use the app https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#t to calculate this critical value (set the number of degrees of freedom, check the two-tail box, and change the middle area to the confidence level).
Confidence intervals (CIs) for group means help determine the precision of the estimated mean for each group. A wider CI indicates greater uncertainty in the estimated mean, while a narrower CI suggests higher precision.
Comparing CIs between groups can reveal whether the differences in means are statistically significant. If the confidence interval for the difference between two group means does not include zero, it suggests a significant difference between those groups.
Example: Use the following data
to calculate 95% confidence intervals for
Solution:
For, \(\mu_1\), we use the formula
\[\bar{y}_1\pm t^*\cdot SD\cdot \sqrt{1/n_1}\] where
Therefore, the \(95\%\) confidence interval is
\[4.8\pm2.1098\cdot 1.9763\cdot \sqrt{1/5}\] or, \((2.9353, 6.6647)\).
For, \(\mu_2-\mu_1\), we use the formula
\[(\bar{y}_2-\bar{y}_1)\pm t^*\cdot SD\cdot \sqrt{1/n_2+1/n_1}\]
to get the 95% confidence interval to be \((3.5629, 8.8371)\).
How much do the differences matter?
To answer this question, we need to calculate effect sizes. The p-values tell whether results are significant at certain levels, but significance does not imply importance.
An effect size provides valuable insights into the practical significance of the observed differences among groups.
In the one-way ANOVA setting, we can measure the effect size for a single group by
\[D_i = \frac{\bar{y_i}-\bar{y}}{SD}\] and for a difference between any pair of two groups by
\[D_{ij} = \frac{\bar{y_i}-\bar{y_j}}{SD}\]
where \(SD=\sqrt{MSE}\). These effect sizes are called Cohen’s \(D\).
As the formulas show, effect sizes quantify the magnitude of differences between groups in standard deviation units.
These effect size measures help to interpret the practical significance of group differences, irrespective of sample size.
Effect sizes can be categorized as small, medium, or large based on standard thresholds.
Here’s a more detailed interpretation of Cohen’s D effect sizes:
For our previous example, the effect size that quantifies the magnitude of difference between - group 1 and group 2 is \(D_{12}=(4.8-6.2)/1.9763=-0.7084\), which is medium. - group 1 and group 3 is \(D_{12}=(4.8-8.8)/1.9763=-2.0240\), which is large.
The one-way ANOVA assumes that the response variable is normally distributed within each group and exhibits equal variance across groups. However, in real-world data, these assumptions may not always hold true. When these conditions are violated, one approach to address the issue is to apply transformations to the response variable to stabilize variation.
Several commonly used transformations for the response variable \(y\) include:
These transformations can help to achieve approximate normality and homogeneity of variances, thereby improving the validity of the ANOVA analysis.
For each transformation, the standard deviation of each group can be calculated. If the maximum standard deviation is no greater than twice the smallest standard deviation, the homogeneity in variance assumption is satisfied.
Example: Use the following data to find a transformation.
Solution.
Plot without log-transformation:
# Create a data frame with two variables: 'group' and 'y'
D <- data.frame(
group = rep(c("A", "B", "C"), each = 3), # 'group' variable contains repetitions of "A", "B", and "C", each 3 times
y = c(0.9, 1, 1.1, 9, 10, 11, 90, 100, 110)
)
# Convert the 'group' variable to a factor
D$group = factor(D$group)
# Plot the data using ggplot2
ggplot(D, aes(x=group, y=y)) +
geom_point()
The data in different groups have different variations.
Plot after log-transformation:
ggplot(D, aes(x=group, y=log(y))) +
geom_point()
After log-transformation, the data in different groups have the same variation. Now, the 1-way ANOVA can be done with the transformed data.
aov_result = aov(log(y)~group, data = D)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 31.81 15.91 1579 6.82e-09 ***
## Residuals 6 0.06 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When an analysis of variance (ANOVA) indicates a significant difference among groups that are compared, multiple comparisons are often conducted in for pairwise comparisons.
Fisher’s Least Significant Difference (LSD), Tukey’s Honestly Significant Difference (HSD), and Bonferroni correction are 3 methods commonly used for conducting multiple pairwise comparisons in the context of ANOVA. However, they differ in their underlying assumptions and the criteria they use to determine significant differences between group means.
We focus on Fisher’s LSD (Least Significant Difference) and introduce the other two methods in Chapter 8.
Steps of the Fisher’s LSD process:
Conduct 1-way ANOVA. If the result is not significant, stop. Otherwise, proceed to the next step.
Compute Fisher’s LSD.
\[LSD = t^*\cdot SE\] where \(SE=\sqrt{MSE}\cdot \sqrt{1/n_1+1/n_2}\).
Fisher’s LSD test is relatively straightforward and easy to implement, but it can be less powerful than other post-hoc tests, especially when there are unequal sample sizes or unequal variances between groups. Therefore, it’s essential to consider the specific characteristics of your data and research question when choosing a multiple comparison method.
There is an R package “agricolae” that does multiple comparison using Fisher’s LSD. Here is an example.
Example 1.: Suppose a professor wants to know whether or not three different studying techniques lead to different exam scores among students. This example is taken from https://www.statology.org/fishers-lsd-in-r/.
To test this, she randomly assigns 10 students to use each studying technique and records their exam scores.
The following table shows the exam scores for each student based on the studying technique they used:
## technique score
## 1 tech1 72
## 2 tech1 73
## 3 tech1 73
## 4 tech1 77
## 5 tech1 82
## 6 tech1 82
## 7 tech1 83
## 8 tech1 84
## 9 tech1 85
## 10 tech1 89
## 11 tech2 81
## 12 tech2 82
## 13 tech2 83
## 14 tech2 83
## 15 tech2 83
## 16 tech2 84
## 17 tech2 87
## 18 tech2 90
## 19 tech2 92
## 20 tech2 93
## 21 tech3 77
## 22 tech3 78
## 23 tech3 79
## 24 tech3 88
## 25 tech3 89
## 26 tech3 90
## 27 tech3 91
## 28 tech3 95
## 29 tech3 95
## 30 tech3 98
Let’s do ANOVA:
model<-aov(score~technique, data=df)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## technique 2 341.6 170.80 4.623 0.0188 *
## Residuals 27 997.6 36.95
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result shows that the mean scores differs significantly at level 0.05 for different techniques.
The following is R code for multiple comparison using the Fisher’s LSD process.
library(agricolae) # You must install the package first
model<-aov(score~technique, data=df)
out <- LSD.test(model,"technique")
out
## $statistics
## MSerror Df Mean CV t.value LSD
## 36.94815 27 84.6 7.184987 2.051831 5.57767
##
## $parameters
## test p.ajusted name.t ntr alpha
## Fisher-LSD none technique 3 0.05
##
## $means
## score std r se LCL UCL Min Max Q25 Q50 Q75
## tech1 80.0 5.868939 10 1.92219 76.05599 83.94401 72 89 74.00 82.0 83.75
## tech2 85.8 4.391912 10 1.92219 81.85599 89.74401 81 93 83.00 83.5 89.25
## tech3 88.0 7.557189 10 1.92219 84.05599 91.94401 77 98 81.25 89.5 94.00
##
## $comparison
## NULL
##
## $groups
## score groups
## tech3 88.0 a
## tech2 85.8 a
## tech1 80.0 b
##
## attr(,"class")
## [1] "group"
plot(out)
The result shows that
Blocking and Two-Way ANOVA are statistical techniques used to analyze experimental data, particularly when there are multiple factors at play. Let’s introduce each of them:
Blocking:
The Randomized Complete Block (RCB) Design is a common experimental design used in agricultural, industrial, and scientific research to reduce variability and increase the precision of treatment comparisons. It is particularly useful when there are known sources of variability that may confound the treatment effects.
The Randomized Complete Block (RCB) Design can be implemented in various ways to suit different experimental setups and research goals. Three common strategies for implementing RCB designs are subdividing blocks, sorting/matching blocks, and reusing blocks. Let’s explore each:
Subdivide a large plot of land (the block) into subplots (the units) so that each gets one of a few different treatments at random one at a time. There are a few blocks.
Sort or match individuals with similar baseline characteristics into blocks so that each block has the number of individuals equal to the number of treatments. Randomly assign one treatment to one individual.
Reuse each subject (the block) the number of times equal to the number of treatments in random order.
Each strategy results in a data table of \(n\) rows and \(k\) columns, with \(n\) equal to the number of blocks and \(k\) equal to the number of treatments. The total number of observations equals \(n\cdot k\).
Here is a possible data table when \(n=5\) and \(k=3\).
Block | Treatment1 | Treatment2 | Treatment3 |
---|---|---|---|
Block1 | y11 | y12 | y13 |
Block2 | y21 | y22 | y23 |
Block3 | y31 | y32 | y33 |
Block4 | y41 | y42 | y43 |
Block5 | y51 | y52 | y53 |
Let’s introduce some terminology based on the above table.
Read textbook page 260 for examples.
What graphical insights can we glean from the data prior to model fitting? Constructing a plot for Randomized Complete Block Design (RCBD) data involves creating side-by-side dot plots for each treatment group and connecting observations from the same block.
What does such a plot reveal? It can demonstrate:
As our aim is to discern differences in treatment effects, it’s imperative to eliminate block effects to derive adjusted responses before comparing treatments.
Here are the steps detailing how adjusted values are calculated:
Plotting the adjusted values against treatments offers a clearer depiction of each treatment’s effect, as the influence of blocks has been mitigated.
Example 1. Let’s consider an example where we want to compare the effects of caffeine and theobromine with a placebo (a sham treatment). Four subjects participate in the study. On the first day, each subject randomly selects a treatment to ingest. Two hours later, the subjects are timed while they type a sequence of 1000 letters on a computer, and their finger typing rate is recorded. This process is repeated on the third and fifth days for the other two treatments in a random order. The data is illustrated below:
Subject | Placebo | Caffeine | Theobromine |
---|---|---|---|
Alice | 11 | 26 | 20 |
Bob | 56 | 83 | 71 |
Cathy | 15 | 34 | 41 |
David | 6 | 13 | 32 |
The above data is in a wide format. To plot the original data and the block effect-adjusted data, R software requires us to convert the data from the wide format to tall or long format in 3 columns (subject, treatment, and response).
D=data.frame(Subject = factor(rep(1:4, each=3)),
Treatment = rep(c("Placebo", "Caffeine", "Theobromine"), 4),
Y = c(11, 26, 20, 56, 83, 71, 15, 34, 41, 6, 13, 32)
)
D$Treatment = factor(D$Treatment, levels = c("Placebo", "Caffeine", "Theobromine"))
# Calculate block means
block_means <- aggregate(Y ~ Subject, data = D, FUN = mean)
# Calculate grand mean
grand_mean <- mean(D$Y)
# Calculate block effects
block_effects <- block_means$Y - grand_mean
# Adjusted data
D$Adjusted_Y <- D$Y - block_effects[match(D$Subject, block_means$Subject)]
# Original data plot
with(D, interaction.plot(Treatment, Subject, Y, xlab = "Treatment", ylab = "Rate", col = 1:4))
# Block effect-adjusted data plot
ggplot(D, aes(x = Treatment, y = Adjusted_Y)) +
geom_point() +
geom_rect(aes(xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf), fill = "transparent", color = "black") +
labs(title = "Block Effect-Adjusted Data",
x = "Treatment",
y = "Rate") +
theme(panel.background = element_rect(fill = "white"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
The original data show that Variability observed within treatment groups may stem from both treatment effects and block effects, making it challenging to discern the true treatment effect.
The block effect-adjusted data plot illustrates the adjusted responses for each treatment group after accounting for block effects.
Reduced variability within treatment groups compared to the original data plot suggests a clearer delineation of treatment effects from block effects.
The impacts of caffeine and theobromine exhibit similarities, with both demonstrating higher effects compared to the placebo.
There is another plot called the Anscombe’s block plot introduced in the textbook. We skip it here.
The two-way ANOVA model is
\[Y=\mu+\tau_i + \beta_j+ \epsilon\] where \(\mu\) is the grand or population mean, \(\tau_i\) is the \(i\)th treatment effect, \(\beta_j\) is the \(j\)th block effect, and \(\epsilon\) is the random error term.
Assumptions of the model:
As in one-way ANOVA, an observation can be decomposed as
\[\text{Observed Value} = \text{Grand Mean} + \text{Block Effect} + \text{Treatment Effect} + \text{Residual}\] For the finger data we examined before,
For the observed value 11, it can be decomposed as
\[11 = 34+(-15)+(-12)+4\] which is equivalent to
\[11 - 34 = (-15)+(-12)+4\] It turns out that
\[\sum(y - grand~~mean)^2 = \sum(block~~effect)^2+\sum(treatment~~effect)^2+\sum(residual)^2\] where the summations are taken over all cells of the data in wide format.
The above sums of squares are called
Their numbers of degrees of freedom are
Note that
\[(n-1)=(b-1)+(k-1)+(n-b-k+1)\] i.e., the total degrees of freedom equals the sum of the individual degrees of freedom.
The R code for carrying out the two-way ANOVA is shown in the following example.
Example 0: (Carrying out the analysis in Excel) Watch this video https://www.youtube.com/watch?v=7KVuq8tZeWA. The details are here: https://www.youtube.com/watch?v=jH6tkTdABHs
Example 1: Use the finger data to fit a two-way ANOVA model. The R is
model = aov(Y~Subject+Treatment, data = D)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Subject 3 5478 1826.0 33.00 0.000399 ***
## Treatment 2 872 436.0 7.88 0.020967 *
## Residuals 6 332 55.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that:
If we blindly fit a one-way ANOVA model to study the effect of treatments without considering blocks, the result would be not significant, as shown by the following:
model2 = aov(Y~Treatment, data = D)
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 2 872 436.0 0.675 0.533
## Residuals 9 5810 645.6
This example shows that it is important to use the correct model.
The assumptions for the two-way ANOVA model for RDBC designs can be checked with 3 kinds of residual plots:
Example 2.: We use the finger data to show how R can be used for constructing these plots.
Obtain a Q-Q plot of residuals. Is there any indication from this plot that the normality condition is violated?
Obtain a plot of residuals versus fitted values. Is any indication from the shape of this plot that the variation is not constant?
Make the Tukey additivity plot. Is a power transformation needed? If yes, what power transformation is appropriate?
Fit a 2-way ANOVA model for the log-iron data. Answer questions (1) and (2).
Solution: Fit the model.
# Provided data
D <- data.frame(Subject = factor(rep(1:4, each=3)),
Treatment = rep(c("Placebo", "Caffeine", "Theobromine"), 4),
Y = c(11, 26, 20, 56, 83, 71, 15, 34, 41, 6, 13, 32))
model = aov(Y~Subject+Treatment, data = D)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Subject 3 5478 1826.0 33.00 0.000399 ***
## Treatment 2 872 436.0 7.88 0.020967 *
## Residuals 6 332 55.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(model, 2)
The points tend to be on the line, so the normality assumption is satisfied.
plot(model, 1)
Interpretation of the plot:
# Calculate block effects
block_effects <- tapply(D$Y, D$Subject, mean)
# Calculate treatment effects
treatment_effects <- tapply(D$Y, D$Treatment, mean)
# Calculate grand mean
grand_mean <- mean(D$Y)
# Calculate comparison values
comparison_values <- outer(block_effects, treatment_effects, FUN = "*") / grand_mean
# Flatten comparison values matrix to vector
comparison_values <- as.vector(comparison_values)
# Calculate residuals
fit <- lm(D$Y ~ comparison_values + 0)
residuals <- resid(fit)
# Extract slope coefficient
slope <- coef(fit)[1]
# Plot Tukey additivity plot with the fitted line
plot(comparison_values, residuals, xlab = "Comparison Values", ylab = "Residuals",
main = "Tukey Additivity Plot", pch = 16)
abline(a = 0, b = slope, col = "red") # Fitted line with intercept 0
# Add legend showing slope value
legend("topright", legend = paste("Slope:", round(slope, 4)), col = "red", lwd = 1, bty = "n")
The slope is 0.5631, so the power transformation suggested by the tukey plot has a power of \(1-0.5631\) or 0.437 which is close to 0.5, indicating that a square-root transformation might be appropriate.
We can test hypotheses
We can also find a confidence interval
library(lsmeans)
# Perform the 2-way ANOVA
model <- aov(Y ~ Treatment + Subject, data = D)
# Compute least square means and confidence intervals
lsmeans_model <- lsmeans(model, "Treatment")
lsmeans_model
## Treatment lsmean SE df lower.CL upper.CL
## Caffeine 39 3.72 6 29.9 48.1
## Placebo 22 3.72 6 12.9 31.1
## Theobromine 41 3.72 6 31.9 50.1
##
## Results are averaged over the levels of: Subject
## Confidence level used: 0.95
# Calculate confidence intervals for pairwise comparisons
pairwise_CI <- pairs(lsmeans_model, adjust = "none", infer = TRUE)
pairwise_CI
## contrast estimate SE df lower.CL upper.CL t.ratio p.value
## Caffeine - Placebo 17 5.26 6 4.13 29.87 3.232 0.0179
## Caffeine - Theobromine -2 5.26 6 -14.87 10.87 -0.380 0.7169
## Placebo - Theobromine -19 5.26 6 -31.87 -6.13 -3.612 0.0112
##
## Results are averaged over the levels of: Subject
## Confidence level used: 0.95
The result shows that
Note: We can also calculate effect sizes for a single treatment mean and for the mean difference between two treatments. The formulae are the same as in the 1-way ANOVA case.
The two-way ANOVA model can be applied to a situation where there is a major factor of interest (not treatment) and a nuisance factor (the block) and there is only one observation for each combination of the levels of the two factors.
Here is an example.
Example 3: Use the Rivlron data from https://stat2.org/datapage.html
The data were collected from four rivers in upstate New York. Some geologists were interested in the water chemistry. They took water samples at three different locations in each river (Grasse, Oswegatchie, Raquette, and St. Regis). The sampling sites were chosen to investigate how the composition of the water changes as it flows from the source to the mouth of each river. The sampling sites were labeled as upstream, midstream, and downstream. This dataset contains the concentrations of iron in the samples.
Fit a 2-way ANOVA model for the iron data.
In ANOVA, interaction refers to the combined effect of two or more factors on the response variable. When an interaction exists between factors, the effect of one factor on the response variable depends on the level of another factor. Understanding interaction is crucial as it can provide deeper insights into the relationships between variables and help us make more informed decisions.
The two-way factorial design is a powerful experimental design used to study the effects of two independent variables (factors) on a single dependent variable (response). In this design, each factor has two or more levels, and all possible combinations of the levels of the two factors are studied.
Key Concepts:
Factors: In a two-way factorial design, there are two independent variables, often referred to as factors. These factors can represent different treatments, conditions, or levels being studied in the experiment.
Levels: Each factor consists of two or more levels, which represent the different values or categories of the factor being manipulated or observed.
Cells: The combination of levels from the two factors creates cells in the experimental design. Each cell represents a unique combination of factor levels.
Main Effects: The main effects of each factor represent the average effect of that factor across all levels of the other factor. It describes how the response variable changes as the levels of one factor change, while holding the other factor constant.
Interaction Effects: Interaction effects occur when the effect of one factor on the response variable depends on the level of another factor. It signifies that the relationship between the response variable and one factor is not constant across the levels of the other factor.
Example 1:
Consider a study examining the effects of two factors,
on weight loss. The two-way factorial design would involve all possible combinations of these levels, resulting in four treatment groups:
Before diving into a two-way ANOVA analysis, it’s crucial to thoroughly explore the dataset to gain insights into its characteristics and uncover any potential patterns or trends.
We will construct the following plots:
This involves the decomposition of each observed value:
\[y = \text{grand mean}+\text{effect of factor one}+\text{effect of factor two}+\text{interaction effect}+\text{residual}.\]
The sums of squares are defined similarly as in two-way ANOVA for the RCBD. They are
The total sum of squares can be decomposed as
\[SSTo = SSA + SSB + SSAB + SSE\]
The corresponding degrees of freedom are
The number of degrees of freedom for the total sum of squares equals the sum of the other numbers of degrees of freedom.
As before, we can make an ANOVA table, which will show 3 p-values, one for each factor and one for interaction. We interpret the p-value for interaction first.
A very good reference is https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-6-two-way-analysis-of-variance/.
In some cases, the assumptions of ANOVA, such as normality and homogeneity of variances, may not be met. When this occurs, it may be necessary to transform the data to meet these assumptions before conducting the analysis. Assessing the fit of the ANOVA model and considering data transformations are essential steps in ensuring the validity and reliability of the results.
We can again use the Tukey additivity plot to suggest a transformation on the response variable, but since we have more than one observation in each cell, we use cell means instead. The R code in previous chapter can be used to the cell means. The plot can answer two related questions:
Another plot we can construct is to plot log(standard deviation) versus log(mean). It’s similar to Tukey’s plot. The plot can answer two related questions:
After fitting a two-way ANOVA model and assessing its fit, several procedures can be performed:
Example 1: Do male and female birds respond in the same way to a hormone supplement added to their food? The hormone in question was meant to raise the concentration of calcium in the blood. To investigate this question, 10 male robins were divided at random into two groups of 5, one group being treated with the hormone and the other not. A similar procedure was applied to 10 female robine, 5 getting the hormone and 5 not. The response variable of interest is the blood calcium level (measured in mg per 100 ml). The data are
We can display the data in a table below:
No Hormone | Hormone | |
Male | 14.5, 11, 10.8, 14.3, 10 | 32, 23.8, 28.8, 25, 29.3 |
Femalee | 16.5, 18.4, 12.7, 14, 12.8 | 31.9, 26.2, 21.3, 35.8, 40.2 |
Make an interaction plot for the data.
Is it clear that the hormone supplement raises the calcium level?
Is the size of the effect much different for male and female birds? That is, is there any interaction between the two factors (gender and hormone)?
Fit a two-way ANOVA model with interaction. Test the interaction effect. If there is no interaction, test the main effects and construct confidence intervals for the mean difference between male and females and the mean difference with and without hormone. If there is an interaction, construct confidence intervals for the conditional mean difference between male and females given hormone occurrence or not and the conditional mean difference with and without hormone given male or female.
Calculate the effect size for hormone versus no hormone and the effect size for female versus male.
Solution.
# Create data frame
bird <- data.frame(gender = rep(c("Male", "Female"), each = 10),
hormone = rep(rep(c("No", "Yes"), each = 5), times = 2),
blood_calcium = c(14.5, 11, 10.8, 14.3, 10, 32, 23.8, 28.8, 25, 29.3, 16.5, 18.4, 12.7, 14, 12.8, 31.9, 26.2, 21.3, 35.8, 40.2)
)
# Create interaction plot
interaction.plot(x.factor = bird$hormone,
trace.factor = bird$gender,
response = bird$blood_calcium,
fun = mean,
type = "l",
legend = TRUE,
xlab = "Hormone Supplement?",
ylab = "Mean Blood Calcium Level (mg/100 ml)",
trace.label = "Gender",
col = 1:2)
The interaction plot clearly shows that the hormone supplement raises the calcium level.
The two lines are parallel, so there is no interaction between the two factors (gender and hormone).
Fit the two-way ANOVA model using R code
# Fit the two-way ANOVA model
model <- aov(blood_calcium ~ gender * hormone, data = bird)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## gender 1 45.9 45.9 2.346 0.145
## hormone 1 1268.8 1268.8 64.841 5.09e-07 ***
## gender:hormone 1 0.4 0.4 0.019 0.893
## Residuals 16 313.1 19.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that
To interpret the result, we say
Example 2: In theory, adding vitamin B12 to the diet of baby pigs should cause them to gain weight faster. In practice, however, the presence of bacteria in a pig’s gut can prevent the use of vitamin B12. Would taking antibiotics make the use of vitamin B12 more effective?
You want to compare 4 diets based on crossing two factors, B12 (0 mg or 5 mg) added to the diet, and Antibiotics (0 mg or 40 mg): no/no, no/yes, yes/no, and yes/yes. With 12 pigs and 4 diets, you set up a lottery to assign diets to pigs, with 3 tickets for each diet. The tickets tell pigs what they get to eat. The data are shown below:
No Antibiotics | Antibiotics | |
No B12 | 30, 19, 8 | 5, 0, 4 |
B12 | 26, 21, 19 | 52, 56, 54 |
In the table, the numbers are daily weight gain (hundredths of a pound over 1.00). For example, 56 represents 0.56 pounds gained per day.
Make an interaction plot for the data.
Fit a two-way ANOVA model with interaction. Test the interaction effect. If there is no interaction, test the main effects and construct confidence intervals for the mean difference between B12 and no B12 and the mean difference between Antibiotics and no antibiotics. If there is an interaction, construct confidence intervals for the conditional mean difference between B12 and no B12 given Antibiotics and no antibiotics and the conditional mean difference between Antibiotics and no antibiotics given B12 and no B12.
Calculate the effect size for B12 versus with Antibiotics and no Antibiotics. Calculate the effect size for Antibiotics versus no antibiotics given B12 and no B12.
Solution.
# Create data frame
pig <- data.frame(B12 = rep(c("No", "Yes"), each = 6),
Antibiotics = rep(rep(c("No", "Yes"), each = 3), times = 2),
Weight_gain = c(30, 19, 8, 5, 0, 4, 26, 21, 19, 52, 56, 54)
)
# Create interaction plot
interaction.plot(x.factor = pig$Antibiotics,
trace.factor = pig$B12,
response = pig$Weight_gain,
fun = mean,
type = "l",
legend = TRUE,
xlab = "Antibiotics (mg)",
ylab = "Mean Weight Gain (hundredths of a pound over 1.00)",
trace.label = "B12",
col = 1:2)
The plot shows that - When B12 is present (5 mg), the mean difference between the levels of Antibiotics (40 mg and 0 mg) is 32. - When B12 is absent (0 mg), the mean difference between the levels of Antibiotics (40 mg and 0 mg) is \(-16\). - So, clearly, there is an interaction between the two factors.
# Fit the two-way ANOVA model
model <- aov(Weight_gain ~ B12 * Antibiotics, data = pig)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## B12 1 2187 2187.0 60.331 5.4e-05 ***
## Antibiotics 1 192 192.0 5.297 0.050359 .
## B12:Antibiotics 1 1728 1728.0 47.669 0.000124 ***
## Residuals 8 290 36.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show that
To interpret the result, we say
You are all set!!!
A video regarding : https://datatab.net/tutorial/two-factorial-anova-without-repeated-measures
Example 1.
A data example for a two-way factorial design could be a study examining the effect of different types of fertilizer (“Fertilizer Type: A, B”) and watering frequency (“Watering Frequency: Low, High”) on plant growth (measured in height), where each combination of fertilizer and watering frequency is tested on multiple plants, allowing analysis of both main effects (fertilizer type and watering frequency) and their interaction on plant height.
Example Data Table:
Fertilizer Type | Watering Frequency | Plant Height (cm) |
---|---|---|
A | Low | 10 |
A | Low | 12 |
A | Low | 11 |
A | High | 15 |
A | High | 14 |
A | High | 16 |
B | Low | 8 |
B | Low | 9 |
B | Low | 10 |
B | High | 18 |
B | High | 17 |
B | High | 19 |
Key points about this example:
Example 2.
Refer to the page: https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Mostly_Harmless_Statistics_(Webb)/11%3A_Analysis_of_Variance/11.03%3A_Two-Way_ANOVA_(Factorial_Design)
https://peterstatistics.com/CrashCourse/3-TwoVarUnpair/ScaleScale/ScaleScale4.html
Julian J. Faraway’s book: https://julianfaraway.github.io/faraway/PRA/pra.pdf
Panel Data Analysis: https://dss.princeton.edu/training/Panel101R.pdf
Does the age of becoming GM have a significant effect on the most recent rating of a chess GM player? How can we collect data? Any other factor affects the relationship?
Does the age of becoming GM have a significant effect on the most recent rating of a chess GM player adjusting for the effect of nationality and sex? How can we collect data?