library(tidyverse)
The dataset was retrieved from Kaggle, 80 Cereals.
This dataset contains the nutritional content of 77 cereals and their ratings from Consumer Reports (unverified).
The dataset contains 16 columns (variables) and 77 rows (observations).
cereal <- data.frame(df)
#retrieve the number of rows and columns in an object
dim(cereal)
## [1] 77 16
View the structure (str() function) of the data frame. The types of variables shown in the dataset are: numerical and categorical.
str(cereal)
## 'data.frame': 77 obs. of 16 variables:
## $ name : chr "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
## $ mfr : chr "N" "Q" "K" "K" ...
## $ type : chr "C" "C" "C" "C" ...
## $ calories: num 70 120 70 50 110 110 110 130 90 90 ...
## $ protein : num 4 3 4 4 2 2 2 3 2 3 ...
## $ fat : num 1 5 1 0 2 2 0 2 1 0 ...
## $ sodium : num 130 15 260 140 200 180 125 210 200 210 ...
## $ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ sugars : num 6 8 5 0 8 10 14 8 6 5 ...
## $ potass : num 280 135 320 330 -1 70 30 100 125 190 ...
## $ vitamins: num 25 0 25 25 25 25 25 25 25 25 ...
## $ shelf : num 3 3 3 3 3 1 2 3 1 3 ...
## $ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
## $ cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
## $ rating : num 68.4 34 59.4 93.7 34.4 ...
Retrieve the column names of the dataset
colnames(cereal)
## [1] "name" "mfr" "type" "calories" "protein" "fat"
## [7] "sodium" "fiber" "carbo" "sugars" "potass" "vitamins"
## [13] "shelf" "weight" "cups" "rating"
#view the header and first few rows of the data frame
head(cereal)
## name mfr type calories protein fat sodium fiber carbo
## 1 100% Bran N C 70 4 1 130 10.0 5.0
## 2 100% Natural Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond Delight R C 110 2 2 200 1.0 14.0
## 6 Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 -1 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
Check data frame for NULL and NA values
#check NULL value of vector
is.null(cereal)
## [1] FALSE
any(is.na(cereal))
## [1] FALSE
Check the variables numeric summary of the data (minimum, median, mean, and maximum) values of the independent and dependent variables:
summary(cereal)
## name mfr type calories
## Length:77 Length:77 Length:77 Min. : 50.0
## Class :character Class :character Class :character 1st Qu.:100.0
## Mode :character Mode :character Mode :character Median :110.0
## Mean :106.9
## 3rd Qu.:110.0
## Max. :160.0
## protein fat sodium fiber
## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median :180.0 Median : 2.000
## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
## carbo sugars potass vitamins
## Min. :-1.0 Min. :-1.000 Min. : -1.00 Min. : 0.00
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 40.00 1st Qu.: 25.00
## Median :14.0 Median : 7.000 Median : 90.00 Median : 25.00
## Mean :14.6 Mean : 6.922 Mean : 96.08 Mean : 28.25
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00
## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00
## shelf weight cups rating
## Min. :1.000 Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :2.000 Median :1.00 Median :0.750 Median :40.40
## Mean :2.208 Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :3.000 Max. :1.50 Max. :1.500 Max. :93.70
#Change variable to data type "factor"
cereal %>% mutate(name = factor(name),
mfr = factor(mfr),
type = factor(type)) -> cereal.f
Check the data type conversion: character changed to factor for statistical operations
#view the header and first few rows of the data frame
head(cereal.f, 6)
## name mfr type calories protein fat sodium fiber carbo
## 1 100% Bran N C 70 4 1 130 10.0 5.0
## 2 100% Natural Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond Delight R C 110 2 2 200 1.0 14.0
## 6 Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 -1 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
# view individual variable attribute
print(attributes(cereal.f$mfr))
## $levels
## [1] "A" "G" "K" "N" "P" "Q" "R"
##
## $class
## [1] "factor"
summary(cereal.f)
## name mfr type calories protein
## 100% Bran : 1 A: 1 C:74 Min. : 50.0 Min. :1.000
## 100% Natural Bran : 1 G:22 H: 3 1st Qu.:100.0 1st Qu.:2.000
## All-Bran : 1 K:23 Median :110.0 Median :3.000
## All-Bran with Extra Fiber: 1 N: 6 Mean :106.9 Mean :2.545
## Almond Delight : 1 P: 9 3rd Qu.:110.0 3rd Qu.:3.000
## Apple Cinnamon Cheerios : 1 Q: 8 Max. :160.0 Max. :6.000
## (Other) :71 R: 8
## fat sodium fiber carbo
## Min. :0.000 Min. : 0.0 Min. : 0.000 Min. :-1.0
## 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000 1st Qu.:12.0
## Median :1.000 Median :180.0 Median : 2.000 Median :14.0
## Mean :1.013 Mean :159.7 Mean : 2.152 Mean :14.6
## 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000 3rd Qu.:17.0
## Max. :5.000 Max. :320.0 Max. :14.000 Max. :23.0
##
## sugars potass vitamins shelf
## Min. :-1.000 Min. : -1.00 Min. : 0.00 Min. :1.000
## 1st Qu.: 3.000 1st Qu.: 40.00 1st Qu.: 25.00 1st Qu.:1.000
## Median : 7.000 Median : 90.00 Median : 25.00 Median :2.000
## Mean : 6.922 Mean : 96.08 Mean : 28.25 Mean :2.208
## 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00 3rd Qu.:3.000
## Max. :15.000 Max. :330.00 Max. :100.00 Max. :3.000
##
## weight cups rating
## Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :1.00 Median :0.750 Median :40.40
## Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :1.50 Max. :1.500 Max. :93.70
##
Check dataframe, cereal_num to identify the average score, dispersion of the data, signs of skewness (normal distribution), and show outliers. The box plots indicates there are differences between groups, some outliers in a group, and most likely negative skewed distrutions.
boxplot(cereal.f, col = rainbow(ncol(cereal.f)), las = 2)
The variable outliners for this dataset is dropped because the outliers will not change the linear model results but may affect the assumptions.
cereal.f %>% mutate_if(is.numeric, scale) %>% filter_if(is.numeric, any_vars(. > -3.29 | . < 3.29 )) -> cereal.outlierremoved
summary(cereal.outlierremoved)
## name mfr type calories.V1
## 100% Bran : 1 A: 1 C:74 Min. :-2.9194605
## 100% Natural Bran : 1 G:22 H: 3 1st Qu.:-0.3532681
## All-Bran : 1 K:23 Median : 0.1599704
## All-Bran with Extra Fiber: 1 N: 6 Mean : 0.0000000
## Almond Delight : 1 P: 9 3rd Qu.: 0.1599704
## Apple Cinnamon Cheerios : 1 Q: 8 Max. : 2.7261629
## (Other) :71 R: 8
## protein.V1 fat.V1 sodium.V1
## Min. :-1.4116451 Min. :-1.006473 Min. :-1.9046994
## 1st Qu.:-0.4982277 1st Qu.:-1.006473 1st Qu.:-0.3539844
## Median : 0.4151897 Median :-0.012903 Median : 0.2424445
## Mean : 0.0000000 Mean : 0.000000 Mean : 0.0000000
## 3rd Qu.: 0.4151897 3rd Qu.: 0.980666 3rd Qu.: 0.6003018
## Max. : 3.1554419 Max. : 3.961373 Max. : 1.9124453
##
## fiber.V1 carbo.V1 sugars.V1
## Min. :-0.902904 Min. :-3.645142 Min. :-1.7822907
## 1st Qu.:-0.483329 1st Qu.:-0.607018 1st Qu.:-0.8823800
## Median :-0.063754 Median :-0.139614 Median : 0.0175307
## Mean : 0.000000 Mean : 0.000000 Mean : 0.0000000
## 3rd Qu.: 0.355821 3rd Qu.: 0.561491 3rd Qu.: 0.9174414
## Max. : 4.971147 Max. : 1.963703 Max. : 1.8173522
##
## potass.V1 vitamins.V1 shelf.V1
## Min. :-1.361794 Min. :-1.264260 Min. :-1.4507595
## 1st Qu.:-0.786652 1st Qu.:-0.145317 1st Qu.:-1.4507595
## Median :-0.085260 Median :-0.145317 Median :-0.2495930
## Mean : 0.000000 Mean : 0.000000 Mean : 0.0000000
## 3rd Qu.: 0.335575 3rd Qu.:-0.145317 3rd Qu.: 0.9515734
## Max. : 3.281421 Max. : 3.211511 Max. : 0.9515734
##
## weight.V1 cups.V1 rating.V1
## Min. :-3.519548 Min. :-2.4538004 Min. :-1.752855
## 1st Qu.:-0.196777 1st Qu.:-0.6490266 1st Qu.:-0.675690
## Median :-0.196777 Median :-0.3052601 Median :-0.161276
## Mean : 0.000000 Mean : 0.0000000 Mean : 0.000000
## 3rd Qu.:-0.196777 3rd Qu.: 0.7690100 3rd Qu.: 0.581086
## Max. : 3.125994 Max. : 2.9175503 Max. : 3.633385
##
boxplot(select_if(cereal.outlierremoved, is.numeric), col = rainbow(ncol(cereal.f)), las = 2)
The pairs() function provides a plot matrix, consisting of scatterplots for each variable-combination of all the data in the cereal dataframe.
The pairwise combination plot shows: * the data frame names of the numeric variables diagonally * the other cells of the plot matrix show a scatterplot (i.e. correlation plot) of each variable combination * the left figure in second row illustrates the correlation between calories and protein and so on …
#use select_if() function to select only numeric variables
pairs(cereal.outlierremoved, gap = 0.5, main = "Pairs matrix", pch = 21,
bg = c("red", "green3", "blue", "yellow"), upper.panel = NULL)
View the linear model and plot, then identify all possible numeric variable predicators that can be included in the multi-factor regression model.
Compute linear model
cereallm <- lm(rating ~ calories + protein + fat + sodium + fiber + carbo + sugars + potass + vitamins + shelf + weight + cups, data = cereal.outlierremoved)
View the information about the linear model
summary(cereallm)
##
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber +
## carbo + sugars + potass + vitamins + shelf + weight + cups,
## data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.732e-08 -1.835e-08 3.305e-09 1.611e-08 4.027e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.849e-17 2.470e-09 0.000e+00 1.000
## calories -3.089e-01 7.855e-09 -3.933e+07 <2e-16 ***
## protein 2.551e-01 3.969e-09 6.428e+07 <2e-16 ***
## fat -1.212e-01 4.461e-09 -2.717e+07 <2e-16 ***
## sodium -3.252e-01 2.961e-09 -1.098e+08 <2e-16 ***
## fiber 5.842e-01 7.310e-09 7.992e+07 <2e-16 ***
## carbo 3.328e-01 5.309e-09 6.268e+07 <2e-16 ***
## sugars -2.294e-01 5.754e-09 -3.986e+07 <2e-16 ***
## potass -1.725e-01 7.477e-09 -2.307e+07 <2e-16 ***
## vitamins -8.145e-02 3.066e-09 -2.657e+07 <2e-16 ***
## shelf -2.205e-09 3.132e-09 -7.040e-01 0.484
## weight -4.604e-09 5.577e-09 -8.260e-01 0.412
## cups 2.284e-09 3.187e-09 7.170e-01 0.476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.167e-08 on 64 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.349e+16 on 12 and 64 DF, p-value: < 2.2e-16
qqnorm(resid(cereallm), col = 'green4')
qqline(resid(cereallm), col = 'red', lwd = 2, lty = 2)
Backward Elimination Process A process to help decide which predictors to keep in the model and which to exclude. Elimination Process: (See earlier/above process) 1. Start with all possible predictors 2. Use the lm() to compute the model (Process continued below) 3. Use the summary() function to find each predictor’s significance level with the largest p-value 4. Remove that predictor with a value larger than the predetermined significance threshold, p = 0.05 5. Repeat the process until the significance levels of all the predictors remaining in the model are below the threshold, p = 0.05
Identify the the predictor with the largest p-value that exceeds our predetermined threshold of p = 0.05.
Use the update( ) function to eliminate a given predicator and recompute the model in one step.
Removing predictor, shelf from the model
cereallm <- update(cereallm, .~. - shelf,
data = cereal.outlierremoved)
summary(cereallm)
##
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber +
## carbo + sugars + potass + vitamins + weight + cups, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.602e-08 -1.821e-08 3.411e-09 1.605e-08 4.033e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.849e-17 2.460e-09 0.000e+00 1.000
## calories -3.089e-01 7.819e-09 -3.951e+07 <2e-16 ***
## protein 2.551e-01 3.949e-09 6.460e+07 <2e-16 ***
## fat -1.212e-01 4.335e-09 -2.795e+07 <2e-16 ***
## sodium -3.252e-01 2.875e-09 -1.131e+08 <2e-16 ***
## fiber 5.842e-01 7.279e-09 8.027e+07 <2e-16 ***
## carbo 3.328e-01 5.203e-09 6.395e+07 <2e-16 ***
## sugars -2.294e-01 5.698e-09 -4.025e+07 <2e-16 ***
## potass -1.725e-01 7.393e-09 -2.333e+07 <2e-16 ***
## vitamins -8.145e-02 2.791e-09 -2.919e+07 <2e-16 ***
## weight -4.206e-09 5.526e-09 -7.610e-01 0.449
## cups 2.885e-09 3.058e-09 9.430e-01 0.349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.159e-08 on 65 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.483e+16 on 11 and 65 DF, p-value: < 2.2e-16
Removing predictor, weight from the model
cereallm <- update(cereallm, .~. - weight,
data = cereal.outlierremoved)
summary(cereallm)
##
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber +
## carbo + sugars + potass + vitamins + cups, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.689e-08 -1.763e-08 2.799e-09 1.490e-08 4.047e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.849e-17 2.452e-09 0.000e+00 1.000
## calories -3.089e-01 6.120e-09 -5.048e+07 <2e-16 ***
## protein 2.551e-01 3.935e-09 6.483e+07 <2e-16 ***
## fat -1.212e-01 3.974e-09 -3.049e+07 <2e-16 ***
## sodium -3.252e-01 2.855e-09 -1.139e+08 <2e-16 ***
## fiber 5.842e-01 7.186e-09 8.131e+07 <2e-16 ***
## carbo 3.328e-01 5.186e-09 6.417e+07 <2e-16 ***
## sugars -2.294e-01 5.675e-09 -4.042e+07 <2e-16 ***
## potass -1.725e-01 7.130e-09 -2.420e+07 <2e-16 ***
## vitamins -8.145e-02 2.760e-09 -2.952e+07 <2e-16 ***
## cups 3.265e-09 3.008e-09 1.086e+00 0.282
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.152e-08 on 66 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.642e+16 on 10 and 66 DF, p-value: < 2.2e-16
Removing predictor, cups from the model
cereallm <- update(cereallm, .~. - cups,
data = cereal.outlierremoved)
summary(cereallm)
##
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber +
## carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.638e-08 -1.813e-08 2.278e-09 1.737e-08 4.040e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.849e-17 2.455e-09 0 1
## calories -3.089e-01 6.098e-09 -50659136 <2e-16 ***
## protein 2.551e-01 3.927e-09 64965144 <2e-16 ***
## fat -1.212e-01 3.978e-09 -30461632 <2e-16 ***
## sodium -3.252e-01 2.858e-09 -113771893 <2e-16 ***
## fiber 5.842e-01 7.072e-09 82611577 <2e-16 ***
## carbo 3.328e-01 5.138e-09 64770284 <2e-16 ***
## sugars -2.294e-01 5.668e-09 -40465741 <2e-16 ***
## potass -1.725e-01 7.135e-09 -24177294 <2e-16 ***
## vitamins -8.145e-02 2.753e-09 -29585288 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.155e-08 on 67 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.819e+16 on 9 and 67 DF, p-value: < 2.2e-16
Now, the p-values for all the predictors are less than 0.02, which is less than our predetermined threshold of 0.05. Now, the backward elimination process is complete.
The following predictors belong in the final model for cereallm: calories / protein / fat / sodium / fiber / carbo / sugars/ potass / vitamins
*The adjusted R-square remained at 1 during the elimination process indicating those removed predictors had no predictive value for the model.
Check the valadity for the assumptions used to develop the model with residual analysis techniques.
Examine the residual values to see the model’s quality. The residual values plot will show the data distribution and we would expect to see a uniform distribution around zero for a well-fitted model.
#residual test for a well-fitted model
plot(fitted(cereallm), resid(cereallm), col = "blue4") +
abline(0,0, col='red') #add a horizontal line at 0
## integer(0)
The x-axis displays the fitted values and the y-axis displays the residuals. The plot displays the spread of the residuals tends to increase as we move to the right. The residuals are relatively uniformly scattered in the middle but shows heavy tails. Overall, this residual plot would indicate the variables are a good predictor for the rating.
qqnorm(resid(cereallm), col = 'green4')
qqline(resid(cereallm), col = 'red', lwd = 2, lty = 2)
View the components with residuals for linearity. The plot shows linearity for all components for the dataset, except vitamins.
library(car)
crPlots(cereallm)
Fit a simple linear regression model
The Component + Residual Plots shows variables: protein, fiber, and carbo, will be a potential good fit to perform a quadratic regression model.
*The linear regression model will show the relationship between ratings of the cereal and protein content of the cereal for 77 observations.
#fit linear model
linear.Model <- lm(rating ~ protein, data = cereal.outlierremoved)
#view model summary
summary(linear.Model)
##
## Call:
## lm(formula = rating ~ protein, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2322 -0.5960 -0.2781 0.4621 3.0081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.870e-17 1.012e-01 0.000 1
## protein 4.706e-01 1.019e-01 4.619 1.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8882 on 75 degrees of freedom
## Multiple R-squared: 0.2215, Adjusted R-squared: 0.2111
## F-statistic: 21.34 on 1 and 75 DF, p-value: 1.566e-05
The total variance in protein on rating a cereal is: 22.15%, as shown by the value for Multiple R-squared.
Relationship between predictor and response as a mathematical formula:
\[ rating = (3.870e-17) + (4.706e-01) * protein\]
The quadratic regresssion model (linear model), where one of which is the square of the the other.
Create a variable Protein2 which is the square of the variable Protein.
\[ Protein2 \longleftarrow Protein^2\]
#create a new variable for protein
cereal.outlierremoved$protein2 <- cereal.outlierremoved$protein^2
#fit quadratic regression model
quadratic.Model <- lm(rating ~ protein + protein2, data = cereal.outlierremoved)
#view model summary
summary(quadratic.Model)
##
## Call:
## lm(formula = rating ~ protein + protein2, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3233 -0.5491 -0.1590 0.4038 2.9870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1115 0.1181 0.944 0.3483
## protein 0.5527 0.1107 4.992 3.85e-06 ***
## protein2 -0.1130 0.0640 -1.765 0.0816 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8759 on 74 degrees of freedom
## Multiple R-squared: 0.2529, Adjusted R-squared: 0.2328
## F-statistic: 12.53 on 2 and 74 DF, p-value: 2.061e-05
The total variance in protein on rating a cereal increased to 25.29%, explaining an additional 3.19% of the variance.
Based on the coefficients shown, the fitted quadratic regression would be:
\[rating = 0.1115 - 0.1130(protein)^2 + 0.5527(protein)\]
Multi-Linear Regression Model
linear.Model2 <- lm(rating ~ calories + protein + protein2 + fat + sodium + fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
summary(linear.Model2)
##
## Call:
## lm(formula = rating ~ calories + protein + protein2 + fat + sodium +
## fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.436e-08 -1.523e-08 2.529e-09 1.364e-08 4.127e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.282e-09 2.993e-09 -1.431e+00 0.1573
## calories -3.089e-01 5.904e-09 -5.233e+07 <2e-16 ***
## protein 2.551e-01 4.363e-09 5.847e+07 <2e-16 ***
## protein2 4.338e-09 1.844e-09 2.353e+00 0.0216 *
## fat -1.212e-01 3.887e-09 -3.118e+07 <2e-16 ***
## sodium -3.252e-01 2.949e-09 -1.103e+08 <2e-16 ***
## fiber 5.842e-01 6.903e-09 8.464e+07 <2e-16 ***
## carbo 3.328e-01 5.173e-09 6.433e+07 <2e-16 ***
## sugars -2.294e-01 5.494e-09 -4.175e+07 <2e-16 ***
## potass -1.725e-01 6.959e-09 -2.479e+07 <2e-16 ***
## vitamins -8.145e-02 2.701e-09 -3.015e+07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.085e-08 on 66 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.748e+16 on 10 and 66 DF, p-value: < 2.2e-16
The total variance in the model to explain cereal rating is: 1.
Create a dichotomous variable where each level of the categorical variable is contrasted to a specified reference level.
cereallm.mfr <- lm(rating ~ mfr, data = cereal.outlierremoved)
summary(cereallm.mfr)
##
## Call:
## lm(formula = rating ~ mfr, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.771 -0.535 -0.007 0.421 3.536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8674 0.8281 1.047 0.2985
## mfrG -1.4498 0.8467 -1.712 0.0913 .
## mfrK -0.7697 0.8459 -0.910 0.3660
## mfrN 0.9338 0.8945 1.044 0.3001
## mfrP -0.9358 0.8729 -1.072 0.2874
## mfrQ -0.8496 0.8784 -0.967 0.3367
## mfrR -0.9474 0.8784 -1.079 0.2845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8281 on 70 degrees of freedom
## Multiple R-squared: 0.3684, Adjusted R-squared: 0.3142
## F-statistic: 6.804 on 6 and 70 DF, p-value: 1.032e-05
The ANOVA test indicates the groups are different, F-value, because F values are above 1.
# Compute the model
model.dv <- lm(rating ~ mfr + calories + protein2, data = cereal.outlierremoved)
Anova(model.dv)
## Anova Table (Type II tests)
##
## Response: rating
## Sum Sq Df F value Pr(>F)
## mfr 14.6427 6 6.5823 1.624e-05 ***
## calories 22.2962 1 60.1368 6.109e-11 ***
## protein2 0.4217 1 1.1373 0.29
## Residuals 25.2115 68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
dictcont <- lm(rating ~ mfr*calories, data = cereal.outlierremoved)
summary(dictcont)
##
## Call:
## lm(formula = rating ~ mfr * calories, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9492 -0.4941 0.0000 0.3691 1.4188
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7675 0.6133 1.251 0.2153
## mfrG -1.2409 0.6295 -1.971 0.0530 .
## mfrK -0.6044 0.6264 -0.965 0.3382
## mfrN 0.9104 0.8500 1.071 0.2882
## mfrP -0.7914 0.6474 -1.222 0.2260
## mfrQ -1.1717 0.6567 -1.784 0.0791 .
## mfrR -0.7296 0.6640 -1.099 0.2760
## calories -0.2829 0.1979 -1.430 0.1577
## mfrG:calories -0.1906 0.3187 -0.598 0.5519
## mfrK:calories -0.4198 0.2283 -1.839 0.0706 .
## mfrN:calories 0.1640 0.5508 0.298 0.7668
## mfrP:calories -0.1491 0.4446 -0.335 0.7385
## mfrQ:calories -0.4091 0.2503 -1.635 0.1070
## mfrR:calories NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6093 on 64 degrees of freedom
## Multiple R-squared: 0.6874, Adjusted R-squared: 0.6287
## F-statistic: 11.73 on 12 and 64 DF, p-value: 4.979e-12
Plot the regression lines to understanding the interaction effects using interaction_plot function.
The plot indicates the cereal rating decrease when cereal calories increase, and vice versa.**
library(interactions)
interact_plot(dictcont, pred = calories, modx = mfr)
The plot check
linearity between the predicted lines (black) and a loess line (red) within each group. The red loess line should be straight like the predicted line to support linearity.
The plot show only one group, mfr = N, that is linear, therefore the assumption of linear is not satisfied because the other groups are nonlinear.
interact_plot(dictcont, pred = calories, modx = mfr, linearity.check = TRUE, plot.points = TRUE)
allModels <- lm(rating ~ mfr*calories + protein + protein2 + fat + sodium + fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
summary(allModels)
##
## Call:
## lm(formula = rating ~ mfr * calories + protein + protein2 + fat +
## sodium + fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.379e-08 -1.468e-08 1.008e-09 1.355e-08 4.436e-08
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.329e-08 2.559e-08 -9.100e-01 0.3667
## mfrG 1.652e-08 2.633e-08 6.270e-01 0.5330
## mfrK 1.745e-08 2.677e-08 6.520e-01 0.5171
## mfrN 1.263e-08 3.431e-08 3.680e-01 0.7142
## mfrP 3.412e-08 2.695e-08 1.266e+00 0.2109
## mfrQ 9.677e-09 2.697e-08 3.590e-01 0.7211
## mfrR 2.040e-08 2.789e-08 7.310e-01 0.4676
## calories -3.089e-01 9.905e-09 -3.119e+07 <2e-16 ***
## protein 2.551e-01 4.746e-09 5.375e+07 <2e-16 ***
## protein2 4.357e-09 1.970e-09 2.212e+00 0.0312 *
## fat -1.212e-01 4.607e-09 -2.630e+07 <2e-16 ***
## sodium -3.252e-01 4.135e-09 -7.865e+07 <2e-16 ***
## fiber 5.842e-01 9.363e-09 6.240e+07 <2e-16 ***
## carbo 3.328e-01 6.003e-09 5.544e+07 <2e-16 ***
## sugars -2.294e-01 5.898e-09 -3.889e+07 <2e-16 ***
## potass -1.725e-01 8.287e-09 -2.082e+07 <2e-16 ***
## vitamins -8.145e-02 2.985e-09 -2.729e+07 <2e-16 ***
## mfrG:calories -1.955e-09 1.166e-08 -1.680e-01 0.8675
## mfrK:calories -4.820e-09 8.816e-09 -5.470e-01 0.5868
## mfrN:calories -1.518e-08 2.242e-08 -6.770e-01 0.5014
## mfrP:calories -1.316e-08 1.626e-08 -8.090e-01 0.4218
## mfrQ:calories -2.437e-09 9.879e-09 -2.470e-01 0.8061
## mfrR:calories NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.155e-08 on 55 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.79e+15 on 21 and 55 DF, p-value: < 2.2e-16
View the dataset residuals
summary(allModels$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.379e-08 -1.468e-08 1.008e-09 0.000e+00 1.355e-08 4.435e-08
plot(predict(allModels), residuals(allModels), col=c("blue", "red"))
abline(h=0, lty=2, col="red")
qqnorm(resid(allModels), col = 'green4')
qqline(resid(allModels), col = 'red', lwd = 2, lty = 2)
The dataset was modeled three different ways: quadratic term, dichotomous term, and dichotomous vs. quantitative interaction term.
Each regression model indicated a significant positive relationship between predictor variables and reponse (rating) variable. The p-value for all models were p-value: <2.2e-16, and significant relationships among all models are the same.
In the backward selection, potential predictors continued to show significance among the models. These predictors (calories, protein, fat, sodium, fiber, carbo, sugars, potass, and vitamins) including the quadratic term (protein2) have an impact on ratings for cereal selection.
References:
Lilja, D. J. (2017, August 8). Linear Regression Using R: An Introduction to Data Modeling, 2nd Edition. University Digital Conservancy Home. https://conservancy.umn.edu/handle/11299/189222