Load Library

library(tidyverse)

Dataset:

The dataset was retrieved from Kaggle, 80 Cereals.

This dataset contains the nutritional content of 77 cereals and their ratings from Consumer Reports (unverified).

Understanding the dataset

The dataset contains 16 columns (variables) and 77 rows (observations).

cereal <- data.frame(df)

#retrieve the number of rows and columns in an object
dim(cereal)
## [1] 77 16

View the structure (str() function) of the data frame. The types of variables shown in the dataset are: numerical and categorical.

str(cereal)
## 'data.frame':    77 obs. of  16 variables:
##  $ name    : chr  "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
##  $ mfr     : chr  "N" "Q" "K" "K" ...
##  $ type    : chr  "C" "C" "C" "C" ...
##  $ calories: num  70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : num  4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : num  1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : num  130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : num  6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : num  280 135 320 330 -1 70 30 100 125 190 ...
##  $ vitamins: num  25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : num  3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num  68.4 34 59.4 93.7 34.4 ...

Retrieve the column names of the dataset

colnames(cereal)
##  [1] "name"     "mfr"      "type"     "calories" "protein"  "fat"     
##  [7] "sodium"   "fiber"    "carbo"    "sugars"   "potass"   "vitamins"
## [13] "shelf"    "weight"   "cups"     "rating"
#view the header and first few rows of the data frame
head(cereal)
##                        name mfr type calories protein fat sodium fiber carbo
## 1                 100% Bran   N    C       70       4   1    130  10.0   5.0
## 2         100% Natural Bran   Q    C      120       3   5     15   2.0   8.0
## 3                  All-Bran   K    C       70       4   1    260   9.0   7.0
## 4 All-Bran with Extra Fiber   K    C       50       4   0    140  14.0   8.0
## 5            Almond Delight   R    C      110       2   2    200   1.0  14.0
## 6   Apple Cinnamon Cheerios   G    C      110       2   2    180   1.5  10.5
##   sugars potass vitamins shelf weight cups   rating
## 1      6    280       25     3      1 0.33 68.40297
## 2      8    135        0     3      1 1.00 33.98368
## 3      5    320       25     3      1 0.33 59.42551
## 4      0    330       25     3      1 0.50 93.70491
## 5      8     -1       25     3      1 0.75 34.38484
## 6     10     70       25     1      1 0.75 29.50954

Check data frame for NULL and NA values

#check NULL value of vector
is.null(cereal)
## [1] FALSE
any(is.na(cereal))
## [1] FALSE

Check the variables numeric summary of the data (minimum, median, mean, and maximum) values of the independent and dependent variables:

summary(cereal)
##      name               mfr                type              calories    
##  Length:77          Length:77          Length:77          Min.   : 50.0  
##  Class :character   Class :character   Class :character   1st Qu.:100.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :110.0  
##                                                           Mean   :106.9  
##                                                           3rd Qu.:110.0  
##                                                           Max.   :160.0  
##     protein           fat            sodium          fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##      carbo          sugars           potass          vitamins     
##  Min.   :-1.0   Min.   :-1.000   Min.   : -1.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00  
##  Median :14.0   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.6   Mean   : 6.922   Mean   : 96.08   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##      shelf           weight          cups           rating     
##  Min.   :1.000   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :2.000   Median :1.00   Median :0.750   Median :40.40  
##  Mean   :2.208   Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :3.000   Max.   :1.50   Max.   :1.500   Max.   :93.70

Variable Identification / Restructure data set

#Change variable to data type "factor"
cereal %>% mutate(name = factor(name), 
                 mfr = factor(mfr),
                 type = factor(type)) -> cereal.f

Check the data type conversion: character changed to factor for statistical operations

#view the header and first few rows of the data frame
head(cereal.f, 6)
##                        name mfr type calories protein fat sodium fiber carbo
## 1                 100% Bran   N    C       70       4   1    130  10.0   5.0
## 2         100% Natural Bran   Q    C      120       3   5     15   2.0   8.0
## 3                  All-Bran   K    C       70       4   1    260   9.0   7.0
## 4 All-Bran with Extra Fiber   K    C       50       4   0    140  14.0   8.0
## 5            Almond Delight   R    C      110       2   2    200   1.0  14.0
## 6   Apple Cinnamon Cheerios   G    C      110       2   2    180   1.5  10.5
##   sugars potass vitamins shelf weight cups   rating
## 1      6    280       25     3      1 0.33 68.40297
## 2      8    135        0     3      1 1.00 33.98368
## 3      5    320       25     3      1 0.33 59.42551
## 4      0    330       25     3      1 0.50 93.70491
## 5      8     -1       25     3      1 0.75 34.38484
## 6     10     70       25     1      1 0.75 29.50954
# view individual variable attribute
print(attributes(cereal.f$mfr))
## $levels
## [1] "A" "G" "K" "N" "P" "Q" "R"
## 
## $class
## [1] "factor"
summary(cereal.f)
##                         name    mfr    type      calories        protein     
##  100% Bran                : 1   A: 1   C:74   Min.   : 50.0   Min.   :1.000  
##  100% Natural Bran        : 1   G:22   H: 3   1st Qu.:100.0   1st Qu.:2.000  
##  All-Bran                 : 1   K:23          Median :110.0   Median :3.000  
##  All-Bran with Extra Fiber: 1   N: 6          Mean   :106.9   Mean   :2.545  
##  Almond Delight           : 1   P: 9          3rd Qu.:110.0   3rd Qu.:3.000  
##  Apple Cinnamon Cheerios  : 1   Q: 8          Max.   :160.0   Max.   :6.000  
##  (Other)                  :71   R: 8                                         
##       fat            sodium          fiber            carbo     
##  Min.   :0.000   Min.   :  0.0   Min.   : 0.000   Min.   :-1.0  
##  1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000   1st Qu.:12.0  
##  Median :1.000   Median :180.0   Median : 2.000   Median :14.0  
##  Mean   :1.013   Mean   :159.7   Mean   : 2.152   Mean   :14.6  
##  3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000   3rd Qu.:17.0  
##  Max.   :5.000   Max.   :320.0   Max.   :14.000   Max.   :23.0  
##                                                                 
##      sugars           potass          vitamins          shelf      
##  Min.   :-1.000   Min.   : -1.00   Min.   :  0.00   Min.   :1.000  
##  1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00   1st Qu.:1.000  
##  Median : 7.000   Median : 90.00   Median : 25.00   Median :2.000  
##  Mean   : 6.922   Mean   : 96.08   Mean   : 28.25   Mean   :2.208  
##  3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00   3rd Qu.:3.000  
##  Max.   :15.000   Max.   :330.00   Max.   :100.00   Max.   :3.000  
##                                                                    
##      weight          cups           rating     
##  Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :1.00   Median :0.750   Median :40.40  
##  Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :1.50   Max.   :1.500   Max.   :93.70  
## 

Check dataframe, cereal_num to identify the average score, dispersion of the data, signs of skewness (normal distribution), and show outliers. The box plots indicates there are differences between groups, some outliers in a group, and most likely negative skewed distrutions.

boxplot(cereal.f, col = rainbow(ncol(cereal.f)), las = 2)

The variable outliners for this dataset is dropped because the outliers will not change the linear model results but may affect the assumptions.

cereal.f %>% mutate_if(is.numeric, scale) %>% filter_if(is.numeric, any_vars(. > -3.29 | . < 3.29 )) -> cereal.outlierremoved
summary(cereal.outlierremoved)
##                         name    mfr    type       calories.V1     
##  100% Bran                : 1   A: 1   C:74   Min.   :-2.9194605  
##  100% Natural Bran        : 1   G:22   H: 3   1st Qu.:-0.3532681  
##  All-Bran                 : 1   K:23          Median : 0.1599704  
##  All-Bran with Extra Fiber: 1   N: 6          Mean   : 0.0000000  
##  Almond Delight           : 1   P: 9          3rd Qu.: 0.1599704  
##  Apple Cinnamon Cheerios  : 1   Q: 8          Max.   : 2.7261629  
##  (Other)                  :71   R: 8                              
##       protein.V1            fat.V1             sodium.V1      
##  Min.   :-1.4116451   Min.   :-1.006473   Min.   :-1.9046994  
##  1st Qu.:-0.4982277   1st Qu.:-1.006473   1st Qu.:-0.3539844  
##  Median : 0.4151897   Median :-0.012903   Median : 0.2424445  
##  Mean   : 0.0000000   Mean   : 0.000000   Mean   : 0.0000000  
##  3rd Qu.: 0.4151897   3rd Qu.: 0.980666   3rd Qu.: 0.6003018  
##  Max.   : 3.1554419   Max.   : 3.961373   Max.   : 1.9124453  
##                                                               
##       fiber.V1            carbo.V1            sugars.V1      
##  Min.   :-0.902904   Min.   :-3.645142   Min.   :-1.7822907  
##  1st Qu.:-0.483329   1st Qu.:-0.607018   1st Qu.:-0.8823800  
##  Median :-0.063754   Median :-0.139614   Median : 0.0175307  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000000  
##  3rd Qu.: 0.355821   3rd Qu.: 0.561491   3rd Qu.: 0.9174414  
##  Max.   : 4.971147   Max.   : 1.963703   Max.   : 1.8173522  
##                                                              
##       potass.V1          vitamins.V1           shelf.V1      
##  Min.   :-1.361794   Min.   :-1.264260   Min.   :-1.4507595  
##  1st Qu.:-0.786652   1st Qu.:-0.145317   1st Qu.:-1.4507595  
##  Median :-0.085260   Median :-0.145317   Median :-0.2495930  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.0000000  
##  3rd Qu.: 0.335575   3rd Qu.:-0.145317   3rd Qu.: 0.9515734  
##  Max.   : 3.281421   Max.   : 3.211511   Max.   : 0.9515734  
##                                                              
##       weight.V1            cups.V1             rating.V1     
##  Min.   :-3.519548   Min.   :-2.4538004   Min.   :-1.752855  
##  1st Qu.:-0.196777   1st Qu.:-0.6490266   1st Qu.:-0.675690  
##  Median :-0.196777   Median :-0.3052601   Median :-0.161276  
##  Mean   : 0.000000   Mean   : 0.0000000   Mean   : 0.000000  
##  3rd Qu.:-0.196777   3rd Qu.: 0.7690100   3rd Qu.: 0.581086  
##  Max.   : 3.125994   Max.   : 2.9175503   Max.   : 3.633385  
## 
boxplot(select_if(cereal.outlierremoved, is.numeric), col = rainbow(ncol(cereal.f)), las = 2)

Visualizing the Relationship in the Data

The pairs() function provides a plot matrix, consisting of scatterplots for each variable-combination of all the data in the cereal dataframe.

The pairwise combination plot shows: * the data frame names of the numeric variables diagonally * the other cells of the plot matrix show a scatterplot (i.e. correlation plot) of each variable combination * the left figure in second row illustrates the correlation between calories and protein and so on …

#use select_if() function to select only numeric variables
pairs(cereal.outlierremoved, gap = 0.5, main = "Pairs matrix", pch = 21, 
      bg = c("red", "green3", "blue", "yellow"), upper.panel = NULL) 

Identifying Potential Predictors in a Mult-Factor Regression Model

View the linear model and plot, then identify all possible numeric variable predicators that can be included in the multi-factor regression model.

Using Mult-Factor Regression Model

Compute linear model

cereallm <- lm(rating ~ calories + protein + fat + sodium + fiber + carbo + sugars + potass + vitamins + shelf + weight + cups, data = cereal.outlierremoved)

View the information about the linear model

summary(cereallm)
## 
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber + 
##     carbo + sugars + potass + vitamins + shelf + weight + cups, 
##     data = cereal.outlierremoved)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.732e-08 -1.835e-08  3.305e-09  1.611e-08  4.027e-08 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -5.849e-17  2.470e-09  0.000e+00    1.000    
## calories    -3.089e-01  7.855e-09 -3.933e+07   <2e-16 ***
## protein      2.551e-01  3.969e-09  6.428e+07   <2e-16 ***
## fat         -1.212e-01  4.461e-09 -2.717e+07   <2e-16 ***
## sodium      -3.252e-01  2.961e-09 -1.098e+08   <2e-16 ***
## fiber        5.842e-01  7.310e-09  7.992e+07   <2e-16 ***
## carbo        3.328e-01  5.309e-09  6.268e+07   <2e-16 ***
## sugars      -2.294e-01  5.754e-09 -3.986e+07   <2e-16 ***
## potass      -1.725e-01  7.477e-09 -2.307e+07   <2e-16 ***
## vitamins    -8.145e-02  3.066e-09 -2.657e+07   <2e-16 ***
## shelf       -2.205e-09  3.132e-09 -7.040e-01    0.484    
## weight      -4.604e-09  5.577e-09 -8.260e-01    0.412    
## cups         2.284e-09  3.187e-09  7.170e-01    0.476    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.167e-08 on 64 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.349e+16 on 12 and 64 DF,  p-value: < 2.2e-16
qqnorm(resid(cereallm), col = 'green4')
qqline(resid(cereallm), col = 'red', lwd = 2, lty = 2)

Assigning Potential Predictors using Backward Elimination Process

Backward Elimination Process A process to help decide which predictors to keep in the model and which to exclude. Elimination Process: (See earlier/above process) 1. Start with all possible predictors 2. Use the lm() to compute the model (Process continued below) 3. Use the summary() function to find each predictor’s significance level with the largest p-value 4. Remove that predictor with a value larger than the predetermined significance threshold, p = 0.05 5. Repeat the process until the significance levels of all the predictors remaining in the model are below the threshold, p = 0.05

Identify the the predictor with the largest p-value that exceeds our predetermined threshold of p = 0.05.

Use the update( ) function to eliminate a given predicator and recompute the model in one step.

Removing predictor, shelf from the model

cereallm <- update(cereallm, .~. - shelf, 
                data = cereal.outlierremoved)
summary(cereallm)
## 
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber + 
##     carbo + sugars + potass + vitamins + weight + cups, data = cereal.outlierremoved)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.602e-08 -1.821e-08  3.411e-09  1.605e-08  4.033e-08 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -5.849e-17  2.460e-09  0.000e+00    1.000    
## calories    -3.089e-01  7.819e-09 -3.951e+07   <2e-16 ***
## protein      2.551e-01  3.949e-09  6.460e+07   <2e-16 ***
## fat         -1.212e-01  4.335e-09 -2.795e+07   <2e-16 ***
## sodium      -3.252e-01  2.875e-09 -1.131e+08   <2e-16 ***
## fiber        5.842e-01  7.279e-09  8.027e+07   <2e-16 ***
## carbo        3.328e-01  5.203e-09  6.395e+07   <2e-16 ***
## sugars      -2.294e-01  5.698e-09 -4.025e+07   <2e-16 ***
## potass      -1.725e-01  7.393e-09 -2.333e+07   <2e-16 ***
## vitamins    -8.145e-02  2.791e-09 -2.919e+07   <2e-16 ***
## weight      -4.206e-09  5.526e-09 -7.610e-01    0.449    
## cups         2.885e-09  3.058e-09  9.430e-01    0.349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.159e-08 on 65 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.483e+16 on 11 and 65 DF,  p-value: < 2.2e-16

Removing predictor, weight from the model

cereallm <- update(cereallm, .~. - weight, 
                data = cereal.outlierremoved)
summary(cereallm)
## 
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber + 
##     carbo + sugars + potass + vitamins + cups, data = cereal.outlierremoved)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.689e-08 -1.763e-08  2.799e-09  1.490e-08  4.047e-08 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -5.849e-17  2.452e-09  0.000e+00    1.000    
## calories    -3.089e-01  6.120e-09 -5.048e+07   <2e-16 ***
## protein      2.551e-01  3.935e-09  6.483e+07   <2e-16 ***
## fat         -1.212e-01  3.974e-09 -3.049e+07   <2e-16 ***
## sodium      -3.252e-01  2.855e-09 -1.139e+08   <2e-16 ***
## fiber        5.842e-01  7.186e-09  8.131e+07   <2e-16 ***
## carbo        3.328e-01  5.186e-09  6.417e+07   <2e-16 ***
## sugars      -2.294e-01  5.675e-09 -4.042e+07   <2e-16 ***
## potass      -1.725e-01  7.130e-09 -2.420e+07   <2e-16 ***
## vitamins    -8.145e-02  2.760e-09 -2.952e+07   <2e-16 ***
## cups         3.265e-09  3.008e-09  1.086e+00    0.282    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.152e-08 on 66 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.642e+16 on 10 and 66 DF,  p-value: < 2.2e-16

Removing predictor, cups from the model

cereallm <- update(cereallm, .~. - cups, 
                data = cereal.outlierremoved)
summary(cereallm)
## 
## Call:
## lm(formula = rating ~ calories + protein + fat + sodium + fiber + 
##     carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.638e-08 -1.813e-08  2.278e-09  1.737e-08  4.040e-08 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -5.849e-17  2.455e-09          0        1    
## calories    -3.089e-01  6.098e-09  -50659136   <2e-16 ***
## protein      2.551e-01  3.927e-09   64965144   <2e-16 ***
## fat         -1.212e-01  3.978e-09  -30461632   <2e-16 ***
## sodium      -3.252e-01  2.858e-09 -113771893   <2e-16 ***
## fiber        5.842e-01  7.072e-09   82611577   <2e-16 ***
## carbo        3.328e-01  5.138e-09   64770284   <2e-16 ***
## sugars      -2.294e-01  5.668e-09  -40465741   <2e-16 ***
## potass      -1.725e-01  7.135e-09  -24177294   <2e-16 ***
## vitamins    -8.145e-02  2.753e-09  -29585288   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.155e-08 on 67 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.819e+16 on 9 and 67 DF,  p-value: < 2.2e-16

Now, the p-values for all the predictors are less than 0.02, which is less than our predetermined threshold of 0.05. Now, the backward elimination process is complete.

The following predictors belong in the final model for cereallm: calories / protein / fat / sodium / fiber / carbo / sugars/ potass / vitamins

*The adjusted R-square remained at 1 during the elimination process indicating those removed predictors had no predictive value for the model.

Residual Analysis

Check the valadity for the assumptions used to develop the model with residual analysis techniques.

Residual Plot

Examine the residual values to see the model’s quality. The residual values plot will show the data distribution and we would expect to see a uniform distribution around zero for a well-fitted model.

#residual test for a well-fitted model

plot(fitted(cereallm), resid(cereallm), col = "blue4") +
  abline(0,0, col='red') #add a horizontal line at 0

## integer(0)

Q-Q Plot

The x-axis displays the fitted values and the y-axis displays the residuals. The plot displays the spread of the residuals tends to increase as we move to the right. The residuals are relatively uniformly scattered in the middle but shows heavy tails. Overall, this residual plot would indicate the variables are a good predictor for the rating.

qqnorm(resid(cereallm), col = 'green4')
qqline(resid(cereallm), col = 'red', lwd = 2, lty = 2)

Quadratic Regression Term

Create a scatterplot to visualize the data.

View the components with residuals for linearity. The plot shows linearity for all components for the dataset, except vitamins.

library(car)

crPlots(cereallm)

Fit a simple linear regression model

The Component + Residual Plots shows variables: protein, fiber, and carbo, will be a potential good fit to perform a quadratic regression model.

*The linear regression model will show the relationship between ratings of the cereal and protein content of the cereal for 77 observations.

#fit linear model
linear.Model <- lm(rating ~ protein, data = cereal.outlierremoved)

#view model summary
summary(linear.Model)
## 
## Call:
## lm(formula = rating ~ protein, data = cereal.outlierremoved)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2322 -0.5960 -0.2781  0.4621  3.0081 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.870e-17  1.012e-01   0.000        1    
## protein     4.706e-01  1.019e-01   4.619 1.57e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8882 on 75 degrees of freedom
## Multiple R-squared:  0.2215, Adjusted R-squared:  0.2111 
## F-statistic: 21.34 on 1 and 75 DF,  p-value: 1.566e-05

The total variance in protein on rating a cereal is: 22.15%, as shown by the value for Multiple R-squared.

Relationship between predictor and response as a mathematical formula:

\[ rating = (3.870e-17) + (4.706e-01) * protein\]

Fit a Quadratic Regression Model

The quadratic regresssion model (linear model), where one of which is the square of the the other.

Create a variable Protein2 which is the square of the variable Protein.

\[ Protein2 \longleftarrow Protein^2\]

#create a new variable for protein
cereal.outlierremoved$protein2 <- cereal.outlierremoved$protein^2

#fit quadratic regression model
quadratic.Model <- lm(rating ~ protein + protein2, data = cereal.outlierremoved)

#view model summary
summary(quadratic.Model)
## 
## Call:
## lm(formula = rating ~ protein + protein2, data = cereal.outlierremoved)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3233 -0.5491 -0.1590  0.4038  2.9870 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1115     0.1181   0.944   0.3483    
## protein       0.5527     0.1107   4.992 3.85e-06 ***
## protein2     -0.1130     0.0640  -1.765   0.0816 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8759 on 74 degrees of freedom
## Multiple R-squared:  0.2529, Adjusted R-squared:  0.2328 
## F-statistic: 12.53 on 2 and 74 DF,  p-value: 2.061e-05

The total variance in protein on rating a cereal increased to 25.29%, explaining an additional 3.19% of the variance.

Based on the coefficients shown, the fitted quadratic regression would be:

\[rating = 0.1115 - 0.1130(protein)^2 + 0.5527(protein)\]

Multi-Linear Regression Model

linear.Model2 <- lm(rating ~ calories + protein + protein2 + fat + sodium + fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)

summary(linear.Model2)
## 
## Call:
## lm(formula = rating ~ calories + protein + protein2 + fat + sodium + 
##     fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.436e-08 -1.523e-08  2.529e-09  1.364e-08  4.127e-08 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -4.282e-09  2.993e-09 -1.431e+00   0.1573    
## calories    -3.089e-01  5.904e-09 -5.233e+07   <2e-16 ***
## protein      2.551e-01  4.363e-09  5.847e+07   <2e-16 ***
## protein2     4.338e-09  1.844e-09  2.353e+00   0.0216 *  
## fat         -1.212e-01  3.887e-09 -3.118e+07   <2e-16 ***
## sodium      -3.252e-01  2.949e-09 -1.103e+08   <2e-16 ***
## fiber        5.842e-01  6.903e-09  8.464e+07   <2e-16 ***
## carbo        3.328e-01  5.173e-09  6.433e+07   <2e-16 ***
## sugars      -2.294e-01  5.494e-09 -4.175e+07   <2e-16 ***
## potass      -1.725e-01  6.959e-09 -2.479e+07   <2e-16 ***
## vitamins    -8.145e-02  2.701e-09 -3.015e+07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.085e-08 on 66 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.748e+16 on 10 and 66 DF,  p-value: < 2.2e-16

The total variance in the model to explain cereal rating is: 1.

Dichotomous Term

Create a dichotomous variable where each level of the categorical variable is contrasted to a specified reference level.

cereallm.mfr <- lm(rating ~ mfr, data = cereal.outlierremoved)
summary(cereallm.mfr)
## 
## Call:
## lm(formula = rating ~ mfr, data = cereal.outlierremoved)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.771 -0.535 -0.007  0.421  3.536 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   0.8674     0.8281   1.047   0.2985  
## mfrG         -1.4498     0.8467  -1.712   0.0913 .
## mfrK         -0.7697     0.8459  -0.910   0.3660  
## mfrN          0.9338     0.8945   1.044   0.3001  
## mfrP         -0.9358     0.8729  -1.072   0.2874  
## mfrQ         -0.8496     0.8784  -0.967   0.3367  
## mfrR         -0.9474     0.8784  -1.079   0.2845  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8281 on 70 degrees of freedom
## Multiple R-squared:  0.3684, Adjusted R-squared:  0.3142 
## F-statistic: 6.804 on 6 and 70 DF,  p-value: 1.032e-05

ANOVA test (Dichotomous Term)

The ANOVA test indicates the groups are different, F-value, because F values are above 1.

# Compute the model
model.dv <- lm(rating ~ mfr + calories + protein2, data = cereal.outlierremoved)
Anova(model.dv)
## Anova Table (Type II tests)
## 
## Response: rating
##            Sum Sq Df F value    Pr(>F)    
## mfr       14.6427  6  6.5823 1.624e-05 ***
## calories  22.2962  1 60.1368 6.109e-11 ***
## protein2   0.4217  1  1.1373      0.29    
## Residuals 25.2115 68                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interactions Term

dictcont <- lm(rating ~ mfr*calories, data = cereal.outlierremoved)
summary(dictcont)
## 
## Call:
## lm(formula = rating ~ mfr * calories, data = cereal.outlierremoved)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9492 -0.4941  0.0000  0.3691  1.4188 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)     0.7675     0.6133   1.251   0.2153  
## mfrG           -1.2409     0.6295  -1.971   0.0530 .
## mfrK           -0.6044     0.6264  -0.965   0.3382  
## mfrN            0.9104     0.8500   1.071   0.2882  
## mfrP           -0.7914     0.6474  -1.222   0.2260  
## mfrQ           -1.1717     0.6567  -1.784   0.0791 .
## mfrR           -0.7296     0.6640  -1.099   0.2760  
## calories       -0.2829     0.1979  -1.430   0.1577  
## mfrG:calories  -0.1906     0.3187  -0.598   0.5519  
## mfrK:calories  -0.4198     0.2283  -1.839   0.0706 .
## mfrN:calories   0.1640     0.5508   0.298   0.7668  
## mfrP:calories  -0.1491     0.4446  -0.335   0.7385  
## mfrQ:calories  -0.4091     0.2503  -1.635   0.1070  
## mfrR:calories       NA         NA      NA       NA  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6093 on 64 degrees of freedom
## Multiple R-squared:  0.6874, Adjusted R-squared:  0.6287 
## F-statistic: 11.73 on 12 and 64 DF,  p-value: 4.979e-12

Visualizations - Plot the data (Interactions)

Plot the regression lines to understanding the interaction effects using interaction_plot function.

The plot indicates the cereal rating decrease when cereal calories increase, and vice versa.**

library(interactions)

interact_plot(dictcont, pred = calories, modx = mfr)

The plot check linearity between the predicted lines (black) and a loess line (red) within each group. The red loess line should be straight like the predicted line to support linearity.

The plot show only one group, mfr = N, that is linear, therefore the assumption of linear is not satisfied because the other groups are nonlinear.

interact_plot(dictcont, pred = calories, modx = mfr, linearity.check = TRUE, plot.points = TRUE)

Combine all terms: Quadratic, Dichotomous, and one Dichotomous vs. quantitative interaction.

allModels <- lm(rating ~ mfr*calories + protein + protein2 + fat + sodium + fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)

summary(allModels)
## 
## Call:
## lm(formula = rating ~ mfr * calories + protein + protein2 + fat + 
##     sodium + fiber + carbo + sugars + potass + vitamins, data = cereal.outlierremoved)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.379e-08 -1.468e-08  1.008e-09  1.355e-08  4.436e-08 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -2.329e-08  2.559e-08 -9.100e-01   0.3667    
## mfrG           1.652e-08  2.633e-08  6.270e-01   0.5330    
## mfrK           1.745e-08  2.677e-08  6.520e-01   0.5171    
## mfrN           1.263e-08  3.431e-08  3.680e-01   0.7142    
## mfrP           3.412e-08  2.695e-08  1.266e+00   0.2109    
## mfrQ           9.677e-09  2.697e-08  3.590e-01   0.7211    
## mfrR           2.040e-08  2.789e-08  7.310e-01   0.4676    
## calories      -3.089e-01  9.905e-09 -3.119e+07   <2e-16 ***
## protein        2.551e-01  4.746e-09  5.375e+07   <2e-16 ***
## protein2       4.357e-09  1.970e-09  2.212e+00   0.0312 *  
## fat           -1.212e-01  4.607e-09 -2.630e+07   <2e-16 ***
## sodium        -3.252e-01  4.135e-09 -7.865e+07   <2e-16 ***
## fiber          5.842e-01  9.363e-09  6.240e+07   <2e-16 ***
## carbo          3.328e-01  6.003e-09  5.544e+07   <2e-16 ***
## sugars        -2.294e-01  5.898e-09 -3.889e+07   <2e-16 ***
## potass        -1.725e-01  8.287e-09 -2.082e+07   <2e-16 ***
## vitamins      -8.145e-02  2.985e-09 -2.729e+07   <2e-16 ***
## mfrG:calories -1.955e-09  1.166e-08 -1.680e-01   0.8675    
## mfrK:calories -4.820e-09  8.816e-09 -5.470e-01   0.5868    
## mfrN:calories -1.518e-08  2.242e-08 -6.770e-01   0.5014    
## mfrP:calories -1.316e-08  1.626e-08 -8.090e-01   0.4218    
## mfrQ:calories -2.437e-09  9.879e-09 -2.470e-01   0.8061    
## mfrR:calories         NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.155e-08 on 55 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.79e+15 on 21 and 55 DF,  p-value: < 2.2e-16

Visualizations

View the dataset residuals

summary(allModels$residuals)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -3.379e-08 -1.468e-08  1.008e-09  0.000e+00  1.355e-08  4.435e-08
plot(predict(allModels), residuals(allModels), col=c("blue", "red"))
abline(h=0, lty=2, col="red")

qqnorm(resid(allModels), col = 'green4')
qqline(resid(allModels), col = 'red', lwd = 2, lty = 2)

Conclusion

The dataset was modeled three different ways: quadratic term, dichotomous term, and dichotomous vs. quantitative interaction term.

Each regression model indicated a significant positive relationship between predictor variables and reponse (rating) variable. The p-value for all models were p-value: <2.2e-16, and significant relationships among all models are the same.

In the backward selection, potential predictors continued to show significance among the models. These predictors (calories, protein, fat, sodium, fiber, carbo, sugars, potass, and vitamins) including the quadratic term (protein2) have an impact on ratings for cereal selection.

References:

Lilja, D. J. (2017, August 8). Linear Regression Using R: An Introduction to Data Modeling, 2nd Edition. University Digital Conservancy Home. https://conservancy.umn.edu/handle/11299/189222

#Pairs function short tutorial