Read in Data: (Introduction of the Cereals dataset)
data_cereals = read.table("C:\\Users\\oghen\\Desktop\\cereals.dat",header=T)
Quick check on dataset to understand the structure
str(data_cereals)
## 'data.frame': 77 obs. of 16 variables:
## $ NAME : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ MANUF : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
## $ TYPE : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
## $ CALORIES: int 70 120 70 50 110 110 110 130 90 90 ...
## $ PROTEIN : int 4 3 4 4 2 2 2 3 2 3 ...
## $ FAT : int 1 5 1 0 2 2 0 2 1 0 ...
## $ SODIUM : int 130 15 260 140 200 180 125 210 200 210 ...
## $ FIBER : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ CARBO : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ SUGARS : int 6 8 5 0 8 10 14 8 6 5 ...
## $ POTASS : int 280 135 320 330 -1 70 30 100 125 190 ...
## $ VITAMINS: int 25 0 25 25 25 25 25 25 25 25 ...
## $ SHELF : int 3 3 3 3 3 1 2 3 1 3 ...
## $ WEIGHT : num 1 1 1 1 1 1 1 1.33 1 1 ...
## $ CUPS : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
## $ RATING : num 68.4 34 59.4 93.7 34.4 ...
head(data_cereals)
◦Which variables are categorical and which are continuous?
# Name : Categorical
# Manufacturer : Categorical
# Type: Categorical
# CALORIES: Continuous
# PROTEIN : Continuous
# FAT : Continuous
# SODIUM : Continuous
# FIBER : Continuous
# CARBO : Continuous
# SUGARS : Continuous
# POTASS : Continuous
# VITAMINS: Continuous
# SHELF : Continuous
# WEIGHT : Continuous
# CUPS : Continuous
# RATING : Continuous
find out which variables are correlated:
library(corrplot)
cor_cereals = cor(data_cereals[, !names(data_cereals) %in% c("NAME","MANUF","TYPE")])
cor_cereals%>% kable(escape=FALSE) %>%
kable_styling("striped",full_width = F, position = "left")
| CALORIES | PROTEIN | FAT | SODIUM | FIBER | CARBO | SUGARS | POTASS | VITAMINS | SHELF | WEIGHT | CUPS | RATING | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CALORIES | 1.0000000 | 0.0190661 | 0.4986098 | 0.3006492 | -0.2934128 | 0.2506809 | 0.5623403 | -0.0666089 | 0.2653563 | 0.0972344 | 0.6960911 | 0.0871995 | -0.6893760 |
| PROTEIN | 0.0190661 | 1.0000000 | 0.2084310 | -0.0546743 | 0.5003300 | -0.1308636 | -0.3291418 | 0.5494074 | 0.0073354 | 0.1338648 | 0.2161585 | -0.2444692 | 0.4706185 |
| FAT | 0.4986098 | 0.2084310 | 1.0000000 | -0.0054075 | 0.0167192 | -0.3180435 | 0.2708192 | 0.1932786 | -0.0311563 | 0.2636911 | 0.2146250 | -0.1758921 | -0.4092837 |
| SODIUM | 0.3006492 | -0.0546743 | -0.0054075 | 1.0000000 | -0.0706750 | 0.3559835 | 0.1014514 | -0.0326035 | 0.3614767 | -0.0697190 | 0.3085765 | 0.1196646 | -0.4012952 |
| FIBER | -0.2934128 | 0.5003300 | 0.0167192 | -0.0706750 | 1.0000000 | -0.3560827 | -0.1412054 | 0.9033737 | -0.0322427 | 0.2975391 | 0.2472256 | -0.5130609 | 0.5841604 |
| CARBO | 0.2506809 | -0.1308636 | -0.3180435 | 0.3559835 | -0.3560827 | 1.0000000 | -0.3316654 | -0.3496852 | 0.2581475 | -0.1017903 | 0.1351364 | 0.3639325 | 0.0520547 |
| SUGARS | 0.5623403 | -0.3291418 | 0.2708192 | 0.1014514 | -0.1412054 | -0.3316654 | 1.0000000 | 0.0216958 | 0.1251373 | 0.1004379 | 0.4506476 | -0.0323576 | -0.7596747 |
| POTASS | -0.0666089 | 0.5494074 | 0.1932786 | -0.0326035 | 0.9033737 | -0.3496852 | 0.0216958 | 1.0000000 | 0.0206987 | 0.3606634 | 0.4163032 | -0.4951949 | 0.3801654 |
| VITAMINS | 0.2653563 | 0.0073354 | -0.0311563 | 0.3614767 | -0.0322427 | 0.2581475 | 0.1251373 | 0.0206987 | 1.0000000 | 0.2992617 | 0.3203241 | 0.1284045 | -0.2405436 |
| SHELF | 0.0972344 | 0.1338648 | 0.2636911 | -0.0697190 | 0.2975391 | -0.1017903 | 0.1004379 | 0.3606634 | 0.2992617 | 1.0000000 | 0.1907620 | -0.3352688 | 0.0251588 |
| WEIGHT | 0.6960911 | 0.2161585 | 0.2146250 | 0.3085765 | 0.2472256 | 0.1351364 | 0.4506476 | 0.4163032 | 0.3203241 | 0.1907620 | 1.0000000 | -0.1995827 | -0.2981240 |
| CUPS | 0.0871995 | -0.2444692 | -0.1758921 | 0.1196646 | -0.5130609 | 0.3639325 | -0.0323576 | -0.4951949 | 0.1284045 | -0.3352688 | -0.1995827 | 1.0000000 | -0.2031601 |
| RATING | -0.6893760 | 0.4706185 | -0.4092837 | -0.4012952 | 0.5841604 | 0.0520547 | -0.7596747 | 0.3801654 | -0.2405436 | 0.0251588 | -0.2981240 | -0.2031601 | 1.0000000 |
corrplot(cor_cereals)
How to interpret the correlaton table:
Scale used for interpretation of correlation function results:
0.00 to 0.19 : Very weak (positive)
0.20 to 0.39 : Weak(positive)
0.40 to 0.59 : Moderate (positive)
0.60 to 0.79 : Strong (positive)
0.80 to 1.0 : Very strong (positive)
0.00 to (-0.19) : Very weak (negative)
-0.20 to (-0.39) : Weak(negative)
-0.40 to (-0.59) : Moderate (negative)
-0.60 to(-0.79) : Strong (negative)
-0.80 to (-1.0) : Very strong (negative)
How to use the coorplot:
Red indicates a negative correlation while Blue indicates a positive correlation The larger the size and the darker the color from the coorplot, the stonger the association
Applying rules to indentify correlating variables( I’ll only identify meduim to strong correlations between variables):
Results:
Calories is positively correlated to fat(r(cor_value) =0.499),sugars( r(cor_value) =0.56),weight( r(cor_value) =0.70), and negatively correlated to weight(( r(cor_value) =-0.69))
Protein is positively correlated to Fiber(r(cor_value) =0.5),Potassuim( r(cor_value) =0.5), and ratings( r(cor_value) =0.47).
Fat is positively correlated to Calories(r(cor_value) =0.50), and negatively correlated to rating(( r(cor_value) =-0.41))
SODIUM is negatively correlated to ratings(( r(cor_value) =-0.40))
Fiber is positively correlated to protein(r(cor_value) =0.50),Potassuim( r(cor_value) =0.0.90),rating( r(cor_value) =0.58), and negatively correlated to cups(( r(cor_value) =-0.51))
Sugars is positively correlated to Calories(r(cor_value) =0.56),Weight( r(cor_value) =0.45), and negatively correlated to rating(( r(cor_value) =-0.76))
Potassuim is positively correlated to Weight(r(cor_value) =0.42), and negatively correlated to cups(( r(cor_value) =-0.49))
Choose a variable as independent and run simple linear regression with Rating as the target variable. How does your independent variable predict the Rating (show equation)?
Independent variable - sugars
Predicated variable/ target variable - rating
attach(data_cereals)
model= lm(RATING~SUGARS)
summary(model)
##
## Call:
## lm(formula = RATING ~ SUGARS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.853 -5.677 -1.439 5.160 34.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.2844 1.9485 30.43 < 2e-16 ***
## SUGARS -2.4008 0.2373 -10.12 1.15e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.196 on 75 degrees of freedom
## Multiple R-squared: 0.5771, Adjusted R-squared: 0.5715
## F-statistic: 102.3 on 1 and 75 DF, p-value: 1.153e-15
plot(SUGARS,RATING,main = "Figure: Sugars Vs Rating (Is Sugars a predictor for rating?)",
xlab = "SUGARS", ylab = "RATING",
pch = 19, frame = FALSE)
abline(lm(RATING~SUGARS, data = data_cereals), col = "blue")
rating = beta_0 + beta_1(Sugars)
rating = 59.2844 -2.4008 (sugars)
Meaning: For every one unit increase in Sugars, the predicted rating decreases by 2.4008 or the slope or beta_1.
Statistical Inference: Ho:beta_1 = 0 Ha:beta_1 != 0
Since the p_value is less than alpha at 0.05 significance, we reject the null hypothesis and conclude that sugars is a significant predictor for rating.
The R squared value of 0.57 indicates that sugar explains about 57% of the variability in rating.Including other possible predictors in a multiple regression could potentially increase the r^2 (explained variance in y) value. This is beyond the scope of this example but the correlation table seems to indicate that other predictors could increase the the percentage of the explained variability in rating.