R Notebook

Read in Data: (Introduction of the Cereals dataset)

data_cereals = read.table("C:\\Users\\oghen\\Desktop\\cereals.dat",header=T)

Quick check on dataset to understand the structure

str(data_cereals)

## 'data.frame':    77 obs. of  16 variables:
##  $ NAME    : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ MANUF   : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
##  $ TYPE    : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
##  $ CALORIES: int  70 120 70 50 110 110 110 130 90 90 ...
##  $ PROTEIN : int  4 3 4 4 2 2 2 3 2 3 ...
##  $ FAT     : int  1 5 1 0 2 2 0 2 1 0 ...
##  $ SODIUM  : int  130 15 260 140 200 180 125 210 200 210 ...
##  $ FIBER   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ CARBO   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ SUGARS  : int  6 8 5 0 8 10 14 8 6 5 ...
##  $ POTASS  : int  280 135 320 330 -1 70 30 100 125 190 ...
##  $ VITAMINS: int  25 0 25 25 25 25 25 25 25 25 ...
##  $ SHELF   : int  3 3 3 3 3 1 2 3 1 3 ...
##  $ WEIGHT  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ CUPS    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ RATING  : num  68.4 34 59.4 93.7 34.4 ...

head(data_cereals)

◦Which variables are categorical and which are continuous?

# Name : Categorical
# Manufacturer : Categorical
# Type: Categorical
# CALORIES: Continuous
# PROTEIN : Continuous
# FAT     : Continuous
# SODIUM  : Continuous
# FIBER   : Continuous
# CARBO   : Continuous
# SUGARS  : Continuous
# POTASS  : Continuous
# VITAMINS: Continuous
# SHELF   : Continuous
# WEIGHT  : Continuous
# CUPS    : Continuous
# RATING  : Continuous

find out which variables are correlated:

library(corrplot)
cor_cereals = cor(data_cereals[, !names(data_cereals) %in% c("NAME","MANUF","TYPE")])
cor_cereals%>% kable(escape=FALSE) %>%
  kable_styling("striped",full_width = F, position = "left")

	CALORIES	PROTEIN	FAT	SODIUM	FIBER	CARBO	SUGARS	POTASS	VITAMINS	SHELF	WEIGHT	CUPS	RATING
CALORIES	1.0000000	0.0190661	0.4986098	0.3006492	-0.2934128	0.2506809	0.5623403	-0.0666089	0.2653563	0.0972344	0.6960911	0.0871995	-0.6893760
PROTEIN	0.0190661	1.0000000	0.2084310	-0.0546743	0.5003300	-0.1308636	-0.3291418	0.5494074	0.0073354	0.1338648	0.2161585	-0.2444692	0.4706185
FAT	0.4986098	0.2084310	1.0000000	-0.0054075	0.0167192	-0.3180435	0.2708192	0.1932786	-0.0311563	0.2636911	0.2146250	-0.1758921	-0.4092837
SODIUM	0.3006492	-0.0546743	-0.0054075	1.0000000	-0.0706750	0.3559835	0.1014514	-0.0326035	0.3614767	-0.0697190	0.3085765	0.1196646	-0.4012952
FIBER	-0.2934128	0.5003300	0.0167192	-0.0706750	1.0000000	-0.3560827	-0.1412054	0.9033737	-0.0322427	0.2975391	0.2472256	-0.5130609	0.5841604
CARBO	0.2506809	-0.1308636	-0.3180435	0.3559835	-0.3560827	1.0000000	-0.3316654	-0.3496852	0.2581475	-0.1017903	0.1351364	0.3639325	0.0520547
SUGARS	0.5623403	-0.3291418	0.2708192	0.1014514	-0.1412054	-0.3316654	1.0000000	0.0216958	0.1251373	0.1004379	0.4506476	-0.0323576	-0.7596747
POTASS	-0.0666089	0.5494074	0.1932786	-0.0326035	0.9033737	-0.3496852	0.0216958	1.0000000	0.0206987	0.3606634	0.4163032	-0.4951949	0.3801654
VITAMINS	0.2653563	0.0073354	-0.0311563	0.3614767	-0.0322427	0.2581475	0.1251373	0.0206987	1.0000000	0.2992617	0.3203241	0.1284045	-0.2405436
SHELF	0.0972344	0.1338648	0.2636911	-0.0697190	0.2975391	-0.1017903	0.1004379	0.3606634	0.2992617	1.0000000	0.1907620	-0.3352688	0.0251588
WEIGHT	0.6960911	0.2161585	0.2146250	0.3085765	0.2472256	0.1351364	0.4506476	0.4163032	0.3203241	0.1907620	1.0000000	-0.1995827	-0.2981240
CUPS	0.0871995	-0.2444692	-0.1758921	0.1196646	-0.5130609	0.3639325	-0.0323576	-0.4951949	0.1284045	-0.3352688	-0.1995827	1.0000000	-0.2031601
RATING	-0.6893760	0.4706185	-0.4092837	-0.4012952	0.5841604	0.0520547	-0.7596747	0.3801654	-0.2405436	0.0251588	-0.2981240	-0.2031601	1.0000000

corrplot(cor_cereals)

How to interpret the correlaton table:

Scale used for interpretation of correlation function results:

 0.00 to 0.19 : Very weak (positive)

 0.20 to 0.39 : Weak(positive)

 0.40 to 0.59 : Moderate (positive)

 0.60 to 0.79 : Strong (positive)

 0.80 to 1.0 : Very strong (positive)

 0.00 to (-0.19) : Very weak (negative)

 -0.20 to (-0.39) : Weak(negative)

 -0.40 to (-0.59) : Moderate (negative)

 -0.60 to(-0.79) : Strong (negative)

 -0.80 to (-1.0) : Very strong (negative)

How to use the coorplot:

Red indicates a negative correlation while Blue indicates a positive correlation The larger the size and the darker the color from the coorplot, the stonger the association

Applying rules to indentify correlating variables( I’ll only identify meduim to strong correlations between variables):

Results:

Calories is positively correlated to fat(r(cor_value) =0.499),sugars( r(cor_value) =0.56),weight( r(cor_value) =0.70), and negatively correlated to weight(( r(cor_value) =-0.69))

Protein is positively correlated to Fiber(r(cor_value) =0.5),Potassuim( r(cor_value) =0.5), and ratings( r(cor_value) =0.47).

Fat is positively correlated to Calories(r(cor_value) =0.50), and negatively correlated to rating(( r(cor_value) =-0.41))

SODIUM is negatively correlated to ratings(( r(cor_value) =-0.40))

Fiber is positively correlated to protein(r(cor_value) =0.50),Potassuim( r(cor_value) =0.0.90),rating( r(cor_value) =0.58), and negatively correlated to cups(( r(cor_value) =-0.51))

Sugars is positively correlated to Calories(r(cor_value) =0.56),Weight( r(cor_value) =0.45), and negatively correlated to rating(( r(cor_value) =-0.76))

Potassuim is positively correlated to Weight(r(cor_value) =0.42), and negatively correlated to cups(( r(cor_value) =-0.49))

Choose a variable as independent and run simple linear regression with Rating as the target variable. How does your independent variable predict the Rating (show equation)?

Independent variable - sugars

Predicated variable/ target variable - rating

attach(data_cereals)
model= lm(RATING~SUGARS)
summary(model)

## 
## Call:
## lm(formula = RATING ~ SUGARS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.853  -5.677  -1.439   5.160  34.421 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.2844     1.9485   30.43  < 2e-16 ***
## SUGARS       -2.4008     0.2373  -10.12 1.15e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.196 on 75 degrees of freedom
## Multiple R-squared:  0.5771, Adjusted R-squared:  0.5715 
## F-statistic: 102.3 on 1 and 75 DF,  p-value: 1.153e-15

plot(SUGARS,RATING,main = "Figure: Sugars Vs Rating (Is Sugars a predictor for rating?)",
     xlab = "SUGARS", ylab = "RATING",
     pch = 19, frame = FALSE)
abline(lm(RATING~SUGARS, data = data_cereals), col = "blue")

rating = beta_0 + beta_1(Sugars)

rating = 59.2844 -2.4008 (sugars)

Meaning: For every one unit increase in Sugars, the predicted rating decreases by 2.4008 or the slope or beta_1.

Statistical Inference: Ho:beta_1 = 0 Ha:beta_1 != 0

Since the p_value is less than alpha at 0.05 significance, we reject the null hypothesis and conclude that sugars is a significant predictor for rating.

The R squared value of 0.57 indicates that sugar explains about 57% of the variability in rating.Including other possible predictors in a multiple regression could potentially increase the r^2 (explained variance in y) value. This is beyond the scope of this example but the correlation table seems to indicate that other predictors could increase the the percentage of the explained variability in rating.