========================================================
The data used in this analysis was found in the Data and Storage Library (DASL) which was listed as one of the 100 Interesting Datasets. The dataset contains observations of 77 different cereals from 7 different manufacturers. Besides the cereal name and manufacturer, each observation lists a rating, as well as a handful of other nutritional facts about the cereals.
The code below reads in the dataset, saves and attaches the variable names, and displays the set:
x <- read.delim("~/1.RENSSELAER POLYTECHNIC INSTITUTE/a- Senior Spring/Applied Regression Analysis/Cerealdata.txt")
data <-x[order(x$sugars),]
attach(data)
head(data)
## name mfr type calories protein fat sodium fiber
## 58 Quaker_Oatmeal Q H 100 5 2 0 2.7
## 4 All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0
## 21 Cream_of_Wheat_(Quick) N H 100 3 0 80 1.0
## 55 Puffed_Rice Q C 50 1 0 0 0.0
## 56 Puffed_Wheat Q C 50 2 0 0 1.0
## 64 Shredded_Wheat N C 80 2 0 0 3.0
## carbo sugars potass vitamins shelf weight cups rating
## 58 -1 -1 110 0 1 1.00 0.67 50.83
## 4 8 0 330 25 3 1.00 0.50 93.70
## 21 21 0 -1 0 2 1.00 1.00 64.53
## 55 13 0 15 0 3 0.50 1.00 60.76
## 56 10 0 50 0 3 0.50 1.00 63.01
## 64 16 0 95 0 1 0.83 1.00 68.24
View(data)
For this project, the dependent/response variable will be the rating and the independent variable will be the grams of sugar per serving (sugars). Sugars and rating are both continuous variables. The rating is based on a scale of 0-100. Most other variables in the dataset are continuous while some are catergorical factors, however they will not be used for this analysis.
The goal is to see if the rating of a cereal can be explained at all by the amount of sugar in the cereal.
Below is a summary of the data as well as a boxplot of the ratings for each of the sugar content levels:
summary(data)
## name mfr type calories protein
## 100%_Bran : 1 A: 1 C:74 Min. : 50 Min. :1.00
## 100%_Natural_Bran : 1 G:22 H: 3 1st Qu.:100 1st Qu.:2.00
## All-Bran : 1 K:23 Median :110 Median :3.00
## All-Bran_with_Extra_Fiber: 1 N: 6 Mean :107 Mean :2.54
## Almond_Delight : 1 P: 9 3rd Qu.:110 3rd Qu.:3.00
## Apple_Cinnamon_Cheerios : 1 Q: 8 Max. :160 Max. :6.00
## (Other) :71 R: 8
## fat sodium fiber carbo
## Min. :0.00 Min. : 0 Min. : 0.00 Min. :-1.0
## 1st Qu.:0.00 1st Qu.:130 1st Qu.: 1.00 1st Qu.:12.0
## Median :1.00 Median :180 Median : 2.00 Median :14.0
## Mean :1.01 Mean :160 Mean : 2.15 Mean :14.6
## 3rd Qu.:2.00 3rd Qu.:210 3rd Qu.: 3.00 3rd Qu.:17.0
## Max. :5.00 Max. :320 Max. :14.00 Max. :23.0
##
## sugars potass vitamins shelf
## Min. :-1.00 Min. : -1.0 Min. : 0.0 Min. :1.00
## 1st Qu.: 3.00 1st Qu.: 40.0 1st Qu.: 25.0 1st Qu.:1.00
## Median : 7.00 Median : 90.0 Median : 25.0 Median :2.00
## Mean : 6.92 Mean : 96.1 Mean : 28.2 Mean :2.21
## 3rd Qu.:11.00 3rd Qu.:120.0 3rd Qu.: 25.0 3rd Qu.:3.00
## Max. :15.00 Max. :330.0 Max. :100.0 Max. :3.00
##
## weight cups rating
## Min. :0.50 Min. :0.250 Min. :18.0
## 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.2
## Median :1.00 Median :0.750 Median :40.4
## Mean :1.03 Mean :0.821 Mean :42.7
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.8
## Max. :1.50 Max. :1.500 Max. :93.7
##
boxplot(rating~sugars, main="Boxplot of Rating vs Sugars", xlab="sugars", ylab="rating")
From the boxplot, you can see that some of the sugar levels only have 1 observation. It appears there may be a decreasing trend in the rating as sugar content increases.
The null hypothesis is that the variation in the dependent variable, rating, cannot be explained by anything other than randomization. In other words, for the model described below, the variation in sugar cannot explain the variation in rating.
The model below explores whether the variation in sugar content can explain any of the variation in cereal rating.
The code below creates and saves this model for future interpretation:
model<-lm(rating~sugars)
Scattergram of the cereal rating vs the sugar content of the cereal:
plot(sugars,rating, main="Rating vs Sugar Content (with regression line)", xlab= "Sugar Content", ylab="Cereal Rating", col="blue", pch=18)
abline(model$coef, lwd=2, col="dark blue")
The plot shows many verticle lines because the nature of the data (there are multiple observations with the same sugar content and because the content is recorded at integer values). The points do seem to follow an inverse trend, as sugar content increases, rating decreases. The regression line represents the equation created by the model explained above. It appears to be a good fit of the data as it seems to go through the center of the data with an intercept of approximately 60. The model will be further anayzed in the results section below.
Next we plot the 95% confidence interval for the model:
model_conf <- predict(model, interval="confidence")
plot(sugars,rating, main="95% Confidence Interval for Rating vs Sugar Content", xlab= "Sugar Content", ylab="Cereal Rating", col="blue", pch=18)
abline(model$coef, lwd=2, col="dark blue")
lines(sugars, model_conf[,2], lty=2, lwd=2, col="blueviolet")
lines(sugars, model_conf[,3], lty=2, lwd=2, col="blueviolet")
For 95% of the samples, the true value will lie between the upper and lower confidence interval lines shown in the plot above. The fit of the interval is tighter in the middle and widens out toward the higher and lower sugar content values most liekly because a majority of the data lies in the middle and therefore the model can more easily predict these values.
Below is the output of the model summary:
summary(model)
##
## Call:
## lm(formula = rating ~ sugars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.85 -5.68 -1.44 5.16 34.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.284 1.948 30.4 < 2e-16 ***
## sugars -2.401 0.237 -10.1 1.2e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.2 on 75 degrees of freedom
## Multiple R-squared: 0.577, Adjusted R-squared: 0.571
## F-statistic: 102 on 1 and 75 DF, p-value: 1.15e-15
From the model summary output, we can see the p-value is 1.153e-15 so we reject the null hypothesis and something other than randomization may be able to explain the variation in cereal rating. Since in this model, we examined the independent variable, sugar, the variaiton in sugar may be a explanatory factor. To examine how much the variation in the independent variable exlains the change in the dependent variable, we look at the Adjusted R^2 value. The output returned a value of 0.5715, meaning 57.15% of the variation in the dependent variable Rating can be explained by the variation in sugar content.
The model can be written as the following equation: Rating= -2.4(sugars) + 59.28
The parmeter estimate for sugars is negative as we origanally predicted by examining the inverse nature of the initial data and boxplots. The intercept is about 60, which again is a logical result when we look at the model plot.
Overall from this analysis, it can be interpreted that people rate cereals with a lower sugar content higher, and the sugar content can explain about 57% of the cereal rating.
Finally, to check the fit of the model, we will examine the model residuals:
model.resid<-model$residuals
plot(model.resid, main="Residuals",pch=20)
abline(0,0, lwd=2, col="blueviolet")
The resididuals show the difference between the acutal and fitted values of the model. They are spread out across the dynamic range and distributed equally about the 0, indicating the model is a good fit.