Disclaimer : This report was done in fulfilment of an introduction to my data mining class assignment. It should be noted that some of the analysis therein may have errors to be updated over time.Hence, findings therein should not be used for generalisation.
.
## name mfr type calories protein fat sodium fiber carbo
## 1 100% Bran N C 70 4 1 130 10.0 5.0
## 2 100% Natural Bran Q C 120 3 5 15 2.0 8.0
## 3 All-Bran K C 70 4 1 260 9.0 7.0
## 4 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0
## 5 Almond Delight R C 110 2 2 200 1.0 14.0
## 6 Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups rating
## 1 6 280 25 3 1 0.33 68.40297
## 2 8 135 0 3 1 1.00 33.98368
## 3 5 320 25 3 1 0.33 59.42551
## 4 0 330 25 3 1 0.50 93.70491
## 5 8 -1 25 3 1 0.75 34.38484
## 6 10 70 25 1 1 0.75 29.50954
Question 1 :What is the general statistical description of this dataset?
## name mfr type calories
## Length:77 Length:77 Length:77 Min. : 50.0
## Class :character Class :character Class :character 1st Qu.:100.0
## Mode :character Mode :character Mode :character Median :110.0
## Mean :106.9
## 3rd Qu.:110.0
## Max. :160.0
## protein fat sodium fiber
## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median :180.0 Median : 2.000
## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
## carbo sugars potass vitamins
## Min. :-1.0 Min. :-1.000 Min. : -1.00 Min. : 0.00
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 40.00 1st Qu.: 25.00
## Median :14.0 Median : 7.000 Median : 90.00 Median : 25.00
## Mean :14.6 Mean : 6.922 Mean : 96.08 Mean : 28.25
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00
## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00
## shelf weight cups rating
## Min. :1.000 Min. :0.50 Min. :0.250 Min. :18.04
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17
## Median :2.000 Median :1.00 Median :0.750 Median :40.40
## Mean :2.208 Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:3.000 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :3.000 Max. :1.50 Max. :1.500 Max. :93.70
The variables manufacturer,type,shelf and vitamins are categorical but listed as characters.These variables will be re-categorised as factor variables. The remaining variables are classified as numerical.The variables carbo(carbohydrates),sugars and potass(potassium) contain negative values which are not possible. Hence, the negative values will be replaced with 0. The type and manufacture variables contain values represented by letters which can make readability difficult. A potential outlier has the highest rating value. This point corresponds to Kelloggs’ (K) cereal with a rating of 93.7.When compared to the other manufacturers’ cereals in the Table , this data point is suspicious of being a potential outlier. However, this point is not ignored, because it might be the fact that this particular cereal is rated high.
Question 2 : Find the count of hot and cold cereal according to manufacturer. Based on your interpretation, can a cereal type (hot or cold) influence its popularity on the market?
Two-way contingency table to find the count of hot and cold cereals by manufacturer.
type_cnt<-table(cereal$mfr,cereal$type)
type_cnt
##
## Cold Hot
## American Home Food Prod 0 1
## General Mills 22 0
## Kellogs 23 0
## Nabisco 5 1
## Post 9 0
## Quaker Oats 7 1
## Ralston Purina 8 0
BarPlot of Cereal Type according to Manufacturer
Looking at Hot cereals, only three manufacturers produce them.Cold cereals are more in abundance. The likelihood of choosing a cold cereal is significantly higher than that of a hot type cereal.The three companies that make the most variety of cereals produce only Cold cereals.These companies are General Mills,Kellogs and Post.Hot cereals require preparation and can be used at home while cold cereals can be consumed anywhere, great for persons who are always on the go. The average person is almost always on the go whether going to work, travelling etc. This may possibly explain the presence of a large amount of cold cereals on the market.
Question 3: What is the distribution of weight in cereals with cup sizes greater than 0.7? What can you infer about the weight of cereals where cups size is 0.75 in comparison with those weights where cup size 1?
Dataset is filtered to extract observations where cup size is greater than 0.7
## name mfr type vitamins
## 1 100% Natural Bran Quaker Oats Cold 0
## 2 Almond Delight Ralston Purina Cold 25
## 3 Apple Cinnamon Cheerios General Mills Cold 25
## 4 Apple Jacks Kellogs Cold 25
## 5 Basic 4 General Mills Cold 25
## 6 Cap'n'Crunch Quaker Oats Cold 25
## 7 Cheerios General Mills Cold 25
## 8 Cinnamon Toast Crunch General Mills Cold 25
## 9 Cocoa Puffs General Mills Cold 25
## 10 Corn Chex Ralston Purina Cold 25
## 11 Corn Flakes Kellogs Cold 25
## 12 Corn Pops Kellogs Cold 25
## 13 Count Chocula General Mills Cold 25
## 14 Cream of Wheat (Quick) Nabisco Hot 0
## 15 Crispix Kellogs Cold 25
## 16 Crispy Wheat & Raisins General Mills Cold 25
## 17 Double Chex Ralston Purina Cold 25
## 18 Froot Loops Kellogs Cold 25
## 19 Frosted Flakes Kellogs Cold 25
## 20 Frosted Mini-Wheats Kellogs Cold 25
## 21 Fruity Pebbles Post Cold 25
## 22 Golden Crisp Post Cold 25
## 23 Golden Grahams General Mills Cold 25
## 24 Grape Nuts Flakes Post Cold 25
## 25 Honey Graham Ohs Quaker Oats Cold 25
## 26 Honey Nut Cheerios General Mills Cold 25
## 27 Honey-comb Post Cold 25
## 28 Just Right Crunchy Nuggets Kellogs Cold 100
## 29 Just Right Fruit & Nut Kellogs Cold 100
## 30 Kix General Mills Cold 25
## 31 Lucky Charms General Mills Cold 25
## 32 Maypo American Home Food Prod Hot 25
## 33 Muesli Raisins; Dates; & Almonds Ralston Purina Cold 25
## 34 Muesli Raisins; Peaches; & Pecans Ralston Purina Cold 25
## 35 Multi-Grain Cheerios General Mills Cold 25
## 36 Nutri-grain Wheat Kellogs Cold 25
## 37 Product 19 Kellogs Cold 100
## 38 Puffed Rice Quaker Oats Cold 0
## 39 Puffed Wheat Quaker Oats Cold 0
## 40 Raisin Bran Kellogs Cold 25
## 41 Rice Chex Ralston Purina Cold 25
## 42 Rice Krispies Kellogs Cold 25
## 43 Shredded Wheat Nabisco Cold 0
## 44 Smacks Kellogs Cold 25
## 45 Special K Kellogs Cold 25
## 46 Strawberry Fruit Wheats Nabisco Cold 25
## 47 Total Corn Flakes General Mills Cold 100
## 48 Total Raisin Bran General Mills Cold 100
## 49 Total Whole Grain General Mills Cold 100
## 50 Triples General Mills Cold 25
## 51 Trix General Mills Cold 25
## 52 Wheaties General Mills Cold 25
## 53 Wheaties Honey Gold General Mills Cold 25
## shelf carbo sugars potass protein fat sodium weight cups calories fiber
## 1 3 8.0 8 135 3 5 15 1.00 1.00 120 2.0
## 2 3 14.0 8 0 2 2 200 1.00 0.75 110 1.0
## 3 1 10.5 10 70 2 2 180 1.00 0.75 110 1.5
## 4 2 11.0 14 30 2 0 125 1.00 1.00 110 1.0
## 5 3 18.0 8 100 3 2 210 1.33 0.75 130 2.0
## 6 2 12.0 12 35 1 2 220 1.00 0.75 120 0.0
## 7 1 17.0 1 105 6 2 290 1.00 1.25 110 2.0
## 8 2 13.0 9 45 1 3 210 1.00 0.75 120 0.0
## 9 2 12.0 13 55 1 1 180 1.00 1.00 110 0.0
## 10 1 22.0 3 25 2 0 280 1.00 1.00 110 0.0
## 11 1 21.0 2 35 2 0 290 1.00 1.00 100 1.0
## 12 2 13.0 12 20 1 0 90 1.00 1.00 110 1.0
## 13 2 12.0 13 65 1 1 180 1.00 1.00 110 0.0
## 14 2 21.0 0 0 3 0 80 1.00 1.00 100 1.0
## 15 3 21.0 3 30 2 0 220 1.00 1.00 110 1.0
## 16 3 11.0 10 120 2 1 140 1.00 0.75 100 2.0
## 17 3 18.0 5 80 2 0 190 1.00 0.75 100 1.0
## 18 2 11.0 13 30 2 1 125 1.00 1.00 110 1.0
## 19 1 14.0 11 25 1 0 200 1.00 0.75 110 1.0
## 20 2 14.0 7 100 3 0 0 1.00 0.80 100 3.0
## 21 2 13.0 12 25 1 1 135 1.00 0.75 110 0.0
## 22 1 11.0 15 40 2 0 45 1.00 0.88 100 0.0
## 23 2 15.0 9 45 1 1 280 1.00 0.75 110 0.0
## 24 3 15.0 5 85 3 1 140 1.00 0.88 100 3.0
## 25 2 12.0 11 45 1 2 220 1.00 1.00 120 1.0
## 26 1 11.5 10 90 3 1 250 1.00 0.75 110 1.5
## 27 1 14.0 11 35 1 0 180 1.00 1.33 110 0.0
## 28 3 17.0 6 60 2 1 170 1.00 1.00 110 1.0
## 29 3 20.0 9 95 3 1 170 1.30 0.75 140 2.0
## 30 2 21.0 3 40 2 1 260 1.00 1.50 110 0.0
## 31 2 12.0 12 55 2 1 180 1.00 1.00 110 0.0
## 32 2 16.0 3 95 4 1 0 1.00 1.00 100 0.0
## 33 3 16.0 11 170 4 3 95 1.00 1.00 150 3.0
## 34 3 16.0 11 170 4 3 150 1.00 1.00 150 3.0
## 35 1 15.0 6 90 2 1 220 1.00 1.00 100 2.0
## 36 3 18.0 2 90 3 0 170 1.00 1.00 90 3.0
## 37 3 20.0 3 45 3 0 320 1.00 1.00 100 1.0
## 38 3 13.0 0 15 1 0 0 0.50 1.00 50 0.0
## 39 3 10.0 0 50 2 0 0 0.50 1.00 50 1.0
## 40 2 14.0 12 240 3 1 210 1.33 0.75 120 5.0
## 41 1 23.0 2 30 1 0 240 1.00 1.13 110 0.0
## 42 1 22.0 3 35 2 0 290 1.00 1.00 110 0.0
## 43 1 16.0 0 95 2 0 0 0.83 1.00 80 3.0
## 44 2 9.0 15 40 2 1 70 1.00 0.75 110 1.0
## 45 1 16.0 3 55 6 0 230 1.00 1.00 110 1.0
## 46 2 15.0 5 90 2 0 15 1.00 1.00 90 3.0
## 47 3 21.0 3 35 2 1 200 1.00 1.00 110 0.0
## 48 3 15.0 14 230 3 1 190 1.50 1.00 140 4.0
## 49 3 16.0 3 110 3 1 200 1.00 1.00 100 3.0
## 50 3 21.0 3 60 2 1 250 1.00 0.75 110 0.0
## 51 2 13.0 12 25 1 1 140 1.00 1.00 110 0.0
## 52 1 17.0 3 110 3 1 200 1.00 1.00 100 3.0
## 53 1 16.0 8 60 2 1 200 1.00 0.75 110 1.0
## rating
## 1 33.98368
## 2 34.38484
## 3 29.50954
## 4 33.17409
## 5 37.03856
## 6 18.04285
## 7 50.76500
## 8 19.82357
## 9 22.73645
## 10 41.44502
## 11 45.86332
## 12 35.78279
## 13 22.39651
## 14 64.53382
## 15 46.89564
## 16 36.17620
## 17 44.33086
## 18 32.20758
## 19 31.43597
## 20 58.34514
## 21 28.02576
## 22 35.25244
## 23 23.80404
## 24 52.07690
## 25 21.87129
## 26 31.07222
## 27 28.74241
## 28 36.52368
## 29 36.47151
## 30 39.24111
## 31 26.73451
## 32 54.85092
## 33 37.13686
## 34 34.13976
## 35 40.10596
## 36 59.64284
## 37 41.50354
## 38 60.75611
## 39 63.00565
## 40 39.25920
## 41 41.99893
## 42 40.56016
## 43 68.23588
## 44 31.23005
## 45 53.13132
## 46 59.36399
## 47 38.83975
## 48 28.59278
## 49 46.65884
## 50 39.10617
## 51 27.75330
## 52 51.59219
## 53 36.18756
Density Plot for Comparison between Cereals of Cup Size 0.75 and Cereals of Cup Size 1
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
## Warning: Groups with fewer than two data points have been dropped.
For all cups of size 1, weight is normally distributed. Weight is also normally distributed where cup size is 0.75.However, the peek of the distribution is shorter when compared to the peek of cup size =1. This implies that the mean weight of the size 1 cups is more dense than the mean weight of size 0.75 cups.
Question 4 :Describe the shape and center of the fat distribution and carbohydrates distribution in cereals.Are there similarities between fat and carbohydrates.
Histogram Showing the Distribution of Fat per Serving
Histogram Showing the Distribution of Carbohydrates per Serving
The shape of the distribution of fat in cereals show a skewed distribution to the right. This distribution is said to be positively skewed. This kind of distribution has a large number of occurrences on the left side and few on the right side. The mean value of fat per serving is closer to 0.
Carbohydatrates on the other hand had a left skewed distribution .The mean carbohydrates per serving falls between 10 and 15 with very few occurences on the left. The only similarity between fat and carbohydrates observed here is that they both are not normally distributed.
Question 5 : Can sugar content in cereals predict shelf placement? Provide your observation of the sugar distribution in cereals according to their location on shelves.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The mean amount of sugar for both shelf 1 and 3 falls below 5g per serving. This is contrasted with shelf 2, where the mean/average amount of sugar on shelf 2 is about 10 to 13g per serving. Products in a supermarket are usually placed according to customers hand reach and eye level. Given that consumers can range from children, to adult and oldage folk, we can attempt to assume the reason behind sugar rich cereals being placed on the middle shelf (shelf 2). The average child around 5 - 10 years would easily spot the cereals on the middle shelf. There is a possibility that manufacturers are placing their sugar rich cereals on the middle shelf to attract children. In such a case,we can say that sugar content in cereals can affect their shelf placement.
Question 6 :Do calories per serving and sugar per serving affect Carbohydrates? Describe the relationship between carbohydrates,calories and sugar.
From the scatterplot above, we can see that as calories per serving increase ,carbohydrates also increase (in other words, the higher the calories, the higher the carbohydrate. We can therefore assume that a strong positive linear relationship exist between carbohydrates and calories.The relationship between sugar and carbohydrates, however seems to be very weak and non-linear.
Question 7 : Compare the relationship of calorie content and ratings to the relationship of fiber content and ratings.
Correlation between calories and rating
cor.test(cereal$calories,cereal$rating, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cereal$calories and cereal$rating
## t = -8.2415, df = 75, p-value = 4.14e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7911906 -0.5503788
## sample estimates:
## cor
## -0.689376
Correlation of between fiber and rating
cor.test(cereal$fiber,cereal$rating, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: cereal$fiber and cereal$rating
## t = 6.233, df = 75, p-value = 2.445e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4144017 0.7146365
## sample estimates:
## cor
## 0.5841604
Correlogram Matrix to visualize the strength of the relationships between calories vs rating and fiber vs rating.
The correlation between calories and rating produces a value of -0.689376. This value is leaning closer to -1, which indicates a strong negative linear relationship.Therefore, as calories in a cereal increase the rating decreases. In the case of the relationship between fiber and ratings, the correlation value is estimated to be 0.584. This value leans heavily towards positive 1, which indicates a strong positive linear relationship.Therefore, as the fiber content in cereals incerease, the ratings would also increase. We can therefore imply that cereals with better health values such as high fiber are rated higher and those with excessive calories are rated lower.
Question 8:What do you observe with respect to the calories of cereals by their shelf placement?
The boxplot above graphically represents the distribution of rating according to shelf.By separating the data in to categories for the shelf placement, we can quickly see that there are outliers for ratings on each shelf. This shows that the top shelf has the highest median rating count as well as the cereal with the highest rating. The distribution of rating on shelf 3 is also right skewed. Cereals with the lowest ratings are found on shelf 2. The boxplot of shelf two indicates a normal distribution. When outliers are excluded, the cereal with the highest rating (excluding outliers ) is found on shelf 1.
Question 9 : Describe the relationship between non-nutrition variables and ratings. Can a cereal that is manufactured by Kellogs and placed on top shelf (shelf 3) positively affect ratings?
Construction of a new dataframe that gives counts and mean for ratings
## # A tibble: 13 x 5
## # Groups: mfr [7]
## mfr `shelf == 3` count average_rating stdev_rating
## <fct> <lgl> <int> <dbl> <dbl>
## 1 American Home Food Prod FALSE 1 54.9 NA
## 2 General Mills FALSE 13 32.4 10.5
## 3 General Mills TRUE 9 37.4 5.39
## 4 Kellogs FALSE 11 39.2 9.56
## 5 Kellogs TRUE 12 48.5 17.0
## 6 Nabisco FALSE 5 67.9 6.16
## 7 Nabisco TRUE 1 68.4 NA
## 8 Post FALSE 3 30.7 3.98
## 9 Post TRUE 6 47.2 6.76
## 10 Quaker Oats FALSE 4 34.0 16.5
## 11 Quaker Oats TRUE 4 51.8 13.3
## 12 Ralston Purina FALSE 4 45.6 4.48
## 13 Ralston Purina TRUE 4 37.5 4.75
Point Plot to visualize the association of manufacturer,shelf and rating.
ggplot(cereal,
aes(x=mfr, y=rating, color=shelf, shape=shelf)) +
geom_point(size=6) +
scale_shape_manual(values=c(17:25))
Out of 77 cereals, 12 belonging to 1 manufacturer(Kellogs) is placed on shelf 3. With an average rating as low 48.5 and standard deviation rating being 17 (very high), there is no sigificant relationship between manufacturerer and ratings. However, from the point plot above most cereals found on shelf 3(top shelf), indicated by a blue point, are relatively high in ratings when compared to the other shelves.
Question 10 : Is there an association between vitamins and manufacturer?
From the bar charts gathered it can clearly be observed that a great number of cereal manufacturers produce cereals with vitamins distributed at 25g per serving. The top three manufacturers of cereals with vitamins of 25g per serving may possibly be attributed to the national health requirements standard for vitamins in cereals.
Modeling
Target variable : Rating
Simple Linear Regression
Model1
#Simple linear model
linear_mod1<-lm(rating~fiber, data = cereal)
linear_mod1
##
## Call:
## lm(formula = rating ~ fiber, data = cereal)
##
## Coefficients:
## (Intercept) fiber
## 35.257 3.443
summary(linear_mod1)#model summary
##
## Call:
## lm(formula = rating ~ fiber, data = cereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.436 -8.159 -2.037 6.491 27.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.2566 1.7674 19.948 < 2e-16 ***
## fiber 3.4430 0.5524 6.233 2.45e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.48 on 75 degrees of freedom
## Multiple R-squared: 0.3412, Adjusted R-squared: 0.3325
## F-statistic: 38.85 on 1 and 75 DF, p-value: 2.445e-08
The simple linear model (linear_mod1) uses the dataset cereal to find a correlation between fiber and ratings where fiber is the explanatory variable.Linear_mod1 is statistically significant because the p-values of the regression model(p-value: 2.445e-08 - note when this is converted it gives a value of 0.00082) and predictor variable fiber(2.45e-08 ***) are less than the standard significance level of 0.05
Model2
Uses the train data set
#Simple linear model
linear_mod2<-lm(rating~fiber, data = train)
linear_mod2
##
## Call:
## lm(formula = rating ~ fiber, data = train)
##
## Coefficients:
## (Intercept) fiber
## 35.772 3.166
summary(linear_mod2)
##
## Call:
## lm(formula = rating ~ fiber, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.843 -7.987 -1.998 6.227 26.037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.7719 2.0945 17.079 < 2e-16 ***
## fiber 3.1659 0.7902 4.007 0.000187 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.57 on 55 degrees of freedom
## Multiple R-squared: 0.2259, Adjusted R-squared: 0.2119
## F-statistic: 16.05 on 1 and 55 DF, p-value: 0.0001867
The simple linear model (linear_mod2) uses the train dataset to predict the values of response variable rating by fiber.Model2 is statiscally significant as the p values of regression model and predictor(fiber) are less than the standard significant level of 0.05 One difficulty with simple linear regression is that it only caters for 1 predictor variable at a time, which is not ideal because real world datasets rarely have individual relationships only. Often, variables are correlated with one or more other variables.
Multiple Linear Regression
Model3
##Multiple Linear Regression
linear_mod3<-lm(rating~sugars+calories, data =cereal)
linear_mod3
##
## Call:
## lm(formula = rating ~ sugars + calories, data = cereal)
##
## Coefficients:
## (Intercept) sugars calories
## 84.0620 -1.7369 -0.2746
summary(linear_mod3)
##
## Call:
## lm(formula = rating ~ sugars + calories, data = cereal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.653 -6.074 -1.248 4.726 23.373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.06199 5.43272 15.473 < 2e-16 ***
## sugars -1.73692 0.25328 -6.858 1.81e-09 ***
## calories -0.27460 0.05749 -4.776 8.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.064 on 74 degrees of freedom
## Multiple R-squared: 0.6791, Adjusted R-squared: 0.6705
## F-statistic: 78.32 on 2 and 74 DF, p-value: < 2.2e-16
The Multiple Linear Regression Model(linear_mod3) uses the cereal dataset to determine correlation between two explanatory variables sugars and calories with a response variable rating. It is multiple linear because it takes more than one predictor variables.It is also statistically significant (regression model p-value: <2.2e-16,sugars 1.81e-09 *** calories- 8.81e-06 *** )
Model4
##Multiple Linear Regression
linear_mod4<-lm(rating~sugars+calories+protein, data =train)
linear_mod4
##
## Call:
## lm(formula = rating ~ sugars + calories + protein, data = train)
##
## Coefficients:
## (Intercept) sugars calories protein
## 75.6768 -1.0835 -0.3637 5.2131
summary(linear_mod4)
##
## Call:
## lm(formula = rating ~ sugars + calories + protein, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.102 -3.459 -1.403 3.280 15.888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.67683 5.27834 14.337 < 2e-16 ***
## sugars -1.08355 0.22609 -4.793 1.37e-05 ***
## calories -0.36368 0.05656 -6.430 3.72e-08 ***
## protein 5.21305 0.84779 6.149 1.05e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.98 on 53 degrees of freedom
## Multiple R-squared: 0.8007, Adjusted R-squared: 0.7894
## F-statistic: 70.99 on 3 and 53 DF, p-value: < 2.2e-16
Model Selection
For model comparison, the model with the lowest AIC and BIC score is preferred.
AIC(linear_mod1)
## [1] 598.3042
AIC(linear_mod2)
## [1] 444.8366
AIC(linear_mod3)
## [1] 544.9123
AIC(linear_mod4)
## [1] 371.4909
BIC(linear_mod1)
## [1] 605.3356
BIC(linear_mod2)
## [1] 450.9658
BIC(linear_mod3)
## [1] 554.2876
BIC(linear_mod4)
## [1] 381.7061
Selected Model : Model 4
Model Selection - The Model with the Lowest AIC and BIC score is linear_mod4 - model4 with AIC 371.4909 and BIC 381.7061. This model is ideal because it shows stronger correlation among variables as relationships are not only limited to one predictor on a response variable.
Prediction
actuals_preds <- data.frame(cbind(actuals=test$rating, predicteds=pred_rating))
head(actuals_preds)
## actuals predicteds
## 2 33.98368 39.00660
## 3 59.42551 65.65405
## 4 93.70491 78.34528
## 11 18.04285 24.24632
## 22 46.89564 42.84803
## 24 44.33086 44.31769
Out of 77 variables, only 6 were off or not accurately predicted
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy
## actuals predicteds
## actuals 1.0000000 0.8648255
## predicteds 0.8648255 1.0000000
Prediction : 86% accuracy.
References
https://www.kaggle.com/mbrachmann/cereal-ratings-and-linear-regressions
https://www.kaggle.com/jeandsantos/breakfast-cereals-data-analysis-and-clustering
https://www.kaggle.com/mbrachmann/cereal-ratings-and-linear-regressions
https://rpubs.com/whodunnit/412482
https://www.statmethods.net/graphs/scatterplot.html https://www.slideshare.net/hhlo/presentation-stats-324-24491954
https://medium.com/nightingale/plotting-a-quantitative-variable-in-your-dataset-526c26d40dc7 https://rpubs.com/ivim/Rmd-v1