Disclaimer : This report was done in fulfilment of an introduction to my data mining class assignment. It should be noted that some of the analysis therein may have errors to be updated over time.Hence, findings therein should not be used for generalisation.

.

##                        name mfr type calories protein fat sodium fiber carbo
## 1                 100% Bran   N    C       70       4   1    130  10.0   5.0
## 2         100% Natural Bran   Q    C      120       3   5     15   2.0   8.0
## 3                  All-Bran   K    C       70       4   1    260   9.0   7.0
## 4 All-Bran with Extra Fiber   K    C       50       4   0    140  14.0   8.0
## 5            Almond Delight   R    C      110       2   2    200   1.0  14.0
## 6   Apple Cinnamon Cheerios   G    C      110       2   2    180   1.5  10.5
##   sugars potass vitamins shelf weight cups   rating
## 1      6    280       25     3      1 0.33 68.40297
## 2      8    135        0     3      1 1.00 33.98368
## 3      5    320       25     3      1 0.33 59.42551
## 4      0    330       25     3      1 0.50 93.70491
## 5      8     -1       25     3      1 0.75 34.38484
## 6     10     70       25     1      1 0.75 29.50954

Question 1 :What is the general statistical description of this dataset?

##      name               mfr                type              calories    
##  Length:77          Length:77          Length:77          Min.   : 50.0  
##  Class :character   Class :character   Class :character   1st Qu.:100.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :110.0  
##                                                           Mean   :106.9  
##                                                           3rd Qu.:110.0  
##                                                           Max.   :160.0  
##     protein           fat            sodium          fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##      carbo          sugars           potass          vitamins     
##  Min.   :-1.0   Min.   :-1.000   Min.   : -1.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00  
##  Median :14.0   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.6   Mean   : 6.922   Mean   : 96.08   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##      shelf           weight          cups           rating     
##  Min.   :1.000   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :2.000   Median :1.00   Median :0.750   Median :40.40  
##  Mean   :2.208   Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :3.000   Max.   :1.50   Max.   :1.500   Max.   :93.70

The variables manufacturer,type,shelf and vitamins are categorical but listed as characters.These variables will be re-categorised as factor variables. The remaining variables are classified as numerical.The variables carbo(carbohydrates),sugars and potass(potassium) contain negative values which are not possible. Hence, the negative values will be replaced with 0. The type and manufacture variables contain values represented by letters which can make readability difficult. A potential outlier has the highest rating value. This point corresponds to Kelloggs’ (K) cereal with a rating of 93.7.When compared to the other manufacturers’ cereals in the Table , this data point is suspicious of being a potential outlier. However, this point is not ignored, because it might be the fact that this particular cereal is rated high.

Question 2 : Find the count of hot and cold cereal according to manufacturer. Based on your interpretation, can a cereal type (hot or cold) influence its popularity on the market?

Two-way contingency table to find the count of hot and cold cereals by manufacturer.

type_cnt<-table(cereal$mfr,cereal$type)
type_cnt
##                          
##                           Cold Hot
##   American Home Food Prod    0   1
##   General Mills             22   0
##   Kellogs                   23   0
##   Nabisco                    5   1
##   Post                       9   0
##   Quaker Oats                7   1
##   Ralston Purina             8   0

BarPlot of Cereal Type according to Manufacturer

Looking at Hot cereals, only three manufacturers produce them.Cold cereals are more in abundance. The likelihood of choosing a cold cereal is significantly higher than that of a hot type cereal.The three companies that make the most variety of cereals produce only Cold cereals.These companies are General Mills,Kellogs and Post.Hot cereals require preparation and can be used at home while cold cereals can be consumed anywhere, great for persons who are always on the go. The average person is almost always on the go whether going to work, travelling etc. This may possibly explain the presence of a large amount of cold cereals on the market.

Question 3: What is the distribution of weight in cereals with cup sizes greater than 0.7? What can you infer about the weight of cereals where cups size is 0.75 in comparison with those weights where cup size 1?

Dataset is filtered to extract observations where cup size is greater than 0.7

##                                 name                     mfr type vitamins
## 1                  100% Natural Bran             Quaker Oats Cold        0
## 2                     Almond Delight          Ralston Purina Cold       25
## 3            Apple Cinnamon Cheerios           General Mills Cold       25
## 4                        Apple Jacks                 Kellogs Cold       25
## 5                            Basic 4           General Mills Cold       25
## 6                       Cap'n'Crunch             Quaker Oats Cold       25
## 7                           Cheerios           General Mills Cold       25
## 8              Cinnamon Toast Crunch           General Mills Cold       25
## 9                        Cocoa Puffs           General Mills Cold       25
## 10                         Corn Chex          Ralston Purina Cold       25
## 11                       Corn Flakes                 Kellogs Cold       25
## 12                         Corn Pops                 Kellogs Cold       25
## 13                     Count Chocula           General Mills Cold       25
## 14            Cream of Wheat (Quick)                 Nabisco  Hot        0
## 15                           Crispix                 Kellogs Cold       25
## 16            Crispy Wheat & Raisins           General Mills Cold       25
## 17                       Double Chex          Ralston Purina Cold       25
## 18                       Froot Loops                 Kellogs Cold       25
## 19                    Frosted Flakes                 Kellogs Cold       25
## 20               Frosted Mini-Wheats                 Kellogs Cold       25
## 21                    Fruity Pebbles                    Post Cold       25
## 22                      Golden Crisp                    Post Cold       25
## 23                    Golden Grahams           General Mills Cold       25
## 24                 Grape Nuts Flakes                    Post Cold       25
## 25                  Honey Graham Ohs             Quaker Oats Cold       25
## 26                Honey Nut Cheerios           General Mills Cold       25
## 27                        Honey-comb                    Post Cold       25
## 28       Just Right Crunchy  Nuggets                 Kellogs Cold      100
## 29            Just Right Fruit & Nut                 Kellogs Cold      100
## 30                               Kix           General Mills Cold       25
## 31                      Lucky Charms           General Mills Cold       25
## 32                             Maypo American Home Food Prod  Hot       25
## 33  Muesli Raisins; Dates; & Almonds          Ralston Purina Cold       25
## 34 Muesli Raisins; Peaches; & Pecans          Ralston Purina Cold       25
## 35              Multi-Grain Cheerios           General Mills Cold       25
## 36                 Nutri-grain Wheat                 Kellogs Cold       25
## 37                        Product 19                 Kellogs Cold      100
## 38                       Puffed Rice             Quaker Oats Cold        0
## 39                      Puffed Wheat             Quaker Oats Cold        0
## 40                       Raisin Bran                 Kellogs Cold       25
## 41                         Rice Chex          Ralston Purina Cold       25
## 42                     Rice Krispies                 Kellogs Cold       25
## 43                    Shredded Wheat                 Nabisco Cold        0
## 44                            Smacks                 Kellogs Cold       25
## 45                         Special K                 Kellogs Cold       25
## 46           Strawberry Fruit Wheats                 Nabisco Cold       25
## 47                 Total Corn Flakes           General Mills Cold      100
## 48                 Total Raisin Bran           General Mills Cold      100
## 49                 Total Whole Grain           General Mills Cold      100
## 50                           Triples           General Mills Cold       25
## 51                              Trix           General Mills Cold       25
## 52                          Wheaties           General Mills Cold       25
## 53               Wheaties Honey Gold           General Mills Cold       25
##    shelf carbo sugars potass protein fat sodium weight cups calories fiber
## 1      3   8.0      8    135       3   5     15   1.00 1.00      120   2.0
## 2      3  14.0      8      0       2   2    200   1.00 0.75      110   1.0
## 3      1  10.5     10     70       2   2    180   1.00 0.75      110   1.5
## 4      2  11.0     14     30       2   0    125   1.00 1.00      110   1.0
## 5      3  18.0      8    100       3   2    210   1.33 0.75      130   2.0
## 6      2  12.0     12     35       1   2    220   1.00 0.75      120   0.0
## 7      1  17.0      1    105       6   2    290   1.00 1.25      110   2.0
## 8      2  13.0      9     45       1   3    210   1.00 0.75      120   0.0
## 9      2  12.0     13     55       1   1    180   1.00 1.00      110   0.0
## 10     1  22.0      3     25       2   0    280   1.00 1.00      110   0.0
## 11     1  21.0      2     35       2   0    290   1.00 1.00      100   1.0
## 12     2  13.0     12     20       1   0     90   1.00 1.00      110   1.0
## 13     2  12.0     13     65       1   1    180   1.00 1.00      110   0.0
## 14     2  21.0      0      0       3   0     80   1.00 1.00      100   1.0
## 15     3  21.0      3     30       2   0    220   1.00 1.00      110   1.0
## 16     3  11.0     10    120       2   1    140   1.00 0.75      100   2.0
## 17     3  18.0      5     80       2   0    190   1.00 0.75      100   1.0
## 18     2  11.0     13     30       2   1    125   1.00 1.00      110   1.0
## 19     1  14.0     11     25       1   0    200   1.00 0.75      110   1.0
## 20     2  14.0      7    100       3   0      0   1.00 0.80      100   3.0
## 21     2  13.0     12     25       1   1    135   1.00 0.75      110   0.0
## 22     1  11.0     15     40       2   0     45   1.00 0.88      100   0.0
## 23     2  15.0      9     45       1   1    280   1.00 0.75      110   0.0
## 24     3  15.0      5     85       3   1    140   1.00 0.88      100   3.0
## 25     2  12.0     11     45       1   2    220   1.00 1.00      120   1.0
## 26     1  11.5     10     90       3   1    250   1.00 0.75      110   1.5
## 27     1  14.0     11     35       1   0    180   1.00 1.33      110   0.0
## 28     3  17.0      6     60       2   1    170   1.00 1.00      110   1.0
## 29     3  20.0      9     95       3   1    170   1.30 0.75      140   2.0
## 30     2  21.0      3     40       2   1    260   1.00 1.50      110   0.0
## 31     2  12.0     12     55       2   1    180   1.00 1.00      110   0.0
## 32     2  16.0      3     95       4   1      0   1.00 1.00      100   0.0
## 33     3  16.0     11    170       4   3     95   1.00 1.00      150   3.0
## 34     3  16.0     11    170       4   3    150   1.00 1.00      150   3.0
## 35     1  15.0      6     90       2   1    220   1.00 1.00      100   2.0
## 36     3  18.0      2     90       3   0    170   1.00 1.00       90   3.0
## 37     3  20.0      3     45       3   0    320   1.00 1.00      100   1.0
## 38     3  13.0      0     15       1   0      0   0.50 1.00       50   0.0
## 39     3  10.0      0     50       2   0      0   0.50 1.00       50   1.0
## 40     2  14.0     12    240       3   1    210   1.33 0.75      120   5.0
## 41     1  23.0      2     30       1   0    240   1.00 1.13      110   0.0
## 42     1  22.0      3     35       2   0    290   1.00 1.00      110   0.0
## 43     1  16.0      0     95       2   0      0   0.83 1.00       80   3.0
## 44     2   9.0     15     40       2   1     70   1.00 0.75      110   1.0
## 45     1  16.0      3     55       6   0    230   1.00 1.00      110   1.0
## 46     2  15.0      5     90       2   0     15   1.00 1.00       90   3.0
## 47     3  21.0      3     35       2   1    200   1.00 1.00      110   0.0
## 48     3  15.0     14    230       3   1    190   1.50 1.00      140   4.0
## 49     3  16.0      3    110       3   1    200   1.00 1.00      100   3.0
## 50     3  21.0      3     60       2   1    250   1.00 0.75      110   0.0
## 51     2  13.0     12     25       1   1    140   1.00 1.00      110   0.0
## 52     1  17.0      3    110       3   1    200   1.00 1.00      100   3.0
## 53     1  16.0      8     60       2   1    200   1.00 0.75      110   1.0
##      rating
## 1  33.98368
## 2  34.38484
## 3  29.50954
## 4  33.17409
## 5  37.03856
## 6  18.04285
## 7  50.76500
## 8  19.82357
## 9  22.73645
## 10 41.44502
## 11 45.86332
## 12 35.78279
## 13 22.39651
## 14 64.53382
## 15 46.89564
## 16 36.17620
## 17 44.33086
## 18 32.20758
## 19 31.43597
## 20 58.34514
## 21 28.02576
## 22 35.25244
## 23 23.80404
## 24 52.07690
## 25 21.87129
## 26 31.07222
## 27 28.74241
## 28 36.52368
## 29 36.47151
## 30 39.24111
## 31 26.73451
## 32 54.85092
## 33 37.13686
## 34 34.13976
## 35 40.10596
## 36 59.64284
## 37 41.50354
## 38 60.75611
## 39 63.00565
## 40 39.25920
## 41 41.99893
## 42 40.56016
## 43 68.23588
## 44 31.23005
## 45 53.13132
## 46 59.36399
## 47 38.83975
## 48 28.59278
## 49 46.65884
## 50 39.10617
## 51 27.75330
## 52 51.59219
## 53 36.18756

Density Plot for Comparison between Cereals of Cup Size 0.75 and Cereals of Cup Size 1

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

## Warning: Groups with fewer than two data points have been dropped.

For all cups of size 1, weight is normally distributed. Weight is also normally distributed where cup size is 0.75.However, the peek of the distribution is shorter when compared to the peek of cup size =1. This implies that the mean weight of the size 1 cups is more dense than the mean weight of size 0.75 cups.

Question 4 :Describe the shape and center of the fat distribution and carbohydrates distribution in cereals.Are there similarities between fat and carbohydrates.

Histogram Showing the Distribution of Fat per Serving

Histogram Showing the Distribution of Carbohydrates per Serving

The shape of the distribution of fat in cereals show a skewed distribution to the right. This distribution is said to be positively skewed. This kind of distribution has a large number of occurrences on the left side and few on the right side. The mean value of fat per serving is closer to 0.

Carbohydatrates on the other hand had a left skewed distribution .The mean carbohydrates per serving falls between 10 and 15 with very few occurences on the left. The only similarity between fat and carbohydrates observed here is that they both are not normally distributed.

Question 5 : Can sugar content in cereals predict shelf placement? Provide your observation of the sugar distribution in cereals according to their location on shelves.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The mean amount of sugar for both shelf 1 and 3 falls below 5g per serving. This is contrasted with shelf 2, where the mean/average amount of sugar on shelf 2 is about 10 to 13g per serving. Products in a supermarket are usually placed according to customers hand reach and eye level. Given that consumers can range from children, to adult and oldage folk, we can attempt to assume the reason behind sugar rich cereals being placed on the middle shelf (shelf 2). The average child around 5 - 10 years would easily spot the cereals on the middle shelf. There is a possibility that manufacturers are placing their sugar rich cereals on the middle shelf to attract children. In such a case,we can say that sugar content in cereals can affect their shelf placement.

Question 6 :Do calories per serving and sugar per serving affect Carbohydrates? Describe the relationship between carbohydrates,calories and sugar.

From the scatterplot above, we can see that as calories per serving increase ,carbohydrates also increase (in other words, the higher the calories, the higher the carbohydrate. We can therefore assume that a strong positive linear relationship exist between carbohydrates and calories.The relationship between sugar and carbohydrates, however seems to be very weak and non-linear.

Question 7 : Compare the relationship of calorie content and ratings to the relationship of fiber content and ratings.

Correlation between calories and rating

cor.test(cereal$calories,cereal$rating, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  cereal$calories and cereal$rating
## t = -8.2415, df = 75, p-value = 4.14e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7911906 -0.5503788
## sample estimates:
##       cor 
## -0.689376

Correlation of between fiber and rating

cor.test(cereal$fiber,cereal$rating, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  cereal$fiber and cereal$rating
## t = 6.233, df = 75, p-value = 2.445e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4144017 0.7146365
## sample estimates:
##       cor 
## 0.5841604

Correlogram Matrix to visualize the strength of the relationships between calories vs rating and fiber vs rating.

The correlation between calories and rating produces a value of -0.689376. This value is leaning closer to -1, which indicates a strong negative linear relationship.Therefore, as calories in a cereal increase the rating decreases. In the case of the relationship between fiber and ratings, the correlation value is estimated to be 0.584. This value leans heavily towards positive 1, which indicates a strong positive linear relationship.Therefore, as the fiber content in cereals incerease, the ratings would also increase. We can therefore imply that cereals with better health values such as high fiber are rated higher and those with excessive calories are rated lower.

Question 8:What do you observe with respect to the calories of cereals by their shelf placement?

The boxplot above graphically represents the distribution of rating according to shelf.By separating the data in to categories for the shelf placement, we can quickly see that there are outliers for ratings on each shelf. This shows that the top shelf has the highest median rating count as well as the cereal with the highest rating. The distribution of rating on shelf 3 is also right skewed. Cereals with the lowest ratings are found on shelf 2. The boxplot of shelf two indicates a normal distribution. When outliers are excluded, the cereal with the highest rating (excluding outliers ) is found on shelf 1.

Question 9 : Describe the relationship between non-nutrition variables and ratings. Can a cereal that is manufactured by Kellogs and placed on top shelf (shelf 3) positively affect ratings?

Construction of a new dataframe that gives counts and mean for ratings

## # A tibble: 13 x 5
## # Groups:   mfr [7]
##    mfr                     `shelf == 3` count average_rating stdev_rating
##    <fct>                   <lgl>        <int>          <dbl>        <dbl>
##  1 American Home Food Prod FALSE            1           54.9        NA   
##  2 General Mills           FALSE           13           32.4        10.5 
##  3 General Mills           TRUE             9           37.4         5.39
##  4 Kellogs                 FALSE           11           39.2         9.56
##  5 Kellogs                 TRUE            12           48.5        17.0 
##  6 Nabisco                 FALSE            5           67.9         6.16
##  7 Nabisco                 TRUE             1           68.4        NA   
##  8 Post                    FALSE            3           30.7         3.98
##  9 Post                    TRUE             6           47.2         6.76
## 10 Quaker Oats             FALSE            4           34.0        16.5 
## 11 Quaker Oats             TRUE             4           51.8        13.3 
## 12 Ralston Purina          FALSE            4           45.6         4.48
## 13 Ralston Purina          TRUE             4           37.5         4.75

Point Plot to visualize the association of manufacturer,shelf and rating.

ggplot(cereal, 
       aes(x=mfr, y=rating, color=shelf, shape=shelf)) + 
  geom_point(size=6) +
  scale_shape_manual(values=c(17:25)) 

Out of 77 cereals, 12 belonging to 1 manufacturer(Kellogs) is placed on shelf 3. With an average rating as low 48.5 and standard deviation rating being 17 (very high), there is no sigificant relationship between manufacturerer and ratings. However, from the point plot above most cereals found on shelf 3(top shelf), indicated by a blue point, are relatively high in ratings when compared to the other shelves.

Question 10 : Is there an association between vitamins and manufacturer?

From the bar charts gathered it can clearly be observed that a great number of cereal manufacturers produce cereals with vitamins distributed at 25g per serving. The top three manufacturers of cereals with vitamins of 25g per serving may possibly be attributed to the national health requirements standard for vitamins in cereals.

Modeling

Target variable : Rating

Simple Linear Regression

Model1

#Simple linear model
linear_mod1<-lm(rating~fiber, data = cereal)
linear_mod1
## 
## Call:
## lm(formula = rating ~ fiber, data = cereal)
## 
## Coefficients:
## (Intercept)        fiber  
##      35.257        3.443
summary(linear_mod1)#model summary
## 
## Call:
## lm(formula = rating ~ fiber, data = cereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.436  -8.159  -2.037   6.491  27.216 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.2566     1.7674  19.948  < 2e-16 ***
## fiber         3.4430     0.5524   6.233 2.45e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.48 on 75 degrees of freedom
## Multiple R-squared:  0.3412, Adjusted R-squared:  0.3325 
## F-statistic: 38.85 on 1 and 75 DF,  p-value: 2.445e-08

The simple linear model (linear_mod1) uses the dataset cereal to find a correlation between fiber and ratings where fiber is the explanatory variable.Linear_mod1 is statistically significant because the p-values of the regression model(p-value: 2.445e-08 - note when this is converted it gives a value of 0.00082) and predictor variable fiber(2.45e-08 ***) are less than the standard significance level of 0.05

Model2

Uses the train data set

#Simple linear model
linear_mod2<-lm(rating~fiber, data = train)
linear_mod2
## 
## Call:
## lm(formula = rating ~ fiber, data = train)
## 
## Coefficients:
## (Intercept)        fiber  
##      35.772        3.166
summary(linear_mod2)
## 
## Call:
## lm(formula = rating ~ fiber, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.843  -7.987  -1.998   6.227  26.037 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.7719     2.0945  17.079  < 2e-16 ***
## fiber         3.1659     0.7902   4.007 0.000187 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.57 on 55 degrees of freedom
## Multiple R-squared:  0.2259, Adjusted R-squared:  0.2119 
## F-statistic: 16.05 on 1 and 55 DF,  p-value: 0.0001867

The simple linear model (linear_mod2) uses the train dataset to predict the values of response variable rating by fiber.Model2 is statiscally significant as the p values of regression model and predictor(fiber) are less than the standard significant level of 0.05 One difficulty with simple linear regression is that it only caters for 1 predictor variable at a time, which is not ideal because real world datasets rarely have individual relationships only. Often, variables are correlated with one or more other variables.

Multiple Linear Regression

Model3

##Multiple Linear Regression
linear_mod3<-lm(rating~sugars+calories, data =cereal)
linear_mod3
## 
## Call:
## lm(formula = rating ~ sugars + calories, data = cereal)
## 
## Coefficients:
## (Intercept)       sugars     calories  
##     84.0620      -1.7369      -0.2746
summary(linear_mod3)
## 
## Call:
## lm(formula = rating ~ sugars + calories, data = cereal)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.653  -6.074  -1.248   4.726  23.373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 84.06199    5.43272  15.473  < 2e-16 ***
## sugars      -1.73692    0.25328  -6.858 1.81e-09 ***
## calories    -0.27460    0.05749  -4.776 8.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.064 on 74 degrees of freedom
## Multiple R-squared:  0.6791, Adjusted R-squared:  0.6705 
## F-statistic: 78.32 on 2 and 74 DF,  p-value: < 2.2e-16

The Multiple Linear Regression Model(linear_mod3) uses the cereal dataset to determine correlation between two explanatory variables sugars and calories with a response variable rating. It is multiple linear because it takes more than one predictor variables.It is also statistically significant (regression model p-value: <2.2e-16,sugars 1.81e-09 *** calories- 8.81e-06 *** )

Model4

##Multiple Linear Regression
linear_mod4<-lm(rating~sugars+calories+protein, data =train)
linear_mod4
## 
## Call:
## lm(formula = rating ~ sugars + calories + protein, data = train)
## 
## Coefficients:
## (Intercept)       sugars     calories      protein  
##     75.6768      -1.0835      -0.3637       5.2131
summary(linear_mod4)
## 
## Call:
## lm(formula = rating ~ sugars + calories + protein, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.102  -3.459  -1.403   3.280  15.888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 75.67683    5.27834  14.337  < 2e-16 ***
## sugars      -1.08355    0.22609  -4.793 1.37e-05 ***
## calories    -0.36368    0.05656  -6.430 3.72e-08 ***
## protein      5.21305    0.84779   6.149 1.05e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.98 on 53 degrees of freedom
## Multiple R-squared:  0.8007, Adjusted R-squared:  0.7894 
## F-statistic: 70.99 on 3 and 53 DF,  p-value: < 2.2e-16

Model Selection

For model comparison, the model with the lowest AIC and BIC score is preferred.

AIC(linear_mod1)
## [1] 598.3042
AIC(linear_mod2)
## [1] 444.8366
AIC(linear_mod3)
## [1] 544.9123
AIC(linear_mod4)
## [1] 371.4909
BIC(linear_mod1)
## [1] 605.3356
BIC(linear_mod2)
## [1] 450.9658
BIC(linear_mod3)
## [1] 554.2876
BIC(linear_mod4)
## [1] 381.7061

Selected Model : Model 4

Model Selection - The Model with the Lowest AIC and BIC score is linear_mod4 - model4 with AIC 371.4909 and BIC 381.7061. This model is ideal because it shows stronger correlation among variables as relationships are not only limited to one predictor on a response variable.

Prediction

actuals_preds <- data.frame(cbind(actuals=test$rating, predicteds=pred_rating))  
head(actuals_preds)
##     actuals predicteds
## 2  33.98368   39.00660
## 3  59.42551   65.65405
## 4  93.70491   78.34528
## 11 18.04285   24.24632
## 22 46.89564   42.84803
## 24 44.33086   44.31769

Out of 77 variables, only 6 were off or not accurately predicted

correlation_accuracy <- cor(actuals_preds) 
correlation_accuracy
##              actuals predicteds
## actuals    1.0000000  0.8648255
## predicteds 0.8648255  1.0000000

Prediction : 86% accuracy.

References

http://datascience.uconn.edu/index.php/component/k2/item/216-predicting-relationship-of-nutritious-factors-in-cereals

https://www.kaggle.com/mbrachmann/cereal-ratings-and-linear-regressions

https://www.kaggle.com/jeandsantos/breakfast-cereals-data-analysis-and-clustering

https://www.kaggle.com/mbrachmann/cereal-ratings-and-linear-regressions

https://rpubs.com/whodunnit/412482

https://www.statmethods.net/graphs/scatterplot.html https://www.slideshare.net/hhlo/presentation-stats-324-24491954

https://medium.com/better-programming/data-science-modeling-how-to-use-linear-regression-with-python-fdf6ca5481be

https://medium.com/nightingale/plotting-a-quantitative-variable-in-your-dataset-526c26d40dc7 https://rpubs.com/ivim/Rmd-v1