About Document

The purpose of this document is to provide an overview of data analysis and visualization for the different types of cereals.

Things to know:

Type;

Manufacturer;

The data set used in this overview was taken from: https://www.kaggle.com/crawford/80-cereals/data

Data set(Cereal)

Name Manufacturer Type Calories Protein Fat Sodium Fibre Carbohydrates Sugar Potassium Vitamins Shelf Weight Cups Rating
100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1 0.33 68.40297
100% Natural Bran Q C 120 3 5 15 2.0 8.0 8 135 0 3 1 1.00 33.98368
All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1 0.33 59.42551
All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1 0.50 93.70491
Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1 0.75 34.38484
Apple Cinnamon Cheerios G C 110 2 2 180 1.5 10.5 10 70 25 1 1 0.75 29.50954

Summary of Data set(Cereal)

##                         Name    Manufacturer Type      Calories    
##  100% Bran                : 1   A: 1         C:74   Min.   : 50.0  
##  100% Natural Bran        : 1   G:22         H: 3   1st Qu.:100.0  
##  All-Bran                 : 1   K:23                Median :110.0  
##  All-Bran with Extra Fiber: 1   N: 6                Mean   :106.9  
##  Almond Delight           : 1   P: 9                3rd Qu.:110.0  
##  Apple Cinnamon Cheerios  : 1   Q: 8                Max.   :160.0  
##  (Other)                  :71   R: 8                               
##     Protein           Fat            Sodium          Fibre       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##                                                                  
##  Carbohydrates      Sugar          Potassium         Vitamins     
##  Min.   :-1.0   Min.   :-1.000   Min.   : -1.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00  
##  Median :14.0   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.6   Mean   : 6.922   Mean   : 96.08   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##                                                                   
##      Shelf           Weight          Cups           Rating     
##  Min.   :1.000   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :2.000   Median :1.00   Median :0.750   Median :40.40  
##  Mean   :2.208   Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :3.000   Max.   :1.50   Max.   :1.500   Max.   :93.70  
## 

Question 1: Which Manufacturer have cereals with the most fat?

It can be observed on the histogram, that Manufacturer K(Kelloggs) has the most fat content.

Question 2: What are the type of cereals the different manufacturers product?

It can be observed on the histogram, that Type C(Cold) cereals are manufacturered the most.We can also see that the Manufacturers for N (Nabisco) and Q (Quaker Oats ) product both hot and cold cereals.

Question 3: Which type of cereal persons prefer?

The Boxplot compares the rating of Cereals by the different types. It can be observed that Hot type of cereals have a Minimum rating = 51, Q1 rating = 53 , Median rating = 55, Q3 rating = 60 and Maximum rating = 65 with a right skew. Cold type Cereals have a Minimum rating = 18, Q1 rating = 33 , Median rating = 40, Q3 rating = 50 and Maximum rating = 95 (including the 1 outliner) with a right skew.

Question 4: How much Calories one can get per serving?

The scatter plot shows, the amount of Calories you can get from a One cup by Manufactures.

Question 5: Which cereal is the unhealthiest?

It is observed in the scatter plot, there is no relationship between hot and cold cereals, additionally it can be observed that cold cereals has the most fat and sugar content.

Question 6: Which type of cereal will give you more energy(protein)?

It can be observed on the histograms, that eating Manufacturer K(Kelloggs) cold Cereal you will get more energy.

Question 7: Which Manufacturer product Cereals with the most Sodium?

The Box plot compares the amount of Potassium that are in the different type of Cereals. It can be observed that Hot type of cereals have a Minimum = 0 Potassium, Q1 = 49 Potassium, Median = 98 Potassium, Q3 = 101 Potassium and Maximum = 110 Potassium with a left skew. Cold type Cereals have a Minimum = 0 Potassium, Q1 = 30 Potassium, Median = 80 Potassium, Q3 = 110 Potassium and Maximum = 330 Potassium (including the 4 outliners) with a right skew.

Question 8: What is the average amount of Carbohydrates?

It can be observed on the histogram, that the average amount of Carbohydrates one can get from eating cereal hot or cold is 14.5974026.

Question 9: What is the total amount of Fiber you can get from eating your cereal cold or hot?

It can be observed on the histogram, that the total amount of fiber you can get from eating you cereal cold is 74 and hot is 3.

Question 10: Which Manufacturer have cereals with the most Vitamins?

It can be observed on the histogram, that Manufacturer G (General Mills) is rich in vitamins.

Train and Test

Simple Linear Regression 1

## 
## Call:
## lm(formula = Rating ~ Fat, data = train)
## 
## Coefficients:
## (Intercept)          Fat  
##      47.725       -5.248

Summary for first Simple Linear Regression

## 
## Call:
## lm(formula = Rating ~ Fat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.081  -7.102  -2.116   7.976  25.926 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   47.725      2.560  18.643   <2e-16 ***
## Fat           -5.248      1.963  -2.673   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.94 on 48 degrees of freedom
## Multiple R-squared:  0.1295, Adjusted R-squared:  0.1114 
## F-statistic: 7.143 on 1 and 48 DF,  p-value: 0.01025

Correlation

## [1] -0.3599192

Anova

## Analysis of Variance Table
## 
## Response: Rating
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## Fat        1 1018.3 1018.35  7.1434 0.01025 *
## Residuals 48 6842.8  142.56                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC (Akaike’s information criterion)

## [1] 393.8402

BIC (Bayesian information criterion)

## [1] 399.5763

Simple Linear Regression 2

## 
## Call:
## lm(formula = Rating ~ Sugar, data = train)
## 
## Coefficients:
## (Intercept)        Sugar  
##      58.616       -2.324

Summary for second Simple Linear Regression

## 
## Call:
## lm(formula = Rating ~ Sugar, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8051  -5.3921  -0.7764   4.7406  23.7296 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  58.6163     2.1726  26.980  < 2e-16 ***
## Sugar        -2.3238     0.2688  -8.646 2.37e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.002 on 48 degrees of freedom
## Multiple R-squared:  0.609,  Adjusted R-squared:  0.6008 
## F-statistic: 74.75 on 1 and 48 DF,  p-value: 2.367e-11

Correlation

## [1] -0.7803697

Anova

## Analysis of Variance Table
## 
## Response: Rating
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Sugar      1 4787.2  4787.2  74.755 2.367e-11 ***
## Residuals 48 3073.9    64.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC (Akaike’s information criterion)

## [1] 353.8275

BIC (Bayesian information criterion)

## [1] 359.5636

Simple Linear Regression 3

## 
## Call:
## lm(formula = Rating ~ Calories, data = train)
## 
## Coefficients:
## (Intercept)     Calories  
##     86.5206      -0.4161

Summary for third Simple Linear Regression

## 
## Call:
## lm(formula = Rating ~ Calories, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.3546  -5.1485  -0.0718   6.5752  23.7289 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 86.52055    6.95497  12.440  < 2e-16 ***
## Calories    -0.41609    0.06465  -6.436  5.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.376 on 48 degrees of freedom
## Multiple R-squared:  0.4632, Adjusted R-squared:  0.452 
## F-statistic: 41.42 on 1 and 48 DF,  p-value: 5.399e-08

Correlation

## [1] -0.6805819

Anova

## Analysis of Variance Table
## 
## Response: Rating
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Calories   1 3641.2  3641.2  41.417 5.399e-08 ***
## Residuals 48 4219.9    87.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC (Akaike’s information criterion)

## [1] 369.6713

BIC (Bayesian information criterion)

## [1] 375.4073

Comparsion

The Models have a value of(anove, the lower the value the stronger it is);

There R-Square values are(the higher the R-square value the better the fit of the model);

Correlation(the closer to 1 or -1 the stronger the correlation);

AIC (the model with the lowest AIC score is preferred);

BIC (the model with the lowest BIC score is preferred);

Prediction

actuals.Name actuals.Manufacturer actuals.Type actuals.Calories actuals.Protein actuals.Fat actuals.Sodium actuals.Fibre actuals.Carbohydrates actuals.Sugar actuals.Potassium actuals.Vitamins actuals.Shelf actuals.Weight actuals.Cups actuals.Rating predicteds
13 Cinnamon Toast Crunch G C 120 1 3 210 0.0 13 9 45 25 2 1 0.75 19.82357 37.70189
33 Grape Nuts Flakes P C 100 3 1 140 3.0 15 5 85 25 3 1 0.88 52.07690 46.99720
22 Crispix K C 110 2 0 220 1.0 21 3 30 25 3 1 1.00 46.89564 51.64485
26 Frosted Flakes K C 110 1 0 200 1.0 14 11 25 25 1 1 0.75 31.43597 33.05424
73 Triples G C 110 2 1 250 0.0 21 3 60 25 3 1 0.75 39.10617 51.64485
58 Quaker Oatmeal Q H 100 5 2 0 2.7 -1 -1 110 0 1 1 0.67 50.82839 60.94015

From the comparisons it can be observed that, Model 2 is the best fit and the most accurate model of the dataset, for it has a stronger correlation(steeper curve and a higher R-square value).It can be predicted that the Rating goes up when there is More Calories.