Homework Four: Univariate Statistics using the Cereal Dataset

Introduction

The cereal dataset is used for this assignment.First the data must be in tidy format. This means each row should correspond to an observation, and each column corresponds to a different observation variable.

Importing and Tidying the data

First, the cereal dataset is imported for cleaning and future exploration.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readr)
library(here)

## here() starts at /Users/chester/Desktop

cereal <- read_csv(here("~/Desktop/cereal.csv"))

## Rows: 20 Columns: 4

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Cereal, Type
## dbl (2): Sodium, Sugar

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Descriptions of the variables

The data were collected from the course data set cereal. To gain some insight into the classification of data, the head() and col() functions are used. The cereal data frame shows us that each row represents a cereal brand, and the variables quantitatively (meaning they contain numerical data) specify two of the cereal’s ingredients,sodium and sugar. There is also a mystery column called type, which is a character variable. At this point I have no idea what it represents. I thought it may categorize the cereals by manufacturer, however, when I researched the cereal brand manufacturers, this was not the case.

head(cereal)

## # A tibble: 6 × 4
##   Cereal              Sodium Sugar Type 
##   <chr>                <dbl> <dbl> <chr>
## 1 Frosted Mini Wheats      0    11 A    
## 2 Raisin Bran            340    18 A    
## 3 All Bran                70     5 A    
## 4 Apple Jacks            140    14 C    
## 5 Captain Crunch         200    12 C    
## 6 Cheerios               180     1 C

colnames(cereal)

## [1] "Cereal" "Sodium" "Sugar"  "Type"

summary(cereal)

##     Cereal              Sodium          Sugar           Type          
##  Length:20          Min.   :  0.0   Min.   : 0.00   Length:20         
##  Class :character   1st Qu.:137.5   1st Qu.: 4.00   Class :character  
##  Mode  :character   Median :180.0   Median : 9.50   Mode  :character  
##                     Mean   :167.0   Mean   : 8.75                     
##                     3rd Qu.:202.5   3rd Qu.:12.50                     
##                     Max.   :340.0   Max.   :18.00

For easier readability I convert the data frame into a simple table, using the DT() function.

library(DT)

datatable(data = cereal,
          rownames = FALSE,
          filter ="top",
          options = list(autoWidth = TRUE))

Recoding of variables of interest

Here, the rename() function from the tidyverse package is used to change the variables Sodium and Sugar to Sodium_milligrams and Sugar_grams respectively to better represent the quantity of each ingredient per serving.

library(tidyverse)

cereal_data<-cereal %>% 
  select(Cereal,Sodium,Sugar,Type) %>% 
  rename(Sodium_milligrams = Sodium) %>% 
  rename(Sugar_grams = Sugar)

Once the sodium and sugar variables have been renamed, I take a more granular look at the data set using the dim(),names(),str(),and attributes() functions.

dim(cereal_data)

## [1] 20  4

names(cereal_data)

## [1] "Cereal"            "Sodium_milligrams" "Sugar_grams"      
## [4] "Type"

str(cereal_data)

## tibble [20 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Cereal           : chr [1:20] "Frosted Mini Wheats" "Raisin Bran" "All Bran" "Apple Jacks" ...
##  $ Sodium_milligrams: num [1:20] 0 340 70 140 200 180 210 150 100 130 ...
##  $ Sugar_grams      : num [1:20] 11 18 5 14 12 1 10 16 0 12 ...
##  $ Type             : chr [1:20] "A" "A" "A" "C" ...

attributes(cereal_data)

## $names
## [1] "Cereal"            "Sodium_milligrams" "Sugar_grams"      
## [4] "Type"             
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

Renamed Data Frame cereal_data in table format for easier readability

library(DT)

datatable(data = cereal_data,
          rownames = FALSE,
          filter ="top",
          options = list(autoWidth = TRUE))

Variables

The variables used for this assignment are going to be the renamed variables Sodium_milligrams and Sugar_grams so they are more descriptive.

Descriptive Statistics

First, I start with the skimr() function to gain a comprehensive summary of the data.

The skimr() provides a summary of the data set as well as a separation of the variable types, which in this instance are character and numeric. The variable types we are focusing on are numeric.

library(skimr)
skim(cereal_data)

Data summary
Name	cereal_data
Number of rows	20
Number of columns	4
_______________________
Column type frequency:
character	2
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Cereal	0	1	4	21	0	20	0
Type	0	1	1	1	0	2	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Sodium_milligrams	0	1	167.00	77.26	0	137.5	180.0	202.5	340	▂▂▇▂▂
Sugar_grams	0	1	8.75	5.32	0	4.0	9.5	12.5	18	▅▇▂▇▃

Let’s first take a look at Sodium_milligrams variable.

It has a mean of 167 mg. per serving across all cereal_brands and a standard deviation of 77.3.

Next is the Sugar_grams variable.

It has a mean 8.75 grams of sugar per serving across all cereal brands and a standard deviation of 5.32.

Here I used the Psych() package to gain insight into some of the statistics of the dataset. What I like about the Psych() compared to other tools used for descriptive statistics is the readability. Each variable is clearly depicted. For example, the Sodium_milligram variable clearly shows the quadrilles, the median and mean. I am not sure how the Psych() package will show up on RPubs. After knitting it looked very readable, however, RPubs may be another story.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

summary(cereal_data)

##     Cereal          Sodium_milligrams  Sugar_grams        Type          
##  Length:20          Min.   :  0.0     Min.   : 0.00   Length:20         
##  Class :character   1st Qu.:137.5     1st Qu.: 4.00   Class :character  
##  Mode  :character   Median :180.0     Median : 9.50   Mode  :character  
##                     Mean   :167.0     Mean   : 8.75                     
##                     3rd Qu.:202.5     3rd Qu.:12.50                     
##                     Max.   :340.0     Max.   :18.00

Summary table of Sugar_grams content of each cereal brand represented in grams per serving.

library(tidyverse)

cereal_data %>% 
  select(Cereal,Sugar_grams) %>% 
  datatable(rownames= NULL,filter ="top",
  colnames = c("Cereal","Sugar_grams(per_serving"))

Summary table of sodium content of each cereal brand represented in grams per serving.

library(tidyverse)

cereal_data %>% 
  select(Cereal,Sodium_milligrams) %>% 
  datatable(rownames= NULL,filter ="top",
  colnames = c("Cereal","Sodium_milligrams(per_serving"))

Visualization of the Cereal Dataset using a scatterplot

I use a weighted scatterplot using the ggplot package to visually represent the levels of sodium and sugar in each brand of cereal. Visually, I was able to quickly gain insight into the cereals that had the highest levels of sodium and sugar.

The scatter plot was particularly useful in identifying outliers in the data. For example, it is seen that AppleJacks are particularly high in sodium and sugar, Whereas,Special K is lower in sodium and sugar.

library("ggplot2")

ggplot(data = cereal_data)+
  geom_point(mapping = aes(x = Sugar_grams,
                           y = Sodium_milligrams,
                           color = Cereal))+
             xlab("Sugar")+
             ylab("Sodium")+
             ggtitle("Weighted scatterplot of Cereal Dataset - Sodium vs. Sugar")