The cereal dataset is used for this assignment.First the data must be in tidy format. This means each row should correspond to an observation, and each column corresponds to a different observation variable.
First, the cereal dataset is imported for cleaning and future exploration.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(here)
## here() starts at /Users/chester/Desktop
cereal <- read_csv(here("~/Desktop/cereal.csv"))
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Cereal, Type
## dbl (2): Sodium, Sugar
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data were collected from the course data set cereal. To gain some insight into the classification of data, the head() and col() functions are used. The cereal data frame shows us that each row represents a cereal brand, and the variables quantitatively (meaning they contain numerical data) specify two of the cereal’s ingredients,sodium and sugar. There is also a mystery column called type, which is a character variable. At this point I have no idea what it represents. I thought it may categorize the cereals by manufacturer, however, when I researched the cereal brand manufacturers, this was not the case.
head(cereal)
## # A tibble: 6 × 4
## Cereal Sodium Sugar Type
## <chr> <dbl> <dbl> <chr>
## 1 Frosted Mini Wheats 0 11 A
## 2 Raisin Bran 340 18 A
## 3 All Bran 70 5 A
## 4 Apple Jacks 140 14 C
## 5 Captain Crunch 200 12 C
## 6 Cheerios 180 1 C
colnames(cereal)
## [1] "Cereal" "Sodium" "Sugar" "Type"
summary(cereal)
## Cereal Sodium Sugar Type
## Length:20 Min. : 0.0 Min. : 0.00 Length:20
## Class :character 1st Qu.:137.5 1st Qu.: 4.00 Class :character
## Mode :character Median :180.0 Median : 9.50 Mode :character
## Mean :167.0 Mean : 8.75
## 3rd Qu.:202.5 3rd Qu.:12.50
## Max. :340.0 Max. :18.00
For easier readability I convert the data frame into a simple table, using the DT() function.
library(DT)
datatable(data = cereal,
rownames = FALSE,
filter ="top",
options = list(autoWidth = TRUE))
Here, the rename() function from the tidyverse package is used to change the variables Sodium and Sugar to Sodium_milligrams and Sugar_grams respectively to better represent the quantity of each ingredient per serving.
library(tidyverse)
cereal_data<-cereal %>%
select(Cereal,Sodium,Sugar,Type) %>%
rename(Sodium_milligrams = Sodium) %>%
rename(Sugar_grams = Sugar)
Once the sodium and sugar variables have been renamed, I take a more granular look at the data set using the dim(),names(),str(),and attributes() functions.
dim(cereal_data)
## [1] 20 4
names(cereal_data)
## [1] "Cereal" "Sodium_milligrams" "Sugar_grams"
## [4] "Type"
str(cereal_data)
## tibble [20 × 4] (S3: tbl_df/tbl/data.frame)
## $ Cereal : chr [1:20] "Frosted Mini Wheats" "Raisin Bran" "All Bran" "Apple Jacks" ...
## $ Sodium_milligrams: num [1:20] 0 340 70 140 200 180 210 150 100 130 ...
## $ Sugar_grams : num [1:20] 11 18 5 14 12 1 10 16 0 12 ...
## $ Type : chr [1:20] "A" "A" "A" "C" ...
attributes(cereal_data)
## $names
## [1] "Cereal" "Sodium_milligrams" "Sugar_grams"
## [4] "Type"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
Renamed Data Frame cereal_data in table format for easier readability
library(DT)
datatable(data = cereal_data,
rownames = FALSE,
filter ="top",
options = list(autoWidth = TRUE))
The variables used for this assignment are going to be the renamed variables Sodium_milligrams and Sugar_grams so they are more descriptive.
First, I start with the skimr() function to gain a comprehensive summary of the data.
The skimr() provides a summary of the data set as well as a separation of the variable types, which in this instance are character and numeric. The variable types we are focusing on are numeric.
library(skimr)
skim(cereal_data)
| Name | cereal_data |
| Number of rows | 20 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Cereal | 0 | 1 | 4 | 21 | 0 | 20 | 0 |
| Type | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Sodium_milligrams | 0 | 1 | 167.00 | 77.26 | 0 | 137.5 | 180.0 | 202.5 | 340 | ▂▂▇▂▂ |
| Sugar_grams | 0 | 1 | 8.75 | 5.32 | 0 | 4.0 | 9.5 | 12.5 | 18 | ▅▇▂▇▃ |
Let’s first take a look at Sodium_milligrams variable.
It has a mean of 167 mg. per serving across all cereal_brands and a standard deviation of 77.3.
Next is the Sugar_grams variable.
It has a mean 8.75 grams of sugar per serving across all cereal brands and a standard deviation of 5.32.
Here I used the Psych() package to gain insight into some of the statistics of the dataset. What I like about the Psych() compared to other tools used for descriptive statistics is the readability. Each variable is clearly depicted. For example, the Sodium_milligram variable clearly shows the quadrilles, the median and mean. I am not sure how the Psych() package will show up on RPubs. After knitting it looked very readable, however, RPubs may be another story.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
summary(cereal_data)
## Cereal Sodium_milligrams Sugar_grams Type
## Length:20 Min. : 0.0 Min. : 0.00 Length:20
## Class :character 1st Qu.:137.5 1st Qu.: 4.00 Class :character
## Mode :character Median :180.0 Median : 9.50 Mode :character
## Mean :167.0 Mean : 8.75
## 3rd Qu.:202.5 3rd Qu.:12.50
## Max. :340.0 Max. :18.00
Summary table of Sugar_grams content of each cereal brand represented in grams per serving.
library(tidyverse)
cereal_data %>%
select(Cereal,Sugar_grams) %>%
datatable(rownames= NULL,filter ="top",
colnames = c("Cereal","Sugar_grams(per_serving"))
Summary table of sodium content of each cereal brand represented in grams per serving.
library(tidyverse)
cereal_data %>%
select(Cereal,Sodium_milligrams) %>%
datatable(rownames= NULL,filter ="top",
colnames = c("Cereal","Sodium_milligrams(per_serving"))
I use a weighted scatterplot using the ggplot package to visually represent the levels of sodium and sugar in each brand of cereal. Visually, I was able to quickly gain insight into the cereals that had the highest levels of sodium and sugar.
The scatter plot was particularly useful in identifying outliers in the data. For example, it is seen that AppleJacks are particularly high in sodium and sugar, Whereas,Special K is lower in sodium and sugar.
library("ggplot2")
ggplot(data = cereal_data)+
geom_point(mapping = aes(x = Sugar_grams,
y = Sodium_milligrams,
color = Cereal))+
xlab("Sugar")+
ylab("Sodium")+
ggtitle("Weighted scatterplot of Cereal Dataset - Sodium vs. Sugar")