Assignment

Assignment Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)

You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.

After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.

You should complete your submission on the schedule stated in the course syllabus.

Kaggle Dataset

The Kaggle dataset I selected is NutriKit: Your Ultimate Food Database which provides nutrition information about differe food items categorized into fruits, grains, beverages, vegetables, and etc.

Vignette for dyplr

The dyplyr package allows users to be able to manipulate data easily and efficiently. With the useof dyplr and piping it allows users to easily read the codes and understand the changes that are done to the dataset.

The dyplyr package is part of the tidyverse package therefore I can load the tidyverse package since it allows me to access other packages such as dplyr, ggplot2, tidyr, readr, purr, tibble, stringr, and forcats.

Load tidyverse library

library(tidyverse)

Load in the csv file from Kaggle

To load the data we utilized the readr package in the dataset to read the csv file from Github.

We use the function read_csv(“file_name.csv”) in order to read the csv.

read_csv() - reads a csv file

calories <- read_csv("https://raw.githubusercontent.com/AnnaMoy/Data-607/main/Calorie_value.csv")
## Rows: 717 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): food items, Category
## dbl (2): Avg Serving Size, Calories
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
calories
## # A tibble: 717 × 4
##    `food items`     `Avg Serving Size` Calories Category
##    <chr>                         <dbl>    <dbl> <chr>   
##  1 Apple                           182     94.6 Fruits  
##  2 Apricot                          55     26.4 Fruits  
##  3 Avocado                          30     48   Fruits  
##  4 Banana                          118    105.  Fruits  
##  5 Black Chokeberry                143     71.5 Fruits  
##  6 Blackberries                    144     89.3 Fruits  
##  7 Blueberries                     148     84.4 Fruits  
##  8 Cantaloupe                      134     64.3 Fruits  
##  9 Cherry                          140     70   Fruits  
## 10 Cranberry                       100     46   Fruits  
## # ℹ 707 more rows

Group_by and Summarize Function

The group_by is a grouping of the same category together and the summarize function will summarize the data for average, median, min, max and etc..

group_by() - grouping data that are the same

# Finding out the average Calories for each Categories
avg_calories <- calories %>%
  group_by(Category) %>%
  summarize(mean_calories = mean(Calories))

avg_calories
## # A tibble: 21 × 2
##    Category         mean_calories
##    <chr>                    <dbl>
##  1 Beverages                370. 
##  2 Breads                    98.8
##  3 Breakfast grains         245. 
##  4 Dairy                    112. 
##  5 Fruits                    93.5
##  6 Grains                   200. 
##  7 Healthy Fats             148. 
##  8 Indian bread             186. 
##  9 Juice                    149. 
## 10 Meat                     155. 
## # ℹ 11 more rows

Manipulate

Dplyr allows us to manipulate the data and extract certain information for further analysis.

filter() - filter out the information you want in a particular observation

distinct() - finding the unique values without duplicate values

slice() - looking at only certain rows in the data

slice_sample() - takes a random sample of the data based on the n value

slice_min() and slice_max() - find the lowest number and the highest value in the column

arrange() - arrange the data from highest to lowest value. desc() is added to do the reverse order

pull() - pulls out the column values which could be characters or numerical

summarize() - summary of data such as mean, median, min, max, and etc.

# Filter for food items that are Avocado
avocado_calories <- calories %>%
  filter(`food items` == "Avocado")

avocado_calories
## # A tibble: 1 × 4
##   `food items` `Avg Serving Size` Calories Category
##   <chr>                     <dbl>    <dbl> <chr>   
## 1 Avocado                      30       48 Fruits
# Finding the distinct avg serving size because we do not want duplicate information on the serving size and arrange from highest to lowest
distinct_serving <- calories %>%
  distinct(`Avg Serving Size`) %>%
  arrange(desc(`Avg Serving Size`))

distinct_serving
## # A tibble: 115 × 1
##    `Avg Serving Size`
##                 <dbl>
##  1                500
##  2                450
##  3                400
##  4                350
##  5                333
##  6                329
##  7                301
##  8                300
##  9                286
## 10                282
## # ℹ 105 more rows
# Using the slice function we can view the information from row 5 to 10 only.
slice <- calories %>%
  slice(5:10)

slice
## # A tibble: 6 × 4
##   `food items`     `Avg Serving Size` Calories Category
##   <chr>                         <dbl>    <dbl> <chr>   
## 1 Black Chokeberry                143     71.5 Fruits  
## 2 Blackberries                    144     89.3 Fruits  
## 3 Blueberries                     148     84.4 Fruits  
## 4 Cantaloupe                      134     64.3 Fruits  
## 5 Cherry                          140     70   Fruits  
## 6 Cranberry                       100     46   Fruits
# Using slice_sample which takes 5 random sample of the data
calories %>%
  slice_sample(n = 5)
## # A tibble: 5 × 4
##   `food items`    `Avg Serving Size` Calories Category    
##   <chr>                        <dbl>    <dbl> <chr>       
## 1 Cashew                          28     155. Healthy Fats
## 2 Cacao Seeds                     28     148. Nuts & Seeds
## 3 Methi                          100      49  Vegetables  
## 4 Green Smoothies                400     200  Beverages   
## 5 Lentil salad                   198     426. Salads
# Finding the lowest Calories
calories %>%
  slice_min(Calories)
## # A tibble: 5 × 4
##   `food items`    `Avg Serving Size` Calories Category    
##   <chr>                        <dbl>    <dbl> <chr>       
## 1 Sparkling Water                350        0 Beverages   
## 2 Flavored Water                 350        0 Beverages   
## 3 Cucumber Water                 400        0 Beverages   
## 4 Rooibos tea                    200        0 Tea & Coffee
## 5 Peppermint tea                 200        0 Tea & Coffee
# Finding the maximum Calories and pull out the number values only in Calories
calories %>%
  slice_max(Calories) %>%
  pull(Calories)
## [1] 2035 2035

Vingette for ggplot2

ggplot2 package allows for users to graph information for their data and making it visually appealing to their audience.

ggplot() - plots the information on the dataset and aes is to add in which columns to appear. Fill is used to add color to the column you want

geom_bar() - plots a bar chart

coord_flip() - flips the x-axis to the y-axis and the y-axis to x-axis

ylab() - Adds a label to the y-axis

xlab() - Adds a label to the x-axis

ggtitle() - Adds a title to the plot

# Below is a bar chart is the average calories for each Category. 
ggplot(avg_calories, aes(Category, mean_calories, fill = Category)) +
  geom_bar(stat ="identity") +
  coord_flip() +
  ylab("Average Calories") +
  xlab("Each Category") +
  ggtitle("Average Calories For Each Category")

##Extend - Bishoy Sokkar

Using the mutate function from the “dplyr” package we can create a new variable named “Average Calorie Per Serving” That is calculated using Calories divided by average serving size.

calories2 <- calories %>%
  mutate(Avg_Cal_Per_Serving = Calories/`Avg Serving Size`)

From there we can argue that the foods with the highest Average per serving are foods the are not very healthy, while those with the lowest Average calories per serving are more healthy options.

highest_calorie_density <- calories2 %>%
  slice_max(Avg_Cal_Per_Serving)
highest_calorie_density
## # A tibble: 8 × 5
##   `food items`  `Avg Serving Size` Calories Category     Avg_Cal_Per_Serving
##   <chr>                      <dbl>    <dbl> <chr>                      <dbl>
## 1 Ghee                          14      126 Dairy                          9
## 2 Cow Ghee                      14      126 Dairy                          9
## 3 Buffalo Ghee                  14      126 Dairy                          9
## 4 Desi Ghee                     14      126 Dairy                          9
## 5 Goat Ghee                     14      126 Dairy                          9
## 6 Flavored ghee                 15      135 Dairy                          9
## 7 Pure Ghee                     14      126 Dairy                          9
## 8 ghee                          14      126 Healthy Fats                   9
lowest_calorie_density <- calories2 %>%
  slice_min(Avg_Cal_Per_Serving)
lowest_calorie_density
## # A tibble: 5 × 5
##   `food items`    `Avg Serving Size` Calories Category     Avg_Cal_Per_Serving
##   <chr>                        <dbl>    <dbl> <chr>                      <dbl>
## 1 Sparkling Water                350        0 Beverages                      0
## 2 Flavored Water                 350        0 Beverages                      0
## 3 Cucumber Water                 400        0 Beverages                      0
## 4 Rooibos tea                    200        0 Tea & Coffee                   0
## 5 Peppermint tea                 200        0 Tea & Coffee                   0

Since the data set had many food and drinks, the resutls were only Oils/fat for the highest calorie density and there were only drinks in the lowest calories density, we can repeat the same code with only category. by Groubing by category first.

avg_calories2 <- calories2 %>%
  group_by(Category) %>%
  summarize(mean_calories = mean(Avg_Cal_Per_Serving))

avg_calories2 %>%
  arrange(mean_calories)
## # A tibble: 21 × 2
##    Category       mean_calories
##    <chr>                  <dbl>
##  1 Juice                  0.47 
##  2 Vegetables             0.557
##  3 Fruits                 0.650
##  4 Tea & Coffee           0.717
##  5 Beverages              1.05 
##  6 Soup                   1.11 
##  7 Salads                 1.25 
##  8 Non-veg Soup           1.28 
##  9 Meat                   1.55 
## 10 Non-veg Salads         1.65 
## # ℹ 11 more rows
avg_calories2 %>%
  arrange(desc(mean_calories))
## # A tibble: 21 × 2
##    Category         mean_calories
##    <chr>                    <dbl>
##  1 Dairy                     5.08
##  2 Nuts & Seeds              4.85
##  3 Healthy Fats              4.01
##  4 Indian bread              2.97
##  5 Protien                   2.83
##  6 Sandwich                  2.74
##  7 Non-veg Sandwich          2.68
##  8 Breads                    2.59
##  9 Breakfast grains          2.44
## 10 Non-veg Protien           2.37
## # ℹ 11 more rows

From the above we can see the categories of food with the lowest and highest average calories per serving.