Assignment Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.
The Kaggle dataset I selected is NutriKit: Your Ultimate Food Database which provides nutrition information about differe food items categorized into fruits, grains, beverages, vegetables, and etc.
The dyplyr package allows users to be able to manipulate data easily and efficiently. With the useof dyplr and piping it allows users to easily read the codes and understand the changes that are done to the dataset.
The dyplyr package is part of the tidyverse package therefore I can load the tidyverse package since it allows me to access other packages such as dplyr, ggplot2, tidyr, readr, purr, tibble, stringr, and forcats.
library(tidyverse)
To load the data we utilized the readr package in the dataset to read the csv file from Github.
We use the function read_csv(“file_name.csv”) in order to read the csv.
read_csv() - reads a csv file
calories <- read_csv("https://raw.githubusercontent.com/AnnaMoy/Data-607/main/Calorie_value.csv")
## Rows: 717 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): food items, Category
## dbl (2): Avg Serving Size, Calories
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
calories
## # A tibble: 717 × 4
## `food items` `Avg Serving Size` Calories Category
## <chr> <dbl> <dbl> <chr>
## 1 Apple 182 94.6 Fruits
## 2 Apricot 55 26.4 Fruits
## 3 Avocado 30 48 Fruits
## 4 Banana 118 105. Fruits
## 5 Black Chokeberry 143 71.5 Fruits
## 6 Blackberries 144 89.3 Fruits
## 7 Blueberries 148 84.4 Fruits
## 8 Cantaloupe 134 64.3 Fruits
## 9 Cherry 140 70 Fruits
## 10 Cranberry 100 46 Fruits
## # ℹ 707 more rows
The group_by is a grouping of the same category together and the summarize function will summarize the data for average, median, min, max and etc..
group_by() - grouping data that are the same
# Finding out the average Calories for each Categories
avg_calories <- calories %>%
group_by(Category) %>%
summarize(mean_calories = mean(Calories))
avg_calories
## # A tibble: 21 × 2
## Category mean_calories
## <chr> <dbl>
## 1 Beverages 370.
## 2 Breads 98.8
## 3 Breakfast grains 245.
## 4 Dairy 112.
## 5 Fruits 93.5
## 6 Grains 200.
## 7 Healthy Fats 148.
## 8 Indian bread 186.
## 9 Juice 149.
## 10 Meat 155.
## # ℹ 11 more rows
Dplyr allows us to manipulate the data and extract certain information for further analysis.
filter() - filter out the information you want in a particular observation
distinct() - finding the unique values without duplicate values
slice() - looking at only certain rows in the data
slice_sample() - takes a random sample of the data based on the n value
slice_min() and slice_max() - find the lowest number and the highest value in the column
arrange() - arrange the data from highest to lowest value. desc() is added to do the reverse order
pull() - pulls out the column values which could be characters or numerical
summarize() - summary of data such as mean, median, min, max, and etc.
# Filter for food items that are Avocado
avocado_calories <- calories %>%
filter(`food items` == "Avocado")
avocado_calories
## # A tibble: 1 × 4
## `food items` `Avg Serving Size` Calories Category
## <chr> <dbl> <dbl> <chr>
## 1 Avocado 30 48 Fruits
# Finding the distinct avg serving size because we do not want duplicate information on the serving size and arrange from highest to lowest
distinct_serving <- calories %>%
distinct(`Avg Serving Size`) %>%
arrange(desc(`Avg Serving Size`))
distinct_serving
## # A tibble: 115 × 1
## `Avg Serving Size`
## <dbl>
## 1 500
## 2 450
## 3 400
## 4 350
## 5 333
## 6 329
## 7 301
## 8 300
## 9 286
## 10 282
## # ℹ 105 more rows
# Using the slice function we can view the information from row 5 to 10 only.
slice <- calories %>%
slice(5:10)
slice
## # A tibble: 6 × 4
## `food items` `Avg Serving Size` Calories Category
## <chr> <dbl> <dbl> <chr>
## 1 Black Chokeberry 143 71.5 Fruits
## 2 Blackberries 144 89.3 Fruits
## 3 Blueberries 148 84.4 Fruits
## 4 Cantaloupe 134 64.3 Fruits
## 5 Cherry 140 70 Fruits
## 6 Cranberry 100 46 Fruits
# Using slice_sample which takes 5 random sample of the data
calories %>%
slice_sample(n = 5)
## # A tibble: 5 × 4
## `food items` `Avg Serving Size` Calories Category
## <chr> <dbl> <dbl> <chr>
## 1 Cashew 28 155. Healthy Fats
## 2 Cacao Seeds 28 148. Nuts & Seeds
## 3 Methi 100 49 Vegetables
## 4 Green Smoothies 400 200 Beverages
## 5 Lentil salad 198 426. Salads
# Finding the lowest Calories
calories %>%
slice_min(Calories)
## # A tibble: 5 × 4
## `food items` `Avg Serving Size` Calories Category
## <chr> <dbl> <dbl> <chr>
## 1 Sparkling Water 350 0 Beverages
## 2 Flavored Water 350 0 Beverages
## 3 Cucumber Water 400 0 Beverages
## 4 Rooibos tea 200 0 Tea & Coffee
## 5 Peppermint tea 200 0 Tea & Coffee
# Finding the maximum Calories and pull out the number values only in Calories
calories %>%
slice_max(Calories) %>%
pull(Calories)
## [1] 2035 2035
ggplot2 package allows for users to graph information for their data and making it visually appealing to their audience.
ggplot() - plots the information on the dataset and aes is to add in which columns to appear. Fill is used to add color to the column you want
geom_bar() - plots a bar chart
coord_flip() - flips the x-axis to the y-axis and the y-axis to x-axis
ylab() - Adds a label to the y-axis
xlab() - Adds a label to the x-axis
ggtitle() - Adds a title to the plot
# Below is a bar chart is the average calories for each Category.
ggplot(avg_calories, aes(Category, mean_calories, fill = Category)) +
geom_bar(stat ="identity") +
coord_flip() +
ylab("Average Calories") +
xlab("Each Category") +
ggtitle("Average Calories For Each Category")
##Extend - Bishoy Sokkar
Using the mutate function from the “dplyr” package we can create a new variable named “Average Calorie Per Serving” That is calculated using Calories divided by average serving size.
calories2 <- calories %>%
mutate(Avg_Cal_Per_Serving = Calories/`Avg Serving Size`)
From there we can argue that the foods with the highest Average per serving are foods the are not very healthy, while those with the lowest Average calories per serving are more healthy options.
highest_calorie_density <- calories2 %>%
slice_max(Avg_Cal_Per_Serving)
highest_calorie_density
## # A tibble: 8 × 5
## `food items` `Avg Serving Size` Calories Category Avg_Cal_Per_Serving
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Ghee 14 126 Dairy 9
## 2 Cow Ghee 14 126 Dairy 9
## 3 Buffalo Ghee 14 126 Dairy 9
## 4 Desi Ghee 14 126 Dairy 9
## 5 Goat Ghee 14 126 Dairy 9
## 6 Flavored ghee 15 135 Dairy 9
## 7 Pure Ghee 14 126 Dairy 9
## 8 ghee 14 126 Healthy Fats 9
lowest_calorie_density <- calories2 %>%
slice_min(Avg_Cal_Per_Serving)
lowest_calorie_density
## # A tibble: 5 × 5
## `food items` `Avg Serving Size` Calories Category Avg_Cal_Per_Serving
## <chr> <dbl> <dbl> <chr> <dbl>
## 1 Sparkling Water 350 0 Beverages 0
## 2 Flavored Water 350 0 Beverages 0
## 3 Cucumber Water 400 0 Beverages 0
## 4 Rooibos tea 200 0 Tea & Coffee 0
## 5 Peppermint tea 200 0 Tea & Coffee 0
Since the data set had many food and drinks, the resutls were only Oils/fat for the highest calorie density and there were only drinks in the lowest calories density, we can repeat the same code with only category. by Groubing by category first.
avg_calories2 <- calories2 %>%
group_by(Category) %>%
summarize(mean_calories = mean(Avg_Cal_Per_Serving))
avg_calories2 %>%
arrange(mean_calories)
## # A tibble: 21 × 2
## Category mean_calories
## <chr> <dbl>
## 1 Juice 0.47
## 2 Vegetables 0.557
## 3 Fruits 0.650
## 4 Tea & Coffee 0.717
## 5 Beverages 1.05
## 6 Soup 1.11
## 7 Salads 1.25
## 8 Non-veg Soup 1.28
## 9 Meat 1.55
## 10 Non-veg Salads 1.65
## # ℹ 11 more rows
avg_calories2 %>%
arrange(desc(mean_calories))
## # A tibble: 21 × 2
## Category mean_calories
## <chr> <dbl>
## 1 Dairy 5.08
## 2 Nuts & Seeds 4.85
## 3 Healthy Fats 4.01
## 4 Indian bread 2.97
## 5 Protien 2.83
## 6 Sandwich 2.74
## 7 Non-veg Sandwich 2.68
## 8 Breads 2.59
## 9 Breakfast grains 2.44
## 10 Non-veg Protien 2.37
## # ℹ 11 more rows
From the above we can see the categories of food with the lowest and highest average calories per serving.