knitr::opts_chunk$set(echo = TRUE)
The goal of this assignment practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
“The tidyverse is a powerful collection of R packages that you can use for data science. They are designed to help you to transform and visualize data. All packages within this collection share an underlying philosophy and common APIs.” (datacamp)
Tidyverse packages include the following libraries:
ggplot2: it is used to visualize data
dplyr: it is used to manipulate data
tidyr: it is used to tidy the data
readr: it is used to read rectangular data like csv files
purr: it is used to work with functions and vectors
tibble: it is used to re-imagine the data frame
stringr: it is used to work with strings easily
forcats: it is used to work with factors.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The dataset I chose for this assignment is about the ‘Fruits and Vegetables Prices In USA In The Year 2020’. The dataset contains 8 columns and 156 rows.
The column description of the dataset is as follows:
-Item: Name of the fruit or the vegetable.
-Form: The form of the item, i.e., canned, fresh, juice, dried or frozen.
-Retail Price: Average retail price of the item in the year.
-Retail Price Unit: Average retail price’s measurement unit.
-Yield: Average yield of the item in the year.
-Cup Equivalent Size: Comparison done with one edible cup of food.
-Cup Equivalent Unit: Comparison’s measurement unit.
-Cup Equivalent Price: Price per edible cup equivalent (The Unit of Measurement for Federal Recommendations for Fruit and Vegetable Consumption)
This is taken form the website of kaggle
url<- "https://raw.githubusercontent.com/SalouaDaouki/Data607/main/Prices.csv"
dataset <- read.csv(url)
The “dplyr is a grammar of data You can use it to solve the most common data manipulation challenges.” (Datacamp). Some of its functions are:
filter() allows you to select a subset of rows in a data frame.
arrange() sorts the observations in a dataset in ascending or descending order based on one of its variables.
mutate() allows you to update or create new columns of a data frame.
summarize() allows you to turn many observations into a single data point.
Now, let’s first take a look at the data:
glimpse(dataset)
## Rows: 155
## Columns: 8
## $ Item <chr> "Acorn squash", "Apples", "Apples, applesauce", "Ap…
## $ Form <chr> "Fresh", "Fresh", "Canned", "Juice", "Juice", "Fres…
## $ RetailPrice <dbl> 1.1804, 1.5193, 1.0660, 0.5853, 0.7804, 2.9665, 6.6…
## $ RetailPriceUnit <chr> "per pound", "per pound", "per pound", "per pint", …
## $ Yield <dbl> 0.4586, 0.9000, 1.0000, 1.0000, 1.0000, 0.9300, 1.0…
## $ CupEquivalentSize <dbl> 0.4519, 0.2425, 0.5401, 8.0000, 8.0000, 0.3638, 0.1…
## $ CupEquivalentUnit <chr> "pounds", "pounds", "pounds", "fluid ounces", "flui…
## $ CupEquivalentPrice <dbl> 1.1633, 0.4094, 0.5758, 0.2926, 0.3902, 1.1603, 0.9…
Looking at the summary of the data will give us the five number summary for the quantitative data and the class of each qualitative data:
summary(dataset)
## Item Form RetailPrice RetailPriceUnit
## Length:155 Length:155 Min. : 0.3604 Length:155
## Class :character Class :character 1st Qu.: 1.1565 Class :character
## Mode :character Mode :character Median : 1.7218 Mode :character
## Mean : 2.1796
## 3rd Qu.: 2.5783
## Max. :10.5527
## Yield CupEquivalentSize CupEquivalentUnit CupEquivalentPrice
## Min. :0.3750 Min. :0.1232 Length:155 Min. :0.2021
## 1st Qu.:0.6650 1st Qu.:0.3197 Class :character 1st Qu.:0.5199
## Median :0.9100 Median :0.3527 Mode :character Median :0.6769
## Mean :0.9264 Mean :0.8852 Mean :0.8141
## 3rd Qu.:1.0000 3rd Qu.:0.3858 3rd Qu.:1.0223
## Max. :2.5397 Max. :8.0000 Max. :3.0700
Let’s filter the data by form of the fruits and vegetables:
head(filter(dataset, Form=="Fresh"))
## Item Form RetailPrice RetailPriceUnit Yield CupEquivalentSize
## 1 Acorn squash Fresh 1.1804 per pound 0.4586 0.4519
## 2 Apples Fresh 1.5193 per pound 0.9000 0.2425
## 3 Apricots Fresh 2.9665 per pound 0.9300 0.3638
## 4 Artichoke Fresh 2.1913 per pound 0.3750 0.3858
## 5 Asparagus Fresh 2.7576 per pound 0.4938 0.3968
## 6 Avocados Fresh 2.2368 per pound 0.7408 0.3197
## CupEquivalentUnit CupEquivalentPrice
## 1 pounds 1.1633
## 2 pounds 0.4094
## 3 pounds 1.1603
## 4 pounds 2.2545
## 5 pounds 2.2159
## 6 pounds 0.9653
Let’s arrange the data on ascending order based on the retail price:
head(dataset%>% arrange(RetailPrice))
## Item Form RetailPrice RetailPriceUnit Yield
## 1 Watermelon Fresh 0.3604 per pound 0.5200
## 2 Bananas Fresh 0.5249 per pound 0.6400
## 3 Pineapple Fresh 0.5685 per pound 0.5100
## 4 Cantaloupe Fresh 0.5767 per pound 0.5100
## 5 Apples, frozen concentrate Juice 0.5853 per pint 1.0000
## 6 Potatoes Fresh 0.6682 per pound 0.8113
## CupEquivalentSize CupEquivalentUnit CupEquivalentPrice
## 1 0.3307 pounds 0.2292
## 2 0.3307 pounds 0.2712
## 3 0.3638 pounds 0.4055
## 4 0.3748 pounds 0.4238
## 5 8.0000 fluid ounces 0.2926
## 6 0.2646 pounds 0.2179
We can also combine both filter and arrange functions, for example, below, I am going to filter the data to focus only on the fresh fruits and vegetables, then arrange in ascending order based on the retail price.
dataset%>%
filter(Form == "Fresh") %>%
arrange(RetailPrice)
## Item Form RetailPrice RetailPriceUnit Yield
## 1 Watermelon Fresh 0.3604 per pound 0.5200
## 2 Bananas Fresh 0.5249 per pound 0.6400
## 3 Pineapple Fresh 0.5685 per pound 0.5100
## 4 Cantaloupe Fresh 0.5767 per pound 0.5100
## 5 Potatoes Fresh 0.6682 per pound 0.8113
## 6 Cabbage, green Fresh 0.7025 per pound 0.7788
## 7 Carrots, cooked whole Fresh 0.8703 per pound 0.8158
## 8 Carrots, raw whole Fresh 0.8703 per pound 0.8900
## 9 Honeydew Fresh 0.9056 per pound 0.4600
## 10 Onions Fresh 0.9751 per pound 0.9000
## 11 Celery, trimmed bunches Fresh 0.9842 per pound 0.7300
## 12 Lettuce, iceberg Fresh 0.9952 per pound 0.9500
## 13 Cabbage, red Fresh 1.0985 per pound 0.7791
## 14 Sweet potatoes Fresh 1.1198 per pound 0.8818
## 15 Mangoes Fresh 1.1513 per pound 0.7100
## 16 Tomatoes, roma & plum Fresh 1.1618 per pound 0.9100
## 17 Grapefruit Fresh 1.1695 per pound 0.4900
## 18 Acorn squash Fresh 1.1804 per pound 0.4586
## 19 Cucumbers with peel Fresh 1.1933 per pound 0.9700
## 20 Cucumbers without peel Fresh 1.1933 per pound 0.7300
## 21 Oranges Fresh 1.2131 per pound 0.6800
## 22 Butternut squash Fresh 1.2325 per pound 0.7140
## 23 Carrots, baby Fresh 1.2716 per pound 1.0000
## 24 Green peppers Fresh 1.2772 per pound 0.8200
## 25 Papaya Fresh 1.2904 per pound 0.6200
## 26 Clementines Fresh 1.3847 per pound 0.7700
## 27 Apples Fresh 1.5193 per pound 0.9000
## 28 Zucchini Fresh 1.5489 per pound 0.7695
## 29 Radish Fresh 1.5770 per pound 0.9000
## 30 Pears Fresh 1.5865 per pound 0.9000
## 31 Peaches Fresh 1.7167 per pound 0.9600
## 32 Lettuce, romaine, heads Fresh 1.8299 per pound 0.9400
## 33 Grapes Fresh 1.8398 per pound 0.9600
## 34 Corn Fresh 1.8908 per pound 0.5400
## 35 Nectarines Fresh 1.9062 per pound 0.9100
## 36 Cauliflower heads Fresh 1.9769 per pound 0.8926
## 37 Plum Fresh 2.0292 per pound 0.9400
## 38 Green beans Fresh 2.1463 per pound 0.8466
## 39 Red peppers Fresh 2.1614 per pound 0.8200
## 40 Kiwi Fresh 2.1849 per pound 0.7600
## 41 Artichoke Fresh 2.1913 per pound 0.3750
## 42 Pomegranate Fresh 2.2350 per pound 0.5600
## 43 Avocados Fresh 2.2368 per pound 0.7408
## 44 Broccoli heads Fresh 2.3065 per pound 0.7800
## 45 Tomatoes, large round Fresh 2.3347 per pound 0.9100
## 46 Celery sticks Fresh 2.4041 per pound 1.0000
## 47 Turnip greens Fresh 2.4176 per pound 0.7500
## 48 Kale Fresh 2.5018 per pound 1.0500
## 49 Lettuce, romaine, hearts Fresh 2.5766 per pound 0.8500
## 50 Strawberries Fresh 2.5800 per pound 0.9400
## 51 Collard greens Fresh 2.6820 per pound 1.1600
## 52 Brussels sprouts Fresh 2.6895 per pound 1.0600
## 53 Broccoli florets Fresh 2.7486 per pound 1.0000
## 54 Asparagus Fresh 2.7576 per pound 0.4938
## 55 Apricots Fresh 2.9665 per pound 0.9300
## 56 Spinach, boiled Fresh 2.9940 per pound 0.7700
## 57 Spinach, eaten raw Fresh 2.9940 per pound 1.0000
## 58 Cherries Fresh 3.4269 per pound 0.9200
## 59 Mushrooms, whole Fresh 3.4464 per pound 0.9700
## 60 Cauliflower florets Fresh 3.5859 per pound 0.9702
## 61 Mushrooms, sliced Fresh 3.6417 per pound 1.0000
## 62 Okra Fresh 3.9803 per pound 0.7695
## 63 Tomatoes, grape & cherry Fresh 4.1458 per pound 0.9100
## 64 Blueberries Fresh 4.1739 per pound 0.9500
## 65 Blackberries Fresh 6.0172 per pound 0.9600
## 66 Raspberries Fresh 6.6391 per pound 0.9600
## CupEquivalentSize CupEquivalentUnit CupEquivalentPrice
## 1 0.3307 pounds 0.2292
## 2 0.3307 pounds 0.2712
## 3 0.3638 pounds 0.4055
## 4 0.3748 pounds 0.4238
## 5 0.2646 pounds 0.2179
## 6 0.3307 pounds 0.2983
## 7 0.3197 pounds 0.3410
## 8 0.2756 pounds 0.2695
## 9 0.3748 pounds 0.7378
## 10 0.3527 pounds 0.3822
## 11 0.2646 pounds 0.3567
## 12 0.2425 pounds 0.2540
## 13 0.3307 pounds 0.4663
## 14 0.4409 pounds 0.5599
## 15 0.3638 pounds 0.5898
## 16 0.3748 pounds 0.4785
## 17 0.4630 pounds 1.1050
## 18 0.4519 pounds 1.1633
## 19 0.2646 pounds 0.3255
## 20 0.2646 pounds 0.4325
## 21 0.4079 pounds 0.7276
## 22 0.4519 pounds 0.7802
## 23 0.2756 pounds 0.3504
## 24 0.2646 pounds 0.4121
## 25 0.3086 pounds 0.6424
## 26 0.4630 pounds 0.8326
## 27 0.2425 pounds 0.4094
## 28 0.3968 pounds 0.7987
## 29 0.2756 pounds 0.4829
## 30 0.3638 pounds 0.6412
## 31 0.3417 pounds 0.6111
## 32 0.2094 pounds 0.4077
## 33 0.3307 pounds 0.6338
## 34 0.3638 pounds 1.2737
## 35 0.3197 pounds 0.6696
## 36 0.2756 pounds 0.6103
## 37 0.3638 pounds 0.7852
## 38 0.2756 pounds 0.6987
## 39 0.2646 pounds 0.6973
## 40 0.3858 pounds 1.1091
## 41 0.3858 pounds 2.2545
## 42 0.3417 pounds 1.3638
## 43 0.3197 pounds 0.9653
## 44 0.3417 pounds 1.0105
## 45 0.3748 pounds 0.9616
## 46 0.2646 pounds 0.6360
## 47 0.3197 pounds 1.0304
## 48 0.2866 pounds 0.6829
## 49 0.2094 pounds 0.6349
## 50 0.3197 pounds 0.8774
## 51 0.2866 pounds 0.6626
## 52 0.3417 pounds 0.8670
## 53 0.3417 pounds 0.9393
## 54 0.3968 pounds 2.2159
## 55 0.3638 pounds 1.1603
## 56 0.3307 pounds 1.2859
## 57 0.1543 pounds 0.4621
## 58 0.3417 pounds 1.2729
## 59 0.1543 pounds 0.5483
## 60 0.2756 pounds 1.0185
## 61 0.1543 pounds 0.5620
## 62 0.3527 pounds 1.8246
## 63 0.3748 pounds 1.7075
## 64 0.3197 pounds 1.4045
## 65 0.3197 pounds 2.0037
## 66 0.3197 pounds 2.2107
Using mutate() we can update or add another column; in this data, I am going to add a column “EquivalentPounds” to calculate the number of pounds equivalent to the cup_equivalent_price. To be able to do that, I am going to filter first the column cup equivalent unit that has pounds only. Then, I can arrange the dataset by the new column in descending order. The new datset will be called “newDataset” for further use.
newDataset <- dataset %>%
filter(CupEquivalentUnit=="pounds") %>%
mutate(EquivalentPounds =CupEquivalentPrice/RetailPrice)%>%
arrange(desc(EquivalentPounds))
We can use summarize() function to summarize observations in the dataset, To illustrate, I am going to find the median, the minimum and the maximum values of the new column “EquivalentPounds” I created in the previous subsection.
newDataset %>%
summarise(MedianEP = median(EquivalentPounds),
MinEP = min(EquivalentPounds),
MaxEP = max(EquivalentPounds))
## MedianEP MinEP MaxEP
## 1 0.399709 0.1231699 1.028841
To summarize within a group in the dataset, we can use group by. To illustrate I am going to summarize the data the same way as before, but focusing on each form of the fruits and vegetables:
newDataset %>%
group_by(Form) %>%
summarise(MedianEP = median(EquivalentPounds),
MinEP = min(EquivalentPounds),
MaxEP = max(EquivalentPounds))
## # A tibble: 4 × 4
## Form MedianEP MinEP MaxEP
## <chr> <dbl> <dbl> <dbl>
## 1 Canned 0.560 0.298 0.678
## 2 Dried 0.156 0.123 0.187
## 3 Fresh 0.389 0.154 1.03
## 4 Frozen 0.344 0.296 0.483
ggplot2 is a package in tidyverse that allows us to visualize the data using different plots such as scatter plots, which is used to compare two variables. To illustrate, I am going to plot the relationship between the retail price and the cup equivalent price of the fruits and vegetables in different plots based on the form:
ggplot(dataset, aes(x=RetailPrice,
y=CupEquivalentPrice)) +
geom_point() +
facet_wrap("Form")
Based on the graphs, there is a strong positive correlation between the retail price and the cup equivalent price of the fruits and vegetables across all their five forms.
There are many packages within tidyverse that we can use to manipulate, tidy and visualize the data, but for starters, the dplyr and ggplots2 are enough to acquire and practice basic skills in R.