In this challenge we have to do the fundamental read operations and describe the data present. Further steps involve tidying the data to make it more presentable and pivot it. We will be using the functions taught in the lectures and tutorials.
First we load the essential libraries required for the operations below.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
Now it is time to load the dataset.
eggs <- read_csv("eggs_tidy.csv", show_col_types = FALSE)
We can view the top rows of the dataset.
head(eggs)
## # A tibble: 6 × 6
## month year large_half_dozen large_dozen extra_large_half_dozen
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 January 2004 126 230 132
## 2 February 2004 128. 226. 134.
## 3 March 2004 131 225 137
## 4 April 2004 131 225 137
## 5 May 2004 131 225 137
## 6 June 2004 134. 231. 137
## # ℹ 1 more variable: extra_large_dozen <dbl>
This appears to be the prices of different quantites of eggs across each month in a 10 year span from 2004 to 2013.
The following are the attributes-
month - represents every month of the year
year - shows every year from 2004 to 2013.
large_half_dozen - It refers to price of half dozen of large eggs.
extra_large_half_dozen - It is price of half dozen of extra large eggs.
extra_large_dozen- Price of a dozen of extra large eggs.
Apart from month, every data is numerical.
The dimensions of this data -
dim(eggs)
## [1] 120 6
As seen from above, every month in a 10 year span equals 120 and there are 6 columns with two being month and year and other 4 are the prices of various quantities of eggs.
We can have a look at all the attributes of the dataset.
colnames(eggs)
## [1] "month" "year" "large_half_dozen"
## [4] "large_dozen" "extra_large_half_dozen" "extra_large_dozen"
We can have a look at specific columns of the data using the select function.
select(eggs, "month", "year", "extra_large_half_dozen")
## # A tibble: 120 × 3
## month year extra_large_half_dozen
## <chr> <dbl> <dbl>
## 1 January 2004 132
## 2 February 2004 134.
## 3 March 2004 137
## 4 April 2004 137
## 5 May 2004 137
## 6 June 2004 137
## 7 July 2004 137
## 8 August 2004 137
## 9 September 2004 136.
## 10 October 2004 136.
## # ℹ 110 more rows
This shows all the prices of that particular quantity across the timeline.
In the above dataset, we see that each egg quantity has its own column with its price. It can be subject to change if we tidy it. In this case, tidying will involve having a single column that specificies all the sizes and a price column that adds the associated price with them.
Given that each month of the year has 4 different prices for each quantity, the new dataframe will have 4 rows for each month. And because there are 120 months in total we can assume that the dimensions of new dataframe will be 480 rows and 4 columns.
Because we are lessening the width of the data and making it longer. We are using the pivot_longer function from the tidyr package.
new_eggs <- pivot_longer(eggs,
cols = -c(month, year),
names_to = "egg_quantity",
values_to = "price")
We can load the new tidy data.
new_eggs
## # A tibble: 480 × 4
## month year egg_quantity price
## <chr> <dbl> <chr> <dbl>
## 1 January 2004 large_half_dozen 126
## 2 January 2004 large_dozen 230
## 3 January 2004 extra_large_half_dozen 132
## 4 January 2004 extra_large_dozen 230
## 5 February 2004 large_half_dozen 128.
## 6 February 2004 large_dozen 226.
## 7 February 2004 extra_large_half_dozen 134.
## 8 February 2004 extra_large_dozen 230
## 9 March 2004 large_half_dozen 131
## 10 March 2004 large_dozen 225
## # ℹ 470 more rows
As we anticipated before, each month has 4 rows associated with it where each row is related to the quantity of the egg and its price.
Here, pivot longer was successful as we turned the data from wider to longer.
This new data will be easier to use in visualizations as one of the columns is now categorical. It is more visually appealing and can be used for various statistical operations.
We first gave a brief description of the data and then tidied it by using pivot_longer that we learnt in the lecture and tutorials. Its application was successful and it improved the visual quality of the dataframe.