Challenge 3

Introduction

In this challenge we have to do the fundamental read operations and describe the data present. Further steps involve tidying the data to make it more presentable and pivot it. We will be using the functions taught in the lectures and tutorials.

Dataset

First we load the essential libraries required for the operations below.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(knitr)

Now it is time to load the dataset.

eggs <- read_csv("eggs_tidy.csv", show_col_types = FALSE)

Reading the Data

We can view the top rows of the dataset.

head(eggs)

## # A tibble: 6 × 6
##   month     year large_half_dozen large_dozen extra_large_half_dozen
##   <chr>    <dbl>            <dbl>       <dbl>                  <dbl>
## 1 January   2004             126         230                    132 
## 2 February  2004             128.        226.                   134.
## 3 March     2004             131         225                    137 
## 4 April     2004             131         225                    137 
## 5 May       2004             131         225                    137 
## 6 June      2004             134.        231.                   137 
## # ℹ 1 more variable: extra_large_dozen <dbl>

This appears to be the prices of different quantites of eggs across each month in a 10 year span from 2004 to 2013.

The following are the attributes-

month - represents every month of the year

year - shows every year from 2004 to 2013.

large_half_dozen - It refers to price of half dozen of large eggs.

extra_large_half_dozen - It is price of half dozen of extra large eggs.

extra_large_dozen- Price of a dozen of extra large eggs.

Apart from month, every data is numerical.

The dimensions of this data -

dim(eggs)

## [1] 120   6

As seen from above, every month in a 10 year span equals 120 and there are 6 columns with two being month and year and other 4 are the prices of various quantities of eggs.

We can have a look at all the attributes of the dataset.

colnames(eggs)

## [1] "month"                  "year"                   "large_half_dozen"      
## [4] "large_dozen"            "extra_large_half_dozen" "extra_large_dozen"

We can have a look at specific columns of the data using the select function.

select(eggs, "month", "year", "extra_large_half_dozen")

## # A tibble: 120 × 3
##    month      year extra_large_half_dozen
##    <chr>     <dbl>                  <dbl>
##  1 January    2004                   132 
##  2 February   2004                   134.
##  3 March      2004                   137 
##  4 April      2004                   137 
##  5 May        2004                   137 
##  6 June       2004                   137 
##  7 July       2004                   137 
##  8 August     2004                   137 
##  9 September  2004                   136.
## 10 October    2004                   136.
## # ℹ 110 more rows

This shows all the prices of that particular quantity across the timeline.

Tidying the data

In the above dataset, we see that each egg quantity has its own column with its price. It can be subject to change if we tidy it. In this case, tidying will involve having a single column that specificies all the sizes and a price column that adds the associated price with them.

Given that each month of the year has 4 different prices for each quantity, the new dataframe will have 4 rows for each month. And because there are 120 months in total we can assume that the dimensions of new dataframe will be 480 rows and 4 columns.

Because we are lessening the width of the data and making it longer. We are using the pivot_longer function from the tidyr package.

new_eggs <- pivot_longer(eggs,
                         cols = -c(month, year),
                         names_to = "egg_quantity",
                         values_to = "price")

We can load the new tidy data.

new_eggs

## # A tibble: 480 × 4
##    month     year egg_quantity           price
##    <chr>    <dbl> <chr>                  <dbl>
##  1 January   2004 large_half_dozen        126 
##  2 January   2004 large_dozen             230 
##  3 January   2004 extra_large_half_dozen  132 
##  4 January   2004 extra_large_dozen       230 
##  5 February  2004 large_half_dozen        128.
##  6 February  2004 large_dozen             226.
##  7 February  2004 extra_large_half_dozen  134.
##  8 February  2004 extra_large_dozen       230 
##  9 March     2004 large_half_dozen        131 
## 10 March     2004 large_dozen             225 
## # ℹ 470 more rows

As we anticipated before, each month has 4 rows associated with it where each row is related to the quantity of the egg and its price.

Here, pivot longer was successful as we turned the data from wider to longer.

This new data will be easier to use in visualizations as one of the columns is now categorical. It is more visually appealing and can be used for various statistical operations.

Conclusion

We first gave a brief description of the data and then tidied it by using pivot_longer that we learnt in the lecture and tutorials. Its application was successful and it improved the visual quality of the dataframe.