Challenge 1

Installing the libraries

Installing the correct libraries to read the data is essential to use their functions.

library(readr)
library(here)

## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 1

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dataset

The dataset being used is the Birds CSV file. It consists of several attributes such as Domain Code, Domain, Area, Year, Value and others. This document gives a high level overview of the data present in the file.

Let’s get the dataset to start the analysis.

bird_from_csv <- read_csv("birds.csv", show_col_types = FALSE)

Time to take a look at the data

bird_from_csv

## # A tibble: 30,977 × 14
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
## 10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
## # ℹ 30,967 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

To just get the first few rows for a better display, we can use the head function.

head(bird_from_csv)

## # A tibble: 6 × 14
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
## 2 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
## 3 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
## 4 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
## 5 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
## 6 QA            Live Anima…           2 Afgh…           5112 Stocks         1057
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

Column Data Types

It is possible to inspect the data types of the columns of the file.

spec(bird_from_csv)

## cols(
##   `Domain Code` = col_character(),
##   Domain = col_character(),
##   `Area Code` = col_double(),
##   Area = col_character(),
##   `Element Code` = col_double(),
##   Element = col_character(),
##   `Item Code` = col_double(),
##   Item = col_character(),
##   `Year Code` = col_double(),
##   Year = col_double(),
##   Unit = col_character(),
##   Value = col_double(),
##   Flag = col_character(),
##   `Flag Description` = col_character()
## )

We see that columns are either double or character type.

Mean and Standard Deviation using summarize

From the tutorials, we have seen that the summarize function can be used to perform calculations in a cleaner way. From the dataset, we see that we can use the attribute Value for our operation.

summarize(bird_from_csv,
          mean_value = mean(Value, na.rm=T),
          sd_value = sd(Value, na.rm=T))

## # A tibble: 1 × 2
##   mean_value sd_value
##        <dbl>    <dbl>
## 1     99411.  720611.

This can be extended to add other calculations too.

summarize(bird_from_csv,
          mean_value = mean(Value, na.rm=T),
          sd_value = sd(Value, na.rm=T),
          median_value = median(Value, na.rm = T))

## # A tibble: 1 × 3
##   mean_value sd_value median_value
##        <dbl>    <dbl>        <dbl>
## 1     99411.  720611.         1800

Mutate function

We can use the mutate function to create a new column. In this case, we can have a new column where the value of each bird is in cents.

mutate(bird_from_csv, value_cents = Value * 100)

## # A tibble: 30,977 × 15
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
## 10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
## # ℹ 30,967 more rows
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>, value_cents <dbl>

But this column will not be permanent from above operation. To make it permanent, we need to use the assignment operator.

bird_1 <- mutate(bird_from_csv, value_cents = Value * 100)
bird_1

## # A tibble: 30,977 × 15
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
##  9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
## 10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
## # ℹ 30,967 more rows
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>, value_cents <dbl>

Group By Function

We can perform calculations by categorizing, in our dataset we can categorize by item and find the median value of the birds.

summarize(group_by(bird_from_csv, Item),
          median_price = median(Value, na.rm = T))

## # A tibble: 5 × 2
##   Item                   median_price
##   <chr>                         <dbl>
## 1 Chickens                     10784.
## 2 Ducks                          510 
## 3 Geese and guinea fowls         258 
## 4 Pigeons, other birds          2800 
## 5 Turkeys                        528

Another way to write this is by use of pipes. The below chunk is a cleaner way to perform the same operation.

bird_from_csv %>%
  group_by(Item) %>%
  summarize(median_price = median(Value, na.rm = T))

## # A tibble: 5 × 2
##   Item                   median_price
##   <chr>                         <dbl>
## 1 Chickens                     10784.
## 2 Ducks                          510 
## 3 Geese and guinea fowls         258 
## 4 Pigeons, other birds          2800 
## 5 Turkeys                        528

Conclusion

We went through the operations that were taught in the lectures and tutorials. It gave a very high level overview of the data from this csv file. However, in later challenges it can be shown in more depth with several other functions and plots.