Installing the correct libraries to read the data is essential to use their functions.
library(readr)
library(here)
## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 1
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The dataset being used is the Birds CSV file. It consists of several attributes such as Domain Code, Domain, Area, Year, Value and others. This document gives a high level overview of the data present in the file.
Let’s get the dataset to start the analysis.
bird_from_csv <- read_csv("birds.csv", show_col_types = FALSE)
Time to take a look at the data
bird_from_csv
## # A tibble: 30,977 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## # ℹ 30,967 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
To just get the first few rows for a better display, we can use the head function.
head(bird_from_csv)
## # A tibble: 6 × 14
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anima… 2 Afgh… 5112 Stocks 1057
## 2 QA Live Anima… 2 Afgh… 5112 Stocks 1057
## 3 QA Live Anima… 2 Afgh… 5112 Stocks 1057
## 4 QA Live Anima… 2 Afgh… 5112 Stocks 1057
## 5 QA Live Anima… 2 Afgh… 5112 Stocks 1057
## 6 QA Live Anima… 2 Afgh… 5112 Stocks 1057
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>
It is possible to inspect the data types of the columns of the file.
spec(bird_from_csv)
## cols(
## `Domain Code` = col_character(),
## Domain = col_character(),
## `Area Code` = col_double(),
## Area = col_character(),
## `Element Code` = col_double(),
## Element = col_character(),
## `Item Code` = col_double(),
## Item = col_character(),
## `Year Code` = col_double(),
## Year = col_double(),
## Unit = col_character(),
## Value = col_double(),
## Flag = col_character(),
## `Flag Description` = col_character()
## )
We see that columns are either double or character type.
From the tutorials, we have seen that the summarize function can be used to perform calculations in a cleaner way. From the dataset, we see that we can use the attribute Value for our operation.
summarize(bird_from_csv,
mean_value = mean(Value, na.rm=T),
sd_value = sd(Value, na.rm=T))
## # A tibble: 1 × 2
## mean_value sd_value
## <dbl> <dbl>
## 1 99411. 720611.
This can be extended to add other calculations too.
summarize(bird_from_csv,
mean_value = mean(Value, na.rm=T),
sd_value = sd(Value, na.rm=T),
median_value = median(Value, na.rm = T))
## # A tibble: 1 × 3
## mean_value sd_value median_value
## <dbl> <dbl> <dbl>
## 1 99411. 720611. 1800
We can use the mutate function to create a new column. In this case, we can have a new column where the value of each bird is in cents.
mutate(bird_from_csv, value_cents = Value * 100)
## # A tibble: 30,977 × 15
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## # ℹ 30,967 more rows
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>, value_cents <dbl>
But this column will not be permanent from above operation. To make it permanent, we need to use the assignment operator.
bird_1 <- mutate(bird_from_csv, value_cents = Value * 100)
bird_1
## # A tibble: 30,977 × 15
## `Domain Code` Domain `Area Code` Area `Element Code` Element `Item Code`
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 2 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 3 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 4 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 5 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 6 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 7 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 8 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 9 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## 10 QA Live Anim… 2 Afgh… 5112 Stocks 1057
## # ℹ 30,967 more rows
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## # Value <dbl>, Flag <chr>, `Flag Description` <chr>, value_cents <dbl>
We can perform calculations by categorizing, in our dataset we can categorize by item and find the median value of the birds.
summarize(group_by(bird_from_csv, Item),
median_price = median(Value, na.rm = T))
## # A tibble: 5 × 2
## Item median_price
## <chr> <dbl>
## 1 Chickens 10784.
## 2 Ducks 510
## 3 Geese and guinea fowls 258
## 4 Pigeons, other birds 2800
## 5 Turkeys 528
Another way to write this is by use of pipes. The below chunk is a cleaner way to perform the same operation.
bird_from_csv %>%
group_by(Item) %>%
summarize(median_price = median(Value, na.rm = T))
## # A tibble: 5 × 2
## Item median_price
## <chr> <dbl>
## 1 Chickens 10784.
## 2 Ducks 510
## 3 Geese and guinea fowls 258
## 4 Pigeons, other birds 2800
## 5 Turkeys 528
We went through the operations that were taught in the lectures and tutorials. It gave a very high level overview of the data from this csv file. However, in later challenges it can be shown in more depth with several other functions and plots.