Introduction

This challenge will focus on joining multiple datasets and analyzes aspects of the newly formed joint data.

Dataset

We are using two datasets. Both are FAO stats with one dataset related to livestock and the other about chicken and eggs.

First, the libraries are loaded.

library(readr)
library(here)
## here() starts at C:/Users/SHAURYA/Desktop/Studies/Winter 2024 601/Challenges/challenge 8
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

We then load the datasets.

livestock <- read_csv("FAOSTAT_livestock.csv", show_col_types = FALSE)
egg <- read_csv("FAOSTAT_egg_chicken.csv", show_col_types = FALSE)

Reading the Data

We can have a look at the data.

head(livestock)
## # A tibble: 6 × 14
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 2 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 3 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 4 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 5 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 6 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

We see a variety of columns such as Domain, Area, Year, Item, and Value. It has a mix of both numerical and categorical data which is useful for further analysis.

An explanation of data in each column can be given-

Domain Code - Represents the national website domain of a country. In this case, it is Qatar.

Domain - Describes the relvant domain of the livestock. Here, all the animals are in the Live Animals domain.

Area Code - Represents the code assigned to each country.

Area - Represents the country.

Element Code - Numerical value assigned to the element.

Element - Shows the element for each animal. Here, it is common for all being stocks.

Item Code - Code assigned to every item.

Item - Represents the category of the livestock such as Asses, Buffaloes, Pigs, etc.

Year Code and Year - Redundant columns representing the year.

Unit - Represents the quantity. Here it is common being Head.

Value - Assigns the numerical value of the livestock. It varies across the data.

Flag and Flag Description - It is a symbol and category combination. Such as blank for official data, F for FAO estimate, * for Unofficial Figure, etc.

head(egg)
## # A tibble: 6 × 14
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QL            Livestock …           2 Afgh…           5313 Laying         1062
## 2 QL            Livestock …           2 Afgh…           5410 Yield          1062
## 3 QL            Livestock …           2 Afgh…           5510 Produc…        1062
## 4 QL            Livestock …           2 Afgh…           5313 Laying         1062
## 5 QL            Livestock …           2 Afgh…           5410 Yield          1062
## 6 QL            Livestock …           2 Afgh…           5510 Produc…        1062
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

While most attributes are same as the livestock data, the description slightly differs.

  1. Domain - Describes the relvant domain of the chicken. Here, it is common for all rows.

  2. Element - Shows the element for each chicken and egg.

  3. Item - Represents the category of the chicken and eggs. It is common for all.

  4. Unit - Represents the quantity. It varies among the rows.

  5. Value - Assigns the numerical value of the product. It varies across the data.

The dimensions of the data can be obtained by-

dim(livestock)
## [1] 82116    14
dim(egg)
## [1] 38170    14

We see that the livestock dataset is twice was big as the egg and chicken data.

We can look at the different columns -

colnames(livestock)
##  [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
##  [5] "Element Code"     "Element"          "Item Code"        "Item"            
##  [9] "Year Code"        "Year"             "Unit"             "Value"           
## [13] "Flag"             "Flag Description"
colnames(egg)
##  [1] "Domain Code"      "Domain"           "Area Code"        "Area"            
##  [5] "Element Code"     "Element"          "Item Code"        "Item"            
##  [9] "Year Code"        "Year"             "Unit"             "Value"           
## [13] "Flag"             "Flag Description"

We can have a look at specific columns of the data using the select function.

select(livestock, "Area")
## # A tibble: 82,116 × 1
##    Area       
##    <chr>      
##  1 Afghanistan
##  2 Afghanistan
##  3 Afghanistan
##  4 Afghanistan
##  5 Afghanistan
##  6 Afghanistan
##  7 Afghanistan
##  8 Afghanistan
##  9 Afghanistan
## 10 Afghanistan
## # ℹ 82,106 more rows
select(egg, "Flag")
## # A tibble: 38,170 × 1
##    Flag 
##    <chr>
##  1 F    
##  2 Fc   
##  3 F    
##  4 F    
##  5 Fc   
##  6 F    
##  7 F    
##  8 Fc   
##  9 F    
## 10 F    
## # ℹ 38,160 more rows

Multiple columns can also be selected-

select(livestock, "Area", "Item", "Value")
## # A tibble: 82,116 × 3
##    Area        Item    Value
##    <chr>       <chr>   <dbl>
##  1 Afghanistan Asses 1300000
##  2 Afghanistan Asses  851850
##  3 Afghanistan Asses 1001112
##  4 Afghanistan Asses 1150000
##  5 Afghanistan Asses 1300000
##  6 Afghanistan Asses 1200000
##  7 Afghanistan Asses 1200000
##  8 Afghanistan Asses 1328000
##  9 Afghanistan Asses 1250000
## 10 Afghanistan Asses 1300000
## # ℹ 82,106 more rows
select(egg, "Element", "Item", "Year")
## # A tibble: 38,170 × 3
##    Element    Item                 Year
##    <chr>      <chr>               <dbl>
##  1 Laying     Eggs, hen, in shell  1961
##  2 Yield      Eggs, hen, in shell  1961
##  3 Production Eggs, hen, in shell  1961
##  4 Laying     Eggs, hen, in shell  1962
##  5 Yield      Eggs, hen, in shell  1962
##  6 Production Eggs, hen, in shell  1962
##  7 Laying     Eggs, hen, in shell  1963
##  8 Yield      Eggs, hen, in shell  1963
##  9 Production Eggs, hen, in shell  1963
## 10 Laying     Eggs, hen, in shell  1964
## # ℹ 38,160 more rows

We see that elements can be either of Laying, Yield, or Production. This likely shows the purpose of raising chickens and eggs.

Tables can be used for checking the distribution of values within a column. For this, we can first retrieve the particular column.

Livestock_item <- select(livestock, Item)
head(Livestock_item)
## # A tibble: 6 × 1
##   Item 
##   <chr>
## 1 Asses
## 2 Asses
## 3 Asses
## 4 Asses
## 5 Asses
## 6 Asses
table(Livestock_item)
## Item
##     Asses Buffaloes    Camels    Cattle     Goats    Horses     Mules      Pigs 
##      8571      3505      3265     13086     12498     11104      6153     12015 
##     Sheep 
##     11919

From above we get the distribution of the livestock type from the data. We see that cattle appears the highest and Camels the lowest.

We can also get the proportion of each value with the help of proportion tables.

prop.table(table(Livestock_item))
## Item
##      Asses  Buffaloes     Camels     Cattle      Goats     Horses      Mules 
## 0.10437674 0.04268352 0.03976083 0.15935993 0.15219933 0.13522334 0.07493059 
##       Pigs      Sheep 
## 0.14631740 0.14514833

Transforming Data

If we want to look at specific portions of data then we can use the filter function. For instance, if we want to see the data from Cuba then we can do the following-

filter(egg, `Area` == "Cuba")
## # A tibble: 174 × 14
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QL            Livestock…          49 Cuba            5313 Laying         1062
##  2 QL            Livestock…          49 Cuba            5410 Yield          1062
##  3 QL            Livestock…          49 Cuba            5510 Produc…        1062
##  4 QL            Livestock…          49 Cuba            5313 Laying         1062
##  5 QL            Livestock…          49 Cuba            5410 Yield          1062
##  6 QL            Livestock…          49 Cuba            5510 Produc…        1062
##  7 QL            Livestock…          49 Cuba            5313 Laying         1062
##  8 QL            Livestock…          49 Cuba            5410 Yield          1062
##  9 QL            Livestock…          49 Cuba            5510 Produc…        1062
## 10 QL            Livestock…          49 Cuba            5313 Laying         1062
## # ℹ 164 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

If we want data from 2000 and recent then we can use the filter function for that as well.

filter(egg, `Year` >= 2000)
## # A tibble: 13,406 × 14
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QL            Livestock…           2 Afgh…           5313 Laying         1062
##  2 QL            Livestock…           2 Afgh…           5410 Yield          1062
##  3 QL            Livestock…           2 Afgh…           5510 Produc…        1062
##  4 QL            Livestock…           2 Afgh…           5313 Laying         1062
##  5 QL            Livestock…           2 Afgh…           5410 Yield          1062
##  6 QL            Livestock…           2 Afgh…           5510 Produc…        1062
##  7 QL            Livestock…           2 Afgh…           5313 Laying         1062
##  8 QL            Livestock…           2 Afgh…           5410 Yield          1062
##  9 QL            Livestock…           2 Afgh…           5510 Produc…        1062
## 10 QL            Livestock…           2 Afgh…           5313 Laying         1062
## # ℹ 13,396 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>

We can use the group_by function with summarise to get the total value from every area in the data. It will require two steps. First step will be to group the data by the area. Second step involves calculating the aggregate value for each area.

area_data <- egg %>% group_by(Area)
area_data
## # A tibble: 38,170 × 14
## # Groups:   Area [245]
##    `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
##    <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
##  1 QL            Livestock…           2 Afgh…           5313 Laying         1062
##  2 QL            Livestock…           2 Afgh…           5410 Yield          1062
##  3 QL            Livestock…           2 Afgh…           5510 Produc…        1062
##  4 QL            Livestock…           2 Afgh…           5313 Laying         1062
##  5 QL            Livestock…           2 Afgh…           5410 Yield          1062
##  6 QL            Livestock…           2 Afgh…           5510 Produc…        1062
##  7 QL            Livestock…           2 Afgh…           5313 Laying         1062
##  8 QL            Livestock…           2 Afgh…           5410 Yield          1062
##  9 QL            Livestock…           2 Afgh…           5510 Produc…        1062
## 10 QL            Livestock…           2 Afgh…           5313 Laying         1062
## # ℹ 38,160 more rows
## # ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>
agg_val <- area_data %>% summarize(total_value = sum(Value, na.rm = TRUE))
agg_val
## # A tibble: 245 × 2
##    Area                total_value
##    <chr>                     <dbl>
##  1 Afghanistan             2645330
##  2 Africa                112454830
##  3 Albania                 5363511
##  4 Algeria                11134303
##  5 American Samoa          2300997
##  6 Americas              569598764
##  7 Angola                  2488490
##  8 Antigua and Barbuda     2880269
##  9 Argentina              27251130
## 10 Armenia                 3546073
## # ℹ 235 more rows

From above, we see the total sum of values for each area from the data. We can arrange this data in increasing order to see the areas with the highest values.

desc_agg_val <- agg_val %>% arrange(desc(total_value))
desc_agg_val
## # A tibble: 245 × 2
##    Area                     total_value
##    <chr>                          <dbl>
##  1 World                     2532519493
##  2 Asia                      1227360081
##  3 Eastern Asia               876570254
##  4 China, mainland            700621017
##  5 Europe                     623448480
##  6 Americas                   569598764
##  7 Northern America           313640726
##  8 United States of America   291302872
##  9 Eastern Europe             284453965
## 10 Western Europe             172362145
## # ℹ 235 more rows

This gives the obvious result that World has highest value followed by the Continents and some highest populated countries like China and India.

Same can be done to get the areas with least value.

inc_agg_data <- agg_val %>% arrange(total_value)
inc_agg_data
## # A tibble: 245 × 2
##    Area                             total_value
##    <chr>                                  <dbl>
##  1 South Sudan                                0
##  2 United States Virgin Islands           87585
##  3 Micronesia (Federated States of)      640138
##  4 Pacific Islands Trust Territory       695616
##  5 Barbados                              734511
##  6 Sudan                                 853586
##  7 Saint Pierre and Miquelon             953768
##  8 Liechtenstein                        1103525
##  9 Montenegro                           1109816
## 10 Eritrea                              1123776
## # ℹ 235 more rows

Tidying the Data

We can check for missing values.

colSums(is.na(livestock))
##      Domain Code           Domain        Area Code             Area 
##                0                0                0                0 
##     Element Code          Element        Item Code             Item 
##                0                0                0                0 
##        Year Code             Year             Unit            Value 
##                0                0                0             1301 
##             Flag Flag Description 
##            38270                0

From the data, we see that the empty cell in Flag corresponds to “official data” in Flag Description so it cannot be treated as missing value.

We can fix the missing data in Value by replacing with the mean.

livestock$Value[is.na(livestock$Value)] <- mean(livestock$Value, na.rm = TRUE)

Now we check in the egg dataset.

colSums(is.na(egg))
##      Domain Code           Domain        Area Code             Area 
##                0                0                0                0 
##     Element Code          Element        Item Code             Item 
##                0                0                0                0 
##        Year Code             Year             Unit            Value 
##                0                0                0               40 
##             Flag Flag Description 
##             7548                0

Like before, Flag column is an exception but we can fix the missing data in Value.

egg$Value[is.na(egg$Value)] <- mean(egg$Value, na.rm = TRUE)

We now check if it fixed the problem.

colSums(is.na(livestock))
##      Domain Code           Domain        Area Code             Area 
##                0                0                0                0 
##     Element Code          Element        Item Code             Item 
##                0                0                0                0 
##        Year Code             Year             Unit            Value 
##                0                0                0                0 
##             Flag Flag Description 
##            38270                0
colSums(is.na(egg))
##      Domain Code           Domain        Area Code             Area 
##                0                0                0                0 
##     Element Code          Element        Item Code             Item 
##                0                0                0                0 
##        Year Code             Year             Unit            Value 
##                0                0                0                0 
##             Flag Flag Description 
##             7548                0

We see that it fixed all the missing values.

Mutating the Data

While the values present clear numerical data, it can be hard to compare such large numbers. We can scale them accordingly that can be helped in comparison analysis across various years and quantity.

livestock <- mutate(livestock, Value_Comparison = scale(Value))
head(livestock)
## # A tibble: 6 × 15
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 2 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 3 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 4 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 5 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## 6 QA            Live Anima…           2 Afgh…           5111 Stocks         1107
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>,
## #   Value_Comparison <dbl[,1]>
egg <- mutate(egg, Value_Comparison = scale(Value))
head(egg)
## # A tibble: 6 × 15
##   `Domain Code` Domain      `Area Code` Area  `Element Code` Element `Item Code`
##   <chr>         <chr>             <dbl> <chr>          <dbl> <chr>         <dbl>
## 1 QL            Livestock …           2 Afgh…           5313 Laying         1062
## 2 QL            Livestock …           2 Afgh…           5410 Yield          1062
## 3 QL            Livestock …           2 Afgh…           5510 Produc…        1062
## 4 QL            Livestock …           2 Afgh…           5313 Laying         1062
## 5 QL            Livestock …           2 Afgh…           5410 Yield          1062
## 6 QL            Livestock …           2 Afgh…           5510 Produc…        1062
## # ℹ 8 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
## #   Value <dbl>, Flag <chr>, `Flag Description` <chr>,
## #   Value_Comparison <dbl[,1]>

We see that adding the new column was successful.

Joining the datasets

We can perform an inner join by mutual columns for which we have plenty of options in these datasets.

joined_data <- inner_join(livestock, egg, by = c("Area", "Year"))
## Warning in inner_join(livestock, egg, by = c("Area", "Year")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(joined_data)
## # A tibble: 6 × 28
##   `Domain Code.x` Domain.x     `Area Code.x` Area     `Element Code.x` Element.x
##   <chr>           <chr>                <dbl> <chr>               <dbl> <chr>    
## 1 QA              Live Animals             2 Afghani…             5111 Stocks   
## 2 QA              Live Animals             2 Afghani…             5111 Stocks   
## 3 QA              Live Animals             2 Afghani…             5111 Stocks   
## 4 QA              Live Animals             2 Afghani…             5111 Stocks   
## 5 QA              Live Animals             2 Afghani…             5111 Stocks   
## 6 QA              Live Animals             2 Afghani…             5111 Stocks   
## # ℹ 22 more variables: `Item Code.x` <dbl>, Item.x <chr>, `Year Code.x` <dbl>,
## #   Year <dbl>, Unit.x <chr>, Value.x <dbl>, Flag.x <chr>,
## #   `Flag Description.x` <chr>, Value_Comparison.x <dbl[,1]>,
## #   `Domain Code.y` <chr>, Domain.y <chr>, `Area Code.y` <dbl>,
## #   `Element Code.y` <dbl>, Element.y <chr>, `Item Code.y` <dbl>, Item.y <chr>,
## #   `Year Code.y` <dbl>, Unit.y <chr>, Value.y <dbl>, Flag.y <chr>,
## #   `Flag Description.y` <chr>, Value_Comparison.y <dbl[,1]>

We see that the join was successful. We can now analyze certain aspects of this new dataset.

We first get the summary.

summary(joined_data)
##  Domain Code.x        Domain.x          Area Code.x         Area          
##  Length:236827      Length:236827      Min.   :   1.0   Length:236827     
##  Class :character   Class :character   1st Qu.:  74.0   Class :character  
##  Mode  :character   Mode  :character   Median : 146.0   Mode  :character  
##                                        Mean   : 943.5                     
##                                        3rd Qu.: 221.0                     
##                                        Max.   :5504.0                     
##  Element Code.x  Element.x          Item Code.x      Item.x         
##  Min.   :5111   Length:236827      Min.   : 866   Length:236827     
##  1st Qu.:5111   Class :character   1st Qu.: 976   Class :character  
##  Median :5111   Mode  :character   Median :1034   Mode  :character  
##  Mean   :5111                      Mean   :1018                     
##  3rd Qu.:5111                      3rd Qu.:1096                     
##  Max.   :5111                      Max.   :1126                     
##   Year Code.x        Year         Unit.x             Value.x         
##  Min.   :1961   Min.   :1961   Length:236827      Min.   :0.000e+00  
##  1st Qu.:1976   1st Qu.:1976   Class :character   1st Qu.:1.500e+04  
##  Median :1991   Median :1991   Mode  :character   Median :2.670e+05  
##  Mean   :1991   Mean   :1991                      Mean   :1.201e+07  
##  3rd Qu.:2005   3rd Qu.:2005                      3rd Qu.:2.760e+06  
##  Max.   :2018   Max.   :2018                      Max.   :1.490e+09  
##     Flag.x          Flag Description.x Value_Comparison.x.V1 Domain Code.y     
##  Length:236827      Length:236827      Min.   :-0.180902     Length:236827     
##  Class :character   Class :character   1st Qu.:-0.180668     Class :character  
##  Mode  :character   Mode  :character   Median :-0.176747     Mode  :character  
##                                        Mean   : 0.005905                       
##                                        3rd Qu.:-0.137957                       
##                                        Max.   :23.000527                       
##    Domain.y          Area Code.y     Element Code.y  Element.y        
##  Length:236827      Min.   :   1.0   Min.   :5313   Length:236827     
##  Class :character   1st Qu.:  74.0   1st Qu.:5313   Class :character  
##  Mode  :character   Median : 146.0   Median :5410   Mode  :character  
##                     Mean   : 943.5   Mean   :5411                     
##                     3rd Qu.: 221.0   3rd Qu.:5510                     
##                     Max.   :5504.0   Max.   :5510                     
##   Item Code.y      Item.y           Year Code.y      Unit.y         
##  Min.   :1062   Length:236827      Min.   :1961   Length:236827     
##  1st Qu.:1062   Class :character   1st Qu.:1976   Class :character  
##  Median :1062   Mode  :character   Median :1991   Mode  :character  
##  Mean   :1062                      Mean   :1991                     
##  3rd Qu.:1062                      3rd Qu.:2005                     
##  Max.   :1062                      Max.   :2018                     
##     Value.y            Flag.y          Flag Description.y Value_Comparison.y.V1
##  Min.   :       1   Length:236827      Length:236827      Min.   :-0.13055     
##  1st Qu.:    4000   Class :character   Class :character   1st Qu.:-0.12876     
##  Median :   39582   Mode  :character   Mode  :character   Median :-0.11282     
##  Mean   :  386588                                         Mean   : 0.04268     
##  3rd Qu.:  106390                                         3rd Qu.:-0.08288     
##  Max.   :76769955                                         Max.   :34.27090

We now compare the means.

group_by(joined_data, Area) %>%
  summarize(mean_val_livestock = mean(Value.x), mean_val_egg = mean(Value.y))
## # A tibble: 244 × 3
##    Area                mean_val_livestock mean_val_egg
##    <chr>                            <dbl>        <dbl>
##  1 Afghanistan                   3597216.       15203.
##  2 Africa                       78159910.      646292.
##  3 Albania                        532013.       30825.
##  4 Algeria                       2575444.       63990.
##  5 American Samoa                   5399.       13224.
##  6 Americas                     95795716.     3273556.
##  7 Angola                        1123117.       14302.
##  8 Antigua and Barbuda              7877.       16553.
##  9 Argentina                    12879607.      156616.
## 10 Armenia                        167587.       43779.
## # ℹ 234 more rows

We can comparison the scaled value columns that we created.

summary(joined_data$Value_Comparison.x)
##        V1           
##  Min.   :-0.180902  
##  1st Qu.:-0.180668  
##  Median :-0.176747  
##  Mean   : 0.005905  
##  3rd Qu.:-0.137957  
##  Max.   :23.000527
summary(joined_data$Value_Comparison.y)
##        V1          
##  Min.   :-0.13055  
##  1st Qu.:-0.12876  
##  Median :-0.11282  
##  Mean   : 0.04268  
##  3rd Qu.:-0.08288  
##  Max.   :34.27090

Conclusion

We first did the descriptive analysis, tidying, and mutating the data. Then we joined these similar datasets and performed some basic operations on the newly joined data.