Challenge 1 - Read in data

Author

Jingyi Yang

Published

December 22, 2023

1. Start Up

knitr::opts_chunk$set(echo = TRUE)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(here)
here() starts at C:/8-601
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2. Importing the data

2.1 railroad data importing

setwd("C:\\8-601\\challenge_datasets")
railroad <- read_csv("railroad_2012_clean_county.csv")
Rows: 2930 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): state, county
dbl (1): total_employees

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#read_csv(here())is not working as the document is not saved in the default working directory. So, the setwd() function is needed to redirect the working directory.
railroad
# A tibble: 2,930 × 3
   state county               total_employees
   <chr> <chr>                          <dbl>
 1 AE    APO                                2
 2 AK    ANCHORAGE                          7
 3 AK    FAIRBANKS NORTH STAR               2
 4 AK    JUNEAU                             3
 5 AK    MATANUSKA-SUSITNA                  2
 6 AK    SITKA                              1
 7 AK    SKAGWAY MUNICIPALITY              88
 8 AL    AUTAUGA                          102
 9 AL    BALDWIN                          143
10 AL    BARBOUR                            1
# ℹ 2,920 more rows

2.2 birds data importing

setwd("C:\\8-601\\challenge_datasets")
birds <- read_csv("birds.csv")
Rows: 30977 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Domain Code, Domain, Area, Element, Item, Unit, Flag, Flag Description
dbl (6): Area Code, Element Code, Item Code, Year Code, Year, Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#read_csv(here())is not working as the document is not saved in the default working directory. So, the setwd() function is needed to redirect the working directory.
as_tibble(birds)
# A tibble: 30,977 × 14
   `Domain Code` Domain     `Area Code` Area  `Element Code` Element `Item Code`
   <chr>         <chr>            <dbl> <chr>          <dbl> <chr>         <dbl>
 1 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 2 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 3 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 4 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 5 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 6 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 7 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 8 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
 9 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
10 QA            Live Anim…           2 Afgh…           5112 Stocks         1057
# ℹ 30,967 more rows
# ℹ 7 more variables: Item <chr>, `Year Code` <dbl>, Year <dbl>, Unit <chr>,
#   Value <dbl>, Flag <chr>, `Flag Description` <chr>

3.Summarize & Description

3.1 railroad data summarize

The summary is mainly focused on the total number of employees. The table “sum_railroad” below shows the sum, median, mean, variability, and standard deviation of the employees in each state group to fully describe and understand the data and show the differences between states. As states have different numbers of counties, another table “summary_railroad_table” contains this information to show these differences better and make the differences between states more clear.

sum_railroad <- railroad %>%
  group_by(state) %>%
  summarise(total_number = sum(total_employees,na.rm = T), median_number = median(total_employees,na.rm = T), mean_number = mean(total_employees, na.rm = T), var_number = var(total_employees,na.rm = T), sd_number = sd(total_employees,na.rm = T))
sum_railroad
# A tibble: 53 × 6
   state total_number median_number mean_number var_number sd_number
   <chr>        <dbl>         <dbl>       <dbl>      <dbl>     <dbl>
 1 AE               2           2           2          NA       NA  
 2 AK             103           2.5        17.2      1209.      34.8
 3 AL            4257          26          63.5     16943.     130. 
 4 AP               1           1           1          NA       NA  
 5 AR            3871          16.5        53.8     17197.     131. 
 6 AZ            3153          94         210.      51885.     228. 
 7 CA           13137          61         239.     301916.     549. 
 8 CO            3650          10          64.0     16320.     128. 
 9 CT            2592         125         324      270606.     520. 
10 DC             279         279         279          NA       NA  
# ℹ 43 more rows
railroad %>%
dplyr::distinct(railroad$state)
# A tibble: 53 × 1
   `railroad$state`
   <chr>           
 1 AE              
 2 AK              
 3 AL              
 4 AP              
 5 AR              
 6 AZ              
 7 CA              
 8 CO              
 9 CT              
10 DC              
# ℹ 43 more rows
total_county <- railroad %>% count(state)

summary_railroad_table <- mutate(total_county, sum_railroad)
summary_railroad_table
# A tibble: 53 × 7
   state     n total_number median_number mean_number var_number sd_number
   <chr> <int>        <dbl>         <dbl>       <dbl>      <dbl>     <dbl>
 1 AE        1            2           2           2          NA       NA  
 2 AK        6          103           2.5        17.2      1209.      34.8
 3 AL       67         4257          26          63.5     16943.     130. 
 4 AP        1            1           1           1          NA       NA  
 5 AR       72         3871          16.5        53.8     17197.     131. 
 6 AZ       15         3153          94         210.      51885.     228. 
 7 CA       55        13137          61         239.     301916.     549. 
 8 CO       57         3650          10          64.0     16320.     128. 
 9 CT        8         2592         125         324      270606.     520. 
10 DC        1          279         279         279          NA       NA  
# ℹ 43 more rows

3.2 birds data summarize

The summary is divided into four parts- area, year, item, and flag description. After using the filter functions in Excel to analyze columns in the data sets, these are the three variables divided into multiple categories (besides the variable “value.” Other variables basically only have one category or provide numerical code for the categories of these three variables.

The summary connects these four qualitative variables to a quantitative variable, which is “value.” Based on the data set, it is unsure what the “value” refers to, but I treat it as the number of animals/birds.

All the tables below contain the sum, median, mean, variability, and standard deviation of the “value,” which we hope will provide some description and an understanding of the data based on different categorical variables. Besides, the “summary_items_birds_table” table includes the numbers of different kinds of animals (or birds), which provides more information about the data sets and shows the differences between different types of animals(or birds). The table “summary_flag_description_birds_table” also contains the numbers of various flag descriptions for the same purpose as the “summary_items_birds_table” table.

area_birds_summary <- birds %>%
  group_by(Area) %>%
  summarise(area_total_number = sum(Value,na.rm = T), area_median_number = median(Value,na.rm = T), area_mean_number = mean(Value, na.rm = T), area_var_number = var(Value,na.rm = T), area_sd_number = sd(Value,na.rm = T))
area_birds_summary
# A tibble: 248 × 6
   Area    area_total_number area_median_number area_mean_number area_var_number
   <chr>               <dbl>              <dbl>            <dbl>           <dbl>
 1 Afghan…            469727             6700             8099.          7.95e 6
 2 Africa           57002821            12910.          196561.          1.90e11
 3 Albania            271133             1300             2278.          5.14e 6
 4 Algeria           4088109               42.5          17621.          1.51e 9
 5 Americ…              2402               38               41.4         1.97e 2
 6 Americ…         198674579            66924.          856356.          2.37e12
 7 Angola             548260             6075             9453.          7.97e 7
 8 Antigu…              5431               85               93.6         1.57e 3
 9 Argent…           4371886             2355            18844.          1.13e 9
10 Armenia            111324             1528.            2062.          4.18e 6
# ℹ 238 more rows
# ℹ 1 more variable: area_sd_number <dbl>
year_birds_sumarry <- birds %>%
  group_by(Year) %>%
  summarise(year_total_number = sum(Value,na.rm = T), year_median_number = median(Value,na.rm = T), year_mean_number = mean(Value, na.rm = T), year_var_number = var(Value,na.rm = T), year_sd_number = sd(Value,na.rm = T))
year_birds_sumarry
# A tibble: 58 × 6
    Year year_total_number year_median_number year_mean_number year_var_number
   <dbl>             <dbl>              <dbl>            <dbl>           <dbl>
 1  1961          17420323              1033            36752.    47059099105.
 2  1962          17910944              1014            37787.    50595809286.
 3  1963          18360699              1106            38736.    53353928347.
 4  1964          18640211              1103            39325.    54806462166.
 5  1965          19158619              1104            40334.    57858203063.
 6  1966          19625156              1088.           41229.    60307571742.
 7  1967          20582357              1193            43240.    66353492122.
 8  1968          21143795              1252.           44420.    70623262796.
 9  1969          21845768              1267            45607.    75005569765.
10  1970          22851291              1259            47706.    81653504019.
# ℹ 48 more rows
# ℹ 1 more variable: year_sd_number <dbl>
items_birds_sumarry <- birds %>%
  group_by(Item) %>%
  summarise(items_total_number = sum(Value,na.rm = T), items_median_number = median(Value,na.rm = T), items_mean_number = mean(Value, na.rm = T), items_var_number = var(Value,na.rm = T), items_sd_number = sd(Value,na.rm = T))
items_birds_sumarry
# A tibble: 5 × 6
  Item                  items_total_number items_median_number items_mean_number
  <chr>                              <dbl>               <dbl>             <dbl>
1 Chickens                      2696862583              10784.           207931.
2 Ducks                          149781301                510             23072.
3 Geese and guinea fow…           41136874                258             10292.
4 Pigeons, other birds             6822856               2800              6163.
5 Turkeys                         81850064                528             15228.
# ℹ 2 more variables: items_var_number <dbl>, items_sd_number <dbl>
birds %>%
dplyr::distinct(birds$Item)
# A tibble: 5 × 1
  `birds$Item`          
  <chr>                 
1 Chickens              
2 Ducks                 
3 Geese and guinea fowls
4 Turkeys               
5 Pigeons, other birds  
total_bird <- birds %>% count(Item)

summary_items_birds_table <- mutate(total_bird, items_birds_sumarry)
summary_items_birds_table
# A tibble: 5 × 7
  Item                n items_total_number items_median_number items_mean_number
  <chr>           <int>              <dbl>               <dbl>             <dbl>
1 Chickens        13074         2696862583              10784.           207931.
2 Ducks            6909          149781301                510             23072.
3 Geese and guin…  4136           41136874                258             10292.
4 Pigeons, other…  1165            6822856               2800              6163.
5 Turkeys          5693           81850064                528             15228.
# ℹ 2 more variables: items_var_number <dbl>, items_sd_number <dbl>
flag_birds_sumarry <- birds %>%
  group_by(`Flag Description`) %>%
  summarise(flag_total_number = sum(Value,na.rm = T), flag_median_number = median(Value,na.rm = T), flag_mean_number = mean(Value, na.rm = T), flag_var_number = var(Value,na.rm = T), flag_sd_number = sd(Value,na.rm = T))
flag_birds_sumarry
# A tibble: 6 × 6
  `Flag Description`       flag_total_number flag_median_number flag_mean_number
  <chr>                                <dbl>              <dbl>            <dbl>
1 Aggregate, may include …        2232340190               8003          345885.
2 Data not available                       0                 NA             NaN 
3 FAO data based on imput…          57689560                501           47559.
4 FAO estimate                     165898597                465           16578.
5 Official data                    346111032               2500           32128.
6 Unofficial figure                174414299               1952          116743.
# ℹ 2 more variables: flag_var_number <dbl>, flag_sd_number <dbl>
birds %>%
dplyr::distinct(birds$`Flag Description`)
# A tibble: 6 × 1
  `birds$\`Flag Description\``                                                
  <chr>                                                                       
1 FAO estimate                                                                
2 Official data                                                               
3 FAO data based on imputation methodology                                    
4 Data not available                                                          
5 Unofficial figure                                                           
6 Aggregate, may include official, semi-official, estimated or calculated data
total_flag_description <- birds %>% count(`Flag Description`)

summary_flag_description_birds_table <- mutate(total_flag_description, flag_birds_sumarry)
summary_flag_description_birds_table
# A tibble: 6 × 7
  `Flag Description`     n flag_total_number flag_median_number flag_mean_number
  <chr>              <int>             <dbl>              <dbl>            <dbl>
1 Aggregate, may in…  6488        2232340190               8003          345885.
2 Data not available  1002                 0                 NA             NaN 
3 FAO data based on…  1213          57689560                501           47559.
4 FAO estimate       10007         165898597                465           16578.
5 Official data      10773         346111032               2500           32128.
6 Unofficial figure   1494         174414299               1952          116743.
# ℹ 2 more variables: flag_var_number <dbl>, flag_sd_number <dbl>