Homework week 6

Author

Olga Zubareva

Data cleaning

1. Upload data from the website and read it in R. Skip first seven rows.

Rows: 85 Columns: 25                                                                                                                                                        
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Time
num (24): 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 201...

2. We need to arrange the data in R format. To do that we correct column names and rename our dataset with the correct column names.

> correct_column_names <- oecd |>
+ select(!contains("...")) |> 
+ names()
> correct_column_names
[1] "Time" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012"
[20] "2013" "2014" "2015" "2016" "2017" "2018"

> oecd_step_2 <- oecd |>
+ rename_with(.f = function(x) correct_column_names) |>
+ rename(geography = Time)
> oecd_step_2
# A tibble: 85 × 25
   geogra…¹ `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 WLD: Wo… 2.14e4 2.18e4 2.22e4 2.24e4 2.25e4 2.32e4 2.36e4 2.39e4 2.50e4 2.61e4 2.71e4 2.80e4 2.90e4 2.92e4 2.88e4 3.06e4 3.15e4 3.18e4
 2 OECD: O… 1.30e4 1.33e4 1.35e4 1.37e4 1.40e4 1.46e4 1.44e4 1.44e4 1.49e4 1.54e4 1.57e4 1.58e4 1.59e4 1.54e4 1.41e4 1.48e4 1.46e4 1.43e4
 3 AUS: Au… 2.78e2 2.84e2 2.98e2 3.22e2 3.20e2 3.24e2 3.22e2 3.40e2 3.50e2 3.81e2 4.04e2 4.12e2 4.2 e2 4.22e2 4.29e2 4.34e2 4.40e2 4.52e2
 4 AUT: Au… 8.51e1 8.94e1 8.36e1 8.57e1 8.37e1 8.39e1 8.57e1 8.44e1 9.33e1 9.78e1 1.03e2 1.02e2 9.8 e1 9.78e1 8.93e1 9.32e1 9.42e1 8.96e1
 5 BEL: Be… 1.30e2 1.32e2 1.27e2 1.31e2 1.27e2 1.27e2 1.25e2 1.20e2 1.29e2 1.36e2 1.39e2 1.37e2 1.39e2 1.45e2 1.30e2 1.40e2 1.37e2 1.33e2
 6 CAN: Ca… 4.09e2 4.17e2 4.36e2 4.39e2 4.46e2 4.57e2 4.50e2 4.71e2 5.08e2 5.17e2 5.46e2 5.52e2 5.82e2 5.66e2 5.30e2 5.76e2 5.82e2 5.74e2
 7 CHL: Ch… 3.99e1 4.62e1 5.31e1 5.66e1 5.59e1 5.39e1 5.1 e1 5.37e1 5.44e1 5.82e1 6.17e1 6.42e1 6.95e1 7.75e1 7   e1 7.99e1 8.89e1 9.3 e1
 8 COL: Co… 6.55e1 6.56e1 7.22e1 7.23e1 5.97e1 5.99e1 6.11e1 6.09e1 5.91e1 6.19e1 6.61e1 6.78e1 7.12e1 7.34e1 7.06e1 7.55e1 8.38e1 8.6 e1
 9 CRI: Co… 8.1 e0 7.8 e0 8.2 e0 9.1 e0 9   e0 8.7 e0 8.8 e0 9.3 e0 9.3 e0 9.3 e0 9.7 e0 1.04e1 1.17e1 1.24e1 1.03e1 1.15e1 1.23e1 1.28e1
10 CZE: Cz… 1.05e2 1.09e2 1.06e2 9.98e1 9.61e1 1.02e2 1.03e2 1.02e2 1.07e2 1.04e2 1.03e2 1.03e2 1.07e2 1.08e2 9.79e1 1.03e2 1.01e2 9.41e1

3. Let’s separate columns with the geography codes and the geography names.

> geography_tbl <- oecd_step_2 |>
+ select(geography)
> geography_tbl
# A tibble: 85 × 1
   geography                  
   <chr>                      
 1 WLD: World                 
 2 OECD: OECD member countries
 3 AUS: Australia             
 4 AUT: Austria               
 5 BEL: Belgium               
 6 CAN: Canada                
 7 CHL: Chile                 
 8 COL: Colombia              
 9 CRI: Costa Rica            
10 CZE: Czech Republic

> geography_tbl_2 <- geography_tbl |>
+ separate(col = geography, into = c("geography_code", "geography_name"), 
+ sep = ":", remove = FALSE) |>
+ mutate(across(c("geography_code", "geography_name"), str_trim))
> geography_tbl_2
# A tibble: 85 × 3
   geography                   geography_code geography_name       
   <chr>                       <chr>          <chr>                
 1 WLD: World                  WLD            World                
 2 OECD: OECD member countries OECD           OECD member countries
 3 AUS: Australia              AUS            Australia            
 4 AUT: Austria                AUT            Austria              
 5 BEL: Belgium                BEL            Belgium              
 6 CAN: Canada                 CAN            Canada               
 7 CHL: Chile                  CHL            Chile                
 8 COL: Colombia               COL            Colombia             
 9 CRI: Costa Rica             CRI            Costa Rica           
10 CZE: Czech Republic         CZE            Czech Republic

4. We need to standardize country names and get rid of a column which include both country code and country name.

geography_tbl_2 |> 
+ mutate(country_name = countrycode(sourcevar = geography_code, 
+ origin = "iso3c", 
+ destination = "country.name")) |> 
+ filter(is.na(country_name))
# A tibble: 19 × 4
   geography                                                    geography_code                   geography_name                   count…¹
   <chr>                                                        <chr>                            <chr>                            <chr>  
 1 WLD: World                                                   WLD                              World                            NA     
 2 OECD: OECD member countries                                  OECD                             OECD member countries            NA     
 3 NONOECD: Non-OECD economies and aggregates                   NONOECD                          Non-OECD economies and aggregat… NA     
 4 ROW: Rest of the World                                       ROW                              Rest of the World                NA     
 5 APEC: Asia-Pacific Economic Cooperation                      APEC                             Asia-Pacific Economic Cooperati… NA     
 6 ASEAN: Association of South East Asian Nations               ASEAN                            Association of South East Asian… NA     
 7 EASIA: Eastern Asia                                          EASIA                            Eastern Asia                     NA     
 8 EU27_2020: European Union (27 countries)                     EU27_2020                        European Union (27 countries)    NA     
 9 EU28: European Union (28 countries)                          EU28                             European Union (28 countries)    NA     
10 EU15: European Union (15 countries)                          EU15                             European Union (15 countries)    NA     
11 EU13: EU28 excluding EU15                                    EU13                             EU28 excluding EU15              NA     
12 EA19: Euro area (19 countries)                               EA19                             Euro area (19 countries)         NA     
13 G20: Group of Twenty                                         G20                              Group of Twenty                  NA     
14 ZEUR: Europe                                                 ZEUR                             Europe                           NA     
15 ZASI: East and Southeastern Asia                             ZASI                             East and Southeastern Asia       NA     
16 ZNAM: North America                                          ZNAM                             North America                    NA     
17 ZSCA: South and Central America                              ZSCA                             South and Central America        NA     
18 ZOTH: Other regions                                          ZOTH                             Other regions                    NA     

> geography_processed <- geography_tbl_2 |> 
+ mutate(country_name = countrycode(sourcevar = geography_code, 
+ origin = "iso3c", 
+ destination = "country.name")) |>
+ mutate(country_or_region = if_else(is.na(country_name), 
+ true = "region",
+ false = "country")) |> 
+ mutate(country_name = if_else(is.na(country_name),
+ true = geography_name,
+ false = country_name)) |> 
+ select(geography, country_name, iso3c = geography_code, country_or_region)
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin = "iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 09 MAR 2023 13, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA 

> step3 <- geography_processed |> 
+ left_join(oecd_step_2, by = "geography") |>
+ select(-geography)
> step3
# A tibble: 85 × 27
   country…¹ iso3c count…² `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
   <chr>     <chr> <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 World     WLD   region  2.14e4 2.18e4 2.22e4 2.24e4 2.25e4 2.32e4 2.36e4 2.39e4 2.50e4 2.61e4 2.71e4 2.80e4 2.90e4 2.92e4 2.88e4 3.06e4 3.15e4 3.18e4 3.24e4 3.25e4 3.24e4 3.24e4
 2 OECD mem… OECD  region  1.30e4 1.33e4 1.35e4 1.37e4 1.40e4 1.46e4 1.44e4 1.44e4 1.49e4 1.54e4 1.57e4 1.58e4 1.59e4 1.54e4 1.41e4 1.48e4 1.46e4 1.43e4 1.42e4 1.39e4 1.37e4 1.36e4
 3 Australia AUS   country 2.78e2 2.84e2 2.98e2 3.22e2 3.20e2 3.24e2 3.22e2 3.40e2 3.50e2 3.81e2 4.04e2 4.12e2 4.2 e2 4.22e2 4.29e2 4.34e2 4.40e2 4.52e2 4.48e2 4.13e2 4.16e2 4.14e2
 4 Austria   AUT   country 8.51e1 8.94e1 8.36e1 8.57e1 8.37e1 8.39e1 8.57e1 8.44e1 9.33e1 9.78e1 1.03e2 1.02e2 9.8 e1 9.78e1 8.93e1 9.32e1 9.42e1 8.96e1 8.88e1 8.38e1 8.09e1 8.19e1
 5 Belgium   BEL   country 1.30e2 1.32e2 1.27e2 1.31e2 1.27e2 1.27e2 1.25e2 1.20e2 1.29e2 1.36e2 1.39e2 1.37e2 1.39e2 1.45e2 1.30e2 1.40e2 1.37e2 1.33e2 1.30e2 1.27e2 1.25e2 1.25e2
 6 Canada    CAN   country 4.09e2 4.17e2 4.36e2 4.39e2 4.46e2 4.57e2 4.50e2 4.71e2 5.08e2 5.17e2 5.46e2 5.52e2 5.82e2 5.66e2 5.30e2 5.76e2 5.82e2 5.74e2 5.70e2 5.56e2 5.29e2 5.23e2
 7 Chile     CHL   country 3.99e1 4.62e1 5.31e1 5.66e1 5.59e1 5.39e1 5.1 e1 5.37e1 5.44e1 5.82e1 6.17e1 6.42e1 6.95e1 7.75e1 7   e1 7.99e1 8.89e1 9.3 e1 9.6 e1 8.8 e1 8.87e1 8.99e1
 8 Colombia  COL   country 6.55e1 6.56e1 7.22e1 7.23e1 5.97e1 5.99e1 6.11e1 6.09e1 5.91e1 6.19e1 6.61e1 6.78e1 7.12e1 7.34e1 7.06e1 7.55e1 8.38e1 8.6 e1 9.15e1 9.41e1 9.04e1 8.97e1
 9 Costa Ri… CRI   country 8.1 e0 7.8 e0 8.2 e0 9.1 e0 9   e0 8.7 e0 8.8 e0 9.3 e0 9.3 e0 9.3 e0 9.7 e0 1.04e1 1.17e1 1.24e1 1.03e1 1.15e1 1.23e1 1.28e1 1.32e1 1.34e1 1.35e1 1.44e1
10 Czechia   CZE   country 1.05e2 1.09e2 1.06e2 9.98e1 9.61e1 1.02e2 1.03e2 1.02e2 1.07e2 1.04e2 1.03e2 1.03e2 1.07e2 1.08e2 9.79e1 1.03e2 1.01e2 9.41e1 9.05e1 8.62e1 8.79e1 9   e1

5. Let’s pivot our data to make it in line with R format

> step4 <- step3 |> 
+ pivot_longer(cols = matches("\\d{4}"), 
+ names_to = "year", 
+ names_transform = as.integer)
> step4
# A tibble: 2,040 × 5
   country_name iso3c country_or_region  year  value
   <chr>        <chr> <chr>             <int>  <dbl>
 1 World        WLD   region             1995 21367.
 2 World        WLD   region             1996 21815.
 3 World        WLD   region             1997 22214.
 4 World        WLD   region             1998 22391 
 5 World        WLD   region             1999 22526.
 6 World        WLD   region             2000 23240.
 7 World        WLD   region             2001 23587.
 8 World        WLD   region             2002 23924.
 9 World        WLD   region             2003 24956.
10 World        WLD   region             2004 26130.
# … with 2,030 more rows