Homework week 6
Data cleaning
1. Upload data from the website and read it in R. Skip first seven rows.
Rows: 85 Columns: 25
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Time
num (24): 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 201...
2. We need to arrange the data in R format. To do that we correct column names and rename our dataset with the correct column names.
> correct_column_names <- oecd |>
+ select(!contains("...")) |>
+ names()
> correct_column_names
[1] "Time" "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012"
[20] "2013" "2014" "2015" "2016" "2017" "2018"
> oecd_step_2 <- oecd |>
+ rename_with(.f = function(x) correct_column_names) |>
+ rename(geography = Time)
> oecd_step_2
# A tibble: 85 × 25
geogra…¹ `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 WLD: Wo… 2.14e4 2.18e4 2.22e4 2.24e4 2.25e4 2.32e4 2.36e4 2.39e4 2.50e4 2.61e4 2.71e4 2.80e4 2.90e4 2.92e4 2.88e4 3.06e4 3.15e4 3.18e4
2 OECD: O… 1.30e4 1.33e4 1.35e4 1.37e4 1.40e4 1.46e4 1.44e4 1.44e4 1.49e4 1.54e4 1.57e4 1.58e4 1.59e4 1.54e4 1.41e4 1.48e4 1.46e4 1.43e4
3 AUS: Au… 2.78e2 2.84e2 2.98e2 3.22e2 3.20e2 3.24e2 3.22e2 3.40e2 3.50e2 3.81e2 4.04e2 4.12e2 4.2 e2 4.22e2 4.29e2 4.34e2 4.40e2 4.52e2
4 AUT: Au… 8.51e1 8.94e1 8.36e1 8.57e1 8.37e1 8.39e1 8.57e1 8.44e1 9.33e1 9.78e1 1.03e2 1.02e2 9.8 e1 9.78e1 8.93e1 9.32e1 9.42e1 8.96e1
5 BEL: Be… 1.30e2 1.32e2 1.27e2 1.31e2 1.27e2 1.27e2 1.25e2 1.20e2 1.29e2 1.36e2 1.39e2 1.37e2 1.39e2 1.45e2 1.30e2 1.40e2 1.37e2 1.33e2
6 CAN: Ca… 4.09e2 4.17e2 4.36e2 4.39e2 4.46e2 4.57e2 4.50e2 4.71e2 5.08e2 5.17e2 5.46e2 5.52e2 5.82e2 5.66e2 5.30e2 5.76e2 5.82e2 5.74e2
7 CHL: Ch… 3.99e1 4.62e1 5.31e1 5.66e1 5.59e1 5.39e1 5.1 e1 5.37e1 5.44e1 5.82e1 6.17e1 6.42e1 6.95e1 7.75e1 7 e1 7.99e1 8.89e1 9.3 e1
8 COL: Co… 6.55e1 6.56e1 7.22e1 7.23e1 5.97e1 5.99e1 6.11e1 6.09e1 5.91e1 6.19e1 6.61e1 6.78e1 7.12e1 7.34e1 7.06e1 7.55e1 8.38e1 8.6 e1
9 CRI: Co… 8.1 e0 7.8 e0 8.2 e0 9.1 e0 9 e0 8.7 e0 8.8 e0 9.3 e0 9.3 e0 9.3 e0 9.7 e0 1.04e1 1.17e1 1.24e1 1.03e1 1.15e1 1.23e1 1.28e1
10 CZE: Cz… 1.05e2 1.09e2 1.06e2 9.98e1 9.61e1 1.02e2 1.03e2 1.02e2 1.07e2 1.04e2 1.03e2 1.03e2 1.07e2 1.08e2 9.79e1 1.03e2 1.01e2 9.41e1
3. Let’s separate columns with the geography codes and the geography names.
> geography_tbl <- oecd_step_2 |>
+ select(geography)
> geography_tbl
# A tibble: 85 × 1
geography
<chr>
1 WLD: World
2 OECD: OECD member countries
3 AUS: Australia
4 AUT: Austria
5 BEL: Belgium
6 CAN: Canada
7 CHL: Chile
8 COL: Colombia
9 CRI: Costa Rica
10 CZE: Czech Republic
> geography_tbl_2 <- geography_tbl |>
+ separate(col = geography, into = c("geography_code", "geography_name"),
+ sep = ":", remove = FALSE) |>
+ mutate(across(c("geography_code", "geography_name"), str_trim))
> geography_tbl_2
# A tibble: 85 × 3
geography geography_code geography_name
<chr> <chr> <chr>
1 WLD: World WLD World
2 OECD: OECD member countries OECD OECD member countries
3 AUS: Australia AUS Australia
4 AUT: Austria AUT Austria
5 BEL: Belgium BEL Belgium
6 CAN: Canada CAN Canada
7 CHL: Chile CHL Chile
8 COL: Colombia COL Colombia
9 CRI: Costa Rica CRI Costa Rica
10 CZE: Czech Republic CZE Czech Republic
4. We need to standardize country names and get rid of a column which include both country code and country name.
geography_tbl_2 |>
+ mutate(country_name = countrycode(sourcevar = geography_code,
+ origin = "iso3c",
+ destination = "country.name")) |>
+ filter(is.na(country_name))
# A tibble: 19 × 4
geography geography_code geography_name count…¹
<chr> <chr> <chr> <chr>
1 WLD: World WLD World NA
2 OECD: OECD member countries OECD OECD member countries NA
3 NONOECD: Non-OECD economies and aggregates NONOECD Non-OECD economies and aggregat… NA
4 ROW: Rest of the World ROW Rest of the World NA
5 APEC: Asia-Pacific Economic Cooperation APEC Asia-Pacific Economic Cooperati… NA
6 ASEAN: Association of South East Asian Nations ASEAN Association of South East Asian… NA
7 EASIA: Eastern Asia EASIA Eastern Asia NA
8 EU27_2020: European Union (27 countries) EU27_2020 European Union (27 countries) NA
9 EU28: European Union (28 countries) EU28 European Union (28 countries) NA
10 EU15: European Union (15 countries) EU15 European Union (15 countries) NA
11 EU13: EU28 excluding EU15 EU13 EU28 excluding EU15 NA
12 EA19: Euro area (19 countries) EA19 Euro area (19 countries) NA
13 G20: Group of Twenty G20 Group of Twenty NA
14 ZEUR: Europe ZEUR Europe NA
15 ZASI: East and Southeastern Asia ZASI East and Southeastern Asia NA
16 ZNAM: North America ZNAM North America NA
17 ZSCA: South and Central America ZSCA South and Central America NA
18 ZOTH: Other regions ZOTH Other regions NA
> geography_processed <- geography_tbl_2 |>
+ mutate(country_name = countrycode(sourcevar = geography_code,
+ origin = "iso3c",
+ destination = "country.name")) |>
+ mutate(country_or_region = if_else(is.na(country_name),
+ true = "region",
+ false = "country")) |>
+ mutate(country_name = if_else(is.na(country_name),
+ true = geography_name,
+ false = country_name)) |>
+ select(geography, country_name, iso3c = geography_code, country_or_region)
Warning message:
There was 1 warning in `mutate()`.
ℹ In argument: `country_name = countrycode(sourcevar = geography_code, origin = "iso3c", destination = "country.name")`.
Caused by warning in `countrycode_convert()`:
! Some values were not matched unambiguously: APEC, ASEAN, DATA EXTRACTED ON 09 MAR 2023 13, EA19, EASIA, EU13, EU15, EU27_2020, EU28, G20, NONOECD, OECD, ROW, WLD, ZASI, ZEUR, ZNAM, ZOTH, ZSCA
> step3 <- geography_processed |>
+ left_join(oecd_step_2, by = "geography") |>
+ select(-geography)
> step3
# A tibble: 85 × 27
country…¹ iso3c count…² `1995` `1996` `1997` `1998` `1999` `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008` `2009` `2010` `2011` `2012` `2013` `2014` `2015` `2016`
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 World WLD region 2.14e4 2.18e4 2.22e4 2.24e4 2.25e4 2.32e4 2.36e4 2.39e4 2.50e4 2.61e4 2.71e4 2.80e4 2.90e4 2.92e4 2.88e4 3.06e4 3.15e4 3.18e4 3.24e4 3.25e4 3.24e4 3.24e4
2 OECD mem… OECD region 1.30e4 1.33e4 1.35e4 1.37e4 1.40e4 1.46e4 1.44e4 1.44e4 1.49e4 1.54e4 1.57e4 1.58e4 1.59e4 1.54e4 1.41e4 1.48e4 1.46e4 1.43e4 1.42e4 1.39e4 1.37e4 1.36e4
3 Australia AUS country 2.78e2 2.84e2 2.98e2 3.22e2 3.20e2 3.24e2 3.22e2 3.40e2 3.50e2 3.81e2 4.04e2 4.12e2 4.2 e2 4.22e2 4.29e2 4.34e2 4.40e2 4.52e2 4.48e2 4.13e2 4.16e2 4.14e2
4 Austria AUT country 8.51e1 8.94e1 8.36e1 8.57e1 8.37e1 8.39e1 8.57e1 8.44e1 9.33e1 9.78e1 1.03e2 1.02e2 9.8 e1 9.78e1 8.93e1 9.32e1 9.42e1 8.96e1 8.88e1 8.38e1 8.09e1 8.19e1
5 Belgium BEL country 1.30e2 1.32e2 1.27e2 1.31e2 1.27e2 1.27e2 1.25e2 1.20e2 1.29e2 1.36e2 1.39e2 1.37e2 1.39e2 1.45e2 1.30e2 1.40e2 1.37e2 1.33e2 1.30e2 1.27e2 1.25e2 1.25e2
6 Canada CAN country 4.09e2 4.17e2 4.36e2 4.39e2 4.46e2 4.57e2 4.50e2 4.71e2 5.08e2 5.17e2 5.46e2 5.52e2 5.82e2 5.66e2 5.30e2 5.76e2 5.82e2 5.74e2 5.70e2 5.56e2 5.29e2 5.23e2
7 Chile CHL country 3.99e1 4.62e1 5.31e1 5.66e1 5.59e1 5.39e1 5.1 e1 5.37e1 5.44e1 5.82e1 6.17e1 6.42e1 6.95e1 7.75e1 7 e1 7.99e1 8.89e1 9.3 e1 9.6 e1 8.8 e1 8.87e1 8.99e1
8 Colombia COL country 6.55e1 6.56e1 7.22e1 7.23e1 5.97e1 5.99e1 6.11e1 6.09e1 5.91e1 6.19e1 6.61e1 6.78e1 7.12e1 7.34e1 7.06e1 7.55e1 8.38e1 8.6 e1 9.15e1 9.41e1 9.04e1 8.97e1
9 Costa Ri… CRI country 8.1 e0 7.8 e0 8.2 e0 9.1 e0 9 e0 8.7 e0 8.8 e0 9.3 e0 9.3 e0 9.3 e0 9.7 e0 1.04e1 1.17e1 1.24e1 1.03e1 1.15e1 1.23e1 1.28e1 1.32e1 1.34e1 1.35e1 1.44e1
10 Czechia CZE country 1.05e2 1.09e2 1.06e2 9.98e1 9.61e1 1.02e2 1.03e2 1.02e2 1.07e2 1.04e2 1.03e2 1.03e2 1.07e2 1.08e2 9.79e1 1.03e2 1.01e2 9.41e1 9.05e1 8.62e1 8.79e1 9 e1
5. Let’s pivot our data to make it in line with R format
> step4 <- step3 |>
+ pivot_longer(cols = matches("\\d{4}"),
+ names_to = "year",
+ names_transform = as.integer)
> step4
# A tibble: 2,040 × 5
country_name iso3c country_or_region year value
<chr> <chr> <chr> <int> <dbl>
1 World WLD region 1995 21367.
2 World WLD region 1996 21815.
3 World WLD region 1997 22214.
4 World WLD region 1998 22391
5 World WLD region 1999 22526.
6 World WLD region 2000 23240.
7 World WLD region 2001 23587.
8 World WLD region 2002 23924.
9 World WLD region 2003 24956.
10 World WLD region 2004 26130.
# … with 2,030 more rows