LinkedIn Course - Data Wrangling in R
Taught by - Mike Chapple

_____________________________________________

Introduction

This case study aims at converting a “coal consumption” data set from the one that is there to one that is tidy. The first step would to to import the data set into R.

The data used in this case study can be found on the following link: http://594442.youcanlearnit.net/coal.csv

Setting up the ground

The first step would be to import data.

coal <- read_csv("C:\\Users\\duelr\\Downloads\\coal.csv")
## Parsed with column specification:
## cols(
##   .default = col_character()
## )
## See spec(...) for full column specifications.

A look at the actual data in Excel or an equivalent file will show thatthere are two rows that can be skipped beofre we come to the dates in terms of years. The first column consists a list of places in terms of countries.

Now to take a glimpse at the same:

glimpse(coal)
## Rows: 234
## Columns: 31
## $ `Table: Total Coal Consumption (Quadrillion Btu)` <chr> NA, NA, "North Am...
## $ X2                                                <chr> NA, "1980", "16.4...
## $ X3                                                <chr> NA, "1981", "16.9...
## $ X4                                                <chr> NA, "1982", "16.4...
## $ X5                                                <chr> NA, "1983", "17.1...
## $ X6                                                <chr> NA, "1984", "18.4...
## $ X7                                                <chr> NA, "1985", "18.8...
## $ X8                                                <chr> NA, "1986", "18.5...
## $ X9                                                <chr> NA, "1987", "19.4...
## $ X10                                               <chr> NA, "1988", "20.4...
## $ X11                                               <chr> NA, "1989", "20.6...
## $ X12                                               <chr> NA, "1990", "20.5...
## $ X13                                               <chr> NA, "1991", "20.4...
## $ X14                                               <chr> NA, "1992", "20.6...
## $ X15                                               <chr> NA, "1993", "21.2...
## $ X16                                               <chr> NA, "1994", "21.3...
## $ X17                                               <chr> NA, "1995", "21.6...
## $ X18                                               <chr> NA, "1996", "22.5...
## $ X19                                               <chr> NA, "1997", "23.2...
## $ X20                                               <chr> NA, "1998", "23.5...
## $ X21                                               <chr> NA, "1999", "23.4...
## $ X22                                               <chr> NA, "2000", "24.5...
## $ X23                                               <chr> NA, "2001", "23.6...
## $ X24                                               <chr> NA, "2002", "23.6...
## $ X25                                               <chr> NA, "2003", "24.1...
## $ X26                                               <chr> NA, "2004", "24.3...
## $ X27                                               <chr> NA, "2005", "24.6...
## $ X28                                               <chr> NA, "2006", "24.3...
## $ X29                                               <chr> NA, "2007", "24.5...
## $ X30                                               <chr> NA, "2008", "24.1...
## $ X31                                               <chr> NA, "2009", "21.1...

The output reveals that for some reason R did not read the years as column headers. This is already known to us given we had an initial look at it in a software that is either Excel of equivalent to excel.

It is also important to notice that there are NAs consistently in the first cell of every row. The above problems could be mostly due to the empty rows that the data set has.

To correct the errors we can skip the first two rows of the data set.

coal <- read_csv("C:\\Users\\duelr\\Downloads\\coal.csv", skip = 2)
## Parsed with column specification:
## cols(
##   .default = col_character()
## )
## See spec(...) for full column specifications.
glimpse(coal)
## Rows: 232
## Columns: 31
## $ X1     <chr> "North America", "Bermuda", "Canada", "Greenland", "Mexico",...
## $ `1980` <chr> "16.45179", "0", "0.96156", "0.00005", "0.10239", "0", "15.3...
## $ `1981` <chr> "16.98772", "0", "0.99047", "0.00005", "0.10562", "0", "15.8...
## $ `1982` <chr> "16.47546", "0", "1.05584", "0.00003", "0.11967", "0", "15.2...
## $ `1983` <chr> "17.12407", "0", "1.11653", "0.00003", "0.12869", "0", "15.8...
## $ `1984` <chr> "18.4267", "0", "1.23682", "0.00003", "0.13071", "0", "17.05...
## $ `1985` <chr> "18.81819", "0", "1.20679", "0", "0.14646", "0", "17.46494",...
## $ `1986` <chr> "18.52559", "0", "1.12583", "0", "0.15609", "0", "17.24367",...
## $ `1987` <chr> "19.43781", "0", "1.25072", "0", "0.17001", "0", "18.01708",...
## $ `1988` <chr> "20.40363", "0", "1.35809", "0", "0.15967", "0", "18.88587",...
## $ `1989` <chr> "20.62571", "0", "1.35196", "0", "0.17359", "0", "19.10017",...
## $ `1990` <chr> "20.5602", "0", "1.21338", "0", "0.1694", "0", "19.17742", "...
## $ `1991` <chr> "20.4251", "0", "1.26457", "0", "0.15916", "0", "19.00137", ...
## $ `1992` <chr> "20.64672", "0", "1.32379", "0", "0.16584", "0", "19.15709",...
## $ `1993` <chr> "21.28219", "0", "1.22875", "0", "0.19118", "0", "19.86225",...
## $ `1994` <chr> "21.39631", "0", "1.24492", "0", "0.1836", "0", "19.96779", ...
## $ `1995` <chr> "21.64225", "0", "1.28479", "0", "0.20768", "0", "20.14978",...
## $ `1996` <chr> "22.57572", "0", "1.30032", "0", "0.25067", "0", "21.02473",...
## $ `1997` <chr> "23.20491", "0", "1.44933", "0", "0.26373", "0", "21.49186",...
## $ `1998` <chr> "23.5002", "0", "1.50985", "0", "0.26753", "0", "21.72283", ...
## $ `1999` <chr> "23.4747", "0", "1.505", "0", "0.28947", "0", "21.68023", "0...
## $ `2000` <chr> "24.55583", "0", "1.61651", "0", "0.29444", "0", "22.64488",...
## $ `2001` <chr> "23.62705", "0", "1.35444", "0", "0.32908", "0", "21.94353",...
## $ `2002` <chr> "23.69876", "0", "1.36876", "0", "0.36525", "0", "21.96475",...
## $ `2003` <chr> "24.17788", "0", "1.38766", "0", "0.41878", "0", "22.37144",...
## $ `2004` <chr> "24.36024", "0", "1.43684", "0", "0.31944", "0", "22.60396",...
## $ `2005` <chr> "24.6876", "0", "1.44948", "0", "0.39739", "0", "22.84073", ...
## $ `2006` <chr> "24.32174", "0", "1.42135", "0", "0.39244", "0", "22.50795",...
## $ `2007` <chr> "24.54746", "0", "1.38369", "0", "0.38911", "0", "22.77466",...
## $ `2008` <chr> "24.11993", "0", "1.37388", "0", "0.32008", "0", "22.42597",...
## $ `2009` <chr> "21.14803", "0", "1.14314", "0", "0.3365", "0", "19.66839", ...

Now, the data set has the column heads as years. However, the first column does not have a name but from looking at the data we know that it is related to places. This could be named anything but let us go with region.

colnames(coal)[1] <- "Region"

The new column name is now set to Region.

Another issue that comes up here is that the data is all in chr format. Given that this data is about consumption it would make more sense for it to be in numeric format.

as.numeric(as.character(coal$"1980"))
## Warning: NAs introduced by coercion
##   [1] 16.45179  0.00000  0.96156  0.00005  0.10239  0.00000 15.38779  0.42011
##   [9]  0.00000  0.00000  0.03476       NA  0.00000  0.00000  0.00000  0.00003
##  [17]  0.21533  0.00000  0.04761  0.10287  0.00003  0.00448  0.00000  0.00000
##  [25]  0.00000  0.00003  0.00000  0.00000  0.00000  0.00000  0.00061  0.00000
##  [33]  0.00000  0.00000  0.00003  0.00000  0.00000  0.00000  0.00000  0.00005
##  [41]  0.00000  0.00595  0.00000  0.00000  0.00000  0.00000  0.00003  0.00000
##  [49]  0.00000  0.00012  0.00818  0.00000  0.00000 19.61333  0.01721  0.13129
##  [57]  0.43307       NA  0.37186       NA  0.00000       NA  0.24766  0.00000
##  [65]  0.17864  1.52402       NA  0.58895  1.28681       NA  2.44026  2.98956
##  [73]  0.00000  0.15001  0.31629  0.00077  0.02852  0.45444  0.06929       NA
##  [81]  0.00003       NA  0.15234  0.03865  3.85271  0.01701  0.48174       NA
##  [89]       NA       NA  0.52915  0.06552  0.01251  0.25161  2.98343 11.45935
##  [97]       NA       NA       NA       NA 11.45935       NA       NA       NA
## [105]       NA       NA       NA       NA       NA       NA       NA       NA
## [113]  0.02787  0.00000  0.02743  0.00003  0.00014  0.00000  0.00000  0.00017
## [121]  0.00000       NA  0.00000  0.00000  0.00011  0.00000  0.00000  2.25210
## [129]  0.01703  0.00000  0.00000  0.00900  0.00000  0.00000  0.00000  0.00000
## [137]  0.00000  0.00000  0.00000  0.00000  0.00688  0.00000  0.00000  0.02266
## [145]  0.00000       NA  0.00000  0.00000  0.00000  0.00005  0.00000  0.00000
## [153]  0.00051  0.00000  0.00000  0.00003  0.00026  0.00172  0.00000  0.00016
## [161]  0.00003  0.02115  0.00795       NA  0.00056  0.00484  0.00000  0.00000
## [169]  0.00000  0.00000  0.00000  0.00000  0.00000  0.00000  2.05903  0.00000
## [177]  0.00454  0.00005  0.00000  0.00240  0.00000  0.00000  0.01539  0.07787
## [185] 19.67405  0.00289  0.00000  1.05036  0.00658  0.00003  0.00005  0.00156
## [193]  0.00000 12.29851  0.00000  0.00048  0.00000  0.00000  0.00000  0.00070
## [201]  2.09389  0.01560  2.09960  0.00000  1.11857  0.53252  0.00000  0.00011
## [209]  0.00205  0.00000  0.05696  0.00000  0.00214  0.00272  0.05685  0.00000
## [217]  0.04407  0.00000  0.01598  0.00000  0.00012  0.00000  0.00005  0.14675
## [225]  0.01682       NA  0.00000  0.00000  0.00000  0.10809  0.00000 69.89860

Although this gives the output for the 1980 year values as expected, this is not feasible to do when there are so many columns. This can be kept aside for now and can be dealt with after “gathering” the data into a tidy data set.

Tidy Data Set

With regard to the principles of a tidy set:

  • Column headers are values and not variable names
  • Multiple variables are stored in a single column
  • Variables are stored in both rows and columns
  • Multiple types of observational units are in the same table
  • A single observational unit is in multiple tables

With regard to the above criteria, the “coal” data set clearly violates requirement 1, and 3.

This would now require us to make changes to the above data set in such a manner that the coal data set would comply to the tidy data norms.

To tidy the data, we could use the gather() function. However, according to the documentation this has been retired and it is now recommended to replace this function usage with pivot_longer().1

For pivot_longer, we need the data, cols i.e. the columns that need to be gathered, and names_to to determine what the name of the new column would be.

coal_tidy <- pivot_longer(coal, cols = -Region, names_to = "Years")
head(coal_tidy)
## # A tibble: 6 x 3
##   Region        Years value   
##   <chr>         <chr> <chr>   
## 1 North America 1980  16.45179
## 2 North America 1981  16.98772
## 3 North America 1982  16.47546
## 4 North America 1983  17.12407
## 5 North America 1984  18.4267 
## 6 North America 1985  18.81819

The above table is in the format that we would like it to be. All the year dated columns are now clubbed under one variable. The values for each seem to show correctly. The places have a header of “Region” as determined previously.

The data has now been converted from one that is WIDE to LONG.

Now that we have all the values in one column, we can change the type of the data that is present in it - from chr to one that is numeric.

coal_tidy$value <- as.numeric(as.character(coal_tidy$value))
head(coal_tidy)
## # A tibble: 6 x 3
##   Region        Years value
##   <chr>         <chr> <dbl>
## 1 North America 1980   16.5
## 2 North America 1981   17.0
## 3 North America 1982   16.5
## 4 North America 1983   17.1
## 5 North America 1984   18.4
## 6 North America 1985   18.8

We now have a type of dblt for value which is numeric in nature. We can perform a similar function on the Years column to change it to numeric.

coal_tidy$Years <- as.integer(as.character(coal_tidy$Years))
head(coal_tidy)
## # A tibble: 6 x 3
##   Region        Years value
##   <chr>         <int> <dbl>
## 1 North America  1980  16.5
## 2 North America  1981  17.0
## 3 North America  1982  16.5
## 4 North America  1983  17.1
## 5 North America  1984  18.4
## 6 North America  1985  18.8

There is another issue that is stopping this data set from being “tidy”. The Regions column has counties and continents placed together. This violates the tidy data requirements assumption number 4 - Multiple types of observational units are in the same table.

Why is this an issue?

Look at the first two to three values in the Region column in the original data set coal. There is North America, Bermuda, and then Canada. Notice that there are two categories here - geographic regions, and countries. This specific thing is what violates requirement number 4.

This would lead to double counting while say performing a summarizing operation on the data. Thus it is necessary to clean this column.

By going through the data we know that the following values are not countries:

  • North America
  • Central & South America
  • Antarctica
  • Europe
  • Eurasia
  • Middle East
  • Africa
  • Asia & Oceania
  • World
noncountries <- c("North America", "Central & South America", "Antarctica", "Europe", 
                  "Eurasia", "Middle East", "Africa", "Asia & Oceania", "World")

Now this vector can be used to identify all the rows in the Region column which have these noncountries in them. This can be identified using match. Match will identify all the rows in which an instance of the noncountries exists and provides a vector which indicates the position of the matching element. To go one step further, we need a vector in terms of logicals so that we can easily subset the elements. This can be done through using the is.na() function. However, this will give it in a format where all the places in which the noncountries exist will be marked as false. To turn this to true we just have to add an “!” mark to “s.na()” function. Once this is done, we can just extract the row numbers of all the places where we have TRUE by using which().

matches_vector<- which(!is.na(match(coal_tidy$Region, noncountries)))
matches_vector

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 [16] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 [31] 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 [46] 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 [61] 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 [76] 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 [91] 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 [106] 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 [121] 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 [136] 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 [151] 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 [166] 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 [181] 3811 3812 3813 3814 3815 3816 3817 3818 3819 3820 3821 3822 3823 3824 3825 [196] 3826 3827 3828 3829 3830 3831 3832 3833 3834 3835 3836 3837 3838 3839 3840 [211] 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 [226] 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 [241] 6931 6932 6933 6934 6935 6936 6937 6938 6939 6940 6941 6942 6943 6944 6945 [256] 6946 6947 6948 6949 6950 6951 6952 6953 6954 6955 6956 6957 6958 6959 6960

Now to just pull out all the rows that are countries:

coal_tidy_countriesOnly <- coal_tidy[-matches_vector,]
head(coal_tidy_countriesOnly)
## # A tibble: 6 x 3
##   Region  Years value
##   <chr>   <int> <dbl>
## 1 Bermuda  1980     0
## 2 Bermuda  1981     0
## 3 Bermuda  1982     0
## 4 Bermuda  1983     0
## 5 Bermuda  1984     0
## 6 Bermuda  1985     0

Now we have “Bermuda” as the first value in Region indicating that we have removed all the noncountries from the data set.

Now to setup a similar table for the geographic regions:

coal_tidy_geographicRegionsOnly <- coal_tidy[matches_vector,]
head(coal_tidy_geographicRegionsOnly)
## # A tibble: 6 x 3
##   Region        Years value
##   <chr>         <int> <dbl>
## 1 North America  1980  16.5
## 2 North America  1981  17.0
## 3 North America  1982  16.5
## 4 North America  1983  17.1
## 5 North America  1984  18.4
## 6 North America  1985  18.8

The first value now is North America which indicates that this transformation is successful.

We have to note here that the matches_vector has “World” as one of the values. However, this again leads to double counting as mentioned before as all the geographic regions will count towards “World”.

coal_tidy_withoutWorld_geographicRegionsOnly <- coal_tidy_geographicRegionsOnly[-((which(coal_tidy_geographicRegionsOnly == "World"))),]
head(coal_tidy_withoutWorld_geographicRegionsOnly)
## # A tibble: 6 x 3
##   Region        Years value
##   <chr>         <int> <dbl>
## 1 North America  1980  16.5
## 2 North America  1981  17.0
## 3 North America  1982  16.5
## 4 North America  1983  17.1
## 5 North America  1984  18.4
## 6 North America  1985  18.8

This data set now has only geographic regions and no “World”.

With this we now have a tidy data set. This can be used to perform other types of data analysis.

Basic Exploratory Data Analysis

On a regional level:

 ggplot(coal_tidy_withoutWorld_geographicRegionsOnly) +
   geom_line(mapping = aes(x = Years, y = value, color = Region)) +
   labs(
     title = "Regional Coal Consumption",
     y = "Coal consumption values",
     x = "Years"
   ) +
   theme_minimal() +
   theme(legend.position = "bottom") +
   theme(legend.title = element_blank())

We can see that Asia & Oceania together have has an exponential growth in the consumption of coal while the other regions in general have shown a decrease in the consumption of the same.

A similar analysis can be done on a country level but the chart would be too messy and so will not be going into it.


  1. The course instruction uses gather().↩︎