An R Learner’s Diary: Exploratory Graphics with ggplot2

In this article, I record the graph creation aspects of a recent project, essentially as a memory aid, so my future self would find it easier to relearn what my present self knows how to do.

Data Source

I downloaded the Excel files available at the Reserve Bank of India’s Handbook of Statistics on Indian States 2017-18 (HSIS_RBI) and collected the data therein into a single tidy data file. This file – called HSIS_RBI_2018_data.csv – is available in the following Github repository: https://github.com/UdayanRoy62/HSIS_RBI_2018. For the purposes of the R code below, I assume that the file “HSIS_RBI_2018_data.csv” is in the working directory.

I load the crucial tidyverse package and read my data into R:

library(tidyverse)
HSIS_RBI_2018 <- read_csv("HSIS_RBI_2018_data.csv", na = "--")

Use of the `dplyr` package for data preparation

In this exercise, I will focus on a particular variable – the Per Capita Net Domestic Product at Factor Cost in Constant Prices (NDPPC_cons) – and see how it varies across India’s various regions and across time (1993 through 2017). This variable is the inflation-adjusted per-person income, and as such it is a crucial measure of the economic vigor of a region. The data are available in Table 14 of HSIS_RBI. But a great deal of “massaging” will be necessary before the exploratory data analysis can begin.

First, I use the dplyr::select() function to select only the five NDPPC_cons variables plus three basic ones from the massive HSIS_RBI_2018 data frame:

(NDPPC_cons <- select(HSIS_RBI_2018, Region, Region_Ab, year, NDPPC_cons_0405:NDPPC_cons_9900))

## # A tibble: 1,665 x 8
##    Region Region_Ab  year NDPPC_cons_0405 NDPPC_cons_1112 NDPPC_cons_8081
##    <chr>  <chr>     <int>           <int>           <int>           <int>
##  1 Andam~ AN         1990              NA              NA            2580
##  2 Andhr~ AP         1990              NA              NA            2060
##  3 Aruna~ AR         1990              NA              NA            2709
##  4 Assam  AS         1990              NA              NA            1544
##  5 Bihar  BR         1990              NA              NA            1197
##  6 Chand~ CH         1990              NA              NA              NA
##  7 Chhat~ CT         1990              NA              NA              NA
##  8 Dadra~ DN         1990              NA              NA              NA
##  9 Daman~ DD         1990              NA              NA              NA
## 10 Delhi  DL         1990              NA              NA            5447
## # ... with 1,655 more rows, and 2 more variables: NDPPC_cons_9394 <int>,
## #   NDPPC_cons_9900 <int>

Region is the name of an Indian state or union territory.
Region_Ab is the two-letter abbreviated name. I have used the abbreviations listed here.
year denotes India’s fiscal year. So, 1994 is April 1, 1994 through March 31, 1995.
NDPPC_cons_8081 is the per-person income series measured in constant 1980 (fiscal year or FY) prices. To take another example …
NDPPC_cons_1112 is the per-person income series measured in constant 2011 prices.

The data massaging mentioned above will consist of using the five NDPPC_cons variables to build as long a per capita income series as possible in constant 2011 prices. Data for years 2011 through 2017 are available in 2011 prices as NDPPC_cons_1112; so for these years NDPPC_cons_1112 can be used as is, with no modification.

But data for 2004-10 are in the NDPPC_cons_0405 variable, in 2004 prices. We need to convert it to 2011 prices. Luckily, NDPPC_cons data for 2011 are available from two variables: NDPPC_cons_1112 and NDPPC_cons_0405. One can compare NDPPC_cons_1112 and NDPPC_cons_0405 and calculate the multiple by which NDPPC_cons_1112 exceeds NDPPC_cons_0405 in the year 2011 for each Region. Then, one can multiply each Region’s NDPPC_cons data for the years 2004 through 2010 by the multiple for the year 2011 to convert the per capita income in 2004 prices to per capita income in 2011 prices. (For example, if NDPPC_cons_1112 for some region A is 3 times higher than NDPPC_cons_0405 for region A in 2011, then one can obtain per capita income for 2004-10 for region A in 2011 prices by simply multiplying the NDPPC_cons_0405 data by 3.)

Similarly, by exploiting the availability of NDPPC_cons data for 2004 in the variables NDPPC_cons_0405 and NDPPC_cons_9900, I scaled the NDPPC_cons data for the years 1999 through 2003, first from 1999 prices to 2004 prices, and then, as in the previous paragraph, from 2004 prices to 2011 prices.

Finally, by exploiting the availability of NDPPC_cons data for 1999 in the variables NDPPC_cons_9900 and NDPPC_cons_9394, I scaled the NDPPC_cons data for the years 1993 through 1998, first from 1993 prices to 1999 prices, and then, as in the previous paragraphs, from 1999 prices to 2004 prices, and finally from 2004 prices to 2011 prices.

This stitch work below uses the dplyr package, which is part of the tidyverse package that was downloaded above.

My exploitation of overlaps in the available data series is, however, not possible for the years 1990 through 1992. So, for my purposes, the observations for those years are of no use.

Moreover, there is no NDPPC_cons data for several regions. Consequently, the observations (i.e., rows) for those regions can be removed.

I use the dplyr::filter() function to keep only the observations I need.

## To keep only those observations in `NDPPC_cons` for which the condition `year > 1992` is TRUE I use:
NDPPC_cons <- filter(NDPPC_cons, year > 1992)
NDPPC_cons <- filter(NDPPC_cons, !(Region_Ab %in% c("RC", "RE", "RNE", "RN", "RS", "RW", "IN")))

Next, I calculate the “multiples” and scaling factors discussed above. Here we see use of the %>% or pipe operator of the magrittr package and the mutate function of the dplyr package.

NDPPC0410 <- NDPPC_cons %>% filter(year == 2011) %>% mutate(multiple.04.11 = NDPPC_cons_1112/NDPPC_cons_0405) %>% select(Region, Region_Ab, multiple.04.11)
NDPPC9903 <- NDPPC_cons %>% filter(year == 2004) %>% mutate(multiple.99.04 = NDPPC_cons_0405/NDPPC_cons_9900) %>% select(Region, Region_Ab, multiple.99.04)
NDPPC9398 <- NDPPC_cons %>% filter(year == 1999) %>% mutate(multiple.93.99 = NDPPC_cons_9900/NDPPC_cons_9394) %>% select(Region, Region_Ab, multiple.93.99)

multiples <- list(NDPPC0410,NDPPC9903,NDPPC9398) %>% reduce(full_join)

## Joining, by = c("Region", "Region_Ab")
## Joining, by = c("Region", "Region_Ab")

multiples <- multiples %>% mutate(
  scale_1117 = 1,
  scale_0410 = scale_1117*multiple.04.11,
  scale_9903 = scale_0410*multiple.99.04,
  scale_9398 = scale_9903*multiple.93.99
)
NDPPC_cons <- NDPPC_cons %>% left_join(multiples)

## Joining, by = c("Region", "Region_Ab")

Finally, I build per capita income for the years 1993 through 2017, all in constant 2011 prices:

NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 1993:1998, NDPPC_cons$NDPPC_cons_9394*NDPPC_cons$scale_9398, NA)
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 1999:2003, NDPPC_cons$NDPPC_cons_9900*NDPPC_cons$scale_9903, NDPPC_cons$NDPPC_cons)
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 2004:2010, NDPPC_cons$NDPPC_cons_0405*NDPPC_cons$scale_0410, NDPPC_cons$NDPPC_cons)
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 2011:2017, NDPPC_cons$NDPPC_cons_1112*NDPPC_cons$scale_1117, NDPPC_cons$NDPPC_cons)

Next, I keep only the variables necessary for subsequent work:

(NDPPC.cons.small <- NDPPC_cons %>% select(Region, Region_Ab, year, NDPPC_cons))

## # A tibble: 900 x 4
##    Region                      Region_Ab  year NDPPC_cons
##    <chr>                       <chr>     <int>      <dbl>
##  1 Andaman and Nicobar Islands AN         1993     46168.
##  2 Andhra Pradesh              AP         1993     28188.
##  3 Arunachal Pradesh           AR         1993     39054.
##  4 Assam                       AS         1993     27630.
##  5 Bihar                       BR         1993     10350.
##  6 Chandigarh                  CH         1993     74928.
##  7 Chhattisgarh                CT         1993     30447.
##  8 Dadra and Nagar Haveli      DN         1993        NA 
##  9 Daman and Diu               DD         1993        NA 
## 10 Delhi                       DL         1993     72386.
## # ... with 890 more rows

Graphs

Graph 1

ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = NDPPC_cons)) + 
  geom_line(aes(color = Region_Ab)) +
  labs(
    y = "Rupees",
    title = "Per capita net domestic product of India's states and union teritories, 1993 -- 2017",
    subtitle = "in constant 2011-12 rupees",
    caption = "Source: Reserve Bank of India"
  )

## Warning: Removed 148 rows containing missing values (geom_path).

Graph 1 Frustrations

When there are so many lines – one for each of those regions – the legend is not very effective unless you have eyes that are great at distinguishing between ten different shades of green! I wish I knew how to label each line graph.

I don’t know how to control the size of the graph.

I don’t know how to compel ggplot2 to show all years on the horizontal axis (rather than just two).

I don’t know how to compel ggplot2 to show the Rupees values in normal notation. The chart uses scientific notation. (Just multiply the numbers before the “e” by 10 raised to the power of the numbers after the “e”. For example, 1e+05 means \(1\times 10^5=100,000\) rupees bearing the purchasing power in 2011.)

Graph 2: Same as Graph 1 but using a log scale and fancy labels from the `gghighlight` package

library(gghighlight)
ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = log(NDPPC_cons)) ) + 
  geom_line(aes(color = Region_Ab)) + gghighlight(label_key = Region_Ab)

## Warning: Removed 148 rows containing missing values (geom_path).

## Warning: Removed 148 rows containing missing values (geom_path).

## Warning: Removed 27 rows containing missing values (geom_label_repel).

Graph 2 Frustrations

I couldn’t get gghighlight to show all Region_Ab labels. The package uses its own judgement to throw away labels when things get crowded.

Graph 3 Showing the lowest, the highest, and the average per capita income for every year

I want to add lines for the lowest per capita income, the highest per capita income, and the (unweighted) average per capita income for every year. This requires integrating the observations for all regions, but one year at a time. This uses the dplyr::summarise() command in combination with the dplyr::group_by() command.

mean <- NDPPC.cons.small %>% group_by(year) %>% summarise(mean = mean(NDPPC_cons, na.rm = TRUE))
max  <- NDPPC.cons.small %>% group_by(year) %>% summarise(max = max(NDPPC_cons, na.rm = TRUE))
min  <- NDPPC.cons.small %>% group_by(year) %>% summarise(min = min(NDPPC_cons, na.rm = TRUE))
NDPPC_cons_year <- list(mean,max,min) %>% reduce(full_join)

## Joining, by = "year"
## Joining, by = "year"

Now, the data prepared above can be graphed:

ggplot(data = NDPPC_cons_year, mapping = aes(x = year) ) + 
  geom_line(mapping = aes(y = mean)) +
  geom_line(mapping = aes(y = max)) +
  geom_line(mapping = aes(y = min))

Graph 4 Combining Graphs 1 and 3

ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = NDPPC_cons)) + geom_line(aes(color = Region_Ab)) +
  geom_line(data = NDPPC_cons_year, mapping = aes(x = year, y = mean)) +
  geom_line(data = NDPPC_cons_year, mapping = aes(x = year, y = max)) +
  geom_line(data = NDPPC_cons_year, mapping = aes(x = year, y = min))

## Warning: Removed 148 rows containing missing values (geom_path).

Graph 5: Create a graph showing percent change from 1993

First, more data massaging! I need to calculate the NDPPC_cons for every region and year but as a percentage of that region’s NDPPC_cons in 1993.

NDPPC_cons_1993 <- NDPPC.cons.small %>% group_by(Region_Ab) %>% summarise(NDPPC_cons_1993 = NDPPC_cons[year == 1993])
NDPPC.cons.small <- NDPPC.cons.small %>% left_join(NDPPC_cons_1993)

## Joining, by = "Region_Ab"

NDPPC.cons.small <- NDPPC.cons.small %>% mutate(NDPPC_scaled = 100*NDPPC_cons/NDPPC_cons_1993)

Now, the rescaled data can be graphed.

ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = NDPPC_scaled)) + geom_line(aes(color = Region_Ab)) +
  labs(
    y = "Per capita income as percent of 1993 level",
    title = "Per capita net domestic product of India's states and union teritories",
    subtitle = "1993 -- 2017"
  )

## Warning: Removed 178 rows containing missing values (geom_path).

Calculate growth rates of per capita income

Here, the tricky part is that not every region’s per capita income data begins in 1993 and ends in 2017. I needed to to calculate the annual compounded growth rate for each region, given the data availability.

NDPPC.cons.small <- select(NDPPC_cons, Region, Region_Ab, year, NDPPC_cons)
NDPPC.cons.small <- na.omit(NDPPC.cons.small)
NDPPC.cons.growth <- NDPPC.cons.small %>% group_by(Region_Ab) %>% summarise(
  year.first = min(year),
  year.last = max(year),
  NDPPC.cons.first = NDPPC_cons[year == min(year)],
  NDPPC.cons.last = NDPPC_cons[year == max(year)],
  growth = ((NDPPC_cons[year == max(year)]/NDPPC_cons[year == min(year)])^(1/(max(year) - min(year)))) - 1)

Graph 6 Plot growth and initial income to investigate convergence

Economic theories of growth in national income often debate a phenomenon called absolute convergence: For any extended period of time, the regions that start poorer will grow faster, thereby closing the gap over time. This absolute convergence has been observed for regions within economically integrated areas, such as the states of the U.S., the prefectures of Japan, and the members of the O.E.C.D.

Sadly, absolute convergence has not been observed for the countries of the world or even for the states of India. I confirm this below for my India data.

ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE) +
  labs(
    y = "Annual growth rate of per capita income",
    x = "Per capita income at beginning of data period"
  )

No convergence is apparent for the post-1993 period! Had the scatter points been arrayed from the top-left down to the bottom-right in a roughly inverse relation, we could have claimed absolute convergence. For India, if anything, the regions that were richer to begin with have grown faster in the post-1993 period, thereby widening the inter-region gaps.

Graph 7 Same as Graph 6 but with labels for each plot point’s region

For the labeling, I used the suggestions here: http://www.sthda.com/english/wiki/ggplot2-texts-add-text-annotations-to-a-graph-in-r-software.

library(ggrepel)
ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth, label = Region_Ab)) +
  geom_point() + geom_smooth(method = "lm", se = FALSE) + geom_text_repel() +
  labs(
    y = "Annual growth rate of per capita income",
    x = "Per capita income at beginning of data period"
  )

Graph 8 Maybe there’s convergence in geographical zones?

I use the classification in Table 107 of India’s states and union territories into zones to see if there has been convergence in states that are close to each other in geographical terms.

(region.zone <- read_csv("region_zone.csv"))

## Parsed with column specification:
## cols(
##   Region = col_character(),
##   Region_Ab = col_character(),
##   Zone = col_character()
## )

## # A tibble: 36 x 3
##    Region                      Region_Ab Zone      
##    <chr>                       <chr>     <chr>     
##  1 Andaman and Nicobar Islands AN        East      
##  2 Andhra Pradesh              AP        South     
##  3 Arunachal Pradesh           AR        North East
##  4 Assam                       AS        North East
##  5 Bihar                       BR        East      
##  6 Chandigarh                  CH        North     
##  7 Chhattisgarh                CT        Central   
##  8 Dadra and Nagar Haveli      DN        West      
##  9 Daman and Diu               DD        West      
## 10 Delhi                       DL        North     
## # ... with 26 more rows

Now each dot in Graphs 6 and 7 can be given a zone-specific color.

NDPPC.cons.growth <- left_join(NDPPC.cons.growth, region.zone)

## Joining, by = "Region_Ab"

ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth)) +
  geom_point(mapping = aes(color = Zone)) +
  labs(
    y = "Annual growth rate of per capita income",
    x = "Per capita income at beginning of data period"
  )

This helps a bit in the investigation of the issue of absolute convergence within zones, but it is still somewhat confusing. There may be a better way.

Graph 9: Use of facets

ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth, label = Region_Ab)) +
  geom_point() + geom_text_repel() + 
  geom_smooth(method = "lm", se = FALSE) + facet_wrap(~ Zone) +
  labs(
    y = "Annual growth rate of per capita income",
    x = "Per capita income at beginning of data period"
  )

Nope, no convergence within geographical zones either!

Graph 10: Growth Bar Chart, states arranged in descending order of growth rate

First, let’s express growth rates in percentage points rather than in decimal form:

NDPPC.cons.growth <- mutate(NDPPC.cons.growth, growth = 100*growth)

Now comes the graph:

NDPPC.cons.growth$Region <- factor(NDPPC.cons.growth$Region, levels = NDPPC.cons.growth$Region[order(NDPPC.cons.growth$growth)])
ggplot(data = NDPPC.cons.growth, mapping = aes(x = Region, y = growth)) +
  geom_bar(stat = "identity") + 
  coord_flip() +
  labs(x = NULL,
    y = "Annual growth rate of per capita income")

Graph 11: Same as Graph 10 but with zone-specific information

Different colored bars for different zones:

ggplot(data = NDPPC.cons.growth, mapping = aes(x = Region, y = growth)) +
  geom_bar(stat = "identity", mapping = aes(fill = Zone)) + 
  labs(x = NULL,
       y = "Annual growth rate of per capita income") +
  coord_flip()

Done, for now!

An R Learner’s Diary: Exploratory Graphics with ggplot2

Udayan Roy

September 4, 2018

Data Source

Use of the `dplyr` package for data preparation

Graphs

Graph 1

Graph 1 Frustrations

Graph 2: Same as Graph 1 but using a log scale and fancy labels from the `gghighlight` package

Graph 2 Frustrations

Graph 3 Showing the lowest, the highest, and the average per capita income for every year

Graph 4 Combining Graphs 1 and 3

Graph 5: Create a graph showing percent change from 1993

Calculate growth rates of per capita income

Graph 6 Plot growth and initial income to investigate convergence

Graph 7 Same as Graph 6 but with labels for each plot point’s region

Graph 8 Maybe there’s convergence in geographical zones?

Graph 9: Use of facets

Graph 10: Growth Bar Chart, states arranged in descending order of growth rate

Graph 11: Same as Graph 10 but with zone-specific information

An R Learner’s Diary: Exploratory Graphics with ggplot2

Udayan Roy

September 4, 2018

Data Source

Use of the dplyr package for data preparation

Graphs

Graph 1

Graph 1 Frustrations

Graph 2: Same as Graph 1 but using a log scale and fancy labels from the gghighlight package

Graph 2 Frustrations

Graph 3 Showing the lowest, the highest, and the average per capita income for every year

Graph 4 Combining Graphs 1 and 3

Graph 5: Create a graph showing percent change from 1993

Calculate growth rates of per capita income

Graph 6 Plot growth and initial income to investigate convergence

Graph 7 Same as Graph 6 but with labels for each plot point’s region

Graph 8 Maybe there’s convergence in geographical zones?

Graph 9: Use of facets

Graph 10: Growth Bar Chart, states arranged in descending order of growth rate

Graph 11: Same as Graph 10 but with zone-specific information

Use of the `dplyr` package for data preparation

Graph 2: Same as Graph 1 but using a log scale and fancy labels from the `gghighlight` package