In this article, I record the graph creation aspects of a recent project, essentially as a memory aid, so my future self would find it easier to relearn what my present self knows how to do.
I downloaded the Excel files available at the Reserve Bank of India’s Handbook of Statistics on Indian States 2017-18 (HSIS_RBI) and collected the data therein into a single tidy data file. This file – called HSIS_RBI_2018_data.csv – is available in the following Github repository: https://github.com/UdayanRoy62/HSIS_RBI_2018. For the purposes of the R code below, I assume that the file “HSIS_RBI_2018_data.csv” is in the working directory.
I load the crucial tidyverse package and read my data into R:
library(tidyverse)
HSIS_RBI_2018 <- read_csv("HSIS_RBI_2018_data.csv", na = "--")
dplyr package for data preparationIn this exercise, I will focus on a particular variable – the Per Capita Net Domestic Product at Factor Cost in Constant Prices (NDPPC_cons) – and see how it varies across India’s various regions and across time (1993 through 2017). This variable is the inflation-adjusted per-person income, and as such it is a crucial measure of the economic vigor of a region. The data are available in Table 14 of HSIS_RBI. But a great deal of “massaging” will be necessary before the exploratory data analysis can begin.
First, I use the dplyr::select() function to select only the five NDPPC_cons variables plus three basic ones from the massive HSIS_RBI_2018 data frame:
(NDPPC_cons <- select(HSIS_RBI_2018, Region, Region_Ab, year, NDPPC_cons_0405:NDPPC_cons_9900))
## # A tibble: 1,665 x 8
## Region Region_Ab year NDPPC_cons_0405 NDPPC_cons_1112 NDPPC_cons_8081
## <chr> <chr> <int> <int> <int> <int>
## 1 Andam~ AN 1990 NA NA 2580
## 2 Andhr~ AP 1990 NA NA 2060
## 3 Aruna~ AR 1990 NA NA 2709
## 4 Assam AS 1990 NA NA 1544
## 5 Bihar BR 1990 NA NA 1197
## 6 Chand~ CH 1990 NA NA NA
## 7 Chhat~ CT 1990 NA NA NA
## 8 Dadra~ DN 1990 NA NA NA
## 9 Daman~ DD 1990 NA NA NA
## 10 Delhi DL 1990 NA NA 5447
## # ... with 1,655 more rows, and 2 more variables: NDPPC_cons_9394 <int>,
## # NDPPC_cons_9900 <int>
The data massaging mentioned above will consist of using the five NDPPC_cons variables to build as long a per capita income series as possible in constant 2011 prices. Data for years 2011 through 2017 are available in 2011 prices as NDPPC_cons_1112; so for these years NDPPC_cons_1112 can be used as is, with no modification.
But data for 2004-10 are in the NDPPC_cons_0405 variable, in 2004 prices. We need to convert it to 2011 prices. Luckily, NDPPC_cons data for 2011 are available from two variables: NDPPC_cons_1112 and NDPPC_cons_0405. One can compare NDPPC_cons_1112 and NDPPC_cons_0405 and calculate the multiple by which NDPPC_cons_1112 exceeds NDPPC_cons_0405 in the year 2011 for each Region. Then, one can multiply each Region’s NDPPC_cons data for the years 2004 through 2010 by the multiple for the year 2011 to convert the per capita income in 2004 prices to per capita income in 2011 prices. (For example, if NDPPC_cons_1112 for some region A is 3 times higher than NDPPC_cons_0405 for region A in 2011, then one can obtain per capita income for 2004-10 for region A in 2011 prices by simply multiplying the NDPPC_cons_0405 data by 3.)
Similarly, by exploiting the availability of NDPPC_cons data for 2004 in the variables NDPPC_cons_0405 and NDPPC_cons_9900, I scaled the NDPPC_cons data for the years 1999 through 2003, first from 1999 prices to 2004 prices, and then, as in the previous paragraph, from 2004 prices to 2011 prices.
Finally, by exploiting the availability of NDPPC_cons data for 1999 in the variables NDPPC_cons_9900 and NDPPC_cons_9394, I scaled the NDPPC_cons data for the years 1993 through 1998, first from 1993 prices to 1999 prices, and then, as in the previous paragraphs, from 1999 prices to 2004 prices, and finally from 2004 prices to 2011 prices.
This stitch work below uses the dplyr package, which is part of the tidyverse package that was downloaded above.
My exploitation of overlaps in the available data series is, however, not possible for the years 1990 through 1992. So, for my purposes, the observations for those years are of no use.
Moreover, there is no NDPPC_cons data for several regions. Consequently, the observations (i.e., rows) for those regions can be removed.
I use the dplyr::filter() function to keep only the observations I need.
## To keep only those observations in `NDPPC_cons` for which the condition `year > 1992` is TRUE I use:
NDPPC_cons <- filter(NDPPC_cons, year > 1992)
NDPPC_cons <- filter(NDPPC_cons, !(Region_Ab %in% c("RC", "RE", "RNE", "RN", "RS", "RW", "IN")))
Next, I calculate the “multiples” and scaling factors discussed above. Here we see use of the %>% or pipe operator of the magrittr package and the mutate function of the dplyr package.
NDPPC0410 <- NDPPC_cons %>% filter(year == 2011) %>% mutate(multiple.04.11 = NDPPC_cons_1112/NDPPC_cons_0405) %>% select(Region, Region_Ab, multiple.04.11)
NDPPC9903 <- NDPPC_cons %>% filter(year == 2004) %>% mutate(multiple.99.04 = NDPPC_cons_0405/NDPPC_cons_9900) %>% select(Region, Region_Ab, multiple.99.04)
NDPPC9398 <- NDPPC_cons %>% filter(year == 1999) %>% mutate(multiple.93.99 = NDPPC_cons_9900/NDPPC_cons_9394) %>% select(Region, Region_Ab, multiple.93.99)
multiples <- list(NDPPC0410,NDPPC9903,NDPPC9398) %>% reduce(full_join)
## Joining, by = c("Region", "Region_Ab")
## Joining, by = c("Region", "Region_Ab")
multiples <- multiples %>% mutate(
scale_1117 = 1,
scale_0410 = scale_1117*multiple.04.11,
scale_9903 = scale_0410*multiple.99.04,
scale_9398 = scale_9903*multiple.93.99
)
NDPPC_cons <- NDPPC_cons %>% left_join(multiples)
## Joining, by = c("Region", "Region_Ab")
Finally, I build per capita income for the years 1993 through 2017, all in constant 2011 prices:
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 1993:1998, NDPPC_cons$NDPPC_cons_9394*NDPPC_cons$scale_9398, NA)
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 1999:2003, NDPPC_cons$NDPPC_cons_9900*NDPPC_cons$scale_9903, NDPPC_cons$NDPPC_cons)
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 2004:2010, NDPPC_cons$NDPPC_cons_0405*NDPPC_cons$scale_0410, NDPPC_cons$NDPPC_cons)
NDPPC_cons$NDPPC_cons <- ifelse (NDPPC_cons$year %in% 2011:2017, NDPPC_cons$NDPPC_cons_1112*NDPPC_cons$scale_1117, NDPPC_cons$NDPPC_cons)
Next, I keep only the variables necessary for subsequent work:
(NDPPC.cons.small <- NDPPC_cons %>% select(Region, Region_Ab, year, NDPPC_cons))
## # A tibble: 900 x 4
## Region Region_Ab year NDPPC_cons
## <chr> <chr> <int> <dbl>
## 1 Andaman and Nicobar Islands AN 1993 46168.
## 2 Andhra Pradesh AP 1993 28188.
## 3 Arunachal Pradesh AR 1993 39054.
## 4 Assam AS 1993 27630.
## 5 Bihar BR 1993 10350.
## 6 Chandigarh CH 1993 74928.
## 7 Chhattisgarh CT 1993 30447.
## 8 Dadra and Nagar Haveli DN 1993 NA
## 9 Daman and Diu DD 1993 NA
## 10 Delhi DL 1993 72386.
## # ... with 890 more rows
ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = NDPPC_cons)) +
geom_line(aes(color = Region_Ab)) +
labs(
y = "Rupees",
title = "Per capita net domestic product of India's states and union teritories, 1993 -- 2017",
subtitle = "in constant 2011-12 rupees",
caption = "Source: Reserve Bank of India"
)
## Warning: Removed 148 rows containing missing values (geom_path).
When there are so many lines – one for each of those regions – the legend is not very effective unless you have eyes that are great at distinguishing between ten different shades of green! I wish I knew how to label each line graph.
I don’t know how to control the size of the graph.
I don’t know how to compel ggplot2 to show all years on the horizontal axis (rather than just two).
I don’t know how to compel ggplot2 to show the Rupees values in normal notation. The chart uses scientific notation. (Just multiply the numbers before the “e” by 10 raised to the power of the numbers after the “e”. For example, 1e+05 means \(1\times 10^5=100,000\) rupees bearing the purchasing power in 2011.)
gghighlight packagelibrary(gghighlight)
ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = log(NDPPC_cons)) ) +
geom_line(aes(color = Region_Ab)) + gghighlight(label_key = Region_Ab)
## Warning: Removed 148 rows containing missing values (geom_path).
## Warning: Removed 148 rows containing missing values (geom_path).
## Warning: Removed 27 rows containing missing values (geom_label_repel).
I couldn’t get gghighlight to show all Region_Ab labels. The package uses its own judgement to throw away labels when things get crowded.
I want to add lines for the lowest per capita income, the highest per capita income, and the (unweighted) average per capita income for every year. This requires integrating the observations for all regions, but one year at a time. This uses the dplyr::summarise() command in combination with the dplyr::group_by() command.
mean <- NDPPC.cons.small %>% group_by(year) %>% summarise(mean = mean(NDPPC_cons, na.rm = TRUE))
max <- NDPPC.cons.small %>% group_by(year) %>% summarise(max = max(NDPPC_cons, na.rm = TRUE))
min <- NDPPC.cons.small %>% group_by(year) %>% summarise(min = min(NDPPC_cons, na.rm = TRUE))
NDPPC_cons_year <- list(mean,max,min) %>% reduce(full_join)
## Joining, by = "year"
## Joining, by = "year"
Now, the data prepared above can be graphed:
ggplot(data = NDPPC_cons_year, mapping = aes(x = year) ) +
geom_line(mapping = aes(y = mean)) +
geom_line(mapping = aes(y = max)) +
geom_line(mapping = aes(y = min))
ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = NDPPC_cons)) + geom_line(aes(color = Region_Ab)) +
geom_line(data = NDPPC_cons_year, mapping = aes(x = year, y = mean)) +
geom_line(data = NDPPC_cons_year, mapping = aes(x = year, y = max)) +
geom_line(data = NDPPC_cons_year, mapping = aes(x = year, y = min))
## Warning: Removed 148 rows containing missing values (geom_path).
First, more data massaging! I need to calculate the NDPPC_cons for every region and year but as a percentage of that region’s NDPPC_cons in 1993.
NDPPC_cons_1993 <- NDPPC.cons.small %>% group_by(Region_Ab) %>% summarise(NDPPC_cons_1993 = NDPPC_cons[year == 1993])
NDPPC.cons.small <- NDPPC.cons.small %>% left_join(NDPPC_cons_1993)
## Joining, by = "Region_Ab"
NDPPC.cons.small <- NDPPC.cons.small %>% mutate(NDPPC_scaled = 100*NDPPC_cons/NDPPC_cons_1993)
Now, the rescaled data can be graphed.
ggplot(data = NDPPC.cons.small, mapping = aes(x = year, y = NDPPC_scaled)) + geom_line(aes(color = Region_Ab)) +
labs(
y = "Per capita income as percent of 1993 level",
title = "Per capita net domestic product of India's states and union teritories",
subtitle = "1993 -- 2017"
)
## Warning: Removed 178 rows containing missing values (geom_path).
Here, the tricky part is that not every region’s per capita income data begins in 1993 and ends in 2017. I needed to to calculate the annual compounded growth rate for each region, given the data availability.
NDPPC.cons.small <- select(NDPPC_cons, Region, Region_Ab, year, NDPPC_cons)
NDPPC.cons.small <- na.omit(NDPPC.cons.small)
NDPPC.cons.growth <- NDPPC.cons.small %>% group_by(Region_Ab) %>% summarise(
year.first = min(year),
year.last = max(year),
NDPPC.cons.first = NDPPC_cons[year == min(year)],
NDPPC.cons.last = NDPPC_cons[year == max(year)],
growth = ((NDPPC_cons[year == max(year)]/NDPPC_cons[year == min(year)])^(1/(max(year) - min(year)))) - 1)
Economic theories of growth in national income often debate a phenomenon called absolute convergence: For any extended period of time, the regions that start poorer will grow faster, thereby closing the gap over time. This absolute convergence has been observed for regions within economically integrated areas, such as the states of the U.S., the prefectures of Japan, and the members of the O.E.C.D.
Sadly, absolute convergence has not been observed for the countries of the world or even for the states of India. I confirm this below for my India data.
ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth)) +
geom_point() + geom_smooth(method = "lm", se = FALSE) +
labs(
y = "Annual growth rate of per capita income",
x = "Per capita income at beginning of data period"
)
No convergence is apparent for the post-1993 period! Had the scatter points been arrayed from the top-left down to the bottom-right in a roughly inverse relation, we could have claimed absolute convergence. For India, if anything, the regions that were richer to begin with have grown faster in the post-1993 period, thereby widening the inter-region gaps.
For the labeling, I used the suggestions here: http://www.sthda.com/english/wiki/ggplot2-texts-add-text-annotations-to-a-graph-in-r-software.
library(ggrepel)
ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth, label = Region_Ab)) +
geom_point() + geom_smooth(method = "lm", se = FALSE) + geom_text_repel() +
labs(
y = "Annual growth rate of per capita income",
x = "Per capita income at beginning of data period"
)
I use the classification in Table 107 of India’s states and union territories into zones to see if there has been convergence in states that are close to each other in geographical terms.
(region.zone <- read_csv("region_zone.csv"))
## Parsed with column specification:
## cols(
## Region = col_character(),
## Region_Ab = col_character(),
## Zone = col_character()
## )
## # A tibble: 36 x 3
## Region Region_Ab Zone
## <chr> <chr> <chr>
## 1 Andaman and Nicobar Islands AN East
## 2 Andhra Pradesh AP South
## 3 Arunachal Pradesh AR North East
## 4 Assam AS North East
## 5 Bihar BR East
## 6 Chandigarh CH North
## 7 Chhattisgarh CT Central
## 8 Dadra and Nagar Haveli DN West
## 9 Daman and Diu DD West
## 10 Delhi DL North
## # ... with 26 more rows
Now each dot in Graphs 6 and 7 can be given a zone-specific color.
NDPPC.cons.growth <- left_join(NDPPC.cons.growth, region.zone)
## Joining, by = "Region_Ab"
ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth)) +
geom_point(mapping = aes(color = Zone)) +
labs(
y = "Annual growth rate of per capita income",
x = "Per capita income at beginning of data period"
)
This helps a bit in the investigation of the issue of absolute convergence within zones, but it is still somewhat confusing. There may be a better way.
ggplot(data = NDPPC.cons.growth, mapping = aes(x = NDPPC.cons.first, y = growth, label = Region_Ab)) +
geom_point() + geom_text_repel() +
geom_smooth(method = "lm", se = FALSE) + facet_wrap(~ Zone) +
labs(
y = "Annual growth rate of per capita income",
x = "Per capita income at beginning of data period"
)
Nope, no convergence within geographical zones either!
First, let’s express growth rates in percentage points rather than in decimal form:
NDPPC.cons.growth <- mutate(NDPPC.cons.growth, growth = 100*growth)
Now comes the graph:
NDPPC.cons.growth$Region <- factor(NDPPC.cons.growth$Region, levels = NDPPC.cons.growth$Region[order(NDPPC.cons.growth$growth)])
ggplot(data = NDPPC.cons.growth, mapping = aes(x = Region, y = growth)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = NULL,
y = "Annual growth rate of per capita income")
Different colored bars for different zones:
ggplot(data = NDPPC.cons.growth, mapping = aes(x = Region, y = growth)) +
geom_bar(stat = "identity", mapping = aes(fill = Zone)) +
labs(x = NULL,
y = "Annual growth rate of per capita income") +
coord_flip()
Done, for now!