Inequality and the Fallacy of Composition

Demystifying Inequality

Income inequality has become a topical discussion in recent years as growth and social welfare matters for countries receive greater scrutiny. Globalization and its relationship to wealth and poverty underlie much of this heightened interest. Before drawing inferences regarding inequality, its is helpful to first provide context by looking at research in the area and analyzing what this phenomenon is telling us.

Let's start with how income inequality as it is commonly measured by economists. The GINI coefficient is an internationally recognized measure of income dispersion which is widely used by economists and social scientists. It was developed as a ratio of idealized "perfect" to "observed" equality and graphically is illustrated by the difference in area between these curves to the area under the ideal curve.
The GINI ratio is derived from the Lorenz curve developed by Max O. Lorenz in 1905 for representing inequality of the wealth distribution.

This fundamental model seems to be a good fit for analysis of variance modeling but visualizing this distribution perhaps is a more compelling way to convey what is happening between and within countries.

https://en.wikipedia.org/wiki/Lorenz_curve

The Lorenz Curve

# an simple example of the Lorenz curve model
bndry <- c( A=137, B=499, C=311, D=173, E=219, F=81)
bndry <- Lc(bndry, n = rep(1,length(bndry)), plot =F)
plot(bndry,col="black",lty=1,lwd=3,main="Lorenz Curve",xlab="Cum % of population", ylab="Cum % of income")

Fallacy of Composition

A misperception that what is true for the part is necessarily true for the whole.

In the context of inequality what is true within countries over time is not necessarily true between countries. By analyzing representative countries this study provides a visualization of this distinction.

Inferences can then be drawn from this analysis to give us a narrative as to what is happening with inequality.

Cavaets regarding Data and Resuable Code

This study references "Parametric Estimations of the World Distribution of Income" Maxim Pinkovskiy, Massachusetts Institute of Technology Xavier Sala-i-Martin, Columbia University and NBER1 Oct 11, 2009

Global and India data sets and shape files were sourced from the World Bank. Many data sets that would be needed to make this a more complete analysis was not available–at least publically. Missing data is therefore noted and due to processing and time contraints required to render shapefiles, we show pre-rendered maps as png files, thus the code is not completely reproducable.

US data sets and shape files were sourced from the US Census Bureau

Github source files here!

Decreasing Inequality between Countries

Global trends actually show a decrease in inequality between countries over the past 40 years as Non-OECD countries develop economically to parity with OECD countries. Where the inequality is increasing within many developing countries as concentrations of wealth grow where there had been widespread poverty. 6 myths - Sali-Martin

THE GLOBAL OUTLOOK

Global income inequality is Diminishing

[Maxim Pinkovskiy & Xavier Sala-i-Martin 2009]

Global Analysis from the World Bank

We download worldbank data and subset it so that it will include GINI coefficients only for 1981 and 2013. Because the GINI coefficient for 2005 for India is missing, the idea of the previous chunk was to scrape the urban and rural GINI coefficients for India and compute the weighted GINI coefficient and replace the NA for 2005 for India with that value.

GINI Between Countries - World Map 1981

mapCountryData(n, nameColumnToPlot='1981GINI', mapTitle = "1981")

GINI Between Countries - World Map 2013

mapCountryData(n, nameColumnToPlot= '2013GINI', mapTitle = "2013")

INDIA PROVINCES

India

The provincial data for the GINI coefficient of India is scraped as follows and then spread out. Some of the spellings of the names of the provinces have to be changed so that they will match the names listed in the shape file. worldbankdata stackexchange-ref

dat.india.province = subset(fromJSON("https://knoema.com/api/1.0/data/wiwuiff?Time=2005-2005&region=1000130,1000020,1000040,1000050,1000060,1000080,1000090,1000100,1000110,1000120,1000140,1000150,1000160,1000220,1000210,1000230,1000290,1000280,1000270,1000250&variable=1000130,1000140,1000070,1000080&Frequencies=A")$data, select = -c(Unit, Time, RegionId, Frequency, Scale))
dat.india.province = data.frame(spread(dat.india.province, variable, Value))
colnames(dat.india.province)[2:5] = c("Ruralization (Percentage)", "Urbanization (Percentage)", "RuralGini", "UrbanGini")
dat.india.province$region[8] = "Jammu and Kashmir"
dat.india.province$region[14] = "Odisha"

Because this dataset does not include any information for Telangana, which was formed only recently, we can just subset the data so that we get data only for the province of Andhra Pradesh and just apply it to Telangana and then order the states in alphabetic order.

dat.telangana = data.frame(region = 'Telangana', subset(dat.india.province, region == "Andhra Pradesh", select = -c(region)))
colnames(dat.telangana) = colnames(dat.india.province)
dat.india.province = rbind(dat.india.province, dat.telangana)
dat.india.province = dat.india.province[order(dat.india.province$region),]

Now, for each province, we compute the GINI coefficient by weighting the rural GINI coefficient with the percentage of the rural population and the urban GINI coefficient with the percentage of the urban population.

dat.india.province$GINI = ((dat.india.province$RuralGini)*(dat.india.province$Ruralization) + (dat.india.province$UrbanGini)*(dat.india.province$Urbanization))/100

Now, we read in the shapefile for India and plot the map of India and color each province according to its GINI coefficient.

gpclibPermit()
gpclibPermitStatus()
map.ind.regions1 = readShapePoly("/Users/chittampalliyashaswini/Desktop/Yadu/IND_adm_shp/IND_adm1.shp", proj4string=CRS("+proj=longlat +datum=NAD27"))
map.ind.regions1 = fortify(map.ind.regions1, region = "NAME_1")
map.ind.regions1 = rename(map.ind.regions1,x=long,y=lat)

mycolors = brewer.pal(9,"BrBG")
plot1 = ggplot(data=dat.india.province) + geom_map(aes(fill=GINI, map_id=region),map=map.ind.regions1) + expand_limits(map.ind.regions1) + coord_map("polyconic") + theme_bw() + scale_fill_gradientn(name="GINI", colours = mycolors) + theme(legend.justification=c(1,0),legend.position=c(1,0),legend.background=element_rect(colour="black"))

mycolors2 = brewer.pal(9,"OrRd")
plot2 = ggplot(data=dat.india.provincecomp) + geom_map(aes(fill=`Rural Gini Coefficient Percent Change`, map_id=region),map=map.ind.regions1) + expand_limits(map.ind.regions1) + coord_map("polyconic") + theme_bw() + scale_fill_gradientn(name="Rural GINI % Change", colours = mycolors) + theme(legend.justification=c(1,0),legend.position=c(1,0),legend.background=element_rect(colour="black"))

mycolors3 = brewer.pal(9,"Blues")
plot3 = ggplot(data=dat.india.provincecomp) + geom_map(aes(fill=`Urban Gini Coefficient Percent Change`, map_id=region),map=map.ind.regions1) + expand_limits(map.ind.regions1) + coord_map("polyconic") + theme_bw() + scale_fill_gradientn(name="Urban GINI % Change", colours = mycolors) + theme(legend.justification=c(1,0),legend.position=c(1,0),legend.background=element_rect(colour="black"))

grid.arrange(plot1, plot2, plot3, top = textGrob("Maps of India", gp = gpar(fontface = "bold")), ncol = 1, nrow = 3)

GGplot mapping GINI data to India's Provinces

gi <- ggplot(data=dat.india.province) +
      scale_fill_gradientn(name="Coverage", colours = mycolors)
gi <- gi + geom_map(
                map=map.ind.regions1,
                aes(fill=GINI, map_id=region)
  ) +
  expand_limits(map.ind.regions1) +
  theme_bw()
gi <- gi + coord_map("polyconic") +
      labs(title="India 2005 GINI Index by Province range (0 -> 1) ",x="",y="") +
      theme_bw() +
      theme(legend.justification=c(1,0),legend.position=c(1,0),
            legend.background=element_rect(colour="black"))
gi <- gi + guides(fill=guide_legend(title="GINI",nrow=3,title.position="top",
                title.hjust=0.5,title.theme=element_text(face="bold",angle=0)))
gi <- gi + scale_x_continuous("") + scale_y_continuous("")
gi

India GINI 2005 map by Province

Image

India GINI% Change 1974 - 2005 Urbanization

Image

India GINI% Change 1974 - 2005 Ruralization

Image

India Urbanization Regression Analysis

ggplot(dat.india.province, aes(x = `Urbanization (Percentage)`, y = GINI)) + geom_point(color = "red") + geom_smooth(method = "lm")

lm(GINI ~ `Urbanization (Percentage)`, data = dat.india.province)

## 
## Call:
## lm(formula = GINI ~ `Urbanization (Percentage)`, data = dat.india.province)
## 
## Coefficients:
##                 (Intercept)  `Urbanization (Percentage)`  
##                    0.037315                     0.006261

summary(lm(GINI ~ `Urbanization (Percentage)`, data = dat.india.province))

## 
## Call:
## lm(formula = GINI ~ `Urbanization (Percentage)`, data = dat.india.province)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055301 -0.019227 -0.003567  0.023605  0.066789 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.0373150  0.0204036   1.829   0.0832 .  
## `Urbanization (Percentage)` 0.0062608  0.0007829   7.997 1.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03048 on 19 degrees of freedom
## Multiple R-squared:  0.7709, Adjusted R-squared:  0.7589 
## F-statistic: 63.95 on 1 and 19 DF,  p-value: 1.686e-07

India Ruralization Regression Analysis

ggplot(dat.india.province, aes(x=`Ruralization (Percentage)`,y=GINI)) + geom_point(color="red") + geom_smooth(method= "lm")

lm(GINI ~ `Ruralization (Percentage)`, data = dat.india.province)

## 
## Call:
## lm(formula = GINI ~ `Ruralization (Percentage)`, data = dat.india.province)
## 
## Coefficients:
##                 (Intercept)  `Ruralization (Percentage)`  
##                    0.023372                     0.004427

summary(lm(GINI ~ `Ruralization (Percentage)`, data = dat.india.province))

## 
## Call:
## lm(formula = GINI ~ `Ruralization (Percentage)`, data = dat.india.province)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.055002 -0.014365  0.007769  0.014592  0.037581 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.0233720  0.0187340   1.248    0.227    
## `Ruralization (Percentage)` 0.0044273  0.0004687   9.446 1.31e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02668 on 19 degrees of freedom
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.8152 
## F-statistic: 89.23 on 1 and 19 DF,  p-value: 1.308e-08

THE UNITED STATES

The United States has comparitively high GINI Indicies.

The Census Bureau publishes data sets which track the GINI index at different levels of geographic granularity including region, state, congressional district and metropolitan statical area. This study analyzes the income dispersion within the United States using Census data and in particular, income data collected for the American Community Survey. The Census Bureau provides the followng tool for acquiring data sets . . .

http://factfinder.census.gov/faces/nav/jsf/pages/guided_search.xhtml

Boundary Conditions

This is an observational study of data collected by surveyors by the US Census Bureau. The presumption is that each observation is an independent event of objective fact. The Census Bureau's survey techniques rely sampling, so the initial data-set is based to a degree on statiscial inference and imputed data.

All data used in this survey was sourced from the American Community Survey published by the US Census Bureau. 4 distinct datasets generated using the Census Bureau's utility. Except for the Regional data set all other data sets have more than 30 independent observations. It is therefore expected that a near normal sampling distribution applies to the data collected.

*   50 States
*  436 Congressional Districts
*  916 Gini Indicies by Metropolitan Statistical Area 
* 3143 Counties

US STATES, COUNTIES & CITIES

US GINI Between States

Image

US GINI Between Counties

Image

No Index for Urbanization? Make One.

US.cities <- world.cities[world.cities$country.etc == "USA",]
US.cities <- US.cities %>%
    filter(US.cities$pop > 100000)
US.counties <- map_data('county')
US.counties <-  US.counties %>%
        group_by(subregion) %>%
        summarise(avglat = mean(lat), avglong = mean(long))
xprod <- merge(x = US.counties, y = US.cities, by = NULL)
US.counties <- xprod %>%
          mutate(dist = sqrt((xprod$avglat - xprod$lat)^2 + (xprod$avglong - xprod$long)^2)) %>%
          select (subregion, name, dist) %>%
          group_by(subregion) %>%
          summarize(disttocity=min(dist))
colnames(US.counties)[1] <- "GEOLABEL"
gini.data$GEOLABEL <- gsub("(.*),.*", "\\1",x = gsub(pattern = " county", replacement = "", x = tolower(gini.data$GEOLABEL)))
US.counties <- inner_join(US.counties, gini.data)

## Joining by: "GEOLABEL"

Urbanization Correlation

attach(US.counties)
plot(disttocity, GINI, main="Scatterplot GINI - Urbanization", xlab="DIST ", ylab="GINI ", pch=19)
abline(lm(GINI~disttocity), col="red") # regression line (y~x)

summary(lm(GINI ~ `disttocity`, data = US.counties))

## 
## Call:
## lm(formula = GINI ~ disttocity, data = US.counties)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.104423 -0.022168 -0.002624  0.019762  0.212409 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.4410044  0.0010441 422.381   <2e-16 ***
## disttocity  -0.0011277  0.0006838  -1.649   0.0992 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.034 on 2968 degrees of freedom
## Multiple R-squared:  0.0009155,  Adjusted R-squared:  0.0005789 
## F-statistic:  2.72 on 1 and 2968 DF,  p-value: 0.09921

US CONGRESSIONAL DISTRICTS & MSA's

Congressional Districts (Political Boundaries)

Image

MSA (Economic Boundaries)

Metropolitian Statistical Area

CLEAN-UP

Connect to Postgres GINI db

This will be the data store for data frames with persistent data

#assign connection parms and connect to flight db in Postgres
dbname <- "gini"
dbuser <- "postgres"
dbpass <- "postgres"
dbhost <- "localhost"
dbport <- 5432
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host=dbhost, port=dbport, dbname=dbname,user=dbuser, password=dbpass)

ETL-store Map data for Persistance

# Delete any existing table with the same name:
if(dbExistsTable(con,"finalworlddata")) {dbRemoveTable(con,"finalworlddata")}

## [1] TRUE

# Finally write a new table:
dbWriteTable(con,"finalworlddata", finalworlddata)

## [1] TRUE

# Delete any existing table with the same name:
if(dbExistsTable(con,"dat.india.province")) {dbRemoveTable(con,"dat.india.province")}

## [1] TRUE

# Finally write a new table:
dbWriteTable(con,"dat.india.province", dat.india.province)

## [1] TRUE

# Delete any existing table with the same name:
if(dbExistsTable(con,"US.counties")) {dbRemoveTable(con,"US.counties")}

## [1] TRUE

# Finally write a new table:
dbWriteTable(con,"US.counties", county.data)

## [1] TRUE

if(dbExistsTable(con,"US.cities")) {dbRemoveTable(con,"US.cities")}

## [1] TRUE

# Finally write a new table:
dbWriteTable(con,"US.cities", US.cities)

## [1] TRUE

Conclusions

The main narrative with GINI is that inequality is falling between countries while increasing within particular countries. The reason for this is that concentrations of wealth are forming within growing countries which globally means non-OECD countries are catching up to their OECD counterparts in relative wealth. Persistent inequality within country may be a strong indicator that significant potential for growth is being forgone. In the near term however, high rates of inequality may simply mean economic growth within all sectors of the economy is uneven as a developing countries experience rapid growth.
Analysis with non-crowd-sourced data–i.e. sentiment analysis studies–require an authoratative data source with complete records which can be difficult to find. In this study, World Bank data of historic GINI surveys presented us with many missing data challenges. India alone hasn't surveyed GINI since 2005 partly because of data collection issues in many provinces.
The US Census Bureau provides a very complete set of data across a wide spectrum of topics including those of economic, social, political interest and provides this data a varying levels of granulaity.
In addition to standard data, the UCB provides geospatial corresponding to most of its data sets.
The relationship between inequality and urbanization is inconclusive based upon this study. In India, it inequality appears to be highly correlated to both urban and rural popluations. In the US, it appears that the distance to urban areas is uncorrelated with inequality.

Demystifying Inequality

The Lorenz Curve

Fallacy of Composition

Cavaets regarding Data and Resuable Code

Decreasing Inequality between Countries

THE GLOBAL OUTLOOK

Global income inequality is Diminishing

Global Analysis from the World Bank

GINI Between Countries - World Map 1981

GINI Between Countries - World Map 2013

INDIA PROVINCES

India

GGplot mapping GINI data to India's Provinces

India GINI 2005 map by Province

India GINI% Change 1974 - 2005 Urbanization

India GINI% Change 1974 - 2005 Ruralization

India Urbanization Regression Analysis

India Ruralization Regression Analysis

THE UNITED STATES

The United States has comparitively high GINI Indicies.

Boundary Conditions

US STATES, COUNTIES & CITIES

US GINI Between States

US GINI Between Counties

No Index for Urbanization? Make One.

Urbanization Correlation

US CONGRESSIONAL DISTRICTS & MSA's

Congressional Districts (Political Boundaries)

MSA (Economic Boundaries)

CLEAN-UP

Connect to Postgres GINI db

ETL-store Map data for Persistance

Conclusions

THE END