Is the dispersion of income within the United States different based upon regional, geographic political and economic boundaries? Answering this question has growth and governance implications for our country, particularly to the degree that concentrated within an an economy implies a corresponding deficit in growth potential. In order to answer the later question, an analysis of the former question must be analyzed.
GINI is an internationally recognized measure of income dispersion within a specified geographic area. Income inequality has also been a topical discussion in recent years and is a key measure of income dispersion.
The Census Bureau publishes data sets which track the GINI index at different levels of geographic granularity including region, state, congressional district and metropolitan statical area. This study analyzes the income dispersion within the United States using Census data and in particular, income data collected for the American Community Survey. The Census Bureau provides the followng tool for acquiring data sets . . .
http://factfinder.census.gov/faces/nav/jsf/pages/guided_search.xhtml
This is an observational study of data collected by surveyors by the US Census Bureau. The presumption is that each observation is an independent event of objective fact. The Census Bureau’s survey techniques rely sampling, so the initial data-set is based to a degree on statiscial inference and imputed data.
All data used in this survey was sourced from the American Community Survey published by the US Census Bureau. 4 distinct data sets generated using the Census Bureau’s utility. Except for the Regional data set all other data sets have more than 30 independent observations. It is therefore expected that a near normal sampling distribution applies to the data collected.
* Gini Indicies by Region
* Gini Indicies by State
* Gini Indicies by Congressional District
* Gini Indicies by Metropolitan Statistical Area
The following cases correspond to each geographic level data set above:
* 4 Regions: Northeast, Midwest, South and West + US Oveall
* 50 States
* 436 Congressional Districts
* 916 Gini Indicies by Metropolitan Statistical Area
The following Variables will be collected for analysis within each data set:
* Catagorical: Geographic Level
* Numeric: Gini Index
# REGION
lr <- read.csv(
"/Users/scottkarr/IS607Spring2016/project2/more/GINI-2014-Region-untidy.csv",
sep=",",
na.strings = "",
blank.lines.skip = TRUE,
col.names = c("Quintile", "West", "South", "Midwest","Northeast", "US Overall"),
stringsAsFactors=FALSE
)
dfr <- data.frame(lr)
# STATE
ls <- read.csv(
"/Users/scottkarr/IS607Spring2016/project2/more/GINI-2014-State.csv",
sep=",",
na.strings = "",
blank.lines.skip = TRUE,
col.names = c("GeoID", "ID", "State", "Gini","MOE"),
stringsAsFactors=FALSE
)
dfs <- data.frame(ls)
# CONGRESSIONAL DISTRICT
lc <- read.csv(
"/Users/scottkarr/IS607Spring2016/project2/more/GINI-2014-CongDistrict.csv",
sep=",",
na.strings = "",
blank.lines.skip = TRUE,
col.names = c("GeoID", "ID", "CongDistrict", "Gini","MOE"),
stringsAsFactors=FALSE
)
dfc <- data.frame(lc)
# METROPOLITAN STATISTICAL AREA
lm <- read.csv(
"/Users/scottkarr/IS607Spring2016/project2/more/GINI-2014-MSA.csv",
sep=",",
na.strings = "",
blank.lines.skip = TRUE,
col.names = c("GeoID", "ID", "MSA", "Gini","MOE"),
stringsAsFactors=FALSE
)
dfm <- data.frame(lm)
Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question. Tidy the data & get summary statistics
#REGION
# remove extraneous rows
# derived fields can be calculated from raw data
dfr <- dfr[-c(1,7),]
# gather morphs data from wide to long format
df_tidyr <- dfr %>%
gather(Region, Gini, -Quintile) %>%
arrange(Quintile, Region, Gini)
# organize the final data sets
df_tidyr <- df_tidyr %>%
select(Region, Quintile, Gini) %>%
arrange(Region, Quintile, Gini)
#STATE
# remove extraneous rows
# derived fields can be calculated from raw data
dfs <- dfs[-c(1,5),]
# organize the final data sets
df_tidys <- dfs %>%
select(State, Gini) %>%
arrange(State, Gini)
df_tidys$Gini <- as.numeric(df_tidys$Gini)
#CONGRESSIONAL DISTRICT
# remove extraneous rows
# derived fields can be calculated from raw data
dfc <- dfc[-c(1,5),]
# organize the final data sets
df_tidyc <- dfc %>%
select(CongDistrict, Gini) %>%
arrange(CongDistrict, Gini)
df_tidyc$Gini <- as.numeric(df_tidyc$Gini)
#METROPOLITIAN STATISTICAL DISTRICT
# remove extraneous rows
# derived fields can be calculated from raw data
dfm <- dfm[-c(1,5),]
# organize the final data sets
df_tidym <- dfm %>%
select(MSA, Gini) %>%
arrange(MSA, Gini)
df_tidym$Gini <- as.numeric(df_tidym$Gini)
df_tidyr_grouped= group_by(df_tidyr, Region)
#
df_stats <- summarise(df_tidyr_grouped, mean_gini = mean(Gini), std_gini = sd(Gini))
df_stats[6,] <- c("State", mean(df_tidys$Gini), sd(df_tidys$Gini))
df_stats[7,] <- c("CongDistrict", mean(df_tidyc$Gini), sd(df_tidyc$Gini))
df_stats[8,] <- c("MSA", mean(df_tidym$Gini), sd(df_tidym$Gini))
#
kable(head(df_stats), align = 'l')
Region | mean_gini | std_gini |
---|---|---|
Midwest | 0.2 | 0.099498743710662 |
Northeast | 0.198 | 0.0831865373723417 |
South | 0.2 | 0.085146931829632 |
US.Overall | 0.2 | 0 |
West | 0.2 | 0.051478150704935 |
State | 0.465135294117647 | 0.024102355266898 |
To test variability of our samples from a theoretical randomly distributed population of Gini indexes we will use a mean of .5 since the Gini index is a value between 0 and 1. We will use the sample standard deviation to calculate an estimate of the population standard deviation.
To do this first the standard error of each data set is calculated. Then, using the population mean, estimated standard deviation of the population and number of observations in our population, the margin of error is calculated using a 95% confidence interval to test if .5 falls within this interval.
Ho = .5, the mean of our sample matches our theoretical population mean within the C.I.
Ha <> .5, we can reject the null hypothesis and can infer that the distribution of Gini isn’t randomly generated for the observed sample.
Note: this is based on assumption that the .5 mean population assumption being valid
There are two issues with fitting a normal distribution to our data set:
First, the Gini index that we are testing is bound by a range between 0 and 1. While the data sets in this study do appear to follow a normal distribution, a true normal distribution is not range bound. Since the liklihood of falling within a normal distribution exponentially diminishes with the distance from the mean, we will treat the simulated data falling outside the range of [0,1] as outliers and focus on the distribution statistics of the sampling. see simulated normal distribution below
Second, the recorded observations of Gini that were gathered by the Census Bureau used sampling and data imputation methods to generate their statistics. This means the nature of this study is observational and those observations themselves use inferences to the actual population being tested. Despite this, the raw aggregated data used in this study will be treated as actual population data to simplify the scope of this study.
sim_norm <- rnorm(n = 1000, mean = .5, sd = .3)
g <- sim_norm
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=50, breaks=20, prob=TRUE, xlab="Simulated Normal Income Distribution", ylim=c(0, 2), main="normal curve over histogram")
Regional data uses only 5 data points which is illustrated below with a faceted data plot
# get descriptive statistics
describe(df_tidyr$Gini)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 25 0.2 0.07 0.2 0.2 0.07 0.08 0.32 0.24 0.22 -0.85 0.01
df_stats[1:5,]
## Source: local data frame [5 x 3]
##
## Region mean_gini std_gini
## (chr) (chr) (chr)
## 1 Midwest 0.2 0.099498743710662
## 2 Northeast 0.198 0.0831865373723417
## 3 South 0.2 0.085146931829632
## 4 US.Overall 0.2 0
## 5 West 0.2 0.051478150704935
# regional scatterplot of population by regions
ggplot(data = df_tidyr, aes(x = Quintile, y = Gini)) +
geom_point() + facet_wrap( ~ Region )
The State Distribution is somewhat right skewed and multi-modal as compared with the overlayed normal distribution.
# get descriptive statistics
describe(df_tidys$Gini)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 51 0.47 0.02 0.46 0.46 0.02 0.42 0.55 0.13 0.81 1.45 0
df_stats[6,]
## Source: local data frame [1 x 3]
##
## Region mean_gini std_gini
## (chr) (chr) (chr)
## 1 State 0.465135294117647 0.024102355266898
#hist(df_tidys$Gini, breaks=20)
g <- df_tidys$Gini
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=50, breaks=20, prob=TRUE, xlab="State Gini Index", ylim=c(0, 25), main="normal curve over histogram")
curve(dnorm(x, mean=m, sd=std), col="darkblue", lwd=2, add=TRUE, yaxt="n")
\[SE_{\bar{x}} = \sigma_{\bar{x}} = \frac{s_{{x}}}{\sqrt{n}}\] \[\frac{0.24102}{\sqrt{51}} = 0.03374\] \[{\bar{x}_{gini}} = .5 \pm{ 1.96 \times = .006513}\] \[C.I._{95%} = [0.48723, 0.51277]\]
`0.46513 < 0.48723 so based on our hypothesis criteria:`
Ha <> .5, we can reject the null hypothesis and can infer that the distribution of Gini
isn’t randomly generated for the observed sample.
The Congressional District is also somewhat right skewed as compared with the overlayed normal distribution but a smoother fit.
# get descriptive statistics
describe(df_tidyc$Gini)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 436 0.46 0.03 0.45 0.45 0.03 0.39 0.58 0.2 1.03 2.01 0
df_stats[7,]
## Source: local data frame [1 x 3]
##
## Region mean_gini std_gini
## (chr) (chr) (chr)
## 1 CongDistrict 0.456470642201835 0.030584541366859
#hist(df_tidyc$Gini, breaks=20)
g <- df_tidyc$Gini
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=50, breaks=20, prob=TRUE,
xlab="Congressional District Income Dispersion", ylim=c(0, 15), main="normal curve over histogram")
curve(dnorm(x, mean=m, sd=std), col="darkblue", lwd=2, add=TRUE, yaxt="n")
\[SE_{\bar{x}} = \sigma_{\bar{x}} = \frac{s_{{x}}}{\sqrt{n}}\] \[\frac{0.03058}{\sqrt{436}} = 0.00147\] \[{\bar{x}_{gini}} = .5 \pm{ 1.96 \times 0.00287}\] \[C.I._{95%} = [0.49437, 0.50563]\]
`0.45647 < 0.49437 so based on our hypothesis criteria:`
Ha <> .5, we can reject the null hypothesis and can infer that the distribution of Gini
isn’t randomly generated for the observed sample.
The MSA is near normally distributed.
# get descriptive statistics
describe(df_tidym$Gini)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 916 0.45 0.03 0.45 0.45 0.03 0.37 0.54 0.17 0.34 0.26 0
df_stats[8,]
## Source: local data frame [1 x 3]
##
## Region mean_gini std_gini
## (chr) (chr) (chr)
## 1 MSA 0.446776091703057 0.0285726948720191
#hist(df_tidym$Gini, breaks=20
g <- df_tidym$Gini
m<-mean(g)
std<-sqrt(var(g))
hist(g, density=50, breaks=20, prob=TRUE,
xlab="Metro-Area Income Dispersion", ylim=c(0, 15), main="normal curve over histogram")
curve(dnorm(x, mean=m, sd=std), col="darkblue", lwd=2, add=TRUE, yaxt="n")
\[SE_{\bar{x}} = \sigma_{\bar{x}} = \frac{s_{{x}}}{\sqrt{n}}\] \[\frac{0.02857}{\sqrt{916}} = 0.00094\] \[{\bar{x}_{gini}} = .5 \pm{ 1.96 \times 0.00185}\] \[C.I._{95%} = [0.496374, 0.50362]\]
`0.44678 < 0.496374 so based on our hypothesis criteria:`
Ha <> .5, we can reject the null hypothesis and can infer that the distribution of Gini
isn’t randomly generated for the observed sample.
Inference:
The United States has an overall Gini coefficient of 0.486 which is close to our theoretical assumption of .5 based upon a random uniform distribution of numbers between 0 and 1. The hypothesis test results would not have changed if .486 was used.
Rejecting the randomness of the Gini coefficient in this analysis implies that there are other factors that determine the dispersion of income by geographic region. In addition, different geographic areas vary significantly in their dispersion of income. Note that the skew and variance by Congressional District which is a politically drawn boundary is greater than that of the dispersion of income by Metropolitan Statistical Area which is principally and economic boundary.
The Metropolitan Statistical Area is a construct of the Office of Management and Budget and is chosen as a dense cluster of population at its core and economic ties futher out. It is intended to be an apolitical boundary where anomolies such as gerrymandering occur. In addition MSA’s tend not to have the same concentration of income disparity as occurs within the boundary of cities since they encompass the city center and radiate outward to the surrounding suburban and exurban areas.
Conclusion:
Income inequality has been a topical discussion in recent years both within the United States and on a global basis. The debate over the dispersion of income highlights the notion that a growing economy may obscure an uneven concentration of participant in this growth. It also implies that an economy may not be producing up to its potential.
This study looked at this phenomenon using the Gini index which rates a geographic region’s income distribution at various scales. The four levels that were examined in this study were Region, State, Congressional District and Metropolitan Statistical Area.
Regional data was anecdotal since the size of the data set was small. The other 3 data sets showed significant variance from a normal distribution of the same mean and standard deviation. The MSA’s
most closely followed a normal distribution while Congressional Districts had the most variance and skew.
One can infer that there are non-random factors that affect variance from the normal distribution of Gini, despite assumptions that were made about a normal distribution as the basis for comparison. Identifying these factors is a policy matter that may signify a regions political and economic viability.