scottkarr-proposal

# load data

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
The US census bureau has collected GINI indexes within the United States at various levels of geographic granularity including state level, congressional district level, metropolitan statistical level and city level.

The GINI index within each of these levels can be tested as a proxy for certain types of bias, i.e.

1) State -> geographic 
2) Congressional District -> political
3) Metropolitan Statistical Area -> economic

If true, the distribution of GINI indices at each level will vary from a normal distribution reflecting such bias. My hypothesis therefore is that the GINI distribution representing each of these geographic levels will vary from a randomly sampled normal distribution of the same number of observations.

Cases

What are the cases, and how many are there?
Cases

50 State GINI indices
435 Congressional District indices
214 Metropolitan Statistical Area indicies

Data collection

Describe the method of data collection.
From the US Census Bureau on the American Community Survey. . . Income Inequality

Household Income: 2013 also examined the Gini index for states and large metro areas. The Gini index is a summary measure of income inequality, ranging from 0 — complete equality — to 1 — complete inequality. Among the findings:

Five states and the District of Columbia had Gini indexes higher than the U.S. index of .481.
Thirty-six states had lower Gini indexes than the U.S. index of .481.

The Gini index of 15 states increased from 2012 to 2013. Alaska was the only state to have a decrease. 
All other states saw no significant change.

The highest Gini index was in the District of Columbia (0.532). Alaska’s (0.408) was among the lowest.

Additional Gini index data on the Census Bureau’s American FactFinder data search engine is available for metropolitan statistical areas and other areas with populations of 65,000 or more. Of the 25 most populous metro areas, Gini indexes ranged from 0.442 (for the Washington, D.C., metro area, although not statistically different from Portland, Ore., Riverside, Calif., and Minneapolis) to 0.512 (for the New York metro area, which was not statistically different from the Miami metro area).

Type of study

What type of study is this (observational/experiment)?
This is an observational study using data gathered by the US Census Bureau.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.
https://www.census.gov/prod/2011pubs/acs-16.pdf
http://www.bloomberg.com/visual-data/best-and-worst//most-income-inequality-congressional-districts

Response

What is the response variable, and what type is it (numerical/categorical)?
The response variable is the GINI index as a measure of income dispersion. GINI is a numeric variable. I will also compare the relative variance for GINI from a randomly sampled normal distribution for each level of geographic granularity

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

The explanatory variable is the geographic variable which is a categorical variable.
Each geographic level is a categorical variable

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

I’m basically comparing means, standard deviations between each group and a simulation of the same size. I will also plot the normal distribution and histograms for each and compare skewness. All 3 geographic levels retest my hypothesis that the randomly generated simulation and the actual data will vary. I will test the hypothesis for each at a 95% confidence level.

Finally, I will compare the degree of variance from the normal distribution for each hypothesis to make inferences about the strength of each level’s bias–if any.

simulating data from a normal distribution using rnorm.

Generate mean, std for 50 states, generate a histogram plotted against a normal distribution {r } g = gini$state m<-mean(g) std<-sqrt(var(g)) hist(g, density=50, breaks=20, prob=TRUE, xlab=“distribution of income”, ylim=c(0, .06), main=“normal curve over histogram”) curve(dnorm(x, mean=m, sd=std), col=“darkblue”, lwd=2, add=TRUE, yaxt=“n”)

Generate the same for a simulation Simulate data from a normal distribution using rnorm. {r sim-norm, eval=TRUE} sim_norm <- rnorm(n = length(gini$state), mean = statemean, sd = statesd)

{r sim-mean-sd, eval=TRUE} simmean <- mean(sim_norm) simmean simsd <- sd(sim_norm) simsd

{r sim-hist_states, eval=TRUE} hist(sim_norm,density=50,breaks=20,xlab=“distribution of income”,prob=TRUE,ylim=c(0,.06),main=“normal curve over sim histogram”) x <- 0:100 y <- dnorm(x=x,mean=simmean,sd=simsd) lines(x=x,y=y, col=“blue”)