DATA 205 Dataset Feedback Form


Herein:

Dataset: Household Income and Incarceration for Children from Low-Income Households by County, Race, and Gender

From Opportunity Insights. See readme for full descriptions of all columns. <top>


The table below omits standard error and gender specific columns. This dataset examines for 3219 localities, household income ranking (2014-2015) and incarceration fraction (as of April 10, 2010) of adults aged 31 to 37 where the household incomes of their parents were at the 25th national percentile.

  Mean Min. Median Max. NA’s
kfr_pooled_pooled_p25 0.43 0.12 0.42 0.69 11
jail_pooled_pooled_p25 0.021 -0.075 0.02 0.23 35
pooled_pooled_count 10455 52 3308 1576718 85
kfr_black_pooled_p25 0.34 0.19 0.33 0.64 1304
kfr_hisp_pooled_p25 0.42 0.21 0.42 0.68 836
kfr_white_pooled_p25 0.46 0.31 0.45 0.7 88
jail_black_pooled_p25 0.059 -0.036 0.053 0.34 1386
jail_hisp_pooled_p25 0.016 -0.088 0.014 0.26 941
jail_white_pooled_p25 0.017 -0.086 0.016 0.28 91
white_pooled_count 4719 17 2238 165085 91
black_pooled_count 4029 5.2 656 318866 1304
hisp_pooled_count 3586 6.6 225 1112381 883

Data Quality

  • There seem to be large proportions of NA’s for both Hispanics and African Americans compared to whites.
  • There is a large variation in the population size of the localities.
  • Noise was added to the data to protect individual privacy. That may account for the negative minimum incarceration fractions.
  • The subject counts in many localities are very small making for distorted interpreations.
  • It is unclear when or how the parent’s membership in the 25th household income percentile was determined.

Primary Data Columns

  • All data columns appear interesting
  • The income and incarceration should be studied independently.
  • It would be interesting to drill down on the gender differences for income.

Relevant Filters

  • Select for localities where the pooled-pooled count exceeds the median.
  • Select gender male/female
  • Select for localities where pooled-pooled income fraction exceeds .5

Possible visualizations

  • Scatter plot count vs incarceration rates shapes for race/gender
  • Scatter plot count vs income fraction shapes for race/gender
  • Bar graphs for income/incarceration aggregated by state stacked by race and gender

Limitations of the Dataset

  • Use of national income distribution may distort local results
  • As above, NAs more frequently occur with Hispanics, African Americans.
  • Unclear whether subjects reside in childhood locality.
  • Individuals often incarerated far from home. Local population, income distorted in rural localities with high density prisons.

Improvement to the Dataset

  • Categorize locality by urban, suburban, rural
  • Add adjustement locality income distribution
  • Add locality incarceration rate.

Always interesting to see how Montgomery County stacks up nationally:

  moco all.Mean all.Min. all.Median all.Max.
kfr_pooled_pooled_p25 0.48 0.43 0.12 0.42 0.69
jail_pooled_pooled_p25 0.014 0.021 -0.075 0.02 0.23
pooled_pooled_count 69344 10455 52 3308 1576718
kfr_black_pooled_p25 0.42 0.34 0.19 0.33 0.64
kfr_hisp_pooled_p25 0.5 0.42 0.21 0.42 0.68
kfr_white_pooled_p25 0.51 0.46 0.31 0.45 0.7
jail_black_pooled_p25 0.03 0.059 -0.036 0.053 0.34
jail_hisp_pooled_p25 0.0067 0.016 -0.088 0.014 0.26
jail_white_pooled_p25 0.013 0.017 -0.086 0.016 0.28
white_pooled_count 15973 4719 17 2238 165085
black_pooled_count 21475 4029 5.2 656 318866
hisp_pooled_count 18871 3586 6.6 225 1112381

and in the DC commuting zone

  moco dcCz.Mean dcCz.Min. dcCz.Median dcCz.Max.
kfr_pooled_pooled_p25 0.48 0.44 0.35 0.43 0.51
jail_pooled_pooled_p25 0.014 0.026 0.011 0.027 0.045
pooled_pooled_count 69344 21998 510 10031 98422
kfr_black_pooled_p25 0.42 0.38 0.34 0.38 0.48
kfr_hisp_pooled_p25 0.5 0.47 0.34 0.47 0.52
kfr_white_pooled_p25 0.51 0.48 0.4 0.48 0.54
jail_black_pooled_p25 0.03 0.057 0.0074 0.055 0.1
jail_hisp_pooled_p25 0.0067 0.013 -0.0042 0.0087 0.066
jail_white_pooled_p25 0.013 0.018 0.0049 0.017 0.033
white_pooled_count 15973 4712 235 2888 15973
black_pooled_count 21475 11127 31 2194 73435
hisp_pooled_count 18871 4358 66 875 18871

Dataset: Baseline Cross-Sectional Estimates by College

From Opportunity Insights. See readme for full descriptions of all columns. <top>

The dataset consists of 2199 rows representing 2197 post-secondary academic institutions identified with the U.S. Dept of Education’s Office of Postsecondary Education Identification (OPE ID) number.

Each row has 85 columns describing the school demographics (location, name, type, tier {prestige 1-14}, etc.) and data collected/imputed for students born 1980-1982 (count,fraction female, married, measures of income, parental income, expected income rankings.

Column prefixes: k for student (kid), par for parent, mr_k for student mobility rate.
Table below includes the following columns:

column name description
type Type : 1,= public 2,= private non-profit 3,= for-profit
tier Selectivity and type combination rank 1-14
iclevel Four-year or two-year college
count Average number of kids per cohort
female Fraction female among kids
k_married Fraction of kids married in 2014
mr_kq5_pq1 Mobility rate (joint probability of parents in bottom quintile and child in top quintile of the income distribution)
par_rank Mean parental income rank
par_top1pc Fraction of parents in the top 1 income percentile [PCTILE].
k_rank Mean kid earnings rank
k_0inc Fraction of kids with zero labor earnings
k_q5 Fraction of kids in the top income quintile
k_rank_cond_parq1 Mean kid earnings rank conditional on parent in 1st quintile
k_rank_cond_parq2 Mean kid earnings rank conditional on parent in 2nd quintile
k_rank_cond_parq3 Mean kid earnings rank conditional on parent in 3rd quintile
k_rank_cond_parq4 Mean kid earnings rank conditional on parent in 4th quintile
k_rank_cond_parq5 Mean kid earnings rank conditional on parent in 5th quintile
  Mean Min. Median Max. NA’s
type 1.5 1 1 3 3
tier 7.1 1 6 14 NA
iclevel 1.4 1 1 3 3
count 1714 50 468 955065 NA
female 0.56 0.0033 0.55 1 19
k_married 0.53 0.12 0.54 0.94 NA
mr_kq5_pq1 0.018 0 0.015 0.16 NA
par_rank 0.57 0.25 0.57 0.89 NA
par_top1pc 0.015 0 0.0052 0.22 NA
k_rank 0.57 0.34 0.55 0.91 NA
k_0inc 0.12 0.024 0.11 0.34 NA
k_q5 0.26 0.017 0.22 0.83 NA
k_rank_cond_parq1 0.52 0.25 0.51 0.92 NA
k_rank_cond_parq2 0.55 0.29 0.53 0.91 NA
k_rank_cond_parq3 0.56 0.31 0.55 0.91 NA
k_rank_cond_parq4 0.58 0.23 0.57 0.9 NA
k_rank_cond_parq5 0.59 0.18 0.58 0.9 NA

Data Quality

  • Not clear when income ranks of students and parents were recorded. Cohorts presumably graduated high school between 1998 and 2001. Certain columns indicate value as of 2014 when students were in their thirties. Others don’t.
  • Data do not distinguish when or how students began or left college.
  • Cohort began careers at the onset of the Great Recession. Presumably this dispropotionately affected their incomes.

Primary Data Columns

The dataset contains a large variety of columns and offers many opportunities for studying how the interact:

  • Effect of parent’s income rank on the prestige (column tier) of the institution their children attend.
  • Influence of parent’s income rank on their child income rank conditioned by the prestige of their institution
  • Influence of instition’s type (public/private non-profit/for-profit) and iclevel (4year/2year/<2year) on student’s future (?) income ranking and/or unemployment ($0 income).
  • Other effects such as those related to gender and marital status appear difficult to tease out.

Relevant Filters

  • Institution prestige
  • Institution type and iclevel (see above)

Possible visualizations

  • Scatter plot of parent vs child income rank colored by institution prestige
  • Scatter plot of cohort count vs child income rank colored by institution prestige
  • Stacked bar chart of institution prestige (horizontal) vs cohort counts by type and iclevel

Limitations of the Dataset

  • Dollar amount columns are not corrected for inflation. Does not permit valid comparsion of child/parent incomes
  • Dataset fails to adequately account for cohort who attended post-secondary institutions when they over 22 years old.
  • Schools of Art and Design (eg Rhode Island School Of Design) misclassified as non selective, low prestige institutions
  • U.S. military academies omitted.

Improvement to the Dataset

  • Include institutions’ distribution STEM/arts/business/trades/military
  • Include fraction of cohort from single parent households
  • Include cohort’s minority distribution
  • Include fraction of students who did not graduate

Always interesting to see how Montgomery College stacks up nationally:

  MC all.Mean all.Min. all.Median all.Max.
type 1 1.5 1 1 3
tier 9 7.1 1 6 14
iclevel 2 1.4 1 1 3
count 2941 1714 50 468 955065
female 0.47 0.56 0.0033 0.55 1
k_married 0.39 0.53 0.12 0.54 0.94
mr_kq5_pq1 0.03 0.018 0 0.015 0.16
par_rank 0.56 0.57 0.25 0.57 0.89
par_top1pc 0.0082 0.015 0 0.0052 0.22
k_rank 0.56 0.57 0.34 0.55 0.91
k_0inc 0.14 0.12 0.024 0.11 0.34
k_q5 0.26 0.26 0.017 0.22 0.83
k_rank_cond_parq1 0.53 0.52 0.25 0.51 0.92
k_rank_cond_parq2 0.54 0.55 0.29 0.53 0.91
k_rank_cond_parq3 0.57 0.56 0.31 0.55 0.91
k_rank_cond_parq4 0.56 0.58 0.23 0.57 0.9
k_rank_cond_parq5 0.58 0.59 0.18 0.58 0.9

and in the DC commuting zone

  MC dcCz.Mean dcCz.Min. dcCz.Median dcCz.Max.
type 1 1.7 1 2 3
tier 9 6.7 2 6 11
iclevel 2 1.3 1 1 2
count 2941 1783 125 892 11705
female 0.47 0.6 0.34 0.57 0.95
k_married 0.39 0.44 0.15 0.51 0.62
mr_kq5_pq1 0.03 0.022 0.0057 0.022 0.04
par_rank 0.56 0.61 0.34 0.63 0.83
par_top1pc 0.0082 0.024 0.00024 0.0065 0.18
k_rank 0.56 0.6 0.42 0.59 0.78
k_0inc 0.14 0.12 0.075 0.12 0.21
k_q5 0.26 0.34 0.05 0.29 0.69
k_rank_cond_parq1 0.53 0.55 0.39 0.53 0.74
k_rank_cond_parq2 0.54 0.59 0.38 0.58 0.77
k_rank_cond_parq3 0.57 0.6 0.38 0.58 0.74
k_rank_cond_parq4 0.56 0.61 0.43 0.6 0.77
k_rank_cond_parq5 0.58 0.63 0.47 0.61 0.79

Dataset: Neighborhood Characteristics by County

From Opportunity Insights. See readme for full descriptions of all columns. <top>


Table below shows a subset of the columns for data 2010 and after. All units are fractions of the totals with the following exceptions:

  • med_hhinc2016: median household income in 2016 in dollars
  • gsmn_math_g3_2013: mean score on 3rd grade standardized math test
  • rent_twobed2015: median rent of two bedroom dwelling in dollars
  • mail_return_rate2010: percentage of completed census forms returned
  • popdensity2010: residents per squire mile
  • job_density_2013: jobs per square mile
  Mean Min. Median Max. NA’s
state 31 1 30 72 NA
county 103 1 79 840 NA
cz 20397 100 21701 39400 81
czname NA NA NA NA NA
frac_coll_plus2010 0.19 0.037 0.17 0.71 1
foreign_share2010 0.044 0 0.024 0.72 79
med_hhinc2016 48260 11640 46718 129150 2
poor_share2010 0.16 0 0.15 0.65 1
share_white2010 0.76 0.002 0.85 0.99 1
share_black2010 0.09 0 0.022 0.86 1
share_hisp2010 0.1 0 0.034 1 1
share_asian2010 0.0095 0 0.0039 0.43 22
gsmn_math_g3_2013 3.2 -0.66 3.2 6.6 152
rent_twobed2015 685 172 638 2085 78
singleparent_share2010 0.31 0 0.3 0.81 2
traveltime15_2010 0.4 0.087 0.38 0.99 1
mail_return_rate2010 80 0 81 100 28
popdensity2010 287 0.04 47 70584 2
ann_avg_job_growth_2004_2013 -0.0029 -0.083 -0.0033 0.12 7
job_density_2013 129 0.02 19 36663 4

Data Quality

  • Across all numeric columns there is a median NA frequency of 2.4% of the number of rows.
  • Most recent data originated in 2015. Is it still relevant unless combined with more recent data?
  • Data span the Great Recession possibly affecting dollar amount columns

Primary Data Columns

  • Yearly data spanning different years poses difficulties for correlating them.
  • county and commuting zone: interesting to see the variations across zones comprising multiple counties, particularly those crossing state boundaries (requires additional dataset)
  • poor_share[1990,2000,2010], singleparent_share[same range], and popdensity[2000,2010]**

Relevant Filters

  • sum(abs(shareSY1-shareSY2) > threshold for all S

Possible visualizations

  • Stacked bar chart of counties in commuting zones sized by share column of selected year and ordered by mean population density across commuting zone. Many interesting columns.
  • Paired stacked bar chart of all shares of paired years y1,y2 in same county ordered by sum of absolute share difference y2-y1.

Limitations of the Dataset

  • Data are not up to date
  • Data columns span different years
  • Dollar amount data columns do not use constant dollars
  • Do commuting zones change over time?

Improvement to the Dataset

  • Convert dollar amounts to constant dollars
  • Join county population obtained from U.S. census

Always interesting to see how Montgomery County stacks up nationally:

  moco all.Mean all.Min. all.Median all.Max.
state 24 31 1 30 72
county 31 103 1 79 840
cz 11304 20397 100 21701 39400
czname Washington DC NA NA NA NA
frac_coll_plus2010 0.5617628 0.19 0.037 0.17 0.71
foreign_share2010 0.3100337 0.044 0 0.024 0.72
med_hhinc2016 112746.9 48260 11640 46718 129150
poor_share2010 0.06015318 0.16 0 0.15 0.65
share_white2010 0.4926696 0.76 0.002 0.85 0.99
share_black2010 0.1775243 0.09 0 0.022 0.86
share_hisp2010 0.1702016 0.1 0 0.034 1
share_asian2010 0.1180755 0.0095 0 0.0039 0.43
gsmn_math_g3_2013 3.299804 3.2 -0.66 3.2 6.6
rent_twobed2015 1666.377 685 172 638 2085
singleparent_share2010 0.234801 0.31 0 0.3 0.81
traveltime15_2010 0.1349907 0.4 0.087 0.38 0.99
mail_return_rate2010 82.40179 80 0 81 100
popdensity2010 1978.14 287 0.04 47 70584
ann_avg_job_growth_2004_2013 0.007636189 -0.0029 -0.083 -0.0033 0.12
job_density_2013 1050.902 129 0.02 19 36663

and in the DC commuting zone

  moco dcCz.Mean dcCz.Min. dcCz.Median dcCz.Max.
state 24 40 11 51 51
county 31 209 1 61 685
cz 11304 11304 11304 11304 11304
czname Washington DC NA NA NA NA
frac_coll_plus2010 0.5617628 0.42 0.21 0.37 0.71
foreign_share2010 0.3100337 0.17 0.026 0.19 0.31
med_hhinc2016 112746.9 95058 57429 93129 129150
poor_share2010 0.06015318 0.074 0.032 0.06 0.19
share_white2010 0.4926696 0.61 0.15 0.61 0.9
share_black2010 0.1775243 0.18 0.05 0.14 0.65
share_hisp2010 0.1702016 0.13 0.027 0.12 0.33
share_asian2010 0.1180755 0.055 0.005 0.038 0.15
gsmn_math_g3_2013 3.299804 3.7 1.8 3.7 5
rent_twobed2015 1666.377 1447 876 1440 2085
singleparent_share2010 0.234801 0.26 0.15 0.23 0.51
traveltime15_2010 0.1349907 0.18 0.11 0.16 0.28
mail_return_rate2010 82.40179 80 75 81 83
popdensity2010 1978.14 2978 28 1789 9856
ann_avg_job_growth_2004_2013 0.007636189 0.013 -0.0065 0.01 0.037
job_density_2013 1050.902 1766 14 937 5981