Math 215 Final Project

In economics, one of the major fields of study is socioeconomic mobility. Socioeconomic mobility refers to the movement of one’s social class or economic level, which can be measured through job changes, income, marriage, and many other factors. This field is extremely important as understanding why some individuals can climb through income levels and others become stuck is crucial in creating effective policies to help reduce poverty. However, socioeconomic mobility can be quite difficult to measure as there tend to be many factors impacting one’s mobility, such as education, location, gender, and much more.

Until this summer, it has been assumed that the following four statements are true even without mathematical and empirical proof due to the difficulty and cost of finding such conclusions. These are:

1: Friendship networks are strongly class-based

2: Economic connectedness is the only form of social capital that boosts mobility

3: Economic connectedness boosts mobility more than anything else

4: Friending bias and economic segregation contribute equally to a lack of connectedness

While one might assume that these conclusions are fairly obvious, it costs a lot to get cross-country data showing these results. In August 2022, in Raj Chetty’s paper Social capital I: measurement and associations with economic mobility, that data was collected to confirm these conclusions.

In Chetty’s paper, they use data on 21 billion friendships from Facebook to study social capital. From the Facebook profiles, they use the ZIP code in the United States to examine the connectedness between different types of socioeconomic status; social cohesion, such as the extent of cliques in friendship networks; and civic engagement, such as rates of volunteering.

This analysis concluded that the share of high-SES friends among individuals with low SES is among the strongest predictors of upward income mobility identified. For example, if children who grew up in a low economic status household grew up in counties with economic connectedness where their peers have high SES parents, their income would increase by 20% on average.

This paper will use the same data to attempt to replicate some of Chetty’s findings. To do this, we’ll first define economic connectedness by two times the share of high SES friends among low SES averaged, overall low SES individuals in the country. Mathematically it is represented as

\[\text{IEC}_i(g)\equiv \biggl\{\frac{H_i(g)}{d_i(g)}\biggl\}/0.5\] \[\text{EC}_c(g)= \frac{\sum_{i\in L \cap c}\text{IEC}_i(g)}{N_{Lc}}\]

Using this definition we will measure and compare the effect of upward income mobility explained by economic connectedness through country and zip. Finally we will try to identify friending bias and exposure by college by comparing Share of high parental SES students to friending bias among low parental SES students .

Exploratory Data:

The creation of the country dataset:

#Creates County Data Set
Data <- read_csv("/Users/kenfritzell/Downloads/social_capital_county.csv")

## Rows: 3089 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): county_name
## dbl (25): county, num_below_p50, pop2018, ec_county, ec_se_county, child_ec_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data1 <-  read_dta("/Users/kenfritzell/Downloads/social_capital_replication/Other public covariate data/county_covariates.dta")
County_data <- merge(Data,Data1,by="county")
County_data$kfr_pooled_pooled_p25 <- County_data$kfr_pooled_pooled_p25*100
head(County_data)

The first relationship we will explore is if there’s association between upward income mobility which we’ll measure by household income rank for children parents at the 25th income percentile and economic connectedness across countries.

#Plots County graphs 
ggplot(County_data, aes(x=ec_county,y=kfr_pooled_pooled_p25))+
  geom_point(col = ifelse(1:nrow(County_data) == 2936
, "red", "black"))+
   labs(x = "Economic connectedness", y = "Predicted Household Income Rank for \nChildren with Parents at 25th Income Percentile") +
  theme_classic()+
  ggtitle("Associations between Upward Income Mobility, Economic Connectedness\nacross Counties")

## Warning: Removed 71 rows containing missing values (`geom_point()`).

County_data$Group <- ifelse(County_data$kfr_pooled_pooled_p25<39, "<39",
                     ifelse(39 <= County_data$kfr_pooled_pooled_p25 & County_data$kfr_pooled_pooled_p25 < 42, "39-42",
                     ifelse(42 <= County_data$kfr_pooled_pooled_p25 & County_data$kfr_pooled_pooled_p25 < 45, "42-45",
                     ifelse(45 <= County_data$kfr_pooled_pooled_p25 & County_data$kfr_pooled_pooled_p25 < 49, "45-49",
                     ifelse(County_data$kfr_pooled_pooled_p25 > 49, ">49",
                                                                    NA ))))) # all other values map to NA
ggplot(County_data, aes(x=med_hhinc2017,y=ec_county, colour= Group))+
  geom_point()+
  labs(x = "Median household income by county (US$)", y = "Economic connectedness",colour = "Upward Mobility (child's \nincome rank in adulthood \ngiven parents at 25th \nincome percentile") +
  theme_classic()+
  ggtitle("Associations between Upward Income Mobility, Economic Connectedness,\nand Median Household Income by County")

## Warning: Removed 72 rows containing missing values (`geom_point()`).

There seems to be an obvious positive relationship between economic connectedness and upward income mobility and perhaps a quadratic relationship between economic connectedness and median household income. Now that we’ve explored connectedness by country, we will do the same for ZIP code data. In the Associations between Upward Income Mobility, Economic Connectedness across Counties graph the red dot show Walla Walla county which shows we’re kinda below the middle of economic connectedness.

Creating the Zip Data Set:

#Creates Zip Data
DataZip <- read_csv("/Users/kenfritzell/Downloads/social_capital_zip.csv")

## Rows: 23028 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (23): zip, county, num_below_p50, pop2018, ec_zip, ec_se_zip, nbhd_ec_zi...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data1Zip <-  read_dta("/Users/kenfritzell/Downloads/social_capital_replication/Other public covariate data/zip_covariates.dta")
Zip_data <- merge(DataZip,Data1Zip,by="zip")
Zip_data$kfr_pooled_pooled_p25 <- Zip_data$kfr_pooled_pooled_p25*100
head(Zip_data)

#Creates Zip Graphs
ggplot(Zip_data, aes(x=ec_zip,y=kfr_pooled_pooled_p25))+
  geom_point(col = ifelse(1:nrow(Zip_data) == 22949, "red", "black"))+
  labs(x = "Economic connectedness", y = "Predicted Household Income Rank for \nChildren with Parents at 25th Income Percentile") +
  theme_classic()+
  ggtitle("Associations between Upward Income Mobility, Economic Connectedness\nacross ZIP codes")

## Warning: Removed 4050 rows containing missing values (`geom_point()`).

Zip_data$Group <- ifelse(Zip_data$kfr_pooled_pooled_p25<39, "<39",
                     ifelse(39 <= Zip_data$kfr_pooled_pooled_p25 & Zip_data$kfr_pooled_pooled_p25 < 42, "39-42",
                     ifelse(42 <= Zip_data$kfr_pooled_pooled_p25 & Zip_data$kfr_pooled_pooled_p25 < 45, "42-45",
                     ifelse(45 <= Zip_data$kfr_pooled_pooled_p25 & Zip_data$kfr_pooled_pooled_p25 < 49, "45-49",
                     ifelse(Zip_data$kfr_pooled_pooled_p25 > 49, ">49",
                                                                    NA ))))) # all other values map to NA
ggplot(Zip_data, aes(x=med_inc_2018,y=ec_zip, colour= Group))+
  geom_point()+
  labs(x = "Median household income in ZIP Code (US$)", y = "Economic connectedness",colour = "Upward Mobility (child's \nincome rank in adulthood \ngiven parents at 25th \nincome percentile") +
  theme_classic()+
  ggtitle("Associations between Upward Income Mobility, Economic Connectedness,\nand Median Household Income by ZIP code")

## Warning: Removed 4105 rows containing missing values (`geom_point()`).

There seems to be an obvious positive relationship between economic connectedness and upward income mobility for zip code data within the United States. Also, perhaps a quadratic relationship between economic connectedness and median household income. However, we must note that there are 3089 observations from the country data and 32817 observations from the zip code data. Thus they might have very similar relationships, but we’ll determine that through statistical analysis. In the Associations between Upward Income Mobility, Economic Connectedness across ZIP codes graph the red dot show Walla Walla county which shows we’re kinda below the middle of economic connectedness.

The last item we’ll be exploring is the friending bias and college exposure. To measure this, we’ll use data from colleges telling us the share of high parental SES students to explain friending bias among low parental SES students. We measured the share of high parental SES students by taking two times the average share of high parental SES individuals within three class years, averaged over low parental SES users. The friending bias among low parental SES students was calculated by taking Economic connectedness with parental SES divided by the mean exposure to high parental SES individuals by the college for low parental SES individuals; all subtracted from one.

Creating the data set:

#Creates College DataSet
DataCollege <- read_csv("/Users/kenfritzell/Downloads/social_capital_college (1).csv")

## Rows: 2586 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): college_name
## dbl (21): college, zip, county, mean_students_per_cohort, ec_own_ses_college...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data1College <-  read_dta("/Users/kenfritzell/Downloads/social_capital_replication/Other public covariate data/college_characteristics.dta")
College_data <- merge(DataCollege,Data1College,by="college")

#Creates College Plots
x_lab <- "Share of high-parental-SES students (%)"
y_lab <- "Friending bias among low-parental-SES students (%)"
aes_specs <- aes_string(x = "exposure_parent_ses_college", y = "bias_parent_ses_college")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation ideoms with `aes()`

College_data<-College_data %>% mutate(exposure_parent_ses_college = 100 * exposure_parent_ses_college / 2, 
                        bias_parent_ses_college =bias_parent_ses_college * 100) 
ggplot(College_data, aes_specs) +
          # plot all schools (higher (negative) values on the y axis means less friending bias)
          geom_point(alpha = 0.15,col = ifelse(1:nrow(College_data) == 1583, "red", "black")) +
          scale_y_continuous(limits = c(30, -15), breaks = c(30, 20, 10, 0, -10), trans = "reverse") +
          scale_x_continuous(limits = c(0, 90), breaks = c(0, 20, 40, 60, 80)) +
          labs(x = x_lab, y = y_lab) +
          ggtitle("Frending Bias and exposure by Colleges")+
          theme_classic() + 
          theme(
            axis.text.x = element_text(size = 12),
            axis.text.y = element_text(size = 12),
            axis.title = element_text(size = 12)
          ) +
    
          # Arrows on axis to indicate direction of exposure/bias
          annotate("segment", 
             x = 30, xend = 60, y = 30, yend = 30,
             arrow = arrow(angle = unit(10, "cm"), length = unit(0.4, "cm"), type = "closed")
          ) + 
          
          annotate("text", 
             x = 45, y = 28, yend = 28, label = "More exposure") + 
          
          annotate("segment", 
             x = 0, xend = 0, y = 20, yend = 0,
             arrow = arrow(angle = unit(10, "cm"), length = unit(0.4, "cm"), type = "closed")
          ) + 
    
          annotate("text", 
             x = 3, xend = 3, y = 10, label = "Less friending bias", angle = 90)

## Warning in annotate("text", x = 45, y = 28, yend = 28, label = "More exposure"):
## Ignoring unknown aesthetics: yend

## Warning in annotate("text", x = 3, xend = 3, y = 10, label = "Less friending
## bias", : Ignoring unknown aesthetics: xend

## Warning: Removed 149 rows containing missing values (`geom_point()`).

From this we note that as share of high parental SES students increases the friending bias starts to center around zero. But its where we have low share of high parental SES students where we get more friending bias. Thus we wont do a statistical test on this graph we will make the interesting assumption that as the share of high parental SES students increase in a college the frending bias will go to zero.

Data Analysis

Model 1

To start this this analysis we’ll first show a distribution plot of the household income rank for children with parents at 25th income percentile:

ggplot(data = County_data,aes(kfr_pooled_pooled_p25))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looking at this histogram we see our response variable seems to be normally distributed thus we wont need to apply any transformation or other glm models.

Starting with the county data set we’ll use a workflow to create a cross validation system using a 80% training data set and 20% testing data set.

set.seed(2564)
county_split <- initial_split(County_data, prop=0.8) # prop = 3/4 by default
#Training
county_train <- training(county_split)
dim(county_train)

## [1] 2471   63

#Testing
county_test <- testing(county_split)
dim(county_test)

## [1] 618  63

To find these relationships we’ll be using a linear regression, thus setting a regression which we can call later in our workflow.

Linear_mod <-linear_reg() %>%
set_engine("lm")
Linear_mod

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Now creating our recipe for our county data which we’ll use to predict the household income rank for children with parents at 25th income percentile based on economic connectedness on the training data. Now for ease we’ll scale economic connectedness between 0 and 1 :

County_rec <- recipe(kfr_pooled_pooled_p25 ~ ec_county, data = county_train) %>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) %>%
step_range(ec_county,min = 0,max = 1)
County_rec

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          1
## 
## Operations:
## 
## Zero variance filter on all_predictors()
## Sparse, unbalanced variable filter on all_predictors()
## Range scaling to [0,1] for ec_county

Now creating the workflow with our linear model and recipe:

County_wflow <- workflow() %>%
add_model(Linear_mod) %>%
add_recipe(County_rec)
County_wflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_zv()
## • step_nzv()
## • step_range()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Now using the workflow to fit the data on our training data:

County_fit <- County_wflow %>%
fit(data = county_train)
tidy(County_fit)

We note that from this regression we get an intercept value of 30.34 and the coefficient for economic connectedness is 25.67. Both these values are extremely significant as both test statistics are greater than 50 and have a pvalue less than our .05 alpha. Thus From this regression we get the following fitted model:

\[\widehat{kfrPooledPooledp25} = 30.34 + 25.67(ecCounty)\]

This model tells us that as we increase the economic connectedness by one, then the household income rank for children with parents at 25th income percentile increases by 25.67.

We will now use perform 5-fold cross validation using the sample code below. V-fold cross-validation randomly splits the data into V groups of roughly equal size (called “folds”). A resample of the analysis data consisted of V-1 of the folds while the assessment set contains the final fold. In basic V-fold cross-validation (i.e. no repeats), the number of resamples is equal to V.

set.seed(345)
folds <- vfold_cv(county_train, v = 5)
folds

Now applying our workflow to the folds:

set.seed(345)
County_fit_rs <- County_wflow %>%
fit_resamples(folds)
County_fit_rs

Now we’ll check the r squared and rmse to determine if our model has an acceptable fit:

collect_metrics(County_fit_rs)

Since we have a fairly low RMSE and rquared above .5 we’ll consider this to be a good fit model.Now we also are interested in knowing how well this model fits the training data thus we’ll collect the $R^2$. Since we’re happy with this we’ll now test the model with the testing data:

county_test_pred <-predict(County_fit, county_test) %>%
bind_cols(county_test %>% select(kfr_pooled_pooled_p25))
county_test_pred

From these predictions we can calculate the RMSE and R squared value to see how well the model preformed on the testing data:

rmse1 <- rmse(county_test_pred, truth = kfr_pooled_pooled_p25, estimate = .pred)
rmse1

rsq1 <- rsq(county_test_pred, truth = kfr_pooled_pooled_p25, estimate = .pred)
rsq1

As we see the rmse, standard, is quite low thus meaning we have a good fitting model. We also have a high r squared at rsq, above .5, which means our model does a good job at explaining the variability observed in the kfr_pooled_pooled_p25 variable is explained by our model.

Model 2

We will now run a regression using the zip code data to see how median household income in ZIP Code (US$) and upward mobility (child’s income rank in adulthood given parents at 25th income percentile) effects economic contentedness.

First to create a histogram of the response variable to make sure the economic contentedness is normally distributed.

ggplot(data = Zip_data,aes(ec_zip))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4041 rows containing non-finite values (`stat_bin()`).

Looking at this histogram we see our response variable seems to be normally distributed thus we wont need to apply any transformation or other glm models.

First creating a cross validation system using a 80% training data set and 20% testing data set.

set.seed(2564)
Zip_split <- initial_split(Zip_data, prop=0.8) # prop = 3/4 by default
#Training
Zip_train <- training(Zip_split)
dim(Zip_train)

## [1] 18415    59

#Testing
Zip_test <- testing(Zip_split)
dim(Zip_test)

## [1] 4604   59

Now creating a recipe for our model. As we said we’re predicting economic contentedness given mean income and social mobility. Thus our theoretical model will be:

\[EC = \beta_0 + \beta_1(Mean \ Income) + + \beta_2(Mean \ Income)^2+ \beta_3(Social \ Mobility)+\epsilon \text{ Where } \epsilon \sim NI(0,\sigma^2)\]

Zip_train$med_inc_2018_dup <- Zip_train$med_inc_2018
Zip_test$med_inc_2018_dup <- Zip_test$med_inc_2018
Zip_rec <- recipe(ec_zip ~ med_inc_2018+ med_inc_2018_dup + Group, data =Zip_train) %>%
step_mutate(med_inc_2018_dup*med_inc_2018_dup)%>%
step_zv(all_predictors()) %>%
step_nzv(all_predictors()) 
Zip_rec

## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          3
## 
## Operations:
## 
## Variable mutation for med_inc_2018_dup * med_inc_2018_dup
## Zero variance filter on all_predictors()
## Sparse, unbalanced variable filter on all_predictors()

Now creating the workflow with our linear model and recipe:

Zip_wflow <- workflow() %>%
add_model(Linear_mod) %>%
add_recipe(Zip_rec)
Zip_wflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_mutate()
## • step_zv()
## • step_nzv()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Now using the workflow to fit the data on our training data:

Zip_fit <-Zip_wflow %>%
fit(data = Zip_train)
tidy(Zip_fit)

Since all values are significant with a p-value less than .05, the fitted model this regression gives us is:

$\widehat{Economic \ Connectedness} = .326 + 9.16*10^{-6}(median \ income) + .24(Group>49)$ $+ .0715(39-42)+ .01197(42-45) + .171(45-49)-2.76*10^{-11}(median \ income)^2$

While there’s a lot of terms in this model and wont go through each but first depending on the social mobility group the economic connectedness increases by the given coefficient as they are all positive. We also note that as the median income squared term increases by one then economic connected increases by $-2.76*10^{-11}$.

Like in model 1 we’ll now want to see how well this model fits the data, thus we’ll create a perform 5-fold cross validation:

set.seed(345)
folds2 <- vfold_cv(Zip_train, v = 5)
folds2

Now applying our Zip work flow on this data:

set.seed(345)
Zip_fit_rs <- Zip_wflow %>%
fit_resamples(folds2)

## ! Fold1: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-deficient fit may be misleading

## ! Fold2: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-deficient fit may be misleading

## ! Fold3: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-deficient fit may be misleading

## ! Fold4: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-deficient fit may be misleading

## ! Fold5: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-deficient fit may be misleading

Zip_fit_rs

Now showing the RMSE and R2 for all cross validation folds:

collect_metrics(Zip_fit_rs)

Zip_test_pred <-predict(Zip_fit, Zip_test) %>%
bind_cols(Zip_test %>% select(ec_zip))

## Warning in predict.lm(object = object$fit, newdata = new_data, type =
## "response"): prediction from a rank-deficient fit may be misleading

Zip_test_pred

From these predictions we can calculate the RMSE and R squared value to see how well the model preformed on the testing data:

rmse2 <- rmse(Zip_test_pred, truth = ec_zip, estimate = .pred)
rmse2

rsq2 <- rsq(Zip_test_pred, truth = ec_zip, estimate = .pred)
rsq2

As we see the rmse, is quite low thus meaning we have a good fitting model. We also have a high r squared at rsq, above the .6 level, which means our model does a good job at explaining the variability observed in the ec_zip variable is explained by our model.

Conclusion:

Overall, both our models do a pretty good job at estimating our predictors as we have both our r-squared values above the .50 level which means that >50% of the variability observed in the two response variables is explained by our models.

In this data analysis, within county data, there is a strong positive relationship between economic connectedness and predicted household income. This is important as it shows how important economic connectedness is to increase one’s income level. From the zip code data, we also find that higher upward mobility also increases economic connectedness and median household income, which makes sense as more mobility makes it easier to connect with others and grow their network.

Since we have found connections between economic connectedness and income, further studies can be done to examine what these relationships mean for high-income exposure and friending bias in the United States. As we know, one’s degree goes to increase their future income; however, not all college experiences are the same. For example, students at Whitman do not have the same connection opportunities as those at Harvard or Princeton.

Overall the models we’ve created have done a good job at modeling our response variables and leave the door open for many more studies to better explain the important ideas in economic mobility.