The Correlation between Poverty and Various Internet Infrastructure Indicators

Team 2, School of Professional Studies, CUNY
5/24/2019

Question 1: What economic indicators (race, occupation, community poverty rate) are most strongly correlated with internet access rates? Can we build a model that accurately predicts said rates?
Question 2: Are internet access rates a stronger predictor of poverty rates than other forms of social investment (ie roads, schools, hospitals)?
Question 3: Do these effects extend across internet technologies (cell phone sand broadband internet)? If not, which type of infrastructure investment is better.

Team Members: Charlie, Olga, Julian, Hantz, and Mike

Overview

  • Motivation, Literature Review, Methodology, and Hypothesis

  • Internet Inequality- A Global Perspective

  • Correlation between Various Technology Indicators and Poverty Rates

  • Data Initialization, Plots, and Models (SVM, Neural Networks, and GLM)

  • Finance Analysis

  • Conclusion and Next Steps

Abstract

  • Through this project, we examing the various econmic, social, and technological indicators in detail using American Census Data and measure it with respect to poverty rates
  • Using state by state data from 2017, we built several exploratory models before examing the regression coe???cients for their correlary information.
  • We repeated this process using state expenditures data in addition to the technological indicators from above.
  • Using generalized linear models and random forest regressors, we were able to build descriptive models that all had R2 values of 0.80,meaning the chosen indicators explained 80% of the variance in poverty.
  • Additionally, all models had a root mean square error of 1-2% of the actual poverty rate. We found that technological indicators like a STEM education, home computer access, broadband internet, and smarthphones were most highly correlated with poverty rates.

Motivation

  • Internet access is a mix bag
  • Internet can benefit politicians, used to spread hate, and intrude on our privacy
  • On other hand, Internet can advance economies, and these who do not utilize it will be left behind
  • To speak about this in particular, we need to investigate the ways in which internet access infuences occupational outlook while controlling for other confounding factors like geography, race, and infrastructure investment more generally.

Literature Review

  • Almost every study, despite the methodology and whether it was cross-country or single country, found a positive economic impact from fixed broadband
  • Using Ordinary Least Square model and data from 1998-2002 the study shows that the availability of broadband services adds as much as 1.4% to the employment growth regardless of actual adoption" (H. Galperin, 2006)
  • One study found that of all ICTs (i.e., fixed telephones, mobile, Internet use and broadband), broadband has the biggest economic impact (Qiang et al., 2009)
  • However another study found that in a low-income economy, mobile has a bigger impact, both in terms of basic subscriptions and mobile broadband (Katz and Koutroumpis, 2012)

Sources

Methodology

  • We will use chloropleth maps to show us effect of different variables
  • We will build a support vector machine, a neural net, a random forest regressor, and the generalized linear models to see how internet access and infrastructure investment influence poverty rates
  • The ACS includes internet access rates, poverty, race, industry, language, occupation, place of birth, and familial origin which should help us to determine if race or occupation is a better indicator of aggregate poverty than internet access rates
  • Financial data will help us to investigate if state spending on infrastructure affects poverty more or less than technological infrastructure

Hypothesis

  • Pew Research says that 20% of teens are unable to finish their homework due to the digital divide. The end result of this is likely low-skill careers and lower incomes.
  • In fact, the internet tends to raise the tide for all, as a breadth study (also by Pew) showed that per capita income and access rates are highly correlated.
  • We'd like to investigate the relationship between technology and the economy and see if we can build models resilient to the particle type of device.
  • Due to the multiplicative of effects in education, business opportunities, and spending opportunities available on the Internet, we suspect that governmental investment in digital infrastructure will have at least as much affect as road or school spending.
  • This research will reveal to governments (both local and national) what kinds of infrastructure investment yields the most economic gains in the digital age. To our knowledge, this particular question has not been answered.

Correlation between Various Technology Indicators and Poverty Rates: Data Initialization and Preprocessing

  1. Fixing Missing Data: To fill in any missing values, the empty fields were replaced with the mean of the column in which they come from.
  2. Evaluation of metrics:
    a. Data was scaled and centered
    b. Missing data was imputed
  3. Correlation of data: We addressed issue of multicollinearity, which improved a fit of our models.
  4. Data Pre-processing: Near-zero predictors were removed; Missing values were imputed; We applied Yeo-Johnson transformation method to our predictor variables

Correlation between Various Technology Indicators and Poverty Rates: Support Vector Machine

  • Comparing SVM with Neural Networks (NN), both are non-linear algorithms
  • A Support Vector Machine with different kernels is comparable to a Neural Network with different layers
  • The support vector machine performed decently well, but since it optimizes for two introduced variables, it does little to help us explain poverty rates.
  • Support vector machines are used for reducing dimensionality, not for examining the data in place.
  • So, it is inappropriate for our purposes, even if it is relatively accurate
Model RMSE.train RSquared.train RMSE.test RSquared.test
Support Vector Machine 2.8493 0.8711891 1.5309 0.5936714

Correlation between Various Technology Indicators and Poverty Rates: Random Forest

  • The variable importance plot is composed of two subplots.
  • The left one indicates the mean square error introduced by excluding a given variable and is a measure of an indicator's relevance model.
  • The right one indicates the nodes that allow for the fewest number of decision splits in our model and are very similar to the linear model below, likely due to our minimization of model complexity (as measured by AIC).
  • The random forest regressor works by using automation and sampling to make many regression models before investigating the resulting output relative to the sensitivty and precision of the input. plot of chunk unnamed-chunk-2

Correlation between Various Technology Indicators and Poverty Rates: Generalized Linear Model

  • Because our output value is continuous and our data is numeric, we can use a generalized linear model to compute the pH.
  • These models are generic and assume linearity in response. We will use the “Gaussian” type which assumes normally distributed variables.
  • Mwabu et al. (2000) used regression analysis and identi???ed the following variables as the key determinants of poverty: size of household, places of residence(urban or rural), level of schooling and livestock.
  • We tried to pick indicators specific to the culture in which they were measured- that of America in 2017. So, we looked at race, education, and occupational indicators.
Model RMSE.train RSquared.train RMSE.test RSquared.test
GLM 0.722634 0.9854818 2.196015 0.8323232

Correlation between Various Technology Indicators and Poverty Rates: Generalized Linear Model (Continued)

  • This model shows a somewhat even distribution of the value of poverty indicators other than race indicators and percent foreign born. However these appear to encode geography which may be a confounding factor.
  • In particular, New York and California have large and diverse populations and the strongest economies.
  • More generally, the model shows that all that all racial indicators have positive effects on poverty rates.
  • Interestingly, immigration appears to be correlated with a decrease in poverty rates, suggesting again that this model appears to be encoding the more diverse and urban states.
  • Despite that the model is able to predict poverty rates to within 1% for the training set and 2% for the training set, which is an acceptable deviation.
  • Additionally,the test set has an R2 value of 83%, meaning that 83% of the variance in the poverty

Correlation between Various Technology Indicators and Poverty Rates: Conclusion - Social Indicators

  • We were successfully able to evaluate the effency of serveral different models.
  • The non-parametric models did not perform as well as the parametric one, perhaps because we had reduced our dimensionality during pre-processing.
  • Additionally, they both suffer from large test/train splits.
  • The linear model performed very well, but the strongest indicators were raced based - seeming to an indicate geography as a confounding factor. -
  • The random forest method performed only slightly worse than the generalized linear model, but due to the iterative nature of the algorithm,it had a superior test/train split indicating more generalizability
  • Because the glm and random forest have human-readable coeffcients/factors and performed the best, we will continue using them moving forward.

Finance Analysis

  • The American Community Survey also collects data from state governments about their spending habits.
  • First, we use the random forest model because the variable importance that it measures will inform our linear model below.
  • The most important variables on the financial data set are computer, smartphone, and state debt. That is true for both precision and accuracy.
  • Notably, the computer indicator is a better indicator of poverty status than broadband access, suggesting that the ability to learn and create technology is far more valuable than mere access.
  • As we can see above, both models are good enough to be relied on for describing the relationships between our data vectors.
  • However, the test/train gap present in both indicates a lack of generalizability in the models.

Conclusion

  • Across the board, our models seem to indicate that having a laptop or desktop computer is a far more valuable indicator of poverty rate than smart phone or broadband subscription rates.
  • While the social models are dominated by racial considerations, the positive value of all racial coeffcients seems to favor diversity over a given race.
  • When it comes to the economic models, technological access is a much strong indicator of poverty rates than state spending on education, schools or police.
  • Since our models are not totally consistent, it's impossible to draw a strong conclusion about correlation or causality.
  • However, it's clear that technological access is important to a healthy economy and a reduction of poverty rates.
  • Luckily, there are ways to measure these things more directly. Unfortunately, they are outside the scope of this investigation.

Next Steps

  • There are many ways we can continue improving the model performance, one method could be runnin gmore times of cross validation on more folds than 3 times
  • 3-fold we have now for SVM and GBM models.
  • It would take a longer time to compute, but the results would likely be better.
  • Finally, more data would help us build a better model, in particular because the gap between the test and train sets tends to be relatively largeacross all of the models, with the exception of the Random Forest regressor because it builds its model over 1000 iterations.
  • We could also use time-series data to examine causal relationships between the data using either Bayseian networks or Granger inference.
  • The former requires a much larger aomount of data (perhaps all county level census data since 2010) and the Granger analysis requires continuous time-series data.