ABSTRACT

As Data Scientists, we recognize that “big data” can be used to influence elections, spread hateful propaganda, and track every purchase and decision we make one. However, we believe that the Internet, as a whole provides, many positive economic outlets.

We seek to quantify the positive effects of internet access on a global scale through our analysis of internet infrastructure and select social and economic indicators.

Our research process is outlined below for those interested in replicating our work.

R DEPENDENCIES

Replication requires the use of the following dependencies:

library(curl) 
library(XML) 
library(wbstats)
library(data.table)
library(tidyr)
library(dplyr)
suppressWarnings(source("indicators.R"))

RESEARCH QUESTION

Does internet access correlate with the chosen indicators of inequality?

H0: Internet penetration rates are not correlated with the chosen equity indicators.
HA: Internet penetrateion rates are correlated with the chosen equity indicator.

Does the selected economic indicator have a stronger affect on internet access than the social equity indicators?

H0 Economic indicators have a stronger effect than social measurements on Internet access.
HA: Economic indicators do not have a stronger effect than social measurements on Internet access.

INDICATORS

Aggregated indicators were selected from the World Bank Indicator API Queries. We used the curl and xml packages to download and parse the socio-economic indicators. In addition, we used the wbpackage for the internet data. This package directly communicates with the API to download WB data.

Documentation for this process can be found in the internet.Rmd and indicator.R file within this repository.

We have outlined the selected indicators and their corresponding World Bank definitions below:

  1. Socio-Economic Indicators:
    • SI.POV.GINI: Gini index measures global distribution of income or consumption. This measurement captures extent to which the distribution of income or consumption among individuals or households within an economy deviates from a perfectly equal distribution.
    • NY.GDP.MKTP.KD.ZG: Annual percentage growth rate of GDP at market prices based on constant local currency. Aggregates are based on constant 2010 U.S. dollars.
    • SE.ADT.LITR.ZS: Literacy rate, adult total (% of people ages 15 and above).
  2. Internet Indicators
    • IT.NET.USER.ZS: Internet users are individuals who have used the Internet (from any location) in the last 3 months. The Internet can be used via a computer, mobile phone, personal digital assistant, games machine, digital TV etc.

Collection

All socio-economic indicators were collected via the web api for the world bank. Please check out the indicators.R file found in the repository. All internet indicators were collected from the wb() package which presumably does the same thing. However, after cleaning and analysis, that was saved as a csv and re-imported to fulfill the requirements of the assignment.

Socio-Economic Indicators

GINI Index

We used the GINI index as a socio-economic proxy measure for equity. A Gini index value of 0 represents perfect equality, whereas an value of 100 constitutes perfect inequality.

As shown below, GINI index data is widely unavailable for certain countries, across certain years. The World Bank choose different reference year in their analysis of each country, as shown on Table 1.3 of the World Development Indicators. A selected portion of the data we parsed from this table can be seen below:

Selected GINI Values, 2000 - 2016
country iso3code 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Brazil BRA NA 58.4 58.1 57.6 56.5 56.3 55.6 54.9 54.0 53.7 NA 52.9 52.6 52.8 51.5 51.3 NA
Barbados BRB NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Malaysia MYS NA NA NA NA 46.1 NA NA 46.1 45.5 NA NA 43.9 NA 41.3 NA 41.0 NA
St. Kitts and Nevis KNA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
United Kingdom GBR NA NA NA NA 36.0 34.3 34.6 35.7 34.1 34.3 34.4 33.2 32.3 33.2 34.0 33.2 NA

We choose to analyze the gini data through annual aggregates. However, we note that the absence of data for certain countries and time periods could potentially affect our results.

Our initial review of this data shows that almost 3/4ths of this data contains missing values. Less than 5% of index values represent countries with a rating of higher inequality levels above 50. The median index value for the gini dataset is 35.7, while the mean value is 37.88.

The summary statistics and visual inspection of the histogram and boxplot show that data for this indicator is unimodal and skewed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   16.20   30.80   35.70   37.88   44.10   64.80    2641
GINI Index Proportion Table
index freq prop
more inequal 142 0.04
more equal 904 0.25
n.a. value 2641 0.72

Selected GDP growth (annual %) Indicator

We chose raw GDP to see if internet infrastructure contributes more to social mobility than a generalized economy. Below is an output of the data parsed and tidied from the World Bank:

Selected GDP growth (annual %) Values, 2000-2016
country iso3code 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Vietnam VNM 6.787316 6.1928933 6.3208210 6.8990635 7.536411 7.547248 6.977955 7.129505 5.6617712 5.397898 6.423238 6.240303 5.247367 5.4218830 5.9836546 6.6792888 6.210812
China CHN 8.491509 8.3399105 9.1306459 10.0356030 10.111223 11.395776 12.719479 14.231388 9.6542894 9.399813 10.636140 9.536443 7.856262 7.7576351 7.2976660 6.9002048 6.700000
Finland FIN 5.634848 2.5807921 1.6803251 1.9939841 3.926057 2.779955 4.055197 5.184801 0.7206685 -8.269037 2.992338 2.570818 -1.426189 -0.7580363 -0.6317281 0.1350823 2.135382
Israel ISR 8.169052 0.0252953 0.1618297 0.7675661 4.569715 4.133446 5.210821 5.773553 2.9883954 1.381607 5.223904 4.657831 1.942508 4.1111552 3.4101365 3.0379439 4.094336
Afghanistan AFG NA NA NA 8.4441632 1.055556 11.175270 5.554138 13.740205 3.6113684 21.020649 8.433291 6.113685 14.434741 3.9005749 2.6905219 1.3100404 2.366712

The annual GDP measurement was much more robust than the GINI index. This indicator contained only 310 missing values, accounting for 8.4% of all values. The summary statistics and visual inspection of the histogram and boxplot show that data for this indicator is unimodal and slightly skewed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -62.076   1.581   3.786   3.855   6.113 123.140     310

Literacy

We used this adult literacy indicator as a socio-economic proxy measure for poverty. We are interested in examining the relationship between internet useage and literacy as well as literacy and GDP. Below is an output of the data parsed and tidied from the World Bank:

Selected Literacy Values, 2000-2016
country iso3code 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
Slovenia SVN NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Korea, Dem. People’s Rep. PRK NA NA NA NA NA NA NA NA 99.99819 NA NA NA NA NA NA NA NA
Jamaica JAM NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Hungary HUN NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Nepal NPL NA 48.60897 NA NA NA NA NA NA NA NA NA 59.62725 NA NA NA NA NA

Like the GINI index, this measurement also contained a significant amount missing values. Approximately 86% of values in this dataset were missing. Out of the data reported, we countries to have high literacy rates on average. You can see in the histogram and boxplot below that this data is unimodal and skewed to the left.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14.38   72.66   91.17   81.77   95.82  100.00    3163

Internet Indicators

Internet Usage

This indicator measures the number of individuals using the Internet as a percentage of country population.

While other internet indicators are available through the World Bank, this is the only dataset we had that was complete enough to be useful.

internet <- read.csv("internet/internet_over_time.csv")
plot(internet$annual.mean~internet$years)
model <- lm(internet$annual.mean~internet$years)
abline(model)

summary(model)
## 
## Call:
## lm(formula = internet$annual.mean ~ internet$years)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76034 -0.53410  0.02895  0.28827  1.04662 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.296e+03  6.009e+01  -88.14   <2e-16 ***
## internet$years  2.652e+00  2.992e-02   88.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6044 on 15 degrees of freedom
## Multiple R-squared:  0.9981, Adjusted R-squared:  0.998 
## F-statistic:  7852 on 1 and 15 DF,  p-value: < 2.2e-16

The number of people connected to the inernet around the world grows by a factor of 2.652 every year! Furthermore, this model has an \(R^2\) value of .998, meaning that global internet access ratess are almost entirely a function of time.

ANALYSIS

First, we examined the relationship between our three socio-economic indicators and found weak correlations between the selected variables.

We also compared these values using the linear model:

lit.gdp.lm <- lm(annual.means$lit~annual.means$gdp)
summary(lit.gdp.lm)
## 
## Call:
## lm(formula = annual.means$lit ~ annual.means$gdp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.4804  -2.2881   0.7013   3.3223   6.6514 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       88.4953     4.0640  21.775 9.17e-13 ***
## annual.means$gdp  -2.0868     0.9881  -2.112   0.0519 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.932 on 15 degrees of freedom
## Multiple R-squared:  0.2292, Adjusted R-squared:  0.1778 
## F-statistic:  4.46 on 1 and 15 DF,  p-value: 0.05188
gdp.gini.lm <- lm(annual.means$gdp~annual.means$gini)
summary(gdp.gini.lm)
## 
## Call:
## lm(formula = annual.means$gdp ~ annual.means$gini)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0060 -0.6210 -0.1796  0.4822  2.3418 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)        4.36818    5.33915   0.818    0.426
## annual.means$gini -0.01355    0.13836  -0.098    0.923
## 
## Residual standard error: 1.549 on 15 degrees of freedom
## Multiple R-squared:  0.0006391,  Adjusted R-squared:  -0.06598 
## F-statistic: 0.009592 on 1 and 15 DF,  p-value: 0.9233
gini.lit.lm <- lm(annual.means$gini~annual.means$lit)
summary(gini.lit.lm)
## 
## Call:
## lm(formula = annual.means$gini ~ annual.means$lit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2142 -1.5687 -1.1067 -0.2172  5.9544 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       49.1521     8.4815   5.795 3.53e-05 ***
## annual.means$lit  -0.1325     0.1051  -1.261    0.227    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.749 on 15 degrees of freedom
## Multiple R-squared:  0.09581,    Adjusted R-squared:  0.03553 
## F-statistic: 1.589 on 1 and 15 DF,  p-value: 0.2267

There is a weak, downward sloping correlation between the gini and internet indicators which suggests nations with higher levels of inequality use the internet to a lesser extent.

plot(gini.annual$annual.mean~internet$annual.mean)

model <- lm(gini.annual$annual.mean~internet$annual.mean)
summary(model)
## 
## Call:
## lm(formula = gini.annual$annual.mean ~ internet$annual.mean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3747 -1.1959 -0.7847  0.9313  3.8762 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          42.79073    1.16037  36.877 3.91e-16 ***
## internet$annual.mean -0.15140    0.03716  -4.074 0.000997 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.992 on 15 degrees of freedom
## Multiple R-squared:  0.5253, Adjusted R-squared:  0.4937 
## F-statistic:  16.6 on 1 and 15 DF,  p-value: 0.000997

We repeated for lit vs internet (make sure to use correct indicator i.e. literacy$):

plot(lit.annual$annual.mean~internet$annual.mean)

model <- lm(lit.annual$annual.mean~internet$annual.mean)
summary(model)
## 
## Call:
## lm(formula = lit.annual$annual.mean ~ internet$annual.mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3002  -1.9428   0.2232   3.0566   7.3965 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          70.94858    2.85963  24.810 1.36e-13 ***
## internet$annual.mean  0.33531    0.09158   3.661  0.00231 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.909 on 15 degrees of freedom
## Multiple R-squared:  0.472,  Adjusted R-squared:  0.4367 
## F-statistic: 13.41 on 1 and 15 DF,  p-value: 0.002315

The bandwidth per capita is weakly correlated with the gini index as well as literacy. It slightly weaker in the case of literacy given the respective R^2 values.

plot(internet$annual.mean ~ gdp.annual$annual.mean, xlab = "GDP per year", ylab='Internet Users per Year')

model <- lm(internet$annual.mean ~gdp.annual$annual.mea)
abline(model)

summary(model)
## 
## Call:
## lm(formula = internet$annual.mean ~ gdp.annual$annual.mea)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7329 -11.0752   0.8362   9.6916  18.6565 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             40.901      8.833   4.631 0.000327 ***
## gdp.annual$annual.mea   -3.252      2.148  -1.514 0.150701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.89 on 15 degrees of freedom
## Multiple R-squared:  0.1326, Adjusted R-squared:  0.0748 
## F-statistic: 2.294 on 1 and 15 DF,  p-value: 0.1507

Additionally, we find that there is almost no correlation between an nation’s gdp and its bandwidth per capita.

Conclusion

GDP is a stronger indicator of literacy or wealth inequality than internet speeds. However, there is a weak, positive correlation between internet speeds and each of these indicators. Sparsity in the data prevented further analysis by country and would have inherently selected for OECD countries (which maintain a wonderful dataset). However, we were more interested in investigating developing nations. Further research will require a more comprehensive dataset. Additionally, we took time in this project to work on styling. Jemceach taught simplymathematics how to use knitr and we both figured out how to use ggextra to make a gorgeous multi-plot graphic.

Challenges:

Time-series data within the selected indicators contained many missing values across country and time variables.

  1. Finding an internet indicator that had enough data points.
  2. Finding a social indicator that was a good comparison across countries.
  3. Finding a measure of social equity was hard for similar reasons.

Further:

With a more complete dataset we could compare these data points by country rather than year. Furthermore, we could include other indicators (both economic and technical) and do a covariance analysis. However, without complete data both of these tasks are useless.