As Data Scientists, we recognize that “big data” can be used to influence elections, spread hateful propaganda, and track every purchase and decision we make one. However, we believe that the Internet, as a whole provides, many positive economic outlets.
We seek to quantify the positive effects of internet access on a global scale through our analysis of internet infrastructure and select social and economic indicators.
Our research process is outlined below for those interested in replicating our work.
Replication requires the use of the following dependencies:
library(curl)
library(XML)
library(wbstats)
library(data.table)
library(tidyr)
library(dplyr)
suppressWarnings(source("indicators.R"))Does internet access correlate with the chosen indicators of inequality?
H0: Internet penetration rates are not correlated with the chosen equity indicators.
HA: Internet penetrateion rates are correlated with the chosen equity indicator.
Does the selected economic indicator have a stronger affect on internet access than the social equity indicators?
H0 Economic indicators have a stronger effect than social measurements on Internet access.
HA: Economic indicators do not have a stronger effect than social measurements on Internet access.
Aggregated indicators were selected from the World Bank Indicator API Queries. We used the curl and xml packages to download and parse the socio-economic indicators. In addition, we used the wbpackage for the internet data. This package directly communicates with the API to download WB data.
Documentation for this process can be found in the internet.Rmd and indicator.R file within this repository.
We have outlined the selected indicators and their corresponding World Bank definitions below:
All socio-economic indicators were collected via the web api for the world bank. Please check out the indicators.R file found in the repository. All internet indicators were collected from the wb() package which presumably does the same thing. However, after cleaning and analysis, that was saved as a csv and re-imported to fulfill the requirements of the assignment.
We used the GINI index as a socio-economic proxy measure for equity. A Gini index value of 0 represents perfect equality, whereas an value of 100 constitutes perfect inequality.
As shown below, GINI index data is widely unavailable for certain countries, across certain years. The World Bank choose different reference year in their analysis of each country, as shown on Table 1.3 of the World Development Indicators. A selected portion of the data we parsed from this table can be seen below:
| country | iso3code | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Brazil | BRA | NA | 58.4 | 58.1 | 57.6 | 56.5 | 56.3 | 55.6 | 54.9 | 54.0 | 53.7 | NA | 52.9 | 52.6 | 52.8 | 51.5 | 51.3 | NA |
| Barbados | BRB | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Malaysia | MYS | NA | NA | NA | NA | 46.1 | NA | NA | 46.1 | 45.5 | NA | NA | 43.9 | NA | 41.3 | NA | 41.0 | NA |
| St. Kitts and Nevis | KNA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| United Kingdom | GBR | NA | NA | NA | NA | 36.0 | 34.3 | 34.6 | 35.7 | 34.1 | 34.3 | 34.4 | 33.2 | 32.3 | 33.2 | 34.0 | 33.2 | NA |
We choose to analyze the gini data through annual aggregates. However, we note that the absence of data for certain countries and time periods could potentially affect our results.
Our initial review of this data shows that almost 3/4ths of this data contains missing values. Less than 5% of index values represent countries with a rating of higher inequality levels above 50. The median index value for the gini dataset is 35.7, while the mean value is 37.88.
The summary statistics and visual inspection of the histogram and boxplot show that data for this indicator is unimodal and skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 16.20 30.80 35.70 37.88 44.10 64.80 2641
| index | freq | prop |
|---|---|---|
| more inequal | 142 | 0.04 |
| more equal | 904 | 0.25 |
| n.a. value | 2641 | 0.72 |
We chose raw GDP to see if internet infrastructure contributes more to social mobility than a generalized economy. Below is an output of the data parsed and tidied from the World Bank:
| country | iso3code | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vietnam | VNM | 6.787316 | 6.1928933 | 6.3208210 | 6.8990635 | 7.536411 | 7.547248 | 6.977955 | 7.129505 | 5.6617712 | 5.397898 | 6.423238 | 6.240303 | 5.247367 | 5.4218830 | 5.9836546 | 6.6792888 | 6.210812 |
| China | CHN | 8.491509 | 8.3399105 | 9.1306459 | 10.0356030 | 10.111223 | 11.395776 | 12.719479 | 14.231388 | 9.6542894 | 9.399813 | 10.636140 | 9.536443 | 7.856262 | 7.7576351 | 7.2976660 | 6.9002048 | 6.700000 |
| Finland | FIN | 5.634848 | 2.5807921 | 1.6803251 | 1.9939841 | 3.926057 | 2.779955 | 4.055197 | 5.184801 | 0.7206685 | -8.269037 | 2.992338 | 2.570818 | -1.426189 | -0.7580363 | -0.6317281 | 0.1350823 | 2.135382 |
| Israel | ISR | 8.169052 | 0.0252953 | 0.1618297 | 0.7675661 | 4.569715 | 4.133446 | 5.210821 | 5.773553 | 2.9883954 | 1.381607 | 5.223904 | 4.657831 | 1.942508 | 4.1111552 | 3.4101365 | 3.0379439 | 4.094336 |
| Afghanistan | AFG | NA | NA | NA | 8.4441632 | 1.055556 | 11.175270 | 5.554138 | 13.740205 | 3.6113684 | 21.020649 | 8.433291 | 6.113685 | 14.434741 | 3.9005749 | 2.6905219 | 1.3100404 | 2.366712 |
The annual GDP measurement was much more robust than the GINI index. This indicator contained only 310 missing values, accounting for 8.4% of all values. The summary statistics and visual inspection of the histogram and boxplot show that data for this indicator is unimodal and slightly skewed to the right.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -62.076 1.581 3.786 3.855 6.113 123.140 310
We used this adult literacy indicator as a socio-economic proxy measure for poverty. We are interested in examining the relationship between internet useage and literacy as well as literacy and GDP. Below is an output of the data parsed and tidied from the World Bank:
| country | iso3code | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Slovenia | SVN | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Korea, Dem. People’s Rep. | PRK | NA | NA | NA | NA | NA | NA | NA | NA | 99.99819 | NA | NA | NA | NA | NA | NA | NA | NA |
| Jamaica | JAM | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Hungary | HUN | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Nepal | NPL | NA | 48.60897 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 59.62725 | NA | NA | NA | NA | NA |
Like the GINI index, this measurement also contained a significant amount missing values. Approximately 86% of values in this dataset were missing. Out of the data reported, we countries to have high literacy rates on average. You can see in the histogram and boxplot below that this data is unimodal and skewed to the left.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.38 72.66 91.17 81.77 95.82 100.00 3163
This indicator measures the number of individuals using the Internet as a percentage of country population.
While other internet indicators are available through the World Bank, this is the only dataset we had that was complete enough to be useful.
internet <- read.csv("internet/internet_over_time.csv")
plot(internet$annual.mean~internet$years)
model <- lm(internet$annual.mean~internet$years)
abline(model)summary(model)##
## Call:
## lm(formula = internet$annual.mean ~ internet$years)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76034 -0.53410 0.02895 0.28827 1.04662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.296e+03 6.009e+01 -88.14 <2e-16 ***
## internet$years 2.652e+00 2.992e-02 88.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6044 on 15 degrees of freedom
## Multiple R-squared: 0.9981, Adjusted R-squared: 0.998
## F-statistic: 7852 on 1 and 15 DF, p-value: < 2.2e-16
The number of people connected to the inernet around the world grows by a factor of 2.652 every year! Furthermore, this model has an \(R^2\) value of .998, meaning that global internet access ratess are almost entirely a function of time.
First, we examined the relationship between our three socio-economic indicators and found weak correlations between the selected variables.
We also compared these values using the linear model:
lit.gdp.lm <- lm(annual.means$lit~annual.means$gdp)
summary(lit.gdp.lm)##
## Call:
## lm(formula = annual.means$lit ~ annual.means$gdp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4804 -2.2881 0.7013 3.3223 6.6514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.4953 4.0640 21.775 9.17e-13 ***
## annual.means$gdp -2.0868 0.9881 -2.112 0.0519 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.932 on 15 degrees of freedom
## Multiple R-squared: 0.2292, Adjusted R-squared: 0.1778
## F-statistic: 4.46 on 1 and 15 DF, p-value: 0.05188
gdp.gini.lm <- lm(annual.means$gdp~annual.means$gini)
summary(gdp.gini.lm)##
## Call:
## lm(formula = annual.means$gdp ~ annual.means$gini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0060 -0.6210 -0.1796 0.4822 2.3418
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.36818 5.33915 0.818 0.426
## annual.means$gini -0.01355 0.13836 -0.098 0.923
##
## Residual standard error: 1.549 on 15 degrees of freedom
## Multiple R-squared: 0.0006391, Adjusted R-squared: -0.06598
## F-statistic: 0.009592 on 1 and 15 DF, p-value: 0.9233
gini.lit.lm <- lm(annual.means$gini~annual.means$lit)
summary(gini.lit.lm)##
## Call:
## lm(formula = annual.means$gini ~ annual.means$lit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2142 -1.5687 -1.1067 -0.2172 5.9544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.1521 8.4815 5.795 3.53e-05 ***
## annual.means$lit -0.1325 0.1051 -1.261 0.227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.749 on 15 degrees of freedom
## Multiple R-squared: 0.09581, Adjusted R-squared: 0.03553
## F-statistic: 1.589 on 1 and 15 DF, p-value: 0.2267
There is a weak, downward sloping correlation between the gini and internet indicators which suggests nations with higher levels of inequality use the internet to a lesser extent.
plot(gini.annual$annual.mean~internet$annual.mean)model <- lm(gini.annual$annual.mean~internet$annual.mean)
summary(model)##
## Call:
## lm(formula = gini.annual$annual.mean ~ internet$annual.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3747 -1.1959 -0.7847 0.9313 3.8762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.79073 1.16037 36.877 3.91e-16 ***
## internet$annual.mean -0.15140 0.03716 -4.074 0.000997 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.992 on 15 degrees of freedom
## Multiple R-squared: 0.5253, Adjusted R-squared: 0.4937
## F-statistic: 16.6 on 1 and 15 DF, p-value: 0.000997
We repeated for lit vs internet (make sure to use correct indicator i.e. literacy$):
plot(lit.annual$annual.mean~internet$annual.mean)model <- lm(lit.annual$annual.mean~internet$annual.mean)
summary(model)##
## Call:
## lm(formula = lit.annual$annual.mean ~ internet$annual.mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3002 -1.9428 0.2232 3.0566 7.3965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.94858 2.85963 24.810 1.36e-13 ***
## internet$annual.mean 0.33531 0.09158 3.661 0.00231 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.909 on 15 degrees of freedom
## Multiple R-squared: 0.472, Adjusted R-squared: 0.4367
## F-statistic: 13.41 on 1 and 15 DF, p-value: 0.002315
The bandwidth per capita is weakly correlated with the gini index as well as literacy. It slightly weaker in the case of literacy given the respective R^2 values.
plot(internet$annual.mean ~ gdp.annual$annual.mean, xlab = "GDP per year", ylab='Internet Users per Year')
model <- lm(internet$annual.mean ~gdp.annual$annual.mea)
abline(model)summary(model)##
## Call:
## lm(formula = internet$annual.mean ~ gdp.annual$annual.mea)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.7329 -11.0752 0.8362 9.6916 18.6565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.901 8.833 4.631 0.000327 ***
## gdp.annual$annual.mea -3.252 2.148 -1.514 0.150701
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.89 on 15 degrees of freedom
## Multiple R-squared: 0.1326, Adjusted R-squared: 0.0748
## F-statistic: 2.294 on 1 and 15 DF, p-value: 0.1507
Additionally, we find that there is almost no correlation between an nation’s gdp and its bandwidth per capita.
GDP is a stronger indicator of literacy or wealth inequality than internet speeds. However, there is a weak, positive correlation between internet speeds and each of these indicators. Sparsity in the data prevented further analysis by country and would have inherently selected for OECD countries (which maintain a wonderful dataset). However, we were more interested in investigating developing nations. Further research will require a more comprehensive dataset. Additionally, we took time in this project to work on styling. Jemceach taught simplymathematics how to use knitr and we both figured out how to use ggextra to make a gorgeous multi-plot graphic.
Time-series data within the selected indicators contained many missing values across country and time variables.
With a more complete dataset we could compare these data points by country rather than year. Furthermore, we could include other indicators (both economic and technical) and do a covariance analysis. However, without complete data both of these tasks are useless.