World Bank Service Indicators Project

Object

Research question

Using observational study methods, this report will attempt to identify if a country’s “Computer, communications and other technical services” (import and export) are affected by Geographical area and/or prosperity utilizing data generated from The World Bank databank World Development Indicators; data from 2008 to 2017. Data from 2018 was incomplete; analysis had to move to the most complete date. Within the data there are there are both pre-defined groups (Aggregate groups based on Geographical area, political situation, and monetary status) and individual countries.

World Development Indicators (WDI) is the primary World Bank collection of development indicators, compiled from officially recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates. [Note: Even though Global Development Finance (GDF) is no longer listed in the WDI database name, all external debt and financial flows data continue to be included in WDI. The GDF publication has been renamed International Debt Statistics (IDS), and has its own separate database, as well.

I have applied a focus on Indicator Names: Computer, communications and other services (% of commercial service exports) and Computer, communications and other services (% of commercial service imports) [Dependent variable]. We will use WDI Metadata to define Geographical position (Qualitative) and a country’s financial status (Quantitative) [Independent Variable] as well as Commercial Service Import/Export numbers and each country’s respective Population numbers. Countries with Blank values for all 10 years have been removed from the dataset.

Data Source

Data can be found at: (https://databank.worldbank.org/source/world-development-indicators).

Defining Requirements

In order to have a valid model for inference or linear regression specific assumptions must be present:

Inference

Sample must be random
Distribution must be random
Observations must be independent

Linear Regression

Relationship between independent and dependent variables must be linear
All points must be independent
The residuals must follow a normal distribution
Variances must be equal

Objective Analysis

Data Preparation

library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(wbstats)
library(RColorBrewer)
library(statsr)

# load data
# dowload file from github, save it locally in your home directory
download <- download.file('https://raw.githubusercontent.com/kelloggjohnd/Data606/master/Aggregates.csv', destfile = "Aggregates.csv", method = "wininet") 
download <- download.file('https://raw.githubusercontent.com/kelloggjohnd/Data606/master/Country.csv', destfile = "Country.csv", method = "wininet")
##download <- download.file('https://raw.githubusercontent.com/kelloggjohnd/DATA607/master/overview.csv', destfile = "metadata.csv", method = "wininet")

# manipulate the data into a data frame
agg_raw <- data.frame(read.csv(file = "Aggregates.csv", header = TRUE, sep = ","))
ctry_raw <- data.frame(read.csv(file = "Country.csv", header = TRUE, sep = ","))

names(agg_raw)<- c("Name","Country.Code","Series.Name","Series.Code","YR2008","YR2009","YR2010","YR2011","YR2012","YR2013","YR2014","YR2015","YR2016","YR2017")
names(ctry_raw)<- c("Name","Country.Code","Series.Name","Series.Code","YR2008","YR2009","YR2010","YR2011","YR2012","YR2013","YR2014","YR2015","YR2016","YR2017")

agg_raw <- agg_raw[,c('Name','Country.Code','Series.Code','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')]
ctry_raw <- ctry_raw[,c('Name','Country.Code','Series.Code','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')]

# Pull from WB API 
wbctry_raw <- wbcountries(lang = "en")

Since the data for this report came from multiple locations and methods, an extensive scrubbing process was required. Those methods, mostly utilizing the tidyr and dplyr packages, can be seen below by clicking the “Show Code” buttons. Each section, has comments on what process took place.

# Mutating the Aggrate dataframe for later processing
agg_raw <- agg_raw %>% 
  mutate_all(na_if,"..")%>%
  mutate(series = ifelse (Series.Code == "BX.GSR.CMCP.ZS", "Export", "Import"))%>%
  select(-Series.Code)%>%
  mutate_if(is.factor, as.character)%>%
  mutate(Name = as.factor(Name))%>%
  mutate(series = as.factor(series))%>%
  mutate(Code = as.factor(Country.Code))%>%
  select(-Country.Code)%>%
  mutate_if(is.character,as.numeric)

# Mutating the Country dataframe for later processing
ctry_raw <- ctry_raw %>% 
  mutate_all(na_if,"..")%>%
  mutate(series = ifelse (Series.Code == "BX.GSR.CMCP.ZS", "Export", "Import"))%>%
  select(-Series.Code)%>%
   mutate_if(is.factor, as.character)%>%
    mutate(Name = as.factor(Name))%>%
    mutate(series = as.factor(series))%>%
    mutate(Code = as.factor(Country.Code))%>%
    select(-Country.Code)%>%
    mutate_if(is.character,as.numeric)

# Preparing the Metadata file from the API call in the Setup chunk

meta_data <-
wbctry_raw %>%
select (country, iso3c, region, incomeID, income)%>%
  rename (Code = iso3c)

# Seperating the Aggregates from the Countries
Agg_data <- meta_data %>%
  filter (income == "Aggregates") %>%
  select (country, Code)
  
country_data <- meta_data %>%
  filter (income != "Aggregates")

# removing metadata from bottom of DF
Agg_data_Import_export <- agg_raw [1:94,]

# Adding the MEAN of each datapoint
Agg_data_Import_export<-mutate(Agg_data_Import_export, xMEAN = rowMeans(select(Agg_data_Import_export, starts_with("YR")), na.rm = TRUE))

# Seperating the export from the import and getting DF ready for Tidy process
Agg_export<- Agg_data_Import_export %>% 
  filter (series == "Export")%>%
  select(Name, Code, series, everything())%>%
  select(-series)%>% #ease of code than writing all the columns in the select statement
  select(-xMEAN)%>%
  filter (Name != "Not classified") 

Agg_import <- Agg_data_Import_export %>% 
  filter (series == "Import")%>%
  select(Name, Code, series, everything())%>%
  select(-series) %>% 
  select(-xMEAN) %>%
  filter (Name != "Not classified")

# Seperating the Mean values into their own dataframe
Agg_export_mean <-
  Agg_data_Import_export %>% 
  filter (series == "Export")%>%
  select(Name, Code,xMEAN)%>%
  filter (Name != "Not classified")

Agg_import_mean<-
  Agg_data_Import_export %>% 
  filter (series == "Import")%>%
  select(Name, Code,xMEAN)%>%
  filter (Name != "Not classified")

# Tidy process on the Aggregate dataframes 
Agg_export <-
  Agg_export%>%
  gather("Year", "Totals", YR2008:YR2017)

Agg_import <-
  Agg_import%>%
  gather("Year", "Totals", YR2008:YR2017)

# removing metadata from bottom of DF
ctry_raw <- ctry_raw [1:434,]
# Adding the MEAN of each datapoint
ctry_raw<-mutate(ctry_raw, xMEAN = rowMeans(select(ctry_raw, starts_with("YR")), na.rm = TRUE))
# Removing the removing the blank name values
ctry_data<- ctry_raw %>% filter(xMEAN != "NaN")

# Seperating the Export from Import and clearing out Null values from MEAN
ctry_export <- ctry_raw %>% 
  filter (series == "Export")%>%
  filter(xMEAN != "NaN")

ctry_import <- ctry_raw %>% 
  filter (series == "Import")%>%
  filter(xMEAN != "NaN")

# Joining the DF with the Metadata DF
ctry_export <- full_join(ctry_export,meta_data, by = "Code") %>% 
    filter (!is.na(Name))%>%
    select (-country, -incomeID)

ctry_import <- full_join(ctry_import,meta_data, by = "Code") %>% 
    filter (!is.na(Name))%>%
    select (-country, -incomeID)

# Seperating the Country dataframe into the individual sections of Income (Low, Lower middle, upper middle, High)

ctry_ex_low <- ctry_export %>%
  filter(income == "Low income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_ex_low_mean <- ctry_export %>%
  filter(income == "Low income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_ex_lmid <- ctry_export %>%
  filter(income == "Lower middle income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_ex_lmid_mean <- ctry_export %>%
  filter(income == "Lower middle income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_ex_hmid <- ctry_export %>%
  filter(income == "Upper middle income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_ex_hmid_mean <- ctry_export %>%
  filter(income == "Upper middle income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_ex_high <- ctry_export %>%
  filter(income == "High income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_ex_high_mean <- ctry_export %>%
  filter(income == "High income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_imp_low <- ctry_import %>%
  filter(income == "Low income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_imp_low_mean <- ctry_export %>%
  filter(income == "Low income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_imp_lmid <- ctry_import %>%
  filter(income == "Lower middle income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_imp_lmid_mean <- ctry_export %>%
  filter(income == "Lower middle income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_imp_hmid <- ctry_import %>%
  filter(income == "Upper middle income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_imp_hmid_mean <- ctry_export %>%
  filter(income == "Upper middle income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

ctry_imp_high <- ctry_import %>%
  filter(income == "High income")%>%
  select('Name','region','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather("Year", "Totals", YR2008:YR2017)%>%
  filter(!is.na (Totals))

ctry_imp_high_mean <- ctry_export %>%
  filter(income == "High income")%>%
  select('Name','region','xMEAN')%>%
  filter(!is.na (xMEAN))

# Country Export numbers gathered into Tidy format
ctry_ex_all <- ctry_export %>%
  select('Name','region', 'Code','income','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather('Year', 'Export_Totals', YR2008:YR2017)

ctry_imp_all <- ctry_import %>%
  select('Name','region', 'Code','income','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather('Year', 'Export_Totals', YR2008:YR2017)

# Country Import numbers gathered into Tidy format
ctry_all <- ctry_import %>%
  select('Name','region', 'Code','income','YR2008','YR2009','YR2010','YR2011','YR2012','YR2013','YR2014','YR2015','YR2016','YR2017')%>%
  gather('Year', 'Import_Totals', YR2008:YR2017)

# Country numbers bound together into one Dataframe
ctry_all <- cbind(ctry_all, Export_Totals = ctry_ex_all$Export_Totals)

# Clean up to remove the columns with NA in both Export/Import 
ctry_all <- ctry_all %>%
  filter (!is.na(Export_Totals) & !is.na(Import_Totals))%>%
  arrange(Name)%>%
  mutate_if(is.character, str_trim)%>%
  arrange(Code, Year)%>%
  filter(Name != "Uzbekistan")

# Seperating out the MEANs
ctry_export_mean <- 
  ctry_export %>%
    select('Name','region', 'income', 'xMEAN')%>%
      rename(export_mean = xMEAN)

ctry_import_mean <- 
  ctry_import %>%
    select('Name','region', 'income', 'xMEAN')%>%
      rename(import_mean = xMEAN)

ctry_all_mean <- cbind(ctry_import_mean, export_mean = ctry_export_mean$export_mean)

Relevant summary statistics

Within this report, we will be analyzing four sets of data; percentage of the commercial export used for Computer, communications and other services by country, the same percentage by aggregate, commercial service import and export by country (in USD), and metadata for both country and aggregate.

The below tables gives us insight into the main focus of the report percentage of the commercial export used for Computer, communications and other services export/import dataframes.

Country Import (General)

summary(ctry_import_mean)

##                   Name        region             income         
##  Afghanistan        :  1   Length:188         Length:188        
##  Albania            :  1   Class :character   Class :character  
##  Algeria            :  1   Mode  :character   Mode  :character  
##  Angola             :  1                                        
##  Antigua and Barbuda:  1                                        
##  Argentina          :  1                                        
##  (Other)            :182                                        
##   import_mean    
##  Min.   : 2.913  
##  1st Qu.:23.692  
##  Median :35.093  
##  Mean   :35.208  
##  3rd Qu.:45.289  
##  Max.   :77.343  
##

ggplot(ctry_import_mean, aes(x=import_mean))+ geom_density() +
  geom_histogram(aes(x=import_mean, y= ..density..),
       binwidth = 3, fill = "gray", color = "black")+
        geom_density(alpha=.2, fill="Red")

The general import percentage figures by country and the resulting graph to follow a normal distribution and observations are independent.

Country export (General)

summary(ctry_export_mean)

##                   Name        region             income         
##  Afghanistan        :  1   Length:188         Length:188        
##  Albania            :  1   Class :character   Class :character  
##  Algeria            :  1   Mode  :character   Mode  :character  
##  Angola             :  1                                        
##  Antigua and Barbuda:  1                                        
##  Argentina          :  1                                        
##  (Other)            :182                                        
##   export_mean    
##  Min.   : 1.924  
##  1st Qu.:18.958  
##  Median :30.661  
##  Mean   :33.519  
##  3rd Qu.:45.554  
##  Max.   :91.718  
##

ggplot(ctry_export_mean, aes(x=export_mean))+ geom_density() +
  geom_histogram(aes(x=export_mean, y= ..density..),
       binwidth = 3, fill = "gray", color = "black")+
        geom_density(alpha=.2, fill="Red")

The general export percentage figures by country and the resulting graph shows the observations are independent, however, the normal distribution has a right skew. This may be overcome later by pulling a random sample.

ggplot(ctry_import_mean, aes(sample = import_mean)) +
  stat_qq() +
  stat_qq_line()+
  labs(title = "Country Import Mean")

ggplot(ctry_export_mean, aes(sample = export_mean)) +
  stat_qq() +
  stat_qq_line()+
  labs(title = "Country Export Mean")

Q/Q plots (quantile-quantile) show a definite curve in both means toward the lower end. We will continue to investigate as the lower end curves toward 0. It is impossible in the data to go below a 0% percentage.

Aggregate import (General)

The Aggregates are defined as generalized geographical areas as well as groupings by income status. A single country will be in at least 2 aggregates and possibly more as some geographical areas are also broken up by income.

Example:

East Asia & Pacific
East Asia & Pacific (excluding high income)
East Asia & Pacific (IDA & IBRD countries)

summary(Agg_import_mean)

##                                           Name         Code   
##  Arab World                                 : 1   ARB    : 1  
##  Caribbean small states                     : 1   CEB    : 1  
##  Central Europe and the Baltics             : 1   CSS    : 1  
##  Early-demographic dividend                 : 1   EAP    : 1  
##  East Asia & Pacific                        : 1   EAR    : 1  
##  East Asia & Pacific (excluding high income): 1   EAS    : 1  
##  (Other)                                    :40   (Other):40  
##      xMEAN      
##  Min.   :26.04  
##  1st Qu.:34.33  
##  Median :37.67  
##  Mean   :38.34  
##  3rd Qu.:42.35  
##  Max.   :50.95  
##

ggplot(Agg_import_mean, aes(x=xMEAN))+ geom_density() +
  geom_histogram(aes(x=xMEAN, y= ..density..),
       binwidth = 3, fill = "gray", color = "black")+
        geom_density(alpha=.2, fill="Red")

The Aggregate import follows most of the same normal distribution with a slight left skew. Since they are a full mix of geographical and income groups, we can not rely on the histogram.

Aggregate export (General)

summary(Agg_export_mean)

##                                           Name         Code   
##  Arab World                                 : 1   ARB    : 1  
##  Caribbean small states                     : 1   CEB    : 1  
##  Central Europe and the Baltics             : 1   CSS    : 1  
##  Early-demographic dividend                 : 1   EAP    : 1  
##  East Asia & Pacific                        : 1   EAR    : 1  
##  East Asia & Pacific (excluding high income): 1   EAS    : 1  
##  (Other)                                    :40   (Other):40  
##      xMEAN      
##  Min.   :13.20  
##  1st Qu.:32.17  
##  Median :38.82  
##  Mean   :38.32  
##  3rd Qu.:44.55  
##  Max.   :66.01  
##

ggplot(Agg_export_mean, aes(x=xMEAN))+ geom_density() +
  geom_histogram(aes(x=xMEAN, y= ..density..),
       binwidth = 3, fill = "gray", color = "black")+
        geom_density(alpha=.2, fill="Red")

The Aggregate export follows does not follow any normal distribution. The same problem as above holds, this dataframe is a full mix of geographical and income groups, we can not rely on the histogram. We will need to evaluate if it still follows the requirement of Inference or Regression after we break the dataframe down further.

Aggregate Mean

Both the import and export means are filtered into separate dataframes. The Q/Q graphs will show if the .

ggplot(Agg_import_mean, aes(sample = xMEAN)) +
  stat_qq() +
  stat_qq_line()+
  labs(title = "Aggregate Import Mean")

ggplot(Agg_export_mean, aes(sample = xMEAN)) +
  stat_qq() +
  stat_qq_line()+
  labs(title = "Aggregate Export Mean")

The data seems to stick pretty close to the Stat line. There appears to be Heavy outlines on the Export graph

ggplot(data=Agg_import_mean, 
       aes(x = Name, y=xMEAN, fill=Name))+
        geom_bar(stat = "identity")+
        scale_fill_hue(l=50)+
        ggtitle(label = "Import Rate across Countries")+
        theme_minimal()+
        theme(legend.position = "none")+
        theme(axis.text.x = element_text(angle = 90, hjust = 1))+
        xlab("Aggregate Name")+ylab("Average Rate")

ggplot(data=Agg_export_mean, 
       aes(x = Name, y=xMEAN, fill=Name))+
        geom_bar(stat = "identity")+
        scale_fill_hue(l=50)+
        ggtitle(label = "Export Rate across Countries")+
        theme_minimal()+
        theme(legend.position = "none")+
        theme(axis.text.x = element_text(angle = 90, hjust = 1))+
        xlab("Aggregate Name")+ylab("Average Rate")

South Asia immediately presents as an outlier in the export data, we will need to investigate further. Additionally, Lower Middle income, countries have a larger percentage of exports over their other income aggregates.

Geographical analysis

Location Analysis

The first section of the research question we will tackle is location; does a countries geographical location indicate higher or lower Computer, communications and other services (% of commercial service exports) and/or Computer, communications and other services (% of commercial service imports)?

In order to prove this, we should be able to take any country and compare the import and export numbers to the other countries in the aggregate area. If they are near similar, We can comfortably prove the theory.

\[{ H }_{ 0 }:{ \beta }_{ 1 }-{ A }_{ 1 }=0\] \[No.Difference.between.countries.of.the.same.aggregate\]

\[{ H }_{ 0 }:{ \beta }_{ 1 }-{ A }_{ 1 }\neq 0\] \[Difference.exists.between.countries.of.the.same.aggregate\]

The Regions are:
* South Asia
* Europe & Central Asia
* Middle East & North Africa
* Sub-Saharan Africa
* Latin America & Caribbean
* East Asia & Pacific
* North America

hist(ctry_all$Import_Totals)

hist(ctry_all$Export_Totals)

Looking at the data for all the countries broken down by Import and export, there is a definite right skew in both and Export being far more skewed. We don’t want to only reply on this simple histogram to reject or fail to reject the hypothesis.

# Scatterplot with Color, Regression line and Confidence interval
ggplot(ctry_all, aes(x=Import_Totals, y=Export_Totals, color=region)) +
    geom_point(size=1.5) +
  geom_smooth(method=lm , color="red", se=TRUE)+
        ggtitle(label = "Import/Export by Region")

# Scatterplot Grid comparing Import/Export Totals by Year
ggplot(ctry_all, aes(x=Import_Totals, y=Export_Totals))+
  geom_point() +
  facet_grid(Year ~ region)+
  theme(text = element_text(size=12),
        axis.text.x = element_text(angle=90))+ 
  geom_smooth(method=lm , color="red", se=FALSE)+
        ggtitle(label = "Import/Export by Year")

# Boxplot of Import/Export by Region
ggplot(ctry_all, aes(x=Import_Totals, y=Export_Totals, fill=region)) + 
    geom_boxplot()+
        ggtitle(label = "Import/Export by Region")+
  theme(legend.position="bottom")

Import and export by region, we do see a linear progression with a lot of outlines. Breaking each geographical area down individually, we start to see an issue with South Asia. We will need to break each geographical area down further to analyze.

South Asia

south_asia <- ctry_all %>%
  filter(region == "South Asia")

hist(south_asia$Import_Totals)

hist(south_asia$Export_Totals)

ggplot(south_asia, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
  theme(legend.position="bottom")+
  scale_fill_brewer(palette="Set3")

##+geom_smooth(method=lm , color="red", se=FALSE)

ggplot(south_asia, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

Middle East & North Africa

ME_NAFR <- ctry_all %>%
  filter(region == "Middle East & North Africa")

hist(ME_NAFR$Import_Totals)

hist(ME_NAFR$Export_Totals)

ggplot(ME_NAFR, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
  theme(legend.position="bottom")+
  scale_fill_brewer(palette="Set3")

ggplot(ME_NAFR, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

Europe & Central Asia

Eur_CentAsia <- ctry_all %>%
  filter(region == "Europe & Central Asia")

hist(Eur_CentAsia$Import_Totals)

hist(Eur_CentAsia$Export_Totals)

ggplot(Eur_CentAsia, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
theme(legend.position="bottom")+
  scale_fill_brewer(palette="Set3")

ggplot(Eur_CentAsia, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

Latin America & Caribbean

LatAm_Carib <- ctry_all %>%
  filter(region == "Latin America & Caribbean")

hist(LatAm_Carib$Import_Totals)

hist(LatAm_Carib$Export_Totals)

ggplot(LatAm_Carib, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
  scale_fill_brewer(palette="Set3")

ggplot(LatAm_Carib, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

Sub-Saharan Africa

Sub_Africa <- ctry_all %>%
  filter(region == "Sub-Saharan Africa")%>%
  filter (Name != "Comoros")

hist(Sub_Africa$Import_Totals)

hist(Sub_Africa$Export_Totals)

ggplot(Sub_Africa, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
  scale_fill_brewer(palette="Set3")

ggplot(Sub_Africa, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

East Asia & Pacific

eAsia_Pacific <- ctry_all %>%
  filter(region == "East Asia & Pacific")

hist(eAsia_Pacific$Import_Totals)

hist(eAsia_Pacific$Export_Totals)

ggplot(eAsia_Pacific, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
  scale_fill_brewer(palette="Set3")

ggplot(eAsia_Pacific, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

North America

north_america <- ctry_all %>%
  filter(region == "North America")

hist(north_america$Import_Totals, breaks = 25)

hist(north_america$Export_Totals, breaks = 25)

ggplot(north_america, aes(x=Import_Totals, y=Export_Totals, color=Name, shape=income)) +
    geom_point(size=3) +
  scale_fill_brewer(palette="Set3")

ggplot(north_america, aes(x=Import_Totals, y=Export_Totals, fill=Name)) + 
    geom_boxplot()+
  theme(legend.position="bottom")

With this data in hand, there is no need to for further analysis. We can reject the null hypothesis. The aggregate area country is in does not have effect on their import and export percentage.

Proofs:

Import figures seem to follow a nearly normal curve, except South Asia. If the hypothesis was correct, all aggregates would have near similar import graphs.
The export figures are always to the right. Once again, South Asia breaks the near similar curve of the other aggregates.
The Scatter plots and box plots for each aggregate show clear separate groupings of the countries.

Income

Income Analysis

The second section of the research question we will tackle Income based analysis; does a countries general Income indicate higher or lower Computer, communications and other services (% of commercial service exports) and/or Computer, communications and other services (% of commercial service imports)?

The same Null hypothesis is used.

\[{ H }_{ 0 }:{ \beta }_{ 1 }-{ A }_{ 1 }=0\] \[No.Difference.between.countries.of.the.same.aggregate\]

\[{ H }_{ 0 }:{ \beta }_{ 1 }-{ A }_{ 1 }\neq 0\] \[Difference.exists.between.countries.of.the.same.aggregate\]

The income groups are: * Low income
* Lower middle income
* Upper middle income
* High income

Using the Metadata, we first need to see the comparison between the aggregates.

meta_data %>%
  filter (income != "Aggregates")%>%
  ggplot() +
    geom_bar(aes(x = income, fill = income), 
             position = "dodge", stat = "count")+
        ggtitle(label = "Income groups")

We have a much larger percentage of High Income countries than all others.

Breaking apart and tracking the import and export by year and income shows steady growth between the income groups.

ctry_ex_low %>%
    ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
    ylim(.5, 2700)+
      theme(legend.position = "none")+
        ggtitle(label = "Low income export figures")

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_ex_lmid %>%
    ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
    ylim(.5, 2700)+
      theme(legend.position = "none") +
        ggtitle(label = "Lower middle income export figures")

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_ex_hmid  %>%
    ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
    ylim(.5, 2700)+
      theme(legend.position = "none") +
        ggtitle(label = "Upper middle income export figures")

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_ex_high  %>%
    ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
    ylim(.5, 2700)+
      theme(legend.position = "none") +
        ggtitle(label = "High income export figures")

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_imp_low %>%
    ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
      ylim(.5, 2700)+
      theme(legend.position = "none")+
        ggtitle(label = "Low income Import figures")+ 
        theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_imp_lmid %>%
  ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
        ylim(.5, 2700)+
        theme(legend.position = "none")+
        ggtitle(label = "Lower middle income Import figures")+ 
        theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_imp_hmid  %>%
  ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
        ylim(.5, 2700)+    
        theme(legend.position = "none")+
        ggtitle(label = "Upper middle income Import figures")+ 
        theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))

## Warning: Removed 10 rows containing missing values (geom_bar).

ctry_imp_high  %>%
  ggplot() + geom_bar(aes(y = Totals, x = Year, fill = Year),
      stat="identity")+
        ylim(.5, 2700)+    
        theme(legend.position = "none")+
        ggtitle(label = "High income Import figures")+ 
        theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

## Warning: Removed 10 rows containing missing values (geom_bar).

These charts show the marked difference in the rates by each aggregate. As income grows so does the amount of activity the aggregate is performing. Logically this makes since. The more money a country has the more they will be able to afford in purchases and sales.

Combining Location and population

The logic of comparing, for example, all “High” income countries to each other has an inherent flaw. We need to compare Liechtenstein (population ~ 35,727) with Germany (Population ~ 82,657,002). Both are in the same aggregate in both location and income. To even out the comparison, the population figures for each country was pulled; a calculation of amount (in USD) per person was performed. The average per person is divided against the known percentage of our research question. The resulting number should, in theory, level set the analysis and even out the extreme population variance.

population <- wb(indicator = "SP.POP.TOTL", startdate = 2008, enddate = 2017)%>%
  select (date, country, iso3c, value)%>%
  rename (Code = iso3c)%>%
  rename (Year = date)%>%
  rename (population = value)

population <- 
country_data%>%
  select (Code)%>%
  left_join(population, by = "Code")%>%
  arrange(country)%>%
  rename (Name = country)

# Serive Export #'s pull from API
# divide service import/export from US$
Service_export <- wb(indicator = "TX.VAL.SERV.CD.WT", startdate = 2008, enddate = 2017)%>%
  rename (Code = iso3c)%>%
  rename (Year = date) %>%
  rename (export = value) %>%
  select (Year, Code, export)

Service_numbers <- wb(indicator = "TM.VAL.SERV.CD.WT", startdate = 2008, enddate = 2017)%>%
  rename (Code = iso3c)%>%
  rename (Year = date) %>%
  rename (import = value)%>%
  full_join(Service_export, by = "Code")%>%
  right_join(population,by = "Code")%>%
  right_join(meta_data, by = "Code")%>%
  filter(Year.x == Year.y & Year.x == Year)%>%
  rename (country = country.x) %>%
  select(Year, Name, Code, country, income, region, population, import, export)%>%
  mutate(pop_import = import/population)%>%
  mutate(pop_export = export/population)%>%
  arrange(Code, Year)

# joining this data with the Country all data
ctry_all_pop <- ctry_all %>%
  left_join(Service_numbers, by = "Code")%>%
    select(country, region.x, Code, income.x, Year.x, Import_Totals, Export_Totals,population, import, export, pop_import, pop_export)%>%
    rename(name = country)%>%
    rename(region = region.x)%>%
    rename(income = income.x)%>%
    rename(Year = Year.x)%>%
    mutate(pop_import_per = pop_import/Import_Totals)%>%
    mutate(pop_export_per= pop_export/Export_Totals)

Graphing the new numbers

ctry_all_pop %>%
  filter(income == "Low income")%>%
    ggplot() + geom_bar(aes(y = pop_import_per, x = Year, fill = Year),
      stat="identity")+
          ylim(0, 30000)+  
        theme(legend.position = "none")+
        ggtitle(label = "Low income Import figures")+ 
        theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

ctry_all_pop %>%
  filter(income == "Lower middle income")%>%
    ggplot() + geom_bar(aes(y = pop_import_per, x = Year, fill = Year),
      stat="identity")+
            ylim(0, 30000)+  
        theme(legend.position = "none")+
        ggtitle(label = "Low Middle income Import figures")+ 
        theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

ctry_all_pop %>%
  filter(income == "Upper middle income")%>%
    ggplot() + geom_bar(aes(y = pop_import_per, x = Year, fill = Year),
      stat="identity")+
            ylim(0, 30000)+  
        theme(legend.position = "none")+
        ggtitle(label = "Upper Middle income Import figures")+ 
        theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

ctry_all_pop %>%
  filter(income == "High income")%>%
    ggplot() + geom_bar(aes(y = pop_import_per, x = Year, fill = Year),
      stat="identity")+
          ylim(0, 130000)+  
        theme(legend.position = "none")+
        ggtitle(label = "High income Import figures")+ 
        theme(axis.text.x = element_text(angle = 60, hjust = 1, vjust = 1))

Note: Pay attention to the scale of the High income. In order to show all the data the y limit required a change from 30,000 to 130,000.

We are still following our model presented earlier, the more income a country has the higher their import expenditure is.

# breaking down the data again with the new data pulled from the population dataframe
service_income_high <- 
  ctry_all_pop%>%
    filter(income == "High income")

service_income_low <- 
  ctry_all_pop%>%
    filter(income == "Low income")

service_income_low_mid <- 
  ctry_all_pop%>%
    filter(income == "Lower middle income")

service_income_upper_mid <- 
  ctry_all_pop%>%
    filter(income == "Upper middle income")

service_region_latin_Am <- 
  ctry_all_pop%>%
    filter(region == "Latin America & Caribbean")

service_region_south_asia <- 
  ctry_all_pop%>%
    filter(region == "South Asia")

service_region_sub_africa <- 
  ctry_all_pop%>%
    filter(region == "Sub-Saharan Africa")

service_region_europe <- 
  ctry_all_pop%>%
    filter(region == "Europe & Central Asia")

service_region_east_asia <- 
  ctry_all_pop%>%
    filter(region == "East Asia & Pacific")

service_region_mid_east_nafrica <- 
  ctry_all_pop%>%
    filter(region == "Middle East & North Africa")

service_region_north_america <- 
  ctry_all_pop%>%
    filter(region == "North America")

High Income Analysis

We know High income countries make up a majority of our dataset. Running modeling across the High income countries should allow us to determine if we are keeping the hypothesis.

ggplot(service_income_high, aes(region, pop_import_per)) + geom_boxplot()+ 
   theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "High income Import figures by region")

ggplot(service_income_high, aes(name, pop_import_per)) + geom_boxplot()+ 
   theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "High income Import figures by region")

ggplot(service_income_high, aes(region, pop_export_per)) + 
  geom_boxplot()+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "High income Export figures by region")

ggplot(service_income_high, aes(name, pop_export_per)) + 
  geom_boxplot()+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "High income Export figures by country")

We have heavy statistical outliers on both the Import and Export figures. Important factors for later consideration:

Heavy outlines in import figures for Europe and Central Asia
Lithuania has import figures far over the rest of the aggregate
Heavy outlines in export figures for East Asia & Pacific
Luxembourg has export figures far over the rest of the aggregate

m_income_high = lm(service_income_high$pop_export_per~ service_income_high$pop_import_per)
summary(m_income_high)

## 
## Call:
## lm(formula = service_income_high$pop_export_per ~ service_income_high$pop_import_per)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1324   -702   -651   -579  58936 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        645.1794    63.9644   10.09   <2e-16
## service_income_high$pop_import_per   1.5044     0.1255   11.99   <2e-16
##                                       
## (Intercept)                        ***
## service_income_high$pop_import_per ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4486 on 5700 degrees of freedom
## Multiple R-squared:  0.0246, Adjusted R-squared:  0.02443 
## F-statistic: 143.7 on 1 and 5700 DF,  p-value: < 2.2e-16

plot_ss(x= service_income_high$pop_export_per, y= jitter(service_income_high$pop_import_per), data = service_income_high, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   173.70626      0.01635  
## 
## Sum of Squares:  1246754433

The linear model is heavily thrown off by the outlines enough it is no longer reliable.

Low Income Analysis

We now analyze the Low income using the same methods as we used for the High Income

ggplot(service_income_low, aes(region, pop_import_per)) + geom_boxplot()+ 
   theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "Low income Import figures by region")

ggplot(service_income_low, aes(name, pop_import_per)) + geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "Low income import figures by country")

ggplot(service_income_low, aes(region, pop_export_per)) + geom_boxplot()+ 
   theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "Low income Export figures by region")

ggplot(service_income_low, aes(name, pop_export_per)) + geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "Low income export figures by country")

m_income_low = lm(service_income_low$pop_export_per~ service_income_low$pop_import_per)
summary.lm(m_income_low)

## 
## Call:
## lm(formula = service_income_low$pop_export_per ~ service_income_low$pop_import_per)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.060  -1.084  -0.727   0.275  38.516 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       0.827310   0.073912   11.19   <2e-16 ***
## service_income_low$pop_import_per 0.304024   0.008984   33.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.255 on 2368 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3257 
## F-statistic:  1145 on 1 and 2368 DF,  p-value: < 2.2e-16

plot_ss(x= service_income_low$pop_export_per, y= jitter(service_income_low$pop_import_per), data = service_income_low, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       1.477        1.072  
## 
## Sum of Squares:  88476.23

We have heavy statistical outliers on both the Import and Export figures. Important factors for later consideration:

Heavy outlines in import and export figures for Sub-Saharan Africa
Guinea and Sierra Leone have import figures far over the rest of the aggregate
Gambia and Syria have export figures far over the rest of the aggregate

Regression

Linear Regression model

Even though we don’t have all the factors met for a linear regression model, we wanted to run the data through the formulas.

# changing the region to numbers
ctry_lm_model <-
ctry_all_pop%>%
  mutate_if(is.character, str_trim)%>%
    select(name, region,income, pop_import_per, pop_export_per)%>%
    mutate(region = ifelse(region == "East Asia & Pacific", 1,
                                     ifelse(region == "Europe & Central Asia", 2,
                                            ifelse(region == "Latin America & Caribbean", 3,
                                                   ifelse(region == "Middle East & North Africa", 4,
                                                         ifelse(region == "North America", 5, 
                                                                ifelse(region == "South Asia", 6,7)))))))

# Changing the income column to numbers using a different method  
ctry_lm_model <-
ctry_lm_model%>%  
    mutate(income = replace(income, income =="Low income", 1))%>%
    mutate(income = replace(income, income =="Lower middle income", 2))%>%
    mutate(income = replace(income, income =="Upper middle income", 3))%>%
    mutate(income = replace(income, income =="High income", 4))

Linear regression model per Region based on Import numbers

plot(ctry_lm_model$pop_import_per ~ ctry_lm_model$region)
abline(h=0, lty = 3)

cor(ctry_lm_model$pop_import_per, ctry_lm_model$region)

## [1] -0.1179268

m_ctry_lm <- lm(region ~ pop_import_per, data = ctry_lm_model)
summary(m_ctry_lm)

## 
## Call:
## lm(formula = region ~ pop_import_per, data = ctry_lm_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6527 -1.6369 -0.6392  2.3495  4.3672 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.653e+00  1.690e-02  216.14   <2e-16 ***
## pop_import_per -9.075e-04  5.774e-05  -15.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.168 on 17517 degrees of freedom
## Multiple R-squared:  0.01391,    Adjusted R-squared:  0.01385 
## F-statistic:   247 on 1 and 17517 DF,  p-value: < 2.2e-16

The Linear model shows the heavy outlines in region 2 (Europe & Central Asia) as seen earlier. The r2 values are near ZERO

Linear regression model per income based on Import numbers

plot(ctry_lm_model$income ~ ctry_lm_model$pop_import_per)
abline(h=0, lty = 3)

cor(ctry_lm_model$pop_export_per, ctry_lm_model$pop_import_per)

## [1] 0.1934991

m_ctry_lm <- lm(pop_import_per ~ income, data = ctry_lm_model)
summary(m_ctry_lm)

## 
## Call:
## lm(formula = pop_import_per ~ income, data = ctry_lm_model)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -185.7  -22.7   -5.4    0.5 5382.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.508      5.582   0.628 0.529759    
## income2        4.805      6.974   0.689 0.490856    
## income3       23.370      6.730   3.473 0.000517 ***
## income4      185.392      6.641  27.915  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 271.7 on 17515 degrees of freedom
## Multiple R-squared:  0.08283,    Adjusted R-squared:  0.08267 
## F-statistic: 527.3 on 3 and 17515 DF,  p-value: < 2.2e-16

The Linear model shows the heavy outlines in income 4 (High income) as seen earlier. Again, the r2 values are nearly ZERO

With this data in hand, there is no need to for further analysis. We can reject the null hypothesis. The income aggregate of a country does not have effect on their import and export percentage.

Proofs:

Heavy outlines in Low income import and export figures for Sub-Saharan Africa
Guinea and Sierra Leone have import figures far over the rest of the low income aggregate
Gambia and Syria have export figures far over the rest of the low income aggregate
Heavy High Income outlines in import figures for Europe and Central Asia
Lithuania has import figures far over the rest of the High Income aggregate
Heavy High Income outlines in export figures for East Asia & Pacific
Luxembourg has export figures far over the rest of the High Income aggregate
Both the r2 models show nearly ZERO figures leading to no correlation (Poor fit)

Reexamine geographical

Reexamine geographical analysis with new data

hist(service_region_europe$pop_import_per, 
     main="Histogram of Service import for Europe", 
     xlab="Service Import figures in $USD")

hist(service_region_europe$pop_export_per, 
     main="Histogram of Service export for Europe", 
     xlab="Service Import figures in $USD")

ggplot(service_region_europe, aes(name, pop_import_per)) + geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "Europe import figures by country")

ggplot(service_region_europe, aes(name, pop_export_per)) + geom_boxplot()+ 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "Europe export figures by country")

m_income_Europe = lm(service_region_europe$pop_export_per~ service_region_europe$pop_import_per)
summary.lm(m_income_Europe)

## 
## Call:
## lm(formula = service_region_europe$pop_export_per ~ service_region_europe$pop_import_per)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -611.29  -33.93   -8.51   10.06 2277.49 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          1.960737   2.390853    0.82    0.412
## service_region_europe$pop_import_per 1.647478   0.004423  372.44   <2e-16
##                                         
## (Intercept)                             
## service_region_europe$pop_import_per ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156.3 on 4614 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9678 
## F-statistic: 1.387e+05 on 1 and 4614 DF,  p-value: < 2.2e-16

plot_ss(x= service_region_europe$pop_export_per, y= jitter(service_region_europe$pop_import_per), data = service_region_europe, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      3.5676       0.5875  
## 
## Sum of Squares:  40215222

hist(service_region_south_asia$pop_import_per, 
     main="Histogram of Service import for South Asia", 
     xlab="Service Import figures in $USD")

hist(service_region_south_asia$pop_export_per, 
     main="Histogram of Service export for South Asia", 
     xlab="Service Import figures in $USD")

ggplot(service_region_south_asia, aes(name, pop_import_per)) + 
  geom_boxplot()+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "South Asia import figures by country")

ggplot(service_region_south_asia, aes(name, pop_export_per)) + 
  geom_boxplot()+ theme(axis.text.x = element_text(angle = 90, hjust = 1))+
     ggtitle(label = "South Asia import figures by country")

m_income_South_asia = lm(service_region_south_asia$pop_export_per~ service_region_south_asia$pop_import_per)
summary.lm(m_income_South_asia)

## 
## Call:
## lm(formula = service_region_south_asia$pop_export_per ~ service_region_south_asia$pop_import_per)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2360.54   -53.65    -4.58    16.00  1398.93 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                              -49.1827    11.5931  -4.242
## service_region_south_asia$pop_import_per  28.1741     0.4743  59.398
##                                          Pr(>|t|)    
## (Intercept)                              2.47e-05 ***
## service_region_south_asia$pop_import_per  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 293.5 on 798 degrees of freedom
## Multiple R-squared:  0.8155, Adjusted R-squared:  0.8153 
## F-statistic:  3528 on 1 and 798 DF,  p-value: < 2.2e-16

plot_ss(x= service_region_south_asia$pop_export_per, y= jitter(service_region_south_asia$pop_import_per), data = service_region_south_asia, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     3.43312      0.02895  
## 
## Sum of Squares:  70638.23

The rejection of the hypothesis still holds for the by region analysis

Conclusion

Using this data I can comfortably conclude there is no correlation between a Country’s Geographical area and/or prosperity figures and that same country’s Computer, communications and other technical services (import and export) percentage.

World Bank Service Indicators Project

John Kellogg

Object

Research question

Data Source

Defining Requirements

Objective Analysis

Data Preparation

Relevant summary statistics

Country Import (General)

Country export (General)

Aggregate import (General)

Aggregate export (General)

Aggregate Mean

Geographical analysis

Location Analysis

South Asia

Middle East & North Africa

Europe & Central Asia

Latin America & Caribbean

Sub-Saharan Africa

East Asia & Pacific

North America

Income

Income Analysis

Combining Location and population

Graphing the new numbers

High Income Analysis

Low Income Analysis

Regression

Linear Regression model

Linear regression model per Region based on Import numbers

Linear regression model per income based on Import numbers

Reexamine geographical

Reexamine geographical analysis with new data

Conclusion