About the Data

This data set about the Countries of the World presents some geographical and economical statistics about the 197 countries in today’s world. The data set is available at [Countries of the World] (http://www.cs.stonybrook.edu/~skiena/519/hw1/country-data.csv) This is a table of information about each country, with approximately 20 fields including: name, countrycode, type of government, longitude and latitude of capital city, population, life expectancy, GDP, area, literacy rate, and more. Along with this data, I also sourced additional information on the list of countries in each continent, to perfom a continent-level analysis. This table was sourced from [here] (http://www.sport-histoire.fr/en/Geography/By_continent.php) We will explore this data and uncover interesting observations about the success and fate of nations.

An Overview of the Data and the variables

countries_data <- read.csv("http://www3.cs.stonybrook.edu/~skiena/519/hw1/country-data.csv")
kable(names(countries_data))
Country.Name
Country.Code
Government.Type
Capital.City
Date.of.Founding.Independence
Latitude.of.Capital
Longitude.of.Capital
Population
Life.Expectancy
GDP..PPP..in.US..
Area..sq.km.
Land.Boundaries..km.
Coastline..km.
Literacy.Rate….
Health.Expenditure.GDP
Military.Expenditure.GDP
Renewable.Water.Resources..cu.km.
Net.Migration.Rate.1000.Population
Labor.Force
Internet.Users..circa.2009.
Commercial.Prime.Lending.Rate
kable(str(countries_data))
## 'data.frame':    197 obs. of  21 variables:
##  $ Country.Name                      : Factor w/ 197 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Country.Code                      : Factor w/ 196 levels "AD","AE","AF",..: 3 5 49 1 7 4 8 6 10 9 ...
##  $ Government.Type                   : Factor w/ 52 levels "a parliamentary democracy, a federation, and a constitutional monarchy",..: 28 37 43 37 43 10 43 43 19 21 ...
##  $ Capital.City                      : Factor w/ 197 levels "Abu Dhabi","Abuja",..: 74 179 5 8 94 156 41 196 44 187 ...
##  $ Date.of.Founding.Independence     : Factor w/ 187 levels "1/1/1804","1/1/1901",..: 148 35 129 53 30 28 134 173 2 31 ...
##  $ Latitude.of.Capital               : Factor w/ 194 levels "0 13 S","0 19 N",..: 96 126 104 131 187 44 97 122 100 149 ...
##  $ Longitude.of.Capital              : Factor w/ 196 levels "0 05 W","0 13 W",..: 158 67 91 4 30 153 142 120 40 49 ...
##  $ Population                        : int  31822848 3020209 38813722 85458 19088106 91295 43024374 3060631 22507617 8223062 ...
##  $ Life.Expectancy                   : num  50.5 78 76.4 82.7 55.3 ...
##  $ GDP..PPP..in.US..                 : num  4.53e+10 2.83e+10 2.85e+11 3.16e+09 1.32e+11 ...
##  $ Area..sq.km.                      : num  652230 28748 2381741 468 1246700 ...
##  $ Land.Boundaries..km.              : num  5987 691 6734 118 5369 ...
##  $ Coastline..km.                    : num  0 362 998 0 1600 ...
##  $ Literacy.Rate....                 : Factor w/ 125 levels "100","27","28.1",..: 3 104 45 1 40 119 110 122 119 111 ...
##  $ Health.Expenditure.GDP            : Factor w/ 87 levels "1.9","10","10.1",..: 84 55 34 62 30 51 71 37 79 6 ...
##  $ Military.Expenditure.GDP          : Factor w/ 113 levels "0","0.13","0.19",..: 108 55 105 113 98 113 24 100 63 19 ...
##  $ Renewable.Water.Resources..cu.km. : Factor w/ 166 levels "0.02","0.03",..: 141 113 23 166 45 3 155 146 124 149 ...
##  $ Net.Migration.Rate.1000.Population: Factor w/ 154 levels "-0.02","-0.04",..: 44 68 30 93 101 127 93 79 146 117 ...
##  $ Labor.Force                       : Factor w/ 195 levels "1022000","1037000",..: 169 6 7 110 187 94 43 23 16 113 ...
##  $ Internet.Users..circa.2009.       : Factor w/ 181 levels "10000","1000000",..: 2 17 126 146 136 143 21 57 36 139 ...
##  $ Commercial.Prime.Lending.Rate     : Factor w/ 116 levels "0.25","1.5","10",..: 31 113 102 116 31 6 36 37 90 46 ...
kable(head(countries_data))
Country.Name Country.Code Government.Type Capital.City Date.of.Founding.Independence Latitude.of.Capital Longitude.of.Capital Population Life.Expectancy GDP..PPP..in.US.. Area..sq.km. Land.Boundaries..km. Coastline..km. Literacy.Rate…. Health.Expenditure.GDP Military.Expenditure.GDP Renewable.Water.Resources..cu.km. Net.Migration.Rate.1000.Population Labor.Force Internet.Users..circa.2009. Commercial.Prime.Lending.Rate
Afghanistan AF Islamic republic Kabul 8/19/1919 34 31 N 69 11 E 31822848 50.49 4.530e+10 652230.0 5987 0 28.1 9.6 4.74 65.33 -1.83 7512000 1000000 15
Albania AL parliamentary democracy Tirana 11/28/1912 41 19 N 19 49 E 3020209 77.96 2.834e+10 28748.0 691 362 96.8 6.3 1.47 41.7 -3.31 1098000 1300000 9.52
Algeria DZ republic Algiers 7/5/1962 36 45 N 3 03 E 38813722 76.39 2.847e+11 2381741.0 6734 998 72.6 3.9 4.48 11.67 -0.93 11150000 4700000 8
Andorra AD parliamentary democracy Andorra la Vella 1278 42 30 N 1 31 E 85458 82.65 3.163e+09 468.0 118 0 100 7.2 unknown unknown 0 36060 67100 unknown
Angola AO republic Luanda 11/11/1975 8 50 S 13 13 E 19088106 55.29 1.318e+11 1246700.0 5369 1600 70.4 3.5 3.63 148 0.47 9018000 606700 15
Antigua and Barbuda AG constitutional monarchy Saint John’s 11/1/1981 17 07 N 61 51 W 91295 76.12 1.610e+09 442.6 0 153 99 5.9 unknown 0.05 2.23 30000 65000 10.3

Normalization of the Data

As we can see, the data is presented to us in one big chunk. Looking at the attributes of the data set, it becomes apparent that we can divide this big table into smaller, logical, modular pieces. I have tried to partition the parent table logically into 3 tables.

In summary, our data set is now composed of 4 tables as follows:

  1. Countries_Geography: This table stores information about the geographical attributes of countries.
  2. Countries_Metrics: This table stores information about economic metrics of the countries.
  3. Contries: This table stores general(miscellaneous) information about the Countries
  4. Continent: This is the Continent-Country mapping table.

A glimpse of our tables-

Local <- odbcConnect("R Connect" ,uid = "", pwd = "")
Countries_Metrics <- sqlQuery(Local, "select * from Countries_Metrics")
Countries <- sqlQuery(Local, "select * from Countries")
Continents <- sqlQuery(Local, "select * from Continents")
Countries_Geography <- sqlQuery(Local, "select * from Countries_Geography")
kable(names(Countries))
Country_Code
Country_Name
Government_Type
Founding_Independence_Date
Capital_City
kable(str(Countries))
## 'data.frame':    197 obs. of  5 variables:
##  $ Country_Code              : Factor w/ 196 levels "AD","AE","AF",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Country_Name              : Factor w/ 197 levels "Afghanistan",..: 4 186 1 6 2 8 5 7 10 9 ...
##  $ Government_Type           : Factor w/ 52 levels "a parliamentary democracy, a federation, and a constitutional monarchy",..: 37 24 28 10 37 43 43 43 21 19 ...
##  $ Founding_Independence_Date: Date, format: "1278-01-01" "1971-12-02" ...
##  $ Capital_City              : Factor w/ 197 levels "Abu Dhabi","Abuja",..: 8 1 74 156 179 196 94 41 187 44 ...
kable(head(Countries))
Country_Code Country_Name Government_Type Founding_Independence_Date Capital_City
AD Andorra parliamentary democracy 1278-01-01 Andorra la Vella
AE United Arab Emirates federation with specified powers delegated to the UAE federal government and other powers reserved to member emirates 1971-12-02 Abu Dhabi
AF Afghanistan Islamic republic 1919-08-19 Kabul
AG Antigua and Barbuda constitutional monarchy 1981-11-01 Saint John’s
AL Albania parliamentary democracy 1912-11-28 Tirana
AM Armenia republic 1991-09-21 Yerevan
kable(names(Countries_Geography))
Country_Code
Capital_Latitude
Capital_Longitude
Area
Land_Boundaries
Coastline
kable(str(Countries_Geography))
## 'data.frame':    197 obs. of  6 variables:
##  $ Country_Code     : Factor w/ 196 levels "AD","AE","AF",..: 3 5 49 1 7 4 8 6 10 9 ...
##  $ Capital_Latitude : Factor w/ 194 levels "0 13 S","0 19 N",..: 96 126 104 131 187 44 97 122 100 149 ...
##  $ Capital_Longitude: Factor w/ 196 levels "0 05 W","0 13 W",..: 158 67 91 4 30 153 142 120 40 49 ...
##  $ Area             : num  652230 28748 2381741 468 1246700 ...
##  $ Land_Boundaries  : num  5987 691 6734 118 5369 ...
##  $ Coastline        : num  0 362 998 0 1600 ...
kable(head(Countries_Geography))
Country_Code Capital_Latitude Capital_Longitude Area Land_Boundaries Coastline
AF 34 31 N 69 11 E 652230.0 5987 0
AL 41 19 N 19 49 E 28748.0 691 362
DZ 36 45 N 3 03 E 2381741.0 6734 998
AD 42 30 N 1 31 E 468.0 118 0
AO 8 50 S 13 13 E 1246700.0 5369 1600
AG 17 07 N 61 51 W 442.6 0 153
kable(names(Countries_Metrics))
Country_Code
Population
Life_Expectancy
GDP
Literacy_Rate
Health_Expenditure_per_GDP
Military_Expenditure_per_GDP
Renewable_Water_Resources
Net_Migration_Rate_per_1000_population
Labor_Force
Internet_Users
Commercial_Prime_Lending_Rate
GDP_per_Capita
Economy_Status
Labor_force_participation_rate
Technology_Penetration_rate
Population_Density
kable(str(Countries_Metrics))
## 'data.frame':    197 obs. of  17 variables:
##  $ Country_Code                          : Factor w/ 196 levels "AD","AE","AF",..: 3 5 49 1 7 4 8 6 10 9 ...
##  $ Population                            : num  31822848 3020209 38813722 85458 19088106 ...
##  $ Life_Expectancy                       : num  50.5 78 76.4 82.7 55.3 ...
##  $ GDP                                   : num  4.53e+10 2.83e+10 2.85e+11 3.16e+09 1.32e+11 ...
##  $ Literacy_Rate                         : num  28.1 96.8 72.6 100 70.4 99 97.9 99.6 99 98 ...
##  $ Health_Expenditure_per_GDP            : num  9.6 6.3 3.9 7.2 3.5 5.9 8.1 4.3 9 10.6 ...
##  $ Military_Expenditure_per_GDP          : num  4.74 1.47 4.48 NA 3.63 NA 0.91 3.92 1.71 0.81 ...
##  $ Renewable_Water_Resources             : num  65.3 41.7 11.7 NA 148 ...
##  $ Net_Migration_Rate_per_1000_population: num  -1.83 -3.31 -0.93 0 0.47 2.23 0 -5.88 5.74 1.76 ...
##  $ Labor_Force                           : num  7512000 1098000 11150000 36060 9018000 ...
##  $ Internet_Users                        : num  1000000 1300000 4700000 67100 606700 ...
##  $ Commercial_Prime_Lending_Rate         : num  15 9.52 8 NA 15 10.3 16.4 16.5 6.2 2.2 ...
##  $ GDP_per_Capita                        : num  1424 9383 7335 37012 6905 ...
##  $ Economy_Status                        : Factor w/ 3 levels "Developed","Developing",..: 3 2 2 1 2 1 1 2 1 1 ...
##  $ Labor_force_participation_rate        : num  0.236 0.364 0.287 0.422 0.472 ...
##  $ Technology_Penetration_rate           : num  0.0314 0.4304 0.1211 0.7852 0.0318 ...
##  $ Population_Density                    : num  48.8 105.1 16.3 182.6 15.3 ...
kable(head(Countries_Metrics))
Country_Code Population Life_Expectancy GDP Literacy_Rate Health_Expenditure_per_GDP Military_Expenditure_per_GDP Renewable_Water_Resources Net_Migration_Rate_per_1000_population Labor_Force Internet_Users Commercial_Prime_Lending_Rate GDP_per_Capita Economy_Status Labor_force_participation_rate Technology_Penetration_rate Population_Density
AF 31822848 50.49 4.530e+10 28.1 9.6 4.74 65.33 -1.83 7512000 1000000 15.00 1423.506 Least Developed Country (LDC) 0.2360568 0.0314240 48.79084
AL 3020209 77.96 2.834e+10 96.8 6.3 1.47 41.70 -3.31 1098000 1300000 9.52 9383.457 Developing 0.3635510 0.4304338 105.05806
DZ 38813722 76.39 2.847e+11 72.6 3.9 4.48 11.67 -0.93 11150000 4700000 8.00 7335.035 Developing 0.2872695 0.1210912 16.29637
AD 85458 82.65 3.163e+09 100.0 7.2 NA NA 0.00 36060 67100 NA 37012.334 Developed 0.4219617 0.7851810 182.60256
AO 19088106 55.29 1.318e+11 70.4 3.5 3.63 148.00 0.47 9018000 606700 15.00 6904.823 Developing 0.4724408 0.0317842 15.31091
AG 91295 76.12 1.610e+09 99.0 5.9 NA 0.05 2.23 30000 65000 10.30 17635.139 Developed 0.3286051 0.7119777 206.26977
kable(names(Continents))
Country Name
Continent
Country_Code
kable(str(Continents))
## 'data.frame':    197 obs. of  3 variables:
##  $ Country Name: Factor w/ 197 levels "Afghanistan",..: 3 5 19 23 27 28 30 32 33 34 ...
##  $ Continent   : Factor w/ 5 levels "Africa","America",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Country_Code: Factor w/ 196 levels "AD","AE","AF",..: 49 7 20 26 16 19 36 41 31 169 ...
kable(head(Continents))
Country Name Continent Country_Code
Algeria Africa DZ
Angola Africa AO
Benin Africa BJ
Botswana Africa BW
Burkina Faso Africa BF
Burundi Africa BI

Problems in the Data Set

Below were the problems in the data set that had to be fixed- 1. The data indicated missing values as ‘unknown’. We used SQL to set these values to NULL Example Code - Update Countries_Data Set [Government#Type]=Nullif([Government#Type],‘unknown’) Update Countries_Data Set[Capital#City]=Nullif([Capital#City],‘unknown’) Update Countries_Data Set[Date#of#Founding#Independence]=Nullif([Date#of#Founding#Independence],‘unknown’);

  1. The Country names in the Continents table and the Country names in the Countries table are not consistent in some cases. We standardized these in SQL. Example Code - update Countries set Country_Name = ‘Ivory Coast’ where Country_Name like ‘%Ivoire%’

  2. I also added a couple of calculated fields in our Countries_Metrics table which would come in handy during our analysis. The below columns were added-
  1. Population Density: This was calculated from the Population field in Countries_Metrics and the Area field in Countries_Geography as follows-

    sqlQuery(Local, “Alter table Countries_Metrics add Population_Density float”) sqlQuery(Local, “update Countries_Metrics set Population_Density = (Population)/(Area) from Countries_Metrics, Countries_Geography where Countries_Metrics.Country_Code=Countries_Geography.Country_Code”)

  2. Technology Penetration Rate: This was calculated from the Internet_Users and Population fields in Countries_Metrics table as follows-

    sqlQuery(Local, “Alter table Countries_Metrics add Technology_Penetration_rate float”) sqlQuery(Local, “update Countries_Metrics set Technology_Penetration_rate = (Internet_Users)/(Population)”)

  3. Labor Force Participation Rate: This was calculated from the Labor Force and Population fields in Countries_Metrics table as follows-

    sqlQuery(Local, “Alter table Countries_Metrics add Labor_force_participation_rate float”) sqlQuery(Local, “update Countries_Metrics set Labor_force_participation_rate = (Labor_Force)/(Population)”)

Exploratory Data Analysis

Analysis on Literacy rates across different Continents and Countries

Literacy_Rates_Continents <- sqlQuery(Local,"select continent, max(Literacy_Rate) as Highest_Literacy_Rate, min(Literacy_Rate) as Lowest_Literacy_Rate, Avg(Literacy_Rate) as Average_Literacy_Rate
from Countries_Metrics, Continents where Countries_Metrics.Country_Code=Continents.Country_Code
group by continent order by continent")

Highest_literacy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Literacy_Rate in (select max(Literacy_Rate) from Countries_Metrics) 
and Countries.Country_Code=Countries_Metrics.Country_Code")

Lowest_literacy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Literacy_Rate in (select min(Literacy_Rate) from Countries_Metrics) 
and Countries.Country_Code=Countries_Metrics.Country_Code")

Literacy_Rates_Max_Min <- sqlQuery(Local,"select max(Literacy_Rate) as Highest_Literacy_Rate, min(Literacy_Rate) as Lowest_Literacy_Rate, Avg(Literacy_Rate) as Average_Literacy_Rate
from Countries_Metrics")

Literacy_Highest_lowest <- list()
Literacy_Highest_lowest[[ "Highest_Country" ]] <- Highest_literacy_Rate_Country$Country_Name
Literacy_Highest_lowest[[ "Lowest_Country" ]] <- Lowest_literacy_Rate_Country$Country_Name
Literacy_Highest_lowest[["Highest_Value"]] <- Literacy_Rates_Max_Min$Highest_Literacy_Rate
Literacy_Highest_lowest[["Lowest_Value"]] <- Literacy_Rates_Max_Min$Lowest_Literacy_Rate
Literacy_Highest_lowest[["Average Value"]] <- Literacy_Rates_Max_Min$Average_Literacy_Rate

kable(Literacy_Rates_Continents)
continent Highest_Literacy_Rate Lowest_Literacy_Rate Average_Literacy_Rate
Africa 94.2 27.0 66.05926
America 100.0 48.7 91.71667
Asia 100.0 28.1 87.07917
Europe 100.0 91.9 98.51591
Oceania 99.0 62.4 90.35455
Literacy_Highest_lowest
## $Highest_Country
## [1] Andorra       Finland       Greenland     Korea, North  Liechtenstein
## [6] Luxembourg    Norway       
## 7 Levels: Andorra Finland Greenland Korea, North ... Norway
## 
## $Lowest_Country
## [1] South Sudan
## Levels: South Sudan
## 
## $Highest_Value
## [1] 100
## 
## $Lowest_Value
## [1] 27
## 
## $`Average Value`
## [1] 84.85699

Analysis on Life Expectancy rates across different Continents and Countries

Life_Expectancy_Rates_Continents <- sqlQuery(Local,"select continent, max(Life_Expectancy) as Highest_Life_Expectancy, min(Life_Expectancy) as Lowest_Life_Expectancy, Avg(Life_Expectancy) as Average_Life_Expectancy
from Countries_Metrics, Continents where Countries_Metrics.Country_Code=Continents.Country_Code
                                             group by continent order by continent")

Highest_Life_Expectancy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Life_Expectancy in (select max(Life_Expectancy) from Countries_Metrics) 
                                                 and Countries.Country_Code=Countries_Metrics.Country_Code")

Lowest_Life_Expectancy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Life_Expectancy in (select min(Life_Expectancy) from Countries_Metrics) 
                                                and Countries.Country_Code=Countries_Metrics.Country_Code
                                                ")

Life_Expectancy_Rates_Max_Min <- sqlQuery(Local,"select max(Life_Expectancy) as Highest_Life_Expectancy, min(Life_Expectancy) as Lowest_Life_Expectancy, Avg(Life_Expectancy) as Average_Life_Expectancy
from Countries_Metrics")

Life_Expectancy_Highest_lowest <- list()
Life_Expectancy_Highest_lowest[[ "Highest_Country" ]] <- Highest_Life_Expectancy_Rate_Country$Country_Name
Life_Expectancy_Highest_lowest[[ "Lowest_Country" ]] <- Lowest_Life_Expectancy_Rate_Country$Country_Name
Life_Expectancy_Highest_lowest[["Highest_Value"]] <- Life_Expectancy_Rates_Max_Min$Highest_Life_Expectancy
Life_Expectancy_Highest_lowest[["Lowest_Value"]] <- Life_Expectancy_Rates_Max_Min$Lowest_Life_Expectancy
Life_Expectancy_Highest_lowest[["Average Value"]] <- Life_Expectancy_Rates_Max_Min$Average_Life_Expectancy

kable(Life_Expectancy_Rates_Continents)
continent Highest_Life_Expectancy Lowest_Life_Expectancy Average_Life_Expectancy
Africa 76.51 49.44 60.18000
America 81.67 63.18 74.42028
Asia 84.46 50.49 72.55958
Europe 89.57 69.14 78.08795
Oceania 82.07 65.47 72.41786
kable(as.data.frame(Life_Expectancy_Highest_lowest))
Highest_Country Lowest_Country Highest_Value Lowest_Value Average.Value
Monaco Chad 89.57 49.44 70.66807

Analysis on Government Types in Countries

Type_of_Government <- sqlQuery(Local,"select Government_Type, count(*) as Total_Number_of_Countries 
from Countries group by Government_Type order by Government_Type order by Total_number_of_Countries")

kable(Type_of_Government)
42000 156 [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Incorrect syntax near the keyword ‘order’.
[RODBC] ERROR: Could not SQLExecDirect ’select Government_Type, count(*) as Total_Number_of_Countries
from Countries group by Government_Type order by Government_Type order by Total_number_of_Countries’

Analysis on the GDP per Capita (World’s Richest and Poorest Countries)

Richest_Country <- sqlQuery(Local,"select Country_Name from Countries,Countries_Metrics
where GDP_per_Capita in (select max(GDP_per_Capita) from Countries_Metrics )
and Countries.Country_Code=Countries_Metrics.Country_Code")

Poorest_Country <- sqlQuery(Local,"select Country_Name from Countries,Countries_Metrics
where GDP_per_Capita in (select min(GDP_per_Capita) from Countries_Metrics )
                            and Countries.Country_Code=Countries_Metrics.Country_Code")

GDP_per_Capita_Max_Min <- sqlQuery(Local,"select max(GDP_per_Capita) as Highest_GDP_per_capita, min(GDP_per_capita) as Lowest_GDP_per_capita, Avg(GDP_per_capita) as Average_GDP_per_capita
from Countries_Metrics")

GDP_per_Capita_Highest_Lowest <- list()
GDP_per_Capita_Highest_Lowest[[ "Richest_Country" ]] <- Richest_Country$Country_Name
GDP_per_Capita_Highest_Lowest[[ "Poorest_Country" ]] <- Poorest_Country$Country_Name
GDP_per_Capita_Highest_Lowest[["Highest_Value"]] <- GDP_per_Capita_Max_Min$Highest_GDP_per_capita
GDP_per_Capita_Highest_Lowest[["Lowest_Value"]] <- GDP_per_Capita_Max_Min$Lowest_GDP_per_capita
GDP_per_Capita_Highest_Lowest[["Average Value"]] <- GDP_per_Capita_Max_Min$Average_GDP_per_capita

kable(as.data.frame(GDP_per_Capita_Highest_Lowest))
Richest_Country Poorest_Country Highest_Value Lowest_Value Average.Value
Monaco Democratic Republic of Congo 203651.5 379.5503 16245.79

Analysis on the Economic Development Status of the countries - Proportion of Developed/Developing and Least Developed Countries within Continents

Continents_by_Economic_Status <- sqlQuery(Local, "select Continents.continent, Economy_Status, round((cast(count(*) as float))/cast(Continent_Totals as float)*100,2) as Proportion  from Continents 
inner join Countries on [Country Name]=Country_Name
                inner join Countries_Metrics on Countries.Country_Code=Countries_Metrics.Country_Code
                inner join (select continent,count(*) as Continent_Totals from Continents group by continent) sub on sub.continent=Continents.continent
                group by Continents.continent, Economy_Status, Continent_Totals
                order by continent")

kable(Continents_by_Economic_Status)
continent Economy_Status Proportion
Africa Developed 9.09
Africa Developing 23.64
Africa Least Developed Country (LDC) 67.27
America Developed 36.11
America Developing 61.11
America Least Developed Country (LDC) 2.78
Asia Developed 31.25
Asia Developing 47.92
Asia Least Developed Country (LDC) 20.83
Europe Developed 79.55
Europe Developing 20.45
Oceania Developed 14.29
Oceania Developing 85.71
ggplot(Continents_by_Economic_Status, aes(x = continent, y = Proportion,fill=Economy_Status)) +
geom_bar(stat='identity')

Inferential Data Analysis

Does Health Expenditure and Military Expenditure affect the Life Expectancy Rate?

lm_health_life_expectancy <- lm(Life_Expectancy~Health_Expenditure_per_GDP + Military_Expenditure_per_GDP, data = Countries_Metrics)
summary(lm_health_life_expectancy)
## 
## Call:
## lm(formula = Life_Expectancy ~ Health_Expenditure_per_GDP + Military_Expenditure_per_GDP, 
##     data = Countries_Metrics)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.269  -3.148   3.302   6.578  13.880 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   67.7977     2.4735  27.409   <2e-16 ***
## Health_Expenditure_per_GDP     0.3881     0.2694   1.441    0.152    
## Military_Expenditure_per_GDP   0.2606     0.5757   0.453    0.652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.207 on 134 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.01555,    Adjusted R-squared:  0.0008615 
## F-statistic: 1.059 on 2 and 134 DF,  p-value: 0.3498

The linear regression results tell us that neither Health Expenditure nor Military Expenditure affect the Life Expectancy rate significantly. The below plot tells us the same story.

#### Does Literacy Rate affect Technology Penetration Rate??

lm_literacy_tpr <- lm(Technology_Penetration_rate~Literacy_Rate, data=Countries_Metrics)
summary(lm_literacy_tpr)
## 
## Call:
## lm(formula = Technology_Penetration_rate ~ Literacy_Rate, data = Countries_Metrics)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42471 -0.15224 -0.03571  0.16559  0.52310 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.4887271  0.0760268  -6.428 1.06e-09 ***
## Literacy_Rate  0.0093271  0.0008741  10.671  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2169 on 186 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.3797, Adjusted R-squared:  0.3764 
## F-statistic: 113.9 on 1 and 186 DF,  p-value: < 2.2e-16

The linear regression results tell us that literacy rate does affect the technology penetration rate. However, the coefficient is extremely small. This tells us that there might be other factors affecting the Technology penetration rate that we might be missing.

Below is a plot of how the Technology Penetration Rate and the Literacy Rate affect the GDP per Capita-

#### Do geographical conditions like the coastline/land boundaries affect the GDP_per_Capita?

geography_analysis_data <- sqlQuery(Local, "select Countries_Metrics.Country_Code,Land_Boundaries, Coastline, GDP_per_Capita from Countries_Metrics,Countries_Geography
                                     where Countries_Metrics.Country_Code=Countries_Geography.Country_Code")
lm_geography_gdp <- lm(GDP_per_Capita ~ Land_Boundaries+Coastline, data=geography_analysis_data)
summary(lm_geography_gdp)
## 
## Call:
## lm(formula = GDP_per_Capita ~ Land_Boundaries + Coastline, data = geography_analysis_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -23136 -12090  -6052   5912 184808 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.885e+04  1.896e+03   9.942  < 2e-16 ***
## Land_Boundaries -1.278e+00  4.340e-01  -2.946  0.00361 ** 
## Coastline        2.344e-01  9.533e-02   2.459  0.01482 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20870 on 194 degrees of freedom
## Multiple R-squared:  0.0599, Adjusted R-squared:  0.0502 
## F-statistic:  6.18 on 2 and 194 DF,  p-value: 0.002501

The linear regression results find these factors significant, however the coefficients are terribly small to have a significant effect.

Let’s take a look at the Correlation Matrix between some key variables affecting the GDP per Capita of a country.

What are the factors that affect the GDP_per_Capita of a country significantly?

We will try to answer this question by fitting a multiple linear regression on the attributes and a regression decision tree and compare the factors from both the techniques.

Multiple Linear Regression

lm.fit <- lm(GDP_per_Capita ~. -Country_Code -Population -GDP -Labor_Force -Internet_Users -Economy_Status, data=Countries_Metrics)
summary(lm.fit)
## 
## Call:
## lm(formula = GDP_per_Capita ~ . - Country_Code - Population - 
##     GDP - Labor_Force - Internet_Users - Economy_Status, data = Countries_Metrics)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14435.9  -4006.1    -50.9   2298.3  27017.7 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            -1.256e+04  7.965e+03  -1.577
## Life_Expectancy                        -1.760e+01  1.197e+02  -0.147
## Literacy_Rate                          -3.201e+01  5.407e+01  -0.592
## Health_Expenditure_per_GDP              2.378e+02  2.249e+02   1.057
## Military_Expenditure_per_GDP            1.453e+03  4.917e+02   2.956
## Renewable_Water_Resources              -5.575e-01  6.613e-01  -0.843
## Net_Migration_Rate_per_1000_population  1.199e+02  7.855e+01   1.526
## Commercial_Prime_Lending_Rate          -2.809e+02  1.102e+02  -2.549
## Labor_force_participation_rate          4.180e+04  7.843e+03   5.329
## Technology_Penetration_rate             3.626e+04  4.101e+03   8.842
## Population_Density                      2.750e+00  8.884e-01   3.096
##                                        Pr(>|t|)    
## (Intercept)                             0.11764    
## Life_Expectancy                         0.88336    
## Literacy_Rate                           0.55503    
## Health_Expenditure_per_GDP              0.29265    
## Military_Expenditure_per_GDP            0.00381 ** 
## Renewable_Water_Resources               0.40104    
## Net_Migration_Rate_per_1000_population  0.12974    
## Commercial_Prime_Lending_Rate           0.01218 *  
## Labor_force_participation_rate         5.24e-07 ***
## Technology_Penetration_rate            1.64e-14 ***
## Population_Density                      0.00249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6770 on 111 degrees of freedom
##   (75 observations deleted due to missingness)
## Multiple R-squared:  0.8151, Adjusted R-squared:  0.7985 
## F-statistic: 48.94 on 10 and 111 DF,  p-value: < 2.2e-16

The results show us that only the variables Military_Expenditure_per_GDP, Labor_force_participation_rate, Population_Density, Commercial_Prime_Lending_Rate and Technology_Penetration_rate are statistically significant. So we try to fit a linear regression model with only these parameters. By doing this, the R-squared increases from

lm.fit.significant <- lm(GDP_per_Capita ~ Military_Expenditure_per_GDP+Labor_force_participation_rate+Technology_Penetration_rate+Population_Density+Commercial_Prime_Lending_Rate, data=Countries_Metrics)
summary(lm.fit.significant)
## 
## Call:
## lm(formula = GDP_per_Capita ~ Military_Expenditure_per_GDP + 
##     Labor_force_participation_rate + Technology_Penetration_rate + 
##     Population_Density + Commercial_Prime_Lending_Rate, data = Countries_Metrics)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14641.1  -3510.5   -587.8   2718.1  28012.3 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -10596.776   3756.073  -2.821 0.005586 ** 
## Military_Expenditure_per_GDP     1324.081    471.889   2.806 0.005842 ** 
## Labor_force_participation_rate  34662.375   7171.069   4.834 3.95e-06 ***
## Technology_Penetration_rate     34991.865   2851.159  12.273  < 2e-16 ***
## Population_Density                  3.088      0.872   3.541 0.000564 ***
## Commercial_Prime_Lending_Rate    -316.769    106.601  -2.972 0.003570 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6947 on 122 degrees of freedom
##   (69 observations deleted due to missingness)
## Multiple R-squared:  0.7902, Adjusted R-squared:  0.7816 
## F-statistic: 91.88 on 5 and 122 DF,  p-value: < 2.2e-16
plot(lm.fit.significant)

We see that the R-squared has decreased only slightly even after excluding half of the variables. This is a small cost to pay for greater interpretability of the new model.

Regression Decision Tree

tree_model_gdp <- tree(GDP_per_Capita~. -Country_Code -Population -GDP -Labor_Force -Internet_Users -Economy_Status, data=Countries_Metrics)
summary(tree_model_gdp)
## 
## Regression tree:
## tree(formula = GDP_per_Capita ~ . - Country_Code - Population - 
##     GDP - Labor_Force - Internet_Users - Economy_Status, data = Countries_Metrics)
## Variables actually used in tree construction:
## [1] "Technology_Penetration_rate"           
## [2] "Literacy_Rate"                         
## [3] "Renewable_Water_Resources"             
## [4] "Net_Migration_Rate_per_1000_population"
## [5] "Life_Expectancy"                       
## Number of terminal nodes:  8 
## Residual mean deviance:  29300000 = 3.34e+09 / 114 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -15540.0  -2543.0   -646.5      0.0   2170.0  26670.0

The decision tree model has used the variables Technology_Penetration_rate, Literacy_Rate, Renewable_Water_Resources, Net_Migration_Rate_per_1000_population and Life_Expectancy. This differs from our Linear Regression Model quite a bit in the variables.

What are the factors that affect the Economic Development Status of a country?

We will again answer this question by fitting two different models - the Linear Discriminant Model and the Classification Decision tree Model.

Linear Discriminant Model -

lda.fit <- lda(Economy_Status~. -Country_Code -Population -GDP -Labor_Force -Internet_Users , data=Countries_Metrics)
lda.fit
## Call:
## lda(Economy_Status ~ . - Country_Code - Population - GDP - Labor_Force - 
##     Internet_Users, data = Countries_Metrics)
## 
## Prior probabilities of groups:
##                     Developed                    Developing 
##                     0.4016393                     0.3770492 
## Least Developed Country (LDC) 
##                     0.2213115 
## 
## Group means:
##                               Life_Expectancy Literacy_Rate
## Developed                            77.51143      97.24898
## Developing                           71.48891      88.07609
## Least Developed Country (LDC)        58.94889      60.95185
##                               Health_Expenditure_per_GDP
## Developed                                       7.926531
## Developing                                      6.219565
## Least Developed Country (LDC)                   7.492593
##                               Military_Expenditure_per_GDP
## Developed                                         1.926327
## Developing                                        1.939565
## Least Developed Country (LDC)                     1.768889
##                               Renewable_Water_Resources
## Developed                                      348.2045
## Developing                                     575.4443
## Least Developed Country (LDC)                  206.8944
##                               Net_Migration_Rate_per_1000_population
## Developed                                                  1.9793878
## Developing                                                 0.5406522
## Least Developed Country (LDC)                             -0.8496296
##                               Commercial_Prime_Lending_Rate GDP_per_Capita
## Developed                                          5.806735      31000.451
## Developing                                        12.865000       7586.505
## Least Developed Country (LDC)                     15.888889       1631.816
##                               Labor_force_participation_rate
## Developed                                          0.4869111
## Developing                                         0.4198059
## Least Developed Country (LDC)                      0.3858228
##                               Technology_Penetration_rate
## Developed                                      0.59061088
## Developing                                     0.21598043
## Least Developed Country (LDC)                  0.05317401
##                               Population_Density
## Developed                               327.2058
## Developing                              107.5172
## Least Developed Country (LDC)           142.4166
## 
## Coefficients of linear discriminants:
##                                                  LD1           LD2
## Life_Expectancy                        -2.501811e-02 -1.025452e-01
## Literacy_Rate                          -5.446456e-02 -5.399292e-02
## Health_Expenditure_per_GDP              9.362533e-02  8.551849e-02
## Military_Expenditure_per_GDP            6.116670e-02  5.550107e-02
## Renewable_Water_Resources              -4.687629e-05 -2.550731e-04
## Net_Migration_Rate_per_1000_population  4.352336e-03  8.784090e-03
## Commercial_Prime_Lending_Rate           5.740350e-02  1.809983e-02
## GDP_per_Capita                         -9.208882e-05  5.577265e-05
## Labor_force_participation_rate          1.582394e+00  2.219273e+00
## Technology_Penetration_rate             2.043357e-02  3.075307e+00
## Population_Density                      3.097789e-04  2.648935e-05
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8527 0.1473

Classification Tree Model

tree_model_economy_status <- tree(Economy_Status~. -Country_Code -Population -GDP_per_Capita -Labor_Force -Internet_Users , data=Countries_Metrics)
summary(tree_model_economy_status)
## 
## Classification tree:
## tree(formula = Economy_Status ~ . - Country_Code - Population - 
##     GDP_per_Capita - Labor_Force - Internet_Users, data = Countries_Metrics)
## Variables actually used in tree construction:
## [1] "Technology_Penetration_rate"           
## [2] "Life_Expectancy"                       
## [3] "Literacy_Rate"                         
## [4] "Population_Density"                    
## [5] "Net_Migration_Rate_per_1000_population"
## [6] "Commercial_Prime_Lending_Rate"         
## [7] "GDP"                                   
## Number of terminal nodes:  12 
## Residual mean deviance:  0.5317 = 58.48 / 110 
## Misclassification error rate: 0.1066 = 13 / 122

The Classification Tree model has deemed the below variables important to determine the Economy Status of a country. [1] “Technology_Penetration_rate” “Life_Expectancy”
[3] “Literacy_Rate” “Population_Density”
[5] “Net_Migration_Rate_per_1000_population” “Commercial_Prime_Lending_Rate”
[7] “GDP”

Challenges Faced-

As with any Data Analysis project, this one was also not without its fair share of challenges. I had to struggle with the below difficulties during the course of the project - 1. Finding the ideal data-set! 2. Cleaning and preparing the data. 3. Thinking about Normalization. 4. Translating a few thoughts and ideas into the right SQL queries. 5. Reducing the scope of the analysis to align with the purpose of the project.