About the Data
This data set about the Countries of the World presents some geographical and economical statistics about the 197 countries in today’s world. The data set is available at [Countries of the World] (http://www.cs.stonybrook.edu/~skiena/519/hw1/country-data.csv) This is a table of information about each country, with approximately 20 fields including: name, countrycode, type of government, longitude and latitude of capital city, population, life expectancy, GDP, area, literacy rate, and more. Along with this data, I also sourced additional information on the list of countries in each continent, to perfom a continent-level analysis. This table was sourced from [here] (http://www.sport-histoire.fr/en/Geography/By_continent.php) We will explore this data and uncover interesting observations about the success and fate of nations.
An Overview of the Data and the variables
countries_data <- read.csv("http://www3.cs.stonybrook.edu/~skiena/519/hw1/country-data.csv")
kable(names(countries_data))
Country.Name |
Country.Code |
Government.Type |
Capital.City |
Date.of.Founding.Independence |
Latitude.of.Capital |
Longitude.of.Capital |
Population |
Life.Expectancy |
GDP..PPP..in.US.. |
Area..sq.km. |
Land.Boundaries..km. |
Coastline..km. |
Literacy.Rate…. |
Health.Expenditure.GDP |
Military.Expenditure.GDP |
Renewable.Water.Resources..cu.km. |
Net.Migration.Rate.1000.Population |
Labor.Force |
Internet.Users..circa.2009. |
Commercial.Prime.Lending.Rate |
kable(str(countries_data))
## 'data.frame': 197 obs. of 21 variables:
## $ Country.Name : Factor w/ 197 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Country.Code : Factor w/ 196 levels "AD","AE","AF",..: 3 5 49 1 7 4 8 6 10 9 ...
## $ Government.Type : Factor w/ 52 levels "a parliamentary democracy, a federation, and a constitutional monarchy",..: 28 37 43 37 43 10 43 43 19 21 ...
## $ Capital.City : Factor w/ 197 levels "Abu Dhabi","Abuja",..: 74 179 5 8 94 156 41 196 44 187 ...
## $ Date.of.Founding.Independence : Factor w/ 187 levels "1/1/1804","1/1/1901",..: 148 35 129 53 30 28 134 173 2 31 ...
## $ Latitude.of.Capital : Factor w/ 194 levels "0 13 S","0 19 N",..: 96 126 104 131 187 44 97 122 100 149 ...
## $ Longitude.of.Capital : Factor w/ 196 levels "0 05 W","0 13 W",..: 158 67 91 4 30 153 142 120 40 49 ...
## $ Population : int 31822848 3020209 38813722 85458 19088106 91295 43024374 3060631 22507617 8223062 ...
## $ Life.Expectancy : num 50.5 78 76.4 82.7 55.3 ...
## $ GDP..PPP..in.US.. : num 4.53e+10 2.83e+10 2.85e+11 3.16e+09 1.32e+11 ...
## $ Area..sq.km. : num 652230 28748 2381741 468 1246700 ...
## $ Land.Boundaries..km. : num 5987 691 6734 118 5369 ...
## $ Coastline..km. : num 0 362 998 0 1600 ...
## $ Literacy.Rate.... : Factor w/ 125 levels "100","27","28.1",..: 3 104 45 1 40 119 110 122 119 111 ...
## $ Health.Expenditure.GDP : Factor w/ 87 levels "1.9","10","10.1",..: 84 55 34 62 30 51 71 37 79 6 ...
## $ Military.Expenditure.GDP : Factor w/ 113 levels "0","0.13","0.19",..: 108 55 105 113 98 113 24 100 63 19 ...
## $ Renewable.Water.Resources..cu.km. : Factor w/ 166 levels "0.02","0.03",..: 141 113 23 166 45 3 155 146 124 149 ...
## $ Net.Migration.Rate.1000.Population: Factor w/ 154 levels "-0.02","-0.04",..: 44 68 30 93 101 127 93 79 146 117 ...
## $ Labor.Force : Factor w/ 195 levels "1022000","1037000",..: 169 6 7 110 187 94 43 23 16 113 ...
## $ Internet.Users..circa.2009. : Factor w/ 181 levels "10000","1000000",..: 2 17 126 146 136 143 21 57 36 139 ...
## $ Commercial.Prime.Lending.Rate : Factor w/ 116 levels "0.25","1.5","10",..: 31 113 102 116 31 6 36 37 90 46 ...
kable(head(countries_data))
Country.Name | Country.Code | Government.Type | Capital.City | Date.of.Founding.Independence | Latitude.of.Capital | Longitude.of.Capital | Population | Life.Expectancy | GDP..PPP..in.US.. | Area..sq.km. | Land.Boundaries..km. | Coastline..km. | Literacy.Rate…. | Health.Expenditure.GDP | Military.Expenditure.GDP | Renewable.Water.Resources..cu.km. | Net.Migration.Rate.1000.Population | Labor.Force | Internet.Users..circa.2009. | Commercial.Prime.Lending.Rate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan | AF | Islamic republic | Kabul | 8/19/1919 | 34 31 N | 69 11 E | 31822848 | 50.49 | 4.530e+10 | 652230.0 | 5987 | 0 | 28.1 | 9.6 | 4.74 | 65.33 | -1.83 | 7512000 | 1000000 | 15 |
Albania | AL | parliamentary democracy | Tirana | 11/28/1912 | 41 19 N | 19 49 E | 3020209 | 77.96 | 2.834e+10 | 28748.0 | 691 | 362 | 96.8 | 6.3 | 1.47 | 41.7 | -3.31 | 1098000 | 1300000 | 9.52 |
Algeria | DZ | republic | Algiers | 7/5/1962 | 36 45 N | 3 03 E | 38813722 | 76.39 | 2.847e+11 | 2381741.0 | 6734 | 998 | 72.6 | 3.9 | 4.48 | 11.67 | -0.93 | 11150000 | 4700000 | 8 |
Andorra | AD | parliamentary democracy | Andorra la Vella | 1278 | 42 30 N | 1 31 E | 85458 | 82.65 | 3.163e+09 | 468.0 | 118 | 0 | 100 | 7.2 | unknown | unknown | 0 | 36060 | 67100 | unknown |
Angola | AO | republic | Luanda | 11/11/1975 | 8 50 S | 13 13 E | 19088106 | 55.29 | 1.318e+11 | 1246700.0 | 5369 | 1600 | 70.4 | 3.5 | 3.63 | 148 | 0.47 | 9018000 | 606700 | 15 |
Antigua and Barbuda | AG | constitutional monarchy | Saint John’s | 11/1/1981 | 17 07 N | 61 51 W | 91295 | 76.12 | 1.610e+09 | 442.6 | 0 | 153 | 99 | 5.9 | unknown | 0.05 | 2.23 | 30000 | 65000 | 10.3 |
Normalization of the Data
As we can see, the data is presented to us in one big chunk. Looking at the attributes of the data set, it becomes apparent that we can divide this big table into smaller, logical, modular pieces. I have tried to partition the parent table logically into 3 tables.
In summary, our data set is now composed of 4 tables as follows:
- Countries_Geography: This table stores information about the geographical attributes of countries.
- Countries_Metrics: This table stores information about economic metrics of the countries.
- Contries: This table stores general(miscellaneous) information about the Countries
- Continent: This is the Continent-Country mapping table.
A glimpse of our tables-
Local <- odbcConnect("R Connect" ,uid = "", pwd = "")
Countries_Metrics <- sqlQuery(Local, "select * from Countries_Metrics")
Countries <- sqlQuery(Local, "select * from Countries")
Continents <- sqlQuery(Local, "select * from Continents")
Countries_Geography <- sqlQuery(Local, "select * from Countries_Geography")
kable(names(Countries))
Country_Code |
Country_Name |
Government_Type |
Founding_Independence_Date |
Capital_City |
kable(str(Countries))
## 'data.frame': 197 obs. of 5 variables:
## $ Country_Code : Factor w/ 196 levels "AD","AE","AF",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Country_Name : Factor w/ 197 levels "Afghanistan",..: 4 186 1 6 2 8 5 7 10 9 ...
## $ Government_Type : Factor w/ 52 levels "a parliamentary democracy, a federation, and a constitutional monarchy",..: 37 24 28 10 37 43 43 43 21 19 ...
## $ Founding_Independence_Date: Date, format: "1278-01-01" "1971-12-02" ...
## $ Capital_City : Factor w/ 197 levels "Abu Dhabi","Abuja",..: 8 1 74 156 179 196 94 41 187 44 ...
kable(head(Countries))
Country_Code | Country_Name | Government_Type | Founding_Independence_Date | Capital_City |
---|---|---|---|---|
AD | Andorra | parliamentary democracy | 1278-01-01 | Andorra la Vella |
AE | United Arab Emirates | federation with specified powers delegated to the UAE federal government and other powers reserved to member emirates | 1971-12-02 | Abu Dhabi |
AF | Afghanistan | Islamic republic | 1919-08-19 | Kabul |
AG | Antigua and Barbuda | constitutional monarchy | 1981-11-01 | Saint John’s |
AL | Albania | parliamentary democracy | 1912-11-28 | Tirana |
AM | Armenia | republic | 1991-09-21 | Yerevan |
kable(names(Countries_Geography))
Country_Code |
Capital_Latitude |
Capital_Longitude |
Area |
Land_Boundaries |
Coastline |
kable(str(Countries_Geography))
## 'data.frame': 197 obs. of 6 variables:
## $ Country_Code : Factor w/ 196 levels "AD","AE","AF",..: 3 5 49 1 7 4 8 6 10 9 ...
## $ Capital_Latitude : Factor w/ 194 levels "0 13 S","0 19 N",..: 96 126 104 131 187 44 97 122 100 149 ...
## $ Capital_Longitude: Factor w/ 196 levels "0 05 W","0 13 W",..: 158 67 91 4 30 153 142 120 40 49 ...
## $ Area : num 652230 28748 2381741 468 1246700 ...
## $ Land_Boundaries : num 5987 691 6734 118 5369 ...
## $ Coastline : num 0 362 998 0 1600 ...
kable(head(Countries_Geography))
Country_Code | Capital_Latitude | Capital_Longitude | Area | Land_Boundaries | Coastline |
---|---|---|---|---|---|
AF | 34 31 N | 69 11 E | 652230.0 | 5987 | 0 |
AL | 41 19 N | 19 49 E | 28748.0 | 691 | 362 |
DZ | 36 45 N | 3 03 E | 2381741.0 | 6734 | 998 |
AD | 42 30 N | 1 31 E | 468.0 | 118 | 0 |
AO | 8 50 S | 13 13 E | 1246700.0 | 5369 | 1600 |
AG | 17 07 N | 61 51 W | 442.6 | 0 | 153 |
kable(names(Countries_Metrics))
Country_Code |
Population |
Life_Expectancy |
GDP |
Literacy_Rate |
Health_Expenditure_per_GDP |
Military_Expenditure_per_GDP |
Renewable_Water_Resources |
Net_Migration_Rate_per_1000_population |
Labor_Force |
Internet_Users |
Commercial_Prime_Lending_Rate |
GDP_per_Capita |
Economy_Status |
Labor_force_participation_rate |
Technology_Penetration_rate |
Population_Density |
kable(str(Countries_Metrics))
## 'data.frame': 197 obs. of 17 variables:
## $ Country_Code : Factor w/ 196 levels "AD","AE","AF",..: 3 5 49 1 7 4 8 6 10 9 ...
## $ Population : num 31822848 3020209 38813722 85458 19088106 ...
## $ Life_Expectancy : num 50.5 78 76.4 82.7 55.3 ...
## $ GDP : num 4.53e+10 2.83e+10 2.85e+11 3.16e+09 1.32e+11 ...
## $ Literacy_Rate : num 28.1 96.8 72.6 100 70.4 99 97.9 99.6 99 98 ...
## $ Health_Expenditure_per_GDP : num 9.6 6.3 3.9 7.2 3.5 5.9 8.1 4.3 9 10.6 ...
## $ Military_Expenditure_per_GDP : num 4.74 1.47 4.48 NA 3.63 NA 0.91 3.92 1.71 0.81 ...
## $ Renewable_Water_Resources : num 65.3 41.7 11.7 NA 148 ...
## $ Net_Migration_Rate_per_1000_population: num -1.83 -3.31 -0.93 0 0.47 2.23 0 -5.88 5.74 1.76 ...
## $ Labor_Force : num 7512000 1098000 11150000 36060 9018000 ...
## $ Internet_Users : num 1000000 1300000 4700000 67100 606700 ...
## $ Commercial_Prime_Lending_Rate : num 15 9.52 8 NA 15 10.3 16.4 16.5 6.2 2.2 ...
## $ GDP_per_Capita : num 1424 9383 7335 37012 6905 ...
## $ Economy_Status : Factor w/ 3 levels "Developed","Developing",..: 3 2 2 1 2 1 1 2 1 1 ...
## $ Labor_force_participation_rate : num 0.236 0.364 0.287 0.422 0.472 ...
## $ Technology_Penetration_rate : num 0.0314 0.4304 0.1211 0.7852 0.0318 ...
## $ Population_Density : num 48.8 105.1 16.3 182.6 15.3 ...
kable(head(Countries_Metrics))
Country_Code | Population | Life_Expectancy | GDP | Literacy_Rate | Health_Expenditure_per_GDP | Military_Expenditure_per_GDP | Renewable_Water_Resources | Net_Migration_Rate_per_1000_population | Labor_Force | Internet_Users | Commercial_Prime_Lending_Rate | GDP_per_Capita | Economy_Status | Labor_force_participation_rate | Technology_Penetration_rate | Population_Density |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AF | 31822848 | 50.49 | 4.530e+10 | 28.1 | 9.6 | 4.74 | 65.33 | -1.83 | 7512000 | 1000000 | 15.00 | 1423.506 | Least Developed Country (LDC) | 0.2360568 | 0.0314240 | 48.79084 |
AL | 3020209 | 77.96 | 2.834e+10 | 96.8 | 6.3 | 1.47 | 41.70 | -3.31 | 1098000 | 1300000 | 9.52 | 9383.457 | Developing | 0.3635510 | 0.4304338 | 105.05806 |
DZ | 38813722 | 76.39 | 2.847e+11 | 72.6 | 3.9 | 4.48 | 11.67 | -0.93 | 11150000 | 4700000 | 8.00 | 7335.035 | Developing | 0.2872695 | 0.1210912 | 16.29637 |
AD | 85458 | 82.65 | 3.163e+09 | 100.0 | 7.2 | NA | NA | 0.00 | 36060 | 67100 | NA | 37012.334 | Developed | 0.4219617 | 0.7851810 | 182.60256 |
AO | 19088106 | 55.29 | 1.318e+11 | 70.4 | 3.5 | 3.63 | 148.00 | 0.47 | 9018000 | 606700 | 15.00 | 6904.823 | Developing | 0.4724408 | 0.0317842 | 15.31091 |
AG | 91295 | 76.12 | 1.610e+09 | 99.0 | 5.9 | NA | 0.05 | 2.23 | 30000 | 65000 | 10.30 | 17635.139 | Developed | 0.3286051 | 0.7119777 | 206.26977 |
kable(names(Continents))
Country Name |
Continent |
Country_Code |
kable(str(Continents))
## 'data.frame': 197 obs. of 3 variables:
## $ Country Name: Factor w/ 197 levels "Afghanistan",..: 3 5 19 23 27 28 30 32 33 34 ...
## $ Continent : Factor w/ 5 levels "Africa","America",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Country_Code: Factor w/ 196 levels "AD","AE","AF",..: 49 7 20 26 16 19 36 41 31 169 ...
kable(head(Continents))
Country Name | Continent | Country_Code |
---|---|---|
Algeria | Africa | DZ |
Angola | Africa | AO |
Benin | Africa | BJ |
Botswana | Africa | BW |
Burkina Faso | Africa | BF |
Burundi | Africa | BI |
Problems in the Data Set
Below were the problems in the data set that had to be fixed- 1. The data indicated missing values as ‘unknown’. We used SQL to set these values to NULL Example Code - Update Countries_Data Set [Government#Type]=Nullif([Government#Type],‘unknown’) Update Countries_Data Set[Capital#City]=Nullif([Capital#City],‘unknown’) Update Countries_Data Set[Date#of#Founding#Independence]=Nullif([Date#of#Founding#Independence],‘unknown’);
The Country names in the Continents table and the Country names in the Countries table are not consistent in some cases. We standardized these in SQL. Example Code - update Countries set Country_Name = ‘Ivory Coast’ where Country_Name like ‘%Ivoire%’
- I also added a couple of calculated fields in our Countries_Metrics table which would come in handy during our analysis. The below columns were added-
Population Density: This was calculated from the Population field in Countries_Metrics and the Area field in Countries_Geography as follows-
sqlQuery(Local, “Alter table Countries_Metrics add Population_Density float”) sqlQuery(Local, “update Countries_Metrics set Population_Density = (Population)/(Area) from Countries_Metrics, Countries_Geography where Countries_Metrics.Country_Code=Countries_Geography.Country_Code”)
Technology Penetration Rate: This was calculated from the Internet_Users and Population fields in Countries_Metrics table as follows-
sqlQuery(Local, “Alter table Countries_Metrics add Technology_Penetration_rate float”) sqlQuery(Local, “update Countries_Metrics set Technology_Penetration_rate = (Internet_Users)/(Population)”)
Labor Force Participation Rate: This was calculated from the Labor Force and Population fields in Countries_Metrics table as follows-
sqlQuery(Local, “Alter table Countries_Metrics add Labor_force_participation_rate float”) sqlQuery(Local, “update Countries_Metrics set Labor_force_participation_rate = (Labor_Force)/(Population)”)
Exploratory Data Analysis
Analysis on Literacy rates across different Continents and Countries
Literacy_Rates_Continents <- sqlQuery(Local,"select continent, max(Literacy_Rate) as Highest_Literacy_Rate, min(Literacy_Rate) as Lowest_Literacy_Rate, Avg(Literacy_Rate) as Average_Literacy_Rate
from Countries_Metrics, Continents where Countries_Metrics.Country_Code=Continents.Country_Code
group by continent order by continent")
Highest_literacy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Literacy_Rate in (select max(Literacy_Rate) from Countries_Metrics)
and Countries.Country_Code=Countries_Metrics.Country_Code")
Lowest_literacy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Literacy_Rate in (select min(Literacy_Rate) from Countries_Metrics)
and Countries.Country_Code=Countries_Metrics.Country_Code")
Literacy_Rates_Max_Min <- sqlQuery(Local,"select max(Literacy_Rate) as Highest_Literacy_Rate, min(Literacy_Rate) as Lowest_Literacy_Rate, Avg(Literacy_Rate) as Average_Literacy_Rate
from Countries_Metrics")
Literacy_Highest_lowest <- list()
Literacy_Highest_lowest[[ "Highest_Country" ]] <- Highest_literacy_Rate_Country$Country_Name
Literacy_Highest_lowest[[ "Lowest_Country" ]] <- Lowest_literacy_Rate_Country$Country_Name
Literacy_Highest_lowest[["Highest_Value"]] <- Literacy_Rates_Max_Min$Highest_Literacy_Rate
Literacy_Highest_lowest[["Lowest_Value"]] <- Literacy_Rates_Max_Min$Lowest_Literacy_Rate
Literacy_Highest_lowest[["Average Value"]] <- Literacy_Rates_Max_Min$Average_Literacy_Rate
kable(Literacy_Rates_Continents)
continent | Highest_Literacy_Rate | Lowest_Literacy_Rate | Average_Literacy_Rate |
---|---|---|---|
Africa | 94.2 | 27.0 | 66.05926 |
America | 100.0 | 48.7 | 91.71667 |
Asia | 100.0 | 28.1 | 87.07917 |
Europe | 100.0 | 91.9 | 98.51591 |
Oceania | 99.0 | 62.4 | 90.35455 |
Literacy_Highest_lowest
## $Highest_Country
## [1] Andorra Finland Greenland Korea, North Liechtenstein
## [6] Luxembourg Norway
## 7 Levels: Andorra Finland Greenland Korea, North ... Norway
##
## $Lowest_Country
## [1] South Sudan
## Levels: South Sudan
##
## $Highest_Value
## [1] 100
##
## $Lowest_Value
## [1] 27
##
## $`Average Value`
## [1] 84.85699
Analysis on Life Expectancy rates across different Continents and Countries
Life_Expectancy_Rates_Continents <- sqlQuery(Local,"select continent, max(Life_Expectancy) as Highest_Life_Expectancy, min(Life_Expectancy) as Lowest_Life_Expectancy, Avg(Life_Expectancy) as Average_Life_Expectancy
from Countries_Metrics, Continents where Countries_Metrics.Country_Code=Continents.Country_Code
group by continent order by continent")
Highest_Life_Expectancy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Life_Expectancy in (select max(Life_Expectancy) from Countries_Metrics)
and Countries.Country_Code=Countries_Metrics.Country_Code")
Lowest_Life_Expectancy_Rate_Country <- sqlQuery(Local, "select Country_Name from Countries,Countries_Metrics
where Life_Expectancy in (select min(Life_Expectancy) from Countries_Metrics)
and Countries.Country_Code=Countries_Metrics.Country_Code
")
Life_Expectancy_Rates_Max_Min <- sqlQuery(Local,"select max(Life_Expectancy) as Highest_Life_Expectancy, min(Life_Expectancy) as Lowest_Life_Expectancy, Avg(Life_Expectancy) as Average_Life_Expectancy
from Countries_Metrics")
Life_Expectancy_Highest_lowest <- list()
Life_Expectancy_Highest_lowest[[ "Highest_Country" ]] <- Highest_Life_Expectancy_Rate_Country$Country_Name
Life_Expectancy_Highest_lowest[[ "Lowest_Country" ]] <- Lowest_Life_Expectancy_Rate_Country$Country_Name
Life_Expectancy_Highest_lowest[["Highest_Value"]] <- Life_Expectancy_Rates_Max_Min$Highest_Life_Expectancy
Life_Expectancy_Highest_lowest[["Lowest_Value"]] <- Life_Expectancy_Rates_Max_Min$Lowest_Life_Expectancy
Life_Expectancy_Highest_lowest[["Average Value"]] <- Life_Expectancy_Rates_Max_Min$Average_Life_Expectancy
kable(Life_Expectancy_Rates_Continents)
continent | Highest_Life_Expectancy | Lowest_Life_Expectancy | Average_Life_Expectancy |
---|---|---|---|
Africa | 76.51 | 49.44 | 60.18000 |
America | 81.67 | 63.18 | 74.42028 |
Asia | 84.46 | 50.49 | 72.55958 |
Europe | 89.57 | 69.14 | 78.08795 |
Oceania | 82.07 | 65.47 | 72.41786 |
kable(as.data.frame(Life_Expectancy_Highest_lowest))
Highest_Country | Lowest_Country | Highest_Value | Lowest_Value | Average.Value |
---|---|---|---|---|
Monaco | Chad | 89.57 | 49.44 | 70.66807 |
Analysis on Government Types in Countries
Type_of_Government <- sqlQuery(Local,"select Government_Type, count(*) as Total_Number_of_Countries
from Countries group by Government_Type order by Government_Type order by Total_number_of_Countries")
kable(Type_of_Government)
42000 156 [Microsoft][ODBC Driver 11 for SQL Server][SQL Server]Incorrect syntax near the keyword ‘order’. |
[RODBC] ERROR: Could not SQLExecDirect ’select Government_Type, count(*) as Total_Number_of_Countries |
from Countries group by Government_Type order by Government_Type order by Total_number_of_Countries’ |
Analysis on the GDP per Capita (World’s Richest and Poorest Countries)
Richest_Country <- sqlQuery(Local,"select Country_Name from Countries,Countries_Metrics
where GDP_per_Capita in (select max(GDP_per_Capita) from Countries_Metrics )
and Countries.Country_Code=Countries_Metrics.Country_Code")
Poorest_Country <- sqlQuery(Local,"select Country_Name from Countries,Countries_Metrics
where GDP_per_Capita in (select min(GDP_per_Capita) from Countries_Metrics )
and Countries.Country_Code=Countries_Metrics.Country_Code")
GDP_per_Capita_Max_Min <- sqlQuery(Local,"select max(GDP_per_Capita) as Highest_GDP_per_capita, min(GDP_per_capita) as Lowest_GDP_per_capita, Avg(GDP_per_capita) as Average_GDP_per_capita
from Countries_Metrics")
GDP_per_Capita_Highest_Lowest <- list()
GDP_per_Capita_Highest_Lowest[[ "Richest_Country" ]] <- Richest_Country$Country_Name
GDP_per_Capita_Highest_Lowest[[ "Poorest_Country" ]] <- Poorest_Country$Country_Name
GDP_per_Capita_Highest_Lowest[["Highest_Value"]] <- GDP_per_Capita_Max_Min$Highest_GDP_per_capita
GDP_per_Capita_Highest_Lowest[["Lowest_Value"]] <- GDP_per_Capita_Max_Min$Lowest_GDP_per_capita
GDP_per_Capita_Highest_Lowest[["Average Value"]] <- GDP_per_Capita_Max_Min$Average_GDP_per_capita
kable(as.data.frame(GDP_per_Capita_Highest_Lowest))
Richest_Country | Poorest_Country | Highest_Value | Lowest_Value | Average.Value |
---|---|---|---|---|
Monaco | Democratic Republic of Congo | 203651.5 | 379.5503 | 16245.79 |
Analysis on the Economic Development Status of the countries - Proportion of Developed/Developing and Least Developed Countries within Continents
Continents_by_Economic_Status <- sqlQuery(Local, "select Continents.continent, Economy_Status, round((cast(count(*) as float))/cast(Continent_Totals as float)*100,2) as Proportion from Continents
inner join Countries on [Country Name]=Country_Name
inner join Countries_Metrics on Countries.Country_Code=Countries_Metrics.Country_Code
inner join (select continent,count(*) as Continent_Totals from Continents group by continent) sub on sub.continent=Continents.continent
group by Continents.continent, Economy_Status, Continent_Totals
order by continent")
kable(Continents_by_Economic_Status)
continent | Economy_Status | Proportion |
---|---|---|
Africa | Developed | 9.09 |
Africa | Developing | 23.64 |
Africa | Least Developed Country (LDC) | 67.27 |
America | Developed | 36.11 |
America | Developing | 61.11 |
America | Least Developed Country (LDC) | 2.78 |
Asia | Developed | 31.25 |
Asia | Developing | 47.92 |
Asia | Least Developed Country (LDC) | 20.83 |
Europe | Developed | 79.55 |
Europe | Developing | 20.45 |
Oceania | Developed | 14.29 |
Oceania | Developing | 85.71 |
ggplot(Continents_by_Economic_Status, aes(x = continent, y = Proportion,fill=Economy_Status)) +
geom_bar(stat='identity')
Inferential Data Analysis
Does Health Expenditure and Military Expenditure affect the Life Expectancy Rate?
lm_health_life_expectancy <- lm(Life_Expectancy~Health_Expenditure_per_GDP + Military_Expenditure_per_GDP, data = Countries_Metrics)
summary(lm_health_life_expectancy)
##
## Call:
## lm(formula = Life_Expectancy ~ Health_Expenditure_per_GDP + Military_Expenditure_per_GDP,
## data = Countries_Metrics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.269 -3.148 3.302 6.578 13.880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.7977 2.4735 27.409 <2e-16 ***
## Health_Expenditure_per_GDP 0.3881 0.2694 1.441 0.152
## Military_Expenditure_per_GDP 0.2606 0.5757 0.453 0.652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.207 on 134 degrees of freedom
## (60 observations deleted due to missingness)
## Multiple R-squared: 0.01555, Adjusted R-squared: 0.0008615
## F-statistic: 1.059 on 2 and 134 DF, p-value: 0.3498
The linear regression results tell us that neither Health Expenditure nor Military Expenditure affect the Life Expectancy rate significantly. The below plot tells us the same story.
#### Does Literacy Rate affect Technology Penetration Rate??
lm_literacy_tpr <- lm(Technology_Penetration_rate~Literacy_Rate, data=Countries_Metrics)
summary(lm_literacy_tpr)
##
## Call:
## lm(formula = Technology_Penetration_rate ~ Literacy_Rate, data = Countries_Metrics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42471 -0.15224 -0.03571 0.16559 0.52310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4887271 0.0760268 -6.428 1.06e-09 ***
## Literacy_Rate 0.0093271 0.0008741 10.671 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2169 on 186 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.3797, Adjusted R-squared: 0.3764
## F-statistic: 113.9 on 1 and 186 DF, p-value: < 2.2e-16
The linear regression results tell us that literacy rate does affect the technology penetration rate. However, the coefficient is extremely small. This tells us that there might be other factors affecting the Technology penetration rate that we might be missing.
Below is a plot of how the Technology Penetration Rate and the Literacy Rate affect the GDP per Capita-
#### Do geographical conditions like the coastline/land boundaries affect the GDP_per_Capita?
geography_analysis_data <- sqlQuery(Local, "select Countries_Metrics.Country_Code,Land_Boundaries, Coastline, GDP_per_Capita from Countries_Metrics,Countries_Geography
where Countries_Metrics.Country_Code=Countries_Geography.Country_Code")
lm_geography_gdp <- lm(GDP_per_Capita ~ Land_Boundaries+Coastline, data=geography_analysis_data)
summary(lm_geography_gdp)
##
## Call:
## lm(formula = GDP_per_Capita ~ Land_Boundaries + Coastline, data = geography_analysis_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23136 -12090 -6052 5912 184808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.885e+04 1.896e+03 9.942 < 2e-16 ***
## Land_Boundaries -1.278e+00 4.340e-01 -2.946 0.00361 **
## Coastline 2.344e-01 9.533e-02 2.459 0.01482 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20870 on 194 degrees of freedom
## Multiple R-squared: 0.0599, Adjusted R-squared: 0.0502
## F-statistic: 6.18 on 2 and 194 DF, p-value: 0.002501
The linear regression results find these factors significant, however the coefficients are terribly small to have a significant effect.
Let’s take a look at the Correlation Matrix between some key variables affecting the GDP per Capita of a country.
What are the factors that affect the GDP_per_Capita of a country significantly?
We will try to answer this question by fitting a multiple linear regression on the attributes and a regression decision tree and compare the factors from both the techniques.
Multiple Linear Regression
lm.fit <- lm(GDP_per_Capita ~. -Country_Code -Population -GDP -Labor_Force -Internet_Users -Economy_Status, data=Countries_Metrics)
summary(lm.fit)
##
## Call:
## lm(formula = GDP_per_Capita ~ . - Country_Code - Population -
## GDP - Labor_Force - Internet_Users - Economy_Status, data = Countries_Metrics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14435.9 -4006.1 -50.9 2298.3 27017.7
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.256e+04 7.965e+03 -1.577
## Life_Expectancy -1.760e+01 1.197e+02 -0.147
## Literacy_Rate -3.201e+01 5.407e+01 -0.592
## Health_Expenditure_per_GDP 2.378e+02 2.249e+02 1.057
## Military_Expenditure_per_GDP 1.453e+03 4.917e+02 2.956
## Renewable_Water_Resources -5.575e-01 6.613e-01 -0.843
## Net_Migration_Rate_per_1000_population 1.199e+02 7.855e+01 1.526
## Commercial_Prime_Lending_Rate -2.809e+02 1.102e+02 -2.549
## Labor_force_participation_rate 4.180e+04 7.843e+03 5.329
## Technology_Penetration_rate 3.626e+04 4.101e+03 8.842
## Population_Density 2.750e+00 8.884e-01 3.096
## Pr(>|t|)
## (Intercept) 0.11764
## Life_Expectancy 0.88336
## Literacy_Rate 0.55503
## Health_Expenditure_per_GDP 0.29265
## Military_Expenditure_per_GDP 0.00381 **
## Renewable_Water_Resources 0.40104
## Net_Migration_Rate_per_1000_population 0.12974
## Commercial_Prime_Lending_Rate 0.01218 *
## Labor_force_participation_rate 5.24e-07 ***
## Technology_Penetration_rate 1.64e-14 ***
## Population_Density 0.00249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6770 on 111 degrees of freedom
## (75 observations deleted due to missingness)
## Multiple R-squared: 0.8151, Adjusted R-squared: 0.7985
## F-statistic: 48.94 on 10 and 111 DF, p-value: < 2.2e-16
The results show us that only the variables Military_Expenditure_per_GDP, Labor_force_participation_rate, Population_Density, Commercial_Prime_Lending_Rate and Technology_Penetration_rate are statistically significant. So we try to fit a linear regression model with only these parameters. By doing this, the R-squared increases from
lm.fit.significant <- lm(GDP_per_Capita ~ Military_Expenditure_per_GDP+Labor_force_participation_rate+Technology_Penetration_rate+Population_Density+Commercial_Prime_Lending_Rate, data=Countries_Metrics)
summary(lm.fit.significant)
##
## Call:
## lm(formula = GDP_per_Capita ~ Military_Expenditure_per_GDP +
## Labor_force_participation_rate + Technology_Penetration_rate +
## Population_Density + Commercial_Prime_Lending_Rate, data = Countries_Metrics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14641.1 -3510.5 -587.8 2718.1 28012.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10596.776 3756.073 -2.821 0.005586 **
## Military_Expenditure_per_GDP 1324.081 471.889 2.806 0.005842 **
## Labor_force_participation_rate 34662.375 7171.069 4.834 3.95e-06 ***
## Technology_Penetration_rate 34991.865 2851.159 12.273 < 2e-16 ***
## Population_Density 3.088 0.872 3.541 0.000564 ***
## Commercial_Prime_Lending_Rate -316.769 106.601 -2.972 0.003570 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6947 on 122 degrees of freedom
## (69 observations deleted due to missingness)
## Multiple R-squared: 0.7902, Adjusted R-squared: 0.7816
## F-statistic: 91.88 on 5 and 122 DF, p-value: < 2.2e-16
plot(lm.fit.significant)
We see that the R-squared has decreased only slightly even after excluding half of the variables. This is a small cost to pay for greater interpretability of the new model.
Regression Decision Tree
tree_model_gdp <- tree(GDP_per_Capita~. -Country_Code -Population -GDP -Labor_Force -Internet_Users -Economy_Status, data=Countries_Metrics)
summary(tree_model_gdp)
##
## Regression tree:
## tree(formula = GDP_per_Capita ~ . - Country_Code - Population -
## GDP - Labor_Force - Internet_Users - Economy_Status, data = Countries_Metrics)
## Variables actually used in tree construction:
## [1] "Technology_Penetration_rate"
## [2] "Literacy_Rate"
## [3] "Renewable_Water_Resources"
## [4] "Net_Migration_Rate_per_1000_population"
## [5] "Life_Expectancy"
## Number of terminal nodes: 8
## Residual mean deviance: 29300000 = 3.34e+09 / 114
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -15540.0 -2543.0 -646.5 0.0 2170.0 26670.0
The decision tree model has used the variables Technology_Penetration_rate, Literacy_Rate, Renewable_Water_Resources, Net_Migration_Rate_per_1000_population and Life_Expectancy. This differs from our Linear Regression Model quite a bit in the variables.
What are the factors that affect the Economic Development Status of a country?
We will again answer this question by fitting two different models - the Linear Discriminant Model and the Classification Decision tree Model.
Linear Discriminant Model -
lda.fit <- lda(Economy_Status~. -Country_Code -Population -GDP -Labor_Force -Internet_Users , data=Countries_Metrics)
lda.fit
## Call:
## lda(Economy_Status ~ . - Country_Code - Population - GDP - Labor_Force -
## Internet_Users, data = Countries_Metrics)
##
## Prior probabilities of groups:
## Developed Developing
## 0.4016393 0.3770492
## Least Developed Country (LDC)
## 0.2213115
##
## Group means:
## Life_Expectancy Literacy_Rate
## Developed 77.51143 97.24898
## Developing 71.48891 88.07609
## Least Developed Country (LDC) 58.94889 60.95185
## Health_Expenditure_per_GDP
## Developed 7.926531
## Developing 6.219565
## Least Developed Country (LDC) 7.492593
## Military_Expenditure_per_GDP
## Developed 1.926327
## Developing 1.939565
## Least Developed Country (LDC) 1.768889
## Renewable_Water_Resources
## Developed 348.2045
## Developing 575.4443
## Least Developed Country (LDC) 206.8944
## Net_Migration_Rate_per_1000_population
## Developed 1.9793878
## Developing 0.5406522
## Least Developed Country (LDC) -0.8496296
## Commercial_Prime_Lending_Rate GDP_per_Capita
## Developed 5.806735 31000.451
## Developing 12.865000 7586.505
## Least Developed Country (LDC) 15.888889 1631.816
## Labor_force_participation_rate
## Developed 0.4869111
## Developing 0.4198059
## Least Developed Country (LDC) 0.3858228
## Technology_Penetration_rate
## Developed 0.59061088
## Developing 0.21598043
## Least Developed Country (LDC) 0.05317401
## Population_Density
## Developed 327.2058
## Developing 107.5172
## Least Developed Country (LDC) 142.4166
##
## Coefficients of linear discriminants:
## LD1 LD2
## Life_Expectancy -2.501811e-02 -1.025452e-01
## Literacy_Rate -5.446456e-02 -5.399292e-02
## Health_Expenditure_per_GDP 9.362533e-02 8.551849e-02
## Military_Expenditure_per_GDP 6.116670e-02 5.550107e-02
## Renewable_Water_Resources -4.687629e-05 -2.550731e-04
## Net_Migration_Rate_per_1000_population 4.352336e-03 8.784090e-03
## Commercial_Prime_Lending_Rate 5.740350e-02 1.809983e-02
## GDP_per_Capita -9.208882e-05 5.577265e-05
## Labor_force_participation_rate 1.582394e+00 2.219273e+00
## Technology_Penetration_rate 2.043357e-02 3.075307e+00
## Population_Density 3.097789e-04 2.648935e-05
##
## Proportion of trace:
## LD1 LD2
## 0.8527 0.1473
Classification Tree Model
tree_model_economy_status <- tree(Economy_Status~. -Country_Code -Population -GDP_per_Capita -Labor_Force -Internet_Users , data=Countries_Metrics)
summary(tree_model_economy_status)
##
## Classification tree:
## tree(formula = Economy_Status ~ . - Country_Code - Population -
## GDP_per_Capita - Labor_Force - Internet_Users, data = Countries_Metrics)
## Variables actually used in tree construction:
## [1] "Technology_Penetration_rate"
## [2] "Life_Expectancy"
## [3] "Literacy_Rate"
## [4] "Population_Density"
## [5] "Net_Migration_Rate_per_1000_population"
## [6] "Commercial_Prime_Lending_Rate"
## [7] "GDP"
## Number of terminal nodes: 12
## Residual mean deviance: 0.5317 = 58.48 / 110
## Misclassification error rate: 0.1066 = 13 / 122
The Classification Tree model has deemed the below variables important to determine the Economy Status of a country. [1] “Technology_Penetration_rate” “Life_Expectancy”
[3] “Literacy_Rate” “Population_Density”
[5] “Net_Migration_Rate_per_1000_population” “Commercial_Prime_Lending_Rate”
[7] “GDP”
Challenges Faced-
As with any Data Analysis project, this one was also not without its fair share of challenges. I had to struggle with the below difficulties during the course of the project - 1. Finding the ideal data-set! 2. Cleaning and preparing the data. 3. Thinking about Normalization. 4. Translating a few thoughts and ideas into the right SQL queries. 5. Reducing the scope of the analysis to align with the purpose of the project.