1 MAsters in Big Data Analitics

2 Introduction

This homework demonstrates important concepts in R programming and Data Science including:

  1. Importing data from different sources
  2. Merging datasets
  3. Using %>% pipes and group_by()
  4. Creating visualizations using ggplot2 geoms
  5. Investigating trace() and recover() debugging functions
  6. Using functional programming tools such as sapply(), lapply(), map(), vapply(), split(), and tapply()

3 Setup and loading Libraries

library(tidyverse)
library(ggplot2)
library(dplyr)
library(readr)
library(corrplot)
library(reshape2)
library(skimr)
library(janitor)
library(plotly)
library(purrr)
library(DBI)
library(RPostgres)

4 1. Importing Data from Two Different Sources(CSV and Postgres)

4.1 1.1 Loading Data from my local PostgreSQL Database

# Create PostgreSQL connection
connection <- dbConnect(
  RPostgres::Postgres(),
  dbname = "r_conn_test_db",
  host = "localhost",
  port = 5432,
  user = "donatien",
  password = "MYSQLdon2020."
)

# Display available tables
 dbListTables(connection)
## [1] "student"

4.2 Reading from Student Table of r_conn_test_db database

# Read the student table
students <- dbReadTable(connection, "student")

# Alternative using SQL query
students <- dbGetQuery(connection, "SELECT * FROM student")

# Display first rows
head(students)
##   id reg_number      name age gender height
## 1  1    REG0001 Student_1  18   Male 168.63
## 2  2    REG0002 Student_2  16 Female 194.89
## 3  3    REG0003 Student_3  16 Female 160.26
## 4  4    REG0004 Student_4  17 Female 165.74
## 5  5    REG0005 Student_5  23 Female 157.51
## 6  6    REG0006 Student_6  19 Female 162.85

4.3 Closing Database Connection

dbDisconnect(connection)
dbDisconnect(connection)

4.4 1.2 Importing date from CSV Datasets (files on my computer : /home/donatient/Documents/r_assignment/data_sets/)

4.5 I am loading two files because will need them for merging purpose

4.6 Population and C02 emission files will be used till the end of this homwork

# Load population dataset
population <- read.csv(
  "/home/donatient/Documents/r_assignment/data_sets/world_population.csv"
)

# Load CO2 emission dataset
co2Emission <- read.csv(
  "/home/donatient/Documents/r_assignment/data_sets/CO2_emission.csv"
)

5 Display first rows from each data set Population and Co2 emmision

head(population)
##   Rank CCA3 Country.Territory          Capital Continent X2022.Population
## 1   36  AFG       Afghanistan            Kabul      Asia         41128771
## 2  138  ALB           Albania           Tirana    Europe          2842321
## 3   34  DZA           Algeria          Algiers    Africa         44903225
## 4  213  ASM    American Samoa        Pago Pago   Oceania            44273
## 5  203  AND           Andorra Andorra la Vella    Europe            79824
## 6   42  AGO            Angola           Luanda    Africa         35588987
##   X2020.Population X2015.Population X2010.Population X2000.Population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   X1990.Population X1980.Population X1970.Population Area..km..
## 1         10694796         12486631         10752971     652230
## 2          3295066          2941651          2324731      28748
## 3         25518074         18739378         13795915    2381741
## 4            47818            32886            27075        199
## 5            53569            35611            19860        468
## 6         11828638          8330047          6029700    1246700
##   Density..per.km.. Growth.Rate World.Population.Percentage
## 1           63.0587      1.0257                        0.52
## 2           98.8702      0.9957                        0.04
## 3           18.8531      1.0164                        0.56
## 4          222.4774      0.9831                        0.00
## 5          170.5641      1.0100                        0.00
## 6           28.5466      1.0315                        0.45
head(co2Emission)
##           Country.Name country_code                     Region
## 1                Aruba          ABW  Latin America & Caribbean
## 2          Afghanistan          AFG                 South Asia
## 3               Angola          AGO         Sub-Saharan Africa
## 4              Albania          ALB      Europe & Central Asia
## 5              Andorra          AND      Europe & Central Asia
## 6 United Arab Emirates          ARE Middle East & North Africa
##                           Indicator.Name      X1990      X1991       X1992
## 1 CO2 emissions (metric tons per capita)         NA         NA          NA
## 2 CO2 emissions (metric tons per capita)  0.1917451  0.1676816  0.09595774
## 3 CO2 emissions (metric tons per capita)  0.5536620  0.5445386  0.54355722
## 4 CO2 emissions (metric tons per capita)  1.8195416  1.2428102  0.68369983
## 5 CO2 emissions (metric tons per capita)  7.5218317  7.2353792  6.96307870
## 6 CO2 emissions (metric tons per capita) 30.1951886 31.7784962 29.08092584
##         X1993       X1994       X1995       X1996       X1997       X1998
## 1          NA          NA          NA          NA          NA          NA
## 2  0.08472111  0.07554583  0.06846796  0.06258803  0.05682662  0.05269086
## 3  0.70898423  0.83680440  0.91214149  1.07216847  1.08663697  1.09182531
## 4  0.63830704  0.64535519  0.60543625  0.61236736  0.46692147  0.57215370
## 5  6.72417752  6.54157891  6.73347949  6.99159455  7.30744115  7.63953851
## 6 29.27567777 30.84933296 31.12501806 30.92802588 30.48633262 29.66358052
##         X1999      X2000       X2001       X2002       X2003       X2004
## 1          NA         NA          NA          NA          NA          NA
## 2  0.04015697  0.0365737  0.03378536  0.04557366  0.05151838  0.04165539
## 3  1.10985966  0.9880774  0.94182891  0.89557767  0.92486944  0.93026295
## 4  0.95535931  1.0262131  1.05549588  1.23237878  1.33898498  1.40405869
## 5  7.92319165  7.9522863  7.72154906  7.56623988  7.24241557  7.34426233
## 6 28.88710798 27.0351591 29.43026994 28.50146173 27.96926982 27.03893822
##         X2005       X2006       X2007      X2008      X2009      X2010
## 1          NA          NA          NA         NA         NA         NA
## 2  0.06041878  0.06658329  0.06531235  0.1284166  0.1718624  0.2436140
## 3  0.81353929  0.82184008  0.81175351  0.8886580  0.9394040  0.9761842
## 4  1.33820940  1.33999574  1.39393137  1.3843112  1.4414936  1.5276237
## 5  7.35378001  6.79054277  6.53104692  6.4393039  6.1566875  6.1571978
## 6 25.38238104 22.93510429 21.37028576 22.0114692 19.8323489 19.0397698
##        X2011      X2012      X2013      X2014      X2015      X2016      X2017
## 1         NA         NA         NA         NA         NA         NA         NA
## 2  0.2965062  0.2592953  0.1856237  0.1462356  0.1728967  0.1497893  0.1316946
## 3  0.9855223  0.9506959  1.0362939  1.0997791  1.1350441  1.0318113  0.8133007
## 4  1.6694232  1.5032405  1.5336300  1.6683374  1.6037751  1.5576644  1.7887861
## 5  5.8508861  5.9446542  5.9428004  5.8071277  6.0261818  6.0806003  6.1041339
## 6 18.5094574 19.2078011 20.0556476 20.0516980 21.0776420 21.4806686 20.7690223
##        X2018      X2019    X2019.1
## 1         NA         NA         NA
## 2  0.1632953  0.1598244  0.1598244
## 3  0.7776749  0.7921371  0.7921371
## 4  1.7827389  1.6922483  1.6922483
## 5  6.3629754  6.4812174  6.4812174
## 6 18.3906781 19.3295633 19.3295633

6 2. Data Cleaning before merging 2 Datasets(population and C02 Emission)

6.1 Cleaning Population Dataset from misisng values prior to merging

# Check missing values
colSums(is.na(population))
##                        Rank                        CCA3 
##                           0                           0 
##           Country.Territory                     Capital 
##                           0                           0 
##                   Continent            X2022.Population 
##                           0                           0 
##            X2020.Population            X2015.Population 
##                           0                           0 
##            X2010.Population            X2000.Population 
##                           0                           0 
##            X1990.Population            X1980.Population 
##                           0                           0 
##            X1970.Population                  Area..km.. 
##                           0                           0 
##           Density..per.km..                 Growth.Rate 
##                           0                           0 
## World.Population.Percentage 
##                           0
# Remove missing values
population <- na.omit(population)

# Clean column names
population <- clean_names(population)
co2Emission <- clean_names(co2Emission)

# Display cleaned column names
colnames(population)
##  [1] "rank"                        "cca3"                       
##  [3] "country_territory"           "capital"                    
##  [5] "continent"                   "x2022_population"           
##  [7] "x2020_population"            "x2015_population"           
##  [9] "x2010_population"            "x2000_population"           
## [11] "x1990_population"            "x1980_population"           
## [13] "x1970_population"            "area_km"                    
## [15] "density_per_km"              "growth_rate"                
## [17] "world_population_percentage"

6.2 Merging Population and CO2 emission Datasets

merged <- merge(
  population,
  co2Emission,
  by.x = "country_territory",
  by.y = "country_name"
)

# Display merged dataset information
head(merged)
##   country_territory rank cca3          capital continent x2022_population
## 1       Afghanistan   36  AFG            Kabul      Asia         41128771
## 2           Albania  138  ALB           Tirana    Europe          2842321
## 3           Algeria   34  DZA          Algiers    Africa         44903225
## 4    American Samoa  213  ASM        Pago Pago   Oceania            44273
## 5           Andorra  203  AND Andorra la Vella    Europe            79824
## 6            Angola   42  AGO           Luanda    Africa         35588987
##   x2020_population x2015_population x2010_population x2000_population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   x1990_population x1980_population x1970_population area_km density_per_km
## 1         10694796         12486631         10752971  652230        63.0587
## 2          3295066          2941651          2324731   28748        98.8702
## 3         25518074         18739378         13795915 2381741        18.8531
## 4            47818            32886            27075     199       222.4774
## 5            53569            35611            19860     468       170.5641
## 6         11828638          8330047          6029700 1246700        28.5466
##   growth_rate world_population_percentage country_code
## 1      1.0257                        0.52          AFG
## 2      0.9957                        0.04          ALB
## 3      1.0164                        0.56          DZA
## 4      0.9831                        0.00          ASM
## 5      1.0100                        0.00          AND
## 6      1.0315                        0.45          AGO
##                       region                         indicator_name     x1990
## 1                 South Asia CO2 emissions (metric tons per capita) 0.1917451
## 2      Europe & Central Asia CO2 emissions (metric tons per capita) 1.8195416
## 3 Middle East & North Africa CO2 emissions (metric tons per capita) 2.4434300
## 4        East Asia & Pacific CO2 emissions (metric tons per capita)        NA
## 5      Europe & Central Asia CO2 emissions (metric tons per capita) 7.5218317
## 6         Sub-Saharan Africa CO2 emissions (metric tons per capita) 0.5536620
##       x1991      x1992      x1993      x1994      x1995      x1996      x1997
## 1 0.1676816 0.09595774 0.08472111 0.07554583 0.06846796 0.06258803 0.05682662
## 2 1.2428102 0.68369983 0.63830704 0.64535519 0.60543625 0.61236736 0.46692147
## 3 2.5162433 2.47296078 2.61330374 2.60900907 2.65806257 2.60093353 2.50243923
## 4        NA         NA         NA         NA         NA         NA         NA
## 5 7.2353792 6.96307870 6.72417752 6.54157891 6.73347949 6.99159455 7.30744115
## 6 0.5445386 0.54355722 0.70898423 0.83680440 0.91214149 1.07216847 1.08663697
##        x1998      x1999     x2000      x2001      x2002      x2003      x2004
## 1 0.05269086 0.04015697 0.0365737 0.03378536 0.04557366 0.05151838 0.04165539
## 2 0.57215370 0.95535931 1.0262131 1.05549588 1.23237878 1.33898498 1.40405869
## 3 2.47244786 2.53107052 2.5787445 2.50067461 2.58671220 2.73337366 2.73735406
## 4         NA         NA        NA         NA         NA         NA         NA
## 5 7.63953851 7.92319165 7.9522863 7.72154906 7.56623988 7.24241557 7.34426233
## 6 1.09182531 1.10985966 0.9880774 0.94182891 0.89557767 0.92486944 0.93026295
##        x2005      x2006      x2007     x2008     x2009     x2010     x2011
## 1 0.06041878 0.06658329 0.06531235 0.1284166 0.1718624 0.2436140 0.2965062
## 2 1.33820940 1.33999574 1.39393137 1.3843112 1.4414936 1.5276237 1.6694232
## 3 2.84135137 2.96691468 3.00728985 3.1024511 3.1745733 3.1736545 3.2947426
## 4         NA         NA         NA        NA        NA        NA        NA
## 5 7.35378001 6.79054277 6.53104692 6.4393039 6.1566875 6.1571978 5.8508861
## 6 0.81353929 0.82184008 0.81175351 0.8886580 0.9394040 0.9761842 0.9855223
##       x2012     x2013     x2014     x2015     x2016     x2017     x2018
## 1 0.2592953 0.1856237 0.1462356 0.1728967 0.1497893 0.1316946 0.1632953
## 2 1.5032405 1.5336300 1.6683374 1.6037751 1.5576644 1.7887861 1.7827389
## 3 3.6093077 3.6449793 3.7956323 3.9334959 3.8200903 3.8256380 3.9201091
## 4        NA        NA        NA        NA        NA        NA        NA
## 5 5.9446542 5.9428004 5.8071277 6.0261818 6.0806003 6.1041339 6.3629754
## 6 0.9506959 1.0362939 1.0997791 1.1350441 1.0318113 0.8133007 0.7776749
##       x2019   x2019_1
## 1 0.1598244 0.1598244
## 2 1.6922483 1.6922483
## 3 3.9776505 3.9776505
## 4        NA        NA
## 5 6.4812174 6.4812174
## 6 0.7921371 0.7921371
colnames(merged)
##  [1] "country_territory"           "rank"                       
##  [3] "cca3"                        "capital"                    
##  [5] "continent"                   "x2022_population"           
##  [7] "x2020_population"            "x2015_population"           
##  [9] "x2010_population"            "x2000_population"           
## [11] "x1990_population"            "x1980_population"           
## [13] "x1970_population"            "area_km"                    
## [15] "density_per_km"              "growth_rate"                
## [17] "world_population_percentage" "country_code"               
## [19] "region"                      "indicator_name"             
## [21] "x1990"                       "x1991"                      
## [23] "x1992"                       "x1993"                      
## [25] "x1994"                       "x1995"                      
## [27] "x1996"                       "x1997"                      
## [29] "x1998"                       "x1999"                      
## [31] "x2000"                       "x2001"                      
## [33] "x2002"                       "x2003"                      
## [35] "x2004"                       "x2005"                      
## [37] "x2006"                       "x2007"                      
## [39] "x2008"                       "x2009"                      
## [41] "x2010"                       "x2011"                      
## [43] "x2012"                       "x2013"                      
## [45] "x2014"                       "x2015"                      
## [47] "x2016"                       "x2017"                      
## [49] "x2018"                       "x2019"                      
## [51] "x2019_1"

7 3. Using Pipes (%>%) and Grouping

7.1 Boxplot Visualization

population %>%
  select(density_per_km, growth_rate, area_km) %>%
  boxplot(
    main = "Population Visualization",
    col = c("lightblue", "lightgreen", "lightpink")
  )

7.2 Top 10 Countries by Population in 2022

top10 <- population %>%
  arrange(desc(x2022_population)) %>%
  head(10)

# Display top 10 countries
 top10
##    rank cca3 country_territory          capital     continent x2022_population
## 1     1  CHN             China          Beijing          Asia       1425887337
## 2     2  IND             India        New Delhi          Asia       1417173173
## 3     3  USA     United States Washington, D.C. North America        338289857
## 4     4  IDN         Indonesia          Jakarta          Asia        275501339
## 5     5  PAK          Pakistan        Islamabad          Asia        235824862
## 6     6  NGA           Nigeria            Abuja        Africa        218541212
## 7     7  BRA            Brazil         Brasilia South America        215313498
## 8     8  BGD        Bangladesh            Dhaka          Asia        171186372
## 9     9  RUS            Russia           Moscow        Europe        144713314
## 10   10  MEX            Mexico      Mexico City North America        127504125
##    x2020_population x2015_population x2010_population x2000_population
## 1        1424929781       1393715448       1348191368       1264099069
## 2        1396387127       1322866505       1240613620       1059633675
## 3         335942003        324607776        311182845        282398554
## 4         271857970        259091970        244016173        214072421
## 5         227196741        210969298        194454498        154369924
## 6         208327405        183995785        160952853        122851984
## 7         213196304        205188205        196353492        175873720
## 8         167420951        157830000        148391139        129193327
## 9         145617329        144668389        143242599        146844839
## 10        125998302        120149897        112532401         97873442
##    x1990_population x1980_population x1970_population  area_km density_per_km
## 1        1153704252        982372466        822534450  9706961       146.8933
## 2         870452165        696828385        557501301  3287590       431.0675
## 3         248083732        223140018        200328340  9372610        36.0935
## 4         182159874        148177096        115228394  1904569       144.6529
## 5         115414069         80624057         59290872   881912       267.4018
## 6          95214257         72951439         55569264   923768       236.5759
## 7         150706446        122288383         96369875  8515767        25.2841
## 8         107147651         83929765         67541860   147570      1160.0350
## 9         148005704        138257420        130093010 17098242         8.4636
## 10         81720428         67705186         50289306  1964375        64.9082
##    growth_rate world_population_percentage
## 1       1.0000                       17.88
## 2       1.0068                       17.77
## 3       1.0038                        4.24
## 4       1.0064                        3.45
## 5       1.0191                        2.96
## 6       1.0241                        2.74
## 7       1.0046                        2.70
## 8       1.0108                        2.15
## 9       0.9973                        1.81
## 10      1.0063                        1.60

7.3 Grouping by Continent

population %>%
  group_by(continent) %>%
  summarise(
    average_density = mean(density_per_km),
    average_growth = mean(growth_rate),
    total_population = sum(x2022_population)
  )
## # A tibble: 6 × 4
##   continent     average_density average_growth total_population
##   <chr>                   <dbl>          <dbl>            <dbl>
## 1 Africa                  125.            1.02       1426730932
## 2 Asia                   1025.            1.01       4721383274
## 3 Europe                  663.            1.00        743147538
## 4 North America           273.            1.00        600296136
## 5 Oceania                 133.            1.01         45038554
## 6 South America            21.0           1.01        436816608

8 4. Data Visualization Using Geoms

8.1 Preparing CO2 Dataset for Plotting

co2_long <- co2Emission %>%
  pivot_longer(
    cols = starts_with("x"),
    names_to = "year",
    values_to = "co2_emission"
  )

head(co2_long)
## # A tibble: 6 × 6
##   country_name country_code region             indicator_name year  co2_emission
##   <chr>        <chr>        <chr>              <chr>          <chr>        <dbl>
## 1 Aruba        ABW          Latin America & C… CO2 emissions… x1990           NA
## 2 Aruba        ABW          Latin America & C… CO2 emissions… x1991           NA
## 3 Aruba        ABW          Latin America & C… CO2 emissions… x1992           NA
## 4 Aruba        ABW          Latin America & C… CO2 emissions… x1993           NA
## 5 Aruba        ABW          Latin America & C… CO2 emissions… x1994           NA
## 6 Aruba        ABW          Latin America & C… CO2 emissions… x1995           NA

8.2 CO2 Emission Trend Visualization

ggplot(
  co2_long,
  aes(
    x = year,
    y = co2_emission,
    color = country_name,
    group = country_name
  )
) +
  geom_line(linewidth = 1) +
  geom_point() +
  labs(
    title = "CO2 Emission Trends by Country",
    x = "Year",
    y = "CO2 Emission",
    color = "Country"
  ) +
  theme_minimal()


9 5. Investigating trace() and recover()

9.1 Creating a Buggy Function

process_population_df <- function(data) {

  data$x2026_population <- data$x2022_population * 2

  result <- mean(data$x2027_population)

  return(result)
}

9.2 Using trace() function for debugging

trace(
  "process_population_df",
  tracer = quote({
    print("TRACE ACTIVE")
    print(head(data))
  })
)
## [1] "process_population_df"
process_population_df(population)
## Tracing process_population_df(population) on entry 
## [1] "TRACE ACTIVE"
##   rank cca3 country_territory          capital continent x2022_population
## 1   36  AFG       Afghanistan            Kabul      Asia         41128771
## 2  138  ALB           Albania           Tirana    Europe          2842321
## 3   34  DZA           Algeria          Algiers    Africa         44903225
## 4  213  ASM    American Samoa        Pago Pago   Oceania            44273
## 5  203  AND           Andorra Andorra la Vella    Europe            79824
## 6   42  AGO            Angola           Luanda    Africa         35588987
##   x2020_population x2015_population x2010_population x2000_population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   x1990_population x1980_population x1970_population area_km density_per_km
## 1         10694796         12486631         10752971  652230        63.0587
## 2          3295066          2941651          2324731   28748        98.8702
## 3         25518074         18739378         13795915 2381741        18.8531
## 4            47818            32886            27075     199       222.4774
## 5            53569            35611            19860     468       170.5641
## 6         11828638          8330047          6029700 1246700        28.5466
##   growth_rate world_population_percentage
## 1      1.0257                        0.52
## 2      0.9957                        0.04
## 3      1.0164                        0.56
## 4      0.9831                        0.00
## 5      1.0100                        0.00
## 6      1.0315                        0.45
## [1] NA

9.3 Using recover() function for debugging

options(error = recover)

process_population_df(population)
## Tracing process_population_df(population) on entry 
## [1] "TRACE ACTIVE"
##   rank cca3 country_territory          capital continent x2022_population
## 1   36  AFG       Afghanistan            Kabul      Asia         41128771
## 2  138  ALB           Albania           Tirana    Europe          2842321
## 3   34  DZA           Algeria          Algiers    Africa         44903225
## 4  213  ASM    American Samoa        Pago Pago   Oceania            44273
## 5  203  AND           Andorra Andorra la Vella    Europe            79824
## 6   42  AGO            Angola           Luanda    Africa         35588987
##   x2020_population x2015_population x2010_population x2000_population
## 1         38972230         33753499         28189672         19542982
## 2          2866849          2882481          2913399          3182021
## 3         43451666         39543154         35856344         30774621
## 4            46189            51368            54849            58230
## 5            77700            71746            71519            66097
## 6         33428485         28127721         23364185         16394062
##   x1990_population x1980_population x1970_population area_km density_per_km
## 1         10694796         12486631         10752971  652230        63.0587
## 2          3295066          2941651          2324731   28748        98.8702
## 3         25518074         18739378         13795915 2381741        18.8531
## 4            47818            32886            27075     199       222.4774
## 5            53569            35611            19860     468       170.5641
## 6         11828638          8330047          6029700 1246700        28.5466
##   growth_rate world_population_percentage
## 1      1.0257                        0.52
## 2      0.9957                        0.04
## 3      1.0164                        0.56
## 4      0.9831                        0.00
## 5      1.0100                        0.00
## 6      1.0315                        0.45
## [1] NA
options(error = NULL)

10 6. Applying different functions to data sets/dataframe in R

10.1 6.1 Using sapply()

sapply(
  population[, c("density_per_km", "growth_rate", "area_km")],
  mean
)
## density_per_km    growth_rate        area_km 
##   4.521270e+02   1.009577e+00   5.814494e+05

10.2 6.2 Using lapply()

lapply(
  population[, c("density_per_km", "growth_rate", "area_km")],
  summary
)
## $density_per_km
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 2.610e-02 3.842e+01 9.535e+01 4.521e+02 2.389e+02 2.317e+04 
## 
## $growth_rate
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.912   1.002   1.008   1.010   1.017   1.069 
## 
## $area_km
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        1     2650    81200   581449   430426 17098242

10.3 6.3 Using vapply()

vapply(
  population[, c("density_per_km", "growth_rate", "area_km")],
  mean,
  numeric(1)
)
## density_per_km    growth_rate        area_km 
##   4.521270e+02   1.009577e+00   5.814494e+05

10.4 6.4 Using split()

population_split <- split(population, population$continent)

str(population_split)
## List of 6
##  $ Africa       :'data.frame':   57 obs. of  17 variables:
##   ..$ rank                       : int [1:57] 34 42 77 144 58 78 53 171 117 69 ...
##   ..$ cca3                       : chr [1:57] "DZA" "AGO" "BEN" "BWA" ...
##   ..$ country_territory          : chr [1:57] "Algeria" "Angola" "Benin" "Botswana" ...
##   ..$ capital                    : chr [1:57] "Algiers" "Luanda" "Porto-Novo" "Gaborone" ...
##   ..$ continent                  : chr [1:57] "Africa" "Africa" "Africa" "Africa" ...
##   ..$ x2022_population           : int [1:57] 44903225 35588987 13352864 2630296 22673762 12889576 27914536 593149 5579144 17723315 ...
##   ..$ x2020_population           : int [1:57] 43451666 33428485 12643123 2546402 21522626 12220227 26491087 582640 5343020 16644701 ...
##   ..$ x2015_population           : int [1:57] 39543154 28127721 10932783 2305171 18718019 10727148 23012646 552166 4819333 14140274 ...
##   ..$ x2010_population           : int [1:57] 35856344 23364185 9445710 2091664 16116845 9126605 19878036 521212 4660067 11894727 ...
##   ..$ x2000_population           : int [1:57] 30774621 16394062 6998023 1726985 11882888 6307659 15091594 458251 3759170 8259137 ...
##   ..$ x1990_population           : int [1:57] 25518074 11828638 5133419 1341474 9131361 5483793 11430520 364563 2809221 5827069 ...
##   ..$ x1980_population           : int [1:57] 18739378 8330047 3833939 938578 6932967 4312834 8519891 317234 2415276 4408230 ...
##   ..$ x1970_population           : int [1:57] 13795915 6029700 3023443 592244 5611666 3497834 6452787 287262 2067356 3667394 ...
##   ..$ area_km                    : int [1:57] 2381741 1246700 112622 582000 272967 27834 475442 4033 622984 1284000 ...
##   ..$ density_per_km             : num [1:57] 18.85 28.55 118.56 4.52 83.06 ...
##   ..$ growth_rate                : num [1:57] 1.02 1.03 1.03 1.02 1.03 ...
##   ..$ world_population_percentage: num [1:57] 0.56 0.45 0.17 0.03 0.28 0.16 0.35 0.01 0.07 0.22 ...
##  $ Asia         :'data.frame':   50 obs. of  17 variables:
##   ..$ rank                       : int [1:50] 36 140 91 154 8 165 175 73 1 131 ...
##   ..$ cca3                       : chr [1:50] "AFG" "ARM" "AZE" "BHR" ...
##   ..$ country_territory          : chr [1:50] "Afghanistan" "Armenia" "Azerbaijan" "Bahrain" ...
##   ..$ capital                    : chr [1:50] "Kabul" "Yerevan" "Baku" "Manama" ...
##   ..$ continent                  : chr [1:50] "Asia" "Asia" "Asia" "Asia" ...
##   ..$ x2022_population           : int [1:50] 41128771 2780469 10358074 1472233 171186372 782455 449002 16767842 1425887337 3744385 ...
##   ..$ x2020_population           : int [1:50] 38972230 2805608 10284951 1477469 167420951 772506 441725 16396860 1424929781 3765912 ...
##   ..$ x2015_population           : int [1:50] 33753499 2878595 9863480 1362142 157830000 743274 421437 15417523 1393715448 3771132 ...
##   ..$ x2010_population           : int [1:50] 28189672 2946293 9237202 1213645 148391139 705516 396053 14363532 1348191368 3836831 ...
##   ..$ x2000_population           : int [1:50] 19542982 3168523 8190337 711442 129193327 587207 333926 12118841 1264099069 4265172 ...
##   ..$ x1990_population           : int [1:50] 10694796 3556539 7427836 517418 107147651 558442 261928 8910808 1153704252 5391636 ...
##   ..$ x1980_population           : int [1:50] 12486631 3135123 6383060 362595 83929765 415257 187921 6198959 982372466 5145843 ...
##   ..$ x1970_population           : int [1:50] 10752971 2534377 5425317 222555 67541860 298894 133343 6708525 822534450 4800426 ...
##   ..$ area_km                    : int [1:50] 652230 29743 86600 765 147570 38394 5765 181035 9706961 69700 ...
##   ..$ density_per_km             : num [1:50] 63.1 93.5 119.6 1924.5 1160 ...
##   ..$ growth_rate                : num [1:50] 1.026 0.996 1.004 1.006 1.011 ...
##   ..$ world_population_percentage: num [1:50] 0.52 0.03 0.13 0.02 2.15 ...
##  $ Europe       :'data.frame':   50 obs. of  17 variables:
##   ..$ rank                       : int [1:50] 138 203 99 96 81 137 108 130 158 88 ...
##   ..$ cca3                       : chr [1:50] "ALB" "AND" "AUT" "BLR" ...
##   ..$ country_territory          : chr [1:50] "Albania" "Andorra" "Austria" "Belarus" ...
##   ..$ capital                    : chr [1:50] "Tirana" "Andorra la Vella" "Vienna" "Minsk" ...
##   ..$ continent                  : chr [1:50] "Europe" "Europe" "Europe" "Europe" ...
##   ..$ x2022_population           : int [1:50] 2842321 79824 8939617 9534954 11655930 3233526 6781953 4030358 1251488 10493986 ...
##   ..$ x2020_population           : int [1:50] 2866849 77700 8907777 9633740 11561717 3318407 6979175 4096868 1237537 10530953 ...
##   ..$ x2015_population           : int [1:50] 2882481 71746 8642421 9700609 11248303 3524324 7309253 4254815 1187280 10523798 ...
##   ..$ x2010_population           : int [1:50] 2913399 71519 8362829 9731427 10877947 3811088 7592273 4368682 1129686 10464749 ...
##   ..$ x2000_population           : int [1:50] 3182021 66097 8010428 10256483 10264343 4179350 8097691 4548434 948237 10234710 ...
##   ..$ x1990_population           : int [1:50] 3295066 53569 7678729 10428525 9959560 4494310 8767778 4873707 788500 10301192 ...
##   ..$ x1980_population           : int [1:50] 2941651 35611 7547561 9817257 9828986 4199820 8980606 4680144 679327 10270060 ...
##   ..$ x1970_population           : int [1:50] 2324731 19860 7465301 9170786 9629376 3815561 8582950 4492638 640804 9795744 ...
##   ..$ area_km                    : int [1:50] 28748 468 83871 207600 30528 51209 110879 56594 9251 78865 ...
##   ..$ density_per_km             : num [1:50] 98.9 170.6 106.6 45.9 381.8 ...
##   ..$ growth_rate                : num [1:50] 0.996 1.01 1.002 0.996 1.004 ...
##   ..$ world_population_percentage: num [1:50] 0.04 0 0.11 0.12 0.15 0.04 0.09 0.05 0.02 0.13 ...
##  $ North America:'data.frame':   40 obs. of  17 variables:
##   ..$ rank                       : int [1:40] 224 201 198 176 186 177 206 221 39 205 ...
##   ..$ cca3                       : chr [1:40] "AIA" "ATG" "ABW" "BHS" ...
##   ..$ country_territory          : chr [1:40] "Anguilla" "Antigua and Barbuda" "Aruba" "Bahamas" ...
##   ..$ capital                    : chr [1:40] "The Valley" "Saint John’s" "Oranjestad" "Nassau" ...
##   ..$ continent                  : chr [1:40] "North America" "North America" "North America" "North America" ...
##   ..$ x2022_population           : int [1:40] 15857 93763 106445 409984 281635 405272 64184 31305 38454327 68706 ...
##   ..$ x2020_population           : int [1:40] 15585 92664 106585 406471 280693 394921 64031 30910 37888705 67311 ...
##   ..$ x2015_population           : int [1:40] 14525 89941 104257 392697 278083 359871 63144 29366 35732126 60911 ...
##   ..$ x2010_population           : int [1:40] 13172 85695 100341 373272 274711 322106 63447 27556 33963412 54074 ...
##   ..$ x2000_population           : int [1:40] 11047 75055 89101 325014 264657 240406 61371 20104 30683313 39658 ...
##   ..$ x1990_population           : int [1:40] 8316 63328 65712 270679 258868 182589 57470 15617 27657204 26027 ...
##   ..$ x1980_population           : int [1:40] 6560 64888 62267 223752 253575 145133 53565 11109 24511510 17100 ...
##   ..$ x1970_population           : int [1:40] 6283 64516 59106 179129 241397 120905 52019 9581 21434577 10533 ...
##   ..$ area_km                    : int [1:40] 91 442 180 13943 430 22966 54 151 9984670 264 ...
##   ..$ density_per_km             : num [1:40] 174.3 212.1 591.4 29.4 655 ...
##   ..$ growth_rate                : num [1:40] 1.007 1.006 0.999 1.005 1.002 ...
##   ..$ world_population_percentage: num [1:40] 0 0 0 0.01 0 0.01 0 0 0.48 0 ...
##  $ Oceania      :'data.frame':   23 obs. of  17 variables:
##   ..$ rank                       : int [1:23] 213 55 223 162 183 191 192 215 194 225 ...
##   ..$ cca3                       : chr [1:23] "ASM" "AUS" "COK" "FJI" ...
##   ..$ country_territory          : chr [1:23] "American Samoa" "Australia" "Cook Islands" "Fiji" ...
##   ..$ capital                    : chr [1:23] "Pago Pago" "Canberra" "Avarua" "Suva" ...
##   ..$ continent                  : chr [1:23] "Oceania" "Oceania" "Oceania" "Oceania" ...
##   ..$ x2022_population           : int [1:23] 44273 26177413 17011 929766 306279 171774 131232 41569 114164 12668 ...
##   ..$ x2020_population           : int [1:23] 46189 25670051 17029 920422 301920 169231 126463 43413 112106 12315 ...
##   ..$ x2015_population           : int [1:23] 51368 23820236 17695 917200 291787 167978 116707 49410 109462 11185 ...
##   ..$ x2010_population           : int [1:23] 54849 22019168 17212 905169 283788 164905 107995 53416 107588 10241 ...
##   ..$ x2000_population           : int [1:23] 58230 19017963 15897 832509 250927 160188 88826 54224 111709 10377 ...
##   ..$ x1990_population           : int [1:23] 47818 17048003 17123 780430 211089 138263 75124 46047 98603 9598 ...
##   ..$ x1980_population           : int [1:23] 32886 14706322 17651 644582 163591 110286 60813 31988 76299 7635 ...
##   ..$ x1970_population           : int [1:23] 27075 12595034 20470 527634 117891 88300 57437 23969 58989 6663 ...
##   ..$ area_km                    : int [1:23] 199 7692024 236 18272 4167 549 811 181 702 21 ...
##   ..$ density_per_km             : num [1:23] 222.5 3.4 72.1 50.9 73.5 ...
##   ..$ growth_rate                : num [1:23] 0.983 1.01 1 1.006 1.007 ...
##   ..$ world_population_percentage: num [1:23] 0 0.33 0 0.01 0 0 0 0 0 0 ...
##  $ South America:'data.frame':   14 obs. of  17 variables:
##   ..$ rank                       : int [1:14] 33 80 7 65 28 67 231 184 164 109 ...
##   ..$ cca3                       : chr [1:14] "ARG" "BOL" "BRA" "CHL" ...
##   ..$ country_territory          : chr [1:14] "Argentina" "Bolivia" "Brazil" "Chile" ...
##   ..$ capital                    : chr [1:14] "Buenos Aires" "Sucre" "Brasilia" "Santiago" ...
##   ..$ continent                  : chr [1:14] "South America" "South America" "South America" "South America" ...
##   ..$ x2022_population           : int [1:14] 45510318 12224110 215313498 19603733 51874024 18001000 3780 304557 808726 6780744 ...
##   ..$ x2020_population           : int [1:14] 45036032 11936162 213196304 19300315 50930662 17588595 3747 290969 797202 6618695 ...
##   ..$ x2015_population           : int [1:14] 43257065 11090085 205188205 17870124 47119728 16195902 3408 257026 755031 6177950 ...
##   ..$ x2010_population           : int [1:14] 41100123 10223270 196353492 17004162 44816108 14989585 3187 228453 747932 5768613 ...
##   ..$ x2000_population           : int [1:14] 37070774 8592656 175873720 15351799 39215135 12626507 3080 164351 759051 5123819 ...
##   ..$ x1990_population           : int [1:14] 32637657 7096194 150706446 13342868 32601393 10449837 2332 113931 747116 4059195 ...
##   ..$ x1980_population           : int [1:14] 28024803 5736088 122288383 11469828 26176195 8135845 2240 66825 778176 3078912 ...
##   ..$ x1970_population           : int [1:14] 23842803 4585693 96369875 9820481 20905254 6172215 2274 46484 705261 2408787 ...
##   ..$ area_km                    : int [1:14] 2780400 1098581 8515767 756102 1141748 276841 12173 83534 214969 406752 ...
##   ..$ density_per_km             : num [1:14] 16.4 11.1 25.3 25.9 45.4 ...
##   ..$ growth_rate                : num [1:14] 1.01 1.01 1 1.01 1.01 ...
##   ..$ world_population_percentage: num [1:14] 0.57 0.15 2.7 0.25 0.65 0.23 0 0 0.01 0.09 ...

10.5 6.5 Using tapply()

tapply(
  population$density_per_km,
  population$continent,
  mean
)
##        Africa          Asia        Europe North America       Oceania 
##     125.04765    1025.02414     663.32474     272.76176     132.54307 
## South America 
##      20.97198

10.6 6.6 Using map()

population[, c("density_per_km", "growth_rate", "area_km")] %>%
  map(mean)
## $density_per_km
## [1] 452.127
## 
## $growth_rate
## [1] 1.009577
## 
## $area_km
## [1] 581449.4

11 Conclusion

In this homework i explored major concepts bellow :

  • Database connectivity with PostgreSQL
  • CSV data importation
  • CSV Data cleaning and merging
  • Using functions p in R:map,tapply,split,vapply,lapply,sapply,merge,….
  • Visualization using geoms for ggplot2
  • Debugging using trace() and recover()
  • Data manipulation with pipes and grouping

These tools are essential in modern Data Science workflows and statistical analysis.