library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read.csv("dataset.csv")

Week 3 Data Dive (Group By and Probabilities)

Introduction

This notebook explores patterns in a country–year dataset using group-by analysis and probability concepts. The goal is to understand how observations are distributed across different regions and income categories, identify rare and common groups, and interpret what these patterns mean in context. By grouping the data in multiple ways, calculating probabilities based on random row selection, and using visualizations, this analysis highlights structural patterns, anomalies, and potential limitations in the dataset.

Group By Analysis

Group by Region

This code groups the dataset by region using group_by() and counts how many rows belong to each region using summarise(). Each row in the resulting table represents a region, and n indicates how frequently each region appears in the dataset.

region_counts <- dataset %>%
  group_by(region) %>%
  summarise(n = n())

Sorting Group

This code sorts the regions from least frequent to most frequent. This makes it easier to identify which regions are rare and therefore have a lower probability of being selected when randomly sampling a row.

region_counts %>%
  arrange (n)
## # A tibble: 7 × 2
##   region                         n
##   <chr>                      <int>
## 1 North America                 60
## 2 South Asia                   160
## 3 Middle East & North Africa   420
## 4 East Asia & Pacific          740
## 5 Latin America & Caribbean    840
## 6 Sub-Saharan Africa           960
## 7 Europe & Central Asia       1160

Tagging the rarest region

This code adds a new column called rare using mutate(). The region with the smallest number of rows is labeled TRUE, and all other regions are labeled FALSE. This explicitly marks the lowest-probability group in the dataset.

region_counts <- region_counts %>%
  mutate(rare = n == min(n))
region_counts
## # A tibble: 7 × 3
##   region                         n rare 
##   <chr>                      <int> <lgl>
## 1 East Asia & Pacific          740 FALSE
## 2 Europe & Central Asia       1160 FALSE
## 3 Latin America & Caribbean    840 FALSE
## 4 Middle East & North Africa   420 FALSE
## 5 North America                 60 TRUE 
## 6 South Asia                   160 FALSE
## 7 Sub-Saharan Africa           960 FALSE

Probability Calculation

This code uses mutate() to add two new columns to the dataset. The first column, probability, represents the proportion of total rows that belong to each region. The second column, probability_percent, expresses this value as a percentage, rounded to two decimal places for clarity.

region_counts <- region_counts %>%
  mutate(
    probability = n / sum (n),
    probability_percent = round(probability * 100, 2))
region_counts
## # A tibble: 7 × 5
##   region                         n rare  probability probability_percent
##   <chr>                      <int> <lgl>       <dbl>               <dbl>
## 1 East Asia & Pacific          740 FALSE      0.171                17.0 
## 2 Europe & Central Asia       1160 FALSE      0.267                26.7 
## 3 Latin America & Caribbean    840 FALSE      0.194                19.4 
## 4 Middle East & North Africa   420 FALSE      0.0968                9.68
## 5 North America                 60 TRUE       0.0138                1.38
## 6 South Asia                   160 FALSE      0.0369                3.69
## 7 Sub-Saharan Africa           960 FALSE      0.221                22.1

Visualization

This bar chart visualizes the number of rows for each region. Regions with shorter bars represent lower probabilities of occurrence when randomly selecting a row from the dataset.

ggplot(region_counts, aes(x = region, y = n)) +
  geom_col()

Testable Hypothesis

Hypothesis: Regions with fewer unique countries will have lower country-year counts and lower probabilities in the dataset.

To test this, I would need to count the number of unique countries per region. If North America has fewer unique countries represented in this dataset compared to other regions, that would explain its low probability.

Alternative hypothesis: Regions with smaller geographic areas or populations contribute fewer observations to the dataset.

Interpretation

When grouping the data by region, North America appears least frequently, with only 60 observations out of 4,340 total rows. This means North America has a probability of 60/4,340 = 0.0138, or approximately 1.38%.

In other words, if we randomly selected a single row from this dataset, there is only a 1.38% chance it would represent North America. This is the lowest probability among all regions.

In contrast, Europe & Central Asia have the highest probability at 1,160/4,340 = 26.7%, making it nearly 20 times more likely to be randomly selected than North America.

Further Questions

1. Does North America have fewer countries represented in this dataset, or do those countries simply have fewer year observations? 2. Are there specific time periods where North America data is missing? 3. Would the probability distribution change if we grouped by a different variable (e.g., income level or development status)?

Group by Income

This analysis groups the dataset by income level to understand the distribution of observations across different economic classifications.

income_counts <- dataset %>%
  group_by(income) %>%
  summarise(n = n())
income_counts
## # A tibble: 5 × 2
##   income                  n
##   <chr>               <int>
## 1 High income          1700
## 2 Low income            520
## 3 Lower middle income  1020
## 4 Not classified         20
## 5 Upper middle income  1080

Sorting Group

The data is sorted from least to most frequent to identify which income groups are underrepresented in the dataset.

income_counts %>%
  arrange(n)
## # A tibble: 5 × 2
##   income                  n
##   <chr>               <int>
## 1 Not classified         20
## 2 Low income            520
## 3 Lower middle income  1020
## 4 Upper middle income  1080
## 5 High income          1700

Tagging the rarest income group

The “Not classified” category has been tagged as the rarest group, with only 20 observations, representing countries that don’t fit into standard World Bank income classifications.

income_counts <- income_counts %>%
  mutate(rare = n == min(n))
income_counts
## # A tibble: 5 × 3
##   income                  n rare 
##   <chr>               <int> <lgl>
## 1 High income          1700 FALSE
## 2 Low income            520 FALSE
## 3 Lower middle income  1020 FALSE
## 4 Not classified         20 TRUE 
## 5 Upper middle income  1080 FALSE

Probability Calculation

The "Not classified" income group has a probability of 0.46%. This means that if a single row were randomly selected from the dataset, there is less than a 1% chance it would belong to the “Not classified” income group. In contrast, High income countries account for 39.17% of all observations, making them far more likely to appear in the

income_counts <- income_counts %>%
  mutate(
    probability = n / sum (n),
    probability_percent = round(probability * 100, 2))
income_counts %>%
  arrange(n)
## # A tibble: 5 × 5
##   income                  n rare  probability probability_percent
##   <chr>               <int> <lgl>       <dbl>               <dbl>
## 1 Not classified         20 TRUE      0.00461                0.46
## 2 Low income            520 FALSE     0.120                 12.0 
## 3 Lower middle income  1020 FALSE     0.235                 23.5 
## 4 Upper middle income  1080 FALSE     0.249                 24.9 
## 5 High income          1700 FALSE     0.392                 39.2

Visualization

This bar chart clearly shows the stark difference in representation, with “Not classified” barely visible compared to other income groups.

ggplot(income_counts, aes(x = income, y = n)) +
  geom_col()

Testable Hypothesis

Hypothesis: The "Not classified" income category is rare because it includes fewer unique countries, rather than because countries in this category have fewer year observations.

How to test: Count the number of unique countries in each income category. If “Not classified” has only 1-2 countries while other categories have 20-30+ countries, this would support the hypothesis.

Alternative hypothesis: “Not classified” countries may have incomplete data collection, resulting in fewer years of observations even if multiple countries belong to this category.

Interpretation

When grouping the data by income level, “Not classified” appears least frequently, with only 20 observations out of 4,340 total rows. This represents a probability of 0.46%, making it the rarest income category in the dataset. This low probability suggests that very few countries fall into the “Not classified” category, or that countries in this category have limited data availability. The category likely includes small territories, disputed regions, or countries with insufficient economic data for World Bank classification.

High income countries dominate the dataset at 39.17%, followed by Upper middle income (24.88%) and Lower middle income (23.50%). This distribution suggests the dataset may be biased toward wealthier nations with better data collection infrastructure, or the dataset focuses on economically developed regions.

Further Questions

  1. Which specific countries are classified as “Not classified,” and why don’t they fit standard income categories?

  2. Is there a correlation between income level and the number of years of data available per country?

  3. Does the overrepresentation of high-income countries affect the conclusions we can draw from this dataset?

Group by Income + Population

The goal of this analysis is to examine how population size varies across income groups. By grouping the data by income level and summarizing population using the median, we can identify which income groups tend to have smaller or larger typical country populations and whether any group stands out as unusual.

This code groups the dataset by income level and calculates the median population for each group. The median is used because population data can be highly skewed by very large countries. The number of observations (n) is also included to provide context for how frequently each income group appears in the dataset.

income_population <- dataset %>%
  group_by(income) %>%
  summarise(
    median_population = median(population, na.rm = TRUE),
    n = n()
  )
income_population %>%
  arrange(median_population)
## # A tibble: 5 × 3
##   income              median_population     n
##   <chr>                           <dbl> <int>
## 1 High income                  2226880   1700
## 2 Upper middle income          5092308.  1080
## 3 Lower middle income         10422800   1020
## 4 Low income                  16038824.   520
## 5 Not classified              28776760.    20

Tagging the rarest income group

This code adds a logical column that tags the income group with the fewest observations as rare. This identifies the lowest-probability income group when randomly selecting a row from the dataset.

income_population <- income_population %>%
  mutate(rare = n == min(n))

income_population
## # A tibble: 5 × 4
##   income              median_population     n rare 
##   <chr>                           <dbl> <int> <lgl>
## 1 High income                  2226880   1700 FALSE
## 2 Low income                  16038824.   520 FALSE
## 3 Lower middle income         10422800   1020 FALSE
## 4 Not classified              28776760.    20 TRUE 
## 5 Upper middle income          5092308.  1080 FALSE

Probability Calculation

This code calculates the probability of each income group based on how many observations it contains. The probability is computed by dividing the number of rows in each income group (n) by the total number of rows in the dataset. The probability is also converted into a percentage to make the results easier to interpret. Finally, the table is sorted by the number of observations to clearly show which income groups are least and most likely to appear when randomly selecting a row.

The probability values are identical to those in the income-only grouping because probability depends solely on how many observations fall into each income category, not on the population statistics summarized in this analysis.

income_population <- income_population %>%
  mutate(
    probability = n / sum(n),
    probability_percent = round(probability * 100, 2)
  )
income_population %>%
  arrange(n)
## # A tibble: 5 × 6
##   income           median_population     n rare  probability probability_percent
##   <chr>                        <dbl> <int> <lgl>       <dbl>               <dbl>
## 1 Not classified           28776760.    20 TRUE      0.00461                0.46
## 2 Low income               16038824.   520 FALSE     0.120                 12.0 
## 3 Lower middle in…         10422800   1020 FALSE     0.235                 23.5 
## 4 Upper middle in…          5092308.  1080 FALSE     0.249                 24.9 
## 5 High income               2226880   1700 FALSE     0.392                 39.2

Visualization

This bar chart visualizes the median population for each income group. Each bar represents an income category, and the height of the bar corresponds to the typical (median) population of countries within that group. Taller bars indicate income groups with larger median populations, allowing for easy comparison across categories.

ggplot(income_population, aes( x = income , y = median_population)) +
  geom_col()

Testable Hypothesis

Hypothesis: High income countries have smaller median populations because many wealthy nations are smaller European or island countries with advanced economies, while lower- and middle-income groups include several highly populous developing nations such as India, China, and Indonesia.

How to test: Examine the relationship between GDP per capita and population size. If a negative correlation exists (higher GDP per capita associated with smaller population sizes), this would support the hypothesis.

Alternative hypothesis: The “Not classified” income group has the largest median population because it may include unusual cases such as disputed territories or countries with incomplete economic classification despite having very large populations.

Interpretation

When grouping the data by income level and examining median population, the “Not classified” category has an unusually large median population (28.7 million) despite having only 20 observations, corresponding to a very low probability of occurrence. This suggests that the category may include a small number of highly populous countries that do not fit standard World Bank income classifications.

In contrast, high-income countries have the smallest median population (2.2 million) while being the most common income category in the dataset. This indicates that wealthier nations in this dataset tend to be smaller in population size, potentially reflecting the presence of many small European countries, city-states, or island nations.

Overall, this pattern reveals an inverse relationship between income level and typical population size, where the rarest income category contains the largest populations, while the most common category contains smaller ones. This is significant because it suggests potential representation bias in the dataset.

Further Questions

  1. Which specific countries are included in the “Not classified” income category, and why do they not fit standard income classifications?

  2. Is there a statistical correlation between income level and population size across all observations?

  3. Do high-income countries with large populations behave differently from smaller high-income countries in the dataset?

Categorical Variable Combinations: Region and Income

The goal of this analysis is to examine how income categories are distributed across different world regions. By analyzing all region–income combinations, we can identify which combinations are common, which are rare, and which do not appear in the dataset at all. This helps reveal structural patterns and potential limitations in the data.

This code counts the number of observations for each unique combination of region and income level. Each row in the resulting table represents one specific region–income pair, and the count (n) shows how frequently that combination appears in the dataset. This table allows us to identify which region–income combinations are most common and which are least common, providing insight into how economic classifications are distributed geographically.

region_income_counts <- dataset %>%
  count(region, income)
region_income_counts
##                        region              income   n
## 1         East Asia & Pacific         High income 300
## 2         East Asia & Pacific          Low income  20
## 3         East Asia & Pacific Lower middle income 240
## 4         East Asia & Pacific Upper middle income 180
## 5       Europe & Central Asia         High income 800
## 6       Europe & Central Asia Lower middle income  60
## 7       Europe & Central Asia Upper middle income 300
## 8   Latin America & Caribbean         High income 360
## 9   Latin America & Caribbean Lower middle income  80
## 10  Latin America & Caribbean      Not classified  20
## 11  Latin America & Caribbean Upper middle income 380
## 12 Middle East & North Africa         High income 160
## 13 Middle East & North Africa          Low income  40
## 14 Middle East & North Africa Lower middle income 140
## 15 Middle East & North Africa Upper middle income  80
## 16              North America         High income  60
## 17                 South Asia          Low income  20
## 18                 South Asia Lower middle income 120
## 19                 South Asia Upper middle income  20
## 20         Sub-Saharan Africa         High income  20
## 21         Sub-Saharan Africa          Low income 440
## 22         Sub-Saharan Africa Lower middle income 380
## 23         Sub-Saharan Africa Upper middle income 120

All possible combinations

This code generates all possible combinations of region and income categories that could theoretically exist in the dataset. It does not rely on observed data but instead creates a complete grid of every region paired with every income group. By creating all possible combinations, we establish a reference set that allows us to compare which combinations exist in the data and which do not.

all_combinations <- expand.grid(
  region = unique(dataset$region),
  income = unique(dataset$income)
)
all_combinations
##                        region              income
## 1       Europe & Central Asia         High income
## 2               North America         High income
## 3   Latin America & Caribbean         High income
## 4         East Asia & Pacific         High income
## 5  Middle East & North Africa         High income
## 6          Sub-Saharan Africa         High income
## 7                  South Asia         High income
## 8       Europe & Central Asia Upper middle income
## 9               North America Upper middle income
## 10  Latin America & Caribbean Upper middle income
## 11        East Asia & Pacific Upper middle income
## 12 Middle East & North Africa Upper middle income
## 13         Sub-Saharan Africa Upper middle income
## 14                 South Asia Upper middle income
## 15      Europe & Central Asia Lower middle income
## 16              North America Lower middle income
## 17  Latin America & Caribbean Lower middle income
## 18        East Asia & Pacific Lower middle income
## 19 Middle East & North Africa Lower middle income
## 20         Sub-Saharan Africa Lower middle income
## 21                 South Asia Lower middle income
## 22      Europe & Central Asia          Low income
## 23              North America          Low income
## 24  Latin America & Caribbean          Low income
## 25        East Asia & Pacific          Low income
## 26 Middle East & North Africa          Low income
## 27         Sub-Saharan Africa          Low income
## 28                 South Asia          Low income
## 29      Europe & Central Asia      Not classified
## 30              North America      Not classified
## 31  Latin America & Caribbean      Not classified
## 32        East Asia & Pacific      Not classified
## 33 Middle East & North Africa      Not classified
## 34         Sub-Saharan Africa      Not classified
## 35                 South Asia      Not classified

Find missing combinations

This code identifies region–income combinations that are not present in the dataset. The anti_join() function returns only the combinations from the full set that do not appear in the observed data. The resulting table shows combinations of region and income level that do not occur in the dataset at all. These missing combinations may reflect real-world constraints (for example, certain regions may not contain countries in specific income categories) or limitations in data availability.

Missing region–income combinations likely occur because some income categories are not economically plausible in certain regions, or because the dataset lacks observations for specific country–year pairs. For example, some regions may not contain any high-income countries, while others may not include low-income countries due to regional economic development patterns.

missing_combinations <- all_combinations %>%
  anti_join(region_income_counts, by = c("region", "income"))

missing_combinations
##                        region              income
## 1                  South Asia         High income
## 2               North America Upper middle income
## 3               North America Lower middle income
## 4       Europe & Central Asia          Low income
## 5               North America          Low income
## 6   Latin America & Caribbean          Low income
## 7       Europe & Central Asia      Not classified
## 8               North America      Not classified
## 9         East Asia & Pacific      Not classified
## 10 Middle East & North Africa      Not classified
## 11         Sub-Saharan Africa      Not classified
## 12                 South Asia      Not classified

Most and Least Common Combinations

The goal of this analysis is to identify which region–income combinations appear most frequently and least frequently in the dataset. This helps reveal dominant economic patterns across regions as well as rare or underrepresented combinations.

This code counts the number of observations for each unique combination of region and income level. The results are then sorted in descending order based on the count (n), so the most frequent combinations appear first. The head() function is used to display the top five most common region–income combinations in the dataset.

region_income_counts <- dataset %>%
  count(region, income) %>%
  arrange(desc(n))

head(region_income_counts, 5)
##                      region              income   n
## 1     Europe & Central Asia         High income 800
## 2        Sub-Saharan Africa          Low income 440
## 3 Latin America & Caribbean Upper middle income 380
## 4        Sub-Saharan Africa Lower middle income 380
## 5 Latin America & Caribbean         High income 360

This code displays the five least frequent region–income combinations by selecting the bottom rows of the sorted table. These combinations represent rare or minimally represented pairs in the dataset. The tail() function is used to display the five least common region–income combinations in the dataset.

tail(region_income_counts, 5)
##                       region              income  n
## 19       East Asia & Pacific          Low income 20
## 20 Latin America & Caribbean      Not classified 20
## 21                South Asia          Low income 20
## 22                South Asia Upper middle income 20
## 23        Sub-Saharan Africa         High income 20

Visualization

This stacked bar chart visualizes how income categories are distributed across regions. The height of each bar represents the total number of observations in a region, while the colored segments show the contribution of different income groups. This visualization makes it easier to compare income distributions across regions and identify dominant or missing patterns.

This analysis reveals that income distribution varies substantially across regions, with some region–income combinations appearing frequently and others not appearing at all. Missing combinations suggest structural economic differences between regions or gaps in data collection. Understanding these patterns is important for interpreting results from grouped analyses and recognizing potential biases in the dataset.

ggplot(region_income_counts, aes(x = region, y = n, fill = income)) +
  geom_col()