library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read.csv("dataset.csv")
This notebook explores patterns in a country–year dataset using group-by analysis and probability concepts. The goal is to understand how observations are distributed across different regions and income categories, identify rare and common groups, and interpret what these patterns mean in context. By grouping the data in multiple ways, calculating probabilities based on random row selection, and using visualizations, this analysis highlights structural patterns, anomalies, and potential limitations in the dataset.
This code groups the dataset by region using group_by()
and counts how many rows belong to each region using
summarise(). Each row in the resulting table represents a
region, and n indicates how frequently each region appears
in the dataset.
region_counts <- dataset %>%
group_by(region) %>%
summarise(n = n())
This code sorts the regions from least frequent to most frequent. This makes it easier to identify which regions are rare and therefore have a lower probability of being selected when randomly sampling a row.
region_counts %>%
arrange (n)
## # A tibble: 7 × 2
## region n
## <chr> <int>
## 1 North America 60
## 2 South Asia 160
## 3 Middle East & North Africa 420
## 4 East Asia & Pacific 740
## 5 Latin America & Caribbean 840
## 6 Sub-Saharan Africa 960
## 7 Europe & Central Asia 1160
This code adds a new column called rare using
mutate(). The region with the smallest number of rows is
labeled TRUE, and all other regions are labeled
FALSE. This explicitly marks the lowest-probability group
in the dataset.
region_counts <- region_counts %>%
mutate(rare = n == min(n))
region_counts
## # A tibble: 7 × 3
## region n rare
## <chr> <int> <lgl>
## 1 East Asia & Pacific 740 FALSE
## 2 Europe & Central Asia 1160 FALSE
## 3 Latin America & Caribbean 840 FALSE
## 4 Middle East & North Africa 420 FALSE
## 5 North America 60 TRUE
## 6 South Asia 160 FALSE
## 7 Sub-Saharan Africa 960 FALSE
This code uses mutate() to add two new columns to the
dataset. The first column, probability, represents the
proportion of total rows that belong to each region. The second column,
probability_percent, expresses this value as a percentage,
rounded to two decimal places for clarity.
region_counts <- region_counts %>%
mutate(
probability = n / sum (n),
probability_percent = round(probability * 100, 2))
region_counts
## # A tibble: 7 × 5
## region n rare probability probability_percent
## <chr> <int> <lgl> <dbl> <dbl>
## 1 East Asia & Pacific 740 FALSE 0.171 17.0
## 2 Europe & Central Asia 1160 FALSE 0.267 26.7
## 3 Latin America & Caribbean 840 FALSE 0.194 19.4
## 4 Middle East & North Africa 420 FALSE 0.0968 9.68
## 5 North America 60 TRUE 0.0138 1.38
## 6 South Asia 160 FALSE 0.0369 3.69
## 7 Sub-Saharan Africa 960 FALSE 0.221 22.1
This bar chart visualizes the number of rows for each region. Regions with shorter bars represent lower probabilities of occurrence when randomly selecting a row from the dataset.
ggplot(region_counts, aes(x = region, y = n)) +
geom_col()
Hypothesis: Regions with fewer unique countries will have lower country-year counts and lower probabilities in the dataset.
To test this, I would need to count the number of unique countries per region. If North America has fewer unique countries represented in this dataset compared to other regions, that would explain its low probability.
Alternative hypothesis: Regions with smaller geographic areas or populations contribute fewer observations to the dataset.
When grouping the data by region, North America appears least frequently, with only 60 observations out of 4,340 total rows. This means North America has a probability of 60/4,340 = 0.0138, or approximately 1.38%.
In other words, if we randomly selected a single row from this dataset, there is only a 1.38% chance it would represent North America. This is the lowest probability among all regions.
In contrast, Europe & Central Asia have the highest probability at 1,160/4,340 = 26.7%, making it nearly 20 times more likely to be randomly selected than North America.
1. Does North America have fewer countries represented in this dataset, or do those countries simply have fewer year observations? 2. Are there specific time periods where North America data is missing? 3. Would the probability distribution change if we grouped by a different variable (e.g., income level or development status)?
This analysis groups the dataset by income level to understand the distribution of observations across different economic classifications.
income_counts <- dataset %>%
group_by(income) %>%
summarise(n = n())
income_counts
## # A tibble: 5 × 2
## income n
## <chr> <int>
## 1 High income 1700
## 2 Low income 520
## 3 Lower middle income 1020
## 4 Not classified 20
## 5 Upper middle income 1080
The data is sorted from least to most frequent to identify which income groups are underrepresented in the dataset.
income_counts %>%
arrange(n)
## # A tibble: 5 × 2
## income n
## <chr> <int>
## 1 Not classified 20
## 2 Low income 520
## 3 Lower middle income 1020
## 4 Upper middle income 1080
## 5 High income 1700
The “Not classified” category has been tagged as the rarest group, with only 20 observations, representing countries that don’t fit into standard World Bank income classifications.
income_counts <- income_counts %>%
mutate(rare = n == min(n))
income_counts
## # A tibble: 5 × 3
## income n rare
## <chr> <int> <lgl>
## 1 High income 1700 FALSE
## 2 Low income 520 FALSE
## 3 Lower middle income 1020 FALSE
## 4 Not classified 20 TRUE
## 5 Upper middle income 1080 FALSE
The "Not classified" income group has a probability of
0.46%. This means that if a single row were randomly
selected from the dataset, there is less than a 1% chance it would
belong to the “Not classified” income group. In contrast, High
income countries account for 39.17% of all
observations, making them far more likely to appear in the
income_counts <- income_counts %>%
mutate(
probability = n / sum (n),
probability_percent = round(probability * 100, 2))
income_counts %>%
arrange(n)
## # A tibble: 5 × 5
## income n rare probability probability_percent
## <chr> <int> <lgl> <dbl> <dbl>
## 1 Not classified 20 TRUE 0.00461 0.46
## 2 Low income 520 FALSE 0.120 12.0
## 3 Lower middle income 1020 FALSE 0.235 23.5
## 4 Upper middle income 1080 FALSE 0.249 24.9
## 5 High income 1700 FALSE 0.392 39.2
This bar chart clearly shows the stark difference in representation, with “Not classified” barely visible compared to other income groups.
ggplot(income_counts, aes(x = income, y = n)) +
geom_col()
Hypothesis: The "Not classified" income
category is rare because it includes fewer unique countries, rather than
because countries in this category have fewer year observations.
How to test: Count the number of unique countries in each income category. If “Not classified” has only 1-2 countries while other categories have 20-30+ countries, this would support the hypothesis.
Alternative hypothesis: “Not classified” countries may have incomplete data collection, resulting in fewer years of observations even if multiple countries belong to this category.
When grouping the data by income level, “Not classified” appears least frequently, with only 20 observations out of 4,340 total rows. This represents a probability of 0.46%, making it the rarest income category in the dataset. This low probability suggests that very few countries fall into the “Not classified” category, or that countries in this category have limited data availability. The category likely includes small territories, disputed regions, or countries with insufficient economic data for World Bank classification.
High income countries dominate the dataset at 39.17%, followed by Upper middle income (24.88%) and Lower middle income (23.50%). This distribution suggests the dataset may be biased toward wealthier nations with better data collection infrastructure, or the dataset focuses on economically developed regions.
Which specific countries are classified as “Not classified,” and why don’t they fit standard income categories?
Is there a correlation between income level and the number of years of data available per country?
Does the overrepresentation of high-income countries affect the conclusions we can draw from this dataset?
The goal of this analysis is to examine how population size varies across income groups. By grouping the data by income level and summarizing population using the median, we can identify which income groups tend to have smaller or larger typical country populations and whether any group stands out as unusual.
This code groups the dataset by income level and calculates the
median population for each group. The median is used because population
data can be highly skewed by very large countries. The number of
observations (n) is also included to provide context for
how frequently each income group appears in the dataset.
income_population <- dataset %>%
group_by(income) %>%
summarise(
median_population = median(population, na.rm = TRUE),
n = n()
)
income_population %>%
arrange(median_population)
## # A tibble: 5 × 3
## income median_population n
## <chr> <dbl> <int>
## 1 High income 2226880 1700
## 2 Upper middle income 5092308. 1080
## 3 Lower middle income 10422800 1020
## 4 Low income 16038824. 520
## 5 Not classified 28776760. 20
This code adds a logical column that tags the income group with the fewest observations as rare. This identifies the lowest-probability income group when randomly selecting a row from the dataset.
income_population <- income_population %>%
mutate(rare = n == min(n))
income_population
## # A tibble: 5 × 4
## income median_population n rare
## <chr> <dbl> <int> <lgl>
## 1 High income 2226880 1700 FALSE
## 2 Low income 16038824. 520 FALSE
## 3 Lower middle income 10422800 1020 FALSE
## 4 Not classified 28776760. 20 TRUE
## 5 Upper middle income 5092308. 1080 FALSE
This code calculates the probability of each income group based on
how many observations it contains. The probability is computed by
dividing the number of rows in each income group (n) by the
total number of rows in the dataset. The probability is also converted
into a percentage to make the results easier to interpret. Finally, the
table is sorted by the number of observations to clearly show which
income groups are least and most likely to appear when randomly
selecting a row.
The probability values are identical to those in the income-only grouping because probability depends solely on how many observations fall into each income category, not on the population statistics summarized in this analysis.
income_population <- income_population %>%
mutate(
probability = n / sum(n),
probability_percent = round(probability * 100, 2)
)
income_population %>%
arrange(n)
## # A tibble: 5 × 6
## income median_population n rare probability probability_percent
## <chr> <dbl> <int> <lgl> <dbl> <dbl>
## 1 Not classified 28776760. 20 TRUE 0.00461 0.46
## 2 Low income 16038824. 520 FALSE 0.120 12.0
## 3 Lower middle in… 10422800 1020 FALSE 0.235 23.5
## 4 Upper middle in… 5092308. 1080 FALSE 0.249 24.9
## 5 High income 2226880 1700 FALSE 0.392 39.2
This bar chart visualizes the median population for each income group. Each bar represents an income category, and the height of the bar corresponds to the typical (median) population of countries within that group. Taller bars indicate income groups with larger median populations, allowing for easy comparison across categories.
ggplot(income_population, aes( x = income , y = median_population)) +
geom_col()
Hypothesis: High income countries have smaller median populations because many wealthy nations are smaller European or island countries with advanced economies, while lower- and middle-income groups include several highly populous developing nations such as India, China, and Indonesia.
How to test: Examine the relationship between GDP per capita and population size. If a negative correlation exists (higher GDP per capita associated with smaller population sizes), this would support the hypothesis.
Alternative hypothesis: The “Not classified” income group has the largest median population because it may include unusual cases such as disputed territories or countries with incomplete economic classification despite having very large populations.
When grouping the data by income level and examining median population, the “Not classified” category has an unusually large median population (28.7 million) despite having only 20 observations, corresponding to a very low probability of occurrence. This suggests that the category may include a small number of highly populous countries that do not fit standard World Bank income classifications.
In contrast, high-income countries have the smallest median population (2.2 million) while being the most common income category in the dataset. This indicates that wealthier nations in this dataset tend to be smaller in population size, potentially reflecting the presence of many small European countries, city-states, or island nations.
Overall, this pattern reveals an inverse relationship between income level and typical population size, where the rarest income category contains the largest populations, while the most common category contains smaller ones. This is significant because it suggests potential representation bias in the dataset.
Which specific countries are included in the “Not classified” income category, and why do they not fit standard income classifications?
Is there a statistical correlation between income level and population size across all observations?
Do high-income countries with large populations behave differently from smaller high-income countries in the dataset?
The goal of this analysis is to examine how income categories are distributed across different world regions. By analyzing all region–income combinations, we can identify which combinations are common, which are rare, and which do not appear in the dataset at all. This helps reveal structural patterns and potential limitations in the data.
This code counts the number of observations for each unique
combination of region and income level. Each row in the resulting table
represents one specific region–income pair, and the count
(n) shows how frequently that combination appears in the
dataset. This table allows us to identify which region–income
combinations are most common and which are least common, providing
insight into how economic classifications are distributed
geographically.
region_income_counts <- dataset %>%
count(region, income)
region_income_counts
## region income n
## 1 East Asia & Pacific High income 300
## 2 East Asia & Pacific Low income 20
## 3 East Asia & Pacific Lower middle income 240
## 4 East Asia & Pacific Upper middle income 180
## 5 Europe & Central Asia High income 800
## 6 Europe & Central Asia Lower middle income 60
## 7 Europe & Central Asia Upper middle income 300
## 8 Latin America & Caribbean High income 360
## 9 Latin America & Caribbean Lower middle income 80
## 10 Latin America & Caribbean Not classified 20
## 11 Latin America & Caribbean Upper middle income 380
## 12 Middle East & North Africa High income 160
## 13 Middle East & North Africa Low income 40
## 14 Middle East & North Africa Lower middle income 140
## 15 Middle East & North Africa Upper middle income 80
## 16 North America High income 60
## 17 South Asia Low income 20
## 18 South Asia Lower middle income 120
## 19 South Asia Upper middle income 20
## 20 Sub-Saharan Africa High income 20
## 21 Sub-Saharan Africa Low income 440
## 22 Sub-Saharan Africa Lower middle income 380
## 23 Sub-Saharan Africa Upper middle income 120
This code generates all possible combinations of region and income categories that could theoretically exist in the dataset. It does not rely on observed data but instead creates a complete grid of every region paired with every income group. By creating all possible combinations, we establish a reference set that allows us to compare which combinations exist in the data and which do not.
all_combinations <- expand.grid(
region = unique(dataset$region),
income = unique(dataset$income)
)
all_combinations
## region income
## 1 Europe & Central Asia High income
## 2 North America High income
## 3 Latin America & Caribbean High income
## 4 East Asia & Pacific High income
## 5 Middle East & North Africa High income
## 6 Sub-Saharan Africa High income
## 7 South Asia High income
## 8 Europe & Central Asia Upper middle income
## 9 North America Upper middle income
## 10 Latin America & Caribbean Upper middle income
## 11 East Asia & Pacific Upper middle income
## 12 Middle East & North Africa Upper middle income
## 13 Sub-Saharan Africa Upper middle income
## 14 South Asia Upper middle income
## 15 Europe & Central Asia Lower middle income
## 16 North America Lower middle income
## 17 Latin America & Caribbean Lower middle income
## 18 East Asia & Pacific Lower middle income
## 19 Middle East & North Africa Lower middle income
## 20 Sub-Saharan Africa Lower middle income
## 21 South Asia Lower middle income
## 22 Europe & Central Asia Low income
## 23 North America Low income
## 24 Latin America & Caribbean Low income
## 25 East Asia & Pacific Low income
## 26 Middle East & North Africa Low income
## 27 Sub-Saharan Africa Low income
## 28 South Asia Low income
## 29 Europe & Central Asia Not classified
## 30 North America Not classified
## 31 Latin America & Caribbean Not classified
## 32 East Asia & Pacific Not classified
## 33 Middle East & North Africa Not classified
## 34 Sub-Saharan Africa Not classified
## 35 South Asia Not classified
This code identifies region–income combinations that are not present
in the dataset. The anti_join() function returns only the
combinations from the full set that do not appear in the observed data.
The resulting table shows combinations of region and income level that
do not occur in the dataset at all. These missing combinations may
reflect real-world constraints (for example, certain regions may not
contain countries in specific income categories) or limitations in data
availability.
Missing region–income combinations likely occur because some income categories are not economically plausible in certain regions, or because the dataset lacks observations for specific country–year pairs. For example, some regions may not contain any high-income countries, while others may not include low-income countries due to regional economic development patterns.
missing_combinations <- all_combinations %>%
anti_join(region_income_counts, by = c("region", "income"))
missing_combinations
## region income
## 1 South Asia High income
## 2 North America Upper middle income
## 3 North America Lower middle income
## 4 Europe & Central Asia Low income
## 5 North America Low income
## 6 Latin America & Caribbean Low income
## 7 Europe & Central Asia Not classified
## 8 North America Not classified
## 9 East Asia & Pacific Not classified
## 10 Middle East & North Africa Not classified
## 11 Sub-Saharan Africa Not classified
## 12 South Asia Not classified
The goal of this analysis is to identify which region–income combinations appear most frequently and least frequently in the dataset. This helps reveal dominant economic patterns across regions as well as rare or underrepresented combinations.
This code counts the number of observations for each unique
combination of region and income level. The results are then sorted in
descending order based on the count (n), so the most
frequent combinations appear first. The head() function is
used to display the top five most common region–income combinations in
the dataset.
region_income_counts <- dataset %>%
count(region, income) %>%
arrange(desc(n))
head(region_income_counts, 5)
## region income n
## 1 Europe & Central Asia High income 800
## 2 Sub-Saharan Africa Low income 440
## 3 Latin America & Caribbean Upper middle income 380
## 4 Sub-Saharan Africa Lower middle income 380
## 5 Latin America & Caribbean High income 360
This code displays the five least frequent region–income combinations
by selecting the bottom rows of the sorted table. These combinations
represent rare or minimally represented pairs in the dataset. The
tail() function is used to display the five least common
region–income combinations in the dataset.
tail(region_income_counts, 5)
## region income n
## 19 East Asia & Pacific Low income 20
## 20 Latin America & Caribbean Not classified 20
## 21 South Asia Low income 20
## 22 South Asia Upper middle income 20
## 23 Sub-Saharan Africa High income 20
This stacked bar chart visualizes how income categories are distributed across regions. The height of each bar represents the total number of observations in a region, while the colored segments show the contribution of different income groups. This visualization makes it easier to compare income distributions across regions and identify dominant or missing patterns.
This analysis reveals that income distribution varies substantially across regions, with some region–income combinations appearing frequently and others not appearing at all. Missing combinations suggest structural economic differences between regions or gaps in data collection. Understanding these patterns is important for interpreting results from grouped analyses and recognizing potential biases in the dataset.
ggplot(region_income_counts, aes(x = region, y = n, fill = income)) +
geom_col()