Week 2 Data Dive (Summaries)

Introduction

This data dive explores a dataset containing 4,340 observations and 12 variables, including both numeric and categorical measures related to population and income classifications. The goal of this analysis is to better understand the structure of the data, summarize key variables, and explore how population size varies across income categories. Through summary statistics, aggregation, and visualizations, this notebook identifies basic patterns in the data that can inform further analysis in later stages of the project.

Loading the Data

This section loads the required package and imports the dataset for analysis.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The data is imported into R and stored for use in the analysis.

data <- read.csv("dataset.csv")

Data Overview

glimpse() gives a quick overview of the dataset by showing the column names, data types, and example values.

glimpse(data)
## Rows: 4,340
## Columns: 12
## $ iso3c                     <chr> "DNK", "FIN", "POL", "SWE", "ESP", "NLD", "S…
## $ country                   <chr> "Denmark", "Finland", "Poland", "Sweden", "S…
## $ region                    <chr> "Europe & Central Asia", "Europe & Central A…
## $ income                    <chr> "High income", "High income", "High income",…
## $ year                      <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 20…
## $ population                <int> 5946952, 5584264, 36685849, 10536632, 483733…
## $ overall_score             <dbl> 95.25583, 95.11542, 94.65375, 94.41000, 94.3…
## $ data_use_score            <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100,…
## $ data_services_score       <dbl> 98.46667, 96.43333, 97.30000, 96.00000, 91.2…
## $ data_products_score       <dbl> 90.71250, 90.96875, 84.54375, 90.57500, 92.6…
## $ data_sources_score        <dbl> 87.100, 88.175, 91.425, 85.475, 87.800, 83.6…
## $ data_infrastructure_score <int> 100, 100, 100, 100, 100, 100, 100, 100, 100,…

Summary Statistics

Numerical Summary : Population

This section summarizes the population variable to understand its range, central tendency, and overall distribution.

summary(data$population)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 9.791e+03 7.442e+05 5.941e+06 3.338e+07 2.168e+07 1.429e+09

Categorical Summary: Income

This section summarizes the income variable to examine the different income categories and their frequencies.

table(data$income)
## 
##         High income          Low income Lower middle income      Not classified 
##                1700                 520                1020                  20 
## Upper middle income 
##                1080

Research Questions

This section lists questions motivated by the summaries and the goals of the project.

  1. How does population vary across different income groups?
  2. Are certain income categories associated with higher or lower population values?
  3. How evenly are observations distributed across income categories?

Aggregation Analysis

Question 1 : Population Variation Across Income Group

This section will answer the first questions which is “How does population vary across different income groups?” to examine how population differs across income categories using aggregated summaries.

The aggregation results show that upper-middle-income countries have the highest average population, while high-income countries have the lowest average population. This suggests that economic development and population size are not directly aligned, as wealthier countries tend to govern smaller populations. From a policy perspective, this highlights the need for different development strategies: upper-middle-income countries may face greater challenges in providing services to large populations with more limited resources per person, while high-income countries may be better positioned to allocate resources on a per-capita basis. International development programs and aid efforts should therefore account for population size alongside income classification when designing interventions.

data %>%
  group_by(income) %>%
  summarise(ave_population = mean (population, na.rm = TRUE))
## # A tibble: 5 × 2
##   income              ave_population
##   <chr>                        <dbl>
## 1 High income              15607777.
## 2 Low income               21779230.
## 3 Lower middle income      52414301.
## 4 Not classified           28777998.
## 5 Upper middle income      49053126.

Visual Analysis : Population Distribution

This visualization uses a histogram to examine the distribution of the population variable. Population values are shown on the x-axis, while the y-axis represents the number of observations within each population range. The histogram is divided into 30 bins to show how population values are distributed across the dataset. Axis labels and a title are included to improve readability, and a minimal theme is applied to keep the visualization clean and consistent with the lab style. The plot shows that most observations have relatively small population values, while a few observations have very large populations. This indicates a right-skewed distribution.

The right-skewed population distribution indicates that most countries in the dataset have relatively small populations, while a small number of countries have extremely large populations. This concentration has important real-world implications: policies or interventions targeting only a few highly populous countries could impact a large proportion of the global population. In contrast, initiatives aimed at smaller countries may require broader coverage to achieve similar reach. This suggests that population size should be a key consideration in global planning efforts such as public health programs, climate initiatives, or infrastructure development.

ggplot(data, aes(x = population)) +
  geom_histogram(
    bins = 30,
    color = 'white'
  ) +
  labs(
    title = "Distribution of Population",
    x = "Population",
    y = "Count"
  ) +
  scale_x_continuous(labels = scales::comma) +
  theme_minimal()

Visual Analysis : Population by Income Group

This visualization uses a boxplot to examine how population values vary across different income categories. Population is treated as a numeric variable and placed on the y-axis, while income groups are treated as a categorical variable and placed on the x-axis. Color is mapped to income categories to clearly distinguish between groups. A boxplot is chosen because it summarizes the distribution of population within each income group by showing the median, spread, and potential outliers. Axis labels and a descriptive title are included to improve readability, and population values are formatted with commas to make large numbers easier to interpret.

The boxplots reveal substantial variation in population size within each income group, particularly among upper-middle-income and high-income countries. This indicates that income classification alone is not sufficient to predict population-related needs. For example, an upper-middle-income country with a very large population may face significantly different infrastructure, healthcare, and education challenges than a high-income country with a much smaller population. These findings suggest that policymakers and development organizations should segment countries using both income level and population size when designing scalable and effective interventions.

ggplot(data, aes(x = income, y = population, color = income)) +
  geom_boxplot() +
  labs(
    title = "Population Distribution by Income Group",
    x = "Income Group",
    y = " Population"
  ) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

Question 2: Median Population Across Income Categories

This analysis examines whether income categories are associated with higher or lower typical population values by comparing the median population across income groups. The median is used instead of the mean to reduce the influence of extremely large population values and better represent a typical country within each income category.

The results show meaningful differences in typical population size across income categories. Countries in the low-income and not-classified categories have the highest median population values, while high-income countries have the lowest median population (approximately 2.2 million). Lower-middle-income and upper-middle-income countries fall in between. This pattern suggests that wealthier countries tend to govern smaller populations, whereas countries with lower income classifications often serve much larger populations. From a policy perspective, this implies that lower-income countries may face greater challenges in providing services and infrastructure to large populations with more limited resources, highlighting the importance of population-aware development and aid strategies.

data %>%
  group_by(income) %>%
  summarise(median_population = median(population, na.rm = TRUE))
## # A tibble: 5 × 2
##   income              median_population
##   <chr>                           <dbl>
## 1 High income                  2226880 
## 2 Low income                  16038824.
## 3 Lower middle income         10422800 
## 4 Not classified              28776760.
## 5 Upper middle income          5092308.

Visualization Analysis : Median Population by Income Group

This code first groups the dataset by income category and then calculates the median population for each group using the summarise() function. The median is used instead of the mean to reduce the influence of extremely large population values, which are common in this dataset. After computing these summary values, a bar chart is created using ggplot(), where income categories are placed on the x-axis and median population values on the y-axis. The geom_col() function is used because the population values have already been calculated prior to plotting. Axis labels, a descriptive title, and comma formatting for large numbers are added to improve readability. This visualization directly addresses Research Question 2 by enabling a clear comparison of typical population size across income categories.

The visualization shows clear differences in median population across income categories. Countries in the low-income and not-classified groups have higher median population values, while high-income countries have substantially smaller median populations. Lower-middle-income and upper-middle-income groups fall between these extremes. This pattern suggests that countries with lower income classifications often govern much larger populations, which may increase the complexity of delivering infrastructure, healthcare, and public services. From a policy and development perspective, these results highlight the importance of considering population size alongside income level when designing scalable and effective interventions, as income classification alone does not fully capture population-related challenges.

data %>%
  group_by(income) %>%
  summarise(median_population = median(population, na.rm = TRUE)) %>%
  ggplot (aes (x = income, y = median_population, fill = income)) +
  geom_col() + 
  labs (
    title = "Median Population by Income Category",
    x = "Income Category",
    y = "Median Population"
  ) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

Conclusion

This data dive examined population patterns across income categories using summary statistics, aggregation, and visualizations. The analysis showed that population size is unevenly distributed across countries, with lower-income and not-classified income groups often representing larger typical populations, while high-income countries tend to govern smaller populations. These findings suggest that income classification alone is insufficient for understanding population-related challenges. Future analyses and policy decisions should consider both population size and income level to better inform resource allocation, development planning, and scalable interventions.