Problem Definition

Data Wrangling

We will use the Gapminder dataset available in R. Using dplyr, we will filter, select, and modify the dataset to focus on relevant data for our analysis.

# Load Gapminder dataset
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.4.2
data <- gapminder

# Filter data for the year 2007
filtered_data <- data %>%
  filter(year == 2007) %>%
  select(country, continent, year, lifeExp, gdpPercap, pop)

# Create a new column for GDP in billions
filtered_data <- filtered_data %>% mutate(gdp_billion = (gdpPercap * pop) / 1e9)

# Display the first few rows of the wrangled data
head(filtered_data)
## # A tibble: 6 × 7
##   country     continent  year lifeExp gdpPercap      pop gdp_billion
##   <fct>       <fct>     <int>   <dbl>     <dbl>    <int>       <dbl>
## 1 Afghanistan Asia       2007    43.8      975. 31889923        31.1
## 2 Albania     Europe     2007    76.4     5937.  3600523        21.4
## 3 Algeria     Africa     2007    72.3     6223. 33333216       207. 
## 4 Angola      Africa     2007    42.7     4797. 12420476        59.6
## 5 Argentina   Americas   2007    75.3    12779. 40301927       515. 
## 6 Australia   Oceania    2007    81.2    34435. 20434176       704.

Table Output

We will generate a summary table showing the top 5 countries by life expectancy in 2007 for each continent.

# Group by continent and arrange by life expectancy
summary_table <- filtered_data %>%
  group_by(continent) %>%
  arrange(desc(lifeExp)) %>%
  slice_head(n = 5) %>%
  ungroup()

# Display the table
knitr::kable(summary_table, caption = "Top 5 Countries by Life Expectancy in 2007 (Grouped by Continent)")
Top 5 Countries by Life Expectancy in 2007 (Grouped by Continent)
country continent year lifeExp gdpPercap pop gdp_billion
Reunion Africa 2007 76.442 7670.123 798094 6.121479
Libya Africa 2007 73.952 12057.499 6036914 72.790086
Tunisia Africa 2007 73.923 7092.923 10276158 72.887998
Mauritius Africa 2007 72.801 10956.991 1250882 13.705903
Algeria Africa 2007 72.301 6223.367 33333216 207.444852
Canada Americas 2007 80.653 36319.235 33390141 1212.704378
Costa Rica Americas 2007 78.782 9645.061 4133884 39.871565
Puerto Rico Americas 2007 78.746 19328.709 3942491 76.203261
Chile Americas 2007 78.553 13171.639 16284741 214.496727
Cuba Americas 2007 78.273 8948.103 11416987 102.160375
Japan Asia 2007 82.603 31656.068 127467972 4035.134797
Hong Kong, China Asia 2007 82.208 39724.979 6980412 277.296718
Israel Asia 2007 80.745 25523.277 6426679 164.029909
Singapore Asia 2007 79.972 47143.180 4553009 214.643321
Korea, Rep. Asia 2007 78.623 23348.140 49044790 1145.104610
Iceland Europe 2007 81.757 36180.789 301931 10.924102
Switzerland Europe 2007 81.701 37506.419 7554661 283.348281
Spain Europe 2007 80.941 28821.064 40448191 1165.759889
Sweden Europe 2007 80.884 33859.748 9031088 305.790367
France Europe 2007 80.657 30470.017 61083916 1861.227941
Australia Oceania 2007 81.235 34435.367 20434176 703.658359
New Zealand Oceania 2007 80.204 25185.009 4115771 103.655730

Data Visualization

Life Expectancy vs GDP Per Capita

We will use ggplot2 to visualize the relationship between GDP per capita and life expectancy.

# Scatterplot of GDP per capita vs Life Expectancy
plot1 <- ggplot(filtered_data, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
  geom_point(alpha = 0.7) +
  scale_x_log10() +
  labs(
    title = "Life Expectancy vs GDP Per Capita (2007)",
    x = "GDP Per Capita (Log Scale)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population"
  ) +
  theme_minimal()

plot1

Geographic Visualization

Using leaflet, we will visualize the population distribution for the year 2007.

# Load required libraries
library(dplyr)
library(leaflet)

# Load the gapminder dataset or your equivalent data
library(gapminder)
data("gapminder")

# Filter data for the year 2007
filtered_data <- gapminder %>%
  filter(year == 2007) %>%
  select(country = country, pop, continent)

# Load the coordinates data
coordinates_data <- read.csv("C:\\Users\\vaibhav gupta\\Downloads\\longitude.csv", stringsAsFactors = FALSE)

# Join the filtered data with coordinates data
map_data <- filtered_data %>%
  inner_join(coordinates_data, by = c("country" = "name"))

# Normalize the population values for visualization
map_data <- map_data %>%
  mutate(radius = sqrt(pop) / 1000)  # Adjust divisor for better scaling

# Create a Leaflet map to visualize population distribution
leaflet(data = map_data) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~longitude, lat = ~latitude,
    radius = ~radius,  # Use the scaled radius
    popup = ~paste0("<b>", country, "</b><br>",
                    "Population: ", formatC(pop, format = "d", big.mark = ","),
                    "<br>Continent: ", continent),
    color = "blue",    # Set circle color (can modify for continents)
    stroke = FALSE, fillOpacity = 0.7
  )

Summary and Interpretation

  1. Life Expectancy Trends:
    • High-income countries in Europe tend to have the highest life expectancy, while many African countries have lower life expectancies.
  2. GDP and Life Expectancy:
    • There is a positive correlation between GDP per capita and life expectancy, though with diminishing returns at higher income levels.
  3. Population Distribution:
    • Populations are heavily concentrated in countries like China and India, which have significant geographic and economic implications.

This analysis provides insights into global socio-economic patterns, highlighting disparities and opportunities for targeted interventions.