library(dslabs)
## Warning: package 'dslabs' was built under R version 4.5.3
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.4 âś” readr 2.1.6
## âś” forcats 1.0.1 âś” stringr 1.6.0
## âś” ggplot2 4.0.2 âś” tibble 3.3.1
## âś” lubridate 1.9.4 âś” tidyr 1.3.2
## âś” purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("gapminder")
# Creates a new income_per_person variable because "income" is not a preset variable
# Filters for the year 2010 and ensure there are no missing values
gap_clean <- gapminder %>%
filter(year == 2010) %>%
mutate(income_per_person = gdp / population) %>% # creates a new variable
drop_na(income_per_person, life_expectancy, region)
# Create the multivariable scatterplot
ggplot(gap_clean, aes(x = income_per_person,
y = life_expectancy,
color = region)) +
geom_point(alpha = 0.8, size = 3) +
scale_x_log10(labels = scales::comma) + # log scale helps spread the data
labs(
title = "Income Per Person vs Life Expectancy Across World Regions (2010)",
x = "Income Per Person (log scale)",
y = "Life Expectancy (Years)",
color = "World Region"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face = "bold", size = 18),
legend.position = "right",
panel.grid.major = element_line(color = "gray80")
)
For this assignment I used the gapminder dataset to fulfill the task
provided to us. This dataset has demographic and economic indicators for
countries across the globe over the decades. These indicators include
variables such as fertility, life expectancy, GDP, and population. Since
the dataset doesn’t have an already existing income variable I created
one by dividing the GDP by the population. From there I filtered the
dataset for just the year 2010 and then removed all missing value to
make sure my dataset is clean. My visualization uses income per person
on the x-axis, life expectancy on the y-axis, and region as a third
variable represented by the color. I also added a log scale to income in
order to reduce the skew. The plot presents clear regional patterns such
as Europe and East Asia tend to cluster at higher life expectancy and
higher income per person compared to Africa.