DSLabs HW

Author

Andrew Hart

Welcome to my assignment.

I chose to work with the “countries of the world_cia_kaggle” dataset. I chose this set because I wanted to make a scatter plot and figured that it had numerous variables that I could compare. I ended up deciding on making

Load in the data and libraries

As always I first load in my libraries and then set the working directory. I then load in my csv file and take a look.

library(tidyverse)
library(RColorBrewer)

setwd("C:/Users/andre/OneDrive/Documents/School/Data 110")
countries <- read_csv("countries of the world_cia_kaggle.csv")

head(countries)
# A tibble: 6 × 20
  Country        Region       Population `Area (sq. mi.)` Pop. Density (per sq…¹
  <chr>          <chr>             <dbl>            <dbl> <chr>                 
1 Afghanistan    ASIA (EX. N…   31056997           647500 48,0                  
2 Albania        EASTERN EUR…    3581655            28748 124,6                 
3 Algeria        NORTHERN AF…   32930091          2381740 13,8                  
4 American Samoa OCEANIA           57794              199 290,4                 
5 Andorra        WESTERN EUR…      71201              468 152,1                 
6 Angola         SUB-SAHARAN…   12127071          1246700 9,7                   
# ℹ abbreviated name: ¹​`Pop. Density (per sq. mi.)`
# ℹ 15 more variables: `Coastline (coast/area ratio)` <chr>,
#   `Net migration` <chr>, `Infant mortality (per 1000 births)` <dbl>,
#   `GDP ($ per capita)` <dbl>, `Literacy (%)` <dbl>,
#   `Phones (per 1000)` <chr>, `Arable (%)` <chr>, `Crops (%)` <chr>,
#   `Other (%)` <dbl>, Climate <dbl>, Birthrate <dbl>, Deathrate <dbl>,
#   Agriculture <chr>, Industry <chr>, Service <chr>

Clean the data

I first change the column names and then chose the variables I want to keep and then get rid of everything else. After that I clean the Na values.

names(countries) <- tolower(names(countries))
selected_columns <- c('country', 'region', 'population', 'gdp ($ per capita)', "literacy (%)")
countries <- countries[selected_columns]
countries <- countries[!is.na(countries$`gdp ($ per capita)`), ] 
countries <- countries[!is.na(countries$`literacy (%)`), ]

head(countries)
# A tibble: 6 × 5
  country        region           population `gdp ($ per capita)` `literacy (%)`
  <chr>          <chr>                 <dbl>                <dbl>          <dbl>
1 Afghanistan    ASIA (EX. NEAR …   31056997                  700            360
2 Albania        EASTERN EUROPE      3581655                 4500            865
3 Algeria        NORTHERN AFRICA    32930091                 6000            700
4 American Samoa OCEANIA               57794                 8000            970
5 Andorra        WESTERN EUROPE        71201                19000           1000
6 Angola         SUB-SAHARAN AFR…   12127071                 1900            420

First plot

At first I was thinking about having the regions on the x axis and the countries in those regions moving up on the y axis according to population. I was going to either size by gdp and color by literacy or vice verse. But the two points in Asia threw off the scale so I moved on.

p1 <- ggplot(countries, aes(x = region, y = population, label = country)) +
  geom_point() +
  labs(x = "Region", y = "Population", title = "Scatterplot of Country Population by Region") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p1

Plot two

I then put the population on the x axis and gdp on the y, but as you can see several countries just cause the rest to cluster. Moving on!

p2 <- ggplot(countries, aes(x = population, y = `gdp ($ per capita)`, label = country)) +
  geom_point() +
  labs(x = "Population", y = "GDP ($ per capita)", title = "Scatterplot of GDP vs. Population by Country") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p2

Plot three

Now I am getting somewhere! This is the relationship that I decide to move forward with and play around with.

p3 <- 
ggplot(countries, aes(x = `gdp ($ per capita)`, y = `literacy (%)`, label = country)) +
  geom_point() +
  labs(x = "GDP ($ per capita)", y = "Literacy (%)", title = "Scatterplot of Literacy vs. GDP by Country") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p3

Give it some color!

I chose to color by region.

p4 <- ggplot(countries, aes(x = `gdp ($ per capita)`, y = `literacy (%)`, label = country, color = region)) +
  geom_point() +
  labs(x = "GDP ($ per capita)", y = "Literacy (%)", title = "Scatterplot of Literacy vs. GDP by Country") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = n_distinct(countries$region), name = "Paired")) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p4

Now show population

I now add the population by changing the size of the points.

p8 <- ggplot(countries, aes(x = `gdp ($ per capita)`, y = `literacy (%)`, label = country, color = region, size = population)) +
  geom_point() +
  labs(x = "GDP ($ per capita)", y = "Literacy (%)", title = "Scatterplot of Literacy vs. GDP by Country") +
  scale_color_manual(values = RColorBrewer::brewer.pal(n = n_distinct(countries$region), name = "Paired")) +
  scale_size_continuous(labels = scales::comma) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p8

Change up the color and X axis

I did not like the light yellow so I decided to change the values. Also changed the scale of the X axis.

p9 <- ggplot(countries, aes(x = `gdp ($ per capita)`, y = `literacy (%)`, label = country, color = region, size = population)) +
  geom_point() +
  labs(x = "GDP ($ per capita)" , y = "Literacy (%)", title = "Scatterplot of Literacy vs. GDP by Country") +
  scale_color_manual(values = c("ASIA (EX. NEAR EAST)" = "brown",
                                "BALTICS" = "pink",
                                "C.W. OF IND. STATES" = "purple",
                                "EASTERN EUROPE" = "darkblue",
                                "WESTERN EUROPE" = "red",
                                "LATIN AMER. & CARIB" = "green",
                                "OCEANIA" = "blue",
                                "NEAR EAST" = "black",
                                "NORTHERN AMERICA" = "darkgreen",
                                "SUB-SAHARAN AFRICA" = "orange")) +
   scale_x_continuous(breaks = seq(0, max(countries$`gdp ($ per capita)`), by = 5000)) +
  scale_size_continuous(labels = scales::comma) + 
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5))
p9

Separate the graphs by region

In order to get a better final product I chose to use facet wrap and separate the graphs.

p10 <- ggplot(countries, aes(x = `gdp ($ per capita)`, y = `literacy (%)`, label = country, color = region, size = population)) +
  geom_point() +
  labs(x = "GDP ($ per capita)", y = "Literacy (%)", title = "Scatterplot of Literacy vs. GDP by Country, Scaled by Population") +
  scale_color_manual(values = c("ASIA (EX. NEAR EAST)" = "brown",
                                "BALTICS" = "pink",
                                "C.W. OF IND. STATES" = "purple",
                                "EASTERN EUROPE" = "darkred",
                                "WESTERN EUROPE" = "red",
                                "LATIN AMER. & CARIB" = "green",
                                "OCEANIA" = "blue",
                                "NEAR EAST" = "black",
                                "NORTHERN AMERICA" = "darkgreen",
                                "SUB-SAHARAN AFRICA" = "orange")) +
  scale_x_continuous(breaks = seq(0, max(countries$`gdp ($ per capita)`), by = 5000)) +
  scale_size_continuous(labels = scales::comma) +  # Optional: format the size labels with commas
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  facet_wrap(~ region, nrow = 3)
p10