I chose to work with the “countries of the world_cia_kaggle” dataset. I chose this set because I wanted to make a scatter plot and figured that it had numerous variables that I could compare. I ended up deciding on making
Load in the data and libraries
As always I first load in my libraries and then set the working directory. I then load in my csv file and take a look.
library(tidyverse)library(RColorBrewer)setwd("C:/Users/andre/OneDrive/Documents/School/Data 110")countries <-read_csv("countries of the world_cia_kaggle.csv")head(countries)
# A tibble: 6 × 20
Country Region Population `Area (sq. mi.)` Pop. Density (per sq…¹
<chr> <chr> <dbl> <dbl> <chr>
1 Afghanistan ASIA (EX. N… 31056997 647500 48,0
2 Albania EASTERN EUR… 3581655 28748 124,6
3 Algeria NORTHERN AF… 32930091 2381740 13,8
4 American Samoa OCEANIA 57794 199 290,4
5 Andorra WESTERN EUR… 71201 468 152,1
6 Angola SUB-SAHARAN… 12127071 1246700 9,7
# ℹ abbreviated name: ¹`Pop. Density (per sq. mi.)`
# ℹ 15 more variables: `Coastline (coast/area ratio)` <chr>,
# `Net migration` <chr>, `Infant mortality (per 1000 births)` <dbl>,
# `GDP ($ per capita)` <dbl>, `Literacy (%)` <dbl>,
# `Phones (per 1000)` <chr>, `Arable (%)` <chr>, `Crops (%)` <chr>,
# `Other (%)` <dbl>, Climate <dbl>, Birthrate <dbl>, Deathrate <dbl>,
# Agriculture <chr>, Industry <chr>, Service <chr>
Clean the data
I first change the column names and then chose the variables I want to keep and then get rid of everything else. After that I clean the Na values.
names(countries) <-tolower(names(countries))selected_columns <-c('country', 'region', 'population', 'gdp ($ per capita)', "literacy (%)")countries <- countries[selected_columns]countries <- countries[!is.na(countries$`gdp ($ per capita)`), ] countries <- countries[!is.na(countries$`literacy (%)`), ]head(countries)
# A tibble: 6 × 5
country region population `gdp ($ per capita)` `literacy (%)`
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan ASIA (EX. NEAR … 31056997 700 360
2 Albania EASTERN EUROPE 3581655 4500 865
3 Algeria NORTHERN AFRICA 32930091 6000 700
4 American Samoa OCEANIA 57794 8000 970
5 Andorra WESTERN EUROPE 71201 19000 1000
6 Angola SUB-SAHARAN AFR… 12127071 1900 420
First plot
At first I was thinking about having the regions on the x axis and the countries in those regions moving up on the y axis according to population. I was going to either size by gdp and color by literacy or vice verse. But the two points in Asia threw off the scale so I moved on.
p1 <-ggplot(countries, aes(x = region, y = population, label = country)) +geom_point() +labs(x ="Region", y ="Population", title ="Scatterplot of Country Population by Region") +theme_bw() +theme(plot.title =element_text(hjust =0.5))p1
Plot two
I then put the population on the x axis and gdp on the y, but as you can see several countries just cause the rest to cluster. Moving on!
p2 <-ggplot(countries, aes(x = population, y =`gdp ($ per capita)`, label = country)) +geom_point() +labs(x ="Population", y ="GDP ($ per capita)", title ="Scatterplot of GDP vs. Population by Country") +theme_bw() +theme(plot.title =element_text(hjust =0.5))p2
Plot three
Now I am getting somewhere! This is the relationship that I decide to move forward with and play around with.
p3 <-ggplot(countries, aes(x =`gdp ($ per capita)`, y =`literacy (%)`, label = country)) +geom_point() +labs(x ="GDP ($ per capita)", y ="Literacy (%)", title ="Scatterplot of Literacy vs. GDP by Country") +theme_bw() +theme(plot.title =element_text(hjust =0.5))p3
Give it some color!
I chose to color by region.
p4 <-ggplot(countries, aes(x =`gdp ($ per capita)`, y =`literacy (%)`, label = country, color = region)) +geom_point() +labs(x ="GDP ($ per capita)", y ="Literacy (%)", title ="Scatterplot of Literacy vs. GDP by Country") +scale_color_manual(values = RColorBrewer::brewer.pal(n =n_distinct(countries$region), name ="Paired")) +theme_bw() +theme(plot.title =element_text(hjust =0.5))p4
Now show population
I now add the population by changing the size of the points.
p8 <-ggplot(countries, aes(x =`gdp ($ per capita)`, y =`literacy (%)`, label = country, color = region, size = population)) +geom_point() +labs(x ="GDP ($ per capita)", y ="Literacy (%)", title ="Scatterplot of Literacy vs. GDP by Country") +scale_color_manual(values = RColorBrewer::brewer.pal(n =n_distinct(countries$region), name ="Paired")) +scale_size_continuous(labels = scales::comma) +theme_bw() +theme(plot.title =element_text(hjust =0.5))p8
Change up the color and X axis
I did not like the light yellow so I decided to change the values. Also changed the scale of the X axis.
p9 <-ggplot(countries, aes(x =`gdp ($ per capita)`, y =`literacy (%)`, label = country, color = region, size = population)) +geom_point() +labs(x ="GDP ($ per capita)" , y ="Literacy (%)", title ="Scatterplot of Literacy vs. GDP by Country") +scale_color_manual(values =c("ASIA (EX. NEAR EAST)"="brown","BALTICS"="pink","C.W. OF IND. STATES"="purple","EASTERN EUROPE"="darkblue","WESTERN EUROPE"="red","LATIN AMER. & CARIB"="green","OCEANIA"="blue","NEAR EAST"="black","NORTHERN AMERICA"="darkgreen","SUB-SAHARAN AFRICA"="orange")) +scale_x_continuous(breaks =seq(0, max(countries$`gdp ($ per capita)`), by =5000)) +scale_size_continuous(labels = scales::comma) +theme_bw() +theme(plot.title =element_text(hjust =0.5))p9
Separate the graphs by region
In order to get a better final product I chose to use facet wrap and separate the graphs.
p10 <-ggplot(countries, aes(x =`gdp ($ per capita)`, y =`literacy (%)`, label = country, color = region, size = population)) +geom_point() +labs(x ="GDP ($ per capita)", y ="Literacy (%)", title ="Scatterplot of Literacy vs. GDP by Country, Scaled by Population") +scale_color_manual(values =c("ASIA (EX. NEAR EAST)"="brown","BALTICS"="pink","C.W. OF IND. STATES"="purple","EASTERN EUROPE"="darkred","WESTERN EUROPE"="red","LATIN AMER. & CARIB"="green","OCEANIA"="blue","NEAR EAST"="black","NORTHERN AMERICA"="darkgreen","SUB-SAHARAN AFRICA"="orange")) +scale_x_continuous(breaks =seq(0, max(countries$`gdp ($ per capita)`), by =5000)) +scale_size_continuous(labels = scales::comma) +# Optional: format the size labels with commastheme_bw() +theme(plot.title =element_text(hjust =0.5),axis.text.x =element_text(angle =45, hjust =1),legend.position ="none") +facet_wrap(~ region, nrow =3)p10