Economic and Social Indicators

This section will use the World Development Indicators data set from the WDI package.

Data Preparation

Q1: Load the tidyverse and WDI packages.

library(tidyverse)
library(WDI)
library(knitr)
theme_set(theme_light())

Q2: Fetch data for GDP per capita (constant 2015 US$), Secondary education completion rate, and Life expectancy at birth for the years 2000-2020 for all available countries.

To answer this question, we first need to examine the data set by using functions such as glimpse(), View().

However, the WDI is not a data set. Instead, we use the WDI() function to extract the indicators, countries, years we want from the World Bank data respiratory.

First, for the categories, we need to identify the Indicator Codes:

  • GDP per capita (constant 2015 US$): NY.GDP.PCAP.KD
  • Secondary education completion rate: SE.SEC.CMPT.LO.ZS
  • Life expectancy at birth: SP.DYN.LE00.IN

Other requirements for the data can be set in the function. To extract the data as required, we use this code:

data <- WDI(
  indicator = c("NY.GDP.PCAP.KD", "SE.SEC.CMPT.LO.ZS", "SP.DYN.LE00.IN"),
  country = "all",
  start = 2000,
  end = 2020,
  extra = TRUE
)

During running this code chunk, I had some problems:

  1. The indicator argument needs to follow with a vector, we cannot just separate all the indicators by using comma, but use the c() function to combine them.
  2. The data took to long to load and the program kept warning that there’s problem with the connection, and I don’t know how to fix yet.
  3. I tried to reduce the amount of data to extract in order to increase the load speed, but it seems like it didn’t work.

Using GPT, the model suggested me to breakdown the code, as only extract one indicator at a time, which I did. It also told me to use the worldbank package. I’ll try to use it below:

library(worldbank)

gdp_data <- wb(indicator = "NY.GDP.PCAP.KD", startdate = 2000, enddate = 2020)
education_data <- wb(indicator = "SE.SEC.CMPT.LO.ZS", startdate = 2000, enddate = 2020)
life_expectancy_data <- wb(indicator = "SP.DYN.LE00.IN", startdate = 2000, enddate = 2020)

Turns out GPT hallucinating again (as it always be). The wb() function for some reason is warned that it doesn’t exist. Originally I use Claude, but I want to keep it clean for further requirements, so I used GPT as an alternative. But with these answers, I lost trust to it again.

My last solution is to download the data set as CVS file, then import them manually into R. Or I can just change the data set to economics from ggplot2 package. The question 2 will change to

Q2: Examine the economics data set

We will use the function View(), glimpse(), and summary(). Also using ?economics can help we understand the context of the data.

View(economics)
glimpse(economics)
summary(economics)

Q3: Check for any missing values and handle them appropriately.

R provides many function for checking missing value in the data set. First, we use is.na() and sum() function together to count total missing value in the economic data set.

sum(is.na(economics))
## [1] 0

The results shows that there is no NA value in this data set. Hence we can process to the next step.

Data Transformation

Q4: Calculate year-over-year growth rates for population and personal consumption expenditures.

For this problem, I will use the functions from dplyr package. Specifically for this question, first function came in my mind is mutate() function.

The equation to calculate the year-over-year growth rates for year \(n\) (\(GR_n\)) for personal consumption expenditures is:

\[ GR_n = \frac{PCE_{n+1} - PCE_n}{PCE_n} \] The same applies for calculating population’s year-over-year growth rates.

In order to calculate, we need the population and PCE for that year. Meanwhile, the data set has data updated by month.

We assume that data on the first December of each year is the data for that year overall. So first step is to filter only rows that contain “12-01” string.

Second step, we’ll use the function lag() to extract data of that previous row for each corresponding row. The default value will be 1, which is suitable for us in this case. Then, we just need to apply the equation above to calculate the growth rates for population and PCE.

The kable() function from the package knitr helps generate more beautiful tables.

The codes will be as follow:

economics_GR <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  mutate(P_GR = round(((((pop-lag(pop))/pop))*100),2)) %>%
  mutate(PCE_GR = round(((((pce-lag(pce))/pce))*100),2)) %>%
  slice_head(n = 5) %>%
  select(year, pce, pop, psavert, uempmed, unemploy, P_GR, PCE_GR)
kable(economics_GR,
      col.names = c("Year", "PCE", "Population", "PSR", "UP", "Unemployment", "P-GR", "PCE-GR"),
      caption = "Table 1: US Population and PCE Growth Rate (First 5 Years)")
Table 1: US Population and PCE Growth Rate (First 5 Years)
Year PCE Population PSR UP Unemployment P-GR PCE-GR
1967 525.1 199657 11.8 4.8 3018 NA NA
1968 576.5 201621 11.1 4.4 2685 0.97 8.92
1969 622.8 203675 11.8 4.6 2884 1.01 7.43
1970 665.6 206238 13.2 5.9 5076 1.24 6.43
1971 728.4 208740 13.0 6.2 5154 1.20 8.62

At first, I tried to use contain() function for the first step. But turn out it’s only for select() function, so I asked Claude for help. Because the date column has date type data, it requires a specific way to filter out the rows we need.

Q5: Create a new variable for the unemployment rate.

This question is somewhat easier than the previous question.

economics_UR <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  mutate(UR = round((unemploy/pop)*100,2)) %>%
  mutate(year = year(date)) %>%
  slice_head(n = 5) %>%
  select(year, pop, unemploy, UR)
kable(economics_UR,
      col.names = c("Year", "Population", "Unemployment", "UR"),
      caption = "Table 2: US Unemployment Rate (First 5 Years)")
Table 2: US Unemployment Rate (First 5 Years)
Year Population Unemployment UR
1967 199657 3018 1.51
1968 201621 2685 1.33
1969 203675 2884 1.42
1970 206238 5076 2.46
1971 208740 5154 2.47

Data Visualization

Q6: Create a line plot showing the trends of personal consumption expenditures and median duration of unemployment over time.

For simplicity, we will use the filtered data for this question, with each row corresponding for each year.

economics_Q6 <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  select(year, pce, uempmed)
economics_Q6_show <- economics_Q6 %>%
  slice_head(n = 5)
kable(economics_Q6_show,
      col.names = c("Year", "PCE", "MDU"),
      caption = "Table 3: Data for Question 6")
Table 3: Data for Question 6
Year PCE MDU
1967 525.1 4.8
1968 576.5 4.4
1969 622.8 4.6
1970 665.6 5.9
1971 728.4 6.2
ggplot(economics_Q6, aes(x = year, y = pce)) +
  geom_line() +
  labs(title = "Personal Consumption Expenditures Over Time",
       x = "Year",
       y = "PCE")

ggplot(economics_Q6, aes(x = year, y = uempmed)) +
  geom_line() +
  labs(title = "Median Duration of Unemployment Over Time",
       x = "Year",
       y = "Median Duration of Unemployment")

We can have two plots put next to each other, or a plot with two y-axis. However, doing so require more advanced technique which we can easily ask AI chat bots for help, I believe.

Q7: Generate a scatter plot of personal savings rate vs unemployment rate, using a log scale for personal savings rate and adding a smoothed trend line.

economics_Q7 <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  select(year, psavert, unemploy)
economics_q7_show <- economics_Q7 %>%
  slice_head(n = 5)
kable(economics_q7_show,
      col.names = c("Year", "PSR", "Unemployment"),
      caption = "Table 4: Data for Question 7")
Table 4: Data for Question 7
Year PSR Unemployment
1967 11.8 3018
1968 11.1 2685
1969 11.8 2884
1970 13.2 5076
1971 13.0 5154
ggplot(economics_Q7, aes(x = psavert, y = unemploy)) +
  geom_point() +
  scale_x_log10() +
  geom_smooth(color = "blue")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We will discuss more about the smooth line in later topics.

Q8: Produce a stacked area chart showing the composition of the population (employed vs unemployed) over time.

The question requires creating a stacked area chart. However, the textbook we use doesn’t cover this type of chat but introducing bar plot. We will answer this question using bar plot.

economics_Q8 <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = as.numeric(year(date))) %>%
  mutate(employ = pop - unemploy) %>%
  select(year, pop, employ, unemploy)
economics_Q8_show <- economics_Q8 %>%
  slice_head(n = 5)
kable(economics_Q8_show,
      col.names = c("Year", "Population", "Employed", "Unemployed"))
Year Population Employed Unemployed
1967 199657 196639 3018
1968 201621 198936 2685
1969 203675 200791 2884
1970 206238 201162 5076
1971 208740 203586 5154
economics_Q8_long <- economics_Q8 %>%
  pivot_longer(names_to = "status",
               values_to = "amount",
               cols = c(employ, unemploy))

ggplot(economics_Q8_long, aes(x = year, y = amount, fill = status)) +
  geom_col()

With this question, tackling stacked bar chart, it’s crucial to turn data into tidy version. I still struggle with tidying the data, however, we can always ask for GPT or Claude’s help, as long as we understand what are we doing, and understand what are they doing as well.

Advanced Techniques

Next are advanced questions. Since I don’t have time, I might leave it for later. Here is the list of it:

Q9: Use purrr to apply a custom function to calculate and plot rolling averages for multiple economic indicators.

Q10: Create an animated plot using gganimate to show how the relationship between personal consumption expenditures and unemployment has changed over time.

Exploring Global Health Data

We use the gapminder dataset from the gapminder package for this section.

Data Preparation

library(gapminder)

Q1: Examine the structure of the gapminder dataset.

We use these codes again:

View(gapminder)
glimpse(gapminder)
summary(gapminder)

Data Transformation

Q2: Calculate life expectancy growth rates between consecutive years for each country.

Despite being the first question require codes of the section, just as the previous section, it’s unexpectedly, or ridiculously, difficult to execute.

We can use the lag() function again, but problem is the first row of a country will use data from another country above them to calculate. If we ignore this problem, it would be easy to finish the answer.

s2q2 <- gapminder %>%
  group_by(country) %>%
  arrange(country, year) %>%
  mutate(egr = round((lifeExp - lag(lifeExp)) / lag(lifeExp) * 100, 2))
kable(head(s2q2, 10))
country continent year lifeExp pop gdpPercap egr
Afghanistan Asia 1952 28.801 8425333 779.4453 NA
Afghanistan Asia 1957 30.332 9240934 820.8530 5.32
Afghanistan Asia 1962 31.997 10267083 853.1007 5.49
Afghanistan Asia 1967 34.020 11537966 836.1971 6.32
Afghanistan Asia 1972 36.088 13079460 739.9811 6.08
Afghanistan Asia 1977 38.438 14880372 786.1134 6.51
Afghanistan Asia 1982 39.854 12881816 978.0114 3.68
Afghanistan Asia 1987 40.822 13867957 852.3959 2.43
Afghanistan Asia 1992 41.674 16317921 649.3414 2.09
Afghanistan Asia 1997 41.763 22227415 635.3414 0.21

Apparently, using arrange() would solve the problem mentioned earlier.

Q3: Calculate the average annual growth rate of GDP per capita for each country between 1952 and 2007.

To answer this question, we follow these steps:

  1. Filtering the data for 1952 and 2007.
  2. Calculating the total growth rate between these two years.
  3. Converting this to an average annual growth rate.
  4. Adding this new variable to the dataset.
s2q3 <- gapminder %>%
  filter(year == 1952 | year == 2007) %>%
  group_by(country) %>%
  arrange(country, year) %>%
  mutate(total_gr = round(((gdpPercap - lag(gdpPercap))/lag(gdpPercap) * 100),2)) %>%
  mutate(annual_gr = round((total_gr/(2007-1952)),2))
kable(head(s2q3, 10))
country continent year lifeExp pop gdpPercap total_gr annual_gr
Afghanistan Asia 1952 28.801 8425333 779.4453 NA NA
Afghanistan Asia 2007 43.828 31889923 974.5803 25.04 0.46
Albania Europe 1952 55.230 1282697 1601.0561 NA NA
Albania Europe 2007 76.423 3600523 5937.0295 270.82 4.92
Algeria Africa 1952 43.077 9279525 2449.0082 NA NA
Algeria Africa 2007 72.301 33333216 6223.3675 154.12 2.80
Angola Africa 1952 30.015 4232095 3520.6103 NA NA
Angola Africa 2007 42.731 12420476 4797.2313 36.26 0.66
Argentina Americas 1952 62.485 17876956 5911.3151 NA NA
Argentina Americas 2007 75.320 40301927 12779.3796 116.19 2.11

Q4: Categorize countries into growth rate groups based on their life expectancy change between 1952 and 2007.

This question requires us to use the function case_when(). The function is not covered in our textbook, but after looking at it, I think it’s interesting to include here.

We also use a different way to get the rows limited to 1952 and 2007 values.

s2q4 <- gapminder %>%
  filter(year %in% c(1952, 2007)) %>%
  group_by(country) %>%
  summarize(
    lifeExp_1952 = first(lifeExp),
    lifeExp_2007 = last(lifeExp),
    lifeExp_growth = round(((lifeExp_2007 - lifeExp_1952)/lifeExp_1952 * 100),2)
  ) %>%
  mutate(growth_category = case_when(
    lifeExp_growth >= 50 ~ "high",
    lifeExp_growth >= 25 & lifeExp_growth < 50 ~ "moderate",
    lifeExp_growth >= 0 & lifeExp_growth < 25 ~ "low",
    lifeExp_growth <= 0 ~ "negative"
  ))
kable(head(s2q4, 10))
country lifeExp_1952 lifeExp_2007 lifeExp_growth growth_category
Afghanistan 28.801 43.828 52.18 high
Albania 55.230 76.423 38.37 moderate
Algeria 43.077 72.301 67.84 high
Angola 30.015 42.731 42.37 moderate
Argentina 62.485 75.320 20.54 low
Australia 69.120 81.235 17.53 low
Austria 66.800 79.829 19.50 low
Bahrain 50.939 75.635 48.48 moderate
Bangladesh 37.484 64.062 70.90 high
Belgium 68.000 79.441 16.83 low

The summarize() results in a more organized table than the previous version when I approached using the mutate() function.

Data Visualization

Q5: Create a scatter plot of GDP per capita vs life expectancy, with point size representing population and color representing continents.

This question is more advanced since it requires customize the colors and the sizes of the dots. We use the most recent data for simplicity.

s2q5 <- gapminder %>%
  filter(year == 2007)

ggplot(s2q5, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
  geom_point() +
  scale_x_log10()

Q6: Generate a faceted line plot showing the trend of life expectancy for each continent over time.

s2q6 <- gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_lifeExp = mean(lifeExp))

ggplot(s2q6, aes(x = year, y = mean_lifeExp)) +
  geom_line() +
  facet_wrap(~continent)

Q7: Produce a box plot of GDP per capita distribution for each continent, using a log scale for GDP per capita.

ggplot(gapminder, aes(x = continent, y = gdpPercap)) +
  scale_y_log10() +
  geom_boxplot()

Advanced Techniques

This section will be saved for later.

Q8: Use tidyr to reshape the data from wide to long format for GDP, life expectancy, and population.

Q9: Create a bubble chart race using gganimate to show how countries have progressed in terms of GDP per capita and life expectancy over time.

Q10: Implement a small multiples plot using facet_wrap() to show the relationship between GDP per capita and life expectancy for each continent over time.

Analyzing Iris Flower Data

We use the iris dataset that comes built-in with R.

Data Preparation

We use these codes again:

View(iris)
glimpse(iris)
summary(iris)

Data Transformation

Q1: Create a new variable that calculates the ratio of sepal length to sepal width.

s3q1 <- iris %>%
  mutate(Sepal.Ratio = round((Sepal.Length/Sepal.Width),2))
kable(head(s3q1, 10))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
5.1 3.5 1.4 0.2 setosa 1.46
4.9 3.0 1.4 0.2 setosa 1.63
4.7 3.2 1.3 0.2 setosa 1.47
4.6 3.1 1.5 0.2 setosa 1.48
5.0 3.6 1.4 0.2 setosa 1.39
5.4 3.9 1.7 0.4 setosa 1.38
4.6 3.4 1.4 0.3 setosa 1.35
5.0 3.4 1.5 0.2 setosa 1.47
4.4 2.9 1.4 0.2 setosa 1.52
4.9 3.1 1.5 0.1 setosa 1.58

Q2: Use mutate() and across() to scale all numeric variables.

Scaling is a new concept and not introduced in the textbook. From what I’ve searched, it is about normalize the data, or standardized it.

The across() function is for applying another function to multiple columns. Let’s see how it works:

s3q2 <- iris %>%
  mutate(across((Sepal.Length:Petal.Width), scale))
kable(head(s3q2, 10))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-0.8976739 1.01560199 -1.335752 -1.311052 setosa
-1.1392005 -0.13153881 -1.335752 -1.311052 setosa
-1.3807271 0.32731751 -1.392399 -1.311052 setosa
-1.5014904 0.09788935 -1.279104 -1.311052 setosa
-1.0184372 1.24503015 -1.335752 -1.311052 setosa
-0.5353840 1.93331463 -1.165809 -1.048667 setosa
-1.5014904 0.78617383 -1.335752 -1.179859 setosa
-1.0184372 0.78617383 -1.279104 -1.311052 setosa
-1.7430170 -0.36096697 -1.335752 -1.311052 setosa
-1.1392005 0.09788935 -1.279104 -1.442245 setosa

Q3: Convert the Species variable to a factor if it isn’t already.

iris <- iris %>%
  mutate(Species = as.factor(Species))

Data Visualization

Q3: Create a scatter plot matrix of all variables.

library(GGally)
ggpairs(iris,
        columns = 1:4,
        aes(color = Species, alpha = 0.5))

We use the GGally package for this problem.

Q4: Generate a violin plot showing the distribution of each measurement for each species.

The violin chart is not covered in the textbook as well. But since the problem is quite simple, let’s do it quick here.

iris_long <- iris %>%
  pivot_longer(names_to = "measure",
               values_to = "value",
               cols = -Species)

ggplot(iris_long, aes(x = value, y = measure, fill = measure)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = .1, fill = "white", alpha = .7) +
  facet_wrap(~Species)

Advanced Techniques

Q5: Use tidyr to reshape the data from wide to long format for all measurements.

Q6: Create a parallel coordinates plot to visualize all variables across different species.

Q7: Implement a custom ggplot2 function that creates a standardized plot for comparing any two variables in the dataset.