Economic and Social Indicators

This section will use the World Development Indicators data set from the WDI package.

Data Preparation

Q1: Load the tidyverse and WDI packages.

library(tidyverse)
library(WDI)
library(knitr)
theme_set(theme_light())

Q2: Fetch data for GDP per capita (constant 2015 US$), Secondary education completion rate, and Life expectancy at birth for the years 2000-2020 for all available countries.

To answer this question, we first need to examine the data set by using functions such as glimpse(), View().

However, the WDI is not a data set. Instead, we use the WDI() function to extract the indicators, countries, years we want from the World Bank data respiratory.

First, for the categories, we need to identify the Indicator Codes:

GDP per capita (constant 2015 US$): NY.GDP.PCAP.KD
Secondary education completion rate: SE.SEC.CMPT.LO.ZS
Life expectancy at birth: SP.DYN.LE00.IN

Other requirements for the data can be set in the function. To extract the data as required, we use this code:

data <- WDI(
  indicator = c("NY.GDP.PCAP.KD", "SE.SEC.CMPT.LO.ZS", "SP.DYN.LE00.IN"),
  country = "all",
  start = 2000,
  end = 2020,
  extra = TRUE
)

During running this code chunk, I had some problems:

The indicator argument needs to follow with a vector, we cannot just separate all the indicators by using comma, but use the c() function to combine them.
The data took to long to load and the program kept warning that there’s problem with the connection, and I don’t know how to fix yet.
I tried to reduce the amount of data to extract in order to increase the load speed, but it seems like it didn’t work.

Using GPT, the model suggested me to breakdown the code, as only extract one indicator at a time, which I did. It also told me to use the worldbank package. I’ll try to use it below:

library(worldbank)

gdp_data <- wb(indicator = "NY.GDP.PCAP.KD", startdate = 2000, enddate = 2020)
education_data <- wb(indicator = "SE.SEC.CMPT.LO.ZS", startdate = 2000, enddate = 2020)
life_expectancy_data <- wb(indicator = "SP.DYN.LE00.IN", startdate = 2000, enddate = 2020)

Turns out GPT hallucinating again (as it always be). The wb() function for some reason is warned that it doesn’t exist. Originally I use Claude, but I want to keep it clean for further requirements, so I used GPT as an alternative. But with these answers, I lost trust to it again.

My last solution is to download the data set as CVS file, then import them manually into R. Or I can just change the data set to economics from ggplot2 package. The question 2 will change to

Q2: Examine the economics data set

We will use the function View(), glimpse(), and summary(). Also using ?economics can help we understand the context of the data.

View(economics)
glimpse(economics)
summary(economics)

Q3: Check for any missing values and handle them appropriately.

R provides many function for checking missing value in the data set. First, we use is.na() and sum() function together to count total missing value in the economic data set.

sum(is.na(economics))

## [1] 0

The results shows that there is no NA value in this data set. Hence we can process to the next step.

Data Transformation

Q4: Calculate year-over-year growth rates for population and personal consumption expenditures.

For this problem, I will use the functions from dplyr package. Specifically for this question, first function came in my mind is mutate() function.

The equation to calculate the year-over-year growth rates for year $n$ ($GR_n$) for personal consumption expenditures is:

\[ GR_n = \frac{PCE_{n+1} - PCE_n}{PCE_n} \] The same applies for calculating population’s year-over-year growth rates.

In order to calculate, we need the population and PCE for that year. Meanwhile, the data set has data updated by month.

We assume that data on the first December of each year is the data for that year overall. So first step is to filter only rows that contain “12-01” string.

Second step, we’ll use the function lag() to extract data of that previous row for each corresponding row. The default value will be 1, which is suitable for us in this case. Then, we just need to apply the equation above to calculate the growth rates for population and PCE.

The kable() function from the package knitr helps generate more beautiful tables.

The codes will be as follow:

economics_GR <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  mutate(P_GR = round(((((pop-lag(pop))/pop))*100),2)) %>%
  mutate(PCE_GR = round(((((pce-lag(pce))/pce))*100),2)) %>%
  slice_head(n = 5) %>%
  select(year, pce, pop, psavert, uempmed, unemploy, P_GR, PCE_GR)
kable(economics_GR,
      col.names = c("Year", "PCE", "Population", "PSR", "UP", "Unemployment", "P-GR", "PCE-GR"),
      caption = "Table 1: US Population and PCE Growth Rate (First 5 Years)")

Table 1: US Population and PCE Growth Rate (First 5 Years)
Year	PCE	Population	PSR	UP	Unemployment	P-GR	PCE-GR
1967	525.1	199657	11.8	4.8	3018	NA	NA
1968	576.5	201621	11.1	4.4	2685	0.97	8.92
1969	622.8	203675	11.8	4.6	2884	1.01	7.43
1970	665.6	206238	13.2	5.9	5076	1.24	6.43
1971	728.4	208740	13.0	6.2	5154	1.20	8.62

At first, I tried to use contain() function for the first step. But turn out it’s only for select() function, so I asked Claude for help. Because the date column has date type data, it requires a specific way to filter out the rows we need.

Q5: Create a new variable for the unemployment rate.

This question is somewhat easier than the previous question.

economics_UR <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  mutate(UR = round((unemploy/pop)*100,2)) %>%
  mutate(year = year(date)) %>%
  slice_head(n = 5) %>%
  select(year, pop, unemploy, UR)
kable(economics_UR,
      col.names = c("Year", "Population", "Unemployment", "UR"),
      caption = "Table 2: US Unemployment Rate (First 5 Years)")

Table 2: US Unemployment Rate (First 5 Years)
Year	Population	Unemployment	UR
1967	199657	3018	1.51
1968	201621	2685	1.33
1969	203675	2884	1.42
1970	206238	5076	2.46
1971	208740	5154	2.47

Data Visualization

Q6: Create a line plot showing the trends of personal consumption expenditures and median duration of unemployment over time.

For simplicity, we will use the filtered data for this question, with each row corresponding for each year.

economics_Q6 <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  select(year, pce, uempmed)
economics_Q6_show <- economics_Q6 %>%
  slice_head(n = 5)
kable(economics_Q6_show,
      col.names = c("Year", "PCE", "MDU"),
      caption = "Table 3: Data for Question 6")

Table 3: Data for Question 6
Year	PCE	MDU
1967	525.1	4.8
1968	576.5	4.4
1969	622.8	4.6
1970	665.6	5.9
1971	728.4	6.2

ggplot(economics_Q6, aes(x = year, y = pce)) +
  geom_line() +
  labs(title = "Personal Consumption Expenditures Over Time",
       x = "Year",
       y = "PCE")

ggplot(economics_Q6, aes(x = year, y = uempmed)) +
  geom_line() +
  labs(title = "Median Duration of Unemployment Over Time",
       x = "Year",
       y = "Median Duration of Unemployment")

We can have two plots put next to each other, or a plot with two y-axis. However, doing so require more advanced technique which we can easily ask AI chat bots for help, I believe.

Q7: Generate a scatter plot of personal savings rate vs unemployment rate, using a log scale for personal savings rate and adding a smoothed trend line.

economics_Q7 <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = year(date)) %>%
  select(year, psavert, unemploy)
economics_q7_show <- economics_Q7 %>%
  slice_head(n = 5)
kable(economics_q7_show,
      col.names = c("Year", "PSR", "Unemployment"),
      caption = "Table 4: Data for Question 7")

Table 4: Data for Question 7
Year	PSR	Unemployment
1967	11.8	3018
1968	11.1	2685
1969	11.8	2884
1970	13.2	5076
1971	13.0	5154

ggplot(economics_Q7, aes(x = psavert, y = unemploy)) +
  geom_point() +
  scale_x_log10() +
  geom_smooth(color = "blue")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

We will discuss more about the smooth line in later topics.

Q8: Produce a stacked area chart showing the composition of the population (employed vs unemployed) over time.

The question requires creating a stacked area chart. However, the textbook we use doesn’t cover this type of chat but introducing bar plot. We will answer this question using bar plot.

economics_Q8 <- economics %>%
  filter(month(date) == 12, day(date) == 1) %>%
  mutate(year = as.numeric(year(date))) %>%
  mutate(employ = pop - unemploy) %>%
  select(year, pop, employ, unemploy)
economics_Q8_show <- economics_Q8 %>%
  slice_head(n = 5)
kable(economics_Q8_show,
      col.names = c("Year", "Population", "Employed", "Unemployed"))

Year	Population	Employed	Unemployed
1967	199657	196639	3018
1968	201621	198936	2685
1969	203675	200791	2884
1970	206238	201162	5076
1971	208740	203586	5154

economics_Q8_long <- economics_Q8 %>%
  pivot_longer(names_to = "status",
               values_to = "amount",
               cols = c(employ, unemploy))

ggplot(economics_Q8_long, aes(x = year, y = amount, fill = status)) +
  geom_col()

With this question, tackling stacked bar chart, it’s crucial to turn data into tidy version. I still struggle with tidying the data, however, we can always ask for GPT or Claude’s help, as long as we understand what are we doing, and understand what are they doing as well.

Advanced Techniques

Next are advanced questions. Since I don’t have time, I might leave it for later. Here is the list of it:

Q9: Use purrr to apply a custom function to calculate and plot rolling averages for multiple economic indicators.

Q10: Create an animated plot using gganimate to show how the relationship between personal consumption expenditures and unemployment has changed over time.

Exploring Global Health Data

We use the gapminder dataset from the gapminder package for this section.

Data Preparation

library(gapminder)

Q1: Examine the structure of the gapminder dataset.

We use these codes again:

View(gapminder)
glimpse(gapminder)
summary(gapminder)

Data Transformation

Q2: Calculate life expectancy growth rates between consecutive years for each country.

Despite being the first question require codes of the section, just as the previous section, it’s unexpectedly, or ridiculously, difficult to execute.

We can use the lag() function again, but problem is the first row of a country will use data from another country above them to calculate. If we ignore this problem, it would be easy to finish the answer.

s2q2 <- gapminder %>%
  group_by(country) %>%
  arrange(country, year) %>%
  mutate(egr = round((lifeExp - lag(lifeExp)) / lag(lifeExp) * 100, 2))
kable(head(s2q2, 10))

country	continent	year	lifeExp	pop	gdpPercap	egr
Afghanistan	Asia	1952	28.801	8425333	779.4453	NA
Afghanistan	Asia	1957	30.332	9240934	820.8530	5.32
Afghanistan	Asia	1962	31.997	10267083	853.1007	5.49
Afghanistan	Asia	1967	34.020	11537966	836.1971	6.32
Afghanistan	Asia	1972	36.088	13079460	739.9811	6.08
Afghanistan	Asia	1977	38.438	14880372	786.1134	6.51
Afghanistan	Asia	1982	39.854	12881816	978.0114	3.68
Afghanistan	Asia	1987	40.822	13867957	852.3959	2.43
Afghanistan	Asia	1992	41.674	16317921	649.3414	2.09
Afghanistan	Asia	1997	41.763	22227415	635.3414	0.21

Apparently, using arrange() would solve the problem mentioned earlier.

Q3: Calculate the average annual growth rate of GDP per capita for each country between 1952 and 2007.

To answer this question, we follow these steps:

Filtering the data for 1952 and 2007.
Calculating the total growth rate between these two years.
Converting this to an average annual growth rate.
Adding this new variable to the dataset.

s2q3 <- gapminder %>%
  filter(year == 1952 | year == 2007) %>%
  group_by(country) %>%
  arrange(country, year) %>%
  mutate(total_gr = round(((gdpPercap - lag(gdpPercap))/lag(gdpPercap) * 100),2)) %>%
  mutate(annual_gr = round((total_gr/(2007-1952)),2))
kable(head(s2q3, 10))

country	continent	year	lifeExp	pop	gdpPercap	total_gr	annual_gr
Afghanistan	Asia	1952	28.801	8425333	779.4453	NA	NA
Afghanistan	Asia	2007	43.828	31889923	974.5803	25.04	0.46
Albania	Europe	1952	55.230	1282697	1601.0561	NA	NA
Albania	Europe	2007	76.423	3600523	5937.0295	270.82	4.92
Algeria	Africa	1952	43.077	9279525	2449.0082	NA	NA
Algeria	Africa	2007	72.301	33333216	6223.3675	154.12	2.80
Angola	Africa	1952	30.015	4232095	3520.6103	NA	NA
Angola	Africa	2007	42.731	12420476	4797.2313	36.26	0.66
Argentina	Americas	1952	62.485	17876956	5911.3151	NA	NA
Argentina	Americas	2007	75.320	40301927	12779.3796	116.19	2.11

Q4: Categorize countries into growth rate groups based on their life expectancy change between 1952 and 2007.

This question requires us to use the function case_when(). The function is not covered in our textbook, but after looking at it, I think it’s interesting to include here.

We also use a different way to get the rows limited to 1952 and 2007 values.

s2q4 <- gapminder %>%
  filter(year %in% c(1952, 2007)) %>%
  group_by(country) %>%
  summarize(
    lifeExp_1952 = first(lifeExp),
    lifeExp_2007 = last(lifeExp),
    lifeExp_growth = round(((lifeExp_2007 - lifeExp_1952)/lifeExp_1952 * 100),2)
  ) %>%
  mutate(growth_category = case_when(
    lifeExp_growth >= 50 ~ "high",
    lifeExp_growth >= 25 & lifeExp_growth < 50 ~ "moderate",
    lifeExp_growth >= 0 & lifeExp_growth < 25 ~ "low",
    lifeExp_growth <= 0 ~ "negative"
  ))
kable(head(s2q4, 10))

country	lifeExp_1952	lifeExp_2007	lifeExp_growth	growth_category
Afghanistan	28.801	43.828	52.18	high
Albania	55.230	76.423	38.37	moderate
Algeria	43.077	72.301	67.84	high
Angola	30.015	42.731	42.37	moderate
Argentina	62.485	75.320	20.54	low
Australia	69.120	81.235	17.53	low
Austria	66.800	79.829	19.50	low
Bahrain	50.939	75.635	48.48	moderate
Bangladesh	37.484	64.062	70.90	high
Belgium	68.000	79.441	16.83	low

The summarize() results in a more organized table than the previous version when I approached using the mutate() function.

Data Visualization

Q5: Create a scatter plot of GDP per capita vs life expectancy, with point size representing population and color representing continents.

This question is more advanced since it requires customize the colors and the sizes of the dots. We use the most recent data for simplicity.

s2q5 <- gapminder %>%
  filter(year == 2007)

ggplot(s2q5, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
  geom_point() +
  scale_x_log10()

Q6: Generate a faceted line plot showing the trend of life expectancy for each continent over time.

s2q6 <- gapminder %>%
  group_by(continent, year) %>%
  summarize(mean_lifeExp = mean(lifeExp))

ggplot(s2q6, aes(x = year, y = mean_lifeExp)) +
  geom_line() +
  facet_wrap(~continent)

Q7: Produce a box plot of GDP per capita distribution for each continent, using a log scale for GDP per capita.

ggplot(gapminder, aes(x = continent, y = gdpPercap)) +
  scale_y_log10() +
  geom_boxplot()

Advanced Techniques

This section will be saved for later.

Q8: Use tidyr to reshape the data from wide to long format for GDP, life expectancy, and population.

Q9: Create a bubble chart race using gganimate to show how countries have progressed in terms of GDP per capita and life expectancy over time.

Q10: Implement a small multiples plot using facet_wrap() to show the relationship between GDP per capita and life expectancy for each continent over time.

Analyzing Iris Flower Data

We use the iris dataset that comes built-in with R.

Data Preparation

We use these codes again:

View(iris)
glimpse(iris)
summary(iris)

Data Transformation

Q1: Create a new variable that calculates the ratio of sepal length to sepal width.

s3q1 <- iris %>%
  mutate(Sepal.Ratio = round((Sepal.Length/Sepal.Width),2))
kable(head(s3q1, 10))

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	Sepal.Ratio
5.1	3.5	1.4	0.2	setosa	1.46
4.9	3.0	1.4	0.2	setosa	1.63
4.7	3.2	1.3	0.2	setosa	1.47
4.6	3.1	1.5	0.2	setosa	1.48
5.0	3.6	1.4	0.2	setosa	1.39
5.4	3.9	1.7	0.4	setosa	1.38
4.6	3.4	1.4	0.3	setosa	1.35
5.0	3.4	1.5	0.2	setosa	1.47
4.4	2.9	1.4	0.2	setosa	1.52
4.9	3.1	1.5	0.1	setosa	1.58

Q2: Use mutate() and across() to scale all numeric variables.

Scaling is a new concept and not introduced in the textbook. From what I’ve searched, it is about normalize the data, or standardized it.

The across() function is for applying another function to multiple columns. Let’s see how it works:

s3q2 <- iris %>%
  mutate(across((Sepal.Length:Petal.Width), scale))
kable(head(s3q2, 10))

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
-0.8976739	1.01560199	-1.335752	-1.311052	setosa
-1.1392005	-0.13153881	-1.335752	-1.311052	setosa
-1.3807271	0.32731751	-1.392399	-1.311052	setosa
-1.5014904	0.09788935	-1.279104	-1.311052	setosa
-1.0184372	1.24503015	-1.335752	-1.311052	setosa
-0.5353840	1.93331463	-1.165809	-1.048667	setosa
-1.5014904	0.78617383	-1.335752	-1.179859	setosa
-1.0184372	0.78617383	-1.279104	-1.311052	setosa
-1.7430170	-0.36096697	-1.335752	-1.311052	setosa
-1.1392005	0.09788935	-1.279104	-1.442245	setosa

Q3: Convert the Species variable to a factor if it isn’t already.

iris <- iris %>%
  mutate(Species = as.factor(Species))

Data Visualization

Q3: Create a scatter plot matrix of all variables.

library(GGally)
ggpairs(iris,
        columns = 1:4,
        aes(color = Species, alpha = 0.5))

We use the GGally package for this problem.

Q4: Generate a violin plot showing the distribution of each measurement for each species.

The violin chart is not covered in the textbook as well. But since the problem is quite simple, let’s do it quick here.

iris_long <- iris %>%
  pivot_longer(names_to = "measure",
               values_to = "value",
               cols = -Species)

ggplot(iris_long, aes(x = value, y = measure, fill = measure)) +
  geom_violin(trim = FALSE) +
  geom_boxplot(width = .1, fill = "white", alpha = .7) +
  facet_wrap(~Species)

Advanced Techniques

Q5: Use tidyr to reshape the data from wide to long format for all measurements.

Q6: Create a parallel coordinates plot to visualize all variables across different species.

Q7: Implement a custom ggplot2 function that creates a standardized plot for comparing any two variables in the dataset.

Project 1

Economic and Social Indicators

Data Preparation

Data Transformation

Data Visualization

Advanced Techniques

Exploring Global Health Data

Data Preparation

Data Transformation

Data Visualization

Advanced Techniques

Analyzing Iris Flower Data

Data Preparation

Data Transformation

Data Visualization

Advanced Techniques