We use the gapminder dataset from the
gapminder package for this section.
library(gapminder)
Q1: Examine the structure of the
gapminderdataset.
We use these codes again:
View(gapminder)
glimpse(gapminder)
summary(gapminder)
Q2: Calculate life expectancy growth rates between consecutive years for each country.
Despite being the first question require codes of the section, just as the previous section, it’s unexpectedly, or ridiculously, difficult to execute.
We can use the lag() function again, but problem is the
first row of a country will use data from another country above them to
calculate. If we ignore this problem, it would be easy to finish the
answer.
s2q2 <- gapminder %>%
group_by(country) %>%
arrange(country, year) %>%
mutate(egr = round((lifeExp - lag(lifeExp)) / lag(lifeExp) * 100, 2))
kable(head(s2q2, 10))
| country | continent | year | lifeExp | pop | gdpPercap | egr |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | NA |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 5.32 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 5.49 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 6.32 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 6.08 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 6.51 |
| Afghanistan | Asia | 1982 | 39.854 | 12881816 | 978.0114 | 3.68 |
| Afghanistan | Asia | 1987 | 40.822 | 13867957 | 852.3959 | 2.43 |
| Afghanistan | Asia | 1992 | 41.674 | 16317921 | 649.3414 | 2.09 |
| Afghanistan | Asia | 1997 | 41.763 | 22227415 | 635.3414 | 0.21 |
Apparently, using arrange() would solve the problem
mentioned earlier.
Q3: Calculate the average annual growth rate of GDP per capita for each country between 1952 and 2007.
To answer this question, we follow these steps:
s2q3 <- gapminder %>%
filter(year == 1952 | year == 2007) %>%
group_by(country) %>%
arrange(country, year) %>%
mutate(total_gr = round(((gdpPercap - lag(gdpPercap))/lag(gdpPercap) * 100),2)) %>%
mutate(annual_gr = round((total_gr/(2007-1952)),2))
kable(head(s2q3, 10))
| country | continent | year | lifeExp | pop | gdpPercap | total_gr | annual_gr |
|---|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | NA | NA |
| Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.5803 | 25.04 | 0.46 |
| Albania | Europe | 1952 | 55.230 | 1282697 | 1601.0561 | NA | NA |
| Albania | Europe | 2007 | 76.423 | 3600523 | 5937.0295 | 270.82 | 4.92 |
| Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.0082 | NA | NA |
| Algeria | Africa | 2007 | 72.301 | 33333216 | 6223.3675 | 154.12 | 2.80 |
| Angola | Africa | 1952 | 30.015 | 4232095 | 3520.6103 | NA | NA |
| Angola | Africa | 2007 | 42.731 | 12420476 | 4797.2313 | 36.26 | 0.66 |
| Argentina | Americas | 1952 | 62.485 | 17876956 | 5911.3151 | NA | NA |
| Argentina | Americas | 2007 | 75.320 | 40301927 | 12779.3796 | 116.19 | 2.11 |
Q4: Categorize countries into growth rate groups based on their life expectancy change between 1952 and 2007.
This question requires us to use the function
case_when(). The function is not covered in our textbook,
but after looking at it, I think it’s interesting to include here.
We also use a different way to get the rows limited to
1952 and 2007 values.
s2q4 <- gapminder %>%
filter(year %in% c(1952, 2007)) %>%
group_by(country) %>%
summarize(
lifeExp_1952 = first(lifeExp),
lifeExp_2007 = last(lifeExp),
lifeExp_growth = round(((lifeExp_2007 - lifeExp_1952)/lifeExp_1952 * 100),2)
) %>%
mutate(growth_category = case_when(
lifeExp_growth >= 50 ~ "high",
lifeExp_growth >= 25 & lifeExp_growth < 50 ~ "moderate",
lifeExp_growth >= 0 & lifeExp_growth < 25 ~ "low",
lifeExp_growth <= 0 ~ "negative"
))
kable(head(s2q4, 10))
| country | lifeExp_1952 | lifeExp_2007 | lifeExp_growth | growth_category |
|---|---|---|---|---|
| Afghanistan | 28.801 | 43.828 | 52.18 | high |
| Albania | 55.230 | 76.423 | 38.37 | moderate |
| Algeria | 43.077 | 72.301 | 67.84 | high |
| Angola | 30.015 | 42.731 | 42.37 | moderate |
| Argentina | 62.485 | 75.320 | 20.54 | low |
| Australia | 69.120 | 81.235 | 17.53 | low |
| Austria | 66.800 | 79.829 | 19.50 | low |
| Bahrain | 50.939 | 75.635 | 48.48 | moderate |
| Bangladesh | 37.484 | 64.062 | 70.90 | high |
| Belgium | 68.000 | 79.441 | 16.83 | low |
The summarize() results in a more organized table than
the previous version when I approached using the mutate()
function.
Q5: Create a scatter plot of GDP per capita vs life expectancy, with point size representing population and color representing continents.
This question is more advanced since it requires customize the colors and the sizes of the dots. We use the most recent data for simplicity.
s2q5 <- gapminder %>%
filter(year == 2007)
ggplot(s2q5, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
geom_point() +
scale_x_log10()
Q6: Generate a faceted line plot showing the trend of life expectancy for each continent over time.
s2q6 <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_lifeExp = mean(lifeExp))
ggplot(s2q6, aes(x = year, y = mean_lifeExp)) +
geom_line() +
facet_wrap(~continent)
Q7: Produce a box plot of GDP per capita distribution for each continent, using a log scale for GDP per capita.
ggplot(gapminder, aes(x = continent, y = gdpPercap)) +
scale_y_log10() +
geom_boxplot()
This section will be saved for later.
Q8: Use tidyr to reshape the data from wide to long format for GDP, life expectancy, and population.
Q9: Create a bubble chart race using gganimate to show how countries have progressed in terms of GDP per capita and life expectancy over time.
Q10: Implement a small multiples plot using facet_wrap() to show the relationship between GDP per capita and life expectancy for each continent over time.
We use the iris dataset that comes built-in with R.
We use these codes again:
View(iris)
glimpse(iris)
summary(iris)
Q1: Create a new variable that calculates the ratio of sepal length to sepal width.
s3q1 <- iris %>%
mutate(Sepal.Ratio = round((Sepal.Length/Sepal.Width),2))
kable(head(s3q1, 10))
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Ratio |
|---|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1.46 |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1.63 |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1.47 |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1.48 |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1.39 |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa | 1.38 |
| 4.6 | 3.4 | 1.4 | 0.3 | setosa | 1.35 |
| 5.0 | 3.4 | 1.5 | 0.2 | setosa | 1.47 |
| 4.4 | 2.9 | 1.4 | 0.2 | setosa | 1.52 |
| 4.9 | 3.1 | 1.5 | 0.1 | setosa | 1.58 |
Q2: Use
mutate()andacross()to scale all numeric variables.
Scaling is a new concept and not introduced in the textbook. From what I’ve searched, it is about normalize the data, or standardized it.
The across() function is for applying another function
to multiple columns. Let’s see how it works:
s3q2 <- iris %>%
mutate(across((Sepal.Length:Petal.Width), scale))
kable(head(s3q2, 10))
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| -0.8976739 | 1.01560199 | -1.335752 | -1.311052 | setosa |
| -1.1392005 | -0.13153881 | -1.335752 | -1.311052 | setosa |
| -1.3807271 | 0.32731751 | -1.392399 | -1.311052 | setosa |
| -1.5014904 | 0.09788935 | -1.279104 | -1.311052 | setosa |
| -1.0184372 | 1.24503015 | -1.335752 | -1.311052 | setosa |
| -0.5353840 | 1.93331463 | -1.165809 | -1.048667 | setosa |
| -1.5014904 | 0.78617383 | -1.335752 | -1.179859 | setosa |
| -1.0184372 | 0.78617383 | -1.279104 | -1.311052 | setosa |
| -1.7430170 | -0.36096697 | -1.335752 | -1.311052 | setosa |
| -1.1392005 | 0.09788935 | -1.279104 | -1.442245 | setosa |
Q3: Convert the
Speciesvariable to a factor if it isn’t already.
iris <- iris %>%
mutate(Species = as.factor(Species))
Q3: Create a scatter plot matrix of all variables.
library(GGally)
ggpairs(iris,
columns = 1:4,
aes(color = Species, alpha = 0.5))
We use the GGally package for this problem.
Q4: Generate a violin plot showing the distribution of each measurement for each species.
The violin chart is not covered in the textbook as well. But since the problem is quite simple, let’s do it quick here.
iris_long <- iris %>%
pivot_longer(names_to = "measure",
values_to = "value",
cols = -Species)
ggplot(iris_long, aes(x = value, y = measure, fill = measure)) +
geom_violin(trim = FALSE) +
geom_boxplot(width = .1, fill = "white", alpha = .7) +
facet_wrap(~Species)
Q5: Use tidyr to reshape the data from wide to long format for all measurements.
Q6: Create a parallel coordinates plot to visualize all variables across different species.
Q7: Implement a custom ggplot2 function that creates a standardized plot for comparing any two variables in the dataset.