The first data set that we are going to work on is gampminder.
Description: Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.
Format: The main data frame gapminder has 1704 rows and 6 variables:
country: factor with 142 levels
continent: factor with 5 levels
year: ranges from 1952 to 2007 in increments of 5 years
lifeExp: life expectancy at birth, in years
pop: population
gdpPercap: GDP per capita (US$, inflation-adjusted)
Source: http://www.gapminder.org/data/
library(gapminder)
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
library(ggplot2)
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
Part 1: Based on the graph below, what does the following command do?
p + geom_smooth() + geom_point()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Part 2: What is the difference between the graph above and the following graph?
p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Part 3: What is the difference between the last graph and the following graph?
Note:
gam stands for Generalized Additive Model
lm stands for Linear Model
glm stands for Generalized Linear Model
p <- p + geom_point() + geom_smooth(method = "lm")
p
Part 4: Based on the graph below, what does the following command do?
p + scale_x_log10()
Part 5: Based on the graph below, what does the following command do?
p = p + scale_x_log10(labels = scales::dollar)
p
# There are lots of other scales formats like scales::percent.
Part 6: Based on the following two graphs, what is the difference between the following two commands?
p + aes(color = "purple")
p + geom_point(color = "purple")
Part 7: The graph below is the result of the following two lines of codes. Based on this graph, describe what the second line of code does?
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + scale_x_log10(labels = scales::dollar)
p + geom_point(alpha = 0.3) + geom_smooth(color = "orange", se = FALSE, size = 2, method = "lm")
Part 8: Based on the graph below, describe what the following second line of code does?
p <- p + geom_point() + geom_smooth(method = "lm")
p + aes(color = continent)
Part 9: What is the difference between the following two commands?
p + aes(color = year)
p + aes(color = factor(year))
Part 10: What is the difference between the following two commands?
p + aes(color = continent)
p + aes(color = continent, fill=continent)
Part 11: Based on the graph below, what does the following command do?
p + aes(color = log(pop))
For the next 3 parts, we will work on a small data set named rel_by_region.
dim(rel_by_region)
## [1] 24 5
head(rel_by_region)
## # A tibble: 6 x 5
## # Groups: bigregion [1]
## bigregion religion N freq pct
## <fct> <fct> <int> <dbl> <dbl>
## 1 Northeast Protestant 158 0.324 32
## 2 Northeast Catholic 162 0.332 33
## 3 Northeast Jewish 27 0.0553 6
## 4 Northeast None 112 0.230 23
## 5 Northeast Other 28 0.0574 6
## 6 Northeast <NA> 1 0.00205 0
p <- ggplot(rel_by_region,
aes(x = bigregion, y = N, fill = religion))
Part 12: What are the differences among the following three commands?
p+geom_col(position = "stack")
p+geom_col(position = "fill")
p = p+geom_col(position = "dodge")
p
Part 13: Based on the graph below, what does the following command do?
p + labs(fill = "Religion") + theme(legend.position = "top")
Part 14: Based on the graph below, what does the following command do?
p + guides(fill = FALSE) + coord_flip() +
facet_grid(~ religion)
For the rest of the assignment, we will work on a new data set named organdata
Description: A dataset containing data on rates of organ donation for seventeen OECD countries between 1991 and 2002. The variables are as follows:
Format: A (tibble) data frame with 237 rows and 21 variables.
Details:
country. Country name.
year. Year.
donors. Organ Donation rate per million population.
pop. Population in thousands.
pop_dens. Population density per square mile.
gdp. Gross Domestic Product in thousands of PPP dollars.
gdp_lag. Lagged Gross Domestic Product in thousands of PPP dollars.
health. Health spending, thousands of PPP dollars per capita.
health_lag Lagged health spending, thousands of PPP dollars per capita.
pubhealth. Public health spending as a percentage of total expenditure.
roads. Road accident fatalities per 100,000 population.
cerebvas. Cerebrovascular deaths per 100,000 population (rounded).
assault. Assault deaths per 100,000 population (rounded).
external. Deaths due to external causes per 100,000 population.
txp_pop. Transplant programs per million population.
world. Welfare state world (Esping Andersen.)
opt. Opt-in policy or Opt-out policy.
consent_law. Consent law, informed or presumed.
consent_practice. Consent practice, informed or presumed.
consistent. Law consistent with practice, yes or no.
ccode. Abbreviated country code.
Source: Macro-economic and spending data: OECD. Other data: Kieran Healy.
head(organdata)
## # A tibble: 6 x 21
## country year donors pop pop_dens gdp gdp_lag health health_lag
## <chr> <date> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
## 1 Austra… NA NA 17065 0.220 16774 16591 1300 1224
## 2 Austra… 1991-01-01 12.1 17284 0.223 17171 16774 1379 1300
## 3 Austra… 1992-01-01 12.4 17495 0.226 17914 17171 1455 1379
## 4 Austra… 1993-01-01 12.5 17667 0.228 18883 17914 1540 1455
## 5 Austra… 1994-01-01 10.2 17855 0.231 19849 18883 1626 1540
## 6 Austra… 1995-01-01 10.2 18072 0.233 21079 19849 1737 1626
## # … with 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
## # assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
## # consent_law <chr>, consent_practice <chr>, consistent <chr>,
## # ccode <chr>
p <- ggplot(data = organdata,
mapping = aes(x = country, y = donors, fill = world))+
coord_flip() + theme(legend.position = "top")
Part 15: What does each of the following two functions show?
p + geom_violin()
p + geom_boxplot()
Part 16: Based on the graph below, what does the following command do?
ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, fill = world)) + coord_flip() + theme(legend.position = "top") + geom_boxplot()
Part 17: We used the following code to generate the following two graph. Describe the benefits of the second graph over the first one?
organdata2 <- organdata[organdata$country %in% c("Spain", "Austria", "Belgium", "United Kingdom", "Germany"),]
p <- ggplot(data = organdata2, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world)) + labs(x=NULL) + coord_flip() + theme(legend.position = "top")
p + geom_point()
p + geom_jitter(position = position_jitter(width=0.4))
From the organdata data set, we created a new data set called by_county, in which each row represents the mean and the standard deviation of different aspects of a specific country.
dim(by_country)
## [1] 17 27
colnames(by_country)
## [1] "country" "donors_mean" "pop_mean"
## [4] "pop_dens_mean" "gdp_mean" "gdp_lag_mean"
## [7] "health_mean" "health_lag_mean" "pubhealth_mean"
## [10] "roads_mean" "cerebvas_mean" "assault_mean"
## [13] "external_mean" "txp_pop_mean" "donors_sd"
## [16] "pop_sd" "pop_dens_sd" "gdp_sd"
## [19] "gdp_lag_sd" "health_sd" "health_lag_sd"
## [22] "pubhealth_sd" "roads_sd" "cerebvas_sd"
## [25] "assault_sd" "external_sd" "txp_pop_sd"
head(by_country)
## # A tibble: 6 x 27
## country donors_mean pop_mean pop_dens_mean gdp_mean gdp_lag_mean
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austra… 10.6 18318. 0.237 22179. 21779.
## 2 Austria 23.5 7927. 9.45 23876. 23415.
## 3 Belgium 21.9 10153. 30.7 22500. 22096.
## 4 Canada 14.0 29608. 0.297 23711. 23353.
## 5 Denmark 13.1 5257. 12.2 23722. 23275
## 6 Finland 18.4 5112. 1.51 21019. 20763
## # … with 21 more variables: health_mean <dbl>, health_lag_mean <dbl>,
## # pubhealth_mean <dbl>, roads_mean <dbl>, cerebvas_mean <dbl>,
## # assault_mean <dbl>, external_mean <dbl>, txp_pop_mean <dbl>,
## # donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>, gdp_sd <dbl>,
## # gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>,
## # pubhealth_sd <dbl>, roads_sd <dbl>, cerebvas_sd <dbl>,
## # assault_sd <dbl>, external_sd <dbl>, txp_pop_sd <dbl>
p <- ggplot(data = by_country,
mapping = aes(x = reorder(country,donors_mean), y = donors_mean)) + labs(x= "", y= "Donor Procurement Rate") + coord_flip()
mapping = aes(ymin = donors_mean - donors_sd,
ymax = donors_mean + donors_sd)
Part 18: What are the differences among the following 4 graphs?
p + geom_pointrange(mapping)
p + geom_linerange(mapping)
p + geom_crossbar(mapping)
p + geom_errorbar(mapping)
Part 19: Based on the graph below, what does the following command do?
p <- ggplot(data = by_country, mapping = aes(x = gdp_mean, y = health_mean)) + geom_point()
p + geom_text(mapping = aes(label = country))
Part 20: There is a slight diffrence between the previous graph and the following one. Describe the difference.
p + geom_text(mapping = aes(label = country), hjust = 0)
Part 21: What does geom_text_repel() function from ggrepel package do?
library(ggrepel)
p + geom_text_repel(data = by_country, mapping = aes(label = country))
Part 22: Based on the graph below, what does the following command do?
data2 <- by_country[by_country$gdp_mean > 25000 | by_country$health_mean < 1500 | by_country$country %in% c("Belgium","Denmark"),]
p + geom_text_repel(data = data2, mapping = aes(label = country))
Part 23: Based on the graph below, what does the following command do?
p + annotate(geom = "text", x = 28000, y = 3800, label = "A surprisingly \n high gdp mean.") +
annotate(geom = "rect", xmin = 26000, xmax = 30000, ymin = 2000, ymax = 4100, fill = "red", alpha = 0.2)
Part 24: Based on the graph below, what does the following command do?
p + geom_hline(yintercept = 2500, size = 1.4, color = "gray80") +
geom_vline(xintercept = 26000, size = 1.4, color = "gray80")
Part 25: Please ignore this part and leave it blank!
Part 26: Based on the graph below, what does the following command do?
p + scale_y_continuous(breaks = c(2000, 4000), labels = c("2k", "4k"))
Part 27: Based on the graph below, what does the following command do?
p + ylim(0, NA) + labs(x = "x_label", y = "y_label", title = "p_title", subtitle = "p_subtitle", caption = "p_caption")
Note: You can save a graph as a picture or as a pdf file:
ggsave(filename = "my_figure.png")
ggsave(filename = "my_figure.pdf")
ggsave(filename = "my_figure.pdf", plot = p, height = 3, width = 4, units = "in")
Most of the codes in this document are collected from https://socviz.co/.