The first data set that we are going to work on is gampminder.

Description: Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

Format: The main data frame gapminder has 1704 rows and 6 variables:

Source: http://www.gapminder.org/data/

library(gapminder)
head(gapminder)
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
library(ggplot2)
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

Part 1: Based on the graph below, what does the following command do?

p + geom_smooth() + geom_point() 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Part 2: What is the difference between the graph above and the following graph?

p + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Part 3: What is the difference between the last graph and the following graph?

Note:

p <- p + geom_point() + geom_smooth(method = "lm")
p

Part 4: Based on the graph below, what does the following command do?

p + scale_x_log10()

Part 5: Based on the graph below, what does the following command do?

p = p + scale_x_log10(labels = scales::dollar)
p

# There are lots of other scales formats like scales::percent.

Part 6: Based on the following two graphs, what is the difference between the following two commands?

p + aes(color = "purple")

p + geom_point(color = "purple")

Part 7: The graph below is the result of the following two lines of codes. Based on this graph, describe what the second line of code does?

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + scale_x_log10(labels = scales::dollar)
p + geom_point(alpha = 0.3) +  geom_smooth(color = "orange", se = FALSE, size = 2, method = "lm") 

Part 8: Based on the graph below, describe what the following second line of code does?

p <- p + geom_point() + geom_smooth(method = "lm")
p + aes(color = continent)

Part 9: What is the difference between the following two commands?

p + aes(color = year)

p + aes(color = factor(year))

Part 10: What is the difference between the following two commands?

p + aes(color = continent)

p + aes(color = continent, fill=continent)

Part 11: Based on the graph below, what does the following command do?

p + aes(color = log(pop))

For the next 3 parts, we will work on a small data set named rel_by_region.

dim(rel_by_region)
## [1] 24  5
head(rel_by_region)
## # A tibble: 6 x 5
## # Groups:   bigregion [1]
##   bigregion religion       N    freq   pct
##   <fct>     <fct>      <int>   <dbl> <dbl>
## 1 Northeast Protestant   158 0.324      32
## 2 Northeast Catholic     162 0.332      33
## 3 Northeast Jewish        27 0.0553      6
## 4 Northeast None         112 0.230      23
## 5 Northeast Other         28 0.0574      6
## 6 Northeast <NA>           1 0.00205     0
p <- ggplot(rel_by_region, 
  aes(x = bigregion, y = N, fill = religion)) 

Part 12: What are the differences among the following three commands?

p+geom_col(position = "stack")

p+geom_col(position = "fill")

p = p+geom_col(position = "dodge")
p

Part 13: Based on the graph below, what does the following command do?

p + labs(fill = "Religion") + theme(legend.position = "top")

Part 14: Based on the graph below, what does the following command do?

p +  guides(fill = FALSE) + coord_flip() + 
  facet_grid(~ religion)

For the rest of the assignment, we will work on a new data set named organdata

Description: A dataset containing data on rates of organ donation for seventeen OECD countries between 1991 and 2002. The variables are as follows:

Format: A (tibble) data frame with 237 rows and 21 variables.

Details:

Source: Macro-economic and spending data: OECD. Other data: Kieran Healy.

head(organdata)
## # A tibble: 6 x 21
##   country year       donors   pop pop_dens   gdp gdp_lag health health_lag
##   <chr>   <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
## 1 Austra… NA           NA   17065    0.220 16774   16591   1300       1224
## 2 Austra… 1991-01-01   12.1 17284    0.223 17171   16774   1379       1300
## 3 Austra… 1992-01-01   12.4 17495    0.226 17914   17171   1455       1379
## 4 Austra… 1993-01-01   12.5 17667    0.228 18883   17914   1540       1455
## 5 Austra… 1994-01-01   10.2 17855    0.231 19849   18883   1626       1540
## 6 Austra… 1995-01-01   10.2 18072    0.233 21079   19849   1737       1626
## # … with 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
## #   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
## #   consent_law <chr>, consent_practice <chr>, consistent <chr>,
## #   ccode <chr>
p <- ggplot(data = organdata, 
  mapping = aes(x = country, y = donors, fill = world))+
 coord_flip() + theme(legend.position = "top")

Part 15: What does each of the following two functions show?

p + geom_violin()

p + geom_boxplot()

Part 16: Based on the graph below, what does the following command do?

ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, fill = world)) + coord_flip() + theme(legend.position = "top") + geom_boxplot()

Part 17: We used the following code to generate the following two graph. Describe the benefits of the second graph over the first one?

organdata2 <- organdata[organdata$country %in% c("Spain", "Austria", "Belgium", "United Kingdom", "Germany"),]
p <- ggplot(data = organdata2, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world))  + labs(x=NULL) +  coord_flip() + theme(legend.position = "top")

p + geom_point()

p + geom_jitter(position = position_jitter(width=0.4))

From the organdata data set, we created a new data set called by_county, in which each row represents the mean and the standard deviation of different aspects of a specific country.

dim(by_country)
## [1] 17 27
colnames(by_country)
##  [1] "country"         "donors_mean"     "pop_mean"       
##  [4] "pop_dens_mean"   "gdp_mean"        "gdp_lag_mean"   
##  [7] "health_mean"     "health_lag_mean" "pubhealth_mean" 
## [10] "roads_mean"      "cerebvas_mean"   "assault_mean"   
## [13] "external_mean"   "txp_pop_mean"    "donors_sd"      
## [16] "pop_sd"          "pop_dens_sd"     "gdp_sd"         
## [19] "gdp_lag_sd"      "health_sd"       "health_lag_sd"  
## [22] "pubhealth_sd"    "roads_sd"        "cerebvas_sd"    
## [25] "assault_sd"      "external_sd"     "txp_pop_sd"
head(by_country)
## # A tibble: 6 x 27
##   country donors_mean pop_mean pop_dens_mean gdp_mean gdp_lag_mean
##   <chr>         <dbl>    <dbl>         <dbl>    <dbl>        <dbl>
## 1 Austra…        10.6   18318.         0.237   22179.       21779.
## 2 Austria        23.5    7927.         9.45    23876.       23415.
## 3 Belgium        21.9   10153.        30.7     22500.       22096.
## 4 Canada         14.0   29608.         0.297   23711.       23353.
## 5 Denmark        13.1    5257.        12.2     23722.       23275 
## 6 Finland        18.4    5112.         1.51    21019.       20763 
## # … with 21 more variables: health_mean <dbl>, health_lag_mean <dbl>,
## #   pubhealth_mean <dbl>, roads_mean <dbl>, cerebvas_mean <dbl>,
## #   assault_mean <dbl>, external_mean <dbl>, txp_pop_mean <dbl>,
## #   donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>, gdp_sd <dbl>,
## #   gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>,
## #   pubhealth_sd <dbl>, roads_sd <dbl>, cerebvas_sd <dbl>,
## #   assault_sd <dbl>, external_sd <dbl>, txp_pop_sd <dbl>
p <- ggplot(data = by_country, 
       mapping = aes(x = reorder(country,donors_mean), y = donors_mean)) + labs(x= "", y= "Donor Procurement Rate") + coord_flip() 

mapping = aes(ymin = donors_mean - donors_sd,
              ymax = donors_mean + donors_sd)

Part 18: What are the differences among the following 4 graphs?

p + geom_pointrange(mapping) 

p + geom_linerange(mapping) 

p + geom_crossbar(mapping) 

p + geom_errorbar(mapping) 

Part 19: Based on the graph below, what does the following command do?

p <- ggplot(data = by_country, mapping = aes(x = gdp_mean, y = health_mean)) + geom_point()

p + geom_text(mapping = aes(label = country))

Part 20: There is a slight diffrence between the previous graph and the following one. Describe the difference.

p + geom_text(mapping = aes(label = country), hjust = 0)

Part 21: What does geom_text_repel() function from ggrepel package do?

library(ggrepel)
p + geom_text_repel(data = by_country, mapping = aes(label = country))

Part 22: Based on the graph below, what does the following command do?

data2 <- by_country[by_country$gdp_mean > 25000 | by_country$health_mean < 1500 | by_country$country %in% c("Belgium","Denmark"),]

p + geom_text_repel(data = data2, mapping = aes(label = country))

Part 23: Based on the graph below, what does the following command do?

p + annotate(geom = "text", x = 28000, y = 3800, label = "A surprisingly \n high gdp mean.") +
  annotate(geom = "rect", xmin = 26000, xmax = 30000, ymin = 2000, ymax = 4100, fill = "red", alpha = 0.2)

Part 24: Based on the graph below, what does the following command do?

p + geom_hline(yintercept = 2500, size = 1.4, color = "gray80") +
geom_vline(xintercept = 26000, size = 1.4, color = "gray80") 

Part 25: Please ignore this part and leave it blank!

Part 26: Based on the graph below, what does the following command do?

p + scale_y_continuous(breaks = c(2000, 4000), labels = c("2k", "4k"))

Part 27: Based on the graph below, what does the following command do?

p + ylim(0, NA) + labs(x = "x_label", y = "y_label", title = "p_title", subtitle = "p_subtitle", caption = "p_caption")

Note: You can save a graph as a picture or as a pdf file:

ggsave(filename = "my_figure.png")
ggsave(filename = "my_figure.pdf")
ggsave(filename = "my_figure.pdf", plot = p, height = 3, width = 4, units = "in")

Most of the codes in this document are collected from https://socviz.co/.