BU.520.650.Su19: Assignment 3

The first data set that we are going to work on is gampminder.

Description: Excerpt of the Gapminder data on life expectancy, GDP per capita, and population by country.

Format: The main data frame gapminder has 1704 rows and 6 variables:

country: factor with 142 levels
continent: factor with 5 levels
year: ranges from 1952 to 2007 in increments of 5 years
lifeExp: life expectancy at birth, in years
pop: population
gdpPercap: GDP per capita (US$, inflation-adjusted)

library(gapminder)
head(gapminder)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

library(ggplot2)
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

Part 1: Based on the graph below, what does the following command do?

p + geom_smooth() + geom_point()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Part 2: What is the difference between the graph above and the following graph?

p + geom_point() + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Part 3: What is the difference between the last graph and the following graph?

Note:

gam stands for Generalized Additive Model
lm stands for Linear Model
glm stands for Generalized Linear Model

p <- p + geom_point() + geom_smooth(method = "lm")
p

Part 4: Based on the graph below, what does the following command do?

p + scale_x_log10()

Part 5: Based on the graph below, what does the following command do?

p = p + scale_x_log10(labels = scales::dollar)
p

# There are lots of other scales formats like scales::percent.

Part 6: Based on the following two graphs, what is the difference between the following two commands?

p + aes(color = "purple")

p + geom_point(color = "purple")

Part 7: The graph below is the result of the following two lines of codes. Based on this graph, describe what the second line of code does?

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + scale_x_log10(labels = scales::dollar)

p + geom_point(alpha = 0.3) +  geom_smooth(color = "orange", se = FALSE, size = 2, method = "lm")

Part 8: Based on the graph below, describe what the following second line of code does?

p <- p + geom_point() + geom_smooth(method = "lm")

p + aes(color = continent)

Part 9: What is the difference between the following two commands?

p + aes(color = year)

p + aes(color = factor(year))

Part 10: What is the difference between the following two commands?

p + aes(color = continent)

p + aes(color = continent, fill=continent)

Part 11: Based on the graph below, what does the following command do?

p + aes(color = log(pop))

For the next 3 parts, we will work on a small data set named rel_by_region.

dim(rel_by_region)

## [1] 24  5

head(rel_by_region)

## # A tibble: 6 x 5
## # Groups:   bigregion [1]
##   bigregion religion       N    freq   pct
##   <fct>     <fct>      <int>   <dbl> <dbl>
## 1 Northeast Protestant   158 0.324      32
## 2 Northeast Catholic     162 0.332      33
## 3 Northeast Jewish        27 0.0553      6
## 4 Northeast None         112 0.230      23
## 5 Northeast Other         28 0.0574      6
## 6 Northeast <NA>           1 0.00205     0

p <- ggplot(rel_by_region, 
  aes(x = bigregion, y = N, fill = religion))

Part 12: What are the differences among the following three commands?

p+geom_col(position = "stack")

p+geom_col(position = "fill")

p = p+geom_col(position = "dodge")
p

Part 13: Based on the graph below, what does the following command do?

p + labs(fill = "Religion") + theme(legend.position = "top")

Part 14: Based on the graph below, what does the following command do?

p +  guides(fill = FALSE) + coord_flip() + 
  facet_grid(~ religion)

For the rest of the assignment, we will work on a new data set named organdata

Description: A dataset containing data on rates of organ donation for seventeen OECD countries between 1991 and 2002. The variables are as follows:

Format: A (tibble) data frame with 237 rows and 21 variables.

Details:

country. Country name.
year. Year.
donors. Organ Donation rate per million population.
pop. Population in thousands.
pop_dens. Population density per square mile.
gdp. Gross Domestic Product in thousands of PPP dollars.
gdp_lag. Lagged Gross Domestic Product in thousands of PPP dollars.
health. Health spending, thousands of PPP dollars per capita.
health_lag Lagged health spending, thousands of PPP dollars per capita.
pubhealth. Public health spending as a percentage of total expenditure.
roads. Road accident fatalities per 100,000 population.
cerebvas. Cerebrovascular deaths per 100,000 population (rounded).
assault. Assault deaths per 100,000 population (rounded).
external. Deaths due to external causes per 100,000 population.
txp_pop. Transplant programs per million population.
world. Welfare state world (Esping Andersen.)
opt. Opt-in policy or Opt-out policy.
consent_law. Consent law, informed or presumed.
consent_practice. Consent practice, informed or presumed.
consistent. Law consistent with practice, yes or no.
ccode. Abbreviated country code.

Source: Macro-economic and spending data: OECD. Other data: Kieran Healy.

head(organdata)

## # A tibble: 6 x 21
##   country year       donors   pop pop_dens   gdp gdp_lag health health_lag
##   <chr>   <date>      <dbl> <int>    <dbl> <int>   <int>  <dbl>      <dbl>
## 1 Austra… NA           NA   17065    0.220 16774   16591   1300       1224
## 2 Austra… 1991-01-01   12.1 17284    0.223 17171   16774   1379       1300
## 3 Austra… 1992-01-01   12.4 17495    0.226 17914   17171   1455       1379
## 4 Austra… 1993-01-01   12.5 17667    0.228 18883   17914   1540       1455
## 5 Austra… 1994-01-01   10.2 17855    0.231 19849   18883   1626       1540
## 6 Austra… 1995-01-01   10.2 18072    0.233 21079   19849   1737       1626
## # … with 12 more variables: pubhealth <dbl>, roads <dbl>, cerebvas <int>,
## #   assault <int>, external <int>, txp_pop <dbl>, world <chr>, opt <chr>,
## #   consent_law <chr>, consent_practice <chr>, consistent <chr>,
## #   ccode <chr>

p <- ggplot(data = organdata, 
  mapping = aes(x = country, y = donors, fill = world))+
 coord_flip() + theme(legend.position = "top")

Part 15: What does each of the following two functions show?

p + geom_violin()

p + geom_boxplot()

Part 16: Based on the graph below, what does the following command do?

ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, fill = world)) + coord_flip() + theme(legend.position = "top") + geom_boxplot()

Part 17: We used the following code to generate the following two graph. Describe the benefits of the second graph over the first one?

organdata2 <- organdata[organdata$country %in% c("Spain", "Austria", "Belgium", "United Kingdom", "Germany"),]
p <- ggplot(data = organdata2, mapping = aes(x = reorder(country, donors, na.rm=TRUE), y = donors, color = world))  + labs(x=NULL) +  coord_flip() + theme(legend.position = "top")

p + geom_point()

p + geom_jitter(position = position_jitter(width=0.4))

From the organdata data set, we created a new data set called by_county, in which each row represents the mean and the standard deviation of different aspects of a specific country.

dim(by_country)

## [1] 17 27

colnames(by_country)

##  [1] "country"         "donors_mean"     "pop_mean"       
##  [4] "pop_dens_mean"   "gdp_mean"        "gdp_lag_mean"   
##  [7] "health_mean"     "health_lag_mean" "pubhealth_mean" 
## [10] "roads_mean"      "cerebvas_mean"   "assault_mean"   
## [13] "external_mean"   "txp_pop_mean"    "donors_sd"      
## [16] "pop_sd"          "pop_dens_sd"     "gdp_sd"         
## [19] "gdp_lag_sd"      "health_sd"       "health_lag_sd"  
## [22] "pubhealth_sd"    "roads_sd"        "cerebvas_sd"    
## [25] "assault_sd"      "external_sd"     "txp_pop_sd"

head(by_country)

## # A tibble: 6 x 27
##   country donors_mean pop_mean pop_dens_mean gdp_mean gdp_lag_mean
##   <chr>         <dbl>    <dbl>         <dbl>    <dbl>        <dbl>
## 1 Austra…        10.6   18318.         0.237   22179.       21779.
## 2 Austria        23.5    7927.         9.45    23876.       23415.
## 3 Belgium        21.9   10153.        30.7     22500.       22096.
## 4 Canada         14.0   29608.         0.297   23711.       23353.
## 5 Denmark        13.1    5257.        12.2     23722.       23275 
## 6 Finland        18.4    5112.         1.51    21019.       20763 
## # … with 21 more variables: health_mean <dbl>, health_lag_mean <dbl>,
## #   pubhealth_mean <dbl>, roads_mean <dbl>, cerebvas_mean <dbl>,
## #   assault_mean <dbl>, external_mean <dbl>, txp_pop_mean <dbl>,
## #   donors_sd <dbl>, pop_sd <dbl>, pop_dens_sd <dbl>, gdp_sd <dbl>,
## #   gdp_lag_sd <dbl>, health_sd <dbl>, health_lag_sd <dbl>,
## #   pubhealth_sd <dbl>, roads_sd <dbl>, cerebvas_sd <dbl>,
## #   assault_sd <dbl>, external_sd <dbl>, txp_pop_sd <dbl>

p <- ggplot(data = by_country, 
       mapping = aes(x = reorder(country,donors_mean), y = donors_mean)) + labs(x= "", y= "Donor Procurement Rate") + coord_flip() 

mapping = aes(ymin = donors_mean - donors_sd,
              ymax = donors_mean + donors_sd)

Part 18: What are the differences among the following 4 graphs?

p + geom_pointrange(mapping)

p + geom_linerange(mapping)

p + geom_crossbar(mapping)

p + geom_errorbar(mapping)

Part 19: Based on the graph below, what does the following command do?

p <- ggplot(data = by_country, mapping = aes(x = gdp_mean, y = health_mean)) + geom_point()

p + geom_text(mapping = aes(label = country))

Part 20: There is a slight diffrence between the previous graph and the following one. Describe the difference.

p + geom_text(mapping = aes(label = country), hjust = 0)

Part 21: What does geom_text_repel() function from ggrepel package do?

library(ggrepel)
p + geom_text_repel(data = by_country, mapping = aes(label = country))

Part 22: Based on the graph below, what does the following command do?

data2 <- by_country[by_country$gdp_mean > 25000 | by_country$health_mean < 1500 | by_country$country %in% c("Belgium","Denmark"),]

p + geom_text_repel(data = data2, mapping = aes(label = country))

Part 23: Based on the graph below, what does the following command do?

p + annotate(geom = "text", x = 28000, y = 3800, label = "A surprisingly \n high gdp mean.") +
  annotate(geom = "rect", xmin = 26000, xmax = 30000, ymin = 2000, ymax = 4100, fill = "red", alpha = 0.2)

Part 24: Based on the graph below, what does the following command do?

p + geom_hline(yintercept = 2500, size = 1.4, color = "gray80") +
geom_vline(xintercept = 26000, size = 1.4, color = "gray80")

Part 25: Please ignore this part and leave it blank!

Part 26: Based on the graph below, what does the following command do?

p + scale_y_continuous(breaks = c(2000, 4000), labels = c("2k", "4k"))

Part 27: Based on the graph below, what does the following command do?

p + ylim(0, NA) + labs(x = "x_label", y = "y_label", title = "p_title", subtitle = "p_subtitle", caption = "p_caption")

Note: You can save a graph as a picture or as a pdf file:

ggsave(filename = "my_figure.png")
ggsave(filename = "my_figure.pdf")
ggsave(filename = "my_figure.pdf", plot = p, height = 3, width = 4, units = "in")

Most of the codes in this document are collected from https://socviz.co/.

BU.520.650.Su19: Assignment 3

6/16/2019