This document was composed from Dr. Snopkowski’s ANTH 504 Week 4 lecture and from Introduction to Data Science: Data analysis and prediction algorithms with R by Rafael A.Irizarry (Irizarry 2019).
Sometimes you will see |> instead of the
%>%.
The |> code works the same as %>%
murders |> head()
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
Which country has the highest child mortality? Sri Lanka or Turkey Poland or South Korea Malaysia or Russia Pakistan or Vietnam Thailand or South Africa
data(gapminder)
gapminder %>% as_tibble()
## # A tibble: 10,545 × 9
## country year infan…¹ life_…² ferti…³ popul…⁴ gdp conti…⁵ region
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
## 1 Albania 1960 115. 62.9 6.19 1.64e6 NA Europe South…
## 2 Algeria 1960 148. 47.5 7.65 1.11e7 1.38e10 Africa North…
## 3 Angola 1960 208 36.0 7.32 5.27e6 NA Africa Middl…
## 4 Antigua and Ba… 1960 NA 63.0 4.43 5.47e4 NA Americ… Carib…
## 5 Argentina 1960 59.9 65.4 3.11 2.06e7 1.08e11 Americ… South…
## 6 Armenia 1960 NA 66.9 4.55 1.87e6 NA Asia Weste…
## 7 Aruba 1960 NA 65.7 4.82 5.42e4 NA Americ… Carib…
## 8 Australia 1960 20.3 70.9 3.45 1.03e7 9.67e10 Oceania Austr…
## 9 Austria 1960 37.3 68.8 2.7 7.07e6 5.24e10 Europe Weste…
## 10 Azerbaijan 1960 NA 61.3 5.57 3.90e6 NA Asia Weste…
## # … with 10,535 more rows, and abbreviated variable names ¹infant_mortality,
## # ²life_expectancy, ³fertility, ⁴population, ⁵continent
Let’s examine the infant mortality rates of those countries displayed in the video for 2015: Sri Lanka vs. Turkey Thailand vs. South Africa Column headings include: country and infant_mortality Write the code to get these values
Plot a scatterplot of fertility (x-axis) vs. life_expectancy (y-axis) for 1960.
gapminder %>%
filter(year == 1960 & country %in% c("Sri Lanka", "Turkey"))
## country year infant_mortality life_expectancy fertility population
## 1 Sri Lanka 1960 72.7 59.76 5.54 9896172
## 2 Turkey 1960 166.0 46.91 6.30 27553280
## gdp continent region
## 1 2708601390 Asia Southern Asia
## 2 44566050533 Asia Western Asia
Add unique colors by continent
p1 <- gapminder |>
filter(year == 1960) |>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point()
p1
Compare this to 2015
p2 <- gapminder |>
filter(year == 2015) |>
ggplot(aes(fertility, life_expectancy, color = continent)) +
geom_point()
p2
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Can you make side-by-side plots? grid.arrange() is part
of gridExtra package.
Notice that the plots were saved under p1 and
p2
grid.arrange(p1, p2, ncol = 2)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Faceting variables allows us to show multiple plots by stratifying the data by some variable We add a layer with the function facet_grid(), which separates our plots. You can facet by 2 variables: included as row column
gapminder %>%
filter(year %in% c(1962, 2015)) %>%
ggplot(aes(fertility, life_expectancy, col=continent)) +
geom_point() +
facet_grid(continent ~ year)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
year %in% tells what year you want included.
continent is row and year is columns,
If instead we just want to have 2 side-by-side graphs separated by year (as in the last slide), we can use facet_grid like this:
gapminder %>% filter(year %in% c(1962, 2015)) %>%
ggplot(aes(fertility, life_expectancy, col=continent)) +
geom_point() +
facet_grid(. ~ year)
## Warning: Removed 1 rows containing missing values (`geom_point()`).
.~year separates it by year. It puts year in different
columns.
Perhaps using the term “developing world” no longer makes sense.
If we want to include several charts (for instance to see how these values have changed over time), we may not want to use facet_grid because they won’t all fit on one row. facet_wrap() permits us to wrap the graphs across multiple rows and columns so that each plot is view able.
years <- c(1960, 1970, 1980, 1990, 2000, 2010, 2015)
continents <- c("Europe", "Asia")
gapminder %>%
filter(year %in% years & continent %in% continents) %>%
ggplot(aes(fertility, life_expectancy, col=continent)) +
geom_point() + facet_wrap(~year)
It just raps the text around when it runs out of space. Declaring the
year as a separate line of code will help allow you to use and edit the
code in the future. It is a good practice for what code is needed in the
future but edited of specific analysis will occur. year and
years is somewhat different but It might be better to makei
it more unique.
By default, R will keep the scales the same (to compare across charts). If you don’t want the scales to be the same you can use:
gapminder %>%
filter(year %in% years & continent %in% continents) %>%
ggplot(aes(fertility, life_expectancy, col=continent)) +
geom_point() +
facet_wrap(~year, scales = "free")
This is not best practice as people expect the grid to be the same across the graphs.
We may be interested in understanding how changes in fertility and life expectancy changed over time. To look at factors that change over time, we can use time series plots (where time is on the x-axis)
For instance, let’s look at fertility rates in the US across time:
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, fertility)) +
geom_point()
## Warning: Removed 1 rows containing missing values (`geom_point()`).
OR – if you like time to be current to past
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, fertility)) +
geom_point() +
scale_x_continuous(trans="reverse")
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Since the points are close together, we might want to connect them with a line. This reverses the x-axis scale.It is not very useful when looking at time.
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, fertility)) +
geom_line()
## Warning: Removed 1 row containing missing values (`geom_line()`).
What interpretations would you draw from this chart? This plot shows how fertility rate has dropped drastically during the 1960s and 1970s. By about 1975, the fertility rate hit a low which did increase slightly but overall has remained stable to the present time.
We may be interested in comparing 2 countries over time. So how can we create 2 lines?
Let’s start with the points:
countries <- c("South Korea", "Germany")
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Then we can change it to lines:
countries <- c("South Korea", "Germany")
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility)) +
geom_line()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
But . . . this isn’t what we want. We need to tell R that we want separate lines, so we add a group argument.
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, group = country)) +
geom_line()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
You can’t identify which country is which. We obviously want to indicate which line is which. We can do this by using the color argument instead of group.When you tell R that they are grouped, R will know to make the lines separate.
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col = country)) +
geom_line()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
Saying col = will make it a separate color with a
legand.
Labeling may be preferred over legends whenever possible since labels are easier for readers.
To include labels we need to decide where on the chart we want to place them. For South Korea, we may want to put the label at (1970, 5). For Germany, we may want to put the label at (1960, 2.7).
First, we have to create our labels:
labels <- data.frame(country = countries, x=c(1970, 1960), y=c(5, 2.5))
head(labels)
## country x y
## 1 South Korea 1970 5.0
## 2 Germany 1960 2.5
It is best to do this step by step and check your work.
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country)) +
geom_line() +
geom_text(data=labels, aes(x,y,label=country))
## Warning: Removed 2 rows containing missing values (`geom_line()`).
You can add the theme top give it a special position.
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country)) +
geom_line() +
geom_text(data=labels, aes(x,y,label=country)) +
theme(legend.position = "none")
## Warning: Removed 2 rows containing missing values (`geom_line()`).
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country)) +
geom_line() +
geom_text(data=labels, aes(x,y,label=country), size=5) +
theme(legend.position = "none")
## Warning: Removed 2 rows containing missing values (`geom_line()`).
We need theme(legend.position = none), otherwise a
legend will pop up. You tell R where to put the label.
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country)) +
geom_line() +
geom_text(aes(x,y,label=country), labels, size=5) +
theme(legend.position = "none")
## Warning: Removed 2 rows containing missing values (`geom_line()`).
Getting rid of legends can also be done like this, but requires more code:
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country)) +
geom_line(show.legend=FALSE) +
geom_text(data=labels, aes(x,y,label=country), size = 5, show.legend=FALSE)
## Warning: Removed 2 rows containing missing values (`geom_line()`).
That last method was long winded. Maybe we can do this a bit easier. An alternative way to label your lines:
#install.packages("geomtextpath")
library(geomtextpath)
p <- gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country, label=country)) +
geom_textpath() +
theme(legend.position = "none")
p
## Warning: Removed 2 rows containing missing values (`geom_textpath()`).
theme(legend.position = "none") gets rid of the
legend.
countries <- c("South Korea", "Germany", "Haiti")
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, fertility, col=country, label=country)) +
geom_textpath() +
theme(legend.position = "none")
## Warning: Removed 3 rows containing missing values (`geom_textpath()`).
How has the wealth distribution changed across the world over time?
We can use GDP (gross domestic product) a measure of the market value of goods and services produced by a country in a year. We can then calculate this per person in a country, which is a rough estimate of a country’s wealth. Then we can divide by 365 to get the per person wealth per day. In countries where people live on less than $2/day on average, they are considered to live in absolute poverty.
Let’s create this variable: dollars_per_day (Note: this
is measured in current US dollars)
gapminder <- gapminder %>%
mutate(dollars_per_day = gdp/population/365)
head(gapminder)
## country year infant_mortality life_expectancy fertility
## 1 Albania 1960 115.40 62.87 6.19
## 2 Algeria 1960 148.20 47.50 7.65
## 3 Angola 1960 208.00 35.98 7.32
## 4 Antigua and Barbuda 1960 NA 62.97 4.43
## 5 Argentina 1960 59.87 65.39 3.11
## 6 Armenia 1960 NA 66.86 4.55
## population gdp continent region dollars_per_day
## 1 1636054 NA Europe Southern Europe NA
## 2 11124892 13828152297 Africa Northern Africa 3.405458
## 3 5270844 NA Africa Middle Africa NA
## 4 54681 NA Americas Caribbean NA
## 5 20619075 108322326649 Americas South America 14.393153
## 6 1867396 NA Asia Western Asia NA
Get a look at the distribution. Let’s plot a histogram dollars of per day in 1970.
hist(gapminder$dollars_per_day)
Look at just one year
gapminder |>
filter(year == 1970) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 72 rows containing non-finite values (`stat_bin()`).
As expected, the distribution of wealth is typically non-normal. We may want to examine this on a log scale. If we use log base 2, then each doubling of a value turns into an increase by 1. (Switches multiplicative changes into additive ones).
gapminder %>%
filter(year == 1970) %>%
ggplot(aes(log2(dollars_per_day))) +
geom_histogram(binwidth = 1, color = "black")
## Warning: Removed 72 rows containing non-finite values (`stat_bin()`).
OR
gapminder %>%
filter(year == 1970 & !is.na(gdp)) %>%
ggplot(aes(log2(dollars_per_day))) +
geom_histogram(binwidth = 1, color = "black")
Common choices for bases include: e, 2, 10.
Recommendations: Don’t use e for data exploration / visualization because it’s hard to do these calculations in our head. What is e^3, e^4, etc. While it’s easier to do 2^3, 2^4 ,or 10^3, 10^4.
In the previous example, we used base 2 instead of base 10 because the range was easier to interpret – going from -2 to 6, which are relatively easy for us to calculate. If we chose base 10, our range would include only 0-2, so our range is small, and we need to choose a binwidth other than 1.
With base 2, a binwidth of 1 translates to a bin with range x to 2x.
But . . . Let’s look at population sizes. Create a histogram. Determine whether it would be best to use base e, 2, or 10 to display this data? Explain your choice.
Pros & Cons 1. If we transform the variable, then the scale is still interpretable (you can easily determine the value on the scale), but we have to do calculations on those values (2axis_value)
Transforming the scale can be done with
scale_x_continuous
gapminder %>%
filter(year == 1970) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth= 1, color="black") +
scale_x_continuous(trans="log2")
## Warning: Removed 72 rows containing non-finite values (`stat_bin()`).
There are 2 modes. This is bi-modal.
Does this mean that there are 2 groups of countries in terms of wealth?
Let’s examine the data by region. Create a plot of
dollars_per_day by region
p <- gapminder %>%
filter(year == 1962) %>%
ggplot(aes(region, dollars_per_day)) +
geom_point()
p
## Warning: Removed 89 rows containing missing values (`geom_point()`).
It would be nice if we could actually read those labels +
theme() Check ?theme for details +
axis.text.x is text on x-axis + element_text()
is the non-data component of plot + hjust is horizontal
justification from 0 to 1 where 1 = right justified
gapminder %>%
filter(year == 1962) %>%
ggplot(aes(region, dollars_per_day)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning: Removed 89 rows containing missing values (`geom_point()`).
Is there a difference between “west and”not west”?
These are ordered alphabetically. It’d be nice if they were ordered in a meaningful way.
What might be a meaningful way?
p <- gapminder %>%
filter(year == 1962 & !is.na(dollars_per_day)) %>%
mutate(region = reorder(region, dollars_per_day, FUN=median)) %>%
ggplot(aes(region, dollars_per_day)) +
geom_point()
p + theme(axis.text.x = element_text(angle=90, hjust=1)) +
scale_y_continuous(trans="log2")
!is.na() Need to exclude those that are missing and are
NA FUN because we are calculating the median.
When you calculate stats like this, you can’t have any NA
in the data. Need to reorder region by median
dollars_per_day (the FUN is optional)
scale_y_continuous(trans="log2") Add a log scale to spread
out the dots a little bit more. It is ordered by
dollars_per_day
Interpretation: The the way we group countries by regions, there is a similarity in wealth except Eastern Asia has a wider spread.
p <- gapminder %>%
filter(year == 2010 & !is.na(dollars_per_day)) %>%
mutate(region = reorder(region, dollars_per_day, FUN=median)) %>%
ggplot(aes(region, dollars_per_day)) +
geom_point()
p + theme(axis.text.x = element_text(angle=90, hjust=1)) +
scale_y_continuous(trans="log2")
There are more regions on this second graph. Eastern and Western Aferica are still very poor.
Let’s re-categorize our regions so that we don’t have so many. We can
use the case_when() function to create this new
variable
gapminder <- gapminder %>%
mutate(group = case_when(region %in% c("Western Europe", "Northern Europe", "Southern Europe", "Northern America", "Australia and New Zealand") ~ "West", region %in% c("Eastern Asia", "South- Eastern Asia") ~ "East Asia", region %in% c("Caribbean", "Central America",
"South America") ~ "Latin America", continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa", TRUE ~ "Others"))
case_when() lets you change the variable like
region and continent
You can check that it worked correctly by:
table(gapminder$region, gapminder$group)
##
## East Asia Latin America Others Sub-Saharan Africa
## Australia and New Zealand 0 0 0 0
## Caribbean 0 741 0 0
## Central America 0 456 0 0
## Central Asia 0 0 285 0
## Eastern Africa 0 0 0 912
## Eastern Asia 342 0 0 0
## Eastern Europe 0 0 570 0
## Melanesia 0 0 285 0
## Micronesia 0 0 114 0
## Middle Africa 0 0 0 456
## Northern Africa 0 0 342 0
## Northern America 0 0 0 0
## Northern Europe 0 0 0 0
## Polynesia 0 0 171 0
## South America 0 684 0 0
## South-Eastern Asia 0 0 570 0
## Southern Africa 0 0 0 285
## Southern Asia 0 0 456 0
## Southern Europe 0 0 0 0
## Western Africa 0 0 0 912
## Western Asia 0 0 1026 0
## Western Europe 0 0 0 0
##
## West
## Australia and New Zealand 114
## Caribbean 0
## Central America 0
## Central Asia 0
## Eastern Africa 0
## Eastern Asia 0
## Eastern Europe 0
## Melanesia 0
## Micronesia 0
## Middle Africa 0
## Northern Africa 0
## Northern America 171
## Northern Europe 570
## Polynesia 0
## South America 0
## South-Eastern Asia 0
## Southern Africa 0
## Southern Asia 0
## Southern Europe 684
## Western Africa 0
## Western Asia 0
## Western Europe 399
Next, you can turn this variable: group into a factor so that we can control the order of the levels
gapminder <- gapminder %>%
mutate(group = factor(group, levels = c("Others", "Latin America", "East Asia", "Sub-Saharan Africa", "West")))
table(gapminder$region, gapminder$group)
##
## Others Latin America East Asia Sub-Saharan Africa
## Australia and New Zealand 0 0 0 0
## Caribbean 0 741 0 0
## Central America 0 456 0 0
## Central Asia 285 0 0 0
## Eastern Africa 0 0 0 912
## Eastern Asia 0 0 342 0
## Eastern Europe 570 0 0 0
## Melanesia 285 0 0 0
## Micronesia 114 0 0 0
## Middle Africa 0 0 0 456
## Northern Africa 342 0 0 0
## Northern America 0 0 0 0
## Northern Europe 0 0 0 0
## Polynesia 171 0 0 0
## South America 0 684 0 0
## South-Eastern Asia 570 0 0 0
## Southern Africa 0 0 0 285
## Southern Asia 456 0 0 0
## Southern Europe 0 0 0 0
## Western Africa 0 0 0 912
## Western Asia 1026 0 0 0
## Western Europe 0 0 0 0
##
## West
## Australia and New Zealand 114
## Caribbean 0
## Central America 0
## Central Asia 0
## Eastern Africa 0
## Eastern Asia 0
## Eastern Europe 0
## Melanesia 0
## Micronesia 0
## Middle Africa 0
## Northern Africa 0
## Northern America 171
## Northern Europe 570
## Polynesia 0
## South America 0
## South-Eastern Asia 0
## Southern Africa 0
## Southern Asia 0
## Southern Europe 684
## Western Africa 0
## Western Asia 0
## Western Europe 399
Now that we have 5 groups, we can compare their distributions (as boxplots)
p <- gapminder %>%
filter(year == 1962 & !is.na(dollars_per_day)) %>%
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=90, hjust=1)) +
scale_y_continuous(trans="log2") +
xlab("")
p
Now let’s show the data as well!
p + geom_point(alpha=0.5)
Interpretation: The West has a small range, East Asia has a large range, . . .
If we have so much data that there is over-plotting, showing the data can be counterproductive.
If we have so much data that there is over-plotting, showing the data can be counterproductive.
Ridge plots are stacked smooth densities or
histograms. We will use the package ggridges
p2 <- gapminder %>%
filter(year == 1962 & !is.na(dollars_per_day)) %>%
ggplot(aes(dollars_per_day, group)) +
scale_x_continuous(trans="log2")
p2
What are the differences here compared to the last code for this graph?
p2 + geom_density_ridges(scale = 2)
## Picking joint bandwidth of 0.64
scale = 2 indicates the amount of overlap.
scale = 1 indicates no overlap
We can also show the data using jittered points (left) or rug representation (right).
Note: the jittered points plot can be confusing because the height of the points is not indicative of anything. See the text: section 10.7.2 for code
How might we want to compare income distributions across time to see if there is still 2 groups?
Potentially histograms? Box plots? Density plots?
We could categorize the countries into “West” and “not West”.
What does the following code do?
gapminder %>%
filter(year %in% c(1962, 2010) & !is.na(dollars_per_day)) %>%
mutate(west=ifelse(group == "West", "West", "not West")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color="black") +
scale_x_continuous(trans="log2") +
facet_grid(year~ west)
ifelse() to rename a new column called west
if called the current value is called West and give it the
value of West but in not, say not West The
lower case west is the name of the column.
For an ifelse, it needs three parts. The first part is
the if part, the second part is “if that is
TRUE”, do this, and the third part is “if that is
FALSE”, do this.
There are more data points in 2010 than 1968, so we may want to plot only those countries for which we have data on both years (to keep the total number the same)
country_list_1962 <- gapminder %>%
filter(year== 1962 & !is.na(dollars_per_day)) %>%
pull(country)
country_list_2010 <- gapminder %>%
filter(year == 2010 & !is.na(dollars_per_day)) %>%
pull(country)
country_list <- intersect(country_list_1962, country_list_2010)
How does the code from the previous slide change to incorporate only these countries?
gapminder %>%
filter(year %in% c(1962, 2010) & country %in% country_list) %>%
mutate(west=ifelse(group == "West", "West", "not West")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color="black") +
scale_x_continuous(trans="log2") +
facet_grid(year~ west)
Now, let’s create boxplots that compare 1968 to 2010. Here is our code from before. How do we have to modify it?
p <- gapminder %>%
filter(year == 1962 & !is.na(dollars_per_day)) %>%
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=90, hjust=1)) +
scale_y_continuous(trans="log2") +
xlab("")
p
Up to 3 changes: 1. year %in% c(1962, 2010) 2.
facet_grid(. ~ year) 3. If you want to limit to countries
that have data in both years, add:
country %in% country_list
But this is a bit hard to compare dollars_per_day. It
would be nicer if the boxes for each region (in the 2 years) were next
to each other.
We can fill color (or fill) the boxes depending on year.
Before we can fill based on year, we need to convert year to a factor so that R can assign color based on factor.
This is the code we had before but we need to change it to create
plots side by side. We need facet_grid and both
years. remember that facet_grid does the rows
then ~ then columns.
gapminder %>%
filter(year %in% c(1962, 2010) & country %in% country_list) %>%
ggplot(aes(group, dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=90, hjust=1)) +
scale_y_continuous(trans="log2") +
xlab("") +
facet_grid(. ~year)
To group them together, we get ride of the facet_grid
and fill = year need to be in the aes.
gapminder %>%
filter(year %in% c(1962, 2010) & country %in% country_list) %>%
mutate(year = factor(year)) %>%
ggplot(aes(group, dollars_per_day, fill=year)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle=90, hjust=1)) +
scale_y_continuous(trans="log2") +
xlab("")
Let’s examine our histogram code again. What changes would we need to make to create density plots (shown below)?
gapminder %>%
filter(year %in% c(1962, 2010) & country %in% country_list) %>%
mutate(west=ifelse(group == "West", "West", "not West")) %>%
ggplot(aes(dollars_per_day)) +
geom_histogram(binwidth = 1, color="black") +
scale_x_continuous(trans="log2") +
facet_grid(year~ west)
Changes: 1. Remove geom_histogram() 2. Add
geom_density(alpha = 0.2) 3. If we want the two groups of
countries to be on the same graph, we want to add fill=grop
to the ggplot function and facet_grid(year ~.) since we
aren’t separated by “west vs. not west”
But . . . density requires that the distributions add to 1, but the # of countries is not the same in these two groups.
From the help file, we can see that we can access the computed variable”count” in geom_density.
To access these computed variables, we surround the name with 2 dots.
aes(x=dollars_per_day, y=..count..)
gapminder %>%
filter(year %in% c(1962, 2010) & country %in% country_list) %>%
mutate(west=ifelse(group == "West", "West", "not West")) %>%
ggplot(aes(dollars_per_day, y=..count.., fill=west)) +
geom_density(alpha = 0.2) +
scale_x_continuous(trans="log2") +
facet_grid(year~ .)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
We may want to know how individual regions influence this shift over time.
Here was our previous ridge code. What needs to change?
p2 <- gapminder %>% filter(year == 1962 & !is.na(dollars_per_day)) %>%
ggplot(aes(dollars_per_day, group)) +
scale_x_continuous(trans="log2")
p2 + geom_density_ridges(scale = 2)
## Picking joint bandwidth of 0.64
Changes: 1) year %in% c(1962, 2010) 2)
country %in% country_list
Let’s examine this code:
gapminder %>%
filter(year %in% c(1962,2010) & country %in% country_list) %>%
group_by(year) %>%
mutate(weight = population/sum(population)) %>%
ungroup() %>%
ggplot(aes(dollars_per_day, fill=group, weight = weight)) +
scale_x_continuous(trans="log2", limit = c(0.05, 300)) +
geom_density(alpha = 0.2, position = "stack", bw=0.75) +
facet_grid(year ~.)
Start Audio recording 230207_002 at 1:09:00
** Never use pie charts. Barplots and tables are always better.** >…humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue. >If for some reason you need to make a pie chart, label each pie slice with its respective percentage
The axis does not start at 0. Judging by the length, it appears border apprehensions are doubled when it was not. It makes you think then differences are larger than they actually are. >When using position rather than length, it is then not necessary to include 0. Here they are comparing the differences between groups relative to the within-group variability.
It is not necessary to include 0
Alphabetical order has nothing to do with the disease and by
ordering according to actual rate, we quickly see the
states with most and least rates. Ordering the plots by value (or
meaningful category) is easier to look at and interpret. The “reorder”
function to plot values in order.
To create this chart use this code:
data(murders)
murders %>% mutate(murder_rate = total / population * 100000) %>%
mutate(state=reorder(state, murder_rate)) %>%
ggplot(aes(state, murder_rate)) + geom_bar(stat="identity") +
coord_flip() +
theme(axis.text.y = element_text(size=6)) +
xlab("")
In the past, we may have created plots like this: the mean with the standard error, but plots that show the data are much more informative
heights %>% ggplot(aes(sex,
height)) + geom_point()
If we want to“jitter” ur points, we can update the code to:
heights %>%
ggplot(aes(sex, height)) +
geom_jitter(width = 0.1, alpha = 0.2)
It may be hard to compare these distributions. Why?
1) The axes have different ranges – the “Male” range extends past 80
2) The plots are side by side, but if they were arranged vertically, we could better compare horizontal changes.
heights %>% ggplot(aes(height, ..density..)) +
geom_histogram(binwidth = 1, color="black") +
facet_grid(sex~.)
Note ..density..
An order of magnitude difference may cause reason to show the
comparison on a log scale.
We can encode categorical variables with color and shape. These shapes
can be controlled with the shape argument.
gapminder <- gapminder %>%
mutate(OPEC = ifelse(group == "West", "Yes", "NO"))
gapminder %>%
filter(year== 2010 & !is.na(dollars_per_day) &
!is.na(infant_mortality)) %>%
ggplot(aes(dollars_per_day, 1-infant_mortality/1000, col=region, shape=OPEC, size=population)) +
geom_point() +
scale_x_continuous(trans="log2")
logit
We are bad at seeing in three dimensions – particularly on a 2-D screen.Refer to pie chart of frequency of use. the whole point for a graph is to communicate something.
Avoid too many significant digits
Prior to the vaccination programs, deaths from infectious diseases (like polio and smallpox) were common. Let’s examine rates of disease before & after the initiation of vaccination programs.
library(RColorBrewer)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population"
Measles data per 100,000 rate, removing Alaska & Hawaii and adjusting for weeks_reporting
the_disease <- "Measles"
dat <- us_contagious_diseases %>%
filter(!state%in%c("Hawaii","Alaska") & disease ==
the_disease) %>% mutate(rate = count / population * 100000 *
52 / weeks_reporting) %>% mutate(state = reorder(state, rate))
Let’s plot the disease rates per year for California
dat %>% filter(state == "California" & !is.na(rate)) %>%
ggplot(aes(year, rate)) + geom_line() + ylab("Cases per
100,000") + geom_vline(xintercept=1963, col = "blue")
CDC introduced measles vaccine in 1963
dat %>% ggplot(aes(year, state, fill = rate)) + geom_tile(color =
"grey50") + scale_x_continuous(expand = c(0,0)) +
scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans =
"sqrt") + geom_vline(xintercept = 1963, col = "blue") +
theme_minimal() + theme(panel.grid = element_blank(),
legend.position="bottom", text = element_text(size = 8)) +
labs(title = the_disease, x = "", y = "")
expand() Expands the chart so there isn’t a gap between
the names of the state and the plot xintercept More details
on (ColorBrewer)[https://rdrr.io/cran/RColorBrewer/man/ColorBrewer.html]
palettes:
colnames(dat)
## [1] "disease" "state" "year" "weeks_reporting"
## [5] "count" "population" "rate"
avg <- us_contagious_diseases |>
filter(disease==the_disease) |> group_by(year) |>
summarize(us_rate = sum(count, na.rm = TRUE) /
sum(population, na.rm = TRUE) * 10000)
dat |>
filter(!is.na(rate)) |>
ggplot() +
geom_line(aes(year, rate, group = state), color = "grey50",
show.legend = FALSE, alpha = 0.2, size = 1) +
geom_line(mapping = aes(year, us_rate), data = avg, size = 1) +
ggtitle("Cases per 10,000 by state") +
xlab("") + ylab("") +
geom_text(data = data.frame(x = 1955, y = 50),
mapping = aes(x, y, label="US average"),
color="black") +
geom_vline(xintercept=1963, col = "blue") +
scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
Reference