Today we will investigate a data set from http://www.gapminder.org, a site which contains a wealth of data and visualizations related to health, wealth, population, etc. of countries of the world. The data set has already been created in a .Rdata
file available on Google Classroom.
First, save the Rmd file in a folder with the rest of your course work. Put the file gapminder.Rdata
in the same folder. Go to Session > Set Working Directory > To Source File Location. Now you may run the below code that will load the data.
load("gapminder.Rdata")
ls()
[1] "gapminder"
Function ls
returns a vector of character strings giving the names of the objects in the specified environment. After loading the Rdata file you should see an object named gapminder
.
The str
function reports on the structure of an object in R
. It’s often useful to use str
when working with a new dataset.
It is never okay to display large data sets in your .Rmd file. No one wants to wait minutes for the HTML file to generate when the .Rmd file is knit.
str(gapminder)
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
Object gapminder
has six variables:
Extract the third row of gapminder
.
Extract the first 50 components of the variable year
.
What years are in the data set? Hint: unique
.
Create a data frame called gapminder2002
using the subset
function. Below is an example.
gapminder1952 <- subset(gapminder, subset = (year == 1952))
boxplot(lifeExp ~ continent, data = gapminder1952,
xlab = "Continent", ylab = "Life expectancy",
main = "Life expectancy by continent: 1952")
From the boxplots, what do you notice? Which continent has the largest median life expectancy? Which continent has the largest inter-quartile range with regards to life expectancy?
Compute a few summary statistics for life expectancy and GDP in 1952. Compare these with the same statistics from 2002. Should we compare raw GDP numbers across time such as this?
Choose a country of interest. Create a data frame which only contains data from that country. Draw scatter plots of life expectancy, and of the GDP, both versus the year. To add some customization such as color, connecting the points, and changing the point style see https://www.statmethods.net/advgraphs/parameters.html.
Subsetting with logical vectors is an essential skill. When a vector, say x, is subset with a logical vector, the components of x are returned wherever a TRUE value component exists in the logical vector. Below are some examples. Think about what is happening in each example. Recall that we can combine conditions with the operators &
and |
which represent “and” and “or”.
mean(gapminder$pop[gapminder$country == "France"])
[1] 52952564
unique(gapminder$country[gapminder$continent == "Africa"])
[1] Algeria Angola
[3] Benin Botswana
[5] Burkina Faso Burundi
[7] Cameroon Central African Republic
[9] Chad Comoros
[11] Congo, Dem. Rep. Congo, Rep.
[13] Cote d'Ivoire Djibouti
[15] Egypt Equatorial Guinea
[17] Eritrea Ethiopia
[19] Gabon Gambia
[21] Ghana Guinea
[23] Guinea-Bissau Kenya
[25] Lesotho Liberia
[27] Libya Madagascar
[29] Malawi Mali
[31] Mauritania Mauritius
[33] Morocco Mozambique
[35] Namibia Niger
[37] Nigeria Reunion
[39] Rwanda Sao Tome and Principe
[41] Senegal Sierra Leone
[43] Somalia South Africa
[45] Sudan Swaziland
[47] Tanzania Togo
[49] Tunisia Uganda
[51] Zambia Zimbabwe
142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe
gapminder$country[(gapminder$pop > 150000000) & (gapminder$year == 1992)]
[1] Brazil China India Indonesia United States
142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe
Operator %in%
returns a logical vector indicating if there is a match or not for its left operand. Consider the example below
x <- 5:10
y <- c(3, 5, 6, 9, 12, 15)
x %in% y
[1] TRUE TRUE FALSE FALSE TRUE FALSE
y %in% x
[1] FALSE TRUE TRUE TRUE FALSE FALSE
Subsetting also can be used to change values of existing R
objects as in the following example. Remove the chunk option eval = FALSE
to see the example’s result in your knitted HTML file.
dd <- data.frame(x = c("dog", "cat", "oink", "pig", "oink", "cat", "dog"),
y = c("dog", "cat", "cat", "pig", "cow", "dog", "dog"),
stringsAsFactors = FALSE)
dd
dd$x[dd$x == "oink"] <- "pig"
dd
dd$same <- rep("no", dim(dd)[1])
dd
dd$same[dd$x == dd$y] <- "yes"
dd
Extract the population values of all countries whose life expectancy is more than 70 years for the year 1967.
For the year 2007, how many countries had a life expectancy of at least 75?
Add a variable called G8
to the gapminder
data frame, which will be equal to 1 or 0 depending on whether the country is in the G8 group of nations: France, Germany, Italy, the United Kingdom, Japan, the United States, Canada, and Russia.
Create a plot of your choice that involves countries of the G8.