1 Introduction

Today we will investigate a data set from http://www.gapminder.org, a site which contains a wealth of data and visualizations related to health, wealth, population, etc. of countries of the world. The data set has already been created in a .Rdata file available on Google Classroom.

First, save the Rmd file in a folder with the rest of your course work. Put the file gapminder.Rdata in the same folder. Go to Session > Set Working Directory > To Source File Location. Now you may run the below code that will load the data.

load("gapminder.Rdata")
ls()
[1] "gapminder"

Function ls returns a vector of character strings giving the names of the objects in the specified environment. After loading the Rdata file you should see an object named gapminder.

2 Data frame

The str function reports on the structure of an object in R. It’s often useful to use str when working with a new dataset.

It is never okay to display large data sets in your .Rmd file. No one wants to wait minutes for the HTML file to generate when the .Rmd file is knit.

str(gapminder)
'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

Object gapminder has six variables:

  • country: country of the world, factor
  • year: year, integer
  • pop: population, numeric
  • continent: continent of the world, factor
  • lifeExp: life expectancy, numeric
  • gdpPercap: GDP per capita, numeric

2.1 Exercises

  1. Extract the third row of gapminder.

  2. Extract the first 50 components of the variable year.

  3. What years are in the data set? Hint: unique.

  4. Create a data frame called gapminder2002 using the subset function. Below is an example.

gapminder1952 <- subset(gapminder, subset = (year == 1952))
  1. Create comparison boxplots for between continent and life expectancy based on the 2002 data. Below is an example from the 1952 data.
boxplot(lifeExp ~ continent, data = gapminder1952, 
        xlab = "Continent", ylab = "Life expectancy",
        main = "Life expectancy by continent: 1952")

  1. From the boxplots, what do you notice? Which continent has the largest median life expectancy? Which continent has the largest inter-quartile range with regards to life expectancy?

  2. Compute a few summary statistics for life expectancy and GDP in 1952. Compare these with the same statistics from 2002. Should we compare raw GDP numbers across time such as this?

  3. Choose a country of interest. Create a data frame which only contains data from that country. Draw scatter plots of life expectancy, and of the GDP, both versus the year. To add some customization such as color, connecting the points, and changing the point style see https://www.statmethods.net/advgraphs/parameters.html.

3 Logical subsetting

Subsetting with logical vectors is an essential skill. When a vector, say x, is subset with a logical vector, the components of x are returned wherever a TRUE value component exists in the logical vector. Below are some examples. Think about what is happening in each example. Recall that we can combine conditions with the operators & and | which represent “and” and “or”.

mean(gapminder$pop[gapminder$country == "France"])
[1] 52952564
unique(gapminder$country[gapminder$continent == "Africa"])
 [1] Algeria                  Angola                  
 [3] Benin                    Botswana                
 [5] Burkina Faso             Burundi                 
 [7] Cameroon                 Central African Republic
 [9] Chad                     Comoros                 
[11] Congo, Dem. Rep.         Congo, Rep.             
[13] Cote d'Ivoire            Djibouti                
[15] Egypt                    Equatorial Guinea       
[17] Eritrea                  Ethiopia                
[19] Gabon                    Gambia                  
[21] Ghana                    Guinea                  
[23] Guinea-Bissau            Kenya                   
[25] Lesotho                  Liberia                 
[27] Libya                    Madagascar              
[29] Malawi                   Mali                    
[31] Mauritania               Mauritius               
[33] Morocco                  Mozambique              
[35] Namibia                  Niger                   
[37] Nigeria                  Reunion                 
[39] Rwanda                   Sao Tome and Principe   
[41] Senegal                  Sierra Leone            
[43] Somalia                  South Africa            
[45] Sudan                    Swaziland               
[47] Tanzania                 Togo                    
[49] Tunisia                  Uganda                  
[51] Zambia                   Zimbabwe                
142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe
gapminder$country[(gapminder$pop > 150000000) & (gapminder$year == 1992)]
[1] Brazil        China         India         Indonesia     United States
142 Levels: Afghanistan Albania Algeria Angola Argentina ... Zimbabwe

Operator %in% returns a logical vector indicating if there is a match or not for its left operand. Consider the example below

x <- 5:10
y <- c(3, 5, 6, 9, 12, 15)

x %in% y
[1]  TRUE  TRUE FALSE FALSE  TRUE FALSE
y %in% x
[1] FALSE  TRUE  TRUE  TRUE FALSE FALSE

Subsetting also can be used to change values of existing R objects as in the following example. Remove the chunk option eval = FALSE to see the example’s result in your knitted HTML file.

dd <-  data.frame(x = c("dog", "cat", "oink", "pig", "oink", "cat", "dog"), 
                y = c("dog", "cat", "cat", "pig", "cow", "dog", "dog"),
                stringsAsFactors = FALSE)
dd

dd$x[dd$x == "oink"] <- "pig"
dd

dd$same <- rep("no", dim(dd)[1])
dd

dd$same[dd$x == dd$y] <- "yes"
dd

3.1 Exercises

  1. Extract the population values of all countries whose life expectancy is more than 70 years for the year 1967.

  2. For the year 2007, how many countries had a life expectancy of at least 75?

  3. Add a variable called G8 to the gapminder data frame, which will be equal to 1 or 0 depending on whether the country is in the G8 group of nations: France, Germany, Italy, the United Kingdom, Japan, the United States, Canada, and Russia.

  4. Create a plot of your choice that involves countries of the G8.