Libraries

library(tidyverse)
library(forcats)
library(lubridate)
library(ggthemes)
library(gdata)
library(DBI)
library(data.table)

0. Course Description

Importing data into R should be the easiest step in your analysis. Unfortunately, that is almost never the case. Data can come in many formats, ranging from .csv and text files, to statistical software files, to databases and HTML data. Knowing which approach to use is key to getting started with the actual analysis. In this course, you’ll start by learning how to read .csv and text files in R. You will then cover the readr and data.table packages to easily and efficiently import flat file data. After that, you will learn how to read .xls files in R using readxl and gdata.

1. Importing data from flat files with utils

A lot of data comes in the form of flat files: simple tabular text files. Learn how to import the common formats of flat file data with base R functions.

1.1 read.csv

The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.

# Import swimming_pools.csv: pools
pools <- read.csv("swimming_pools.csv", stringsAsFactors = TRUE)
# Print the structure of pools
str(pools)
## 'data.frame':    20 obs. of  5 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
##  $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

1.2 stringsAsFactors

With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.

For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be character.

# Import swimming_pools.csv correctly: pools
pools<-read.csv("swimming_pools.csv", stringsAsFactors=FALSE)
# Check the structure of pools
str(pools)
## 'data.frame':    20 obs. of  5 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name     : chr  "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr  "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...

1.3 read.delim

Aside from .csv files, there are also the .txt files which are basically text files. You can import these functions with read.delim(). By default, it sets the sep argument to “ (fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names).

# Import hotdogs.txt: hotdogs
hotdogs <- read.delim("hotdogs.txt", header = FALSE)

# Summarize hotdogs
summary(hotdogs)
##        V1           V2              V3       
##  Beef   :20   Min.   : 86.0   Min.   :144.0  
##  Meat   :17   1st Qu.:132.0   1st Qu.:362.5  
##  Poultry:17   Median :145.0   Median :405.0  
##               Mean   :145.4   Mean   :424.8  
##               3rd Qu.:172.8   3rd Qu.:503.5  
##               Max.   :195.0   Max.   :645.0

1.4 read.table

If you’re dealing with more exotic flat file formats, you’ll want to use read.table(). It’s the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is “” by default.

Up to you again! The data is still hotdogs.txt. It has no column names in the first row, and the field separators are tabs. This time, though, the file is in the data folder inside your current working directory. A variable path with the location of this file is already coded for you.

# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table("hotdogs.txt", 
                      sep = "\t", 
                      col.names = c("type", "calories", "sodium"))

# Call head() on hotdogs
head(hotdogs)
##   type calories sodium
## 1 Beef      186    495
## 2 Beef      181    477
## 3 Beef      176    425
## 4 Beef      149    322
## 5 Beef      184    482
## 6 Beef      190    587

Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes:

# Display structure of hotdogs
str(hotdogs)
## 'data.frame':    54 obs. of  3 variables:
##  $ type    : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ calories: int  186 181 176 149 184 190 158 139 175 148 ...
##  $ sodium  : int  495 477 425 322 482 587 370 322 479 375 ...
# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim("hotdogs.txt", header = FALSE, 
                       col.names = c("type", "calories", "sodium"),
                       colClasses = c("factor", "NULL", "numeric"))
# Display structure of hotdogs2
str(hotdogs2)
## 'data.frame':    54 obs. of  2 variables:
##  $ type  : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ sodium: num  495 477 425 322 482 587 370 322 479 375 ...

2. readr & data.table

In addition to base R, there are dedicated packages to easily and efficiently import flat file data. We’ll talk about two such packages: readr and data.table.

2.1 read_csv and read.tsv

CSV files can be imported with read_csv(). It’s a wrapper function around read_delim() that handles all the details for you. For example, it will assume that the first row contains the column names.

Where you use read_csv() to easily read in CSV files, you use read_tsv() to easily read in TSV files. TSV is short for tab-separated values.

# Import potatoes.csv with read_csv(): potatoes
potatoes_1<- read.csv("potatoes.csv")
head(potatoes_1)
##   X area temp size storage method texture flavor moistness
## 1 1    1    1    1       1      1     2.9    3.2       3.0
## 2 2    1    1    1       1      2     2.3    2.5       2.6
## 3 3    1    1    1       1      3     2.5    2.8       2.8
## 4 4    1    1    1       1      4     2.1    2.9       2.4
## 5 5    1    1    1       1      5     1.9    2.8       2.2
## 6 6    1    1    1       2      1     1.8    3.0       1.7
# Column names
properties <- c("area", "temp", "size", "storage", "method",
                "texture", "flavor", "moistness")

# Import potatoes.txt: potatoes
potatoes_2 <- read_tsv("potatoes.txt", col_names = properties)
## Rows: 160 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(potatoes_2)
## # A tibble: 6 × 8
##    area  temp  size storage method texture flavor moistness
##   <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
## 1     1     1     1       1      1     2.9    3.2       3  
## 2     1     1     1       1      2     2.3    2.5       2.6
## 3     1     1     1       1      3     2.5    2.8       2.8
## 4     1     1     1       1      4     2.1    2.9       2.4
## 5     1     1     1       1      5     1.9    2.8       2.2
## 6     1     1     1       2      1     1.8    3         1.7

2.2 read_delim

Just as read.table() was the main utils function, read_delim() is the main readr function. read_delim() takes two mandatory arguments: * file: the file that contains the data * delim: the character that separates the values in the data file

# Import potatoes.txt using read_delim(): potatoes
potatoes_3 <- read_delim("potatoes.txt", delim = "\t", col_names = properties)
## Rows: 160 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(potatoes_3)
## # A tibble: 6 × 8
##    area  temp  size storage method texture flavor moistness
##   <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
## 1     1     1     1       1      1     2.9    3.2       3  
## 2     1     1     1       1      2     2.3    2.5       2.6
## 3     1     1     1       1      3     2.5    2.8       2.8
## 4     1     1     1       1      4     2.1    2.9       2.4
## 5     1     1     1       1      5     1.9    2.8       2.2
## 6     1     1     1       2      1     1.8    3         1.7
# Import 5 observations from potatoes.txt: potatoes_fragment
potatoes_fragment <- read_tsv("potatoes.txt", skip = 6, n_max = 5, col_names = properties)
## Rows: 5 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
potatoes_fragment
## # A tibble: 5 × 8
##    area  temp  size storage method texture flavor moistness
##   <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
## 1     1     1     1       2      2     2.6    3.1       2.4
## 2     1     1     1       2      3     3      3         2.9
## 3     1     1     1       2      4     2.2    3.2       2.5
## 4     1     1     1       2      5     2      2.8       1.9
## 5     1     1     1       3      1     1.8    2.6       1.5
# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv("potatoes.txt", col_types = "cccccccc", col_names = properties)
# Print out structure of potatoes_char
str(potatoes_char)
## spec_tbl_df [160 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ area     : chr [1:160] "1" "1" "1" "1" ...
##  $ temp     : chr [1:160] "1" "1" "1" "1" ...
##  $ size     : chr [1:160] "1" "1" "1" "1" ...
##  $ storage  : chr [1:160] "1" "1" "1" "1" ...
##  $ method   : chr [1:160] "1" "2" "3" "4" ...
##  $ texture  : chr [1:160] "2.9" "2.3" "2.5" "2.1" ...
##  $ flavor   : chr [1:160] "3.2" "2.5" "2.8" "2.9" ...
##  $ moistness: chr [1:160] "3.0" "2.6" "2.8" "2.4" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   area = col_character(),
##   ..   temp = col_character(),
##   ..   size = col_character(),
##   ..   storage = col_character(),
##   ..   method = col_character(),
##   ..   texture = col_character(),
##   ..   flavor = col_character(),
##   ..   moistness = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

2.3 col_types

You can also specify which types the columns in your imported data frame should have. You can do this with col_types. If set to NULL, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a string, where each character denotes the class of the column: character, double, integer and logical. _ skips the column as a whole.

# Import without col_types
hotdogs <- read_tsv("hotdogs.txt", col_names = c("type", "calories", "sodium"))
## Rows: 54 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): type
## dbl (2): calories, sodium
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(hotdogs)
##      type              calories         sodium     
##  Length:54          Min.   : 86.0   Min.   :144.0  
##  Class :character   1st Qu.:132.0   1st Qu.:362.5  
##  Mode  :character   Median :145.0   Median :405.0  
##                     Mean   :145.4   Mean   :424.8  
##                     3rd Qu.:172.8   3rd Qu.:503.5  
##                     Max.   :195.0   Max.   :645.0
# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()

# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv("hotdogs.txt",
                           col_names = c("type", "calories", "sodium"),
                           col_types = list(fac, int, int))

# Display the summary of hotdogs_factor
summary(hotdogs_factor)
##       type       calories         sodium     
##  Beef   :20   Min.   : 86.0   Min.   :144.0  
##  Meat   :17   1st Qu.:132.0   1st Qu.:362.5  
##  Poultry:17   Median :145.0   Median :405.0  
##               Mean   :145.4   Mean   :424.8  
##               3rd Qu.:172.8   3rd Qu.:503.5  
##               Max.   :195.0   Max.   :645.0

2.4 data.table: fread

You still remember how to use read.table(), right? Well, fread() is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data.

# Import potatoes.csv with fread(): potatoes
potatoes_4 <- fread("potatoes.csv")
# Print out potatoes
head(potatoes_4)
##    V1 area temp size storage method texture flavor moistness
## 1:  1    1    1    1       1      1     2.9    3.2       3.0
## 2:  2    1    1    1       1      2     2.3    2.5       2.6
## 3:  3    1    1    1       1      3     2.5    2.8       2.8
## 4:  4    1    1    1       1      4     2.1    2.9       2.4
## 5:  5    1    1    1       1      5     1.9    2.8       2.2
## 6:  6    1    1    1       2      1     1.8    3.0       1.7
# Import columns 6 and 8 of potatoes.csv: potatoes
potatoes_5<-fread("potatoes.csv", select=c(6, 8) )

# Plot texture (x) and moistness (y) of potatoes
plot(potatoes_4$texture, potatoes_4$moistness)

3. Importing Excel data

Excel is a widely used data analysis tool. If you prefer to do your analyses in R, though, you’ll need an understanding of how to import .csv data into R. This chapter will show you how to use readxl and gdata to do so.

3.1 List the sheets and Import an Excel sheet

Before you can start importing from Excel, you should find out which sheets are available in the workbook. You can use the excel_sheets() function for this.

You will find the Excel file urbanpop.xlsx (view) in your working directory (type dir() to see it). This dataset contains urban population metrics for practically all countries in the world throughout time (Source: Gapminder). It contains three sheets for three different time periods. In each sheet, the first row contains the column names.

library(readxl)
# Print the names of all worksheets
excel_sheets("urbanpop.xlsx")
## [1] "1960-1966" "1967-1974" "1975-2011"
# Read the sheets, one by one
pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)
pop_2 <- read_excel("urbanpop.xlsx", sheet = 2)
pop_3 <- read_excel("urbanpop.xlsx", sheet= 3)

# Put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list <- list(pop_1, pop_2, pop_3)

# Display the structure of pop_list
str(pop_list)
## List of 3
##  $ : tibble [209 × 8] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
##   ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
##   ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
##   ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
##   ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
##   ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
##   ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
##  $ : tibble [209 × 9] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
##   ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
##   ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
##   ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
##   ..$ 1971   : num [1:209] 1409001 698932 5619042 19752 20929 ...
##   ..$ 1972   : num [1:209] 1502402 720207 5815734 20263 22406 ...
##   ..$ 1973   : num [1:209] 1598835 741681 6020647 20742 23937 ...
##   ..$ 1974   : num [1:209] 1696445 763385 6235114 21194 25482 ...
##  $ : tibble [209 × 38] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1975   : num [1:209] 1793266 785350 6460138 21632 27019 ...
##   ..$ 1976   : num [1:209] 1905033 807990 6774099 22047 28366 ...
##   ..$ 1977   : num [1:209] 2021308 830959 7102902 22452 29677 ...
##   ..$ 1978   : num [1:209] 2142248 854262 7447728 22899 31037 ...
##   ..$ 1979   : num [1:209] 2268015 877898 7810073 23457 32572 ...
##   ..$ 1980   : num [1:209] 2398775 901884 8190772 24177 34366 ...
##   ..$ 1981   : num [1:209] 2493265 927224 8637724 25173 36356 ...
##   ..$ 1982   : num [1:209] 2590846 952447 9105820 26342 38618 ...
##   ..$ 1983   : num [1:209] 2691612 978476 9591900 27655 40983 ...
##   ..$ 1984   : num [1:209] 2795656 1006613 10091289 29062 43207 ...
##   ..$ 1985   : num [1:209] 2903078 1037541 10600112 30524 45119 ...
##   ..$ 1986   : num [1:209] 3006983 1072365 11101757 32014 46254 ...
##   ..$ 1987   : num [1:209] 3113957 1109954 11609104 33548 47019 ...
##   ..$ 1988   : num [1:209] 3224082 1146633 12122941 35095 47669 ...
##   ..$ 1989   : num [1:209] 3337444 1177286 12645263 36618 48577 ...
##   ..$ 1990   : num [1:209] 3454129 1198293 13177079 38088 49982 ...
##   ..$ 1991   : num [1:209] 3617842 1215445 13708813 39600 51972 ...
##   ..$ 1992   : num [1:209] 3788685 1222544 14248297 41049 54469 ...
##   ..$ 1993   : num [1:209] 3966956 1222812 14789176 42443 57079 ...
##   ..$ 1994   : num [1:209] 4152960 1221364 15322651 43798 59243 ...
##   ..$ 1995   : num [1:209] 4347018 1222234 15842442 45129 60598 ...
##   ..$ 1996   : num [1:209] 4531285 1228760 16395553 46343 60927 ...
##   ..$ 1997   : num [1:209] 4722603 1238090 16935451 47527 60462 ...
##   ..$ 1998   : num [1:209] 4921227 1250366 17469200 48705 59685 ...
##   ..$ 1999   : num [1:209] 5127421 1265195 18007937 49906 59281 ...
##   ..$ 2000   : num [1:209] 5341456 1282223 18560597 51151 59719 ...
##   ..$ 2001   : num [1:209] 5564492 1315690 19198872 52341 61062 ...
##   ..$ 2002   : num [1:209] 5795940 1352278 19854835 53583 63212 ...
##   ..$ 2003   : num [1:209] 6036100 1391143 20529356 54864 65802 ...
##   ..$ 2004   : num [1:209] 6285281 1430918 21222198 56166 68301 ...
##   ..$ 2005   : num [1:209] 6543804 1470488 21932978 57474 70329 ...
##   ..$ 2006   : num [1:209] 6812538 1512255 22625052 58679 71726 ...
##   ..$ 2007   : num [1:209] 7091245 1553491 23335543 59894 72684 ...
##   ..$ 2008   : num [1:209] 7380272 1594351 24061749 61118 73335 ...
##   ..$ 2009   : num [1:209] 7679982 1635262 24799591 62357 73897 ...
##   ..$ 2010   : num [1:209] 7990746 1676545 25545622 63616 74525 ...
##   ..$ 2011   : num [1:209] 8316976 1716842 26216968 64817 75207 ...
# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel("urbanpop_nonames.xlsx", col_names= FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("urbanpop_nonames.xlsx", col_names = cols)

# Print the summary of pop_a
summary(pop_a)
##      ...1                ...2                ...3                ...4          
##  Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
##  Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
##  Mode  :character   Median :   580675   Median :   570159   Median :   593968  
##                     Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
##                     3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##                     NA's   :11                                                 
##       ...5                ...6                ...7          
##  Min.   :     1154   Min.   :     1218   Min.   :     1281  
##  1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
##  Median :   619331   Median :   645262   Median :   679109  
##  Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
##  3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317422  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##                                                             
##       ...8          
##  Min.   :     1349  
##  1st Qu.:    93638  
##  Median :   735139  
##  Mean   :  5790281  
##  3rd Qu.:  3418036  
##  Max.   :141962708  
## 
# Print the summary of pop_b
summary(pop_b)
##    country            year_1960           year_1961           year_1962        
##  Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
##  Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
##  Mode  :character   Median :   580675   Median :   570159   Median :   593968  
##                     Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
##                     3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##                     NA's   :11                                                 
##    year_1963           year_1964           year_1965        
##  Min.   :     1154   Min.   :     1218   Min.   :     1281  
##  1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
##  Median :   619331   Median :   645262   Median :   679109  
##  Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
##  3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317422  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##                                                             
##    year_1966        
##  Min.   :     1349  
##  1st Qu.:    93638  
##  Median :   735139  
##  Mean   :  5790281  
##  3rd Qu.:  3418036  
##  Max.   :141962708  
## 
# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel
urbanpop_sel <- read_excel("urbanpop.xlsx", sheet = 2, col_names = FALSE, skip = 21)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Print out the first observation from urbanpop_sel
urbanpop_sel[1,]
## # A tibble: 1 × 9
##   ...1     ...2    ...3    ...4    ...5    ...6    ...7    ...8    ...9
##   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Benin 382022. 411859. 443013. 475611. 515820. 557938. 602093. 648410.

3.2 Reading a workbook

In the previous exercise you generated a list of three Excel sheets that you imported. However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply(). If you have no experience with lapply(), feel free to take Chapter 4 of the Intermediate R course.

# Read all Excel sheets with lapply(): pop_list
pop_list <- lapply(excel_sheets("urbanpop.xlsx"), read_excel, path = "urbanpop.xlsx")
# Display the structure of pop_list
str(pop_list)
## List of 3
##  $ : tibble [209 × 8] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1960   : num [1:209] 769308 494443 3293999 NA NA ...
##   ..$ 1961   : num [1:209] 814923 511803 3515148 13660 8724 ...
##   ..$ 1962   : num [1:209] 858522 529439 3739963 14166 9700 ...
##   ..$ 1963   : num [1:209] 903914 547377 3973289 14759 10748 ...
##   ..$ 1964   : num [1:209] 951226 565572 4220987 15396 11866 ...
##   ..$ 1965   : num [1:209] 1000582 583983 4488176 16045 13053 ...
##   ..$ 1966   : num [1:209] 1058743 602512 4649105 16693 14217 ...
##  $ : tibble [209 × 9] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1967   : num [1:209] 1119067 621180 4826104 17349 15440 ...
##   ..$ 1968   : num [1:209] 1182159 639964 5017299 17996 16727 ...
##   ..$ 1969   : num [1:209] 1248901 658853 5219332 18619 18088 ...
##   ..$ 1970   : num [1:209] 1319849 677839 5429743 19206 19529 ...
##   ..$ 1971   : num [1:209] 1409001 698932 5619042 19752 20929 ...
##   ..$ 1972   : num [1:209] 1502402 720207 5815734 20263 22406 ...
##   ..$ 1973   : num [1:209] 1598835 741681 6020647 20742 23937 ...
##   ..$ 1974   : num [1:209] 1696445 763385 6235114 21194 25482 ...
##  $ : tibble [209 × 38] (S3: tbl_df/tbl/data.frame)
##   ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##   ..$ 1975   : num [1:209] 1793266 785350 6460138 21632 27019 ...
##   ..$ 1976   : num [1:209] 1905033 807990 6774099 22047 28366 ...
##   ..$ 1977   : num [1:209] 2021308 830959 7102902 22452 29677 ...
##   ..$ 1978   : num [1:209] 2142248 854262 7447728 22899 31037 ...
##   ..$ 1979   : num [1:209] 2268015 877898 7810073 23457 32572 ...
##   ..$ 1980   : num [1:209] 2398775 901884 8190772 24177 34366 ...
##   ..$ 1981   : num [1:209] 2493265 927224 8637724 25173 36356 ...
##   ..$ 1982   : num [1:209] 2590846 952447 9105820 26342 38618 ...
##   ..$ 1983   : num [1:209] 2691612 978476 9591900 27655 40983 ...
##   ..$ 1984   : num [1:209] 2795656 1006613 10091289 29062 43207 ...
##   ..$ 1985   : num [1:209] 2903078 1037541 10600112 30524 45119 ...
##   ..$ 1986   : num [1:209] 3006983 1072365 11101757 32014 46254 ...
##   ..$ 1987   : num [1:209] 3113957 1109954 11609104 33548 47019 ...
##   ..$ 1988   : num [1:209] 3224082 1146633 12122941 35095 47669 ...
##   ..$ 1989   : num [1:209] 3337444 1177286 12645263 36618 48577 ...
##   ..$ 1990   : num [1:209] 3454129 1198293 13177079 38088 49982 ...
##   ..$ 1991   : num [1:209] 3617842 1215445 13708813 39600 51972 ...
##   ..$ 1992   : num [1:209] 3788685 1222544 14248297 41049 54469 ...
##   ..$ 1993   : num [1:209] 3966956 1222812 14789176 42443 57079 ...
##   ..$ 1994   : num [1:209] 4152960 1221364 15322651 43798 59243 ...
##   ..$ 1995   : num [1:209] 4347018 1222234 15842442 45129 60598 ...
##   ..$ 1996   : num [1:209] 4531285 1228760 16395553 46343 60927 ...
##   ..$ 1997   : num [1:209] 4722603 1238090 16935451 47527 60462 ...
##   ..$ 1998   : num [1:209] 4921227 1250366 17469200 48705 59685 ...
##   ..$ 1999   : num [1:209] 5127421 1265195 18007937 49906 59281 ...
##   ..$ 2000   : num [1:209] 5341456 1282223 18560597 51151 59719 ...
##   ..$ 2001   : num [1:209] 5564492 1315690 19198872 52341 61062 ...
##   ..$ 2002   : num [1:209] 5795940 1352278 19854835 53583 63212 ...
##   ..$ 2003   : num [1:209] 6036100 1391143 20529356 54864 65802 ...
##   ..$ 2004   : num [1:209] 6285281 1430918 21222198 56166 68301 ...
##   ..$ 2005   : num [1:209] 6543804 1470488 21932978 57474 70329 ...
##   ..$ 2006   : num [1:209] 6812538 1512255 22625052 58679 71726 ...
##   ..$ 2007   : num [1:209] 7091245 1553491 23335543 59894 72684 ...
##   ..$ 2008   : num [1:209] 7380272 1594351 24061749 61118 73335 ...
##   ..$ 2009   : num [1:209] 7679982 1635262 24799591 62357 73897 ...
##   ..$ 2010   : num [1:209] 7990746 1676545 25545622 63616 74525 ...
##   ..$ 2011   : num [1:209] 8316976 1716842 26216968 64817 75207 ...

3.3 The col_names argument

Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.

By default it is TRUE, denoting whether the first row in the Excel sheets contains the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. It works exactly the same as in the readr package.

# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel("urbanpop_nonames.xlsx", col_names = FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("urbanpop_nonames.xlsx", col_names = cols)

# Print the summary of pop_a
summary(pop_a)
##      ...1                ...2                ...3                ...4          
##  Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
##  Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
##  Mode  :character   Median :   580675   Median :   570159   Median :   593968  
##                     Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
##                     3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##                     NA's   :11                                                 
##       ...5                ...6                ...7          
##  Min.   :     1154   Min.   :     1218   Min.   :     1281  
##  1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
##  Median :   619331   Median :   645262   Median :   679109  
##  Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
##  3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317422  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##                                                             
##       ...8          
##  Min.   :     1349  
##  1st Qu.:    93638  
##  Median :   735139  
##  Mean   :  5790281  
##  3rd Qu.:  3418036  
##  Max.   :141962708  
## 
# Print the summary of pop_b
summary(pop_b)
##    country            year_1960           year_1961           year_1962        
##  Length:209         Min.   :     3378   Min.   :     1028   Min.   :     1090  
##  Class :character   1st Qu.:    88978   1st Qu.:    70644   1st Qu.:    74974  
##  Mode  :character   Median :   580675   Median :   570159   Median :   593968  
##                     Mean   :  4988124   Mean   :  4991613   Mean   :  5141592  
##                     3rd Qu.:  3077228   3rd Qu.:  2807280   3rd Qu.:  2948396  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##                     NA's   :11                                                 
##    year_1963           year_1964           year_1965        
##  Min.   :     1154   Min.   :     1218   Min.   :     1281  
##  1st Qu.:    81870   1st Qu.:    84953   1st Qu.:    88633  
##  Median :   619331   Median :   645262   Median :   679109  
##  Mean   :  5303711   Mean   :  5468966   Mean   :  5637394  
##  3rd Qu.:  3148941   3rd Qu.:  3296444   3rd Qu.:  3317422  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##                                                             
##    year_1966        
##  Min.   :     1349  
##  1st Qu.:    93638  
##  Median :   735139  
##  Mean   :  5790281  
##  3rd Qu.:  3418036  
##  Max.   :141962708  
## 

3.4 he skip argument

Another argument that can be very useful when reading in Excel files that are less tidy, is skip. With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you’re trying to pull data from. Have a look at this example:

  • read_excel(“data.xlsx”, skip = 15) In this case, the first 15 rows in the first sheet of “data.xlsx” are ignored.
# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel
urbanpop_sel <- read_excel("urbanpop.xlsx", sheet=2, col_names = FALSE, skip=21)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Print out the first observation from urbanpop_sel
urbanpop_sel[1, ]
## # A tibble: 1 × 9
##   ...1     ...2    ...3    ...4    ...5    ...6    ...7    ...8    ...9
##   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 Benin 382022. 411859. 443013. 475611. 515820. 557938. 602093. 648410.

3.5 Import a local file

In this part of the chapter you’ll learn how to import .xls files using the gdata package. Similar to the readxl package, you can import single Excel sheets from Excel sheets to start your analysis in R.

library(gdata)

# Import the second sheet of urbanpop.xls: urban_pop
urban_pop <- read.xls("urbanpop.xls", sheet = "1967-1974")

# Print the first 11 observations using head()
head(urban_pop, n = 11)
##                country       X1967       X1968       X1969       X1970
## 1          Afghanistan  1119067.20  1182159.06  1248900.79  1319848.78
## 2              Albania   621179.85   639964.46   658853.12   677839.12
## 3              Algeria  4826104.22  5017298.60  5219331.87  5429743.08
## 4       American Samoa    17348.66    17995.51    18618.68    19206.39
## 5              Andorra    15439.62    16726.99    18088.32    19528.96
## 6               Angola   757496.32   798459.26   841261.96   886401.63
## 7  Antigua and Barbuda    22086.25    22149.39    22182.92    22180.87
## 8            Argentina 17753280.98 18124103.64 18510462.30 18918072.79
## 9              Armenia  1337032.09  1392892.13  1449641.49  1507619.77
## 10               Aruba    29414.72    29576.09    29737.87    29901.57
## 11           Australia  9934404.03 10153969.77 10412390.67 10664093.55
##          X1971       X1972       X1973       X1974
## 1   1409001.09  1502401.79  1598835.45  1696444.83
## 2    698932.25   720206.57   741681.04   763385.45
## 3   5619041.53  5815734.49  6020647.35  6235114.38
## 4     19752.02    20262.67    20741.97    21194.38
## 5     20928.73    22405.84    23937.05    25481.98
## 6    955010.09  1027397.35  1103829.78  1184486.23
## 7     22560.87    22907.76    23221.29    23502.92
## 8  19329718.16 19763078.00 20211424.85 20664728.90
## 9   1564367.60  1622103.53  1680497.75  1739063.02
## 10    30081.36    30279.76    30467.42    30602.87
## 11 11047706.39 11269945.50 11461120.68 11772934.25
# Column names for urban_pop
columns <- c("country", paste0("year_", 1967:1974))

# Finish the read.xls call
urban_pop <- read.xls("urbanpop.xls", sheet = 2,
                      skip = 50, header = FALSE, stringsAsFactors = FALSE,
                      col.names = columns)

# Print first 10 observation of urban_pop
head(urban_pop, n = 10)
##               country   year_1967   year_1968   year_1969   year_1970
## 1              Cyprus   231929.74   237831.38   243983.34   250164.52
## 2      Czech Republic  6204409.91  6266304.50  6326368.97  6348794.89
## 3             Denmark  3777552.62  3826785.08  3874313.99  3930042.97
## 4            Djibouti    77788.04    84694.35    92045.77    99845.22
## 5            Dominica    27550.36    29527.32    31475.62    33328.25
## 6  Dominican Republic  1535485.43  1625455.76  1718315.40  1814060.00
## 7             Ecuador  2059355.12  2151395.14  2246890.79  2345864.41
## 8               Egypt 13798171.00 14248342.19 14703858.22 15162858.52
## 9         El Salvador  1345528.98  1387218.33  1429378.98  1472181.26
## 10  Equatorial Guinea    75364.50    77295.03    78445.74    78411.07
##      year_1971   year_1972   year_1973   year_1974
## 1    261213.21   272407.99   283774.90   295379.83
## 2   6437055.17  6572632.32  6718465.53  6873458.18
## 3   3981360.12  4028247.92  4076867.28  4120201.43
## 4    107799.69   116098.23   125391.58   136606.25
## 5     34761.52    36049.99    37260.05    38501.47
## 6   1915590.38  2020157.01  2127714.45  2238203.87
## 7   2453817.78  2565644.81  2681525.25  2801692.62
## 8  15603661.36 16047814.69 16498633.27 16960827.93
## 9   1527985.34  1584758.18  1642098.95  1699470.87
## 10    77055.29    74596.06    71438.96    68179.26
# Import all sheets from urbanpop.xls
path <- "urbanpop.xls"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE)

# Extend the cbind() call to include urban_sheet3: urban_all
urban <- cbind(urban_sheet1, urban_sheet2[-1], urban_sheet3[-1])

# Remove all rows with NAs from urban: urban_clean
urban_clean <- na.omit(urban)

# Print out a summary of urban_clean
summary(urban_clean)
##    country              X1960               X1961               X1962          
##  Length:197         Min.   :     3378   Min.   :     3433   Min.   :     3481  
##  Class :character   1st Qu.:    87735   1st Qu.:    92905   1st Qu.:    98331  
##  Mode  :character   Median :   599714   Median :   630788   Median :   659464  
##                     Mean   :  5012388   Mean   :  5282488   Mean   :  5440972  
##                     3rd Qu.:  3130085   3rd Qu.:  3155370   3rd Qu.:  3250211  
##                     Max.   :126469700   Max.   :129268133   Max.   :131974143  
##      X1963               X1964               X1965          
##  Min.   :     3532   Min.   :     3586   Min.   :     3644  
##  1st Qu.:   104988   1st Qu.:   112084   1st Qu.:   119322  
##  Median :   704989   Median :   740609   Median :   774957  
##  Mean   :  5612312   Mean   :  5786961   Mean   :  5964970  
##  3rd Qu.:  3416490   3rd Qu.:  3585464   3rd Qu.:  3666724  
##  Max.   :134599886   Max.   :137205240   Max.   :139663053  
##      X1966               X1967               X1968          
##  Min.   :     3706   Min.   :     3771   Min.   :     3835  
##  1st Qu.:   128565   1st Qu.:   138024   1st Qu.:   147846  
##  Median :   809768   Median :   838449   Median :   890270  
##  Mean   :  6126413   Mean   :  6288771   Mean   :  6451367  
##  3rd Qu.:  3871757   3rd Qu.:  4019906   3rd Qu.:  4158186  
##  Max.   :141962708   Max.   :144201722   Max.   :146340364  
##      X1969               X1970               X1971          
##  Min.   :     3893   Min.   :     3941   Min.   :     4017  
##  1st Qu.:   158252   1st Qu.:   171063   1st Qu.:   181483  
##  Median :   929450   Median :   976471   Median :  1008630  
##  Mean   :  6624909   Mean   :  6799110   Mean   :  6980895  
##  3rd Qu.:  4300669   3rd Qu.:  4440047   3rd Qu.:  4595966  
##  Max.   :148475901   Max.   :150922373   Max.   :152863831  
##      X1972               X1973               X1974          
##  Min.   :     4084   Min.   :     4146   Min.   :     4206  
##  1st Qu.:   189492   1st Qu.:   197792   1st Qu.:   205410  
##  Median :  1048738   Median :  1097293   Median :  1159402  
##  Mean   :  7165338   Mean   :  7349454   Mean   :  7540446  
##  3rd Qu.:  4766545   3rd Qu.:  4838297   3rd Qu.:  4906384  
##  Max.   :154530473   Max.   :156034106   Max.   :157488074  
##      X1975               X1976               X1977          
##  Min.   :     4267   Min.   :     4334   Min.   :     4402  
##  1st Qu.:   211746   1st Qu.:   216991   1st Qu.:   222209  
##  Median :  1223146   Median :  1249829   Median :  1311276  
##  Mean   :  7731973   Mean   :  7936401   Mean   :  8145945  
##  3rd Qu.:  5003370   3rd Qu.:  5121118   3rd Qu.:  5227677  
##  Max.   :159452730   Max.   :165583752   Max.   :171550310  
##      X1978               X1979               X1980          
##  Min.   :     4470   Min.   :     4539   Min.   :     4607  
##  1st Qu.:   227605   1st Qu.:   233461   1st Qu.:   242583  
##  Median :  1340811   Median :  1448185   Median :  1592397  
##  Mean   :  8361360   Mean   :  8583138   Mean   :  8808772  
##  3rd Qu.:  5352746   3rd Qu.:  5558850   3rd Qu.:  5815772  
##  Max.   :177605736   Max.   :183785364   Max.   :189947471  
##      X1981               X1982               X1983          
##  Min.   :     4645   Min.   :     4681   Min.   :     4716  
##  1st Qu.:   248948   1st Qu.:   257944   1st Qu.:   274139  
##  Median :  1673079   Median :  1713060   Median :  1730626  
##  Mean   :  9049163   Mean   :  9295226   Mean   :  9545035  
##  3rd Qu.:  6070457   3rd Qu.:  6337995   3rd Qu.:  6619987  
##  Max.   :199385258   Max.   :209435968   Max.   :219680098  
##      X1984               X1985               X1986          
##  Min.   :     4750   Min.   :     4782   Min.   :     4809  
##  1st Qu.:   284939   1st Qu.:   300928   1st Qu.:   307699  
##  Median :  1749033   Median :  1786125   Median :  1850910  
##  Mean   :  9798559   Mean   : 10058661   Mean   : 10323839  
##  3rd Qu.:  6918261   3rd Qu.:  6931780   3rd Qu.:  6935763  
##  Max.   :229872397   Max.   :240414890   Max.   :251630158  
##      X1987               X1988               X1989          
##  Min.   :     4835   Min.   :     4859   Min.   :     4883  
##  1st Qu.:   321125   1st Qu.:   334616   1st Qu.:   347348  
##  Median :  1953694   Median :  1997011   Median :  1993544  
##  Mean   : 10595817   Mean   : 10873041   Mean   : 11154458  
##  3rd Qu.:  6939905   3rd Qu.:  6945022   3rd Qu.:  6885378  
##  Max.   :263433513   Max.   :275570541   Max.   :287810747  
##      X1990               X1991               X1992          
##  Min.   :     4907   Min.   :     4946   Min.   :     4985  
##  1st Qu.:   370152   1st Qu.:   394611   1st Qu.:   418788  
##  Median :  2066505   Median :  2150230   Median :  2237405  
##  Mean   : 11438543   Mean   : 11725076   Mean   : 12010922  
##  3rd Qu.:  6830026   3rd Qu.:  6816589   3rd Qu.:  6820099  
##  Max.   :300165618   Max.   :314689997   Max.   :329099365  
##      X1993               X1994               X1995          
##  Min.   :     5024   Min.   :     5062   Min.   :     5100  
##  1st Qu.:   427457   1st Qu.:   435959   1st Qu.:   461993  
##  Median :  2322158   Median :  2410297   Median :  2482393  
##  Mean   : 12296949   Mean   : 12582930   Mean   : 12871480  
##  3rd Qu.:  7139656   3rd Qu.:  7499901   3rd Qu.:  7708571  
##  Max.   :343555327   Max.   :358232230   Max.   :373035157  
##      X1996               X1997               X1998          
##  Min.   :     5079   Min.   :     5055   Min.   :     5029  
##  1st Qu.:   488136   1st Qu.:   494203   1st Qu.:   498002  
##  Median :  2522460   Median :  2606125   Median :  2664983  
##  Mean   : 13165924   Mean   : 13463675   Mean   : 13762861  
##  3rd Qu.:  7686092   3rd Qu.:  7664316   3rd Qu.:  7784056  
##  Max.   :388936607   Max.   :405031716   Max.   :421147610  
##      X1999               X2000               X2001          
##  Min.   :     5001   Min.   :     4971   Min.   :     5003  
##  1st Qu.:   505144   1st Qu.:   525629   1st Qu.:   550638  
##  Median :  2737809   Median :  2826647   Median :  2925851  
##  Mean   : 14063387   Mean   : 14369278   Mean   : 14705743  
##  3rd Qu.:  8083488   3rd Qu.:  8305564   3rd Qu.:  8421967  
##  Max.   :437126845   Max.   :452999147   Max.   :473204511  
##      X2002               X2003               X2004          
##  Min.   :     5034   Min.   :     5064   Min.   :     5090  
##  1st Qu.:   567531   1st Qu.:   572094   1st Qu.:   593900  
##  Median :  2928252   Median :  2944934   Median :  2994356  
##  Mean   : 15043381   Mean   : 15384513   Mean   : 15730299  
##  3rd Qu.:  8448628   3rd Qu.:  8622732   3rd Qu.:  8999112  
##  Max.   :493402140   Max.   :513607776   Max.   :533892175  
##      X2005               X2006               X2007          
##  Min.   :     5111   Min.   :     5135   Min.   :     5155  
##  1st Qu.:   620511   1st Qu.:   632659   1st Qu.:   645172  
##  Median :  3057923   Median :  3269963   Median :  3432024  
##  Mean   : 16080262   Mean   : 16435872   Mean   : 16797484  
##  3rd Qu.:  9394001   3rd Qu.:  9689807   3rd Qu.:  9803381  
##  Max.   :554367818   Max.   :575050081   Max.   :595731464  
##      X2008               X2009               X2010          
##  Min.   :     5172   Min.   :     5189   Min.   :     5206  
##  1st Qu.:   658017   1st Qu.:   671085   1st Qu.:   684302  
##  Median :  3589395   Median :  3652338   Median :  3676309  
##  Mean   : 17164898   Mean   : 17533997   Mean   : 17904811  
##  3rd Qu.: 10210317   3rd Qu.: 10518289   3rd Qu.: 10618596  
##  Max.   :616552722   Max.   :637533976   Max.   :658557734  
##      X2011          
##  Min.   :     5233  
##  1st Qu.:   698009  
##  Median :  3664664  
##  Mean   : 18276297  
##  3rd Qu.: 10731193  
##  Max.   :678796403

4. Importing data from databases (Part 1)

Many companies store their information in relational databases. The R community has also developed R packages to get data from these architectures. You’ll learn how to connect to a database and how to retrieve data from it.

4.1 Establish a connection

The first step to import data from a SQL database is creating a connection to it. As Filip explained, you need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI package.

dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().

If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.

# Load the RMySQL package
library(DBI)
library(RMySQL)
# Edit dbConnect() call
con <- dbConnect(MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

4.2 List the database tables

After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function. As you might remember from the video, this function requires the connection object as an input, and outputs a character vector with the table names.

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Build a vector of table names: tables
tables <- dbListTables(con)

# Display structure of tables
str(tables)
##  chr [1:3] "comments" "tweats" "users"

4.3 Import users

As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users, tweats, and comments that have relations among them. Which ones, you ask? You’ll discover in a moment!

Let’s start by importing the data on the users into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is a standard R data frame.

# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(), 
                 dbname = "tweater", 
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import the users table from tweater: users
users<- dbReadTable(con, "users")

# Print users
users
##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja

4.4 Import all tables

Next to the users, we’re also interested in the tweats and comments tables. However, separate dbReadTable() calls for each and every one of the tables in your database would mean a lot of code duplication. Remember about the lapply() function? You can use it again here! A connection is already coded for you, as well as a vector table_names, containing the names of all the tables in the database.

# Get table names
table_names <- dbListTables(con)

# Import all tables
tables <- lapply(table_names, dbReadTable, conn = con)

# Print out tables
tables
## [[1]]
##      id tweat_id user_id            message
## 1  1022       87       7              nice!
## 2  1000       77       7             great!
## 3  1011       49       5            love it
## 4  1012       87       1   awesome! thanks!
## 5  1010       88       6              yuck!
## 6  1026       77       4      not my thing!
## 7  1004       49       1  this is fabulous!
## 8  1030       75       6           so easy!
## 9  1025       88       2             oh yes
## 10 1007       49       3           serious?
## 11 1020       77       1 couldn't be better
## 12 1014       77       1       saved my day
## 
## [[2]]
##   id user_id
## 1 75       3
## 2 88       4
## 3 77       6
## 4 87       5
## 5 49       1
## 6 24       7
##                                                                  post
## 1                                       break egg. bake egg. eat egg.
## 2                           wash strawberries. add ice. blend. enjoy.
## 3                       2 slices of bread. add cheese. grill. heaven.
## 4               open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6                              just eat an apple. simply and healthy.
##         date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
## 
## [[3]]
##   id      name     login
## 1  1 elisabeth  elismith
## 2  2      mike     mikey
## 3  3      thea   teatime
## 4  4    thomas tomatotom
## 5  5    oliver olivander
## 6  6      kate  katebenn
## 7  7    anjali    lianja

5. Importing data from databases (Part 2)

mporting an entire table from a database while you might only need a tiny bit of information seems like a lot of unncessary work. In this chapter, you’ll learn about SQL queries, which will help you make things more efficient by performing some computations on the database side.

5.1 Query tweater (1)

In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.

dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects the age variable from the people dataset where gender equals “male”:

dbGetQuery(con, “SELECT age FROM people WHERE gender = ‘male’”) A connection to the tweater database has already been coded for you.

con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Import tweat_id column of comments where user_id is 1: elisabeth
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
# Print elisabeth
elisabeth
##   tweat_id
## 1       87
## 2       49
## 3       77
## 4       77

5.2 Query tweater (2)

Apart from checking equality, you can also check for less than and greater than relationships, with < and >, just like in R. con, a connection to the tweater database, is again available.

# Import post column of tweats where date is higher than '2015-09-21': latest
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > '2015-09-21'")

# Print latest
latest
##                                                                  post
## 1               open and crush avocado. add shrimps. perfect starter.
## 2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 3                              just eat an apple. simply and healthy.

5.3 Query tweater (3)

Suppose that you have a people table, with a bunch of information. This time, you want to find out the age and country of married males. Provided that there is a married column that’s 1 when the person in question is married, the following query would work.

SELECT age, country FROM people WHERE gender = “male” AND married = 1

# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4")

# Print specific
specific
##   message
## 1  great!

5.4 Query tweater (4)

There are also dedicated SQL functions that you can use in the WHERE clause of an SQL query. For example, CHAR_LENGTH() returns the number of characters in a string.

# Create data frame short
short <- dbGetQuery(con, "SELECT id,  name FROM users WHERE CHAR_LENGTH(name) <5")

# Print short
short
##   id name
## 1  2 mike
## 2  3 thea
## 3  6 kate

5.5 DBI internals

You’ve used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:

  • Sending the specified query with dbSendQuery();
  • Fetching the result of executing the query on the database with dbFetch();
  • Clearing the result with dbClearResult().

Let’s not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().

con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")

# Use dbFetch() twice
dbFetch(res, n = 2)
##     id tweat_id user_id message
## 1 1022       87       7   nice!
## 2 1000       77       7  great!
# Clear res
dbClearResult(res)
## [1] TRUE

5.6 Be polite and …

Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections for you, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.

# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
                 dbname = "tweater",
                 host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
                 port = 3306,
                 user = "student",
                 password = "datacamp")

# Create the data frame  long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) > 40")

# Print long_tweats
long_tweats
##                                                                  post
## 1                           wash strawberries. add ice. blend. enjoy.
## 2                       2 slices of bread. add cheese. grill. heaven.
## 3               open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
##         date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE

6. Importing data from the web (Part 1)

More and more of the information that data scientists are using resides on the web. Importing this data into R requires an understanding of the protocols used on the web. In this chapter, you’ll get a crash course in HTTP and learn to perform your own HTTP requests from inside R.

6.1 Import flat files from the web

In the video, you saw that the utils functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.

You must be wondering whether Hadley Wickham’s alternative package, readr, is equally potent. Well, figure it out in this exercise! The URLs for both a .csv file as well as a .delim file are already coded for you. It’s up to you to actually import the data. If it works, that is…

# Load the readr package
library(readr)

# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Address
## dbl (2): Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)
## Rows: 160 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print pools and potatoes
pools
## # A tibble: 20 × 4
##    Name                                      Address          Latitude Longitude
##    <chr>                                     <chr>               <dbl>     <dbl>
##  1 Acacia Ridge Leisure Centre               1391 Beaudesert…    -27.6      153.
##  2 Bellbowrie Pool                           Sugarwood Stree…    -27.6      153.
##  3 Carole Park                               Cnr Boundary Ro…    -27.6      153.
##  4 Centenary Pool (inner City)               400 Gregory Ter…    -27.5      153.
##  5 Chermside Pool                            375 Hamilton Ro…    -27.4      153.
##  6 Colmslie Pool (Morningside)               400 Lytton Road…    -27.5      153.
##  7 Spring Hill Baths (inner City)            14 Torrington S…    -27.5      153.
##  8 Dunlop Park Pool (Corinda)                794 Oxley Road,…    -27.5      153.
##  9 Fortitude Valley Pool                     432 Wickham Str…    -27.5      153.
## 10 Hibiscus Sports Complex (upper MtGravatt) 90 Klumpp Road,…    -27.6      153.
## 11 Ithaca Pool ( Paddington)                 131 Caxton Stre…    -27.5      153.
## 12 Jindalee Pool                             11 Yallambee Ro…    -27.5      153.
## 13 Manly Pool                                1 Fairlead Cres…    -27.5      153.
## 14 Mt Gravatt East Aquatic Centre            Cnr wecker Road…    -27.5      153.
## 15 Musgrave Park Pool (South Brisbane)       100 Edmonstone …    -27.5      153.
## 16 Newmarket Pool                            71 Alderson Str…    -27.4      153.
## 17 Runcorn Pool                              37 Bonemill Roa…    -27.6      153.
## 18 Sandgate Pool                             231 Flinders Pa…    -27.3      153.
## 19 Langlands Parks Pool (Stones Corner)      5 Panitya Stree…    -27.5      153.
## 20 Yeronga Park Pool                         81 School Road,…    -27.5      153.
potatoes
## # A tibble: 160 × 8
##     area  temp  size storage method texture flavor moistness
##    <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>
##  1     1     1     1       1      1     2.9    3.2       3  
##  2     1     1     1       1      2     2.3    2.5       2.6
##  3     1     1     1       1      3     2.5    2.8       2.8
##  4     1     1     1       1      4     2.1    2.9       2.4
##  5     1     1     1       1      5     1.9    2.8       2.2
##  6     1     1     1       2      1     1.8    3         1.7
##  7     1     1     1       2      2     2.6    3.1       2.4
##  8     1     1     1       2      3     3      3         2.9
##  9     1     1     1       2      4     2.2    3.2       2.5
## 10     1     1     1       2      5     2      2.8       1.9
## # … with 150 more rows

6.2 Secure importing

In the previous exercises, you have been working with URLs that all start with http://. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HyperText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.

# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"

# Import the file using read.csv(): pools1
pools1 <- read.csv(url_csv)

# Load the readr package
library(readr)

# Import the file using read_csv(): pools2
pools2 <- read_csv(url_csv)
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Address
## dbl (2): Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the structure of pools1 and pools2
str(pools1)
## 'data.frame':    20 obs. of  4 variables:
##  $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
##  $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
##  $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num  153 153 153 153 153 ...
str(pools2)
## spec_tbl_df [20 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Name     : chr [1:20] "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
##  $ Address  : chr [1:20] "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
##  $ Latitude : num [1:20] -27.6 -27.6 -27.6 -27.5 -27.4 ...
##  $ Longitude: num [1:20] 153 153 153 153 153 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Name = col_character(),
##   ..   Address = col_character(),
##   ..   Latitude = col_double(),
##   ..   Longitude = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

6.3 mport Excel files from the web

When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can’t, at least not yet. The URL with which you’ll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.

# Load the readxl and gdata package
library(readxl)
library(gdata)

# Specification of url: url_xls
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"

# Import the .xls file with gdata: excel_gdata
excel_gdata <- read.xls(url_xls)

# Download file behind URL, name it local_latitude.xls
download.file(url_xls, destfile = "/Users/arifmathiq/R Data/local_latitude.xls", quiet = TRUE, mode = "wb")
getwd()
## [1] "/Users/arifmathiq/R Data"
# Import the local .xls file with readxl: excel_readxl
excel_readxl <- read_excel("local_latitude.xls")

6.4 Downloading any file, secure or not

In the previous exercise you’ve seen how you can read excel files on the web using the read_excel package by first downloading the file with the download.file() function.

There’s more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.

You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you’ll first download the RData file securely, and then import the local data file.

# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"

# Download the wine file to your working directory
download.file(url_rdata, destfile = "wine_local.RData")

# Load the wine data into your workspace using load()
load("wine_local.RData")

# Print out the summary of the wine data
summary(wine)
##     Alcohol        Malic acid        Ash        Alcalinity of ash
##  Min.   :11.03   Min.   :0.74   Min.   :1.360   Min.   :10.60    
##  1st Qu.:12.36   1st Qu.:1.60   1st Qu.:2.210   1st Qu.:17.20    
##  Median :13.05   Median :1.87   Median :2.360   Median :19.50    
##  Mean   :12.99   Mean   :2.34   Mean   :2.366   Mean   :19.52    
##  3rd Qu.:13.67   3rd Qu.:3.10   3rd Qu.:2.560   3rd Qu.:21.50    
##  Max.   :14.83   Max.   :5.80   Max.   :3.230   Max.   :30.00    
##    Magnesium      Total phenols     Flavanoids    Nonflavanoid phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.740   1st Qu.:1.200   1st Qu.:0.2700      
##  Median : 98.00   Median :2.350   Median :2.130   Median :0.3400      
##  Mean   : 99.59   Mean   :2.292   Mean   :2.023   Mean   :0.3623      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.860   3rd Qu.:0.4400      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color intensity       Hue           Proline      
##  Min.   :0.410   Min.   : 1.280   Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.250   1st Qu.: 3.210   1st Qu.:1.930   1st Qu.: 500.0  
##  Median :1.550   Median : 4.680   Median :2.780   Median : 672.0  
##  Mean   :1.587   Mean   : 5.055   Mean   :2.604   Mean   : 745.1  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :3.580   Max.   :13.000   Max.   :4.000   Max.   :1680.0

6.5 HTTP? httr! (1)

Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.

httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.

You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it’ll try its best to figure out which type is most appropriate based on the content-type.

library(httr)
# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)
# Print resp
resp
## Response [http://www.example.com/]
##   Date: 2022-04-03 13:28
##   Status: 200
##   Content-Type: text/html; charset=UTF-8
##   Size: 1.26 kB
## <!doctype html>
## <html>
## <head>
##     <title>Example Domain</title>
## 
##     <meta charset="utf-8" />
##     <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
##     <meta name="viewport" content="width=device-width, initial-scale=1" />
##     <style type="text/css">
##     body {
## ...
# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")
# Print the head of raw_content
head(raw_content)
## [1] 3c 21 64 6f 63 74

Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp’s Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.

# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)
# Print resp
resp
## Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
##   Date: 2022-04-03 13:28
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 1.02 kB
# Print content of resp as text
content(resp, as = "text")
## [1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts\",\"Plot\":\"Alvy Singer, a divorced Jewish comedian, reflects on his relationship with ex-lover Annie Hall, an aspiring nightclub singer, which ended abruptly just like his previous marriages.\",\"Language\":\"English, German\",\"Country\":\"United States\",\"Awards\":\"Won 4 Oscars. 31 wins & 8 nominations total\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"96%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"261,852\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"05 Jul 2000\",\"BoxOffice\":\"$38,251,425\",\"Production\":\"N/A\",\"Website\":\"N/A\",\"Response\":\"True\"}"
# Print content of resp
content(resp)
## $Title
## [1] "Annie Hall"
## 
## $Year
## [1] "1977"
## 
## $Rated
## [1] "PG"
## 
## $Released
## [1] "20 Apr 1977"
## 
## $Runtime
## [1] "93 min"
## 
## $Genre
## [1] "Comedy, Romance"
## 
## $Director
## [1] "Woody Allen"
## 
## $Writer
## [1] "Woody Allen, Marshall Brickman"
## 
## $Actors
## [1] "Woody Allen, Diane Keaton, Tony Roberts"
## 
## $Plot
## [1] "Alvy Singer, a divorced Jewish comedian, reflects on his relationship with ex-lover Annie Hall, an aspiring nightclub singer, which ended abruptly just like his previous marriages."
## 
## $Language
## [1] "English, German"
## 
## $Country
## [1] "United States"
## 
## $Awards
## [1] "Won 4 Oscars. 31 wins & 8 nominations total"
## 
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg"
## 
## $Ratings
## $Ratings[[1]]
## $Ratings[[1]]$Source
## [1] "Internet Movie Database"
## 
## $Ratings[[1]]$Value
## [1] "8.0/10"
## 
## 
## $Ratings[[2]]
## $Ratings[[2]]$Source
## [1] "Rotten Tomatoes"
## 
## $Ratings[[2]]$Value
## [1] "96%"
## 
## 
## $Ratings[[3]]
## $Ratings[[3]]$Source
## [1] "Metacritic"
## 
## $Ratings[[3]]$Value
## [1] "92/100"
## 
## 
## 
## $Metascore
## [1] "92"
## 
## $imdbRating
## [1] "8.0"
## 
## $imdbVotes
## [1] "261,852"
## 
## $imdbID
## [1] "tt0075686"
## 
## $Type
## [1] "movie"
## 
## $DVD
## [1] "05 Jul 2000"
## 
## $BoxOffice
## [1] "$38,251,425"
## 
## $Production
## [1] "N/A"
## 
## $Website
## [1] "N/A"
## 
## $Response
## [1] "True"

7. Importing data from the web (Part 2)

Importing data from the web is one thing; actually being able to extract useful information is another. Learn more about the JSON format to get one step closer to web domination.

7.1 From JSON to R

In the simplest setting, fromJSON() can convert character strings that represent JSON data into a nicely structured R list. Give it a try!

# Load the jsonlite package
library(jsonlite)
## 
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
## 
##     flatten
# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'

# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)

# Print structure of wine
str(wine)
## List of 5
##  $ name       : chr "Chateau Migraine"
##  $ year       : int 1997
##  $ alcohol_pct: num 12.4
##  $ color      : chr "red"
##  $ awarded    : logi FALSE

7.2 Quandl API

As Filip showed in the video, fromJSON() also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let’s try this out on the Quandl API, where you can fetch all sorts of financial and economical data.

# Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"

# Import Quandl data: quandl_data
quandl_data <- fromJSON(quandl_url)

# Print structure of quandl_data
str(quandl_data)
## List of 1
##  $ dataset_data:List of 10
##   ..$ limit       : NULL
##   ..$ transform   : NULL
##   ..$ column_index: NULL
##   ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
##   ..$ start_date  : chr "2012-05-18"
##   ..$ end_date    : chr "2018-03-27"
##   ..$ frequency   : chr "daily"
##   ..$ data        : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
##   ..$ collapse    : NULL
##   ..$ order       : NULL

7.3 OMDb API

In the video, you saw how easy it is to interact with an API once you know how to formulate requests. You also saw how to fetch all information on Rain Man from OMDb. Simply perform a GET() call, and next ask for the contents with the content() function. This content() function, which is part of the httr package, uses jsonlite behind the scenes to import the JSON data into R.

However, by now you also know that jsonlite can handle URLs itself. Simply passing the request URL to fromJSON() will get your data into R. In this exercise, you will be using this technique to compare the release year of two movies in the Open Movie Database.

# Definition of the URLs
url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"

# Import two URLs with fromJSON(): sw4 and sw3
sw4 <- fromJSON(url_sw4)
sw3 <- fromJSON(url_sw3)

# Print out the Title element of both lists
sw4
## $Title
## [1] "Star Wars"
## 
## $Year
## [1] "1977"
## 
## $Rated
## [1] "PG"
## 
## $Released
## [1] "25 May 1977"
## 
## $Runtime
## [1] "121 min"
## 
## $Genre
## [1] "Action, Adventure, Fantasy"
## 
## $Director
## [1] "George Lucas"
## 
## $Writer
## [1] "George Lucas"
## 
## $Actors
## [1] "Mark Hamill, Harrison Ford, Carrie Fisher"
## 
## $Plot
## [1] "Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vad"
## 
## $Language
## [1] "English"
## 
## $Country
## [1] "United States"
## 
## $Awards
## [1] "Won 7 Oscars. 63 wins & 29 nominations total"
## 
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BNzVlY2MwMjktM2E4OS00Y2Y3LWE3ZjctYzhkZGM3YzA1ZWM2XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_SX300.jpg"
## 
## $Ratings
##                    Source  Value
## 1 Internet Movie Database 8.6/10
## 2         Rotten Tomatoes    92%
## 3              Metacritic 90/100
## 
## $Metascore
## [1] "90"
## 
## $imdbRating
## [1] "8.6"
## 
## $imdbVotes
## [1] "1,312,386"
## 
## $imdbID
## [1] "tt0076759"
## 
## $Type
## [1] "movie"
## 
## $DVD
## [1] "06 Dec 2005"
## 
## $BoxOffice
## [1] "$460,998,507"
## 
## $Production
## [1] "N/A"
## 
## $Website
## [1] "N/A"
## 
## $Response
## [1] "True"
sw3
## $Title
## [1] "Star Wars: Episode III - Revenge of the Sith"
## 
## $Year
## [1] "2005"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "19 May 2005"
## 
## $Runtime
## [1] "140 min"
## 
## $Genre
## [1] "Action, Adventure, Fantasy"
## 
## $Director
## [1] "George Lucas"
## 
## $Writer
## [1] "George Lucas, John Ostrander, Jan Duursema"
## 
## $Actors
## [1] "Hayden Christensen, Natalie Portman, Ewan McGregor"
## 
## $Plot
## [1] "Three years into the Clone Wars, the Jedi rescue Palpatine from Count Dooku. As Obi-Wan pursues a new threat, Anakin acts as a double agent between the Jedi Council and Palpatine and is lured into a sinister plan to rule the galaxy."
## 
## $Language
## [1] "English"
## 
## $Country
## [1] "United States"
## 
## $Awards
## [1] "Nominated for 1 Oscar. 26 wins & 63 nominations total"
## 
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BNTc4MTc3NTQ5OF5BMl5BanBnXkFtZTcwOTg0NjI4NA@@._V1_SX300.jpg"
## 
## $Ratings
##                    Source  Value
## 1 Internet Movie Database 7.6/10
## 2         Rotten Tomatoes    80%
## 3              Metacritic 68/100
## 
## $Metascore
## [1] "68"
## 
## $imdbRating
## [1] "7.6"
## 
## $imdbVotes
## [1] "758,447"
## 
## $imdbID
## [1] "tt0121766"
## 
## $Type
## [1] "movie"
## 
## $DVD
## [1] "01 Nov 2005"
## 
## $BoxOffice
## [1] "$380,270,577"
## 
## $Production
## [1] "N/A"
## 
## $Website
## [1] "N/A"
## 
## $Response
## [1] "True"
# Is the release year of sw4 later than sw3?
sw4$Year > sw3$Year
## [1] FALSE

7.4 JSON practice (1)

JSON is built on two structures: objects and arrays. To help you experiment with these, two JSON strings are included in the sample code. It’s up to you to change them appropriately and then call jsonlite’s fromJSON() function on them each time.

# Challenge 1
json1 <- '[1, 2, 3, 4, 5, 6]'
fromJSON(json1)
## [1] 1 2 3 4 5 6
# Challenge 2
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)
## $a
## [1] 1 2 3
## 
## $b
## [1] 4 5 6

7.5 JSON practice (2)

We prepared two more JSON strings in the sample code. Can you change them and call jsonlite’s fromJSON() function on them, similar to the previous exercise?

# Challenge 1
json1 <- '[[1, 2], [3, 4], [5, 6]]'

# Challenge 2
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}]'

7.6 toJSON()

Apart from converting JSON to R with fromJSON(), you can also use toJSON() to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json, which is basically a character string representing that JSON.

For this exercise, you will be working with a .csv file containing information on the amount of desalinated water that is produced around the world. As you’ll see, it contains a lot of missing values. This data can be found on the URL that is specified in the sample code.

# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"

# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)

# Convert the data file according to the requirements
water_json <- toJSON(water)

# Print out water_json
water_json
## [{"water":"Algeria","X1992":0.064,"X2002":0.017},{"water":"American Samoa"},{"water":"Angola","X1992":0.0001,"X2002":0.0001},{"water":"Antigua and Barbuda","X1992":0.0033},{"water":"Argentina","X1992":0.0007,"X1997":0.0007,"X2002":0.0007},{"water":"Australia","X1992":0.0298,"X2002":0.0298},{"water":"Austria","X1992":0.0022,"X2002":0.0022},{"water":"Bahamas","X1992":0.0013,"X2002":0.0074},{"water":"Bahrain","X1992":0.0441,"X2002":0.0441,"X2007":0.1024},{"water":"Barbados","X2007":0.0146},{"water":"British Virgin Islands","X2007":0.0042},{"water":"Canada","X1992":0.0027,"X2002":0.0027},{"water":"Cape Verde","X1992":0.002,"X1997":0.0017},{"water":"Cayman Islands","X1992":0.0033},{"water":"Central African Rep."},{"water":"Chile","X1992":0.0048,"X2002":0.0048},{"water":"Colombia","X1992":0.0027,"X2002":0.0027},{"water":"Cuba","X1992":0.0069,"X1997":0.0069,"X2002":0.0069},{"water":"Cyprus","X1992":0.003,"X1997":0.003,"X2002":0.0335},{"water":"Czech Rep.","X1992":0.0002,"X2002":0.0002},{"water":"Denmark","X1992":0.015,"X2002":0.015},{"water":"Djibouti","X1992":0.0001,"X2002":0.0001},{"water":"Ecuador","X1992":0.0022,"X1997":0.0022,"X2002":0.0022},{"water":"Egypt","X1992":0.025,"X1997":0.025,"X2002":0.1},{"water":"El Salvador","X1992":0.0001,"X2002":0.0001},{"water":"Finland","X1992":0.0001,"X2002":0.0001},{"water":"France","X1992":0.0117,"X2002":0.0117},{"water":"Gibraltar","X1992":0.0077},{"water":"Greece","X1992":0.01,"X2002":0.01},{"water":"Honduras","X1992":0.0002,"X2002":0.0002},{"water":"Hungary","X1992":0.0002,"X2002":0.0002},{"water":"India","X1997":0.0005,"X2002":0.0005},{"water":"Indonesia","X1992":0.0187,"X2002":0.0187},{"water":"Iran","X1992":0.003,"X1997":0.003,"X2002":0.003,"X2007":0.2},{"water":"Iraq","X1997":0.0074,"X2002":0.0074},{"water":"Ireland","X1992":0.0002,"X2002":0.0002},{"water":"Israel","X1992":0.0256,"X2002":0.0256,"X2007":0.14},{"water":"Italy","X1992":0.0973,"X2002":0.0973},{"water":"Jamaica","X1992":0.0005,"X1997":0.0005,"X2002":0.0005},{"water":"Japan","X1997":0.04,"X2002":0.04},{"water":"Jordan","X1997":0.002,"X2007":0.0098},{"water":"Kazakhstan","X1997":1.328,"X2002":1.328},{"water":"Kuwait","X1992":0.507,"X1997":0.231,"X2002":0.4202},{"water":"Lebanon","X2007":0.0473},{"water":"Libya","X2002":0.018},{"water":"Malaysia","X1992":0.0043,"X2002":0.0043},{"water":"Maldives","X1992":0.0004},{"water":"Malta","X1992":0.024,"X1997":0.031,"X2002":0.031},{"water":"Marshall Islands","X1992":0.0007},{"water":"Mauritania","X1992":0.002,"X2002":0.002},{"water":"Mexico","X1992":0.0307,"X2002":0.0307},{"water":"Morocco","X1992":0.0034,"X1997":0.0034,"X2002":0.007},{"water":"Namibia","X1992":0.0003,"X2002":0.0003},{"water":"Netherlands Antilles","X1992":0.063},{"water":"Nicaragua","X1992":0.0002,"X2002":0.0002},{"water":"Nigeria","X1992":0.003,"X2002":0.003},{"water":"Norway","X1992":0.0001,"X2002":0.0001},{"water":"Oman","X1997":0.034,"X2002":0.034,"X2007":0.109},{"water":"Peru","X1992":0.0054,"X2002":0.0054},{"water":"Poland","X1992":0.007,"X2002":0.007},{"water":"Portugal","X1992":0.0016,"X2002":0.0016},{"water":"Qatar","X1992":0.065,"X1997":0.099,"X2002":0.099,"X2007":0.18},{"water":"Saudi Arabia","X1992":0.683,"X1997":0.727,"X2002":0.863,"X2007":1.033},{"water":"Senegal","X1992":0,"X2002":0},{"water":"Somalia","X1992":0.0001,"X2002":0.0001},{"water":"South Africa","X1992":0.018,"X2002":0.018},{"water":"Spain","X1992":0.1002,"X2002":0.1002},{"water":"Sudan","X1992":0.0004,"X1997":0.0004,"X2002":0.0004},{"water":"Sweden","X1992":0.0002,"X2002":0.0002},{"water":"Trinidad and Tobago","X2007":0.036},{"water":"Tunisia","X1992":0.008,"X2002":0.013},{"water":"Turkey","X1992":0.0005,"X2002":0.0005,"X2007":0.0005},{"water":"United Arab Emirates","X1992":0.163,"X1997":0.385,"X2007":0.95},{"water":"United Kingdom","X1992":0.0333,"X2002":0.0333},{"water":"United States","X1992":0.58,"X2002":0.58},{"water":"Venezuela","X1992":0.0052,"X2002":0.0052},{"water":"Yemen, Rep.","X1992":0.01,"X2002":0.01}]

8. Importing data from statistical software packages

Next to R, there are also other commonly used statistical software packages: SAS, STATA and SPSS. Each of them has their own file format. Learn how to use the haven and foreign packages to get them into R with remarkable ease!

8.1 Import SAS data with haven

haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:

  • SAS: read_sas()
  • STATA: read_dta() (or read_stata(), which are identical)
  • SPSS: read_sav() or read_por(), depending on the file type. All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.
# Load the haven package
library(haven)

# Import sales.sas7bdat: sales
sales <- read_sas("sales.sas7bdat")

# Display the structure of sales
str(sales)
## tibble [431 × 4] (S3: tbl_df/tbl/data.frame)
##  $ purchase: num [1:431] 0 0 1 1 0 0 0 0 0 0 ...
##  $ age     : num [1:431] 41 47 41 39 32 32 33 45 43 40 ...
##  $ gender  : chr [1:431] "Female" "Female" "Female" "Female" ...
##  $ income  : chr [1:431] "Low" "Low" "Low" "Low" ...

You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in your current working directory. You can also download the data here.

8.2 Import STATA data with haven

Next up are STATA data files; you can use read_dta() for these.

When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.

In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta

# Import the data from the URL: sugar
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")

# Structure of sugar
str(sugar)
## tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Date    : dbl+lbl [1:10] 10,  9,  8,  7,  6,  5,  4,  3,  2,  1
##    ..@ label       : chr "Date"
##    ..@ format.stata: chr "%9.0g"
##    ..@ labels      : Named num [1:10] 1 2 3 4 5 6 7 8 9 10
##    .. ..- attr(*, "names")= chr [1:10] "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
##  $ Import  : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
##   ..- attr(*, "label")= chr "Import"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
##   ..- attr(*, "label")= chr "Weight_I"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Export  : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
##   ..- attr(*, "label")= chr "Export"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
##   ..- attr(*, "label")= chr "Weight_E"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  - attr(*, "label")= chr "Written by R."
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))

# Structure of sugar again
str(sugar)
## tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Date    : Date[1:10], format: "2013-12-31" "2012-12-31" ...
##  $ Import  : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
##   ..- attr(*, "label")= chr "Import"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
##   ..- attr(*, "label")= chr "Weight_I"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Export  : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
##   ..- attr(*, "label")= chr "Export"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
##   ..- attr(*, "label")= chr "Weight_E"
##   ..- attr(*, "format.stata")= chr "%9.0g"
##  - attr(*, "label")= chr "Written by R."

8.3 Import SPSS data with haven

The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.

In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.

# Import person.sav: traits
traits <- read_sav("person.sav")

# Summarize traits
summary(traits)
##     Neurotic      Extroversion   Agreeableness   Conscientiousness
##  Min.   : 0.00   Min.   : 5.00   Min.   :15.00   Min.   : 7.00    
##  1st Qu.:18.00   1st Qu.:26.00   1st Qu.:39.00   1st Qu.:25.00    
##  Median :24.00   Median :31.00   Median :45.00   Median :30.00    
##  Mean   :23.63   Mean   :30.23   Mean   :44.55   Mean   :30.85    
##  3rd Qu.:29.00   3rd Qu.:34.00   3rd Qu.:50.00   3rd Qu.:36.00    
##  Max.   :44.00   Max.   :65.00   Max.   :73.00   Max.   :58.00    
##  NA's   :14      NA's   :16      NA's   :19      NA's   :14
# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40)
## # A tibble: 8 × 4
##   Neurotic Extroversion Agreeableness Conscientiousness
##      <dbl>        <dbl>         <dbl>             <dbl>
## 1       38           43            49                29
## 2       20           42            46                31
## 3       18           42            49                31
## 4       42           43            44                29
## 5       30           42            51                24
## 6       18           42            50                25
## 7       27           45            55                23
## 8       18           43            57                34

8.4 Factorize, round two

In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.

The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL:

http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav

# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")

# Display summary of work$GENDER
summary(work$GENDER)
##    Length     Class      Mode 
##       474 character character
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)

# Display summary of work$GENDER again
summary(work$GENDER)
## Female   Male 
##    216    258

8.5 Import STATA data with foreign (1)

The foreign package offers a simple function to import and read STATA data: read.dta().

In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in your working directory, you can download it here if you want to experiment some more.

# Load the foreign package
library(foreign)
# Import florida.dta and name the resulting data frame florida
florida <- read.dta("florida.dta")

# Check tail() of florida
tail(florida)
##     gore  bush buchanan nader  total
## 62  2647  4051       27    59   6784
## 63  1399  2326       26    29   3780
## 64 97063 82214      396  2436 182109
## 65  3835  4511       46   149   8541
## 66  5637 12176      120   265  18198
## 67  2796  4983       88    93   7960

8.6 Import STATA data with foreign (2)

Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.

The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.

The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the worldbank folder in your working directory.

# Create and print structure of edu_equal_1
library(readstata13)
edu_equal_1 <- read.dta("edequality.dta")
str(edu_equal_1)
## 'data.frame':    12214 obs. of  27 variables:
##  $ hhid              : num  1 1 1 2 2 3 4 4 5 6 ...
##  $ hhweight          : num  627 627 627 627 627 ...
##  $ location          : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
##  $ region            : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
##  $ ethnicity_head    : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
##  $ age               : num  37 11 8 73 70 75 79 80 82 83 ...
##  $ gender            : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
##  $ relation          : Factor w/ 9 levels "head                      ",..: 1 3 3 1 2 1 1 2 1 1 ...
##  $ literate          : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ income_mnt        : num  13.3 13.3 13.3 142.5 142.5 ...
##  $ income            : num  160 160 160 1710 1710 ...
##  $ aggregate         : num  1042 1042 1042 3271 3271 ...
##  $ aggr_ind_annual   : num  347 347 347 1635 1635 ...
##  $ educ_completed    : int  2 4 4 4 3 3 3 3 4 4 ...
##  $ grade_complete    : num  4 3 0 3 4 4 4 4 5 5 ...
##  $ grade_all         : num  4 11 8 11 8 8 8 8 13 13 ...
##  $ unemployed        : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ reason_OLF        : int  NA NA NA 3 3 3 9 9 3 3 ...
##  $ sector            : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ occupation        : int  NA NA NA NA NA NA 5 5 NA NA ...
##  $ earn_mont         : num  0 0 0 0 0 0 20 20 0 0 ...
##  $ earn_ann          : num  0 0 0 0 0 0 240 240 0 0 ...
##  $ hours_week        : num  NA NA NA NA NA NA 30 35 NA NA ...
##  $ hours_mnt         : num  NA NA NA NA NA ...
##  $ fulltime          : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ hhexp             : num  100 100 100 343 343 ...
##  $ legacy_pension_amt: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  100 100 108 108 108 100 108 108 108 100 ...
##  - attr(*, "val.labels")= chr  "" "" "location" "region" ...
##  - attr(*, "var.labels")= chr  "hhid" "hhweight" "location" "region" ...
##  - attr(*, "expansion.fields")=List of 12
##   ..$ : chr  "_dta" "_svy_su1" "cluster"
##   ..$ : chr  "_dta" "_svy_strata1" "strata"
##   ..$ : chr  "_dta" "_svy_stages" "1"
##   ..$ : chr  "_dta" "_svy_version" "2"
##   ..$ : chr  "_dta" "__XijVarLabcons" "(sum) cons"
##   ..$ : chr  "_dta" "ReS_Xij" "cons"
##   ..$ : chr  "_dta" "ReS_str" "0"
##   ..$ : chr  "_dta" "ReS_j" "group"
##   ..$ : chr  "_dta" "ReS_ver" "v.2"
##   ..$ : chr  "_dta" "ReS_i" "hhid dur"
##   ..$ : chr  "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc,  gall, health, rent, durables we"| __truncated__
##   ..$ : chr  "_dta" "note0" "1"
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 12
##   ..$ location: Named int  1 2
##   .. ..- attr(*, "names")= chr  "urban location" "rural location"
##   ..$ region  : Named int  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "Sofia city" "Bourgass" "Varna" "Lovetch" ...
##   ..$ ethnic  : Named int  1 2 3 4
##   .. ..- attr(*, "names")= chr  "Bulgaria" "Turks" "Roma" "Other"
##   ..$ s2_q2   : Named int  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ s2_q3   : Named int  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "head                      " "spouse/partner            " "child                     " "son/daughter-in-law       " ...
##   ..$ lit     : Named int  1 2
##   .. ..- attr(*, "names")= chr  "no" "yes"
##   ..$         : Named int  1 2 3 4
##   .. ..- attr(*, "names")= chr  "never attanded" "primary" "secondary" "postsecondary"
##   ..$         : Named int  1 2
##   .. ..- attr(*, "names")= chr  "Not unemployed" "Unemployed"
##   ..$         : Named int  1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr  "student" "housewife/childcare" "in retirement" "illness, disability" ...
##   ..$         : Named int  1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr  "agriculture" "mining" "manufacturing" "utilities" ...
##   ..$         : Named int  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "private company" "public works program" "government,public sector, army" "private individual" ...
##   ..$         : Named int  1 2
##   .. ..- attr(*, "names")= chr  "no" "yes"
# Create and print structure of edu_equal_2
edu_equal_2 <- read.dta("edequality.dta", convert.factors = FALSE)
str(edu_equal_2)
## 'data.frame':    12214 obs. of  27 variables:
##  $ hhid              : num  1 1 1 2 2 3 4 4 5 6 ...
##  $ hhweight          : num  627 627 627 627 627 ...
##  $ location          : int  1 1 1 1 1 2 2 2 1 1 ...
##  $ region            : int  8 8 8 9 9 4 4 4 8 8 ...
##  $ ethnicity_head    : int  2 2 2 1 1 1 1 1 1 1 ...
##  $ age               : num  37 11 8 73 70 75 79 80 82 83 ...
##  $ gender            : int  2 2 1 1 2 1 1 2 2 2 ...
##  $ relation          : int  1 3 3 1 2 1 1 2 1 1 ...
##  $ literate          : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ income_mnt        : num  13.3 13.3 13.3 142.5 142.5 ...
##  $ income            : num  160 160 160 1710 1710 ...
##  $ aggregate         : num  1042 1042 1042 3271 3271 ...
##  $ aggr_ind_annual   : num  347 347 347 1635 1635 ...
##  $ educ_completed    : int  2 4 4 4 3 3 3 3 4 4 ...
##  $ grade_complete    : num  4 3 0 3 4 4 4 4 5 5 ...
##  $ grade_all         : num  4 11 8 11 8 8 8 8 13 13 ...
##  $ unemployed        : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ reason_OLF        : int  NA NA NA 3 3 3 9 9 3 3 ...
##  $ sector            : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ occupation        : int  NA NA NA NA NA NA 5 5 NA NA ...
##  $ earn_mont         : num  0 0 0 0 0 0 20 20 0 0 ...
##  $ earn_ann          : num  0 0 0 0 0 0 240 240 0 0 ...
##  $ hours_week        : num  NA NA NA NA NA NA 30 35 NA NA ...
##  $ hours_mnt         : num  NA NA NA NA NA ...
##  $ fulltime          : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ hhexp             : num  100 100 100 343 343 ...
##  $ legacy_pension_amt: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  100 100 108 108 108 100 108 108 108 100 ...
##  - attr(*, "val.labels")= chr  "" "" "location" "region" ...
##  - attr(*, "var.labels")= chr  "hhid" "hhweight" "location" "region" ...
##  - attr(*, "expansion.fields")=List of 12
##   ..$ : chr  "_dta" "_svy_su1" "cluster"
##   ..$ : chr  "_dta" "_svy_strata1" "strata"
##   ..$ : chr  "_dta" "_svy_stages" "1"
##   ..$ : chr  "_dta" "_svy_version" "2"
##   ..$ : chr  "_dta" "__XijVarLabcons" "(sum) cons"
##   ..$ : chr  "_dta" "ReS_Xij" "cons"
##   ..$ : chr  "_dta" "ReS_str" "0"
##   ..$ : chr  "_dta" "ReS_j" "group"
##   ..$ : chr  "_dta" "ReS_ver" "v.2"
##   ..$ : chr  "_dta" "ReS_i" "hhid dur"
##   ..$ : chr  "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc,  gall, health, rent, durables we"| __truncated__
##   ..$ : chr  "_dta" "note0" "1"
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 12
##   ..$ location: Named int  1 2
##   .. ..- attr(*, "names")= chr  "urban location" "rural location"
##   ..$ region  : Named int  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "Sofia city" "Bourgass" "Varna" "Lovetch" ...
##   ..$ ethnic  : Named int  1 2 3 4
##   .. ..- attr(*, "names")= chr  "Bulgaria" "Turks" "Roma" "Other"
##   ..$ s2_q2   : Named int  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ s2_q3   : Named int  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "head                      " "spouse/partner            " "child                     " "son/daughter-in-law       " ...
##   ..$ lit     : Named int  1 2
##   .. ..- attr(*, "names")= chr  "no" "yes"
##   ..$         : Named int  1 2 3 4
##   .. ..- attr(*, "names")= chr  "never attanded" "primary" "secondary" "postsecondary"
##   ..$         : Named int  1 2
##   .. ..- attr(*, "names")= chr  "Not unemployed" "Unemployed"
##   ..$         : Named int  1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr  "student" "housewife/childcare" "in retirement" "illness, disability" ...
##   ..$         : Named int  1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr  "agriculture" "mining" "manufacturing" "utilities" ...
##   ..$         : Named int  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "private company" "public works program" "government,public sector, army" "private individual" ...
##   ..$         : Named int  1 2
##   .. ..- attr(*, "names")= chr  "no" "yes"
# Create and print structure of edu_equal_3
edu_equal_3 <- read.dta("edequality.dta", convert.underscore = TRUE)
str(edu_equal_3)
## 'data.frame':    12214 obs. of  27 variables:
##  $ hhid              : num  1 1 1 2 2 3 4 4 5 6 ...
##  $ hhweight          : num  627 627 627 627 627 ...
##  $ location          : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
##  $ region            : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
##  $ ethnicity.head    : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
##  $ age               : num  37 11 8 73 70 75 79 80 82 83 ...
##  $ gender            : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
##  $ relation          : Factor w/ 9 levels "head                      ",..: 1 3 3 1 2 1 1 2 1 1 ...
##  $ literate          : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ income.mnt        : num  13.3 13.3 13.3 142.5 142.5 ...
##  $ income            : num  160 160 160 1710 1710 ...
##  $ aggregate         : num  1042 1042 1042 3271 3271 ...
##  $ aggr.ind.annual   : num  347 347 347 1635 1635 ...
##  $ educ.completed    : int  2 4 4 4 3 3 3 3 4 4 ...
##  $ grade.complete    : num  4 3 0 3 4 4 4 4 5 5 ...
##  $ grade.all         : num  4 11 8 11 8 8 8 8 13 13 ...
##  $ unemployed        : int  2 1 1 1 1 1 1 1 1 1 ...
##  $ reason.OLF        : int  NA NA NA 3 3 3 9 9 3 3 ...
##  $ sector            : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ occupation        : int  NA NA NA NA NA NA 5 5 NA NA ...
##  $ earn.mont         : num  0 0 0 0 0 0 20 20 0 0 ...
##  $ earn.ann          : num  0 0 0 0 0 0 240 240 0 0 ...
##  $ hours.week        : num  NA NA NA NA NA NA 30 35 NA NA ...
##  $ hours.mnt         : num  NA NA NA NA NA ...
##  $ fulltime          : int  NA NA NA NA NA NA 1 1 NA NA ...
##  $ hhexp             : num  100 100 100 343 343 ...
##  $ legacy.pension.amt: num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr ""
##  - attr(*, "formats")= chr  "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  100 100 108 108 108 100 108 108 108 100 ...
##  - attr(*, "val.labels")= chr  "" "" "location" "region" ...
##  - attr(*, "var.labels")= chr  "hhid" "hhweight" "location" "region" ...
##  - attr(*, "expansion.fields")=List of 12
##   ..$ : chr  "_dta" "_svy_su1" "cluster"
##   ..$ : chr  "_dta" "_svy_strata1" "strata"
##   ..$ : chr  "_dta" "_svy_stages" "1"
##   ..$ : chr  "_dta" "_svy_version" "2"
##   ..$ : chr  "_dta" "__XijVarLabcons" "(sum) cons"
##   ..$ : chr  "_dta" "ReS_Xij" "cons"
##   ..$ : chr  "_dta" "ReS_str" "0"
##   ..$ : chr  "_dta" "ReS_j" "group"
##   ..$ : chr  "_dta" "ReS_ver" "v.2"
##   ..$ : chr  "_dta" "ReS_i" "hhid dur"
##   ..$ : chr  "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc,  gall, health, rent, durables we"| __truncated__
##   ..$ : chr  "_dta" "note0" "1"
##  - attr(*, "version")= int 7
##  - attr(*, "label.table")=List of 12
##   ..$ location: Named int  1 2
##   .. ..- attr(*, "names")= chr  "urban location" "rural location"
##   ..$ region  : Named int  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "Sofia city" "Bourgass" "Varna" "Lovetch" ...
##   ..$ ethnic  : Named int  1 2 3 4
##   .. ..- attr(*, "names")= chr  "Bulgaria" "Turks" "Roma" "Other"
##   ..$ s2_q2   : Named int  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##   ..$ s2_q3   : Named int  1 2 3 4 5 6 7 8 9
##   .. ..- attr(*, "names")= chr  "head                      " "spouse/partner            " "child                     " "son/daughter-in-law       " ...
##   ..$ lit     : Named int  1 2
##   .. ..- attr(*, "names")= chr  "no" "yes"
##   ..$         : Named int  1 2 3 4
##   .. ..- attr(*, "names")= chr  "never attanded" "primary" "secondary" "postsecondary"
##   ..$         : Named int  1 2
##   .. ..- attr(*, "names")= chr  "Not unemployed" "Unemployed"
##   ..$         : Named int  1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr  "student" "housewife/childcare" "in retirement" "illness, disability" ...
##   ..$         : Named int  1 2 3 4 5 6 7 8 9 10
##   .. ..- attr(*, "names")= chr  "agriculture" "mining" "manufacturing" "utilities" ...
##   ..$         : Named int  1 2 3 4 5
##   .. ..- attr(*, "names")= chr  "private company" "public works program" "government,public sector, army" "private individual" ...
##   ..$         : Named int  1 2
##   .. ..- attr(*, "names")= chr  "no" "yes"

8.7 Import SPSS data with foreign (1)

All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().

In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.

# Import international.sav as a data frame: demo
demo <- read.spss("international.sav", to.data.frame = TRUE)
## re-encoding from CP1252
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)

8.8 Import SPSS data with foreign (2)

In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.

In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.

# Import international.sav as demo_1
demo_1 <- read.spss("international.sav", to.data.frame = TRUE)
## re-encoding from CP1252
# Print out the head of demo_1
head(demo_1)
##   id              country  contint m_illit f_illit lifeexpt  gdp
## 1  1 Argentina            Americas     3.0     3.0       16 3375
## 2  2 Benin                  Africa    45.2    74.5        7  521
## 3  3 Burundi                Africa    33.2    48.1        5   86
## 4  4 Chile                Americas     4.2     4.4       14 4523
## 5  5 Dominican Republic   Americas    12.0    12.7       12 2408
## 6  6 El Salvador          Americas    17.6    22.9       11 2302
# Import international.sav as demo_2
demo_2 <- read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)
## re-encoding from CP1252
# Print out the head of demo_2
head(demo_2)
##   id              country contint m_illit f_illit lifeexpt  gdp
## 1  1 Argentina                  2     3.0     3.0       16 3375
## 2  2 Benin                      1    45.2    74.5        7  521
## 3  3 Burundi                    1    33.2    48.1        5   86
## 4  4 Chile                      2     4.2     4.4       14 4523
## 5  5 Dominican Republic         2    12.0    12.7       12 2408
## 6  6 El Salvador                2    17.6    22.9       11 2302

The End.

Thanks DataCamp

- My Favorite Team - Cim boom