Libraries
library(tidyverse)
library(forcats)
library(lubridate)
library(ggthemes)
library(gdata)
library(DBI)
library(data.table)
Importing data into R should be the easiest step in your analysis. Unfortunately, that is almost never the case. Data can come in many formats, ranging from .csv and text files, to statistical software files, to databases and HTML data. Knowing which approach to use is key to getting started with the actual analysis. In this course, you’ll start by learning how to read .csv and text files in R. You will then cover the readr and data.table packages to easily and efficiently import flat file data. After that, you will learn how to read .xls files in R using readxl and gdata.
A lot of data comes in the form of flat files: simple tabular text files. Learn how to import the common formats of flat file data with base R functions.
The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.
# Import swimming_pools.csv: pools
pools <- read.csv("swimming_pools.csv", stringsAsFactors = TRUE)
# Print the structure of pools
str(pools)
## 'data.frame': 20 obs. of 5 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
## $ Address : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
## $ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num 153 153 153 153 153 ...
With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.
For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be character.
# Import swimming_pools.csv correctly: pools
pools<-read.csv("swimming_pools.csv", stringsAsFactors=FALSE)
# Check the structure of pools
str(pools)
## 'data.frame': 20 obs. of 5 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
## $ Address : chr "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
## $ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num 153 153 153 153 153 ...
Aside from .csv files, there are also the .txt files which are basically text files. You can import these functions with read.delim(). By default, it sets the sep argument to “ (fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names).
# Import hotdogs.txt: hotdogs
hotdogs <- read.delim("hotdogs.txt", header = FALSE)
# Summarize hotdogs
summary(hotdogs)
## V1 V2 V3
## Beef :20 Min. : 86.0 Min. :144.0
## Meat :17 1st Qu.:132.0 1st Qu.:362.5
## Poultry:17 Median :145.0 Median :405.0
## Mean :145.4 Mean :424.8
## 3rd Qu.:172.8 3rd Qu.:503.5
## Max. :195.0 Max. :645.0
If you’re dealing with more exotic flat file formats, you’ll want to use read.table(). It’s the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is “” by default.
Up to you again! The data is still hotdogs.txt. It has no column names in the first row, and the field separators are tabs. This time, though, the file is in the data folder inside your current working directory. A variable path with the location of this file is already coded for you.
# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table("hotdogs.txt",
sep = "\t",
col.names = c("type", "calories", "sodium"))
# Call head() on hotdogs
head(hotdogs)
## type calories sodium
## 1 Beef 186 495
## 2 Beef 181 477
## 3 Beef 176 425
## 4 Beef 149 322
## 5 Beef 184 482
## 6 Beef 190 587
Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes:
# Display structure of hotdogs
str(hotdogs)
## 'data.frame': 54 obs. of 3 variables:
## $ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ calories: int 186 181 176 149 184 190 158 139 175 148 ...
## $ sodium : int 495 477 425 322 482 587 370 322 479 375 ...
# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim("hotdogs.txt", header = FALSE,
col.names = c("type", "calories", "sodium"),
colClasses = c("factor", "NULL", "numeric"))
# Display structure of hotdogs2
str(hotdogs2)
## 'data.frame': 54 obs. of 2 variables:
## $ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ sodium: num 495 477 425 322 482 587 370 322 479 375 ...
In addition to base R, there are dedicated packages to easily and efficiently import flat file data. We’ll talk about two such packages: readr and data.table.
CSV files can be imported with read_csv(). It’s a wrapper function around read_delim() that handles all the details for you. For example, it will assume that the first row contains the column names.
Where you use read_csv() to easily read in CSV files, you use read_tsv() to easily read in TSV files. TSV is short for tab-separated values.
# Import potatoes.csv with read_csv(): potatoes
potatoes_1<- read.csv("potatoes.csv")
head(potatoes_1)
## X area temp size storage method texture flavor moistness
## 1 1 1 1 1 1 1 2.9 3.2 3.0
## 2 2 1 1 1 1 2 2.3 2.5 2.6
## 3 3 1 1 1 1 3 2.5 2.8 2.8
## 4 4 1 1 1 1 4 2.1 2.9 2.4
## 5 5 1 1 1 1 5 1.9 2.8 2.2
## 6 6 1 1 1 2 1 1.8 3.0 1.7
# Column names
properties <- c("area", "temp", "size", "storage", "method",
"texture", "flavor", "moistness")
# Import potatoes.txt: potatoes
potatoes_2 <- read_tsv("potatoes.txt", col_names = properties)
## Rows: 160 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(potatoes_2)
## # A tibble: 6 × 8
## area temp size storage method texture flavor moistness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 2.9 3.2 3
## 2 1 1 1 1 2 2.3 2.5 2.6
## 3 1 1 1 1 3 2.5 2.8 2.8
## 4 1 1 1 1 4 2.1 2.9 2.4
## 5 1 1 1 1 5 1.9 2.8 2.2
## 6 1 1 1 2 1 1.8 3 1.7
Just as read.table() was the main utils function, read_delim() is the main readr function. read_delim() takes two mandatory arguments: * file: the file that contains the data * delim: the character that separates the values in the data file
# Import potatoes.txt using read_delim(): potatoes
potatoes_3 <- read_delim("potatoes.txt", delim = "\t", col_names = properties)
## Rows: 160 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(potatoes_3)
## # A tibble: 6 × 8
## area temp size storage method texture flavor moistness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 2.9 3.2 3
## 2 1 1 1 1 2 2.3 2.5 2.6
## 3 1 1 1 1 3 2.5 2.8 2.8
## 4 1 1 1 1 4 2.1 2.9 2.4
## 5 1 1 1 1 5 1.9 2.8 2.2
## 6 1 1 1 2 1 1.8 3 1.7
# Import 5 observations from potatoes.txt: potatoes_fragment
potatoes_fragment <- read_tsv("potatoes.txt", skip = 6, n_max = 5, col_names = properties)
## Rows: 5 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
potatoes_fragment
## # A tibble: 5 × 8
## area temp size storage method texture flavor moistness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 2 2 2.6 3.1 2.4
## 2 1 1 1 2 3 3 3 2.9
## 3 1 1 1 2 4 2.2 3.2 2.5
## 4 1 1 1 2 5 2 2.8 1.9
## 5 1 1 1 3 1 1.8 2.6 1.5
# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv("potatoes.txt", col_types = "cccccccc", col_names = properties)
# Print out structure of potatoes_char
str(potatoes_char)
## spec_tbl_df [160 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ area : chr [1:160] "1" "1" "1" "1" ...
## $ temp : chr [1:160] "1" "1" "1" "1" ...
## $ size : chr [1:160] "1" "1" "1" "1" ...
## $ storage : chr [1:160] "1" "1" "1" "1" ...
## $ method : chr [1:160] "1" "2" "3" "4" ...
## $ texture : chr [1:160] "2.9" "2.3" "2.5" "2.1" ...
## $ flavor : chr [1:160] "3.2" "2.5" "2.8" "2.9" ...
## $ moistness: chr [1:160] "3.0" "2.6" "2.8" "2.4" ...
## - attr(*, "spec")=
## .. cols(
## .. area = col_character(),
## .. temp = col_character(),
## .. size = col_character(),
## .. storage = col_character(),
## .. method = col_character(),
## .. texture = col_character(),
## .. flavor = col_character(),
## .. moistness = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
You can also specify which types the columns in your imported data frame should have. You can do this with col_types. If set to NULL, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a string, where each character denotes the class of the column: character, double, integer and logical. _ skips the column as a whole.
# Import without col_types
hotdogs <- read_tsv("hotdogs.txt", col_names = c("type", "calories", "sodium"))
## Rows: 54 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): type
## dbl (2): calories, sodium
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(hotdogs)
## type calories sodium
## Length:54 Min. : 86.0 Min. :144.0
## Class :character 1st Qu.:132.0 1st Qu.:362.5
## Mode :character Median :145.0 Median :405.0
## Mean :145.4 Mean :424.8
## 3rd Qu.:172.8 3rd Qu.:503.5
## Max. :195.0 Max. :645.0
# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()
# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv("hotdogs.txt",
col_names = c("type", "calories", "sodium"),
col_types = list(fac, int, int))
# Display the summary of hotdogs_factor
summary(hotdogs_factor)
## type calories sodium
## Beef :20 Min. : 86.0 Min. :144.0
## Meat :17 1st Qu.:132.0 1st Qu.:362.5
## Poultry:17 Median :145.0 Median :405.0
## Mean :145.4 Mean :424.8
## 3rd Qu.:172.8 3rd Qu.:503.5
## Max. :195.0 Max. :645.0
You still remember how to use read.table(), right? Well, fread() is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data.
# Import potatoes.csv with fread(): potatoes
potatoes_4 <- fread("potatoes.csv")
# Print out potatoes
head(potatoes_4)
## V1 area temp size storage method texture flavor moistness
## 1: 1 1 1 1 1 1 2.9 3.2 3.0
## 2: 2 1 1 1 1 2 2.3 2.5 2.6
## 3: 3 1 1 1 1 3 2.5 2.8 2.8
## 4: 4 1 1 1 1 4 2.1 2.9 2.4
## 5: 5 1 1 1 1 5 1.9 2.8 2.2
## 6: 6 1 1 1 2 1 1.8 3.0 1.7
# Import columns 6 and 8 of potatoes.csv: potatoes
potatoes_5<-fread("potatoes.csv", select=c(6, 8) )
# Plot texture (x) and moistness (y) of potatoes
plot(potatoes_4$texture, potatoes_4$moistness)
Excel is a widely used data analysis tool. If you prefer to do your analyses in R, though, you’ll need an understanding of how to import .csv data into R. This chapter will show you how to use readxl and gdata to do so.
Before you can start importing from Excel, you should find out which sheets are available in the workbook. You can use the excel_sheets() function for this.
You will find the Excel file urbanpop.xlsx (view) in your working directory (type dir() to see it). This dataset contains urban population metrics for practically all countries in the world throughout time (Source: Gapminder). It contains three sheets for three different time periods. In each sheet, the first row contains the column names.
library(readxl)
# Print the names of all worksheets
excel_sheets("urbanpop.xlsx")
## [1] "1960-1966" "1967-1974" "1975-2011"
# Read the sheets, one by one
pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)
pop_2 <- read_excel("urbanpop.xlsx", sheet = 2)
pop_3 <- read_excel("urbanpop.xlsx", sheet= 3)
# Put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list <- list(pop_1, pop_2, pop_3)
# Display the structure of pop_list
str(pop_list)
## List of 3
## $ : tibble [209 × 8] (S3: tbl_df/tbl/data.frame)
## ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## ..$ 1960 : num [1:209] 769308 494443 3293999 NA NA ...
## ..$ 1961 : num [1:209] 814923 511803 3515148 13660 8724 ...
## ..$ 1962 : num [1:209] 858522 529439 3739963 14166 9700 ...
## ..$ 1963 : num [1:209] 903914 547377 3973289 14759 10748 ...
## ..$ 1964 : num [1:209] 951226 565572 4220987 15396 11866 ...
## ..$ 1965 : num [1:209] 1000582 583983 4488176 16045 13053 ...
## ..$ 1966 : num [1:209] 1058743 602512 4649105 16693 14217 ...
## $ : tibble [209 × 9] (S3: tbl_df/tbl/data.frame)
## ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## ..$ 1967 : num [1:209] 1119067 621180 4826104 17349 15440 ...
## ..$ 1968 : num [1:209] 1182159 639964 5017299 17996 16727 ...
## ..$ 1969 : num [1:209] 1248901 658853 5219332 18619 18088 ...
## ..$ 1970 : num [1:209] 1319849 677839 5429743 19206 19529 ...
## ..$ 1971 : num [1:209] 1409001 698932 5619042 19752 20929 ...
## ..$ 1972 : num [1:209] 1502402 720207 5815734 20263 22406 ...
## ..$ 1973 : num [1:209] 1598835 741681 6020647 20742 23937 ...
## ..$ 1974 : num [1:209] 1696445 763385 6235114 21194 25482 ...
## $ : tibble [209 × 38] (S3: tbl_df/tbl/data.frame)
## ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## ..$ 1975 : num [1:209] 1793266 785350 6460138 21632 27019 ...
## ..$ 1976 : num [1:209] 1905033 807990 6774099 22047 28366 ...
## ..$ 1977 : num [1:209] 2021308 830959 7102902 22452 29677 ...
## ..$ 1978 : num [1:209] 2142248 854262 7447728 22899 31037 ...
## ..$ 1979 : num [1:209] 2268015 877898 7810073 23457 32572 ...
## ..$ 1980 : num [1:209] 2398775 901884 8190772 24177 34366 ...
## ..$ 1981 : num [1:209] 2493265 927224 8637724 25173 36356 ...
## ..$ 1982 : num [1:209] 2590846 952447 9105820 26342 38618 ...
## ..$ 1983 : num [1:209] 2691612 978476 9591900 27655 40983 ...
## ..$ 1984 : num [1:209] 2795656 1006613 10091289 29062 43207 ...
## ..$ 1985 : num [1:209] 2903078 1037541 10600112 30524 45119 ...
## ..$ 1986 : num [1:209] 3006983 1072365 11101757 32014 46254 ...
## ..$ 1987 : num [1:209] 3113957 1109954 11609104 33548 47019 ...
## ..$ 1988 : num [1:209] 3224082 1146633 12122941 35095 47669 ...
## ..$ 1989 : num [1:209] 3337444 1177286 12645263 36618 48577 ...
## ..$ 1990 : num [1:209] 3454129 1198293 13177079 38088 49982 ...
## ..$ 1991 : num [1:209] 3617842 1215445 13708813 39600 51972 ...
## ..$ 1992 : num [1:209] 3788685 1222544 14248297 41049 54469 ...
## ..$ 1993 : num [1:209] 3966956 1222812 14789176 42443 57079 ...
## ..$ 1994 : num [1:209] 4152960 1221364 15322651 43798 59243 ...
## ..$ 1995 : num [1:209] 4347018 1222234 15842442 45129 60598 ...
## ..$ 1996 : num [1:209] 4531285 1228760 16395553 46343 60927 ...
## ..$ 1997 : num [1:209] 4722603 1238090 16935451 47527 60462 ...
## ..$ 1998 : num [1:209] 4921227 1250366 17469200 48705 59685 ...
## ..$ 1999 : num [1:209] 5127421 1265195 18007937 49906 59281 ...
## ..$ 2000 : num [1:209] 5341456 1282223 18560597 51151 59719 ...
## ..$ 2001 : num [1:209] 5564492 1315690 19198872 52341 61062 ...
## ..$ 2002 : num [1:209] 5795940 1352278 19854835 53583 63212 ...
## ..$ 2003 : num [1:209] 6036100 1391143 20529356 54864 65802 ...
## ..$ 2004 : num [1:209] 6285281 1430918 21222198 56166 68301 ...
## ..$ 2005 : num [1:209] 6543804 1470488 21932978 57474 70329 ...
## ..$ 2006 : num [1:209] 6812538 1512255 22625052 58679 71726 ...
## ..$ 2007 : num [1:209] 7091245 1553491 23335543 59894 72684 ...
## ..$ 2008 : num [1:209] 7380272 1594351 24061749 61118 73335 ...
## ..$ 2009 : num [1:209] 7679982 1635262 24799591 62357 73897 ...
## ..$ 2010 : num [1:209] 7990746 1676545 25545622 63616 74525 ...
## ..$ 2011 : num [1:209] 8316976 1716842 26216968 64817 75207 ...
# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel("urbanpop_nonames.xlsx", col_names= FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("urbanpop_nonames.xlsx", col_names = cols)
# Print the summary of pop_a
summary(pop_a)
## ...1 ...2 ...3 ...4
## Length:209 Min. : 3378 Min. : 1028 Min. : 1090
## Class :character 1st Qu.: 88978 1st Qu.: 70644 1st Qu.: 74974
## Mode :character Median : 580675 Median : 570159 Median : 593968
## Mean : 4988124 Mean : 4991613 Mean : 5141592
## 3rd Qu.: 3077228 3rd Qu.: 2807280 3rd Qu.: 2948396
## Max. :126469700 Max. :129268133 Max. :131974143
## NA's :11
## ...5 ...6 ...7
## Min. : 1154 Min. : 1218 Min. : 1281
## 1st Qu.: 81870 1st Qu.: 84953 1st Qu.: 88633
## Median : 619331 Median : 645262 Median : 679109
## Mean : 5303711 Mean : 5468966 Mean : 5637394
## 3rd Qu.: 3148941 3rd Qu.: 3296444 3rd Qu.: 3317422
## Max. :134599886 Max. :137205240 Max. :139663053
##
## ...8
## Min. : 1349
## 1st Qu.: 93638
## Median : 735139
## Mean : 5790281
## 3rd Qu.: 3418036
## Max. :141962708
##
# Print the summary of pop_b
summary(pop_b)
## country year_1960 year_1961 year_1962
## Length:209 Min. : 3378 Min. : 1028 Min. : 1090
## Class :character 1st Qu.: 88978 1st Qu.: 70644 1st Qu.: 74974
## Mode :character Median : 580675 Median : 570159 Median : 593968
## Mean : 4988124 Mean : 4991613 Mean : 5141592
## 3rd Qu.: 3077228 3rd Qu.: 2807280 3rd Qu.: 2948396
## Max. :126469700 Max. :129268133 Max. :131974143
## NA's :11
## year_1963 year_1964 year_1965
## Min. : 1154 Min. : 1218 Min. : 1281
## 1st Qu.: 81870 1st Qu.: 84953 1st Qu.: 88633
## Median : 619331 Median : 645262 Median : 679109
## Mean : 5303711 Mean : 5468966 Mean : 5637394
## 3rd Qu.: 3148941 3rd Qu.: 3296444 3rd Qu.: 3317422
## Max. :134599886 Max. :137205240 Max. :139663053
##
## year_1966
## Min. : 1349
## 1st Qu.: 93638
## Median : 735139
## Mean : 5790281
## 3rd Qu.: 3418036
## Max. :141962708
##
# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel
urbanpop_sel <- read_excel("urbanpop.xlsx", sheet = 2, col_names = FALSE, skip = 21)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Print out the first observation from urbanpop_sel
urbanpop_sel[1,]
## # A tibble: 1 × 9
## ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Benin 382022. 411859. 443013. 475611. 515820. 557938. 602093. 648410.
In the previous exercise you generated a list of three Excel sheets that you imported. However, loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply(). If you have no experience with lapply(), feel free to take Chapter 4 of the Intermediate R course.
# Read all Excel sheets with lapply(): pop_list
pop_list <- lapply(excel_sheets("urbanpop.xlsx"), read_excel, path = "urbanpop.xlsx")
# Display the structure of pop_list
str(pop_list)
## List of 3
## $ : tibble [209 × 8] (S3: tbl_df/tbl/data.frame)
## ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## ..$ 1960 : num [1:209] 769308 494443 3293999 NA NA ...
## ..$ 1961 : num [1:209] 814923 511803 3515148 13660 8724 ...
## ..$ 1962 : num [1:209] 858522 529439 3739963 14166 9700 ...
## ..$ 1963 : num [1:209] 903914 547377 3973289 14759 10748 ...
## ..$ 1964 : num [1:209] 951226 565572 4220987 15396 11866 ...
## ..$ 1965 : num [1:209] 1000582 583983 4488176 16045 13053 ...
## ..$ 1966 : num [1:209] 1058743 602512 4649105 16693 14217 ...
## $ : tibble [209 × 9] (S3: tbl_df/tbl/data.frame)
## ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## ..$ 1967 : num [1:209] 1119067 621180 4826104 17349 15440 ...
## ..$ 1968 : num [1:209] 1182159 639964 5017299 17996 16727 ...
## ..$ 1969 : num [1:209] 1248901 658853 5219332 18619 18088 ...
## ..$ 1970 : num [1:209] 1319849 677839 5429743 19206 19529 ...
## ..$ 1971 : num [1:209] 1409001 698932 5619042 19752 20929 ...
## ..$ 1972 : num [1:209] 1502402 720207 5815734 20263 22406 ...
## ..$ 1973 : num [1:209] 1598835 741681 6020647 20742 23937 ...
## ..$ 1974 : num [1:209] 1696445 763385 6235114 21194 25482 ...
## $ : tibble [209 × 38] (S3: tbl_df/tbl/data.frame)
## ..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## ..$ 1975 : num [1:209] 1793266 785350 6460138 21632 27019 ...
## ..$ 1976 : num [1:209] 1905033 807990 6774099 22047 28366 ...
## ..$ 1977 : num [1:209] 2021308 830959 7102902 22452 29677 ...
## ..$ 1978 : num [1:209] 2142248 854262 7447728 22899 31037 ...
## ..$ 1979 : num [1:209] 2268015 877898 7810073 23457 32572 ...
## ..$ 1980 : num [1:209] 2398775 901884 8190772 24177 34366 ...
## ..$ 1981 : num [1:209] 2493265 927224 8637724 25173 36356 ...
## ..$ 1982 : num [1:209] 2590846 952447 9105820 26342 38618 ...
## ..$ 1983 : num [1:209] 2691612 978476 9591900 27655 40983 ...
## ..$ 1984 : num [1:209] 2795656 1006613 10091289 29062 43207 ...
## ..$ 1985 : num [1:209] 2903078 1037541 10600112 30524 45119 ...
## ..$ 1986 : num [1:209] 3006983 1072365 11101757 32014 46254 ...
## ..$ 1987 : num [1:209] 3113957 1109954 11609104 33548 47019 ...
## ..$ 1988 : num [1:209] 3224082 1146633 12122941 35095 47669 ...
## ..$ 1989 : num [1:209] 3337444 1177286 12645263 36618 48577 ...
## ..$ 1990 : num [1:209] 3454129 1198293 13177079 38088 49982 ...
## ..$ 1991 : num [1:209] 3617842 1215445 13708813 39600 51972 ...
## ..$ 1992 : num [1:209] 3788685 1222544 14248297 41049 54469 ...
## ..$ 1993 : num [1:209] 3966956 1222812 14789176 42443 57079 ...
## ..$ 1994 : num [1:209] 4152960 1221364 15322651 43798 59243 ...
## ..$ 1995 : num [1:209] 4347018 1222234 15842442 45129 60598 ...
## ..$ 1996 : num [1:209] 4531285 1228760 16395553 46343 60927 ...
## ..$ 1997 : num [1:209] 4722603 1238090 16935451 47527 60462 ...
## ..$ 1998 : num [1:209] 4921227 1250366 17469200 48705 59685 ...
## ..$ 1999 : num [1:209] 5127421 1265195 18007937 49906 59281 ...
## ..$ 2000 : num [1:209] 5341456 1282223 18560597 51151 59719 ...
## ..$ 2001 : num [1:209] 5564492 1315690 19198872 52341 61062 ...
## ..$ 2002 : num [1:209] 5795940 1352278 19854835 53583 63212 ...
## ..$ 2003 : num [1:209] 6036100 1391143 20529356 54864 65802 ...
## ..$ 2004 : num [1:209] 6285281 1430918 21222198 56166 68301 ...
## ..$ 2005 : num [1:209] 6543804 1470488 21932978 57474 70329 ...
## ..$ 2006 : num [1:209] 6812538 1512255 22625052 58679 71726 ...
## ..$ 2007 : num [1:209] 7091245 1553491 23335543 59894 72684 ...
## ..$ 2008 : num [1:209] 7380272 1594351 24061749 61118 73335 ...
## ..$ 2009 : num [1:209] 7679982 1635262 24799591 62357 73897 ...
## ..$ 2010 : num [1:209] 7990746 1676545 25545622 63616 74525 ...
## ..$ 2011 : num [1:209] 8316976 1716842 26216968 64817 75207 ...
Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.
By default it is TRUE, denoting whether the first row in the Excel sheets contains the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. It works exactly the same as in the readr package.
# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel("urbanpop_nonames.xlsx", col_names = FALSE)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("urbanpop_nonames.xlsx", col_names = cols)
# Print the summary of pop_a
summary(pop_a)
## ...1 ...2 ...3 ...4
## Length:209 Min. : 3378 Min. : 1028 Min. : 1090
## Class :character 1st Qu.: 88978 1st Qu.: 70644 1st Qu.: 74974
## Mode :character Median : 580675 Median : 570159 Median : 593968
## Mean : 4988124 Mean : 4991613 Mean : 5141592
## 3rd Qu.: 3077228 3rd Qu.: 2807280 3rd Qu.: 2948396
## Max. :126469700 Max. :129268133 Max. :131974143
## NA's :11
## ...5 ...6 ...7
## Min. : 1154 Min. : 1218 Min. : 1281
## 1st Qu.: 81870 1st Qu.: 84953 1st Qu.: 88633
## Median : 619331 Median : 645262 Median : 679109
## Mean : 5303711 Mean : 5468966 Mean : 5637394
## 3rd Qu.: 3148941 3rd Qu.: 3296444 3rd Qu.: 3317422
## Max. :134599886 Max. :137205240 Max. :139663053
##
## ...8
## Min. : 1349
## 1st Qu.: 93638
## Median : 735139
## Mean : 5790281
## 3rd Qu.: 3418036
## Max. :141962708
##
# Print the summary of pop_b
summary(pop_b)
## country year_1960 year_1961 year_1962
## Length:209 Min. : 3378 Min. : 1028 Min. : 1090
## Class :character 1st Qu.: 88978 1st Qu.: 70644 1st Qu.: 74974
## Mode :character Median : 580675 Median : 570159 Median : 593968
## Mean : 4988124 Mean : 4991613 Mean : 5141592
## 3rd Qu.: 3077228 3rd Qu.: 2807280 3rd Qu.: 2948396
## Max. :126469700 Max. :129268133 Max. :131974143
## NA's :11
## year_1963 year_1964 year_1965
## Min. : 1154 Min. : 1218 Min. : 1281
## 1st Qu.: 81870 1st Qu.: 84953 1st Qu.: 88633
## Median : 619331 Median : 645262 Median : 679109
## Mean : 5303711 Mean : 5468966 Mean : 5637394
## 3rd Qu.: 3148941 3rd Qu.: 3296444 3rd Qu.: 3317422
## Max. :134599886 Max. :137205240 Max. :139663053
##
## year_1966
## Min. : 1349
## 1st Qu.: 93638
## Median : 735139
## Mean : 5790281
## 3rd Qu.: 3418036
## Max. :141962708
##
Another argument that can be very useful when reading in Excel files that are less tidy, is skip. With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you’re trying to pull data from. Have a look at this example:
# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel
urbanpop_sel <- read_excel("urbanpop.xlsx", sheet=2, col_names = FALSE, skip=21)
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
# Print out the first observation from urbanpop_sel
urbanpop_sel[1, ]
## # A tibble: 1 × 9
## ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Benin 382022. 411859. 443013. 475611. 515820. 557938. 602093. 648410.
In this part of the chapter you’ll learn how to import .xls files using the gdata package. Similar to the readxl package, you can import single Excel sheets from Excel sheets to start your analysis in R.
library(gdata)
# Import the second sheet of urbanpop.xls: urban_pop
urban_pop <- read.xls("urbanpop.xls", sheet = "1967-1974")
# Print the first 11 observations using head()
head(urban_pop, n = 11)
## country X1967 X1968 X1969 X1970
## 1 Afghanistan 1119067.20 1182159.06 1248900.79 1319848.78
## 2 Albania 621179.85 639964.46 658853.12 677839.12
## 3 Algeria 4826104.22 5017298.60 5219331.87 5429743.08
## 4 American Samoa 17348.66 17995.51 18618.68 19206.39
## 5 Andorra 15439.62 16726.99 18088.32 19528.96
## 6 Angola 757496.32 798459.26 841261.96 886401.63
## 7 Antigua and Barbuda 22086.25 22149.39 22182.92 22180.87
## 8 Argentina 17753280.98 18124103.64 18510462.30 18918072.79
## 9 Armenia 1337032.09 1392892.13 1449641.49 1507619.77
## 10 Aruba 29414.72 29576.09 29737.87 29901.57
## 11 Australia 9934404.03 10153969.77 10412390.67 10664093.55
## X1971 X1972 X1973 X1974
## 1 1409001.09 1502401.79 1598835.45 1696444.83
## 2 698932.25 720206.57 741681.04 763385.45
## 3 5619041.53 5815734.49 6020647.35 6235114.38
## 4 19752.02 20262.67 20741.97 21194.38
## 5 20928.73 22405.84 23937.05 25481.98
## 6 955010.09 1027397.35 1103829.78 1184486.23
## 7 22560.87 22907.76 23221.29 23502.92
## 8 19329718.16 19763078.00 20211424.85 20664728.90
## 9 1564367.60 1622103.53 1680497.75 1739063.02
## 10 30081.36 30279.76 30467.42 30602.87
## 11 11047706.39 11269945.50 11461120.68 11772934.25
# Column names for urban_pop
columns <- c("country", paste0("year_", 1967:1974))
# Finish the read.xls call
urban_pop <- read.xls("urbanpop.xls", sheet = 2,
skip = 50, header = FALSE, stringsAsFactors = FALSE,
col.names = columns)
# Print first 10 observation of urban_pop
head(urban_pop, n = 10)
## country year_1967 year_1968 year_1969 year_1970
## 1 Cyprus 231929.74 237831.38 243983.34 250164.52
## 2 Czech Republic 6204409.91 6266304.50 6326368.97 6348794.89
## 3 Denmark 3777552.62 3826785.08 3874313.99 3930042.97
## 4 Djibouti 77788.04 84694.35 92045.77 99845.22
## 5 Dominica 27550.36 29527.32 31475.62 33328.25
## 6 Dominican Republic 1535485.43 1625455.76 1718315.40 1814060.00
## 7 Ecuador 2059355.12 2151395.14 2246890.79 2345864.41
## 8 Egypt 13798171.00 14248342.19 14703858.22 15162858.52
## 9 El Salvador 1345528.98 1387218.33 1429378.98 1472181.26
## 10 Equatorial Guinea 75364.50 77295.03 78445.74 78411.07
## year_1971 year_1972 year_1973 year_1974
## 1 261213.21 272407.99 283774.90 295379.83
## 2 6437055.17 6572632.32 6718465.53 6873458.18
## 3 3981360.12 4028247.92 4076867.28 4120201.43
## 4 107799.69 116098.23 125391.58 136606.25
## 5 34761.52 36049.99 37260.05 38501.47
## 6 1915590.38 2020157.01 2127714.45 2238203.87
## 7 2453817.78 2565644.81 2681525.25 2801692.62
## 8 15603661.36 16047814.69 16498633.27 16960827.93
## 9 1527985.34 1584758.18 1642098.95 1699470.87
## 10 77055.29 74596.06 71438.96 68179.26
# Import all sheets from urbanpop.xls
path <- "urbanpop.xls"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE)
# Extend the cbind() call to include urban_sheet3: urban_all
urban <- cbind(urban_sheet1, urban_sheet2[-1], urban_sheet3[-1])
# Remove all rows with NAs from urban: urban_clean
urban_clean <- na.omit(urban)
# Print out a summary of urban_clean
summary(urban_clean)
## country X1960 X1961 X1962
## Length:197 Min. : 3378 Min. : 3433 Min. : 3481
## Class :character 1st Qu.: 87735 1st Qu.: 92905 1st Qu.: 98331
## Mode :character Median : 599714 Median : 630788 Median : 659464
## Mean : 5012388 Mean : 5282488 Mean : 5440972
## 3rd Qu.: 3130085 3rd Qu.: 3155370 3rd Qu.: 3250211
## Max. :126469700 Max. :129268133 Max. :131974143
## X1963 X1964 X1965
## Min. : 3532 Min. : 3586 Min. : 3644
## 1st Qu.: 104988 1st Qu.: 112084 1st Qu.: 119322
## Median : 704989 Median : 740609 Median : 774957
## Mean : 5612312 Mean : 5786961 Mean : 5964970
## 3rd Qu.: 3416490 3rd Qu.: 3585464 3rd Qu.: 3666724
## Max. :134599886 Max. :137205240 Max. :139663053
## X1966 X1967 X1968
## Min. : 3706 Min. : 3771 Min. : 3835
## 1st Qu.: 128565 1st Qu.: 138024 1st Qu.: 147846
## Median : 809768 Median : 838449 Median : 890270
## Mean : 6126413 Mean : 6288771 Mean : 6451367
## 3rd Qu.: 3871757 3rd Qu.: 4019906 3rd Qu.: 4158186
## Max. :141962708 Max. :144201722 Max. :146340364
## X1969 X1970 X1971
## Min. : 3893 Min. : 3941 Min. : 4017
## 1st Qu.: 158252 1st Qu.: 171063 1st Qu.: 181483
## Median : 929450 Median : 976471 Median : 1008630
## Mean : 6624909 Mean : 6799110 Mean : 6980895
## 3rd Qu.: 4300669 3rd Qu.: 4440047 3rd Qu.: 4595966
## Max. :148475901 Max. :150922373 Max. :152863831
## X1972 X1973 X1974
## Min. : 4084 Min. : 4146 Min. : 4206
## 1st Qu.: 189492 1st Qu.: 197792 1st Qu.: 205410
## Median : 1048738 Median : 1097293 Median : 1159402
## Mean : 7165338 Mean : 7349454 Mean : 7540446
## 3rd Qu.: 4766545 3rd Qu.: 4838297 3rd Qu.: 4906384
## Max. :154530473 Max. :156034106 Max. :157488074
## X1975 X1976 X1977
## Min. : 4267 Min. : 4334 Min. : 4402
## 1st Qu.: 211746 1st Qu.: 216991 1st Qu.: 222209
## Median : 1223146 Median : 1249829 Median : 1311276
## Mean : 7731973 Mean : 7936401 Mean : 8145945
## 3rd Qu.: 5003370 3rd Qu.: 5121118 3rd Qu.: 5227677
## Max. :159452730 Max. :165583752 Max. :171550310
## X1978 X1979 X1980
## Min. : 4470 Min. : 4539 Min. : 4607
## 1st Qu.: 227605 1st Qu.: 233461 1st Qu.: 242583
## Median : 1340811 Median : 1448185 Median : 1592397
## Mean : 8361360 Mean : 8583138 Mean : 8808772
## 3rd Qu.: 5352746 3rd Qu.: 5558850 3rd Qu.: 5815772
## Max. :177605736 Max. :183785364 Max. :189947471
## X1981 X1982 X1983
## Min. : 4645 Min. : 4681 Min. : 4716
## 1st Qu.: 248948 1st Qu.: 257944 1st Qu.: 274139
## Median : 1673079 Median : 1713060 Median : 1730626
## Mean : 9049163 Mean : 9295226 Mean : 9545035
## 3rd Qu.: 6070457 3rd Qu.: 6337995 3rd Qu.: 6619987
## Max. :199385258 Max. :209435968 Max. :219680098
## X1984 X1985 X1986
## Min. : 4750 Min. : 4782 Min. : 4809
## 1st Qu.: 284939 1st Qu.: 300928 1st Qu.: 307699
## Median : 1749033 Median : 1786125 Median : 1850910
## Mean : 9798559 Mean : 10058661 Mean : 10323839
## 3rd Qu.: 6918261 3rd Qu.: 6931780 3rd Qu.: 6935763
## Max. :229872397 Max. :240414890 Max. :251630158
## X1987 X1988 X1989
## Min. : 4835 Min. : 4859 Min. : 4883
## 1st Qu.: 321125 1st Qu.: 334616 1st Qu.: 347348
## Median : 1953694 Median : 1997011 Median : 1993544
## Mean : 10595817 Mean : 10873041 Mean : 11154458
## 3rd Qu.: 6939905 3rd Qu.: 6945022 3rd Qu.: 6885378
## Max. :263433513 Max. :275570541 Max. :287810747
## X1990 X1991 X1992
## Min. : 4907 Min. : 4946 Min. : 4985
## 1st Qu.: 370152 1st Qu.: 394611 1st Qu.: 418788
## Median : 2066505 Median : 2150230 Median : 2237405
## Mean : 11438543 Mean : 11725076 Mean : 12010922
## 3rd Qu.: 6830026 3rd Qu.: 6816589 3rd Qu.: 6820099
## Max. :300165618 Max. :314689997 Max. :329099365
## X1993 X1994 X1995
## Min. : 5024 Min. : 5062 Min. : 5100
## 1st Qu.: 427457 1st Qu.: 435959 1st Qu.: 461993
## Median : 2322158 Median : 2410297 Median : 2482393
## Mean : 12296949 Mean : 12582930 Mean : 12871480
## 3rd Qu.: 7139656 3rd Qu.: 7499901 3rd Qu.: 7708571
## Max. :343555327 Max. :358232230 Max. :373035157
## X1996 X1997 X1998
## Min. : 5079 Min. : 5055 Min. : 5029
## 1st Qu.: 488136 1st Qu.: 494203 1st Qu.: 498002
## Median : 2522460 Median : 2606125 Median : 2664983
## Mean : 13165924 Mean : 13463675 Mean : 13762861
## 3rd Qu.: 7686092 3rd Qu.: 7664316 3rd Qu.: 7784056
## Max. :388936607 Max. :405031716 Max. :421147610
## X1999 X2000 X2001
## Min. : 5001 Min. : 4971 Min. : 5003
## 1st Qu.: 505144 1st Qu.: 525629 1st Qu.: 550638
## Median : 2737809 Median : 2826647 Median : 2925851
## Mean : 14063387 Mean : 14369278 Mean : 14705743
## 3rd Qu.: 8083488 3rd Qu.: 8305564 3rd Qu.: 8421967
## Max. :437126845 Max. :452999147 Max. :473204511
## X2002 X2003 X2004
## Min. : 5034 Min. : 5064 Min. : 5090
## 1st Qu.: 567531 1st Qu.: 572094 1st Qu.: 593900
## Median : 2928252 Median : 2944934 Median : 2994356
## Mean : 15043381 Mean : 15384513 Mean : 15730299
## 3rd Qu.: 8448628 3rd Qu.: 8622732 3rd Qu.: 8999112
## Max. :493402140 Max. :513607776 Max. :533892175
## X2005 X2006 X2007
## Min. : 5111 Min. : 5135 Min. : 5155
## 1st Qu.: 620511 1st Qu.: 632659 1st Qu.: 645172
## Median : 3057923 Median : 3269963 Median : 3432024
## Mean : 16080262 Mean : 16435872 Mean : 16797484
## 3rd Qu.: 9394001 3rd Qu.: 9689807 3rd Qu.: 9803381
## Max. :554367818 Max. :575050081 Max. :595731464
## X2008 X2009 X2010
## Min. : 5172 Min. : 5189 Min. : 5206
## 1st Qu.: 658017 1st Qu.: 671085 1st Qu.: 684302
## Median : 3589395 Median : 3652338 Median : 3676309
## Mean : 17164898 Mean : 17533997 Mean : 17904811
## 3rd Qu.: 10210317 3rd Qu.: 10518289 3rd Qu.: 10618596
## Max. :616552722 Max. :637533976 Max. :658557734
## X2011
## Min. : 5233
## 1st Qu.: 698009
## Median : 3664664
## Mean : 18276297
## 3rd Qu.: 10731193
## Max. :678796403
Many companies store their information in relational databases. The R community has also developed R packages to get data from these architectures. You’ll learn how to connect to a database and how to retrieve data from it.
The first step to import data from a SQL database is creating a connection to it. As Filip explained, you need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI package.
dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().
If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.
# Load the RMySQL package
library(DBI)
library(RMySQL)
# Edit dbConnect() call
con <- dbConnect(MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function. As you might remember from the video, this function requires the connection object as an input, and outputs a character vector with the table names.
# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Build a vector of table names: tables
tables <- dbListTables(con)
# Display structure of tables
str(tables)
## chr [1:3] "comments" "tweats" "users"
As you might have guessed by now, the database contains data on a more tasty version of Twitter, namely Tweater. Users can post tweats with short recipes for delicious snacks. People can comment on these tweats. There are three tables: users, tweats, and comments that have relations among them. Which ones, you ask? You’ll discover in a moment!
Let’s start by importing the data on the users into your R session. You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is a standard R data frame.
# Connect to the MySQL database: con
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Import the users table from tweater: users
users<- dbReadTable(con, "users")
# Print users
users
## id name login
## 1 1 elisabeth elismith
## 2 2 mike mikey
## 3 3 thea teatime
## 4 4 thomas tomatotom
## 5 5 oliver olivander
## 6 6 kate katebenn
## 7 7 anjali lianja
Next to the users, we’re also interested in the tweats and comments tables. However, separate dbReadTable() calls for each and every one of the tables in your database would mean a lot of code duplication. Remember about the lapply() function? You can use it again here! A connection is already coded for you, as well as a vector table_names, containing the names of all the tables in the database.
# Get table names
table_names <- dbListTables(con)
# Import all tables
tables <- lapply(table_names, dbReadTable, conn = con)
# Print out tables
tables
## [[1]]
## id tweat_id user_id message
## 1 1022 87 7 nice!
## 2 1000 77 7 great!
## 3 1011 49 5 love it
## 4 1012 87 1 awesome! thanks!
## 5 1010 88 6 yuck!
## 6 1026 77 4 not my thing!
## 7 1004 49 1 this is fabulous!
## 8 1030 75 6 so easy!
## 9 1025 88 2 oh yes
## 10 1007 49 3 serious?
## 11 1020 77 1 couldn't be better
## 12 1014 77 1 saved my day
##
## [[2]]
## id user_id
## 1 75 3
## 2 88 4
## 3 77 6
## 4 87 5
## 5 49 1
## 6 24 7
## post
## 1 break egg. bake egg. eat egg.
## 2 wash strawberries. add ice. blend. enjoy.
## 3 2 slices of bread. add cheese. grill. heaven.
## 4 open and crush avocado. add shrimps. perfect starter.
## 5 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 6 just eat an apple. simply and healthy.
## date
## 1 2015-09-05
## 2 2015-09-14
## 3 2015-09-21
## 4 2015-09-22
## 5 2015-09-22
## 6 2015-09-24
##
## [[3]]
## id name login
## 1 1 elisabeth elismith
## 2 2 mike mikey
## 3 3 thea teatime
## 4 4 thomas tomatotom
## 5 5 oliver olivander
## 6 6 kate katebenn
## 7 7 anjali lianja
mporting an entire table from a database while you might only need a tiny bit of information seems like a lot of unncessary work. In this chapter, you’ll learn about SQL queries, which will help you make things more efficient by performing some computations on the database side.
In your life as a data scientist, you’ll often be working with huge databases that contain tables with millions of rows. If you want to do some analyses on this data, it’s possible that you only need a fraction of this data. In this case, it’s a good idea to send SQL queries to your database, and only import the data you actually need into R.
dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string. This example selects the age variable from the people dataset where gender equals “male”:
dbGetQuery(con, “SELECT age FROM people WHERE gender = ‘male’”) A connection to the tweater database has already been coded for you.
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Import tweat_id column of comments where user_id is 1: elisabeth
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments WHERE user_id = 1")
# Print elisabeth
elisabeth
## tweat_id
## 1 87
## 2 49
## 3 77
## 4 77
Apart from checking equality, you can also check for less than and greater than relationships, with < and >, just like in R. con, a connection to the tweater database, is again available.
# Import post column of tweats where date is higher than '2015-09-21': latest
latest <- dbGetQuery(con, "SELECT post FROM tweats WHERE date > '2015-09-21'")
# Print latest
latest
## post
## 1 open and crush avocado. add shrimps. perfect starter.
## 2 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## 3 just eat an apple. simply and healthy.
Suppose that you have a people table, with a bunch of information. This time, you want to find out the age and country of married males. Provided that there is a married column that’s 1 when the person in question is married, the following query would work.
SELECT age, country FROM people WHERE gender = “male” AND married = 1
# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments WHERE tweat_id = 77 AND user_id > 4")
# Print specific
specific
## message
## 1 great!
There are also dedicated SQL functions that you can use in the WHERE clause of an SQL query. For example, CHAR_LENGTH() returns the number of characters in a string.
# Create data frame short
short <- dbGetQuery(con, "SELECT id, name FROM users WHERE CHAR_LENGTH(name) <5")
# Print short
short
## id name
## 1 2 mike
## 2 3 thea
## 3 6 kate
You’ve used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed:
Let’s not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
# Use dbFetch() twice
dbFetch(res, n = 2)
## id tweat_id user_id message
## 1 1022 87 7 nice!
## 2 1000 77 7 great!
# Clear res
dbClearResult(res)
## [1] TRUE
Every time you connect to a database using dbConnect(), you’re creating a new connection to the database you’re referencing. RMySQL automatically specifies a maximum of open connections and closes some of the connections for you, but still: it’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.
# Connect to the database
library(DBI)
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Create the data frame long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date FROM tweats WHERE CHAR_LENGTH(post) > 40")
# Print long_tweats
long_tweats
## post
## 1 wash strawberries. add ice. blend. enjoy.
## 2 2 slices of bread. add cheese. grill. heaven.
## 3 open and crush avocado. add shrimps. perfect starter.
## 4 nachos. add tomato sauce, minced meat and cheese. oven for 10 mins.
## date
## 1 2015-09-14
## 2 2015-09-21
## 3 2015-09-22
## 4 2015-09-22
# Disconnect from the database
dbDisconnect(con)
## [1] TRUE
More and more of the information that data scientists are using resides on the web. Importing this data into R requires an understanding of the protocols used on the web. In this chapter, you’ll get a crash course in HTTP and learn to perform your own HTTP requests from inside R.
In the video, you saw that the utils functions to import flat file data, such as read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.
You must be wondering whether Hadley Wickham’s alternative package, readr, is equally potent. Well, figure it out in this exercise! The URLs for both a .csv file as well as a .delim file are already coded for you. It’s up to you to actually import the data. If it works, that is…
# Load the readr package
library(readr)
# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Address
## dbl (2): Latitude, Longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)
## Rows: 160 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (8): area, temp, size, storage, method, texture, flavor, moistness
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print pools and potatoes
pools
## # A tibble: 20 × 4
## Name Address Latitude Longitude
## <chr> <chr> <dbl> <dbl>
## 1 Acacia Ridge Leisure Centre 1391 Beaudesert… -27.6 153.
## 2 Bellbowrie Pool Sugarwood Stree… -27.6 153.
## 3 Carole Park Cnr Boundary Ro… -27.6 153.
## 4 Centenary Pool (inner City) 400 Gregory Ter… -27.5 153.
## 5 Chermside Pool 375 Hamilton Ro… -27.4 153.
## 6 Colmslie Pool (Morningside) 400 Lytton Road… -27.5 153.
## 7 Spring Hill Baths (inner City) 14 Torrington S… -27.5 153.
## 8 Dunlop Park Pool (Corinda) 794 Oxley Road,… -27.5 153.
## 9 Fortitude Valley Pool 432 Wickham Str… -27.5 153.
## 10 Hibiscus Sports Complex (upper MtGravatt) 90 Klumpp Road,… -27.6 153.
## 11 Ithaca Pool ( Paddington) 131 Caxton Stre… -27.5 153.
## 12 Jindalee Pool 11 Yallambee Ro… -27.5 153.
## 13 Manly Pool 1 Fairlead Cres… -27.5 153.
## 14 Mt Gravatt East Aquatic Centre Cnr wecker Road… -27.5 153.
## 15 Musgrave Park Pool (South Brisbane) 100 Edmonstone … -27.5 153.
## 16 Newmarket Pool 71 Alderson Str… -27.4 153.
## 17 Runcorn Pool 37 Bonemill Roa… -27.6 153.
## 18 Sandgate Pool 231 Flinders Pa… -27.3 153.
## 19 Langlands Parks Pool (Stones Corner) 5 Panitya Stree… -27.5 153.
## 20 Yeronga Park Pool 81 School Road,… -27.5 153.
potatoes
## # A tibble: 160 × 8
## area temp size storage method texture flavor moistness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 1 1 2.9 3.2 3
## 2 1 1 1 1 2 2.3 2.5 2.6
## 3 1 1 1 1 3 2.5 2.8 2.8
## 4 1 1 1 1 4 2.1 2.9 2.4
## 5 1 1 1 1 5 1.9 2.8 2.2
## 6 1 1 1 2 1 1.8 3 1.7
## 7 1 1 1 2 2 2.6 3.1 2.4
## 8 1 1 1 2 3 3 3 2.9
## 9 1 1 1 2 4 2.2 3.2 2.5
## 10 1 1 1 2 5 2 2.8 1.9
## # … with 150 more rows
In the previous exercises, you have been working with URLs that all start with http://. There is, however, a safer alternative to HTTP, namely HTTPS, which stands for HyperText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.
# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
# Import the file using read.csv(): pools1
pools1 <- read.csv(url_csv)
# Load the readr package
library(readr)
# Import the file using read_csv(): pools2
pools2 <- read_csv(url_csv)
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Name, Address
## dbl (2): Latitude, Longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Print the structure of pools1 and pools2
str(pools1)
## 'data.frame': 20 obs. of 4 variables:
## $ Name : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
## $ Address : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
## $ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num 153 153 153 153 153 ...
str(pools2)
## spec_tbl_df [20 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Name : chr [1:20] "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
## $ Address : chr [1:20] "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
## $ Latitude : num [1:20] -27.6 -27.6 -27.6 -27.5 -27.4 ...
## $ Longitude: num [1:20] 153 153 153 153 153 ...
## - attr(*, "spec")=
## .. cols(
## .. Name = col_character(),
## .. Address = col_character(),
## .. Latitude = col_double(),
## .. Longitude = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can’t, at least not yet. The URL with which you’ll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.
# Load the readxl and gdata package
library(readxl)
library(gdata)
# Specification of url: url_xls
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
# Import the .xls file with gdata: excel_gdata
excel_gdata <- read.xls(url_xls)
# Download file behind URL, name it local_latitude.xls
download.file(url_xls, destfile = "/Users/arifmathiq/R Data/local_latitude.xls", quiet = TRUE, mode = "wb")
getwd()
## [1] "/Users/arifmathiq/R Data"
# Import the local .xls file with readxl: excel_readxl
excel_readxl <- read_excel("local_latitude.xls")
In the previous exercise you’ve seen how you can read excel files on the web using the read_excel package by first downloading the file with the download.file() function.
There’s more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.
You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument. In this exercise, you’ll first download the RData file securely, and then import the local data file.
# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
# Download the wine file to your working directory
download.file(url_rdata, destfile = "wine_local.RData")
# Load the wine data into your workspace using load()
load("wine_local.RData")
# Print out the summary of the wine data
summary(wine)
## Alcohol Malic acid Ash Alcalinity of ash
## Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.87 Median :2.360 Median :19.50
## Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
## 3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
## Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
## Magnesium Total phenols Flavanoids Nonflavanoid phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200 1st Qu.:0.2700
## Median : 98.00 Median :2.350 Median :2.130 Median :0.3400
## Mean : 99.59 Mean :2.292 Mean :2.023 Mean :0.3623
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860 3rd Qu.:0.4400
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color intensity Hue Proline
## Min. :0.410 Min. : 1.280 Min. :1.270 Min. : 278.0
## 1st Qu.:1.250 1st Qu.: 3.210 1st Qu.:1.930 1st Qu.: 500.0
## Median :1.550 Median : 4.680 Median :2.780 Median : 672.0
## Mean :1.587 Mean : 5.055 Mean :2.604 Mean : 745.1
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :3.580 Max. :13.000 Max. :4.000 Max. :1680.0
Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.
httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.
You can extract the content from the request using the content() function. At the time of writing, there are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it’ll try its best to figure out which type is most appropriate based on the content-type.
library(httr)
# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)
# Print resp
resp
## Response [http://www.example.com/]
## Date: 2022-04-03 13:28
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 1.26 kB
## <!doctype html>
## <html>
## <head>
## <title>Example Domain</title>
##
## <meta charset="utf-8" />
## <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
## <meta name="viewport" content="width=device-width, initial-scale=1" />
## <style type="text/css">
## body {
## ...
# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")
# Print the head of raw_content
head(raw_content)
## [1] 3c 21 64 6f 63 74
Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp’s Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.
# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)
# Print resp
resp
## Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
## Date: 2022-04-03 13:28
## Status: 200
## Content-Type: application/json; charset=utf-8
## Size: 1.02 kB
# Print content of resp as text
content(resp, as = "text")
## [1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts\",\"Plot\":\"Alvy Singer, a divorced Jewish comedian, reflects on his relationship with ex-lover Annie Hall, an aspiring nightclub singer, which ended abruptly just like his previous marriages.\",\"Language\":\"English, German\",\"Country\":\"United States\",\"Awards\":\"Won 4 Oscars. 31 wins & 8 nominations total\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"96%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"261,852\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"05 Jul 2000\",\"BoxOffice\":\"$38,251,425\",\"Production\":\"N/A\",\"Website\":\"N/A\",\"Response\":\"True\"}"
# Print content of resp
content(resp)
## $Title
## [1] "Annie Hall"
##
## $Year
## [1] "1977"
##
## $Rated
## [1] "PG"
##
## $Released
## [1] "20 Apr 1977"
##
## $Runtime
## [1] "93 min"
##
## $Genre
## [1] "Comedy, Romance"
##
## $Director
## [1] "Woody Allen"
##
## $Writer
## [1] "Woody Allen, Marshall Brickman"
##
## $Actors
## [1] "Woody Allen, Diane Keaton, Tony Roberts"
##
## $Plot
## [1] "Alvy Singer, a divorced Jewish comedian, reflects on his relationship with ex-lover Annie Hall, an aspiring nightclub singer, which ended abruptly just like his previous marriages."
##
## $Language
## [1] "English, German"
##
## $Country
## [1] "United States"
##
## $Awards
## [1] "Won 4 Oscars. 31 wins & 8 nominations total"
##
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg"
##
## $Ratings
## $Ratings[[1]]
## $Ratings[[1]]$Source
## [1] "Internet Movie Database"
##
## $Ratings[[1]]$Value
## [1] "8.0/10"
##
##
## $Ratings[[2]]
## $Ratings[[2]]$Source
## [1] "Rotten Tomatoes"
##
## $Ratings[[2]]$Value
## [1] "96%"
##
##
## $Ratings[[3]]
## $Ratings[[3]]$Source
## [1] "Metacritic"
##
## $Ratings[[3]]$Value
## [1] "92/100"
##
##
##
## $Metascore
## [1] "92"
##
## $imdbRating
## [1] "8.0"
##
## $imdbVotes
## [1] "261,852"
##
## $imdbID
## [1] "tt0075686"
##
## $Type
## [1] "movie"
##
## $DVD
## [1] "05 Jul 2000"
##
## $BoxOffice
## [1] "$38,251,425"
##
## $Production
## [1] "N/A"
##
## $Website
## [1] "N/A"
##
## $Response
## [1] "True"
Importing data from the web is one thing; actually being able to extract useful information is another. Learn more about the JSON format to get one step closer to web domination.
In the simplest setting, fromJSON() can convert character strings that represent JSON data into a nicely structured R list. Give it a try!
# Load the jsonlite package
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)
# Print structure of wine
str(wine)
## List of 5
## $ name : chr "Chateau Migraine"
## $ year : int 1997
## $ alcohol_pct: num 12.4
## $ color : chr "red"
## $ awarded : logi FALSE
As Filip showed in the video, fromJSON() also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let’s try this out on the Quandl API, where you can fetch all sorts of financial and economical data.
# Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
# Import Quandl data: quandl_data
quandl_data <- fromJSON(quandl_url)
# Print structure of quandl_data
str(quandl_data)
## List of 1
## $ dataset_data:List of 10
## ..$ limit : NULL
## ..$ transform : NULL
## ..$ column_index: NULL
## ..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
## ..$ start_date : chr "2012-05-18"
## ..$ end_date : chr "2018-03-27"
## ..$ frequency : chr "daily"
## ..$ data : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
## ..$ collapse : NULL
## ..$ order : NULL
In the video, you saw how easy it is to interact with an API once you know how to formulate requests. You also saw how to fetch all information on Rain Man from OMDb. Simply perform a GET() call, and next ask for the contents with the content() function. This content() function, which is part of the httr package, uses jsonlite behind the scenes to import the JSON data into R.
However, by now you also know that jsonlite can handle URLs itself. Simply passing the request URL to fromJSON() will get your data into R. In this exercise, you will be using this technique to compare the release year of two movies in the Open Movie Database.
# Definition of the URLs
url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"
# Import two URLs with fromJSON(): sw4 and sw3
sw4 <- fromJSON(url_sw4)
sw3 <- fromJSON(url_sw3)
# Print out the Title element of both lists
sw4
## $Title
## [1] "Star Wars"
##
## $Year
## [1] "1977"
##
## $Rated
## [1] "PG"
##
## $Released
## [1] "25 May 1977"
##
## $Runtime
## [1] "121 min"
##
## $Genre
## [1] "Action, Adventure, Fantasy"
##
## $Director
## [1] "George Lucas"
##
## $Writer
## [1] "George Lucas"
##
## $Actors
## [1] "Mark Hamill, Harrison Ford, Carrie Fisher"
##
## $Plot
## [1] "Luke Skywalker joins forces with a Jedi Knight, a cocky pilot, a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station, while also attempting to rescue Princess Leia from the mysterious Darth Vad"
##
## $Language
## [1] "English"
##
## $Country
## [1] "United States"
##
## $Awards
## [1] "Won 7 Oscars. 63 wins & 29 nominations total"
##
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BNzVlY2MwMjktM2E4OS00Y2Y3LWE3ZjctYzhkZGM3YzA1ZWM2XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_SX300.jpg"
##
## $Ratings
## Source Value
## 1 Internet Movie Database 8.6/10
## 2 Rotten Tomatoes 92%
## 3 Metacritic 90/100
##
## $Metascore
## [1] "90"
##
## $imdbRating
## [1] "8.6"
##
## $imdbVotes
## [1] "1,312,386"
##
## $imdbID
## [1] "tt0076759"
##
## $Type
## [1] "movie"
##
## $DVD
## [1] "06 Dec 2005"
##
## $BoxOffice
## [1] "$460,998,507"
##
## $Production
## [1] "N/A"
##
## $Website
## [1] "N/A"
##
## $Response
## [1] "True"
sw3
## $Title
## [1] "Star Wars: Episode III - Revenge of the Sith"
##
## $Year
## [1] "2005"
##
## $Rated
## [1] "PG-13"
##
## $Released
## [1] "19 May 2005"
##
## $Runtime
## [1] "140 min"
##
## $Genre
## [1] "Action, Adventure, Fantasy"
##
## $Director
## [1] "George Lucas"
##
## $Writer
## [1] "George Lucas, John Ostrander, Jan Duursema"
##
## $Actors
## [1] "Hayden Christensen, Natalie Portman, Ewan McGregor"
##
## $Plot
## [1] "Three years into the Clone Wars, the Jedi rescue Palpatine from Count Dooku. As Obi-Wan pursues a new threat, Anakin acts as a double agent between the Jedi Council and Palpatine and is lured into a sinister plan to rule the galaxy."
##
## $Language
## [1] "English"
##
## $Country
## [1] "United States"
##
## $Awards
## [1] "Nominated for 1 Oscar. 26 wins & 63 nominations total"
##
## $Poster
## [1] "https://m.media-amazon.com/images/M/MV5BNTc4MTc3NTQ5OF5BMl5BanBnXkFtZTcwOTg0NjI4NA@@._V1_SX300.jpg"
##
## $Ratings
## Source Value
## 1 Internet Movie Database 7.6/10
## 2 Rotten Tomatoes 80%
## 3 Metacritic 68/100
##
## $Metascore
## [1] "68"
##
## $imdbRating
## [1] "7.6"
##
## $imdbVotes
## [1] "758,447"
##
## $imdbID
## [1] "tt0121766"
##
## $Type
## [1] "movie"
##
## $DVD
## [1] "01 Nov 2005"
##
## $BoxOffice
## [1] "$380,270,577"
##
## $Production
## [1] "N/A"
##
## $Website
## [1] "N/A"
##
## $Response
## [1] "True"
# Is the release year of sw4 later than sw3?
sw4$Year > sw3$Year
## [1] FALSE
JSON is built on two structures: objects and arrays. To help you experiment with these, two JSON strings are included in the sample code. It’s up to you to change them appropriately and then call jsonlite’s fromJSON() function on them each time.
# Challenge 1
json1 <- '[1, 2, 3, 4, 5, 6]'
fromJSON(json1)
## [1] 1 2 3 4 5 6
# Challenge 2
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)
## $a
## [1] 1 2 3
##
## $b
## [1] 4 5 6
We prepared two more JSON strings in the sample code. Can you change them and call jsonlite’s fromJSON() function on them, similar to the previous exercise?
# Challenge 1
json1 <- '[[1, 2], [3, 4], [5, 6]]'
# Challenge 2
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}]'
Apart from converting JSON to R with fromJSON(), you can also use toJSON() to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json, which is basically a character string representing that JSON.
For this exercise, you will be working with a .csv file containing information on the amount of desalinated water that is produced around the world. As you’ll see, it contains a lot of missing values. This data can be found on the URL that is specified in the sample code.
# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"
# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)
# Convert the data file according to the requirements
water_json <- toJSON(water)
# Print out water_json
water_json
## [{"water":"Algeria","X1992":0.064,"X2002":0.017},{"water":"American Samoa"},{"water":"Angola","X1992":0.0001,"X2002":0.0001},{"water":"Antigua and Barbuda","X1992":0.0033},{"water":"Argentina","X1992":0.0007,"X1997":0.0007,"X2002":0.0007},{"water":"Australia","X1992":0.0298,"X2002":0.0298},{"water":"Austria","X1992":0.0022,"X2002":0.0022},{"water":"Bahamas","X1992":0.0013,"X2002":0.0074},{"water":"Bahrain","X1992":0.0441,"X2002":0.0441,"X2007":0.1024},{"water":"Barbados","X2007":0.0146},{"water":"British Virgin Islands","X2007":0.0042},{"water":"Canada","X1992":0.0027,"X2002":0.0027},{"water":"Cape Verde","X1992":0.002,"X1997":0.0017},{"water":"Cayman Islands","X1992":0.0033},{"water":"Central African Rep."},{"water":"Chile","X1992":0.0048,"X2002":0.0048},{"water":"Colombia","X1992":0.0027,"X2002":0.0027},{"water":"Cuba","X1992":0.0069,"X1997":0.0069,"X2002":0.0069},{"water":"Cyprus","X1992":0.003,"X1997":0.003,"X2002":0.0335},{"water":"Czech Rep.","X1992":0.0002,"X2002":0.0002},{"water":"Denmark","X1992":0.015,"X2002":0.015},{"water":"Djibouti","X1992":0.0001,"X2002":0.0001},{"water":"Ecuador","X1992":0.0022,"X1997":0.0022,"X2002":0.0022},{"water":"Egypt","X1992":0.025,"X1997":0.025,"X2002":0.1},{"water":"El Salvador","X1992":0.0001,"X2002":0.0001},{"water":"Finland","X1992":0.0001,"X2002":0.0001},{"water":"France","X1992":0.0117,"X2002":0.0117},{"water":"Gibraltar","X1992":0.0077},{"water":"Greece","X1992":0.01,"X2002":0.01},{"water":"Honduras","X1992":0.0002,"X2002":0.0002},{"water":"Hungary","X1992":0.0002,"X2002":0.0002},{"water":"India","X1997":0.0005,"X2002":0.0005},{"water":"Indonesia","X1992":0.0187,"X2002":0.0187},{"water":"Iran","X1992":0.003,"X1997":0.003,"X2002":0.003,"X2007":0.2},{"water":"Iraq","X1997":0.0074,"X2002":0.0074},{"water":"Ireland","X1992":0.0002,"X2002":0.0002},{"water":"Israel","X1992":0.0256,"X2002":0.0256,"X2007":0.14},{"water":"Italy","X1992":0.0973,"X2002":0.0973},{"water":"Jamaica","X1992":0.0005,"X1997":0.0005,"X2002":0.0005},{"water":"Japan","X1997":0.04,"X2002":0.04},{"water":"Jordan","X1997":0.002,"X2007":0.0098},{"water":"Kazakhstan","X1997":1.328,"X2002":1.328},{"water":"Kuwait","X1992":0.507,"X1997":0.231,"X2002":0.4202},{"water":"Lebanon","X2007":0.0473},{"water":"Libya","X2002":0.018},{"water":"Malaysia","X1992":0.0043,"X2002":0.0043},{"water":"Maldives","X1992":0.0004},{"water":"Malta","X1992":0.024,"X1997":0.031,"X2002":0.031},{"water":"Marshall Islands","X1992":0.0007},{"water":"Mauritania","X1992":0.002,"X2002":0.002},{"water":"Mexico","X1992":0.0307,"X2002":0.0307},{"water":"Morocco","X1992":0.0034,"X1997":0.0034,"X2002":0.007},{"water":"Namibia","X1992":0.0003,"X2002":0.0003},{"water":"Netherlands Antilles","X1992":0.063},{"water":"Nicaragua","X1992":0.0002,"X2002":0.0002},{"water":"Nigeria","X1992":0.003,"X2002":0.003},{"water":"Norway","X1992":0.0001,"X2002":0.0001},{"water":"Oman","X1997":0.034,"X2002":0.034,"X2007":0.109},{"water":"Peru","X1992":0.0054,"X2002":0.0054},{"water":"Poland","X1992":0.007,"X2002":0.007},{"water":"Portugal","X1992":0.0016,"X2002":0.0016},{"water":"Qatar","X1992":0.065,"X1997":0.099,"X2002":0.099,"X2007":0.18},{"water":"Saudi Arabia","X1992":0.683,"X1997":0.727,"X2002":0.863,"X2007":1.033},{"water":"Senegal","X1992":0,"X2002":0},{"water":"Somalia","X1992":0.0001,"X2002":0.0001},{"water":"South Africa","X1992":0.018,"X2002":0.018},{"water":"Spain","X1992":0.1002,"X2002":0.1002},{"water":"Sudan","X1992":0.0004,"X1997":0.0004,"X2002":0.0004},{"water":"Sweden","X1992":0.0002,"X2002":0.0002},{"water":"Trinidad and Tobago","X2007":0.036},{"water":"Tunisia","X1992":0.008,"X2002":0.013},{"water":"Turkey","X1992":0.0005,"X2002":0.0005,"X2007":0.0005},{"water":"United Arab Emirates","X1992":0.163,"X1997":0.385,"X2007":0.95},{"water":"United Kingdom","X1992":0.0333,"X2002":0.0333},{"water":"United States","X1992":0.58,"X2002":0.58},{"water":"Venezuela","X1992":0.0052,"X2002":0.0052},{"water":"Yemen, Rep.","X1992":0.01,"X2002":0.01}]
Next to R, there are also other commonly used statistical software packages: SAS, STATA and SPSS. Each of them has their own file format. Learn how to use the haven and foreign packages to get them into R with remarkable ease!
haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:
# Load the haven package
library(haven)
# Import sales.sas7bdat: sales
sales <- read_sas("sales.sas7bdat")
# Display the structure of sales
str(sales)
## tibble [431 × 4] (S3: tbl_df/tbl/data.frame)
## $ purchase: num [1:431] 0 0 1 1 0 0 0 0 0 0 ...
## $ age : num [1:431] 41 47 41 39 32 32 33 45 43 40 ...
## $ gender : chr [1:431] "Female" "Female" "Female" "Female" ...
## $ income : chr [1:431] "Low" "Low" "Low" "Low" ...
You’ll be working with data on the age, gender, income, and purchase level (0 = low, 1 = high) of 36 individuals (Source: SAS). The information is stored in a SAS file, sales.sas7bdat, which is available in your current working directory. You can also download the data here.
Next up are STATA data files; you can use read_dta() for these.
When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.
In this exercise, you will work with data on yearly import and export numbers of sugar, both in USD and in weight. The data can be found at: http://assets.datacamp.com/production/course_1478/datasets/trade.dta
# Import the data from the URL: sugar
sugar <- read_dta("http://assets.datacamp.com/production/course_1478/datasets/trade.dta")
# Structure of sugar
str(sugar)
## tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
## $ Date : dbl+lbl [1:10] 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
## ..@ label : chr "Date"
## ..@ format.stata: chr "%9.0g"
## ..@ labels : Named num [1:10] 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr [1:10] "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
## $ Import : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
## ..- attr(*, "label")= chr "Import"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
## ..- attr(*, "label")= chr "Weight_I"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Export : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
## ..- attr(*, "label")= chr "Export"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
## ..- attr(*, "label")= chr "Weight_E"
## ..- attr(*, "format.stata")= chr "%9.0g"
## - attr(*, "label")= chr "Written by R."
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
# Structure of sugar again
str(sugar)
## tibble [10 × 5] (S3: tbl_df/tbl/data.frame)
## $ Date : Date[1:10], format: "2013-12-31" "2012-12-31" ...
## $ Import : num [1:10] 37664782 16316512 11082246 35677943 9879878 ...
## ..- attr(*, "label")= chr "Import"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_I: num [1:10] 54029106 21584365 14526089 55034932 14806865 ...
## ..- attr(*, "label")= chr "Weight_I"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Export : num [1:10] 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
## ..- attr(*, "label")= chr "Export"
## ..- attr(*, "format.stata")= chr "%9.0g"
## $ Weight_E: num [1:10] 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
## ..- attr(*, "label")= chr "Weight_E"
## ..- attr(*, "format.stata")= chr "%9.0g"
## - attr(*, "label")= chr "Written by R."
The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.
In this exercise, you will work with data on four of the Big Five personality traits for 434 persons (Source: University of Bath). The Big Five is a psychological concept including, originally, five dimensions of personality to classify human personality. The SPSS dataset is called person.sav and is available in your working directory.
# Import person.sav: traits
traits <- read_sav("person.sav")
# Summarize traits
summary(traits)
## Neurotic Extroversion Agreeableness Conscientiousness
## Min. : 0.00 Min. : 5.00 Min. :15.00 Min. : 7.00
## 1st Qu.:18.00 1st Qu.:26.00 1st Qu.:39.00 1st Qu.:25.00
## Median :24.00 Median :31.00 Median :45.00 Median :30.00
## Mean :23.63 Mean :30.23 Mean :44.55 Mean :30.85
## 3rd Qu.:29.00 3rd Qu.:34.00 3rd Qu.:50.00 3rd Qu.:36.00
## Max. :44.00 Max. :65.00 Max. :73.00 Max. :58.00
## NA's :14 NA's :16 NA's :19 NA's :14
# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40)
## # A tibble: 8 × 4
## Neurotic Extroversion Agreeableness Conscientiousness
## <dbl> <dbl> <dbl> <dbl>
## 1 38 43 49 29
## 2 20 42 46 31
## 3 18 42 49 31
## 4 42 43 44 29
## 5 30 42 51 24
## 6 18 42 50 25
## 7 27 45 55 23
## 8 18 43 57 34
In the last exercise you learned how to import a data file using the command read_sav(). With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.
The data for this exercise involves information on employees and their demographic and economic attributes (Source: QRiE). The data can be found on the following URL:
http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav
# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")
# Display summary of work$GENDER
summary(work$GENDER)
## Length Class Mode
## 474 character character
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)
# Display summary of work$GENDER again
summary(work$GENDER)
## Female Male
## 216 258
The foreign package offers a simple function to import and read STATA data: read.dta().
In this exercise, you will import data on the US presidential elections in the year 2000. The data in florida.dta contains the total numbers of votes for each of the four candidates as well as the total number of votes per election area in the state of Florida (Source: Florida Department of State). The file is available in your working directory, you can download it here if you want to experiment some more.
# Load the foreign package
library(foreign)
# Import florida.dta and name the resulting data frame florida
florida <- read.dta("florida.dta")
# Check tail() of florida
tail(florida)
## gore bush buchanan nader total
## 62 2647 4051 27 59 6784
## 63 1399 2326 26 29 3780
## 64 97063 82214 396 2436 182109
## 65 3835 4511 46 149 8541
## 66 5637 12176 120 265 18198
## 67 2796 4983 88 93 7960
Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.
The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward, as Filip explained in the video. It’s all about correctly converting STATA data to standard R data structures. Type ?read.dta to find out about about the default values.
The dataset for this exercise contains socio-economic measures and access to education for different individuals (Source: World Bank). This data is available as edequality.dta, which is located in the worldbank folder in your working directory.
# Create and print structure of edu_equal_1
library(readstata13)
edu_equal_1 <- read.dta("edequality.dta")
str(edu_equal_1)
## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
## $ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity_head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr_ind_annual : num 347 347 347 1635 1635 ...
## $ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours_mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr "" "" "location" "region" ...
## - attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr "_dta" "_svy_su1" "cluster"
## ..$ : chr "_dta" "_svy_strata1" "strata"
## ..$ : chr "_dta" "_svy_stages" "1"
## ..$ : chr "_dta" "_svy_version" "2"
## ..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr "_dta" "ReS_Xij" "cons"
## ..$ : chr "_dta" "ReS_str" "0"
## ..$ : chr "_dta" "ReS_j" "group"
## ..$ : chr "_dta" "ReS_ver" "v.2"
## ..$ : chr "_dta" "ReS_i" "hhid dur"
## ..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int 1 2
## .. ..- attr(*, "names")= chr "urban location" "rural location"
## ..$ region : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
## ..$ : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int 1 2 3 4 5
## .. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
# Create and print structure of edu_equal_2
edu_equal_2 <- read.dta("edequality.dta", convert.factors = FALSE)
str(edu_equal_2)
## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : int 1 1 1 1 1 2 2 2 1 1 ...
## $ region : int 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity_head : int 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : int 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : int 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : int 1 2 2 2 2 2 2 2 2 2 ...
## $ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr_ind_annual : num 347 347 347 1635 1635 ...
## $ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours_mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr "" "" "location" "region" ...
## - attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr "_dta" "_svy_su1" "cluster"
## ..$ : chr "_dta" "_svy_strata1" "strata"
## ..$ : chr "_dta" "_svy_stages" "1"
## ..$ : chr "_dta" "_svy_version" "2"
## ..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr "_dta" "ReS_Xij" "cons"
## ..$ : chr "_dta" "ReS_str" "0"
## ..$ : chr "_dta" "ReS_j" "group"
## ..$ : chr "_dta" "ReS_ver" "v.2"
## ..$ : chr "_dta" "ReS_i" "hhid dur"
## ..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int 1 2
## .. ..- attr(*, "names")= chr "urban location" "rural location"
## ..$ region : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
## ..$ : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int 1 2 3 4 5
## .. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
# Create and print structure of edu_equal_3
edu_equal_3 <- read.dta("edequality.dta", convert.underscore = TRUE)
str(edu_equal_3)
## 'data.frame': 12214 obs. of 27 variables:
## $ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
## $ hhweight : num 627 627 627 627 627 ...
## $ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
## $ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
## $ ethnicity.head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
## $ age : num 37 11 8 73 70 75 79 80 82 83 ...
## $ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
## $ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
## $ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ income.mnt : num 13.3 13.3 13.3 142.5 142.5 ...
## $ income : num 160 160 160 1710 1710 ...
## $ aggregate : num 1042 1042 1042 3271 3271 ...
## $ aggr.ind.annual : num 347 347 347 1635 1635 ...
## $ educ.completed : int 2 4 4 4 3 3 3 3 4 4 ...
## $ grade.complete : num 4 3 0 3 4 4 4 4 5 5 ...
## $ grade.all : num 4 11 8 11 8 8 8 8 13 13 ...
## $ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
## $ reason.OLF : int NA NA NA 3 3 3 9 9 3 3 ...
## $ sector : int NA NA NA NA NA NA 1 1 NA NA ...
## $ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
## $ earn.mont : num 0 0 0 0 0 0 20 20 0 0 ...
## $ earn.ann : num 0 0 0 0 0 0 240 240 0 0 ...
## $ hours.week : num NA NA NA NA NA NA 30 35 NA NA ...
## $ hours.mnt : num NA NA NA NA NA ...
## $ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
## $ hhexp : num 100 100 100 343 343 ...
## $ legacy.pension.amt: num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, "datalabel")= chr ""
## - attr(*, "time.stamp")= chr ""
## - attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
## - attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
## - attr(*, "val.labels")= chr "" "" "location" "region" ...
## - attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
## - attr(*, "expansion.fields")=List of 12
## ..$ : chr "_dta" "_svy_su1" "cluster"
## ..$ : chr "_dta" "_svy_strata1" "strata"
## ..$ : chr "_dta" "_svy_stages" "1"
## ..$ : chr "_dta" "_svy_version" "2"
## ..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
## ..$ : chr "_dta" "ReS_Xij" "cons"
## ..$ : chr "_dta" "ReS_str" "0"
## ..$ : chr "_dta" "ReS_j" "group"
## ..$ : chr "_dta" "ReS_ver" "v.2"
## ..$ : chr "_dta" "ReS_i" "hhid dur"
## ..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
## ..$ : chr "_dta" "note0" "1"
## - attr(*, "version")= int 7
## - attr(*, "label.table")=List of 12
## ..$ location: Named int 1 2
## .. ..- attr(*, "names")= chr "urban location" "rural location"
## ..$ region : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
## ..$ ethnic : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
## ..$ s2_q2 : Named int 1 2
## .. ..- attr(*, "names")= chr "male" "female"
## ..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
## .. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
## ..$ lit : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
## ..$ : Named int 1 2 3 4
## .. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
## ..$ : Named int 1 2 3 4 5 6 7 8 9 10
## .. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
## ..$ : Named int 1 2 3 4 5
## .. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
## ..$ : Named int 1 2
## .. ..- attr(*, "names")= chr "no" "yes"
All great things come in pairs. Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().
In this exercise, you’ll be working with socio-economic variables from different countries (Source: Quantative Data Analysis in Education). The SPSS data is in a file called international.sav, which is in your working directory. You can also download it here if you want to play around with it some more.
# Import international.sav as a data frame: demo
demo <- read.spss("international.sav", to.data.frame = TRUE)
## re-encoding from CP1252
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)
In the previous exercise, you used the to.data.frame argument inside read.spss(). There are many other ways in which to customize the way your SPSS data is imported.
In this exercise you will experiment with another argument, use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside R.
# Import international.sav as demo_1
demo_1 <- read.spss("international.sav", to.data.frame = TRUE)
## re-encoding from CP1252
# Print out the head of demo_1
head(demo_1)
## id country contint m_illit f_illit lifeexpt gdp
## 1 1 Argentina Americas 3.0 3.0 16 3375
## 2 2 Benin Africa 45.2 74.5 7 521
## 3 3 Burundi Africa 33.2 48.1 5 86
## 4 4 Chile Americas 4.2 4.4 14 4523
## 5 5 Dominican Republic Americas 12.0 12.7 12 2408
## 6 6 El Salvador Americas 17.6 22.9 11 2302
# Import international.sav as demo_2
demo_2 <- read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)
## re-encoding from CP1252
# Print out the head of demo_2
head(demo_2)
## id country contint m_illit f_illit lifeexpt gdp
## 1 1 Argentina 2 3.0 3.0 16 3375
## 2 2 Benin 1 45.2 74.5 7 521
## 3 3 Burundi 1 33.2 48.1 5 86
## 4 4 Chile 2 4.2 4.4 14 4523
## 5 5 Dominican Republic 2 12.0 12.7 12 2408
## 6 6 El Salvador 2 17.6 22.9 11 2302
The End.
Thanks DataCamp
- My Favorite Team -