The first step to import data from a SQL database is creating a connection to it. You need different packages depending on the database you want to connect to. All of these packages do this in a uniform way, as specified in the DBI package.
dbConnect() creates a connection between your R session and a SQL database. The first argument has to be a DBIdriver object, that specifies how connections are made and how data is mapped between R and the database. Specifically for MySQL databases, you can build such a driver with RMySQL::MySQL().
If the MySQL database is a remote database hosted on a server, you’ll also have to specify the following arguments in dbConnect(): dbname, host, port, user and password. Most of these details have already been provided.
library(DBI)
Warning message:
package ‘RMySQL’ was built under R version 3.6.3
# Edit dbConnect() call
con <- dbConnect(RMySQL::MySQL(),
dbname = "tweater",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
After you’ve successfully connected to a remote MySQL database, the next step is to see what tables the database contains. You can do this with the dbListTables() function.
dbListTables(con)
[1] "comments" "tweats" "users"
# Build a vector of table names: tables
tables <- dbListTables(con)
# Display structure of tables
str(tables)
chr [1:3] "comments" "tweats" "users"
You do this with the dbReadTable() function. Simply pass it the connection object (con), followed by the name of the table you want to import. The resulting object is a standard R data frame.
# Import the users table from tweater: users
users <- dbReadTable(con, "users")
# Print users
users
Next to the users, we’re also interested in the tweats and comments tables. However, separate dbReadTable() calls for each and every one of the tables in your database would mean a lot of code duplication. Remember about the lapply() function? You can use it again here! A connection is already coded for you, as well as a vector table_names, containing the names of all the tables in the database.
# Get table names
table_names <- dbListTables(con)
# Import all tables
tables <- lapply(table_names, dbReadTable, conn = con)
# Print out tables
tables
[[1]]
[[2]]
[[3]]
NA
dbGetQuery() is what you need. As usual, you first pass the connection object to it. The second argument is an SQL query in the form of a character string.
# Import tweat_id column of comments where user_id is 1: elisabeth
elisabeth <- dbGetQuery(con, "SELECT tweat_id FROM comments
WHERE user_id = 1")
# Print elisabeth
elisabeth
# Import post column of tweats where date is higher than '2015-09-21': latest
latest <- dbGetQuery(con, "SELECT post FROM tweats
WHERE date > '2015-09-21'")
# Print latest
latest
# Create data frame specific
specific <- dbGetQuery(con, "SELECT message FROM comments
WHERE tweat_id = 77 and user_id >4")
# Print specific
specific
There are also dedicated SQL functions that you can use in the WHERE clause of an SQL query. For example, CHAR_LENGTH() returns the number of characters in a string.
# Create data frame short
short <- dbGetQuery(con, "SELECT id, name FROM users
WHERE CHAR_LENGTH(name) < 5")
# Print short
short
Another very often used keyword is JOIN, and more specifically INNER JOIN.
dbGetQuery(con, "SELECT name, post
FROM users INNER JOIN tweats on users.id = user_id
WHERE date > '2015-09-19'")
dbGetQuery(con, "SELECT post, message
FROM tweats INNER JOIN comments on tweats.id = tweat_id
WHERE tweat_id = 77")
You’ve used dbGetQuery() multiple times now. This is a virtual function from the DBI package, but is actually implemented by the RMySQL package. Behind the scenes, the following steps are performed: - Sending the specified query with dbSendQuery(); - Fetching the result of executing the query on the database with dbFetch(); - Clearing the result with dbClearResult().
Let’s not use dbGetQuery() this time and implement the steps above. This is tedious to write, but it gives you the ability to fetch the query’s result in chunks rather than all at once. You can do this by specifying the n argument inside dbFetch().
# Send query to the database
res <- dbSendQuery(con, "SELECT * FROM comments WHERE user_id > 4")
# Use dbFetch() twice
dbFetch(res, n = 2)
dbFetch(res, n = 2)
# Clear res
dbClearResult(res)
[1] TRUE
RMySQL automatically specifies a maximum of open connections and closes some of the connections for you. It’s always polite to manually disconnect from the database afterwards. You do this with the dbDisconnect() function.
# Create the data frame long_tweats
long_tweats <- dbGetQuery(con, "SELECT post, date
FROM tweats
WHERE CHAR_LENGTH(post)>40")
# Print long_tweats
print(long_tweats)
# Disconnect from the database
dbDisconnect(con)
[1] TRUE
read.csv() and read.delim(), are capable of automatically importing from URLs that point to flat files on the web.
You must be wondering whether Hadley Wickham’s alternative package, readr, is equally potent.
# Load the readr package
library(readr)
# Import the csv file: pools
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
pools <- read_csv(url_csv)
Parsed with column specification:
cols(
Name = [31mcol_character()[39m,
Address = [31mcol_character()[39m,
Latitude = [32mcol_double()[39m,
Longitude = [32mcol_double()[39m
)
# Import the txt file: potatoes
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/potatoes.txt"
potatoes <- read_tsv(url_delim)
Parsed with column specification:
cols(
area = [32mcol_double()[39m,
temp = [32mcol_double()[39m,
size = [32mcol_double()[39m,
storage = [32mcol_double()[39m,
method = [32mcol_double()[39m,
texture = [32mcol_double()[39m,
flavor = [32mcol_double()[39m,
moistness = [32mcol_double()[39m
)
# Print pools and potatoes
pools
potatoes
A safer alternative to HTTP, namely HTTPS, which stands for HypterText Transfer Protocol Secure. Just remember this: HTTPS is relatively safe, HTTP is not.
# https URL to the swimming_pools csv file.
url_csv <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/swimming_pools.csv"
# Import the file using read.csv(): pools1
pools1 <- read.csv(url_csv)
# Load the readr package
library(readr)
# Import the file using read_csv(): pools2
pools2 <- read_csv(url_csv)
Parsed with column specification:
cols(
Name = [31mcol_character()[39m,
Address = [31mcol_character()[39m,
Latitude = [32mcol_double()[39m,
Longitude = [32mcol_double()[39m
)
# Print the structure of pools1 and pools2
str(pools1)
'data.frame': 20 obs. of 4 variables:
$ Name : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
$ Address : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
$ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
$ Longitude: num 153 153 153 153 153 ...
str(pools2)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of 4 variables:
$ Name : chr "Acacia Ridge Leisure Centre" "Bellbowrie Pool" "Carole Park" "Centenary Pool (inner City)" ...
$ Address : chr "1391 Beaudesert Road, Acacia Ridge" "Sugarwood Street, Bellbowrie" "Cnr Boundary Road and Waterford Road Wacol" "400 Gregory Terrace, Spring Hill" ...
$ Latitude : num -27.6 -27.6 -27.6 -27.5 -27.4 ...
$ Longitude: num 153 153 153 153 153 ...
- attr(*, "spec")=
.. cols(
.. Name = [31mcol_character()[39m,
.. Address = [31mcol_character()[39m,
.. Latitude = [32mcol_double()[39m,
.. Longitude = [32mcol_double()[39m
.. )
When you learned about gdata, it was already mentioned that gdata can handle .xls files that are on the internet. readxl can’t, at least not yet. The URL with which you’ll be working is already available in the sample code. You will import it once using gdata and once with the readxl package via a workaround.
# Load the readxl and gdata package
library(readxl)
library(gdata)
gdata: Unable to locate valid perl interpreter
gdata:
gdata: read.xls() will be unable to read Excel XLS and XLSX
gdata: files unless the 'perl=' argument is used to specify
gdata: the location of a valid perl intrpreter.
gdata:
gdata: (To avoid display of this message in the future, please
gdata: ensure perl is installed and available on the
gdata: executable search path.)
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLX' (Excel 97-2004) files.
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLSX' (Excel 2007+) files.
gdata: Run the function 'installXLSXsupport()'
gdata: to automatically download and install the perl
gdata: libaries needed to support Excel XLS and XLSX formats.
Attaching package: 㤼㸱gdata㤼㸲
The following object is masked from 㤼㸱package:stats㤼㸲:
nobs
The following object is masked from 㤼㸱package:utils㤼㸲:
object.size
The following object is masked from 㤼㸱package:base㤼㸲:
startsWith
# Specification of url: url_xls
url_xls <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls"
# Import the .xls file with gdata: excel_gdata
excel_gdata = read.xls(url_xls, perl="c:/Program Files/Git/usr/bin/perl.exe")
probando la URL 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls'
Content type '' length 37888 bytes (37 KB)
downloaded 37 KB
# Download file behind URL, name it local_latitude.xls
download.file(url_xls, "local_latitude.xls", mode="wb")
probando la URL 'http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/latitude.xls'
Content type '' length 37888 bytes (37 KB)
downloaded 37 KB
# Import the local .xls file with readxl: excel_readxl
excel_readxl = read_excel("local_latitude.xls")
There’s more: with download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also .RData files. An RData file is very efficient format to store R data.
You can load data from an RData file using the load() function, but this function does not accept a URL string as an argument.
# https URL to the wine RData file.
url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData"
# Download the wine file to your working directory
download.file(url_rdata, "wine_local.RData")
probando la URL 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/wine.RData'
Content type '' length 4781 bytes
downloaded 4781 bytes
# Load the wine data into your workspace using load()
load("wine_local.RData")
# Print out the summary of the wine data
summary(wine)
Alcohol Malic acid Ash Alcalinity of ash
Min. :11.03 Min. :0.74 Min. :1.360 Min. :10.60
1st Qu.:12.36 1st Qu.:1.60 1st Qu.:2.210 1st Qu.:17.20
Median :13.05 Median :1.87 Median :2.360 Median :19.50
Mean :12.99 Mean :2.34 Mean :2.366 Mean :19.52
3rd Qu.:13.67 3rd Qu.:3.10 3rd Qu.:2.560 3rd Qu.:21.50
Max. :14.83 Max. :5.80 Max. :3.230 Max. :30.00
Magnesium Total phenols Flavanoids
Min. : 70.00 Min. :0.980 Min. :0.340
1st Qu.: 88.00 1st Qu.:1.740 1st Qu.:1.200
Median : 98.00 Median :2.350 Median :2.130
Mean : 99.59 Mean :2.292 Mean :2.023
3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.860
Max. :162.00 Max. :3.880 Max. :5.080
Nonflavanoid phenols Proanthocyanins Color intensity
Min. :0.1300 Min. :0.410 Min. : 1.280
1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.210
Median :0.3400 Median :1.550 Median : 4.680
Mean :0.3623 Mean :1.587 Mean : 5.055
3rd Qu.:0.4400 3rd Qu.:1.950 3rd Qu.: 6.200
Max. :0.6600 Max. :3.580 Max. :13.000
Hue Proline
Min. :1.270 Min. : 278.0
1st Qu.:1.930 1st Qu.: 500.0
Median :2.780 Median : 672.0
Mean :2.604 Mean : 745.1
3rd Qu.:3.170 3rd Qu.: 985.0
Max. :4.000 Max. :1680.0
Downloading a file from the Internet means sending a GET request and receiving the file you asked for. Internally, all the previously discussed functions use a GET request to download files.
httr provides a convenient function, GET() to execute this GET request. The result is a response object, that provides easy access to the status code, content-type and, of course, the actual content.
You can extract the content from the request using the content() function. There are three ways to retrieve this content: as a raw object, as a character vector, or an R object, such as a list. If you don’t tell content() how to retrieve the content through the as argument, it’ll try its best to figure out which type is most appropriate based on the content-type.
# Load the httr package
library(httr)
# Get the url, save response to resp
url <- "http://www.example.com/"
resp <- GET(url)
# Print resp
resp
Response [http://www.example.com/]
Date: 2020-04-05 13:58
Status: 200
Content-Type: text/html; charset=UTF-8
Size: 1.26 kB
<!doctype html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf...
<meta name="viewport" content="width=device-width, initial-scal...
<style type="text/css">
body {
...
# Get the raw content of resp: raw_content
raw_content <- content(resp, as = "raw")
# Print the head of raw_content
head(raw_content)
[1] 3c 21 64 6f 63 74
Web content does not limit itself to HTML pages and files stored on remote servers such as DataCamp’s Amazon S3 instances. There are many other data formats out there. A very common one is JSON. This format is very often used by so-called Web APIs, interfaces to web servers with which you as a client can communicate to get or store information in more complicated ways.
# Get the url
url <- "http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json"
resp <- GET(url)
# Print resp
resp
Response [http://www.omdbapi.com/?apikey=72bc447a&t=Annie+Hall&y=&plot=short&r=json]
Date: 2020-04-05 13:58
Status: 200
Content-Type: application/json; charset=utf-8
Size: 929 B
# Print content of resp as text
content(resp, as = "text")
[1] "{\"Title\":\"Annie Hall\",\"Year\":\"1977\",\"Rated\":\"PG\",\"Released\":\"20 Apr 1977\",\"Runtime\":\"93 min\",\"Genre\":\"Comedy, Romance\",\"Director\":\"Woody Allen\",\"Writer\":\"Woody Allen, Marshall Brickman\",\"Actors\":\"Woody Allen, Diane Keaton, Tony Roberts, Carol Kane\",\"Plot\":\"Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall.\",\"Language\":\"English, German\",\"Country\":\"USA\",\"Awards\":\"Won 4 Oscars. Another 26 wins & 8 nominations.\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"8.0/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"97%\"},{\"Source\":\"Metacritic\",\"Value\":\"92/100\"}],\"Metascore\":\"92\",\"imdbRating\":\"8.0\",\"imdbVotes\":\"243,072\",\"imdbID\":\"tt0075686\",\"Type\":\"movie\",\"DVD\":\"28 Apr 1998\",\"BoxOffice\":\"N/A\",\"Production\":\"United Artists\",\"Website\":\"N/A\",\"Response\":\"True\"}"
# Print content of resp
content(resp)
$Title
[1] "Annie Hall"
$Year
[1] "1977"
$Rated
[1] "PG"
$Released
[1] "20 Apr 1977"
$Runtime
[1] "93 min"
$Genre
[1] "Comedy, Romance"
$Director
[1] "Woody Allen"
$Writer
[1] "Woody Allen, Marshall Brickman"
$Actors
[1] "Woody Allen, Diane Keaton, Tony Roberts, Carol Kane"
$Plot
[1] "Neurotic New York comedian Alvy Singer falls in love with the ditzy Annie Hall."
$Language
[1] "English, German"
$Country
[1] "USA"
$Awards
[1] "Won 4 Oscars. Another 26 wins & 8 nominations."
$Poster
[1] "https://m.media-amazon.com/images/M/MV5BZDg1OGQ4YzgtM2Y2NS00NjA3LWFjYTctMDRlMDI3NWE1OTUyXkEyXkFqcGdeQXVyMjUzOTY1NTc@._V1_SX300.jpg"
$Ratings
$Ratings[[1]]
$Ratings[[1]]$Source
[1] "Internet Movie Database"
$Ratings[[1]]$Value
[1] "8.0/10"
$Ratings[[2]]
$Ratings[[2]]$Source
[1] "Rotten Tomatoes"
$Ratings[[2]]$Value
[1] "97%"
$Ratings[[3]]
$Ratings[[3]]$Source
[1] "Metacritic"
$Ratings[[3]]$Value
[1] "92/100"
$Metascore
[1] "92"
$imdbRating
[1] "8.0"
$imdbVotes
[1] "243,072"
$imdbID
[1] "tt0075686"
$Type
[1] "movie"
$DVD
[1] "28 Apr 1998"
$BoxOffice
[1] "N/A"
$Production
[1] "United Artists"
$Website
[1] "N/A"
$Response
[1] "True"
In the simplest setting, fromJSON() can convert character strings that represent JSON data into a nicely structured R list.
library(jsonlite)
# wine_json is a JSON
wine_json <- '{"name":"Chateau Migraine", "year":1997, "alcohol_pct":12.4, "color":"red", "awarded":false}'
# Convert wine_json into a list: wine
wine <- fromJSON(wine_json)
# Print structure of wine
str(wine)
List of 5
$ name : chr "Chateau Migraine"
$ year : int 1997
$ alcohol_pct: num 12.4
$ color : chr "red"
$ awarded : logi FALSE
fromJSON() also works if you pass a URL as a character string or the path to a local file that contains JSON data. Let’s try this out on the Quandl API, where you can fetch all sorts of financial and economical data.
# Definition of quandl_url
quandl_url <- "https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json?auth_token=i83asDsiWUUyfoypkgMz"
# Import Quandl data:
quandl_data <- fromJSON(quandl_url)
# Print structure of quandl_data
str(quandl_data)
List of 1
$ dataset_data:List of 10
..$ limit : NULL
..$ transform : NULL
..$ column_index: NULL
..$ column_names: chr [1:13] "Date" "Open" "High" "Low" ...
..$ start_date : chr "2012-05-18"
..$ end_date : chr "2018-03-27"
..$ frequency : chr "daily"
..$ data : chr [1:1472, 1:13] "2018-03-27" "2018-03-26" "2018-03-23" "2018-03-22" ...
..$ collapse : NULL
..$ order : NULL
Compare the release year of two movies in the Open Movie Database.
# Definition of the URLs
url_sw4 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0076759&r=json"
url_sw3 <- "http://www.omdbapi.com/?apikey=72bc447a&i=tt0121766&r=json"
# Import two URLs with fromJSON(): sw4 and sw3
sw4 <- fromJSON(url_sw4)
sw3 <- fromJSON(url_sw3)
# Print out the Title element of both lists
sw4$Title
[1] "Star Wars: Episode IV - A New Hope"
sw3$Title
[1] "Star Wars: Episode III - Revenge of the Sith"
# Is the release year of sw4 later than sw3?
Year <- sw4$Year > sw3$Year
Year
[1] FALSE
JSON is built on two structures: objects and arrays.
# Challenge 1
json1 <- '[1, 2, 3, 4, 5,6]'
fromJSON(json1)
[1] 1 2 3 4 5 6
# Challenge 2
json2 <- '{"a": [1, 2, 3], "b": [4, 5, 6]}'
fromJSON(json2)
$a
[1] 1 2 3
$b
[1] 4 5 6
# Challenge 1
json1 <- '[[1, 2], [3, 4]]'
fromJSON(json1)
[,1] [,2]
[1,] 1 2
[2,] 3 4
# Challenge 2
json2 <- '[{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}]'
fromJSON(json2)
Apart from converting JSON to R with fromJSON(), you can also use toJSON() to convert R data to a JSON format. In its most basic use, you simply pass this function an R object to convert to a JSON. The result is an R object of the class json, which is basically a character string representing that JSON.
# URL pointing to the .csv file
url_csv <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/water.csv"
# Import the .csv file located at url_csv
water <- read.csv(url_csv, stringsAsFactors = FALSE)
# Convert the data file according to the requirements
water_json <- toJSON(water)
# Print out water_json
water_json
[{"water":"Algeria","X1992":0.064,"X2002":0.017},{"water":"American Samoa"},{"water":"Angola","X1992":0.0001,"X2002":0.0001},{"water":"Antigua and Barbuda","X1992":0.0033},{"water":"Argentina","X1992":0.0007,"X1997":0.0007,"X2002":0.0007},{"water":"Australia","X1992":0.0298,"X2002":0.0298},{"water":"Austria","X1992":0.0022,"X2002":0.0022},{"water":"Bahamas","X1992":0.0013,"X2002":0.0074},{"water":"Bahrain","X1992":0.0441,"X2002":0.0441,"X2007":0.1024},{"water":"Barbados","X2007":0.0146},{"water":"British Virgin Islands","X2007":0.0042},{"water":"Canada","X1992":0.0027,"X2002":0.0027},{"water":"Cape Verde","X1992":0.002,"X1997":0.0017},{"water":"Cayman Islands","X1992":0.0033},{"water":"Central African Rep."},{"water":"Chile","X1992":0.0048,"X2002":0.0048},{"water":"Colombia","X1992":0.0027,"X2002":0.0027},{"water":"Cuba","X1992":0.0069,"X1997":0.0069,"X2002":0.0069},{"water":"Cyprus","X1992":0.003,"X1997":0.003,"X2002":0.0335},{"water":"Czech Rep.","X1992":0.0002,"X2002":0.0002},{"water":"Denmark","X1992":0.015,"X2002":0.015},{"water":"Djibouti","X1992":0.0001,"X2002":0.0001},{"water":"Ecuador","X1992":0.0022,"X1997":0.0022,"X2002":0.0022},{"water":"Egypt","X1992":0.025,"X1997":0.025,"X2002":0.1},{"water":"El Salvador","X1992":0.0001,"X2002":0.0001},{"water":"Finland","X1992":0.0001,"X2002":0.0001},{"water":"France","X1992":0.0117,"X2002":0.0117},{"water":"Gibraltar","X1992":0.0077},{"water":"Greece","X1992":0.01,"X2002":0.01},{"water":"Honduras","X1992":0.0002,"X2002":0.0002},{"water":"Hungary","X1992":0.0002,"X2002":0.0002},{"water":"India","X1997":0.0005,"X2002":0.0005},{"water":"Indonesia","X1992":0.0187,"X2002":0.0187},{"water":"Iran","X1992":0.003,"X1997":0.003,"X2002":0.003,"X2007":0.2},{"water":"Iraq","X1997":0.0074,"X2002":0.0074},{"water":"Ireland","X1992":0.0002,"X2002":0.0002},{"water":"Israel","X1992":0.0256,"X2002":0.0256,"X2007":0.14},{"water":"Italy","X1992":0.0973,"X2002":0.0973},{"water":"Jamaica","X1992":0.0005,"X1997":0.0005,"X2002":0.0005},{"water":"Japan","X1997":0.04,"X2002":0.04},{"water":"Jordan","X1997":0.002,"X2007":0.0098},{"water":"Kazakhstan","X1997":1.328,"X2002":1.328},{"water":"Kuwait","X1992":0.507,"X1997":0.231,"X2002":0.4202},{"water":"Lebanon","X2007":0.0473},{"water":"Libya","X2002":0.018},{"water":"Malaysia","X1992":0.0043,"X2002":0.0043},{"water":"Maldives","X1992":0.0004},{"water":"Malta","X1992":0.024,"X1997":0.031,"X2002":0.031},{"water":"Marshall Islands","X1992":0.0007},{"water":"Mauritania","X1992":0.002,"X2002":0.002},{"water":"Mexico","X1992":0.0307,"X2002":0.0307},{"water":"Morocco","X1992":0.0034,"X1997":0.0034,"X2002":0.007},{"water":"Namibia","X1992":0.0003,"X2002":0.0003},{"water":"Netherlands Antilles","X1992":0.063},{"water":"Nicaragua","X1992":0.0002,"X2002":0.0002},{"water":"Nigeria","X1992":0.003,"X2002":0.003},{"water":"Norway","X1992":0.0001,"X2002":0.0001},{"water":"Oman","X1997":0.034,"X2002":0.034,"X2007":0.109},{"water":"Peru","X1992":0.0054,"X2002":0.0054},{"water":"Poland","X1992":0.007,"X2002":0.007},{"water":"Portugal","X1992":0.0016,"X2002":0.0016},{"water":"Qatar","X1992":0.065,"X1997":0.099,"X2002":0.099,"X2007":0.18},{"water":"Saudi Arabia","X1992":0.683,"X1997":0.727,"X2002":0.863,"X2007":1.033},{"water":"Senegal","X1992":0,"X2002":0},{"water":"Somalia","X1992":0.0001,"X2002":0.0001},{"water":"South Africa","X1992":0.018,"X2002":0.018},{"water":"Spain","X1992":0.1002,"X2002":0.1002},{"water":"Sudan","X1992":0.0004,"X1997":0.0004,"X2002":0.0004},{"water":"Sweden","X1992":0.0002,"X2002":0.0002},{"water":"Trinidad and Tobago","X2007":0.036},{"water":"Tunisia","X1992":0.008,"X2002":0.013},{"water":"Turkey","X1992":0.0005,"X2002":0.0005,"X2007":0.0005},{"water":"United Arab Emirates","X1992":0.163,"X1997":0.385,"X2007":0.95},{"water":"United Kingdom","X1992":0.0333,"X2002":0.0333},{"water":"United States","X1992":0.58,"X2002":0.58},{"water":"Venezuela","X1992":0.0052,"X2002":0.0052},{"water":"Yemen, Rep.","X1992":0.01,"X2002":0.01}]
JSONs can come in different formats. Take these two JSONs, that are in fact exactly the same: the first one is in a minified format, the second one is in a pretty format with indentation, whitespace and new lines:
# Mini
{"a":1,"b":2,"c":{"x":5,"y":6}}
# Pretty
{
"a": 1,
"b": 2,
"c": {
"x": 5,
"y": 6
}
}
Unless you’re a computer, you surely prefer the second version. However, the standard form that toJSON() returns, is the minified version, as it is more concise. You can adapt this behavior by setting the pretty argument inside toJSON() to TRUE. If you already have a JSON string, you can use prettify() or minify() to make the JSON pretty or as concise as possible.
# Convert mtcars to a pretty JSON: pretty_json
pretty_json <- toJSON(mtcars, pretty = TRUE)
# Print pretty_json
pretty_json
[
{
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.62,
"qsec": 16.46,
"vs": 0,
"am": 1,
"gear": 4,
"carb": 4,
"_row": "Mazda RX4"
},
{
"mpg": 21,
"cyl": 6,
"disp": 160,
"hp": 110,
"drat": 3.9,
"wt": 2.875,
"qsec": 17.02,
"vs": 0,
"am": 1,
"gear": 4,
"carb": 4,
"_row": "Mazda RX4 Wag"
},
{
"mpg": 22.8,
"cyl": 4,
"disp": 108,
"hp": 93,
"drat": 3.85,
"wt": 2.32,
"qsec": 18.61,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 1,
"_row": "Datsun 710"
},
{
"mpg": 21.4,
"cyl": 6,
"disp": 258,
"hp": 110,
"drat": 3.08,
"wt": 3.215,
"qsec": 19.44,
"vs": 1,
"am": 0,
"gear": 3,
"carb": 1,
"_row": "Hornet 4 Drive"
},
{
"mpg": 18.7,
"cyl": 8,
"disp": 360,
"hp": 175,
"drat": 3.15,
"wt": 3.44,
"qsec": 17.02,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 2,
"_row": "Hornet Sportabout"
},
{
"mpg": 18.1,
"cyl": 6,
"disp": 225,
"hp": 105,
"drat": 2.76,
"wt": 3.46,
"qsec": 20.22,
"vs": 1,
"am": 0,
"gear": 3,
"carb": 1,
"_row": "Valiant"
},
{
"mpg": 14.3,
"cyl": 8,
"disp": 360,
"hp": 245,
"drat": 3.21,
"wt": 3.57,
"qsec": 15.84,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 4,
"_row": "Duster 360"
},
{
"mpg": 24.4,
"cyl": 4,
"disp": 146.7,
"hp": 62,
"drat": 3.69,
"wt": 3.19,
"qsec": 20,
"vs": 1,
"am": 0,
"gear": 4,
"carb": 2,
"_row": "Merc 240D"
},
{
"mpg": 22.8,
"cyl": 4,
"disp": 140.8,
"hp": 95,
"drat": 3.92,
"wt": 3.15,
"qsec": 22.9,
"vs": 1,
"am": 0,
"gear": 4,
"carb": 2,
"_row": "Merc 230"
},
{
"mpg": 19.2,
"cyl": 6,
"disp": 167.6,
"hp": 123,
"drat": 3.92,
"wt": 3.44,
"qsec": 18.3,
"vs": 1,
"am": 0,
"gear": 4,
"carb": 4,
"_row": "Merc 280"
},
{
"mpg": 17.8,
"cyl": 6,
"disp": 167.6,
"hp": 123,
"drat": 3.92,
"wt": 3.44,
"qsec": 18.9,
"vs": 1,
"am": 0,
"gear": 4,
"carb": 4,
"_row": "Merc 280C"
},
{
"mpg": 16.4,
"cyl": 8,
"disp": 275.8,
"hp": 180,
"drat": 3.07,
"wt": 4.07,
"qsec": 17.4,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 3,
"_row": "Merc 450SE"
},
{
"mpg": 17.3,
"cyl": 8,
"disp": 275.8,
"hp": 180,
"drat": 3.07,
"wt": 3.73,
"qsec": 17.6,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 3,
"_row": "Merc 450SL"
},
{
"mpg": 15.2,
"cyl": 8,
"disp": 275.8,
"hp": 180,
"drat": 3.07,
"wt": 3.78,
"qsec": 18,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 3,
"_row": "Merc 450SLC"
},
{
"mpg": 10.4,
"cyl": 8,
"disp": 472,
"hp": 205,
"drat": 2.93,
"wt": 5.25,
"qsec": 17.98,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 4,
"_row": "Cadillac Fleetwood"
},
{
"mpg": 10.4,
"cyl": 8,
"disp": 460,
"hp": 215,
"drat": 3,
"wt": 5.424,
"qsec": 17.82,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 4,
"_row": "Lincoln Continental"
},
{
"mpg": 14.7,
"cyl": 8,
"disp": 440,
"hp": 230,
"drat": 3.23,
"wt": 5.345,
"qsec": 17.42,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 4,
"_row": "Chrysler Imperial"
},
{
"mpg": 32.4,
"cyl": 4,
"disp": 78.7,
"hp": 66,
"drat": 4.08,
"wt": 2.2,
"qsec": 19.47,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 1,
"_row": "Fiat 128"
},
{
"mpg": 30.4,
"cyl": 4,
"disp": 75.7,
"hp": 52,
"drat": 4.93,
"wt": 1.615,
"qsec": 18.52,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 2,
"_row": "Honda Civic"
},
{
"mpg": 33.9,
"cyl": 4,
"disp": 71.1,
"hp": 65,
"drat": 4.22,
"wt": 1.835,
"qsec": 19.9,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 1,
"_row": "Toyota Corolla"
},
{
"mpg": 21.5,
"cyl": 4,
"disp": 120.1,
"hp": 97,
"drat": 3.7,
"wt": 2.465,
"qsec": 20.01,
"vs": 1,
"am": 0,
"gear": 3,
"carb": 1,
"_row": "Toyota Corona"
},
{
"mpg": 15.5,
"cyl": 8,
"disp": 318,
"hp": 150,
"drat": 2.76,
"wt": 3.52,
"qsec": 16.87,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 2,
"_row": "Dodge Challenger"
},
{
"mpg": 15.2,
"cyl": 8,
"disp": 304,
"hp": 150,
"drat": 3.15,
"wt": 3.435,
"qsec": 17.3,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 2,
"_row": "AMC Javelin"
},
{
"mpg": 13.3,
"cyl": 8,
"disp": 350,
"hp": 245,
"drat": 3.73,
"wt": 3.84,
"qsec": 15.41,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 4,
"_row": "Camaro Z28"
},
{
"mpg": 19.2,
"cyl": 8,
"disp": 400,
"hp": 175,
"drat": 3.08,
"wt": 3.845,
"qsec": 17.05,
"vs": 0,
"am": 0,
"gear": 3,
"carb": 2,
"_row": "Pontiac Firebird"
},
{
"mpg": 27.3,
"cyl": 4,
"disp": 79,
"hp": 66,
"drat": 4.08,
"wt": 1.935,
"qsec": 18.9,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 1,
"_row": "Fiat X1-9"
},
{
"mpg": 26,
"cyl": 4,
"disp": 120.3,
"hp": 91,
"drat": 4.43,
"wt": 2.14,
"qsec": 16.7,
"vs": 0,
"am": 1,
"gear": 5,
"carb": 2,
"_row": "Porsche 914-2"
},
{
"mpg": 30.4,
"cyl": 4,
"disp": 95.1,
"hp": 113,
"drat": 3.77,
"wt": 1.513,
"qsec": 16.9,
"vs": 1,
"am": 1,
"gear": 5,
"carb": 2,
"_row": "Lotus Europa"
},
{
"mpg": 15.8,
"cyl": 8,
"disp": 351,
"hp": 264,
"drat": 4.22,
"wt": 3.17,
"qsec": 14.5,
"vs": 0,
"am": 1,
"gear": 5,
"carb": 4,
"_row": "Ford Pantera L"
},
{
"mpg": 19.7,
"cyl": 6,
"disp": 145,
"hp": 175,
"drat": 3.62,
"wt": 2.77,
"qsec": 15.5,
"vs": 0,
"am": 1,
"gear": 5,
"carb": 6,
"_row": "Ferrari Dino"
},
{
"mpg": 15,
"cyl": 8,
"disp": 301,
"hp": 335,
"drat": 3.54,
"wt": 3.57,
"qsec": 14.6,
"vs": 0,
"am": 1,
"gear": 5,
"carb": 8,
"_row": "Maserati Bora"
},
{
"mpg": 21.4,
"cyl": 4,
"disp": 121,
"hp": 109,
"drat": 4.11,
"wt": 2.78,
"qsec": 18.6,
"vs": 1,
"am": 1,
"gear": 4,
"carb": 2,
"_row": "Volvo 142E"
}
]
# Minify pretty_json: mini_json
mini_json <- minify(pretty_json)
# Print mini_json
mini_json
[{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.62,"qsec":16.46,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4"},{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3.9,"wt":2.875,"qsec":17.02,"vs":0,"am":1,"gear":4,"carb":4,"_row":"Mazda RX4 Wag"},{"mpg":22.8,"cyl":4,"disp":108,"hp":93,"drat":3.85,"wt":2.32,"qsec":18.61,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Datsun 710"},{"mpg":21.4,"cyl":6,"disp":258,"hp":110,"drat":3.08,"wt":3.215,"qsec":19.44,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Hornet 4 Drive"},{"mpg":18.7,"cyl":8,"disp":360,"hp":175,"drat":3.15,"wt":3.44,"qsec":17.02,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Hornet Sportabout"},{"mpg":18.1,"cyl":6,"disp":225,"hp":105,"drat":2.76,"wt":3.46,"qsec":20.22,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Valiant"},{"mpg":14.3,"cyl":8,"disp":360,"hp":245,"drat":3.21,"wt":3.57,"qsec":15.84,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Duster 360"},{"mpg":24.4,"cyl":4,"disp":146.7,"hp":62,"drat":3.69,"wt":3.19,"qsec":20,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 240D"},{"mpg":22.8,"cyl":4,"disp":140.8,"hp":95,"drat":3.92,"wt":3.15,"qsec":22.9,"vs":1,"am":0,"gear":4,"carb":2,"_row":"Merc 230"},{"mpg":19.2,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.3,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280"},{"mpg":17.8,"cyl":6,"disp":167.6,"hp":123,"drat":3.92,"wt":3.44,"qsec":18.9,"vs":1,"am":0,"gear":4,"carb":4,"_row":"Merc 280C"},{"mpg":16.4,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":4.07,"qsec":17.4,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SE"},{"mpg":17.3,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.73,"qsec":17.6,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SL"},{"mpg":15.2,"cyl":8,"disp":275.8,"hp":180,"drat":3.07,"wt":3.78,"qsec":18,"vs":0,"am":0,"gear":3,"carb":3,"_row":"Merc 450SLC"},{"mpg":10.4,"cyl":8,"disp":472,"hp":205,"drat":2.93,"wt":5.25,"qsec":17.98,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Cadillac Fleetwood"},{"mpg":10.4,"cyl":8,"disp":460,"hp":215,"drat":3,"wt":5.424,"qsec":17.82,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Lincoln Continental"},{"mpg":14.7,"cyl":8,"disp":440,"hp":230,"drat":3.23,"wt":5.345,"qsec":17.42,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Chrysler Imperial"},{"mpg":32.4,"cyl":4,"disp":78.7,"hp":66,"drat":4.08,"wt":2.2,"qsec":19.47,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat 128"},{"mpg":30.4,"cyl":4,"disp":75.7,"hp":52,"drat":4.93,"wt":1.615,"qsec":18.52,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Honda Civic"},{"mpg":33.9,"cyl":4,"disp":71.1,"hp":65,"drat":4.22,"wt":1.835,"qsec":19.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Toyota Corolla"},{"mpg":21.5,"cyl":4,"disp":120.1,"hp":97,"drat":3.7,"wt":2.465,"qsec":20.01,"vs":1,"am":0,"gear":3,"carb":1,"_row":"Toyota Corona"},{"mpg":15.5,"cyl":8,"disp":318,"hp":150,"drat":2.76,"wt":3.52,"qsec":16.87,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Dodge Challenger"},{"mpg":15.2,"cyl":8,"disp":304,"hp":150,"drat":3.15,"wt":3.435,"qsec":17.3,"vs":0,"am":0,"gear":3,"carb":2,"_row":"AMC Javelin"},{"mpg":13.3,"cyl":8,"disp":350,"hp":245,"drat":3.73,"wt":3.84,"qsec":15.41,"vs":0,"am":0,"gear":3,"carb":4,"_row":"Camaro Z28"},{"mpg":19.2,"cyl":8,"disp":400,"hp":175,"drat":3.08,"wt":3.845,"qsec":17.05,"vs":0,"am":0,"gear":3,"carb":2,"_row":"Pontiac Firebird"},{"mpg":27.3,"cyl":4,"disp":79,"hp":66,"drat":4.08,"wt":1.935,"qsec":18.9,"vs":1,"am":1,"gear":4,"carb":1,"_row":"Fiat X1-9"},{"mpg":26,"cyl":4,"disp":120.3,"hp":91,"drat":4.43,"wt":2.14,"qsec":16.7,"vs":0,"am":1,"gear":5,"carb":2,"_row":"Porsche 914-2"},{"mpg":30.4,"cyl":4,"disp":95.1,"hp":113,"drat":3.77,"wt":1.513,"qsec":16.9,"vs":1,"am":1,"gear":5,"carb":2,"_row":"Lotus Europa"},{"mpg":15.8,"cyl":8,"disp":351,"hp":264,"drat":4.22,"wt":3.17,"qsec":14.5,"vs":0,"am":1,"gear":5,"carb":4,"_row":"Ford Pantera L"},{"mpg":19.7,"cyl":6,"disp":145,"hp":175,"drat":3.62,"wt":2.77,"qsec":15.5,"vs":0,"am":1,"gear":5,"carb":6,"_row":"Ferrari Dino"},{"mpg":15,"cyl":8,"disp":301,"hp":335,"drat":3.54,"wt":3.57,"qsec":14.6,"vs":0,"am":1,"gear":5,"carb":8,"_row":"Maserati Bora"},{"mpg":21.4,"cyl":4,"disp":121,"hp":109,"drat":4.11,"wt":2.78,"qsec":18.6,"vs":1,"am":1,"gear":4,"carb":2,"_row":"Volvo 142E"}]
haven is an extremely easy-to-use package to import data from three software packages: SAS, STATA and SPSS. Depending on the software, you use different functions:
All these functions take one key argument: the path to your local file. In fact, you can even pass a URL; haven will then automatically download the file for you before importing it.
library(haven)
# Import sales.sas7bdat: sales
sales <- read_sas("sales.sas7bdat")
# Display the structure of sales
str(sales)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 431 obs. of 4 variables:
$ purchase: num 0 0 1 1 0 0 0 0 0 0 ...
$ age : num 41 47 41 39 32 32 33 45 43 40 ...
$ gender : chr "Female" "Female" "Female" "Female" ...
$ income : chr "Low" "Low" "Low" "Low" ...
- attr(*, "label")= chr "SALES"
Next up are STATA data files; you can use read_dta() for these.
When inspecting the result of the read_dta() call, you will notice that one column will be imported as a labelled vector, an R equivalent for the common data structure in other statistical environments. In order to effectively continue working on the data in R, it’s best to change this data into a standard R class. To convert a variable of the class labelled to a factor, you’ll need haven’s as_factor() function.
# Import the data from the URL: sugar
sugar <- read_dta('http://assets.datacamp.com/production/course_1478/datasets/trade.dta')
# Structure of sugar
str(sugar)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 5 variables:
$ Date : 'haven_labelled' num 10 9 8 7 6 5 4 3 2 1
..- attr(*, "label")= chr "Date"
..- attr(*, "format.stata")= chr "%9.0g"
..- attr(*, "labels")= Named num 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "2004-12-31" "2005-12-31" "2006-12-31" "2007-12-31" ...
$ Import : num 37664782 16316512 11082246 35677943 9879878 ...
..- attr(*, "label")= chr "Import"
..- attr(*, "format.stata")= chr "%9.0g"
$ Weight_I: num 54029106 21584365 14526089 55034932 14806865 ...
..- attr(*, "label")= chr "Weight_I"
..- attr(*, "format.stata")= chr "%9.0g"
$ Export : num 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
..- attr(*, "label")= chr "Export"
..- attr(*, "format.stata")= chr "%9.0g"
$ Weight_E: num 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
..- attr(*, "label")= chr "Weight_E"
..- attr(*, "format.stata")= chr "%9.0g"
- attr(*, "label")= chr "Written by R."
# Convert values in Date column to dates
sugar$Date <- as.Date(as_factor(sugar$Date))
# Structure of sugar again
str(sugar)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 5 variables:
$ Date : Date, format: "2013-12-31" ...
$ Import : num 37664782 16316512 11082246 35677943 9879878 ...
..- attr(*, "label")= chr "Import"
..- attr(*, "format.stata")= chr "%9.0g"
$ Weight_I: num 54029106 21584365 14526089 55034932 14806865 ...
..- attr(*, "label")= chr "Weight_I"
..- attr(*, "format.stata")= chr "%9.0g"
$ Export : num 5.45e+07 1.03e+08 3.79e+07 4.85e+07 7.15e+07 ...
..- attr(*, "label")= chr "Export"
..- attr(*, "format.stata")= chr "%9.0g"
$ Weight_E: num 9.34e+07 1.58e+08 8.80e+07 1.12e+08 1.32e+08 ...
..- attr(*, "label")= chr "Weight_E"
..- attr(*, "format.stata")= chr "%9.0g"
- attr(*, "label")= chr "Written by R."
A plot can be very useful to explore the relationship between two variables. If you pass the plot() function two arguments, the first one will be plotted on the x-axis, the second one will be plotted on the y-axis.
plot(x = sugar$Import, y = sugar$Weight_I)
The haven package can also import data files from SPSS. Again, importing the data is pretty straightforward. Depending on the SPSS data file you’re working with, you’ll need either read_sav() - for .sav files - or read_por() - for .por files.
# Import person.sav: traits
traits <- read_sav("person.sav")
# Summarize traits
summary(traits)
Neurotic Extroversion Agreeableness Conscientiousness
Min. : 0.00 Min. : 5.00 Min. :15.00 Min. : 7.00
1st Qu.:18.00 1st Qu.:26.00 1st Qu.:39.00 1st Qu.:25.00
Median :24.00 Median :31.00 Median :45.00 Median :30.00
Mean :23.63 Mean :30.23 Mean :44.55 Mean :30.85
3rd Qu.:29.00 3rd Qu.:34.00 3rd Qu.:50.00 3rd Qu.:36.00
Max. :44.00 Max. :65.00 Max. :73.00 Max. :58.00
NA's :14 NA's :16 NA's :19 NA's :14
# Print out a subset
subset(traits, Extroversion > 40 & Agreeableness > 40)
With SPSS data files, it can also happen that some of the variables you import have the labelled class. This is done to keep all the labelling information that was originally present in the .sav and .por files. It’s advised to coerce (or change) these variables to factors or other standard R classes.
# Import SPSS data from the URL: work
work <- read_sav("http://s3.amazonaws.com/assets.datacamp.com/production/course_1478/datasets/employee.sav")
# Display summary of work$GENDER
summary(work$GENDER)
Length Class Mode
474 haven_labelled character
# Convert work$GENDER to a factor
work$GENDER <- as_factor(work$GENDER)
# Display summary of work$GENDER again
summary(work$GENDER)
Female Male
216 258
The foreign package offers a simple function to import and read STATA data: read.dta().
library(foreign)
package 㤼㸱foreign㤼㸲 was built under R version 3.6.3
# Import florida.dta and name the resulting data frame florida
florida <- read.dta("florida.dta")
# Check tail() of florida
tail(florida)
Data can be very diverse, going from character vectors to categorical variables, dates and more. It’s in these cases that the additional arguments of read.dta() will come in handy.
The arguments you will use most often are convert.dates, convert.factors, missing.type and convert.underscore. Their meaning is pretty straightforward.
# Specify the file path using file.path(): path
path <- file.path("edequality.dta")
# Create and print structure of edu_equal_1
edu_equal_1 <- read.dta(path)
str(edu_equal_1)
'data.frame': 12214 obs. of 27 variables:
$ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
$ hhweight : num 627 627 627 627 627 ...
$ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
$ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
$ ethnicity_head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
$ age : num 37 11 8 73 70 75 79 80 82 83 ...
$ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
$ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
$ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
$ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
$ income : num 160 160 160 1710 1710 ...
$ aggregate : num 1042 1042 1042 3271 3271 ...
$ aggr_ind_annual : num 347 347 347 1635 1635 ...
$ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
$ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
$ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
$ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
$ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
$ sector : int NA NA NA NA NA NA 1 1 NA NA ...
$ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
$ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
$ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
$ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
$ hours_mnt : num NA NA NA NA NA ...
$ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
$ hhexp : num 100 100 100 343 343 ...
$ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr ""
- attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
- attr(*, "val.labels")= chr "" "" "location" "region" ...
- attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
- attr(*, "expansion.fields")=List of 12
..$ : chr "_dta" "_svy_su1" "cluster"
..$ : chr "_dta" "_svy_strata1" "strata"
..$ : chr "_dta" "_svy_stages" "1"
..$ : chr "_dta" "_svy_version" "2"
..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
..$ : chr "_dta" "ReS_Xij" "cons"
..$ : chr "_dta" "ReS_str" "0"
..$ : chr "_dta" "ReS_j" "group"
..$ : chr "_dta" "ReS_ver" "v.2"
..$ : chr "_dta" "ReS_i" "hhid dur"
..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
..$ : chr "_dta" "note0" "1"
- attr(*, "version")= int 7
- attr(*, "label.table")=List of 12
..$ location: Named int 1 2
.. ..- attr(*, "names")= chr "urban location" "rural location"
..$ region : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
..$ ethnic : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
..$ s2_q2 : Named int 1 2
.. ..- attr(*, "names")= chr "male" "female"
..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
..$ lit : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
..$ : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
..$ : Named int 1 2 3 4 5
.. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
# Create and print structure of edu_equal_2
edu_equal_2 <- read.dta(path, convert.factors = FALSE)
str(edu_equal_2)
'data.frame': 12214 obs. of 27 variables:
$ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
$ hhweight : num 627 627 627 627 627 ...
$ location : int 1 1 1 1 1 2 2 2 1 1 ...
$ region : int 8 8 8 9 9 4 4 4 8 8 ...
$ ethnicity_head : int 2 2 2 1 1 1 1 1 1 1 ...
$ age : num 37 11 8 73 70 75 79 80 82 83 ...
$ gender : int 2 2 1 1 2 1 1 2 2 2 ...
$ relation : int 1 3 3 1 2 1 1 2 1 1 ...
$ literate : int 1 2 2 2 2 2 2 2 2 2 ...
$ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
$ income : num 160 160 160 1710 1710 ...
$ aggregate : num 1042 1042 1042 3271 3271 ...
$ aggr_ind_annual : num 347 347 347 1635 1635 ...
$ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
$ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
$ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
$ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
$ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
$ sector : int NA NA NA NA NA NA 1 1 NA NA ...
$ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
$ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
$ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
$ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
$ hours_mnt : num NA NA NA NA NA ...
$ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
$ hhexp : num 100 100 100 343 343 ...
$ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr ""
- attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
- attr(*, "val.labels")= chr "" "" "location" "region" ...
- attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
- attr(*, "expansion.fields")=List of 12
..$ : chr "_dta" "_svy_su1" "cluster"
..$ : chr "_dta" "_svy_strata1" "strata"
..$ : chr "_dta" "_svy_stages" "1"
..$ : chr "_dta" "_svy_version" "2"
..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
..$ : chr "_dta" "ReS_Xij" "cons"
..$ : chr "_dta" "ReS_str" "0"
..$ : chr "_dta" "ReS_j" "group"
..$ : chr "_dta" "ReS_ver" "v.2"
..$ : chr "_dta" "ReS_i" "hhid dur"
..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
..$ : chr "_dta" "note0" "1"
- attr(*, "version")= int 7
- attr(*, "label.table")=List of 12
..$ location: Named int 1 2
.. ..- attr(*, "names")= chr "urban location" "rural location"
..$ region : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
..$ ethnic : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
..$ s2_q2 : Named int 1 2
.. ..- attr(*, "names")= chr "male" "female"
..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
..$ lit : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
..$ : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
..$ : Named int 1 2 3 4 5
.. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
# Create and print structure of edu_equal_3
edu_equal_3 <- read.dta(path, convert.underscore = TRUE)
str(edu_equal_3)
'data.frame': 12214 obs. of 27 variables:
$ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
$ hhweight : num 627 627 627 627 627 ...
$ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
$ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
$ ethnicity.head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
$ age : num 37 11 8 73 70 75 79 80 82 83 ...
$ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
$ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
$ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
$ income.mnt : num 13.3 13.3 13.3 142.5 142.5 ...
$ income : num 160 160 160 1710 1710 ...
$ aggregate : num 1042 1042 1042 3271 3271 ...
$ aggr.ind.annual : num 347 347 347 1635 1635 ...
$ educ.completed : int 2 4 4 4 3 3 3 3 4 4 ...
$ grade.complete : num 4 3 0 3 4 4 4 4 5 5 ...
$ grade.all : num 4 11 8 11 8 8 8 8 13 13 ...
$ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
$ reason.OLF : int NA NA NA 3 3 3 9 9 3 3 ...
$ sector : int NA NA NA NA NA NA 1 1 NA NA ...
$ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
$ earn.mont : num 0 0 0 0 0 0 20 20 0 0 ...
$ earn.ann : num 0 0 0 0 0 0 240 240 0 0 ...
$ hours.week : num NA NA NA NA NA NA 30 35 NA NA ...
$ hours.mnt : num NA NA NA NA NA ...
$ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
$ hhexp : num 100 100 100 343 343 ...
$ legacy.pension.amt: num NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr ""
- attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
- attr(*, "val.labels")= chr "" "" "location" "region" ...
- attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
- attr(*, "expansion.fields")=List of 12
..$ : chr "_dta" "_svy_su1" "cluster"
..$ : chr "_dta" "_svy_strata1" "strata"
..$ : chr "_dta" "_svy_stages" "1"
..$ : chr "_dta" "_svy_version" "2"
..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
..$ : chr "_dta" "ReS_Xij" "cons"
..$ : chr "_dta" "ReS_str" "0"
..$ : chr "_dta" "ReS_j" "group"
..$ : chr "_dta" "ReS_ver" "v.2"
..$ : chr "_dta" "ReS_i" "hhid dur"
..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
..$ : chr "_dta" "note0" "1"
- attr(*, "version")= int 7
- attr(*, "label.table")=List of 12
..$ location: Named int 1 2
.. ..- attr(*, "names")= chr "urban location" "rural location"
..$ region : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
..$ ethnic : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
..$ s2_q2 : Named int 1 2
.. ..- attr(*, "names")= chr "male" "female"
..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
..$ lit : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
..$ : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
..$ : Named int 1 2 3 4 5
.. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
how many observations (e.g. how many people) have an age higher than 40 and are literate?
str(edu_equal_1)
'data.frame': 12214 obs. of 27 variables:
$ hhid : num 1 1 1 2 2 3 4 4 5 6 ...
$ hhweight : num 627 627 627 627 627 ...
$ location : Factor w/ 2 levels "urban location",..: 1 1 1 1 1 2 2 2 1 1 ...
$ region : Factor w/ 9 levels "Sofia city","Bourgass",..: 8 8 8 9 9 4 4 4 8 8 ...
$ ethnicity_head : Factor w/ 4 levels "Bulgaria","Turks",..: 2 2 2 1 1 1 1 1 1 1 ...
$ age : num 37 11 8 73 70 75 79 80 82 83 ...
$ gender : Factor w/ 2 levels "male","female": 2 2 1 1 2 1 1 2 2 2 ...
$ relation : Factor w/ 9 levels "head ",..: 1 3 3 1 2 1 1 2 1 1 ...
$ literate : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
$ income_mnt : num 13.3 13.3 13.3 142.5 142.5 ...
$ income : num 160 160 160 1710 1710 ...
$ aggregate : num 1042 1042 1042 3271 3271 ...
$ aggr_ind_annual : num 347 347 347 1635 1635 ...
$ educ_completed : int 2 4 4 4 3 3 3 3 4 4 ...
$ grade_complete : num 4 3 0 3 4 4 4 4 5 5 ...
$ grade_all : num 4 11 8 11 8 8 8 8 13 13 ...
$ unemployed : int 2 1 1 1 1 1 1 1 1 1 ...
$ reason_OLF : int NA NA NA 3 3 3 9 9 3 3 ...
$ sector : int NA NA NA NA NA NA 1 1 NA NA ...
$ occupation : int NA NA NA NA NA NA 5 5 NA NA ...
$ earn_mont : num 0 0 0 0 0 0 20 20 0 0 ...
$ earn_ann : num 0 0 0 0 0 0 240 240 0 0 ...
$ hours_week : num NA NA NA NA NA NA 30 35 NA NA ...
$ hours_mnt : num NA NA NA NA NA ...
$ fulltime : int NA NA NA NA NA NA 1 1 NA NA ...
$ hhexp : num 100 100 100 343 343 ...
$ legacy_pension_amt: num NA NA NA NA NA NA NA NA NA NA ...
- attr(*, "datalabel")= chr ""
- attr(*, "time.stamp")= chr ""
- attr(*, "formats")= chr "%9.0g" "%9.0g" "%9.0g" "%9.0g" ...
- attr(*, "types")= int 100 100 108 108 108 100 108 108 108 100 ...
- attr(*, "val.labels")= chr "" "" "location" "region" ...
- attr(*, "var.labels")= chr "hhid" "hhweight" "location" "region" ...
- attr(*, "expansion.fields")=List of 12
..$ : chr "_dta" "_svy_su1" "cluster"
..$ : chr "_dta" "_svy_strata1" "strata"
..$ : chr "_dta" "_svy_stages" "1"
..$ : chr "_dta" "_svy_version" "2"
..$ : chr "_dta" "__XijVarLabcons" "(sum) cons"
..$ : chr "_dta" "ReS_Xij" "cons"
..$ : chr "_dta" "ReS_str" "0"
..$ : chr "_dta" "ReS_j" "group"
..$ : chr "_dta" "ReS_ver" "v.2"
..$ : chr "_dta" "ReS_i" "hhid dur"
..$ : chr "_dta" "note1" "variables g1pc, g2pc, g3pc, g4pc, g5pc, g7pc, g8pc, g9pc, g10pc, g11pc, g12pc, gall, health, rent, durables we"| __truncated__
..$ : chr "_dta" "note0" "1"
- attr(*, "version")= int 7
- attr(*, "label.table")=List of 12
..$ location: Named int 1 2
.. ..- attr(*, "names")= chr "urban location" "rural location"
..$ region : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "Sofia city" "Bourgass" "Varna" "Lovetch" ...
..$ ethnic : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "Bulgaria" "Turks" "Roma" "Other"
..$ s2_q2 : Named int 1 2
.. ..- attr(*, "names")= chr "male" "female"
..$ s2_q3 : Named int 1 2 3 4 5 6 7 8 9
.. ..- attr(*, "names")= chr "head " "spouse/partner " "child " "son/daughter-in-law " ...
..$ lit : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
..$ : Named int 1 2 3 4
.. ..- attr(*, "names")= chr "never attanded" "primary" "secondary" "postsecondary"
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "Not unemployed" "Unemployed"
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "student" "housewife/childcare" "in retirement" "illness, disability" ...
..$ : Named int 1 2 3 4 5 6 7 8 9 10
.. ..- attr(*, "names")= chr "agriculture" "mining" "manufacturing" "utilities" ...
..$ : Named int 1 2 3 4 5
.. ..- attr(*, "names")= chr "private company" "public works program" "government,public sector, army" "private individual" ...
..$ : Named int 1 2
.. ..- attr(*, "names")= chr "no" "yes"
nrow(subset(edu_equal_1, age > 40 & literate == "yes"))
[1] 6506
How many observations/individuals of Bulgarian ethnicity have an income above 1000?
nrow(subset(edu_equal_1, ethnicity_head == "Bulgaria" & income > 1000))
[1] 8997
Where foreign provided read.dta() to read Stata data, there’s also read.spss() to read SPSS data files. To get a data frame, make sure to set to.data.frame = TRUE inside read.spss().
# Import international.sav as a data frame: demo
demo <- read.spss("international.sav", to.data.frame = TRUE)
re-encoding from CP1252
# Create boxplot of gdp variable of demo
boxplot(demo$gdp)
If you’re familiar with statistics, you’ll have heard about Pearson’s Correlation. It is a measurement to evaluate the linear dependency between two variables, say X and Y. It can range from -1 to 1; if it’s close to 1 it means that there is a strong positive association between the variables. If X is high, also Y tends to be high. If it’s close to -1, there is a strong negative association: If X is high, Y tends to be low. When the Pearson correlation between two variables is 0, these variables are possibly independent: there is no association between X and Y.
You can calculate the correlation between two vectors with the cor() function. Take this code for example, that computes the correlation between the columns height and width of a fictional data frame size:
cor(size$height, size$width)
What is the correlation coefficient for the two numerical variables gdp and f_illit (female illiteracy rate)?
cor(demo$gdp, demo$f_illit)
[1] -0.4476856
There are many other ways in which to customize the way your SPSS data is imported. use.value.labels. It specifies whether variables with value labels should be converted into R factors with levels that are named accordingly. The argument is TRUE by default which means that so called labelled variables inside SPSS are converted to factors inside
# Import international.sav as demo_1
demo_1 <- read.spss("international.sav", to.data.frame = TRUE)
re-encoding from CP1252
# Print out the head of demo_1
head(demo_1)
# Import international.sav as demo_2
demo_2 <- read.spss("international.sav", to.data.frame = TRUE, use.value.labels = FALSE)
re-encoding from CP1252
# Print out the head of demo_2
head(demo_2)