The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.
# Import test.csv.csv: pools
pools = read.csv("test.csv")
# Print the structure of pools
str(pools)
'data.frame': 205 obs. of 27 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9 ...
$ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
$ normalized.losses: Factor w/ 52 levels "?","101","102",..: 1 1 1 29 29 1 27 1 27 1 ...
$ make : Factor w/ 22 levels "alfa-romero",..: 1 1 1 2 2 2 2 2 2 2 ...
$ fuel.type : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
$ aspiration : Factor w/ 2 levels "std","turbo": 1 1 1 1 1 1 1 1 2 2 ...
$ num.of.doors : Factor w/ 3 levels "?","four","two": 3 3 3 2 2 3 2 2 2 3 ...
$ body.style : Factor w/ 5 levels "convertible",..: 1 1 3 4 4 4 4 5 4 3 ...
$ drive.wheels : Factor w/ 3 levels "4wd","fwd","rwd": 3 3 3 2 1 2 2 2 2 1 ...
$ engine.location : Factor w/ 2 levels "front","rear": 1 1 1 1 1 1 1 1 1 1 ...
$ wheel.base : num 88.6 88.6 94.5 99.8 99.4 ...
$ length : num 169 169 171 177 177 ...
$ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
$ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
$ curb.weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
$ engine.type : Factor w/ 7 levels "dohc","dohcv",..: 1 1 6 4 4 4 4 4 4 4 ...
$ num.of.cylinders : Factor w/ 7 levels "eight","five",..: 3 3 4 3 2 2 2 2 2 2 ...
$ engine.size : int 130 130 152 109 136 136 136 136 131 131 ...
$ fuel.system : Factor w/ 8 levels "1bbl","2bbl",..: 6 6 6 6 6 6 6 6 6 6 ...
$ bore : Factor w/ 39 levels "?","2.54","2.68",..: 25 25 3 15 15 15 15 15 12 12 ...
$ stroke : Factor w/ 37 levels "?","2.07","2.19",..: 6 6 29 26 26 26 26 26 26 26 ...
$ compression.ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
$ horsepower : Factor w/ 60 levels "?","100","101",..: 7 7 22 4 10 6 6 6 17 25 ...
$ peak.rpm : Factor w/ 24 levels "?","4150","4200",..: 12 12 12 18 18 18 18 18 18 18 ...
$ city.mpg : int 21 21 19 24 18 19 19 19 17 16 ...
$ highway.mpg : int 27 27 26 30 22 25 25 25 20 22 ...
$ price : Factor w/ 187 levels "?","10198","10245",..: 33 52 52 38 63 43 65 73 83 1 ...
With stringsAsFactors, you can tell R whether it should convert strings in the flat file to factors.
For all importing functions in the utils package, this argument is TRUE, which means that you import strings as factors. This only makes sense if the strings you import represent categorical variables in R. If you set stringsAsFactors to FALSE, the data frame columns corresponding to strings in your text file will be character.
# Import test.csv.csv: pools
pools = read.csv("test.csv", stringsAsFactors = FALSE)
# Print the structure of pools
str(pools)
'data.frame': 205 obs. of 27 variables:
$ X : int 0 1 2 3 4 5 6 7 8 9 ...
$ symboling : int 3 3 1 2 2 2 1 1 1 0 ...
$ normalized.losses: chr "?" "?" "?" "164" ...
$ make : chr "alfa-romero" "alfa-romero" "alfa-romero" "audi" ...
$ fuel.type : chr "gas" "gas" "gas" "gas" ...
$ aspiration : chr "std" "std" "std" "std" ...
$ num.of.doors : chr "two" "two" "two" "four" ...
$ body.style : chr "convertible" "convertible" "hatchback" "sedan" ...
$ drive.wheels : chr "rwd" "rwd" "rwd" "fwd" ...
$ engine.location : chr "front" "front" "front" "front" ...
$ wheel.base : num 88.6 88.6 94.5 99.8 99.4 ...
$ length : num 169 169 171 177 177 ...
$ width : num 64.1 64.1 65.5 66.2 66.4 66.3 71.4 71.4 71.4 67.9 ...
$ height : num 48.8 48.8 52.4 54.3 54.3 53.1 55.7 55.7 55.9 52 ...
$ curb.weight : int 2548 2548 2823 2337 2824 2507 2844 2954 3086 3053 ...
$ engine.type : chr "dohc" "dohc" "ohcv" "ohc" ...
$ num.of.cylinders : chr "four" "four" "six" "four" ...
$ engine.size : int 130 130 152 109 136 136 136 136 131 131 ...
$ fuel.system : chr "mpfi" "mpfi" "mpfi" "mpfi" ...
$ bore : chr "3.47" "3.47" "2.68" "3.19" ...
$ stroke : chr "2.68" "2.68" "3.47" "3.40" ...
$ compression.ratio: num 9 9 9 10 8 8.5 8.5 8.5 8.3 7 ...
$ horsepower : chr "111" "111" "154" "102" ...
$ peak.rpm : chr "5000" "5000" "5000" "5500" ...
$ city.mpg : int 21 21 19 24 18 19 19 19 17 16 ...
$ highway.mpg : int 27 27 26 30 22 25 25 25 20 22 ...
$ price : chr "13495" "16500" "16500" "13950" ...
Aside from .csv files, there are also the .txt files which are basically text files. You can import these functions with read.delim(). By default, it sets the sep argument to " (fields in a record are delimited by tabs) and the header argument to TRUE (the first row contains the field names).
# Import hotdogs.txt: hotdogs
hotdogs = read.delim("hotdogs.txt", header = FALSE)
# Summarize hotdogs
summary(hotdogs)
V1 V2 V3
Beef :20 Min. : 86.0 Min. :144.0
Meat :17 1st Qu.:132.0 1st Qu.:362.5
Poultry:17 Median :145.0 Median :405.0
Mean :145.4 Mean :424.8
3rd Qu.:172.8 3rd Qu.:503.5
Max. :195.0 Max. :645.0
f you’re dealing with more exotic flat file formats, you’ll want to use read.table(). It’s the most basic importing function; you can specify tons of different arguments in this function. Unlike read.csv() and read.delim(), the header argument defaults to FALSE and the sep argument is "" by default.
# Path to the hotdogs.txt file: path
path <- file.path("hotdogs.txt")
# Import the hotdogs.txt file: hotdogs
hotdogs <- read.table(path,
sep = "\t",
col.names = c("type", "calories", "sodium"))
# Call head() on hotdogs
head(hotdogs)
# Finish the read.delim() call
hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))
# Select the hot dog with the least calories: lily
lily <- hotdogs[which.min(hotdogs$calories), ]
# Select the observation with the most sodium: tom
tom <- hotdogs[which.max(hotdogs$sodium), ]
# Print lily and tom
lily
tom
Next to column names, you can also specify the column types or column classes of the resulting data frame. You can do this by setting the colClasses argument to a vector of strings representing classes.
This approach can be useful if you have some columns that should be factors and others that should be characters. You don’t have to bother with stringsAsFactors anymore; just state for each column what the class should be.
If a column is set to “NULL” in the colClasses vector, this column will be skipped and will not be loaded into the data frame.
# Previous call to import hotdogs.txt
hotdogs <- read.delim("hotdogs.txt", header = FALSE, col.names = c("type", "calories", "sodium"))
# Display structure of hotdogs
str(hotdogs)
'data.frame': 54 obs. of 3 variables:
$ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
$ calories: int 186 181 176 149 184 190 158 139 175 148 ...
$ sodium : int 495 477 425 322 482 587 370 322 479 375 ...
# Edit the colClasses argument to import the data correctly: hotdogs2
hotdogs2 <- read.delim("hotdogs.txt", header = FALSE,
col.names = c("type", "calories", "sodium"),
colClasses = c("factor", "NULL", "numeric"))
# Display structure of hotdogs2
str(hotdogs2)
'data.frame': 54 obs. of 2 variables:
$ type : Factor w/ 3 levels "Beef","Meat",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sodium: num 495 477 425 322 482 587 370 322 479 375 ...
CSV files can be imported with read_csv(). It’s a wrapper function around read_delim() that handles all the details for you. For example, it will assume that the first row contains the column names.
# Load the readr package
library(readr)
# Import potatoes.csv with read_csv(): potatoes
potatoes <- read_csv("potatoes.csv")
Parsed with column specification:
cols(
area = [32mcol_double()[39m,
temp = [32mcol_double()[39m,
size = [32mcol_double()[39m,
storage = [32mcol_double()[39m,
method = [32mcol_double()[39m,
texture = [32mcol_double()[39m,
flavor = [32mcol_double()[39m,
moistness = [32mcol_double()[39m
)
Where you use read_csv() to easily read in CSV files, you use read_tsv() to easily read in TSV files. TSV is short for tab-separated values.
# Column names
properties <- c("area", "temp", "size", "storage", "method",
"texture", "flavor", "moistness")
# Import potatoes.txt: potatoes
potatoes <- read_tsv("potatoes.txt", col_names = properties)
Parsed with column specification:
cols(
area = [32mcol_double()[39m,
temp = [32mcol_double()[39m,
size = [32mcol_double()[39m,
storage = [32mcol_double()[39m,
method = [32mcol_double()[39m,
texture = [32mcol_double()[39m,
flavor = [32mcol_double()[39m,
moistness = [32mcol_double()[39m
)
# Call head() on potatoes
head(potatoes)
Just as read.table() was the main utils function, read_delim() is the main readr function.
read_delim() takes two mandatory arguments:
# Column names
properties <- c("area", "temp", "size", "storage", "method",
"texture", "flavor", "moistness")
# Import potatoes.txt using read_delim(): potatoes
potatoes <- read_delim("potatoes.txt", delim = "\t", col_names = properties)
Parsed with column specification:
cols(
area = [32mcol_double()[39m,
temp = [32mcol_double()[39m,
size = [32mcol_double()[39m,
storage = [32mcol_double()[39m,
method = [32mcol_double()[39m,
texture = [32mcol_double()[39m,
flavor = [32mcol_double()[39m,
moistness = [32mcol_double()[39m
)
# Print out potatoes
potatoes
Through skip and n_max you can control which part of your flat file you’re actually importing into R.
Once you skip some lines, you also skip the first line that can contain column names!
# Column names
properties <- c("area", "temp", "size", "storage", "method",
"texture", "flavor", "moistness")
# Import 5 observations from potatoes.txt: potatoes_fragment
potatoes_fragment <- read_tsv("potatoes.txt", skip = 6, n_max = 5, col_names = properties)
Parsed with column specification:
cols(
area = [32mcol_double()[39m,
temp = [32mcol_double()[39m,
size = [32mcol_double()[39m,
storage = [32mcol_double()[39m,
method = [32mcol_double()[39m,
texture = [32mcol_double()[39m,
flavor = [32mcol_double()[39m,
moistness = [32mcol_double()[39m
)
You can also specify which types the columns in your imported data frame should have. You can do this with col_types. If set to NULL, the default, functions from the readr package will try to find the correct types themselves. You can manually set the types with a string, where each character denotes the class of the column: character, double, integer and logical. _ skips the column as a whole.
# Column names
properties <- c("area", "temp", "size", "storage", "method",
"texture", "flavor", "moistness")
# Import all data, but force all columns to be character: potatoes_char
potatoes_char <- read_tsv("potatoes.txt", col_types = "cccccccc", col_names = properties)
# Print out structure of potatoes_char
str(potatoes_char)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 160 obs. of 8 variables:
$ area : chr "1" "1" "1" "1" ...
$ temp : chr "1" "1" "1" "1" ...
$ size : chr "1" "1" "1" "1" ...
$ storage : chr "1" "1" "1" "1" ...
$ method : chr "1" "2" "3" "4" ...
$ texture : chr "2.9" "2.3" "2.5" "2.1" ...
$ flavor : chr "3.2" "2.5" "2.8" "2.9" ...
$ moistness: chr "3.0" "2.6" "2.8" "2.4" ...
- attr(*, "spec")=
.. cols(
.. area = [31mcol_character()[39m,
.. temp = [31mcol_character()[39m,
.. size = [31mcol_character()[39m,
.. storage = [31mcol_character()[39m,
.. method = [31mcol_character()[39m,
.. texture = [31mcol_character()[39m,
.. flavor = [31mcol_character()[39m,
.. moistness = [31mcol_character()[39m
.. )
Another way of setting the types of the imported columns is using collectors. Collector functions can be passed in a list() to the col_types argument of read_ functions to tell them how to interpret values in a column.
For a complete list of collector functions, you can take a look at the collector documentation.
For example: - col_integer(): the column should be interpreted as an integer. - col_factor(levels, ordered = FALSE): the column should be interpreted as a factor with levels.
# Import without col_types
hotdogs <- read_tsv("hotdogs.txt", col_names = c("type", "calories", "sodium"))
Parsed with column specification:
cols(
type = [31mcol_character()[39m,
calories = [32mcol_double()[39m,
sodium = [32mcol_double()[39m
)
# Display the summary of hotdogs
summary(hotdogs)
type calories sodium
Length:54 Min. : 86.0 Min. :144.0
Class :character 1st Qu.:132.0 1st Qu.:362.5
Mode :character Median :145.0 Median :405.0
Mean :145.4 Mean :424.8
3rd Qu.:172.8 3rd Qu.:503.5
Max. :195.0 Max. :645.0
# The collectors you will need to import the data
fac <- col_factor(levels = c("Beef", "Meat", "Poultry"))
int <- col_integer()
# Edit the col_types argument to import the data correctly: hotdogs_factor
hotdogs_factor <- read_tsv("hotdogs.txt",
col_names = c("type", "calories", "sodium"),
col_types = list(fac, int, int))
# Display the summary of hotdogs_factor
summary(hotdogs_factor)
type calories sodium
Beef :20 Min. : 86.0 Min. :144.0
Meat :17 1st Qu.:132.0 1st Qu.:362.5
Poultry:17 Median :145.0 Median :405.0
Mean :145.4 Mean :424.8
3rd Qu.:172.8 3rd Qu.:503.5
Max. :195.0 Max. :645.0
library(data.table)
package 㤼㸱data.table㤼㸲 was built under R version 3.6.3Registered S3 method overwritten by 'data.table':
method from
print.data.table
data.table 1.12.8 using 1 threads (see ?getDTthreads). Latest news: r-datatable.com
You still remember how to use read.table(), right? Well, fread() is a function that does the same job with very similar arguments. It is extremely easy to use and blazingly fast! Often, simply specifying the path to the file is enough to successfully import your data.
# Import potatoes.csv with fread(): potatoes
potatoes <- fread("potatoes.csv")
# Print out potatoes
potatoes
to drop or select variables of interest.
# Import columns 6 and 8 of potatoes.csv: potatoes
potatoes = (fread("potatoes.csv", select = c(6,8)))
# Plot texture (x) and moistness (y) of potatoes
plot(x <- potatoes$texture, y <- potatoes$moistness, main = "Potatoes", xlab = "Texture", ylab = "moistness")
You might have noticed that the fread() function produces data frames that look slightly different when you print them out. That’s because another class named data.table is assigned to the resulting data frames. The printout of such data.table objects is different. Does something similar happen with the data frames generated by readr?
Before you can start importing from Excel, you should find out which sheets are available in the workbook. You can use the excel_sheets() function for this.
# Load the readxl package
library(readxl)
# Print the names of all worksheets
excel_sheets("urbanpop.xlsx")
[1] "1960-1966" "1967-1974" "1975-2011"
Now that you know the names of the sheets in the Excel file you want to import, it is time to import those sheets into R. You can do this with the read_excel() function.
# Read the sheets, one by one
pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)
pop_2 <- read_excel("urbanpop.xlsx", sheet = 2)
pop_3 <- read_excel("urbanpop.xlsx", sheet = "1975-2011")
# Put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list <- list(pop_1, pop_2, pop_3)
# Display the structure of pop_list
str(pop_list)
List of 3
$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 209 obs. of 8 variables:
..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
..$ 1960 : num [1:209] 769308 494443 3293999 NA NA ...
..$ 1961 : num [1:209] 814923 511803 3515148 13660 8724 ...
..$ 1962 : num [1:209] 858522 529439 3739963 14166 9700 ...
..$ 1963 : num [1:209] 903914 547377 3973289 14759 10748 ...
..$ 1964 : num [1:209] 951226 565572 4220987 15396 11866 ...
..$ 1965 : num [1:209] 1000582 583983 4488176 16045 13053 ...
..$ 1966 : num [1:209] 1058743 602512 4649105 16693 14217 ...
$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 209 obs. of 9 variables:
..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
..$ 1967 : num [1:209] 1119067 621180 4826104 17349 15440 ...
..$ 1968 : num [1:209] 1182159 639964 5017299 17996 16727 ...
..$ 1969 : num [1:209] 1248901 658853 5219332 18619 18088 ...
..$ 1970 : num [1:209] 1319849 677839 5429743 19206 19529 ...
..$ 1971 : num [1:209] 1409001 698932 5619042 19752 20929 ...
..$ 1972 : num [1:209] 1502402 720207 5815734 20263 22406 ...
..$ 1973 : num [1:209] 1598835 741681 6020647 20742 23937 ...
..$ 1974 : num [1:209] 1696445 763385 6235114 21194 25482 ...
$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 209 obs. of 38 variables:
..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
..$ 1975 : num [1:209] 1793266 785350 6460138 21632 27019 ...
..$ 1976 : num [1:209] 1905033 807990 6774099 22047 28366 ...
..$ 1977 : num [1:209] 2021308 830959 7102902 22452 29677 ...
..$ 1978 : num [1:209] 2142248 854262 7447728 22899 31037 ...
..$ 1979 : num [1:209] 2268015 877898 7810073 23457 32572 ...
..$ 1980 : num [1:209] 2398775 901884 8190772 24177 34366 ...
..$ 1981 : num [1:209] 2493265 927224 8637724 25173 36356 ...
..$ 1982 : num [1:209] 2590846 952447 9105820 26342 38618 ...
..$ 1983 : num [1:209] 2691612 978476 9591900 27655 40983 ...
..$ 1984 : num [1:209] 2795656 1006613 10091289 29062 43207 ...
..$ 1985 : num [1:209] 2903078 1037541 10600112 30524 45119 ...
..$ 1986 : num [1:209] 3006983 1072365 11101757 32014 46254 ...
..$ 1987 : num [1:209] 3113957 1109954 11609104 33548 47019 ...
..$ 1988 : num [1:209] 3224082 1146633 12122941 35095 47669 ...
..$ 1989 : num [1:209] 3337444 1177286 12645263 36618 48577 ...
..$ 1990 : num [1:209] 3454129 1198293 13177079 38088 49982 ...
..$ 1991 : num [1:209] 3617842 1215445 13708813 39600 51972 ...
..$ 1992 : num [1:209] 3788685 1222544 14248297 41049 54469 ...
..$ 1993 : num [1:209] 3966956 1222812 14789176 42443 57079 ...
..$ 1994 : num [1:209] 4152960 1221364 15322651 43798 59243 ...
..$ 1995 : num [1:209] 4347018 1222234 15842442 45129 60598 ...
..$ 1996 : num [1:209] 4531285 1228760 16395553 46343 60927 ...
..$ 1997 : num [1:209] 4722603 1238090 16935451 47527 60462 ...
..$ 1998 : num [1:209] 4921227 1250366 17469200 48705 59685 ...
..$ 1999 : num [1:209] 5127421 1265195 18007937 49906 59281 ...
..$ 2000 : num [1:209] 5341456 1282223 18560597 51151 59719 ...
..$ 2001 : num [1:209] 5564492 1315690 19198872 52341 61062 ...
..$ 2002 : num [1:209] 5795940 1352278 19854835 53583 63212 ...
..$ 2003 : num [1:209] 6036100 1391143 20529356 54864 65802 ...
..$ 2004 : num [1:209] 6285281 1430918 21222198 56166 68301 ...
..$ 2005 : num [1:209] 6543804 1470488 21932978 57474 70329 ...
..$ 2006 : num [1:209] 6812538 1512255 22625052 58679 71726 ...
..$ 2007 : num [1:209] 7091245 1553491 23335543 59894 72684 ...
..$ 2008 : num [1:209] 7380272 1594351 24061749 61118 73335 ...
..$ 2009 : num [1:209] 7679982 1635262 24799591 62357 73897 ...
..$ 2010 : num [1:209] 7990746 1676545 25545622 63616 74525 ...
..$ 2011 : num [1:209] 8316976 1716842 26216968 64817 75207 ...
Loading in every sheet manually and then merging them in a list can be quite tedious. Luckily, you can automate this with lapply().
# Read all Excel sheets with lapply(): pop_list
pop_list <- lapply(excel_sheets("urbanpop.xlsx"),
read_excel,
path = "urbanpop.xlsx")
# Display the structure of pop_list
str(pop_list)
List of 3
$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 209 obs. of 8 variables:
..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
..$ 1960 : num [1:209] 769308 494443 3293999 NA NA ...
..$ 1961 : num [1:209] 814923 511803 3515148 13660 8724 ...
..$ 1962 : num [1:209] 858522 529439 3739963 14166 9700 ...
..$ 1963 : num [1:209] 903914 547377 3973289 14759 10748 ...
..$ 1964 : num [1:209] 951226 565572 4220987 15396 11866 ...
..$ 1965 : num [1:209] 1000582 583983 4488176 16045 13053 ...
..$ 1966 : num [1:209] 1058743 602512 4649105 16693 14217 ...
$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 209 obs. of 9 variables:
..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
..$ 1967 : num [1:209] 1119067 621180 4826104 17349 15440 ...
..$ 1968 : num [1:209] 1182159 639964 5017299 17996 16727 ...
..$ 1969 : num [1:209] 1248901 658853 5219332 18619 18088 ...
..$ 1970 : num [1:209] 1319849 677839 5429743 19206 19529 ...
..$ 1971 : num [1:209] 1409001 698932 5619042 19752 20929 ...
..$ 1972 : num [1:209] 1502402 720207 5815734 20263 22406 ...
..$ 1973 : num [1:209] 1598835 741681 6020647 20742 23937 ...
..$ 1974 : num [1:209] 1696445 763385 6235114 21194 25482 ...
$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 209 obs. of 38 variables:
..$ country: chr [1:209] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
..$ 1975 : num [1:209] 1793266 785350 6460138 21632 27019 ...
..$ 1976 : num [1:209] 1905033 807990 6774099 22047 28366 ...
..$ 1977 : num [1:209] 2021308 830959 7102902 22452 29677 ...
..$ 1978 : num [1:209] 2142248 854262 7447728 22899 31037 ...
..$ 1979 : num [1:209] 2268015 877898 7810073 23457 32572 ...
..$ 1980 : num [1:209] 2398775 901884 8190772 24177 34366 ...
..$ 1981 : num [1:209] 2493265 927224 8637724 25173 36356 ...
..$ 1982 : num [1:209] 2590846 952447 9105820 26342 38618 ...
..$ 1983 : num [1:209] 2691612 978476 9591900 27655 40983 ...
..$ 1984 : num [1:209] 2795656 1006613 10091289 29062 43207 ...
..$ 1985 : num [1:209] 2903078 1037541 10600112 30524 45119 ...
..$ 1986 : num [1:209] 3006983 1072365 11101757 32014 46254 ...
..$ 1987 : num [1:209] 3113957 1109954 11609104 33548 47019 ...
..$ 1988 : num [1:209] 3224082 1146633 12122941 35095 47669 ...
..$ 1989 : num [1:209] 3337444 1177286 12645263 36618 48577 ...
..$ 1990 : num [1:209] 3454129 1198293 13177079 38088 49982 ...
..$ 1991 : num [1:209] 3617842 1215445 13708813 39600 51972 ...
..$ 1992 : num [1:209] 3788685 1222544 14248297 41049 54469 ...
..$ 1993 : num [1:209] 3966956 1222812 14789176 42443 57079 ...
..$ 1994 : num [1:209] 4152960 1221364 15322651 43798 59243 ...
..$ 1995 : num [1:209] 4347018 1222234 15842442 45129 60598 ...
..$ 1996 : num [1:209] 4531285 1228760 16395553 46343 60927 ...
..$ 1997 : num [1:209] 4722603 1238090 16935451 47527 60462 ...
..$ 1998 : num [1:209] 4921227 1250366 17469200 48705 59685 ...
..$ 1999 : num [1:209] 5127421 1265195 18007937 49906 59281 ...
..$ 2000 : num [1:209] 5341456 1282223 18560597 51151 59719 ...
..$ 2001 : num [1:209] 5564492 1315690 19198872 52341 61062 ...
..$ 2002 : num [1:209] 5795940 1352278 19854835 53583 63212 ...
..$ 2003 : num [1:209] 6036100 1391143 20529356 54864 65802 ...
..$ 2004 : num [1:209] 6285281 1430918 21222198 56166 68301 ...
..$ 2005 : num [1:209] 6543804 1470488 21932978 57474 70329 ...
..$ 2006 : num [1:209] 6812538 1512255 22625052 58679 71726 ...
..$ 2007 : num [1:209] 7091245 1553491 23335543 59894 72684 ...
..$ 2008 : num [1:209] 7380272 1594351 24061749 61118 73335 ...
..$ 2009 : num [1:209] 7679982 1635262 24799591 62357 73897 ...
..$ 2010 : num [1:209] 7990746 1676545 25545622 63616 74525 ...
..$ 2011 : num [1:209] 8316976 1716842 26216968 64817 75207 ...
Apart from path and sheet, there are several other arguments you can specify in read_excel(). One of these arguments is called col_names.
By default it is TRUE, denoting whether the first row in the Excel sheets contains the column names. If this is not the case, you can set col_names to FALSE. In this case, R will choose column names for you. You can also choose to set col_names to a character vector with names for each column. It works exactly the same as in the readr package.
# Import the first Excel sheet of urbanpop_nonames.xlsx (R gives names): pop_a
pop_a <- read_excel("urbanpop_nonames.xlsx",
col_names = FALSE)
New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* ... and 3 more problems
# Import the first Excel sheet of urbanpop_nonames.xlsx (specify col_names): pop_b
cols <- c("country", paste0("year_", 1960:1966))
pop_b <- read_excel("urbanpop_nonames.xlsx",
col_names = cols)
# Print the summary of pop_a
summary(pop_a)
...1 ...2 ...3
Length:209 Min. : 3378 Min. : 1028
Class :character 1st Qu.: 88978 1st Qu.: 70644
Mode :character Median : 580675 Median : 570159
Mean : 4988124 Mean : 4991613
3rd Qu.: 3077228 3rd Qu.: 2807280
Max. :126469700 Max. :129268133
NA's :11
...4 ...5 ...6
Min. : 1090 Min. : 1154 Min. : 1218
1st Qu.: 74974 1st Qu.: 81870 1st Qu.: 84953
Median : 593968 Median : 619331 Median : 645262
Mean : 5141592 Mean : 5303711 Mean : 5468966
3rd Qu.: 2948396 3rd Qu.: 3148941 3rd Qu.: 3296444
Max. :131974143 Max. :134599886 Max. :137205240
...7 ...8
Min. : 1281 Min. : 1349
1st Qu.: 88633 1st Qu.: 93638
Median : 679109 Median : 735139
Mean : 5637394 Mean : 5790281
3rd Qu.: 3317422 3rd Qu.: 3418036
Max. :139663053 Max. :141962708
summary(pop_b)
country year_1960 year_1961
Length:209 Min. : 3378 Min. : 1028
Class :character 1st Qu.: 88978 1st Qu.: 70644
Mode :character Median : 580675 Median : 570159
Mean : 4988124 Mean : 4991613
3rd Qu.: 3077228 3rd Qu.: 2807280
Max. :126469700 Max. :129268133
NA's :11
year_1962 year_1963 year_1964
Min. : 1090 Min. : 1154 Min. : 1218
1st Qu.: 74974 1st Qu.: 81870 1st Qu.: 84953
Median : 593968 Median : 619331 Median : 645262
Mean : 5141592 Mean : 5303711 Mean : 5468966
3rd Qu.: 2948396 3rd Qu.: 3148941 3rd Qu.: 3296444
Max. :131974143 Max. :134599886 Max. :137205240
year_1965 year_1966
Min. : 1281 Min. : 1349
1st Qu.: 88633 1st Qu.: 93638
Median : 679109 Median : 735139
Mean : 5637394 Mean : 5790281
3rd Qu.: 3317422 3rd Qu.: 3418036
Max. :139663053 Max. :141962708
Another argument that can be very useful when reading in Excel files that are less tidy, is skip. With skip, you can tell R to ignore a specified number of rows inside the Excel sheets you’re trying to pull data from.
# Import the second sheet of urbanpop.xlsx, skipping the first 21 rows: urbanpop_sel
urbanpop_sel <- read_excel("urbanpop.xlsx", sheet = 2, skip = 21, col_names = FALSE)
New names:
* `` -> ...1
* `` -> ...2
* `` -> ...3
* `` -> ...4
* `` -> ...5
* ... and 4 more problems
# Print out the first observation from urbanpop_sel
head(urbanpop_sel, n=1)
In this part of the chapter you’ll learn how to import .xls files using the gdata package. Similar to the readxl package, you can import single Excel sheets from Excel sheets to start your analysis in R.
library("gdata")
gdata: Unable to locate valid perl interpreter
gdata:
gdata: read.xls() will be unable to read Excel XLS and XLSX
gdata: files unless the 'perl=' argument is used to specify
gdata: the location of a valid perl intrpreter.
gdata:
gdata: (To avoid display of this message in the future, please
gdata: ensure perl is installed and available on the
gdata: executable search path.)
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLX' (Excel 97-2004) files.
gdata: Unable to load perl libaries needed by read.xls()
gdata: to support 'XLSX' (Excel 2007+) files.
gdata: Run the function 'installXLSXsupport()'
gdata: to automatically download and install the perl
gdata: libaries needed to support Excel XLS and XLSX formats.
Attaching package: 㤼㸱gdata㤼㸲
The following objects are masked from 㤼㸱package:data.table㤼㸲:
first, last
The following object is masked from 㤼㸱package:stats㤼㸲:
nobs
The following object is masked from 㤼㸱package:utils㤼㸲:
object.size
The following object is masked from 㤼㸱package:base㤼㸲:
startsWith
# Import the second sheet of urbanpop.xls: urban_pop
urban_pop <- read.xls("urbanpop.xls", sheet = 2, perl="c:/Program Files/Git/usr/bin/perl.exe")
# Print the first 11 observations using head()
head(urban_pop, 11)
Remember how read.xls() actually works? It basically comes down to two steps: converting the Excel file to a .csv file using a Perl script, and then reading that .csv file with the read.csv() function that is loaded by default in R, through the utils package.
This means that all the options that you can specify in read.csv(), can also be specified in read.xls().
# Column names for urban_pop
columns <- c("country", paste0("year_", 1967:1974))
# Finish the read.xls call
urban_pop <- read.xls("urbanpop.xls", sheet = 2,
skip = 50, header = FALSE, stringsAsFactors = FALSE,
col.names = columns, perl="c:/Program Files/Git/usr/bin/perl.exe")
# Print first 10 observation of urban_pop
head(urban_pop, 10)
Now that you can read in Excel data, let’s try to clean and merge it. You already used the cbind() function some exercises ago.
# Add code to import data from all three sheets in urbanpop.xls
path <- "urbanpop.xls"
p <- "c:/Program Files/Git/usr/bin/perl.exe"
urban_sheet1 <- read.xls(path, sheet = 1, stringsAsFactors = FALSE, perl=p)
urban_sheet2 <- read.xls(path, sheet = 2, stringsAsFactors = FALSE, perl=p)
urban_sheet3 <- read.xls(path, sheet = 3, stringsAsFactors = FALSE, perl=p)
# Extend the cbind() call to include urban_sheet3: urban
urban <- cbind(urban_sheet1, urban_sheet2[-1], urban_sheet3[-1])
# Remove all rows with NAs from urban: urban_clean
urban_clean <- na.omit(urban)
# Print out a summary of urban_clean
summary(urban_clean)
country X1960 X1961
Length:197 Min. : 3378 Min. : 3433
Class :character 1st Qu.: 87735 1st Qu.: 92905
Mode :character Median : 599714 Median : 630788
Mean : 5012388 Mean : 5282488
3rd Qu.: 3130085 3rd Qu.: 3155370
Max. :126469700 Max. :129268133
X1962 X1963 X1964
Min. : 3481 Min. : 3532 Min. : 3586
1st Qu.: 98331 1st Qu.: 104988 1st Qu.: 112084
Median : 659464 Median : 704989 Median : 740609
Mean : 5440972 Mean : 5612312 Mean : 5786961
3rd Qu.: 3250211 3rd Qu.: 3416490 3rd Qu.: 3585464
Max. :131974143 Max. :134599886 Max. :137205240
X1965 X1966 X1967
Min. : 3644 Min. : 3706 Min. : 3771
1st Qu.: 119322 1st Qu.: 128565 1st Qu.: 138024
Median : 774957 Median : 809768 Median : 838449
Mean : 5964970 Mean : 6126413 Mean : 6288771
3rd Qu.: 3666724 3rd Qu.: 3871757 3rd Qu.: 4019906
Max. :139663053 Max. :141962708 Max. :144201722
X1968 X1969 X1970
Min. : 3835 Min. : 3893 Min. : 3941
1st Qu.: 147846 1st Qu.: 158252 1st Qu.: 171063
Median : 890270 Median : 929450 Median : 976471
Mean : 6451367 Mean : 6624909 Mean : 6799110
3rd Qu.: 4158186 3rd Qu.: 4300669 3rd Qu.: 4440047
Max. :146340364 Max. :148475901 Max. :150922373
X1971 X1972 X1973
Min. : 4017 Min. : 4084 Min. : 4146
1st Qu.: 181483 1st Qu.: 189492 1st Qu.: 197792
Median : 1008630 Median : 1048738 Median : 1097293
Mean : 6980895 Mean : 7165338 Mean : 7349454
3rd Qu.: 4595966 3rd Qu.: 4766545 3rd Qu.: 4838297
Max. :152863831 Max. :154530473 Max. :156034106
X1974 X1975 X1976
Min. : 4206 Min. : 4267 Min. : 4334
1st Qu.: 205410 1st Qu.: 211746 1st Qu.: 216991
Median : 1159402 Median : 1223146 Median : 1249829
Mean : 7540446 Mean : 7731973 Mean : 7936401
3rd Qu.: 4906384 3rd Qu.: 5003370 3rd Qu.: 5121118
Max. :157488074 Max. :159452730 Max. :165583752
X1977 X1978 X1979
Min. : 4402 Min. : 4470 Min. : 4539
1st Qu.: 222209 1st Qu.: 227605 1st Qu.: 233461
Median : 1311276 Median : 1340811 Median : 1448185
Mean : 8145945 Mean : 8361360 Mean : 8583138
3rd Qu.: 5227677 3rd Qu.: 5352746 3rd Qu.: 5558850
Max. :171550310 Max. :177605736 Max. :183785364
X1980 X1981 X1982
Min. : 4607 Min. : 4645 Min. : 4681
1st Qu.: 242583 1st Qu.: 248948 1st Qu.: 257944
Median : 1592397 Median : 1673079 Median : 1713060
Mean : 8808772 Mean : 9049163 Mean : 9295226
3rd Qu.: 5815772 3rd Qu.: 6070457 3rd Qu.: 6337995
Max. :189947471 Max. :199385258 Max. :209435968
X1983 X1984 X1985
Min. : 4716 Min. : 4750 Min. : 4782
1st Qu.: 274139 1st Qu.: 284939 1st Qu.: 300928
Median : 1730626 Median : 1749033 Median : 1786125
Mean : 9545035 Mean : 9798559 Mean : 10058661
3rd Qu.: 6619987 3rd Qu.: 6918261 3rd Qu.: 6931780
Max. :219680098 Max. :229872397 Max. :240414890
X1986 X1987 X1988
Min. : 4809 Min. : 4835 Min. : 4859
1st Qu.: 307699 1st Qu.: 321125 1st Qu.: 334616
Median : 1850910 Median : 1953694 Median : 1997011
Mean : 10323839 Mean : 10595817 Mean : 10873041
3rd Qu.: 6935763 3rd Qu.: 6939905 3rd Qu.: 6945022
Max. :251630158 Max. :263433513 Max. :275570541
X1989 X1990 X1991
Min. : 4883 Min. : 4907 Min. : 4946
1st Qu.: 347348 1st Qu.: 370152 1st Qu.: 394611
Median : 1993544 Median : 2066505 Median : 2150230
Mean : 11154458 Mean : 11438543 Mean : 11725076
3rd Qu.: 6885378 3rd Qu.: 6830026 3rd Qu.: 6816589
Max. :287810747 Max. :300165618 Max. :314689997
X1992 X1993 X1994
Min. : 4985 Min. : 5024 Min. : 5062
1st Qu.: 418788 1st Qu.: 427457 1st Qu.: 435959
Median : 2237405 Median : 2322158 Median : 2410297
Mean : 12010922 Mean : 12296949 Mean : 12582930
3rd Qu.: 6820099 3rd Qu.: 7139656 3rd Qu.: 7499901
Max. :329099365 Max. :343555327 Max. :358232230
X1995 X1996 X1997
Min. : 5100 Min. : 5079 Min. : 5055
1st Qu.: 461993 1st Qu.: 488136 1st Qu.: 494203
Median : 2482393 Median : 2522460 Median : 2606125
Mean : 12871480 Mean : 13165924 Mean : 13463675
3rd Qu.: 7708571 3rd Qu.: 7686092 3rd Qu.: 7664316
Max. :373035157 Max. :388936607 Max. :405031716
X1998 X1999 X2000
Min. : 5029 Min. : 5001 Min. : 4971
1st Qu.: 498002 1st Qu.: 505144 1st Qu.: 525629
Median : 2664983 Median : 2737809 Median : 2826647
Mean : 13762861 Mean : 14063387 Mean : 14369278
3rd Qu.: 7784056 3rd Qu.: 8083488 3rd Qu.: 8305564
Max. :421147610 Max. :437126845 Max. :452999147
X2001 X2002 X2003
Min. : 5003 Min. : 5034 Min. : 5064
1st Qu.: 550638 1st Qu.: 567531 1st Qu.: 572094
Median : 2925851 Median : 2928252 Median : 2944934
Mean : 14705743 Mean : 15043381 Mean : 15384513
3rd Qu.: 8421967 3rd Qu.: 8448628 3rd Qu.: 8622732
Max. :473204511 Max. :493402140 Max. :513607776
X2004 X2005 X2006
Min. : 5090 Min. : 5111 Min. : 5135
1st Qu.: 593900 1st Qu.: 620511 1st Qu.: 632659
Median : 2994356 Median : 3057923 Median : 3269963
Mean : 15730299 Mean : 16080262 Mean : 16435872
3rd Qu.: 8999112 3rd Qu.: 9394001 3rd Qu.: 9689807
Max. :533892175 Max. :554367818 Max. :575050081
X2007 X2008 X2009
Min. : 5155 Min. : 5172 Min. : 5189
1st Qu.: 645172 1st Qu.: 658017 1st Qu.: 671085
Median : 3432024 Median : 3589395 Median : 3652338
Mean : 16797484 Mean : 17164898 Mean : 17533997
3rd Qu.: 9803381 3rd Qu.: 10210317 3rd Qu.: 10518289
Max. :595731464 Max. :616552722 Max. :637533976
X2010 X2011
Min. : 5206 Min. : 5233
1st Qu.: 684302 1st Qu.: 698009
Median : 3676309 Median : 3664664
Mean : 17904811 Mean : 18276297
3rd Qu.: 10618596 3rd Qu.: 10731193
Max. :658557734 Max. :678796403
When working with XLConnect, the first step will be to load a workbook in your R session with loadWorkbook(); this function will build a “bridge” between your Excel file and your R session.
library("XLConnect")
# Build connection to urbanpop.xlsx: my_book
my_book <- loadWorkbook("urbanpop.xlsx")
# Print out the class of my_book
class(my_book)
[1] "workbook"
attr(,"package")
[1] "XLConnect"
Just as readxl and gdata, you can use XLConnect to import data from Excel file into R.
To list the sheets in an Excel file, use getSheets(). To actually import data from a sheet, you can use readWorksheet(). Both functions require an XLConnect workbook object as the first argument.
# Build connection to urbanpop.xlsx
my_book <- loadWorkbook("urbanpop.xlsx")
# List the sheets in my_book
getSheets(my_book)
[1] "1960-1966" "1967-1974" "1975-2011"
# Import the second sheet in my_book
readWorksheet(my_book, 2)
# Build connection to urbanpop.xlsx
my_book <- loadWorkbook("urbanpop.xlsx")
# Import columns 3, 4, and 5 from second sheet in my_book: urbanpop_sel
urbanpop_sel <- readWorksheet(my_book, sheet = 2, startCol = 3, endCol = 5)
# Import first column from second sheet in my_book: countries
countries <- readWorksheet(my_book, sheet = 2, startCol = 1, endCol = 1)
# cbind() urbanpop_sel and countries together: selection
selection <- cbind(countries, urbanpop_sel)
Where readxl and gdata were only able to import Excel data, XLConnect’s approach of providing an actual interface to an Excel file makes it able to edit your Excel files from inside R. In this exercise, you’ll create a new sheet. In the next exercise, you’ll populate the sheet with data, and save the results in a new Excel file.
# Build connection to urbanpop.xlsx
my_book <- loadWorkbook("urbanpop.xlsx")
# Add a worksheet to my_book, named "data_summary"
createSheet(my_book, name = "data_summary")
# Use getSheets() on my_book
getSheets(my_book)
[1] "1960-1966" "1967-1974" "1975-2011" "data_summary"
The first step of creating a sheet is done; let’s populate it with some data now! summ, a data frame with some summary statistics on the two Excel sheets is already coded so you can take it from there.
# Build connection to urbanpop.xlsx
my_book <- loadWorkbook("urbanpop.xlsx")
# Add a worksheet to my_book, named "data_summary"
createSheet(my_book, "data_summary")
# Create data frame: summ
sheets <- getSheets(my_book)[1:3]
dims <- sapply(sheets, function(x) dim(readWorksheet(my_book, sheet = x)), USE.NAMES = FALSE)
summ <- data.frame(sheets = sheets,
nrows = dims[1, ],
ncols = dims[2, ])
# Add data in summ to "data_summary" sheet
writeWorksheet(my_book, summ, sheet = "data_summary")
# Save workbook as summary.xlsx
saveWorkbook(my_book, file = "summary.xlsx")
# Rename "data_summary" sheet to "summary"
renameSheet(my_book, "data_summary", "summary")
# Print out sheets of my_book
getSheets(my_book)
[1] "1960-1966" "1967-1974" "1975-2011" "summary"
# Save workbook to "renamed.xlsx"
saveWorkbook(my_book, file = "renamed.xlsx")
# Build connection to renamed.xlsx: my_book
my_book <- loadWorkbook("renamed.xlsx")
# Remove the fourth sheet
removeSheet(my_book, sheet = "summary")
# Save workbook to "clean.xlsx"
saveWorkbook(my_book, file = "clean.xlsx")