Importing Data in R

Loading data into the R environment is one of the initial steps of any project. R allows importing data from different sources and formats.

Reading a comma-delimited text file (CSV)

Comma-separated values (CSV) files are one of the most common data file types. They look like this:

year,azimuth,distance,plant_height,long.inflo,status (this line is the ‘header’)
2019,35.0,6.22,51,17,E
2019,35.0,6.22,24,10,X
2019,35.0,6.22,3,0,E
2019,25.0,6.67,20,20,X
2019,25.0,4.53,36,9,S
2019,24.0,4.53,8,0,E
2019,5.0,5.22,10,0,S
2019,340.0,6.06,32,0,E

Reading files with read.csv

For a data file, with a structure as above, you can use the read.csv command:

melo.csv.def <- read.csv("melocactus 2019 2020.csv")
head(melo.csv.def)
##   year azimuth distance plant_height long.inflo status
## 1 2019      35     6.22           51         17     E 
## 2 2019      35     6.22           24         10      X
## 3 2019      35     6.22            3          0      E
## 4 2019      25     6.67           20         20      X
## 5 2019      25     4.53           36          9      S
## 6 2019      24     4.53            8          0      E

Note the different types of data that are in the file (integer, double-precision, character)

The command read.csv use the following default values for the header and sep parameters:

  • header = TRUE (the first line is the name of the columns)
  • sep = “,” (comma is the separator)

Using header = FALSE read the first line as data:

melo.csv.headF <- read.csv("melocactus 2019 2020.csv", header = FALSE)
head(melo.csv.headF)
##     V1      V2       V3           V4         V5     V6
## 1 year azimuth distance plant_height long.inflo status
## 2 2019    35.0     6.22           51         17     E 
## 3 2019    35.0     6.22           24         10      X
## 4 2019    35.0     6.22            3          0      E
## 5 2019    25.0     6.67           20         20      X
## 6 2019    25.0     4.53           36          9      S

You can see that not only the column’s names are now part of the data, but also the data types are changed.

Reading a tab-delimited file (tsv) using read.csv

A tab-delimited file (tsv) can also be read using the read.csv command, just changing the sep parameter:

melo.tsv <- read.csv("melocactus 2019 2020.tsv", sep = "\t")
head(melo.tsv)
##   year azimuth distance plant_height long.inflo status
## 1 2019      35     6.22           51         17     E 
## 2 2019      35     6.22           24         10      X
## 3 2019      35     6.22            3          0      E
## 4 2019      25     6.67           20         20      X
## 5 2019      25     4.53           36          9      S
## 6 2019      24     4.53            8          0      E

Reading an ‘Excel’ file

It is a common practice to transcribe field data to an electronic spreadsheet, usually an Excel worksheet. Using the read_excel command, from the readxl package, we can import data from a selected Excel worksheet in an Excel workbook, to a named data.frame into the R environment.

library(readxl)
melo.excel <- read_excel("melocactus 2019 2020.xlsx",
                         sheet = "data")
head(melo.excel)
## # A tibble: 6 × 6
##    year azimuth distance plant_height long.inflo status
##   <dbl>   <dbl>    <dbl>        <dbl>      <dbl> <chr> 
## 1  2019      35     6.22           51         17 E     
## 2  2019      35     6.22           24         10 X     
## 3  2019      35     6.22            3          0 E     
## 4  2019      25     6.67           20         20 X     
## 5  2019      25     4.53           36          9 S     
## 6  2019      24     4.53            8          0 E

Reading from a public Google Sheets file

‘Google Sheets’ is another widely used web-based spreadsheet, that requires a specific command to connect and retrieve data to R, using the spreadsheet URL. The URL for the Google Sheets file must be public for the command to connect to it.

library (gsheet)
url <- 'https://docs.google.com/spreadsheets/d/1pW4LrQ211JicRekJdRZlHjGD2IOowjaspxNwrVLpOjk/edit#gid=0'
melo.gs <- gsheet2tbl(url)
head(melo.gs)
## # A tibble: 6 × 6
##    year azimuth distance plant_height long.inflo status
##   <dbl>   <dbl>    <dbl>        <dbl>      <dbl> <chr> 
## 1  2019      35     6.22           51         17 E     
## 2  2019      35     6.22           24         10 X     
## 3  2019      35     6.22            3          0 E     
## 4  2019      25     6.67           20         20 X     
## 5  2019      25     4.53           36          9 S     
## 6  2019      24     4.53            8          0 E

Introduction to data wrangling

Data wrangling, in R jargon, is the art of getting your data into R in a useful form for visualization, analyzing, and modeling. We just accomplished the initial part of the process by importing data to the R environment. Now we are going to tidy up our data to prepare for descriptive statistics and visualization.

Several commands in the tidyverse package are suitable to complete the process. We will use three of them: mutate, select, and filter.

Adding new variables with mutate

The Melocactus data set contains two variables to locate each individual in the population: azimuth and distance. With these variables, we can locate each individual in a plot using polar coordinates. Usually, it is more common and somewhat easier, if we have cartesian (X - Y) coordinates. Using a trigonometric transformation, we can convert the polar coordinates to cartesian coordinates: \[X\ coordinates = distance *Sin(azimuth)\]

\[Y\ coordinates = distance*Cos(azimuth)\] Also, we do not want negative coordinates (just a matter of aesthetics), so we can shift the origin in a way that negative values disappear, adding 100 to each coordinate value. All the previous calculations can be done easily using the mutate command to create new variables with the results of the transformation formulas.

library(tidyverse)
melo.mutate <- melo.csv.def %>% 
  mutate(x_coord = distance * sin(azimuth * pi / 180),
         y_coord = distance * cos(azimuth * pi / 180))
head(melo.mutate)
##   year azimuth distance plant_height long.inflo status  x_coord  y_coord
## 1 2019      35     6.22           51         17     E  3.567645 5.095126
## 2 2019      35     6.22           24         10      X 3.567645 5.095126
## 3 2019      35     6.22            3          0      E 3.567645 5.095126
## 4 2019      25     6.67           20         20      X 2.818864 6.045073
## 5 2019      25     4.53           36          9      S 1.914461 4.105574
## 6 2019      24     4.53            8          0      E 1.842517 4.138361

Selecting the variables to be used with select

A large data set with many variables could be confusing if we are not going to use all of them. Using select we keep the variables that we need in a new object and preserve all the variables in the source data. Once we use distance and azimuth to calculate the new coordinates, we can get rid of them and create a more slim data object.

melo.select <- melo.mutate %>% 
  select(year, plant_height:y_coord)
head(melo.select)
##   year plant_height long.inflo status  x_coord  y_coord
## 1 2019           51         17     E  3.567645 5.095126
## 2 2019           24         10      X 3.567645 5.095126
## 3 2019            3          0      E 3.567645 5.095126
## 4 2019           20         20      X 2.818864 6.045073
## 5 2019           36          9      S 1.914461 4.105574
## 6 2019            8          0      E 1.842517 4.138361

Filtering to subset data using filter

Our data are for two sampling years, but we may want to analyze them separately; using filter we can produce a data object for each year.

# 2019
melo.2019 <- melo.select %>% 
  filter(year == 2019)
tail(melo.2019)
##     year plant_height long.inflo status   x_coord   y_coord
## 120 2019            8          0      S -17.99897 -16.20634
## 121 2019            3          0      S -19.02088 -15.96042
## 122 2019            4          0      S -19.02088 -15.96042
## 123 2019            7          0      S -17.47204 -16.29294
## 124 2019            1          0      S -17.47204 -16.29294
## 125 2019            5          0      S -17.47204 -16.29294
# 2020
melo.2020 <- melo.select %>% 
  filter(year == 2020)
head(melo.2020)
##   year plant_height long.inflo status       x_coord    y_coord
## 1 2020           37         13      E -1.586973e+00  -4.360174
## 2 2020           14          0      S  7.078458e-16  -5.780000
## 3 2020           30          0      E  2.271014e+00  -6.239559
## 4 2020           45         13      E  2.402693e+00  -6.977927
## 5 2020           11          0      X  5.660268e+00  -9.420269
## 6 2020            6          0      X  6.655786e+00 -10.651484