Loading data into the R environment is one of the initial steps of any project. R allows importing data from different sources and formats.
Comma-separated values (CSV) files are one of the most common data file types. They look like this:
year,azimuth,distance,plant_height,long.inflo,status (this line
is the ‘header’)
2019,35.0,6.22,51,17,E
2019,35.0,6.22,24,10,X
2019,35.0,6.22,3,0,E
2019,25.0,6.67,20,20,X
2019,25.0,4.53,36,9,S
2019,24.0,4.53,8,0,E
2019,5.0,5.22,10,0,S
2019,340.0,6.06,32,0,E
…
For a data file, with a structure as above, you can use the read.csv command:
melo.csv.def <- read.csv("melocactus 2019 2020.csv")
head(melo.csv.def)
## year azimuth distance plant_height long.inflo status
## 1 2019 35 6.22 51 17 E
## 2 2019 35 6.22 24 10 X
## 3 2019 35 6.22 3 0 E
## 4 2019 25 6.67 20 20 X
## 5 2019 25 4.53 36 9 S
## 6 2019 24 4.53 8 0 E
Note the different types of data that are in the file (integer, double-precision, character)
The command read.csv use the following default values for the header and sep parameters:
Using header = FALSE read the first line as data:
melo.csv.headF <- read.csv("melocactus 2019 2020.csv", header = FALSE)
head(melo.csv.headF)
## V1 V2 V3 V4 V5 V6
## 1 year azimuth distance plant_height long.inflo status
## 2 2019 35.0 6.22 51 17 E
## 3 2019 35.0 6.22 24 10 X
## 4 2019 35.0 6.22 3 0 E
## 5 2019 25.0 6.67 20 20 X
## 6 2019 25.0 4.53 36 9 S
You can see that not only the column’s names are now part of the data, but also the data types are changed.
A tab-delimited file (tsv) can also be read using the read.csv command, just changing the sep parameter:
melo.tsv <- read.csv("melocactus 2019 2020.tsv", sep = "\t")
head(melo.tsv)
## year azimuth distance plant_height long.inflo status
## 1 2019 35 6.22 51 17 E
## 2 2019 35 6.22 24 10 X
## 3 2019 35 6.22 3 0 E
## 4 2019 25 6.67 20 20 X
## 5 2019 25 4.53 36 9 S
## 6 2019 24 4.53 8 0 E
It is a common practice to transcribe field data to an electronic spreadsheet, usually an Excel worksheet. Using the read_excel command, from the readxl package, we can import data from a selected Excel worksheet in an Excel workbook, to a named data.frame into the R environment.
library(readxl)
melo.excel <- read_excel("melocactus 2019 2020.xlsx",
sheet = "data")
head(melo.excel)
## # A tibble: 6 × 6
## year azimuth distance plant_height long.inflo status
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2019 35 6.22 51 17 E
## 2 2019 35 6.22 24 10 X
## 3 2019 35 6.22 3 0 E
## 4 2019 25 6.67 20 20 X
## 5 2019 25 4.53 36 9 S
## 6 2019 24 4.53 8 0 E
‘Google Sheets’ is another widely used web-based spreadsheet, that requires a specific command to connect and retrieve data to R, using the spreadsheet URL. The URL for the Google Sheets file must be public for the command to connect to it.
library (gsheet)
url <- 'https://docs.google.com/spreadsheets/d/1pW4LrQ211JicRekJdRZlHjGD2IOowjaspxNwrVLpOjk/edit#gid=0'
melo.gs <- gsheet2tbl(url)
head(melo.gs)
## # A tibble: 6 × 6
## year azimuth distance plant_height long.inflo status
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2019 35 6.22 51 17 E
## 2 2019 35 6.22 24 10 X
## 3 2019 35 6.22 3 0 E
## 4 2019 25 6.67 20 20 X
## 5 2019 25 4.53 36 9 S
## 6 2019 24 4.53 8 0 E
Data wrangling, in R jargon, is the art of getting your data into R in a useful form for visualization, analyzing, and modeling. We just accomplished the initial part of the process by importing data to the R environment. Now we are going to tidy up our data to prepare for descriptive statistics and visualization.
Several commands in the tidyverse package are suitable to complete the process. We will use three of them: mutate, select, and filter.
The Melocactus data set contains two variables to locate each individual in the population: azimuth and distance. With these variables, we can locate each individual in a plot using polar coordinates. Usually, it is more common and somewhat easier, if we have cartesian (X - Y) coordinates. Using a trigonometric transformation, we can convert the polar coordinates to cartesian coordinates: \[X\ coordinates = distance *Sin(azimuth)\]
\[Y\ coordinates = distance*Cos(azimuth)\] Also, we do not want negative coordinates (just a matter of aesthetics), so we can shift the origin in a way that negative values disappear, adding 100 to each coordinate value. All the previous calculations can be done easily using the mutate command to create new variables with the results of the transformation formulas.
library(tidyverse)
melo.mutate <- melo.csv.def %>%
mutate(x_coord = distance * sin(azimuth * pi / 180),
y_coord = distance * cos(azimuth * pi / 180))
head(melo.mutate)
## year azimuth distance plant_height long.inflo status x_coord y_coord
## 1 2019 35 6.22 51 17 E 3.567645 5.095126
## 2 2019 35 6.22 24 10 X 3.567645 5.095126
## 3 2019 35 6.22 3 0 E 3.567645 5.095126
## 4 2019 25 6.67 20 20 X 2.818864 6.045073
## 5 2019 25 4.53 36 9 S 1.914461 4.105574
## 6 2019 24 4.53 8 0 E 1.842517 4.138361
A large data set with many variables could be confusing if we are not going to use all of them. Using select we keep the variables that we need in a new object and preserve all the variables in the source data. Once we use distance and azimuth to calculate the new coordinates, we can get rid of them and create a more slim data object.
melo.select <- melo.mutate %>%
select(year, plant_height:y_coord)
head(melo.select)
## year plant_height long.inflo status x_coord y_coord
## 1 2019 51 17 E 3.567645 5.095126
## 2 2019 24 10 X 3.567645 5.095126
## 3 2019 3 0 E 3.567645 5.095126
## 4 2019 20 20 X 2.818864 6.045073
## 5 2019 36 9 S 1.914461 4.105574
## 6 2019 8 0 E 1.842517 4.138361
Our data are for two sampling years, but we may want to analyze them separately; using filter we can produce a data object for each year.
# 2019
melo.2019 <- melo.select %>%
filter(year == 2019)
tail(melo.2019)
## year plant_height long.inflo status x_coord y_coord
## 120 2019 8 0 S -17.99897 -16.20634
## 121 2019 3 0 S -19.02088 -15.96042
## 122 2019 4 0 S -19.02088 -15.96042
## 123 2019 7 0 S -17.47204 -16.29294
## 124 2019 1 0 S -17.47204 -16.29294
## 125 2019 5 0 S -17.47204 -16.29294
# 2020
melo.2020 <- melo.select %>%
filter(year == 2020)
head(melo.2020)
## year plant_height long.inflo status x_coord y_coord
## 1 2020 37 13 E -1.586973e+00 -4.360174
## 2 2020 14 0 S 7.078458e-16 -5.780000
## 3 2020 30 0 E 2.271014e+00 -6.239559
## 4 2020 45 13 E 2.402693e+00 -6.977927
## 5 2020 11 0 X 5.660268e+00 -9.420269
## 6 2020 6 0 X 6.655786e+00 -10.651484