Exploring R data structures & managing data

Recap

26.05.2024: During the previous session, we were introduced to the program by learning how to install it, how to use it, and a few concepts that are fundamental to the R program.

Like, we spoke about

Expressions

The basic interaction mode in R is expression evaluation: expressions typically involve variable references, arithmetic operators.

e.g. arithmetic expression, relational or logical expression

Objects

Anything that can be assigned a value. For R, that is just about everything (data, functions, graphs, analytic results, and more). Every object has a class attribute telling R how to handle it. All objects are kept in memory during an interactive session.

<- is used as assignment operator

Today’s session focuses on the initial stage of data analysis, which involves creating a dataset that contains the relevant information in a format that suits our requirements. We will explore various methods for importing data into the R program, including manual entry from external sources. Additionally, we will learn how to annotate or modify the dataset.

This session covers

  • Introduction to R data structures

  • Importing data frames

  • Managing data

The creation of a dataset, in a format that meets our needs, involves the following:

  • Choice of a data structure to hold our data

  • Entering or importing our data into the data structure

There can be many methods available for importing data into the R program or can be entered manually from an external source, which may include text files, spreadsheets, datasets from statistical packages, and database-management systems.

Once a dataset is created, in this workshop we will try to annotate it, adding descriptive labels for variables and variable codes and on functions for modifying the datasets according to our needs.

Datasets, Data structures and Data frames: What’s in the Name?

Datasets: A dataset is usually a rectangular array of data with rows representing observations and columns representing variables.

Although, there are different traditions to speak about elements: rows and columns of any dataset: In biology we name it as observation and variables, or sites vs covariates etc. database analysts call them records and fields, and those from the datamining and machine-learning disciplines call them examples and attributes.

Recording ecological data

Importance of biological record keeping

The data we write down should have the ability to makes sense to us at the time of data collection and also will make sense to future scientists looking to repeat or verify our work.

Collecting new data should be the part of our initial planning, and should continue without out modification in the format.

If we have past data or need reformatting then we may have to spend some time rearranging in a common format before we can do anything useful.

Elements of biological records

There are some basic elements of biological records that should always be included:
  • Where: the location that the data were collected from.

  • When: the date that the data were recorded.

  • What: the species we are dealing with.

  • Who: the name of the person that recorded the data.

  • Other variables that may be added, depending upon our purpose.

Recording in the field

When we are in the field and using our field notebook, we may well use shortcuts to record the information required. Once again there will be items that do not need to be repeated, a single date at the top of every sheet would be sufficient.

In our field notebook or recording sheet we may keep separate pages for each site and end up with a column of figures for each site.

In general, we should aim to create a column for each item of data that we collect. If we were looking at species abundance at several sites for example, then we would need at least two columns, one for the abundance data and one for the site.

Supporting info/covariates/remarks

The date, location and the name of the person collecting the data are basic items that we always need but there may also be additional information that will help we to understand the biological situation as we process the data later. These things include field sketches and site photographs. A field sketch can be very helpful because we can record details that may be hard to represent in any other manner. A sketch can also help us to remember where we placed our quadrats; a grid reference is fine but meaningless without a map! Photographs may also be helpful and digital photography enables lots of images to be captured with minimum fuss; however, it is also easy to get carried away and forget what we were there for in the first place. Any supporting information should be just that – support for the main event: our data.

Record structure and arrangement in computer

Our data sets prepared for analyses can be formatted in different ways:

Analyses ready dataset: Summarised layout (Only necessary data for a particular objective or analyses are set out )

When we enter biological data, enter each record on a separate line and set out our spreadsheet so that each column represents a factor.

Analyses ready dataset: Data table layout (All data are set out in separate columns)

The table shows a dataset. Here we have record on the abundance of 4 butterfly species. In the example above, we can see that Observer1 and 2 is trying to ascertain the abundance of various species of butterflies at some sites.

If someone tries to repeat his experiment, they will know what time of year he was surveying at. Alternatively, if environmental conditions change, it will be essential to know what month or season they did the work. If we fail to collect complete biological data, or fail to retain and communicate all the details in full, then our work may be rendered unrepeatable and therefore useless as a contribution to science.

Now, If we wrote down the information separately we would end up with several smaller tables of data and it would be difficult to carry out any actual analyses.

Having this strict biological recording format allows great flexibility, especially if we end up with a lot of data. our data are now in the form of a database and our spreadsheet will be able to extract and summarize our data easily, allows us to modify the data later using sorting, filters or Pivot Tables.

The structure of the data here is a rectangular array

The data types in this dataset can be seen as the site being the row names; Species names of the butterfly species in rows, Observer: who is recording the data; Date being the date variable; grid reference; the sampling unit reference; abundance as the continuous variable.

Once our biological data are compiled in this format, we can sort them by the various columns, export the grid references to mapping programs, and convert the data into tables for further calculations using a spreadsheet. They can also be imported into databases and other computer programs for statistical analysis.

Now we heard about two key terms Data structure and Data type

Data structures hold data falling under various data types. The data types or modes that R can handle include numeric, character, logical (TRUE/FALSE).

R contains a wide variety of structures for holding data, including scalars, vectors, arrays, data frames, and lists.

There is one more term i.e., Data frame: A data frame is a structure in R that holds data and is similar to the datasets found in standard statistical packages (for example, SAS, SPSS, and Stata). The columns are variables, and the rows are observations. we can have variables of different types (for example, numeric or character) in the same data frame. Data frames are the main structures we use to store datasets.

Data Types

The four data types in R are: numeric, character, logical, and complex number.

A numeric object, such as a, contains numeric values.

N<-3
N
## [1] 3

A character object is to store a character string

S<-"hello world"
S
## [1] "hello world"

A logical object contains results of a logical comparison.

For example, if we ask:

3 > 4
## [1] FALSE

A logical comparison (“is 3 larger than 4?”) and the answer to a logical comparison is either “yes” (TRUE) or “no” (FALSE)

Mode

Data type is known as “mode” in R. The R function mode can be used to get the data type of an object:

mode("hi")
## [1] "character"
mode(1)
## [1] "numeric"

Data structures

R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.

Scalars

The one-element vectors.

f<-3
g <-"Subhasish"
h <-TRUE

f
## [1] 3
g
## [1] "Subhasish"
h
## [1] TRUE

They’re used to hold constants.

Vectors

Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector. Here are examples of each type of vector.

a <- c(1, 2, 5, 3, 6, -2, 4)
a
## [1]  1  2  5  3  6 -2  4
b <- c("Subhasish", "Sourav", "Sathish")
b
## [1] "Subhasish" "Sourav"    "Sathish"
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Note that the data in a vector must be only one type or mode (numeric, character, or logical). we can’t mix modes in the same vector.

Matrix

A set of elements of the same mode appearing in rows and columns (numeric, character, or logical), matrices are created with the matrix() function.

matrix.sample <-matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames))

where vector contains the elements for the matrix, nrow and ncol specify the row and column dimensions, and dimnames contain optional row and column labels stored in character vectors. The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The default is by column. The following listing demonstrates the matrix function.

y <- matrix(1:20, nrow=5, ncol=4)
y
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

Matrices are two-dimensional and, like vectors, can contain only one data type. When there are more than two dimensions, we use arrays. When there are multiple modes of data, we use data frames.

Arrays

Arrays are similar to matrices but can have more than two dimensions. They’re created with an array function of the following form:

myarray <- array(vector, dimensions, dimnames)

where vector contains the data for the array, dimensions is a numeric vector giving the maximal index for each dimension, and dimnames is an optional list of dimension labels.

The following gives an example of creating a three-dimensional (2 × 3 × 4) array of numbers.

dim1 <- c("A1", "A2") 
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
  

array1 <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
array1
## , , C1
## 
##    B1 B2 B3
## A1  1  3  5
## A2  2  4  6
## 
## , , C2
## 
##    B1 B2 B3
## A1  7  9 11
## A2  8 10 12
## 
## , , C3
## 
##    B1 B2 B3
## A1 13 15 17
## A2 14 16 18
## 
## , , C4
## 
##    B1 B2 B3
## A1 19 21 23
## A2 20 22 24
vector1 <- c("A", "B", "C") 
vector2 <- c(10, 11, 12, 13, 14, 15) 
column.names <- c("COL1", "COL2", "COL3") 
row.names <- c("ROW1", "ROW2", "ROW3") 
matrix.names <- c("Matrix1", "Matrix2") 
  
result <- array(c(vector1, vector2), dim = c(3, 3, 2), 
                  dimnames = list(row.names, column.names, 
                  matrix.names)) 
print(result) 
## , , Matrix1
## 
##      COL1 COL2 COL3
## ROW1 "A"  "10" "13"
## ROW2 "B"  "11" "14"
## ROW3 "C"  "12" "15"
## 
## , , Matrix2
## 
##      COL1 COL2 COL3
## ROW1 "A"  "10" "13"
## ROW2 "B"  "11" "14"
## ROW3 "C"  "12" "15"

Arrays are a natural extension of matrices. They can be useful in programming new statistical methods. Like matrices, they must be a single mode. Identifying elements follows what you’ve seen for matrices.

Data frames

The data frame is the fundamental unit for doing data analysis in R. A data frame is kind of like a spreadsheet. It is a table-like form of data which R can read, here, columns represent variables and rows represent observations (or cases).

The survey dataset in Table 1 consists of numeric and character data. Because there are multiple modes of data, can’t contain the data in a matrix. In this case, a data frame is the structure of choice.

A data frame is created with the data.frame() function

mydata <- data.frame(col1, col2, col3,...) 

where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function.

One of the easiest ways to understand a data frame is to learn to make one. This way we can see how the basic components of a data frame fit together.

Let’s start by making an object called sites composed of the names of 7 sites s1, s2…..

sites <- c("s1", "s2", "s3", "s4", "s5", "s6", "s7")
sites
## [1] "s1" "s2" "s3" "s4" "s5" "s6" "s7"

Now, let’s start building a data frame by adding some more variables. Let’s add the site ID of a survey, and whether or not a species of butterfly has been sighted. As these entries are numbers we don’t use quotation marks.

butterfly <- c(10, 20, 1, 2, 12, 2, 6) 
observer<-c("A", "B", "B","A", "B","A", "B")
habitatquality<-c(4, 3, 1, 2, 3,4, 4 )

Now we can make our first data frame which we can call my.df by using the data.frame() function:

my.df <- data.frame(sites, butterfly, observer, habitatquality) 
my.df
##   sites butterfly observer habitatquality
## 1    s1        10        A              4
## 2    s2        20        B              3
## 3    s3         1        B              1
## 4    s4         2        A              2
## 5    s5        12        B              3
## 6    s6         2        A              4
## 7    s7         6        B              4

This will make an object called my.df . Data frames are always rectangular in shape.

That means that each vector which makes up the data frame must be of the same length.

If the vectors of different lengths are used the data.frame() function will return an error.

If we want to see the entire data frame (or any object) we just highlight its name and then run it.

There are several ways to identify the elements of a data frame. we can use the subscript notation we used before (for example, with matrices), or can specify column names. Using the data frame created above, the following listing demonstrates these approaches.

Identifying columns

my.df[1]
##   sites
## 1    s1
## 2    s2
## 3    s3
## 4    s4
## 5    s5
## 6    s6
## 7    s7
my.df$identified
## NULL

The $ notation is used to indicate a particular variable from a given data frame.

For example, if we want to cross-tabulate Species by observer, can use the following code:

table(my.df$sites, my.df$observer)
##     
##      A B
##   s1 1 0
##   s2 0 1
##   s3 0 1
##   s4 1 0
##   s5 0 1
##   s6 1 0
##   s7 0 1

Lists

Lists are the most complex of the R data types. Basically, a list is an ordered collection of objects (components).

A list allows us to gather a variety of (possibly unrelated) objects under one name.

For example, a list may contain a combination of vectors, matrices, data frames, and even other lists.

title <- "bear rehab list" 
age.month <- c(5, 6, 8, 9) 
x.matrix <- matrix(1:4, nrow=4)
chr <- c("Sagalee", "Den", "Ithan","Papum")


mylist <- list(title=title, agesinmonth=age.month, ID=x.matrix, Names=chr)
mylist
## $title
## [1] "bear rehab list"
## 
## $agesinmonth
## [1] 5 6 8 9
## 
## $ID
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## 
## $Names
## [1] "Sagalee" "Den"     "Ithan"   "Papum"

Saving a data frame in our computer

It is simple to save a data frame as a “.csv” file.

To save an object to a file, simply pass its name and the desired filename to the write.csv() function.

To prevent the addition of a column containing reference numbers, we should usually use the row.names = FALSE argument our saved file will then appear identical to the one created in R.

write.csv(my.df, "firstdataframe.csv", row.names = FALSE)

getwd()
## [1] "C:/Users/Subhasish Arandhara/Desktop/Workshop R in ecology"

Data input: Importing a data frame

For most purposes, we will want to work on our real data, but knowing how to simulate a data frame is useful (e.g., to make a reproducible example, to understand R, and to practice modelling approaches).

Now when we know our data structures, we need to put some data in them! when doing data analyses, we are typically faced with data that come from a variety of sources and in a variety of formats. our task is to import the data into our tools, analyze the data, and report on the results. R provides a wide range of tools for importing data.

Tips for importing a data frame

Step 1: Build our data in a spreadsheet using the principles of tidy data (i.e. every column is a variable, every row is an observation, each cell consists of a single value)

Step 2: Save the spreadsheet as as ‘.csv’ file (typically found in the ‘save as’ option of the spreadsheet program). This file type is short for ‘comma-separated values’.

Step 3: Copy the path (location) of the csv file (visible in the properties or information associated with the file).

Step 4: Paste the path of the file into the setwd() function. Remember to include quotation marks. The function is short for set working directory. The configuration of the setwd() function differs according to the operating system:

(however, if we are using a ‘project’ in RStudio we can skip steps 3 and 4 by simply putting the file into the project folder)

Windows: Copying and pasting the path into an R script results in just a single backslash. For this to be read by R, a forward slash needs to be used:

setwd("C:\Users\Me\Documents")  
setwd("C:/Users/Me/Documents") 

Mac: Copying and pasting the path into an R script results in just a forward backslash - so no change is necessary.

setwd("/Users/Me/Documents") 

Linux: Copying and pasting the path into an R script results in just a forward backslash – no change is necessary.

setwd("/usr/me/documents")

Step 5: Paste the file’s name into the read.csv() function to read the file. Don’t forget to put ‘.csv’ at the end of the file’s name.

The syntax is
 mydataframe <- read.csv(file, options) 
where file is a delimited ASCII file and the options are parameters controlling how data is processed.

Options for importing a dataframe

Option Description
header A logical value indicating whether the file contains the variable names in the first line.
sep The delimiter separating data values. The default is sep=““, which denotes one or more spaces, tabs, new lines, or carriage returns. Use sep=”,” to read comma-delimited files, and sep=“\t” to read tab-delimited files.
row.names An optional parameter specifying one or more variables to represent row identifiers.
col.names If the first row of the data file doesn’t contain variable names (header=FALSE), we can use col.names to specify a character vector containing the variable names. If header=FALSE and the col.names option is omitted, variables will be named V1, V2, and so on.
na.strings Optional character vector indicating missing-values codes. For example, na.strings =c(“-9”, “?”) converts each -9 and ? value to NA as the data is read.
Lets import the dataframe as a .csv file
my.df<- read.csv("firstdataframe.csv") 
my.df
##   sites butterfly observer habitatquality
## 1    s1        10        A              4
## 2    s2        20        B              3
## 3    s3         1        B              1
## 4    s4         2        A              2
## 5    s5        12        B              3
## 6    s6         2        A              4
## 7    s7         6        B              4
We can just use the str() function to get an overview of the structure of the object to see how the various data components are being treated by R
str(my.df)
## 'data.frame':    7 obs. of  4 variables:
##  $ sites         : chr  "s1" "s2" "s3" "s4" ...
##  $ butterfly     : int  10 20 1 2 12 2 6
##  $ observer      : chr  "A" "B" "B" "A" ...
##  $ habitatquality: int  4 3 1 2 3 4 4

Investigating a data frame

We can use the class() function to confirm my.df is a data frame.

class(my.df)
## [1] "data.frame"
class(array1)
## [1] "array"

If we wanted to see what the class of a variable within the data frame is,

we use the ‘$’ operator after the name of the object to reference a particular variable.

class(my.df$habitatquality)
## [1] "integer"

So we see that R will treat my.df$habitatquality as a numeric variable.

Categorical data is usually handled best in R by defining them as a factors. We can do this by using the as.factor() function and telling R the name of the data set and variable we are wanting it to reassign:

my.df$habitatquality <- as.factor(my.df$habitatquality)
class(my.df$habitatquality)
## [1] "factor"

Managing data

Annotating datasets

Data analysts typically annotate datasets to make the results easier to interpret. Annotating generally includes adding descriptive labels to variable names and value labels to the codes used for categorical variables.

Labelling variables can be of two ways

  • GUI editor
  • programmatically

We can use the following statement to invoke an interactive editor:

fix(dataframe)
fix(my.df)

For example, for the variable habitatquality, we might want to attach a short label “hab_qual" (higher score means better habitat for the species).”

names(my.df)[4] <- "hab_qual" 

my.df
##   sites butterfly observer hab_qual
## 1    s1        10        A        4
## 2    s2        20        B        3
## 3    s3         1        B        1
## 4    s4         2        A        2
## 5    s5        12        B        3
## 6    s6         2        A        4
## 7    s7         6        B        4

For the variable habitatscore, scored 5-1, we might want to change the factors into characters say “Excellent” to “worse.”

my.df$`hab_qual` <- factor(my.df$`hab_qual`, levels = c(1,2, 3, 4, 5),
labels = c("worst","poor", "good", "better", "excellent"))
my.df
##   sites butterfly observer hab_qual
## 1    s1        10        A   better
## 2    s2        20        B     good
## 3    s3         1        B    worst
## 4    s4         2        A     poor
## 5    s5        12        B     good
## 6    s6         2        A   better
## 7    s7         6        B   better

Subsetting

In R values in a data frame are referenced by their location in terms of rows and columns using the ‘[’ and ‘]’ operators. Taking the form:

my.df[row, column]

This then allows us the ability to manually subset data sets. For example, the 3rd row of the 1st column would be:

my.df[4, 1]
## [1] "s4"

We can select an entire row of the data frame by giving the row position but leaving the leaving the column position empty:

my.df[3, ]
##   sites butterfly observer hab_qual
## 3    s3         1        B    worst

Similarly, we can select a column of the data frame by giving the column position but leaving the leaving the row position empty:

my.df[ , 1]
## [1] "s1" "s2" "s3" "s4" "s5" "s6" "s7"

Values can be removed by using a negative sign. For example, we can remove the 1st row of our data frame:

my.df[-1, ]
##   sites butterfly observer hab_qual
## 2    s2        20        B     good
## 3    s3         1        B    worst
## 4    s4         2        A     poor
## 5    s5        12        B     good
## 6    s6         2        A   better
## 7    s7         6        B   better

Multiple rows or columns can be removed or included by using the c() function. By including a ‘-’ sign before the c() function we can remove a number of rows:

my.df[-c(1,3,5,6,7), ]
##   sites butterfly observer hab_qual
## 2    s2        20        B     good
## 4    s4         2        A     poor

To see the first three entries sites of my.df

my.df$sites[1:3]
## [1] "s1" "s2" "s3"

We can also use ‘[’ ‘]’ to subset some built-in constants in R such as:

Alphabets

LETTERS[1:10]
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

Months

month.name[1:5]
## [1] "January"  "February" "March"    "April"    "May"
month.abb[1:8]
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug"

Creating variables

An Example dataset: Scores are given to a team of biologists differ in the ways they run their projects

Scores: 1:poor; 2:good; 3:better; 4:best; 5: excellent
biologist exp_year age team_work field_work comms_skill
A 6 42 4 2 4
B 5 31 4 2 5
C 7 00 5
D 2 23 1 1 2
biologist<-c("A", "B", "C", "D")
exp_year<-c(6,5,7,2)
age<-c(37,31,00,24)
team_work<-c(4,4,5,1)
field_work<-c(2,2,NA,1)
comms_skill<-c(4,5,NA,2)

bio.scor<-data.frame(biologist, exp_year, age, team_work, field_work,comms_skill)

bio.scor
##   biologist exp_year age team_work field_work comms_skill
## 1         A        6  37         4          2           4
## 2         B        5  31         4          2           5
## 3         C        7   0         5         NA          NA
## 4         D        2  24         1          1           2

Arithmetic operators

Operator Description
+ Addition.
- Subtraction.
* Multiplication.
/ Division.
^ or ** Exponent
x%%y Modulus (x mod y): for example, 5%%2 is 1.
x%/%y Integer division: for example, 5%/%2 is 2.

Combining three scores to obtain a mean score

Now we have a data frame named bio.scor, with variables Team_work, Field_work, comms_skill, and we want to create a new variable sum.scor that adds these variables and a new variable called mean.scor that averages the variables.

sum.scor <- bio.scor$team_work + bio.scor$field_work+ bio.scor$comms_skill

mean.scor <- (bio.scor$team_work + bio.scor$field_work+bio.scor$comms_skill)/3

sum.scor
## [1] 10 11 NA  4
mean.scor
## [1] 3.333333 3.666667       NA 1.333333

the statements will succeed, but lets check if these vectors were added into the data frame or not?

bio.scor
##   biologist exp_year age team_work field_work comms_skill
## 1         A        6  37         4          2           4
## 2         B        5  31         4          2           5
## 3         C        7   0         5         NA          NA
## 4         D        2  24         1          1           2

This probably isn’t the result we want. Ultimately, we want to incorporate new variables into the original data frame.

bio.scor$sum.scor <- bio.scor$team_work + bio.scor$field_work+ bio.scor$comms_skill

bio.scor$mean.scor <- (bio.scor$team_work + bio.scor$field_work+bio.scor$comms_skill)/3

bio.scor$sum.scor
## [1] 10 11 NA  4
bio.scor$mean.scor
## [1] 3.333333 3.666667       NA 1.333333
bio.scor <- transform(bio.scor, 
                      sum.scor = bio.scor$sum.scor,
                      mean.scor = bio.scor$mean.scor)
bio.scor
##   biologist exp_year age team_work field_work comms_skill sum.scor mean.scor
## 1         A        6  37         4          2           4       10  3.333333
## 2         B        5  31         4          2           5       11  3.666667
## 3         C        7   0         5         NA          NA       NA        NA
## 4         D        2  24         1          1           2        4  1.333333

Re-coding (conditional) existing variables

Earlier we came to know about changing data labels, now re-coding involves creating new values of a variable based on the existing values of the same and/or other variables.

    • For example, we may want to change a continuous variable into a set of categories

    • Replace miscoded values with correct values

    • Create a Agree/disagree variable based on a set of cutoff scores.

Statement for conditional recoding

variable[condition] <- expression
The statement will only make the assignment when condition is TRUE.

To recode data, we can use one or more of R’s logical operators

Operator Description
< Less than
<= Less than equal to
> Greater than
>= Greater than or equal to
== Exactly equal to
!= Not equal to
!x Not x
x | y x or y
x & y x and y
isTRUE(x) Tests whether x is TRUE

Let’s say we want to recode the ages of the biologists in the earlier dataset from the continuous variable age to the categorical variable agecat (Young, Middle Aged, Elder).

First, we must recode the value 00 for age to indicate that the value is missing using code such as

bio.scor$age[bio.scor$age == 00] <- NA

The statement variable[condition] <- expression will only make the assignment when condition is TRUE.

Once missing values for age have been specified, we can then use the following code to create the agecat variable:

bio.scor$agecat[bio.scor$age  > 40]   <- "Elder"  
bio.scor$agecat[bio.scor$age < 40]   <- "Young"

The plyr package

Finally, The plyr package has a powerful set of functions for modiflying datasets.

The plyr package isn’t installed by default
 install.packages("plyr")

The format of the rename() function is

rename(dataframe, c(oldname="newname", oldname="newname",...))

Here’s an example with the bio.score data:

library(plyr) 
## Warning: package 'plyr' was built under R version 4.2.3
bio.scor <- rename(bio.scor,                      c(biologist="bioID", age="age.bio"))

Working around missing values

is.na() function

allows us to test for the presence of missing values.

Recoding redundant values to missing

Any value of age that’s equal to 00 is changed toNA.

bio.scor$age.bio[bio.scor$age == 00] <- NA

Excluding missing values from analyses

Once you’ve identified missing values, we’ll need to eliminate them before we can continue analysing the data. The reason is that arithmetic expressions and functions with missing values produce missing values.

mean(comms_skill)
## [1] NA

na.rm=TRUE option removes missing values prior to calculations and applies the function to the remaining value

mean.comms <- mean(comms_skill, na.rm=TRUE)
mean.comms
## [1] 3.666667

na.omit( ) function deletes any rows with missing data

bio.scorNAomit<-na.omit(bio.scor)
bio.scorNAomit
##   bioID exp_year age.bio team_work field_work comms_skill sum.scor mean.scor
## 1     A        6      37         4          2           4       10  3.333333
## 2     B        5      31         4          2           5       11  3.666667
## 4     D        2      24         1          1           2        4  1.333333
##   agecat
## 1  Young
## 2  Young
## 4  Young

Here, any rows containing missing data are deleted from the dataframe before the results are saved to a new dataframe.

Working with date values

R typically handles dates by first inputting them as character strings and then we can convert them into numerical date variables.

The function as.Date() is used to make this translation.

The syntax is as.Date(x, "input_format"), where x is the character data and input_format gives the appropriate format for reading the date.

Symbol Meaning Example
%d Day as a number (0–31) 01–31
%a %A

Abbreviated weekday

Unabbreviated weekday

Mon

Monday

%m Month (00–12) 00–12
%b %B

Abbreviated month

Unabbreviated month

Jan

January

%y %Y

Two-digit year

Four-digit year

24

2024

mydates <- as.Date(c("2024-05-26", "2024-05-25"))
mydates
## [1] "2024-05-26" "2024-05-25"
strDates <- c("01/05/2024", "08/06/2024") 
dates <- as.Date(strDates, "%d/%m/%Y")

strDates
## [1] "01/05/2024" "08/06/2024"

Time-stamping

The function Sys.Date() can be used to obtain the current date, while the date() function can be used to retrieve both the current date and time. As I write this, today. So when those functions are executed, they produce the output below

Sys.Date()
## [1] "2024-05-26"
date()
## [1] "Sun May 26 12:11:38 2024"

we can use the format(x, format="output_format") function to output dates in a specified format and to extract portions of dates:

today <- Sys.Date()
format(today, format="%B %d %Y") 
## [1] "May 26 2024"
format(today, format="%A") 
## [1] "Sunday"

Arithmetic operations on dates

startdate <- as.Date("2023-05-27") 
enddate <- as.Date("2024-05-26") 
days<- enddate - startdate
days
## Time difference of 365 days

The difftime() function can be utilised to calculate time intervals and represent them in various units such as seconds, minutes, hours, days, or weeks.

today <- Sys.Date() 
dob <- as.Date("1994-03-26") 
difftime(today, dob, units="weeks") 
## Time difference of 1574.143 weeks

Merging dataframes

Horizontal merge

When combining two data frames (datasets) horizontally, the merge() function comes in handy. The two data frames are connected through one or more shared key variables.

sitex<-c("site1","site2","site3","site4")
abundancex<-c("2","1","5","6")

siteabu.df<-data.frame(sitex,abundancex)
siteabu.df
##   sitex abundancex
## 1 site1          2
## 2 site2          1
## 3 site3          5
## 4 site4          6
sitex<-c("site1","site2","site3","site4")
habitatx<-c("prim.forest","Sec.forest","grassland","water")

sitehab.df<-data.frame(sitex,habitatx)
sitehab.df
##   sitex    habitatx
## 1 site1 prim.forest
## 2 site2  Sec.forest
## 3 site3   grassland
## 4 site4       water
site.cov <- merge(siteabu.df, sitehab.df, by="sitex")
site.cov
##   sitex abundancex    habitatx
## 1 site1          2 prim.forest
## 2 site2          1  Sec.forest
## 3 site3          5   grassland
## 4 site4          6       water

Vertical merge

The two data frames must have the same variables, but they don’t have to be in the same order. Vertical merging is typically used to add observations to a data frame.

sitex<-c("site5")
abundancex<-c("23")

siteabu.df2<-data.frame(sitex,abundancex)
siteabu.df2
##   sitex abundancex
## 1 site5         23
vert.df <- rbind(siteabu.df, siteabu.df2)
vert.df
##   sitex abundancex
## 1 site1          2
## 2 site2          1
## 3 site3          5
## 4 site4          6
## 5 site5         23

Transposing data frames

Transposing refers to the process of converting the rows of a data frame into columns, and the columns into rows. Transposing is a valuable technique that can be used for different purposes, such as restructuring data or getting it ready for specific analyses.

data.table package can be used to transpose the dataframe, we are using transpose() function

transpose(dataframe)

library(data.table)
## Warning: package 'data.table' was built under R version 4.2.3
# transpose
t_my.df <- transpose(my.df)


colnames(t_my.df) <- rownames(my.df)
rownames(t_my.df) <- colnames(my.df)

t_my.df
##                1    2     3    4    5      6      7
## sites         s1   s2    s3   s4   s5     s6     s7
## butterfly     10   20     1    2   12      2      6
## observer       A    B     B    A    B      A      B
## hab_qual  better good worst poor good better better

Getting an overview of our data

Using Pivot Table

Pivot Tables combine column values for statistical summaries. In R, this is done with the dplyr package, a data compilation package. We are looking into pivots in the next session.

Here, the group_by() function groups data by one or more variables, and the summarise() function create a summary of data by those groups.

df %>% group_by( grouping_variables) %>% summarize( label = aggregate_fun() )
df: determines the data frame in use.
grouping_variables: determine the variable used to group data.
aggregate_fun(): determines the function used for summary. for example, sum, mean, etc.
library(dplyr) 
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
my.df %>% group_by(observer) %>%  
summarize(sum_values = sum(butterfly))
## # A tibble: 2 × 2
##   observer sum_values
##   <chr>         <dbl>
## 1 A                14
## 2 B                39
my.df %>% group_by(observer) %>%  
summarize(mean.butterfly = mean(butterfly))
## # A tibble: 2 × 2
##   observer mean.butterfly
##   <chr>             <dbl>
## 1 A                  4.67
## 2 B                  9.75
getwd()
## [1] "C:/Users/Subhasish Arandhara/Desktop/Workshop R in ecology"
butterfly.df<-read.csv("butterfly.df.csv", header=TRUE)
butterfly.df
##     site    species observer      date grid_ref counts  activity   habitat
## 1  Site1 butterfly1     Obs1 12/5/2024     2342      1    flying grassland
## 2  Site1 butterfly1     Obs1 12/5/2024     2342      1   resting    forest
## 3  Site1 butterfly1     Obs1 12/5/2024     2342      1    flying riverbank
## 4  Site1 butterfly1     Obs1 12/5/2024     2342      1   basking riverbank
## 5  Site1 butterfly3     Obs1 15/5/2024     2342      1    flying riverbank
## 6  Site1 butterfly3     Obs1 15/5/2024     2342      1   resting grassland
## 7  Site1 butterfly3     Obs1 15/5/2024     2342      1    flying riverbank
## 8  Site1 butterfly3     Obs1 15/5/2024     2342      1   resting riverbank
## 9  Site1 butterfly3     Obs1 15/5/2024     2342      1   basking riverbank
## 10 Site1 butterfly3     Obs1 15/5/2024     2342      1   basking    forest
## 11 Site1 butterfly3     Obs1 15/5/2024     2342      1   basking    forest
## 12 Site1 butterfly3     Obs1 15/5/2024     2342      1    flying riverbank
## 13 Site2 butterfly2     Obs2 12/5/2024     2343      1 nectaring    forest
## 14 Site2 butterfly4     Obs2 17/5/2024     2343      1 nectaring grassland
## 15 Site2 butterfly4     Obs2 17/5/2024     2343      1 nectaring    forest
## 16 Site2 butterfly4     Obs2 17/5/2024     2343      1    flying riverbank
## 17 Site2 butterfly4     Obs2 17/5/2024     2343      1    flying    forest
## 18 Site2 butterfly4     Obs2 17/5/2024     2343      1 nectaring grassland
## 19 Site2 butterfly4     Obs2 17/5/2024     2343      1   resting riverbank
## 20 Site2 butterfly4     Obs2 17/5/2024     2343      1   resting    forest
## 21 Site2 butterfly4     Obs2 17/5/2024     2343      1 nectaring grassland
tapply(butterfly.df$site, butterfly.df$species)
##  [1] 1 1 1 1 3 3 3 3 3 3 3 3 2 4 4 4 4 4 4 4 4
new.bfly.df<-aggregate(counts~site+species+date, data=butterfly.df, FUN=sum)

new.bfly.df
##    site    species      date counts
## 1 Site1 butterfly1 12/5/2024      4
## 2 Site2 butterfly2 12/5/2024      1
## 3 Site1 butterfly3 15/5/2024      8
## 4 Site2 butterfly4 17/5/2024      8
write.csv(new.bfly.df, "new.bfly.df.csv")

Home assignments

  1. Importing: Import sample.csv as a data frame provided in this session.

  2. Annotating: The column names in sample.csvare spelled incorrectly, they are obvious, label them correctly through the GUI data editor using the fix() function

  3. Investigating: Check which column in the sample.csv could be a categorical (or Scores) variable in the dataset using the class() function and ‘$’ operator and transform them into factor variable using as.factor() function

  4. Creating: Create an new column (variable) from the transformed factor variable in Step:3 into a set of categories (5=Excellent; 4=Best; 3=Good; 2=Poor; 1=Worse) using the conditional operator. (clue: bio.scor$agecat[bio.scor$age > 40] <- "Elder")