26.05.2024: During the previous session, we were introduced to the program by learning how to install it, how to use it, and a few concepts that are fundamental to the R program.
Like, we spoke about
Expressions
The basic interaction mode in R is expression evaluation: expressions typically involve variable references, arithmetic operators.
e.g. arithmetic expression, relational or logical expression
Objects
Anything that can be assigned a value. For R, that is just about everything (data, functions, graphs, analytic results, and more). Every object has a class attribute telling R how to handle it. All objects are kept in memory during an interactive session.
<- is used as assignment operator
Today’s session focuses on the initial stage of data analysis, which involves creating a dataset that contains the relevant information in a format that suits our requirements. We will explore various methods for importing data into the R program, including manual entry from external sources. Additionally, we will learn how to annotate or modify the dataset.
This session covers
Introduction to R data structures
Importing data frames
Managing data
The creation of a dataset, in a format that meets our needs, involves the following:
Choice of a data structure to hold our data
Entering or importing our data into the data structure
There can be many methods available for importing data into the R program or can be entered manually from an external source, which may include text files, spreadsheets, datasets from statistical packages, and database-management systems.
Once a dataset is created, in this workshop we will try to annotate it, adding descriptive labels for variables and variable codes and on functions for modifying the datasets according to our needs.
Datasets: A dataset is usually a rectangular array
of data with rows representing observations and columns
representing variables.
Although, there are different traditions to speak about elements: rows and columns of any dataset: In biology we name it as observation and variables, or sites vs covariates etc. database analysts call them records and fields, and those from the datamining and machine-learning disciplines call them examples and attributes.
Importance of biological record keeping
The data we write down should have the ability to makes sense to us at the time of data collection and also will make sense to future scientists looking to repeat or verify our work.
Collecting new data should be the part of our initial planning, and should continue without out modification in the format.
If we have past data or need reformatting then we may have to spend some time rearranging in a common format before we can do anything useful.
Elements of biological records
Where: the location that the data were collected from.
When: the date that the data were recorded.
What: the species we are dealing with.
Who: the name of the person that recorded the data.
Other variables that may be added, depending upon our purpose.
Recording in the field
When we are in the field and using our field notebook, we may well use shortcuts to record the information required. Once again there will be items that do not need to be repeated, a single date at the top of every sheet would be sufficient.
In our field notebook or recording sheet we may keep separate pages for each site and end up with a column of figures for each site.
In general, we should aim to create a column for each item of data that we collect. If we were looking at species abundance at several sites for example, then we would need at least two columns, one for the abundance data and one for the site.
Supporting info/covariates/remarks
The date, location and the name of the person collecting the data are basic items that we always need but there may also be additional information that will help we to understand the biological situation as we process the data later. These things include field sketches and site photographs. A field sketch can be very helpful because we can record details that may be hard to represent in any other manner. A sketch can also help us to remember where we placed our quadrats; a grid reference is fine but meaningless without a map! Photographs may also be helpful and digital photography enables lots of images to be captured with minimum fuss; however, it is also easy to get carried away and forget what we were there for in the first place. Any supporting information should be just that – support for the main event: our data.
Record structure and arrangement in computer
Our data sets prepared for analyses can be formatted in different ways:
Analyses ready dataset: Summarised layout (Only necessary data for a particular objective or analyses are set out )
When we enter biological data, enter each record on a separate line and set out our spreadsheet so that each column represents a factor.
Analyses ready dataset: Data table layout (All data are set out in separate columns)
The table shows a dataset. Here we have record on the abundance of 4 butterfly species. In the example above, we can see that Observer1 and 2 is trying to ascertain the abundance of various species of butterflies at some sites.
If someone tries to repeat his experiment, they will know what time of year he was surveying at. Alternatively, if environmental conditions change, it will be essential to know what month or season they did the work. If we fail to collect complete biological data, or fail to retain and communicate all the details in full, then our work may be rendered unrepeatable and therefore useless as a contribution to science.
Now, If we wrote down the information separately we would end up with several smaller tables of data and it would be difficult to carry out any actual analyses.
Having this strict biological recording format allows great flexibility, especially if we end up with a lot of data. our data are now in the form of a database and our spreadsheet will be able to extract and summarize our data easily, allows us to modify the data later using sorting, filters or Pivot Tables.
The structure of the data here is a rectangular array
The data types in this dataset can be seen as the
site being the row names; Species names of the butterfly
species in rows, Observer: who is recording the data; Date
being the date variable; grid reference; the sampling unit reference;
abundance as the continuous variable.
Once our biological data are compiled in this format, we can sort them by the various columns, export the grid references to mapping programs, and convert the data into tables for further calculations using a spreadsheet. They can also be imported into databases and other computer programs for statistical analysis.
Now we heard about two key terms Data structure and
Data type
Data structures hold data falling under various data types. The data types or modes that R can handle include numeric, character, logical (TRUE/FALSE).
R contains a wide variety of structures for holding data, including scalars, vectors, arrays, data frames, and lists.
There is one more term i.e., Data frame: A data frame is a structure in R that holds data and is similar to the datasets found in standard statistical packages (for example, SAS, SPSS, and Stata). The columns are variables, and the rows are observations. we can have variables of different types (for example, numeric or character) in the same data frame. Data frames are the main structures we use to store datasets.
The four data types in R are: numeric,
character, logical, and complex
number.
A numeric object, such as a, contains numeric values.
## [1] 3
A character object is to store a character string
## [1] "hello world"
A logical object contains results of a logical comparison.
For example, if we ask:
## [1] FALSE
A logical comparison (“is 3 larger than 4?”) and the answer to a logical comparison is either “yes” (TRUE) or “no” (FALSE)
Mode
Data type is known as “mode” in R. The R function mode
can be used to get the data type of an object:
## [1] "character"
## [1] "numeric"
R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.
Scalars
The one-element vectors.
## [1] 3
## [1] "Subhasish"
## [1] TRUE
They’re used to hold constants.
Vectors
Vectors are one-dimensional arrays that can hold numeric data,
character data, or logical data. The combine function c()
is used to form the vector. Here are examples of each type of
vector.
## [1] 1 2 5 3 6 -2 4
## [1] "Subhasish" "Sourav" "Sathish"
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
Note that the data in a vector must be only one type or mode (numeric, character, or logical). we can’t mix modes in the same vector.
Matrix
A set of elements of the same mode appearing in rows and columns
(numeric, character, or logical), matrices are created with
the matrix() function.
matrix.sample <-matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames))
where vector contains the elements for the matrix, nrow and ncol specify the row and column dimensions, and dimnames contain optional row and column labels stored in character vectors. The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The default is by column. The following listing demonstrates the matrix function.
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
Matrices are two-dimensional and, like vectors, can contain only one data type. When there are more than two dimensions, we use arrays. When there are multiple modes of data, we use data frames.
Arrays
Arrays are similar to matrices but can have more than two dimensions. They’re created with an array function of the following form:
myarray <- array(vector, dimensions, dimnames)
where vector contains the data for the array, dimensions is a numeric vector giving the maximal index for each dimension, and dimnames is an optional list of dimension labels.
The following gives an example of creating a three-dimensional (2 × 3 × 4) array of numbers.
dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
array1 <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
array1## , , C1
##
## B1 B2 B3
## A1 1 3 5
## A2 2 4 6
##
## , , C2
##
## B1 B2 B3
## A1 7 9 11
## A2 8 10 12
##
## , , C3
##
## B1 B2 B3
## A1 13 15 17
## A2 14 16 18
##
## , , C4
##
## B1 B2 B3
## A1 19 21 23
## A2 20 22 24
vector1 <- c("A", "B", "C")
vector2 <- c(10, 11, 12, 13, 14, 15)
column.names <- c("COL1", "COL2", "COL3")
row.names <- c("ROW1", "ROW2", "ROW3")
matrix.names <- c("Matrix1", "Matrix2")
result <- array(c(vector1, vector2), dim = c(3, 3, 2),
dimnames = list(row.names, column.names,
matrix.names))
print(result) ## , , Matrix1
##
## COL1 COL2 COL3
## ROW1 "A" "10" "13"
## ROW2 "B" "11" "14"
## ROW3 "C" "12" "15"
##
## , , Matrix2
##
## COL1 COL2 COL3
## ROW1 "A" "10" "13"
## ROW2 "B" "11" "14"
## ROW3 "C" "12" "15"
Arrays are a natural extension of matrices. They can be useful in programming new statistical methods. Like matrices, they must be a single mode. Identifying elements follows what you’ve seen for matrices.
The data frame is the fundamental unit for doing data analysis in R. A data frame is kind of like a spreadsheet. It is a table-like form of data which R can read, here, columns represent variables and rows represent observations (or cases).
The survey dataset in Table 1 consists of numeric and character data. Because there are multiple modes of data, can’t contain the data in a matrix. In this case, a data frame is the structure of choice.
A data frame is created with the data.frame()
function
mydata <- data.frame(col1, col2, col3,...)
where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function.
One of the easiest ways to understand a data frame is to learn to make one. This way we can see how the basic components of a data frame fit together.
Let’s start by making an object called sites composed of the names of 7 sites s1, s2…..
## [1] "s1" "s2" "s3" "s4" "s5" "s6" "s7"
Now, let’s start building a data frame by adding some more variables. Let’s add the site ID of a survey, and whether or not a species of butterfly has been sighted. As these entries are numbers we don’t use quotation marks.
butterfly <- c(10, 20, 1, 2, 12, 2, 6)
observer<-c("A", "B", "B","A", "B","A", "B")
habitatquality<-c(4, 3, 1, 2, 3,4, 4 )Now we can make our first data frame which we can call my.df by using the data.frame() function:
## sites butterfly observer habitatquality
## 1 s1 10 A 4
## 2 s2 20 B 3
## 3 s3 1 B 1
## 4 s4 2 A 2
## 5 s5 12 B 3
## 6 s6 2 A 4
## 7 s7 6 B 4
This will make an object called my.df . Data frames are always rectangular in shape.
That means that each vector which makes up the data frame must be of the same length.
If the vectors of different lengths are used the data.frame() function will return an error.
If we want to see the entire data frame (or any object) we just highlight its name and then run it.
There are several ways to identify the elements of a data frame. we can use the subscript notation we used before (for example, with matrices), or can specify column names. Using the data frame created above, the following listing demonstrates these approaches.
Identifying columns
## sites
## 1 s1
## 2 s2
## 3 s3
## 4 s4
## 5 s5
## 6 s6
## 7 s7
## NULL
The $ notation is used to indicate a particular variable
from a given data frame.
For example, if we want to cross-tabulate Species by observer, can use the following code:
##
## A B
## s1 1 0
## s2 0 1
## s3 0 1
## s4 1 0
## s5 0 1
## s6 1 0
## s7 0 1
Lists are the most complex of the R data types. Basically, a list is an ordered collection of objects (components).
A list allows us to gather a variety of (possibly unrelated) objects under one name.
For example, a list may contain a combination of vectors, matrices, data frames, and even other lists.
title <- "bear rehab list"
age.month <- c(5, 6, 8, 9)
x.matrix <- matrix(1:4, nrow=4)
chr <- c("Sagalee", "Den", "Ithan","Papum")
mylist <- list(title=title, agesinmonth=age.month, ID=x.matrix, Names=chr)
mylist## $title
## [1] "bear rehab list"
##
## $agesinmonth
## [1] 5 6 8 9
##
## $ID
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
##
## $Names
## [1] "Sagalee" "Den" "Ithan" "Papum"
It is simple to save a data frame as a “.csv” file.
To save an object to a file, simply pass its name and the desired filename to the write.csv() function.
To prevent the addition of a column containing reference numbers, we should usually use the row.names = FALSE argument our saved file will then appear identical to the one created in R.
## [1] "C:/Users/Subhasish Arandhara/Desktop/Workshop R in ecology"
For most purposes, we will want to work on our real data, but knowing how to simulate a data frame is useful (e.g., to make a reproducible example, to understand R, and to practice modelling approaches).
Now when we know our data structures, we need to put some data in them! when doing data analyses, we are typically faced with data that come from a variety of sources and in a variety of formats. our task is to import the data into our tools, analyze the data, and report on the results. R provides a wide range of tools for importing data.
Step 1: Build our data in a spreadsheet using the principles of tidy data (i.e. every column is a variable, every row is an observation, each cell consists of a single value)
Step 2: Save the spreadsheet as as ‘.csv’ file (typically found in the ‘save as’ option of the spreadsheet program). This file type is short for ‘comma-separated values’.
Step 3: Copy the path (location) of the csv file (visible in the properties or information associated with the file).
Step 4: Paste the path of the file into the setwd() function. Remember to include quotation marks. The function is short for set working directory. The configuration of the setwd() function differs according to the operating system:
(however, if we are using a ‘project’ in RStudio we can skip steps 3 and 4 by simply putting the file into the project folder)
Windows: Copying and pasting the path into an R script results in just a single backslash. For this to be read by R, a forward slash needs to be used:
setwd("C:\Users\Me\Documents")
setwd("C:/Users/Me/Documents")
Mac: Copying and pasting the path into an R script results in just a forward backslash - so no change is necessary.
setwd("/Users/Me/Documents")
Linux: Copying and pasting the path into an R script results in just a forward backslash – no change is necessary.
setwd("/usr/me/documents")
Step 5: Paste the file’s name into the read.csv() function to read the file. Don’t forget to put ‘.csv’ at the end of the file’s name.
mydataframe <- read.csv(file, options)
file is a delimited ASCII
file and the options are parameters controlling how data is
processed.| Option | Description |
|---|---|
| header | A logical value indicating whether the file contains the variable names in the first line. |
| sep | The delimiter separating data values. The default is sep=““, which denotes one or more spaces, tabs, new lines, or carriage returns. Use sep=”,” to read comma-delimited files, and sep=“\t” to read tab-delimited files. |
| row.names | An optional parameter specifying one or more variables to represent row identifiers. |
| col.names | If the first row of the data file doesn’t contain variable names (header=FALSE), we can use col.names to specify a character vector containing the variable names. If header=FALSE and the col.names option is omitted, variables will be named V1, V2, and so on. |
| na.strings | Optional character vector indicating missing-values codes. For example, na.strings =c(“-9”, “?”) converts each -9 and ? value to NA as the data is read. |
.csv
file## sites butterfly observer habitatquality
## 1 s1 10 A 4
## 2 s2 20 B 3
## 3 s3 1 B 1
## 4 s4 2 A 2
## 5 s5 12 B 3
## 6 s6 2 A 4
## 7 s7 6 B 4
## 'data.frame': 7 obs. of 4 variables:
## $ sites : chr "s1" "s2" "s3" "s4" ...
## $ butterfly : int 10 20 1 2 12 2 6
## $ observer : chr "A" "B" "B" "A" ...
## $ habitatquality: int 4 3 1 2 3 4 4
We can use the class() function to confirm
my.df is a data frame.
## [1] "data.frame"
## [1] "array"
If we wanted to see what the class of a variable within the data frame is,
we use the ‘$’ operator after the name of the object to
reference a particular variable.
## [1] "integer"
So we see that R will treat my.df$habitatquality as a numeric variable.
Categorical data is usually handled best in R by
defining them as a factors. We can do this by using the
as.factor() function and telling R the name of the data set
and variable we are wanting it to reassign:
## [1] "factor"
Data analysts typically annotate datasets to make the results easier to interpret. Annotating generally includes adding descriptive labels to variable names and value labels to the codes used for categorical variables.
Labelling variables can be of two ways
We can use the following statement to invoke an interactive editor:
fix(dataframe)
For example, for the variable habitatquality, we might
want to attach a short label “hab_qual" (higher score means
better habitat for the species).”
## sites butterfly observer hab_qual
## 1 s1 10 A 4
## 2 s2 20 B 3
## 3 s3 1 B 1
## 4 s4 2 A 2
## 5 s5 12 B 3
## 6 s6 2 A 4
## 7 s7 6 B 4
For the variable habitatscore, scored 5-1, we might want
to change the factors into characters say “Excellent” to “worse.”
my.df$`hab_qual` <- factor(my.df$`hab_qual`, levels = c(1,2, 3, 4, 5),
labels = c("worst","poor", "good", "better", "excellent"))
my.df## sites butterfly observer hab_qual
## 1 s1 10 A better
## 2 s2 20 B good
## 3 s3 1 B worst
## 4 s4 2 A poor
## 5 s5 12 B good
## 6 s6 2 A better
## 7 s7 6 B better
In R values in a data frame are referenced by their
location in terms of rows and
columns using the ‘[’ and ‘]’ operators.
Taking the form:
my.df[row, column]
This then allows us the ability to manually subset data sets. For example, the 3rd row of the 1st column would be:
## [1] "s4"
We can select an entire row of the data frame by giving the row position but leaving the leaving the column position empty:
## sites butterfly observer hab_qual
## 3 s3 1 B worst
Similarly, we can select a column of the data frame by giving the column position but leaving the leaving the row position empty:
## [1] "s1" "s2" "s3" "s4" "s5" "s6" "s7"
Values can be removed by using a negative sign. For example, we can remove the 1st row of our data frame:
## sites butterfly observer hab_qual
## 2 s2 20 B good
## 3 s3 1 B worst
## 4 s4 2 A poor
## 5 s5 12 B good
## 6 s6 2 A better
## 7 s7 6 B better
Multiple rows or columns can be removed or included by using the c() function. By including a ‘-’ sign before the c() function we can remove a number of rows:
## sites butterfly observer hab_qual
## 2 s2 20 B good
## 4 s4 2 A poor
To see the first three entries sites of
my.df
## [1] "s1" "s2" "s3"
We can also use ‘[’ ‘]’ to subset some built-in constants in R such as:
Alphabets
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
Months
## [1] "January" "February" "March" "April" "May"
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug"
An Example dataset: Scores are given to a team of biologists differ in the ways they run their projects
| biologist | exp_year | age | team_work | field_work | comms_skill |
| A | 6 | 42 | 4 | 2 | 4 |
| B | 5 | 31 | 4 | 2 | 5 |
| C | 7 | 00 | 5 | ||
| D | 2 | 23 | 1 | 1 | 2 |
biologist<-c("A", "B", "C", "D")
exp_year<-c(6,5,7,2)
age<-c(37,31,00,24)
team_work<-c(4,4,5,1)
field_work<-c(2,2,NA,1)
comms_skill<-c(4,5,NA,2)
bio.scor<-data.frame(biologist, exp_year, age, team_work, field_work,comms_skill)
bio.scor## biologist exp_year age team_work field_work comms_skill
## 1 A 6 37 4 2 4
## 2 B 5 31 4 2 5
## 3 C 7 0 5 NA NA
## 4 D 2 24 1 1 2
| Operator | Description |
|---|---|
| + | Addition. |
| - | Subtraction. |
| * | Multiplication. |
| / | Division. |
| ^ or ** | Exponent |
| x%%y | Modulus (x mod y): for example, 5%%2 is 1. |
| x%/%y | Integer division: for example, 5%/%2 is 2. |
Combining three scores to obtain a mean score
Now we have a data frame named bio.scor, with variables
Team_work, Field_work, comms_skill, and we want to create a
new variable sum.scor that adds these variables and a new
variable called mean.scor that averages the variables.
sum.scor <- bio.scor$team_work + bio.scor$field_work+ bio.scor$comms_skill
mean.scor <- (bio.scor$team_work + bio.scor$field_work+bio.scor$comms_skill)/3
sum.scor## [1] 10 11 NA 4
## [1] 3.333333 3.666667 NA 1.333333
the statements will succeed, but lets check if these vectors were added into the data frame or not?
## biologist exp_year age team_work field_work comms_skill
## 1 A 6 37 4 2 4
## 2 B 5 31 4 2 5
## 3 C 7 0 5 NA NA
## 4 D 2 24 1 1 2
This probably isn’t the result we want. Ultimately, we want to incorporate new variables into the original data frame.
bio.scor$sum.scor <- bio.scor$team_work + bio.scor$field_work+ bio.scor$comms_skill
bio.scor$mean.scor <- (bio.scor$team_work + bio.scor$field_work+bio.scor$comms_skill)/3
bio.scor$sum.scor## [1] 10 11 NA 4
## [1] 3.333333 3.666667 NA 1.333333
bio.scor <- transform(bio.scor,
sum.scor = bio.scor$sum.scor,
mean.scor = bio.scor$mean.scor)
bio.scor## biologist exp_year age team_work field_work comms_skill sum.scor mean.scor
## 1 A 6 37 4 2 4 10 3.333333
## 2 B 5 31 4 2 5 11 3.666667
## 3 C 7 0 5 NA NA NA NA
## 4 D 2 24 1 1 2 4 1.333333
Earlier we came to know about changing data labels, now re-coding involves creating new values of a variable based on the existing values of the same and/or other variables.
For example, we may want to change a continuous variable into a set of categories
Replace miscoded values with correct values
Create a Agree/disagree variable based on a set of cutoff scores.
Statement for conditional recoding
variable[condition] <- expression
To recode data, we can use one or more of R’s logical operators
| Operator | Description |
|---|---|
| < | Less than |
| <= | Less than equal to |
| > | Greater than |
| >= | Greater than or equal to |
| == | Exactly equal to |
| != | Not equal to |
| !x | Not x |
| x | y | x or y |
| x & y | x and y |
| isTRUE(x) | Tests whether x is TRUE |
Let’s say we want to recode the ages of the biologists in the earlier dataset from the continuous variable age to the categorical variable agecat (Young, Middle Aged, Elder).
First, we must recode the value 00 for age to indicate that the value is missing using code such as
The statement variable[condition] <- expression will only make the assignment when condition is TRUE.
Once missing values for age have been specified, we can then use the following code to create the agecat variable:
Finally, The plyr package has a powerful set of functions for modiflying datasets.
install.packages("plyr")
The format of the rename() function is
rename(dataframe, c(oldname="newname", oldname="newname",...))
Here’s an example with the bio.score data:
## Warning: package 'plyr' was built under R version 4.2.3
allows us to test for the presence of missing values.
Any value of age that’s equal to 00 is changed toNA.
Once you’ve identified missing values, we’ll need to eliminate them before we can continue analysing the data. The reason is that arithmetic expressions and functions with missing values produce missing values.
## [1] NA
na.rm=TRUE option removes missing
values prior to calculations and applies the function to the remaining
value
## [1] 3.666667
na.omit( ) function deletes any rows
with missing data
## bioID exp_year age.bio team_work field_work comms_skill sum.scor mean.scor
## 1 A 6 37 4 2 4 10 3.333333
## 2 B 5 31 4 2 5 11 3.666667
## 4 D 2 24 1 1 2 4 1.333333
## agecat
## 1 Young
## 2 Young
## 4 Young
Here, any rows containing missing data are deleted from the dataframe before the results are saved to a new dataframe.
R typically handles dates by first inputting them as character strings and then we can convert them into numerical date variables.
The function as.Date() is used to make this
translation.
The syntax is as.Date(x, "input_format"), where x is the
character data and input_format gives the appropriate format for reading
the date.
| Symbol | Meaning | Example |
| %d | Day as a number (0–31) | 01–31 |
| %a %A | Abbreviated weekday Unabbreviated weekday |
Mon Monday |
| %m | Month (00–12) | 00–12 |
| %b %B | Abbreviated month Unabbreviated month |
Jan January |
| %y %Y | Two-digit year Four-digit year |
24 2024 |
## [1] "2024-05-26" "2024-05-25"
## [1] "01/05/2024" "08/06/2024"
The function Sys.Date() can be used to obtain the
current date, while the date() function can be used to
retrieve both the current date and time. As I write this, today. So when
those functions are executed, they produce the output below
## [1] "2024-05-26"
## [1] "Sun May 26 12:11:38 2024"
we can use the format(x, format="output_format")
function to output dates in a specified format and to extract portions
of dates:
## [1] "May 26 2024"
## [1] "Sunday"
## Time difference of 365 days
The difftime() function can be utilised to calculate
time intervals and represent them in various units such as seconds,
minutes, hours, days, or weeks.
## Time difference of 1574.143 weeks
Horizontal merge
When combining two data frames (datasets) horizontally, the merge() function comes in handy. The two data frames are connected through one or more shared key variables.
sitex<-c("site1","site2","site3","site4")
abundancex<-c("2","1","5","6")
siteabu.df<-data.frame(sitex,abundancex)
siteabu.df## sitex abundancex
## 1 site1 2
## 2 site2 1
## 3 site3 5
## 4 site4 6
sitex<-c("site1","site2","site3","site4")
habitatx<-c("prim.forest","Sec.forest","grassland","water")
sitehab.df<-data.frame(sitex,habitatx)
sitehab.df## sitex habitatx
## 1 site1 prim.forest
## 2 site2 Sec.forest
## 3 site3 grassland
## 4 site4 water
## sitex abundancex habitatx
## 1 site1 2 prim.forest
## 2 site2 1 Sec.forest
## 3 site3 5 grassland
## 4 site4 6 water
Vertical merge
The two data frames must have the same variables, but they don’t have to be in the same order. Vertical merging is typically used to add observations to a data frame.
## sitex abundancex
## 1 site5 23
## sitex abundancex
## 1 site1 2
## 2 site2 1
## 3 site3 5
## 4 site4 6
## 5 site5 23
Transposing refers to the process of converting the rows of a data frame into columns, and the columns into rows. Transposing is a valuable technique that can be used for different purposes, such as restructuring data or getting it ready for specific analyses.
data.table package can be used to transpose the
dataframe, we are using transpose() function
transpose(dataframe)
## Warning: package 'data.table' was built under R version 4.2.3
# transpose
t_my.df <- transpose(my.df)
colnames(t_my.df) <- rownames(my.df)
rownames(t_my.df) <- colnames(my.df)
t_my.df## 1 2 3 4 5 6 7
## sites s1 s2 s3 s4 s5 s6 s7
## butterfly 10 20 1 2 12 2 6
## observer A B B A B A B
## hab_qual better good worst poor good better better
Pivot Tables combine column values for statistical summaries. In R, this is done with the dplyr package, a data compilation package. We are looking into pivots in the next session.
Here, the group_by() function groups data by one or more
variables, and the summarise() function create a summary of
data by those groups.
df %>% group_by( grouping_variables) %>% summarize( label = aggregate_fun() )
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## # A tibble: 2 × 2
## observer sum_values
## <chr> <dbl>
## 1 A 14
## 2 B 39
## # A tibble: 2 × 2
## observer mean.butterfly
## <chr> <dbl>
## 1 A 4.67
## 2 B 9.75
## [1] "C:/Users/Subhasish Arandhara/Desktop/Workshop R in ecology"
## site species observer date grid_ref counts activity habitat
## 1 Site1 butterfly1 Obs1 12/5/2024 2342 1 flying grassland
## 2 Site1 butterfly1 Obs1 12/5/2024 2342 1 resting forest
## 3 Site1 butterfly1 Obs1 12/5/2024 2342 1 flying riverbank
## 4 Site1 butterfly1 Obs1 12/5/2024 2342 1 basking riverbank
## 5 Site1 butterfly3 Obs1 15/5/2024 2342 1 flying riverbank
## 6 Site1 butterfly3 Obs1 15/5/2024 2342 1 resting grassland
## 7 Site1 butterfly3 Obs1 15/5/2024 2342 1 flying riverbank
## 8 Site1 butterfly3 Obs1 15/5/2024 2342 1 resting riverbank
## 9 Site1 butterfly3 Obs1 15/5/2024 2342 1 basking riverbank
## 10 Site1 butterfly3 Obs1 15/5/2024 2342 1 basking forest
## 11 Site1 butterfly3 Obs1 15/5/2024 2342 1 basking forest
## 12 Site1 butterfly3 Obs1 15/5/2024 2342 1 flying riverbank
## 13 Site2 butterfly2 Obs2 12/5/2024 2343 1 nectaring forest
## 14 Site2 butterfly4 Obs2 17/5/2024 2343 1 nectaring grassland
## 15 Site2 butterfly4 Obs2 17/5/2024 2343 1 nectaring forest
## 16 Site2 butterfly4 Obs2 17/5/2024 2343 1 flying riverbank
## 17 Site2 butterfly4 Obs2 17/5/2024 2343 1 flying forest
## 18 Site2 butterfly4 Obs2 17/5/2024 2343 1 nectaring grassland
## 19 Site2 butterfly4 Obs2 17/5/2024 2343 1 resting riverbank
## 20 Site2 butterfly4 Obs2 17/5/2024 2343 1 resting forest
## 21 Site2 butterfly4 Obs2 17/5/2024 2343 1 nectaring grassland
## [1] 1 1 1 1 3 3 3 3 3 3 3 3 2 4 4 4 4 4 4 4 4
## site species date counts
## 1 Site1 butterfly1 12/5/2024 4
## 2 Site2 butterfly2 12/5/2024 1
## 3 Site1 butterfly3 15/5/2024 8
## 4 Site2 butterfly4 17/5/2024 8
Importing: Import sample.csv as a
data frame provided in this session.
Annotating: The column names in
sample.csvare spelled incorrectly, they are obvious, label
them correctly through the GUI data editor using the fix()
function
Investigating: Check which column in the
sample.csv could be a categorical (or Scores) variable in
the dataset using the class() function and ‘$’
operator and transform them into factor variable using
as.factor() function
Creating: Create an new column (variable) from
the transformed factor variable in Step:3
into a set of categories
(5=Excellent; 4=Best; 3=Good; 2=Poor; 1=Worse) using the
conditional operator. (clue:
bio.scor$agecat[bio.scor$age > 40] <- "Elder")