Lecture 1 - Getting started in R

Author

Jan C Greyling

Getting started

Now that you have done the DataCamp courses, it is time for you to apply your new skills to a real dataset. However, before we get there, we have to lay down some basics of using R on your computer.

Getting started in R and R Studio

Download and install

The first step is to download R from the CRAN repository. I suggest you do not install the latest version but opt for a somewhat earlier version, given that it would be more stable. You can download it here, ensure that you select the correct version for your operating system.

Now you can download RStudio. While some people prefer to use base R (installed above), most opt to use RStudio as an integrated development environment for base R. Thus RStudio is not R but rather a different way to interact with R. This is also the reason why R has to be installed before R studio. R studio is available here. Again, be sure to install the correct version for your operating system.

Project folders and file

Go on the Teams site and download the “z_blank_folder” in the Class Materials folder. Move the downloaded folder to a different folder on your computer for safekeeping since this folder will be the basis of all your future R projects. Move a copy of “z_blank_folder” to your folder for this course and unzip it. Rename the unzipped folder appropriately and open it. Note that it has five folders in it: “Data”, “Figures”, “Output”, “Scripts”, and “Supplementary Materials”. These will help keep your project folder organised when you import data into your model or write results. All of your scripts should be saved to your “Scripts” folder, and supplementary materials could include relevant articles etc., that pertain to the project.

Now you have to create your project file. It is essential that your project file is in your overarching project folder, in other words, your “R_simmulation” folder (or whatever you renamed it to). If not, you will not be able to import your data etc., since your working directory would not be correct (more info). To create your project file, open RStudio, go to File -> New project…, and select “Existing directory”. After that, click on “Browse” and find your “R_simmulation” folder (or whatever you renamed it to), select it and click on “Create project”. If you succeeded, the top right corner of RStudio should show your “R_simmulation” folder (or whatever you renamed it to); see Figure

Figure 1: Check to see if you are working in the right project

In addition, the project folders should show in the bottom right corner if you created the project file in the right place:

Figure 2: Check to see if your project folders are there

Installing and loading packages

Now that you’ve created your project folder and project file (check the top right corner of R-studio if you’re working in it) we can get started by installing all of the packages required using by the analysis using the command install.packages("Package_Name"). In R, we regularly use “packages”, which are precoded functions developed and shared by other R users. For this part of the work, we will use three packages as data.table.

You only have to install them once, but some of them will have to be updated from time to time by re-installing them. After that, we can load the required packages into memory. Basically, we purchased the tools in the previous step and now we have to put them on the table. This can be done using the require("Package_Name") function. However, before we get to this, some coding best practices.

Task: Install the data.table by typing install.packages("data.table") into the Console and hitting enter

Scripts and coding best practice

Now, please create a new script by clicking on the white page with the green plus sign in the top right corner of your screen and save it in your Scripts folder by hitting the save button with the single disk.

Scripts are lines of commands that RStudio essentially pastes into the Console and runs sequentially when you hit the Run button on the right-hand side of the window (highlighted in yellow above). Scripts ensure that all of your operations are replicable, and unlike excel, it documents all of your data manipulation and analysis steps.

I start all of my scripts with the lines of code below since it helps me to keep track of all scripts etc.

Note the that there are five colours of text in the code which indicate different types of commands. R will ignore any line of code that starts with a #; hence it is typically used for script commenting and creating section dividers such as this. #*#*#*#*#*#*#*#*#*#*#*# Also, note that you can create a section using a three hash sandwich like this ### some text ###. A shortcut is to hit Ctrl + Shift + R together; this will bring up a window wherein you can name a different kind of section.

Code

# Client: Jan C Greyling
# Project: My first coding project
# Script 1: Working with actual data

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*

remove(list=ls()) #clear all from memory

require("data.table") # loading the package called "data.table"
require("tidyverse")

Note the line of code remove(list=ls()), I use this at the top of all of my scripts. This line of code simply deletes all of the items in your Global Environment, essentially wiping the table clean before you start your analysis.

The next line of code (require("data.table")) loads the data.table package that you need for this analysis as discussed above

Working with real data

Reading data

Go to the course Teams page page, download the dataset called wheat_student_data and move it to the Data folder of your project folder. Then load the data and call it dat. If you succeed, it will appear in your Global Environment. We can have a look at the data in several ways. If it is a small dataset, then you can simply click on it, where after it will open in a new tab called “dat”. However, this does not work well if you are working with large datasets. Then it is better to use the head() or tail() functions which display the top and bottom observations in the dataset. You can also look at the structure of the data using the str() function.

Code

dat <- fread("Data/wheat_student_data.csv")

Looking at the data

You can get an idea of the dataset using the following functions:

Code

head(dat)

   year   local rep plot     yld N_plant N_tdress N_spray
1: 2016 Darling   1    1  530.30       0        0       0
2: 2016 Darling   1    2 1157.89      25       25       0
3: 2016 Darling   1    3 1656.37      25      105       0
4: 2016 Darling   1    4  946.80      25       50       0
5: 2016 Darling   1    5 3059.27      25      135       0
6: 2016 Darling   1    6 1630.23      25      165       0

Code

tail(dat)

   year     local rep plot  yld N_plant N_tdress N_spray
1: 2019 Tygerhoek   4   29 3605      25       25       0
2: 2019 Tygerhoek   4   28 3647      25       50       0
3: 2019 Tygerhoek   4   26 3434      25       75       0
4: 2019 Tygerhoek   4   27 2762      25      105       0
5: 2019 Tygerhoek   4   30 3212      25      135       0
6: 2019 Tygerhoek   4   31 3582      25      165       0

Code

str(dat)

Classes 'data.table' and 'data.frame':  856 obs. of  8 variables:
 $ year    : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
 $ local   : chr  "Darling" "Darling" "Darling" "Darling" ...
 $ rep     : int  1 1 1 1 1 1 1 1 2 2 ...
 $ plot    : int  1 2 3 4 5 6 7 8 10 11 ...
 $ yld     : num  530 1158 1656 947 3059 ...
 $ N_plant : int  0 25 25 25 25 25 25 25 25 25 ...
 $ N_tdress: int  0 25 105 50 135 165 0 75 75 135 ...
 $ N_spray : int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr>

You can also open the dataset using using the View function.

Code

View(dat)

What does it mean?

Ok, so having the dataset loaded is one thing but what does it mean? In other words, what are the variables in the dataset. Lets print the names of all the variables.

Code

names(dat)

[1] "year"     "local"    "rep"      "plot"     "yld"      "N_plant"  "N_tdress"
[8] "N_spray"

This data was collected through a trial designed to establish the relationship between wheat yield and nitrogen applied. The trial was conducted over multiple years (year) over multiple locations (local). At each location there are multiple replicates and plots within the repliacte, from there the variables rep and plot. The variable yld shows the wheat yield of the plot. N_plant, N_tdress and N_spray shows different nitrogen applications, as planting, top dress and sprayed, respectively.

Now that you’ve loaded the data and have a better understanding of the structure of the data. How many unique years, locals, and replicates are in the dataset? Before we do this, we have to lay a bit of foundation regarding the data.table package.

data.table package

Basics

data.table is an R package that provides an enhanced version of data.frames, which are the standard data structure for storing data in base R. In the section above, we already created a data.table using fread(). We can also create one using the data.table() function. Here is an example:

Code

DT = data.table(
  ID = c("b","b","b","a","a","c"),
  a = 1:6,
  b = 7:12,
  c = 13:18
)
DT

   ID a  b  c
1:  b 1  7 13
2:  b 2  8 14
3:  b 3  9 15
4:  a 4 10 16
5:  a 5 11 17
6:  c 6 12 18

You can also convert existing objects to a data.table using setDT() (for data.frames and lists) and as.data.table() (for other structures).

In contrast to a data.frame, you can do a lot more than just subsetting rows and selecting columns within the frame of a data.table, i.e., within [ … ] (NB: we might also refer to writing things inside DT[…] as “querying DT”, in analogy to SQL). To understand it we will have to first look at the general form of data.table syntax, as shown below:

Code

#DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

The way to read it (out loud) is:

Take DT, subset/reorder rows using i, then calculate j, grouped by by.

Let’s begin by looking at i and j using our dataset of yield trials.

Advantages

You should use data.table because:

It provides blazing fast speed when it comes to loading data. With the fread function in data.table package, loading large data sets need just few seconds.
It is even faster than the popular dplyr, plyr packages used for data manipulation. data.table provides enough room for tasks such as aggregating, filtering, merging, grouping and other related tasks
Not just reading files, writing the files using data.table is much faster than write.csv(). This packages provides fwrite() function enabled with parallelised fast writing ability. So, next time you get to write 1 million rows, try this function.
In built features such as automatic indexing, rolling joins, overlapping range joins further enhances the user experience while working on large data sets.

Missing values

How many rows have missing yield values?

Code

nrow(dat[is.na(yld),])

[1] 45

Using the tidyverse package this can also be written as

Code

dat[is.na(yld),] %>% 
  nrow(.)

[1] 45

The %>% symbol is called a pype. This makes your code more readable since one can read from rignt to left and not from inside out as above. ## Remove observations

Remove all of the observations from the dataset with missing values.

Code

dat <- dat[!is.na(yld),]
nrow(dat)

[1] 811

New variables

Create a new variable for the total Nitrogen applied and call it N_tot

Code

dat[, N_tot := N_plant + N_tdress + N_spray]
head(dat)

   year   local rep plot     yld N_plant N_tdress N_spray N_tot
1: 2016 Darling   1    1  530.30       0        0       0     0
2: 2016 Darling   1    2 1157.89      25       25       0    50
3: 2016 Darling   1    3 1656.37      25      105       0   130
4: 2016 Darling   1    4  946.80      25       50       0    75
5: 2016 Darling   1    5 3059.27      25      135       0   160
6: 2016 Darling   1    6 1630.23      25      165       0   190

Subsetting

What was the maximum yield per trail per location per year?

Code

dat[, .(max_yld = max(yld)), by = .(year,local)]

    year       local max_yld
 1: 2016     Darling 3495.45
 2: 2016  Langgewens 4339.37
 3: 2016 Porterville 6173.99
 4: 2016  Riversdale 5144.64
 5: 2016   Tygerhoek 5137.83
 6: 2017     Darling 1257.44
 7: 2017  Langgewens 2928.22
 8: 2017 Porterville  884.22
 9: 2017  Riversdale 1699.36
10: 2017   Tygerhoek 4042.33
11: 2018     Darling 4095.51
12: 2018  Langgewens 6033.33
13: 2018 Porterville 5950.75
14: 2018  Riversdale 1947.25
15: 2018   Tygerhoek 6394.42
16: 2019  Langgewens 3722.00
17: 2019 Porterville 3536.00
18: 2019  Riversdale 2964.00
19: 2019   Tygerhoek 3647.00

What was the maximum yield per trail per location per year per replicate?

Code

dat[, .(max_yld = max(yld)), by = .(year,local,rep)]

    year       local rep max_yld
 1: 2016     Darling   1 3059.27
 2: 2016     Darling   2 2161.19
 3: 2016     Darling   3 2776.90
 4: 2016     Darling   4 3495.45
 5: 2016  Langgewens   1 4339.37
 6: 2016  Langgewens   2 4285.36
 7: 2016  Langgewens   3 4035.78
 8: 2016  Langgewens   4 4122.39
 9: 2016 Porterville   1 4005.54
10: 2016 Porterville   2 5463.10
11: 2016 Porterville   3 6173.99
12: 2016 Porterville   4 3676.30
13: 2016  Riversdale   1 5144.64
14: 2016   Tygerhoek   1 4783.65
15: 2016   Tygerhoek   2 4027.99
16: 2016   Tygerhoek   3 5137.83
17: 2016   Tygerhoek   4 5120.12
18: 2017     Darling   1 1257.44
19: 2017     Darling   2 1001.55
20: 2017     Darling   3  744.73
21: 2017     Darling   4  936.18
22: 2017  Langgewens   1 2331.78
23: 2017  Langgewens   2 2852.44
24: 2017  Langgewens   3 2928.22
25: 2017  Langgewens   4 2607.11
26: 2017 Porterville   1  747.33
27: 2017 Porterville   2  722.89
28: 2017 Porterville   3  800.00
29: 2017 Porterville   4  884.22
30: 2017  Riversdale   1 1699.36
31: 2017   Tygerhoek   1 3961.47
32: 2017   Tygerhoek   2 4042.33
33: 2017   Tygerhoek   3 3798.94
34: 2017   Tygerhoek   4 3513.23
35: 2018     Darling   1 3998.78
36: 2018     Darling   2 3838.23
37: 2018     Darling   3 3552.24
38: 2018     Darling   4 4095.51
39: 2018  Langgewens   1 6033.33
40: 2018  Langgewens   2 5805.00
41: 2018  Langgewens   3 5462.50
42: 2018  Langgewens   4 4892.50
43: 2018 Porterville   1 5415.37
44: 2018 Porterville   2 5950.75
45: 2018 Porterville   3 5503.27
46: 2018 Porterville   4 5490.20
47: 2018  Riversdale   1 1947.25
48: 2018   Tygerhoek   1 5977.55
49: 2018   Tygerhoek   2 6108.03
50: 2018   Tygerhoek   3 5138.50
51: 2018   Tygerhoek   4 6394.42
52: 2019  Langgewens   1 2775.00
53: 2019  Langgewens   2 3090.00
54: 2019  Langgewens   3 3037.00
55: 2019  Langgewens   4 3722.00
56: 2019 Porterville   1 2895.00
57: 2019 Porterville   2 3280.00
58: 2019 Porterville   3 3536.00
59: 2019 Porterville   4 3216.00
60: 2019  Riversdale   1 2747.00
61: 2019  Riversdale   2 1961.00
62: 2019  Riversdale   3 2964.00
63: 2019   Tygerhoek   1 3021.00
64: 2019   Tygerhoek   2 3139.00
65: 2019   Tygerhoek   3 3643.00
66: 2019   Tygerhoek   4 3647.00
    year       local rep max_yld

What was the average yield per trail per location per year?

Code

dat[, .(av_yld = mean(yld)), by = .(year,local)]

    year       local    av_yld
 1: 2016     Darling 1965.3425
 2: 2016  Langgewens 3514.3025
 3: 2016 Porterville 2659.0466
 4: 2016  Riversdale 3962.3525
 5: 2016   Tygerhoek 3949.7209
 6: 2017     Darling  730.5494
 7: 2017  Langgewens 2147.9587
 8: 2017 Porterville  589.5056
 9: 2017  Riversdale 1385.5400
10: 2017   Tygerhoek 3353.2075
11: 2018     Darling 3222.7600
12: 2018  Langgewens 3538.7680
13: 2018 Porterville 4561.2541
14: 2018  Riversdale 1270.5187
15: 2018   Tygerhoek 4744.1777
16: 2019  Langgewens 2629.7188
17: 2019 Porterville 2419.0625
18: 2019  Riversdale 2038.9130
19: 2019   Tygerhoek 2796.8125

Assignment 1

Create a normalised yield index variable and call it yld_norm. It must be normalised for the maximum yield by year, location and replicate.