Now that you have done the DataCamp courses, it is time for you to apply your new skills to a real dataset. However, before we get there, we have to lay down some basics of using R on your computer.
Getting started in R and R Studio
Download and install
The first step is to download R from the CRAN repository. I suggest you do not install the latest version but opt for a somewhat earlier version, given that it would be more stable. You can download it here, ensure that you select the correct version for your operating system.
Now you can download RStudio. While some people prefer to use base R (installed above), most opt to use RStudio as an integrated development environment for base R. Thus RStudio is not R but rather a different way to interact with R. This is also the reason why R has to be installed before R studio. R studio is available here. Again, be sure to install the correct version for your operating system.
Project folders and file
Go on the Teams site and download the “z_blank_folder” in the Class Materials folder. Move the downloaded folder to a different folder on your computer for safekeeping since this folder will be the basis of all your future R projects. Move a copy of “z_blank_folder” to your folder for this course and unzip it. Rename the unzipped folder appropriately and open it. Note that it has five folders in it: “Data”, “Figures”, “Output”, “Scripts”, and “Supplementary Materials”. These will help keep your project folder organised when you import data into your model or write results. All of your scripts should be saved to your “Scripts” folder, and supplementary materials could include relevant articles etc., that pertain to the project.
Now you have to create your project file. It is essential that your project file is in your overarching project folder, in other words, your “R_simmulation” folder (or whatever you renamed it to). If not, you will not be able to import your data etc., since your working directory would not be correct (more info). To create your project file, open RStudio, go to File -> New project…, and select “Existing directory”. After that, click on “Browse” and find your “R_simmulation” folder (or whatever you renamed it to), select it and click on “Create project”. If you succeeded, the top right corner of RStudio should show your “R_simmulation” folder (or whatever you renamed it to); see Figure
Figure 1: Check to see if you are working in the right project
In addition, the project folders should show in the bottom right corner if you created the project file in the right place:
Figure 2: Check to see if your project folders are there
Installing and loading packages
Now that you’ve created your project folder and project file (check the top right corner of R-studio if you’re working in it) we can get started by installing all of the packages required using by the analysis using the command install.packages("Package_Name"). In R, we regularly use “packages”, which are precoded functions developed and shared by other R users. For this part of the work, we will use three packages as data.table.
You only have to install them once, but some of them will have to be updated from time to time by re-installing them. After that, we can load the required packages into memory. Basically, we purchased the tools in the previous step and now we have to put them on the table. This can be done using the require("Package_Name") function. However, before we get to this, some coding best practices.
Task: Install the data.table by typing install.packages("data.table") into the Console and hitting enter
Scripts and coding best practice
Now, please create a new script by clicking on the white page with the green plus sign in the top right corner of your screen and save it in your Scripts folder by hitting the save button with the single disk.
Figure 3: Create a script
Scripts are lines of commands that RStudio essentially pastes into the Console and runs sequentially when you hit the Run button on the right-hand side of the window (highlighted in yellow above). Scripts ensure that all of your operations are replicable, and unlike excel, it documents all of your data manipulation and analysis steps.
I start all of my scripts with the lines of code below since it helps me to keep track of all scripts etc.
Note the that there are five colours of text in the code which indicate different types of commands. R will ignore any line of code that starts with a #; hence it is typically used for script commenting and creating section dividers such as this. #*#*#*#*#*#*#*#*#*#*#*# Also, note that you can create a section using a three hash sandwich like this ### some text ###. A shortcut is to hit Ctrl + Shift + R together; this will bring up a window wherein you can name a different kind of section.
Code
# Client: Jan C Greyling# Project: My first coding project# Script 1: Working with actual data#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*remove(list=ls()) #clear all from memoryrequire("data.table") # loading the package called "data.table"require("tidyverse")
Note the line of code remove(list=ls()), I use this at the top of all of my scripts. This line of code simply deletes all of the items in your Global Environment, essentially wiping the table clean before you start your analysis.
The next line of code (require("data.table")) loads the data.table package that you need for this analysis as discussed above
Working with real data
Reading data
Go to the course Teams page page, download the dataset called wheat_student_data and move it to the Data folder of your project folder. Then load the data and call it dat. If you succeed, it will appear in your Global Environment. We can have a look at the data in several ways. If it is a small dataset, then you can simply click on it, where after it will open in a new tab called “dat”. However, this does not work well if you are working with large datasets. Then it is better to use the head() or tail() functions which display the top and bottom observations in the dataset. You can also look at the structure of the data using the str() function.
Code
dat <-fread("Data/wheat_student_data.csv")
Looking at the data
You can get an idea of the dataset using the following functions:
You can also open the dataset using using the View function.
Code
View(dat)
What does it mean?
Ok, so having the dataset loaded is one thing but what does it mean? In other words, what are the variables in the dataset. Lets print the names of all the variables.
This data was collected through a trial designed to establish the relationship between wheat yield and nitrogen applied. The trial was conducted over multiple years (year) over multiple locations (local). At each location there are multiple replicates and plots within the repliacte, from there the variables rep and plot. The variable yld shows the wheat yield of the plot. N_plant, N_tdress and N_spray shows different nitrogen applications, as planting, top dress and sprayed, respectively.
Now that you’ve loaded the data and have a better understanding of the structure of the data. How many unique years, locals, and replicates are in the dataset? Before we do this, we have to lay a bit of foundation regarding the data.table package.
data.table package
Basics
data.table is an R package that provides an enhanced version of data.frames, which are the standard data structure for storing data in base R. In the section above, we already created a data.table using fread(). We can also create one using the data.table() function. Here is an example:
ID a b c
1: b 1 7 13
2: b 2 8 14
3: b 3 9 15
4: a 4 10 16
5: a 5 11 17
6: c 6 12 18
You can also convert existing objects to a data.table using setDT() (for data.frames and lists) and as.data.table() (for other structures).
In contrast to a data.frame, you can do a lot more than just subsetting rows and selecting columns within the frame of a data.table, i.e., within [ … ] (NB: we might also refer to writing things inside DT[…] as “querying DT”, in analogy to SQL). To understand it we will have to first look at the general form of data.table syntax, as shown below:
Code
#DT[i, j, by]## R: i j by## SQL: where | order by select | update group by
The way to read it (out loud) is:
Take DT, subset/reorder rows using i, then calculate j, grouped by by.
Let’s begin by looking at i and j using our dataset of yield trials.
Advantages
You should use data.table because:
It provides blazing fast speed when it comes to loading data. With the fread function in data.table package, loading large data sets need just few seconds.
It is even faster than the popular dplyr, plyr packages used for data manipulation. data.table provides enough room for tasks such as aggregating, filtering, merging, grouping and other related tasks
Not just reading files, writing the files using data.table is much faster than write.csv(). This packages provides fwrite() function enabled with parallelised fast writing ability. So, next time you get to write 1 million rows, try this function.
In built features such as automatic indexing, rolling joins, overlapping range joins further enhances the user experience while working on large data sets.
Missing values
How many rows have missing yield values?
Code
nrow(dat[is.na(yld),])
[1] 45
Using the tidyverse package this can also be written as
Code
dat[is.na(yld),] %>%nrow(.)
[1] 45
The %>% symbol is called a pype. This makes your code more readable since one can read from rignt to left and not from inside out as above. ## Remove observations
Remove all of the observations from the dataset with missing values.
Code
dat <- dat[!is.na(yld),]nrow(dat)
[1] 811
New variables
Create a new variable for the total Nitrogen applied and call it N_tot