Getting started

Now that you’ve done the DataCamp courses it is time for you to apply your new skills to a real dataset.However, before we get there we have to lay down a couple of basics of using R on your computer.

Getting started in R and R Studio

Download and install

The first step is to download R from the CRAN repository. I suggest that you do not install the latest version but opt for a somewhat earlier version given that it would be more stable. You can download it here, be sure to select the correct version for your operating system.

Now you can download RStudio. While some people prefer to use base R (installed above), most opt to use a RStudio as an integrated development environment for base R. Thus RStudio is not R but rather a different way to interact with R. This is also the reason why R has to be installed before R studio. R studio is available here. Again, be sure to install the correct version for your operating system.

Project folders and file

Go SUNLearn and download the “z_blank_folder” in the production economics section. Move the downloaded folder to a different folder on your computer for safe keeping since this folder will be the basis of all of your future R projects. Move a copy of “z_blank_folder” to your folder for this course and unzip it. Rename the unzipped folder appropriately and open it. Note that it has five folders in it as Data“,”Figures“,”Output“,”Scripts" and “Supplementary Materials”. These will help you to keep your project folder organised when you import data into your model or write out results. All of your scripts should be saved to your “Scripts” folder and supplementary materials could include relevant articles etc. that pertain to the project.

Now you have to create your project file. It is essential that your project file is in your overarching project folder, in other words, your “R_production_econ” folder (or whatever you renamed it to). If not, you will not be able to import your data etc. since your working directory would not be correct (more info). To create your project file, open RStudio and go to File -> New project…and select “Existing directory”. Thereafter click in “Browse” and find your “R_production_econ” folder (or whatever you renamed it to), select it and click on “Create project”. If you succeeded the top right corner of RStudio should show your “R_production_econ” folder (or whatever you renamed it to), see Figure

Figure 1: Check to see if you’re working in the right project

In addition, the project folders should show in the bottom right corner if you created the project file in the right place:

Figure 2: Check to see if your project folders are there

Installing and loading packages

Now that you’ve created your project folder and project file (check the top right corner of R-studio if you’re working in it) we can get started by installing all of the packages required using by the analysis using the command install.packages("Package_Name"). In R we regularly make use of “packages”. These are precoded functions that where developed and shared by other R users. For this part of the work we will use three packages as data.table.

You only have to install them once but some of them will have to be updated from time to time by re-installing them. Thereafter we can load the required packages into memory. Basically we purchased the tools in the previous step and now we have to put them on the table. This can be done using the require("Package_Name") function. However before we get to this, some coding best practice.

Task: Install the data.table by typing install.packages("data.table") into the Console and hitting enter

Scripts and coding best practice

Now, create a new script by clicking on the white page with the green plus sign in the top right corner of your screen and save it in your Scripts folder by hitting the save button with the single disk.

Figure 3: Create a script

Scripts are lines of commands that RStudio essentially pastes into the Console and runs sequentially when you hit the Run button, on the right hand side of the window (highlighted in yellow above). Scripts ensure that all of your operations are replicable and unlike excel, it documents all of your data manipulation and analysis steps.

I start all of my scripts with the lines of code below since it helps me to keep track of all scripts etc.

Note the that there is five colours of text in the code which indicate different types of commands. R will ignore any line of code that starts with a #, hence it is typically used for in script commenting and creating section dividers such as this. #*#*#*#*#*#*#*#*#*#*#*# Also, note that you can create a section using a three hash sandwich like this ### some text ###. A shortcut is to hit Ctrl + Shift + R together, this will bring up a window wherein you can name a different kind of section.

# Client: AE 775 / 895
# Project: Production economics in R!
# Script 1: Working with real data

#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*

remove(list=ls()) #clear all from memory

require("data.table") # loading the package called "data.table"

## Loading required package: data.table

## Warning: package 'data.table' was built under R version 3.6.2

Note the line of code remove(list=ls()), I use this at the top of all of my scripts. This simply deletes all of the items in your Global Environment, essentially wiping the table clean before you start your analysis.

The next line of code (require("data.table")) loads the data.table package that you need for this analysis as discussed abobe

Working with real data

Go to the 775/895 SUNLearn page, download the dataset called wheat_student_data and move it to the Data folder of your project folder. Then load the data and call it dat. If you succeeded it will appear in your Global Environment. We can have a look at the data in a number of ways. If it is a small dataset then you can simply click on it where after it will open in a new tab called “dat”. However, this does not work well if you’re working with large datasets. Then it is better to use the head() or tail() functions which displays the top and bottom observations in the dataset. You can also look at the structure of the data using the str() function.

##    year   local rep plot     yld N_plant N_tdress N_spray
## 1: 2016 Darling   1    1  530.30       0        0       0
## 2: 2016 Darling   1    2 1157.89      25       25       0
## 3: 2016 Darling   1    3 1656.37      25      105       0
## 4: 2016 Darling   1    4  946.80      25       50       0
## 5: 2016 Darling   1    5 3059.27      25      135       0
## 6: 2016 Darling   1    6 1630.23      25      165       0

##    year     local rep plot  yld N_plant N_tdress N_spray
## 1: 2019 Tygerhoek   4   29 3605      25       25       0
## 2: 2019 Tygerhoek   4   28 3647      25       50       0
## 3: 2019 Tygerhoek   4   26 3434      25       75       0
## 4: 2019 Tygerhoek   4   27 2762      25      105       0
## 5: 2019 Tygerhoek   4   30 3212      25      135       0
## 6: 2019 Tygerhoek   4   31 3582      25      165       0

## Classes 'data.table' and 'data.frame':   856 obs. of  8 variables:
##  $ year    : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ local   : chr  "Darling" "Darling" "Darling" "Darling" ...
##  $ rep     : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ plot    : int  1 2 3 4 5 6 7 8 10 11 ...
##  $ yld     : num  530 1158 1656 947 3059 ...
##  $ N_plant : int  0 25 25 25 25 25 25 25 25 25 ...
##  $ N_tdress: int  0 25 105 50 135 165 0 75 75 135 ...
##  $ N_spray : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Now that you’ve loaded the data and have a better understanding of the structure thereof. How many unique years, locals and replicates is in the dataset

How many rows has missing yield values?

## [1] 45

Remove all of the observations from the dataset with missing values.

## [1] 811

Create a new variable for the total Nitrogen applied and call it N_tot

##    year   local rep plot     yld N_plant N_tdress N_spray N_tot
## 1: 2016 Darling   1    1  530.30       0        0       0     0
## 2: 2016 Darling   1    2 1157.89      25       25       0    50
## 3: 2016 Darling   1    3 1656.37      25      105       0   130
## 4: 2016 Darling   1    4  946.80      25       50       0    75
## 5: 2016 Darling   1    5 3059.27      25      135       0   160
## 6: 2016 Darling   1    6 1630.23      25      165       0   190

What was the maximum yield per trail per location per year?

##     year       local max_yld
##  1: 2016     Darling 3495.45
##  2: 2016  Langgewens 4339.37
##  3: 2016 Porterville 6173.99
##  4: 2016  Riversdale 5144.64
##  5: 2016   Tygerhoek 5137.83
##  6: 2017     Darling 1257.44
##  7: 2017  Langgewens 2928.22
##  8: 2017 Porterville  884.22
##  9: 2017  Riversdale 1699.36
## 10: 2017   Tygerhoek 4042.33
## 11: 2018     Darling 4095.51
## 12: 2018  Langgewens 6033.33
## 13: 2018 Porterville 5950.75
## 14: 2018  Riversdale 1947.25
## 15: 2018   Tygerhoek 6394.42
## 16: 2019  Langgewens 3722.00
## 17: 2019 Porterville 3536.00
## 18: 2019  Riversdale 2964.00
## 19: 2019   Tygerhoek 3647.00

What was the maximum yield per trail per location per year per replicate?

##     year       local rep max_yld
##  1: 2016     Darling   1 3059.27
##  2: 2016     Darling   2 2161.19
##  3: 2016     Darling   3 2776.90
##  4: 2016     Darling   4 3495.45
##  5: 2016  Langgewens   1 4339.37
##  6: 2016  Langgewens   2 4285.36
##  7: 2016  Langgewens   3 4035.78
##  8: 2016  Langgewens   4 4122.39
##  9: 2016 Porterville   1 4005.54
## 10: 2016 Porterville   2 5463.10
## 11: 2016 Porterville   3 6173.99
## 12: 2016 Porterville   4 3676.30
## 13: 2016  Riversdale   1 5144.64
## 14: 2016   Tygerhoek   1 4783.65
## 15: 2016   Tygerhoek   2 4027.99
## 16: 2016   Tygerhoek   3 5137.83
## 17: 2016   Tygerhoek   4 5120.12
## 18: 2017     Darling   1 1257.44
## 19: 2017     Darling   2 1001.55
## 20: 2017     Darling   3  744.73
## 21: 2017     Darling   4  936.18
## 22: 2017  Langgewens   1 2331.78
## 23: 2017  Langgewens   2 2852.44
## 24: 2017  Langgewens   3 2928.22
## 25: 2017  Langgewens   4 2607.11
## 26: 2017 Porterville   1  747.33
## 27: 2017 Porterville   2  722.89
## 28: 2017 Porterville   3  800.00
## 29: 2017 Porterville   4  884.22
## 30: 2017  Riversdale   1 1699.36
## 31: 2017   Tygerhoek   1 3961.47
## 32: 2017   Tygerhoek   2 4042.33
## 33: 2017   Tygerhoek   3 3798.94
## 34: 2017   Tygerhoek   4 3513.23
## 35: 2018     Darling   1 3998.78
## 36: 2018     Darling   2 3838.23
## 37: 2018     Darling   3 3552.24
## 38: 2018     Darling   4 4095.51
## 39: 2018  Langgewens   1 6033.33
## 40: 2018  Langgewens   2 5805.00
## 41: 2018  Langgewens   3 5462.50
## 42: 2018  Langgewens   4 4892.50
## 43: 2018 Porterville   1 5415.37
## 44: 2018 Porterville   2 5950.75
## 45: 2018 Porterville   3 5503.27
## 46: 2018 Porterville   4 5490.20
## 47: 2018  Riversdale   1 1947.25
## 48: 2018   Tygerhoek   1 5977.55
## 49: 2018   Tygerhoek   2 6108.03
## 50: 2018   Tygerhoek   3 5138.50
## 51: 2018   Tygerhoek   4 6394.42
## 52: 2019  Langgewens   1 2775.00
## 53: 2019  Langgewens   2 3090.00
## 54: 2019  Langgewens   3 3037.00
## 55: 2019  Langgewens   4 3722.00
## 56: 2019 Porterville   1 2895.00
## 57: 2019 Porterville   2 3280.00
## 58: 2019 Porterville   3 3536.00
## 59: 2019 Porterville   4 3216.00
## 60: 2019  Riversdale   1 2747.00
## 61: 2019  Riversdale   2 1961.00
## 62: 2019  Riversdale   3 2964.00
## 63: 2019   Tygerhoek   1 3021.00
## 64: 2019   Tygerhoek   2 3139.00
## 65: 2019   Tygerhoek   3 3643.00
## 66: 2019   Tygerhoek   4 3647.00
##     year       local rep max_yld

What was the average yield per trail per location per year?

##     year       local    av_yld
##  1: 2016     Darling 1965.3425
##  2: 2016  Langgewens 3514.3025
##  3: 2016 Porterville 2659.0466
##  4: 2016  Riversdale 3962.3525
##  5: 2016   Tygerhoek 3949.7209
##  6: 2017     Darling  730.5494
##  7: 2017  Langgewens 2147.9587
##  8: 2017 Porterville  589.5056
##  9: 2017  Riversdale 1385.5400
## 10: 2017   Tygerhoek 3353.2075
## 11: 2018     Darling 3222.7600
## 12: 2018  Langgewens 3538.7680
## 13: 2018 Porterville 4561.2541
## 14: 2018  Riversdale 1270.5187
## 15: 2018   Tygerhoek 4744.1777
## 16: 2019  Langgewens 2629.7188
## 17: 2019 Porterville 2419.0625
## 18: 2019  Riversdale 2038.9130
## 19: 2019   Tygerhoek 2796.8125

Create normalised yield index varialbe and call it yld_norm. It must be normalised for the maximum yield by year,location and replicate.

Getting started in R