Now that you’ve done the DataCamp courses it is time for you to apply your new skills to a real dataset.However, before we get there we have to lay down a couple of basics of using R on your computer.
The first step is to download R from the CRAN repository. I suggest that you do not install the latest version but opt for a somewhat earlier version given that it would be more stable. You can download it here, be sure to select the correct version for your operating system.
Now you can download RStudio. While some people prefer to use base R (installed above), most opt to use a RStudio as an integrated development environment for base R. Thus RStudio is not R but rather a different way to interact with R. This is also the reason why R has to be installed before R studio. R studio is available here. Again, be sure to install the correct version for your operating system.
Go SUNLearn and download the “z_blank_folder” in the production economics section. Move the downloaded folder to a different folder on your computer for safe keeping since this folder will be the basis of all of your future R projects. Move a copy of “z_blank_folder” to your folder for this course and unzip it. Rename the unzipped folder appropriately and open it. Note that it has five folders in it as Data“,”Figures“,”Output“,”Scripts" and “Supplementary Materials”. These will help you to keep your project folder organised when you import data into your model or write out results. All of your scripts should be saved to your “Scripts” folder and supplementary materials could include relevant articles etc. that pertain to the project.
Now you have to create your project file. It is essential that your project file is in your overarching project folder, in other words, your “R_production_econ” folder (or whatever you renamed it to). If not, you will not be able to import your data etc. since your working directory would not be correct (more info). To create your project file, open RStudio and go to File -> New project…and select “Existing directory”. Thereafter click in “Browse” and find your “R_production_econ” folder (or whatever you renamed it to), select it and click on “Create project”. If you succeeded the top right corner of RStudio should show your “R_production_econ” folder (or whatever you renamed it to), see Figure
Figure 1: Check to see if you’re working in the right project
In addition, the project folders should show in the bottom right corner if you created the project file in the right place:
Figure 2: Check to see if your project folders are there
Now that you’ve created your project folder and project file (check the top right corner of R-studio if you’re working in it) we can get started by installing all of the packages required using by the analysis using the command install.packages("Package_Name"). In R we regularly make use of “packages”. These are precoded functions that where developed and shared by other R users. For this part of the work we will use three packages as data.table.
You only have to install them once but some of them will have to be updated from time to time by re-installing them. Thereafter we can load the required packages into memory. Basically we purchased the tools in the previous step and now we have to put them on the table. This can be done using the require("Package_Name") function. However before we get to this, some coding best practice.
Task: Install the data.table by typing install.packages("data.table") into the Console and hitting enter
Now, create a new script by clicking on the white page with the green plus sign in the top right corner of your screen and save it in your Scripts folder by hitting the save button with the single disk.
Figure 3: Create a script
Scripts are lines of commands that RStudio essentially pastes into the Console and runs sequentially when you hit the Run button, on the right hand side of the window (highlighted in yellow above). Scripts ensure that all of your operations are replicable and unlike excel, it documents all of your data manipulation and analysis steps.
I start all of my scripts with the lines of code below since it helps me to keep track of all scripts etc.
Note the that there is five colours of text in the code which indicate different types of commands. R will ignore any line of code that starts with a #, hence it is typically used for in script commenting and creating section dividers such as this. #*#*#*#*#*#*#*#*#*#*#*# Also, note that you can create a section using a three hash sandwich like this ### some text ###. A shortcut is to hit Ctrl + Shift + R together, this will bring up a window wherein you can name a different kind of section.
# Client: AE 775 / 895
# Project: Production economics in R!
# Script 1: Working with real data
#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*#*
remove(list=ls()) #clear all from memory
require("data.table") # loading the package called "data.table"
## Loading required package: data.table
## Warning: package 'data.table' was built under R version 3.6.2
Note the line of code remove(list=ls()), I use this at the top of all of my scripts. This simply deletes all of the items in your Global Environment, essentially wiping the table clean before you start your analysis.
The next line of code (require("data.table")) loads the data.table package that you need for this analysis as discussed abobe
Go to the 775/895 SUNLearn page, download the dataset called wheat_student_data and move it to the Data folder of your project folder. Then load the data and call it dat. If you succeeded it will appear in your Global Environment. We can have a look at the data in a number of ways. If it is a small dataset then you can simply click on it where after it will open in a new tab called “dat”. However, this does not work well if you’re working with large datasets. Then it is better to use the head() or tail() functions which displays the top and bottom observations in the dataset. You can also look at the structure of the data using the str() function.
## year local rep plot yld N_plant N_tdress N_spray
## 1: 2016 Darling 1 1 530.30 0 0 0
## 2: 2016 Darling 1 2 1157.89 25 25 0
## 3: 2016 Darling 1 3 1656.37 25 105 0
## 4: 2016 Darling 1 4 946.80 25 50 0
## 5: 2016 Darling 1 5 3059.27 25 135 0
## 6: 2016 Darling 1 6 1630.23 25 165 0
## year local rep plot yld N_plant N_tdress N_spray
## 1: 2019 Tygerhoek 4 29 3605 25 25 0
## 2: 2019 Tygerhoek 4 28 3647 25 50 0
## 3: 2019 Tygerhoek 4 26 3434 25 75 0
## 4: 2019 Tygerhoek 4 27 2762 25 105 0
## 5: 2019 Tygerhoek 4 30 3212 25 135 0
## 6: 2019 Tygerhoek 4 31 3582 25 165 0
## Classes 'data.table' and 'data.frame': 856 obs. of 8 variables:
## $ year : int 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...
## $ local : chr "Darling" "Darling" "Darling" "Darling" ...
## $ rep : int 1 1 1 1 1 1 1 1 2 2 ...
## $ plot : int 1 2 3 4 5 6 7 8 10 11 ...
## $ yld : num 530 1158 1656 947 3059 ...
## $ N_plant : int 0 25 25 25 25 25 25 25 25 25 ...
## $ N_tdress: int 0 25 105 50 135 165 0 75 75 135 ...
## $ N_spray : int 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
Now that you’ve loaded the data and have a better understanding of the structure thereof. How many unique years, locals and replicates is in the dataset
How many rows has missing yield values?
## [1] 45
Remove all of the observations from the dataset with missing values.
## [1] 811
Create a new variable for the total Nitrogen applied and call it N_tot
## year local rep plot yld N_plant N_tdress N_spray N_tot
## 1: 2016 Darling 1 1 530.30 0 0 0 0
## 2: 2016 Darling 1 2 1157.89 25 25 0 50
## 3: 2016 Darling 1 3 1656.37 25 105 0 130
## 4: 2016 Darling 1 4 946.80 25 50 0 75
## 5: 2016 Darling 1 5 3059.27 25 135 0 160
## 6: 2016 Darling 1 6 1630.23 25 165 0 190
What was the maximum yield per trail per location per year?
## year local max_yld
## 1: 2016 Darling 3495.45
## 2: 2016 Langgewens 4339.37
## 3: 2016 Porterville 6173.99
## 4: 2016 Riversdale 5144.64
## 5: 2016 Tygerhoek 5137.83
## 6: 2017 Darling 1257.44
## 7: 2017 Langgewens 2928.22
## 8: 2017 Porterville 884.22
## 9: 2017 Riversdale 1699.36
## 10: 2017 Tygerhoek 4042.33
## 11: 2018 Darling 4095.51
## 12: 2018 Langgewens 6033.33
## 13: 2018 Porterville 5950.75
## 14: 2018 Riversdale 1947.25
## 15: 2018 Tygerhoek 6394.42
## 16: 2019 Langgewens 3722.00
## 17: 2019 Porterville 3536.00
## 18: 2019 Riversdale 2964.00
## 19: 2019 Tygerhoek 3647.00
What was the maximum yield per trail per location per year per replicate?
## year local rep max_yld
## 1: 2016 Darling 1 3059.27
## 2: 2016 Darling 2 2161.19
## 3: 2016 Darling 3 2776.90
## 4: 2016 Darling 4 3495.45
## 5: 2016 Langgewens 1 4339.37
## 6: 2016 Langgewens 2 4285.36
## 7: 2016 Langgewens 3 4035.78
## 8: 2016 Langgewens 4 4122.39
## 9: 2016 Porterville 1 4005.54
## 10: 2016 Porterville 2 5463.10
## 11: 2016 Porterville 3 6173.99
## 12: 2016 Porterville 4 3676.30
## 13: 2016 Riversdale 1 5144.64
## 14: 2016 Tygerhoek 1 4783.65
## 15: 2016 Tygerhoek 2 4027.99
## 16: 2016 Tygerhoek 3 5137.83
## 17: 2016 Tygerhoek 4 5120.12
## 18: 2017 Darling 1 1257.44
## 19: 2017 Darling 2 1001.55
## 20: 2017 Darling 3 744.73
## 21: 2017 Darling 4 936.18
## 22: 2017 Langgewens 1 2331.78
## 23: 2017 Langgewens 2 2852.44
## 24: 2017 Langgewens 3 2928.22
## 25: 2017 Langgewens 4 2607.11
## 26: 2017 Porterville 1 747.33
## 27: 2017 Porterville 2 722.89
## 28: 2017 Porterville 3 800.00
## 29: 2017 Porterville 4 884.22
## 30: 2017 Riversdale 1 1699.36
## 31: 2017 Tygerhoek 1 3961.47
## 32: 2017 Tygerhoek 2 4042.33
## 33: 2017 Tygerhoek 3 3798.94
## 34: 2017 Tygerhoek 4 3513.23
## 35: 2018 Darling 1 3998.78
## 36: 2018 Darling 2 3838.23
## 37: 2018 Darling 3 3552.24
## 38: 2018 Darling 4 4095.51
## 39: 2018 Langgewens 1 6033.33
## 40: 2018 Langgewens 2 5805.00
## 41: 2018 Langgewens 3 5462.50
## 42: 2018 Langgewens 4 4892.50
## 43: 2018 Porterville 1 5415.37
## 44: 2018 Porterville 2 5950.75
## 45: 2018 Porterville 3 5503.27
## 46: 2018 Porterville 4 5490.20
## 47: 2018 Riversdale 1 1947.25
## 48: 2018 Tygerhoek 1 5977.55
## 49: 2018 Tygerhoek 2 6108.03
## 50: 2018 Tygerhoek 3 5138.50
## 51: 2018 Tygerhoek 4 6394.42
## 52: 2019 Langgewens 1 2775.00
## 53: 2019 Langgewens 2 3090.00
## 54: 2019 Langgewens 3 3037.00
## 55: 2019 Langgewens 4 3722.00
## 56: 2019 Porterville 1 2895.00
## 57: 2019 Porterville 2 3280.00
## 58: 2019 Porterville 3 3536.00
## 59: 2019 Porterville 4 3216.00
## 60: 2019 Riversdale 1 2747.00
## 61: 2019 Riversdale 2 1961.00
## 62: 2019 Riversdale 3 2964.00
## 63: 2019 Tygerhoek 1 3021.00
## 64: 2019 Tygerhoek 2 3139.00
## 65: 2019 Tygerhoek 3 3643.00
## 66: 2019 Tygerhoek 4 3647.00
## year local rep max_yld
What was the average yield per trail per location per year?
## year local av_yld
## 1: 2016 Darling 1965.3425
## 2: 2016 Langgewens 3514.3025
## 3: 2016 Porterville 2659.0466
## 4: 2016 Riversdale 3962.3525
## 5: 2016 Tygerhoek 3949.7209
## 6: 2017 Darling 730.5494
## 7: 2017 Langgewens 2147.9587
## 8: 2017 Porterville 589.5056
## 9: 2017 Riversdale 1385.5400
## 10: 2017 Tygerhoek 3353.2075
## 11: 2018 Darling 3222.7600
## 12: 2018 Langgewens 3538.7680
## 13: 2018 Porterville 4561.2541
## 14: 2018 Riversdale 1270.5187
## 15: 2018 Tygerhoek 4744.1777
## 16: 2019 Langgewens 2629.7188
## 17: 2019 Porterville 2419.0625
## 18: 2019 Riversdale 2038.9130
## 19: 2019 Tygerhoek 2796.8125
Create normalised yield index varialbe and call it yld_norm. It must be normalised for the maximum yield by year,location and replicate.