Data Exploration and Visualization

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
?oats
oats
##      B           V      N   Y
## 1    I     Victory 0.0cwt 111
## 2    I     Victory 0.2cwt 130
## 3    I     Victory 0.4cwt 157
## 4    I     Victory 0.6cwt 174
## 5    I Golden.rain 0.0cwt 117
## 6    I Golden.rain 0.2cwt 114
## 7    I Golden.rain 0.4cwt 161
## 8    I Golden.rain 0.6cwt 141
## 9    I  Marvellous 0.0cwt 105
## 10   I  Marvellous 0.2cwt 140
## 11   I  Marvellous 0.4cwt 118
## 12   I  Marvellous 0.6cwt 156
## 13  II     Victory 0.0cwt  61
## 14  II     Victory 0.2cwt  91
## 15  II     Victory 0.4cwt  97
## 16  II     Victory 0.6cwt 100
## 17  II Golden.rain 0.0cwt  70
## 18  II Golden.rain 0.2cwt 108
## 19  II Golden.rain 0.4cwt 126
## 20  II Golden.rain 0.6cwt 149
## 21  II  Marvellous 0.0cwt  96
## 22  II  Marvellous 0.2cwt 124
## 23  II  Marvellous 0.4cwt 121
## 24  II  Marvellous 0.6cwt 144
## 25 III     Victory 0.0cwt  68
## 26 III     Victory 0.2cwt  64
## 27 III     Victory 0.4cwt 112
## 28 III     Victory 0.6cwt  86
## 29 III Golden.rain 0.0cwt  60
## 30 III Golden.rain 0.2cwt 102
## 31 III Golden.rain 0.4cwt  89
## 32 III Golden.rain 0.6cwt  96
## 33 III  Marvellous 0.0cwt  89
## 34 III  Marvellous 0.2cwt 129
## 35 III  Marvellous 0.4cwt 132
## 36 III  Marvellous 0.6cwt 124
## 37  IV     Victory 0.0cwt  74
## 38  IV     Victory 0.2cwt  89
## 39  IV     Victory 0.4cwt  81
## 40  IV     Victory 0.6cwt 122
## 41  IV Golden.rain 0.0cwt  64
## 42  IV Golden.rain 0.2cwt 103
## 43  IV Golden.rain 0.4cwt 132
## 44  IV Golden.rain 0.6cwt 133
## 45  IV  Marvellous 0.0cwt  70
## 46  IV  Marvellous 0.2cwt  89
## 47  IV  Marvellous 0.4cwt 104
## 48  IV  Marvellous 0.6cwt 117
## 49   V     Victory 0.0cwt  62
## 50   V     Victory 0.2cwt  90
## 51   V     Victory 0.4cwt 100
## 52   V     Victory 0.6cwt 116
## 53   V Golden.rain 0.0cwt  80
## 54   V Golden.rain 0.2cwt  82
## 55   V Golden.rain 0.4cwt  94
## 56   V Golden.rain 0.6cwt 126
## 57   V  Marvellous 0.0cwt  63
## 58   V  Marvellous 0.2cwt  70
## 59   V  Marvellous 0.4cwt 109
## 60   V  Marvellous 0.6cwt  99
## 61  VI     Victory 0.0cwt  53
## 62  VI     Victory 0.2cwt  74
## 63  VI     Victory 0.4cwt 118
## 64  VI     Victory 0.6cwt 113
## 65  VI Golden.rain 0.0cwt  89
## 66  VI Golden.rain 0.2cwt  82
## 67  VI Golden.rain 0.4cwt  86
## 68  VI Golden.rain 0.6cwt 104
## 69  VI  Marvellous 0.0cwt  97
## 70  VI  Marvellous 0.2cwt  99
## 71  VI  Marvellous 0.4cwt 119
## 72  VI  Marvellous 0.6cwt 121

Data Structure

(1) What does each row of the data set represent?

Each row of this data set represents each individual plot (totaling 72).

(2) What do the columns of this data set represent? Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate of the variable is ordinal.

The columns of this data set represent the different variables respective to their representative plot (row). The variable B (blocks) represents what block the plot lies in and is a categorical variable. However, this variable is not ordinal. The variable V (varieties) represents the different types of oats planted and is categorical but not ordinal. The variable N (nitrogen) represents the concentration of nitrogen treatments, is also categorical, and is ordinal. The last variable Y (yield) represents the amount of 1/4lbs of oats that grows on the plot, is a numerical variable, and is discrete.

str(oats)
## 'data.frame':    72 obs. of  4 variables:
##  $ B: Factor w/ 6 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ V: Factor w/ 3 levels "Golden.rain",..: 3 3 3 3 1 1 1 1 2 2 ...
##  $ N: Factor w/ 4 levels "0.0cwt","0.2cwt",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ Y: int  111 130 157 174 117 114 161 141 105 140 ...

(3) What are the response and explanatory variables in this study?

The response variable of this study is the yield quantity, represented by the variable Y. The explanatory variables are the variables V (varieties) and N (nitrogen). The variable B (blocks) is a control used to account for other confounding effects.

(4) Create a hypothesis about nitrogen fertilizer concentration levels, without first looking at the data.

From my own prior knowledge of the use of nitrogen fertilizers, I believe that the use of these kinds of fertilizers ensures more productive use of and proper allocation of nutrients for plant and crop growth. Because of that, I would hypothesize that as nitrogen levels increase, crop yield will also increase.

Graphics and EDA

(5) Use ggplot to create a side-by-side boxplot, which illustrates the yield distribution for each nitrogen fertilizer concentration level and allows for both visual comparison across and within treatments.

ggplot(oats, aes(x = N, y = Y)) +
  geom_boxplot()

(6) Look at Plot 1. What are your observations from this plot? Does your hypothesis from question (4) appear to be supported? Explain.

At lower levels of N, Y (the yield) of the plot tends to be lower whereas at higher levels of N, Y tends to be higher. This plot seems to show that as Nitrogen treatments increase (i.e. level of total nitrogen increases), on average the yield of the plot will be higher. This suggests a positive relationship between N and Y and overall supports the hypothesis I presented in question 4.

(7) Plot 2: Now use ggplot to create a side-by-side boxplot, which illustrates the yield distribution for each oat variety treatment. Let’s add some color! Fill the boxes with a different color for each variety.

ggplot(oats, aes(x = V, y = Y, fill = V)) +
  geom_boxplot()

(8) Look at Plot 2. What are your observations from this plot? Do any of the varieties stand out as being the best producer? Explain.

This plot suggests that of the three varieties, the Marvellous variety tends to have the highest average yield at around 113 1/4 lbs per subplot. The average yield and small variance of this variety allows it to stand out among the other types of varieties. The second highest average yield comes from the Golden.rain variety at a little over 100 1/4lbs per subplot and the lowest average yield coming from the Victory variety at around 90 1/4lbs per subplot. Both of these also have a relatively similar variance, which is larger than that of the Marvellous variety.

(9) Plot 3: Add facets to your plot from part 6 to compare yields across nitrogen fertilizer concentration levels and the three oat varieties.

ggplot(oats, aes(x = N, y = Y, fill = V)) +
  geom_boxplot() +
  facet_grid(N~V)

Conclusion

(11) What advice would you give the farmer after exploring the data?

As suggested by this data, the farmer has a few options to increase his yield. To get the highest yield possible, it would be best for the farmer to plant the Golden.rain variety in a nitrogen concentration of 0.6. Other high yield options would be the Marvellous strain in a concentration of 0.6, 0.4, and 0.2, as well as Golden.rain in 0.4.