Determination of the optimum number of rows of Cacao swollen shoot virus(CSSV) mild strain N1- inoculated cocoa trees required for the control of CSSV severe 1A spread

Data_path <- "C:/Users/seanc/OneDrive/Documents/Fall21/STAT 380/Worksheet2/severity score.xlsx"

severity_score <- read_xlsx(
    Data_path,
    range = "Consolidated!A2:F8666"
)
severity_score_descr <- read_xlsx(
    Data_path,
    range = "Consolidated!L3:L12"
)

The above code just reads in our data.

severity_score %>% 
  group_by(OrigT,year) %>% 
  summarise(mean = mean(score),
            median = median(score),
            sd = sd(score))

## `summarise()` has grouped output by 'OrigT'. You can override using the `.groups` argument.

## # A tibble: 8 x 5
## # Groups:   OrigT [4]
##   OrigT  year  mean median    sd
##   <chr> <dbl> <dbl>  <dbl> <dbl>
## 1 T1        1  6.47      9  3.08
## 2 T1        2  6.64      9  3.22
## 3 T2        1  6.81      9  2.98
## 4 T2        2  7.03      9  3.00
## 5 T3        1  6.89      9  2.91
## 6 T3        2  7.40      9  2.66
## 7 T4        1  6.47      9  3.17
## 8 T4        2  6.68      9  3.24

From the table above we can see that the means and sd are similar but it seems that T1 and T4 have slightly lower means. We can also tell that our data is skewed because of the difference in mean and median.

ggplot(severity_score,aes(x = score))+
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This is just a visual to show that our data is skewed to the left.

data <- severity_score %>% 
  group_by(OrigT,year) 
ggplot(data = data,aes(x = score))+
  geom_histogram(bin = 30)+
  facet_grid(cols = vars(OrigT),
             rows = vars(year))

## Warning: Ignoring unknown parameters: bin

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot above shows the distribution of each plot from year 1 to year 2. Note that Plot T2 and T3 have slightly higher count for the left side of the graph.

data <- severity_score %>% 
  pivot_wider(names_from = year,names_prefix = "score_",
              values_from = score) %>% 
  group_by(OrigT) %>% 
  summarise(score_1_n = mean(score_1),
            score_2_n = mean(score_2)) %>% 
  ungroup() %>% 
  mutate(score_increase = (score_2_n-score_1_n)/score_1_n)

ggplot(data = data,aes(x = OrigT, y = score_increase))+
  geom_col()+
  labs(title = "Percent increase in average score between years")

From the plot above we can see that T2 and T3 had the greatest increase in score from year 1 going to year 2.

severity_score %>% 
  ggplot(aes(x = OrigT, y = score))+
  geom_boxplot()+
  facet_grid(cols = vars(rep),
             rows = vars(year))

From above we can see that that T1 and T4 have similar variance while T2 and T3 seem to be crunched up near the top. So we are gonna want to look a little closer and do an Analysis of Variance or ANOVA.

ANOVA

Anova Assumptions: 1. The responses for each factor level have a normal population distribution.(assume) 2. These distributions have the same variance. 3. The data are independent. (Assume)

Equal Variance:

Recall from above that the biggest standard deviation is 3.243520 while the smallest is 2.661091. Dividing the two gives us 1.2188685, which is in between 0.5 and 2 so we can assume equal variance.

ANOVA

fit <- aov(score ~ OrigT, data = severity_score)
summary(fit)

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## OrigT          3    531  176.99   19.14 2.31e-12 ***
## Residuals   8660  80104    9.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since we have a small p-value we reject the hypothesis null. That is, we reject that the different OrigT give equal variance.

To find out which Trials are better we will use Tukeys HSD.

TukeyHSD(fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = score ~ OrigT, data = severity_score)
## 
## $OrigT
##              diff         lwr        upr     p adj
## T2-T1  0.36795937  0.13048999  0.6054288 0.0004015
## T3-T1  0.59279778  0.35532840  0.8302672 0.0000000
## T4-T1  0.02354571 -0.21392367  0.2610151 0.9942021
## T3-T2  0.22483841 -0.01263097  0.4623078 0.0710957
## T4-T2 -0.34441367 -0.58188304 -0.1069443 0.0011190
## T4-T3 -0.56925208 -0.80672146 -0.3317827 0.0000000

As we can see from above T1 and T4 are similar and T2 and T3 are similar but both pairs are different from each other.

From above we know that T1 and T4 are the best trial to implement. and since they don’t differ that much we can choose the cheapest object to replicate. From observation we would expect that would be T1 since that has the least amount of inoculated trees.

Determination of the optimum number of rows of Cacao swollen shoot virus(CSSV) mild strain N1- inoculated cocoa trees required for the control of CSSV severe 1A spread

Sean Cranston

9/3/2021

ANOVA