For this exam you will be using the “cane.csv” file found on Canvas:

The “cane.csv” file looks at data from an experiment looking at the disease risk of different varieties of sugar cane and how different treatments could impact the levels of disease. The file contains several columns of data:

“nStems” = total number of stems “diseaseStems” = total number of stems with disease “variety” = the type of sugar cane “block” = treatment

In the space below, please perform the following tasks/answer questions:

NOTE: for each prompt, I need to see code in order to give you credit! NOTE: you can perform all of these tasks in one code chunk or many code chunks, this is up to you

#1. bring in the dataset and name it "sugar" (1 pt)
sugar<- read.csv('cane.csv')
#2. How many rows does the data set have? (2 pts)
nrow(sugar)
## [1] 180
#3. What are the types of variables contained in the dataset? (2 pts)
str(sugar)
## 'data.frame':    180 obs. of  4 variables:
##  $ nStems      : int  87 119 94 95 134 92 118 70 128 85 ...
##  $ diseaseStems: int  76 8 74 11 0 0 11 32 33 14 ...
##  $ variety     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ block       : chr  "A" "A" "A" "A" ...
#4. Calculate the mean and standard deviation of the "diseaseStems" column (4 pts)
mean(sugar$diseaseStems)
## [1] 20.25556
sd(sugar$diseaseStems)
## [1] 24.50108
#5. Calculate the mean of the "diseaseStems" column for each "block" (3 pts)
tapply(sugar$diseaseStems, sugar$block, mean)
##        A        B        C        D 
## 18.57778 25.48889 17.44444 19.51111
#6. Convert the "variety" column from a number to a factor - YOU MAY NEED TO LOOK THIS UP! (2 pts)
sugar$variety <- as.factor(sugar$variety)
#7. Calculate the mean of the "diseaseStems" column for each "block" AND "variety" - YOU CANNOT DO THIS WITHOUT COMPLETING 6 (4 points)
tapply(sugar$diseaseStems, list(sugar$block, sugar$variety), mean)
##    1  2  3  4 5  6  7  8  9 10 11 12 13  14 15 16 17 18 19 20 21  22 23 24 25
## A 76  8 74 11 0  0 11 32 33 14  3  3 28  63  3 16 11  2  8 62 14  34  0 13  7
## B 70 21 95 21 6 63  7 22 77 12  0 26 50 105 18 32  9  0 36  9 17 110  0 14  7
## C 54 10 44 15 3 21  8 28 11 13  0  3 36  59  5  4 57  0 22 92 21  57  0 24  0
## D 39 26 38 41 5 47 15 18 11 28  0 39 13  23  1 69 24  1 30 23 15 131  4  2  3
##   26 27 28 29 30 31 32 33 34 35 36 37 38 39  40 41 42 43 44 45
## A 12  0 22  5 17  0 15 20 27  0  0  6 25  2 112  9 10  0  1 27
## B 22  8  7  0 13  1 11 18 25  0 10 43  6 17  48  0 16 11  0 64
## C  8  1  6  4  6  0  7 18 10  8  2 24 22 10   8  0 16  6  0 42
## D 13  3  7  3 13  0  6  2 10  0  2 11 36 16  63  2 12  0  9 24
#8. Plot a histogram of "nStems" (2 pts)
hist(sugar$nStems, main = "Histogram of nstems", xlab = "nstems", col = "red", border = "black")

#9. Does the histogram of “nStems” appear to be normally distributed? (2 pts) No, although there is a single high peak to the histogram, the right and left tails are not symmetrical. The right tail is longer and drops down more than the left tail while the left tail showed more concentrated data. This histogram shows right-skewed data.

#10. Create a boxplot for "diseaseStems" by "block" - create the plot so that each block has a different color box (6 pts)
boxplot(diseaseStems ~ block, data = sugar,
        col = c("red", "blue", "yellow", "green") , 
        main = "Boxplot of diseaseStems by Block",
        xlab = "Block",
        ylab = "diseaseStems")

#11. Does it appear that “block” influences the number of diseases stems? Why or why not? (2 pts) It does not appear that “block” has influence on the number of diseases stems. The medians for blocks A, B, C, and D look close to each other. The spread of the data or the interquartile range of disease stems in each block overlaps for all the blocks. All blocks show some outliers, but the overall distribution look somewhat similar.

#BONUS: 5 pts total (3 for A, 2 for B) #A. Create a new column for your dataset that creates the proportion or percentage of diseased stems for each plot

# Step 1: Create frequency table and convert to percentages
diseaseStems_counts <- table(sugar$diseaseStems)
diseaseStems_percent <- prop.table(diseaseStems_counts) * 100

# Step 2: Create a named vector 
# Names are the diseaseStems values
# Values are percentages
percent_lookup <- as.numeric(diseaseStems_percent)
names(percent_lookup) <- names(diseaseStems_percent)

# Step 3: Adds percentages to each row in sugar
sugar$diseaseStems_percent <- percent_lookup[as.character(sugar$diseaseStems)]

#B. Create a boxplot for the new column by “block”

boxplot(diseaseStems_percent ~ block, data = sugar,
        col = c("red", "blue", "yellow", "green") , 
        main = "Boxplot of Disease Stems Percentage by Block",
        xlab = "Block",
        ylab = "Percentage of Diseased Stems")