The “cane.csv” file looks at data from an experiment looking at the disease risk of different varieties of sugar cane and how different treatments could impact the levels of disease. The file contains several columns of data:
“nStems” = total number of stems “diseaseStems” = total number of stems with disease “variety” = the type of sugar cane “block” = treatment
NOTE: for each prompt, I need to see code in order to give you credit! NOTE: you can perform all of these tasks in one code chunk or many code chunks, this is up to you
#1. bring in the dataset and name it "sugar" (1 pt)
sugar<- read.csv('cane.csv')
#2. How many rows does the data set have? (2 pts)
nrow(sugar)
## [1] 180
#3. What are the types of variables contained in the dataset? (2 pts)
str(sugar)
## 'data.frame': 180 obs. of 4 variables:
## $ nStems : int 87 119 94 95 134 92 118 70 128 85 ...
## $ diseaseStems: int 76 8 74 11 0 0 11 32 33 14 ...
## $ variety : int 1 2 3 4 5 6 7 8 9 10 ...
## $ block : chr "A" "A" "A" "A" ...
#4. Calculate the mean and standard deviation of the "diseaseStems" column (4 pts)
mean(sugar$diseaseStems)
## [1] 20.25556
sd(sugar$diseaseStems)
## [1] 24.50108
#5. Calculate the mean of the "diseaseStems" column for each "block" (3 pts)
tapply(sugar$diseaseStems, sugar$block, mean)
## A B C D
## 18.57778 25.48889 17.44444 19.51111
#6. Convert the "variety" column from a number to a factor - YOU MAY NEED TO LOOK THIS UP! (2 pts)
sugar$variety <- as.factor(sugar$variety)
#7. Calculate the mean of the "diseaseStems" column for each "block" AND "variety" - YOU CANNOT DO THIS WITHOUT COMPLETING 6 (4 points)
tapply(sugar$diseaseStems, list(sugar$block, sugar$variety), mean)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## A 76 8 74 11 0 0 11 32 33 14 3 3 28 63 3 16 11 2 8 62 14 34 0 13 7
## B 70 21 95 21 6 63 7 22 77 12 0 26 50 105 18 32 9 0 36 9 17 110 0 14 7
## C 54 10 44 15 3 21 8 28 11 13 0 3 36 59 5 4 57 0 22 92 21 57 0 24 0
## D 39 26 38 41 5 47 15 18 11 28 0 39 13 23 1 69 24 1 30 23 15 131 4 2 3
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## A 12 0 22 5 17 0 15 20 27 0 0 6 25 2 112 9 10 0 1 27
## B 22 8 7 0 13 1 11 18 25 0 10 43 6 17 48 0 16 11 0 64
## C 8 1 6 4 6 0 7 18 10 8 2 24 22 10 8 0 16 6 0 42
## D 13 3 7 3 13 0 6 2 10 0 2 11 36 16 63 2 12 0 9 24
#8. Plot a histogram of "nStems" (2 pts)
hist(sugar$nStems, main = "Histogram of nstems", xlab = "nstems", col = "red", border = "black")
#9. Does the histogram of “nStems” appear to be normally distributed? (2 pts) No, although there is a single high peak to the histogram, the right and left tails are not symmetrical. The right tail is longer and drops down more than the left tail while the left tail showed more concentrated data. This histogram shows right-skewed data.
#10. Create a boxplot for "diseaseStems" by "block" - create the plot so that each block has a different color box (6 pts)
boxplot(diseaseStems ~ block, data = sugar,
col = c("red", "blue", "yellow", "green") ,
main = "Boxplot of diseaseStems by Block",
xlab = "Block",
ylab = "diseaseStems")
#11. Does it appear that “block” influences the number of diseases
stems? Why or why not? (2 pts) It does not appear that “block” has
influence on the number of diseases stems. The medians for blocks A, B,
C, and D look close to each other. The spread of the data or the
interquartile range of disease stems in each block overlaps for all the
blocks. All blocks show some outliers, but the overall distribution look
somewhat similar.
#BONUS: 5 pts total (3 for A, 2 for B) #A. Create a new column for your dataset that creates the proportion or percentage of diseased stems for each plot
# Step 1: Create frequency table and convert to percentages
diseaseStems_counts <- table(sugar$diseaseStems)
diseaseStems_percent <- prop.table(diseaseStems_counts) * 100
# Step 2: Create a named vector
# Names are the diseaseStems values
# Values are percentages
percent_lookup <- as.numeric(diseaseStems_percent)
names(percent_lookup) <- names(diseaseStems_percent)
# Step 3: Adds percentages to each row in sugar
sugar$diseaseStems_percent <- percent_lookup[as.character(sugar$diseaseStems)]
#B. Create a boxplot for the new column by “block”
boxplot(diseaseStems_percent ~ block, data = sugar,
col = c("red", "blue", "yellow", "green") ,
main = "Boxplot of Disease Stems Percentage by Block",
xlab = "Block",
ylab = "Percentage of Diseased Stems")