A. Using InsectSprays data set: The counts of insects in agricultural experimental units treated with different insecticides.
Column 1: Insect count
Column 2: The type of spray
\(H_0:\) Mean of insect counts are same corresponding to every spray.
1. How many different types of spray categories are in the data set?
dat <- InsectSprays
levels(dat$spray)
## [1] "A" "B" "C" "D" "E" "F"
Using proper R-command find if every spray category has the same number of rows. For example, if spray A has 10 rows, then other categories should have 10 rows.
wdat <- unstack(dat)
sapply(wdat, length)
## A B C D E F
## 12 12 12 12 12 12
Using the unstack command, seperates the different levels in to columns. Then I used sapply to apply the function “length” to each column. As you can see they are all the same size.
2. Find the mean of insect count for the spray category A.
mean(wdat$A)
## [1] 14.5
3. Show all the means corresponding to every category in a row (as output). You may use the function tapply().
tapply(dat$count, dat$spray, mean)
## A B C D E F
## 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
4. Find the variance of insect count corresponding to every category. Hint: You can use the tapply() here as well.
tapply(dat$count, dat$spray, var)
## A B C D E F
## 22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
5. Check if all the A-F spray category versus insect counts are normally distributed. Print the graphs as a matrix graph with 2 rows and 3 columns.
par(mfrow=c(2,3))
qqnorm(wdat$A)
qqnorm(wdat$B)
qqnorm(wdat$C)
qqnorm(wdat$D)
qqnorm(wdat$E)
qqnorm(wdat$F)
6. What is your conclusion regarding the homogeneity of variances?
bartlett.test(dat$count, dat$spray, data=dat)
##
## Bartlett test of homogeneity of variances
##
## data: dat$count and dat$spray
## Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05
P-value is low, so we reject the claim of homogeneity of the variances.
7. Do the one way ANOVA test using aov().
aov(dat$count~dat$spray, dat)
## Call:
## aov(formula = dat$count ~ dat$spray, data = dat)
##
## Terms:
## dat$spray Residuals
## Sum of Squares 2668.833 1015.167
## Deg. of Freedom 5 66
##
## Residual standard error: 3.921902
## Estimated effects may be unbalanced
8. Print the summary of the anova result. Write the numerical values of the followings:
(a) Degrees of Freedom: \(df_{within}\), \(df_{between}\)
(b) Sum of Squares: \(SS_{within}\), \(SS_{between}\)
(c) Mean of Squares (Variances): \(MS_{within}\), \(MS_{between}\)
(d) F value
(e) p value. How do you interpret the p-value with respect to the null hypothesis?
summary(aov(dat$count~dat$spray, dat))
## Df Sum Sq Mean Sq F value Pr(>F)
## dat$spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is lower than alpha so the null must got. We reject the claim that the means are the same.
9. Conduct the Post Hoc test to compare the mean insect counts for different spray-pairs.
TukeyHSD(aov(dat$count~dat$spray, dat))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = dat$count ~ dat$spray, data = dat)
##
## $`dat$spray`
## diff lwr upr p adj
## B-A 0.8333333 -3.866075 5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A 2.1666667 -2.532742 6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B 1.3333333 -3.366075 6.032742 0.9603075
## D-C 2.8333333 -1.866075 7.532742 0.4920707
## E-C 1.4166667 -3.282742 6.116075 0.9488669
## F-C 14.5833333 9.883925 19.282742 0.0000000
## E-D -1.4166667 -6.116075 3.282742 0.9488669
## F-D 11.7500000 7.050591 16.449409 0.0000000
## F-E 13.1666667 8.467258 17.866075 0.0000000
Visualize the results with boxplot().
boxplot(dat$count~dat$spray, dat)
10. Use colors in boxplot for different sprays. Also, add ylabel to the figure. Are there any outliers?
boxplot(dat$count~dat$spray, data = dat, ylab="Count",col=c("red","blue","green","yellow","orange","purple"))
Are there any outliers? Yes, C, D, and E seem to be vastly different from the rest.
11. Create the same boxplot using ggplot() and geom boxplot(). Fill the boxplots with colors as well.
library(ggplot2)
ggplot(dat, aes(x = dat$spray, y = dat$count)) +
geom_boxplot(fill=c("red","blue","green","yellow","orange","purple"), colour = "black") + scale_x_discrete() + xlab("Spray") + ylab("Count")
B Consider the default data set mtcars for the next two problems.
1. Print the Pearson’s correlation coefficients up to 4 decimal places.
dat <- mtcars
round(cor(dat),4)
## mpg cyl disp hp drat wt qsec vs
## mpg 1.0000 -0.8522 -0.8476 -0.7762 0.6812 -0.8677 0.4187 0.6640
## cyl -0.8522 1.0000 0.9020 0.8324 -0.6999 0.7825 -0.5912 -0.8108
## disp -0.8476 0.9020 1.0000 0.7909 -0.7102 0.8880 -0.4337 -0.7104
## hp -0.7762 0.8324 0.7909 1.0000 -0.4488 0.6587 -0.7082 -0.7231
## drat 0.6812 -0.6999 -0.7102 -0.4488 1.0000 -0.7124 0.0912 0.4403
## wt -0.8677 0.7825 0.8880 0.6587 -0.7124 1.0000 -0.1747 -0.5549
## qsec 0.4187 -0.5912 -0.4337 -0.7082 0.0912 -0.1747 1.0000 0.7445
## vs 0.6640 -0.8108 -0.7104 -0.7231 0.4403 -0.5549 0.7445 1.0000
## am 0.5998 -0.5226 -0.5912 -0.2432 0.7127 -0.6925 -0.2299 0.1683
## gear 0.4803 -0.4927 -0.5556 -0.1257 0.6996 -0.5833 -0.2127 0.2060
## carb -0.5509 0.5270 0.3950 0.7498 -0.0908 0.4276 -0.6562 -0.5696
## am gear carb
## mpg 0.5998 0.4803 -0.5509
## cyl -0.5226 -0.4927 0.5270
## disp -0.5912 -0.5556 0.3950
## hp -0.2432 -0.1257 0.7498
## drat 0.7127 0.6996 -0.0908
## wt -0.6925 -0.5833 0.4276
## qsec -0.2299 -0.2127 -0.6562
## vs 0.1683 0.2060 -0.5696
## am 1.0000 0.7941 0.0575
## gear 0.7941 1.0000 0.2741
## carb 0.0575 0.2741 1.0000
2. Visualize histogram, scatter plot with fitted curve and Kendall’s correlation coefficient for mtcars in a same matrix- graph. Make sure the data on the graph is readable, especially the correlation coefficient (up to two decimal places are enough to show).
library(psych)
pairs.panels(dat, method = "kendall", density = FALSE, ellipses = FALSE)