For all of the applicable problems, plots must be generated using the functions within the ggplot2 package.
Problem 1. Due to the constraint in size and space, laptops tend to have a slower processing speed than desktops, but, depending on the task that needs to be run, this may or may not be an issue. Data was collected on the data processing time (seconds) for a machine learning model between 10 randomly selected Dell Inspiron laptops, each with an Intel Core i5-1135G7, and 10 randomly selected Dell XPS desktops, each with an Intel Core i7-12700K, recording the data in the attached “processing.csv” file. Against a significance level of α = 0.05, those running the model want to determine if there is a noticeable difference in the processing time between the laptops and desktops, particularly if the laptops are slower than the desktops.
First thing we need to do before running a T-test is to determine whether the two groups “Desktop” and “Laptop” have equal variance. To do this we can run an F-test with a significance level of α = 0.05, where the null hypothesis states that the true ratio of variances between the groups is equal to 1 (therefore the groups likely have equal variance) and the alternative hypothesis states that the true ratio of variances is not equal to 1.
var.test(Time ~ Computer, data = processing)
##
## F test to compare two variances
##
## data: Time by Computer
## F = 1.1786, num df = 9, denom df = 9, p-value = 0.8107
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.2927405 4.7449217
## sample estimates:
## ratio of variances
## 1.178571
Since the p-value for the F-test of 0.8107 is above the significance level of 0.05 we fail to reject the null hypothesis and thus the variances of both groups are statistically likely to be equal.
Now we can continue with the T-test. Since those who are running the model want to determine if there is a noticeable difference in the processing time between the laptops and desktops, particularly if the laptops are slower than the desktops the null hypothesis and alternative hypothesis are as follows:
Null Hypothesis: μDesktop − μLaptop = 0
Alternative Hypothesis: μDesktop − μLaptop < 0
α = 0.05
t.test(Time ~ Computer, data = processing,
alternative = "less", var.equal = TRUE)
##
## Two Sample t-test
##
## data: Time by Computer
## t = -10.23, df = 18, p-value = 3.142e-09
## alternative hypothesis: true difference in means between group Desktop and group Laptop is less than 0
## 95 percent confidence interval:
## -Inf -12.70662
## sample estimates:
## mean in group Desktop mean in group Laptop
## 55.1 70.4
Since the p-value of 3.142e-09 << 0, we reject the null hypothesis.
Since in part (a) we rejected the null hypothesis, there is strong enough statistical evidence to say that, based on the sample, Laptops on average have larger processing times than Desktops and are thus slower at running the model than Desktops.
As seen in the box plots below, there is no overlap in the sample distributions and the distribution of Laptop processing times is completely higher than the distribution of Desktop times. This supports both the results of the hypothesis test in part (a) and the conclusion drawn in part (b) as, according to the distribution of the sample groups, Laptops have larger processing times than Desktops and are thus slower at running the model.
library(ggplot2)
ggplot(processing, aes(x = Computer, y = Time, fill = Computer)) +
geom_boxplot(alpha = 0.75) +
scale_fill_brewer(palette = "Blues") +
labs(title = "Boxplot Comparing Processing Times of Laptops and Desktops",
x = "Computer Type",
y = "Processing Time (seconds)") +
coord_flip()
Problem 2. First, take a few minutes to look through the components of the built-in R data set “chickwts,” which contains the results of an experiment conducted to measure the effectiveness of 6 different feed supplements on the growth rate (measured by weight) of chickens. Then complete the following:
chickwts<-chickwts
ggplot(chickwts, aes(x = weight, y = feed, fill = feed)) +
geom_boxplot(alpha = 0.75) +
scale_fill_brewer(palette = "Oranges") +
labs(title = "The Effectiveness of Feed Supplements on the Growth Rate of Chickens ",
x = "Chick Weight",
y = "Feed Supplement")
The boxplots indicates general differences in chicken weight among the various feed types. The casein and sunflower feeds appear to produce higher median weights compared with the other groups, while the horsebean feed shows the lowest median weights. The remaining feed types—linseed, soybean, and meatmeal—fall within an intermediate range. Considerable overlap is observed among the distributions, suggesting that while some differences in chick growth are visible, the overall spread of weights across feed types is fairly broad.
Check and interpret the following one-way ANOVA assumptions:
• Normality using both the Normality QQ Plot and Shapiro-Wilk Test, remembering to check this assumption for each feed type.
• Equal Variances using Bartlett’s Test or Levene’s Test - choose which is best!
First we will evaluate Normality as this will help us determine which test we will use to determine the equality of variance among our groups.(If all groups are significantly Normally Distributed we will use Bartlett’s Test because it is more powerful for Normally Distributed data, however if at least one of the groups is not Normally Distributed we will instead use Levene’s Test because even though it is less powerful, it is not sensitive to Non-Normally Distributed data and will still be able to evaluate the equality of variance among the different feed groups.)
ggplot(chickwts, aes(sample = weight, colour = feed, shape = feed)) +
geom_qq() +
geom_qq_line() +
labs(title = "Normal Q–Q Plot of Chicken Weights by Feed Type",
x = "Theoretical Quantiles",
y = "Sample Quantiles") +
scale_colour_brewer(palette = "Set2")
with(chickwts, shapiro.test(weight[feed == "horsebean"]))
##
## Shapiro-Wilk normality test
##
## data: weight[feed == "horsebean"]
## W = 0.93758, p-value = 0.5264
with(chickwts, shapiro.test(weight[feed == "linseed"]))
##
## Shapiro-Wilk normality test
##
## data: weight[feed == "linseed"]
## W = 0.96931, p-value = 0.9035
with(chickwts, shapiro.test(weight[feed == "soybean"]))
##
## Shapiro-Wilk normality test
##
## data: weight[feed == "soybean"]
## W = 0.9464, p-value = 0.5064
with(chickwts, shapiro.test(weight[feed == "sunflower"]))
##
## Shapiro-Wilk normality test
##
## data: weight[feed == "sunflower"]
## W = 0.92809, p-value = 0.3603
with(chickwts, shapiro.test(weight[feed == "meatmeal"]))
##
## Shapiro-Wilk normality test
##
## data: weight[feed == "meatmeal"]
## W = 0.97914, p-value = 0.9612
with(chickwts, shapiro.test(weight[feed == "casein"]))
##
## Shapiro-Wilk normality test
##
## data: weight[feed == "casein"]
## W = 0.91663, p-value = 0.2592
All feed groups produced p-values from the Shapiro-Wilk normality test greater than the significance level of 0.05 (horsebean: 0.5264, linseed: 0.9035, soybean: 0.5064, sunflower: 0.3603, meatmeal: 0.9612, casein: 0.2592), indicating that the null hypothesis of normality cannot be rejected for any group. These results suggest that there is strong enough statistical evidence that the distributions of chicken weights within each feed type are nearly normal. These results are further supported by the Normal Probability plots for each group which all indicate that the distributions of each group are roughly straight and linear.
Since there was strong enough statistical evidence that all groups were determined to have nearly normal distributions we can use Bartlett’s Test to determine the equailty of the variances among the groups.
bartlett.test(weight ~ feed, data = chickwts)
##
## Bartlett test of homogeneity of variances
##
## data: weight by feed
## Bartlett's K-squared = 3.2597, df = 5, p-value = 0.66
Since the p-value of 0.66 is greater than the significance level of 0.05 we fail to reject the null hypothesis for Bartlett’s Test which is that the variances are not statistically different, so by we failing to reject the null hypothesis, we can conclude the variances are not likely to be different and, therefore, can be assumed to be equal.
Null hypothesis: The mean chicken weight is the same across all feed types.
Alternative hypothesis: At least one feed type has a different mean weight.
α = 0.05
chick_aov <- aov(weight ~ feed, data = chickwts)
summary(chick_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## feed 5 231129 46226 15.37 5.94e-10 ***
## Residuals 65 195556 3009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value of 5.94e-10 is much less than the significance level of 0.05 we reject the null hypothesis. There is strong enough statistical evidence to say that at least one feed type has a different mean weight.
1) Does your plot in part (a) support this result? Why or why not? Explain your reasoning
Yes the plot in part (a) supports the results of the ANOVA. The plot suggest that casein and sunflower feeds appear to produce higher median weights compared with the other groups, while the horsebean feed shows the lowest median weights which corresponds with the ANOVA findings as there is infact a statsitically significant difference between at least one of the groups.
2) Based on your assumption checks in part (b), is it likely the one-way ANOVA results are valid? Why or why not? Explain your reasoning
Since the Normal Probability Plots (Q-Q Plots) showed and the Shapiro-Wilk normality tests found that each group was Nearly Normally Distributed and the Bartlett's test of homogeneity of variances found that each groups varaiance was equal both assumptions for the ANOVA test were fufilled. We could assume that the data for each group was Normally Distrubuted and that the varainces of each group were equal, thus the results of the one-way ANOVA were most-likely vaild. If one of these assumptions could not be made, the vailidity of the ANOVA could be called into question, however since both assumptions passed, we can trust the likelhood of the vailidity of the resutls.
TukeyHSD(chick_aov, conf.level = 0.95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ feed, data = chickwts)
##
## $feed
## diff lwr upr p adj
## horsebean-casein -163.383333 -232.346876 -94.41979 0.0000000
## linseed-casein -104.833333 -170.587491 -39.07918 0.0002100
## meatmeal-casein -46.674242 -113.906207 20.55772 0.3324584
## soybean-casein -77.154762 -140.517054 -13.79247 0.0083653
## sunflower-casein 5.333333 -60.420825 71.08749 0.9998902
## linseed-horsebean 58.550000 -10.413543 127.51354 0.1413329
## meatmeal-horsebean 116.709091 46.335105 187.08308 0.0001062
## soybean-horsebean 86.228571 19.541684 152.91546 0.0042167
## sunflower-horsebean 168.716667 99.753124 237.68021 0.0000000
## meatmeal-linseed 58.159091 -9.072873 125.39106 0.1276965
## soybean-linseed 27.678571 -35.683721 91.04086 0.7932853
## sunflower-linseed 110.166667 44.412509 175.92082 0.0000884
## soybean-meatmeal -30.480519 -95.375109 34.41407 0.7391356
## sunflower-meatmeal 52.007576 -15.224388 119.23954 0.2206962
## sunflower-soybean 82.488095 19.125803 145.85039 0.0038845
Based on the Tukey HSD results, several pairwise comparisons between feed types showed statistically significant differences at the 0.05 level. Chickens fed with casein had significantly higher mean weights than those fed horsebean (p = 0.0000000), linseed (p = 0.0002100), and soybean (p = 0.0083653). Chickens fed with sunflower also had significantly higher weights than those fed horsebean (p = 0.0000000), linseed (p = 0.0000884), and soybean (p = 0.0038845). Comparisons among meatmeal, linseed, and soybean generally had p-values greater than 0.05, indicating no significant differences. Overall, the pairwise comparisons suggest that casein and sunflower feeds are associated with the highest chicken weights, while horsebean is associated with the lowest.