M1 Project Report
ALY6015_71821:Intermediate Analytics
SEC_09_Fall_2023_CPS
Northeastern University
Professor: Vladimir Shapiro

By: Zeeshan Ahmad Ansari

Date of Submission: 13 November, 2023


Library

#The report utilizes a set of libraries for various data processing and visualization tasks.

library(tidyverse)
library(readxl)
library(dplyr)
library(readr)
library(kableExtra)
library(corrplot)
library(car)
library(knitr)
library(DescTools)
library(psych)

Introduction

In this comprehensive statistical analysis report, we delve into various hypothesis tests and analysis of variance (ANOVA) techniques using the R programming language. The assignment is designed to apply our understanding of chi-square and ANOVA testing to real-world scenarios, aligning with key learning outcomes of the course. Through a series of tasks, we explore the goodness of fit, independence of variables, homogeneity of proportions, and differences among means.

The journey begins with investigating the distribution of blood types in a hospital compared to the general population, followed by scrutinizing airline on-time performance and movie admissions based on ethnicity. Additionally, we explore the relationship between military rank and branch, sodium content differences in various foods, and variations in sales among leading companies and per-pupil expenditures across different regions.

The assignment culminates in a detailed examination of a gardening experiment aiming to enhance plant growth through a two-way ANOVA test. Further, a self-directed exploration in R involves analyzing baseball data to determine if the number of wins differs by decade. The analysis concludes with a Two-way ANOVA test investigating the impact of fertilizer and density on crop yield.

Through hypothesis testing and ANOVA techniques, this report aims to draw meaningful conclusions and insights, providing a comprehensive application of statistical methodologies in real-world scenarios.

Analysis

The chi-square test compares observed results to expected results based on a hypothesis. It helps determine if the difference between actual and predicted data is simply due to chance, or if there is a relationship between the variables being studied. As such, the chi-square test is an excellent statistical tool for understanding potential connections between categorical variables(University of Southampton, n.d.). It can be used to analyze frequency distributions, test independence between two variables, and assess homogeneity of proportions.(Bluman, 2015).

ANOVA is another statistical technique used to study variability in a continuous response variable under different conditions defined by classification factors. It evaluates equality of means by comparing variation between groups to variation within groups. ANOVA is widely used because it allows analyzing the impact of categorical independent variables on a continuous dependent variable. It helps determine if changes in the independent categorical variables lead to significant differences in the continuous response variable (Larson, 2008).


Task 1 (Section 11-1 6. Blood Types)

A medical researcher wishes to see if hospital patients in a large hospital have the same blood type distribution as those in the general population. The distribution for the general population is as follows:

  • Type A = 20%

  • Type B = 28%

  • Type O = 36%

  • Type AB = 16%.

He selects a random sample of 50 patients and finds the following:

  • Type A = 12

  • Type B = 8

  • Type O = 24

  • Type AB = 6

At α = 0.10, can it be concluded that the distribution is the same as that of the general population?

############Distribution
Type_A = 0.20
Type_B = 0.28
Type_O = 0.36 
Type_AB = 0.16

blood_prob = c(Type_A, Type_B, Type_O, Type_AB)

#number of observations
n = 50

##############Expected Count

TypeA_exp = Type_A * n
TypeB_exp = Type_B * n
TypeO_exp = Type_O * n
TypeAB_exp = Type_AB * n

############Observed count
TypeA_obs = 12
TypeB_obs = 8
TypeO_obs = 24
TypeAB_obs = 6

Solution

R Wrapped \(\chi^2\) Test

The vectors of \(expected\) probabilities and \(observed\) frequencies are as follows:

exp_blood_ct = c(TypeA_exp, TypeB_exp, TypeO_exp, TypeAB_exp)
obs_blood_ct = c(TypeA_obs, TypeB_obs, TypeO_obs, TypeAB_obs)
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): \({ 𝑃(A)=0.20, 𝑃(B)=0.28, 𝑃(O)=0.36, 𝑃(AB)=0.16 }\)
  • Alternative hypothesis \(\it{H_{1}}\): The distribution is not the same as stated in the null hypothesis.
  1. Find the critical value.
alpha_t1 = 0.1
conf_level_t1 = 1- alpha_t1

critical_value_t1 <- qchisq(1 - alpha_t1, df = length(obs_blood_ct) - 1)
critical_value_t1
## [1] 6.251389
  1. Compute the test value.
######################X^2 value

ch_sqrt1 = chisq.test(x = obs_blood_ct, p = blood_prob)
ch_sqrt1
## 
##  Chi-squared test for given probabilities
## 
## data:  obs_blood_ct
## X-squared = 5.4714, df = 3, p-value = 0.1404
  1. Make the decision.
T1_testresult = ifelse(ch_sqrt1$p.value > alpha_t1, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")

T1_testresult
## [1] "Do not Reject the Null Hypothesis (Pvalue > alpha)"

Table showing Chi Square test

##########TABLE FORMAT

blood_count = cbind(exp_blood_ct, obs_blood_ct)
blood_type = c("Blood Type A", "Blood Type B", "Blood Type O", "Blood Type AB")
col_n = c("Expected "," Observed")
dimnames(blood_count) = list(blood_type,col_n)

T1_table = rbind(conf_level_t1*100, alpha_t1,
                 round(ch_sqrt1$statistic,3),     
                 round(ch_sqrt1$p.value,3),
                 T1_testresult)
col_n1 = c("Hypothesis Result")
row_n1 = c("Confidence Level(%):", 
           "Alpha:", 
           "X^2 Value:",
           "p value:",
           "Hypothesis Result:")
dimnames(T1_table) = list(row_n1,col_n1)

kable(list(blood_count,T1_table),
      caption = "<center>Chi-Square Goodness of Fit Test</center>",align = "c",
      booktabs = TRUE)%>%
kable_styling(bootstrap_options = c("bordered"),
                                  font_size = 12) %>%

footnote(general = "H0:P(A)=0.20, P(B)=0.28,P(O)=0.36,P(AB)=0.16 and H1: The distribution is not the same as stated in the null hypothesis \n")
Chi-Square Goodness of Fit Test
Expected Observed
Blood Type A 10 12
Blood Type B 14 8
Blood Type O 18 24
Blood Type AB 8 6
Hypothesis Result
Confidence Level(%): 90
Alpha: 0.1
X^2 Value: 5.471
p value: 0.14
Hypothesis Result: Do not Reject the Null Hypothesis (Pvalue > alpha)
Note:
H0:P(A)=0.20, P(B)=0.28,P(O)=0.36,P(AB)=0.16 and H1: The distribution is not the same as stated in the null hypothesis

Observations to Task_1:

In this task, we have conducted a test to assess the proportions of different blood groups. We have been provided with a sample of 50 patients and compared the observed blood samples with the expected distribution to investigate whether the proportions of blood types match those in the population. Since this is a proportion test, we opted for the Chi-Square Goodness-of-Fit test.

The expected population proportions are as follows: Blood group A (20%), Blood group B (28%), Blood group O (36%), and Blood group AB (16%). The objective of hypothesis testing was to determine if the expected proportions align with the observed blood group distribution.

The null hypothesis posits that the expected proportions of blood groups match the observed proportions, while the alternative hypothesis suggests a difference in proportions. The Chi-Square Goodness-of-Fit test was conducted using chsq.test(), yielding a chi-square statistic of X^2 = 5.471 and Pvalue = 0.1403575

As the p-value exceeds the significance level (alpha) of 0.1, I choose not to reject the null hypothesis. This implies that there isn’t sufficient evidence to dismiss the claim that the proportions of blood types in the population differ, and it can be concluded that the percentages are not significantly different from those specified in the null hypothesis.


Task 2 (Section 11-1. 8. On-Time Performance by Airlines)

According to the Bureau of Transportation Statistics, on-time performance by the airlines is described as follows:

Action of Time
On time 70.8
National Aviation System delay 8.2
Aircraft arriving late 9.0
Other (because of weather and other conditions) 12.0

Records of 200 randomly selected flights for a major airline company showed that 125 planes were on time; 40 were delayed because of weather, 10 because of a National Aviation System delay, and the rest because of arriving late. At α = 0.05, do these results differ from the government’s statistics?

#########Distribution
On_time = 0.708
NA_System_delay = 0.082
arriving_late = 0.09 
other = 0.12

Flight_prob = c(On_time, NA_System_delay, arriving_late, other)
n1 = 200

########Expected Count

On_time_Exp = On_time * n1
NA_System_delay_Exp = NA_System_delay * n1
arriving_late_Exp = arriving_late * n1
othr_Exp = other * n1

#########Observed count
On_time_obs = 125
NA_System_delay_obs = 10
arriving_late_obs = 25
othr_obs = 40

Solution

R Wrapped \(\chi^2\) Test

The vectors of \(Exp Flight Count\) probabilities and \(Obs flight Count\) frequencies are as follows:

Exp_Flight_count = round(c(On_time_Exp,
                    NA_System_delay_Exp,
                    arriving_late_Exp,
                    othr_Exp),0)

Obs_flight_count = c(On_time_obs,
                    NA_System_delay_obs,
                    arriving_late_obs,
                    othr_obs)
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): \({ P(OT)=0.708, P(SD)=0.082, P(AL)=0.09, P(O)=0.12 }\)
  • Alternative hypothesis \(\it{H_{1}}\): The distribution is not the same as stated in the null hypothesis.
  1. Find the critical value.
alpha_t2 = 0.05
conf_level_t2 = 1- alpha_t2
critical_value_t2 <- qchisq(1 - alpha_t2, df = length(Obs_flight_count) - 1)
critical_value_t2
## [1] 7.814728
  1. Compute the test value.
ch_sqrt_t2 = chisq.test(x=Obs_flight_count, p=Flight_prob)
  1. Make the decision.
T2_testresult_2 = ifelse(ch_sqrt_t2$p.value > alpha_t2, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")

T2_testresult_2
## [1] "Reject the Null Hypothesis (Pvalue <= alpha)"

Table showing Chi Square test

flight_count = cbind(Exp_Flight_count,
                    Obs_flight_count)
record_type = c("On time", 
               "National Aviation System delay",
               "Aircraft arriving late",
               "Other (because of weather and other conditions)")
col_n_1 = c("Expected","Observed")
dimnames(flight_count) = list(record_type,col_n_1)


T2_1hypores_1 = rbind(conf_level_t2*100, alpha_t2, round(ch_sqrt_t2$statistic,3), round(ch_sqrt_t2$p.value,6), T2_testresult_2)
col_n1_1 = c("Hypothesis Result")
row_n1_1 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T2_1hypores_1) = list(row_n1_1,col_n1_1)


kable(
  list(flight_count,T2_1hypores_1),
        caption = "<center>Chi-Square Goodness-of-Fit Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 12) %>%
  footnote(general = "𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and \n H1: The distribution is not the same as stated in the null hypothesis")
Chi-Square Goodness-of-Fit Test
Expected Observed
On time 142 125
National Aviation System delay 16 10
Aircraft arriving late 18 25
Other (because of weather and other conditions) 24 40
Hypothesis Result
Confidence Level(%): 95
Alpha: 0.05
X^2 Value: 17.832
p value: 0.000476
Hypothesis Result: Reject the Null Hypothesis (Pvalue <= alpha)
Note:
𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and
H1: The distribution is not the same as stated in the null hypothesis

Observations to Task_2:

In this task, we have examined the punctuality of airlines using data from the Bureau of Transportation Statistics. The data for the observed population includes flight information categorized as follows: On time (70.8%), National Aviation System delay (8.2%), Aircraft arriving late (9.0%), and Other (attributed to weather and other conditions, 12.0%).

Upon conducting the chisq.test(), we found compelling evidence to reject the null hypothesis, given that the Pvalue = 4.7625874^{-4} which is ≤ alpha(α= 0.05). This indicates that the percentage of flight performance significantly differs from the government-provided data.


Task 3 (Section 11-2.8. Ethnicity and Movie Admissions )

Are movie admissions related to ethnicity? A 2014 study indicated the following numbers of admissions (in thousands) for two different years.

Year Caucasian Hispanic African American Other
2013 724 335 174 107
2014 370 292 152 140

At the 0.05 level of significance, can it be concluded that movie attendance by year was dependent upon ethnicity?

#Data presented in Matrix

r1 = c(724, 335, 174, 107)
r2 = c(370, 292, 152, 140)

#row count
rowno = 2

#matrix 

matrix1_2 = matrix(c(r1, r2), nrow = rowno, byrow = TRUE)

Solution

R Wrapped \(\chi^2\) Test

rownames(matrix1_2) = c("2013","2014")
colnames(matrix1_2) = c("Caucasian", "Hispanic", "African_American", "Other")
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): Movie attendance by year was independent upon ethnicity.
  • Alternative hypothesis \(\it{H_{1}}\): Movie attendance by year was dependent upon ethnicity.
  1. Find the critical value.
alpha_t3 = 0.05
conf_level_t3 = 1- alpha_t3
critical_value_t3 <- qchisq(1 - alpha_t3, df = length(matrix1_2) - 1)
critical_value_t3
## [1] 14.06714
  1. Compute the test value.
ch_sqrt1_3 = chisq.test(matrix1_2)
  1. Make the decision.
T3_testresult_3 = ifelse(ch_sqrt1_3$p.value > alpha_t3,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)") 

T3_testresult_3
## [1] "Reject the Null Hypothesis (Pvalue <= alpha)"

Table showing Chi Square test

T3_hypores_3 = rbind(conf_level_t3*100, alpha_t3, round(ch_sqrt1_3$statistic,3), round(ch_sqrt1_3$p.value,14), T3_testresult_3)
col_n1_2 = c("Hypothesis Result")
row_n1_2 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T3_hypores_3) = list(row_n1_2,col_n1_2)

kable(
  list(matrix1_2,T3_hypores_3),
        caption = "<center>Chi-Square Independence Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  footnote(general = "H0: Movie attendance by year was independent upon ethnicity and \n H1: Movie attendance by year was dependent upon ethnicity")
Chi-Square Independence Test
Caucasian Hispanic African_American Other
2013 724 335 174 107
2014 370 292 152 140
Hypothesis Result
Confidence Level(%): 95
Alpha: 0.05
X^2 Value: 60.144
p value: 5.5e-13
Hypothesis Result: Reject the Null Hypothesis (Pvalue <= alpha)
Note:
H0: Movie attendance by year was independent upon ethnicity and
H1: Movie attendance by year was dependent upon ethnicity

Observations to Task_3:

For this assignment, we analyzed movie attendance data spanning two years, including demographic information about the attendees. The hypothesis under consideration posited that movie attendance by year was contingent on ethnicity. To investigate this, we used the Chi-Square Independence test.

Upon executing the chisq.test(), I obtained a p-value of Pvalue = 5.4775074^{-13} which is ≤ alpha(0.05), which is less than or equal to the significance level. Consequently, we reject the null hypothesis. This implies that there is substantial evidence to substantiate the assertion that movie attendance by year is indeed dependent on ethnicity.


Task 4 (Section 11-2.10 Women in the Military)

This table lists the numbers of officers and enlisted personnel for women in the military.

Officers Enlisted
Army 10,791 62,491
Navy 7,816 42,750
Marine Corps 932 9,525
Air Force 11,819 54,344

At α = 0.05, is there sufficient evidence to conclude that a relationship exists between rank and branch of the Armed Forces?

#Data presented in Matrix

rm1 = c(10791, 62491)
rm2 = c(7816, 42750)
rm3 = c(932, 9525)
rm4 = c(11819, 54344)

#row count
rowno_2 = 4

#matrix 

matrixt4 = matrix(c(rm1, rm2, rm3, rm4), nrow = rowno_2, byrow = TRUE)

Solution

R Wrapped \(\chi^2\) Test

rownames(matrixt4) = c("Army","Navy","Marine_Corps", "Air_Force")
colnames(matrixt4) = c("Officers", "Enlisted")
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): Rank achieved is independent of branch of the Armed Forces.
  • Alternative hypothesis \(\it{H_{1}}\): Rank achieved is dependent of branch of the Armed Forces.
  1. Find the critical value.
alpha_t4 = 0.05
conf_level_t4 = 1- alpha_t4
critical_value_t4 <- qchisq(1 - alpha_t4, df = length(matrixt4) - 1)
critical_value_t4
## [1] 14.06714
  1. Compute the test value.
ch_sqrt1_4 = chisq.test(matrixt4)
  1. Make the decision.
T4_testresult_4 = ifelse(ch_sqrt1_4$p.value > alpha_t4,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)") 

T4_testresult_4
## [1] "Reject the Null Hypothesis (Pvalue <= alpha)"

Table showing Chi Square test

T4_hypores_4 = rbind(conf_level_t4*100, alpha_t4, round(ch_sqrt1_4$statistic,3), round(ch_sqrt1_4$p.value,14), T4_testresult_4)
col_n4 = c("Hypothesis Result")
row_n4 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T4_hypores_4) = list(row_n4,col_n4)

kable(
  list(matrixt4,T4_hypores_4),
        caption = "<center>Chi-Square Independence Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "H0: Rank achieved is independent of branch of the Armed Forces and \n H1: Rank achieved is dependent of branch of the Armed Forces.")
Chi-Square Independence Test
Officers Enlisted
Army 10791 62491
Navy 7816 42750
Marine_Corps 932 9525
Air_Force 11819 54344
Hypothesis Result
Confidence Level(%): 95
Alpha: 0.05
X^2 Value: 654.272
p value: 0
Hypothesis Result: Reject the Null Hypothesis (Pvalue <= alpha)
Note:
H0: Rank achieved is independent of branch of the Armed Forces and
H1: Rank achieved is dependent of branch of the Armed Forces.

Observations to Task_4:

For this task, we have explored the potential relationship between rank and branch within the Armed Forces using the Chi-Square Independence test. We organized the data in a matrix by rows to facilitate the Chi-Square test. The chisq.test() yielded a p-value of 1.726418^{-141}, which is less than or equal to the significance level (α=0.05). Based on this result, We reject the null hypothesis, indicating there is substantial evidence to reject the idea that the attained rank is independent of the branch of the Armed Forces. In other words, the rank achieved appears to be associated with the branch of the Armed Forces.


Task 5 (Section 12-1.8 Sodium Contents of Foods)

The amount of sodium (in milligrams) in one serving for a random sample of three different kinds of foods is listed.

Condiments Cereals Deserts
270 260 100
130 220 180
230 290 250
180 290 250
80 200 300
70 320 360
200 140 300

At the 0.05 level of significance, is there sufficient evidence to conclude that a difference in mean sodium amounts exists among condiments,cereals, and desserts?

Condiments = data.frame("Sodium" = c(270, 130, 230, 180, 80, 70, 200), 
                        "Food" = rep("Condiments", 7), stringsAsFactors = FALSE)

Cereals = data.frame("Sodium" = c(260, 220, 290, 290, 200, 320, 140),
                     "Food" = rep("Cereals", 7), stringsAsFactors = FALSE)

Desserts = data.frame("Sodium" = c(100, 180, 250, 250, 300, 360, 300, 160),
                      "Food" = rep("Desserts", 8), stringsAsFactors = FALSE)

Solution

R Wrapped ANOVA (One Way) Test

Sodium = rbind(Condiments, Cereals, Desserts)
Sodium$Food = as.factor(Sodium$Food)
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): 𝜇1=𝜇2=𝜇3
  • Alternative hypothesis \(\it{H_{1}}\): At least one mean is different from others
  1. The alpha value
alpha_t5 = 0.05
conf_level_t5 = 1- alpha_t5
  1. Compute the test value.
##ANOVA Test######

anova_t5 = aov(Sodium ~ Food, data = Sodium)

anovasum = summary(anova_t5)
anovasum
##             Df Sum Sq Mean Sq F value Pr(>F)
## Food         2  27544   13772   2.399  0.118
## Residuals   19 109093    5742
  1. Find the critical value.
anofval = anovasum[[1]][1, "F value"]
anofval
## [1] 2.398538
DofN = anovasum[[1]][1, "Df"] #k-1
DofD = anovasum[[1]][2, "Df"] #N-k

CV2_1 = qf(alpha_t5, DofN, DofD, lower.tail = FALSE) #Critical value 

CV2_1
## [1] 3.521893
  1. Make the decision.
anovhypores_t5 = ifelse(anofval < CV2_1, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

anovhypores_t5
## [1] "Do not reject the Null Hypothesis (Fvalue < CV)"
sodfoodstat <- function(y, uplim = max(Sodium$Sodium) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Sodium, aes(x = Food, y = Sodium, fill = Food)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = sodfoodstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "Sodium Contents of Food",
    caption = "Source: The Doctor's Pocket Calorie, Fat, and Carbohydrate Counter",
    x = "Food",
    y = "Sodium/serving (in milligrams)"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

Table showing ANOVA test

T5_hypores_5 = rbind(conf_level_t5*100, alpha_t5, round(CV2_1,3), round(anofval,3), anovhypores_t5)
col_n5 = c("Hypothesis Result")
row_n5 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(T5_hypores_5) = list(row_n5, col_n5)

kable(T5_hypores_5,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")
One way ANOVA Test
Hypothesis Result
Confidence Level(%): 95
Alpha: 0.05
CV: 3.522
F value: 2.399
Hypothesis Result: Do not reject the Null Hypothesis (Fvalue < CV)
Note:
𝐻0:𝜇1=𝜇2=𝜇3 and
𝐻1: At least one mean is different from others
if (summary(anova_t5)[[1]]$'Pr(>F)'[1] < 0.05) {
  tukey_test_5 <- TukeyHSD(anova_t5)
  print(tukey_test_5)

  # Plot the Tukey's HSD test results
  plot(tukey_test_5)
} else {
  cat("Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.\n")
}
## Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.

Observations to Task_5:

This task involves examining whether the sodium content (measured in milligrams) in one serving differs among three categories of food: condiments, cereals, and desserts. Following the One-Way ANOVA, the obtained F-value is 2.399, and the critical value is 3.522. The decision is made not to reject the null hypothesis since the F-value is less than the critical value.

In summary, the analysis provides enough evidence to retain the null hypothesis, indicating that there is no significant difference in mean sodium amounts among condiments, cereals, and desserts.


Task 6 (Section 12-2.10 Sales for Leading Companies)

Perform a complete one-way ANOVA. If the null hypothesis is rejected, use either the Scheffé or Tukey test to see if there is a significant difference in the pairs of means. Assume all assumptions are met.

The sales in millions of dollars for a year of a sample of leading companies are shown.

Cereal Chocolate Candy Coffee
578 311 261
320 106 185
264 109 302
249 125 689
237 173

At α = 0.01, is there a significant difference in the means?

Cereal = data.frame("Sales" = c(578, 320, 264, 249, 237), "Food" = rep("Cereal",5), stringsAsFactors = FALSE)

Choco_candy = data.frame("Sales" = c(311, 106, 109, 125, 173), "Food" = rep("Chocolate_candy", 5), stringsAsFactors = FALSE)

Coffee = data.frame("Sales" = c(261, 185, 302, 689), "Food" = rep("Coffee", 4), stringsAsFactors = FALSE)

Solution

R Wrapped ANOVA (One Way) Test

Sales = rbind(Cereal, Choco_candy, Coffee) 
Sales$Food = as.factor(Sales$Food) 
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): 𝜇1=𝜇2=𝜇3
  • Alternative hypothesis \(\it{H_{1}}\): At least one mean is different from other.
  1. The alpha value
alpha_t6 = 0.01
conf_level_t6 = 1- alpha_t6
  1. Compute the test value.
##ANOVA Test######
anova_t6 = aov(Sales ~ Food, data = Sales)
 
anovasum_t6 = summary(anova_t6)

anovasum_t6
##             Df Sum Sq Mean Sq F value Pr(>F)
## Food         2 103770   51885   2.172   0.16
## Residuals   11 262795   23890
  1. Find the critical value.
anofval_t6 = anovasum_t6[[1]][1, "F value"]
anofval_t6
## [1] 2.171782
DofN = anovasum_t6[[1]][1, "Df"] #k-1
DofD = anovasum_t6[[1]][2, "Df"] #N-k

CV2_t6 = qf(alpha_t6, DofN, DofD, lower.tail = FALSE) #Critical value 

CV2_t6
## [1] 7.205713
  1. Make the decision.
anovhypores_t6 = ifelse(anofval_t6 < CV2_t6, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

anovhypores_t6
## [1] "Do not reject the Null Hypothesis (Fvalue < CV)"
salesfoodstat <- function(y, uplim = max(Sales$Sales) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Sales, aes(x = Food, y = Sales, fill = Food)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = salesfoodstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "Sales for Leading Companies",
    caption = "Source: Information Resources, Inc",
    x = "Food",
    y = "Sales (USD)"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

Table showing ANOVA test

T6_hypores_6 = rbind(conf_level_t6*100, alpha_t6, round(CV2_t6), round(anofval_t6,3), anovhypores_t6)
col_n6 = c("Hypothesis Result")
row_n6 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(T6_hypores_6) = list(row_n6, col_n6)

kable(T6_hypores_6,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "H0:𝜇1=𝜇2=𝜇3 and \n H1: At least one mean is different from others.")
One way ANOVA Test
Hypothesis Result
Confidence Level(%): 99
Alpha: 0.01
CV: 7
F value: 2.172
Hypothesis Result: Do not reject the Null Hypothesis (Fvalue < CV)
Note:
H0:𝜇1=𝜇2=𝜇3 and
H1: At least one mean is different from others.
if (summary(anova_t6)[[1]]$'Pr(>F)'[1] < 0.01) {
  tukey_test_6 <- TukeyHSD(anova_t6)
  print(tukey_test_6)

  # Plot the Tukey's HSD test results
  plot(tukey_test_6)
} else {
  cat("Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.\n")
}
## Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.

Observations to Task_6:

In this task, we are conducting a One-way ANOVA to assess the variation in sales among different food items sampled from a prominent company. The sample includes Cereal, Chocolate_candy and Coffee, with sales figures reported in million dollars over one year. The objective is to examine the mean of this sample at a 0.01 significance level.

The ANOVA test is performed using the aov() function, and key metrics such as the F-value, degrees of freedom for numerator (DoFN), and degrees of freedom for denominator (DoFD) are extracted from the summary of aov(). Additionally, we have computed the critical value to compare it with the F-value, and since Fvalue < CV , the decision is not to reject the null hypothesis.

To sum up, the analysis provides sufficient evidence to retain the null hypothesis, indicating that there is no significant difference in the means of the sampled food items.


Task 7 (Section 12-2.12 Per-Pupil Expenditures)

Perform a complete one-way ANOVA. If the null hypothesis is rejected, use either the Scheffé or Tukey test to see if there is a significant difference in the pairs of means. Assume all assumptions are met.

The expenditures (in dollars) per pupil for states in three sections of the country are listed.

Eastern third Middle third Western third
4946 6149 5282
5953 7451 8605
6202 6000 6528
7243 6479 6911
6113

Using α = 0.05, can you conclude that there is a difference in means?

Eastern = data.frame("Expenditure" = c(4946, 5953, 6202, 7243, 6113), 
                     "States" = rep("Eastern", 5), stringsAsFactors = FALSE)

Middle = data.frame("Expenditure" = c(6149, 7451, 6000, 6479), 
                    "States" = rep("Middle", 4), stringsAsFactors = FALSE)

Western = data.frame("Expenditure" = c(5282, 8605, 6528, 6911),
                     "States" = rep("Western", 4), stringsAsFactors = FALSE)

Solution

R Wrapped ANOVA (One Way) Test

Expenditure = rbind(Eastern, Middle, Western)
Expenditure$States = as.factor(Expenditure$States)
  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): 𝜇1=𝜇2=𝜇3
  • Alternative hypothesis \(\it{H_{1}}\): At least one mean is different from other.
  1. The alpha value
alpha_t7 = 0.05
conf_level_t7 = 1- alpha_t7
  1. Compute the test value.
##ANOVA Test######
anova_t7 = aov(Expenditure ~ States, data = Expenditure)
 
anovasum_t7 = summary(anova_t7)

anovasum_t7
##             Df  Sum Sq Mean Sq F value Pr(>F)
## States       2 1244588  622294   0.649  0.543
## Residuals   10 9591145  959114
  1. Find the critical value.
anofval_t7 = anovasum_t7[[1]][1, "F value"]
anofval_t7
## [1] 0.6488214
DofN = anovasum_t7[[1]][1, "Df"] #k-1
DofD = anovasum_t7[[1]][2, "Df"] #N-k

CV2_t7 = qf(alpha_t7, DofN, DofD, lower.tail = FALSE) #Critical value 

CV2_t7
## [1] 4.102821
  1. Make the decision.
anovhypores_t7 = ifelse(anofval_t7 < CV2_t7, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

anovhypores_t7
## [1] "Do not reject the Null Hypothesis (Fvalue < CV)"
stateexpenstat <- function(y, uplim = max(Expenditure$Expenditure) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Expenditure, aes(x = States, y = Expenditure, fill = States)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = stateexpenstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "The expenditures per pupil (in dollars)",
    caption = "Source: New York Times Almanac",
    x = "States",
    y = "Expenditure (USD)/ pupil"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

Table showing ANOVA test

T7_hypores_7 = rbind(conf_level_t7*100, alpha_t7, round(CV2_t7), round(anofval_t7,3), anovhypores_t7)
col_n7 = c("Hypothesis Result")
row_n7 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(T7_hypores_7) = list(row_n7, col_n7)

kable(T7_hypores_7,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "H0:𝜇1=𝜇2=𝜇3 and \n H1: At least one mean is different from others.")
One way ANOVA Test
Hypothesis Result
Confidence Level(%): 95
Alpha: 0.05
CV: 4
F value: 0.649
Hypothesis Result: Do not reject the Null Hypothesis (Fvalue < CV)
Note:
H0:𝜇1=𝜇2=𝜇3 and
H1: At least one mean is different from others.
if (summary(anova_t7)[[1]]$'Pr(>F)'[1] < 0.05) {
  tukey_test_7 <- TukeyHSD(anova_t7)
  print(tukey_test_7)

  # Plot the Tukey's HSD test results
  plot(tukey_test_7)
} else {
  cat("Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.\n")
}
## Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.

Observations to Task_7:

In this assignment, the spending per pupil (in USD) is presented for states categorized into three regions: Eastern, Middle, and Western. We performed a one-way ANOVA to investigate if there are variations in the means across these three sections. We utilized the aov() function for the ANOVA test and compare the F-value obtained from this test with the Critical Value. Since the F-value is less than the Critical Value (Fvalue < CV), we will retain the null hypothesis and not reject it.

In summary, the analysis yields enough evidence to support the conclusion that there is no significant difference in means, refuting the claim of differences among the three sections.


Task 8 (Section 12- 3. 10. Increasing Plant Growth)

Assume that all variables are normally or approximately normally distributed, that the samples are independent, and that the population variances are equal. a. State the hypotheses. b. Find the critical value for each F test. c. Complete the summary table and find the test value. d. Make the decision. e. Summarize the results. (Draw a graph of the cell means if necessary.)

A gardening company is testing new ways to improve plant growth. Twelve plants are randomly selected and exposed to a combination of two factors, a “Grow-light” in two different strengths and a plant food supplement with different mineral supplements. After a number of days, the plants are measured for growth, and the results (in inches) are put into the appropriate boxes.

Grow-light 1 Grow-light 2
Plant food A 9.2, 9.4, 8.9 8.5, 9.2, 8.9
Plant food B 7.1, 7.2, 8.5 5.5, 5.8, 7.6

Can an interaction between the two factors be concluded? Is there a difference in mean growth with respect to light? With respect to plant food? Use α = 0.05.

# Creating the data
plant_data <- data.frame(
  Grow_light = rep(c("Light1", "Light2"), each = 6),
  Plant_food = rep(c("Food_A", "Food_B"), times = 6),
  Growth = c(9.2, 9.4, 8.9, 8.5, 9.2, 8.9, 7.1, 7.2, 8.5, 5.5, 5.8, 7.6)
)

ggplot(data = plant_data, aes(x = Plant_food, y = Growth, colour = Grow_light)) + 
  geom_boxplot()

Solution

R Wrapped ANOVA (Two Way) Test

  1. State the hypotheses and identify the claim.

Interaction Effect: - Null hypothesis \(\it{H_{0}}\): There is no interaction effect between type of light used and type of plat food used on plant growth - Alternative hypothesis \(\it{H_{1}}\): There is an interaction effect between type of light used and type of plant food used on plant growth.

Effect of Light on Growth: - Null hypothesis \(\it{H_{0}}\): There is no difference between the means of plant growth for two types of light. - Alternative hypothesis \(\it{H_{1}}\): There is a difference between the means of plant growth for two types of light.

  1. The alpha value
alpha_t8 = 0.05
conf_level_t8 = 1- alpha_t8
  1. Compute the test value.
##ANOVA Test######
Anovatwoway = aov(Growth ~ Plant_food * Grow_light, data = plant_data)

Anovatwoway_summary = summary(Anovatwoway)

Anovatwoway_summary
##                       Df Sum Sq Mean Sq F value  Pr(>F)   
## Plant_food             1  0.213   0.213   0.259 0.62482   
## Grow_light             1 12.813  12.813  15.531 0.00429 **
## Plant_food:Grow_light  1  0.030   0.030   0.036 0.85352   
## Residuals              8  6.600   0.825                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(TukeyHSD(Anovatwoway, conf.level=.95), las = 2)

  1. Find the critical value.
ANOVAFvalinteraction = Anovatwoway_summary[[1]][3, "F value"] #Fvalue of interaction
ANOVAFvalinteraction
## [1] 0.03636364
ANOVAFvallight = Anovatwoway_summary[[1]][2, "F value"] #Fvalue of interaction
ANOVAFvallight
## [1] 15.53131
DofNlight = Anovatwoway_summary[[1]][2, "Df"] #(a-1)(b-1) a is levels in light, b is levels in food
DofDlight = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group
DofNinteraction = Anovatwoway_summary[[1]][3, "Df"] #(a-1)(b-1) 
DofDinteraction = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group

CVinteraction = qf(alpha_t8, DofNinteraction, DofDinteraction, lower.tail = FALSE)
CVlight = qf(alpha_t8, DofNlight, DofDlight, lower.tail = FALSE)
  1. Make the decision.
hypotwoANOVA = ifelse(ANOVAFvalinteraction < CVinteraction, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")
hypotwoANOVA
## [1] "Do not reject the null hypothesis (Fvalue < CV)"
hypotwoANOVAlight = ifelse(ANOVAFvallight < CVlight, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")
hypotwoANOVAlight
## [1] "Reject the null hypothesis (Fvalue > CV)"
plantgrost <- 
  plant_data %>% 
  group_by(Grow_light, Plant_food) %>% # group by the two factors
  summarise(Means = mean(Growth), SEs = sd(Growth)/sqrt(n())) # Mean and Std Er
## `summarise()` has grouped output by 'Grow_light'. You can override using the
## `.groups` argument.
ggplot(plantgrost, 
       aes(x = Grow_light, y = Means, fill = Plant_food,
           ymin = Means - SEs, ymax = Means + SEs)) +
  # this adds the mean
  geom_col(position = position_dodge()) +
  # this adds the error bars
  geom_errorbar(position = position_dodge(0.9), width=.2) +
  # controlling the appearance
  xlab("Growth Light") + ylab("Plant Growth (in inches)")

Table showing ANOVA test

hyporestwoANOVA = rbind(conf_level_t8, alpha_t8, round(CVinteraction,3), round(ANOVAFvalinteraction,3), hypotwoANOVA)
col_twoANOVA = c("Hypothesis Result")
row_twoANOVA = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVA) = list(row_twoANOVA, col_twoANOVA)

kable(hyporestwoANOVA,
        caption = "<center>Two way ANOVA interaction Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and \n 𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.")
Two way ANOVA interaction Test
Hypothesis Result
Confidence Level(%): 0.95
Alpha: 0.05
CV: 5.318
F value: 0.036
Hypothesis Result: Do not reject the null hypothesis (Fvalue < CV)
Note: 𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and
𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.
hyporestwoANOVAlight = rbind(conf_level_t8, alpha_t8, round(CVlight,3), round(ANOVAFvallight,3), hypotwoANOVAlight)
col_twoANOVAlight = c("Hypothesis Result")
row_twoANOVAlight = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVAlight) = list(row_twoANOVAlight, col_twoANOVAlight)

kable(hyporestwoANOVAlight,
        caption = "<center>Two way ANOVA mean of growth to light Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "H0: There is no difference between the means of plant growth for two types of light and \n H1: There is a difference between the means of plant growth for two types of light.")
Two way ANOVA mean of growth to light Test
Hypothesis Result
Confidence Level(%): 0.95
Alpha: 0.05
CV: 5.318
F value: 15.531
Hypothesis Result: Reject the null hypothesis (Fvalue > CV)
Note: H0: There is no difference between the means of plant growth for two types of light and
H1: There is a difference between the means of plant growth for two types of light.

Observations to Task_8:

In this task, we investigated the impact of different grow lights and plant food supplements on plant growth. Our analysis revealed a significant effect of light type on growth, rejecting the null hypothesis. However, no significant interaction effect between light and plant food was found. The choice of light emerges as a crucial factor in optimizing plant growth. These findings inform the gardening company’s practices, emphasizing the importance of selecting an appropriate grow light for enhanced plant development.


Task 9 (On Your Own)

Task 9.1 Download the file ‘baseball.csv’ from the course resources and import the file into R.

#Dataset_Employed_in_this_Task

M2W2Data = read_csv("D:/Quater_2/Second Part/ALY6015/Week_2/Assignment/Individual Assignment/baseball-3.csv")

Task 9.2 Perform EDA on the imported data set. Write a paragraph or two to describe the data set using descriptive statistics and plots. Are there any trends or anything of interest to discuss?

# Rename Columns
colnames(M2W2Data) <- tolower(gsub("[ ,.]", "_", colnames(M2W2Data)))
#colnames(M2W2Data)


# Structure of the dataset
Structure_1 <- str(M2W2Data)
## spc_tbl_ [1,232 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ team        : chr [1:1232] "ARI" "ATL" "BAL" "BOS" ...
##  $ league      : chr [1:1232] "NL" "NL" "AL" "AL" ...
##  $ year        : num [1:1232] 2012 2012 2012 2012 2012 ...
##  $ rs          : num [1:1232] 734 700 712 734 613 748 669 667 758 726 ...
##  $ ra          : num [1:1232] 688 600 705 806 759 676 588 845 890 670 ...
##  $ w           : num [1:1232] 81 94 93 69 61 85 97 68 64 88 ...
##  $ obp         : num [1:1232] 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ slg         : num [1:1232] 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ ba          : num [1:1232] 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ playoffs    : num [1:1232] 0 1 1 0 0 0 1 0 0 1 ...
##  $ rankseason  : num [1:1232] NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ rankplayoffs: num [1:1232] NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ g           : num [1:1232] 162 162 162 162 162 162 162 162 162 162 ...
##  $ oobp        : num [1:1232] 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ oslg        : num [1:1232] 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Team = col_character(),
##   ..   League = col_character(),
##   ..   Year = col_double(),
##   ..   RS = col_double(),
##   ..   RA = col_double(),
##   ..   W = col_double(),
##   ..   OBP = col_double(),
##   ..   SLG = col_double(),
##   ..   BA = col_double(),
##   ..   Playoffs = col_double(),
##   ..   RankSeason = col_double(),
##   ..   RankPlayoffs = col_double(),
##   ..   G = col_double(),
##   ..   OOBP = col_double(),
##   ..   OSLG = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Summary statistics of the dataset
data_summary <- summary(M2W2Data)

kable(data_summary,caption = "<center>Summary</center>", format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")
Summary
team league year rs ra w obp slg ba playoffs rankseason rankplayoffs g oobp oslg
Length:1232 Length:1232 Min. :1962 Min. : 463.0 Min. : 472.0 Min. : 40.0 Min. :0.2770 Min. :0.3010 Min. :0.2140 Min. :0.0000 Min. :1.000 Min. :1.000 Min. :158.0 Min. :0.2940 Min. :0.3460
Class :character Class :character 1st Qu.:1977 1st Qu.: 652.0 1st Qu.: 649.8 1st Qu.: 73.0 1st Qu.:0.3170 1st Qu.:0.3750 1st Qu.:0.2510 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:162.0 1st Qu.:0.3210 1st Qu.:0.4010
Mode :character Mode :character Median :1989 Median : 711.0 Median : 709.0 Median : 81.0 Median :0.3260 Median :0.3960 Median :0.2600 Median :0.0000 Median :3.000 Median :3.000 Median :162.0 Median :0.3310 Median :0.4190
NA NA Mean :1989 Mean : 715.1 Mean : 715.1 Mean : 80.9 Mean :0.3263 Mean :0.3973 Mean :0.2593 Mean :0.1981 Mean :3.123 Mean :2.717 Mean :161.9 Mean :0.3323 Mean :0.4197
NA NA 3rd Qu.:2002 3rd Qu.: 775.0 3rd Qu.: 774.2 3rd Qu.: 89.0 3rd Qu.:0.3370 3rd Qu.:0.4210 3rd Qu.:0.2680 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:162.0 3rd Qu.:0.3430 3rd Qu.:0.4380
NA NA Max. :2012 Max. :1009.0 Max. :1103.0 Max. :116.0 Max. :0.3730 Max. :0.4910 Max. :0.2940 Max. :1.0000 Max. :8.000 Max. :5.000 Max. :165.0 Max. :0.3840 Max. :0.4990
NA NA NA NA NA NA NA NA NA NA NA’s :988 NA’s :988 NA NA’s :812 NA’s :812
describe(M2W2Data) %>%
  kable(caption = "<center>Descriptive Statistics</center>", format = "html", align = "l") %>%
  kable_styling("bordered", full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")
Descriptive Statistics
vars n mean sd median trimmed mad min max range skew kurtosis se
team* 1 1232 18.9285714 10.6140364 20.000 18.7576065 13.3434000 1.000 39.000 38.000 0.0628129 -1.2525294 0.3023954
league* 2 1232 1.5000000 0.5002030 1.500 1.5000000 0.7413000 1.000 2.000 1.000 0.0000000 -2.0016227 0.0142509
year 3 1232 1988.9577922 14.8196251 1989.000 1989.3184584 19.2738000 1962.000 2012.000 50.000 -0.1515595 -1.2077412 0.4222133
rs 4 1232 715.0819805 91.5342940 711.000 713.3387424 90.4386000 463.000 1009.000 546.000 0.1740832 -0.0301863 2.6078252
ra 5 1232 715.0819805 93.0799326 709.000 712.4371197 91.9212000 472.000 1103.000 631.000 0.2978360 -0.0205915 2.6518607
w 6 1232 80.9042208 11.4581390 81.000 81.1206897 11.8608000 40.000 116.000 76.000 -0.1814238 -0.3070528 0.3264440
obp 7 1232 0.3263312 0.0150128 0.326 0.3262586 0.0148260 0.277 0.373 0.096 0.0175923 0.0574876 0.0004277
slg 8 1232 0.3973417 0.0332669 0.396 0.3970903 0.0340998 0.301 0.491 0.190 0.0541978 -0.3250999 0.0009478
ba 9 1232 0.2592727 0.0129072 0.260 0.2593864 0.0133434 0.214 0.294 0.080 -0.1109140 -0.0002138 0.0003677
playoffs 10 1232 0.1980519 0.3986934 0.000 0.1227181 0.0000000 0.000 1.000 1.000 1.5134587 0.2907952 0.0113588
rankseason 11 244 3.1229508 1.7383492 3.000 2.9744898 1.4826000 1.000 8.000 7.000 0.5560350 -0.5778485 0.1112864
rankplayoffs 12 244 2.7172131 1.0952342 3.000 2.7602041 1.4826000 1.000 5.000 4.000 -0.2688894 -1.1199065 0.0701152
g 13 1232 161.9188312 0.6243652 162.000 161.9492901 0.0000000 158.000 165.000 7.000 -1.0420881 6.9746969 0.0177883
oobp 14 420 0.3322643 0.0152953 0.331 0.3319345 0.0163086 0.294 0.384 0.090 0.1943470 -0.3710080 0.0007463
oslg 15 420 0.4197429 0.0265096 0.419 0.4194405 0.0266868 0.346 0.499 0.153 0.1176197 -0.2148956 0.0012935
# Descriptive statistics of numerical columns
num_summary <- summary(select_if(M2W2Data, is.numeric))

kable(num_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")
year rs ra w obp slg ba playoffs rankseason rankplayoffs g oobp oslg
Min. :1962 Min. : 463.0 Min. : 472.0 Min. : 40.0 Min. :0.2770 Min. :0.3010 Min. :0.2140 Min. :0.0000 Min. :1.000 Min. :1.000 Min. :158.0 Min. :0.2940 Min. :0.3460
1st Qu.:1977 1st Qu.: 652.0 1st Qu.: 649.8 1st Qu.: 73.0 1st Qu.:0.3170 1st Qu.:0.3750 1st Qu.:0.2510 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:162.0 1st Qu.:0.3210 1st Qu.:0.4010
Median :1989 Median : 711.0 Median : 709.0 Median : 81.0 Median :0.3260 Median :0.3960 Median :0.2600 Median :0.0000 Median :3.000 Median :3.000 Median :162.0 Median :0.3310 Median :0.4190
Mean :1989 Mean : 715.1 Mean : 715.1 Mean : 80.9 Mean :0.3263 Mean :0.3973 Mean :0.2593 Mean :0.1981 Mean :3.123 Mean :2.717 Mean :161.9 Mean :0.3323 Mean :0.4197
3rd Qu.:2002 3rd Qu.: 775.0 3rd Qu.: 774.2 3rd Qu.: 89.0 3rd Qu.:0.3370 3rd Qu.:0.4210 3rd Qu.:0.2680 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:162.0 3rd Qu.:0.3430 3rd Qu.:0.4380
Max. :2012 Max. :1009.0 Max. :1103.0 Max. :116.0 Max. :0.3730 Max. :0.4910 Max. :0.2940 Max. :1.0000 Max. :8.000 Max. :5.000 Max. :165.0 Max. :0.3840 Max. :0.4990
NA NA NA NA NA NA NA NA NA’s :988 NA’s :988 NA NA’s :812 NA’s :812

The above provided code computes summary statistics for the numerical columns within the dataset.

# Descriptive statistics of categorical columns
cat_summary <- summary(select_if(M2W2Data, is.character))

kable(cat_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "200px")
team league
Length:1232 Length:1232
Class :character Class :character
Mode :character Mode :character
DescTools::Desc(M2W2Data$w)
## ------------------------------------------------------------------------------ 
## M2W2Data$w (numeric)
## 
##   length       n    NAs  unique     0s   mean  meanCI'
##    1'232   1'232      0      63      0  80.90   80.26
##           100.0%   0.0%           0.0%          81.54
##                                                      
##      .05     .10    .25  median    .75    .90     .95
##    62.00   66.00  73.00   81.00  89.00  95.90   98.00
##                                                      
##    range      sd  vcoef     mad    IQR   skew    kurt
##    76.00   11.46   0.14   11.86  16.00  -0.18   -0.31
##                                                      
## lowest : 40.0, 43.0, 50.0, 51.0 (2), 52.0 (2)
## highest: 106.0, 108.0 (3), 109.0, 114.0, 116.0
## 
## ' 95%-CI (classic)

#Decade winnings#####

# Extract decade from year
M2W2Data$Decade <- M2W2Data$year - (M2W2Data$year %% 10)

win_dec_1 = M2W2Data %>%
  group_by(Decade)

win_dec = M2W2Data %>%
  group_by(Decade)%>%
  summarize(TotalWins = sum(w))

win_dec_1 %>%
  ggplot(aes(x = Decade)) + 
    geom_bar() +
  ggtitle("Win Count by Decade")+
    xlab("Year")+
  theme(plot.title = element_text(hjust = 0.5))

Observations:

The provided dataset contains information related to various baseball teams, including their performance statistics, such as runs scored (rs), runs allowed (ra), wins (w), on-base percentage (obp), slugging percentage (slg), batting average (ba), and more. The dataset spans multiple years, with a range from 1962 to 2012.

Upon conducting exploratory data analysis (EDA), several key observations and trends emerge. The summary statistics reveal that the mean number of wins per team is approximately 80.9, with a standard deviation of 11.46. The teams’ performance varies widely, with a range from 40 to 116 wins. The distribution of wins is somewhat negatively skewed, indicating that most teams have relatively high win counts.

The bar plot depicting wins by decade provides an insightful trend. There is a noticeable increase in total wins from the 1960s to the 2000s, with a peak in the 1990s. This suggests a potential upward trend in team performance over the decades.

Task 9.3 Assuming the expected frequencies are equal, perform a Chi-Square Goodness-of-Fit test to determine if there is a difference in the number of wins by decade. Be sure to include the following: a. State the hypotheses and identify the claim. b. Find the critical value (α = 0.05) (programmatically). c. Compute the test value. d. Make the decision. Clearly state if the null hypothesis should or should not be rejected and why. e. Does comparing the critical value with the test value provide the same result as comparing the p-value from R with the significance level?

  1. State the hypotheses and identify the claim.
  • Null hypothesis \(\it{H_{0}}\): Number of wins per decade are same.
  • Alternative hypothesis \(\it{H_{1}}\): Number of wins per decade are not same.
  1. Find the critical value (α = 0.05) (programmatically).
num_decades <- length(unique(M2W2Data$Decade))

df <- num_decades - 1

critical_value <- qchisq(0.95, df)

# Output the critical value
critical_value
## [1] 11.0705
CL_t9a = 0.95
alpha_t9a = 0.05
  1. Compute the test value.
#Chi-Square Wins by Decade####

observed_wins <- M2W2Data %>%
  group_by(Decade) %>%
  summarize(TotalWins = sum(w)) %>%
  ungroup() %>%
  pull(TotalWins)

expected_wins <- rep(mean(observed_wins), length(observed_wins))

# Compute the Chi-Square test statistic
test_value <- sum((observed_wins - expected_wins)^2 / expected_wins)

test_value
## [1] 9989.536
  1. Make the decision. Clearly state if the null hypothesis should or should not be rejected and why.
# Make the decision based on the critical value and the test statistic
if (test_value > critical_value) {
  decision <- "Reject the null hypothesis."
  reason <- "The test statistic is greater than the critical value."
} else {
  decision <- "Fail to reject the null hypothesis."
  reason <- "The test statistic is not greater than the critical value."
}

# Output the decision and the reason
decision
## [1] "Reject the null hypothesis."
reason
## [1] "The test statistic is greater than the critical value."
  1. Does comparing the critical value with the test value provide the same result as comparing the p-value from R with the significance level?
# Perform the Chi-Square Goodness-of-Fit test
chi_square_test <- chisq.test(win_dec$TotalWins, p = rep(1 / nrow(win_dec), nrow(win_dec)))

# Extract the p-value from the test result
p_value <- chi_square_test$p.value

# Output the p-value
p_value
## [1] 0
# Compare the p-value to the significance level
if (p_value < 0.05) {
  p_decision <- "Reject the null hypothesis based on p-value."
} else {
  p_decision <- "Fail to reject the null hypothesis based on p-value."
}

# Output the decision based on p-value
p_decision
## [1] "Reject the null hypothesis based on p-value."

Task 9.4 Download the file ‘crop_data.csv’ from the course resources and import the file into R.

crop = read.csv("D:/Quater_2/Second Part/ALY6015/Week_2/Assignment/Individual Assignment/crop_data-3.csv")

Task 9.5 Download the file ‘crop_data.csv’ from the course resources and import the file into R.

summary(crop)
##     density        block        fertilizer     yield      
##  Min.   :1.0   Min.   :1.00   Min.   :1    Min.   :175.4  
##  1st Qu.:1.0   1st Qu.:1.75   1st Qu.:1    1st Qu.:176.5  
##  Median :1.5   Median :2.50   Median :2    Median :177.1  
##  Mean   :1.5   Mean   :2.50   Mean   :2    Mean   :177.0  
##  3rd Qu.:2.0   3rd Qu.:3.25   3rd Qu.:3    3rd Qu.:177.4  
##  Max.   :2.0   Max.   :4.00   Max.   :3    Max.   :179.1

Task 9.6 Perform a Two-way ANOVA test using yield as the dependent variable and fertilizer and density as the independent variables. Explain the results of the test. Is there reason to believe that fertilizer and density have an impact on yield?

For the Two-way ANOVA test with factors “fertilizer” and “density,” along with their interaction, the null and alternative hypotheses are formulated as follows:

Fertilizer: - Null hypothesis \(\it{H_{0}}\): There is no significant difference in mean yield among the different types of fertilizer. - Alternative hypothesis \(\it{H_{1}}\): There is a significant difference in mean yield among the different types of fertilizer.

Density:

  • Null hypothesis \(\it{H_{0}}\): There is no significant difference in mean yield between different levels of density.
  • Alternative hypothesis \(\it{H_{1}}\): There is a significant difference in mean yield between different levels of density.

Interaction (Fertilizer * Density):

  • Null hypothesis \(\it{H_{0}}\): The effect of fertilizer on yield is independent of the level of density.
  • Alternative hypothesis \(\it{H_{1}}\): There is a significant interaction effect between fertilizer and density on yield.
#anova adjust for two independent variables
#start to generate the dataset for two x variables
# Convert 'density' and 'fertilizer' to factors

crop$fertilizer <- as.factor(crop$fertilizer)
crop$density <- as.factor(crop$density)

anova2way <- aov(yield ~ fertilizer + density, data = crop)
summary(anova2way)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer   2  6.068   3.034   9.073 0.000253 ***
## density      1  5.122   5.122  15.316 0.000174 ***
## Residuals   92 30.765   0.334                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#generate the anova for the data
anova_inter <- aov(yield ~ fertilizer*density, data = crop)
#display the results
summary(anova_inter)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer          2  6.068   3.034   9.001 0.000273 ***
## density             1  5.122   5.122  15.195 0.000186 ***
## fertilizer:density  2  0.428   0.214   0.635 0.532500    
## Residuals          90 30.337   0.337                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observations:

Based on the results, both fertilizer and density significantly impact crop yield individually. However, there is no significant interaction effect between fertilizer and density. Therefore, it can be concluded that the effects of fertilizer and planting density on yield are independent of each other. Adjustments to the type of fertilizer or planting density can be made individually to optimize crop yield without considering their interaction.


Conclusion

In summary, this comprehensive report has provided a robust application of key hypothesis testing and ANOVA methodologies using R programming. Through a series of real-world tasks, we have gained valuable insights by analyzing differences in proportions, independence between categorical variables, comparing means, and assessing interactions.

The analysis began by examining the goodness of fit between observed and expected blood type proportions, followed by scrutinizing airline on-time performance and movie admissions data. Additionally, we tested the potential associations between military rank and branch, compared sodium content variances, and analyzed sales and expenditure differences. The two-way ANOVA allowed us to examine interaction effects in an experiment analyzing lighting, fertilizers and planting density on crop yields.

By leveraging R’s statistical capabilities, this assignment has enabled a broad, practical application of inferential statistics. The insights derived through rigorous hypothesis testing provide a solid foundation to make data-driven decisions in real-world contexts. This comprehensive journey has enhanced our ability to draw meaningful conclusions from data and strengthened our knowledge of R’s extensive tools for statistical analysis.


References

  1. University of Southampton. (n.d.).Chi Square https://www.southampton.ac.uk/passs/full_time_education/bivariate_analysis/chi_square

  2. Bluman, A. (2015). Elementary statistics: A step by step approach. McGraw-Hill Education.

  3. Larson, M. (2008). Analysis of Variance.Aha Journals.https://doi.org/10.1161/CIRCULATIONAHA.107.654335


Appendix
This report contains an R Markdown file named as follows WEEK_2_Ansari_ALY6015_71821_Intermediate_Analytics_SEC_09_Fall_2023_CPS.Rmd