M1 Project Report
ALY6015_71821:Intermediate Analytics
SEC_09_Fall_2023_CPS
Northeastern University
Professor: Vladimir Shapiro

By: Zeeshan Ahmad Ansari
Date of Submission: 13 November, 2023

Analysis

The chi-square test compares observed results to expected results based on a hypothesis. It helps determine if the difference between actual and predicted data is simply due to chance, or if there is a relationship between the variables being studied. As such, the chi-square test is an excellent statistical tool for understanding potential connections between categorical variables(University of Southampton, n.d.). It can be used to analyze frequency distributions, test independence between two variables, and assess homogeneity of proportions.(Bluman, 2015).

ANOVA is another statistical technique used to study variability in a continuous response variable under different conditions defined by classification factors. It evaluates equality of means by comparing variation between groups to variation within groups. ANOVA is widely used because it allows analyzing the impact of categorical independent variables on a continuous dependent variable. It helps determine if changes in the independent categorical variables lead to significant differences in the continuous response variable (Larson, 2008).

Task 1 (Section 11-1 6. Blood Types)

A medical researcher wishes to see if hospital patients in a large hospital have the same blood type distribution as those in the general population. The distribution for the general population is as follows:

Type A = 20%

Type B = 28%

Type O = 36%

Type AB = 16%.

He selects a random sample of 50 patients and finds the following:

Type A = 12

Type B = 8

Type O = 24

Type AB = 6

At α = 0.10, can it be concluded that the distribution is the same as that of the general population?

############Distribution
Type_A = 0.20
Type_B = 0.28
Type_O = 0.36 
Type_AB = 0.16

blood_prob = c(Type_A, Type_B, Type_O, Type_AB)

#number of observations
n = 50

##############Expected Count

TypeA_exp = Type_A * n
TypeB_exp = Type_B * n
TypeO_exp = Type_O * n
TypeAB_exp = Type_AB * n

############Observed count
TypeA_obs = 12
TypeB_obs = 8
TypeO_obs = 24
TypeAB_obs = 6

Solution

R Wrapped \(\chi^2\) Test

The vectors of \(expected\) probabilities and \(observed\) frequencies are as follows:

exp_blood_ct = c(TypeA_exp, TypeB_exp, TypeO_exp, TypeAB_exp)
obs_blood_ct = c(TypeA_obs, TypeB_obs, TypeO_obs, TypeAB_obs)

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): \({ 𝑃(A)=0.20, 𝑃(B)=0.28, 𝑃(O)=0.36, 𝑃(AB)=0.16 }\)
Alternative hypothesis \(\it{H_{1}}\): The distribution is not the same as stated in the null hypothesis.

Find the critical value.

alpha_t1 = 0.1
conf_level_t1 = 1- alpha_t1

critical_value_t1 <- qchisq(1 - alpha_t1, df = length(obs_blood_ct) - 1)
critical_value_t1

## [1] 6.251389

Compute the test value.

######################X^2 value

ch_sqrt1 = chisq.test(x = obs_blood_ct, p = blood_prob)
ch_sqrt1

## 
##  Chi-squared test for given probabilities
## 
## data:  obs_blood_ct
## X-squared = 5.4714, df = 3, p-value = 0.1404

Make the decision.

T1_testresult = ifelse(ch_sqrt1$p.value > alpha_t1, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")

T1_testresult

## [1] "Do not Reject the Null Hypothesis (Pvalue > alpha)"

Table showing Chi Square test

##########TABLE FORMAT

blood_count = cbind(exp_blood_ct, obs_blood_ct)
blood_type = c("Blood Type A", "Blood Type B", "Blood Type O", "Blood Type AB")
col_n = c("Expected "," Observed")
dimnames(blood_count) = list(blood_type,col_n)

T1_table = rbind(conf_level_t1*100, alpha_t1,
                 round(ch_sqrt1$statistic,3),     
                 round(ch_sqrt1$p.value,3),
                 T1_testresult)
col_n1 = c("Hypothesis Result")
row_n1 = c("Confidence Level(%):", 
           "Alpha:", 
           "X^2 Value:",
           "p value:",
           "Hypothesis Result:")
dimnames(T1_table) = list(row_n1,col_n1)

kable(list(blood_count,T1_table),
      caption = "<center>Chi-Square Goodness of Fit Test</center>",align = "c",
      booktabs = TRUE)%>%
kable_styling(bootstrap_options = c("bordered"),
                                  font_size = 12) %>%

footnote(general = "H0:P(A)=0.20, P(B)=0.28,P(O)=0.36,P(AB)=0.16 and H1: The distribution is not the same as stated in the null hypothesis \n")

Chi-Square Goodness of Fit Test

	Expected	Observed
Blood Type A	10	12
Blood Type B	14	8
Blood Type O	18	24
Blood Type AB	8	6

	Hypothesis Result
Confidence Level(%):	90
Alpha:	0.1
X^2 Value:	5.471
p value:	0.14
Hypothesis Result:	Do not Reject the Null Hypothesis (Pvalue > alpha)

Note:

H0:P(A)=0.20, P(B)=0.28,P(O)=0.36,P(AB)=0.16 and H1: The distribution is not the same as stated in the null hypothesis

Observations to Task_1:

In this task, we have conducted a test to assess the proportions of different blood groups. We have been provided with a sample of 50 patients and compared the observed blood samples with the expected distribution to investigate whether the proportions of blood types match those in the population. Since this is a proportion test, we opted for the Chi-Square Goodness-of-Fit test.

The expected population proportions are as follows: Blood group A (20%), Blood group B (28%), Blood group O (36%), and Blood group AB (16%). The objective of hypothesis testing was to determine if the expected proportions align with the observed blood group distribution.

The null hypothesis posits that the expected proportions of blood groups match the observed proportions, while the alternative hypothesis suggests a difference in proportions. The Chi-Square Goodness-of-Fit test was conducted using chsq.test(), yielding a chi-square statistic of X^2 = 5.471 and Pvalue = 0.1403575

As the p-value exceeds the significance level (alpha) of 0.1, I choose not to reject the null hypothesis. This implies that there isn’t sufficient evidence to dismiss the claim that the proportions of blood types in the population differ, and it can be concluded that the percentages are not significantly different from those specified in the null hypothesis.

Task 2 (Section 11-1. 8. On-Time Performance by Airlines)

Action	of Time
On time	70.8
National Aviation System delay	8.2
Aircraft arriving late	9.0
Other (because of weather and other conditions)	12.0

According to the Bureau of Transportation Statistics, on-time performance by the airlines is described as follows:

Action of Time

On time 70.8

National Aviation System delay 8.2

Aircraft arriving late 9.0

Other (because of weather and other conditions) 12.0

Records of 200 randomly selected flights for a major airline company showed that 125 planes were on time; 40 were delayed because of weather, 10 because of a National Aviation System delay, and the rest because of arriving late. At α = 0.05, do these results differ from the government’s statistics?

#########Distribution
On_time = 0.708
NA_System_delay = 0.082
arriving_late = 0.09 
other = 0.12

Flight_prob = c(On_time, NA_System_delay, arriving_late, other)
n1 = 200

########Expected Count

On_time_Exp = On_time * n1
NA_System_delay_Exp = NA_System_delay * n1
arriving_late_Exp = arriving_late * n1
othr_Exp = other * n1

#########Observed count
On_time_obs = 125
NA_System_delay_obs = 10
arriving_late_obs = 25
othr_obs = 40

Solution

R Wrapped \(\chi^2\) Test

The vectors of \(Exp Flight Count\) probabilities and \(Obs flight Count\) frequencies are as follows:

Exp_Flight_count = round(c(On_time_Exp,
                    NA_System_delay_Exp,
                    arriving_late_Exp,
                    othr_Exp),0)

Obs_flight_count = c(On_time_obs,
                    NA_System_delay_obs,
                    arriving_late_obs,
                    othr_obs)

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): \({ P(OT)=0.708, P(SD)=0.082, P(AL)=0.09, P(O)=0.12 }\)
Alternative hypothesis \(\it{H_{1}}\): The distribution is not the same as stated in the null hypothesis.

Find the critical value.

alpha_t2 = 0.05
conf_level_t2 = 1- alpha_t2
critical_value_t2 <- qchisq(1 - alpha_t2, df = length(Obs_flight_count) - 1)
critical_value_t2

## [1] 7.814728

Compute the test value.

ch_sqrt_t2 = chisq.test(x=Obs_flight_count, p=Flight_prob)

Make the decision.

T2_testresult_2 = ifelse(ch_sqrt_t2$p.value > alpha_t2, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")

T2_testresult_2

## [1] "Reject the Null Hypothesis (Pvalue <= alpha)"

Table showing Chi Square test

flight_count = cbind(Exp_Flight_count,
                    Obs_flight_count)
record_type = c("On time", 
               "National Aviation System delay",
               "Aircraft arriving late",
               "Other (because of weather and other conditions)")
col_n_1 = c("Expected","Observed")
dimnames(flight_count) = list(record_type,col_n_1)


T2_1hypores_1 = rbind(conf_level_t2*100, alpha_t2, round(ch_sqrt_t2$statistic,3), round(ch_sqrt_t2$p.value,6), T2_testresult_2)
col_n1_1 = c("Hypothesis Result")
row_n1_1 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T2_1hypores_1) = list(row_n1_1,col_n1_1)


kable(
  list(flight_count,T2_1hypores_1),
        caption = "<center>Chi-Square Goodness-of-Fit Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 12) %>%
  footnote(general = "𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and \n H1: The distribution is not the same as stated in the null hypothesis")

Chi-Square Goodness-of-Fit Test

	Expected	Observed
On time	142	125
National Aviation System delay	16	10
Aircraft arriving late	18	25
Other (because of weather and other conditions)	24	40

	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
X^2 Value:	17.832
p value:	0.000476
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= alpha)

Note:

𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and
H1: The distribution is not the same as stated in the null hypothesis

Observations to Task_2:

In this task, we have examined the punctuality of airlines using data from the Bureau of Transportation Statistics. The data for the observed population includes flight information categorized as follows: On time (70.8%), National Aviation System delay (8.2%), Aircraft arriving late (9.0%), and Other (attributed to weather and other conditions, 12.0%).

Upon conducting the chisq.test(), we found compelling evidence to reject the null hypothesis, given that the Pvalue = 4.7625874^{-4} which is ≤ alpha(α= 0.05). This indicates that the percentage of flight performance significantly differs from the government-provided data.

Task 3 (Section 11-2.8. Ethnicity and Movie Admissions )

Are movie admissions related to ethnicity? A 2014 study indicated the following numbers of admissions (in thousands) for two different years.

Year Caucasian Hispanic African American Other

2013 724 335 174 107

2014 370 292 152 140

At the 0.05 level of significance, can it be concluded that movie attendance by year was dependent upon ethnicity?

#Data presented in Matrix

r1 = c(724, 335, 174, 107)
r2 = c(370, 292, 152, 140)

#row count
rowno = 2

#matrix 

matrix1_2 = matrix(c(r1, r2), nrow = rowno, byrow = TRUE)

Solution

R Wrapped \(\chi^2\) Test

rownames(matrix1_2) = c("2013","2014")
colnames(matrix1_2) = c("Caucasian", "Hispanic", "African_American", "Other")

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): Movie attendance by year was independent upon ethnicity.
Alternative hypothesis \(\it{H_{1}}\): Movie attendance by year was dependent upon ethnicity.

Find the critical value.

alpha_t3 = 0.05
conf_level_t3 = 1- alpha_t3
critical_value_t3 <- qchisq(1 - alpha_t3, df = length(matrix1_2) - 1)
critical_value_t3

## [1] 14.06714

Compute the test value.

ch_sqrt1_3 = chisq.test(matrix1_2)

Make the decision.

T3_testresult_3 = ifelse(ch_sqrt1_3$p.value > alpha_t3,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)") 

T3_testresult_3

## [1] "Reject the Null Hypothesis (Pvalue <= alpha)"

Table showing Chi Square test

T3_hypores_3 = rbind(conf_level_t3*100, alpha_t3, round(ch_sqrt1_3$statistic,3), round(ch_sqrt1_3$p.value,14), T3_testresult_3)
col_n1_2 = c("Hypothesis Result")
row_n1_2 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T3_hypores_3) = list(row_n1_2,col_n1_2)

kable(
  list(matrix1_2,T3_hypores_3),
        caption = "<center>Chi-Square Independence Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  footnote(general = "H0: Movie attendance by year was independent upon ethnicity and \n H1: Movie attendance by year was dependent upon ethnicity")

Chi-Square Independence Test

	Caucasian	Hispanic	African_American	Other
2013	724	335	174	107
2014	370	292	152	140

	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
X^2 Value:	60.144
p value:	5.5e-13
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= alpha)

Note:

H0: Movie attendance by year was independent upon ethnicity and
H1: Movie attendance by year was dependent upon ethnicity

Observations to Task_3:

For this assignment, we analyzed movie attendance data spanning two years, including demographic information about the attendees. The hypothesis under consideration posited that movie attendance by year was contingent on ethnicity. To investigate this, we used the Chi-Square Independence test.

Upon executing the chisq.test(), I obtained a p-value of Pvalue = 5.4775074^{-13} which is ≤ alpha(0.05), which is less than or equal to the significance level. Consequently, we reject the null hypothesis. This implies that there is substantial evidence to substantiate the assertion that movie attendance by year is indeed dependent on ethnicity.

Task 4 (Section 11-2.10 Women in the Military)

	Officers	Enlisted
Army	10,791	62,491
Navy	7,816	42,750
Marine Corps	932	9,525
Air Force	11,819	54,344

This table lists the numbers of officers and enlisted personnel for women in the military.

Officers Enlisted

Army 10,791 62,491

Navy 7,816 42,750

Marine Corps 932 9,525

Air Force 11,819 54,344

At α = 0.05, is there sufficient evidence to conclude that a relationship exists between rank and branch of the Armed Forces?

#Data presented in Matrix

rm1 = c(10791, 62491)
rm2 = c(7816, 42750)
rm3 = c(932, 9525)
rm4 = c(11819, 54344)

#row count
rowno_2 = 4

#matrix 

matrixt4 = matrix(c(rm1, rm2, rm3, rm4), nrow = rowno_2, byrow = TRUE)

Solution

R Wrapped \(\chi^2\) Test

rownames(matrixt4) = c("Army","Navy","Marine_Corps", "Air_Force")
colnames(matrixt4) = c("Officers", "Enlisted")

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): Rank achieved is independent of branch of the Armed Forces.
Alternative hypothesis \(\it{H_{1}}\): Rank achieved is dependent of branch of the Armed Forces.

Find the critical value.

alpha_t4 = 0.05
conf_level_t4 = 1- alpha_t4
critical_value_t4 <- qchisq(1 - alpha_t4, df = length(matrixt4) - 1)
critical_value_t4

## [1] 14.06714

Compute the test value.

ch_sqrt1_4 = chisq.test(matrixt4)

Make the decision.

T4_testresult_4 = ifelse(ch_sqrt1_4$p.value > alpha_t4,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)") 

T4_testresult_4

## [1] "Reject the Null Hypothesis (Pvalue <= alpha)"

Table showing Chi Square test

T4_hypores_4 = rbind(conf_level_t4*100, alpha_t4, round(ch_sqrt1_4$statistic,3), round(ch_sqrt1_4$p.value,14), T4_testresult_4)
col_n4 = c("Hypothesis Result")
row_n4 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T4_hypores_4) = list(row_n4,col_n4)

kable(
  list(matrixt4,T4_hypores_4),
        caption = "<center>Chi-Square Independence Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "H0: Rank achieved is independent of branch of the Armed Forces and \n H1: Rank achieved is dependent of branch of the Armed Forces.")

Chi-Square Independence Test

	Officers	Enlisted
Army	10791	62491
Navy	7816	42750
Marine_Corps	932	9525
Air_Force	11819	54344

	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
X^2 Value:	654.272
p value:	0
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= alpha)

Note:

H0: Rank achieved is independent of branch of the Armed Forces and
H1: Rank achieved is dependent of branch of the Armed Forces.

Observations to Task_4:

For this task, we have explored the potential relationship between rank and branch within the Armed Forces using the Chi-Square Independence test. We organized the data in a matrix by rows to facilitate the Chi-Square test. The chisq.test() yielded a p-value of 1.726418^{-141}, which is less than or equal to the significance level (α=0.05). Based on this result, We reject the null hypothesis, indicating there is substantial evidence to reject the idea that the attained rank is independent of the branch of the Armed Forces. In other words, the rank achieved appears to be associated with the branch of the Armed Forces.

Task 5 (Section 12-1.8 Sodium Contents of Foods)

Condiments	Cereals	Deserts
270	260	100
130	220	180
230	290	250
180	290	250
80	200	300
70	320	360
200	140	300

The amount of sodium (in milligrams) in one serving for a random sample of three different kinds of foods is listed.

Condiments Cereals Deserts

270 260 100

130 220 180

230 290 250

180 290 250

80 200 300

70 320 360

200 140 300

At the 0.05 level of significance, is there sufficient evidence to conclude that a difference in mean sodium amounts exists among condiments,cereals, and desserts?

Condiments = data.frame("Sodium" = c(270, 130, 230, 180, 80, 70, 200), 
                        "Food" = rep("Condiments", 7), stringsAsFactors = FALSE)

Cereals = data.frame("Sodium" = c(260, 220, 290, 290, 200, 320, 140),
                     "Food" = rep("Cereals", 7), stringsAsFactors = FALSE)

Desserts = data.frame("Sodium" = c(100, 180, 250, 250, 300, 360, 300, 160),
                      "Food" = rep("Desserts", 8), stringsAsFactors = FALSE)

Solution

R Wrapped ANOVA (One Way) Test

Sodium = rbind(Condiments, Cereals, Desserts)
Sodium$Food = as.factor(Sodium$Food)

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): 𝜇1=𝜇2=𝜇3
Alternative hypothesis \(\it{H_{1}}\): At least one mean is different from others

The alpha value

alpha_t5 = 0.05
conf_level_t5 = 1- alpha_t5

Compute the test value.

##ANOVA Test######

anova_t5 = aov(Sodium ~ Food, data = Sodium)

anovasum = summary(anova_t5)
anovasum

##             Df Sum Sq Mean Sq F value Pr(>F)
## Food         2  27544   13772   2.399  0.118
## Residuals   19 109093    5742

Find the critical value.

anofval = anovasum[[1]][1, "F value"]
anofval

## [1] 2.398538

DofN = anovasum[[1]][1, "Df"] #k-1
DofD = anovasum[[1]][2, "Df"] #N-k

CV2_1 = qf(alpha_t5, DofN, DofD, lower.tail = FALSE) #Critical value 

CV2_1

## [1] 3.521893

Make the decision.

anovhypores_t5 = ifelse(anofval < CV2_1, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

anovhypores_t5

## [1] "Do not reject the Null Hypothesis (Fvalue < CV)"

sodfoodstat <- function(y, uplim = max(Sodium$Sodium) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Sodium, aes(x = Food, y = Sodium, fill = Food)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = sodfoodstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "Sodium Contents of Food",
    caption = "Source: The Doctor's Pocket Calorie, Fat, and Carbohydrate Counter",
    x = "Food",
    y = "Sodium/serving (in milligrams)"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

Table showing ANOVA test

T5_hypores_5 = rbind(conf_level_t5*100, alpha_t5, round(CV2_1,3), round(anofval,3), anovhypores_t5)
col_n5 = c("Hypothesis Result")
row_n5 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(T5_hypores_5) = list(row_n5, col_n5)

kable(T5_hypores_5,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")

One way ANOVA Test
	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
CV:	3.522
F value:	2.399
Hypothesis Result:	Do not reject the Null Hypothesis (Fvalue < CV)
Note:
𝐻0:𝜇1=𝜇2=𝜇3 and 𝐻1: At least one mean is different from others

if (summary(anova_t5)[[1]]$'Pr(>F)'[1] < 0.05) {
  tukey_test_5 <- TukeyHSD(anova_t5)
  print(tukey_test_5)

  # Plot the Tukey's HSD test results
  plot(tukey_test_5)
} else {
  cat("Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.\n")
}

## Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.

Observations to Task_5:

This task involves examining whether the sodium content (measured in milligrams) in one serving differs among three categories of food: condiments, cereals, and desserts. Following the One-Way ANOVA, the obtained F-value is 2.399, and the critical value is 3.522. The decision is made not to reject the null hypothesis since the F-value is less than the critical value.

In summary, the analysis provides enough evidence to retain the null hypothesis, indicating that there is no significant difference in mean sodium amounts among condiments, cereals, and desserts.

Task 6 (Section 12-2.10 Sales for Leading Companies)

Cereal	Chocolate Candy	Coffee
578	311	261
320	106	185
264	109	302
249	125	689
237	173

Perform a complete one-way ANOVA. If the null hypothesis is rejected, use either the Scheffé or Tukey test to see if there is a significant difference in the pairs of means. Assume all assumptions are met.

The sales in millions of dollars for a year of a sample of leading companies are shown.

Cereal Chocolate Candy Coffee

578 311 261

320 106 185

264 109 302

249 125 689

237 173

At α = 0.01, is there a significant difference in the means?

Cereal = data.frame("Sales" = c(578, 320, 264, 249, 237), "Food" = rep("Cereal",5), stringsAsFactors = FALSE)

Choco_candy = data.frame("Sales" = c(311, 106, 109, 125, 173), "Food" = rep("Chocolate_candy", 5), stringsAsFactors = FALSE)

Coffee = data.frame("Sales" = c(261, 185, 302, 689), "Food" = rep("Coffee", 4), stringsAsFactors = FALSE)

Solution

R Wrapped ANOVA (One Way) Test

Sales = rbind(Cereal, Choco_candy, Coffee) 
Sales$Food = as.factor(Sales$Food)

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): 𝜇1=𝜇2=𝜇3
Alternative hypothesis \(\it{H_{1}}\): At least one mean is different from other.

The alpha value

alpha_t6 = 0.01
conf_level_t6 = 1- alpha_t6

Compute the test value.

##ANOVA Test######
anova_t6 = aov(Sales ~ Food, data = Sales)
 
anovasum_t6 = summary(anova_t6)

anovasum_t6

##             Df Sum Sq Mean Sq F value Pr(>F)
## Food         2 103770   51885   2.172   0.16
## Residuals   11 262795   23890

Find the critical value.

anofval_t6 = anovasum_t6[[1]][1, "F value"]
anofval_t6

## [1] 2.171782

DofN = anovasum_t6[[1]][1, "Df"] #k-1
DofD = anovasum_t6[[1]][2, "Df"] #N-k

CV2_t6 = qf(alpha_t6, DofN, DofD, lower.tail = FALSE) #Critical value 

CV2_t6

## [1] 7.205713

Make the decision.

anovhypores_t6 = ifelse(anofval_t6 < CV2_t6, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

anovhypores_t6

## [1] "Do not reject the Null Hypothesis (Fvalue < CV)"

salesfoodstat <- function(y, uplim = max(Sales$Sales) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Sales, aes(x = Food, y = Sales, fill = Food)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = salesfoodstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "Sales for Leading Companies",
    caption = "Source: Information Resources, Inc",
    x = "Food",
    y = "Sales (USD)"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

Table showing ANOVA test

T6_hypores_6 = rbind(conf_level_t6*100, alpha_t6, round(CV2_t6), round(anofval_t6,3), anovhypores_t6)
col_n6 = c("Hypothesis Result")
row_n6 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(T6_hypores_6) = list(row_n6, col_n6)

kable(T6_hypores_6,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "H0:𝜇1=𝜇2=𝜇3 and \n H1: At least one mean is different from others.")

One way ANOVA Test
	Hypothesis Result
Confidence Level(%):	99
Alpha:	0.01
CV:	7
F value:	2.172
Hypothesis Result:	Do not reject the Null Hypothesis (Fvalue < CV)
Note:
H0:𝜇1=𝜇2=𝜇3 and H1: At least one mean is different from others.

if (summary(anova_t6)[[1]]$'Pr(>F)'[1] < 0.01) {
  tukey_test_6 <- TukeyHSD(anova_t6)
  print(tukey_test_6)

  # Plot the Tukey's HSD test results
  plot(tukey_test_6)
} else {
  cat("Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.\n")
}

## Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.

Observations to Task_6:

In this task, we are conducting a One-way ANOVA to assess the variation in sales among different food items sampled from a prominent company. The sample includes Cereal, Chocolate_candy and Coffee, with sales figures reported in million dollars over one year. The objective is to examine the mean of this sample at a 0.01 significance level.

The ANOVA test is performed using the aov() function, and key metrics such as the F-value, degrees of freedom for numerator (DoFN), and degrees of freedom for denominator (DoFD) are extracted from the summary of aov(). Additionally, we have computed the critical value to compare it with the F-value, and since Fvalue < CV , the decision is not to reject the null hypothesis.

To sum up, the analysis provides sufficient evidence to retain the null hypothesis, indicating that there is no significant difference in the means of the sampled food items.

Task 7 (Section 12-2.12 Per-Pupil Expenditures)

Eastern third	Middle third	Western third
4946	6149	5282
5953	7451	8605
6202	6000	6528
7243	6479	6911
6113

Perform a complete one-way ANOVA. If the null hypothesis is rejected, use either the Scheffé or Tukey test to see if there is a significant difference in the pairs of means. Assume all assumptions are met.

The expenditures (in dollars) per pupil for states in three sections of the country are listed.

Eastern third Middle third Western third

4946 6149 5282

5953 7451 8605

6202 6000 6528

7243 6479 6911

6113

Using α = 0.05, can you conclude that there is a difference in means?

Eastern = data.frame("Expenditure" = c(4946, 5953, 6202, 7243, 6113), 
                     "States" = rep("Eastern", 5), stringsAsFactors = FALSE)

Middle = data.frame("Expenditure" = c(6149, 7451, 6000, 6479), 
                    "States" = rep("Middle", 4), stringsAsFactors = FALSE)

Western = data.frame("Expenditure" = c(5282, 8605, 6528, 6911),
                     "States" = rep("Western", 4), stringsAsFactors = FALSE)

Solution

R Wrapped ANOVA (One Way) Test

Expenditure = rbind(Eastern, Middle, Western)
Expenditure$States = as.factor(Expenditure$States)

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): 𝜇1=𝜇2=𝜇3
Alternative hypothesis \(\it{H_{1}}\): At least one mean is different from other.

The alpha value

alpha_t7 = 0.05
conf_level_t7 = 1- alpha_t7

Compute the test value.

##ANOVA Test######
anova_t7 = aov(Expenditure ~ States, data = Expenditure)
 
anovasum_t7 = summary(anova_t7)

anovasum_t7

##             Df  Sum Sq Mean Sq F value Pr(>F)
## States       2 1244588  622294   0.649  0.543
## Residuals   10 9591145  959114

Find the critical value.

anofval_t7 = anovasum_t7[[1]][1, "F value"]
anofval_t7

## [1] 0.6488214

DofN = anovasum_t7[[1]][1, "Df"] #k-1
DofD = anovasum_t7[[1]][2, "Df"] #N-k

CV2_t7 = qf(alpha_t7, DofN, DofD, lower.tail = FALSE) #Critical value 

CV2_t7

## [1] 4.102821

Make the decision.

anovhypores_t7 = ifelse(anofval_t7 < CV2_t7, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

anovhypores_t7

## [1] "Do not reject the Null Hypothesis (Fvalue < CV)"

stateexpenstat <- function(y, uplim = max(Expenditure$Expenditure) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Expenditure, aes(x = States, y = Expenditure, fill = States)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = stateexpenstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "The expenditures per pupil (in dollars)",
    caption = "Source: New York Times Almanac",
    x = "States",
    y = "Expenditure (USD)/ pupil"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

Table showing ANOVA test

T7_hypores_7 = rbind(conf_level_t7*100, alpha_t7, round(CV2_t7), round(anofval_t7,3), anovhypores_t7)
col_n7 = c("Hypothesis Result")
row_n7 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(T7_hypores_7) = list(row_n7, col_n7)

kable(T7_hypores_7,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("bordered"),
                font_size = 12) %>%
  footnote(general = "H0:𝜇1=𝜇2=𝜇3 and \n H1: At least one mean is different from others.")

One way ANOVA Test
	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
CV:	4
F value:	0.649
Hypothesis Result:	Do not reject the Null Hypothesis (Fvalue < CV)
Note:
H0:𝜇1=𝜇2=𝜇3 and H1: At least one mean is different from others.

if (summary(anova_t7)[[1]]$'Pr(>F)'[1] < 0.05) {
  tukey_test_7 <- TukeyHSD(anova_t7)
  print(tukey_test_7)

  # Plot the Tukey's HSD test results
  plot(tukey_test_7)
} else {
  cat("Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.\n")
}

## Since the ANOVA result is not statistically significant, Tukey's HSD test is not conducted.

Observations to Task_7:

In this assignment, the spending per pupil (in USD) is presented for states categorized into three regions: Eastern, Middle, and Western. We performed a one-way ANOVA to investigate if there are variations in the means across these three sections. We utilized the aov() function for the ANOVA test and compare the F-value obtained from this test with the Critical Value. Since the F-value is less than the Critical Value (Fvalue < CV), we will retain the null hypothesis and not reject it.

In summary, the analysis yields enough evidence to support the conclusion that there is no significant difference in means, refuting the claim of differences among the three sections.

Task 8 (Section 12- 3. 10. Increasing Plant Growth)

	Grow-light 1	Grow-light 2
Plant food A	9.2, 9.4, 8.9	8.5, 9.2, 8.9
Plant food B	7.1, 7.2, 8.5	5.5, 5.8, 7.6

Assume that all variables are normally or approximately normally distributed, that the samples are independent, and that the population variances are equal. a. State the hypotheses. b. Find the critical value for each F test. c. Complete the summary table and find the test value. d. Make the decision. e. Summarize the results. (Draw a graph of the cell means if necessary.)

A gardening company is testing new ways to improve plant growth. Twelve plants are randomly selected and exposed to a combination of two factors, a “Grow-light” in two different strengths and a plant food supplement with different mineral supplements. After a number of days, the plants are measured for growth, and the results (in inches) are put into the appropriate boxes.

Grow-light 1 Grow-light 2

Plant food A 9.2, 9.4, 8.9 8.5, 9.2, 8.9

Plant food B 7.1, 7.2, 8.5 5.5, 5.8, 7.6

Can an interaction between the two factors be concluded? Is there a difference in mean growth with respect to light? With respect to plant food? Use α = 0.05.

# Creating the data
plant_data <- data.frame(
  Grow_light = rep(c("Light1", "Light2"), each = 6),
  Plant_food = rep(c("Food_A", "Food_B"), times = 6),
  Growth = c(9.2, 9.4, 8.9, 8.5, 9.2, 8.9, 7.1, 7.2, 8.5, 5.5, 5.8, 7.6)
)

ggplot(data = plant_data, aes(x = Plant_food, y = Growth, colour = Grow_light)) + 
  geom_boxplot()

Solution

R Wrapped ANOVA (Two Way) Test

State the hypotheses and identify the claim.

Interaction Effect: - Null hypothesis \(\it{H_{0}}\): There is no interaction effect between type of light used and type of plat food used on plant growth - Alternative hypothesis \(\it{H_{1}}\): There is an interaction effect between type of light used and type of plant food used on plant growth.

Effect of Light on Growth: - Null hypothesis \(\it{H_{0}}\): There is no difference between the means of plant growth for two types of light. - Alternative hypothesis \(\it{H_{1}}\): There is a difference between the means of plant growth for two types of light.

The alpha value

alpha_t8 = 0.05
conf_level_t8 = 1- alpha_t8

Compute the test value.

##ANOVA Test######
Anovatwoway = aov(Growth ~ Plant_food * Grow_light, data = plant_data)

Anovatwoway_summary = summary(Anovatwoway)

Anovatwoway_summary

##                       Df Sum Sq Mean Sq F value  Pr(>F)   
## Plant_food             1  0.213   0.213   0.259 0.62482   
## Grow_light             1 12.813  12.813  15.531 0.00429 **
## Plant_food:Grow_light  1  0.030   0.030   0.036 0.85352   
## Residuals              8  6.600   0.825                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(TukeyHSD(Anovatwoway, conf.level=.95), las = 2)

Find the critical value.

ANOVAFvalinteraction = Anovatwoway_summary[[1]][3, "F value"] #Fvalue of interaction
ANOVAFvalinteraction

## [1] 0.03636364

ANOVAFvallight = Anovatwoway_summary[[1]][2, "F value"] #Fvalue of interaction
ANOVAFvallight

## [1] 15.53131

DofNlight = Anovatwoway_summary[[1]][2, "Df"] #(a-1)(b-1) a is levels in light, b is levels in food
DofDlight = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group
DofNinteraction = Anovatwoway_summary[[1]][3, "Df"] #(a-1)(b-1) 
DofDinteraction = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group

CVinteraction = qf(alpha_t8, DofNinteraction, DofDinteraction, lower.tail = FALSE)
CVlight = qf(alpha_t8, DofNlight, DofDlight, lower.tail = FALSE)

Make the decision.

hypotwoANOVA = ifelse(ANOVAFvalinteraction < CVinteraction, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")
hypotwoANOVA

## [1] "Do not reject the null hypothesis (Fvalue < CV)"

hypotwoANOVAlight = ifelse(ANOVAFvallight < CVlight, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")
hypotwoANOVAlight

## [1] "Reject the null hypothesis (Fvalue > CV)"

plantgrost <- 
  plant_data %>% 
  group_by(Grow_light, Plant_food) %>% # group by the two factors
  summarise(Means = mean(Growth), SEs = sd(Growth)/sqrt(n())) # Mean and Std Er

## `summarise()` has grouped output by 'Grow_light'. You can override using the
## `.groups` argument.

ggplot(plantgrost, 
       aes(x = Grow_light, y = Means, fill = Plant_food,
           ymin = Means - SEs, ymax = Means + SEs)) +
  # this adds the mean
  geom_col(position = position_dodge()) +
  # this adds the error bars
  geom_errorbar(position = position_dodge(0.9), width=.2) +
  # controlling the appearance
  xlab("Growth Light") + ylab("Plant Growth (in inches)")

Table showing ANOVA test

hyporestwoANOVA = rbind(conf_level_t8, alpha_t8, round(CVinteraction,3), round(ANOVAFvalinteraction,3), hypotwoANOVA)
col_twoANOVA = c("Hypothesis Result")
row_twoANOVA = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVA) = list(row_twoANOVA, col_twoANOVA)

kable(hyporestwoANOVA,
        caption = "<center>Two way ANOVA interaction Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and \n 𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.")

Two way ANOVA interaction Test
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
CV:	5.318
F value:	0.036
Hypothesis Result:	Do not reject the null hypothesis (Fvalue < CV)

Note: 𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and
𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.

hyporestwoANOVAlight = rbind(conf_level_t8, alpha_t8, round(CVlight,3), round(ANOVAFvallight,3), hypotwoANOVAlight)
col_twoANOVAlight = c("Hypothesis Result")
row_twoANOVAlight = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVAlight) = list(row_twoANOVAlight, col_twoANOVAlight)

kable(hyporestwoANOVAlight,
        caption = "<center>Two way ANOVA mean of growth to light Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "H0: There is no difference between the means of plant growth for two types of light and \n H1: There is a difference between the means of plant growth for two types of light.")

Two way ANOVA mean of growth to light Test
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
CV:	5.318
F value:	15.531
Hypothesis Result:	Reject the null hypothesis (Fvalue > CV)

Note: H0: There is no difference between the means of plant growth for two types of light and
H1: There is a difference between the means of plant growth for two types of light.

Observations to Task_8:

In this task, we investigated the impact of different grow lights and plant food supplements on plant growth. Our analysis revealed a significant effect of light type on growth, rejecting the null hypothesis. However, no significant interaction effect between light and plant food was found. The choice of light emerges as a crucial factor in optimizing plant growth. These findings inform the gardening company’s practices, emphasizing the importance of selecting an appropriate grow light for enhanced plant development.

Task 9 (On Your Own)

Task 9.1 Download the file ‘baseball.csv’ from the course resources and import the file into R.

#Dataset_Employed_in_this_Task

M2W2Data = read_csv("D:/Quater_2/Second Part/ALY6015/Week_2/Assignment/Individual Assignment/baseball-3.csv")

Task 9.2 Perform EDA on the imported data set. Write a paragraph or two to describe the data set using descriptive statistics and plots. Are there any trends or anything of interest to discuss?

# Rename Columns
colnames(M2W2Data) <- tolower(gsub("[ ,.]", "_", colnames(M2W2Data)))
#colnames(M2W2Data)


# Structure of the dataset
Structure_1 <- str(M2W2Data)

## spc_tbl_ [1,232 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ team        : chr [1:1232] "ARI" "ATL" "BAL" "BOS" ...
##  $ league      : chr [1:1232] "NL" "NL" "AL" "AL" ...
##  $ year        : num [1:1232] 2012 2012 2012 2012 2012 ...
##  $ rs          : num [1:1232] 734 700 712 734 613 748 669 667 758 726 ...
##  $ ra          : num [1:1232] 688 600 705 806 759 676 588 845 890 670 ...
##  $ w           : num [1:1232] 81 94 93 69 61 85 97 68 64 88 ...
##  $ obp         : num [1:1232] 0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
##  $ slg         : num [1:1232] 0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
##  $ ba          : num [1:1232] 0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
##  $ playoffs    : num [1:1232] 0 1 1 0 0 0 1 0 0 1 ...
##  $ rankseason  : num [1:1232] NA 4 5 NA NA NA 2 NA NA 6 ...
##  $ rankplayoffs: num [1:1232] NA 5 4 NA NA NA 4 NA NA 2 ...
##  $ g           : num [1:1232] 162 162 162 162 162 162 162 162 162 162 ...
##  $ oobp        : num [1:1232] 0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.336 0.357 0.314 ...
##  $ oslg        : num [1:1232] 0.415 0.378 0.403 0.428 0.424 0.405 0.39 0.43 0.47 0.402 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Team = col_character(),
##   ..   League = col_character(),
##   ..   Year = col_double(),
##   ..   RS = col_double(),
##   ..   RA = col_double(),
##   ..   W = col_double(),
##   ..   OBP = col_double(),
##   ..   SLG = col_double(),
##   ..   BA = col_double(),
##   ..   Playoffs = col_double(),
##   ..   RankSeason = col_double(),
##   ..   RankPlayoffs = col_double(),
##   ..   G = col_double(),
##   ..   OOBP = col_double(),
##   ..   OSLG = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# Summary statistics of the dataset
data_summary <- summary(M2W2Data)

kable(data_summary,caption = "<center>Summary</center>", format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

Summary
team	league	year	rs	ra	w	obp	slg	ba	playoffs	rankseason	rankplayoffs	g	oobp	oslg
Length:1232	Length:1232	Min. :1962	Min. : 463.0	Min. : 472.0	Min. : 40.0	Min. :0.2770	Min. :0.3010	Min. :0.2140	Min. :0.0000	Min. :1.000	Min. :1.000	Min. :158.0	Min. :0.2940	Min. :0.3460
Class :character	Class :character	1st Qu.:1977	1st Qu.: 652.0	1st Qu.: 649.8	1st Qu.: 73.0	1st Qu.:0.3170	1st Qu.:0.3750	1st Qu.:0.2510	1st Qu.:0.0000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:162.0	1st Qu.:0.3210	1st Qu.:0.4010
Mode :character	Mode :character	Median :1989	Median : 711.0	Median : 709.0	Median : 81.0	Median :0.3260	Median :0.3960	Median :0.2600	Median :0.0000	Median :3.000	Median :3.000	Median :162.0	Median :0.3310	Median :0.4190
NA	NA	Mean :1989	Mean : 715.1	Mean : 715.1	Mean : 80.9	Mean :0.3263	Mean :0.3973	Mean :0.2593	Mean :0.1981	Mean :3.123	Mean :2.717	Mean :161.9	Mean :0.3323	Mean :0.4197
NA	NA	3rd Qu.:2002	3rd Qu.: 775.0	3rd Qu.: 774.2	3rd Qu.: 89.0	3rd Qu.:0.3370	3rd Qu.:0.4210	3rd Qu.:0.2680	3rd Qu.:0.0000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:162.0	3rd Qu.:0.3430	3rd Qu.:0.4380
NA	NA	Max. :2012	Max. :1009.0	Max. :1103.0	Max. :116.0	Max. :0.3730	Max. :0.4910	Max. :0.2940	Max. :1.0000	Max. :8.000	Max. :5.000	Max. :165.0	Max. :0.3840	Max. :0.4990
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :988	NA’s :988	NA	NA’s :812	NA’s :812

describe(M2W2Data) %>%
  kable(caption = "<center>Descriptive Statistics</center>", format = "html", align = "l") %>%
  kable_styling("bordered", full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

Descriptive Statistics
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
team*	1	1232	18.9285714	10.6140364	20.000	18.7576065	13.3434000	1.000	39.000	38.000	0.0628129	-1.2525294	0.3023954
league*	2	1232	1.5000000	0.5002030	1.500	1.5000000	0.7413000	1.000	2.000	1.000	0.0000000	-2.0016227	0.0142509
year	3	1232	1988.9577922	14.8196251	1989.000	1989.3184584	19.2738000	1962.000	2012.000	50.000	-0.1515595	-1.2077412	0.4222133
rs	4	1232	715.0819805	91.5342940	711.000	713.3387424	90.4386000	463.000	1009.000	546.000	0.1740832	-0.0301863	2.6078252
ra	5	1232	715.0819805	93.0799326	709.000	712.4371197	91.9212000	472.000	1103.000	631.000	0.2978360	-0.0205915	2.6518607
w	6	1232	80.9042208	11.4581390	81.000	81.1206897	11.8608000	40.000	116.000	76.000	-0.1814238	-0.3070528	0.3264440
obp	7	1232	0.3263312	0.0150128	0.326	0.3262586	0.0148260	0.277	0.373	0.096	0.0175923	0.0574876	0.0004277
slg	8	1232	0.3973417	0.0332669	0.396	0.3970903	0.0340998	0.301	0.491	0.190	0.0541978	-0.3250999	0.0009478
ba	9	1232	0.2592727	0.0129072	0.260	0.2593864	0.0133434	0.214	0.294	0.080	-0.1109140	-0.0002138	0.0003677
playoffs	10	1232	0.1980519	0.3986934	0.000	0.1227181	0.0000000	0.000	1.000	1.000	1.5134587	0.2907952	0.0113588
rankseason	11	244	3.1229508	1.7383492	3.000	2.9744898	1.4826000	1.000	8.000	7.000	0.5560350	-0.5778485	0.1112864
rankplayoffs	12	244	2.7172131	1.0952342	3.000	2.7602041	1.4826000	1.000	5.000	4.000	-0.2688894	-1.1199065	0.0701152
g	13	1232	161.9188312	0.6243652	162.000	161.9492901	0.0000000	158.000	165.000	7.000	-1.0420881	6.9746969	0.0177883
oobp	14	420	0.3322643	0.0152953	0.331	0.3319345	0.0163086	0.294	0.384	0.090	0.1943470	-0.3710080	0.0007463
oslg	15	420	0.4197429	0.0265096	0.419	0.4194405	0.0266868	0.346	0.499	0.153	0.1176197	-0.2148956	0.0012935

# Descriptive statistics of numerical columns
num_summary <- summary(select_if(M2W2Data, is.numeric))

kable(num_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "400px")

year	rs	ra	w	obp	slg	ba	playoffs	rankseason	rankplayoffs	g	oobp	oslg
Min. :1962	Min. : 463.0	Min. : 472.0	Min. : 40.0	Min. :0.2770	Min. :0.3010	Min. :0.2140	Min. :0.0000	Min. :1.000	Min. :1.000	Min. :158.0	Min. :0.2940	Min. :0.3460
1st Qu.:1977	1st Qu.: 652.0	1st Qu.: 649.8	1st Qu.: 73.0	1st Qu.:0.3170	1st Qu.:0.3750	1st Qu.:0.2510	1st Qu.:0.0000	1st Qu.:2.000	1st Qu.:2.000	1st Qu.:162.0	1st Qu.:0.3210	1st Qu.:0.4010
Median :1989	Median : 711.0	Median : 709.0	Median : 81.0	Median :0.3260	Median :0.3960	Median :0.2600	Median :0.0000	Median :3.000	Median :3.000	Median :162.0	Median :0.3310	Median :0.4190
Mean :1989	Mean : 715.1	Mean : 715.1	Mean : 80.9	Mean :0.3263	Mean :0.3973	Mean :0.2593	Mean :0.1981	Mean :3.123	Mean :2.717	Mean :161.9	Mean :0.3323	Mean :0.4197
3rd Qu.:2002	3rd Qu.: 775.0	3rd Qu.: 774.2	3rd Qu.: 89.0	3rd Qu.:0.3370	3rd Qu.:0.4210	3rd Qu.:0.2680	3rd Qu.:0.0000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:162.0	3rd Qu.:0.3430	3rd Qu.:0.4380
Max. :2012	Max. :1009.0	Max. :1103.0	Max. :116.0	Max. :0.3730	Max. :0.4910	Max. :0.2940	Max. :1.0000	Max. :8.000	Max. :5.000	Max. :165.0	Max. :0.3840	Max. :0.4990
NA	NA	NA	NA	NA	NA	NA	NA	NA’s :988	NA’s :988	NA	NA’s :812	NA’s :812

The above provided code computes summary statistics for the numerical columns within the dataset.

# Descriptive statistics of categorical columns
cat_summary <- summary(select_if(M2W2Data, is.character))

kable(cat_summary, format = "html", align = "l") %>%
  column_spec(1, bold = TRUE)%>%
  kable_styling(full_width = TRUE, "striped",font_size = 14) %>%
  row_spec(0, bold = TRUE, background = "slategrey" , color = "white")%>%
  scroll_box(width = "100%", height = "200px")

	team	league
	Length:1232	Length:1232
	Class :character	Class :character
	Mode :character	Mode :character

DescTools::Desc(M2W2Data$w)

## ------------------------------------------------------------------------------ 
## M2W2Data$w (numeric)
## 
##   length       n    NAs  unique     0s   mean  meanCI'
##    1'232   1'232      0      63      0  80.90   80.26
##           100.0%   0.0%           0.0%          81.54
##                                                      
##      .05     .10    .25  median    .75    .90     .95
##    62.00   66.00  73.00   81.00  89.00  95.90   98.00
##                                                      
##    range      sd  vcoef     mad    IQR   skew    kurt
##    76.00   11.46   0.14   11.86  16.00  -0.18   -0.31
##                                                      
## lowest : 40.0, 43.0, 50.0, 51.0 (2), 52.0 (2)
## highest: 106.0, 108.0 (3), 109.0, 114.0, 116.0
## 
## ' 95%-CI (classic)

#Decade winnings#####

# Extract decade from year
M2W2Data$Decade <- M2W2Data$year - (M2W2Data$year %% 10)

win_dec_1 = M2W2Data %>%
  group_by(Decade)

win_dec = M2W2Data %>%
  group_by(Decade)%>%
  summarize(TotalWins = sum(w))

win_dec_1 %>%
  ggplot(aes(x = Decade)) + 
    geom_bar() +
  ggtitle("Win Count by Decade")+
    xlab("Year")+
  theme(plot.title = element_text(hjust = 0.5))

Observations:

The provided dataset contains information related to various baseball teams, including their performance statistics, such as runs scored (rs), runs allowed (ra), wins (w), on-base percentage (obp), slugging percentage (slg), batting average (ba), and more. The dataset spans multiple years, with a range from 1962 to 2012.

Upon conducting exploratory data analysis (EDA), several key observations and trends emerge. The summary statistics reveal that the mean number of wins per team is approximately 80.9, with a standard deviation of 11.46. The teams’ performance varies widely, with a range from 40 to 116 wins. The distribution of wins is somewhat negatively skewed, indicating that most teams have relatively high win counts.

The bar plot depicting wins by decade provides an insightful trend. There is a noticeable increase in total wins from the 1960s to the 2000s, with a peak in the 1990s. This suggests a potential upward trend in team performance over the decades.

Task 9.3 Assuming the expected frequencies are equal, perform a Chi-Square Goodness-of-Fit test to determine if there is a difference in the number of wins by decade. Be sure to include the following: a. State the hypotheses and identify the claim. b. Find the critical value (α = 0.05) (programmatically). c. Compute the test value. d. Make the decision. Clearly state if the null hypothesis should or should not be rejected and why. e. Does comparing the critical value with the test value provide the same result as comparing the p-value from R with the significance level?

State the hypotheses and identify the claim.

Null hypothesis \(\it{H_{0}}\): Number of wins per decade are same.
Alternative hypothesis \(\it{H_{1}}\): Number of wins per decade are not same.

Find the critical value (α = 0.05) (programmatically).

num_decades <- length(unique(M2W2Data$Decade))

df <- num_decades - 1

critical_value <- qchisq(0.95, df)

# Output the critical value
critical_value

## [1] 11.0705

CL_t9a = 0.95
alpha_t9a = 0.05

Compute the test value.

#Chi-Square Wins by Decade####

observed_wins <- M2W2Data %>%
  group_by(Decade) %>%
  summarize(TotalWins = sum(w)) %>%
  ungroup() %>%
  pull(TotalWins)

expected_wins <- rep(mean(observed_wins), length(observed_wins))

# Compute the Chi-Square test statistic
test_value <- sum((observed_wins - expected_wins)^2 / expected_wins)

test_value

## [1] 9989.536

Make the decision. Clearly state if the null hypothesis should or should not be rejected and why.

# Make the decision based on the critical value and the test statistic
if (test_value > critical_value) {
  decision <- "Reject the null hypothesis."
  reason <- "The test statistic is greater than the critical value."
} else {
  decision <- "Fail to reject the null hypothesis."
  reason <- "The test statistic is not greater than the critical value."
}

# Output the decision and the reason
decision

## [1] "Reject the null hypothesis."

reason

## [1] "The test statistic is greater than the critical value."

Does comparing the critical value with the test value provide the same result as comparing the p-value from R with the significance level?

# Perform the Chi-Square Goodness-of-Fit test
chi_square_test <- chisq.test(win_dec$TotalWins, p = rep(1 / nrow(win_dec), nrow(win_dec)))

# Extract the p-value from the test result
p_value <- chi_square_test$p.value

# Output the p-value
p_value

## [1] 0

# Compare the p-value to the significance level
if (p_value < 0.05) {
  p_decision <- "Reject the null hypothesis based on p-value."
} else {
  p_decision <- "Fail to reject the null hypothesis based on p-value."
}

# Output the decision based on p-value
p_decision

## [1] "Reject the null hypothesis based on p-value."

Task 9.4 Download the file ‘crop_data.csv’ from the course resources and import the file into R.

crop = read.csv("D:/Quater_2/Second Part/ALY6015/Week_2/Assignment/Individual Assignment/crop_data-3.csv")

Task 9.5 Download the file ‘crop_data.csv’ from the course resources and import the file into R.

summary(crop)

##     density        block        fertilizer     yield      
##  Min.   :1.0   Min.   :1.00   Min.   :1    Min.   :175.4  
##  1st Qu.:1.0   1st Qu.:1.75   1st Qu.:1    1st Qu.:176.5  
##  Median :1.5   Median :2.50   Median :2    Median :177.1  
##  Mean   :1.5   Mean   :2.50   Mean   :2    Mean   :177.0  
##  3rd Qu.:2.0   3rd Qu.:3.25   3rd Qu.:3    3rd Qu.:177.4  
##  Max.   :2.0   Max.   :4.00   Max.   :3    Max.   :179.1

Task 9.6 Perform a Two-way ANOVA test using yield as the dependent variable and fertilizer and density as the independent variables. Explain the results of the test. Is there reason to believe that fertilizer and density have an impact on yield?

For the Two-way ANOVA test with factors “fertilizer” and “density,” along with their interaction, the null and alternative hypotheses are formulated as follows:

Fertilizer: - Null hypothesis \(\it{H_{0}}\): There is no significant difference in mean yield among the different types of fertilizer. - Alternative hypothesis \(\it{H_{1}}\): There is a significant difference in mean yield among the different types of fertilizer.

Density:

Null hypothesis \(\it{H_{0}}\): There is no significant difference in mean yield between different levels of density.
Alternative hypothesis \(\it{H_{1}}\): There is a significant difference in mean yield between different levels of density.

Interaction (Fertilizer * Density):

Null hypothesis \(\it{H_{0}}\): The effect of fertilizer on yield is independent of the level of density.
Alternative hypothesis \(\it{H_{1}}\): There is a significant interaction effect between fertilizer and density on yield.

#anova adjust for two independent variables
#start to generate the dataset for two x variables
# Convert 'density' and 'fertilizer' to factors

crop$fertilizer <- as.factor(crop$fertilizer)
crop$density <- as.factor(crop$density)

anova2way <- aov(yield ~ fertilizer + density, data = crop)
summary(anova2way)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer   2  6.068   3.034   9.073 0.000253 ***
## density      1  5.122   5.122  15.316 0.000174 ***
## Residuals   92 30.765   0.334                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#generate the anova for the data
anova_inter <- aov(yield ~ fertilizer*density, data = crop)
#display the results
summary(anova_inter)

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer          2  6.068   3.034   9.001 0.000273 ***
## density             1  5.122   5.122  15.195 0.000186 ***
## fertilizer:density  2  0.428   0.214   0.635 0.532500    
## Residuals          90 30.337   0.337                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observations:

Based on the results, both fertilizer and density significantly impact crop yield individually. However, there is no significant interaction effect between fertilizer and density. Therefore, it can be concluded that the effects of fertilizer and planting density on yield are independent of each other. Adjustments to the type of fertilizer or planting density can be made individually to optimize crop yield without considering their interaction.

Conclusion

In summary, this comprehensive report has provided a robust application of key hypothesis testing and ANOVA methodologies using R programming. Through a series of real-world tasks, we have gained valuable insights by analyzing differences in proportions, independence between categorical variables, comparing means, and assessing interactions.

The analysis began by examining the goodness of fit between observed and expected blood type proportions, followed by scrutinizing airline on-time performance and movie admissions data. Additionally, we tested the potential associations between military rank and branch, compared sodium content variances, and analyzed sales and expenditure differences. The two-way ANOVA allowed us to examine interaction effects in an experiment analyzing lighting, fertilizers and planting density on crop yields.

By leveraging R’s statistical capabilities, this assignment has enabled a broad, practical application of inferential statistics. The insights derived through rigorous hypothesis testing provide a solid foundation to make data-driven decisions in real-world contexts. This comprehensive journey has enhanced our ability to draw meaningful conclusions from data and strengthened our knowledge of R’s extensive tools for statistical analysis.

References

University of Southampton. (n.d.).Chi Square https://www.southampton.ac.uk/passs/full_time_education/bivariate_analysis/chi_square
Bluman, A. (2015). Elementary statistics: A step by step approach. McGraw-Hill Education.
Larson, M. (2008). Analysis of Variance.Aha Journals.https://doi.org/10.1161/CIRCULATIONAHA.107.654335

Appendix
This report contains an R Markdown file named as follows WEEK_2_Ansari_ALY6015_71821_Intermediate_Analytics_SEC_09_Fall_2023_CPS.Rmd

Week_2_ALY6015 Assignment

2023-11-13

Library

Introduction

Analysis

Solution

R Wrapped \(\chi^2\) Test

Solution

R Wrapped \(\chi^2\) Test

Solution

R Wrapped \(\chi^2\) Test

Solution

R Wrapped \(\chi^2\) Test

Solution

R Wrapped ANOVA (One Way) Test

Solution

R Wrapped ANOVA (One Way) Test

Solution

R Wrapped ANOVA (One Way) Test

Solution

R Wrapped ANOVA (Two Way) Test

Condiments	Cereals	Deserts
270	260	100
130	220	180
230	290	250
180	290	250
80	200	300
70	320	360
200	140	300

Condiments	Cereals	Deserts
270	260	100
130	220	180
230	290	250
180	290	250
80	200	300
70	320	360
200	140	300

Condiments	Cereals	Deserts
270	260	100
130	220	180
230	290	250
180	290	250
80	200	300
70	320	360
200	140	300