rm(list=ls())
#Library######
library(readr)
library(tidyverse)
library(dplyr)
library(DT)
library(RColorBrewer)
library(rio)
library(dbplyr)
library(psych)
library(FSA)
library(knitr)
library(RColorBrewer)
library(plotrix)
library(kableExtra)
library(ISLR)
library(data.table)
library(magrittr)
library(ggplot2)
library(summarytools)
library(hrbrthemes)
library(cowplot)
library(reshape2)
library(scales)
library(zoo)
library(corrplot)
library(lares)
library(leaps)
library(MASS)
library(car)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(tidyr)
library(readxl)
library(forcats)
#dataset used######
PlantGrowth <- read_excel("Datasets/Task3_Two_ANOVA.xlsx",
sheet = "Sheet2")
baseball_1 <- read_csv("Datasets/baseball-1.csv")
Introduction
A chi-square test compares observed and predicted results. The goal of this test is to identify whether a disparity between actual and predicted data is due to chance or to a link between the variables under consideration. As a result, the chi-square test is an ideal choice for aiding in our understanding and interpretation of the connection between our two categorical variables (University of Southampton, n.d.). Chi-square test is used to the frequency distribution, independence of two variables, and homogeneity of proportions (Bluman, 2015).
ANOVA is a statistical approach for analyzing variability in a response variable (continuous random variable) measured under conditions outlined by discrete components (classification variables, often with nominal levels). ANOVA is widely used to evaluate equality among different means by analyzing variation among groups to variance within groups (random error) (Larson, 2008).
A medical researcher wishes to see if hospital patients in a large
hospital have the same blood type distribution as those in the general
population. The distribution for the general population is as follows:
type A, 20%; type B, 28%; type O, 36%; and type AB = 16%. He selects a
random sample of 50 patients and finds the following: 12 have type A
blood, 8 have type B, 24 have type O, and 6 have type AB blood.
At
α = 0.10, can it be concluded that the distribution is the same as that
of the general population?
#Chi-Square Goodnes-of-Fit Test 1.1######
#Hypothesis Formation#####
#𝐻0:P𝐴=0.20,P𝐵=0.28,P𝑂=0.36,P𝐴𝐵=0.16 𝐻1: The distribution is not the #same as stated in the null hypothesis
##Distrubution####
Type_A = 0.20
Type_B = 0.28
Type_O = 0.36
Type_AB = 0.16
Blood_prob = c(Type_A,
Type_B,
Type_O,
Type_AB)
n = 50
#Expected Count
TypeA_Exp = Type_A * n
TypeB_Exp = Type_B * n
TypeO_Exp = Type_O * n
TypeAB_Exp = Type_AB * n
Exp_Blood_count = c(TypeA_Exp,
TypeB_Exp,
TypeO_Exp,
TypeAB_Exp)
#Observed count
TypeA_obs = 12
TypeB_obs = 8
TypeO_obs = 24
TypeAB_obs = 6
Obs_Blood_count = c(TypeA_obs,
TypeB_obs,
TypeO_obs,
TypeAB_obs)
##Chi-Square testing#####
CLt1 = 0.90
aphat190 = 1 - CLt1
##Finding X^2 value######
ch_sqrt1 = chisq.test(x=Obs_Blood_count, p=Blood_prob)
T1_1testresult = ifelse(ch_sqrt1$p.value > aphat190, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")
###Information Representation#####
Blood_count = cbind(Exp_Blood_count,
Obs_Blood_count)
blood_type = c("Blood Type A",
"Blood Type B",
"Blood Type O",
"Blood Type AB")
col_n = c("Expected","Observed")
dimnames(Blood_count) = list(blood_type,col_n)
T1_1hypores = rbind(CLt1*100, aphat190, round(ch_sqrt1$statistic,3), round(ch_sqrt1$p.value,3), T1_1testresult)
col_n1 = c("Hypothesis Result")
row_n1 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_1hypores) = list(row_n1,col_n1)
kable(
list(Blood_count,T1_1hypores),
caption = "<center>Chi-Square Goodness-of-Fit Test</center>",
align = "c",
booktabs = TRUE)%>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0:P𝐴=0.20,P𝐵=0.28,P𝑂=0.36,P𝐴𝐵=0.16 and \n 𝐻1: The distribution is not the same as stated in the null hypothesis")
|
|
In this task, I would be testing proportion of Blood groups. I have
taken sample of 50 patients and observed these blood samples against the
expected to understand whether the proportion of the blood samples is
same as the population. As this is proportion test, I would be using
Chi-Square Goodness-of-fit test.
The population proportion is Blood
group A, 20%; Blood group B, 28%; Blood group O, 36%; and Blood group AB
= 16% The hypothesis testing would help me to determine whether expected
proportion is same as observed blood groups’ proportion.
The null hypothesis is expected proportion of Blood groups is as
observed proportion and the alternative hypothesis is proportion is
different from the null hypothesis. The chi-Square Goodness-of-Fit test
is performed using chsq.test() and the $X^2 = $ 5.471 and $Pvalue =
$0.14. As the Pvalue is greater than the alpha that is 0.05, I would not
reject the null hypothesis. This means that there is not enough evidence
to reject the claim that the proportion of blood type population differ
and it can be concluded that percentages are not significantly different
from those given in the null hypothesis.
According to the Bureau of Transportation Statistics, on-time performance by the airlines is described as follows:
Action % of Time On time 70.8 National Aviation System delay 8.2 Aircraft arriving late 9.0 Other (because of weather and other conditions) 12.0
Records of 200 randomly selected flights for a major airline company showed that 125 planes were on time; 40 were delayed because of weather, 10 because of a National Aviation System delay, and the rest because of arriving late. At α = 0.05, do these results differ from the government’s statistics?
#Chi-Square Goodness-of-Fit Test 1.2#######
#Hypothesis Formation
#𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 𝐻1: The distribution is not the #same as stated in the null hypothesis (Claim).
##Distrubution####
On_time = 0.708
NA_System_delay = 0.082
arriving_late = 0.09
othr = 0.12
Flight_prob = c(On_time,
NA_System_delay,
arriving_late,
othr)
n1 = 200
#Expected Count
On_time_Exp = On_time * n1
NA_System_delay_Exp = NA_System_delay * n1
arriving_late_Exp = arriving_late * n1
othr_Exp = othr * n1
Exp_Flight_count = round(c(On_time_Exp,
NA_System_delay_Exp,
arriving_late_Exp,
othr_Exp),0)
#Observed count
On_time_obs = 125
NA_System_delay_obs = 10
arriving_late_obs = 25
othr_obs = 40
Obs_flight_count = c(On_time_obs,
NA_System_delay_obs,
arriving_late_obs,
othr_obs)
##Chi-Square testing#####
CLt1_1 = 0.95
aphat1_195 = 1 - CLt1_1
##Finding X^2 value######
ch_sqrt1_1 = chisq.test(x=Obs_flight_count, p=Flight_prob)
T1_1testresult_1 = ifelse(ch_sqrt1_1$p.value > aphat1_195, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")
###Information Representation#####
flight_count = cbind(Exp_Flight_count,
Obs_flight_count)
record_type = c("On time",
"National Aviation System delay",
"Aircraft arriving late",
"Other (because of weather and other conditions)")
col_n_1 = c("Expected","Observed")
dimnames(flight_count) = list(record_type,col_n_1)
T1_1hypores_1 = rbind(CLt1_1*100, aphat1_195, round(ch_sqrt1_1$statistic,3), round(ch_sqrt1_1$p.value,6), T1_1testresult_1)
col_n1_1 = c("Hypothesis Result")
row_n1_1 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_1hypores_1) = list(row_n1_1,col_n1_1)
kable(
list(flight_count,T1_1hypores_1),
caption = "<center>Chi-Square Goodness-of-Fit Test</center>",
align = "c",
booktabs = TRUE)%>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and \n H1: The distribution is not the same as stated in the null hypothesis")
|
|
In this task, I would be testing on-time performance by the airlines
data provided by the Bureau of Transportation Statistics.
The
observed population have flight data:
On time 70.8%, National
Aviation System delay 8.2%, Aircraft arriving late 9.0%, and Other
(because of weather and other conditions) 12.0%
After running
chisq.test() I can conclude that the null hypothesis rejected, as the
$Pvalue = $0.000476 which is \(\le\)
alpha(0.05).
It is concluded that there is enough evidence to
support the claim that percentage of flight performance is different
from the government data.
Are movie admissions related to ethnicity? A 2014 study indicated the following numbers of admissions (in thousands) for two different years. At the 0.05 level of significance, can it be concluded that movie attendance by year was dependent upon ethnicity?
Caucasian Hispanic African_American Other
2013 724 335 174 107 2014 370 292 152 140
#Chi-Square Independence Test######
#Hypothesis
#H0: Movie attendance by year was independent upon ethnicity.
#H1: movie attendance by year was dependent upon ethnicity (claim).
#Data presented in Matrix
r1 = c(724, 335, 174, 107)
r2 = c(370, 292, 152, 140)
#row count
rowno = 2
#matrix
matrix1_2 = matrix(c(r1, r2), nrow = rowno, byrow = TRUE)
#Modifying Matrix col and row names
rownames(matrix1_2) = c("2013","2014")
colnames(matrix1_2) = c("Caucasian", "Hispanic", "African_American", "Other")
##Chi Square Independence test######
CLt1_2 = 0.95
aphat1_295 = 1 - CLt1_2
ch_sqrt1_2 = chisq.test(matrix1_2)
T1_2testresult_1 = ifelse(ch_sqrt1_2$p.value > aphat1_295,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")
##Information Representation#####
T1_2hypores_1 = rbind(CLt1_2*100, aphat1_295, round(ch_sqrt1_2$statistic,3), round(ch_sqrt1_2$p.value,14), T1_2testresult_1)
col_n1_2 = c("Hypothesis Result")
row_n1_2 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_2hypores_1) = list(row_n1_2,col_n1_2)
kable(
list(matrix1_2,T1_2hypores_1),
caption = "<center>Chi-Square Independence Test</center>",
align = "c",
booktabs = TRUE)%>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "H0: Movie attendance by year was independent upon ethnicity and \n H1: Movie attendance by year was dependent upon ethnicity")
|
|
When a single sample is chosen, the test of independence of variables
is used to evaluate if two variables are independent of or connected to
each other (Bluman, 2015).
In this task, the movie attendance data
provided for two years along with the demographic of attendees. As the
claim is the movie attendance by year was dependent upon ethnicity, I
would be using Chi-Square Independence test.
After running the
chisq.test(), $Pvalue = $5.5e-13 which is \(\le\) alpha(0.05).I can conclude that null
hypothesis is rejected means there is enough evidence to support the
claim that movie attendance by year is dependent upon the ethnicity.
This table lists the numbers of officers and enlisted personnel for women in the military. At α = 0.05, is there sufficient evidence to conclude that a relationship exists between rank and branch of the Armed Forces?
Action Officers Enlisted Army 10,791 62,491 Navy 7,816 42,750 Marine Corps 932 9,525 Air Force 11,819 54,344
#Chi-Square Independence Test#####
#Hypothesis
#H0: Rank achieved is independent of branch of the Armed Forces.
#H1: Rank achieved is dependent of branch of the Armed Forces. (claim)
#Data presented in Matrix
rm1 = c(10791, 62491)
rm2 = c(7816, 42750)
rm3 = c(932, 9525)
rm4 = c(11819, 54344)
#row count
rowno_2 = 4
#matrix
matrix1_2_2 = matrix(c(rm1, rm2, rm3, rm4), nrow = rowno_2, byrow = TRUE)
#Modifying Matrix col and row names
rownames(matrix1_2_2) = c("Army","Navy","Marine_Corps", "Air_Force")
colnames(matrix1_2_2) = c("Officers", "Enlisted")
##Chi Square Independence test######
CLt1_2_2 = 0.95
aphat1_2_295 = 1 - CLt1_2_2
ch_sqrt1_2_2 = chisq.test(matrix1_2_2)
T1_2testresult_2 = ifelse(ch_sqrt1_2$p.value > aphat1_295,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")
##Information Representation#####
T1_2hypores_2 = rbind(CLt1_2_2*100, aphat1_2_295, round(ch_sqrt1_2_2$statistic,3), round(ch_sqrt1_2_2$p.value,141), T1_2testresult_2)
col_n1_2_2 = c("Hypothesis Result")
row_n1_2_2 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_2hypores_2) = list(row_n1_2_2,col_n1_2_2)
kable(
list(matrix1_2_2,T1_2hypores_2),
caption = "<center>Chi-Square Independence Test</center>",
align = "c",
booktabs = TRUE)%>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "H0: Rank achieved is independent of branch of the Armed Forces and \n H1: Rank achieved is dependent of branch of the Armed Forces.")
|
|
In this task, I would be testing whether or not a relationship exists
between rank and branch of the Armed Forces using Chi-Square
Independence test. I used matrix to store the data by rows to perform
the Chi-Square test. Chisq.test() provides $Pvalue = $2e-141 which is
\(\le\)alpha (0.05). On this test
result I will reject the null hypothesis that there is enough evidence
to reject the null hypothesis that Rank achieved is independent of
branch of the Armed Forces. Which means Rank achieved is dependent of
branch of the Armed Forces.
The amount of sodium (in milligrams) in one serving for a random sample of three different kinds of foods is listed. At the 0.05 level of significance, is there sufficient evidence to conclude that a difference in mean sodium amounts exists among condiments, cereals, and desserts?
Condiments Cereals Desserts 270 260 100 130 220 180 230 290 250 180 290 250 80 200 300 70 320 360 200 140 300 160
#One-way ANOVA test#####
#Hypothesis
#𝐻0:𝜇1=𝜇2=𝜇3
#𝐻1: At least one mean is different from others (claim)
##Data Collection####
Condiments = data.frame("Sodium" = c(270, 130, 230, 180, 80, 70, 200),
"Food" = rep("Condiments", 7), stringsAsFactors = FALSE)
Cereals = data.frame("Sodium" = c(260, 220, 290, 290, 200, 320, 140),
"Food" = rep("Cereals", 7), stringsAsFactors = FALSE)
Desserts = data.frame("Sodium" = c(100, 180, 250, 250, 300, 360, 300, 160),
"Food" = rep("Desserts", 8), stringsAsFactors = FALSE)
Sodium = rbind(Condiments, Cereals, Desserts)
Sodium$Food = as.factor(Sodium$Food)
##ANOVA Test######
anova2_1 = aov(Sodium ~ Food, data = Sodium)
anovasum = summary(anova2_1)
#Hypothesis testing
CLan2_1 = 0.95
alph2_1 = 1 - CLan2_1
anofval = anovasum[[1]][1, "F value"]
DofN = anovasum[[1]][1, "Df"] #k-1
DofD = anovasum[[1]][2, "Df"] #N-k
CV2_1 = qf(alph2_1, DofN, DofD, lower.tail = FALSE) #Critical value
anovhypores2_1 = ifelse(anofval < CV2_1, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )
##Information representation######
sodfoodstat <- function(y, uplim = max(Sodium$Sodium) * 1.15) {
return(data.frame(
y = 0.95 * uplim,
label = paste(
"Count =", length(y), "\n",
"Mean =", round(mean(y), 2), "\n"
)
))
}
ggplot(Sodium, aes(x = Food, y = Sodium, fill = Food)) +
geom_boxplot() +
stat_summary(
fun.data = sodfoodstat,
geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
labs(
title = "Sodium Contents of Food",
caption = "Source: The Doctor's Pocket Calorie, Fat, and Carbohydrate Counter",
x = "Food",
y = "Sodium/serving (in milligrams)"
) +
theme_classic()+
theme(
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(size = 5)
)
hyporesinfo = rbind(CLan2_1, alph2_1, round(CV2_1,3), round(anofval,3), anovhypores2_1)
col_n2_1 = c("Hypothesis Result")
row_n2_1 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporesinfo) = list(row_n2_1, col_n2_1)
kable(hyporesinfo,
caption = "<center>One way ANOVA Test</center>",
align = "c",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")
| Hypothesis Result | |
|---|---|
| Confidence Level(%): | 0.95 |
| Alpha: | 0.05 |
| CV: | 3.522 |
| F value: | 2.399 |
| Hypothesis Result: | Do not reject the Null Hypothesis (Fvalue < CV) |
Using sample variances, the one-way analysis of variance test is used to determine the equivalency of three or more means. This approach is known as analysis of variance as variances are compared (ANOVA) (Bluman, 2015).
To summarize the result, there is enough evidence to not to accept
the claim that there is a difference in mean sodium amounts exists among
condiments, cereals, and desserts.
The sales in millions of dollars for a year of a sample of leading companies are shown. At α = 0.01, is there a significant difference in the means?
Cereal 578 320 264 249 237
Chocolate_Candy 311 106 109 125 173
Coffee 261 185 302 689
#One-way ANOVA test#####
#Hypothesis
#𝐻0:𝜇1=𝜇2=𝜇3
#𝐻1: At least one mean is different from others (claim)
##Data Collection####
Cereal = data.frame("Sales" = c(578, 320, 264, 249, 237), "Food" = rep("Cereal",5), stringsAsFactors = FALSE)
Choco_candy = data.frame("Sales" = c(311, 106, 109, 125, 173), "Food" = rep("Chocolate_candy", 5), stringsAsFactors = FALSE)
Coffee = data.frame("Sales" = c(261, 185, 302, 689), "Food" = rep("Coffee", 4), stringsAsFactors = FALSE)
Sales = rbind(Cereal, Choco_candy, Coffee)
Sales$Food = as.factor(Sales$Food)
##ANOVA Test#####
ano2_2_1 = aov(Sales ~ Food, data = Sales)
anosumm = summary(ano2_2_1)
#Hypothesis Testing
CLan2_2_1 = 0.99
alph2_2_1 = 1 - CLan2_2_1
anofval2_1 = anosumm[[1]][1, "F value"]
DofN2_1 = anosumm[[1]][1, "Df"] #between the group k - 1
DofD2_1 = anosumm[[1]][2, "Df"] #within the group N-k
CV2_2_1 = qf(alph2_2_1, DofN2_1, DofD2_1, lower.tail = FALSE)
anovhypores2_2_1 = ifelse(anofval2_1 < CV2_2_1, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )
##Information representation######
salesfoodstat <- function(y, uplim = max(Sales$Sales) * 1.15) {
return(data.frame(
y = 0.95 * uplim,
label = paste(
"Count =", length(y), "\n",
"Mean =", round(mean(y), 2), "\n"
)
))
}
ggplot(Sales, aes(x = Food, y = Sales, fill = Food)) +
geom_boxplot() +
stat_summary(
fun.data = salesfoodstat,
geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
labs(
title = "Sales for Leading Companies",
caption = "Source: Information Resources, Inc",
x = "Food",
y = "Sales (USD)"
) +
theme_classic()+
theme(
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(size = 5)
)
hyporesinfo_1 = rbind(CLan2_2_1, alph2_2_1, round(CV2_2_1,3), round(anofval2_1,3), anovhypores2_2_1)
col_n2_1_1 = c("Hypothesis Result")
row_n2_1_1 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporesinfo_1) = list(row_n2_1_1, col_n2_1_1)
kable(hyporesinfo_1,
caption = "<center>One way ANOVA Test</center>",
align = "c",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")
| Hypothesis Result | |
|---|---|
| Confidence Level(%): | 0.99 |
| Alpha: | 0.01 |
| CV: | 7.206 |
| F value: | 2.172 |
| Hypothesis Result: | Do not reject the Null Hypothesis (Fvalue < CV) |
In this task, I would running One-way ANOVA to estimate the difference
between the sales of different food items sample from a leading company.
A sample of Cereal, Chocolate_candy, and Coffee have been taken and
figures are Sale of one year in million dollars. I would be checking the
mean of this sample at 0.01 significance level.
I have used aov()
function to run the ANOVA test and extracted F value, DoFN, and DoFD
from the summary of aov(). Furthermore, I have calculated critical value
to compare it with F value to check whether there is difference between
mean of food items. As $Fvalue < CV $, I will not reject the null
hypothesis.
To summarize, there is enough evidence to reject the
claim that means of food samples are different.
The expenditures (in dollars) per pupil for states in three sections of the country are listed. Using α = 0.05, can you conclude that there is a difference in means?
Eastern_third 4946, 5953, 6202, 7243, 6113
Middle_third 6149, 7451, 6000, 6479
Western_third 5282, 8605, 6528, 6911
#One Way ANOVA test#####
#Hypothesis
#𝐻0:𝜇1=𝜇2=𝜇3
#𝐻1: At least one mean is different from others (claim)
##Data Collection####
Eastern = data.frame("Expenditure" = c(4946, 5953, 6202, 7243, 6113),
"States" = rep("Eastern", 5), stringsAsFactors = FALSE)
Middle = data.frame("Expenditure" = c(6149, 7451, 6000, 6479),
"States" = rep("Middle", 4), stringsAsFactors = FALSE)
Western = data.frame("Expenditure" = c(5282, 8605, 6528, 6911),
"States" = rep("Western", 4), stringsAsFactors = FALSE)
Expenditure = rbind(Eastern, Middle, Western)
Expenditure$States = as.factor(Expenditure$States)
##ANOVA Test######
anova2_2_2 = aov(Expenditure ~ States, data = Expenditure)
anovasum2_2_2 = summary(anova2_2_2)
#Hypothesis Testing
CLan2_2_2 = 0.95
alph2_2_2 = 1 - CLan2_2_2
anofval2_2 = anovasum2_2_2[[1]][1, "F value"]
DofN2_2 = anovasum2_2_2[[1]][1, "Df"] #between the group k - 1
DofD2_2 = anovasum2_2_2[[1]][2, "Df"] #within the group N-k
CV2_2_2 = qf(alph2_2_2, DofN2_2, DofD2_2, lower.tail = FALSE)
anovhypores2_2_2 = ifelse(anofval2_2 < CV2_2_2, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )
##Information representation######
stateexpenstat <- function(y, uplim = max(Expenditure$Expenditure) * 1.15) {
return(data.frame(
y = 0.95 * uplim,
label = paste(
"Count =", length(y), "\n",
"Mean =", round(mean(y), 2), "\n"
)
))
}
ggplot(Expenditure, aes(x = States, y = Expenditure, fill = States)) +
geom_boxplot() +
stat_summary(
fun.data = stateexpenstat,
geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
labs(
title = "The expenditures per pupil (in dollars)",
caption = "Source: New York Times Almanac",
x = "States",
y = "Expenditure (USD)/ pupil"
) +
theme_classic()+
theme(
plot.title = element_text(hjust = 0.5),
plot.caption = element_text(size = 5)
)
hyporesinfo_2 = rbind(CLan2_2_2, alph2_2_2, round(CV2_2_2,3), round(anofval2_2,3), anovhypores2_2_2)
col_n2_1_2 = c("Hypothesis Result")
row_n2_1_2 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporesinfo_2) = list(row_n2_1_2, col_n2_1_2)
kable(hyporesinfo_2,
caption = "<center>One way ANOVA Test</center>",
align = "c",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")
| Hypothesis Result | |
|---|---|
| Confidence Level(%): | 0.95 |
| Alpha: | 0.05 |
| CV: | 4.103 |
| F value: | 0.649 |
| Hypothesis Result: | Do not reject the Null Hypothesis (Fvalue < CV) |
In this task, expenditure per pupil (in USD) is provided from states in
three section: Eastern, Middle, and Western. I would running one-way
ANOVA to check whether there is a difference in means of three
sections.<BR I would be using aov() for the ANOVA test result and
Fvalue from this test would compared with Critical Value.
The \(Fvalue < CV\), so I would not be
rejecting the null hypothesis.
To summarize, there is enough evidence to reject the claim that there
is a difference in means.
#Two way ANOVA####
#Hypothesis
#H0: There is no interaction effect between type of light used and type of
#plat food used on plant growth
#H1: There is an interaction effect between type of light used and type of
#plat food used on plant growth.
#The hypotheses for the growth light types are
#H0: There is no difference between the means of plant growth for
#two types of light
#H1: There is a difference between the means of plant growth for
#two types of light.
#Data Collection and representation
PlantGrowth = data.frame(PlantGrowth)
PlantGrowth$Plant.Food = factor(PlantGrowth$Plant.Food)
PlantGrowth$light = factor(PlantGrowth$light)
ggplot(data = PlantGrowth, aes(x = Plant.Food, y = Growth, colour = light)) +
geom_boxplot()
##TWO Way ANOVA test#####
Anovatwoway = aov(Growth ~ Plant.Food * light, data = PlantGrowth)
Anovatwoway_summary = summary(Anovatwoway)
plot(TukeyHSD(Anovatwoway, conf.level=.95), las = 2)
##Hypothesis testing#####
CLtwoANOVA = 0.95
alphtwoANOVA = 1 - CLtwoANOVA
ANOVAFvalinteraction = Anovatwoway_summary[[1]][3, "F value"] #Fvalue of interaction
ANOVAFvallight = Anovatwoway_summary[[1]][2, "F value"] #Fvalue of interaction
DofNlight = Anovatwoway_summary[[1]][2, "Df"] #(a-1)(b-1) a is levels in light, b is levels in food
DofDlight = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group
DofNinteraction = Anovatwoway_summary[[1]][3, "Df"] #(a-1)(b-1)
DofDinteraction = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group
CVinteraction = qf(alphtwoANOVA, DofNinteraction, DofDinteraction, lower.tail = FALSE)
CVlight = qf(alphtwoANOVA, DofNlight, DofDlight, lower.tail = FALSE)
hypotwoANOVA = ifelse(ANOVAFvalinteraction < CVinteraction, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")
hypotwoANOVAlight = ifelse(ANOVAFvallight < CVlight, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")
##Information Representation#####
plantgrost <-
PlantGrowth %>%
group_by(light, Plant.Food) %>% # group by the two factors
summarise(Means = mean(Growth), SEs = sd(Growth)/sqrt(n())) # Mean and Std Er
ggplot(plantgrost,
aes(x = light, y = Means, fill = Plant.Food,
ymin = Means - SEs, ymax = Means + SEs)) +
# this adds the mean
geom_col(position = position_dodge()) +
# this adds the error bars
geom_errorbar(position = position_dodge(0.9), width=.2) +
# controlling the appearance
xlab("Growth Light") + ylab("Plant Growth (in inches)")
hyporestwoANOVA = rbind(CLtwoANOVA, alphtwoANOVA, round(CVinteraction,3), round(ANOVAFvalinteraction,3), hypotwoANOVA)
col_twoANOVA = c("Hypothesis Result")
row_twoANOVA = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVA) = list(row_twoANOVA, col_twoANOVA)
kable(hyporestwoANOVA,
caption = "<center>Two way ANOVA interaction Test</center>",
align = "c",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and \n 𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.")
| Hypothesis Result | |
|---|---|
| Confidence Level(%): | 0.95 |
| Alpha: | 0.05 |
| CV: | 5.318 |
| F value: | 1.438 |
| Hypothesis Result: | Do not reject the null hypothesis (Fvalue < CV) |
hyporestwoANOVAlight = rbind(CLtwoANOVA, alphtwoANOVA, round(CVlight,3), round(ANOVAFvallight,3), hypotwoANOVAlight)
col_twoANOVAlight = c("Hypothesis Result")
row_twoANOVAlight = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVAlight) = list(row_twoANOVAlight, col_twoANOVAlight)
kable(hyporestwoANOVAlight,
caption = "<center>Two way ANOVA mean of growth to light Test</center>",
align = "c",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "H0: There is no difference between the means of plant growth for two types of light and \n H1: There is a difference between the means of plant growth for two types of light.")
| Hypothesis Result | |
|---|---|
| Confidence Level(%): | 0.95 |
| Alpha: | 0.05 |
| CV: | 5.318 |
| F value: | 3.681 |
| Hypothesis Result: | Do not reject the null hypothesis (Fvalue < CV) |
In this task, I would comparing Growth Lights and Plant Foods
interaction. As plant foods have different required nutrition and
exposure to light make the healthy growth. However, there are could be
the case that plant growth is insignificant despite having the same
light and food. And to test this, I would hypothesise that the
interaction between plant food and growth light exists.
In this case, two independent variables are Growth Light and Plant Food and plant growth is dependent variable. I would be calculating F value and Critical Value to test the hypothesis. aov(Growth ~ Plant.Food * light) will provide F value for Food, Light, and Food*light (interaction) variables. I would be calculation Fvalue with qf() and Degree of Freedom for the Light and Interaction variable would be extracted from summary of aov(). The degree of freedom would be calculated using following formula:
To summarize, there is enough evidence to do not reject the null hypothesis. It can be concluded that combination of type of growth light and type of food does not affect the growth of plants.
#EDA#####
Task3_data = data.frame(baseball_1)
describe(Task3_data) %>%
kable(caption = "<center>Descriptive Statistics</center>") %>%
kable_styling("bordered")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Team* | 1 | 1232 | 18.9285714 | 10.6140364 | 20.000 | 18.7576065 | 13.3434000 | 1.000 | 39.000 | 38.000 | 0.0628129 | -1.2525294 | 0.3023954 |
| League* | 2 | 1232 | 1.5000000 | 0.5002030 | 1.500 | 1.5000000 | 0.7413000 | 1.000 | 2.000 | 1.000 | 0.0000000 | -2.0016227 | 0.0142509 |
| Year | 3 | 1232 | 1988.9577922 | 14.8196251 | 1989.000 | 1989.3184584 | 19.2738000 | 1962.000 | 2012.000 | 50.000 | -0.1515595 | -1.2077412 | 0.4222133 |
| RS | 4 | 1232 | 715.0819805 | 91.5342940 | 711.000 | 713.3387424 | 90.4386000 | 463.000 | 1009.000 | 546.000 | 0.1740832 | -0.0301863 | 2.6078252 |
| RA | 5 | 1232 | 715.0819805 | 93.0799326 | 709.000 | 712.4371197 | 91.9212000 | 472.000 | 1103.000 | 631.000 | 0.2978360 | -0.0205915 | 2.6518607 |
| W | 6 | 1232 | 80.9042208 | 11.4581390 | 81.000 | 81.1206897 | 11.8608000 | 40.000 | 116.000 | 76.000 | -0.1814238 | -0.3070528 | 0.3264440 |
| OBP | 7 | 1232 | 0.3263312 | 0.0150128 | 0.326 | 0.3262586 | 0.0148260 | 0.277 | 0.373 | 0.096 | 0.0175923 | 0.0574876 | 0.0004277 |
| SLG | 8 | 1232 | 0.3973417 | 0.0332669 | 0.396 | 0.3970903 | 0.0340998 | 0.301 | 0.491 | 0.190 | 0.0541978 | -0.3250999 | 0.0009478 |
| BA | 9 | 1232 | 0.2592727 | 0.0129072 | 0.260 | 0.2593864 | 0.0133434 | 0.214 | 0.294 | 0.080 | -0.1109140 | -0.0002138 | 0.0003677 |
| Playoffs | 10 | 1232 | 0.1980519 | 0.3986934 | 0.000 | 0.1227181 | 0.0000000 | 0.000 | 1.000 | 1.000 | 1.5134587 | 0.2907952 | 0.0113588 |
| RankSeason | 11 | 244 | 3.1229508 | 1.7383492 | 3.000 | 2.9744898 | 1.4826000 | 1.000 | 8.000 | 7.000 | 0.5560350 | -0.5778485 | 0.1112864 |
| RankPlayoffs | 12 | 244 | 2.7172131 | 1.0952342 | 3.000 | 2.7602041 | 1.4826000 | 1.000 | 5.000 | 4.000 | -0.2688894 | -1.1199065 | 0.0701152 |
| G | 13 | 1232 | 161.9188312 | 0.6243652 | 162.000 | 161.9492901 | 0.0000000 | 158.000 | 165.000 | 7.000 | -1.0420881 | 6.9746969 | 0.0177883 |
| OOBP | 14 | 420 | 0.3322643 | 0.0152953 | 0.331 | 0.3319345 | 0.0163086 | 0.294 | 0.384 | 0.090 | 0.1943470 | -0.3710080 | 0.0007463 |
| OSLG | 15 | 420 | 0.4197429 | 0.0265096 | 0.419 | 0.4194405 | 0.0266868 | 0.346 | 0.499 | 0.153 | 0.1176197 | -0.2148956 | 0.0012935 |
DescTools::Desc(Task3_data$W)
## ------------------------------------------------------------------------------
## Task3_data$W (numeric)
##
## length n NAs unique 0s mean meanCI'
## 1'232 1'232 0 63 0 80.90 80.26
## 100.0% 0.0% 0.0% 81.54
##
## .05 .10 .25 median .75 .90 .95
## 62.00 66.00 73.00 81.00 89.00 95.90 98.00
##
## range sd vcoef mad IQR skew kurt
## 76.00 11.46 0.14 11.86 16.00 -0.18 -0.31
##
## lowest : 40.0, 43.0, 50.0, 51.0 (2), 52.0 (2)
## highest: 106.0, 108.0 (3), 109.0, 114.0, 116.0
##
## ' 95%-CI (classic)
Task3_data %>%
mutate(Team = fct_reorder(Team, W, .fun='length' )) %>%
ggplot( aes(x=Team, y=W, fill=Team)) +
geom_boxplot(width = 0.4) +
xlab("Team") +
ylab("Wins")+
theme(legend.position="none",
axis.text = element_text(size = 6))
#Decade winnings#####
Task3_data %<>%
mutate(Decade = as.numeric(Year) - as.numeric(Year) %% 10)
win_decade = Task3_data %>%
group_by(Decade)
win_decade %>%
ggplot(aes(x = Decade)) +
geom_bar() +
ggtitle("Win Count by Decade")+
xlab("Year")+
theme(plot.title = element_text(hjust = 0.5))
#Chi-Square Wins by Decade####
winsbydecade = chisq.test(x=win_decade$W)
##Hypothesis testing######
# 𝐻0: Number of wins per decade are same and
#𝐻1: Number of wins per decade are not same (claim)
CLt3 = 0.95
alpht3 = 1 - CLt3
t3_hypores = ifelse(winsbydecade$p.value > alpht3, "Do not Reject the Null Hypothesis (Pvalue > Alpha )", "Reject the Null Hypothesis (Pvalue <= Alpha)")
##Information representation#####
hyporesinfo_t3 = rbind(CLt3, alpht3, round(winsbydecade$p.value,40), t3_hypores)
col_nt3 = c("Hypothesis Result")
row_nt3 = c("Confidence Level(%):", "Alpha:", "P value:","Hypothesis Result:")
dimnames(hyporesinfo_t3) = list(row_nt3, col_nt3)
kable(hyporesinfo_t3,
caption = "<center>Wins per Decade Hypothesis testing(Chi-Square Goodness of Fit)</center>",
align = "c",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("hover",
"bordered"),
font_size = 11) %>%
scroll_box(width = "100%", height = "100%") %>%
footnote(general = "𝐻0: Number of wins per decade are same and \n 𝐻1: Number of wins per decade are not same")
| Hypothesis Result | |
|---|---|
| Confidence Level(%): | 0.95 |
| Alpha: | 0.05 |
| P value: | 2.2e-39 |
| Hypothesis Result: | Reject the Null Hypothesis (Pvalue <= Alpha) |
Baseball data has been studied in this task. Describe() is used to
understand the data. Descriptive Statistics using Describe() would help
to understand the variable types and number of observations per
variable. Furthermore, DescTools::Desc() would provide insights of Wins
variable. From the chart and descriptive statistics it visible that Win
variable is almost normal distribution. Moreover, the mean of Win is
80.90, it also provides confidence interval at 95% which is 80.26 -
81.54.
As we would be testing the Wins by decade, I have plotted a boxplot
to overview the Winings by team. X-axis is teams and Y-axis is number of
wins. Box plot would be helpful to overview the distribution and
identify outliers. As we can see the chart, there are outliers in the
data.<BR As the hypothesis testing is to determine the difference in
the wins per decade, I have calculated decade column based on the Year
variable. For Chi-Square Goodness-of-Fit test, I would be using
chisq.test() and testing Decade column only. Bar chart provides an
overview of the decade count. We can clearly establish that there must
be difference between the wins by decade. Chi-Square test would help to
test the hypothesis. As \(Pvalue \le
\alpha\), I would decide to reject the null hypothesis. To
summarize the result, there is a enough evidence to support the claim
that there is difference between the number of wins by decade.
Bluman, A. (2014). Elementary statistics: A step by step
approach. McGraw-Hill Education.
Childs, D.,Hindle. B., &
Warren, P. (2021). Two-way ANOVA in R. https://dzchilds.github.io/stats-for-bio/two-way-anova-in-r.html
Kabacoff, R. (2015). R in Action. Manning Publications
Co.
Larson, M. (2008). Analysis of Variance.Aha
Journals.https://doi.org/10.1161/CIRCULATIONAHA.107.654335
University of Southampton. (n.d.).Chi Square https://www.southampton.ac.uk/passs/full_time_education/bivariate_analysis/chi_square
Zach. (2020). How to Find the Chi-Square Critical Value in R.
Statology. https://www.statology.org/chi-square-critical-value-r/