Regression Diagnostics with R

rm(list=ls())

#Library######
library(readr)
library(tidyverse) 
library(dplyr) 
library(DT) 
library(RColorBrewer) 
library(rio) 
library(dbplyr) 
library(psych) 
library(FSA) 
library(knitr)
library(RColorBrewer)
library(plotrix)
library(kableExtra)
library(ISLR)
library(data.table)
library(magrittr)
library(ggplot2)
library(summarytools)
library(hrbrthemes)
library(cowplot)
library(reshape2)
library(scales)
library(zoo)
library(corrplot)
library(lares)
library(leaps)
library(MASS)
library(car)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(tidyr)
library(readxl)
library(forcats)

#dataset used######
PlantGrowth <- read_excel("Datasets/Task3_Two_ANOVA.xlsx", 
    sheet = "Sheet2")

baseball_1 <- read_csv("Datasets/baseball-1.csv")

Chi-Square Testing and ANOVA

Introduction

A chi-square test compares observed and predicted results. The goal of this test is to identify whether a disparity between actual and predicted data is due to chance or to a link between the variables under consideration. As a result, the chi-square test is an ideal choice for aiding in our understanding and interpretation of the connection between our two categorical variables (University of Southampton, n.d.). Chi-square test is used to the frequency distribution, independence of two variables, and homogeneity of proportions (Bluman, 2015).

ANOVA is a statistical approach for analyzing variability in a response variable (continuous random variable) measured under conditions outlined by discrete components (classification variables, often with nominal levels). ANOVA is widely used to evaluate equality among different means by analyzing variation among groups to variance within groups (random error) (Larson, 2008).

Task 1: Chi-Square Tests Task 1.1_1: Chi-Square Goodness-of-Fit Test

A medical researcher wishes to see if hospital patients in a large hospital have the same blood type distribution as those in the general population. The distribution for the general population is as follows: type A, 20%; type B, 28%; type O, 36%; and type AB = 16%. He selects a random sample of 50 patients and finds the following: 12 have type A blood, 8 have type B, 24 have type O, and 6 have type AB blood.
At α = 0.10, can it be concluded that the distribution is the same as that of the general population?

#Chi-Square Goodnes-of-Fit Test 1.1######

#Hypothesis Formation#####

#𝐻0:P𝐴=0.20,P𝐵=0.28,P𝑂=0.36,P𝐴𝐵=0.16  𝐻1: The distribution is not the #same as stated in the null hypothesis

##Distrubution####
Type_A =  0.20
Type_B = 0.28
Type_O = 0.36 
Type_AB = 0.16

Blood_prob = c(Type_A,
               Type_B,
               Type_O,
               Type_AB)
n = 50

#Expected Count

TypeA_Exp = Type_A * n
TypeB_Exp = Type_B * n
TypeO_Exp = Type_O * n
TypeAB_Exp = Type_AB * n

Exp_Blood_count = c(TypeA_Exp,
                    TypeB_Exp,
                    TypeO_Exp,
                    TypeAB_Exp)

#Observed count
TypeA_obs = 12
TypeB_obs = 8
TypeO_obs = 24
TypeAB_obs = 6

Obs_Blood_count = c(TypeA_obs,
                    TypeB_obs,
                    TypeO_obs,
                    TypeAB_obs)


##Chi-Square testing#####

CLt1 = 0.90
aphat190 = 1 - CLt1

##Finding X^2 value######

ch_sqrt1 = chisq.test(x=Obs_Blood_count, p=Blood_prob)

T1_1testresult = ifelse(ch_sqrt1$p.value > aphat190, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")

###Information Representation#####

Blood_count = cbind(Exp_Blood_count,
                    Obs_Blood_count)
blood_type = c("Blood Type A", 
               "Blood Type B",
               "Blood Type O",
               "Blood Type AB")
col_n = c("Expected","Observed")
dimnames(Blood_count) = list(blood_type,col_n)


T1_1hypores = rbind(CLt1*100, aphat190, round(ch_sqrt1$statistic,3), round(ch_sqrt1$p.value,3), T1_1testresult)
col_n1 = c("Hypothesis Result")
row_n1 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_1hypores) = list(row_n1,col_n1)


kable(
  list(Blood_count,T1_1hypores),
        caption = "<center>Chi-Square Goodness-of-Fit Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0:P𝐴=0.20,P𝐵=0.28,P𝑂=0.36,P𝐴𝐵=0.16 and \n 𝐻1: The distribution is not the same as stated in the null hypothesis")

Chi-Square Goodness-of-Fit Test

	Expected	Observed
Blood Type A	10	12
Blood Type B	14	8
Blood Type O	18	24
Blood Type AB	8	6

	Hypothesis Result
Confidence Level(%):	90
Alpha:	0.1
X^2 Value:	5.471
p value:	0.14
Hypothesis Result:	Do not Reject the Null Hypothesis (Pvalue > alpha)

Note: 𝐻0:P𝐴=0.20,P𝐵=0.28,P𝑂=0.36,P𝐴𝐵=0.16 and
𝐻1: The distribution is not the same as stated in the null hypothesis

Observation:

In this task, I would be testing proportion of Blood groups. I have taken sample of 50 patients and observed these blood samples against the expected to understand whether the proportion of the blood samples is same as the population. As this is proportion test, I would be using Chi-Square Goodness-of-fit test.
The population proportion is Blood group A, 20%; Blood group B, 28%; Blood group O, 36%; and Blood group AB = 16% The hypothesis testing would help me to determine whether expected proportion is same as observed blood groups’ proportion.

The null hypothesis is expected proportion of Blood groups is as observed proportion and the alternative hypothesis is proportion is different from the null hypothesis. The chi-Square Goodness-of-Fit test is performed using chsq.test() and the $X^2 = $ 5.471 and $Pvalue = $0.14. As the Pvalue is greater than the alpha that is 0.05, I would not reject the null hypothesis. This means that there is not enough evidence to reject the claim that the proportion of blood type population differ and it can be concluded that percentages are not significantly different from those given in the null hypothesis.

Task 1.1_2: Chi-Square Goodness-of-Fit Test

According to the Bureau of Transportation Statistics, on-time performance by the airlines is described as follows:

Action % of Time On time 70.8 National Aviation System delay 8.2 Aircraft arriving late 9.0 Other (because of weather and other conditions) 12.0

Records of 200 randomly selected flights for a major airline company showed that 125 planes were on time; 40 were delayed because of weather, 10 because of a National Aviation System delay, and the rest because of arriving late. At α = 0.05, do these results differ from the government’s statistics?

#Chi-Square Goodness-of-Fit Test 1.2#######

#Hypothesis Formation

#𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12  𝐻1: The distribution is not the #same as stated in the null hypothesis (Claim).

##Distrubution####
On_time =  0.708
NA_System_delay = 0.082
arriving_late = 0.09 
othr = 0.12

Flight_prob = c(On_time,
               NA_System_delay,
               arriving_late,
               othr)
n1 = 200

#Expected Count

On_time_Exp = On_time * n1
NA_System_delay_Exp = NA_System_delay * n1
arriving_late_Exp = arriving_late * n1
othr_Exp = othr * n1

Exp_Flight_count = round(c(On_time_Exp,
                    NA_System_delay_Exp,
                    arriving_late_Exp,
                    othr_Exp),0)

#Observed count
On_time_obs = 125
NA_System_delay_obs = 10
arriving_late_obs = 25
othr_obs = 40

Obs_flight_count = c(On_time_obs,
                    NA_System_delay_obs,
                    arriving_late_obs,
                    othr_obs)


##Chi-Square testing#####

CLt1_1 = 0.95
aphat1_195 = 1 - CLt1_1

##Finding X^2 value######

ch_sqrt1_1 = chisq.test(x=Obs_flight_count, p=Flight_prob)

T1_1testresult_1 = ifelse(ch_sqrt1_1$p.value > aphat1_195, "Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)")

###Information Representation#####

flight_count = cbind(Exp_Flight_count,
                    Obs_flight_count)
record_type = c("On time", 
               "National Aviation System delay",
               "Aircraft arriving late",
               "Other (because of weather and other conditions)")
col_n_1 = c("Expected","Observed")
dimnames(flight_count) = list(record_type,col_n_1)


T1_1hypores_1 = rbind(CLt1_1*100, aphat1_195, round(ch_sqrt1_1$statistic,3), round(ch_sqrt1_1$p.value,6), T1_1testresult_1)
col_n1_1 = c("Hypothesis Result")
row_n1_1 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_1hypores_1) = list(row_n1_1,col_n1_1)


kable(
  list(flight_count,T1_1hypores_1),
        caption = "<center>Chi-Square Goodness-of-Fit Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and \n H1: The distribution is not the same as stated in the null hypothesis")

Chi-Square Goodness-of-Fit Test

	Expected	Observed
On time	142	125
National Aviation System delay	16	10
Aircraft arriving late	18	25
Other (because of weather and other conditions)	24	40

	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
X^2 Value:	17.832
p value:	0.000476
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= alpha)

Note: 𝐻0:P_OT=0.708,P_SD=0.082,P_AL=0.09,P_O=0.12 and
H1: The distribution is not the same as stated in the null hypothesis

Observation:

In this task, I would be testing on-time performance by the airlines data provided by the Bureau of Transportation Statistics.
The observed population have flight data:
On time 70.8%, National Aviation System delay 8.2%, Aircraft arriving late 9.0%, and Other (because of weather and other conditions) 12.0%
After running chisq.test() I can conclude that the null hypothesis rejected, as the $Pvalue = $0.000476 which is $\le$ alpha(0.05).
It is concluded that there is enough evidence to support the claim that percentage of flight performance is different from the government data.

Task 1.2_1: Chi-Square Independence Test

Are movie admissions related to ethnicity? A 2014 study indicated the following numbers of admissions (in thousands) for two different years. At the 0.05 level of significance, can it be concluded that movie attendance by year was dependent upon ethnicity?

  Caucasian     Hispanic    African_American    Other

2013 724 335 174 107 2014 370 292 152 140

#Chi-Square Independence Test######

#Hypothesis

#H0: Movie attendance by year was independent upon ethnicity.
#H1: movie attendance by year was dependent upon ethnicity (claim).

#Data presented in Matrix

r1 = c(724, 335, 174, 107)
r2 = c(370, 292, 152, 140)

#row count
rowno = 2

#matrix 

matrix1_2 = matrix(c(r1, r2), nrow = rowno, byrow = TRUE)

#Modifying Matrix col and row names

rownames(matrix1_2) = c("2013","2014")
colnames(matrix1_2) = c("Caucasian", "Hispanic", "African_American", "Other")

##Chi Square Independence test######

CLt1_2 = 0.95
aphat1_295 = 1 - CLt1_2

ch_sqrt1_2 = chisq.test(matrix1_2)

T1_2testresult_1 = ifelse(ch_sqrt1_2$p.value > aphat1_295,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)") 


##Information Representation#####

T1_2hypores_1 = rbind(CLt1_2*100, aphat1_295, round(ch_sqrt1_2$statistic,3), round(ch_sqrt1_2$p.value,14), T1_2testresult_1)
col_n1_2 = c("Hypothesis Result")
row_n1_2 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_2hypores_1) = list(row_n1_2,col_n1_2)

kable(
  list(matrix1_2,T1_2hypores_1),
        caption = "<center>Chi-Square Independence Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "H0: Movie attendance by year was independent upon ethnicity and \n H1: Movie attendance by year was dependent upon ethnicity")

Chi-Square Independence Test

	Caucasian	Hispanic	African_American	Other
2013	724	335	174	107
2014	370	292	152	140

	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
X^2 Value:	60.144
p value:	5.5e-13
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= alpha)

Note: H0: Movie attendance by year was independent upon ethnicity and
H1: Movie attendance by year was dependent upon ethnicity

Observation:

When a single sample is chosen, the test of independence of variables is used to evaluate if two variables are independent of or connected to each other (Bluman, 2015).
In this task, the movie attendance data provided for two years along with the demographic of attendees. As the claim is the movie attendance by year was dependent upon ethnicity, I would be using Chi-Square Independence test.
After running the chisq.test(), $Pvalue = $5.5e-13 which is $\le$ alpha(0.05).I can conclude that null hypothesis is rejected means there is enough evidence to support the claim that movie attendance by year is dependent upon the ethnicity.

Task 1.2_2: Chi-Square Independence Test

This table lists the numbers of officers and enlisted personnel for women in the military. At α = 0.05, is there sufficient evidence to conclude that a relationship exists between rank and branch of the Armed Forces?

Action Officers Enlisted Army 10,791 62,491 Navy 7,816 42,750 Marine Corps 932 9,525 Air Force 11,819 54,344

#Chi-Square Independence Test#####

#Hypothesis
#H0: Rank achieved is independent of branch of the Armed Forces.
#H1: Rank achieved is dependent of branch of the Armed Forces. (claim)

#Data presented in Matrix

rm1 = c(10791, 62491)
rm2 = c(7816, 42750)
rm3 = c(932, 9525)
rm4 = c(11819, 54344)

#row count
rowno_2 = 4

#matrix 

matrix1_2_2 = matrix(c(rm1, rm2, rm3, rm4), nrow = rowno_2, byrow = TRUE)

#Modifying Matrix col and row names

rownames(matrix1_2_2) = c("Army","Navy","Marine_Corps", "Air_Force")
colnames(matrix1_2_2) = c("Officers", "Enlisted")

##Chi Square Independence test######

CLt1_2_2 = 0.95
aphat1_2_295 = 1 - CLt1_2_2

ch_sqrt1_2_2 = chisq.test(matrix1_2_2)

T1_2testresult_2 = ifelse(ch_sqrt1_2$p.value > aphat1_295,"Do not Reject the Null Hypothesis (Pvalue > alpha)","Reject the Null Hypothesis (Pvalue <= alpha)") 


##Information Representation#####

T1_2hypores_2 = rbind(CLt1_2_2*100, aphat1_2_295, round(ch_sqrt1_2_2$statistic,3), round(ch_sqrt1_2_2$p.value,141), T1_2testresult_2)
col_n1_2_2 = c("Hypothesis Result")
row_n1_2_2 = c("Confidence Level(%):", "Alpha:", "X^2 Value:","p value:","Hypothesis Result:")
dimnames(T1_2hypores_2) = list(row_n1_2_2,col_n1_2_2)

kable(
  list(matrix1_2_2,T1_2hypores_2),
        caption = "<center>Chi-Square Independence Test</center>",
        align = "c",
       booktabs = TRUE)%>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "H0: Rank achieved is independent of branch of the Armed Forces and \n H1: Rank achieved is dependent of branch of the Armed Forces.")

Chi-Square Independence Test

	Officers	Enlisted
Army	10791	62491
Navy	7816	42750
Marine_Corps	932	9525
Air_Force	11819	54344

	Hypothesis Result
Confidence Level(%):	95
Alpha:	0.05
X^2 Value:	654.272
p value:	2e-141
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= alpha)

Note: H0: Rank achieved is independent of branch of the Armed Forces and
H1: Rank achieved is dependent of branch of the Armed Forces.

Observation:

In this task, I would be testing whether or not a relationship exists between rank and branch of the Armed Forces using Chi-Square Independence test. I used matrix to store the data by rows to perform the Chi-Square test. Chisq.test() provides $Pvalue = $2e-141 which is $\le$alpha (0.05). On this test result I will reject the null hypothesis that there is enough evidence to reject the null hypothesis that Rank achieved is independent of branch of the Armed Forces. Which means Rank achieved is dependent of branch of the Armed Forces.

Task 2: Analysis of Variance (ANOVA) Task 2.1: One-way ANOVA

The amount of sodium (in milligrams) in one serving for a random sample of three different kinds of foods is listed. At the 0.05 level of significance, is there sufficient evidence to conclude that a difference in mean sodium amounts exists among condiments, cereals, and desserts?

Condiments Cereals Desserts 270 260 100 130 220 180 230 290 250 180 290 250 80 200 300 70 320 360 200 140 300 160

#One-way ANOVA test#####

#Hypothesis
#𝐻0:𝜇1=𝜇2=𝜇3  
#𝐻1: At least one mean is different from others (claim)

##Data Collection####

Condiments = data.frame("Sodium" = c(270, 130, 230, 180, 80, 70, 200), 
                        "Food" = rep("Condiments", 7), stringsAsFactors = FALSE)


Cereals = data.frame("Sodium" = c(260, 220, 290, 290, 200, 320, 140),
                     "Food" = rep("Cereals", 7), stringsAsFactors = FALSE)

Desserts = data.frame("Sodium" = c(100, 180, 250, 250, 300, 360, 300, 160),
                      "Food" = rep("Desserts", 8), stringsAsFactors = FALSE)

Sodium = rbind(Condiments, Cereals, Desserts)
Sodium$Food = as.factor(Sodium$Food)

##ANOVA Test######

anova2_1 = aov(Sodium ~ Food, data = Sodium)

anovasum = summary(anova2_1)

#Hypothesis testing

CLan2_1 = 0.95
alph2_1 = 1 - CLan2_1

anofval = anovasum[[1]][1, "F value"]

DofN = anovasum[[1]][1, "Df"] #k-1
DofD = anovasum[[1]][2, "Df"] #N-k

CV2_1 = qf(alph2_1, DofN, DofD, lower.tail = FALSE) #Critical value 

anovhypores2_1 = ifelse(anofval < CV2_1, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

##Information representation######

sodfoodstat <- function(y, uplim = max(Sodium$Sodium) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Sodium, aes(x = Food, y = Sodium, fill = Food)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = sodfoodstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "Sodium Contents of Food",
    caption = "Source: The Doctor's Pocket Calorie, Fat, and Carbohydrate Counter",
    x = "Food",
    y = "Sodium/serving (in milligrams)"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

hyporesinfo = rbind(CLan2_1, alph2_1, round(CV2_1,3), round(anofval,3), anovhypores2_1)
col_n2_1 = c("Hypothesis Result")
row_n2_1 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporesinfo) = list(row_n2_1, col_n2_1)

kable(hyporesinfo,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")

One way ANOVA Test
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
CV:	3.522
F value:	2.399
Hypothesis Result:	Do not reject the Null Hypothesis (Fvalue < CV)

Note: 𝐻0:𝜇1=𝜇2=𝜇3 and
𝐻1: At least one mean is different from others

Observation:

Using sample variances, the one-way analysis of variance test is used to determine the equivalency of three or more means. This approach is known as analysis of variance as variances are compared (ANOVA) (Bluman, 2015).

$F = \frac{variance\ between\ groups}{variance\ within\ groups}$
Furthermore, the critical value is calculated to test the F value. Critical value would be calculated using: qf(alpha, DofN, DofD, lower.tail = FALSE) where $\alpha$ is significance level, $DofN = k-1$ where $k=$ number of number of groups, $N=\sum sample\ sizes\ for\ groups$, and lower.tail=False is used to determine critical value on right side of the distribution. As variances are always positive or zero, $F$ value can’t be negative, the lower.tail = False to get positive critical value.
This task is to check whether the amount of sodium (in milligrams) in one serving for three different kinds of food, and identify whether there is a difference in mean sodium amounts exists among condiments, cereals, and desserts. After running One-Way ANOVA, I got $Fvalue = $2.399 and critical value is $CV= $3.522 I can make the decision to not to reject the null hypothesis as $Fvalue < CV$.

To summarize the result, there is enough evidence to not to accept the claim that there is a difference in mean sodium amounts exists among condiments, cereals, and desserts.

Task 2.2: One Way ANOVA

The sales in millions of dollars for a year of a sample of leading companies are shown. At α = 0.01, is there a significant difference in the means?

Cereal 578 320 264 249 237

Chocolate_Candy 311 106 109 125 173

Coffee 261 185 302 689

#One-way ANOVA test#####

#Hypothesis
#𝐻0:𝜇1=𝜇2=𝜇3  
#𝐻1: At least one mean is different from others (claim)

##Data Collection####

Cereal = data.frame("Sales" = c(578, 320, 264, 249, 237), "Food" = rep("Cereal",5), stringsAsFactors = FALSE)

Choco_candy = data.frame("Sales" = c(311, 106, 109, 125, 173), "Food" = rep("Chocolate_candy", 5), stringsAsFactors = FALSE)

Coffee = data.frame("Sales" = c(261, 185, 302, 689), "Food" = rep("Coffee", 4), stringsAsFactors = FALSE)


Sales = rbind(Cereal, Choco_candy, Coffee) 
Sales$Food = as.factor(Sales$Food) 

##ANOVA Test#####

ano2_2_1 = aov(Sales ~ Food, data = Sales)
 
anosumm = summary(ano2_2_1)

#Hypothesis Testing

CLan2_2_1 = 0.99
alph2_2_1 = 1 - CLan2_2_1

anofval2_1 = anosumm[[1]][1, "F value"]

DofN2_1 = anosumm[[1]][1, "Df"] #between the group k - 1
DofD2_1 = anosumm[[1]][2, "Df"] #within the group N-k

CV2_2_1 = qf(alph2_2_1, DofN2_1, DofD2_1, lower.tail = FALSE)

anovhypores2_2_1 = ifelse(anofval2_1 < CV2_2_1, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

##Information representation######

salesfoodstat <- function(y, uplim = max(Sales$Sales) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Sales, aes(x = Food, y = Sales, fill = Food)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = salesfoodstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "Sales for Leading Companies",
    caption = "Source: Information Resources, Inc",
    x = "Food",
    y = "Sales (USD)"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

hyporesinfo_1 = rbind(CLan2_2_1, alph2_2_1, round(CV2_2_1,3), round(anofval2_1,3), anovhypores2_2_1)
col_n2_1_1 = c("Hypothesis Result")
row_n2_1_1 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporesinfo_1) = list(row_n2_1_1, col_n2_1_1)

kable(hyporesinfo_1,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")

One way ANOVA Test
	Hypothesis Result
Confidence Level(%):	0.99
Alpha:	0.01
CV:	7.206
F value:	2.172
Hypothesis Result:	Do not reject the Null Hypothesis (Fvalue < CV)

Note: 𝐻0:𝜇1=𝜇2=𝜇3 and
𝐻1: At least one mean is different from others

Observation:

In this task, I would running One-way ANOVA to estimate the difference between the sales of different food items sample from a leading company. A sample of Cereal, Chocolate_candy, and Coffee have been taken and figures are Sale of one year in million dollars. I would be checking the mean of this sample at 0.01 significance level.
I have used aov() function to run the ANOVA test and extracted F value, DoFN, and DoFD from the summary of aov(). Furthermore, I have calculated critical value to compare it with F value to check whether there is difference between mean of food items. As $Fvalue < CV $, I will not reject the null hypothesis.
To summarize, there is enough evidence to reject the claim that means of food samples are different.

Task 2.3:One Way ANOVA

The expenditures (in dollars) per pupil for states in three sections of the country are listed. Using α = 0.05, can you conclude that there is a difference in means?

Eastern_third 4946, 5953, 6202, 7243, 6113

Middle_third 6149, 7451, 6000, 6479

Western_third 5282, 8605, 6528, 6911

#One Way ANOVA test#####

#Hypothesis
#𝐻0:𝜇1=𝜇2=𝜇3  
#𝐻1: At least one mean is different from others (claim)

##Data Collection####

Eastern = data.frame("Expenditure" = c(4946, 5953, 6202, 7243, 6113), 
                     "States" = rep("Eastern", 5), stringsAsFactors = FALSE)

Middle = data.frame("Expenditure" = c(6149, 7451, 6000, 6479), 
                    "States" = rep("Middle", 4), stringsAsFactors = FALSE)

Western = data.frame("Expenditure" = c(5282, 8605, 6528, 6911),
                     "States" = rep("Western", 4), stringsAsFactors = FALSE)

Expenditure = rbind(Eastern, Middle, Western)
Expenditure$States = as.factor(Expenditure$States)

##ANOVA Test######

anova2_2_2 = aov(Expenditure ~ States, data = Expenditure)

anovasum2_2_2 = summary(anova2_2_2)

#Hypothesis Testing

CLan2_2_2 = 0.95
alph2_2_2 = 1 - CLan2_2_2

anofval2_2 = anovasum2_2_2[[1]][1, "F value"]

DofN2_2 = anovasum2_2_2[[1]][1, "Df"] #between the group k - 1
DofD2_2 = anovasum2_2_2[[1]][2, "Df"] #within the group N-k

CV2_2_2 = qf(alph2_2_2, DofN2_2, DofD2_2, lower.tail = FALSE)

anovhypores2_2_2 = ifelse(anofval2_2 < CV2_2_2, "Do not reject the Null Hypothesis (Fvalue < CV)", "Reject the Null Hypothesis (Fvalue > CV)" )

##Information representation######

stateexpenstat <- function(y, uplim = max(Expenditure$Expenditure) * 1.15) {
  return(data.frame(
    y = 0.95 * uplim,
    label = paste(
      "Count =", length(y), "\n",
      "Mean =", round(mean(y), 2), "\n"
    )
  ))
}

ggplot(Expenditure, aes(x = States, y = Expenditure, fill = States)) + 
  geom_boxplot() +
  stat_summary( 
               fun.data = stateexpenstat, 
               geom = "text", hjust = 0.5, vjust = 0.9, size = 2) +
  labs(
    title = "The expenditures per pupil (in dollars)",
    caption = "Source: New York Times Almanac",
    x = "States",
    y = "Expenditure (USD)/ pupil"
  ) +
  theme_classic()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.caption = element_text(size = 5)
  )

hyporesinfo_2 = rbind(CLan2_2_2, alph2_2_2, round(CV2_2_2,3), round(anofval2_2,3), anovhypores2_2_2)
col_n2_1_2 = c("Hypothesis Result")
row_n2_1_2 = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporesinfo_2) = list(row_n2_1_2, col_n2_1_2)

kable(hyporesinfo_2,
        caption = "<center>One way ANOVA Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0:𝜇1=𝜇2=𝜇3 and \n 𝐻1: At least one mean is different from others")

One way ANOVA Test
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
CV:	4.103
F value:	0.649
Hypothesis Result:	Do not reject the Null Hypothesis (Fvalue < CV)

Note: 𝐻0:𝜇1=𝜇2=𝜇3 and
𝐻1: At least one mean is different from others

Observation:

In this task, expenditure per pupil (in USD) is provided from states in three section: Eastern, Middle, and Western. I would running one-way ANOVA to check whether there is a difference in means of three sections.<BR I would be using aov() for the ANOVA test result and Fvalue from this test would compared with Critical Value.
The $Fvalue < CV$, so I would not be rejecting the null hypothesis.

To summarize, there is enough evidence to reject the claim that there is a difference in means.

Task 3:

#Two way ANOVA####

#Hypothesis 

#H0: There is no interaction effect between type of light used and type of
#plat food used on plant growth
#H1: There is an interaction effect between type of light used and type of
#plat food used on plant growth.

#The hypotheses for the growth light types are
#H0: There is no difference between the means of plant growth for
#two types of light
#H1: There is a difference between the means of plant growth for
#two types of light.


#Data Collection and representation

PlantGrowth = data.frame(PlantGrowth)               

PlantGrowth$Plant.Food = factor(PlantGrowth$Plant.Food)
PlantGrowth$light = factor(PlantGrowth$light)

ggplot(data = PlantGrowth, aes(x = Plant.Food, y = Growth, colour = light)) + 
  geom_boxplot()

##TWO Way ANOVA test#####

Anovatwoway = aov(Growth ~ Plant.Food * light, data = PlantGrowth)

Anovatwoway_summary = summary(Anovatwoway)

plot(TukeyHSD(Anovatwoway, conf.level=.95), las = 2)

##Hypothesis testing#####

CLtwoANOVA = 0.95
alphtwoANOVA = 1 - CLtwoANOVA

ANOVAFvalinteraction = Anovatwoway_summary[[1]][3, "F value"] #Fvalue of interaction
ANOVAFvallight = Anovatwoway_summary[[1]][2, "F value"] #Fvalue of interaction

DofNlight = Anovatwoway_summary[[1]][2, "Df"] #(a-1)(b-1) a is levels in light, b is levels in food
DofDlight = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group
DofNinteraction = Anovatwoway_summary[[1]][3, "Df"] #(a-1)(b-1) 
DofDinteraction = Anovatwoway_summary[[1]][4, "Df"] # (ab)(n-1) n is the number of data values in each group

CVinteraction = qf(alphtwoANOVA, DofNinteraction, DofDinteraction, lower.tail = FALSE)
CVlight = qf(alphtwoANOVA, DofNlight, DofDlight, lower.tail = FALSE)

hypotwoANOVA = ifelse(ANOVAFvalinteraction < CVinteraction, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")

hypotwoANOVAlight = ifelse(ANOVAFvallight < CVlight, "Do not reject the null hypothesis (Fvalue < CV)", "Reject the null hypothesis (Fvalue > CV)")


##Information Representation#####
plantgrost <- 
  PlantGrowth %>% 
  group_by(light, Plant.Food) %>% # group by the two factors
  summarise(Means = mean(Growth), SEs = sd(Growth)/sqrt(n())) # Mean and Std Er
ggplot(plantgrost, 
       aes(x = light, y = Means, fill = Plant.Food,
           ymin = Means - SEs, ymax = Means + SEs)) +
  # this adds the mean
  geom_col(position = position_dodge()) +
  # this adds the error bars
  geom_errorbar(position = position_dodge(0.9), width=.2) +
  # controlling the appearance
  xlab("Growth Light") + ylab("Plant Growth (in inches)")

hyporestwoANOVA = rbind(CLtwoANOVA, alphtwoANOVA, round(CVinteraction,3), round(ANOVAFvalinteraction,3), hypotwoANOVA)
col_twoANOVA = c("Hypothesis Result")
row_twoANOVA = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVA) = list(row_twoANOVA, col_twoANOVA)

kable(hyporestwoANOVA,
        caption = "<center>Two way ANOVA interaction Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and \n 𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.")

Two way ANOVA interaction Test
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
CV:	5.318
F value:	1.438
Hypothesis Result:	Do not reject the null hypothesis (Fvalue < CV)

Note: 𝐻0: There is no interaction effect between type of light used and type of plat food used on plant growth and
𝐻1: There is an interaction effect between type of light used and type of plat food used on plant growth.

hyporestwoANOVAlight = rbind(CLtwoANOVA, alphtwoANOVA, round(CVlight,3), round(ANOVAFvallight,3), hypotwoANOVAlight)
col_twoANOVAlight = c("Hypothesis Result")
row_twoANOVAlight = c("Confidence Level(%):", "Alpha:", "CV:","F value:","Hypothesis Result:")
dimnames(hyporestwoANOVAlight) = list(row_twoANOVAlight, col_twoANOVAlight)

kable(hyporestwoANOVAlight,
        caption = "<center>Two way ANOVA mean of growth to light Test</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "H0: There is no difference between the means of plant growth for two types of light and \n H1: There is a difference between the means of plant growth for two types of light.")

Two way ANOVA mean of growth to light Test
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
CV:	5.318
F value:	3.681
Hypothesis Result:	Do not reject the null hypothesis (Fvalue < CV)

Note: H0: There is no difference between the means of plant growth for two types of light and
H1: There is a difference between the means of plant growth for two types of light.

Observation:

In this task, I would comparing Growth Lights and Plant Foods interaction. As plant foods have different required nutrition and exposure to light make the healthy growth. However, there are could be the case that plant growth is insignificant despite having the same light and food. And to test this, I would hypothesise that the interaction between plant food and growth light exists.

In this case, two independent variables are Growth Light and Plant Food and plant growth is dependent variable. I would be calculating F value and Critical Value to test the hypothesis. aov(Growth ~ Plant.Food * light) will provide F value for Food, Light, and Food*light (interaction) variables. I would be calculation Fvalue with qf() and Degree of Freedom for the Light and Interaction variable would be extracted from summary of aov(). The degree of freedom would be calculated using following formula:

Light variable have 2 levels and 3 data values in each group(n). Similarly, Food have 2 levels and 3 data values in each group(n).
Light:
DoFN = (a-1) and DoFD = (ab)(n-1)
Interaction: Food*light
DoFN = (a-1)(b-1) and DoFD = (ab)(n-1)

After calculating critical value for Light and Interaction variable, individual hypothesis would be tested. As critical value for both Light and Interaction is greater than F value of Light and Interaction, I would not be rejecting the null hypothesis.

To summarize, there is enough evidence to do not reject the null hypothesis. It can be concluded that combination of type of growth light and type of food does not affect the growth of plants.

Task 3:

#EDA#####
Task3_data = data.frame(baseball_1)

describe(Task3_data) %>%
  kable(caption = "<center>Descriptive Statistics</center>") %>%
  kable_styling("bordered")

Descriptive Statistics
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Team*	1	1232	18.9285714	10.6140364	20.000	18.7576065	13.3434000	1.000	39.000	38.000	0.0628129	-1.2525294	0.3023954
League*	2	1232	1.5000000	0.5002030	1.500	1.5000000	0.7413000	1.000	2.000	1.000	0.0000000	-2.0016227	0.0142509
Year	3	1232	1988.9577922	14.8196251	1989.000	1989.3184584	19.2738000	1962.000	2012.000	50.000	-0.1515595	-1.2077412	0.4222133
RS	4	1232	715.0819805	91.5342940	711.000	713.3387424	90.4386000	463.000	1009.000	546.000	0.1740832	-0.0301863	2.6078252
RA	5	1232	715.0819805	93.0799326	709.000	712.4371197	91.9212000	472.000	1103.000	631.000	0.2978360	-0.0205915	2.6518607
W	6	1232	80.9042208	11.4581390	81.000	81.1206897	11.8608000	40.000	116.000	76.000	-0.1814238	-0.3070528	0.3264440
OBP	7	1232	0.3263312	0.0150128	0.326	0.3262586	0.0148260	0.277	0.373	0.096	0.0175923	0.0574876	0.0004277
SLG	8	1232	0.3973417	0.0332669	0.396	0.3970903	0.0340998	0.301	0.491	0.190	0.0541978	-0.3250999	0.0009478
BA	9	1232	0.2592727	0.0129072	0.260	0.2593864	0.0133434	0.214	0.294	0.080	-0.1109140	-0.0002138	0.0003677
Playoffs	10	1232	0.1980519	0.3986934	0.000	0.1227181	0.0000000	0.000	1.000	1.000	1.5134587	0.2907952	0.0113588
RankSeason	11	244	3.1229508	1.7383492	3.000	2.9744898	1.4826000	1.000	8.000	7.000	0.5560350	-0.5778485	0.1112864
RankPlayoffs	12	244	2.7172131	1.0952342	3.000	2.7602041	1.4826000	1.000	5.000	4.000	-0.2688894	-1.1199065	0.0701152
G	13	1232	161.9188312	0.6243652	162.000	161.9492901	0.0000000	158.000	165.000	7.000	-1.0420881	6.9746969	0.0177883
OOBP	14	420	0.3322643	0.0152953	0.331	0.3319345	0.0163086	0.294	0.384	0.090	0.1943470	-0.3710080	0.0007463
OSLG	15	420	0.4197429	0.0265096	0.419	0.4194405	0.0266868	0.346	0.499	0.153	0.1176197	-0.2148956	0.0012935

DescTools::Desc(Task3_data$W)

## ------------------------------------------------------------------------------ 
## Task3_data$W (numeric)
## 
##   length       n    NAs  unique     0s   mean  meanCI'
##    1'232   1'232      0      63      0  80.90   80.26
##           100.0%   0.0%           0.0%          81.54
##                                                      
##      .05     .10    .25  median    .75    .90     .95
##    62.00   66.00  73.00   81.00  89.00  95.90   98.00
##                                                      
##    range      sd  vcoef     mad    IQR   skew    kurt
##    76.00   11.46   0.14   11.86  16.00  -0.18   -0.31
##                                                      
## lowest : 40.0, 43.0, 50.0, 51.0 (2), 52.0 (2)
## highest: 106.0, 108.0 (3), 109.0, 114.0, 116.0
## 
## ' 95%-CI (classic)

Task3_data %>%
  mutate(Team = fct_reorder(Team, W, .fun='length' )) %>%
  ggplot( aes(x=Team, y=W, fill=Team)) + 
    geom_boxplot(width = 0.4) +
    xlab("Team") +
  ylab("Wins")+
    theme(legend.position="none",
          axis.text = element_text(size = 6))

#Decade winnings#####

Task3_data  %<>% 
      mutate(Decade = as.numeric(Year) - as.numeric(Year) %% 10)

win_decade = Task3_data %>%
  group_by(Decade)

win_decade %>%
  ggplot(aes(x = Decade)) + 
    geom_bar() +
  ggtitle("Win Count by Decade")+
    xlab("Year")+
  theme(plot.title = element_text(hjust = 0.5))

#Chi-Square Wins by Decade####

winsbydecade = chisq.test(x=win_decade$W)

##Hypothesis testing######
# 𝐻0: Number of wins per decade are same and
#𝐻1: Number of wins per decade are not same (claim)

CLt3 = 0.95
alpht3 = 1 - CLt3

t3_hypores = ifelse(winsbydecade$p.value > alpht3, "Do not Reject the Null Hypothesis (Pvalue > Alpha )", "Reject the Null Hypothesis (Pvalue <= Alpha)")

##Information representation#####

hyporesinfo_t3 = rbind(CLt3, alpht3, round(winsbydecade$p.value,40), t3_hypores)
col_nt3 = c("Hypothesis Result")
row_nt3 = c("Confidence Level(%):", "Alpha:", "P value:","Hypothesis Result:")
dimnames(hyporesinfo_t3) = list(row_nt3, col_nt3)

kable(hyporesinfo_t3,
        caption = "<center>Wins per Decade Hypothesis testing(Chi-Square Goodness of Fit)</center>",
        align = "c",
       booktabs = TRUE) %>%
    kable_styling(bootstrap_options = c("hover",
                                        "bordered"),
                font_size = 11) %>%
  scroll_box(width = "100%", height = "100%") %>%
  footnote(general = "𝐻0: Number of wins per decade are same and \n 𝐻1: Number of wins per decade are not same")

Wins per Decade Hypothesis testing(Chi-Square Goodness of Fit)
	Hypothesis Result
Confidence Level(%):	0.95
Alpha:	0.05
P value:	2.2e-39
Hypothesis Result:	Reject the Null Hypothesis (Pvalue <= Alpha)

Note: 𝐻0: Number of wins per decade are same and
𝐻1: Number of wins per decade are not same

Observation:

Baseball data has been studied in this task. Describe() is used to understand the data. Descriptive Statistics using Describe() would help to understand the variable types and number of observations per variable. Furthermore, DescTools::Desc() would provide insights of Wins variable. From the chart and descriptive statistics it visible that Win variable is almost normal distribution. Moreover, the mean of Win is 80.90, it also provides confidence interval at 95% which is 80.26 - 81.54.

As we would be testing the Wins by decade, I have plotted a boxplot to overview the Winings by team. X-axis is teams and Y-axis is number of wins. Box plot would be helpful to overview the distribution and identify outliers. As we can see the chart, there are outliers in the data.<BR As the hypothesis testing is to determine the difference in the wins per decade, I have calculated decade column based on the Year variable. For Chi-Square Goodness-of-Fit test, I would be using chisq.test() and testing Decade column only. Bar chart provides an overview of the decade count. We can clearly establish that there must be difference between the wins by decade. Chi-Square test would help to test the hypothesis. As $Pvalue \le \alpha$, I would decide to reject the null hypothesis. To summarize the result, there is a enough evidence to support the claim that there is difference between the number of wins by decade.

Reference:

Bluman, A. (2014). Elementary statistics: A step by step approach. McGraw-Hill Education.
Childs, D.,Hindle. B., & Warren, P. (2021). Two-way ANOVA in R. https://dzchilds.github.io/stats-for-bio/two-way-anova-in-r.html
Kabacoff, R. (2015). R in Action. Manning Publications Co.
Larson, M. (2008). Analysis of Variance.Aha Journals.https://doi.org/10.1161/CIRCULATIONAHA.107.654335
University of Southampton. (n.d.).Chi Square https://www.southampton.ac.uk/passs/full_time_education/bivariate_analysis/chi_square
Zach. (2020). How to Find the Chi-Square Critical Value in R. Statology. https://www.statology.org/chi-square-critical-value-r/

Regression Diagnostics with R

Nikhil Deshpande

2023-02-02