Introduction

Chi-square and ANOVA are two widely used statistical methods for analyzing data and testing hypotheses in different scenarios. In this assignment, I use my knowledge of chi-square and ANOVA to address a series of questions. A total of 10 different sets of questions are investigated using these two methods. The first 8 series of questions, I use chi-square goodness of fit, chi-square independence test, one-way anova and two-way anova to solve. The remaining two questions explore the team baseball dataset spanning from 1962 to 2012 and crop dataset that investigate whether , fertilizer and density have effect on crop yield using two-way anova.

Analysis

Section 11-1.6(Blood Types )

# --------- Section 11-1 ---------
# Question 6
alpha_1 <- 0.10
blood_type <- data.frame(blood_type = c('A','B','O', 'AB'),
                         observed = c(12,8,24,6),
                         expected = c(0.2,0.28,0.36,0.16))
blood_type

a. Hypothesis

(\(H_0\)) : Hospital patient in a large hospital have the same blood type distribution as follow; type A, 20%; type B, 28%; type O, 36%; and type AB = 16% (the claim).

(\(H_1\)): The distribution is not the same as stated in the null hypothesis

b. Critical Value

# find the degree of freedom = (R-1)*(C-1)
df = (4-1)*(2-1)
cv <- qchisq(p = alpha_1, df = df, lower.tail = FALSE)
cv # critical value is 6.251389
## [1] 6.251389

Given a degree of freedom of 3, the critical value is 6.251

c. Test Statistic

# run the chi-square test
result_1 <- chisq.test(x = blood_type$observed, p=blood_type$expected)
result_1
## 
##  Chi-squared test for given probabilities
## 
## data:  blood_type$observed
## X-squared = 5.4714, df = 3, p-value = 0.1404

The test statistic is 5.4714

d. Make decision

ifelse(result_1$statistic > cv, 'Reject the null hypothesis',
       'Fail to reject the null hypothesis')
##                            X-squared 
## "Fail to reject the null hypothesis"

Since the test-statistic(5.4714) is less than the critical value (6.251), the decision is not to reject the null hypothesis

e. Summarize

We conclude that there is enough evidence to support the researcher claim that hospital patients in a large hospital have the same blood type distribution as those in the general population as stated in the null hypothesis.

Section 11-1.8(On-Time Performance by Airlines )

airline <- data.frame(Action = c('One time','National Aviation System delay',
                                  'Aircraft arriving late','Other'),
                      Expected = c(0.708, 0.082, 0.09, 0.12),
                      Observed = c(125, 10, 25, 40))
airline

a. Hypothesis

(\(H_0\)): The Airline on-time performance is the same as the government’s statistics.

(\(H_1\)): The Airline on-time performance is not the same as the government’s statistics (the claim).

b. Critical value

# find the degree of freedom = (R-1)*(C-1)
df = (4-1)*(2-1)
cv_2 <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
cv_2
## [1] 7.814728

Given a degree of freedom(df) of 3, and alpha at 0.05, the critical value is 7.815.

c. Test Statistic

# Test Statistics
result_2 <- chisq.test(airline$Observed, p= airline$Expected)
result_2
## 
##  Chi-squared test for given probabilities
## 
## data:  airline$Observed
## X-squared = 17.832, df = 3, p-value = 0.0004763

The result indicate that the test-statistic is 17.832 and the p-value is 0.0004763.

d. Decision

ifelse(result_2$statistic > cv_2, 'Reject the null hypothesis',
       'Fail to reject the null hypothesis')
##                    X-squared 
## "Reject the null hypothesis"

Since the test-statistic(17.832) > 7.815(the critical value), the decision is to reject the null hypothesis.

e. Summarize

There is enough evidence to support the claim that the airline one time performance differ from the government statistics.

Section 11-2.8(Ethnicity and Movie Admissions)

admission <- matrix(c(724, 335, 174, 107, 370, 292, 152, 140), ncol =  4, byrow = 2)
colnames(admission) <- c('Caucasian','Hispanic','African American','Other')
rownames(admission) <- c(2013, 2014)
admission <- as.table(admission)
admission
##      Caucasian Hispanic African American Other
## 2013       724      335              174   107
## 2014       370      292              152   140

a. Hypothesis

(\(H_0\)): Movie attendance by year is independent of the ethnicity of movie goers.

(\(H_1\)): Movie attendance by year is dependent of the ethnicity of movie goers (the claim).

b. Critical Value

# find the degree of freedom = (R-1)*(C-1)
df = (2-1)*(4-1)
cv_3 <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
cv_3
## [1] 7.814728

Given alpha at 0.05 and the degree of freedom at 3, the critical value is 7.815.

c. Test Statistic

result_3 <- chisq.test(admission)
result_3
## 
##  Pearson's Chi-squared test
## 
## data:  admission
## X-squared = 60.144, df = 3, p-value = 5.478e-13

The test-statistic is 60.144.

d. Decision

# Decision
ifelse(result_3$statistic > cv_3, 'Reject the null hypothesis',
       'Fail to reject the null hypothesis')
##                    X-squared 
## "Reject the null hypothesis"

The decision is to reject the null hypothesis since the test-statistic(60.144) > 7.815(the critical value).

Summarize result

Since the null hypothesis is rejected, then we can conclude that there is sufficient evidence to support the claim that movie attendance by year is dependent upon the ethnicity of movie goers..

Section 11-2.10(Women in the Military)

#-- Question 10 
military <- matrix(c(10791, 62491, 7816, 42750, 932, 9525, 11819, 54344), 
                   ncol = 2, byrow = 4)
colnames(military) <- c('Officers','Enlisted')
rownames(military) <- c('Army','Navy','Marine Corps','Air Force')
women_in_mil <- as.table(military)
women_in_mil
##              Officers Enlisted
## Army            10791    62491
## Navy             7816    42750
## Marine Corps      932     9525
## Air Force       11819    54344

a. Hypothesis

(\(H_0\)): Military rank of women personnel is independent of the military branch of service

(\(H_1\)): Military rank of women personnel is dependent of the military branch of service (the claim).

b. Critical Value

# find the degree of freedom = (R-1)*(C-1)
df = (4-1)*(2-1)
cv_4 <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
cv_4
## [1] 7.814728

the critical value is 7.815 at \(\alpha\) = 0.05 and degree of freedom of 3.

c. Test Statistic

# test statistic
result_4 <- chisq.test(women_in_mil)
result_4
## 
##  Pearson's Chi-squared test
## 
## data:  women_in_mil
## X-squared = 654.27, df = 3, p-value < 2.2e-16

The test value is 654.27.

d. Decision

# Decision
ifelse(result_4$statistic > cv_4, 'Reject the null hypothesis',
       'Fail to reject the null hypothesis')
##                    X-squared 
## "Reject the null hypothesis"

The decision is to reject the null hypothesis since the test-value > critical value.

e. Summarize the result

There is enough evidence to support the claim that there exist a relationship between rank and branch of service for military women personnel.

Section 12-1.8(Sodium Contents of Foods)

#------------------- Section 12-1 ------------------
#create data frame for condiments
condiments <- data.frame(sodium = c(270,130,230,180,80,70,200),food = rep('Condiments', 7))

# create dataframe for cereals
cereals <- data.frame(sodium = c(260,220,290,290,200,320,140),food = rep('Cereals', 7))

# create dataframe for desserts
desserts <- data.frame(sodium = c(100,180,250,250,300,360,300,160), food = rep('Desserts', 8))

# combine the dataframe into one
sodium <- rbind(condiments, cereals, desserts)
sodium$food <- as.factor(sodium$food)
sodium

Hypothesis

(\(H_0\)): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)

(\(H_1\)): at least one of the means differ (the claim)

# Run the anova
result_5 <- aov(sodium ~ food, data = sodium)
result5.summary <- summary(result_5)
result5.summary
##             Df Sum Sq Mean Sq F value Pr(>F)
## food         2  27544   13772   2.399  0.118
## Residuals   19 109093    5742

Critical Value

# get degree of freedom first
df1 <- result5.summary[[1]][1, 'Df']
df2 <- result5.summary[[1]][2, 'Df']
#get the critical value
cv_5 <- qf(p=.05, df1=df1, df2=df2, lower.tail=FALSE)
cv_5
## [1] 3.521893

The critical value is 3.52 at \(\alpha\) = 0.05, with degree of freedom of 2 & 19.

Test Value

# extract the F statistic
F.value <- result5.summary[[1]][[1, 'F value']]
F.value
## [1] 2.398538

The F-value is 2.3985.

Decision

# make decision
ifelse(F.value > cv_5, 'Reject the null hypothesis', 
       'Fail to reject the null hypothesis')
## [1] "Fail to reject the null hypothesis"

Since the F-value(2.3985.) < critical value(3.52), the decision is not to reject the null hypothesis

Summarize the result

There is not sufficient evidence to support the claim that at least one mean is different from the others.

Section 12-2.10(Sales for Leading Companies)

# ------ Question 10 -------
#data frame for cereal
cereal <- data.frame(sales = c(578,320,264,249,237), company = rep('Cereal',5))

# create dataframe for chocolate
chocolate <- data.frame(sales = c(311,106,109,125,173), company = rep('Chocolate Candy',5))

# create dataframe for coffee
coffee <- data.frame(sales = c(261, 185, 302, 689), company = rep('Coffee',4))

#combine all the dataframe into one
sales <- rbind(cereal, chocolate, coffee)
sales$company <- as.factor(sales$company)
sales

a. Hypothesis

(\(H_0\)): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)

(\(H_1\)): there is a significant difference in the means sales (the claim)

b. Critical Value

# k(number of groups) = 3, N(sum of the sample sizes of the groups) = 14
# dfN = K-1; dfD = N-k
dfN = (3-1)
dfD = (14-3)
#find critical value
cv_6 <- qf(p = 0.01, df1 =dfN, df2 = dfD, lower.tail = FALSE)
cv_6
## [1] 7.205713

The critical value is 2.17 at \(\alpha\) = 0.01.

c. Test Value

# test the anova
result_6 <- aov(sales ~ company, data = sales)
result6.summary <- summary(result_6)
result6.summary
##             Df Sum Sq Mean Sq F value Pr(>F)
## company      2 103770   51885   2.172   0.16
## Residuals   11 262795   23890
# F statistic
F.value <- result6.summary[[1]][[1, 'F value']]
F.value
## [1] 2.171782

The test value is 2.172 and the p_value is 0.16.

d. Decision

# make decision
ifelse(F.value > cv_6, 'Reject the null hypothesis', 
       'Fail to reject the null hypothesis')
## [1] "Fail to reject the null hypothesis"

Since the p_value(0.16) > \(\alpha\)(0.01), the decision is to not reject the null hypothesis.

e. Summarize the result

We conclude that there is not enough evidence to support the claim that there is a significant difference in the means.

Section 12-2.12(Per-Pupil Expenditures)

# create dataframe for the eastern third
eastern <- data.frame(expenditure = c(4946,5953,6202,7243, 6113), state = rep('Eastern third',5))

#create dataframe for middle third
middle <- data.frame(expenditure = c(6149, 7451,6000,6479), state=rep('Middle third', 4))

# create dataframe for western third
western <- data.frame(expenditure = c(5282,8605,6528,6911), state = rep('Western third',4))

# combine all dataframe into one
expenditures <- rbind(eastern, middle, western)
expenditures

a. Hypothesis

(\(H_0\)): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)

(\(H_1\)): At least one mean is different from the others (claim)

Critical Value

# k = 3, N = 13
# dfN = K-1; dfD = N-k
dfN = (3-1)
dfD = (13-3)
#find critical value
cv_7 <- qf(p = 0.05, df1 =dfN, df2 = dfD, lower.tail = FALSE)
cv_7
## [1] 4.102821

The critical value is 4.10 at \(\alpha\)=0.05, and degree of freedom of 2 & 10

Test value

# run the anova test
result_7 <- aov(expenditure ~ state, data = expenditures)
result7.summary <- summary(result_7)
result7.summary
##             Df  Sum Sq Mean Sq F value Pr(>F)
## state        2 1244588  622294   0.649  0.543
## Residuals   10 9591145  959114
# F statistic
F.value <- result7.summary[[1]][[1, 'F value']]
F.value
## [1] 0.6488214

The F-value is 0.649. and the p_value is 0.543.

Decision

# Decision
ifelse(F.value > cv_7, 'Reject the null hypothesis', 
       'Fail to reject the null hypothesis')
## [1] "Fail to reject the null hypothesis"

Since the F-value(0.649.) < critical value(4.10), the decision is to not reject the null hypothesis

Conclusion

there is not enough evidence to support the claim that at least one mean is different from the others.

Section 12-3.10 (Increasing Plant Growth)

# create the dataframe
alpha_8 <- 0.05
growth_data <- data.frame(
  Light = rep(c("Grow-light 1", "Grow-light 2"), each = 6),
  PlantFood = rep(c("A", "B"), each =3),
  Growth = c(9.2, 9.4,8.9, 7.1, 7.2, 8.5, 8.5, 9.2, 8.9, 5.5, 5.8, 7.6))

growth_data$Light = as.factor(growth_data$Light)
growth_data$PlantFood = as.factor(growth_data$PlantFood)

Hypothesis

Interaction:

(\(H_0\)) : There is no interaction effect between the strength of the Grow-light and the plant food supplement.

(\(H_1\)): There is an interaction effect between the Grow-light strength and the plant food supplement.

Plant food:

(\(H_0\)): There is no difference in the mean growth with respect to the type of plant food supplement.

(\(H_1\)): There is a difference in the mean growth with respect to the type of plant food supplement.

Grow-light:

(\(H_0\)): There is no difference in the mean growth with respect to the strength of the Grow-light.

(\(H_1\)): There is a difference in the mean growth with respect to the strength of the Grow- light.

Run the anova

result_8 <-aov(Growth ~ Light*PlantFood, data=growth_data)
result8.summary <- summary(result_8)
result8.summary
##                 Df Sum Sq Mean Sq F value  Pr(>F)   
## Light            1  1.920   1.920   3.681 0.09133 . 
## PlantFood        1 12.813  12.813  24.562 0.00111 **
## Light:PlantFood  1  0.750   0.750   1.438 0.26482   
## Residuals        8  4.173   0.522                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Critical Value

# critical value
dfN <- result8.summary[[1]][1, 'Df'] 
dfD <- result8.summary[[1]][4, 'Df']
# Calculate the critical F-value
critical_F <- qf(1 - alpha_8, dfN, dfD)

# Print the critical F-value
print(critical_F)
## [1] 5.317655

The critical value is 5.32

Decision

Since the F-value of Light(3.681) and F-value of the interaction(1.438) is less than the critical F-value(5.32), the we fail to reject the hypothesis of both but in regards to plant food, the F-value(24.562) > critical value(5.32), and so we reject the null hypothesis.

Summarise the result

Since we fail to reject the null hypothesis for interaction effect it can be concluded that the combination of the Grow-light and the plant food supplement has no effect. Also there is no no difference in the mean growth with respect to light.

Since we reject the null hypothesis for plant food, it implies there is sufficient evidence to conclude there is a difference in the mean growth for the plant food.

Baseball Dataset

The baseball dataset provides insights into the performance of baseball teams across various seasons. It includes crucial performance metrics such as runs scored (RS), runs allowed (RA), and wins (W), along with key indicators like on-base percentage (OBP), slugging percentage (SLG), and batting average (BA), which are pivotal for assessing a team’s offensive and defensive capabilities. Covering the period from 1962 to 2012. The summary statistics reveal that the median values for runs scored(RS) and runs allowed(RA) are 711 and 709 respectively, suggesting a balanced distribution of offensive and defensive performance. Additionally, the “Playoffs” variable indicates whether a team made it to the playoffs (1) or not (0). This binary variable suggests that approximately 19.81% of the teams in the dataset made it to the playoffs. The min wins is 40 and the maximum wins is 116 and a median win 0f 81. offensive performance indicators such as on-base percentage (OBP), slugging percentage (SLG), and batting average (BA) exhibit similar variability,

##      Team              League               Year            RS        
##  Length:1232        Length:1232        Min.   :1962   Min.   : 463.0  
##  Class :character   Class :character   1st Qu.:1977   1st Qu.: 652.0  
##  Mode  :character   Mode  :character   Median :1989   Median : 711.0  
##                                        Mean   :1989   Mean   : 715.1  
##                                        3rd Qu.:2002   3rd Qu.: 775.0  
##                                        Max.   :2012   Max.   :1009.0  
##                                                                       
##        RA               W              OBP              SLG        
##  Min.   : 472.0   Min.   : 40.0   Min.   :0.2770   Min.   :0.3010  
##  1st Qu.: 649.8   1st Qu.: 73.0   1st Qu.:0.3170   1st Qu.:0.3750  
##  Median : 709.0   Median : 81.0   Median :0.3260   Median :0.3960  
##  Mean   : 715.1   Mean   : 80.9   Mean   :0.3263   Mean   :0.3973  
##  3rd Qu.: 774.2   3rd Qu.: 89.0   3rd Qu.:0.3370   3rd Qu.:0.4210  
##  Max.   :1103.0   Max.   :116.0   Max.   :0.3730   Max.   :0.4910  
##                                                                    
##        BA            Playoffs        RankSeason     RankPlayoffs  
##  Min.   :0.2140   Min.   :0.0000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:0.2510   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :0.2600   Median :0.0000   Median :3.000   Median :3.000  
##  Mean   :0.2593   Mean   :0.1981   Mean   :3.123   Mean   :2.717  
##  3rd Qu.:0.2680   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :0.2940   Max.   :1.0000   Max.   :8.000   Max.   :5.000  
##                                    NA's   :988     NA's   :988    
##        G              OOBP             OSLG       
##  Min.   :158.0   Min.   :0.2940   Min.   :0.3460  
##  1st Qu.:162.0   1st Qu.:0.3210   1st Qu.:0.4010  
##  Median :162.0   Median :0.3310   Median :0.4190  
##  Mean   :161.9   Mean   :0.3323   Mean   :0.4197  
##  3rd Qu.:162.0   3rd Qu.:0.3430   3rd Qu.:0.4380  
##  Max.   :165.0   Max.   :0.3840   Max.   :0.4990  
##                  NA's   :812      NA's   :812
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

EDA

The average on-base percentage, Slugging Percentage (Avg_SLG) and Batting Average (Avg_BA) all appear to be relatively stable overt the years. While there are minor fluctuations, there are no significant trends observed in team performance metrics over the years. This suggests that teams have maintained relatively consistent levels of offensive proficiency throughout the years covered by the dataset.

Team Performance Over the Years

Team Performance Over the Years

Teams with the most wins

For American League between the year 1960 to 2012, The New York Yankees (NYY) have the highest number of wins, with 15 wins. The Baltimore Orioles (BAL) follow with 8 wins. Other teams, including the Boston Red Sox (BOS), Chicago White Sox (CHW), Detroit Tigers (DET), and Oakland Athletics (OAK), have between 1 and 5 wins. In the National League dataset, covering the same time frame as the American League (1962 to 2012), the Atlanta Braves (ATL) and the St. Louis Cardinals (STL) stand out with 8 wins each.

Number of Wins Per Team

Number of Wins Per Team

Chi-square Goodness-of-Fit Test

Hypothesis

(\(H_0\)): There is no difference in the number of wins by decade

(\(H_1\)): There is a difference in the number of wins by decade.

Critical Value

When alpha is 0.05 and the degree of freedom is 5, the critical value is 11.0705.

Test Value

The test-value is 9989.5 and the p-value is relatively small.

## 
##  Chi-squared test for given probabilities
## 
## data:  wins$wins
## X-squared = 9989.5, df = 5, p-value < 2.2e-16

Decision & Conclusion

Since the test-value(9989.5) > critical value(11.0705), then the decision is to reject the null hypothesis. Because of that we can conclude that there is enough evidence to support the claim that there is a difference in the number of wins by decade.

In the same way, as the p-value< alpha(0.05), the decision is to reject the null hypothesis. This provide the same result as to when the making decision vy comparing the critical value to the test value.

Crop Dataset

Two-way Anova Test

Hypothesis

Interaction

(\(H_0\)): There is no interaction effect between type of fertilizer used and density.

(\(H_1\)): There is an interaction effect between type of fertilizer used and density.

Fertilizer Hypothesis:

(\(H_0\)): There is no significant difference in the means of crop yield using different fertilizer.

(\(H_1\)): There is a significant difference in the means of crop yield using different fertilizer

Density Hypothesis:

(\(H_0\)): There is no significant difference in crop yield among different density.

(\(H_1\)): There is a significant difference in crop yield among different density.

Test- Value

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## fertilizer          2  6.068   3.034   9.001 0.000273 ***
## density             1  5.122   5.122  15.195 0.000186 ***
## fertilizer:density  2  0.428   0.214   0.635 0.532500    
## Residuals          90 30.337   0.337                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Decision & Conclusion

Since the p_value for the fertilizer is less than alpha(0.05), we can reject the null hypothesis and conclude that there is significant evidence to support the claim that fertilizer has impact on crop yield.

In the same way the p_value of density is also less than \(\alpha\) at 0.05 which implies we reject the null hypothesis and conclude that there is significant evidence to support the claim that density have an impact on crop yield

However since the p_value of the interaction is > \(\alpha\) at 0.05, we fail to reject the null hypothesis and concluded that the combination of fertilizer and density have no impact on crop yield.

Conclusion

When conducting a hypothesis test, several steps are involved to ensure a thorough analysis. First, it’s importance to clearly state the hypotheses being tested, defining both the null and alternative hypotheses. Next is to determine, the critical value, this provide a threshold for decision-making based on the chosen significance level. After the test value is calculated based on the collected data. With this information at hand, a decision is made regarding whether to reject or fail to reject the null hypothesis, considering whether the test value falls within the critical region determined by the critical value. Finally, the results of the hypothesis test are summarized, providing insights into the significance of the findings and their implications for the research question at hand. Each of these steps is crucial for ensuring the validity and reliability of the hypothesis test and its conclusions.

The decision to reject the null hypothesis can be made by either comparing the test value against the critical value or by comparing the p-value with the significance level.

If the test value is less than the critical value, the decision is not to reject the null hypothesis since the test value falls within the FTR region.

However, when using the p-value, the opposite approach is taken. We fail to reject the null hypothesis if the p-value is greater than the significance level

Reference

Bluman, A. G. (2008, November 4). Bluman, Elementary Statistics: A Step by Step Approach, © 2009, 7e, Student Edition (Reinforced Binding) with Formula Card. McGraw-Hill Education. http://books.google.ie/books?id=hOVdtAEACAAJ&dq=bluman+elementary+statistics&hl=&cd=3&source=gbs_api

Kabacoff, R. I. (2015). R in Action, Second Edition. O’Reilly Online Learning. Retrieved April 17, 2024, from https://learning.oreilly.com/library/view/r-in-action/9781617291388/kindle_split_002.html

Two-way ANOVA in R. (n.d.). Stats and R. https://statsandr.com/blog/two-way-anova-in-r/

Appendix

#----------- Baseball -------------------
#getwd("/Users/user/Northeastern/ALY6015/Chi-square/Chi-square")
baseball <- read.csv('baseball.csv')
#---- explore data
dim(baseball)
summary(baseball) # there seem to be a lot of missing values in some of the columns


#missing values
# percentage of missing values per column
get_percent_missing <- function(df){
  #this function produce the missing percentage of each variable
  missing_percentage <- colMeans(is.na(df)) * 100
  missing_percentage <- data.frame(missing_percentage)
  missing_percentage <- missing_percentage %>% 
    arrange(desc(missing_percentage)) %>% round(., 2)
  
  return(missing_percentage)
}
#get the missing percentage per column
get_percent_missing(baseball)

#there seem to be a lot of missing values in 4 of the columns, about 

baseball_drop <- baseball %>% select(-c(RankSeason, RankPlayoffs, OOBP, OSLG))
#baseball_drop

# ------ Visualize the distribution of missing variable columns 
# select numeric columns
get_numeric_col <- function(df){
  # this function extract numeric columns from a dataset
  numeric_df <- df[, sapply(df, is.numeric)] 
  
  return(numeric_df)
}

bb_numeric <- get_numeric_col(baseball_drop)
# Reshape the dataframe into long format
df_long <- gather(bb_numeric, key = "variable", value = "value")

# Histogram with kernel density
hist_plot <- ggplot(df_long, aes(x = value)) +
  geom_histogram(aes(y = ..density..), bins = 20, fill = "skyblue", color = "black") +
  geom_density(alpha = 0.5, color = "red") +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal() +
  labs(title = "Histogram with kernel density of Variables with Missing Values",
       x = "Value", y = "Density") +
  theme(plot.title = element_text(hjust = .5))
# Print the histogram plot
#print(hist_plot)



# table os descriptive statisitcs
descriptive_stat <- describe(get_numeric_col(baseball))
descriptive_stat <- descriptive_stat %>% select(c(n,mean,median, min, max))
descriptive_stat



#----- Exploratory Data Analysis-------
# Team Performance Trends Over Time:
performance <- baseball_drop %>% group_by(Year) %>%
  summarise(Avg_RS = mean(RS),
            Avg_RA = mean(RA),
            Avg_W = mean(W),
            Avg_OBP = mean(OBP),
            Avg_SLG = mean(SLG),
            Avg_BA = mean(BA))
performance



performance_plt <- ggplot(performance, aes(x = Year)) +
  geom_line(aes(y = Avg_OBP, color = "OBP"), size = 1) +
  geom_line(aes(y = Avg_SLG, color = "SLG"), size = 1) +
  geom_line(aes(y = Avg_BA, color = "BA"), size = 1) +
  labs(title = "Team Performance Over Time",
       x = "Year",
       y = "Average Value",
       color = "Metric") +
  scale_color_manual(values = c("OBP" = "blue", "SLG" = "red", "BA" = "green")) +
  theme_minimal() + theme(plot.title = element_text(hjust = 0.5))

performance_plt2 <- ggplot(performance, aes(x = Year)) +
  # Divide wins by 10 for better visualization
  geom_line(aes(y = Avg_RS, color = 'Runs Scored'), size = 1) +
  geom_line(aes(y = Avg_RA, color = 'Runs Allowed'), size = 1) +
  geom_line(aes(y = Avg_W, color = 'Wins'), size = 1)+
  labs(title = "Team Performance Over Time",
       x = "Year",
       y = "Average Value") +
  scale_color_manual(values = c('Runs Scored'='blue','Runs Allowed'='red', 
                                'Wins' = 'green')) +
  theme_minimal()



# filter for when league separate league
league_nl <- baseball_drop %>% filter(League == 'NL') %>%
  group_by(Year) %>%
  summarise(Avg_RS = mean(RS),
            Avg_RA = mean(RA),
            Avg_W = mean(W),
            Avg_OBP = mean(OBP),
            Avg_SLG = mean(SLG),
            Avg_BA = mean(BA))
league_nl

# National league
league_al <- baseball_drop %>% filter(League == 'AL') %>%
  group_by(Year) %>%
  summarise(Avg_RS = mean(RS),
            Avg_RA = mean(RA),
            Avg_W = mean(W),
            Avg_OBP = mean(OBP),
            Avg_SLG = mean(SLG),
            Avg_BA = mean(BA))
league_al




#--------- Playoff Performance:------------
# Separate data for playoff and non-playoff teams
playoff_teams <- baseball_drop %>% filter(Playoffs == 1)
non_playoff_teams <- baseball_drop %>% filter(Playoffs == 0)

# Aggregate data for playoff and non-playoff teams and calculate average values of desired variables
playoff_performance <- playoff_teams %>%
  summarize(
    Avg_RS_playoff = mean(RS),
    Avg_RA_playoff = mean(RA),
    Avg_W_playoff = mean(W),
    Avg_OBP_playoff = mean(OBP),
    Avg_SLG_playoff = mean(SLG),
    Avg_BA_playoff = mean(BA)
  )

non_playoff_performance <- non_playoff_teams %>%
  summarize(
    Avg_RS_non_playoff = mean(RS),
    Avg_RA_non_playoff = mean(RA),
    Avg_W_non_playoff = mean(W),
    Avg_OBP_non_playoff = mean(OBP),
    Avg_SLG_non_playoff = mean(SLG),
    Avg_BA_non_playoff = mean(BA)
  )




#-------- Team with the most win ---------
#----- America League ----
team_wins_AL <- baseball_drop %>% filter(League=='AL') %>% group_by(Year) %>%
  summarise(highest_score = max(W),
            best_team = Team[which.max(W)])

team_AL <- team_wins_AL %>% group_by(best_team) %>% 
  summarise(Win_count = n())
# plot
plt_AL <- ggplot(team_AL, aes(x = best_team, y=Win_count, fill = Win_count)) + 
  geom_bar(stat='identity') + 
  labs(title = 'Team with the most win in America League',
       x = 'Teams',
       y = 'Number of Wins'
  ) + 
  scale_fill_gradient(low = "tan", high = "tomato") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = 'None')


# ------- National League -----
team_wins_NL <- baseball_drop %>% filter(League=='NL') %>% group_by(Year) %>%
  summarise(highest_score = max(W),
            best_team = Team[which.max(W)])


team_NL <- team_wins_NL %>% group_by(best_team) %>% 
  summarise(Win_count = n())
# plot
plt_NL <- ggplot(team_NL, aes(x = best_team, y=Win_count, fill =Win_count )) + 
  geom_bar(stat='identity') + 
  labs(title = 'National League',
       x = 'Teams',
       y = 'Number of Wins'
  ) + 
  scale_fill_gradient(low = "tan", high = "tomato") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90, hjust = 1),
        legend.position = 'None')


#------ Chi-square --------
# Extract decade from year
baseball_drop$Decade <- baseball_drop$Year - (baseball_drop$Year %% 10)
# Create a wins table by summing the wins by decade 
wins <- baseball_drop %>% group_by(Decade) %>% 
  summarize(wins = sum(W))

wins$expected <- sum(wins$wins)/6
wins$prob <- wins$expected/sum(wins$wins)
wins

# Hypothesis


# critical value
# first find the degree of freedom value
df_10 <- (6-1)*(2-1)
cv_10 <- qchisq(p = 0.05,df = df_10, lower.tail = FALSE)
cv_10

# test statistic
result_10 <- chisq.test(x = wins$wins, p=wins$prob)
result_10
#------ Crop dataset -------
crop <- read.csv('crop_data.csv')
#convert to factors
crop$density <- as.factor(crop$density)
crop$block <- as.factor(crop$block)
crop$fertilizer <- as.factor(crop$fertilizer)

# two way anova
result_11 <- aov(yield ~fertilizer*density, data=crop)
summary(result_11)