Chi-square and ANOVA are two widely used statistical methods for analyzing data and testing hypotheses in different scenarios. In this assignment, I use my knowledge of chi-square and ANOVA to address a series of questions. A total of 10 different sets of questions are investigated using these two methods. The first 8 series of questions, I use chi-square goodness of fit, chi-square independence test, one-way anova and two-way anova to solve. The remaining two questions explore the team baseball dataset spanning from 1962 to 2012 and crop dataset that investigate whether , fertilizer and density have effect on crop yield using two-way anova.
# --------- Section 11-1 ---------
# Question 6
alpha_1 <- 0.10
blood_type <- data.frame(blood_type = c('A','B','O', 'AB'),
observed = c(12,8,24,6),
expected = c(0.2,0.28,0.36,0.16))
blood_type
(\(H_0\)) : Hospital patient in a large hospital have the same blood type distribution as follow; type A, 20%; type B, 28%; type O, 36%; and type AB = 16% (the claim).
(\(H_1\)): The distribution is not the same as stated in the null hypothesis
# find the degree of freedom = (R-1)*(C-1)
df = (4-1)*(2-1)
cv <- qchisq(p = alpha_1, df = df, lower.tail = FALSE)
cv # critical value is 6.251389
## [1] 6.251389
Given a degree of freedom of 3, the critical value is 6.251
# run the chi-square test
result_1 <- chisq.test(x = blood_type$observed, p=blood_type$expected)
result_1
##
## Chi-squared test for given probabilities
##
## data: blood_type$observed
## X-squared = 5.4714, df = 3, p-value = 0.1404
The test statistic is 5.4714
ifelse(result_1$statistic > cv, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## X-squared
## "Fail to reject the null hypothesis"
Since the test-statistic(5.4714) is less than the critical value (6.251), the decision is not to reject the null hypothesis
We conclude that there is enough evidence to support the researcher claim that hospital patients in a large hospital have the same blood type distribution as those in the general population as stated in the null hypothesis.
airline <- data.frame(Action = c('One time','National Aviation System delay',
'Aircraft arriving late','Other'),
Expected = c(0.708, 0.082, 0.09, 0.12),
Observed = c(125, 10, 25, 40))
airline
(\(H_0\)): The Airline on-time performance is the same as the government’s statistics.
(\(H_1\)): The Airline on-time performance is not the same as the government’s statistics (the claim).
# find the degree of freedom = (R-1)*(C-1)
df = (4-1)*(2-1)
cv_2 <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
cv_2
## [1] 7.814728
Given a degree of freedom(df) of 3, and alpha at 0.05, the critical value is 7.815.
# Test Statistics
result_2 <- chisq.test(airline$Observed, p= airline$Expected)
result_2
##
## Chi-squared test for given probabilities
##
## data: airline$Observed
## X-squared = 17.832, df = 3, p-value = 0.0004763
The result indicate that the test-statistic is 17.832 and the p-value is 0.0004763.
ifelse(result_2$statistic > cv_2, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## X-squared
## "Reject the null hypothesis"
Since the test-statistic(17.832) > 7.815(the critical value), the decision is to reject the null hypothesis.
There is enough evidence to support the claim that the airline one time performance differ from the government statistics.
admission <- matrix(c(724, 335, 174, 107, 370, 292, 152, 140), ncol = 4, byrow = 2)
colnames(admission) <- c('Caucasian','Hispanic','African American','Other')
rownames(admission) <- c(2013, 2014)
admission <- as.table(admission)
admission
## Caucasian Hispanic African American Other
## 2013 724 335 174 107
## 2014 370 292 152 140
(\(H_0\)): Movie attendance by year is independent of the ethnicity of movie goers.
(\(H_1\)): Movie attendance by year is dependent of the ethnicity of movie goers (the claim).
# find the degree of freedom = (R-1)*(C-1)
df = (2-1)*(4-1)
cv_3 <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
cv_3
## [1] 7.814728
Given alpha at 0.05 and the degree of freedom at 3, the critical value is 7.815.
result_3 <- chisq.test(admission)
result_3
##
## Pearson's Chi-squared test
##
## data: admission
## X-squared = 60.144, df = 3, p-value = 5.478e-13
The test-statistic is 60.144.
# Decision
ifelse(result_3$statistic > cv_3, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## X-squared
## "Reject the null hypothesis"
The decision is to reject the null hypothesis since the test-statistic(60.144) > 7.815(the critical value).
Since the null hypothesis is rejected, then we can conclude that there is sufficient evidence to support the claim that movie attendance by year is dependent upon the ethnicity of movie goers..
#-- Question 10
military <- matrix(c(10791, 62491, 7816, 42750, 932, 9525, 11819, 54344),
ncol = 2, byrow = 4)
colnames(military) <- c('Officers','Enlisted')
rownames(military) <- c('Army','Navy','Marine Corps','Air Force')
women_in_mil <- as.table(military)
women_in_mil
## Officers Enlisted
## Army 10791 62491
## Navy 7816 42750
## Marine Corps 932 9525
## Air Force 11819 54344
(\(H_0\)): Military rank of women personnel is independent of the military branch of service
(\(H_1\)): Military rank of women personnel is dependent of the military branch of service (the claim).
# find the degree of freedom = (R-1)*(C-1)
df = (4-1)*(2-1)
cv_4 <- qchisq(p = 0.05, df = df, lower.tail = FALSE)
cv_4
## [1] 7.814728
the critical value is 7.815 at \(\alpha\) = 0.05 and degree of freedom of 3.
# test statistic
result_4 <- chisq.test(women_in_mil)
result_4
##
## Pearson's Chi-squared test
##
## data: women_in_mil
## X-squared = 654.27, df = 3, p-value < 2.2e-16
The test value is 654.27.
# Decision
ifelse(result_4$statistic > cv_4, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## X-squared
## "Reject the null hypothesis"
The decision is to reject the null hypothesis since the test-value > critical value.
There is enough evidence to support the claim that there exist a relationship between rank and branch of service for military women personnel.
#------------------- Section 12-1 ------------------
#create data frame for condiments
condiments <- data.frame(sodium = c(270,130,230,180,80,70,200),food = rep('Condiments', 7))
# create dataframe for cereals
cereals <- data.frame(sodium = c(260,220,290,290,200,320,140),food = rep('Cereals', 7))
# create dataframe for desserts
desserts <- data.frame(sodium = c(100,180,250,250,300,360,300,160), food = rep('Desserts', 8))
# combine the dataframe into one
sodium <- rbind(condiments, cereals, desserts)
sodium$food <- as.factor(sodium$food)
sodium
(\(H_0\)): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)
(\(H_1\)): at least one of the means differ (the claim)
# Run the anova
result_5 <- aov(sodium ~ food, data = sodium)
result5.summary <- summary(result_5)
result5.summary
## Df Sum Sq Mean Sq F value Pr(>F)
## food 2 27544 13772 2.399 0.118
## Residuals 19 109093 5742
# get degree of freedom first
df1 <- result5.summary[[1]][1, 'Df']
df2 <- result5.summary[[1]][2, 'Df']
#get the critical value
cv_5 <- qf(p=.05, df1=df1, df2=df2, lower.tail=FALSE)
cv_5
## [1] 3.521893
The critical value is 3.52 at \(\alpha\) = 0.05, with degree of freedom of 2 & 19.
# extract the F statistic
F.value <- result5.summary[[1]][[1, 'F value']]
F.value
## [1] 2.398538
The F-value is 2.3985.
# make decision
ifelse(F.value > cv_5, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## [1] "Fail to reject the null hypothesis"
Since the F-value(2.3985.) < critical value(3.52), the decision is not to reject the null hypothesis
There is not sufficient evidence to support the claim that at least one mean is different from the others.
# ------ Question 10 -------
#data frame for cereal
cereal <- data.frame(sales = c(578,320,264,249,237), company = rep('Cereal',5))
# create dataframe for chocolate
chocolate <- data.frame(sales = c(311,106,109,125,173), company = rep('Chocolate Candy',5))
# create dataframe for coffee
coffee <- data.frame(sales = c(261, 185, 302, 689), company = rep('Coffee',4))
#combine all the dataframe into one
sales <- rbind(cereal, chocolate, coffee)
sales$company <- as.factor(sales$company)
sales
(\(H_0\)): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)
(\(H_1\)): there is a significant difference in the means sales (the claim)
# k(number of groups) = 3, N(sum of the sample sizes of the groups) = 14
# dfN = K-1; dfD = N-k
dfN = (3-1)
dfD = (14-3)
#find critical value
cv_6 <- qf(p = 0.01, df1 =dfN, df2 = dfD, lower.tail = FALSE)
cv_6
## [1] 7.205713
The critical value is 2.17 at \(\alpha\) = 0.01.
# test the anova
result_6 <- aov(sales ~ company, data = sales)
result6.summary <- summary(result_6)
result6.summary
## Df Sum Sq Mean Sq F value Pr(>F)
## company 2 103770 51885 2.172 0.16
## Residuals 11 262795 23890
# F statistic
F.value <- result6.summary[[1]][[1, 'F value']]
F.value
## [1] 2.171782
The test value is 2.172 and the p_value is 0.16.
# make decision
ifelse(F.value > cv_6, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## [1] "Fail to reject the null hypothesis"
Since the p_value(0.16) > \(\alpha\)(0.01), the decision is to not reject the null hypothesis.
We conclude that there is not enough evidence to support the claim that there is a significant difference in the means.
# create dataframe for the eastern third
eastern <- data.frame(expenditure = c(4946,5953,6202,7243, 6113), state = rep('Eastern third',5))
#create dataframe for middle third
middle <- data.frame(expenditure = c(6149, 7451,6000,6479), state=rep('Middle third', 4))
# create dataframe for western third
western <- data.frame(expenditure = c(5282,8605,6528,6911), state = rep('Western third',4))
# combine all dataframe into one
expenditures <- rbind(eastern, middle, western)
expenditures
(\(H_0\)): \(\mu_1\) = \(\mu_2\) = \(\mu_3\)
(\(H_1\)): At least one mean is different from the others (claim)
# k = 3, N = 13
# dfN = K-1; dfD = N-k
dfN = (3-1)
dfD = (13-3)
#find critical value
cv_7 <- qf(p = 0.05, df1 =dfN, df2 = dfD, lower.tail = FALSE)
cv_7
## [1] 4.102821
The critical value is 4.10 at \(\alpha\)=0.05, and degree of freedom of 2 & 10
# run the anova test
result_7 <- aov(expenditure ~ state, data = expenditures)
result7.summary <- summary(result_7)
result7.summary
## Df Sum Sq Mean Sq F value Pr(>F)
## state 2 1244588 622294 0.649 0.543
## Residuals 10 9591145 959114
# F statistic
F.value <- result7.summary[[1]][[1, 'F value']]
F.value
## [1] 0.6488214
The F-value is 0.649. and the p_value is 0.543.
# Decision
ifelse(F.value > cv_7, 'Reject the null hypothesis',
'Fail to reject the null hypothesis')
## [1] "Fail to reject the null hypothesis"
Since the F-value(0.649.) < critical value(4.10), the decision is to not reject the null hypothesis
there is not enough evidence to support the claim that at least one mean is different from the others.
# create the dataframe
alpha_8 <- 0.05
growth_data <- data.frame(
Light = rep(c("Grow-light 1", "Grow-light 2"), each = 6),
PlantFood = rep(c("A", "B"), each =3),
Growth = c(9.2, 9.4,8.9, 7.1, 7.2, 8.5, 8.5, 9.2, 8.9, 5.5, 5.8, 7.6))
growth_data$Light = as.factor(growth_data$Light)
growth_data$PlantFood = as.factor(growth_data$PlantFood)
Interaction:
(\(H_0\)) : There is no interaction effect between the strength of the Grow-light and the plant food supplement.
(\(H_1\)): There is an interaction effect between the Grow-light strength and the plant food supplement.
Plant food:
(\(H_0\)): There is no difference in the mean growth with respect to the type of plant food supplement.
(\(H_1\)): There is a difference in the mean growth with respect to the type of plant food supplement.
Grow-light:
(\(H_0\)): There is no difference in the mean growth with respect to the strength of the Grow-light.
(\(H_1\)): There is a difference in the mean growth with respect to the strength of the Grow- light.
result_8 <-aov(Growth ~ Light*PlantFood, data=growth_data)
result8.summary <- summary(result_8)
result8.summary
## Df Sum Sq Mean Sq F value Pr(>F)
## Light 1 1.920 1.920 3.681 0.09133 .
## PlantFood 1 12.813 12.813 24.562 0.00111 **
## Light:PlantFood 1 0.750 0.750 1.438 0.26482
## Residuals 8 4.173 0.522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# critical value
dfN <- result8.summary[[1]][1, 'Df']
dfD <- result8.summary[[1]][4, 'Df']
# Calculate the critical F-value
critical_F <- qf(1 - alpha_8, dfN, dfD)
# Print the critical F-value
print(critical_F)
## [1] 5.317655
The critical value is 5.32
Since the F-value of Light(3.681) and F-value of the interaction(1.438) is less than the critical F-value(5.32), the we fail to reject the hypothesis of both but in regards to plant food, the F-value(24.562) > critical value(5.32), and so we reject the null hypothesis.
Since we fail to reject the null hypothesis for interaction effect it can be concluded that the combination of the Grow-light and the plant food supplement has no effect. Also there is no no difference in the mean growth with respect to light.
Since we reject the null hypothesis for plant food, it implies there is sufficient evidence to conclude there is a difference in the mean growth for the plant food.
The baseball dataset provides insights into the performance of baseball teams across various seasons. It includes crucial performance metrics such as runs scored (RS), runs allowed (RA), and wins (W), along with key indicators like on-base percentage (OBP), slugging percentage (SLG), and batting average (BA), which are pivotal for assessing a team’s offensive and defensive capabilities. Covering the period from 1962 to 2012. The summary statistics reveal that the median values for runs scored(RS) and runs allowed(RA) are 711 and 709 respectively, suggesting a balanced distribution of offensive and defensive performance. Additionally, the “Playoffs” variable indicates whether a team made it to the playoffs (1) or not (0). This binary variable suggests that approximately 19.81% of the teams in the dataset made it to the playoffs. The min wins is 40 and the maximum wins is 116 and a median win 0f 81. offensive performance indicators such as on-base percentage (OBP), slugging percentage (SLG), and batting average (BA) exhibit similar variability,
## Team League Year RS
## Length:1232 Length:1232 Min. :1962 Min. : 463.0
## Class :character Class :character 1st Qu.:1977 1st Qu.: 652.0
## Mode :character Mode :character Median :1989 Median : 711.0
## Mean :1989 Mean : 715.1
## 3rd Qu.:2002 3rd Qu.: 775.0
## Max. :2012 Max. :1009.0
##
## RA W OBP SLG
## Min. : 472.0 Min. : 40.0 Min. :0.2770 Min. :0.3010
## 1st Qu.: 649.8 1st Qu.: 73.0 1st Qu.:0.3170 1st Qu.:0.3750
## Median : 709.0 Median : 81.0 Median :0.3260 Median :0.3960
## Mean : 715.1 Mean : 80.9 Mean :0.3263 Mean :0.3973
## 3rd Qu.: 774.2 3rd Qu.: 89.0 3rd Qu.:0.3370 3rd Qu.:0.4210
## Max. :1103.0 Max. :116.0 Max. :0.3730 Max. :0.4910
##
## BA Playoffs RankSeason RankPlayoffs
## Min. :0.2140 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:0.2510 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:2.000
## Median :0.2600 Median :0.0000 Median :3.000 Median :3.000
## Mean :0.2593 Mean :0.1981 Mean :3.123 Mean :2.717
## 3rd Qu.:0.2680 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :0.2940 Max. :1.0000 Max. :8.000 Max. :5.000
## NA's :988 NA's :988
## G OOBP OSLG
## Min. :158.0 Min. :0.2940 Min. :0.3460
## 1st Qu.:162.0 1st Qu.:0.3210 1st Qu.:0.4010
## Median :162.0 Median :0.3310 Median :0.4190
## Mean :161.9 Mean :0.3323 Mean :0.4197
## 3rd Qu.:162.0 3rd Qu.:0.3430 3rd Qu.:0.4380
## Max. :165.0 Max. :0.3840 Max. :0.4990
## NA's :812 NA's :812
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The average on-base percentage, Slugging Percentage (Avg_SLG) and Batting Average (Avg_BA) all appear to be relatively stable overt the years. While there are minor fluctuations, there are no significant trends observed in team performance metrics over the years. This suggests that teams have maintained relatively consistent levels of offensive proficiency throughout the years covered by the dataset.
Team Performance Over the Years
For American League between the year 1960 to 2012, The New York Yankees (NYY) have the highest number of wins, with 15 wins. The Baltimore Orioles (BAL) follow with 8 wins. Other teams, including the Boston Red Sox (BOS), Chicago White Sox (CHW), Detroit Tigers (DET), and Oakland Athletics (OAK), have between 1 and 5 wins. In the National League dataset, covering the same time frame as the American League (1962 to 2012), the Atlanta Braves (ATL) and the St. Louis Cardinals (STL) stand out with 8 wins each.
Number of Wins Per Team
(\(H_0\)): There is no difference in the number of wins by decade
(\(H_1\)): There is a difference in the number of wins by decade.
When alpha is 0.05 and the degree of freedom is 5, the critical value is 11.0705.
The test-value is 9989.5 and the p-value is relatively small.
##
## Chi-squared test for given probabilities
##
## data: wins$wins
## X-squared = 9989.5, df = 5, p-value < 2.2e-16
Since the test-value(9989.5) > critical value(11.0705), then the decision is to reject the null hypothesis. Because of that we can conclude that there is enough evidence to support the claim that there is a difference in the number of wins by decade.
In the same way, as the p-value< alpha(0.05), the decision is to reject the null hypothesis. This provide the same result as to when the making decision vy comparing the critical value to the test value.
Interaction
(\(H_0\)): There is no interaction effect between type of fertilizer used and density.
(\(H_1\)): There is an interaction effect between type of fertilizer used and density.
Fertilizer Hypothesis:
(\(H_0\)): There is no significant difference in the means of crop yield using different fertilizer.
(\(H_1\)): There is a significant difference in the means of crop yield using different fertilizer
Density Hypothesis:
(\(H_0\)): There is no significant difference in crop yield among different density.
(\(H_1\)): There is a significant difference in crop yield among different density.
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 6.068 3.034 9.001 0.000273 ***
## density 1 5.122 5.122 15.195 0.000186 ***
## fertilizer:density 2 0.428 0.214 0.635 0.532500
## Residuals 90 30.337 0.337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p_value for the fertilizer is less than alpha(0.05), we can reject the null hypothesis and conclude that there is significant evidence to support the claim that fertilizer has impact on crop yield.
In the same way the p_value of density is also less than \(\alpha\) at 0.05 which implies we reject the null hypothesis and conclude that there is significant evidence to support the claim that density have an impact on crop yield
However since the p_value of the interaction is > \(\alpha\) at 0.05, we fail to reject the null hypothesis and concluded that the combination of fertilizer and density have no impact on crop yield.
When conducting a hypothesis test, several steps are involved to ensure a thorough analysis. First, it’s importance to clearly state the hypotheses being tested, defining both the null and alternative hypotheses. Next is to determine, the critical value, this provide a threshold for decision-making based on the chosen significance level. After the test value is calculated based on the collected data. With this information at hand, a decision is made regarding whether to reject or fail to reject the null hypothesis, considering whether the test value falls within the critical region determined by the critical value. Finally, the results of the hypothesis test are summarized, providing insights into the significance of the findings and their implications for the research question at hand. Each of these steps is crucial for ensuring the validity and reliability of the hypothesis test and its conclusions.
The decision to reject the null hypothesis can be made by either comparing the test value against the critical value or by comparing the p-value with the significance level.
If the test value is less than the critical value, the decision is not to reject the null hypothesis since the test value falls within the FTR region.
However, when using the p-value, the opposite approach is taken. We fail to reject the null hypothesis if the p-value is greater than the significance level
Bluman, A. G. (2008, November 4). Bluman, Elementary Statistics: A Step by Step Approach, © 2009, 7e, Student Edition (Reinforced Binding) with Formula Card. McGraw-Hill Education. http://books.google.ie/books?id=hOVdtAEACAAJ&dq=bluman+elementary+statistics&hl=&cd=3&source=gbs_api
Kabacoff, R. I. (2015). R in Action, Second Edition. O’Reilly Online Learning. Retrieved April 17, 2024, from https://learning.oreilly.com/library/view/r-in-action/9781617291388/kindle_split_002.html
Two-way ANOVA in R. (n.d.). Stats and R. https://statsandr.com/blog/two-way-anova-in-r/
#----------- Baseball -------------------
#getwd("/Users/user/Northeastern/ALY6015/Chi-square/Chi-square")
baseball <- read.csv('baseball.csv')
#---- explore data
dim(baseball)
summary(baseball) # there seem to be a lot of missing values in some of the columns
#missing values
# percentage of missing values per column
get_percent_missing <- function(df){
#this function produce the missing percentage of each variable
missing_percentage <- colMeans(is.na(df)) * 100
missing_percentage <- data.frame(missing_percentage)
missing_percentage <- missing_percentage %>%
arrange(desc(missing_percentage)) %>% round(., 2)
return(missing_percentage)
}
#get the missing percentage per column
get_percent_missing(baseball)
#there seem to be a lot of missing values in 4 of the columns, about
baseball_drop <- baseball %>% select(-c(RankSeason, RankPlayoffs, OOBP, OSLG))
#baseball_drop
# ------ Visualize the distribution of missing variable columns
# select numeric columns
get_numeric_col <- function(df){
# this function extract numeric columns from a dataset
numeric_df <- df[, sapply(df, is.numeric)]
return(numeric_df)
}
bb_numeric <- get_numeric_col(baseball_drop)
# Reshape the dataframe into long format
df_long <- gather(bb_numeric, key = "variable", value = "value")
# Histogram with kernel density
hist_plot <- ggplot(df_long, aes(x = value)) +
geom_histogram(aes(y = ..density..), bins = 20, fill = "skyblue", color = "black") +
geom_density(alpha = 0.5, color = "red") +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(title = "Histogram with kernel density of Variables with Missing Values",
x = "Value", y = "Density") +
theme(plot.title = element_text(hjust = .5))
# Print the histogram plot
#print(hist_plot)
# table os descriptive statisitcs
descriptive_stat <- describe(get_numeric_col(baseball))
descriptive_stat <- descriptive_stat %>% select(c(n,mean,median, min, max))
descriptive_stat
#----- Exploratory Data Analysis-------
# Team Performance Trends Over Time:
performance <- baseball_drop %>% group_by(Year) %>%
summarise(Avg_RS = mean(RS),
Avg_RA = mean(RA),
Avg_W = mean(W),
Avg_OBP = mean(OBP),
Avg_SLG = mean(SLG),
Avg_BA = mean(BA))
performance
performance_plt <- ggplot(performance, aes(x = Year)) +
geom_line(aes(y = Avg_OBP, color = "OBP"), size = 1) +
geom_line(aes(y = Avg_SLG, color = "SLG"), size = 1) +
geom_line(aes(y = Avg_BA, color = "BA"), size = 1) +
labs(title = "Team Performance Over Time",
x = "Year",
y = "Average Value",
color = "Metric") +
scale_color_manual(values = c("OBP" = "blue", "SLG" = "red", "BA" = "green")) +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5))
performance_plt2 <- ggplot(performance, aes(x = Year)) +
# Divide wins by 10 for better visualization
geom_line(aes(y = Avg_RS, color = 'Runs Scored'), size = 1) +
geom_line(aes(y = Avg_RA, color = 'Runs Allowed'), size = 1) +
geom_line(aes(y = Avg_W, color = 'Wins'), size = 1)+
labs(title = "Team Performance Over Time",
x = "Year",
y = "Average Value") +
scale_color_manual(values = c('Runs Scored'='blue','Runs Allowed'='red',
'Wins' = 'green')) +
theme_minimal()
# filter for when league separate league
league_nl <- baseball_drop %>% filter(League == 'NL') %>%
group_by(Year) %>%
summarise(Avg_RS = mean(RS),
Avg_RA = mean(RA),
Avg_W = mean(W),
Avg_OBP = mean(OBP),
Avg_SLG = mean(SLG),
Avg_BA = mean(BA))
league_nl
# National league
league_al <- baseball_drop %>% filter(League == 'AL') %>%
group_by(Year) %>%
summarise(Avg_RS = mean(RS),
Avg_RA = mean(RA),
Avg_W = mean(W),
Avg_OBP = mean(OBP),
Avg_SLG = mean(SLG),
Avg_BA = mean(BA))
league_al
#--------- Playoff Performance:------------
# Separate data for playoff and non-playoff teams
playoff_teams <- baseball_drop %>% filter(Playoffs == 1)
non_playoff_teams <- baseball_drop %>% filter(Playoffs == 0)
# Aggregate data for playoff and non-playoff teams and calculate average values of desired variables
playoff_performance <- playoff_teams %>%
summarize(
Avg_RS_playoff = mean(RS),
Avg_RA_playoff = mean(RA),
Avg_W_playoff = mean(W),
Avg_OBP_playoff = mean(OBP),
Avg_SLG_playoff = mean(SLG),
Avg_BA_playoff = mean(BA)
)
non_playoff_performance <- non_playoff_teams %>%
summarize(
Avg_RS_non_playoff = mean(RS),
Avg_RA_non_playoff = mean(RA),
Avg_W_non_playoff = mean(W),
Avg_OBP_non_playoff = mean(OBP),
Avg_SLG_non_playoff = mean(SLG),
Avg_BA_non_playoff = mean(BA)
)
#-------- Team with the most win ---------
#----- America League ----
team_wins_AL <- baseball_drop %>% filter(League=='AL') %>% group_by(Year) %>%
summarise(highest_score = max(W),
best_team = Team[which.max(W)])
team_AL <- team_wins_AL %>% group_by(best_team) %>%
summarise(Win_count = n())
# plot
plt_AL <- ggplot(team_AL, aes(x = best_team, y=Win_count, fill = Win_count)) +
geom_bar(stat='identity') +
labs(title = 'Team with the most win in America League',
x = 'Teams',
y = 'Number of Wins'
) +
scale_fill_gradient(low = "tan", high = "tomato") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = 'None')
# ------- National League -----
team_wins_NL <- baseball_drop %>% filter(League=='NL') %>% group_by(Year) %>%
summarise(highest_score = max(W),
best_team = Team[which.max(W)])
team_NL <- team_wins_NL %>% group_by(best_team) %>%
summarise(Win_count = n())
# plot
plt_NL <- ggplot(team_NL, aes(x = best_team, y=Win_count, fill =Win_count )) +
geom_bar(stat='identity') +
labs(title = 'National League',
x = 'Teams',
y = 'Number of Wins'
) +
scale_fill_gradient(low = "tan", high = "tomato") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1),
legend.position = 'None')
#------ Chi-square --------
# Extract decade from year
baseball_drop$Decade <- baseball_drop$Year - (baseball_drop$Year %% 10)
# Create a wins table by summing the wins by decade
wins <- baseball_drop %>% group_by(Decade) %>%
summarize(wins = sum(W))
wins$expected <- sum(wins$wins)/6
wins$prob <- wins$expected/sum(wins$wins)
wins
# Hypothesis
# critical value
# first find the degree of freedom value
df_10 <- (6-1)*(2-1)
cv_10 <- qchisq(p = 0.05,df = df_10, lower.tail = FALSE)
cv_10
# test statistic
result_10 <- chisq.test(x = wins$wins, p=wins$prob)
result_10
#------ Crop dataset -------
crop <- read.csv('crop_data.csv')
#convert to factors
crop$density <- as.factor(crop$density)
crop$block <- as.factor(crop$block)
crop$fertilizer <- as.factor(crop$fertilizer)
# two way anova
result_11 <- aov(yield ~fertilizer*density, data=crop)
summary(result_11)