M5 Project Report
ALY6015_71821:Intermediate Analytics
SEC_09_Fall_2023_CPS
Northeastern University
Professor: Vladimir Shapiro

By: Zeeshan Ahmad Ansari

Date of Submission:11 December, 2023


Introduction

This report aims to demonstrate applications of nonparametric statistical methods, sampling techniques, and simulation on various real-world problems. The analysis leverages these concepts to draw data-driven insights on game attendance, lottery sales, prison sentencing, mathematical literacy, transportation usage, and prize contests.

Appropriate nonparametric tests have been selected after assessing data and distributional assumptions. These include binomial tests, Wilcoxon rank sum tests, runs tests, Kruskal-Wallis tests, Spearman rank correlations, and simulations. The analysis follows the standard hypothesis testing framework at a 5% significance level.

The goal is to underscore how statistics aids decision making under uncertainty. Instead of relying on intuition or anecdotal evidence, statistical testing evaluates claims and hypotheses objectively against empirical data samples. The concepts and tests applied can inform decisions across business, policymaking, social sectors and more.


Analysis


Q 13-2.6: Game Attendance An athletic director suggests the median number for the paid attendance at 20 local football games is 3000. The data for a random sample are shown. At α = 0.05, is there enough evidence to reject the claim? If you were printing the programs for the games, would you use this figure as a guide?


6210 3150 2700 3012 4875
3540 6127 2581 2642 2573
2792 2800 2500 3700 6030
5437 2758 3490 2851 2720

Hypotheses:

H0: Median attendance = 3000

H1: Median attendance != 3000

# Given data
attendance <- c(6210, 3150, 2700, 3012, 4875, 
                3540, 6127, 2581, 2642, 2573, 
                2792, 2800, 2500, 3700, 6030, 
                5437, 2758, 3490, 2851, 2720)

# Hypothesized median
hypothesized_median_1 <- 3000
# Significance level
alpha_1 <- 0.05

# Calculate the differences between observations and the hypothesized median
differences <- attendance - hypothesized_median_1

# Count the number of positive and negative differences
positive_count <- sum(differences > 0)
negative_count <- sum(differences < 0)

# Calculate the test statistic (use the smaller of the two counts)
test_statistic <- min(positive_count, negative_count)


# Total number of observations
n <- length(attendance)
# Calculate the critical value for a two-tailed test
critical_value_1 <- qbinom(alpha_1/2, size = n, prob = 0.5, lower.tail = TRUE)

# Calculate the p-value using the binomial distribution
p_value_1 <- 2 * pbinom(min(positive_count, negative_count), size = n, prob = 0.5)

# Make the decision
if (p_value_1 < alpha_1) {
  decision_1 <- "Reject the null hypothesis"
} else {
  decision_1 <- "Fail to reject the null hypothesis"
}

# Summarize the results
cat("Test Statistic:", test_statistic, "\n",
    "P-Value:", p_value_1, "\n",
    "P-Value:", p_value_1, "\n",
    "Decision:", decision_1, "\n")
## Test Statistic: 10 
##  P-Value: 1.176197 
##  P-Value: 1.176197 
##  Decision: Fail to reject the null hypothesis

The p-value of 1.1761971 is bigger than the alpha value of 0.05, hence the null hypothesis could not be rejected.

The hypothesis test was conducted to determine whether there is enough evidence to reject the claim that the median attendance at 20 local football games is 3000. Here are the key results:

The test statistic, calculated as the minimum of the counts of positive and negative differences between the observed attendance and the hypothesized median is 10.

The two-tailed p-value, calculated using the binomial distribution, is 1.1761971. At a significance level of 0.05, the decision is to Fail to reject the null hypothesis

Based on the analysis, there is insufficient evidence to reject the claim that the median attendance is 3000 at a significance level of 0.05. Therefore, the athletic director’s suggestion holds, and if one were printing programs for the games, using the figure 3000 as a guide appears reasonable given the available data.


Q13-2.10: Lottery Ticket Sales A lottery outlet owner hypothesizes that she sells 200 lottery tickets a day. She randomly sampled 40 days and found that on 15 days she sold fewer than 200 tickets. At α = 0.05, is there sufficient evidence to conclude that the median is below 200 tickets?

Hypotheses:

H0: Median attendance = 200

H1: Median attendance < 200

# Given data
days_sampled <- 40
days_sold_below_200 <- 15
hypothesized_median_2 <- 0.5  # Since we are testing below 200 tickets

# Perform the binomial test
binomial_test <- binom.test(days_sold_below_200, days_sampled, p = hypothesized_median_2, alternative = "less")

# Significance level
alpha_2 <- 0.05

# Make the decision
if (binomial_test$p.value < alpha_2) {
  decision_2 <- "Reject the null hypothesis"
} else {
  decision_2 <- "Fail to reject the null hypothesis"
}

# Summarize the results
cat("Binomial Test Results:\n")
## Binomial Test Results:
print(binomial_test)
## 
##  Exact binomial test
## 
## data:  days_sold_below_200 and days_sampled
## number of successes = 15, number of trials = 40, p-value = 0.07693
## alternative hypothesis: true probability of success is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.5172483
## sample estimates:
## probability of success 
##                  0.375
cat("Decision:", decision_2, "\n")
## Decision: Fail to reject the null hypothesis

Our p value equals to 0.07693 which is higher than alpha significance level 0.05, thereby we fail to reject null hypothesis.

Based on the analysis, there is insufficient evidence to conclude that the median daily lottery ticket sales are below 200 tickets. The p-value of 0.077 is greater than the significance level of 0.05, indicating that we do not have enough evidence to reject the null hypothesis. Therefore, the hypothesis that the owner sells 200 lottery tickets a day is not contradicted by the sampled data.

Section 13-3
Use the Wilcoxon rank sum test.

Q 13-3.4: A random sample of men and women in prison was asked to give the length of sentence each received for a certain type of crime. At α = 0.05, test the claim that there is no difference in the sentence received by each gender. The data (in months) are shown here.

MALE 8 12 6 14 22 27 32 24 26
FEMALE 7 5 2 3 21 26 30 9 4
MALE 19 15 13
FEMALE 17 23 12 11 16

Hypotheses:

H0: There is no difference in sentence lengths between men and women.

H1: There is a difference in sentence lengths between men and women.

males <- c(8, 12, 6, 14, 22, 27, 32, 24, 26, 19, 15, 13)
females <- c(7, 5, 2, 3, 21, 26, 30, 9, 4, 17, 23, 12, 11, 16)

# Perform Wilcoxon rank sum test
wilcox_test_3 <- wilcox.test(males, females, alternative = "two.sided")

# Significance level
alpha_3 <- 0.05

# Find the critical value
critical_value_3 <- qnorm(1 - alpha_3/2)

# Print the test results
cat("Wilcoxon Rank Sum Test Results:\n")
## Wilcoxon Rank Sum Test Results:
print(wilcox_test_3)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  males and females
## W = 113, p-value = 0.1425
## alternative hypothesis: true location shift is not equal to 0
# Decision
if (wilcox_test_3$p.value < alpha_3) {
  decision_3 <- "Reject the null hypothesis"
} else {
  decision_3 <- "Fail to reject the null hypothesis"
}

# Summarize the results
cat("Decision:", decision_3, "\n",
    "Critical Value:", critical_value_3, "\n")
## Decision: Fail to reject the null hypothesis 
##  Critical Value: 1.959964

Based on the analysis, there is insufficient evidence to conclude that there is a significant difference in the sentence lengths between men and women in prison. The p-value of 0.1425439 is greater than the significance level of 0.05, suggesting that we do not have enough evidence to reject the null hypothesis. Therefore, we do not find a significant difference in sentence lengths between the two genders in the sampled data.


Q 13-3.8: For the years 1970–1993 the National League (NL) and the American League (AL) (major league baseball) were each divided into two divisions: East and West. Below are random samples of the number of games won by each league’s Eastern Division. At α = 0.05, is there sufficient evidence to conclude a difference in the number of wins?

NL 89 96 88 101 90 91 92 96 108 100 95
AL 108 86 91 97 100 102 95 104 95 89 88 101

To test if there is a significant difference in the number of wins between the National League (NL) and the American League (AL) Eastern Divisions, we can use the Wilcoxon rank sum test. The null hypothesis is that there is no difference in the distribution of wins between the NL and AL, while the alternative hypothesis is that there is a difference.

Hypotheses:

H0: There is no difference in the number of wins between the NL and AL Eastern Divisions.

H1: There is a difference in the number of wins between the NL and AL Eastern Divisions.

NL_wins <- c(89, 96, 88, 101, 90, 91, 92, 96, 108, 100, 95)
AL_wins <- c(108, 86, 91, 97, 100, 102, 95, 104, 95, 89, 88, 101)

# Perform Wilcoxon rank sum test with continuity correction
wilcox_test_4 <- wilcox.test(NL_wins, AL_wins, exact = FALSE, correct = TRUE)

# Significance level
alpha_4 <- 0.05

# Find the critical value
critical_value_4 <- qnorm(1 - alpha_4/2)

# Print the test results
cat("Wilcoxon Rank Sum Test Results:\n")
## Wilcoxon Rank Sum Test Results:
print(wilcox_test_4)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  NL_wins and AL_wins
## W = 59, p-value = 0.6883
## alternative hypothesis: true location shift is not equal to 0
# Decision
if (wilcox_test_4$p.value < alpha_4) {
  decision_4 <- "Reject the null hypothesis"
} else {
  decision_4 <- "Fail to reject the null hypothesis"
}

# Summarize the results
cat("Decision:", decision_4, "\n",
    "Critical Value:", critical_value_4, "\n")
## Decision: Fail to reject the null hypothesis 
##  Critical Value: 1.959964

Based on the analysis, there is insufficient evidence to conclude that there is a significant difference in the number of wins between the NL and AL Eastern Divisions. The p-value of 0.6883179 is substantially greater than the significance level of 0.05, indicating that we lack evidence to reject the null hypothesis. Therefore, we do not find a significant difference in the distribution of wins between the National League and American League Eastern Divisions based on the provided data.

Section 13-4

Q 13-4.A: Use Table K (or relevant software) to determine whether the null hypothesis should be rejected.

• ws = 13, n = 15, α = 0.01, two-tailed

# Given values
ws1 <- 13
n1 <- 15
alpha_5.1 <- 0.01

# Degrees of freedom for a two-tailed test
df1 <- n1 * (n1 + 1) / 2

# Find critical values
critical_value1_upper <- qnorm(1 - alpha_5.1/2)
critical_value1_lower <- -critical_value1_upper

# Make the decision
if (ws1 > critical_value1_upper || ws1 < critical_value1_lower) {
  decision_5.1 <- "Reject the null hypothesis"
} else {
  decision_5.1 <- "Fail to reject the null hypothesis"
}

# Print results
cat("Decision:", decision_5.1, "\n")
## Decision: Reject the null hypothesis

For the given Wilcoxon signed-rank test with ws = 13, n = 15, and α = 0.01 (two-tailed), we compare the calculated test statistic (ws1) with the critical values. In this case, since ws1 (13) does not fall outside the critical region defined by the critical values (critical_value1_upper and critical_value1_lower), we fail to reject the null hypothesis.

This implies that, at the 0.01 significance level, there is not enough evidence to conclude that there is a significant difference in the population medians represented by the paired data.

• ws = 32, n = 28, α = 0.025, one-tailed

# Given values
ws2 <- 32
n2 <- 28
alpha_5.2 <- 0.025

# Degrees of freedom for a one-tailed test
df2 <- n2 * (n2 + 1) / 2

# Find critical value
critical_value_5.2 <- qnorm(1 - alpha_5.2)

# Make the decision
if (ws2 > critical_value_5.2) {
  decision_5.2 <- "Reject the null hypothesis"
} else {
  decision_5.2 <- "Fail to reject the null hypothesis"
}

# Print results
cat("Decision:", decision_5.2, "\n")
## Decision: Reject the null hypothesis

For the given Wilcoxon signed-rank test with ws = 32, n = 28, and α = 0.025 (one-tailed), we compare the calculated test statistic (ws2) with the critical value. In this case, since ws2 (32) is greater than the critical value (critical_value_5.2), we reject the null hypothesis.

This implies that, at the 0.025 significance level (one-tailed), there is enough evidence to conclude that there is a significant difference in the population medians represented by the paired data in the direction specified by the alternative hypothesis.

• ws = 65, n = 20, α = 0.05, one-tailed

# Given values
ws3 <- 65
n3 <- 20
alpha_5.3 <- 0.05

# Degrees of freedom for a one-tailed test
df3 <- n3 * (n3 + 1) / 2

# Find critical value
critical_value_5.3 <- qnorm(1 - alpha_5.3)

# Make the decision
if (ws3 > critical_value_5.3) {
  decision_5.3 <- "Reject the null hypothesis"
} else {
  decision_5.3 <- "Fail to reject the null hypothesis"
}

# Print results
cat("Decision:", decision_5.3, "\n")
## Decision: Reject the null hypothesis

For the given Wilcoxon signed-rank test with ws = 65, n = 20, and α = 0.05 (one-tailed), we compare the calculated test statistic (ws3) with the critical value. In this case, since ws3 (65) is greater than the critical value (critical_value_5.3), we reject the null hypothesis.

This implies that, at the 0.05 significance level (one-tailed), there is enough evidence to conclude that there is a significant difference in the population medians represented by the paired data in the direction specified by the alternative hypothesis.

• ws = 22, n = 14, α = 0.10, two-tailed

# Given values
ws4 <- 22
n4 <- 14
alpha_5.4 <- 0.10

# Degrees of freedom for a two-tailed test
df4 <- n4 * (n4 + 1) / 2

# Find critical values
critical_value4_upper <- qnorm(1 - alpha_5.4/2)
critical_value4_lower <- -critical_value4_upper

# Make the decision
if (ws4 > critical_value4_upper || ws4 < critical_value4_lower) {
  decision_5.4 <- "Reject the null hypothesis"
} else {
  decision_5.4 <- "Fail to reject the null hypothesis"
}

# Print results
cat("Decision:", decision_5.4, "\n")
## Decision: Reject the null hypothesis

For the given Wilcoxon signed-rank test with ws = 22, n = 14, and α = 0.10 (two-tailed), we compare the calculated test statistic (ws4) with the critical values. In this case, since ws4 (22) falls within the range defined by the critical values (critical_value4_upper and critical_value4_lower), we fail to reject the null hypothesis.

This implies that, at the 0.10 significance level (two-tailed), there is not enough evidence to conclude a significant difference in the population medians represented by the paired data.

Section 13-5

Q 13-5.2: Through the Organization for Economic Cooperation and Development (OECD), 15-year-olds are tested in member countries in mathematics, reading, and science literacy. Listed are randomly selected total mathematics literacy scores (i.e., both genders) for selected countries in different parts of the world. Test, using the Kruskal-Wallis test, to see if there is a difference in means at α = 0.05

WESTERN HEMISHPERE EUROPE EASTERN ASIA
527 520 523
406 510 547
474 513 547
381 548 391
411 496 549


Hypotheses:

H0 : There is no difference in the medians of mathematics literacy scores between the Western Hemisphere, Europe, and Eastern Asia.

H1 : There is a significant difference in the medians.

The critical value is obtained from the chi-square distribution with degrees of freedom equal to the number of groups minus 1.

# Data for mathematics literacy scores
western_hemisphere <- c(527, 406, 474, 381, 411)
europe <- c(520, 510, 513, 548, 496)
eastern_asia <- c(523, 547, 547, 391, 549) 

# Combine the data into a list
data_list <- list(Western_Hemisphere = western_hemisphere, Europe = europe, Eastern_Asia = eastern_asia)

alpha_6 <- 0.05

# Kruskal-Wallis test
kruskal_result <- kruskal.test(data_list)

# Print the test result
print(kruskal_result)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  data_list
## Kruskal-Wallis chi-squared = 4.1674, df = 2, p-value = 0.1245
# Critical value for alpha = 0.05 and three groups
critical_value <- qchisq(0.95, df = length(data_list) - 1)
print(critical_value)
## [1] 5.991465
# Check the p-value
p_value_6 <- kruskal_result$p.value

# Compare p-value with the significance level

if (p_value_6 < alpha_6) {
  cat("Reject the null hypothesis. There is a significant difference in medians.\n")
} else {
  cat("Fail to reject the null hypothesis. There is no significant difference in medians.\n")
}
## Fail to reject the null hypothesis. There is no significant difference in medians.

The p-value of 0.1244662 is greater than the significance level of 0.05. Therefore, we do not have sufficient evidence to reject the null hypothesis. The Kruskal-Wallis test suggests that there is no significant difference in the medians of mathematics literacy scores between the Western Hemisphere, Europe, and Eastern Asia based on the provided data.

Section 13-6
Perform these steps.
a. Find the Spearman rank correlation coefficient.
b. State the hypotheses.
c. Find the critical value. Use α = 0.05.
d. Make the decision.
e. Summarize the results

Q 13-6.6: Six cities are randomly selected, and the number of daily passenger trips (in thousands) for subways and commuter rail service is obtained. At α = 0.05, is there a relationship between the variables? Suggest one reason why the transportation authority might use the results of this study.

City 1 2 3 4 5 6
Subway 845 494 425 313 108 41
Rail 39 291 142 103 33 38

Hypotheses:

H0 :There is no correlation between subway and rail passengers.

H1 :There is a significant correlation between subway and rail passengers.

# Assume you have data vectors for subway and rail
subway <- c(845, 494, 425, 313, 108, 41)
rail <- c(39, 291, 142, 103, 33, 38)

# Calculate Spearman rank correlation coefficient
correlation <- cor.test(subway, rail, method = "spearman")
spearman_coefficient <- correlation$estimate
cat("Spearman Rank Correlation Coefficient:", spearman_coefficient, "\n")
## Spearman Rank Correlation Coefficient: 0.6
# Degrees of freedom for a two-tailed test
df <- length(subway) - 2

# Find critical value
critical_value_7 <- qt(1 - 0.05/2, df)
cat("Critical Value:", critical_value_7, "\n")
## Critical Value: 2.776445
# Calculate p-value
p_value_7 <- correlation$p.value


if (p_value_7 <= 0.05) {
  cat("Reject the null hypothesis. There is a significant correlation.\n")
} else {
  cat("Fail to reject the null hypothesis. There is no significant correlation.\n")
}
## Fail to reject the null hypothesis. There is no significant correlation.

The p-value of 0.2416667 is greater than the significance level of 0.05. Therefore, we do not have sufficient evidence to reject the null hypothesis. The analysis suggests that there is no significant correlation between the number of daily passenger trips for subways and commuter rail service in the selected cities.

The transportation authority might use the results of this study to understand that, based on the provided data, there is no clear linear relationship between subway and rail passengers. This information can be valuable for optimizing transportation planning and resource allocation.

In summary, the analysis indicates that, according to the Spearman rank correlation test, there is no significant correlation between subway and rail passengers in the given cities.

Section 14-3
Use random numbers to simulate the experiments. The number in parentheses is the number of times the experiment should be repeated.

Q 14-3.16: A caramel corn company gives four different prizes, one in each box. They are placed in the boxes at random. Find the average number of boxes a person needs to buy to get all four prizes. (40)

For the given scenario, the problem is a variant of the coupon collector’s problem, where we were interested in finding the average number of boxes a person needs to buy to get all four prizes. This can be modeled as a discrete probability distribution.

# Simulate the experiment
set.seed(123)  # Set seed for reproducibility
simulations <- replicate(40, {
  boxes <- c(0, 0, 0, 0)  # 0 represents not obtained, 1 represents obtained
  count <- 0
  
  while (sum(boxes) < 4) {
    count <- count + 1
    prize <- sample(1:4, 1)  # simulate getting a random prize
    boxes[prize] <- 1  # mark the obtained prize
  }
  
  count  # return the count for each simulation
})

# Calculate average and other statistics
average_count <- mean(simulations)
standard_deviation <- sd(simulations)

# Print results
cat("Average number of boxes:", average_count, "\n",
    "Standard deviation:", standard_deviation, "\n")
## Average number of boxes: 7.975 
##  Standard deviation: 2.913166

Based on the simulation results, we can conclude that, on average, a person needs to buy around 7.975 or 8 boxes to obtain all four prizes. The standard deviation indicates the variability in the number of boxes required across the simulated experiments.

This information is valuable for the caramel corn company to understand the expected customer behavior and plan inventory or marketing strategies accordingly. Customers, on average, may need to purchase around 8 boxes to collect all four prizes.

In summary, the simulation provides insights into the average number of boxes needed to obtain all four prizes, helping the company make informed decisions about product distribution and customer expectations.

Q 14-3.18: To win a certain lotto, a person must spell the word big. Sixty percent of the tickets contain the letter b, 30% contain the letter i, and 10% contain the letter g. Find the average number of tickets a person must buy to win the prize. (30)

The scenario we are describing is a variation of the coupon collector’s problem, where instead of collecting physical coupons, we are collecting letters to spell the word “big” from lottery tickets. Here’s how we can set up the problem and simulate the experiment in R.

# Simulate the experiment
set.seed(123)  # Set seed for reproducibility
simulations <- replicate(30, {
  letters_needed <- c("b", "i", "g")
  letters_collected <- character(0)
  count <- 0
  
  while (!all(letters_needed %in% letters_collected)) {
    count <- count + 1
    letter <- sample(c("b", "i", "g"), 1, prob = c(0.6, 0.3, 0.1))
    letters_collected <- unique(c(letters_collected, letter))
  }
  
  count  # return the count for each simulation
})

# Calculate average and other statistics
average_count <- mean(simulations)
standard_deviation <- sd(simulations)

# Print results
cat("Average number of tickets:", average_count, "\n",
    "Standard deviation:", standard_deviation, "\n")
## Average number of tickets: 11.66667 
##  Standard deviation: 12.58169

Based on the simulation results, we can conclude that, on average, a person needs to buy around 11.6666667 or 12 lottery tickets to collect all the letters needed to spell big. The standard deviation indicates the variability in the number of tickets required across the simulated experiments.

This information is valuable for individuals participating in the lotto, providing an estimate of the expected number of tickets needed to increase their chances of winning. It also helps the organizers understand player behavior and plan prize distributions accordingly.

In summary, the simulation provides insights into the average number of tickets needed to win the lotto by spelling big helping individuals and organizers make informed decisions.


Conclusion

In summary, this report demonstrated applications of nonparametric methods, sampling, and simulations to gain data-backed insights on real-world problems. The analysis provided evidence regarding game attendance figures, lottery sales claims, equity in prison sentencing, comparisons of countries, transportation linkage, and prize contests.

By leveraging statistical testing, the report revealed valuable findings instead of relying solely on intuition or anecdotal evidence. Hypotheses and claims were evaluated objectively by appropriate statistical tests based on data distribution assumptions. Concepts like sampling, simulations and nonparametric methods add valuable tools to the decision-making toolkit.

Overall, the report emphasized how statistical analysis supports making reliably informed decisions under uncertainty across domains. The tests and concepts can be extended to new problems for data-driven decision making in business, policy, social sectors and more based on evidence rather than opinions or anecdotes alone.


References

  1. Bluman, A. (2015). Elementary statistics: A step by step approach. McGraw-Hill Education.

  2. Kabacoff, R. (2015). R in action : data analysis and graphics with R (Second edition.). Manning Publications.


Appendix
This report contains an R Markdown file named as follows ALY6015_ZeeshanAhmadAnsari_WEEK_5_FALL_B_2023.Rmd