Assignment 1: Randomization tests - golf balls

What is the distribution of these numbers? Are the numbers 1, 2, 3, and 4 equally likely? What test statistic should we use? Chosen: sd, max, min, chisq How should we conduct the test using simulations?

Including the following:
A 3-5 page write-up, in the form of a knitted file (ideally a pdf or .md file)

What to turn in: Submit both an Rmarkdown file and a ‘knitted’ file (i.e. a file that also shows the output).

                                                                    Introduction 
  The Monte Carlo Simulation is a massive random simulation to derive empirical approximation to an object of interest. More specifically, it describes the probability of an event, A, [i.e., P(A)] as the proportion of simulations in which the event of interest occurs [“empirical estimate” of probability]. Along with this, it calculates the expected value of a random variable by the average value it takes over a number of simulations. Monte Carlo manages to predict likelihood of schedule and cost overruns while quantifying risks to determine potential impacts of an unknown variable or event. The process also indicates severity of variation in a dataset. After doing this, Monte Carlo is able to provide objective data for decision making. This statistical process can be used in nearly every economic sector as a multiple probability simulation to solve a variety of large-scale problems an organization has. Along with this, Monte Carlo simulation has the ability to mitigate major risk by predicting the outcome of a variable intervention several thousand times over. In this project, the Monte Carlo technique was used to produce a random simulation of 486 upwards of 10,000 times.
  
  In this situation, we were given four types of golf balls on a golf course labeled in values 1-4 totaling up to 500 total. However, 14 are excluded from this data set due to mislabeling practices leaving only 486 remaining. Following this, we conducted a Monte Carlo simulation to access if the remaining golf balls in the data sets were uniformly distributed between 1 and 4. After the simulation, we used several test statistics such as standard deviation, max, min, and chi-square statistic with the p-value to either reject or fail to reject the null hypothesis. By identifying the statistical significance of the p-value and by analyzing the four histograms of the test statistics, we could then decide the validity of the hypothesis. 
  
  For this statistical process, we set the null hypothesis to be that golf ball datasets are uniformly distributed between 1 and 4. However, we set the alternative hypothesis to be where the golf ball datasets are not uniformly distributed between 1 and 4. A hypothesis test is crucial to access the credibility of a hypothesis by using sample data of the 486 golf balls from the golf balls labeled 1-4. To accomplish this, we used the test statistics of standard deviation, chi square statistics, max and min. 
  

                                                                    Methods (1-2 page)
                                          

Describe specific method you are using (Monte Carlo Simulation) For our Monte Carlo Simulation, we decided to use several test-statistics in our analysis such as standard deviation, the chi-square statistic, max, and min. To accomplish our test, we identified the number of golf balls (486) and set the sample out of 1000. Following this, we conducted this Monte Carlo stimulation 10,000 times using the nsim function. After this step, we created a simulated dataset of the same size of the previous one, with 486 golfballs ranging from types 1-4. To show descriptives of this dataset using the four test statistics, we created four different orange histograms labeled appropriately.

Calculate a Test Statistic (Show output) Use standard deviation (measure of spread) Chi square statistic (measures observed vs expected) Max and min important since it clearly specifies the limitation and extent of simulation.

Determine the “p-value” - how do you do this via simulation? 

Interpret the p-value you get

What makes a good test statistic? Compare several.

  A test statistic can be shown to provide statistical significant results depending on the hypothesis being tested. In our Monte Carlo Simulation scenario, standard deviation, chi-square statistic, min and max are the most relevant test statitics. Standard devation provides an accurate measure of spread that helps reveal if our hypothesis is true that the golf ball dataset is normally distributed between 1 and 4. The chi-squre test statistic is important since it measures the observed vs expected outcome (s) for the 10,000 monte carlo simulations that were conducted.


                                    Results (1-2 page)

Results (1-2 page) Show your results Calculated test statistics (plus histograms) Restate hypothesis Show code + calculated p-value + via simulation Interpret p-value once more Output — Histograms (a, b, c, d) Conclusion (½ page) Summarize the results; what did you learn Comment on things such as the following in a Conclusion section:

    – What would you do if you knew the sampling distribution of the test statistic? How does simulation help if you don’t?
                        
    – Power against particular types of alternatives:

title: “Golf balls assignment - pseudocode” author: “Michael Chase” date: “3/1/2022” output: html_document —

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Now onto the assignment…..

This assignment asks you the following:

Allan Rossman used to live along a golf course and collected the golf balls that landed in his yard. Most of these golf balls had a number on them. Allan tallied the numbers on the first 500 golf balls that landed in his yard one summer.

Specifically. he collected the following data: 137 golf balls numbered 1 138 golf balls numbered 2 107 golf balls numbered 3 104 golf balls numbered 4 14 “others” (Either with a different number, or no number at all. We will ignore these other balls for the purposes of this question.)

Question: What is the distribution of these numbers? In particular, are the numbers 1, 2, 3, and 4 equally likely?

The intuition is that you are going to conduct a simulation-based hypothesis test. (If you happen to know the correct theoretical test, I want you to assess the significance of the associated test statistic using simulation rather than theoretical results.)

Here’s the pseudocode. Your job is to flesh it out, run it for at least 3 test statistics (that you have to come up with), and report on the results. Results should be turned in as a nicely formatted R-markdown doc that compiles to an html file.

# Pseudocode for the golf ball problem in assignment 1
# Do the following repeatedly, for each test statistic that you use.
# 1 Pick a test statistic. An example might be to use the range between the highest and lowest number in the four cells of the table.
  
    NumberofGolfBalls<-486
    NumberofSamples<-1000
    HighestNumberSeen<-4
    
    ## Test Statistics chosen in this study
      ### Standard Deviation (sd), Maximum (max), Minimum (min), Chi-square (chisq).
    
  
# 2. We then find the behavior of the test statistic under the null hypothesis that all numbers are equally likely by repeatedly doing the following (say 10000 times):
    nsim <- 10000
    set.seed(2022)
    
# 2a. Generate a simulated data set of the same size as the one we observed. So, in this example it will contain 486 simulated golf-balls, # with the number of each being sampled from a discrete uniform distribution DU[1,4].


test_stat <- matrix(ncol = 4, nrow = nsim)
names(test_stat) <- c("sd", "max", "min", "chisq")
for (i in seq_len(nsim)) {
  size <- 486
  golfballs <- sample(1:4, size, replace = TRUE)
  gb_table <- table(golfballs)
  test_stat[i, 1] <- sd(gb_table)
  test_stat[i, 2] <- max(gb_table)
  test_stat[i, 3] <- min(gb_table)
  test_stat[i, 4] <- as.numeric(chisq.test(gb_table)$statistic)
}
# 2b. Calculate the value of the test statistic for that simulated dataset (so here, we construct a table of frequencies, just like for the observed data, and then take the difference between the larget and smallest numbers in that table). Store this value.

summary(test_stat)
##        V1                V2              V3              V4          
##  Min.   : 0.5774   Min.   :122.0   Min.   : 83.0   Min.   : 0.00823  
##  1st Qu.: 7.0475   1st Qu.:129.0   1st Qu.:107.0   1st Qu.: 1.22634  
##  Median : 9.7468   Median :132.0   Median :111.0   Median : 2.34568  
##  Mean   :10.1295   Mean   :132.9   Mean   :110.3   Mean   : 2.97858  
##  3rd Qu.:12.7932   3rd Qu.:136.0   3rd Qu.:114.0   3rd Qu.: 4.04115  
##  Max.   :28.6182   Max.   :161.0   Max.   :121.0   Max.   :20.22222
# 3. After repeating 2a and 2b 10000 times, and storing the 10000 values of the test statistic that results, plot those values, using a histogram, say. This is the "null distribution" of the test statistic under the null hypothesis that all numbers arew equally likely.

4. Plot the value of the test statistic for the observed dataset (which is 34 if we use the range as an example).

apply(test_stat, 2, hist, col = "orange", main="Test Statistics")

5. If the observed test statistic value falls in the tails of the null distribution plotted in step 3, (quantify this using the percentile of that value in the null distribution - known as the p-value) we reject the null hypothesis that the numbers on the balls are uniformly distributed.

kjl