Midterm Exam 1

(Instructions)
(Question 1)
(Question 2)
(Question 3)
(Question 4)
(Question 5)
(Question 6)

(Instructions)

Due Date: September 27. Canvas submission will be locked after the due date.
Total Points: 175.
Hypothesis Testing: For full credit in hypothesis tests, clearly state all hypotheses, explain all notations, mention the test statistic, and provide a justification for your conclusion.
$\texttt{R}$ Code and Calculations: For any credit, please ensure that the corresponding code, calculations, and results are shown in R code chunks within your submission.
Submission Format: Ensure that your submission is a html or pdf file that includes all the $\texttt{R}$ code chunks, outputs, and relevant plots.
Academic Integrity: Adhere to the university’s academic integrity policy. Ensure that all work submitted is your own and properly cited where necessary.

(Question 1)

A. Suppose a normal distribution has a median of $12$ and an interquartile range (IQR) of $8$. What are the mean, variance, and the first and third quartiles of the distribution? (15 points)

Answer: The mean is 12.

# Your code, if any.

B. Mrs. Bhattacharya conducted a survey among 12 students to find out how many hours they spend per day on their research projects. The recorded responses are as follows:

24, 8, 7.5, 9, 6, 7, 8, 10, 5, 6, 8, 45

What are the range and inter-quartile range (IQR) of the data? (10 points)

hours = c(24, 8, 7.5, 9, 6, 7, 8, 10, 5, 6, 8, 45)
max(hours) - min(hours)

## [1] 40

quantile(hours,0.25)

##  25% 
## 6.75

quantile(hours,0.50)

## 50% 
##   8

quantile(hours,0.75)

##  75% 
## 9.25

IQR(hours)

## [1] 2.5

C. Do you suspect any suspicious number in the data based on common sense? If yes, use a statistical tool to justify your answer. (10 points)

Answer: Yes, it appears that 24 and 55 hours does not belong in the set. Not only does it stand out as an outlier compared to the other data, but we also know that the data is based on the number of hours per day.

boxplot(hours)

(Question 2)

A. Which of the following represent(s) a population parameter? (2 points)

a) The average height of 15 student athletes.

b) The morning blood pressure readings of a patient over a week.

c) The median age of all employees in a company.

d) The variance of soil moisture data at selected plots.

Answer: C

B. Which of the following statements is/are true? (2 points)

a) A sample is always identical to the population

b) From a sample, we can usually calculate a population parameter exactly.

c) A sample can be used to estimate population parameters.

d) From a sample, we can usually calculate a sample statistic exactly.

Answer: C and D

(Question 3)

A junior tennis academy purchased a new batch of performance-enhancing shoes that claim to improve players’ running speeds. To assess this claim, the coach decided to compare the sprint times of 18 players using the new shoes with those of another 18 players wearing traditional shoes. The coaching staff measured the sprint times (in seconds) of all the players. The data below shows the measurements:

Performance-enhancing shoes:	6.4	6.8	6.3	7.0	6.5	6.7	6.6	6.9	6.3	7.1	6.8	7.0	6.5	6.9	7.1	6.7	6.4	6.8
Traditional shoes:	6.3	6.7	6.2	6.9	6.4	6.6	6.5	6.8	6.2	7.0	6.7	6.9	6.4	6.8	7.0	6.6	6.3	6.7

A. Under the assumption of normality, test whether the new performance-enhancing shoes significantly improve sprint times compared to traditional shoes. Use significance level $\alpha$ = 0.05. Clearly explain the notations, hypotheses, and explain whether this is a paired or independent test. Finally, justify your conclusion based on the $p$-value. (20 points)

Answer: $H_{0}$ : $μ$ _1 < $μ$ _2. $H_{A}$ : $μ$ _1 > $μ$ _2 Independent test

all_shoes = c(6.4,  6.8,    6.3,    7.0,    6.5,    6.7,    6.6,    6.9,    6.3,    7.1,    6.8,    7.0,    6.5,    6.9,    7.1,    6.7,    6.4,    6.8, 
6.3,    6.7,    6.2,    6.9,    6.4,    6.6,    6.5,    6.8,    6.2,    7.0,    6.7,    6.9,    6.4,    6.8,    7.0,    6.6,    6.3,    6.7)
shoe_types = c(rep("1.Performance",18),
                   rep("2.Traditional",18))
t.test(all_shoes~shoe_types, alternative="two.sided",mu=0, conf.level=1-0.05)

## 
##  Welch Two Sample t-test
## 
## data:  all_shoes by shoe_types
## t = 1.1302, df = 34, p-value = 0.2663
## alternative hypothesis: true difference in means between group 1.Performance and group 2.Traditional is not equal to 0
## 95 percent confidence interval:
##  -0.07981187  0.27981187
## sample estimates:
## mean in group 1.Performance mean in group 2.Traditional 
##                    6.711111                    6.611111

B. Based on the test conducted in part (A), explain how you can reach the same conclusion using the confidence interval, without relying on the $p$-value. (6 points)

Answer: From the output, the test statistic is 1.1302.I can reject $H_{0}$ since p-value(=0.2663) > $α (= 0.05)$

C. Perform the hypothesis test from part (A), but this time without assuming normality (i.e., use a nonparametric test). Be sure to include all the relevant details of the test. (20 points)

Answer: For the nonparametric test, the Wilcoxon rank sum test

wilcox.test(all_shoes~shoe_types, alternative="two.sided",mu=0, conf.level=1-0.05)

## Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
## compute exact p-value with ties

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  all_shoes by shoe_types
## W = 197, p-value = 0.2722
## alternative hypothesis: true location shift is not equal to 0

(Question 4)

My following code loads the Antarctic ocean sea-ice extent data in $\texttt{R}$ into a dataframe called $\texttt{seaice}$ and cleans the data by removing a missing observation. Please do not modify this code. Note that you need to be connected to the internet to download this file.

seaice = read.csv(url("https://noaadata.apps.nsidc.org/NOAA/G02135/south/monthly/data/S_12_extent_v3.0.csv"))
seaice = seaice[-10, ]

The dataset $\texttt{seaice}$ contains six columns or variables: $\texttt{year}$, $\texttt{mo}$, $\texttt{data.type}$, $\texttt{region}$, $\texttt{extent}$, and $\texttt{area}$. In this dataset, $\texttt{extent}$ refers to the total region where sea ice is present, while $\texttt{area}$ represents the actual coverage of ice within that region. These measurements are recorded over the year and months for the southern region.

A. We are interested in the sea ice extent (column name = extent) and area (column name = area) from the above dataset. Conduct a hypothesis test to determine if the mean sea ice extent is significantly greater than the mean sea ice area at a significance level of $0.05$. Include all relevant details, such as the hypotheses, test statistic, notations, and interpretation of results. (20 points)

Answer: $H_{0}$ : $μ$ _1 < $μ$ _2. $H_{A}$ : $μ$ _1 > $μ$ _2 parametric test Can not reject H0

colnames(seaice)

## [1] "year"      "mo"        "data.type" "region"    "extent"    "area"

mse= seaice$extent
msa= seaice$area
t.test(mse, msa, paired=TRUE,
       alternative = "greater",
       mu=0, conf.level = 1-0.05)

## 
##  Paired t-test
## 
## data:  mse and msa
## t = 62.186, df = 44, p-value < 2.2e-16
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  3.371702      Inf
## sample estimates:
## mean difference 
##        3.465333

B. Perform the appropriate nonparametric test to determine if there is a significant difference between the sea ice extent (column name = extent) and area (column name = area), as described in the previous question. Be sure to include all relevant details, such as your notations, the hypotheses, test statistic, and conclusions. (20 points)

Answer: 𝐻0:𝜇1-𝜇2 < 0 𝐻A:𝜇1-𝜇2 > 0 H0 is rejected

wilcox.test(mse, msa, paired=TRUE,
       alternative = "greater",
       mu=0, conf.level = 1-0.05)

## Warning in wilcox.test.default(mse, msa, paired = TRUE, alternative =
## "greater", : cannot compute exact p-value with ties

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  mse and msa
## V = 1035, p-value = 2.678e-09
## alternative hypothesis: true location shift is greater than 0

C. Perform a nonparametric test to determine if the median sea ice extent (column name = extent) is less than 9.90. Include all relevant details, such as the hypotheses, test statistic, assumptions, and conclusions. (20 points)

Answer: $H_{0} :$ M_1 > 9.90 $H_{A} :$ M_2 < 9.90 Since p value > $α = 0.05$ can not reject $H_{0}$

mse= seaice$extent
t.test(mse, mu=9.90,
            alternative = "less",conf.level=0.95)

## 
##  One Sample t-test
## 
## data:  mse
## t = 2.8184, df = 44, p-value = 0.9964
## alternative hypothesis: true mean is less than 9.9
## 95 percent confidence interval:
##      -Inf 10.46504
## sample estimates:
## mean of x 
##    10.254

(Question 5)

In $\texttt{R}$, one can simulate data from a normal distribution using the $\texttt{rnorm}$ function as shown below:

x = rnorm( n=<number_of_observations>, mean= <true_value_of_mu>, sd = <true_value_of_sigma> )

Note: Last argument in the code is the standard deviation $\sigma$, not the variance $\sigma^2$.

A. Use the above code to sample $1000$ observations from a normal distribution with mean $5$ and variance $16$. (5 points)

Answer: The code doesn’t work for me.

B. Using the data from the previous question, and assuming normality, estimate the population variance and the inter-quartile range (IQR). (10 points)

Answer: The code doesn’t work for me.

# Code

C. Using the same data, and without assuming any parametric conditions, estimate the population variance and the population inter-quartile range (IQR). (10 points)

Answer: The code doesn’t work for me.

# Code

(Question 6)

What aspects of the course have been most beneficial for you so far? Are there areas where you feel additional resources or support would be helpful? (Free 5 points)

Answer: Going over more homework or even possible example problems more frequently with the lectures would definitely help a lot for this course. Also, having practice exams would help a lot more too. I am more of a visual learner and visuals and repetition helps a lot with understanding the concepts more.

Midterm Exam 1

Due Date: September 27

Azzi Parries

2024-09-27

(Instructions)

(Question 1)

A. Suppose a normal distribution has a median of \(12\) and an interquartile range (IQR) of \(8\). What are the mean, variance, and the first and third quartiles of the distribution? (15 points)

B. Mrs. Bhattacharya conducted a survey among 12 students to find out how many hours they spend per day on their research projects. The recorded responses are as follows:

What are the range and inter-quartile range (IQR) of the data? (10 points)

C. Do you suspect any suspicious number in the data based on common sense? If yes, use a statistical tool to justify your answer. (10 points)

(Question 2)

A. Which of the following represent(s) a population parameter? (2 points)

a) The average height of 15 student athletes.

b) The morning blood pressure readings of a patient over a week.

c) The median age of all employees in a company.

d) The variance of soil moisture data at selected plots.

B. Which of the following statements is/are true? (2 points)

a) A sample is always identical to the population

b) From a sample, we can usually calculate a population parameter exactly.

c) A sample can be used to estimate population parameters.

d) From a sample, we can usually calculate a sample statistic exactly.

(Question 3)

B. Based on the test conducted in part (A), explain how you can reach the same conclusion using the confidence interval, without relying on the \(p\)-value. (6 points)

C. Perform the hypothesis test from part (A), but this time without assuming normality (i.e., use a nonparametric test). Be sure to include all the relevant details of the test. (20 points)

(Question 4)

My following code loads the Antarctic ocean sea-ice extent data in \(\texttt{R}\) into a dataframe called \(\texttt{seaice}\) and cleans the data by removing a missing observation. Please do not modify this code. Note that you need to be connected to the internet to download this file.

C. Perform a nonparametric test to determine if the median sea ice extent (column name = extent) is less than 9.90. Include all relevant details, such as the hypotheses, test statistic, assumptions, and conclusions. (20 points)

(Question 5)

In \(\texttt{R}\), one can simulate data from a normal distribution using the \(\texttt{rnorm}\) function as shown below:

Note: Last argument in the code is the standard deviation \(\sigma\), not the variance \(\sigma^2\).

A. Use the above code to sample \(1000\) observations from a normal distribution with mean \(5\) and variance \(16\). (5 points)

B. Using the data from the previous question, and assuming normality, estimate the population variance and the inter-quartile range (IQR). (10 points)

C. Using the same data, and without assuming any parametric conditions, estimate the population variance and the population inter-quartile range (IQR). (10 points)

(Question 6)

What aspects of the course have been most beneficial for you so far? Are there areas where you feel additional resources or support would be helpful? (Free 5 points)