1.1 Structure of Data

Data refers to pieces of information or facts that are collected, recorded, and analyzed for the purpose of gaining insights, making informed decisions, or conducting research. Data can take various forms and types, and it serves as the foundation for drawing conclusions, identifying trends, and generating knowledge.

The structure of data refers to the organization, format, and arrangement of data within a dataset. It encompasses how data is stored, how variables are defined, and how records or observations are organized. A well-structured dataset is crucial for effective data analysis and interpretation. Here are some key components of data structure:

Variables: Variables are characteristics, properties, or attributes of the data that you are measuring or observing. Each column in a dataset represents a variable. For example, in a dataset of student information, “Age,” “Gender,” and “Score” could be variables.
Observations or Records: Observations, also known as records or cases, are individual instances or data points within the dataset. Each row represents an observation. Continuing with the student dataset example, each student’s information (age, gender, score) would constitute an observation.

There are several types of data:

Quantitative Data: This type of data is numeric and represents measurable quantities. It includes variables like age, height, temperature, and income.

Qualitative Data: Also known as categorical or non-numeric data, qualitative data consists of labels or categories. Examples include gender, color preferences, or types of fruits.

Discrete Data: Discrete data consists of separate, distinct values with no intermediate values possible. Examples include the number of students in a class or the number of cars in a parking lot.

Continuous Data: Continuous data can take any value within a given range and may have infinite decimal places. Examples include weight, height, and temperature.

Ordinal Data: Ordinal data represents categories with a meaningful order, but the differences between the categories are not necessarily equal. A common example is survey responses with options like “strongly agree,” “agree,” “neutral,” “disagree,” and “strongly disagree.”

Nominal Data: Nominal data represents categories with no inherent order. It’s used to label and classify data. Examples include colors, gender, and types of animals.

1.2 Sampling from a population

Sampling from a population involves selecting a subset of individuals or elements from a larger group (the population) to gather information and make inferences about the entire population. Sampling is a fundamental concept in statistics and research, allowing you to study a smaller group to draw conclusions about the larger whole. Here are some key points about sampling:

Population: The population refers to the entire group of interest that you want to study or make inferences about. It’s often impractical or impossible to collect data from the entire population, so you select a sample instead.
Sample: A sample is a smaller group selected from the population for study. The goal is for the sample to be representative of the population so that the insights drawn from the sample can be generalized to the entire population.
Sampling Methods: There are various sampling methods, each with its own strengths and limitations. Some common methods include:
- Random Sampling: Each individual has an equal chance of being selected. This helps avoid bias in the sample.
- Stratified Sampling: Dividing the population (say Minnesota) into subgroups called strata (say males/females, or democratic/republican/independent) and then randomly selecting samples from each stratum.
- Cluster Sampling: Dividing the population (say Minnesota) into clusters (such as counties) and then randomly selecting entire clusters for the sample.
- Convenience Sampling: Choosing individuals who are easily accessible. This method can introduce bias, so not recommended.

Some concepts in sampling:

Representativeness: A sample is considered representative when the characteristics of the sample closely mirror those of the population. This allows for valid inferences.

Sample Size: The size of the sample is an important consideration. A larger sample size tends to provide more accurate estimates and a better representation of the population.

Inferential Statistics: After collecting data from the sample, inferential statistics are used to make predictions or draw conclusions about the entire population.

Sampling Bias: Bias can occur when the sample is not representative of the population. It can lead to incorrect inferences.

Sampling is a crucial step in research and statistics, as it allows you to study a manageable subset of a population and draw meaningful insights. The choice of sampling method depends on the research question, available resources, and the level of precision required in the results.

1.3 Observational and Experimental Studies

Data is collected from various sources, such as surveys, experiments, observations, sensors, or online platforms. Once collected, data can be organized, analyzed, and interpreted using various tools and techniques to extract meaningful insights and make informed decisions.

In general, there are two types of studies:

An observational study is a type of research design in which the researcher observes and collects data on individuals or subjects without manipulating any variables. The aim is to understand patterns, relationships, or trends within the observed data. Observational studies are often used in situations where it’s not feasible or ethical to conduct experiments that involve manipulating variables.

Here are three examples of observational studies:

Cross-Sectional Study: In a cross-sectional study, data is collected at a single point in time to analyze the relationships between different variables. For instance, researchers might collect data on the smoking habits and lung health of a sample of individuals to understand the potential link between smoking and respiratory health.

Longitudinal Study: Longitudinal studies involve collecting data from the same subjects over an extended period. Researchers track changes in variables over time to observe trends or patterns. An example could be a study tracking the cognitive development of a group of children from infancy to adolescence to investigate how early experiences impact cognitive growth.

Case-Control Study: In a case-control study, researchers compare individuals with a specific outcome or condition (cases) to individuals without that outcome (controls). The goal is to identify potential factors associated with the outcome. For instance, a case-control study might compare individuals with a certain type of cancer (cases) to those without the cancer (controls) to investigate potential risk factors.

These examples illustrate the diverse applications of observational studies in understanding relationships and patterns within various fields of research.

Experiments in statistics refer to controlled and systematic procedures conducted to collect data and investigate relationships between variables. Experiments are a fundamental aspect of statistical research and are used to draw conclusions, make inferences, and test hypotheses. Here are some key components and concepts related to experiments in statistics:

Another type of study is experimental study.

An experimental design involves planning and organizing the experiment to ensure valid and reliable results. This includes defining the research question, selecting variables, choosing appropriate treatments, and determining the experimental conditions. Variables:

Independent Variable (IV): The variable that is intentionally manipulated by the researcher. It’s also known as the treatment variable. Dependent Variable (DV): The variable that is measured or observed to assess the effect of the independent variable. Control Group and Treatment Group:

In an experiment, participants are divided into two or more groups: a control group and one or more treatment groups. The control group serves as a baseline for comparison, while treatment groups receive the experimental treatment. Randomization:

Random assignment of participants to different groups helps control for potential confounding variables and reduces bias. It ensures that each group is representative of the larger population. Experimental Conditions:

Each experimental group is exposed to a specific condition or treatment. The conditions are manipulated by the researcher to test hypotheses or investigate relationships. Hypothesis Testing:

Experiments are designed to test hypotheses about the relationship between variables. Researchers formulate a null hypothesis (H0) and an alternative hypothesis (Ha), which are then tested using statistical methods. Experimental Control:

Researchers aim to control extraneous variables that could influence the outcome. This helps establish a cause-and-effect relationship between the independent and dependent variables. Data Collection:

Data are collected by measuring the dependent variable(s) for each group under the different experimental conditions. Data collection methods may include surveys, observations, measurements, or other techniques. Analysis of Results:

After data collection, statistical analysis is performed to determine if the observed differences between groups are statistically significant. Common techniques include t-tests, analysis of variance (ANOVA), regression analysis, and more. Interpreting Results:

Researchers interpret the results in the context of the research question and draw conclusions based on the statistical analysis. They assess whether the experimental treatment had a significant impact on the dependent variable. Experiments play a crucial role in drawing causal inferences and understanding relationships between variables. They are widely used in scientific research, psychology, medicine, social sciences, and various other fields to investigate hypotheses and provide evidence for making informed decisions.

Here is an example of Experimental designs.

Purpose: Investigating the Effect of Fertilizer on Plant Growth with Three Treatments

Research Question: How does the application of different fertilizers impact the growth of plants?

Experimental Design:

Plant Species: Sunflower plants (Helianthus annuus).
Independent Variable (IV): Type of fertilizer.
Dependent Variable (DV): Height of the sunflower plants (in centimeters).
Experimental Conditions: Sunflower plants are divided into four groups: > Group A: Plants receive no fertilizer (control group). > Group B: Plants receive Fertilizer X. > Group C: Plants receive Fertilizer Y. > Group D: Plants receive Fertilizer Z.
Random Assignment: Plants within each group are randomly assigned to ensure fairness and avoid bias.

Here is a procedure to conduct the experiment:

Prepare the soil and pots for planting the sunflower seeds.
Plant the sunflower seeds in each pot according to the experimental groups.
Apply the respective fertilizers (Fertilizer X, Y, or Z) to the soil around the plants as per the instructions.
Water all plants equally and maintain consistent environmental conditions (light, temperature, humidity).

Here is how data are collected:

Measure the height of each sunflower plant in each group after a specific time period (e.g., after six weeks).

Here is how hypotheses are set up (more in chapter 4 or later):

Null Hypothesis (H0): There is no difference in the growth of sunflower plants among the different fertilizer treatments.
Alternative Hypothesis (Ha): The growth of sunflower plants differs based on the type of fertilizer treatment.

Here is how analysis of results is obtained (more in Chapter 8):

Perform analysis of variance (ANOVA) to compare the mean heights of sunflower plants in different fertilizer groups.

Here is how you can interpret results:

If ANOVA indicates a significant difference in mean plant heights among the groups, additional tests may be conducted to identify which specific groups differ significantly.

Here is how you can draw a conclusion:

Based on the statistical analysis, researchers can draw conclusions about the effects of Fertilizer X, Y, and Z on the growth of sunflower plants.

Chapter 2. Describing Data

2.1 Numerical Summaries of Quantitative Data

Numerical summaries of quantitative data provide concise and informative statistics that describe the characteristics and distribution of a dataset. These summaries help you gain insights, compare datasets, and make informed decisions. Here are some common numerical summaries used for quantitative data:

Measures of Central Tendency:

Mean (Average): The sum of all values divided by the number of values. It represents the “typical” value in the dataset.
Median: The middle value when the data is sorted. It’s less sensitive to extreme values compared to the mean.
Mode: It is the the category or score that occurs the most frequently within the distribution of data. In other words, it is the most common score or the score that appears the highest number of times in data.

Measures of Dispersion:

Range: The difference between the maximum and minimum values in the dataset.
Variance: The average of the squared differences between each data point and the mean. It measures how much the values vary from the mean.
Standard Deviation: The square root of the variance. It provides a measure of the spread of the data.

Percentiles and Quartiles:

Percentiles: Values that divide the data into specific percent intervals. For example, the 75th percentile is the value below which 75% of the data falls.
Quartiles: Values that divide the data into four equal parts. The first quartile (Q1) is the 25th percentile, the median (Q2) is the 50th percentile, and the third quartile (Q3) is the 75th percentile.

Interquartile Range (IQR):

The difference between the third quartile (Q3) and the first quartile (Q1). It represents the range of the middle 50% of the data and is less affected by outliers. Outliers are data points that significantly deviate from the rest of the observations in a dataset. They can be unusually high or low values that are distant from the bulk of the data points. Outliers can have a substantial impact on data analysis and interpretation, and understanding and handling them appropriately is important.

An example:

Given the exam scores of 20 students: 86, 87, 90, 85, 80, 89, 84, 84, 83, 79, 93, 81, 76, 90, 85, 85, 88, 84, 85, 86, which can be sorted as: 76, 79, 80, 81, 83, 84, 84, 84, 85, 85, 85, 85, 86, 86, 87, 88, 89, 90, 90, 93

find the

Sample size: the number of observations
Mean ($\bar{x}$)
Median ($m$)
Mode: The most frequently occurred data points.
First (or lower) quartile $Q_1$
Third(or upper) quartile $Q_3$
Inter-quartile range ($IQR$)
Variance ($s^2$)
Standard deviation ($s$)
Range

Answer:

Sample size: 20
Mean ($\bar{x}$): 85
Median ($m$): the middle values are 85 and 85, so the median is the average of them, or 85.
Mode: The most frequently occurred data points. It is 85.
First (or lower) quartile $Q_1$: 83.5, the value may vary depending on different methods.
Third(or upper) quartile $Q_3$: 87.5, the value may vary depending on different methods.
Inter-quartile range ($IQR$): 4
Variance ($s^2$): 16.32
Standard deviation ($s$): 4.04
Range: $93-76$ or 17.

2.2 Graphical Summaries of Quantitative Data

We will use a histogram and a boxplot to display the distribution of data.

A histogram is a graphical representation of the frequency distribution of numerical data. It displays the data’s distribution by grouping it into bins and depicting the frequency or count of data points in each bin.

Shape of the Histogram:

Symmetry: A symmetrical histogram indicates a balanced distribution around the center.
Skewness: Skewed histograms have a long tail on one side. Positive skewness indicates a tail to the right (longer right tail), and negative skewness indicates a tail to the left (longer left tail).

Spread:

Wider histograms have more spread-out data, while narrower histograms have more concentrated data.

Interpretation: If the values represented the scores of students in a class, the class would be easy, since majority of students have high scores.

Interpretation: If the values represented the scores of students in a class, the class would be difficult, since majority of students have low scores.

Boxplot (Box-and-Whisker Plot):

A boxplot displays the distribution of data through minimum, maximum, quartiles, median, and potential outliers.

or equivalently,

From the boxplots, the distribution of data A is symmetric, B is left-skewed, and C is right-skewed.

The box represents the middle 50% of the data (between the first and third quartiles).

The length of the box indicates the spread of this middle 50%.

The line inside the box represents the median (second quartile) of the data.

The whiskers (the two lines oconnecting to the box) extend to the minimum and maximum data points within a certain range (from $Q_1 - 1.5\cdot IQR$ to $Q_3 + 1.5\cdot IQR$). Any data point beyond this range is considered an outlier.

Points outside the whiskers are potential outliers. Outliers are typically plotted individually as points beyond the whiskers.

Interpreting Both Visualizations:

Use both the histogram and boxplot together to get a comprehensive understanding of the data’s distribution, central tendency, spread, and presence of outliers.

Look for consistency between the shape of the histogram and the characteristics of the boxplot.
Pay attention to any deviations from normality, skewness, or multimodality in both visualizations.

Overall, histograms and boxplots are valuable tools for exploring and understanding the distribution of your data, identifying potential patterns, and making informed decisions about subsequent data analysis steps.

Before creating a histogram, we need to break the data into nonoverlappling intervals of equal lengths. The number of intervals is usually between 5 and 25. The choice of breaks is subjective, but the breaks are usually chosen to be easy numbers. Let’s say we want 5 intervals, so we need 4 breaks. For our data, the range is $93-76$ or 17, so we set the width of each interval to be $\frac{17}{4}$ or around 4. Since the minimum of our data is 76, our first interval is chosen to be (75, 79], (79, 83], (83, 87], (87, 91], (91, 95]. The number of observations in each interval is given as below:

(75, 79]: 2

(79, 83]: 3

(83, 87]: 10

(87, 91]: 4

(91, 95]: 1

We can use R code to help us draw a histogram with given breaks:

# Prepare data 
exam_score=c(76, 79, 80, 81, 83, 84, 84, 84, 85, 85, 85, 85, 86, 86, 87, 88, 89, 90, 90, 93)

# Draw a histogram using the R function hist()
# A function in a programming language is just like a set of instructions. 
# When you press a button on a refrigerator, you are calling a function to 
# execute a set of instructions.
hist(exam_score, c(75, 79, 83, 87, 91, 95))

Note that the symbol “#” indicates the whole line is a comment, not part of the code. A comment helps you understand the code.

How can you run the code here? You can’t run the code on this page. You need to follow the instructions at the beginning of this lecture note: start Posit/RStudio. Ask the instructor if you get stuck! A very short introduction to R is here: https://scsu.shinyapps.io/ProgrammingR/#section-r-data-types-and-objects

Suppose that you have tried the above code. The breaks for the produced histogram are 75, 79, 83, 87, 91, 95 and are what we wanted. If you don’t provide breaks, R will choose the best breaks for you. It usually choose prettier breaks for you.

hist(exam_score)

You can also produce a boxplot, which shows the minimum, first quartile ($Q_1$), median, third quartile ($Q_3$), and the maximum.

# draw a boxplot, showing the 5-number summaries, using R function boxplot() 
boxplot(exam_score)

Note that there is a dot in the boxplot, which indicates that the observation “76” is an outlier. An outlier is an observation away from the majority. It is found by finding observations that are less than $Q_1-1.5\cdot IQR$ or greater than $Q_3+1.5\cdot IQR$. If there is no outlier, the boxplot will be a standard one. If there is any outlier, the outlier(s) will be isolated from a modified boxplot.

2.3 Standard Scores or z-Scores

Standardization is a statistical technique used to transform individual data points into a standard scale. This process allows for meaningful comparisons between data points that might originally be in different units or have different scales.

In standardization, each data point is transformed by subtracting the mean from the point then dividing the result by the standard deviation. The formula is \[z=\frac{x-\bar{x}}{s}\]

Where:

z is the standardized score (z-score) of the data point.
x is the original data point.
$\bar{x}$ is the mean of the data.
$s$ is the standard deviation of the data.

The resulting z-score represents how many standard deviations a data point is away from the mean. A positive z-score indicates that the data point is above the mean, while a negative z-score indicates that it’s below the mean.

Benefits of Standardization:

Comparison: Standardization allows you to compare data points from different distributions or with different units on a common scale.
Identifying Outliers: Extreme z-scores (far from 0) can help identify outliers in a dataset.

Standardization is commonly used in various fields. For instance, in machine learning, standardizing features ensures that they have comparable scales, which can improve the performance of certain algorithms that are sensitive to feature scales.

Example 1.

For the exam scores data, the mean is 85 and the standard deviation is 4.039. Let’s find the z scores of the minimum score 76 and maximum score 93.

For 76, the z score is $\frac{76-85}{4.039}$ or $-2.23$, meaning that the minimum score 76 is 2.23 standard deviations below the class mean. For 93, the z score is $\frac{93-85}{4.039}$ or 1.98, meaning that the maximum score 93 is 1.98 standard deviations above the class mean.

Example 2.

Tom’s physics score is 88. His class has a mean of 80 and standard deviation of 2.5. Tom’s chemistry score is 91. His class has a mean of 95 and standard deviation of 1.6. Which score is relatively better?

Tom’s z-score on physics is $\frac{88-80}{2.5}=3.2$, and his z-score on chemistry is $\frac{95-91}{1.6}=2.5$. Therefore, he does relatively better on physics.

Numerical Summaries of Categorical Data

Given the exam grades of 20 students: A, B, A, B, C, A, B, C, C, B, A, B, A, B, B, C, D, B, F, D

find the

Proportion of students who gets A grades: $\frac{5}{20} = 0.25$
Proportion of students who gets B grades:
Proportion of students who gets C grades:
Proportion of students who gets D grades:
Proportion of students who gets F grades:

Graphical Summaries of Categorical Data

# Prepare data
grade = c("A", "B", "A", "B", "C", "A", "B", "C", "C", "B", "A", "B", "A", "B", "B", "C", "D", "B", "F", "D")

# Make a frequency table
T = table(grade)

# Make a barplot using the previous table
barplot(T)
title("Grade Distribution of JohnDoe's Students")

Here, we created a vector called exam_grades that contains the exam grades of 20 students. Each grade is represented as a character value (“A”, “B”, “C”, etc.).

The code table(exam_grades) calls the function table() to create a frequency table of the exam grades. It counts how many times each grade appears in the exam_grades vector.

The barplot() function takes the frequency table as input and generates a barplot.

The title() function adds a title to the barplot.

Here are some insights that can be gleaned from the barplot:

Grade Frequencies:

The most common grade among the students is grade “B,” followed by grade “A” and then “C.” Grades “D” and “F” are less common, indicating that fewer students received lower grades.

Distribution Spread:

The barplot shows that the distribution of grades is not evenly spread across all possible grades. There is variation in the frequency of each grade.

No “E” Grade:

It’s interesting to note that there is no grade “E” in the dataset. This might be due to the grading scale used or the specific context of the assessment.

Performance Range:

There are no students who received an “A+” or a “F,” which suggests that the grading system might have upper and lower bounds.

Majority Grades:

The majority of students received grades “A,” “B,” or “C,” which implies that the overall performance was relatively good.

Uncommon Grades:

The grades “D” and “F” are less common, indicating that only a small percentage of students performed poorly in this assessment.

Potential Areas for Improvement:

In the context of the educational institution, educators might want to analyze the distribution of grades to identify areas where students might need more support or where curriculum adjustments are needed.

Two Quantitative Variables: Scatterplot, Correlation and Regression

A scatterplot is a graphical representation that helps us understand the relationship between two quantitative variables. It consists of points on a graph where each point corresponds to a pair of values from the two variables being studied. The x-axis typically represents one variable, and the y-axis represents the other variable. By plotting the points, we can visually identify patterns, trends, or potential correlations between the variables.

Correlation measures the strength and direction of the linear relationship between two quantitative variables. The correlation coefficient (or just correlation), often denoted as $r$, ranges from $-1$ to 1. A positive correlation indicates that as one variable increases, the other tends to increase as well. A negative correlation indicates that as one variable increases, the other tends to decrease. However, correlation does not imply causation; it only measures the degree of linear association.

Regression analysis is a statistical technique used to model and examine the relationship between a dependent variable and one or more independent variables. In simple linear regression, the goal is to fit a straight line to the data points that best represents the relationship between the variables. This line is used to make predictions about the dependent variable based on the values of the independent variable(s). The coefficients of the regression equation represent the $y$-intercept and the slope of the line, which provide insights into how the variables are related.

In summary, scatterplots provide a visual representation of data points, correlation measures the strength and direction of a linear relationship, and regression analysis helps us predict one variable based on the values of another variable using a mathematical model. These concepts are fundamental in exploring, analyzing, and interpreting relationships between quantitative variables in various fields such as statistics, economics, social sciences, and more.

An example:

# Example data: Study Hours and Exam Scores
study_hours <- c(2, 3, 1.5, 4, 2.5)
exam_scores <- c(65, 75, 60, 85, 70)

# Creating a scatterplot
plot(study_hours, 
     exam_scores, 
     main="Scatterplot of Study Hours vs. Exam Scores", 
     xlab="Study Hours", ylab="Exam Scores", pch=19, col="blue"
    )

# Calculating correlation coefficient
cor(study_hours, exam_scores)

## [1] 1

# Performing a simple linear regression 
# with exam_scores being the response variable and study_hours the explanatory variable
linear_model <- lm(exam_scores ~ study_hours)

# Print the estimated y-intercept and slope of the straight line
linear_model

## 
## Call:
## lm(formula = exam_scores ~ study_hours)
## 
## Coefficients:
## (Intercept)  study_hours  
##          45           10

In this R code, we have used quite a few R functions (a function in programming is just a collection of hidden code that perform something unnoticed):

We define the study hours and exam scores as vectors.
We create a scatterplot using the plot function, with study hours on the x-axis and exam scores on the y-axis.
The specification of the parameter “pch” to 19 makes the data points on the scatterplot appear as solid circles.
The specification of the parameter “col” to blue makes the data points on the scatterplot appear blue.
We calculate the correlation coefficient using the cor function.
We perform a simple linear regression using the lm function, where we regress exam scores on study hours.

Copy and paste the code into Posit/Rstudio gives the $y$-intercept 45 and slope 10, so the equation of the fitted straight line can be written as

\[exam\_score = 45 + 10 \cdot (study\_hours)\] How can we interpret the slope 10? It means that for each extra hour a student studies, the student’s exam score increases by 10 points.

For a student whose study hours is 3.5 hours, the predicted exam score is $45+10\cdot (3.5) = 45+35=80$. If we know that the real score of the student is 78, then the difference $78-80=-2$ means that the model overpredicts the student’s exam score by 2 points.

We can find the predicted score for each of the students in the data and calculate the difference between the real score and the predicted score. There differences are called residuals.

Plotting these residuals versus predicted values allow us to check the adequacy of the model (i.e., the fitted equation). Here are the code:

# Use the R function resid to find residuals
residuals = resid(linear_model)

# Use the R function presict to find predicted values
predicted = predict(linear_model)

plot(predicted, residuals)

The residuals are close to the horizontal dashed zero-line, so the model fit is adequate.

Case Study 1

Analyze the data of students’ study time (in hours per week) on a Stat 101 course:

2.61, 2.99, 3.33, 2.12, 2.6, 2.78, 2.92, 2.83, 1.78, 2.92, 2.49, 1.97, 2.31, 3.01, 2.01, 2.48, 2.3, 2.6, 2.27, 3.19, 2.92, 3.14, 1.76, 2.73, 2.04, 2.63, 2.36, 2.33, 2.57, 2.53, 1.92, 2.05, 2.61, 2.38, 2.23, 2.52, 2.59, 3.31, 2.28, 2.47, 3, 1.87, 1.92, 2.5, 2.52, 2.72, 2.9, 2.76, 2.39, 2.83.

Interpret the result.

A Solution

We use the R software to conduct the analysis.

# Prepare data by creating a vector of values
study_time <- c(2.61, 2.99, 3.33, 2.12, 2.6, 2.78, 2.92, 2.83, 1.78, 2.92, 2.49, 1.97, 2.31, 3.01, 2.01, 2.48, 2.3, 2.6, 2.27, 3.19, 2.92, 3.14, 1.76, 2.73, 2.04, 2.63, 2.36, 2.33, 2.57, 2.53, 1.92, 2.05, 2.61, 2.38, 2.23, 2.52, 2.59, 3.31, 2.28, 2.47, 3, 1.87, 1.92, 2.5, 2.52, 2.72, 2.9, 2.76, 2.39, 2.83)

# Call the summary() function to summarize the numeric characteristics of data
summary(study_time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.760   2.285   2.525   2.526   2.817   3.330

# Call the var() and sd() functions to calculate the variance and standard deviation.
var(study_time)

## [1] 0.1575432

sd(study_time)

## [1] 0.3969172

# We can combine the summary statistics, variance, and standard deviation.
c(summary(study_time), Var. = var(study_time), S.D. = sd(study_time))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      Var.      S.D. 
## 1.7600000 2.2850000 2.5250000 2.5258000 2.8175000 3.3300000 0.1575432 0.3969172

This line combines the summary statistics, variance, and standard deviation into a single output. The c() function is used to concatenate these values into a single vector.

# Graph the distribution of the data using the hist() or boxplot() functions.
hist(study_time)

boxplot(study_time)

Finally, the hist() function generates a histogram that visualizes the distribution of study time values. A histogram displays the frequency or count of data points in various intervals or bins. The boxplot() function creates a box plot that provides a graphical representation of the data’s central tendency (i.e., the median), spread, and potential outliers.

Interpretation of Results

The average study time per week among the students is approximately 2.53 hours.
The median study time is 2.53 hours, indicating that half of the students studied for 2.53 hours or less, while the other half studied for more than 2.53 hours.
The first (or lower) quartile of the study time is 2.29 hours, indicating that 25% of the students studied for 2.29 hours or less, while the remaining 75% students studied for more than 2.29 hours.
The third (or upper) quartile of the study time is 2.82 hours, indicating that 75% of the students studied for 2.82 hours or less, while the remaining 25% students studied for more than 2.82 hours.
The variance is approximately 0.16 square hours, indicating the average spread of study times from the mean.
The standard deviation is around 0.40 hour, which also indicates the average spread of study times around the mean. (I prefer standard deviation to variance)
The histogram shows the distribution of study times, with a peak around 2.5 hours and a relatively symmetrical distribution.
The boxplot shows the distribution of study times with quartiles, with a median around 2.5 hours and a relatively symmetrical distribution. It also shows the range of the study times to be roughly between 1.76 and 3.33 (I’m cheating, since it’s hard to tell the range). The box itself represents the interquartile range of 2.3 to 2.8, indicating that the middle 50% of study times fall within this range.

Overall, the data suggests that students tend to study for around 2.53 hours per week on average, with some variability ($\pm 0.40$ hours) around this average.

Chapter 3. Confidence Intervals

In statistics, a population and a parameter are fundamental concepts that help us understand and describe data. Let’s define each term and provide a few examples:

Population:

A population refers to the entire group or collection of individuals, items, or entities that a statistical study aims to investigate or describe. It’s the complete set of subjects that share a common characteristic of interest. In many cases, it’s not feasible or practical to collect data from an entire population, so we often work with samples to make inferences about populations.

Parameter:

A parameter is a numerical value or characteristic that summarizes some aspect of a population. Parameters provide specific information about the entire population but are often unknown because it’s difficult to collect data from every member of the population. Parameters are typically fixed and unchanging for a given population.

Examples of Populations and Parameters:

Population: All the registered voters in a country.

Parameter: The proportion of registered voters who support a particular candidate.

Population: All the apples in an orchard.

Parameter: The average weight of all the apples in the orchard.

Population: All the students attending a university.

Parameter: The median household income of the families of all the students.

Population: All the manufactured light bulbs in a factory.

Parameter: The proportion of light bulbs that are defective.

Population: All the houses in a city.

Parameter: The average age of the houses in the city.

In each of these examples, the population is the complete group of individuals, objects, or entities that we’re interested in studying. The parameter is a specific characteristic or numerical value associated with that population. In practice, it’s often not feasible to collect data from the entire population, so we use statistical techniques to gather information from samples and make inferences about the population based on those samples.

A point estimate is a single value that is calculated from a sample of data and used to approximate an unknown population parameter. It provides a “best guess” of the true parameter value based on the available sample data. Here are three examples of point estimates for different population parameters:

Population Parameter: Mean height of all students in a school.

Point Estimate: The calculated average height of a sample of say 100 students from the school.

Population Parameter: Proportion of defective products in a manufacturing process.

Point Estimate: The observed proportion of defects in a sample of 500 products produced by the manufacturing process.

Population Parameter: Median income of households in a city.

Point Estimate: The computed median income of a random sample of 200 households in the city.

In each of these examples, the population parameter is an unknown value that characterizes a specific aspect of the entire population.

3.1 Sampling Distribution

A point estimate is derived from a sample and provides a single value that we use to estimate the true parameter value. It’s important to note that while point estimates provide a useful starting point, they may not be exactly equal to the true population parameter due to variability in sampling. Confidence intervals are commonly used alongside point estimates to assess the precision and reliability of the estimates.

We now use the app here https://www.lock5stat.com/StatKey/index.html to study variability in sampling.

Go to the page
Locate the “Sampling Distributions” row and click the “Mean” link.
Now, you are seeing a new page. Under the word “StatKey”, there is a dropdown menu which currently shows “Baseball Players…”. Click it and choose “US 2-year colleges”.
Click “Show Data Table” (located just to the right of the dropdown). You will see the content of the data.
Close the data table.
The app will choose a sample and automatically set the size of the sample to 10. Click the numer “10” and change it to whatever you want, say 25.
Now, click “Generate 1 sample”. On the left panel, you will see a dot, which corresponds to a value under the panel. This number is the mean of the sample of 25 randomly chosen colleges.
click “Generate 1 sample” one more time. You now see another dot with a mean for the new sample just generated.
If you click “Generate 1000 samples”, then you will have a total of 1002 samples generated and 1002 means calculated. These sample means will form a dot plot on the left panel. This plot gives an approximation to the sampling distribution of all possible sample means. The standard deviation of those generated sample means is called the standard error (denoted by $se$), which tells us how the sample statistic varies from sample to sample.
On the right panel, there are two plots. The top one shows the population distribution (which is a histogram). The bottom one is the sample distribution (which is also a histogram). When sample size is large, the sample distribution is very similar to the population distribution.
The sampling distribution may not have to be similar to the population distribution, but they do share the same mean theoretically. As the sample size gets larger and larger, the sampling distribution tends to be closer and closer to be bell-shaped.

3.2 Constructing and Interpreting Confidence Intervals

The sample statistic varies from sample to sample. If the sampling distribution is bell-shaped, roughly 95% of the values of the sample statistic are within two standard deviations of the center (which equals the value of the unknown parameter). This suggests that 95% of the time the interval

\[(statistic - 2\cdot se, ~~statistic + 2\cdot se)\]

often written as \[statistic \pm 2\cdot se\] covers the value of the unknown parameter. This interval is called a confidence interval for the parameter.

In practice, we usually have a sample, but we don’t know the value of the standard error $se$ and we must estimate it in order to construct a ready-to-use confidence interval.

Constructing Bootstrap Confidence Intervals

One method of estimating the standard error is through resampling. The process of resampling is as follows:

Resample from the original sample so that the size of the resample is the same as that of the original sample.
Repeat the resampling say $B$ times. For example, $B$ can be 500. For each resample, calculate a value using the same formula as you used for the original sample statistic. This value is called a bootstrap statistic. You should have totally $B$ such bootstrap statistics. The distribution of these bootstrap statistics is called the bootstrap distribution. The standard deviation of these bootstrap statistics can be used to estimate the standard error $se$.

Example 1.

Go to the page https://www.lock5stat.com/StatKey/index.html. Click “CI for Single Mean, Median, St.Dev.” On a new page, Click the tab right below “StatKey”, choose BodyTemp50 (Temperature). Click “Show Data Table” to see the sample data. Close the data table. Now, click “Generate 1000 Samples”. The left panel shows the bootsyrap distribution. On the right panel, the top plot shows the distribution of the original sample, while the bottom plot shows the distribution of the 1000th bootstrap sample. The estimated standard error is shown at the upper-right corner of the left panel. The sample statistic (i.e., the mean of the original sample) is given above the top plot of the right panel.

The bootstrap samples will vary. Based on resampling on my laptop, the estimated standard error $se=0.107$. The mean of the original sample will not change and is 98.26.

Now, the 95% bootstrap confidence interval is constructed as follows:

\[98.26\pm 2\cdot 0.107\] or $98.26\pm 0.214$ or from 98.046 to 98.474.

Example 2.

If we click the tab right above the left panel and choose “median” as our parameter, then the 95% bootstrap confidence interval for the population median is

\[98.2\pm 2\cdot 0.120\] where the number 98.2 is the median of the original sample and the number 0.120 is the corresponding estimated standard error.

The interval can be written as $98.2\pm 0.24$ or from 97.96 to 98.44.

Bootstrap Confidence Intervals Using Percentiles

Skip!

Chapter 4. Hypothesis Tests

Hypothesis tests are used to answer a wide range of questions across various fields. Here are some examples of questions that can be answered using hypothesis tests:

Does a new medication lead to a significant reduction in blood pressure compared to a placebo?
Is there a difference in recovery time between two surgical procedures? (in Medicine/Healthcare)
Does a new teaching method improve students’ test scores compared to the traditional method?
Is there a significant difference in reading comprehension levels between two different textbooks? (in education)
Is there a relationship between air pollution levels and respiratory illness rates in a specific city?
Does the mean temperature differ significantly between two different time periods? (Environmental Science)
Is there a difference in anxiety levels before and after participating in a stress management workshop?
Does a new therapy approach lead to a significant reduction in symptoms for patients with a particular mental disorder? (in Psychology)
Is there a significant difference in species diversity between two different habitats within a specific ecosystem?
Does the introduction of an invasive species impact the population sizes of native species? (in Ecology)
Is there a significant association between a specific genetic mutation and the occurrence of a particular disease?
Does a certain genetic factor influence the expression of a specific trait in an organism? (in Genetics)
Does a specific type of fertilizer lead to a significant increase in plant growth compared to another type?
Is there evidence to support that a particular plant species has a competitive advantage over others in a specific ecosystem? (in Botany)
Is there a significant difference in heart rate before and after exposure to a certain stimulus in an animal model?

*Does a specific drug lead to a significant reduction in blood pressure in patients with a certain medical condition? (in Physiology)

4.1 Introducing Hypothesis Tests

Hypothesis testing is a fundamental concept in statistics used to make informed decisions about population parameters based on sample data. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and assessing the evidence from the sample data to determine which hypothesis is more likely.

Under the null hypothesis (H0) is the default assumption or status quo. It typically states that there is no effect, no difference, or no relationship in the population.

Under the alternative hypothesis (Ha) is the assertion you’re trying to test. It states that there is a significant effect, difference, or relationship in the population. The alternative hypothesis (aka the researcher’s hypothesis) might involve a “<”, $\ne$ sign, or “>” sign, which is called a left-tail, two-tail, or right-tail test, respectively.

Examples:

To answer each of the following questions, specify the null and alternative hypothesis.

Research Question: Does a new fertilizer lead to an increase in crop yield compared to the standard fertilizer?

Null Hypothesis (H0): The new fertilizer does not lead to an increase in crop yield compared to the standard fertilizer, or $\mu_1=\mu_2$ in notation, where $\mu_1$ is the mean yield for the new fertilizer and $\mu_2$ for the standard fertilizer.
Alternative Hypothesis (Ha): The new fertilizer leads to an increase in crop yield compared to the standard fertilizer, or $\mu_1>\mu_2$ in notation.

Research Question: Is there a relationship between caffeine consumption and heart rate in adults?

Null Hypothesis (H0): There is no relationship between caffeine consumption and heart rate in adults.
Alternative Hypothesis (Ha): There is a relationship between caffeine consumption and heart rate in adults.

Research Question: Does a specific exercise regimen result in a decrease in body fat percentage among overweight individuals?

Null Hypothesis (H0): The specific exercise regimen does not result in a decrease in body fat percentage among overweight individuals, or $\mu_1=\mu_2$ in notation, where $mu_1$ is the mean body fat percentage among overweight individuals and $mu_2$ for normal individuals.
Alternative Hypothesis (Ha): The specific exercise regimen results in a decrease in body fat percentage among overweight individuals, or $\mu_1>\mu_2$ in notation.

Research Question: Is there a difference in the average survival time of two different groups of cancer patients treated with different therapies?

Null Hypothesis (H0): There is no difference in the average survival time of the two groups of cancer patients treated with different therapies, or $\mu_1=\mu_2$ in notation, where $\mu_1$ and $\mu_2$ are the mean survival times for the two groups, respectively.
Alternative Hypothesis (Ha): There is a difference in the average survival time of the two groups of cancer patients treated with different therapies, or $\mu_1\ne \mu_2$ in notation.

Research Question: Does exposure to a specific chemical lead to changes in the behavior of a particular animal species?

Null Hypothesis (H0): Exposure to the specific chemical does not lead to changes in the behavior of the animal species.
Alternative Hypothesis (Ha): Exposure to the specific chemical leads to changes in the behavior of the animal species.

Research Question: Is there a relationship between temperature and the rate of enzyme activity?

Null Hypothesis (H0): There is no relationship between temperature and the rate of enzyme activity.
Alternative Hypothesis (Ha): There is a relationship between temperature and the rate of enzyme activity.

Research Question: Does a certain dietary supplement result in a significant increase in bone density among postmenopausal women?

Null Hypothesis (H0): The dietary supplement does not result in a significant increase in bone density among postmenopausal women.
Alternative Hypothesis (Ha): The dietary supplement results in a significant increase in bone density among postmenopausal women.

Research Question: Is there a difference in the flowering time of two different varieties of a plant species?

Null Hypothesis (H0): There is no difference in the flowering time of the two varieties of the plant species.
Alternative Hypothesis (Ha): There is a difference in the flowering time of the two varieties of the plant species.

Research Question: Is there a significant difference in the proportions of patients who experience side effects between two different medication treatments?

Null Hypothesis (H0): The proportion of patients with side effects is equal for both medication treatments, or $p_1 = p_2$, where $p_1$ is the proportion of patients with side effects for the first medication group and $p_2$ for the other medication group.

Alternative Hypothesis (Ha): The proportion of patients with side effects differs between the two medication treatments, or $p_1 \ne p_2$.

Research Question: Is there a significant correlation between hours of study and exam scores for a group of students.

Null hypothesis (H0): There is no correlation between study hours and exam scores ($\rho=0$), where $\rho$ is the correlation coeficient at the population level.
Alternative hypothesis (Ha): There is a correlation between study hours and exam scores ($\rho\ne 0$).

4.2 Measuring Evidence with $P$-values

The $P$-value for a hypothesis test is calculated by first assuming that the null hypothesis is true. When this value is too small, smaller than a pre-selected number (called the significance level and denoted by $\alpha$), the null hypothesis is rejected. Otherwise, it is not rejected.

The $P$-value can be estimated by generating many samples consistent with the null hypothesis. These samples are called randomization samples.

How to generate randomization samples? Let’s show an example with the StatKey software here: https://www.lock5stat.com/StatKey/index.html

Click the tab “Randomization Test for a Mean”.
On a new page, click the tab right below the word “StatKey”.
Choose “Body Temparature”.
Click “Show Data Table” and then close it. It just shows you the sample data.
Now, before generating 1000 samples by clicking “Generate 1000 samples”, we change the value (to the right of “$\mu=$”) to 98. This means we are testing the null hypothesis $\mu=98$. We need also set the alternative hypothesis.
Let’s check the box indicated by “Left Tail”. This means we are testing the null hypothesis against the alternative hypothesis $\mu<98$.

The boxed number above the red part is the $p$-value we want. Based on the result from my laptop, the $p$-value is 0.025. Your result might be different, since it involves a procedure that generates random numbers, but our results should be close.

The left panel shows the randomization distribution. The top plot on the right panel shows the distribution of the original sample. The bottom one shows the distribution of the 1000th randomization sample.

4.3 Determining Statistical Significance

To make a decision on whether or not a null hypothesis should be rejected, a test statistic is first calculated and a $p$-value is determined. If the $p$-value is no greater than a pre-selected threshold (called the significance level and denoted by $\alpha$), the null hypothesis is rejected (meaning that the test is significant). Otherwise, it is not rejected.

When the null hypothesis is rejected, we draw a conclusion by saying something like “The data provide sufficient evidence that …”, where “…” should be consistent with the statement under the alternative hypothesis.

When the null hypothesis is not rejected, we draw a conclusion by saying something like “The data do not provide sufficient evidence that …”, where “…” should be consistent with the statement under the alternative hypothesis.

Let’s continue the example in the previous section. We got the $p$-value 0.025.

If our pre-selected significance level is 0.05, the null hypothesis is rejected. If our pre-selected significance level is 0.01, the null hypothesis is not rejected. Our decision is up to the choice of the significance level.

4.4 A Closer Look at Testing

We have seen that when testing hypotheses, the decision depends on the choice of the significance level.

There are two types of errors we might commit.

Type I error: falsely rejecting a true hypothesis.
Type II error: Failing to reject a false hypothesis.

When type I error is worse, we use a smaller significance level when making a decision. This is intended to reduce the chance of rejecting the null hypothesis, thus reducing the chance of committing type I error.

When type II error is worse, we use a larger significance level when making a decision. This is intended to increase the chance of rejecting the null hypothesis, thus reducing the chance of committing type II error.

Examples:

Suppose a pharmaceutical company is testing a new drug for a rare disease. The null hypothesis ($H_0$) is that the drug has no effect, meaning it’s not better than a placebo. After conducting a clinical trial, the company mistakenly concludes that the drug is effective and decides to market it, even though it doesn’t actually work. Here a type I error is made.
Continuing with the pharmaceutical example, let’s say the new drug is genuinely effective at treating the disease. However, in the clinical trial, the researchers fail to detect this effectiveness, and they do not reject the null hypothesis. As a result, the drug is not approved for use, and patients miss out on a potentially life-saving treatment. Here a type II error is made.

When using a significance level such as 0.05, 5% of hypotheses tests using this significance level will incorrectly reject a null hypothesis. This issue becomes even more important when doing multiple hypothesis tests for the same true null hypothesis at the same significance level, since about 5% of these tests lead to a false rejection of the true null hypothesis. Say in the world, if 100 independent research teams all test for an effect that actually does not exist, about 5 of these research team will incorrectly report that an effect does exist. This is the so-called “multiple testing issue”.

Often, only significant results get published. If many independent tests are conducted, some of them will be significant just by chance, and it may be only these studies that we hear about.

Sometimes, even though a test is significant, it may not necessarily mean that the effect found is large enough for it to be considered important. That is, statistical significance does not always imply practical significance. Usually, large samples tend to give significant results (i.e., small p-values), since small differences can be detected with large samples.

4.5 Making Connections between a Confidence Interval and a Hypothesis Test

For a two-tail alternative hypothesis, if the value under the null hypothesis does not fall within a $1-\alpha$ confidence interval, then at the significance level $\alpha$, the null hypothesis can be rejected.

For example, we want to test whether the mean weight of walleye in a small pond is 1.5 pounds at the significance level 0.05. If a sample of 25 walleye gives a 95% confidence interval of (0.84, 1.53), we can’t reject the null hypothesis, since the hypothesized value 1.5 falls in the confidence interval. However, if a sample of 25 walleye gives a 95% confidence interval of (0.92, 1.48), we reject the null hypothesis, since the hypothesized value 1.5 does not fall in the confidence interval.

Chapter 5. Approximating with a Distribution

If we go back to read the sections introducing sampling distributions, bootstrap distributions, or randomization distribution, we can see many of them have bell-like shapes. This is not a coincidence. Under fairly general conditions (such as sufficiently large sample size), the distributions of many statistics will follow this same bell-shaped pattern. This result is called the Central Limit Theorem. The formal name for this shape is a normal distribution.

A normal distribution has two parameters: the mean ($\mu$) and the standard deviation ($\sigma$). The following graph shows the curves of a few normal distributions:

curve(dnorm(x, 10, 3), xlim=c(-10, 30), ylab = "Density")
curve(dnorm(x, 10, 5), ylab = "Density", add = TRUE, col = "red")
curve(dnorm(x, 15, 5), ylab = "Density", add = TRUE, col = "blue")
legend(15, 0.12, legend = c("mean = 10, sd = 3", "mean = 10, sd = 5", "mean = 15, sd = 5"), 
       text.col = c("black", "red", "blue"), bty = "n")

The first curve (in black) is based on a normal distribution with a mean (μ) of 10 and a standard deviation (σ) of 3. It is plotted over the x-axis range from -10 to 30, though the range is unlimited. The y-axis label is set to “Density.”

The second curve (in red) is added to the same plot as the first one, and it represents a normal distribution with the same mean (μ) of 10 but with a larger standard deviation (σ) of 5. This curve is drawn in red and is overlaid on the first curve.

The third curve (in blue) is added to the same plot as the previous two, and it represents a normal distribution with a higher mean (μ) of 15 and the same standard deviation (σ) of 5. This curve is drawn in blue and is overlaid on the previous curves.

The resulting plot shows three overlaid normal distribution curves with different means and standard deviations. The first curve (in black) has the smallest spread, while the second (in red) and third (in blue) curves have larger spreads. These curves illustrate how changing the mean and standard deviation affects the shape and width of the probability density function for a normal distribution.

When the scale (say $X$) of a normal distribution with mean $\mu$ and standard deviation $\sigma$ is converted to another scale (called $Z$) by $Z=\frac{X-\mu}{\sigma}$ and is called the $z$-score, the values on this new scale $Z$ will have the so-called standard normal distribution whose mean is 0 and standard deviation is 1. The curve of the standard normal distribution looks like the following:

For all normal distributions, there is a common result:

https://hyperskill.org/learn/step/24676#step-title

Within one standard deviation (σ) of the mean: Approximately 68% of the data from a normal distribution falls within one standard deviation of the mean. This region covers the range from μ - σ to μ + σ.
Within two standard deviations (2σ) of the mean: Approximately 95% of the data falls within two standard deviations of the mean. This region covers the range from μ - 2σ to μ + 2σ.
Within three standard deviations (3σ) of the mean: Approximately 99.7% of the data falls within three standard deviations of the mean. This region covers the range from μ - 3σ to μ + 3σ.

This is termed the empirical rule, which is a useful guideline for understanding the distribution of data from a normal distribution. It provides a quick way to estimate the percentage of data points that fall within specific standard deviation intervals from the mean.

Chapter 6. Inference for Means and Proportions

We introduced normal distributions in the previous chapter. We learned that the sampling distributions of many statistics are bell-shaped when sample sizes are sufficiently large. This includes the sample mean (used to estimate the corresponding population mean parameter $\mu$) and the sample proportion (used to estimate the corresponding population proportion parameter $p$).

Specifically, when the sample size is sufficiently large,

The sample mean ($\bar{X}$) is approximately normally distributed with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$, where $\sigma$ is the population standard deviation.
The sample proportion ($\hat{P}$) is approximately normally distributed with mean $p$ and standard deviation $\sqrt{\frac{p(1-p)}{n}}$.

When $p$ is unknown, it is estimated by the corresponding sample proportion $\hat{p}$, and $\sqrt{\frac{p(1-p)}{n}}$ is estimated by $\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$, called the standard error (se) of $\hat{p}$.

When $\sigma$ is unknown, it is estimated by the corresponding sample proportion $\hat{p}$, and $\frac{\sigma}{\sqrt{n}}$ is estimated by $\frac{s}{\sqrt{n}}$, called the standard error of $\bar{x}$. Here $s$ is the sample standard deviation.

Examples Find the standard error for each of the following situations:

$\hat{p}=0.4, n=35$
$s=2.3, n=42$

Solution

$se = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.4(1-0.4)}{35}}=0.0828$
$se=\frac{s}{\sqrt{n}}=\frac{2.3}{\sqrt{42}}=0.3549$

6.1 Confidence Interval for a Population Proportion

When the sample size $n$ is large, instead of using a bootstrap confidence interval (introduced in Chapter 3) for a Population Proportion, we can use a confidence interval based on the standard normal distribution. Such an interval is given by the following formula (assuming the confidence level is $1-\alpha$:

\[\hat{p}\pm z^**se\] where $z^*$ is called the $z$-critical value, a value that separates the top $\alpha/2$ area under the standard normal curve from the other area. The product $z^**se$ is called the margin of error. The smaller the error, the more precise the result, assuming the sample size and confidence level are fixed.

The app https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#normal allows you to produce $z^*$ for a given confidence level. For example,

If the confidence level $1-\alpha$ is 0.90, then $\alpha = 0.10$ and $\alpha/2 = 0.05$. Checking the box labeled “Right Tail” and changing the value inside the plotting region to 0.05 yields the cutoff value 1.645 on the horizontal x-axis. This 1.645 is the critical value of $z^*$.
If the confidence level $1-\alpha$ is 0.95, then $\alpha = 0.05$ and $\alpha/2 = 0.025$. Checking the box labeled “Right Tail” and changing the value inside the plotting region to 0.025 yields the cutoff value 1.96 on the horizontal x-axis. This 1.96 is the critical value of $z^*$.
If the confidence level $1-\alpha$ is 0.99, then $\alpha = 0.01$ and $\alpha/2 = 0.005$. Checking the box labeled “Right Tail” and changing the value inside the plotting region to 0.005 yields the cutoff value 2.576 on the horizontal x-axis. This 2.576 is the critical value of $z^*$.

Example.

Suppose we have a sample of 200 people, and 120 of them support a particular candidate. Determine the 95% confidence interval for the population proportion.

Solution: Calculate the sample proportion: $\hat{p}$ equals the number of supporters divided by the total sample size or 120/200 = 0.6. For a 95% confidence level, the critical $z^*$ value is 1.96 using the StatKey app. The standard error $\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}=\sqrt{\frac{0.6(1-0.6)}{200}}=0.0346$. The margin of error is $1.96\cdot 0.0346=0.0679$. The confidence interval is $0.6\pm 0.0679$. The lower bound is $0.6-0.0679=0.5321$ and the upper bound is $0.6+0.0679=0.6679$. We conclude that with 95% confidence, the proportion of people who support the candidate is approximately 0.532 to 0.668.

6.2 Confidence Interval for a Population Mean

When a population has a normal distribution, instead of using the bootstrap method, the $1-\alpha$ confidence interval for the population mean $\mu$ can be constructed using the following formula:

\[\bar{x}\pm t^*\cdot se\]

where $se= \frac{s}{\sqrt{n}}$ is the standard error of the sample mean $\bar{x}$, $s$ is the sample standard deviation, and $t^*$ is called the $z$-critical value, a value that separates the top $\alpha/2$ area under the t-distribution curve from the other area. A $t$-distribution is associated with a number called the number of degrees of freedom. Such a number for the current situation equals the sample size minus one, or $n-1$. Some typical t-curves are given below:

In the graph, $\nu$ represents the number of degrees of freedom. As $\nu$ gets larger, the curve gets higher, and eventually approches the standard normal curve ($\nu=+\infty$).

In the previous confidence interval formula, the product $t^**se$ is called the margin of error. The smaller the error, the more precise the result, assuming the population standard deviation, the sample size, and the confidence level are fixed.

The app https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#t also allows you to produce $t^*$ for a given confidence level. For example, let the sample size be 24, then the number of degrees of freedom $df = n-1=23$.

If the confidence level $1-\alpha$ is 0.90, then $\alpha = 0.10$ and $\alpha/2 = 0.05$. Checking the box labeled “Right Tail”, changing the value inside the plotting region to 0.05, and clicking “Edit Parameters” to set “df” to 23 yields the cutoff value 1.714 on the horizontal x-axis. This 1.714 is the critical value of $t^*$.
If the confidence level $1-\alpha$ is 0.95, then $\alpha = 0.05$ and $\alpha/2 = 0.025$. Checking the box labeled “Right Tail” and changing the value inside the plotting region to 0.025 yields the cutoff value 2.069 on the horizontal x-axis. This 2.069 is the critical value of $t^*$.
If the confidence level $1-\alpha$ is 0.99, then $\alpha = 0.01$ and $\alpha/2 = 0.005$. Checking the box labeled “Right Tail” and changing the value inside the plotting region to 0.005 yields the cutoff value 2.808 on the horizontal x-axis. This 2.808 is the critical value of $t^*$.

Example.

Suppose we have the following sample data from a population: 12,14,16,18,20,22,24,26,28,30,32,34,36,38,40. Construct a 90% confidence interval for the population mean.

Solution.

The sample mean $\bar{x}$ is the average of the values, which is 26.
The sample variance $s^2$ is calculated as follows:

\[s^2=\frac{(12-26)^2+(14-26)^2+ (16-26)^2+\cdots+(40-26)^2}{15-1}=80\] - The sample standard deviation is the square root of the sample variance, or 8.9443.

The t-critical value $t^*$ is 2.145, using the StatKey app with df = 14 and shaded area 0.05.
The standard error $se=\frac{s}{\sqrt{n}}=2.3094$.
The margin of error is $(2.145)\cdot (2.3094)= 4.95$.
The 90% confidence interval is $26\pm 4.95$, or from 21.05 to 30.95.

Interpretation: With 90% confidence, the population mean is between 21.05 and 30.95.

6.3 R Code for Confidence Intervals about $p$ and $\mu$

We demonstrate these through examples.

Example 1 To study the percent of individuals who have had vaccine, 76 people are surveyed and 52 have had vaccine. Find the 90% confidence interval for the percent of individuals in the population who have had vaccine.

You use the following R code to compute the confidence interval for $p$, the percent of individuals in the population who have had vaccine:

prop.test(n = 76, x = 52, conf.level = 0.90)

## 
##  1-sample proportions test with continuity correction
## 
## data:  52 out of 76, null probability 0.5
## X-squared = 9.5921, df = 1, p-value = 0.001954
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
##  0.5846547 0.7701705
## sample estimates:
##         p 
## 0.6842105

The computer output shows that the 90 percent confidence interval is from 0.58 to 0.77. Note that the computer result might be slightly different from hand calculation, since computer software uses a more complicated method. Here, by hand, the lower bound of the confidence interval is 0.5965 and the upper bound is 0.7719.

Example 2 Suppose a group of medical researchers wants to estimate the average blood pressure of a specific population. They collect data from a random sample of 100 individuals from that population and measure their blood pressure. The sample data yields a sample mean (̄$\bar{x}$$) of 120 mm Hg and a sample standard deviation $($s) of 10 mm Hg.

The researchers are interested in creating a 95% confidence interval for the mean blood pressure of the entire population.

You use the following R code to compute the confidence interval for $\mu$, the mean blood pressure of the entire population:

confidence_level = 0.95
alpha = 0.05
n = 100
df = n-1
t_critical = qt(1-alpha/2, df)
xbar = 120
s = 10
se = s/sqrt(n) # sqrt is the function for calculating square root
lower_bound = xbar - t_critical*se
upper_bound = xbar + t_critical*se

The computer output shows that the 95 percent confidence interval is from 118.02 to 121.98. This would be consistent with hand calculation.

Example 3. Consider the example of estimating the mean cholesterol level in a clinical trial. Let’s assume we have the following individual cholesterol level (in mg/dL) data for a random sample of 50 participants before treatment:

215,198,220,207,202,205,210,193,199,213,204,215,211,216,208,201,209,207,206,198,212,199,197,203,204,217,214, 208,211,202,206,219,207,201,196,207,198,204,214,200,199,205,203,210,208,194,198,200,203

Calculate the 95% confidence interval for the mean cholesterol level of the population.

Solution.

The computer code is:

x= c(215,198,220,207,202,205,210,193,199,213,204,215,211,216,208,201,209,207,206,198,212,199,197,203,204,217,214,
208,211,202,206,219,207,201,196,207,198,204,214,200,199,205,203,210,208,194,198,200,203)

t.test(x, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  x
## t = 214.91, df = 48, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  203.7088 207.5565
## sample estimates:
## mean of x 
##  205.6327

The computer output shows that the 95 percent confidence interval is from 203.7088 to 207.5565. This would be consistent with hand calculation.

Interpretation: with 95% confidence, the the mean cholesterol level of the population lies between 203.7088 and 207.5565.

6.4 Sample Size Determination When Estimating Population Proportion $p$

To estimate the $1-\alpha$ confidence interval of a population proportion so that the margin of error is $E$, the minimum sample size is calculated by

\[(\frac{z^*}{E})^2\cdot p(1-p)\] where $z^*$ is the critical value and $p$ is the population proportion. Since $p$ is unknown, we can use an estimate of it. If no such an estimate is available, use 0.5 for $p$, which results in a conservative sample size. If the estimated sample size is not an integer, always round it up to the next whole integer.

Example.

Suppose you want to estimate the proportion of people in a city who support a proposed policy with a 95% confidence level and a margin of error of 3%. You have no prior estimate of the proportion, so you use 0.5 as a conservative estimate.

Solution.

Desired confidence level (CI) = 95%, so the critical value $z^*$ is 1.96.
Margin of error (E) = 3% = 0.03.
Estimated proportion ($\hat{p}$) = 0.5.
Now, calculate the sample size (n):

\[(\frac{z^*}{E})^2\cdot p(1-p)=(\frac{1.96}{0.03})^2\cdot 0.5(1-0.5)=1067.11\] Round the calculated value 1067.11 up to the next whole integer 1068. The desired sample size is 1068.

6.5 Confidence Interval for the Difference between Two Population Proportions $p_1 - p_2$

When comparing two populations in terms of their proportions, a $1-\alpha$ confidence interval for $p_1 - p_2$ is given by

\[(\hat{p}_1 - \hat{p}_2)\pm z^* \cdot se\] where

$\hat{p}_1$ and $\hat{p}_2$ are the two sample proportions
$z^*$ is the critical value based on the standard normal distribution corresponding to the desired confidence level $1-\alpha$
$se$ is the standard error of the difference between the sample proportions and $se = \sqrt{\frac{\hat{p}_1\cdot (1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2\cdot (1-\hat{p}_2)}{n_2}}$.
$n_1$ and $n_2$ are sample sizes

Example.

Suppose you want to estimate the difference in proportions of people who prefer Product A and Product B in a survey. You collect two independent samples:

Sample 1 (Product A): Sample size is 200 and Number of people who prefer Product A is 120.
Sample 2 (Product B): Sample size is 250 and Number of people who prefer Product B is 140.

You want to calculate a 95% confidence interval for the difference in proportions ($p_1-p_2$).

Solution.

n1 = 200
x1 = 120

n2 = 250
x2 = 140

prop.test(n = c(n1, n2), x = c(x1, x2), conf.level = 0.95)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 0.574, df = 1, p-value = 0.4487
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.0561291  0.1361291
## sample estimates:
## prop 1 prop 2 
##   0.60   0.56

The computer output shows that the 95 percent confidence interval is from -0.0561 to 0.1361. This may not be consistent with hand calculation, since the software uses a more complicated method.

Interpretation: with 95% confidence, the difference (first minus second) between population proportions lies between -0.0561 and 0.1361. Since 0 is contained in the interval, the two population proportions might not be significantly different. In other words, there is no strong evidence to suggest that one product is significantly preferred over the other.

6.6 Confidence Interval for the Difference between Two Population Means $\mu_1 - \mu_2$

When comparing two populations in terms of their means, a $1-\alpha$ confidence interval for $\mu_1 - \mu_2$ is given by

\[(\bar{x}_1 - \bar{x}_2)\pm t^* \cdot se\] where

$\bar{x}_1$ and $\bar{x}_2$ are the two sample means
$t^*$ is the critical value based on the t distribution corresponding to the desired confidence level $1-\alpha$. The number of degrees of freedom of this t distribution can be found using the following formula

\[df = \frac{(A+B)^2}{\frac{A^2}{n_1 -1}+\frac{B^2}{n_2 -1}}\] where $A=\frac{s_1^2}{n_1}$ and $B=\frac{s_2^2}{n_2}$.

$se$ is the standard error of the difference between the sample means and $se = \sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$.
$n_1$ and $n_2$ are sample sizes

Example.

Suppose you want to estimate the difference in the mean test scores between two groups of students, Group A and Group B. You collect the following data:

Sample 1 (group A): Sample size is 30, sample mean 85 and sample standard deviation 10.
Sample 2 (group B): Sample size is 35, sample mean 78 and sample standard deviation 12.

You want to calculate a 95% confidence interval for the difference in means ($\mu_1-\mu_2$).

Solution.

n1 = 30
x1bar = 85
s1 = 10

n2 = 35
x2bar = 78
s2 = 12

A = s1^2/n1
B = s2^2/n2

df = (A+B)^2/(A^2/(n1-1) + B^2/(n2-1) )

alpha = 0.05  # 1-0.95
t_critical = qt(1-alpha/2, df)

se = sqrt(s1^2/n1 + s2^2/n2)

Lower_bound = (x1bar - x2bar) - t_critical*se
Upper_bound = (x1bar - x2bar) + t_critical*se

cat("The lower bound is", Lower_bound, "\n\n")

## The lower bound is 1.546394

cat("The uppwer bound is", Upper_bound, "\n\n")

## The uppwer bound is 12.45361

The computer output shows that the 95 percent confidence interval is from 1.5464 to 12.4536.

Interpretation: with 95% confidence, the difference (first minus second) between population means lies between 1.5464 and 12.4536. Since the interval contains only positive values, the group A has a significantly larger mean than group B. In other words, there is strong evidence to suggest that group A students have significantly higher test scores than group B students.

6.7 R Code for Calculating the Confidence Interval for the Difference in Population Proportions and Means

We demonstrate these by examples.

Example 1 (Difference in Proportions).

Given $n_1 = 40, x1 = 28$ and $n_2 = 45, x1 = 33$, construct a 90% confidence interval for $p_1-p_2$.

n1 = 40
x1 = 28

n2 = 45
x2 = 33

prop.test(n = c(n1, n2), x = c(x1, x2), conf.level = 0.90)

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(x1, x2) out of c(n1, n2)
## X-squared = 0.0098783, df = 1, p-value = 0.9208
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  -0.2180703  0.1514036
## sample estimates:
##    prop 1    prop 2 
## 0.7000000 0.7333333

The 90 percent confidence interval for the difference in population proportions (one minus two): from -0.2180703 to 0.1514036.

Example 2 (Difference in Means).

Suppose you are a botanist conducting a study to determine the effect of a particular fertilizer on the growth of a specific plant species. You have collected data from two groups of 20 plants each. One group received the fertilizer treatment, and the other group did not. You want to calculate a 99% t-confidence interval for the mean difference in plant height between the two groups, based on the following data

Sample 1 (Treated Group):

Plant heights (in centimeters) of treated plants: 18, 20, 19, 17, 21, 20, 22, 18, 19, 21, 19, 20, 22, 18, 20, 19, 17, 21, 20, 22

Sample 2 (Control Group):

Plant heights (in centimeters) of control plants: 16, 18, 17, 15, 19, 18, 20, 16, 17, 19, 17, 18, 20, 16, 18, 17, 15, 19, 18, 20

Solution.

treat = c(18, 20, 19, 17, 21, 20, 22, 18, 19, 21, 19, 20, 22, 18, 20, 19, 17, 21, 20, 22)
control = c(16, 18, 17, 15, 19, 18, 20, 16, 17, 19, 17, 18, 20, 16, 18, 17, 15, 19, 18, 20)

t.test(treat, control, conf.level = 0.99)

## 
##  Welch Two Sample t-test
## 
## data:  treat and control
## t = 4.0406, df = 38, p-value = 0.0002503
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
##  0.6578475 3.3421525
## sample estimates:
## mean of x mean of y 
##     19.65     17.65

The 99 percent confidence interval for the mean difference (treat minus control): from 0.6578475 to 3.3421525.

6.8 The Confidence Interval for the Difference in Population Means Based on Paird Data

To calculate the confidence interval for the difference in population means based on paired data, you can follow these steps. Paired data refers to situations where you have two measurements for each observation or subject. For example, before and after measurements on the same subjects, or measurements on matched pairs. The steps for calculating the confidence interval are as follows:

Step 1: Collect your paired data. These could be, for example, pre-treatment and post-treatment measurements on the same individuals, or measurements on matched pairs.

Step 2: Calculate the differences (d) for each pair. If you have data for the same subjects before and after treatment, you calculate the difference by subtracting the before value from the after value.

Step 3: Treat the differences (d) as one-sample data from a normal distribution. Use the one-sample t confidence interval we introduced before to construct an interval.

Example.

You want to evaluate whether the new diet plan has a significant impact on reducing blood pressure. You’ve measured the participants’ blood pressure before they started the diet plan (baseline) and after they completed the 6-month diet plan. Calculate a 95% confidence interval for the mean difference in blood pressure due to the diet plan based on the following data-

Participant 1: Baseline Blood Pressure (mm Hg): 140 Post-Diet Blood Pressure (mm Hg): 130

Participant 2: Baseline Blood Pressure (mm Hg): 150 Post-Diet Blood Pressure (mm Hg): 138

Participant 3: Baseline Blood Pressure (mm Hg): 132 Post-Diet Blood Pressure (mm Hg): 128

Participant 4: Baseline Blood Pressure (mm Hg): 138 Post-Diet Blood Pressure (mm Hg): 130

Participant 5: Baseline Blood Pressure (mm Hg): 145 Post-Diet Blood Pressure (mm Hg): 134

Solution.

# Prepare data
baseline = c(140, 150, 132, 138, 145)
post = c(130, 138, 128, 130, 134)

# Calculate reductions in blood pressure
d = baseline - post

# Construct 95% confidence interval for the mean reduction in population
t.test(d, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  d
## t = 6.364, df = 4, p-value = 0.003126
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##   5.073514 12.926486
## sample estimates:
## mean of x 
##         9

The computer gives the 95 percent confidence interval for the mean reduction in blood pressure among population to be between 5.0735 and 12.9265.

Interpretation: with 95% confidence, the mean difference in blood pressure due to the diet plan is at least 5.0735 and at most 12.9265.

6.9 Hypothesis Test for a Population Proportion

Instead of estimating a population parameter using a confidence interval, one may test a claim or hypothesis about the parameter. There are two types of hypotheses in statistics. The following are examples of hypotheses:

1. Treatment Efficacy:

Null Hypothesis (H0): The new drug treatment is not more effective than the existing treatment. Alternative Hypothesis (H1): The new drug treatment is more effective than the existing treatment.

1. Vaccine Efficacy:

Null Hypothesis (H0): The vaccine does not provide protection against the target disease. Alternative Hypothesis (H1): The vaccine provides protection against the target disease.

1. Impact of Lifestyle Changes:

Null Hypothesis (H0): There is no significant difference in blood pressure after 12 weeks of exercise and dietary changes. Alternative Hypothesis (H1): There is a significant decrease in blood pressure after 12 weeks of exercise and dietary changes.

1. Genetic Associations:

Null Hypothesis (H0): There is no association between a specific genetic marker and the risk of developing a certain disease. Alternative Hypothesis (H1): There is a significant association between the genetic marker and the risk of the disease.

1. Environmental Exposures:

Null Hypothesis (H0): Exposure to a specific environmental toxin does not increase the risk of cancer. Alternative Hypothesis (H1): Exposure to the environmental toxin is associated with an increased risk of cancer.

1. Nutritional Studies:

Null Hypothesis (H0): There is no difference in weight loss between two diet plans. Alternative Hypothesis (H1): One diet plan results in greater weight loss than the other. 7. Medical Device Performance:

Null Hypothesis (H0): The accuracy of a new medical device is not better than the existing device. Alternative Hypothesis (H1): The accuracy of the new medical device is superior to the existing device.
1. Clinical Trials:

Null Hypothesis (H0): The experimental treatment has no impact on patient survival. Alternative Hypothesis (H1): The experimental treatment improves patient survival.

1. Effects of Age:

Null Hypothesis (H0): Age has no effect on cognitive decline in the elderly. Alternative Hypothesis (H1): Cognitive decline is associated with age in the elderly.

1. Epidemiological Studies:

Null Hypothesis (H0): There is no association between exposure to a specific pathogen and the incidence of a certain disease. Alternative Hypothesis (H1): Exposure to the pathogen is associated with an increased risk of the disease. These are just a few examples of hypotheses or claims commonly investigated in health sciences and life sciences. Statistical inference is crucial in testing these hypotheses and making evidence-based decisions in these fields.

1. Population Mean (μ):

Null Hypothesis (H0): The mean blood pressure of a certain population is 120 mm Hg. Alternative Hypothesis (H1): The mean blood pressure of the population is not 120 mm Hg.

1. Population Proportion (p):

Null Hypothesis (H0): The proportion of smokers in a specific population is 0.30. Alternative Hypothesis (H1): The proportion of smokers in the population is different from 0.30.

1. Population Mean (μ):

Null Hypothesis (H0): The mean response time to a specific stimulus is 100 milliseconds. Alternative Hypothesis (H1, one-tailed): The mean response time is less than 100 milliseconds.

1. Population Proportion (p):

Null Hypothesis (H0): The proportion of patients experiencing a certain side effect from a drug is 0.10. Alternative Hypothesis (H1, one-tailed): The proportion of patients with the side effect is greater than 0.10.

In this section, we introducing the steps for testing hypotheses about a single population proportion.

Step 1. Define the Null and Alternative Hypotheses:

Null Hypothesis ($H_0$): This is a statement that represents the status quo or no effect. It often states that the population proportion equals a specific value ($p = p_0$).

Alternative Hypothesis ($H_1$ or $H_a$): This is a statement that represents what you want to test or find evidence for. It can be one of three types:

One-tailed, greater than: $p > p_0$

One-tailed, less than: $p < p_0$

Two-tailed: $p \ne p_0$

Step 2. Calculate the Test Statistic: For testing a single proportion, you typically use the z-test statistic, which, under the null hypothesis, follows a standard normal distribution. The formula for the z-test statistic is: \[z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}\] where $\hat{p}$ is the sample proportion, $p_0$ is the hypothesized proportion from the null hypothesis, and $n$ is the sample size. Keep two decimal places for $z$.
Step 3. Calculate the P-Value: The p-value is the probability of observing a sample proportion as extreme as, or more extreme than, the one in your sample, assuming the null hypothesis is true.
Step 4. Make a Decision: Based on the comparison of the p-value to a benchmark (called the significance level and denoted by $\alpha$), make a decision to either reject the null hypothesis (if the test is significant) or fail to reject it (if the test is not significant). A decision is made like this: if $p$-value is no greater than $\alpha$, reject the null hypothesis; otherwise, do not reject the null hypothesis.
Step 5. Interpret the Result:

In the context of your problem, interpret your decision. If you reject the null hypothesis, you have evidence to support the alternative hypothesis, and you can draw conclusions about the population proportion.

Example.

You want to test whether the proportion of malaria cases in the population is greater than 0.10. You randomly select 500 individuals from the population and determine if they have malaria (success) or not (failure). Let’s assume that 65 out of the 500 individuals have malaria. Test, at the significance level 0.05, appropriate null hypothesis against an alternative.

Solution.

Step 1. Define the Null and Alternative Hypotheses:

Null Hypothesis (H0): The proportion of malaria cases in a specific population is equal to 0.10 ($p = 0.10$). Alternative Hypothesis (H1): The proportion of malaria cases in the population is greater than 0.10 ($p > 0.10$).

Step 2. Calculate the Test Statistic: We have $n=500$, $\hat{p}=65/500=0.13$, and $p_0=0.10$..

Use the formula for the z-test statistic: \[z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1 - p_0)}{n}}}=\frac{0.13-0.10}{\sqrt{\frac{0.10(1 - 0.10)}{500}}}=2.24\]

Step 3. Calculate the p-value: The p-value is the probability of observing a sample proportion as extreme as 0.13, assuming the null hypothesis is true. This p-value is equal to the area of the right region under the standard normal curve beyond the value 2.24 on the x-axis. Using the StatKey app https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#normal, check the box “Right Tail” and set the cutoff to 2.24. The shaded red region has area of 0.013, which is the desired p-value. Why checking “right Tail”? This is because larger $z$ values provide more evidence than 2.24 against the null hypothesis. If you still have difficulty in understanding this, use the following rule: If there is a greater than sign under the alternative hypothesis, check the box “Right Tail”; If there is a less than sign under the alternative hypothesis, check the box “Left Tail”; If there is an unequal sign under the alternative hypothesis, check the box “Two-Tail”.
Step 4. Make a Decision:

Since the p-value is less than 0.05, you reject the null hypothesis.

Step 5. Interpret the Result:

Based on the data and the analysis, you do have sufficient evidence to conclude that the proportion of malaria cases in the population is greater than 0.10.

We really don’t know whether the null hypothesis is true or not. If it were true, we would have made an error, called the type I error in statistics. A type I error occurs when rejecting a true null hypothesis.

It’s important to note that the test is not significant if you choose another significance level such as 0.01. In this case, we fail to reject the null hypothesis. If the null hypothesis were false, we would have made an error, called the type II error in statistics. A type II error occurs when failing to reject a false null hypothesis.

A guideline to choose $\alpha$:

To reduce the chance of making a type I error, choose a smaller $\alpha$ to use.
To reduce the chance of making a type II error, choose a larger $\alpha$ to use.

6.10 Hypothesis Test for a Population Mean

We have learned the steps for testing a population proportion. To test hypotheses about a population mean, follow the following steps:

Step 1. Define the Null and Alternative Hypotheses:

Null Hypothesis ($H_0$): This is a statement that represents the status quo or no effect. It often states that the population mean equals a specific value ($\mu = \mu_0$).

Alternative Hypothesis ($H_1$ or $H_a$): This is a statement that represents what you want to test or find evidence for. It can be one of three types:

One-tailed, greater than: $\mu > \mu_0$

One-tailed, less than: $\mu < \mu_0$

Two-tailed: $\mu \ne \mu_0$

Step 2. Calculate the Test Statistic: For testing a single mean, you typically use the t-test statistic, which, under the null hypothesis, follows a t distribution with $n-1$ degrees of freedom. The formula for the t-test statistic is: \[t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}\] where $\bar{x}$ is the sample mean, $\mu_0$ is the hypothesized mean from the null hypothesis, and $n$ is the sample size. Keep two decimal places for $t$.
Step 3. Calculate the P-Value: The p-value is the probability of observing a sample proportion as extreme as, or more extreme than, the one in your sample, assuming the null hypothesis is true.
Step 4. Make a Decision: Based on the comparison of the p-value to a benchmark (called the significance level and denoted by $\alpha$), make a decision to either reject the null hypothesis (if the test is significant) or fail to reject it (if the test is not significant). A decision is made like this: if $p$-value is no greater than $\alpha$, reject the null hypothesis; otherwise, do not reject the null hypothesis.
Step 5. Interpret the Result:

Example

Suppose you are a researcher studying the cholesterol levels in a sample of 30 patients involved in a clinical trial for a new cholesterol-lowering drug. The data is as follows:

Cholesterol levels (mg/dL): 195, 200, 195, 210, 190, 202, 198, 205, 215, 192, 205, 198, 200, 207, 203, 208, 190, 193, 199, 198, 212, 201, 193, 209, 200, 195, 210, 204, 196, 202

You can follow the steps mentioned above to test whether the mean cholesterol level in this population of patients is different from the hypothesized value of 200 mg/dL using a t-test. Choose $\alpha = 0.05$.

Solution.

Step 1. Define the Null and Alternative Hypotheses:

Null Hypothesis $H_0:\mu = 200$.

Alternative Hypothesis $H_a: \mu \ne 200$.

Step 2. Calculate the Test Statistic:

The sample mean $\bar{x}=200.8333$ and sample standard deviation $s=6.6751$.

The formula for the t-test statistic is: \[t = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}== \frac{200.8333 - 200}{\frac{6.6751}{\sqrt{30}}}=0.68\]

Step 3. Calculate the P-Value: The p-value is the probability of observing a sample proportion as extreme as, or more extreme than, the one in your sample, assuming the null hypothesis is true. Using the StatKey app https://www.lock5stat.com/StatKey/theoretical_distribution/theoretical_distribution.html#t, set df = 29 (sample size minus 1), the right cutoff to 0.68, and check the box “Two-Tail”. The p-value is the sum of the two shaded tail areas, which is $0.251 + 0.251 = 0.502$.
Step 4. Make a Decision: Based on the comparison of the p-value to $\alpha=0.05$, we do not reject the null hypothesis, since the p-value is greater than $\alpha$.
Step 5. Interpret the Result: you do not have sufficient evidence to conclude that the mean cholesterol level in the population of patients is different from 200 mg/dL. We might have made a type II error.

6.11 Hypothesis Test for the Difference in Two Population Proportions or Population Means

The test steps are similar to those outlined for one-sample tests. To test the difference in two population proportions, the steps are:

Step 1. Define the Null and Alternative Hypotheses:

Null Hypothesis $H_0: p_1 = p_2$, where $p_1$ and $p_2$ are the two population proportions.

Alternative Hypothesis $H_1$ or $H_a$): This is a statement that represents what you want to test or find evidence for. It can be one of three types:

One-tailed, greater than: $p_1 > p_2$

One-tailed, less than: $p_1 < p_2$

Two-tailed: $p_1 \ne p_2$

Step 2. Calculate the Test Statistic: For testing a single proportion, you typically use the z-test statistic, which, under the null hypothesis, follows a standard normal distribution. The formula for the z-test statistic is: \[z = \frac{\hat{p}_1 - \hat{p}_2 }{\sqrt{\frac{\hat{p}_1 (1 - \hat{p}_1 )}{n_1}+\frac{\hat{p}_2 (1 - \hat{p}_2 )}{n_2}}}\] where $\hat{p}_1$ and $\hat{p}_1$ are the sample proportions, , and $n_1$ and $n_2$ are the sample sizes. Keep two decimal places for $z$.
Step 3. Calculate the P-Value: The p-value is the probability of observing a sample proportion as extreme as, or more extreme than, the one in your sample, assuming the null hypothesis is true.
Step 4. Make a Decision: if $p$-value is no greater than $\alpha$, reject the null hypothesis; otherwise, do not reject the null hypothesis.
Step 5. Interpret the Result:

To test the difference in two population means, the steps are:

Step 1. Define the Null and Alternative Hypotheses:

Null Hypothesis ($H_0$): This is a statement that represents the status quo or no effect. It often states that the population mean equals a specific value ($\mu_1 = \mu_2$).

Alternative Hypothesis ($H_1$ or $H_a$): This is a statement that represents what you want to test or find evidence for. It can be one of three types:

One-tailed, greater than: $\mu_1 > \mu_2$

One-tailed, less than: $\mu_1 < \mu_2$

Two-tailed: $\mu_1 \ne \mu_2$

Step 2. Calculate the Test Statistic: For testing a single mean, you typically use the t-test statistic, which, under the null hypothesis, follows a t distribution with $n-1$ degrees of freedom. The formula for the t-test statistic is: \[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}\] where $\bar{x}_1$ and $\bar{x}_2$ are the sample means, $s_1^2$ and $s_2^2$ are the sample variances, and $n_1$ and $n_2$ are the sample sizes. Keep two decimal places for $t$.
Step 3. Calculate the P-Value: The p-value is the probability of observing a sample proportion as extreme as, or more extreme than, the one in your sample, assuming the null hypothesis is true.
Step 4. Make a Decision: if $p$-value is no greater than $\alpha$, reject the null hypothesis; otherwise, do not reject the null hypothesis.
Step 5. Interpret the Result:

We will not give examples to show calculations, since we will introduce R code for doing so.

6.12 R Code for Hypothesis Tests

Example 1. (R code for one-sample z-test for a population proportion)

# Sample data: Number of successes and sample size
no_successes <- 65   # Number of successes (e.g., patients with a specific condition)
sample_size <- 500          # Sample size

# Choose significance level
alpha <- 0.05

# Null hypotheses
p0 <- 0.10        # Null hypothesis H0: p = 0.10

# Perform the one-sample proportion test with alternative hypothesis Ha: p < 0.10
prop.test(n = sample_size, x = no_successes, p = p0, alternative = "less")

## 
##  1-sample proportions test with continuity correction
## 
## data:  no_successes out of sample_size, null probability p0
## X-squared = 4.6722, df = 1, p-value = 0.9847
## alternative hypothesis: true p is less than 0.1
## 95 percent confidence interval:
##  0.0000000 0.1578178
## sample estimates:
##    p 
## 0.13

# Perform the one-sample proportion test with alternative hypothesis Ha: p > 0.10
prop.test(n = sample_size, x = no_successes, p = p0, alternative = "greater")

## 
##  1-sample proportions test with continuity correction
## 
## data:  no_successes out of sample_size, null probability p0
## X-squared = 4.6722, df = 1, p-value = 0.01533
## alternative hypothesis: true p is greater than 0.1
## 95 percent confidence interval:
##  0.1063249 1.0000000
## sample estimates:
##    p 
## 0.13

# Perform the one-sample proportion test with alternative hypothesis Ha: p not equal 0.10
prop.test(n = sample_size, x = no_successes, p = p0, alternative = "two.sided")

## 
##  1-sample proportions test with continuity correction
## 
## data:  no_successes out of sample_size, null probability p0
## X-squared = 4.6722, df = 1, p-value = 0.03065
## alternative hypothesis: true p is not equal to 0.1
## 95 percent confidence interval:
##  0.1024235 0.1634084
## sample estimates:
##    p 
## 0.13

For the left-tail alternative ($p<0.10$), the p-value is 0.9847. Don’t reject the null hypothesis at a significance level 0.05.

For the right-tail alternative ($p>0.10$), the p-value is 0.0153. Reject the null hypothesis at a significance level 0.05.

For the two-tail alternative ($p\ne 0.10$), the p-value is 0.0306. Reject the null hypothesis at a significance level 0.05.

Example 2. (R code for two-sample z-test for two population proportions)

# Sample data: Numbers of successes and sample sizes
no_successes_1 <- 32   # Number of successes in sample 1
sample_size_1 <- 65          # Sample size in sample 1

no_successes_2 <- 45   # Number of successes in sample 2
sample_size_2 <- 78          # Sample size in sample 2

# Choose significance level
alpha <- 0.05


# Perform the 2-sample proportion test with alternative hypothesis Ha: p1 < p2
prop.test(n = c(sample_size_1, sample_size_2), x = c(no_successes_1, no_successes_2), alternative = "less")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(no_successes_1, no_successes_2) out of c(sample_size_1, sample_size_2)
## X-squared = 0.70933, df = 1, p-value = 0.1998
## alternative hypothesis: less
## 95 percent confidence interval:
##  -1.00000000  0.06685472
## sample estimates:
##    prop 1    prop 2 
## 0.4923077 0.5769231

# Perform the 2-sample proportion test with alternative hypothesis Ha: p1 > p2
prop.test(n = c(sample_size_1, sample_size_2), x = c(no_successes_1, no_successes_2), alternative = "greater")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(no_successes_1, no_successes_2) out of c(sample_size_1, sample_size_2)
## X-squared = 0.70933, df = 1, p-value = 0.8002
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.2360855  1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.4923077 0.5769231

# Perform the 2-sample proportion test with alternative hypothesis Ha: p1 not equal p2
prop.test(n = c(sample_size_1, sample_size_2), x = c(no_successes_1, no_successes_2), alternative = "two.sided")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(no_successes_1, no_successes_2) out of c(sample_size_1, sample_size_2)
## X-squared = 0.70933, df = 1, p-value = 0.3997
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.2624015  0.0931707
## sample estimates:
##    prop 1    prop 2 
## 0.4923077 0.5769231

For the left-tail alternative ($p_1<p_2$), the p-value is 0.1998. Don’t reject the null hypothesis at a significance level 0.05.

For the right-tail alternative ($p_1>p_2$), the p-value is 0.8002. Don’t reject the null hypothesis at a significance level 0.05.

For the two-tail alternative ($p_1\ne p_2$), the p-value is 0.3997. Don’t reject the null hypothesis at a significance level 0.05.

Example 3. (R code for 1-sample t-test for a population mean)

# Sample data: 
x <- c(28, 32, 35, 30, 31, 27, 29, 26, 33, 34, 32, 30, 31, 29, 28)

# Hypothesized population mean
hypothesized_mean <- 30  # You can change this to your specific hypothesized mean

# Perform a one-sample t-test with alternative hypothesis Ha: mu < 30
t.test(x, mu = hypothesized_mean, alternative = "less")

## 
##  One Sample t-test
## 
## data:  x
## t = 0.5, df = 14, p-value = 0.6876
## alternative hypothesis: true mean is less than 30
## 95 percent confidence interval:
##      -Inf 31.50754
## sample estimates:
## mean of x 
##  30.33333

# Perform a one-sample t-test with alternative hypothesis Ha: mu > 30
t.test(x, mu = hypothesized_mean, alternative = "greater")

## 
##  One Sample t-test
## 
## data:  x
## t = 0.5, df = 14, p-value = 0.3124
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  29.15913      Inf
## sample estimates:
## mean of x 
##  30.33333

# Perform a one-sample t-test with alternative hypothesis Ha: mu not equal 30
t.test(x, mu = hypothesized_mean, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  x
## t = 0.5, df = 14, p-value = 0.6248
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  28.90348 31.76319
## sample estimates:
## mean of x 
##  30.33333

For the left-tail alternative ($\mu<0.10$), the p-value is 0.6876. Don’t reject the null hypothesis at a significance level 0.05.

For the right-tail alternative ($\mu>0.10$), the p-value is 0.3124. Don’t reject the null hypothesis at a significance level 0.05.

For the two-tail alternative ($\mu \ne 0.10$), the p-value is 0.6248. Don’t reject the null hypothesis at a significance level 0.05.

Example 4. (R code for 2-sample t-test for two population means)

# data for two groups
group1 <- c(25, 27, 30, 32, 28, 29, 26, 31, 33, 30)
group2 <- c(34, 35, 32, 31, 30, 33, 31, 32, 30, 33)

# Perform a two-sample t-test with alternative hypothesis Ha: mu1 < mu2
t_test_result <- t.test(group1, group2, alternative = "less")

# Perform a two-sample t-test with alternative hypothesis Ha: mu1 > mu2
t_test_result <- t.test(group1, group2, alternative = "greater")

# Perform a two-sample t-test with alternative hypothesis Ha: mu1 not equal mu2
t_test_result <- t.test(group1, group2, alternative = "two.sided")

For the left-tail alternative ($\mu_1<\mu_2$), the p-value is 0.003797. Reject the null hypothesis at a significance level 0.05.

For the right-tail alternative ($\mu_1>\mu_2$), the p-value is 0.9962. Don’t reject the null hypothesis at a significance level 0.05.

For the two-tail alternative ($\mu_1\ne \mu_2$), the p-value is 0.007593. Reject the null hypothesis at a significance level 0.05.

6.13 R Code for Testing Hypotheses Based on Data from a Dataset

In this section, I will introduce R code when testing various hypotheses using data from a dataset. The data must be read into RStudio to form the so-called data frame. Usually, your original data are in an Excel spreadsheet or is stored in a “.csv” file.

We will use the National Health and Nutrition Examination Survey (NHANES) data to demonstrate the use of code. The documentation of the data is https://www.lock5stat.com/datasets3e/Lock5DataGuide3e.pdf.

This dataset is a subset of the 2009-2010 National Health and Nutrition Examination Survey (NHANES). NHANES is a national survey conducted by the Centers for Disease Control and Prevention (CDC) on a random sample of Americans. This subset contains data on select variables for the subset of people with responses to the questions about buying organic food and self-reported health status.

The data contain 4716 observations on the following 5 variables.

Case: Case ID number
Organic: Buy any food labeled organic (past 30 days)? (No or Yes)
Health: Self-rating of health (Excellent, Very good, Fair, Good, or Poor)
HealthBinary: Health with two categories: Poor / Fair / Good or Very good / Excellent
Income: Monthly income? (dollars)

Here are some questions and R code examples for 1-sample z-test for a single population proportion, 1-sample-t test for a single population mean, 2-sample z-test for comparing two population proportions, and 2-sample-t test for comparing two population means.

We introduce a very nice package called “lessR” so that you use less R code to do your work. You will need to install the package. To do this, in the lower-left Console panel of RStudio, type the code

install.packages(“lessR”)

and hit the “enter” key.

Once installed, the package will be around forever, so don’t include this code in any of your code file or report.

We are ready to take a look at some examples.

Before we explore some questions, we read the data from the webpage to RStudio using the following code:

# load the lessR package
library(lessR)

# Read data from Lock's web without showing data automatically
survey = Read("https://www.lock5stat.com/datasets3e/NHANES.csv", quiet=TRUE)

# Show the first 6 rows/observations of your data
head(survey)

##   Case Organic    Health          HealthBinary Income
## 1    1      No      Good    Poor / Fair / Good 3324.5
## 2    2      No      Fair    Poor / Fair / Good   1024
## 3    3     Yes      Good    Poor / Fair / Good   2500
## 4    4      No Excellent Very good / Excellent   1450
## 5    5      No      Good    Poor / Fair / Good   1450
## 6    6      No      Good    Poor / Fair / Good   5824

How did I get the following link “https://www.lock5stat.com/datasets3e/NHANES.csv”? I right clicked the file “NHANES.csv” and chose “copy link address”. The quiet = TRUE allows me to suppress useless messages.

Now, we answer each of the following questions using R code:

Question 1: 1-Sample Z-Test for Proportion (Buy Organic Food)

You want to test whether the proportion of people who buy organic food is significantly different from 0.5 (i.e., 50%). Using the NHANES dataset, perform a 1-sample z-test to evaluate this claim at a significance level of 0.05; that is, choose $\alpha=0.05$.

R Code:

Prop_test(data = survey, 
          variable = Organic, 
          success = "Yes", 
          pi = 0.5
         )

## 
## <<< Exact binomial test of a proportion 
## 
## variable: Organic 
## success: Yes 
## 
## ------ Describe ------
## 
## Number of missing values: 0 
## Number of successes: 1711 
## Number of failures: 3005 
## Number of trials: 4716 
## Sample proportion: 0.363 
## 
## ------ Infer ------
## 
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.349 to 0.377

I performed a 1-sample z-test for proportion with $H_0: p = 0.5$ vs $H_a: p \ne 0.5$. The result in the “Infer” portion shows a p-value of 0.000 and a 95% confidence interval between 0.349 and 0.377. Since the p-value is basically zero, less than any commonly used significance level such as 0.05, the null hypothesis is rejected. We conclude that the proportion of customers who bought organic food differs from 0.5.

Note that in the code, we only used the “Organic” column in the data. We tested whether 50% purchases are organic food against it’s not.

Question 2: 1-Sample T-Test for Mean (Income)

You want to test whether the average monthly income of the surveyed population is significantly more than $2,500. Using the NHANES dataset, perform a 1-sample t-test for the mean income at a significance level of 0.05.

R Code:

# Perform the 1-sample t-test for mean with H0: mu = 2500 vs Ha: mu > 2500
survey$Income = as.numeric(survey$Income)
ttest(data = survey, 
      x = Income, 
      mu = 2500, 
      alternative = "greater",
      brief = TRUE)

## 
## Income:  n.miss = 532,  n = 4184,   mean = 3443.379,  sd = 2494.818
## 
## t-cutoff for 95% range of variation: tcut =  -1.645 
## Standard Error of Mean: SE =  38.569 
## 
## 
## Alternative hypothesis: Population mean difference is greater than 2500 
## Hypothesized Value H0: mu = 2500 
## Hypothesis Test of Mean:  t-value = 24.459,  df = 4183,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  Inf
## 95% Confidence Interval for Mean:  3379.924 to Inf
## 
## Distance of sample mean from hypothesized:  943.379
## Standardized Distance, Cohen's d:  0.378

I performed a 1-sample t-test for mean income with $_H0: \mu = 2500$ vs $H_a: \mu > 2500$. The result in the “Infer” portion shows a p-value of 0.000. Since the p-value is basically zero, less than any commonly used significance level such as 0.05, the null hypothesis is rejected. We conclude that the mean monthly income of customers is greater than $2500.

Question 3: Two-Sample Z-Test for Proportions (Buy Organic Food and Health Status)

You want to test whether the proportion of people buying organic food differs between those with “Very good / Excellent” health and those with “Poor / Fair / Good” health. Using the NHANES dataset, perform a two-sample z-test for proportions to evaluate this claim at a significance level of 0.05.

R Code:

# Perform the 2-sample z-test for comparing proportions
Prop_test(data = survey, 
          variable = Organic, 
          success = "Yes", 
          by = HealthBinary)

## 
## <<< 2-sample test for equality of proportions without continuity correction 
## 
## variable: Organic 
## success: Yes 
## by: HealthBinary 
## 
## --- Description
## 
##               Poor / Fair / Good   Very good / Excellent
## -----------  -------------------  ----------------------
## n_Yes                        925                     786
## n_total                     2939                    1777
## proportion                 0.315                   0.442
## 
## --- Inference
## 
## Chi-square statistic: 77.978 
## Degrees of freedom: 1 
## Hypothesis test of equal population proportions: p-value = 0.000

I performed the 2-sample z-test for comparing proportions, with $H_0: p_1 = p_2$ vs $H_a: p_1 \ne p_2$, where $p_1$ is the proportion of “Very good / Excellent” health people who buy organic food and $p_2$ is the proportion of “Poor / Fair / Good” health people who buy organic food. The “Inference” portion of the above result shows a p-value that is basically 0, so the null hypothesis is rejected. We conclude that the proportion of people buying organic food differs between those with “Very good / Excellent” health and those with “Poor / Fair / Good” health.

Question 4: Two-Sample T-Test for Means (Income and HealthBinary)

You want to test whether there’s a significant difference in mean income between those with “Poor/Fair/Good” health and those with “Very good/Excellent” health. Using the NHANES dataset, perform a two-sample t-test for means to evaluate this claim at a significance level of 0.05.

# Perform the 2-sample t-test for comparing means
ttest(Income~HealthBinary, 
      data=survey, 
      alternative="two_sided",
      brief = TRUE
     )

## 
## Compare Income across HealthBinary with levels Very good / Excellent and Poor / Fair / Good 
## Response Variable:  Income, Income
## Grouping Variable:  HealthBinary, 
## 
## 
##  --- Describe ---
## 
## Income for HealthBinary Very good / Excellent:  n.miss = 188,  n = 1589,  mean = 4067.983,  sd = 2687.072
## Income for HealthBinary Poor / Fair / Good:  n.miss = 344,  n = 2595,  mean = 3060.915,  sd = 2287.208
## 
## Mean Difference of Income:  1007.068
## Weighted Average Standard Deviation:   2446.753 
## Standardized Mean Difference of Income: 0.412
## 
##  --- Infer ---
## 
## t-cutoff for 95% range of variation: tcut =  1.961 
## Standard Error of Mean Difference: SE =  77.939 
## 
## Hypothesis Test of 0 Mean Diff:  t-value = 12.921,  df = 4182,  p-value = 0.000
## 
## Margin of Error for 95% Confidence Level:  152.802
## 95% Confidence Interval for Mean Difference:  854.266 to 1159.870

I performed the 2-sample t-test for comparing mean incomes, with $H_0: \mu_1 = \mu_2$ vs $H_a: \mu_1 \ne \mu_2$, where $\mu_1$ is the mean income of people who have “Very good / Excellent” health and $\mu_2$ is the mean income of people who have “Poor / Fair / Good” health. The “Inference” portion of the above result shows a p-value that is basically 0, so the null hypothesis is rejected. We conclude that the mean monthly income differs between those with “Very good / Excellent” health and those with “Poor / Fair / Good” health.

We can also use the traditional t.test() function to do the test of hypotheses.

R Code:

t.test(Income~HealthBinary, data=survey, alternative="two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  Income by HealthBinary
## t = -12.434, df = 2953.7, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group Poor / Fair / Good and group Very good / Excellent is not equal to 0
## 95 percent confidence interval:
##  -1165.8766  -848.2593
## sample estimates:
##    mean in group Poor / Fair / Good mean in group Very good / Excellent 
##                            3060.915                            4067.983

Since the p-value is less than 0.00000000000000022, we again arrive at the same conclusion.

Chapter 7. Chi-Square Tests for Categorical Variables

Case Studies #1:

Suppose a pharmaceutical company is conducting a clinical trial to assess the adherence of patients to a new medication. They want to determine if the observed distribution of medication adherence (e.g., completely adherent, partially adherent, non-adherent) among 300 trial participants matches the expected distribution based on prior research or clinical guidelines.

The company has data on the expected distribution based on clinical guidelines:

Completely Adherent: 60%
Partially Adherent: 30%
Non-Adherent: 10%

They also have observed data from the clinical trial:

Completely Adherent: 180 patients
Partially Adherent: 90 patients
Non-Adherent: 30 patients

How can the pharmaceutical company assess whether the observed medication adherence in the clinical trial matches the expected distribution based on clinical guidelines?

Case Study #2:

Imagine you are a researcher studying the association between a specific disease (e.g., COVID-19) and gender (male or female) among individuals in a community. You want to determine if there is a significant association between the disease and gender.

You have data of randomly selected 500 individuals, of which 200 are males and 300 are females:

Among 200 males, 50 have the disease, and 150 do not.
Among 300 females, 60 have the disease, and 240 do not.

You wish to test whether the distribution of the disease differs significantly between males and females. That is, your hypotheses are

Null Hypothesis (H0): There is no association between the disease and gender.
Alternative Hypothesis (Ha): There is an association between the disease and gender.

How can you carry out such a hypothesis test?

7.1 Testing Goodness-of-Fit for a Single Categorical Variable

Case study one at the beginning of this chapter considered a population of patients. The variable of interest is adherence (e.g., completely adherent, partially adherent, non-adherent), which is a categorical variable. We do not know the underlying distribution of the population on this variable, and we wish to test the hypothesis that the variable follows a particular distribution, i.e.,

Completely Adherent: 60%
Partially Adherent: 30%
Non-Adherent: 10%

The test method, called the Chi-squared goodness-of-fit test, is based on the chi-square distribution.

Each chi-squared distribution is associated with a number called the number of degrees of freedom. The above graph shows 6 different chi-squared distributions. All chi-squared distributions are skewed to the right.

In this chapter, we will consider hypothesis testing problems that involve calculating p-values based on chi-squared distributions. Instead of case study one, we consider a more general problem:

The population is divided into k categories based on a categorical variable.
The population proportions for the $k$ categories are $p_1, p_2, \cdots, p_k$.
The null hypothesis is $H_0: p_1=p_{10}, p_{20}, \cdots, p_{k0}$, where $p_{10}, p_{20}, \cdots, p_{k0}$ are given.
The alternative hypothesis $H_a: \text{At least one of the equations under the null hypothesis is incorrect.}$

The test procedure requires a random sample of size $n$ from the population whose probability distribution is unknown. These $n$ observations are arranged in a frequency table with $k$ classes/categories.

Let $O_i$ be the observed frequency in the $i$th class. Under the null hypothesis, we compute the expected frequency in the $i$th class, denoted $E_i = n\cdot p_{i0}$, $i = 1, 2, ..., k$. The test statistic is

\[\chi_0^2=\sum_{i=1}^k \frac{(O_i -E_i)^2}{E_i}\]

Under the null hypothesis, $\chi_0^2$ has, approximately, a chi-square distribution with $k − 1$ degrees of freedom.

The $p$-value is always the upper-tail area under the chi-square curve with cutoff $\chi_0^2$.

The above is called the goodness of fit test.

Example.

Throw a 6-sided die 100 times. The observations are

17 ones
18 twos
13 threes
17 fours
22 fives
13 sixes

Test, at the significance level 0.05, whether the die is fair.

Solution.

The null hypothesis is $H_0: p_1=p_2=\cdots=p_6=1/6$ and the alternative hypothesis is $H_a: \text{At least one of the probabilities is not 1/6}$. Under the null hypothesis, the expected frequencies are all $(\frac{1}{6})(100)=16.67$. The observed frequencies are $O_1=17, O_2=18, O_3=13, O_4=17, O_5=22, O_6=13$.
The test statistic

\[\chi_0^2=\sum_{i=1}^k \frac{(O_i -E_i)^2}{E_i}\]

\[\chi_0^2=\frac{(17 -16.67)^2}{16.67}+\frac{(18 -16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}+\frac{(17 -16.67)^2}{16.67}+\frac{(22 -16.67)^2}{16.67}+\frac{(13 -16.67)^2}{16.67}\] \[\chi_0^2=3.44\] with $6-1=5$ degrees of freedom.

The $p$-value is $P(\chi^2>3.44)= 0.6325$, the area of the right region under the chi-squared density curve $(df=5)$ (Watch: https://www.youtube.com/watch?v=HwD7ekD5l0g).
Decision & conclusion: Since the $p$-value is greater than the significance level 0.05, the null hypothesis is NOT rejected. We conclude that we don’t have enough evidence to say that the die is unfair.

The R code:

chisq.test(x=c(17, 18, 13, 17, 22, 13 ))

7.2 Testing for an Association between Two categorical Variables

Case study #2 tests whether there is an association between disease and gender. In the following, we consider testing whether there is an association between any two given categorical variables. Assume that the first variable has $r$ levels and that the second has $c$ levels. We will let $O_{ij}$ be the observed frequency for level $i$ of the first variable and level $j$ of the second variable. The data would, in general, appear as shown in the following Table. Such a table is usually called an $r × c$ contingency table.

To test whether the two categorical variables are associated, the null and alternative hypotheses are

\[H_0: \text{The two categorical variables are independent vs.} ~ H_a:\text{The two categorical variables are associated}\] We again use the chi-square test and the test statistic is

\[\chi_0^2=\sum_{i,j} \frac{(O_{ij} -E_{ij})^2}{E_{ij}}\]

where the expected frequency $E_{ij}$ is calculated as the sum of the $i$th row multiplied by the sum of the $j$th column, then divided by the sum of all frequencies.

Under the null hypothesis, this test statistic has an approximate chi-square distribution with $(r − 1)(c − 1)$ degrees of freedom.

The p-value are calculated in the same way as for the goodness of fit test.

Example: Association Between Smoking Status and Lung Cancer in a Population

Suppose you are studying the association between smoking status (smoker or non-smoker) and the presence of lung cancer (yes or no) in a population of 500 individuals.

The data (fake!!!) are

Test whether there is a significant association between smoking and the development of lung cancer. Use a 5% significance level.

Solution.

Null Hypothesis (H0): There is no association between smoking status and the presence of lung cancer.

Alternative Hypothesis (Ha): There is an association between smoking status and the presence of lung cancer.

Calculate Expected Frequencies

To calculate expected frequencies, we assume that smoking status and lung cancer are independent. We find the expected frequency for each cell using the formula:

\[Expected Frequency = (Row Total × Column Total)/Grand Total\] For example, the expected frequency for “Smoker” and “Lung Cancer (Yes)” is:

\[Expected Frequency = (200 × 110)/500=44\] Repeat this calculation for all cells in the table. The expected frequency table is:

Calculate the Chi-Squared Statistic

\[(80−44)^2/44 +(120−156)^2/156+ (30−66)^2/66 + (270−234)^2/234 = 62.94\] The chi-square distribution has $(2-1)(2-1)=1$ degree of freedom.

The p-value (0.000000000000002134) is the area beyond the cutoff 62.94 under the chi-square distribution with one degree of freedom. Since it is so small (of course smaller than 0.05), the null hypothesis is rejected. We conclude that the data provide sufficient evidence that the disease is associated with gender. (Keep in mind, our data are fake!)

The R code:

chisq.test(x=matrix(c(80, 30, 120, 270), 2, 2))

The result based on the R code is almost the same as obtained by hand.

Chapter 8 ANOVA to Compare Means

A one-way analysis of variance (ANOVA) is a statistical test used to determine whether there are statistically significant differences among the means of three or more independent (unrelated) groups. It’s an extension of the t-test, which is used to compare means between two groups. In a one-way ANOVA, you have one categorical independent variable (also known as a factor) and one continuous dependent variable.

Here are the key components of a one-way ANOVA:

Groups or Treatments: You have three or more groups or treatments. These are the categories or levels of the independent variable. For example, in a plant growth experiment, the groups might be different types of fertilizers (Fertilizer A, B, C).

Dependent Variable: You measure a continuous variable, such as height, weight, or time. In the plant growth experiment, the dependent variable is plant height.

Hypotheses:

Null Hypothesis (H0): There are no significant differences among the group means. Alternative Hypothesis (Ha): There are significant differences among at least two group means.

Assumptions:

Independence: The observations within each group are independent of each other.
Normality: The data within each group follow a normal distribution.
Homogeneity of Variance: The variances within each group are roughly equal.

Analysis:

A one-way ANOVA calculates an $F$-statistic, which is the ratio of the variance between group means to the variance within groups. We will use software to calculate $F$.

Significance Level (α): You choose a significance level to determine the threshold for statistical significance (e.g., α = 0.05).

Interpretation:

If the p-value (the probability of obtaining the observed result if the null hypothesis is true) is less than the chosen significance level (α), you reject the null hypothesis. This indicates that there are significant differences among at least two of the group means.

If the p-value is greater than α, you fail to reject the null hypothesis, suggesting that there are no significant differences among the group means.

Post-Hoc Tests (Optional): If the one-way ANOVA indicates significant differences, post-hoc tests like Tukey’s HSD or Bonferroni tests can be used to identify which specific groups differ from each other.

One-way ANOVA is a powerful tool for comparing means across multiple groups and is commonly used in various fields, including experimental research, clinical trials, social sciences, and quality control to determine if different treatments or conditions have a significant impact on the dependent variable.

Example: Comparing the Effect of Three Fertilizers on Plant Growth

Suppose you are a botanist conducting an experiment to determine the effect of three different fertilizers (Fertilizer A, Fertilizer B, and Fertilizer C) on the growth of a specific type of plant. You have divided 60 plants into three groups, each treated with a different fertilizer, and you want to test whether there are significant differences in plant height among the three fertilizer groups.

Your data:

Fertilizer A (Group 1): Plant heights (in centimeters)- 25, 28, 30, 32, 27, 29, 31, 33
Fertilizer B (Group 2): Plant heights (in centimeters)- 30, 33, 35, 36, 32, 34, 38, 37
Fertilizer C (Group 3): Plant heights (in centimeters)- 20, 22, 25, 24, 21, 23, 26, 27

Solution.

Let’s first visualize the data using the following R code:

# Create a data frame with two columns called "Fertilizer" and "PlantHeight"
mydata <- data.frame(
  Fertilizer = rep(c("A", "B", "C"), each = 8),
  PlantHeight = c(25, 28, 30, 32, 27, 29, 31, 33, 30, 33, 35, 36, 32, 34, 
                  38, 37, 20, 22, 25, 24, 21, 23, 26, 27)
)

# Plot data
boxplot(PlantHeight ~ Fertilizer, 
        data = mydata
       )

This code creates a data frame called mydata with two columns: “Fertilizer” and “PlantHeight.” The “Fertilizer” column is created using the rep function to repeat the values “A,” “B,” and “C” each 8 times (for a total of 24 observations). The “PlantHeight” column contains numeric values representing the height of plants.

The remaining code generates a boxplot to visualize the distribution of plant heights for different fertilizers. The formula PlantHeight ~ Fertilizer specifies that the variable “PlantHeight” is plotted against the variable “Fertilizer.” The data = mydata argument indicates that the data comes from the mydata data frame.

The boxplots seem to indicate a significant difference among the means of the three fertilizers.

Step 1: Specify hypotheses.

Null Hypothesis ($H_0$): There are no significant differences in plant height among the three fertilizer groups ($\mu_1 = \mu_2 = \mu_3$).

Alternative Hypothesis ($H_a$): There are significant differences in plant height among at least two of the three fertilizer groups.

Step 2: Perform the One-Way ANOVA with R.

# Perform a one-way ANOVA
anova_result <- aov(PlantHeight ~ Fertilizer, data = mydata)

# Summarize the ANOVA results
summary(anova_result)

##             Df Sum Sq Mean Sq F value    Pr(>F)    
## Fertilizer   2  474.1  237.04   35.12 0.0000002 ***
## Residuals   21  141.7    6.75                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4: Interpret the Results

Since the p-value (0.0000002) is less than any commonly chosen significance level such as 0.05, you reject the null hypothesis. This indicates that there are significant differences in plant height among at least two of the three fertilizer groups.

(If the p-value were greater than the chosen significance level, you would fail to reject the null hypothesis. This would suggest that there are no significant differences in plant height among the fertilizer groups.)

To tell which two specific fertilizers differ, we can find the confidence interval for the mean difference between any two fertilizers.

# Perform Tukey's HSD post-hoc test to get confidence intervals
tukey_result <- TukeyHSD(anova_result)

# Print results
tukey_result

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = PlantHeight ~ Fertilizer, data = mydata)
## 
## $Fertilizer
##        diff        lwr       upr     p adj
## B-A   5.000   1.725683  8.274317 0.0025705
## C-A  -5.875  -9.149317 -2.600683 0.0005250
## C-B -10.875 -14.149317 -7.600683 0.0000001

In this code:

We first create a data frame containing the fertilizer and plant height data.

We use the aov() function to perform a one-way ANOVA to determine if there are significant differences among the groups.

We perform Tukey’s post-hoc test using the TukeyHSD() function, which calculates the Tukey Honestly Significant Difference (HSD) intervals for pairwise comparisons between the groups.

Finally, we view the results by typing tukey_result.

The tukey_result object will contain the results of the post-hoc test, including group means, differences, standard errors, and the Tukey HSD intervals. You can examine these results to identify which specific pairs of groups have statistically significant differences in means.

Using Tukey’s test is a common method for post-hoc analysis following an ANOVA to determine which groups are significantly different from each other while controlling for the family-wise error rate.

In the previous output, the value 5.000 for “diff” indicates the difference between two fertilizers (B-A). The confidence interval for the mean difference is from 1.725683 to 8.274317. The adjusted p-value 0.0025705 indicates there is a significant difference between the means of fertilizers A and B. Other values can be interpreted similarly.

Chapter 9 Inference for Regression

We introduced regression in Chapter 2. In this chapter, we introduce inference for regression; that is, we create confidence interval for and test hypotheses about the slope parameter.

Example: Relationship Between Study Hours and Exam Scores

Suppose you want to investigate the relationship between the number of study hours (independent variable) and the exam scores (dependent variable) of a group of students. You collect data from 20 students

Study hours: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40

Exam scores: 45, 55, 60, 70, 75, 80, 85, 90, 92, 95, 100, 105, 110, 112, 115, 120, 125, 128, 132, 135

You want to perform inference for simple linear regression.

Here’s how you can do it in R:

Prepare Data:

# Prepare data
study_hours <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40)
exam_scores <- c(45, 55, 60, 70, 75, 80, 85, 90, 92, 95, 100, 105, 110, 112, 115, 120, 125, 128, 132, 135)

# Create a data frame
mydata <- data.frame(StudyHours = study_hours, ExamScores = exam_scores)

Plot Data:

plot(ExamScores ~ StudyHours, 
     data = mydata,
     xlab = "Study Hour",
     ylab = "Exam Score",
     main = "Plot of Exam Scores vs. Study Time")

In this code:

We use the plot() function to create a scatter plot. The xlab, ylab, and main arguments are used to label the x-axis, y-axis, and add a title to the plot.

The scatter plot visually shows the relationship between study hours and exam scores. In this case, you can see how exam scores tend to increase as study hours increase.

Fit a Simple Regression Model:

# Fit a simple linear regression model
model <- lm(ExamScores ~ StudyHours, data = mydata)

# Summary of the regression model
summary(model)

## 
## Call:
## lm(formula = ExamScores ~ StudyHours, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3286 -1.8122  0.3992  2.3938  4.6346 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 49.89474    1.60840   31.02 <0.0000000000000002 ***
## StudyHours   2.21692    0.06713   33.02 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.462 on 18 degrees of freedom
## Multiple R-squared:  0.9838, Adjusted R-squared:  0.9829 
## F-statistic:  1090 on 1 and 18 DF,  p-value: < 0.00000000000000022

In this code:

We create sample data for study hours and exam scores.

We create a data frame using the sample data.

We fit a simple linear regression model using the lm() function, where ExamScores is the dependent variable, and StudyHours is the independent variable.

We obtain a summary of the regression model using summary(model).

The summary(model) provides various statistics, including the coefficient estimates (slope and intercept), standard errors, t-values, and p-values.

The “Estimate” under “study_hours” represents the slope of the regression line. In this example, the estimated slope is 2.2. This means that, on average, for each additional hour a student spends studying, their exam score is expected to increase by approximately 2.22 points.

The “Std. Error” is the standard error of the estimate. It measures the variability of the slope estimate.

The “t value” is the test statistic for the slope. It indicates how many standard errors the estimated slope is away from zero. In this case, the t-value is 33.02.

The “Pr(>|t|)” represents the p-value associated with the test of the null hypothesis that the slope is zero (no relationship). In this example, the p-value is very small (< 0.0000000000000002), indicating that the slope is significantly different from 0.

The R-squared (or adjusted R²) is a statistical measure (between 0 and 1) that provides additional insight into the goodness of fit of a regression model. The larger the R-squared, the better the model.

In summary, based on the results, we can infer that there is a statistically significant positive relationship between the number of study hours and exam scores. For each additional hour of study, students, on average, can expect their exam scores to increase by approximately 2.22 points.

Finally, we can add the regression line to the scatterplot.

# Scatter plot
plot(ExamScores ~ StudyHours, 
     data = mydata,
     xlab = "Study Hours", 
     ylab = "Exam Scores",
     main = "Scatter Plot of Exam Scores vs. Study Hours")

# Add regression line to the plot
abline(model, col = "red")

In this code:

We use the plot() function to create a scatter plot. The xlab, ylab, and main arguments are used to label the x-axis, y-axis, and add a title to the plot.

We add a regression line to the plot using the abline() function. The model object contains the regression model, and we specify the color of the line as “red.”

Another example:

Our research focuses on understanding the impact of aging on the physical abilities of mice. We are particularly interested in how the age of mice correlates with their average running speed. The study involves a group of mice of varying ages, ranging from 6 to 20 months.

As mice age, there is anecdotal evidence suggesting that their physical activity levels might change. To investigate this, we conducted experiments where we measured the average running speed of individual mice at different ages. The running speed was assessed using a controlled environment with a running wheel, allowing us to capture their natural locomotor behavior.

The data consists of paired observations of mouse age and the corresponding average running speed.

Age: 12, 14, 16, 18, 20
Running Speed: 24, 20, 16, 12, 10

The following is the code for doing a regression analysis. We first prepare and plot the data.

Now, we fit a simple regression model:

## 
## Call:
## lm(formula = RunningSpeed ~ Age, data = mouse_data)
## 
## Residuals:
##                     1                     2                     3 
##  0.400000000000002465 -0.000000000000002179 -0.400000000000001021 
##                     4                     5 
## -0.800000000000000377  0.800000000000001155 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2000     1.8762   24.09 0.000157 ***
## Age          -1.8000     0.1155  -15.59 0.000574 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7303 on 3 degrees of freedom
## Multiple R-squared:  0.9878, Adjusted R-squared:  0.9837 
## F-statistic:   243 on 1 and 3 DF,  p-value: 0.0005737

Interpretation:

The $R^2$ value tells the goodness of fit of the model; the larger the value is, the better the model. The $R^2$ here is 98.78%, it means that the model explains 98.78% of the total variation in the values of the running speed.

“Pr(>|t|)” gives the p-value for each corresponding test of hypothesis. The p-value 0.000574 indicates the null hypothesis $H_0: \beta_1 = 0$ should be rejected and it provides significant evidence that the slope is not zero, meaning that the running speed is correlated to the age of a mouse in general.

The value $-1.8$ is the estimate of the slope ($\beta_1$), which means that for each unit increase in age, the running speed decreases by 1.8 unit on average.

Shorter Lecture Notes for Stat 239

What Statistics Can Do for You?

Topics to Cover

Software for This Course

Use of AI

Examples of Using ChatGpt

Let’s Start!

Chapter 1. Collecting Data

1.1 Structure of Data

1.2 Sampling from a population

1.3 Observational and Experimental Studies

Chapter 2. Describing Data

2.1 Numerical Summaries of Quantitative Data

2.2 Graphical Summaries of Quantitative Data

2.3 Standard Scores or z-Scores

Numerical Summaries of Categorical Data

Graphical Summaries of Categorical Data

Two Quantitative Variables: Scatterplot, Correlation and Regression

Case Study 1

A Solution

Interpretation of Results

Chapter 3. Confidence Intervals

3.1 Sampling Distribution

3.2 Constructing and Interpreting Confidence Intervals

Constructing Bootstrap Confidence Intervals

Bootstrap Confidence Intervals Using Percentiles

Chapter 4. Hypothesis Tests

4.1 Introducing Hypothesis Tests

4.2 Measuring Evidence with \(P\)-values

4.3 Determining Statistical Significance

4.4 A Closer Look at Testing

4.5 Making Connections between a Confidence Interval and a Hypothesis Test

Chapter 5. Approximating with a Distribution

Chapter 6. Inference for Means and Proportions

6.1 Confidence Interval for a Population Proportion

6.2 Confidence Interval for a Population Mean

6.3 R Code for Confidence Intervals about \(p\) and \(\mu\)

6.4 Sample Size Determination When Estimating Population Proportion \(p\)

6.5 Confidence Interval for the Difference between Two Population Proportions \(p_1 - p_2\)

6.6 Confidence Interval for the Difference between Two Population Means \(\mu_1 - \mu_2\)

6.7 R Code for Calculating the Confidence Interval for the Difference in Population Proportions and Means

6.8 The Confidence Interval for the Difference in Population Means Based on Paird Data

6.9 Hypothesis Test for a Population Proportion

6.10 Hypothesis Test for a Population Mean

6.11 Hypothesis Test for the Difference in Two Population Proportions or Population Means

6.12 R Code for Hypothesis Tests

6.13 R Code for Testing Hypotheses Based on Data from a Dataset

Chapter 7. Chi-Square Tests for Categorical Variables

7.1 Testing Goodness-of-Fit for a Single Categorical Variable

7.2 Testing for an Association between Two categorical Variables

Chapter 8 ANOVA to Compare Means

Chapter 9 Inference for Regression