2023-03-11

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

What is Statistical Hypothesis Testing?

  • Hypothesis testing is commonly used in the scientific world. Researchers from fields including biology, economics, and engineering, use this statistical method to conduct experiments and discover new and significant findings.

  • Hypothesis testing is used to investigate an idea one has on the world using statistics. We can use hypothesis testing to make statistical inferences about a population based on a sample of data.

  • Statistical analysis is performed based on the information available; variability and sample size must be accounted for.

Steps to conduct a hypothesis test:

  • Formulate the null hypothesis (H0) and the alternative hypothesis (HA)
    • The null hypothesis is a statement of no effect, indicating there is no relationship between the groups being tested.
    • The alternative hypothesis is the statement we are testing, which indicates there is a relationship between the groups being tested and there are significant differences.
  • Choose a significance level
    • A significance level is a predetermined probability which will determine later in hypothesis testing when to reject the null hypothesis. The significance level is usually set to 0.05.

Steps to conduct a hypothesis test: (Cont.)

  • Calculate a test statistic by observing the sample data
    • A test statistic observes the relationship between variables and determines how well the data compares to the hypothesis being tested.
  • Determine the p-value
    • P-value is the probability value of the sample if the null hypothesis were true.
  • Compare the p-value to the significance level
    • If the p-value is less than the significance level, then there is evidence that the alternative hypothesis is true.
    • If the p-value is greater than the significance level, then we fail to reject the null hypothesis.

Example: Violent Crime Rates by US State

  • Let’s use the USArrests dataset to find the relationship between the percentage of the urban population and the number of assault arrests.
arrests <- USArrests |>
  select(UrbanPop, Assault)
head(arrests)
##            UrbanPop Assault
## Alabama          58     236
## Alaska           48     263
## Arizona          80     294
## Arkansas         50     190
## California       91     276
## Colorado         78     204

Test Hypothesis for Assault Arrests

Step (1) H0: The percentage of the urban population does not effect the assault arrest rate in US states.
HA: The perentage of the urban population has a positive linear relationship with the assault arrest rate in US states.

Step (2) Our significance level for this test will be 0.05, meaning there is a 5% chance that we will reject the null hypothesis when it is true.

Step (3) Calculate the test statistic.

print(paste("Correlation Coefficient: ", round(correlation_coefficient, 2)))
## [1] "Correlation Coefficient:  0.26"
print(paste("Test Statistic: ", round(test_stat, 2)))
## [1] "Test Statistic:  1.86"

Test Hypothesis for Assault Arrests (Cont.)

Step (4) Determine the P-Value

print(paste("P-Value: ", round(p_value, 2)))
## [1] "P-Value:  0.07"

Step (5) Since the p-value is greater than the significance level, we fail to reject the null hypothesis. This means that our sample did not provide enough evidence to prove that our alternative hypothesis is true.

Plot shows a positive weak correlation.

## `geom_smooth()` using formula = 'y ~ x'

A teacher wanted to determine if a new teaching method had an effect on student’s test scores.
This is an example of a strong positive correlation.

## `geom_smooth()` using formula = 'y ~ x'

## [1] 0.9387595

Arrests Per US Region

  • The following plot organizes each state into its specified region and is placed on the graph according to its crime rate.
  • We can analyze this plot to determine in which US Regions do the most violent crimes occur.
  • A potential hypothesis test could be:
    • H0: Crime rate has no effect across US Regions
    • HA: Crime rates vary amongst US Regions. Most crime occurs in the South and the least crime occurs in the North Central.

Arrests Per US Region

Test Statistics: Spearman’s Rank Correlation Coefficient

  • There are many different kind of test statistics.

  • The test statistic we should use for the Violent Crimes dataset is the “Spearman’s Rank Correlation Coefficient”.

  • This test is used for statistical hypotheses to find the association between two continuous variables.

  • Formula to calculate Spearman’s Rank Correlation Coefficient: \(\rho = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}\)

  • where:

    • \(\rho\) is Spearman’s Rank correlation coefficient
    • \(d_i\) is the difference in the two ranks of each observation
    • \(n\) is the number of observations

Test Statistics: Paired T-Test

  • There are many different kinds of t-tests, including a one-sample, two-sample, and paired test.

  • In our example for test scores, we can use a paired t-test to show the relationship between old and new test scores.

  • Formula to calculate a paired t-test: \(t = \frac{\bar{d}}{s_d/\sqrt{n}}\)

  • where:

    • \(\bar{d}\) is the sum of the differences of each pair
    • \(s_d\) is the standard deviation
    • \(n\) is the sample size

Most Violent Crimes by State

  • The following graph shows the crime rate of each state, with red being the state with the most crime and yellow being the state with the least crime. The data shown in the graph represents arrests per 100,000 people.
  • The 3D plot can be used to show the association between the different crimes.
  • A potential hypothesis test could be:
    • H0: A high assault rate will not have an effect on the murder and rape rate.
    • HA: A high assault rate will have an effect on the murder and rape rate. If the assault rate is high, the murder and rape rate will also be high.

Most Violent Crimes by State

Citations