1.Introduction

Say you work for a major social media website. Your boss comes to you with two questions:

1.does the demographic of users on your site match the company’s expectation?

2.did the new interface update affect user engagement?

With terabytes of user data at your hands, you decide the best way to answer these questions is with statistical hypothesis tests!

Statistical hypothesis testing is a process that allows you to evaluate if a change or difference seen in a dataset is “real”, or if it’s just a result of random fluctuation in the data.

Hypothesis testing can be an integral component of any decision making process. It provides a framework for evaluating how confident one can be in making conclusions based on data. Some instances where this might come up include:

1.a professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?

2.a product manager for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?

In this lesson, you will cover the fundamental concepts that will help you run and evaluate hypothesis tests:

1.Sample and Population Mean

2.P-Values

3.Significance Level

4.Type I and Type II Errors

You will then learn about three different hypothesis tests you can perform to answer the kinds of questions discussed above:

1.One Sample T-Test

2.Two Sample T-Test

3.ANOVA (Analysis of Variance)

Let’s get started!

Instructions

The code in notebook.Rmd performs a hypothesis test on data for a company BuyPie.com. The test evaluates whether the time spent per visitor on the website changes significantly between two weeks.

Read the output at the bottom of the rendered notebook. Do you think there is a difference in time spent per visitor between Week 1 and Week 2?

By the end of the lesson, you will be able to perform and interpret such hypothesis tests yourself!

# load data
load("week_1.Rda")
load("week_2.Rda")
# calculate week_1_mean and week_2_mean:
week_1_mean <- mean(week_1)
week_1_mean
week_2_mean <- mean(week_2)
week_2_mean

[1] 25.44806

[1] 29.02157

# calculate week_1_sd and week_2_sd:
week_1_sd <- sd(week_1)
week_1_sd
week_2_sd <- sd(week_2)
week_2_sd

[1] 4.577702

[1] 5.553785

# run two sample t-test:
results <- t.test(week_1,week_2)
results

Welch Two Sample t-test

data: week_1 and week_2 t = -3.5109, df = 94.554, p-value = 0.0006863 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -5.594299 -1.552718 sample estimates: mean of x mean of y 25.44806 29.02157

2.Sample Mean and Population Mean - I

Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure 10 trees and get an average height of 32 ft. On Tuesday, you measure 12 different trees and reach an average height of 35 ft. On Wednesday, you measure the remaining 11 trees in the park, whose average height is 31 ft. The average height for all 33 trees in your local park is 32.8 ft.

The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A sample is a subset of the entire population (all the oak trees in the park). The mean of each sample is a sample mean and it is an estimate of the population mean.

Note: the sample means (32 ft., 35 ft., and 31 ft.) were all close to the population mean (32.8 ft.), but were all slightly different from the population mean and from each other.

For a population, the mean is a constant value no matter how many times it’s recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:

1.data on the entire population is not available

2.data on the entire population is available, but it is so large that it is unfeasible to analyze

3.meaningful answers to questions can be found faster with sampling

Instructions

1.In the workspace, we’ve generated a random population of size 300 that follows a normal distribution with a mean of 65. Update the value of population_mean to store the mean() of population. Does it closely match your expectation?

# generate random population
population <- rnorm(300, mean=65, sd=3.5)

# calculate population mean here:
population_mean <- mean(population)
population_mean
[1] 64.90532

2.Let’s look at how the means of different samples can vary within the same population.

The code in the notebook generates 5 random samples from population. sample_1 is displayed and sample_1_mean has been calculated.

Replace the “Not calculated” strings with calculations of the means for sample_2, sample_3, sample_4, and sample_5.

Look at the population mean and the sample means. Are they all the same? All different? Why?

# generate sample 1
sample_1 <- sample(population, size=30)
sample_1
 [1] 67.54371 70.74646 58.92483 66.32206 70.76844 61.03942 61.92440 65.01594 66.31625 66.64947 67.29845 70.28490
[13] 61.20255 63.25246 62.48976 65.40590 65.50519 58.48546 64.96503 63.32665 63.64597 65.27635 64.68773 64.14368
[25] 65.83720 60.94812 56.08455 64.11198 67.19988 64.40878
# calculate sample 1 mean
sample_1_mean <- mean(sample_1)
sample_1_mean
[1] 64.46039
# generate samples 2,3,4 and 5
sample_2 <- sample(population, size=30)
sample_3 <- sample(population, size=30)
sample_4 <- sample(population, size=30)
sample_5 <- sample(population, size=30)
# calculate sample means here:
sample_2_mean <- mean(sample_2)
sample_2_mean
[1] 65.82616
sample_3_mean <- mean(sample_3)
sample_3_mean
[1] 64.76717
sample_4_mean <- mean(sample_4)
sample_4_mean
[1] 65.69348
sample_5_mean <- mean(sample_5)
sample_5_mean
[1] 64.93156

3.Sample Mean and Population Mean - II

In the previous exercise, the sample means you calculated closely approximated the population mean. This won’t always be the case!

Consider a tailor of school uniforms at a school for students aged 11 to 13. The tailor needs to know the average height of all the students in order to know which sizes to make the uniforms.

The tailor measures the heights of a random sample of 20 students out of the 300 in the school. The average height of the sample is 57.5 inches. Using this sample mean, the tailor makes uniforms that fit students of this height, some smaller, and some larger.

After delivering the uniforms, the tailor starts to receive some feedback — many of the uniforms are too small! They go back to take measurements on the rest of the students, collecting the following data:

1.11 year olds average height: 56.7 inches

2.12 year olds average height: 59 inches

3.13 year olds average height: 62.8 inches

4.All students average height (population mean): 59.5 inches

The original sample mean was off from the population mean by 2 inches! How did this happen?

The random sample of 20 students was skewed to one direction of the total population. More 11 year olds were chosen in the sample than is representative of the whole school, bringing down the average height of the sample. This is called a sampling error, and occurs when a sample is not representative of the population it comes from. How do you get an average sample height that looks more like the average population height, and reduce the chance of a sampling error?

Selecting only 20 students for the sample allowed for the chance that only younger, shorter students were included. This is a natural consequence of the fact that a sample has less data than the population to which it belongs. If the sample selection is poor, then you will have a sample mean seriously skewed from the population mean.

There is one surefire way to mitigate the risk of having a skewed sample mean — take a larger set of samples! The sample mean of a larger sample set will more closely approximate the population mean, and reduce the chance of a sampling error.

Instructions

In the workspace, we have a population that is normally distributed. Generate samples of different sizes and see how the sample mean could differ from the population mean.

What happens to the difference between the sample mean and the population mean as you increase the sample size?

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo1.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo2.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo3.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo4.png")

4.Hypothesis Formulation

You begin the statistical hypothesis testing process by defining a hypothesis, or an assumption about your population that you want to test. A hypothesis can be written in words, but can also be explained in terms of the sample and population means you just learned about.

Say you are developing a website and want to compare the time spent on different versions of a homepage. You could run a hypothesis test to see if version A or B makes users stay on the page significantly longer. Your hypothesis might be:

“The average time spent on homepage A is greater than the average time spent on homepage B.”

While this is a fine hypothesis to make, data analysts are often very hesitant people. They don’t like to make bold claims without having data to back them up! Thus when constructing hypotheses for a hypothesis test, you want to formulate a null hypothesis. A null hypothesis states that there is no difference between the populations you are comparing, and it implies that any difference seen in the sample data is due to sampling error. A null hypothesis for the same scenario is as follows:

“The average time spent on homepage A is the same as the average time spent on homepage B.”

You could also restate this in terms of population mean:

“The population mean of time spent on homepage A is the same as the population mean of time spent on homepage B.”

After collecting some sample data on how users interact with each homepage, you can then run a hypothesis test using the data collected to determine whether your null hypothesis is true or false, or can be rejected (i.e. there is a difference in time spent on homepage A or B).

Instructions

1.A researcher at a pharmaceutical company is working on the development of a new medication to lower blood pressure, DeePressurize. They run an experiment with a control group of 100 patients that receive a placebo (a sugar pill), and an experimental group of 100 patients that receive DeePressurize. Blood pressure measurements are taken after a 3 month period on both groups of patients.

The researcher wants to run a hypothesis test to compare the resulting datasets. Two hypotheses, hypo_a and hypo_b, are given in notebook.Rmd. Which could be a null hypothesis for comparing the two sets of data? Update the value of null_hypo_1 to the string “hypo_a” or “hypo_b” based on your answer.

# experiment 1
hypo_a <- "DeePressurize lowers blood pressure in patients."
hypo_b <- "DeePressurize has no effect on blood pressure in patients."
null_hypo_1 <- "hypo_b"
null_hypo_1
[1] "hypo_b"

2.A product manager at a dating app company is developing a new user profile page with a different picture layout. They want to see if the new layout results in more matches between users than the current layout. 50% of profiles are updated to the new layout, and over a 1 month period the number of matches for users with the new layout and the original layout are recorded.

The product manager wants to run a hypothesis test to compare the resulting datasets. Two hypotheses, hypo_c and hypo_d, are given in notebook.Rmd. Which could be a null hypothesis for comparing the two sets of data? Update the value of null_hypo_2 to the string “hypo_c” or “hypo_d” based on your answer.

# experiment 2
hypo_c <- "The new profile layout has no effect on number of matches with other users."
hypo_d <- "The new profile layout results in more matches with other users than the original layout."
null_hypo_2 <- "hypo_c"
null_hypo_2
[1] "hypo_c"

5.Designing an Experiment

Suppose you want to know if students who study history are more interested in volleyball than students who study chemistry. Before doing anything else to answer your original question, you come up with a null hypothesis: “History and chemistry students are interested in volleyball at the same rates.”

To test this hypothesis, you need to design an experiment and collect data. You invite 100 history majors and 100 chemistry majors from your university to join an extracurricular volleyball team. After one week, 34 history majors sign up (34%), and 39 chemistry majors sign up (39%). More chemistry majors than history majors signed up, but is this a “real”, or significant difference? Can you conclude that students who study chemistry are more interested in volleyball than students who study history?

In your experiment, the 100 history and 100 chemistry majors at your university are samples of their respective populations (all history and chemistry majors). The sample means are the percentages of history majors (34%) and chemistry majors (39%) that signed up for the team, and the difference in sample means is 39% - 34% = 5%. The population means are the percentage of history and chemistry majors worldwide that would sign up for an extracurricular volleyball team if given the chance.

You want to know if the difference you observed in these sample means (5%) reflects a difference in the population means, or if the difference was caused by sampling error, and the samples of students you chose do not represent the greater populations of history and chemistry students.

Restating the null hypothesis in terms of the population means yields the following:

“The percentage of all history majors who would sign up for volleyball is the same as the percentage of all chemistry majors who would sign up for volleyball, and the observed difference in sample means is due to sampling error.”

This is the same as saying, “If you gave the same volleyball invitation to every history and chemistry major in the world, they would sign up at the same rate, and the sample of 200 students you selected are not representative of their populations.”

Instrctions

1.Your friend is a dog walker that specializes in working with Golden Retrievers and Goldendoodles. They are interested in knowing if there is a signficant difference in the lengths of the two breeds. After a few weeks of data collection, they give you a spreadsheet of 10 Golden Retrievers’ lengths and 10 Goldendoodles’ lengths.

The lengths of the dogs are given in retriever_lengths and doodle_lengths. Calculate the mean of each breed and save the results to mean_retriever_l and mean_doodle_l. View mean_retriever_l and mean_doodle_l.

# load data
load("retriever_lengths.Rda")
load("doodle_lengths.Rda")
# calculate mean_retriever_l and mean_doodle_l here:
mean_retriever_l <- mean(retriever_lengths)
mean_retriever_l
mean_doodle_l <- mean(doodle_lengths)
mean_doodle_l

[1] 23

[1] 20.5

2.Calculate the difference between mean_retriever_l and mean_doodle_l and save the result to mean_difference. View mean_difference.

# calculate mean_difference here:
mean_difference <- mean_retriever_l - mean_doodle_l
mean_difference

[1] 2.5

3.You want to run a hypothesis test to see if there is a significant difference in the lengths of Golden Retrievers and Goldendoodles. Which of the two statements could be a formulation of the null hypothesis?

Update the value of null_hypo with “st_1” or “st_2” depending on your answer.

# statements:
st_1 <- "The average length of Golden Retrievers is 2.5 inches longer than the average length of Goldendoodles."
st_2 <- "The average length of Golden Retrievers is the same as the average length of Goldendoodles."

# update null_hypo here:
null_hypo <- "st_2"
null_hypo
[1] "st_2"

6.Type I and Type II Errors

When using automated processes to make decisions, you need to be aware of how this automation can lead to mistakes. Computer programs can be as fallible as the humans who design them. Because of this, there is a responsibility to understand what can go wrong and what can be done to contain these foreseeable problems.

In statistical hypothesis testing, there are two types of error. A Type I error occurs when a hypothesis test finds a correlation between things that are not related. This error is sometimes called a “false positive” and occurs when the null hypothesis is rejected even though it is true.

For example, consider the history and chemistry major experiment from the previous exercise. Say you run a hypothesis test on the sample data you collected and conclude that there is a significant difference in interest in volleyball between history and chemistry majors. You have rejected the null hypothesis that there is no difference between the two populations of students. If, in reality, your results were due to the groups you happened to pick (sampling error), and there actually is no significant difference in interest in volleyball between history and chemistry majors in the greater population, you have become the victim of a false positive, or a Type I error.

The second kind of error, a Type II error, is failing to find a correlation between things that are actually related. This error is referred to as a “false negative” and occurs when the null hypothesis is not rejected even though it is false.

For example, with the history and chemistry student experiment, say that after you perform the hypothesis test, you conclude that there is no significant difference in interest in volleyball between history and chemistry majors. You did not reject the null hypothesis. If there actually is a difference in the populations as a whole, and there is a significant difference in interest in volleyball between history and chemistry majors, your test has resulted in a false negative, or a Type II error.

Instructions

# the true positives and negatives:
actual_positive <- c(2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47)
actual_negative <- c(1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49)

# the positives and negatives we determine by running the experiment:
experimental_positive <- c(2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49)
experimental_negative <- c(1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48)
# define type_i_errors and type_ii_errors here:
type_i_errors <- intersect(actual_negative, experimental_positive)
print('fales positives')
[1] "fales positives"
type_i_errors
 [1]  4  9 11 13 15 16 17 19 20 22 26 27 28 35 36 40 46 49

2.Now, define type_ii_errors, the list representing the false negatives of the experiment.

type_ii_errors <- intersect(actual_positive, experimental_negative)
print('fales negative')
[1] "fales negative"
type_ii_errors
[1]  6 25 29 30 33 42 44 47

7.P-Values

You know that a hypothesis test is used to determine the validity of a null hypothesis. Once again, the null hypothesis states that there is no actual difference between the two populations of data. But what result does a hypothesis test actually return, and how can you interpret it?

A hypothesis test returns a few numeric measures, most of which are out of the scope of this introductory lesson. Here we will focus on one: p-values. P-values help determine how confident you can be in validating the null hypothesis. In this context, a p-value is the probability that, assuming the null hypothesis is true, you would see at least such a difference in the sample means of your data.

Consider the experiment on history and chemistry majors and their interest in volleyball from a previous exercise:

Null Hypothesis: “History and chemistry students are interested in volleyball at the same rates” Experiment Sample Means: 34% of history majors and 39% of chemistry majors sign up for the volleyball class Assuming the null hypothesis is true, there is no actual difference in preference for volleyball between all history and chemistry majors, and any difference present in the experiment data is the result of sampling error. Imagine you run a hypothesis test on this experiment data and it returns a p-value of 0.04. A p-value of 0.04 indicates that you could expect to see a difference of at least 5% (calculated as 39% - 34% = 5%) in the sample means only 4% of the time.

Essentially, if you ran this same experiment 100 times, you would expect to see as large a difference in the sample means only 4 times given the assumption that there is no actual difference between the populations (i.e. they have the same mean).

Seems like a really small probability, right? Are you thinking about rejecting the null hypothesis you originally stated?

value ≠ probability your result is wrong

value = probability of your data (or more extreme) if the null is true

The p-value itself is not the probability that the null is wrong.

Yes—in practice, when the p-value is very small, it suggests that the null hypothesis is likely wrong.

Instrctions

1.You are big fan of apples, so you gather 10 green and 10 red apples to compare their weights. The green apples average 150 grams in weight, and the red apples average 160 grams in weight.

You run a hypothesis test to see if there is a significant difference in the weight of green and red apples. The test returns a p-value of 0.2. Which statement (st_1, st_2, st_3, or st_4) indicates how this p-value can be interpreted?

Update the value of interpretation with the string “st_1”, “st_2”, “st_3”, or “st_4” depending on your answer.

# possible interpretations
st_1 <- "There is a 20% chance that the difference in average weight of green and red apples is due to random sampling."
st_2 <- "There is a 20% chance that green and red apples have the same average weight."
st_3 <- "There is a 20% chance red apples weigh more than green apples."
st_4 <- "There is a 20% chance green apples weigh more than green apples."

# update the value of interpretation here:
interpretation <- "st_1"
interpretation
[1] "st_1"

A p-value of 0.2 means:

If green and red apples really weigh the same on average, there is a 20% chance of seeing a difference of 10 grams (or more) just due to random variation.

8.Significance Level

While a hypothesis test will return a p-value indicating a level of confidence in the null hypothesis, it does not definitively claim whether you should reject the null hypothesis. To make this decision, you need to determine a threshold p-value for which all p-values below it will result in rejecting the null hypothesis. This threshold is known as the significance level.

A higher significance level is more likely to give a false positive, as it makes it “easier” to state that there is a difference in the populations of your data when such a difference might not actually exist. If you want to be very sure that the result is not due to sampling error, you should select a very small significance level.

It is important to choose the significance level before you perform a statistical hypothesis test. If you wait until after you receive a p-value from a test, you might pick a significance level such that you get the result you want to see. For instance, if someone is trying to publish the results of their scientific study in a journal, they might set a higher significance level that makes their results appear statistically significant. Choosing a significance level in advance helps keep everyone honest.

It is an industry-standard to set a significance level of 0.05 or less, meaning that there is a 5% or less chance that your result is due to sampling error.

The p-value is the probability of getting results at least as extreme as the observed ones, just by random chance, if the null hypothesis is true.

The p-value tells you the probability that the result (or something more extreme) could happen just by chance, assuming the null hypothesis is true.

Instrctions

1.Before you run a hypothesis test on a set of data, you set your significance level to 0.05. The hypothesis test then returns a p-value of 0.1. Can you reject the null hypothesis? Update the value of reject_hypothesis to TRUE or FALSE depending on your answer.

A 0.1 possibility that the result is just by chance.

# update reject_hypothesis here:
reject_hypothesis <- FALSE
reject_hypothesis

9.One Sample T-Test

Consider the fictional business BuyPie, which sends ingredients for pies to your household so that you can make them from scratch. Suppose that a product manager hypothesizes the average age of visitors to BuyPie.com is 30. In the past hour, the website had 100 visitors and the average age was 31. Are the visitors older than expected? Or is this just the result of chance (sampling error) and a small sample size?

You can test this using a One Sample T-Test. A One Sample T-Test compares a sample mean to a hypothetical population mean. It answers the question “What is the probability that the sample came from a distribution with the desired mean?”

The first step is formulating a null hypothesis, which again is the hypothesis that there is no difference between the populations you are comparing. The second population in a One Sample T-Test is the hypothetical population you choose. The null hypothesis that this test examines can be phrased as follows: “The set of samples belongs to a population with the target mean”.

One result of a One Sample T-Test will be a p-value, which tells you whether or not you can reject this null hypothesis. If the p-value you receive is less than your significance level, normally 0.05, you can reject the null hypothesis and state that there is a significant difference.

R has a function called t.test() in the stats package which can perform a One Sample T-Test for you.

t.test() requires two arguments, a distribution of values and an expected mean:

results <- t.test(sample_distribution, mu = expected_mean)

1.sample_distribution is the sample of values that were collected

2.mu is an argument indicating the desired mean of the hypothetical population

3.expected_mean is the value of the desired mean

t.test() will return, among other information we will not cover here, a p-value — this tells you how confident you can be that the sample of values came from a distribution with the specified mean.

P-values give you an idea of how confident you can be in a result. Just because you don’t have enough data to detect a difference doesn’t mean that there isn’t one. Generally, the more samples you have, the smaller a difference you can detect.

Instructions

1.We have provided a small dataset called ages, representing the ages of customers to BuyPie.com in the past hour, in notebook.Rmd.

Even with a small dataset like this, it is hard to make judgments from just looking at the numbers.

To understand the data better, let’s look at the mean. Calculate the mean of ages, and store the result in a variable called ages_mean. View ages_mean.

# load and view data
ages <- c(32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22)
ages
 [1] 32 34 29 29 22 39 38 37 38 36 30 26 22 22
# calculate ages_mean here:
ages_mean <- mean(ages)
ages_mean
[1] 31

2.Use the t.test() function with ages to see what p-value the experiment returns for this distribution, where we expect the mean to be 30.

Store the results of the test in a variable called results.

Does the p-value you got with the One Sample T-Test make sense, knowing the mean of ages?

# perform t-test here:
results <- t.test(ages, mu = 30)
results

    One Sample t-test

data:  ages
t = 0.59738, df = 13, p-value = 0.5605
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
 27.38359 34.61641
sample estimates:
mean of x 
       31 

If the true mean is really 30 (null hypothesis is true), then there is about a 56% chance you’d see a sample mean as far from 30 as 31 (or even farther) just by chance.

It means this observed difference (mean of 31 instead of 30) is completely plausible as random sampling error.

So you fail to reject the null hypothesis.

The data is consistent with the population mean being 30.

10.Two Sample T-Test

Suppose that last week, the average amount of time spent per visitor to a website was 25 minutes. This week, the average amount of time spent per visitor to a website was 29 minutes. Did the average time spent per visitor change (i.e. was there a statistically significant bump in user time on the site)? Or is this just part of natural fluctuations?

One way of testing whether this difference is significant is by using a Two Sample T-Test. A Two Sample T-Test compares two sets of data, which are both approximately normally distributed.

The null hypothesis, in this case, is that the two distributions have the same mean.

You can use R’s t.test() function to perform a Two Sample T-Test, as shown below:

results <- t.test(distribution_1, distribution_2)

When performing a Two Sample T-Test, t.test() takes two distributions as arguments and returns, among other information, a p-value.

Remember, the p-value let’s you know the probability that the difference in the means happened by chance (sampling error).

Instructions

1.We’ve created two distributions representing the time spent per visitor to BuyPie.com last week, week_1, and the time spent per visitor to BuyPie.com this week, week_2.

Find the means of these two distributions. Store them in week_1_mean and week_2_mean. View both means.

# load data
week_1 <- c(23.90507, 26.67632, 27.27434, 24.25757, 32.40423, 39.56919, 23.07010, 29.82068, 27.59434, 28.05640, 27.06757, 30.41193, 25.71359, 24.94295, 28.23124, 24.95338, 18.51232, 27.46235, 28.38017, 13.91206, 29.02616, 26.90747, 22.86777, 24.89383, 25.96948, 26.86870, 20.72676, 27.35988, 20.68409, 21.19846, 16.25801, 23.92518, 24.47923, 29.47051, 27.28425, 26.93339, 28.61027, 18.88377, 33.65469, 25.69470, 20.98291, 22.69700, 28.60279, 21.36000, 30.77685, 20.83416, 23.79367, 19.75567, 29.54421, 20.14331)
week_1
 [1] 23.90507 26.67632 27.27434 24.25757 32.40423 39.56919 23.07010 29.82068 27.59434 28.05640 27.06757 30.41193
[13] 25.71359 24.94295 28.23124 24.95338 18.51232 27.46235 28.38017 13.91206 29.02616 26.90747 22.86777 24.89383
[25] 25.96948 26.86870 20.72676 27.35988 20.68409 21.19846 16.25801 23.92518 24.47923 29.47051 27.28425 26.93339
[37] 28.61027 18.88377 33.65469 25.69470 20.98291 22.69700 28.60279 21.36000 30.77685 20.83416 23.79367 19.75567
[49] 29.54421 20.14331
week_2 <- c( 18.63432, 31.28788, 34.96798, 21.81678, 28.21620, 39.39314, 35.52223, 27.54222, 33.64395, 25.31674, 28.81392, 30.73580, 26.37242, 26.09456, 26.34073, 19.42196, 32.58798, 24.84002, 28.93348, 20.43668, 22.72496, 32.31728, 35.38431, 29.66710, 24.53513, 30.91406, 19.56118, 24.90817, 30.13164, 31.47466, 27.77684, 16.51307, 35.07702, 31.74818, 36.36053, 27.70501, 29.49870, 27.65575, 37.18504, 25.16055, 29.26554, 38.22163, 28.92102, 24.82154, 38.30155, 34.76021, 22.26869, 28.82594, 32.00975, 36.46438)
week_2
 [1] 18.63432 31.28788 34.96798 21.81678 28.21620 39.39314 35.52223 27.54222 33.64395 25.31674 28.81392 30.73580
[13] 26.37242 26.09456 26.34073 19.42196 32.58798 24.84002 28.93348 20.43668 22.72496 32.31728 35.38431 29.66710
[25] 24.53513 30.91406 19.56118 24.90817 30.13164 31.47466 27.77684 16.51307 35.07702 31.74818 36.36053 27.70501
[37] 29.49870 27.65575 37.18504 25.16055 29.26554 38.22163 28.92102 24.82154 38.30155 34.76021 22.26869 28.82594
[49] 32.00975 36.46438
# calculate week_1_mean and week_2_mean here:
week_1_mean <- mean(week_1)
week_1_mean
[1] 25.44806
week_2_mean <- mean(week_2)
week_2_mean
[1] 29.02157

2.Find the standard deviations of these two distributions. Store them in week_1_sd and week_2_sd. View both standard deviations.

# calculate week_1_sd and week_2_sd here:
week_1_sd <- sd(week_1)
week_1_sd
[1] 4.577702
week_2_sd <- sd(week_2)
week_2_sd
[1] 5.553785

3.Run a Two Sample T-Test using the t.test() function.

Save the results to a variable called results and view it. Does the p-value make sense, knowing what you know about these datasets?

# run two sample t-test here:
results<- t.test(week_1, week_2)
results

Assuming there is no real difference in average time spent between the two weeks (null hypothesis), the probability of seeing a difference as large as ~3.57 minutes (or more extreme) just by random sampling is about 0.07%.

There is strong evidence that the average time per visitor changed between week_1 and week_2.

11.Dangers of Multiple T-Tests

Suppose that you own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. You want to know if the average ant sales over the past year are significantly different between the three locations.

At first, it seems that you could perform T-tests between each pair of stores.

You know that the p-value is the probability that you incorrectly reject the null hypothesis on each t-test. The more t-tests you perform, the more likely that you are to get a false positive, a Type I error.

For a p-value of 0.05, if the null hypothesis is true, then the probability of obtaining a significant result is 1 – 0.05 = 0.95. When you run another t-test, the probability of still getting a correct result is 0.95 * 0.95, or 0.9025. That means your probability of making an error is now close to 10%! This error probability only gets bigger with the more t-tests you do.

Instructions

1.We have created samples store_a, store_b, and store_c, representing the sales at VeryAnts at locations A, B, and C, respectively. We want to see if there’s a significant difference in sales between the three locations.

Explore datasets store_a, store_b, and store_c by finding and viewing the means and standard deviations of each one. Store the means in variables called store_a_mean, store_b_mean, and store_c_mean. Store the standard deviations in variables called store_a_sd, store_b_sd, and store_c_sd.

# load data
load("store_a.Rda")
load("store_b.Rda")
load("store_c.Rda")
# calculate means here:
store_a_mean <- mean(store_a)
store_a_mean
store_b_mean <- mean(store_b)
store_b_mean
store_c_mean <- mean(store_c)
store_c_mean

[1] 58.34964

[1] 65.62629

[1] 62.36117

# calculate standard deviations here:
store_a_sd <- sd(store_a)
store_a_sd
store_b_sd <- sd(store_b)
store_b_sd
store_c_sd <- sd(store_c)
store_c_sd

[1] 14.80313

[1] 14.79597

[1] 15.14302

2.Perform a Two Sample T-test between each pair of location data.

Store the results of the tests in variables called a_b_results, a_c_results, and b_c_results. View the results for each test.

# perform two sample t-test here:
a_b_results <- t.test(store_a, store_b)
a_b_results
a_c_results <- t.test(store_a, store_c)
a_c_results
b_c_results <- t.test(store_b, store_c)
b_c_results

Welch Two Sample t-test

data: store_a and store_b t = -4.2581, df = 298, p-value = 2.767e-05 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -10.639701 -3.913601 sample estimates: mean of x mean of y 58.34964 65.62629

Welch Two Sample t-test

data: store_a and store_c t = -2.3201, df = 297.85, p-value = 0.02101 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.4142456 -0.6088286 sample estimates: mean of x mean of y 58.34964 62.36117

Welch Two Sample t-test

data: store_b and store_c t = 1.8888, df = 297.84, p-value = 0.05989 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.1367903 6.6670182 sample estimates: mean of x mean of y 65.62629 62.36117

3.Store the probability of error for running three T-Tests in a variable called error_prob. View error_prob.

# calculate error_prob here:
error_prob <- (1-(0.95**3))
error_prob
[1] 0.142625

[1] 0.142625

12.ANOVA

In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of 0.05 is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

You can use the stats package function aov() to perform ANOVA on multiple datasets. aov() takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame df_scores as follows:

group score

math major 88

math major 81

writing major 92

writing major 80

psychology major 94

psychology major 83

You can then run an ANOVA test with this line:

results <- aov(score ~ group, data = df_scores)

Note: score ~ group indicates the relationship you want to analyze (i.e. how each group, or major, relates to score on the video game)

To retrieve the p-value from the results of calling aov(), use the summary() function:

summary(results)

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than 0.05), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.

Let’s look at an example of ANOVA in action.

Instructions

# load libraries
library(tidyr)
# load data
load("stores.Rda")
load("stores_new.Rda")
# inspect stores here:
stores
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo5.png")

2.Perform an ANOVA on the stores data and save the test results to a variable results. Use the summary() function to view the p-value of the test. Does this p-value lead you to reject the null hypothesis?

# perform anova on stores here:
results <- aov(sales ~ store, data = stores)
summary(results)
         Df Sum Sq Mean Sq F value   Pr(>F)    

store 2 3985 1992.6 8.957 0.000153 *** Residuals 447 99437 222.5
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

The column labeled Pr(>F) is the p-value for the F-test.

3.Let’s say the sales at location B have suddenly soared (maybe there’s an ant convention happening nearby). The new sales for location B have been updated in the stores_new data frame.

Re-run the ANOVA test on stores_new and save the test results to a variable results_new. Use the summary() function to see what the p-value is now. Does this new value make sense?

# perform anova on stores_new here:
results_new <- aov(sales ~ store, data = stores_new)
summary(results_new)
         Df Sum Sq Mean Sq F value Pr(>F)    

store 2 775599 387799 1805 <2e-16 *** Residuals 447 96058 215
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Key value: Pr(>F) = < 2e-16 This is shorthand for a p-value less than 0.0000000000000002

Way below any common significance level (like 0.05 or 0.01)

Reject the null hypothesis

There is a statistically significant difference between the group means for the different stores

The difference is extremely unlikely to be due to chance

There’s overwhelming evidence that the average value (whatever you’re measuring) differs significantly among the three stores.

13.Assumptions of Numerical Hypothesis Tests

Before you use numerical hypothesis tests, you need to be sure that the following things are true:

  1. The samples should each be normally distributed…ish Data analysts in the real world often still perform hypothesis tests on datasets that aren’t exactly normally distributed. What is more important is to recognize if there is some reason to believe that a normal distribution is especially unlikely. If your dataset is definitively not normal, the numerical hypothesis tests won’t work as intended.

For example, imagine you have three datasets, each representing a day of traffic data in three different cities. Each dataset is independent, as traffic in one city should not impact traffic in another city. However, it is unlikely that each dataset is normally distributed. In fact, each dataset probably has two distinct peaks, one at the morning rush hour and one during the evening rush hour. The histogram of a day of traffic data might look something like this:

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo6.png")

In this scenario, using a numerical hypothesis test would be inappropriate.

  1. The population standard deviations of the groups should be equal For ANOVA and Two Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means.

To check for similarity between the standard deviations, it is normally sufficient to divide the two standard deviations and see if the ratio is “close enough” to 1. “Close enough” may differ in different contexts, but generally staying within 10% should suffice.

  1. The samples must be independent When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Here are some examples where it would seem the samples are not independent:

1.the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen

2.a group of patients’ blood pressure levels before, during, and after the administration of a drug

It is important to understand your datasets before you begin conducting hypothesis tests on them so that you know you are choosing the right test.

Instrctions

1.Use the base R hist() function to display the histograms for dist_one, dist_two, dist_three, and dist_four.

# load data
load("dist_one.Rda")
load("dist_two.Rda")
load("dist_three.Rda")
load("dist_four.Rda")
# plot histograms and define not_normal here:
hist(dist_one)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo7.png")

hist(dist_two)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo8.png")

hist(dist_three)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo9.png")

hist(dist_four)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo10.png")

2.Do the distributions look normal?

One of these distributions would probably not be a good choice to use in an ANOVA comparison. Create a variable called not_normal and set it equal to the distribution number (1, 2, 3, or 4) that would be least suited for use in an ANOVA test.

*Hint : A normal distribution will have a bell shaped curve with one mean.

not_normal <- 4
not_normal 
[1] 4

3.Calculate the ratio of standard deviations between dist_two and dist_three, and store the value in a variable called ratio. View ratio. Is this “close enough” to perform a numerical hypothesis test between the two datasets?

# define ratio here:
ratio <- sd(dist_two) / sd(dist_three)
ratio

[1] 0.5784782

One of the assumptions of a numerical hypothesis test is that the ratio of the standard deviations of the datasets are close to 1.

Since the ratio is not close to 1, these datasets should not be used together in a numerical hypothesis test.

14.Review

Phew! Nobody said hypothesis testing is easy, but you made it to the end of the lesson. Congratulations! The world of hypothesis testing is vast. There is much more you can learn, and so many applications where you can use them.

Let’s review what you’ve learned in this lesson:

1.Samples are subsets of an entire population, and the sample mean can be used to approximate the population mean

2.The null hypothesis is an assumption that there is no difference between the populations you are comparing in a hypothesis test

3.Type I Errors occur when a hypothesis test finds a correlation between things that are not related, and Type II Errors occur when a hypothesis test fails to find a correlation between things that are actually related

4.P-Values indicate the probability that, assuming the null hypothesis is true, such differences in the samples you are comparing would exist

5.The Significance Level is a threshold p-value for which all p-values below it will result in rejecting the null hypothesis

6.One Sample T-Tests indicate whether a dataset belongs to a distribution with a given mean

7.Two Sample T-Tests indicate whether there is a significant difference between two datasets

8.ANOVA (Analysis of Variance) allows you to detect if there is a significant difference between one of multiple datasets

