1.Introduction
Say you work for a major social media website. Your boss comes to you
with two questions:
1.does the demographic of users on your site match the company’s
expectation?
2.did the new interface update affect user engagement?
With terabytes of user data at your hands, you decide the best way to
answer these questions is with statistical hypothesis tests!
Statistical hypothesis testing is a process that allows you to
evaluate if a change or difference seen in a dataset is “real”, or if
it’s just a result of random fluctuation in the data.
Hypothesis testing can be an integral component of any decision
making process. It provides a framework for evaluating how confident one
can be in making conclusions based on data. Some instances where this
might come up include:
1.a professor expects an exam average to be roughly 75%, and wants to
know if the actual scores line up with this expectation. Was the test
actually too easy or too hard?
2.a product manager for a website wants to compare the time spent on
different versions of a homepage. Does one version make users stay on
the page significantly longer?
In this lesson, you will cover the fundamental concepts that will
help you run and evaluate hypothesis tests:
1.Sample and Population Mean
2.P-Values
3.Significance Level
4.Type I and Type II Errors
You will then learn about three different hypothesis tests you can
perform to answer the kinds of questions discussed above:
1.One Sample T-Test
2.Two Sample T-Test
3.ANOVA (Analysis of Variance)
Let’s get started!
Instructions
The code in notebook.Rmd performs a hypothesis test on data for a
company BuyPie.com. The test evaluates whether the time spent per
visitor on the website changes significantly between two weeks.
Read the output at the bottom of the rendered notebook. Do you think
there is a difference in time spent per visitor between Week 1 and Week
2?
By the end of the lesson, you will be able to perform and interpret
such hypothesis tests yourself!
# load data
load("week_1.Rda")
load("week_2.Rda")
# calculate week_1_mean and week_2_mean:
week_1_mean <- mean(week_1)
week_1_mean
week_2_mean <- mean(week_2)
week_2_mean
[1] 25.44806
[1] 29.02157
# calculate week_1_sd and week_2_sd:
week_1_sd <- sd(week_1)
week_1_sd
week_2_sd <- sd(week_2)
week_2_sd
[1] 4.577702
[1] 5.553785
# run two sample t-test:
results <- t.test(week_1,week_2)
results
Welch Two Sample t-test
data: week_1 and week_2 t = -3.5109, df = 94.554, p-value = 0.0006863
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -5.594299 -1.552718 sample estimates: mean
of x mean of y 25.44806 29.02157
2.Sample Mean and Population Mean - I
Suppose you want to know the average height of an oak tree in your
local park. On Monday, you measure 10 trees and get an average height of
32 ft. On Tuesday, you measure 12 different trees and reach an average
height of 35 ft. On Wednesday, you measure the remaining 11 trees in the
park, whose average height is 31 ft. The average height for all 33 trees
in your local park is 32.8 ft.
The collection of individual height measurements on Monday, Tuesday,
and Wednesday are each called samples. A sample is a subset of the
entire population (all the oak trees in the park). The mean of each
sample is a sample mean and it is an estimate of the population
mean.
Note: the sample means (32 ft., 35 ft., and 31 ft.) were all close to
the population mean (32.8 ft.), but were all slightly different from the
population mean and from each other.
For a population, the mean is a constant value no matter how many
times it’s recalculated. But with a set of samples, the mean will depend
on exactly which samples are selected. From a sample mean, we can then
extrapolate the mean of the population as a whole. There are three main
reasons we might use sampling:
1.data on the entire population is not available
2.data on the entire population is available, but it is so large that
it is unfeasible to analyze
3.meaningful answers to questions can be found faster with
sampling
Instructions
1.In the workspace, we’ve generated a random population of size 300
that follows a normal distribution with a mean of 65. Update the value
of population_mean to store the mean() of population. Does it closely
match your expectation?
# generate random population
population <- rnorm(300, mean=65, sd=3.5)
# calculate population mean here:
population_mean <- mean(population)
population_mean
[1] 64.90532
2.Let’s look at how the means of different samples can vary within
the same population.
The code in the notebook generates 5 random samples from population.
sample_1 is displayed and sample_1_mean has been calculated.
Replace the “Not calculated” strings with calculations of the means
for sample_2, sample_3, sample_4, and sample_5.
Look at the population mean and the sample means. Are they all the
same? All different? Why?
# generate sample 1
sample_1 <- sample(population, size=30)
sample_1
[1] 67.54371 70.74646 58.92483 66.32206 70.76844 61.03942 61.92440 65.01594 66.31625 66.64947 67.29845 70.28490
[13] 61.20255 63.25246 62.48976 65.40590 65.50519 58.48546 64.96503 63.32665 63.64597 65.27635 64.68773 64.14368
[25] 65.83720 60.94812 56.08455 64.11198 67.19988 64.40878
# calculate sample 1 mean
sample_1_mean <- mean(sample_1)
sample_1_mean
[1] 64.46039
# generate samples 2,3,4 and 5
sample_2 <- sample(population, size=30)
sample_3 <- sample(population, size=30)
sample_4 <- sample(population, size=30)
sample_5 <- sample(population, size=30)
# calculate sample means here:
sample_2_mean <- mean(sample_2)
sample_2_mean
[1] 65.82616
sample_3_mean <- mean(sample_3)
sample_3_mean
[1] 64.76717
sample_4_mean <- mean(sample_4)
sample_4_mean
[1] 65.69348
sample_5_mean <- mean(sample_5)
sample_5_mean
[1] 64.93156
3.Sample Mean and Population Mean - II
In the previous exercise, the sample means you calculated closely
approximated the population mean. This won’t always be the case!
Consider a tailor of school uniforms at a school for students aged 11
to 13. The tailor needs to know the average height of all the students
in order to know which sizes to make the uniforms.
The tailor measures the heights of a random sample of 20 students out
of the 300 in the school. The average height of the sample is 57.5
inches. Using this sample mean, the tailor makes uniforms that fit
students of this height, some smaller, and some larger.
After delivering the uniforms, the tailor starts to receive some
feedback — many of the uniforms are too small! They go back to take
measurements on the rest of the students, collecting the following
data:
1.11 year olds average height: 56.7 inches
2.12 year olds average height: 59 inches
3.13 year olds average height: 62.8 inches
4.All students average height (population mean): 59.5 inches
The original sample mean was off from the population mean by 2
inches! How did this happen?
The random sample of 20 students was skewed to one direction of the
total population. More 11 year olds were chosen in the sample than is
representative of the whole school, bringing down the average height of
the sample. This is called a sampling error, and occurs when a sample is
not representative of the population it comes from. How do you get an
average sample height that looks more like the average population
height, and reduce the chance of a sampling error?
Selecting only 20 students for the sample allowed for the chance that
only younger, shorter students were included. This is a natural
consequence of the fact that a sample has less data than the population
to which it belongs. If the sample selection is poor, then you will have
a sample mean seriously skewed from the population mean.
There is one surefire way to mitigate the risk of having a skewed
sample mean — take a larger set of samples! The sample mean of a larger
sample set will more closely approximate the population mean, and reduce
the chance of a sampling error.
Instructions
In the workspace, we have a population that is normally distributed.
Generate samples of different sizes and see how the sample mean could
differ from the population mean.
What happens to the difference between the sample mean and the
population mean as you increase the sample size?
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo1.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo2.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo3.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo4.png")
4.Hypothesis Formulation
You begin the statistical hypothesis testing process by defining a
hypothesis, or an assumption about your population that you want to
test. A hypothesis can be written in words, but can also be explained in
terms of the sample and population means you just learned about.
Say you are developing a website and want to compare the time spent
on different versions of a homepage. You could run a hypothesis test to
see if version A or B makes users stay on the page significantly longer.
Your hypothesis might be:
“The average time spent on homepage A is greater than the average
time spent on homepage B.”
While this is a fine hypothesis to make, data analysts are often very
hesitant people. They don’t like to make bold claims without having data
to back them up! Thus when constructing hypotheses for a hypothesis
test, you want to formulate a null hypothesis. A null hypothesis states
that there is no difference between the populations you are comparing,
and it implies that any difference seen in the sample data is due to
sampling error. A null hypothesis for the same scenario is as
follows:
“The average time spent on homepage A is the same as the average time
spent on homepage B.”
You could also restate this in terms of population mean:
“The population mean of time spent on homepage A is the same as the
population mean of time spent on homepage B.”
After collecting some sample data on how users interact with each
homepage, you can then run a hypothesis test using the data collected to
determine whether your null hypothesis is true or false, or can be
rejected (i.e. there is a difference in time spent on homepage A or
B).
Instructions
1.A researcher at a pharmaceutical company is working on the
development of a new medication to lower blood pressure, DeePressurize.
They run an experiment with a control group of 100 patients that receive
a placebo (a sugar pill), and an experimental group of 100 patients that
receive DeePressurize. Blood pressure measurements are taken after a 3
month period on both groups of patients.
The researcher wants to run a hypothesis test to compare the
resulting datasets. Two hypotheses, hypo_a and hypo_b, are given in
notebook.Rmd. Which could be a null hypothesis for comparing the two
sets of data? Update the value of null_hypo_1 to the string “hypo_a” or
“hypo_b” based on your answer.
# experiment 1
hypo_a <- "DeePressurize lowers blood pressure in patients."
hypo_b <- "DeePressurize has no effect on blood pressure in patients."
null_hypo_1 <- "hypo_b"
null_hypo_1
[1] "hypo_b"
2.A product manager at a dating app company is developing a new user
profile page with a different picture layout. They want to see if the
new layout results in more matches between users than the current
layout. 50% of profiles are updated to the new layout, and over a 1
month period the number of matches for users with the new layout and the
original layout are recorded.
The product manager wants to run a hypothesis test to compare the
resulting datasets. Two hypotheses, hypo_c and hypo_d, are given in
notebook.Rmd. Which could be a null hypothesis for comparing the two
sets of data? Update the value of null_hypo_2 to the string “hypo_c” or
“hypo_d” based on your answer.
# experiment 2
hypo_c <- "The new profile layout has no effect on number of matches with other users."
hypo_d <- "The new profile layout results in more matches with other users than the original layout."
null_hypo_2 <- "hypo_c"
null_hypo_2
[1] "hypo_c"
5.Designing an Experiment
Suppose you want to know if students who study history are more
interested in volleyball than students who study chemistry. Before doing
anything else to answer your original question, you come up with a null
hypothesis: “History and chemistry students are interested in volleyball
at the same rates.”
To test this hypothesis, you need to design an experiment and collect
data. You invite 100 history majors and 100 chemistry majors from your
university to join an extracurricular volleyball team. After one week,
34 history majors sign up (34%), and 39 chemistry majors sign up (39%).
More chemistry majors than history majors signed up, but is this a
“real”, or significant difference? Can you conclude that students who
study chemistry are more interested in volleyball than students who
study history?
In your experiment, the 100 history and 100 chemistry majors at your
university are samples of their respective populations (all history and
chemistry majors). The sample means are the percentages of history
majors (34%) and chemistry majors (39%) that signed up for the team, and
the difference in sample means is 39% - 34% = 5%. The population means
are the percentage of history and chemistry majors worldwide that would
sign up for an extracurricular volleyball team if given the chance.
You want to know if the difference you observed in these sample means
(5%) reflects a difference in the population means, or if the difference
was caused by sampling error, and the samples of students you chose do
not represent the greater populations of history and chemistry
students.
Restating the null hypothesis in terms of the population means yields
the following:
“The percentage of all history majors who would sign up for
volleyball is the same as the percentage of all chemistry majors who
would sign up for volleyball, and the observed difference in sample
means is due to sampling error.”
This is the same as saying, “If you gave the same volleyball
invitation to every history and chemistry major in the world, they would
sign up at the same rate, and the sample of 200 students you selected
are not representative of their populations.”
Instrctions
1.Your friend is a dog walker that specializes in working with Golden
Retrievers and Goldendoodles. They are interested in knowing if there is
a signficant difference in the lengths of the two breeds. After a few
weeks of data collection, they give you a spreadsheet of 10 Golden
Retrievers’ lengths and 10 Goldendoodles’ lengths.
The lengths of the dogs are given in retriever_lengths and
doodle_lengths. Calculate the mean of each breed and save the results to
mean_retriever_l and mean_doodle_l. View mean_retriever_l and
mean_doodle_l.
# load data
load("retriever_lengths.Rda")
load("doodle_lengths.Rda")
# calculate mean_retriever_l and mean_doodle_l here:
mean_retriever_l <- mean(retriever_lengths)
mean_retriever_l
mean_doodle_l <- mean(doodle_lengths)
mean_doodle_l
[1] 23
[1] 20.5
2.Calculate the difference between mean_retriever_l and mean_doodle_l
and save the result to mean_difference. View mean_difference.
# calculate mean_difference here:
mean_difference <- mean_retriever_l - mean_doodle_l
mean_difference
[1] 2.5
3.You want to run a hypothesis test to see if there is a significant
difference in the lengths of Golden Retrievers and Goldendoodles. Which
of the two statements could be a formulation of the null hypothesis?
Update the value of null_hypo with “st_1” or “st_2” depending on your
answer.
# statements:
st_1 <- "The average length of Golden Retrievers is 2.5 inches longer than the average length of Goldendoodles."
st_2 <- "The average length of Golden Retrievers is the same as the average length of Goldendoodles."
# update null_hypo here:
null_hypo <- "st_2"
null_hypo
[1] "st_2"
6.Type I and Type II Errors
When using automated processes to make decisions, you need to be
aware of how this automation can lead to mistakes. Computer programs can
be as fallible as the humans who design them. Because of this, there is
a responsibility to understand what can go wrong and what can be done to
contain these foreseeable problems.
In statistical hypothesis testing, there are two types of error. A
Type I error occurs when a hypothesis test finds a correlation between
things that are not related. This error is sometimes called a “false
positive” and occurs when the null hypothesis is rejected even though it
is true.
For example, consider the history and chemistry major experiment from
the previous exercise. Say you run a hypothesis test on the sample data
you collected and conclude that there is a significant difference in
interest in volleyball between history and chemistry majors. You have
rejected the null hypothesis that there is no difference between the two
populations of students. If, in reality, your results were due to the
groups you happened to pick (sampling error), and there actually is no
significant difference in interest in volleyball between history and
chemistry majors in the greater population, you have become the victim
of a false positive, or a Type I error.
The second kind of error, a Type II error, is failing to find a
correlation between things that are actually related. This error is
referred to as a “false negative” and occurs when the null hypothesis is
not rejected even though it is false.
For example, with the history and chemistry student experiment, say
that after you perform the hypothesis test, you conclude that there is
no significant difference in interest in volleyball between history and
chemistry majors. You did not reject the null hypothesis. If there
actually is a difference in the populations as a whole, and there is a
significant difference in interest in volleyball between history and
chemistry majors, your test has resulted in a false negative, or a Type
II error.
Instructions
# the true positives and negatives:
actual_positive <- c(2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47)
actual_negative <- c(1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49)
# the positives and negatives we determine by running the experiment:
experimental_positive <- c(2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49)
experimental_negative <- c(1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48)
# define type_i_errors and type_ii_errors here:
type_i_errors <- intersect(actual_negative, experimental_positive)
print('fales positives')
[1] "fales positives"
type_i_errors
[1] 4 9 11 13 15 16 17 19 20 22 26 27 28 35 36 40 46 49
2.Now, define type_ii_errors, the list representing the false
negatives of the experiment.
type_ii_errors <- intersect(actual_positive, experimental_negative)
print('fales negative')
[1] "fales negative"
type_ii_errors
[1] 6 25 29 30 33 42 44 47
7.P-Values
You know that a hypothesis test is used to determine the validity of
a null hypothesis. Once again, the null hypothesis states that there is
no actual difference between the two populations of data. But what
result does a hypothesis test actually return, and how can you interpret
it?
A hypothesis test returns a few numeric measures, most of which are
out of the scope of this introductory lesson. Here we will focus on one:
p-values. P-values help determine how confident you can be in validating
the null hypothesis. In this context, a p-value is the probability that,
assuming the null hypothesis is true, you would see at least such a
difference in the sample means of your data.
Consider the experiment on history and chemistry majors and their
interest in volleyball from a previous exercise:
Null Hypothesis: “History and chemistry students are interested in
volleyball at the same rates” Experiment Sample Means: 34% of history
majors and 39% of chemistry majors sign up for the volleyball class
Assuming the null hypothesis is true, there is no actual difference in
preference for volleyball between all history and chemistry majors, and
any difference present in the experiment data is the result of sampling
error. Imagine you run a hypothesis test on this experiment data and it
returns a p-value of 0.04. A p-value of 0.04 indicates that you could
expect to see a difference of at least 5% (calculated as 39% - 34% = 5%)
in the sample means only 4% of the time.
Essentially, if you ran this same experiment 100 times, you would
expect to see as large a difference in the sample means only 4 times
given the assumption that there is no actual difference between the
populations (i.e. they have the same mean).
Seems like a really small probability, right? Are you thinking about
rejecting the null hypothesis you originally stated?
value ≠ probability your result is wrong
value = probability of your data (or more extreme) if the null is
true
The p-value itself is not the probability that the null is wrong.
Yes—in practice, when the p-value is very small, it suggests that the
null hypothesis is likely wrong.
Instrctions
1.You are big fan of apples, so you gather 10 green and 10 red apples
to compare their weights. The green apples average 150 grams in weight,
and the red apples average 160 grams in weight.
You run a hypothesis test to see if there is a significant difference
in the weight of green and red apples. The test returns a p-value of
0.2. Which statement (st_1, st_2, st_3, or st_4) indicates how this
p-value can be interpreted?
Update the value of interpretation with the string “st_1”, “st_2”,
“st_3”, or “st_4” depending on your answer.
# possible interpretations
st_1 <- "There is a 20% chance that the difference in average weight of green and red apples is due to random sampling."
st_2 <- "There is a 20% chance that green and red apples have the same average weight."
st_3 <- "There is a 20% chance red apples weigh more than green apples."
st_4 <- "There is a 20% chance green apples weigh more than green apples."
# update the value of interpretation here:
interpretation <- "st_1"
interpretation
[1] "st_1"
A p-value of 0.2 means:
If green and red apples really weigh the same on average, there is a
20% chance of seeing a difference of 10 grams (or more) just due to
random variation.
8.Significance Level
While a hypothesis test will return a p-value indicating a level of
confidence in the null hypothesis, it does not definitively claim
whether you should reject the null hypothesis. To make this decision,
you need to determine a threshold p-value for which all p-values below
it will result in rejecting the null hypothesis. This threshold is known
as the significance level.
A higher significance level is more likely to give a false positive,
as it makes it “easier” to state that there is a difference in the
populations of your data when such a difference might not actually
exist. If you want to be very sure that the result is not due to
sampling error, you should select a very small significance level.
It is important to choose the significance level before you perform a
statistical hypothesis test. If you wait until after you receive a
p-value from a test, you might pick a significance level such that you
get the result you want to see. For instance, if someone is trying to
publish the results of their scientific study in a journal, they might
set a higher significance level that makes their results appear
statistically significant. Choosing a significance level in advance
helps keep everyone honest.
It is an industry-standard to set a significance level of 0.05 or
less, meaning that there is a 5% or less chance that your result is due
to sampling error.
The p-value is the probability of getting results at least as extreme
as the observed ones, just by random chance, if the null hypothesis is
true.
The p-value tells you the probability that the result (or something
more extreme) could happen just by chance, assuming the null hypothesis
is true.
Instrctions
1.Before you run a hypothesis test on a set of data, you set your
significance level to 0.05. The hypothesis test then returns a p-value
of 0.1. Can you reject the null hypothesis? Update the value of
reject_hypothesis to TRUE or FALSE depending on your answer.
A 0.1 possibility that the result is just by chance.
# update reject_hypothesis here:
reject_hypothesis <- FALSE
reject_hypothesis
9.One Sample T-Test
Consider the fictional business BuyPie, which sends ingredients for
pies to your household so that you can make them from scratch. Suppose
that a product manager hypothesizes the average age of visitors to
BuyPie.com is 30. In the past hour, the website had 100 visitors and the
average age was 31. Are the visitors older than expected? Or is this
just the result of chance (sampling error) and a small sample size?
You can test this using a One Sample T-Test. A One Sample T-Test
compares a sample mean to a hypothetical population mean. It answers the
question “What is the probability that the sample came from a
distribution with the desired mean?”
The first step is formulating a null hypothesis, which again is the
hypothesis that there is no difference between the populations you are
comparing. The second population in a One Sample T-Test is the
hypothetical population you choose. The null hypothesis that this test
examines can be phrased as follows: “The set of samples belongs to a
population with the target mean”.
One result of a One Sample T-Test will be a p-value, which tells you
whether or not you can reject this null hypothesis. If the p-value you
receive is less than your significance level, normally 0.05, you can
reject the null hypothesis and state that there is a significant
difference.
R has a function called t.test() in the stats package which can
perform a One Sample T-Test for you.
t.test() requires two arguments, a distribution of values and an
expected mean:
results <- t.test(sample_distribution, mu = expected_mean)
1.sample_distribution is the sample of values that were collected
2.mu is an argument indicating the desired mean of the hypothetical
population
3.expected_mean is the value of the desired mean
t.test() will return, among other information we will not cover here,
a p-value — this tells you how confident you can be that the sample of
values came from a distribution with the specified mean.
P-values give you an idea of how confident you can be in a result.
Just because you don’t have enough data to detect a difference doesn’t
mean that there isn’t one. Generally, the more samples you have, the
smaller a difference you can detect.
Instructions
1.We have provided a small dataset called ages, representing the ages
of customers to BuyPie.com in the past hour, in notebook.Rmd.
Even with a small dataset like this, it is hard to make judgments
from just looking at the numbers.
To understand the data better, let’s look at the mean. Calculate the
mean of ages, and store the result in a variable called ages_mean. View
ages_mean.
# load and view data
ages <- c(32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22)
ages
[1] 32 34 29 29 22 39 38 37 38 36 30 26 22 22
# calculate ages_mean here:
ages_mean <- mean(ages)
ages_mean
[1] 31
2.Use the t.test() function with ages to see what p-value the
experiment returns for this distribution, where we expect the mean to be
30.
Store the results of the test in a variable called results.
Does the p-value you got with the One Sample T-Test make sense,
knowing the mean of ages?
# perform t-test here:
results <- t.test(ages, mu = 30)
results
One Sample t-test
data: ages
t = 0.59738, df = 13, p-value = 0.5605
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
27.38359 34.61641
sample estimates:
mean of x
31
If the true mean is really 30 (null hypothesis is true), then there
is about a 56% chance you’d see a sample mean as far from 30 as 31 (or
even farther) just by chance.
It means this observed difference (mean of 31 instead of 30) is
completely plausible as random sampling error.
So you fail to reject the null hypothesis.
The data is consistent with the population mean being 30.
10.Two Sample T-Test
Suppose that last week, the average amount of time spent per visitor
to a website was 25 minutes. This week, the average amount of time spent
per visitor to a website was 29 minutes. Did the average time spent per
visitor change (i.e. was there a statistically significant bump in user
time on the site)? Or is this just part of natural fluctuations?
One way of testing whether this difference is significant is by using
a Two Sample T-Test. A Two Sample T-Test compares two sets of data,
which are both approximately normally distributed.
The null hypothesis, in this case, is that the two distributions have
the same mean.
You can use R’s t.test() function to perform a Two Sample T-Test, as
shown below:
results <- t.test(distribution_1, distribution_2)
When performing a Two Sample T-Test, t.test() takes two distributions
as arguments and returns, among other information, a p-value.
Remember, the p-value let’s you know the probability that the
difference in the means happened by chance (sampling error).
Instructions
1.We’ve created two distributions representing the time spent per
visitor to BuyPie.com last week, week_1, and the time spent per visitor
to BuyPie.com this week, week_2.
Find the means of these two distributions. Store them in week_1_mean
and week_2_mean. View both means.
# load data
week_1 <- c(23.90507, 26.67632, 27.27434, 24.25757, 32.40423, 39.56919, 23.07010, 29.82068, 27.59434, 28.05640, 27.06757, 30.41193, 25.71359, 24.94295, 28.23124, 24.95338, 18.51232, 27.46235, 28.38017, 13.91206, 29.02616, 26.90747, 22.86777, 24.89383, 25.96948, 26.86870, 20.72676, 27.35988, 20.68409, 21.19846, 16.25801, 23.92518, 24.47923, 29.47051, 27.28425, 26.93339, 28.61027, 18.88377, 33.65469, 25.69470, 20.98291, 22.69700, 28.60279, 21.36000, 30.77685, 20.83416, 23.79367, 19.75567, 29.54421, 20.14331)
week_1
[1] 23.90507 26.67632 27.27434 24.25757 32.40423 39.56919 23.07010 29.82068 27.59434 28.05640 27.06757 30.41193
[13] 25.71359 24.94295 28.23124 24.95338 18.51232 27.46235 28.38017 13.91206 29.02616 26.90747 22.86777 24.89383
[25] 25.96948 26.86870 20.72676 27.35988 20.68409 21.19846 16.25801 23.92518 24.47923 29.47051 27.28425 26.93339
[37] 28.61027 18.88377 33.65469 25.69470 20.98291 22.69700 28.60279 21.36000 30.77685 20.83416 23.79367 19.75567
[49] 29.54421 20.14331
week_2 <- c( 18.63432, 31.28788, 34.96798, 21.81678, 28.21620, 39.39314, 35.52223, 27.54222, 33.64395, 25.31674, 28.81392, 30.73580, 26.37242, 26.09456, 26.34073, 19.42196, 32.58798, 24.84002, 28.93348, 20.43668, 22.72496, 32.31728, 35.38431, 29.66710, 24.53513, 30.91406, 19.56118, 24.90817, 30.13164, 31.47466, 27.77684, 16.51307, 35.07702, 31.74818, 36.36053, 27.70501, 29.49870, 27.65575, 37.18504, 25.16055, 29.26554, 38.22163, 28.92102, 24.82154, 38.30155, 34.76021, 22.26869, 28.82594, 32.00975, 36.46438)
week_2
[1] 18.63432 31.28788 34.96798 21.81678 28.21620 39.39314 35.52223 27.54222 33.64395 25.31674 28.81392 30.73580
[13] 26.37242 26.09456 26.34073 19.42196 32.58798 24.84002 28.93348 20.43668 22.72496 32.31728 35.38431 29.66710
[25] 24.53513 30.91406 19.56118 24.90817 30.13164 31.47466 27.77684 16.51307 35.07702 31.74818 36.36053 27.70501
[37] 29.49870 27.65575 37.18504 25.16055 29.26554 38.22163 28.92102 24.82154 38.30155 34.76021 22.26869 28.82594
[49] 32.00975 36.46438
# calculate week_1_mean and week_2_mean here:
week_1_mean <- mean(week_1)
week_1_mean
[1] 25.44806
week_2_mean <- mean(week_2)
week_2_mean
[1] 29.02157
2.Find the standard deviations of these two distributions. Store them
in week_1_sd and week_2_sd. View both standard deviations.
# calculate week_1_sd and week_2_sd here:
week_1_sd <- sd(week_1)
week_1_sd
[1] 4.577702
week_2_sd <- sd(week_2)
week_2_sd
[1] 5.553785
3.Run a Two Sample T-Test using the t.test() function.
Save the results to a variable called results and view it. Does the
p-value make sense, knowing what you know about these datasets?
# run two sample t-test here:
results<- t.test(week_1, week_2)
results
Assuming there is no real difference in average time spent between
the two weeks (null hypothesis), the probability of seeing a difference
as large as ~3.57 minutes (or more extreme) just by random sampling is
about 0.07%.
There is strong evidence that the average time per visitor changed
between week_1 and week_2.
11.Dangers of Multiple T-Tests
Suppose that you own a chain of stores that sell ants, called
VeryAnts. There are three different locations: A, B, and C. You want to
know if the average ant sales over the past year are significantly
different between the three locations.
At first, it seems that you could perform T-tests between each pair
of stores.
You know that the p-value is the probability that you incorrectly
reject the null hypothesis on each t-test. The more t-tests you perform,
the more likely that you are to get a false positive, a Type I
error.
For a p-value of 0.05, if the null hypothesis is true, then the
probability of obtaining a significant result is 1 – 0.05 = 0.95. When
you run another t-test, the probability of still getting a correct
result is 0.95 * 0.95, or 0.9025. That means your probability of making
an error is now close to 10%! This error probability only gets bigger
with the more t-tests you do.
Instructions
1.We have created samples store_a, store_b, and store_c, representing
the sales at VeryAnts at locations A, B, and C, respectively. We want to
see if there’s a significant difference in sales between the three
locations.
Explore datasets store_a, store_b, and store_c by finding and viewing
the means and standard deviations of each one. Store the means in
variables called store_a_mean, store_b_mean, and store_c_mean. Store the
standard deviations in variables called store_a_sd, store_b_sd, and
store_c_sd.
# load data
load("store_a.Rda")
load("store_b.Rda")
load("store_c.Rda")
# calculate means here:
store_a_mean <- mean(store_a)
store_a_mean
store_b_mean <- mean(store_b)
store_b_mean
store_c_mean <- mean(store_c)
store_c_mean
[1] 58.34964
[1] 65.62629
[1] 62.36117
# calculate standard deviations here:
store_a_sd <- sd(store_a)
store_a_sd
store_b_sd <- sd(store_b)
store_b_sd
store_c_sd <- sd(store_c)
store_c_sd
[1] 14.80313
[1] 14.79597
[1] 15.14302
2.Perform a Two Sample T-test between each pair of location data.
Store the results of the tests in variables called a_b_results,
a_c_results, and b_c_results. View the results for each test.
# perform two sample t-test here:
a_b_results <- t.test(store_a, store_b)
a_b_results
a_c_results <- t.test(store_a, store_c)
a_c_results
b_c_results <- t.test(store_b, store_c)
b_c_results
Welch Two Sample t-test
data: store_a and store_b t = -4.2581, df = 298, p-value = 2.767e-05
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -10.639701 -3.913601 sample estimates: mean
of x mean of y 58.34964 65.62629
Welch Two Sample t-test
data: store_a and store_c t = -2.3201, df = 297.85, p-value = 0.02101
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -7.4142456 -0.6088286 sample estimates:
mean of x mean of y 58.34964 62.36117
Welch Two Sample t-test
data: store_b and store_c t = 1.8888, df = 297.84, p-value = 0.05989
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -0.1367903 6.6670182 sample estimates: mean
of x mean of y 65.62629 62.36117
3.Store the probability of error for running three T-Tests in a
variable called error_prob. View error_prob.
# calculate error_prob here:
error_prob <- (1-(0.95**3))
error_prob
[1] 0.142625
[1] 0.142625
12.ANOVA
In the last exercise, you saw that the probability of making a Type I
error got dangerously high as you performed more t-tests.
When comparing more than two numerical datasets, the best way to
preserve a Type I error probability of 0.05 is to use ANOVA. ANOVA
(Analysis of Variance) tests the null hypothesis that all of the
datasets you are considering have the same mean. If you reject the null
hypothesis with ANOVA, you’re saying that at least one of the sets has a
different mean; however, it does not tell you which datasets are
different.
You can use the stats package function aov() to perform ANOVA on
multiple datasets. aov() takes the different datasets combined into a
data frame as an argument. For example, if you were comparing scores on
a video game between math majors, writing majors, and psychology majors,
you could format the data in a data frame df_scores as follows:
group score
math major 88
math major 81
writing major 92
writing major 80
psychology major 94
psychology major 83
You can then run an ANOVA test with this line:
results <- aov(score ~ group, data = df_scores)
Note: score ~ group indicates the relationship you want to analyze
(i.e. how each group, or major, relates to score on the video game)
To retrieve the p-value from the results of calling aov(), use the
summary() function:
summary(results)
The null hypothesis, in this case, is that all three populations have
the same mean score on this video game. If you reject this null
hypothesis (if the p-value is less than 0.05), you can say you are
reasonably confident that a pair of datasets is significantly different.
After using only ANOVA, however, you can’t make any conclusions on which
two populations have a significant difference.
Let’s look at an example of ANOVA in action.
Instructions
# load libraries
library(tidyr)
# load data
load("stores.Rda")
load("stores_new.Rda")
# inspect stores here:
stores
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo5.png")

2.Perform an ANOVA on the stores data and save the test results to a
variable results. Use the summary() function to view the p-value of the
test. Does this p-value lead you to reject the null hypothesis?
# perform anova on stores here:
results <- aov(sales ~ store, data = stores)
summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
store 2 3985 1992.6 8.957 0.000153 *** Residuals 447 99437
222.5
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
The column labeled Pr(>F) is the p-value for the F-test.
3.Let’s say the sales at location B have suddenly soared (maybe
there’s an ant convention happening nearby). The new sales for location
B have been updated in the stores_new data frame.
Re-run the ANOVA test on stores_new and save the test results to a
variable results_new. Use the summary() function to see what the p-value
is now. Does this new value make sense?
# perform anova on stores_new here:
results_new <- aov(sales ~ store, data = stores_new)
summary(results_new)
Df Sum Sq Mean Sq F value Pr(>F)
store 2 775599 387799 1805 <2e-16 *** Residuals 447 96058
215
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
Key value: Pr(>F) = < 2e-16 This is shorthand for a p-value
less than 0.0000000000000002
Way below any common significance level (like 0.05 or 0.01)
Reject the null hypothesis
There is a statistically significant difference between the group
means for the different stores
The difference is extremely unlikely to be due to chance
There’s overwhelming evidence that the average value (whatever you’re
measuring) differs significantly among the three stores.
13.Assumptions of Numerical Hypothesis Tests
Before you use numerical hypothesis tests, you need to be sure that
the following things are true:
- The samples should each be normally distributed…ish Data analysts in
the real world often still perform hypothesis tests on datasets that
aren’t exactly normally distributed. What is more important is to
recognize if there is some reason to believe that a normal distribution
is especially unlikely. If your dataset is definitively not normal, the
numerical hypothesis tests won’t work as intended.
For example, imagine you have three datasets, each representing a day
of traffic data in three different cities. Each dataset is independent,
as traffic in one city should not impact traffic in another city.
However, it is unlikely that each dataset is normally distributed. In
fact, each dataset probably has two distinct peaks, one at the morning
rush hour and one during the evening rush hour. The histogram of a day
of traffic data might look something like this:
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo6.png")

In this scenario, using a numerical hypothesis test would be
inappropriate.
- The population standard deviations of the groups should be equal For
ANOVA and Two Sample T-Tests, using datasets with standard deviations
that are significantly different from each other will often obscure the
differences in group means.
To check for similarity between the standard deviations, it is
normally sufficient to divide the two standard deviations and see if the
ratio is “close enough” to 1. “Close enough” may differ in different
contexts, but generally staying within 10% should suffice.
- The samples must be independent When comparing two or more datasets,
the values in one distribution should not affect the values in another
distribution. In other words, knowing more about one distribution should
not give you any information about any other distribution.
Here are some examples where it would seem the samples are not
independent:
1.the number of goals scored per soccer player before, during, and
after undergoing a rigorous training regimen
2.a group of patients’ blood pressure levels before, during, and
after the administration of a drug
It is important to understand your datasets before you begin
conducting hypothesis tests on them so that you know you are choosing
the right test.
Instrctions
1.Use the base R hist() function to display the histograms for
dist_one, dist_two, dist_three, and dist_four.
# load data
load("dist_one.Rda")
load("dist_two.Rda")
load("dist_three.Rda")
load("dist_four.Rda")
# plot histograms and define not_normal here:
hist(dist_one)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo7.png")

hist(dist_two)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo8.png")

hist(dist_three)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo9.png")

hist(dist_four)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo10.png")

2.Do the distributions look normal?
One of these distributions would probably not be a good choice to use
in an ANOVA comparison. Create a variable called not_normal and set it
equal to the distribution number (1, 2, 3, or 4) that would be least
suited for use in an ANOVA test.
*Hint : A normal distribution will have a bell shaped curve with one
mean.
not_normal <- 4
not_normal
[1] 4
3.Calculate the ratio of standard deviations between dist_two and
dist_three, and store the value in a variable called ratio. View ratio.
Is this “close enough” to perform a numerical hypothesis test between
the two datasets?
# define ratio here:
ratio <- sd(dist_two) / sd(dist_three)
ratio
[1] 0.5784782
One of the assumptions of a numerical hypothesis test is that the
ratio of the standard deviations of the datasets are close to 1.
Since the ratio is not close to 1, these datasets should not be used
together in a numerical hypothesis test.
14.Review
Phew! Nobody said hypothesis testing is easy, but you made it to the
end of the lesson. Congratulations! The world of hypothesis testing is
vast. There is much more you can learn, and so many applications where
you can use them.
Let’s review what you’ve learned in this lesson:
1.Samples are subsets of an entire population, and the sample mean
can be used to approximate the population mean
2.The null hypothesis is an assumption that there is no difference
between the populations you are comparing in a hypothesis test
3.Type I Errors occur when a hypothesis test finds a correlation
between things that are not related, and Type II Errors occur when a
hypothesis test fails to find a correlation between things that are
actually related
4.P-Values indicate the probability that, assuming the null
hypothesis is true, such differences in the samples you are comparing
would exist
5.The Significance Level is a threshold p-value for which all
p-values below it will result in rejecting the null hypothesis
6.One Sample T-Tests indicate whether a dataset belongs to a
distribution with a given mean
7.Two Sample T-Tests indicate whether there is a significant
difference between two datasets
8.ANOVA (Analysis of Variance) allows you to detect if there is a
significant difference between one of multiple datasets
