1.Introduction
Say you work for a major social media website. Your boss comes to you
with two questions:
1.does the demographic of users on your site match the company’s
expectation?
2.did the new interface update affect user engagement?
With terabytes of user data at your hands, you decide the best way to
answer these questions is with statistical hypothesis tests!
Statistical hypothesis testing is a process that allows you to
evaluate if a change or difference seen in a dataset is “real”, or if
it’s just a result of random fluctuation in the data.
Hypothesis testing can be an integral component of any decision
making process. It provides a framework for evaluating how confident one
can be in making conclusions based on data. Some instances where this
might come up include:
1.a professor expects an exam average to be roughly 75%, and wants to
know if the actual scores line up with this expectation. Was the test
actually too easy or too hard?
2.a product manager for a website wants to compare the time spent on
different versions of a homepage. Does one version make users stay on
the page significantly longer?
In this lesson, you will cover the fundamental concepts that will
help you run and evaluate hypothesis tests:
1.Sample and Population Mean
2.P-Values
3.Significance Level
4.Type I and Type II Errors
You will then learn about three different hypothesis tests you can
perform to answer the kinds of questions discussed above:
1.One Sample T-Test
2.Two Sample T-Test
3.ANOVA (Analysis of Variance)
Let’s get started!
Instructions
The code in notebook.Rmd performs a hypothesis test on data for a
company BuyPie.com. The test evaluates whether the time spent per
visitor on the website changes significantly between two weeks.
Read the output at the bottom of the rendered notebook. Do you think
there is a difference in time spent per visitor between Week 1 and Week
2?
By the end of the lesson, you will be able to perform and interpret
such hypothesis tests yourself!
# load data
load("week_1.Rda")
load("week_2.Rda")
# calculate week_1_mean and week_2_mean:
week_1_mean <- mean(week_1)
week_1_mean
week_2_mean <- mean(week_2)
week_2_mean
[1] 25.44806
[1] 29.02157
# calculate week_1_sd and week_2_sd:
week_1_sd <- sd(week_1)
week_1_sd
week_2_sd <- sd(week_2)
week_2_sd
[1] 4.577702
[1] 5.553785
# run two sample t-test:
results <- t.test(week_1,week_2)
results
Welch Two Sample t-test
data: week_1 and week_2 t = -3.5109, df = 94.554, p-value = 0.0006863
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -5.594299 -1.552718 sample estimates: mean
of x mean of y 25.44806 29.02157
2.Sample Mean and Population Mean - I
Suppose you want to know the average height of an oak tree in your
local park. On Monday, you measure 10 trees and get an average height of
32 ft. On Tuesday, you measure 12 different trees and reach an average
height of 35 ft. On Wednesday, you measure the remaining 11 trees in the
park, whose average height is 31 ft. The average height for all 33 trees
in your local park is 32.8 ft.
The collection of individual height measurements on Monday, Tuesday,
and Wednesday are each called samples. A sample is a subset of the
entire population (all the oak trees in the park). The mean of each
sample is a sample mean and it is an estimate of the population
mean.
Note: the sample means (32 ft., 35 ft., and 31 ft.) were all close to
the population mean (32.8 ft.), but were all slightly different from the
population mean and from each other.
For a population, the mean is a constant value no matter how many
times it’s recalculated. But with a set of samples, the mean will depend
on exactly which samples are selected. From a sample mean, we can then
extrapolate the mean of the population as a whole. There are three main
reasons we might use sampling:
1.data on the entire population is not available
2.data on the entire population is available, but it is so large that
it is unfeasible to analyze
3.meaningful answers to questions can be found faster with
sampling
Instructions
1.In the workspace, we’ve generated a random population of size 300
that follows a normal distribution with a mean of 65. Update the value
of population_mean to store the mean() of population. Does it closely
match your expectation?
# generate random population
population <- rnorm(300, mean=65, sd=3.5)
# calculate population mean here:
population_mean <- mean(population)
population_mean
[1] 64.90532
2.Let’s look at how the means of different samples can vary within
the same population.
The code in the notebook generates 5 random samples from population.
sample_1 is displayed and sample_1_mean has been calculated.
Replace the “Not calculated” strings with calculations of the means
for sample_2, sample_3, sample_4, and sample_5.
Look at the population mean and the sample means. Are they all the
same? All different? Why?
# generate sample 1
sample_1 <- sample(population, size=30)
sample_1
[1] 67.54371 70.74646 58.92483 66.32206 70.76844 61.03942 61.92440 65.01594 66.31625 66.64947 67.29845 70.28490
[13] 61.20255 63.25246 62.48976 65.40590 65.50519 58.48546 64.96503 63.32665 63.64597 65.27635 64.68773 64.14368
[25] 65.83720 60.94812 56.08455 64.11198 67.19988 64.40878
# calculate sample 1 mean
sample_1_mean <- mean(sample_1)
sample_1_mean
[1] 64.46039
# generate samples 2,3,4 and 5
sample_2 <- sample(population, size=30)
sample_3 <- sample(population, size=30)
sample_4 <- sample(population, size=30)
sample_5 <- sample(population, size=30)
# calculate sample means here:
sample_2_mean <- mean(sample_2)
sample_2_mean
[1] 65.82616
sample_3_mean <- mean(sample_3)
sample_3_mean
[1] 64.76717
sample_4_mean <- mean(sample_4)
sample_4_mean
[1] 65.69348
sample_5_mean <- mean(sample_5)
sample_5_mean
[1] 64.93156
3.Sample Mean and Population Mean - II
In the previous exercise, the sample means you calculated closely
approximated the population mean. This won’t always be the case!
Consider a tailor of school uniforms at a school for students aged 11
to 13. The tailor needs to know the average height of all the students
in order to know which sizes to make the uniforms.
The tailor measures the heights of a random sample of 20 students out
of the 300 in the school. The average height of the sample is 57.5
inches. Using this sample mean, the tailor makes uniforms that fit
students of this height, some smaller, and some larger.
After delivering the uniforms, the tailor starts to receive some
feedback — many of the uniforms are too small! They go back to take
measurements on the rest of the students, collecting the following
data:
1.11 year olds average height: 56.7 inches
2.12 year olds average height: 59 inches
3.13 year olds average height: 62.8 inches
4.All students average height (population mean): 59.5 inches
The original sample mean was off from the population mean by 2
inches! How did this happen?
The random sample of 20 students was skewed to one direction of the
total population. More 11 year olds were chosen in the sample than is
representative of the whole school, bringing down the average height of
the sample. This is called a sampling error, and occurs when a sample is
not representative of the population it comes from. How do you get an
average sample height that looks more like the average population
height, and reduce the chance of a sampling error?
Selecting only 20 students for the sample allowed for the chance that
only younger, shorter students were included. This is a natural
consequence of the fact that a sample has less data than the population
to which it belongs. If the sample selection is poor, then you will have
a sample mean seriously skewed from the population mean.
There is one surefire way to mitigate the risk of having a skewed
sample mean — take a larger set of samples! The sample mean of a larger
sample set will more closely approximate the population mean, and reduce
the chance of a sampling error.
Instructions
In the workspace, we have a population that is normally distributed.
Generate samples of different sizes and see how the sample mean could
differ from the population mean.
What happens to the difference between the sample mean and the
population mean as you increase the sample size?
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo1.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo2.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo3.png")

knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo4.png")
4.Hypothesis Formulation
You begin the statistical hypothesis testing process by defining a
hypothesis, or an assumption about your population that you want to
test. A hypothesis can be written in words, but can also be explained in
terms of the sample and population means you just learned about.
Say you are developing a website and want to compare the time spent
on different versions of a homepage. You could run a hypothesis test to
see if version A or B makes users stay on the page significantly longer.
Your hypothesis might be:
“The average time spent on homepage A is greater than the average
time spent on homepage B.”
While this is a fine hypothesis to make, data analysts are often very
hesitant people. They don’t like to make bold claims without having data
to back them up! Thus when constructing hypotheses for a hypothesis
test, you want to formulate a null hypothesis. A null hypothesis states
that there is no difference between the populations you are comparing,
and it implies that any difference seen in the sample data is due to
sampling error. A null hypothesis for the same scenario is as
follows:
“The average time spent on homepage A is the same as the average time
spent on homepage B.”
You could also restate this in terms of population mean:
“The population mean of time spent on homepage A is the same as the
population mean of time spent on homepage B.”
After collecting some sample data on how users interact with each
homepage, you can then run a hypothesis test using the data collected to
determine whether your null hypothesis is true or false, or can be
rejected (i.e. there is a difference in time spent on homepage A or
B).
Instructions
1.A researcher at a pharmaceutical company is working on the
development of a new medication to lower blood pressure, DeePressurize.
They run an experiment with a control group of 100 patients that receive
a placebo (a sugar pill), and an experimental group of 100 patients that
receive DeePressurize. Blood pressure measurements are taken after a 3
month period on both groups of patients.
The researcher wants to run a hypothesis test to compare the
resulting datasets. Two hypotheses, hypo_a and hypo_b, are given in
notebook.Rmd. Which could be a null hypothesis for comparing the two
sets of data? Update the value of null_hypo_1 to the string “hypo_a” or
“hypo_b” based on your answer.
# experiment 1
hypo_a <- "DeePressurize lowers blood pressure in patients."
hypo_b <- "DeePressurize has no effect on blood pressure in patients."
null_hypo_1 <- "hypo_b"
null_hypo_1
[1] "hypo_b"
2.A product manager at a dating app company is developing a new user
profile page with a different picture layout. They want to see if the
new layout results in more matches between users than the current
layout. 50% of profiles are updated to the new layout, and over a 1
month period the number of matches for users with the new layout and the
original layout are recorded.
The product manager wants to run a hypothesis test to compare the
resulting datasets. Two hypotheses, hypo_c and hypo_d, are given in
notebook.Rmd. Which could be a null hypothesis for comparing the two
sets of data? Update the value of null_hypo_2 to the string “hypo_c” or
“hypo_d” based on your answer.
# experiment 2
hypo_c <- "The new profile layout has no effect on number of matches with other users."
hypo_d <- "The new profile layout results in more matches with other users than the original layout."
null_hypo_2 <- "hypo_c"
null_hypo_2
[1] "hypo_c"
5.Designing an Experiment
Suppose you want to know if students who study history are more
interested in volleyball than students who study chemistry. Before doing
anything else to answer your original question, you come up with a null
hypothesis: “History and chemistry students are interested in volleyball
at the same rates.”
To test this hypothesis, you need to design an experiment and collect
data. You invite 100 history majors and 100 chemistry majors from your
university to join an extracurricular volleyball team. After one week,
34 history majors sign up (34%), and 39 chemistry majors sign up (39%).
More chemistry majors than history majors signed up, but is this a
“real”, or significant difference? Can you conclude that students who
study chemistry are more interested in volleyball than students who
study history?
In your experiment, the 100 history and 100 chemistry majors at your
university are samples of their respective populations (all history and
chemistry majors). The sample means are the percentages of history
majors (34%) and chemistry majors (39%) that signed up for the team, and
the difference in sample means is 39% - 34% = 5%. The population means
are the percentage of history and chemistry majors worldwide that would
sign up for an extracurricular volleyball team if given the chance.
You want to know if the difference you observed in these sample means
(5%) reflects a difference in the population means, or if the difference
was caused by sampling error, and the samples of students you chose do
not represent the greater populations of history and chemistry
students.
Restating the null hypothesis in terms of the population means yields
the following:
“The percentage of all history majors who would sign up for
volleyball is the same as the percentage of all chemistry majors who
would sign up for volleyball, and the observed difference in sample
means is due to sampling error.”
This is the same as saying, “If you gave the same volleyball
invitation to every history and chemistry major in the world, they would
sign up at the same rate, and the sample of 200 students you selected
are not representative of their populations.”
Instrctions
1.Your friend is a dog walker that specializes in working with Golden
Retrievers and Goldendoodles. They are interested in knowing if there is
a signficant difference in the lengths of the two breeds. After a few
weeks of data collection, they give you a spreadsheet of 10 Golden
Retrievers’ lengths and 10 Goldendoodles’ lengths.
The lengths of the dogs are given in retriever_lengths and
doodle_lengths. Calculate the mean of each breed and save the results to
mean_retriever_l and mean_doodle_l. View mean_retriever_l and
mean_doodle_l.
# load data
load("retriever_lengths.Rda")
load("doodle_lengths.Rda")
# calculate mean_retriever_l and mean_doodle_l here:
mean_retriever_l <- mean(retriever_lengths)
mean_retriever_l
mean_doodle_l <- mean(doodle_lengths)
mean_doodle_l
[1] 23
[1] 20.5
2.Calculate the difference between mean_retriever_l and mean_doodle_l
and save the result to mean_difference. View mean_difference.
# calculate mean_difference here:
mean_difference <- mean_retriever_l - mean_doodle_l
mean_difference
[1] 2.5
3.You want to run a hypothesis test to see if there is a significant
difference in the lengths of Golden Retrievers and Goldendoodles. Which
of the two statements could be a formulation of the null hypothesis?
Update the value of null_hypo with “st_1” or “st_2” depending on your
answer.
# statements:
st_1 <- "The average length of Golden Retrievers is 2.5 inches longer than the average length of Goldendoodles."
st_2 <- "The average length of Golden Retrievers is the same as the average length of Goldendoodles."
# update null_hypo here:
null_hypo <- "st_2"
null_hypo
[1] "st_2"
6.Type I and Type II Errors
When using automated processes to make decisions, you need to be
aware of how this automation can lead to mistakes. Computer programs can
be as fallible as the humans who design them. Because of this, there is
a responsibility to understand what can go wrong and what can be done to
contain these foreseeable problems.
In statistical hypothesis testing, there are two types of error. A
Type I error occurs when a hypothesis test finds a correlation between
things that are not related. This error is sometimes called a “false
positive” and occurs when the null hypothesis is rejected even though it
is true.
For example, consider the history and chemistry major experiment from
the previous exercise. Say you run a hypothesis test on the sample data
you collected and conclude that there is a significant difference in
interest in volleyball between history and chemistry majors. You have
rejected the null hypothesis that there is no difference between the two
populations of students. If, in reality, your results were due to the
groups you happened to pick (sampling error), and there actually is no
significant difference in interest in volleyball between history and
chemistry majors in the greater population, you have become the victim
of a false positive, or a Type I error.
The second kind of error, a Type II error, is failing to find a
correlation between things that are actually related. This error is
referred to as a “false negative” and occurs when the null hypothesis is
not rejected even though it is false.
For example, with the history and chemistry student experiment, say
that after you perform the hypothesis test, you conclude that there is
no significant difference in interest in volleyball between history and
chemistry majors. You did not reject the null hypothesis. If there
actually is a difference in the populations as a whole, and there is a
significant difference in interest in volleyball between history and
chemistry majors, your test has resulted in a false negative, or a Type
II error.
Instructions
# the true positives and negatives:
actual_positive <- c(2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47)
actual_negative <- c(1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49)
# the positives and negatives we determine by running the experiment:
experimental_positive <- c(2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49)
experimental_negative <- c(1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48)
# define type_i_errors and type_ii_errors here:
type_i_errors <- intersect(actual_negative, experimental_positive)
print('fales positives')
[1] "fales positives"
type_i_errors
[1] 4 9 11 13 15 16 17 19 20 22 26 27 28 35 36 40 46 49
2.Now, define type_ii_errors, the list representing the false
negatives of the experiment.
type_ii_errors <- intersect(actual_positive, experimental_negative)
print('fales negative')
[1] "fales negative"
type_ii_errors
[1] 6 25 29 30 33 42 44 47
7.P-Values
You know that a hypothesis test is used to determine the validity of
a null hypothesis. Once again, the null hypothesis states that there is
no actual difference between the two populations of data. But what
result does a hypothesis test actually return, and how can you interpret
it?
A hypothesis test returns a few numeric measures, most of which are
out of the scope of this introductory lesson. Here we will focus on one:
p-values. P-values help determine how confident you can be in validating
the null hypothesis. In this context, a p-value is the probability that,
assuming the null hypothesis is true, you would see at least such a
difference in the sample means of your data.
Consider the experiment on history and chemistry majors and their
interest in volleyball from a previous exercise:
Null Hypothesis: “History and chemistry students are interested in
volleyball at the same rates” Experiment Sample Means: 34% of history
majors and 39% of chemistry majors sign up for the volleyball class
Assuming the null hypothesis is true, there is no actual difference in
preference for volleyball between all history and chemistry majors, and
any difference present in the experiment data is the result of sampling
error. Imagine you run a hypothesis test on this experiment data and it
returns a p-value of 0.04. A p-value of 0.04 indicates that you could
expect to see a difference of at least 5% (calculated as 39% - 34% = 5%)
in the sample means only 4% of the time.
Essentially, if you ran this same experiment 100 times, you would
expect to see as large a difference in the sample means only 4 times
given the assumption that there is no actual difference between the
populations (i.e. they have the same mean).
Seems like a really small probability, right? Are you thinking about
rejecting the null hypothesis you originally stated?
value ≠ probability your result is wrong
value = probability of your data (or more extreme) if the null is
true
The p-value itself is not the probability that the null is wrong.
Yes—in practice, when the p-value is very small, it suggests that the
null hypothesis is likely wrong.
Instrctions
1.You are big fan of apples, so you gather 10 green and 10 red apples
to compare their weights. The green apples average 150 grams in weight,
and the red apples average 160 grams in weight.
You run a hypothesis test to see if there is a significant difference
in the weight of green and red apples. The test returns a p-value of
0.2. Which statement (st_1, st_2, st_3, or st_4) indicates how this
p-value can be interpreted?
Update the value of interpretation with the string “st_1”, “st_2”,
“st_3”, or “st_4” depending on your answer.
# possible interpretations
st_1 <- "There is a 20% chance that the difference in average weight of green and red apples is due to random sampling."
st_2 <- "There is a 20% chance that green and red apples have the same average weight."
st_3 <- "There is a 20% chance red apples weigh more than green apples."
st_4 <- "There is a 20% chance green apples weigh more than green apples."
# update the value of interpretation here:
interpretation <- "st_1"
interpretation
[1] "st_1"
A p-value of 0.2 means:
If green and red apples really weigh the same on average, there is a
20% chance of seeing a difference of 10 grams (or more) just due to
random variation.
8.Significance Level
While a hypothesis test will return a p-value indicating a level of
confidence in the null hypothesis, it does not definitively claim
whether you should reject the null hypothesis. To make this decision,
you need to determine a threshold p-value for which all p-values below
it will result in rejecting the null hypothesis. This threshold is known
as the significance level.
A higher significance level is more likely to give a false positive,
as it makes it “easier” to state that there is a difference in the
populations of your data when such a difference might not actually
exist. If you want to be very sure that the result is not due to
sampling error, you should select a very small significance level.
It is important to choose the significance level before you perform a
statistical hypothesis test. If you wait until after you receive a
p-value from a test, you might pick a significance level such that you
get the result you want to see. For instance, if someone is trying to
publish the results of their scientific study in a journal, they might
set a higher significance level that makes their results appear
statistically significant. Choosing a significance level in advance
helps keep everyone honest.
It is an industry-standard to set a significance level of 0.05 or
less, meaning that there is a 5% or less chance that your result is due
to sampling error.
The p-value is the probability of getting results at least as extreme
as the observed ones, just by random chance, if the null hypothesis is
true.
The p-value tells you the probability that the result (or something
more extreme) could happen just by chance, assuming the null hypothesis
is true.
Instrctions
1.Before you run a hypothesis test on a set of data, you set your
significance level to 0.05. The hypothesis test then returns a p-value
of 0.1. Can you reject the null hypothesis? Update the value of
reject_hypothesis to TRUE or FALSE depending on your answer.
A 0.1 possibility that the result is just by chance.
# update reject_hypothesis here:
reject_hypothesis <- FALSE
reject_hypothesis
9.One Sample T-Test
Consider the fictional business BuyPie, which sends ingredients for
pies to your household so that you can make them from scratch. Suppose
that a product manager hypothesizes the average age of visitors to
BuyPie.com is 30. In the past hour, the website had 100 visitors and the
average age was 31. Are the visitors older than expected? Or is this
just the result of chance (sampling error) and a small sample size?
You can test this using a One Sample T-Test. A One Sample T-Test
compares a sample mean to a hypothetical population mean. It answers the
question “What is the probability that the sample came from a
distribution with the desired mean?”
The first step is formulating a null hypothesis, which again is the
hypothesis that there is no difference between the populations you are
comparing. The second population in a One Sample T-Test is the
hypothetical population you choose. The null hypothesis that this test
examines can be phrased as follows: “The set of samples belongs to a
population with the target mean”.
One result of a One Sample T-Test will be a p-value, which tells you
whether or not you can reject this null hypothesis. If the p-value you
receive is less than your significance level, normally 0.05, you can
reject the null hypothesis and state that there is a significant
difference.
R has a function called t.test() in the stats package which can
perform a One Sample T-Test for you.
t.test() requires two arguments, a distribution of values and an
expected mean:
results <- t.test(sample_distribution, mu = expected_mean)
1.sample_distribution is the sample of values that were collected
2.mu is an argument indicating the desired mean of the hypothetical
population
3.expected_mean is the value of the desired mean
t.test() will return, among other information we will not cover here,
a p-value — this tells you how confident you can be that the sample of
values came from a distribution with the specified mean.
P-values give you an idea of how confident you can be in a result.
Just because you don’t have enough data to detect a difference doesn’t
mean that there isn’t one. Generally, the more samples you have, the
smaller a difference you can detect.
Instructions
1.We have provided a small dataset called ages, representing the ages
of customers to BuyPie.com in the past hour, in notebook.Rmd.
Even with a small dataset like this, it is hard to make judgments
from just looking at the numbers.
To understand the data better, let’s look at the mean. Calculate the
mean of ages, and store the result in a variable called ages_mean. View
ages_mean.
# load and view data
ages <- c(32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22)
ages
[1] 32 34 29 29 22 39 38 37 38 36 30 26 22 22
# calculate ages_mean here:
ages_mean <- mean(ages)
ages_mean
[1] 31
2.Use the t.test() function with ages to see what p-value the
experiment returns for this distribution, where we expect the mean to be
30.
Store the results of the test in a variable called results.
Does the p-value you got with the One Sample T-Test make sense,
knowing the mean of ages?
# perform t-test here:
results <- t.test(ages, mu = 30)
results
One Sample t-test
data: ages
t = 0.59738, df = 13, p-value = 0.5605
alternative hypothesis: true mean is not equal to 30
95 percent confidence interval:
27.38359 34.61641
sample estimates:
mean of x
31
If the true mean is really 30 (null hypothesis is true), then there
is about a 56% chance you’d see a sample mean as far from 30 as 31 (or
even farther) just by chance.
It means this observed difference (mean of 31 instead of 30) is
completely plausible as random sampling error.
So you fail to reject the null hypothesis.
The data is consistent with the population mean being 30.
10.Two Sample T-Test
Suppose that last week, the average amount of time spent per visitor
to a website was 25 minutes. This week, the average amount of time spent
per visitor to a website was 29 minutes. Did the average time spent per
visitor change (i.e. was there a statistically significant bump in user
time on the site)? Or is this just part of natural fluctuations?
One way of testing whether this difference is significant is by using
a Two Sample T-Test. A Two Sample T-Test compares two sets of data,
which are both approximately normally distributed.
The null hypothesis, in this case, is that the two distributions have
the same mean.
You can use R’s t.test() function to perform a Two Sample T-Test, as
shown below:
results <- t.test(distribution_1, distribution_2)
When performing a Two Sample T-Test, t.test() takes two distributions
as arguments and returns, among other information, a p-value.
Remember, the p-value let’s you know the probability that the
difference in the means happened by chance (sampling error).
Instructions
1.We’ve created two distributions representing the time spent per
visitor to BuyPie.com last week, week_1, and the time spent per visitor
to BuyPie.com this week, week_2.
Find the means of these two distributions. Store them in week_1_mean
and week_2_mean. View both means.
# load data
week_1 <- c(23.90507, 26.67632, 27.27434, 24.25757, 32.40423, 39.56919, 23.07010, 29.82068, 27.59434, 28.05640, 27.06757, 30.41193, 25.71359, 24.94295, 28.23124, 24.95338, 18.51232, 27.46235, 28.38017, 13.91206, 29.02616, 26.90747, 22.86777, 24.89383, 25.96948, 26.86870, 20.72676, 27.35988, 20.68409, 21.19846, 16.25801, 23.92518, 24.47923, 29.47051, 27.28425, 26.93339, 28.61027, 18.88377, 33.65469, 25.69470, 20.98291, 22.69700, 28.60279, 21.36000, 30.77685, 20.83416, 23.79367, 19.75567, 29.54421, 20.14331)
week_1
[1] 23.90507 26.67632 27.27434 24.25757 32.40423 39.56919 23.07010 29.82068 27.59434 28.05640 27.06757 30.41193
[13] 25.71359 24.94295 28.23124 24.95338 18.51232 27.46235 28.38017 13.91206 29.02616 26.90747 22.86777 24.89383
[25] 25.96948 26.86870 20.72676 27.35988 20.68409 21.19846 16.25801 23.92518 24.47923 29.47051 27.28425 26.93339
[37] 28.61027 18.88377 33.65469 25.69470 20.98291 22.69700 28.60279 21.36000 30.77685 20.83416 23.79367 19.75567
[49] 29.54421 20.14331
week_2 <- c( 18.63432, 31.28788, 34.96798, 21.81678, 28.21620, 39.39314, 35.52223, 27.54222, 33.64395, 25.31674, 28.81392, 30.73580, 26.37242, 26.09456, 26.34073, 19.42196, 32.58798, 24.84002, 28.93348, 20.43668, 22.72496, 32.31728, 35.38431, 29.66710, 24.53513, 30.91406, 19.56118, 24.90817, 30.13164, 31.47466, 27.77684, 16.51307, 35.07702, 31.74818, 36.36053, 27.70501, 29.49870, 27.65575, 37.18504, 25.16055, 29.26554, 38.22163, 28.92102, 24.82154, 38.30155, 34.76021, 22.26869, 28.82594, 32.00975, 36.46438)
week_2
[1] 18.63432 31.28788 34.96798 21.81678 28.21620 39.39314 35.52223 27.54222 33.64395 25.31674 28.81392 30.73580
[13] 26.37242 26.09456 26.34073 19.42196 32.58798 24.84002 28.93348 20.43668 22.72496 32.31728 35.38431 29.66710
[25] 24.53513 30.91406 19.56118 24.90817 30.13164 31.47466 27.77684 16.51307 35.07702 31.74818 36.36053 27.70501
[37] 29.49870 27.65575 37.18504 25.16055 29.26554 38.22163 28.92102 24.82154 38.30155 34.76021 22.26869 28.82594
[49] 32.00975 36.46438
# calculate week_1_mean and week_2_mean here:
week_1_mean <- mean(week_1)
week_1_mean
[1] 25.44806
week_2_mean <- mean(week_2)
week_2_mean
[1] 29.02157
2.Find the standard deviations of these two distributions. Store them
in week_1_sd and week_2_sd. View both standard deviations.
# calculate week_1_sd and week_2_sd here:
week_1_sd <- sd(week_1)
week_1_sd
[1] 4.577702
week_2_sd <- sd(week_2)
week_2_sd
[1] 5.553785
3.Run a Two Sample T-Test using the t.test() function.
Save the results to a variable called results and view it. Does the
p-value make sense, knowing what you know about these datasets?
# run two sample t-test here:
results<- t.test(week_1, week_2)
results
Assuming there is no real difference in average time spent between
the two weeks (null hypothesis), the probability of seeing a difference
as large as ~3.57 minutes (or more extreme) just by random sampling is
about 0.07%.
There is strong evidence that the average time per visitor changed
between week_1 and week_2.
11.Dangers of Multiple T-Tests
Suppose that you own a chain of stores that sell ants, called
VeryAnts. There are three different locations: A, B, and C. You want to
know if the average ant sales over the past year are significantly
different between the three locations.
At first, it seems that you could perform T-tests between each pair
of stores.
You know that the p-value is the probability that you incorrectly
reject the null hypothesis on each t-test. The more t-tests you perform,
the more likely that you are to get a false positive, a Type I
error.
For a p-value of 0.05, if the null hypothesis is true, then the
probability of obtaining a significant result is 1 – 0.05 = 0.95. When
you run another t-test, the probability of still getting a correct
result is 0.95 * 0.95, or 0.9025. That means your probability of making
an error is now close to 10%! This error probability only gets bigger
with the more t-tests you do.
Instructions
1.We have created samples store_a, store_b, and store_c, representing
the sales at VeryAnts at locations A, B, and C, respectively. We want to
see if there’s a significant difference in sales between the three
locations.
Explore datasets store_a, store_b, and store_c by finding and viewing
the means and standard deviations of each one. Store the means in
variables called store_a_mean, store_b_mean, and store_c_mean. Store the
standard deviations in variables called store_a_sd, store_b_sd, and
store_c_sd.
# load data
load("store_a.Rda")
load("store_b.Rda")
load("store_c.Rda")
# calculate means here:
store_a_mean <- mean(store_a)
store_a_mean
store_b_mean <- mean(store_b)
store_b_mean
store_c_mean <- mean(store_c)
store_c_mean
[1] 58.34964
[1] 65.62629
[1] 62.36117
# calculate standard deviations here:
store_a_sd <- sd(store_a)
store_a_sd
store_b_sd <- sd(store_b)
store_b_sd
store_c_sd <- sd(store_c)
store_c_sd
[1] 14.80313
[1] 14.79597
[1] 15.14302
2.Perform a Two Sample T-test between each pair of location data.
Store the results of the tests in variables called a_b_results,
a_c_results, and b_c_results. View the results for each test.
# perform two sample t-test here:
a_b_results <- t.test(store_a, store_b)
a_b_results
a_c_results <- t.test(store_a, store_c)
a_c_results
b_c_results <- t.test(store_b, store_c)
b_c_results
Welch Two Sample t-test
data: store_a and store_b t = -4.2581, df = 298, p-value = 2.767e-05
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -10.639701 -3.913601 sample estimates: mean
of x mean of y 58.34964 65.62629
Welch Two Sample t-test
data: store_a and store_c t = -2.3201, df = 297.85, p-value = 0.02101
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -7.4142456 -0.6088286 sample estimates:
mean of x mean of y 58.34964 62.36117
Welch Two Sample t-test
data: store_b and store_c t = 1.8888, df = 297.84, p-value = 0.05989
alternative hypothesis: true difference in means is not equal to 0 95
percent confidence interval: -0.1367903 6.6670182 sample estimates: mean
of x mean of y 65.62629 62.36117
3.Store the probability of error for running three T-Tests in a
variable called error_prob. View error_prob.
# calculate error_prob here:
error_prob <- (1-(0.95**3))
error_prob
[1] 0.142625
[1] 0.142625
12.ANOVA
In the last exercise, you saw that the probability of making a Type I
error got dangerously high as you performed more t-tests.
When comparing more than two numerical datasets, the best way to
preserve a Type I error probability of 0.05 is to use ANOVA. ANOVA
(Analysis of Variance) tests the null hypothesis that all of the
datasets you are considering have the same mean. If you reject the null
hypothesis with ANOVA, you’re saying that at least one of the sets has a
different mean; however, it does not tell you which datasets are
different.
You can use the stats package function aov() to perform ANOVA on
multiple datasets. aov() takes the different datasets combined into a
data frame as an argument. For example, if you were comparing scores on
a video game between math majors, writing majors, and psychology majors,
you could format the data in a data frame df_scores as follows:
group score
math major 88
math major 81
writing major 92
writing major 80
psychology major 94
psychology major 83
You can then run an ANOVA test with this line:
results <- aov(score ~ group, data = df_scores)
Note: score ~ group indicates the relationship you want to analyze
(i.e. how each group, or major, relates to score on the video game)
To retrieve the p-value from the results of calling aov(), use the
summary() function:
summary(results)
The null hypothesis, in this case, is that all three populations have
the same mean score on this video game. If you reject this null
hypothesis (if the p-value is less than 0.05), you can say you are
reasonably confident that a pair of datasets is significantly different.
After using only ANOVA, however, you can’t make any conclusions on which
two populations have a significant difference.
Let’s look at an example of ANOVA in action.
Instructions
# load libraries
library(tidyr)
# load data
load("stores.Rda")
load("stores_new.Rda")
# inspect stores here:
stores
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo5.png")

2.Perform an ANOVA on the stores data and save the test results to a
variable results. Use the summary() function to view the p-value of the
test. Does this p-value lead you to reject the null hypothesis?
# perform anova on stores here:
results <- aov(sales ~ store, data = stores)
summary(results)
Df Sum Sq Mean Sq F value Pr(>F)
store 2 3985 1992.6 8.957 0.000153 *** Residuals 447 99437
222.5
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
The column labeled Pr(>F) is the p-value for the F-test.
3.Let’s say the sales at location B have suddenly soared (maybe
there’s an ant convention happening nearby). The new sales for location
B have been updated in the stores_new data frame.
Re-run the ANOVA test on stores_new and save the test results to a
variable results_new. Use the summary() function to see what the p-value
is now. Does this new value make sense?
# perform anova on stores_new here:
results_new <- aov(sales ~ store, data = stores_new)
summary(results_new)
Df Sum Sq Mean Sq F value Pr(>F)
store 2 775599 387799 1805 <2e-16 *** Residuals 447 96058
215
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05
‘.’ 0.1 ’ ’ 1
Key value: Pr(>F) = < 2e-16 This is shorthand for a p-value
less than 0.0000000000000002
Way below any common significance level (like 0.05 or 0.01)
Reject the null hypothesis
There is a statistically significant difference between the group
means for the different stores
The difference is extremely unlikely to be due to chance
There’s overwhelming evidence that the average value (whatever you’re
measuring) differs significantly among the three stores.
13.Assumptions of Numerical Hypothesis Tests
Before you use numerical hypothesis tests, you need to be sure that
the following things are true:
- The samples should each be normally distributed…ish Data analysts in
the real world often still perform hypothesis tests on datasets that
aren’t exactly normally distributed. What is more important is to
recognize if there is some reason to believe that a normal distribution
is especially unlikely. If your dataset is definitively not normal, the
numerical hypothesis tests won’t work as intended.
For example, imagine you have three datasets, each representing a day
of traffic data in three different cities. Each dataset is independent,
as traffic in one city should not impact traffic in another city.
However, it is unlikely that each dataset is normally distributed. In
fact, each dataset probably has two distinct peaks, one at the morning
rush hour and one during the evening rush hour. The histogram of a day
of traffic data might look something like this:
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo6.png")

In this scenario, using a numerical hypothesis test would be
inappropriate.
- The population standard deviations of the groups should be equal For
ANOVA and Two Sample T-Tests, using datasets with standard deviations
that are significantly different from each other will often obscure the
differences in group means.
To check for similarity between the standard deviations, it is
normally sufficient to divide the two standard deviations and see if the
ratio is “close enough” to 1. “Close enough” may differ in different
contexts, but generally staying within 10% should suffice.
- The samples must be independent When comparing two or more datasets,
the values in one distribution should not affect the values in another
distribution. In other words, knowing more about one distribution should
not give you any information about any other distribution.
Here are some examples where it would seem the samples are not
independent:
1.the number of goals scored per soccer player before, during, and
after undergoing a rigorous training regimen
2.a group of patients’ blood pressure levels before, during, and
after the administration of a drug
It is important to understand your datasets before you begin
conducting hypothesis tests on them so that you know you are choosing
the right test.
Instrctions
1.Use the base R hist() function to display the histograms for
dist_one, dist_two, dist_three, and dist_four.
# load data
load("dist_one.Rda")
load("dist_two.Rda")
load("dist_three.Rda")
load("dist_four.Rda")
# plot histograms and define not_normal here:
hist(dist_one)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo7.png")

hist(dist_two)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo8.png")

hist(dist_three)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo9.png")

hist(dist_four)
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo10.png")

2.Do the distributions look normal?
One of these distributions would probably not be a good choice to use
in an ANOVA comparison. Create a variable called not_normal and set it
equal to the distribution number (1, 2, 3, or 4) that would be least
suited for use in an ANOVA test.
*Hint : A normal distribution will have a bell shaped curve with one
mean.
not_normal <- 4
not_normal
[1] 4
3.Calculate the ratio of standard deviations between dist_two and
dist_three, and store the value in a variable called ratio. View ratio.
Is this “close enough” to perform a numerical hypothesis test between
the two datasets?
# define ratio here:
ratio <- sd(dist_two) / sd(dist_three)
ratio
[1] 0.5784782
One of the assumptions of a numerical hypothesis test is that the
ratio of the standard deviations of the datasets are close to 1.
Since the ratio is not close to 1, these datasets should not be used
together in a numerical hypothesis test.
14.Review
Phew! Nobody said hypothesis testing is easy, but you made it to the
end of the lesson. Congratulations! The world of hypothesis testing is
vast. There is much more you can learn, and so many applications where
you can use them.
Let’s review what you’ve learned in this lesson:
1.Samples are subsets of an entire population, and the sample mean
can be used to approximate the population mean
2.The null hypothesis is an assumption that there is no difference
between the populations you are comparing in a hypothesis test
3.Type I Errors occur when a hypothesis test finds a correlation
between things that are not related, and Type II Errors occur when a
hypothesis test fails to find a correlation between things that are
actually related
4.P-Values indicate the probability that, assuming the null
hypothesis is true, such differences in the samples you are comparing
would exist
5.The Significance Level is a threshold p-value for which all
p-values below it will result in rejecting the null hypothesis
6.One Sample T-Tests indicate whether a dataset belongs to a
distribution with a given mean
7.Two Sample T-Tests indicate whether there is a significant
difference between two datasets
8.ANOVA (Analysis of Variance) allows you to detect if there is a
significant difference between one of multiple datasets
---
title: "Hypothesis Testing in R"
author: "Annabel Kuo"
date: "`r format(Sys.time(), '%Y-%m-%d %H:%M')`"
output: html_notebook
---

# 1.Introduction

Say you work for a major social media website. Your boss comes to you with two questions:

1.does the demographic of users on your site match the company’s expectation?

2.did the new interface update affect user engagement?

With terabytes of user data at your hands, you decide the best way to answer these questions is with statistical hypothesis tests!

Statistical hypothesis testing is a process that allows you to evaluate if a change or difference seen in a dataset is “real”, or if it’s just a result of random fluctuation in the data.

Hypothesis testing can be an integral component of any decision making process. It provides a framework for evaluating how confident one can be in making conclusions based on data. Some instances where this might come up include:

1.a professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?

2.a product manager for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?

In this lesson, you will cover the fundamental concepts that will help you run and evaluate hypothesis tests:

1.Sample and Population Mean

2.P-Values

3.Significance Level

4.Type I and Type II Errors

You will then learn about three different hypothesis tests you can perform to answer the kinds of questions discussed above:

1.One Sample T-Test

2.Two Sample T-Test

3.ANOVA (Analysis of Variance)

Let’s get started!

## Instructions

The code in notebook.Rmd performs a hypothesis test on data for a company BuyPie.com. The test evaluates whether the time spent per visitor on the website changes significantly between two weeks.

Read the output at the bottom of the rendered notebook. Do you think there is a difference in time spent per visitor between Week 1 and Week 2?

By the end of the lesson, you will be able to perform and interpret such hypothesis tests yourself!

```{r message = FALSE}
# load data
load("week_1.Rda")
load("week_2.Rda")
```

```{r}
# calculate week_1_mean and week_2_mean:
week_1_mean <- mean(week_1)
week_1_mean
week_2_mean <- mean(week_2)
week_2_mean
```

[1] 25.44806

[1] 29.02157

```{r}
# calculate week_1_sd and week_2_sd:
week_1_sd <- sd(week_1)
week_1_sd
week_2_sd <- sd(week_2)
week_2_sd
```

[1] 4.577702

[1] 5.553785


```{r}
# run two sample t-test:
results <- t.test(week_1,week_2)
results
```

   Welch Two Sample t-test

data:  week_1 and week_2
t = -3.5109, df = 94.554, p-value = 0.0006863
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.594299 -1.552718
sample estimates:
mean of x mean of y 
 25.44806  29.02157 
 
 
# 2.Sample Mean and Population Mean - I

Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure 10 trees and get an average height of 32 ft. On Tuesday, you measure 12 different trees and reach an average height of 35 ft. On Wednesday, you measure the remaining 11 trees in the park, whose average height is 31 ft. The average height for all 33 trees in your local park is 32.8 ft.

The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A sample is a subset of the entire population (all the oak trees in the park). The mean of each sample is a sample mean and it is an estimate of the population mean.

Note: the sample means (32 ft., 35 ft., and 31 ft.) were all close to the population mean (32.8 ft.), but were all slightly different from the population mean and from each other.

For a population, the mean is a constant value no matter how many times it’s recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:

1.data on the entire population is not available

2.data on the entire population is available, but it is so large that it is unfeasible to analyze

3.meaningful answers to questions can be found faster with sampling

## Instructions

1.In the workspace, we’ve generated a random population of size 300 that follows a normal distribution with a mean of 65. Update the value of population_mean to store the mean() of population. Does it closely match your expectation?

```{r}
# generate random population
population <- rnorm(300, mean=65, sd=3.5)

# calculate population mean here:
population_mean <- mean(population)
population_mean
```
2.Let’s look at how the means of different samples can vary within the same population.

The code in the notebook generates 5 random samples from population. sample_1 is displayed and sample_1_mean has been calculated.

Replace the "Not calculated" strings with calculations of the means for sample_2, sample_3, sample_4, and sample_5.

Look at the population mean and the sample means. Are they all the same? All different? Why?

```{r}
# generate sample 1
sample_1 <- sample(population, size=30)
sample_1

# calculate sample 1 mean
sample_1_mean <- mean(sample_1)
sample_1_mean
```

```{r}
# generate samples 2,3,4 and 5
sample_2 <- sample(population, size=30)
sample_3 <- sample(population, size=30)
sample_4 <- sample(population, size=30)
sample_5 <- sample(population, size=30)
```

```{r}
# calculate sample means here:
sample_2_mean <- mean(sample_2)
sample_2_mean
sample_3_mean <- mean(sample_3)
sample_3_mean
sample_4_mean <- mean(sample_4)
sample_4_mean
sample_5_mean <- mean(sample_5)
sample_5_mean
```

# 3.Sample Mean and Population Mean - II

In the previous exercise, the sample means you calculated closely approximated the population mean. This won’t always be the case!

Consider a tailor of school uniforms at a school for students aged 11 to 13. The tailor needs to know the average height of all the students in order to know which sizes to make the uniforms.

The tailor measures the heights of a random sample of 20 students out of the 300 in the school. The average height of the sample is 57.5 inches. Using this sample mean, the tailor makes uniforms that fit students of this height, some smaller, and some larger.

After delivering the uniforms, the tailor starts to receive some feedback — many of the uniforms are too small! They go back to take measurements on the rest of the students, collecting the following data:

1.11 year olds average height: 56.7 inches

2.12 year olds average height: 59 inches

3.13 year olds average height: 62.8 inches

4.All students average height (population mean): 59.5 inches

The original sample mean was off from the population mean by 2 inches! How did this happen?

The random sample of 20 students was skewed to one direction of the total population. More 11 year olds were chosen in the sample than is representative of the whole school, bringing down the average height of the sample. This is called a sampling error, and occurs when a sample is not representative of the population it comes from. How do you get an average sample height that looks more like the average population height, and reduce the chance of a sampling error?

Selecting only 20 students for the sample allowed for the chance that only younger, shorter students were included. This is a natural consequence of the fact that a sample has less data than the population to which it belongs. If the sample selection is poor, then you will have a sample mean seriously skewed from the population mean.

There is one surefire way to mitigate the risk of having a skewed sample mean — take a larger set of samples! The sample mean of a larger sample set will more closely approximate the population mean, and reduce the chance of a sampling error.


## Instructions

In the workspace, we have a population that is normally distributed. Generate samples of different sizes and see how the sample mean could differ from the population mean.

What happens to the difference between the sample mean and the population mean as you increase the sample size?

```{r Hypo1, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo1.png")
```
```{r Hypo2, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo2.png")
```
```{r Hypo3, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo3.png")
```
```{r Hypo4, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo4.png")
```

# 4.Hypothesis Formulation

You begin the statistical hypothesis testing process by defining a hypothesis, or an assumption about your population that you want to test. A hypothesis can be written in words, but can also be explained in terms of the sample and population means you just learned about.

Say you are developing a website and want to compare the time spent on different versions of a homepage. You could run a hypothesis test to see if version A or B makes users stay on the page significantly longer. Your hypothesis might be:

"The average time spent on homepage A is greater than the average time spent on homepage B."

While this is a fine hypothesis to make, data analysts are often very hesitant people. They don’t like to make bold claims without having data to back them up! Thus when constructing hypotheses for a hypothesis test, you want to formulate a null hypothesis. A null hypothesis states that there is no difference between the populations you are comparing, and it implies that any difference seen in the sample data is due to sampling error. A null hypothesis for the same scenario is as follows:

"The average time spent on homepage A is the same as the average time spent on homepage B."

You could also restate this in terms of population mean:

"The population mean of time spent on homepage A is the same as the population mean of time spent on homepage B."

After collecting some sample data on how users interact with each homepage, you can then run a hypothesis test using the data collected to determine whether your null hypothesis is true or false, or can be rejected (i.e. there is a difference in time spent on homepage A or B).

## Instructions

1.A researcher at a pharmaceutical company is working on the development of a new medication to lower blood pressure, DeePressurize. They run an experiment with a control group of 100 patients that receive a placebo (a sugar pill), and an experimental group of 100 patients that receive DeePressurize. Blood pressure measurements are taken after a 3 month period on both groups of patients.

The researcher wants to run a hypothesis test to compare the resulting datasets. Two hypotheses, hypo_a and hypo_b, are given in notebook.Rmd. Which could be a null hypothesis for comparing the two sets of data? Update the value of null_hypo_1 to the string "hypo_a" or "hypo_b" based on your answer.

```{r}
# experiment 1
hypo_a <- "DeePressurize lowers blood pressure in patients."
hypo_b <- "DeePressurize has no effect on blood pressure in patients."
null_hypo_1 <- "hypo_b"
null_hypo_1
```

2.A product manager at a dating app company is developing a new user profile page with a different picture layout. They want to see if the new layout results in more matches between users than the current layout. 50% of profiles are updated to the new layout, and over a 1 month period the number of matches for users with the new layout and the original layout are recorded.

The product manager wants to run a hypothesis test to compare the resulting datasets. Two hypotheses, hypo_c and hypo_d, are given in notebook.Rmd. Which could be a null hypothesis for comparing the two sets of data? Update the value of null_hypo_2 to the string "hypo_c" or "hypo_d" based on your answer.

```{r}
# experiment 2
hypo_c <- "The new profile layout has no effect on number of matches with other users."
hypo_d <- "The new profile layout results in more matches with other users than the original layout."
null_hypo_2 <- "hypo_c"
null_hypo_2
```

# 5.Designing an Experiment

Suppose you want to know if students who study history are more interested in volleyball than students who study chemistry. Before doing anything else to answer your original question, you come up with a null hypothesis: "History and chemistry students are interested in volleyball at the same rates."

To test this hypothesis, you need to design an experiment and collect data. You invite 100 history majors and 100 chemistry majors from your university to join an extracurricular volleyball team. After one week, 34 history majors sign up (34%), and 39 chemistry majors sign up (39%). More chemistry majors than history majors signed up, but is this a “real”, or significant difference? Can you conclude that students who study chemistry are more interested in volleyball than students who study history?

In your experiment, the 100 history and 100 chemistry majors at your university are samples of their respective populations (all history and chemistry majors). The sample means are the percentages of history majors (34%) and chemistry majors (39%) that signed up for the team, and the difference in sample means is 39% - 34% = 5%. The population means are the percentage of history and chemistry majors worldwide that would sign up for an extracurricular volleyball team if given the chance.

You want to know if the difference you observed in these sample means (5%) reflects a difference in the population means, or if the difference was caused by sampling error, and the samples of students you chose do not represent the greater populations of history and chemistry students.

Restating the null hypothesis in terms of the population means yields the following:

"The percentage of all history majors who would sign up for volleyball is the same as the percentage of all chemistry majors who would sign up for volleyball, and the observed difference in sample means is due to sampling error."

This is the same as saying, “If you gave the same volleyball invitation to every history and chemistry major in the world, they would sign up at the same rate, and the sample of 200 students you selected are not representative of their populations.”

## Instrctions

1.Your friend is a dog walker that specializes in working with Golden Retrievers and Goldendoodles. They are interested in knowing if there is a signficant difference in the lengths of the two breeds. After a few weeks of data collection, they give you a spreadsheet of 10 Golden Retrievers’ lengths and 10 Goldendoodles’ lengths.

The lengths of the dogs are given in retriever_lengths and doodle_lengths. Calculate the mean of each breed and save the results to mean_retriever_l and mean_doodle_l. View mean_retriever_l and mean_doodle_l.

```{r}
# load data
load("retriever_lengths.Rda")
load("doodle_lengths.Rda")
```

```{r}
# calculate mean_retriever_l and mean_doodle_l here:
mean_retriever_l <- mean(retriever_lengths)
mean_retriever_l
mean_doodle_l <- mean(doodle_lengths)
mean_doodle_l
```

[1] 23

[1] 20.5


2.Calculate the difference between mean_retriever_l and mean_doodle_l and save the result to mean_difference. View mean_difference.

```{r}
# calculate mean_difference here:
mean_difference <- mean_retriever_l - mean_doodle_l
mean_difference
```

[1] 2.5

3.You want to run a hypothesis test to see if there is a significant difference in the lengths of Golden Retrievers and Goldendoodles. Which of the two statements could be a formulation of the null hypothesis?

Update the value of null_hypo with "st_1" or "st_2" depending on your answer.

```{r}
# statements:
st_1 <- "The average length of Golden Retrievers is 2.5 inches longer than the average length of Goldendoodles."
st_2 <- "The average length of Golden Retrievers is the same as the average length of Goldendoodles."

# update null_hypo here:
null_hypo <- "st_2"
null_hypo
```

# 6.Type I and Type II Errors

When using automated processes to make decisions, you need to be aware of how this automation can lead to mistakes. Computer programs can be as fallible as the humans who design them. Because of this, there is a responsibility to understand what can go wrong and what can be done to contain these foreseeable problems.

In statistical hypothesis testing, there are two types of error. A Type I error occurs when a hypothesis test finds a correlation between things that are not related. This error is sometimes called a “false positive” and occurs when the null hypothesis is rejected even though it is true.

For example, consider the history and chemistry major experiment from the previous exercise. Say you run a hypothesis test on the sample data you collected and conclude that there is a significant difference in interest in volleyball between history and chemistry majors. You have rejected the null hypothesis that there is no difference between the two populations of students. If, in reality, your results were due to the groups you happened to pick (sampling error), and there actually is no significant difference in interest in volleyball between history and chemistry majors in the greater population, you have become the victim of a false positive, or a Type I error.

The second kind of error, a Type II error, is failing to find a correlation between things that are actually related. This error is referred to as a “false negative” and occurs when the null hypothesis is not rejected even though it is false.

For example, with the history and chemistry student experiment, say that after you perform the hypothesis test, you conclude that there is no significant difference in interest in volleyball between history and chemistry majors. You did not reject the null hypothesis. If there actually is a difference in the populations as a whole, and there is a significant difference in interest in volleyball between history and chemistry majors, your test has resulted in a false negative, or a Type II error.

## Instructions

```{r}
# the true positives and negatives:
actual_positive <- c(2, 5, 6, 7, 8, 10, 18, 21, 24, 25, 29, 30, 32, 33, 38, 39, 42, 44, 45, 47)
actual_negative <- c(1, 3, 4, 9, 11, 12, 13, 14, 15, 16, 17, 19, 20, 22, 23, 26, 27, 28, 31, 34, 35, 36, 37, 40, 41, 43, 46, 48, 49)

# the positives and negatives we determine by running the experiment:
experimental_positive <- c(2, 4, 5, 7, 8, 9, 10, 11, 13, 15, 16, 17, 18, 19, 20, 21, 22, 24, 26, 27, 28, 32, 35, 36, 38, 39, 40, 45, 46, 49)
experimental_negative <- c(1, 3, 6, 12, 14, 23, 25, 29, 30, 31, 33, 34, 37, 41, 42, 43, 44, 47, 48)
```

```{r}
# define type_i_errors and type_ii_errors here:
type_i_errors <- intersect(actual_negative, experimental_positive)
print('fales positives')
type_i_errors
```

2.Now, define type_ii_errors, the list representing the false negatives of the experiment.

```{r}
type_ii_errors <- intersect(actual_positive, experimental_negative)
print('fales negative')
type_ii_errors
```

# 7.P-Values

You know that a hypothesis test is used to determine the validity of a null hypothesis. Once again, the null hypothesis states that there is no actual difference between the two populations of data. But what result does a hypothesis test actually return, and how can you interpret it?

A hypothesis test returns a few numeric measures, most of which are out of the scope of this introductory lesson. Here we will focus on one: p-values. P-values help determine how confident you can be in validating the null hypothesis. In this context, a p-value is the probability that, assuming the null hypothesis is true, you would see at least such a difference in the sample means of your data.

Consider the experiment on history and chemistry majors and their interest in volleyball from a previous exercise:

Null Hypothesis: "History and chemistry students are interested in volleyball at the same rates"
Experiment Sample Means: 34% of history majors and 39% of chemistry majors sign up for the volleyball class
Assuming the null hypothesis is true, there is no actual difference in preference for volleyball between all history and chemistry majors, and any difference present in the experiment data is the result of sampling error. Imagine you run a hypothesis test on this experiment data and it returns a p-value of 0.04. A p-value of 0.04 indicates that you could expect to see a difference of at least 5% (calculated as 39% - 34% = 5%) in the sample means only 4% of the time.

Essentially, if you ran this same experiment 100 times, you would expect to see as large a difference in the sample means only 4 times given the assumption that there is no actual difference between the populations (i.e. they have the same mean).

Seems like a really small probability, right? Are you thinking about rejecting the null hypothesis you originally stated?


value ≠ probability your result is wrong

value = probability of your data (or more extreme) if the null is true

The p-value itself is not the probability that the null is wrong.

Yes—in practice, when the p-value is very small, it suggests that the null hypothesis is likely wrong.

## Instrctions

1.You are big fan of apples, so you gather 10 green and 10 red apples to compare their weights. The green apples average 150 grams in weight, and the red apples average 160 grams in weight.

You run a hypothesis test to see if there is a significant difference in the weight of green and red apples. The test returns a p-value of 0.2. Which statement (st_1, st_2, st_3, or st_4) indicates how this p-value can be interpreted?

Update the value of interpretation with the string "st_1", "st_2", "st_3", or "st_4" depending on your answer.

```{r}
# possible interpretations
st_1 <- "There is a 20% chance that the difference in average weight of green and red apples is due to random sampling."
st_2 <- "There is a 20% chance that green and red apples have the same average weight."
st_3 <- "There is a 20% chance red apples weigh more than green apples."
st_4 <- "There is a 20% chance green apples weigh more than green apples."

# update the value of interpretation here:
interpretation <- "st_1"
interpretation
```

A p-value of 0.2 means:

If green and red apples really weigh the same on average, there is a 20% chance of seeing a difference of 10 grams (or more) just due to random variation.

# 8.Significance Level

While a hypothesis test will return a p-value indicating a level of confidence in the null hypothesis, it does not definitively claim whether you should reject the null hypothesis. To make this decision, you need to determine a threshold p-value for which all p-values below it will result in rejecting the null hypothesis. This threshold is known as the significance level.

A higher significance level is more likely to give a false positive, as it makes it “easier” to state that there is a difference in the populations of your data when such a difference might not actually exist. If you want to be very sure that the result is not due to sampling error, you should select a very small significance level.

It is important to choose the significance level before you perform a statistical hypothesis test. If you wait until after you receive a p-value from a test, you might pick a significance level such that you get the result you want to see. For instance, if someone is trying to publish the results of their scientific study in a journal, they might set a higher significance level that makes their results appear statistically significant. Choosing a significance level in advance helps keep everyone honest.

It is an industry-standard to set a significance level of 0.05 or less, meaning that there is a 5% or less chance that your result is due to sampling error.

The p-value is the probability of getting results at least as extreme as the observed ones, just by random chance, if the null hypothesis is true.

The p-value tells you the probability that the result (or something more extreme) could happen just by chance, assuming the null hypothesis is true.

## Instrctions

1.Before you run a hypothesis test on a set of data, you set your significance level to 0.05. The hypothesis test then returns a p-value of 0.1. Can you reject the null hypothesis? Update the value of reject_hypothesis to TRUE or FALSE depending on your answer.

A 0.1 possibility that the result is just by chance.

```{r}
# update reject_hypothesis here:
reject_hypothesis <- FALSE
reject_hypothesis
```

# 9.One Sample T-Test

Consider the fictional business BuyPie, which sends ingredients for pies to your household so that you can make them from scratch. Suppose that a product manager hypothesizes the average age of visitors to BuyPie.com is 30. In the past hour, the website had 100 visitors and the average age was 31. Are the visitors older than expected? Or is this just the result of chance (sampling error) and a small sample size?

You can test this using a One Sample T-Test. A One Sample T-Test compares a sample mean to a hypothetical population mean. It answers the question “What is the probability that the sample came from a distribution with the desired mean?”

The first step is formulating a null hypothesis, which again is the hypothesis that there is no difference between the populations you are comparing. The second population in a One Sample T-Test is the hypothetical population you choose. The null hypothesis that this test examines can be phrased as follows: "The set of samples belongs to a population with the target mean".

One result of a One Sample T-Test will be a p-value, which tells you whether or not you can reject this null hypothesis. If the p-value you receive is less than your significance level, normally 0.05, you can reject the null hypothesis and state that there is a significant difference.

R has a function called t.test() in the stats package which can perform a One Sample T-Test for you.

t.test() requires two arguments, a distribution of values and an expected mean:

```{r}
results <- t.test(sample_distribution, mu = expected_mean)
```

1.sample_distribution is the sample of values that were collected

2.mu is an argument indicating the desired mean of the hypothetical population

3.expected_mean is the value of the desired mean

t.test() will return, among other information we will not cover here, a p-value — this tells you how confident you can be that the sample of values came from a distribution with the specified mean.

P-values give you an idea of how confident you can be in a result. Just because you don’t have enough data to detect a difference doesn’t mean that there isn’t one. Generally, the more samples you have, the smaller a difference you can detect.

## Instructions

1.We have provided a small dataset called ages, representing the ages of customers to BuyPie.com in the past hour, in notebook.Rmd.

Even with a small dataset like this, it is hard to make judgments from just looking at the numbers.

To understand the data better, let’s look at the mean. Calculate the mean of ages, and store the result in a variable called ages_mean. View ages_mean.


```{r message = FALSE}
# load and view data
ages <- c(32, 34, 29, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22)
ages
```
 
```{r}
# calculate ages_mean here:
ages_mean <- mean(ages)
ages_mean
```
2.Use the t.test() function with ages to see what p-value the experiment returns for this distribution, where we expect the mean to be 30.

Store the results of the test in a variable called results.

Does the p-value you got with the One Sample T-Test make sense, knowing the mean of ages?

```{r}
# perform t-test here:
results <- t.test(ages, mu = 30)
results
```

If the true mean is really 30 (null hypothesis is true), then there is about a 56% chance you’d see a sample mean as far from 30 as 31 (or even farther) just by chance.

It means this observed difference (mean of 31 instead of 30) is completely plausible as random sampling error.

So you fail to reject the null hypothesis.

The data is consistent with the population mean being 30.

# 10.Two Sample T-Test

Suppose that last week, the average amount of time spent per visitor to a website was 25 minutes. This week, the average amount of time spent per visitor to a website was 29 minutes. Did the average time spent per visitor change (i.e. was there a statistically significant bump in user time on the site)? Or is this just part of natural fluctuations?

One way of testing whether this difference is significant is by using a Two Sample T-Test. A Two Sample T-Test compares two sets of data, which are both approximately normally distributed.

The null hypothesis, in this case, is that the two distributions have the same mean.

You can use R’s t.test() function to perform a Two Sample T-Test, as shown below:

```{r}
results <- t.test(distribution_1, distribution_2)
```

When performing a Two Sample T-Test, t.test() takes two distributions as arguments and returns, among other information, a p-value. 

Remember, the p-value let’s you know the probability that the difference in the means happened by chance (sampling error).


## Instructions

1.We’ve created two distributions representing the time spent per visitor to BuyPie.com last week, week_1, and the time spent per visitor to BuyPie.com this week, week_2.

Find the means of these two distributions. Store them in week_1_mean and week_2_mean. View both means.

```{r message = FALSE}
# load data
week_1 <- c(23.90507, 26.67632, 27.27434, 24.25757, 32.40423, 39.56919, 23.07010, 29.82068, 27.59434, 28.05640, 27.06757, 30.41193, 25.71359, 24.94295, 28.23124, 24.95338, 18.51232, 27.46235, 28.38017, 13.91206, 29.02616, 26.90747, 22.86777, 24.89383, 25.96948, 26.86870, 20.72676, 27.35988, 20.68409, 21.19846, 16.25801, 23.92518, 24.47923, 29.47051, 27.28425, 26.93339, 28.61027, 18.88377, 33.65469, 25.69470, 20.98291, 22.69700, 28.60279, 21.36000, 30.77685, 20.83416, 23.79367, 19.75567, 29.54421, 20.14331)
week_1
week_2 <- c( 18.63432, 31.28788, 34.96798, 21.81678, 28.21620, 39.39314, 35.52223, 27.54222, 33.64395, 25.31674, 28.81392, 30.73580, 26.37242, 26.09456, 26.34073, 19.42196, 32.58798, 24.84002, 28.93348, 20.43668, 22.72496, 32.31728, 35.38431, 29.66710, 24.53513, 30.91406, 19.56118, 24.90817, 30.13164, 31.47466, 27.77684, 16.51307, 35.07702, 31.74818, 36.36053, 27.70501, 29.49870, 27.65575, 37.18504, 25.16055, 29.26554, 38.22163, 28.92102, 24.82154, 38.30155, 34.76021, 22.26869, 28.82594, 32.00975, 36.46438)
week_2
```

```{r}
# calculate week_1_mean and week_2_mean here:
week_1_mean <- mean(week_1)
week_1_mean
week_2_mean <- mean(week_2)
week_2_mean
```

2.Find the standard deviations of these two distributions. Store them in week_1_sd and week_2_sd. View both standard deviations.

```{r}
# calculate week_1_sd and week_2_sd here:
week_1_sd <- sd(week_1)
week_1_sd
week_2_sd <- sd(week_2)
week_2_sd
```

3.Run a Two Sample T-Test using the t.test() function.

Save the results to a variable called results and view it. Does the p-value make sense, knowing what you know about these datasets?

```{r}
# run two sample t-test here:
results<- t.test(week_1, week_2)
results
```

Assuming there is no real difference in average time spent between the two weeks (null hypothesis), the probability of seeing a difference as large as ~3.57 minutes (or more extreme) just by random sampling is about 0.07%.

There is strong evidence that the average time per visitor changed between week_1 and week_2.


# 11.Dangers of Multiple T-Tests

Suppose that you own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. You want to know if the average ant sales over the past year are significantly different between the three locations.

At first, it seems that you could perform T-tests between each pair of stores.

You know that the p-value is the probability that you incorrectly reject the null hypothesis on each t-test. The more t-tests you perform, the more likely that you are to get a false positive, a Type I error.

For a p-value of 0.05, if the null hypothesis is true, then the probability of obtaining a significant result is 1 – 0.05 = 0.95. When you run another t-test, the probability of still getting a correct result is 0.95 * 0.95, or 0.9025. That means your probability of making an error is now close to 10%! This error probability only gets bigger with the more t-tests you do.

## Instructions

1.We have created samples store_a, store_b, and store_c, representing the sales at VeryAnts at locations A, B, and C, respectively. We want to see if there’s a significant difference in sales between the three locations.

Explore datasets store_a, store_b, and store_c by finding and viewing the means and standard deviations of each one. Store the means in variables called store_a_mean, store_b_mean, and store_c_mean. Store the standard deviations in variables called store_a_sd, store_b_sd, and store_c_sd.

```{r message = FALSE}
# load data
load("store_a.Rda")
load("store_b.Rda")
load("store_c.Rda")
```

```{r}
# calculate means here:
store_a_mean <- mean(store_a)
store_a_mean
store_b_mean <- mean(store_b)
store_b_mean
store_c_mean <- mean(store_c)
store_c_mean
```

[1] 58.34964

[1] 65.62629

[1] 62.36117

```{r}
# calculate standard deviations here:
store_a_sd <- sd(store_a)
store_a_sd
store_b_sd <- sd(store_b)
store_b_sd
store_c_sd <- sd(store_c)
store_c_sd
```

[1] 14.80313

[1] 14.79597

[1] 15.14302

2.Perform a Two Sample T-test between each pair of location data.

Store the results of the tests in variables called a_b_results, a_c_results, and b_c_results. View the results for each test.

```{r}
# perform two sample t-test here:
a_b_results <- t.test(store_a, store_b)
a_b_results
a_c_results <- t.test(store_a, store_c)
a_c_results
b_c_results <- t.test(store_b, store_c)
b_c_results
```

  Welch Two Sample t-test

data:  store_a and store_b
t = -4.2581, df = 298, p-value = 2.767e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -10.639701  -3.913601
sample estimates:
mean of x mean of y 
 58.34964  65.62629 
 
 Welch Two Sample t-test

data:  store_a and store_c
t = -2.3201, df = 297.85, p-value = 0.02101
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.4142456 -0.6088286
sample estimates:
mean of x mean of y 
 58.34964  62.36117 
 
 

  Welch Two Sample t-test

data:  store_b and store_c
t = 1.8888, df = 297.84, p-value = 0.05989
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1367903  6.6670182
sample estimates:
mean of x mean of y 
 65.62629  62.36117 
 
 
3.Store the probability of error for running three T-Tests in a variable called error_prob. View error_prob.

```{r}
# calculate error_prob here:
error_prob <- (1-(0.95**3))
error_prob
```
[1] 0.142625


# 12.ANOVA

In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of 0.05 is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you’re saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

You can use the stats package function aov() to perform ANOVA on multiple datasets. aov() takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame df_scores as follows:

group	                  score

math major	            88

math major	            81

writing major	          92

writing major	          80

psychology major	      94

psychology major	      83

You can then run an ANOVA test with this line:

```{r}
results <- aov(score ~ group, data = df_scores)
```

Note: score ~ group indicates the relationship you want to analyze (i.e. how each group, or major, relates to score on the video game)

To retrieve the p-value from the results of calling aov(), use the summary() function:

```{r}
summary(results)
```

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than 0.05), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can’t make any conclusions on which two populations have a significant difference.

Let’s look at an example of ANOVA in action.

## Instructions

```{r}
# load libraries
library(tidyr)
```

```{r message = FALSE}
# load data
load("stores.Rda")
load("stores_new.Rda")
```

```{r}
# inspect stores here:
stores
```


```{r Hypo5, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo5.png")
```

2.Perform an ANOVA on the stores data and save the test results to a variable results. Use the summary() function to view the p-value of the test. Does this p-value lead you to reject the null hypothesis?

```{r}
# perform anova on stores here:
results <- aov(sales ~ store, data = stores)
summary(results)
```

             Df Sum Sq Mean Sq F value   Pr(>F)    
store         2   3985  1992.6   8.957 0.000153 ***
Residuals   447  99437   222.5                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


The column labeled Pr(>F) is the p-value for the F-test.


3.Let’s say the sales at location B have suddenly soared (maybe there’s an ant convention happening nearby). The new sales for location B have been updated in the stores_new data frame.

Re-run the ANOVA test on stores_new and save the test results to a variable results_new. Use the summary() function to see what the p-value is now. Does this new value make sense?


```{r}
# perform anova on stores_new here:
results_new <- aov(sales ~ store, data = stores_new)
summary(results_new)
```

             Df Sum Sq Mean Sq F value Pr(>F)    
store         2 775599  387799    1805 <2e-16 ***
Residuals   447  96058     215                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Key value: Pr(>F) = < 2e-16
This is shorthand for a p-value less than 0.0000000000000002

Way below any common significance level (like 0.05 or 0.01)

Reject the null hypothesis

There is a statistically significant difference between the group means for the different stores

The difference is extremely unlikely to be due to chance

There’s overwhelming evidence that the average value (whatever you're measuring) differs significantly among the three stores.

# 13.Assumptions of Numerical Hypothesis Tests

Before you use numerical hypothesis tests, you need to be sure that the following things are true:

1. The samples should each be normally distributed…ish
Data analysts in the real world often still perform hypothesis tests on datasets that aren’t exactly normally distributed. What is more important is to recognize if there is some reason to believe that a normal distribution is especially unlikely. If your dataset is definitively not normal, the numerical hypothesis tests won’t work as intended.

For example, imagine you have three datasets, each representing a day of traffic data in three different cities. Each dataset is independent, as traffic in one city should not impact traffic in another city. However, it is unlikely that each dataset is normally distributed. In fact, each dataset probably has two distinct peaks, one at the morning rush hour and one during the evening rush hour. The histogram of a day of traffic data might look something like this:

```{r Hypo6, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo6.png")
```

In this scenario, using a numerical hypothesis test would be inappropriate.

2. The population standard deviations of the groups should be equal
For ANOVA and Two Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means.

To check for similarity between the standard deviations, it is normally sufficient to divide the two standard deviations and see if the ratio is “close enough” to 1. “Close enough” may differ in different contexts, but generally staying within 10% should suffice.

3. The samples must be independent
When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Here are some examples where it would seem the samples are not independent:

1.the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen

2.a group of patients’ blood pressure levels before, during, and after the administration of a drug

It is important to understand your datasets before you begin conducting hypothesis tests on them so that you know you are choosing the right test.

## Instrctions

1.Use the base R hist() function to display the histograms for dist_one, dist_two, dist_three, and dist_four.

```{r message = FALSE}
# load data
load("dist_one.Rda")
load("dist_two.Rda")
load("dist_three.Rda")
load("dist_four.Rda")
```

```{r}
# plot histograms and define not_normal here:
hist(dist_one)
```

```{r Hypo7, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo7.png")
```

```{r}
hist(dist_two)
```

```{r Hypo8, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo8.png")
```

```{r}
hist(dist_three)
```

```{r Hypo9, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo9.png")
```

```{r}
hist(dist_four)
```

```{r Hypo10, out.width="60%"}
knitr::include_graphics("C:/Users/kuoan/Desktop/R Code/Hypo10.png")
```


2.Do the distributions look normal?

One of these distributions would probably not be a good choice to use in an ANOVA comparison. Create a variable called not_normal and set it equal to the distribution number (1, 2, 3, or 4) that would be least suited for use in an ANOVA test.

*Hint : A normal distribution will have a bell shaped curve with one mean.


```{r}
not_normal <- 4
not_normal 
```

3.Calculate the ratio of standard deviations between dist_two and dist_three, and store the value in a variable called ratio. View ratio. Is this “close enough” to perform a numerical hypothesis test between the two datasets?

```{r}
# define ratio here:
ratio <- sd(dist_two) / sd(dist_three)
ratio
```

[1] 0.5784782

One of the assumptions of a numerical hypothesis test is that the ratio of the standard deviations of the datasets are close to 1.

Since the ratio is not close to 1, these datasets should not be used together in a numerical hypothesis test.

# 14.Review

Phew! Nobody said hypothesis testing is easy, but you made it to the end of the lesson. Congratulations! The world of hypothesis testing is vast. There is much more you can learn, and so many applications where you can use them.

Let’s review what you’ve learned in this lesson:

1.Samples are subsets of an entire population, and the sample mean can be used to approximate the population mean

2.The null hypothesis is an assumption that there is no difference between the populations you are comparing in a hypothesis test

3.Type I Errors occur when a hypothesis test finds a correlation between things that are not related, and Type II Errors occur when a hypothesis test fails to find a correlation between things that are actually related

4.P-Values indicate the probability that, assuming the null hypothesis is true, such differences in the samples you are comparing would exist

5.The Significance Level is a threshold p-value for which all p-values below it will result in rejecting the null hypothesis

6.One Sample T-Tests indicate whether a dataset belongs to a distribution with a given mean

7.Two Sample T-Tests indicate whether there is a significant difference between two datasets

8.ANOVA (Analysis of Variance) allows you to detect if there is a significant difference between one of multiple datasets


