Use + - * / for addition, subtraction, multiplication,
and division.
100 + 2
[1] 102
100 - 2
[1] 98
100 * 2
[1] 200
100/2
[1] 50
Use <- to store values in variables. For example, you
can store the number 100 in a variable called x and the
number 2 in a variable called y like this:
x <- 100
y <- 2
You can now use x and y for calculations
involving 100 and 2:
x + y
[1] 102
x - y
[1] 98
x*y
[1] 200
x/y
[1] 50
To create a list, use c(). For example, to store the
numbers 1,3,5,4 in a list called numbers, do this:
numbers <- c(1,3,5,4)
You can use things like mean(), median(),
etc for basic calculations. For example, if we want the mean, median,
and standard deviation of the numbers in the list numbers
above, we can do this:
mean(numbers)
[1] 3.25
median(numbers)
[1] 3.5
sd(numbers)
[1] 1.707825
The data.frame() command can be used to manually create
a dataset in R. For example, consider the following dataset:
| Name | Major | GPA |
|---|---|---|
| Alice | Math | 3.6 |
| Charlie | CS | 3.7 |
| Bob | Math | 3.2 |
Here’s the syntax to create this dataset in R
students <- data.frame(Name = c('Alice', 'Bob', 'Charlie'), Major = c('Math','CS','Math'), GPA = c(3.6, 3.7, 3.2))
This creates a dataset (called students) which has three
variables (Name, Major, GPA), and we enter the values of those variables
as lists (using c()). You can also split this into separate
lines to make it easier to read:
students <- data.frame(
Name = c('Alice', 'Bob', 'Charlie'),
Major = c('Math','CS','Math'),
GPA = c(3.6, 3.7, 3.2))
To access a certain value in a dataset, type the name of the dataset,
followed by a dollar sign $, followed by the name of the
variable.
For example, students$Name will just print out the names
of the students:
students$Name
[1] "Alice" "Bob" "Charlie"
You can also compute means/medians/standard deviations of
quantitative variables in datasets. For example, here are the mean,
median, and standard deviation of the GPAs in the students
dataset.
mean(students$GPA)
[1] 3.5
median(students$GPA)
[1] 3.6
sd(students$GPA)
[1] 0.2645751
You can get the sample size of a dataset using nrow()
(for “number of rows”):
nrow(students)
[1] 3
We won’t usually enter datasets manually like this. Instead, we’ll
load in csv files. There are two ways to do this. First, if you have the
actual csv (in the same folder as your current project), you can load it
directly using read.csv("filename"). For example, if you
have a csv file called studentSample.csv, you can just
enter this:
studentsSample <- read.csv('studentsample.csv')
In this class, for most datasets, I’ll give you the code to load it from a dropbox folder that you can just copy/paste. It’ll look something like this:
studentSample <- read.csv("https://www.dropbox.com/scl/fi/v573fxi9np280m3exwh79/MajorDrinkHeight.csv?rlkey=74kyi6v2kmbi6kbmmwsxslbfg&dl=1")
If you have a dataset in which the individuals are placed in
different groups, you can use tapply to find the
mean/median/etc within each group by using the tapply
command. The syntax is
tapply(dataset$Var1, dataset$Var2, function), where
Var1 is the name of the quantitative variable,
Var2 is the name of the categorical variable (which
determine the groups), and function is the name of the
function you want to use (like mean, median,
etc).
For example, the Salaries dataset gives the salary in
USD (the salaray_in_usd variable) of individuals, and
groups them by experience level (the experience_level
variable). If we want the mean salary in USD for each experience level,
we can do this:
tapply(Salaries$salary_in_usd, Salaries$experience_level, mean)
EN EX MI SE
90930.13 186808.88 123173.74 161651.86
For example, this means that the mean salary for Executive level
positions (EX) is $186808.88.
Use quantile(dataset$variable, pct) to get a percentile.
For example, to get the 90th percentile of the heights in the student
sample dataset, use
quantile(studentSample$Height, 0.90)
90%
73.1
Summary Statistics Use summary to get
the min, max, quartiles, and mean all at once:
summary(studentSample$Height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
62.00 65.25 67.00 68.00 71.50 74.00
Frequency Tables: The general syntax to make a
frequency table for a qualitative variable in a dataset is
table(Dataset$Variable). For example, to make a table of
the Major variable from the
studentSample Dataset, do the following
table(studentSample$Major)
CS Math
4 6
Adding totals: Use addmargins() to
append the totals to the table:
addmargins(table(studentSample$Major))
CS Math Sum
4 6 10
Relative Frequency Tables Use
proportions() to create a relative frequency table:
proportions(table(studentSample$Major))
CS Math
0.4 0.6
You can also add the totals:
addmargins(proportions(table(studentSample$Major)))
CS Math Sum
0.4 0.6 1.0
Two-Way Tables To make a table showing two
qualitative variables, just plug both variables into
table():
table(studentSample$Major, studentSample$FavDrink)
Coffee Tea Water
CS 2 1 1
Math 3 2 1
You can add the totals:
addmargins(table(studentSample$Major, studentSample$FavDrink))
Coffee Tea Water Sum
CS 2 1 1 4
Math 3 2 1 6
Sum 5 3 2 10
or display relative frequencies:
proportions(table(studentSample$Major, studentSample$FavDrink))
Coffee Tea Water
CS 0.2 0.1 0.1
Math 0.3 0.2 0.1
and add the totals to the relative frequency table:
addmargins(proportions(table(studentSample$Major, studentSample$FavDrink)))
Coffee Tea Water Sum
CS 0.2 0.1 0.1 0.4
Math 0.3 0.2 0.1 0.6
Sum 0.5 0.3 0.2 1.0
To make a bar graph using “Base R”, you need to plug a
table() into barplot():
barplot(table(studentSample$FavDrink))
To make a histogram using Base R, just plug the
dataset$variable into hist():
hist(studentSample$Height)
To make a boxplot using Base R, just plug
dataset$variable into boxplot:
boxplot(studentSample$Height)
One of the nice features of R is that there are many packages that you can install to add more functionality.
One of these package, ggplot2 allows you to make
nicer-looking graphs.
To start, you need to install the package (go to “packages” on the
bottom-right pane, then click install, search for ggplot2,
and click install).
Then, you need to run library(ggplot2) to “load” the
library.
Whenever you want to create a graph, always start with
ggplot(dataset).
library(ggplot2)
ggplot(Salaries)
You’ll see this just creates a blank “background”.
The next thing you need to do is to set up the ``aesthetic.” For now, just think of this as specifying what the axes should represent.
For example, if we want to make a grapht of the
salary_in_usd variable, we’d want the x-axis to correspond
to salary_in_usd. To specify this in ggplot,
type aes(x = salary_in_usd)
library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd))
ggplot2Now, whenever you want to actually plot something, you use a
+ symbol, followed by what you want to add. If we want to
add a histogram, the command is + geom_histogram()
library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) + geom_histogram()
You can customize lots of things. For example, we can make the outline of the bars black and the fill grey like this:
library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) + geom_histogram(color="black", fill="grey")
You can change the axis titles like this:
library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) +
geom_histogram(color="black", fill="grey") +
xlab("Salary in USD")+
ylab("Count")
And you can add a title like this:
library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) +
geom_histogram(color="black", fill="grey") +
xlab("Salary in USD")+
ylab("Count")+
ggtitle("Histogram of Salaries")
If you want to have separate histograms for different groups in a
dataset, you can use the facet_grid function. The syntax is
to add +facet_grid(~Var), where Var is the
name of the variable giving the groups. For example, if we want separate
histograms for the different experience levels
(experience_level variable), we just need this
modification:
ggplot(Salaries, aes(x=salary_in_usd)) +
geom_histogram(color="black", fill="grey") +
xlab("Salary in USD")+
ylab("Count")+
ggtitle("Histogram of Salaries")+
facet_grid(~experience_level)
The labels on the x-axis are a little jumbled, don’t worry about this.
ggplot2Boxplots work similarly:
library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) +
geom_boxplot() +
xlab("Salary in USD")+
ggtitle("Boxplot of Salaries")
(The numbers on the vertical axis don’t matter here)
You can also make the vertical by using aes(y=...)
library(ggplot2)
ggplot(Salaries, aes(y=salary_in_usd)) +
geom_boxplot() +
ylab("Salary in USD")+
ggtitle("Boxplot of Salaries")
ggplot2If you want to display boxplots for separate groups, you can specify
both the x and y axes: one for the group
(qualitative), one for the quantitative variable
library(ggplot2)
ggplot(Salaries, aes(y=salary_in_usd, x=experience_level)) +
geom_boxplot() +
xlab("Experience Level")+
ylab("Salary in USD")+
ggtitle("Boxplots of Salaries by Experience Level")
Use pnorm(#) to get the area to the left of
# in N(0,1). Remember this is the same thing
as \(P(z\le\#)\). For example, to get
\(P(z\le1.2)\), this is all you need to
do:
pnorm(1.2)
[1] 0.8849303
So, \(P(z\le 1.2)=0.8849303\).
To get \(P(a\le z\le b)\) (in \(N(0,1)\)), just use
pnorm(b)-pnorm(a). For example, for \(P(-2\le z\le 1.2),\) this is all you
need:
pnorm(1.2)-pnorm(-2)
[1] 0.8621802
So, \(P(-2\le z\le 1.2)=0.8621802.\)
For probabilities in other normal distributions, you can first convert to \(N(0,1)\) using z-scores. Remember the z-score for \(x\) in \(N(\mu,\sigma)\) is \(z=\frac{x-\mu}{\sigma}.\) For example, the z-score for 5.5 in \(N(10,3)\) is \(z=\frac{5.5-10}{3}.\) You can do this in R:
(5.5-10)/3
[1] -1.5
So, the z-score is -1.5. You can use this to get \(P(x\le -1.5)\) (in \(N(10,-1.5)\)) using
pnorm(z-score for -1.5):
pnorm(-1.5)
[1] 0.0668072
Similarly, to get \(P(5.5\le x\le 13)\) in \(N(10,3)\), you can first find the z-scores for 5.5 and 13:
(5.5-10)/3
[1] -1.5
(13-10)/3
[1] 1
Then use pnorm:
pnorm(1)-pnorm(-1.5)
[1] 0.7745375
You can also store the z-scores in variables by doing something like this:
z1 <- (5.5-10)/3
z2 <- (13-10)/3
Now you can plug z1 and z2 into
pnorm rather than using the actual numbers.
pnorm(z2)-pnorm(z1)
[1] 0.7745375
You can also specify the mean and sd in pnorm without
using z-scores:
pnorm(13, mean=10, sd=3) - pnorm(5.5, mean=10, sd=3)
[1] 0.7745375
Either way is fine, but make sure you’re able to do the calculations using z-scores.
Recall the sampling distribution for \(\bar x\) in samples of size \(n\) is (approximately) \(N(\mu,\frac{\sigma}{\sqrt{n}})\), and \(\frac{\sigma}{\sqrt{n}}\) is called the standard error. So we can use this distribution to find probabilities involving \(\bar x\). For example, if the population is \(N(10,3),\) then the standard error for samples of size 36 is \(SE=\frac{3}{\sqrt{36}}\). You can do this calculation in R:
3/sqrt(36)
[1] 0.5
So, \(SE=0.5\). This means the distribution of \(\bar x\) is \(N(10,0.5).\) If we want \(P(\bar x\le 11.25),\) we need to first get the z-score for 11.25, using this distribution:
(11.25-10)/0.5
[1] 2.5
So, the z-score is 2.5. Then use pnorm to get the
probability:
pnorm(2.5)
[1] 0.9937903
So, \(P(\bar x\le 11.25)=0.9937903\).
Remember the formula for a confidence interval for \(\mu\) when \(\sigma\) is known is \[ \bar x\pm ME, \] where \[ ME=z_*\cdot SE, \] and \[ SE=\frac{\sigma}{\sqrt{n}} \]
To get a \(z_*\) value, either use a
table, or use qnorm. For example, to get the \(z_*\) value for 98% confidence, since you
want the middle 98%, each tail is 1%, so you want the 0.01 percentile.
To get this in R, do this:
qnorm(0.01)
[1] -2.326348
You’ll see that it’s negative, the actual \(z_*\) value is just the absolute value of
this (so it’s 2.326348 in this case). You can also make it positive by
just putting a - in front:
-qnorm(0.01)
[1] 2.326348
If the population is normal with \(\sigma=4\) and a sample of size 25 has an average of \(\bar x=90,\) give a 98% confidence interval for \(\mu.\)
First, find the standard error:
4/sqrt(25)
[1] 0.8
So, SE=0.8. Now, find the margin of error using \(z_*=2.326348\) and \(SE=0.8\):
2.326348*0.8
[1] 1.861078
Now find the confidence interval:
90-1.861078
[1] 88.13892
90+1.861078
[1] 91.86108
So, the confidence interval is 88.13892 to 91.86108 (98% confident that \(\mu\) is between these two numbers).
If you’re comfortable working with variables, you can streamline all of this:
SE <- 4/sqrt(25)
zstar <- -qnorm(0.01)
ME <- zstar * SE
90 - ME
[1] 88.13892
90 + ME
[1] 91.86108
Suppose we want to test to see if the average amount of caffeine in all cups of coffee is less than 100mg. We take a sample of size 30 and determine the sample mean is 98. Assume the population standard deviation is 6. Write out the steps for the hypothesis test.
Step 1: \(H_0:\mu=100\), \(H_a:\mu<100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.
Step 2: Use a significance level of \(\alpha=0.05\).
Step 3: We know the population sd, so use \(z=\frac{98-100}{6/\sqrt{30}}\) as the standardized test statistic.
(98-100)/(6/sqrt(30))
[1] -1.825742
Step 4: The p-value is the area to the left of
-1.825742 in \(N(0,1).\) We can use
pnorm to get this area:
pnorm(-1.825742)
[1] 0.03394457
Step 5: Our p-value is less than the significance level, so reject \(H_0\).
Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is less than 100mg.
Suppose we want to test to see if the average amount of caffeine in all cups of coffee is greater than 100mg. We take a sample of size 30 and determine the sample mean is 101. Assume the population standard deviation is 6. Write out the steps for the hypothesis test.
Step 1: \(H_0:\mu=100\), \(H_a:\mu>100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.
Step 2: Use a significance level of \(\alpha=0.05\).
Step 3: We know the population sd, so use \(z=\frac{101-100}{6/\sqrt{30}}\) as the standardized test statistic.
(101-100)/(6/sqrt(30))
[1] 0.9128709
Step 4: The p-value is the area to the right of
0.9128709 in \(N(0,1).\) We can use
pnorm to get this area:
1-pnorm(0.9128709)
[1] 0.1806552
Step 5: Our p-value is greater than the significance level, so do not reject \(H_0\).
Step 6: We do not have statistically discernible evidence that the average amount of caffeine in all cups of coffee is greater than 100mg.
Suppose we want to test to see if the average amount of caffeine in all cups of coffee is not equal to 100mg. We take a sample of size 30 and determine the sample mean is 103. Assume the population standard deviation is 6. Write out the steps for the hypothesis test.
Step 1: \(H_0:\mu=100\), \(H_a:\mu\ne100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.
Step 2: Use a significance level of \(\alpha=0.05\).
Step 3: We know the population sd, so use \(z=\frac{103-100}{6/\sqrt{30}}\) as the standardized test statistic.
(103-100)/(6/sqrt(30))
[1] 2.738613
Step 4: The p-value is the combined area to the left
of -2.738613 and the right of 2.738613 in \(N(0,1).\) This is the same as twice the
area to the left of -2.738613. We can use pnorm to get this
area:
2 * pnorm(-2.738613)
[1] 0.006169895
Step 5: Our p-value is less than the significance level, so reject \(H_0\).
Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is not equal to 100mg.
Recall the t-statistic for a sample with mean \(\bar x\) and standard deviation \(s\) is \[
t=\frac{\bar x - \mu}{s/\sqrt{n}}
\] The t-statistic follows a “t-distribution with n-1 degrees of
freedom.” In R, the functions pt and qt are
the analogs of pnorm and qnorm. The only
difference is that with pt and qt you need to
specify the degrees of freedom (which is n-1). For example, if we want
\(P(-1.2\le z\le 2)\) in \(N(0,1)\), we would use this:
pnorm(2) - pnorm(-1.2)
[1] 0.8621802
So, \(P(-1.2\le z\le 2)=0.8621802.\) If we want \(P(-1.2\le t\le 2)\) in a t-distribution for samples of size 10, then there are 9 degrees of freedom, so we’d use this:
pt(2, df=9) - pt(-1.2, df=9)
[1] 0.831337
So, \(P(-1.2\le t\le 2)=0.8621802.\)
Remember the formula for a confidence interval for \(\mu\) when \(\sigma\) is unknown is \[ \bar x\pm ME, \] where \[ ME=t_*\cdot SE, \] and \[ SE=\frac{s}{\sqrt{n}} \]
To get the \(t_*\) values, either
use a table, or qt in R. For example, if we want the \(t_*\) value for 98% confidence for samples
of size 25 (df = 24), then similarly to finding the \(z_*\) value, we want the area to the left
to be 0.01, so we can use this:
-qt(0.01, df=24)
[1] 2.492159
So for samples of size 25, the \(t_*\) value for 98% confidence is 2.492159
If the population is normal with unknown mean and unknown standard deviation, and a sample of size 25 has an average of \(\bar x=90\) and a standard deviation of \(s=4,\) give a 98% confidence interval for \(\mu.\)
Note This is the same as the example in the normal distribution section, except we only know the sample standard deviation, rather than the population standard deviation. The steps are almost the same, except that we use the \(t_*\) value rather than a \(z_*\) value.
First, find the standard error:
4/sqrt(25)
[1] 0.8
So, SE=0.8. Now, find the margin of error using \(t_*=2.492159\) (from above) and \(SE=0.8\):
2.492159*0.8
[1] 1.993727
Now find the confidence interval:
90-1.993727
[1] 88.00627
90+1.993727
[1] 91.99373
So, the confidence interval is 88.00627 to 91.99373 (98% confident that \(\mu\) is between these two numbers).
Again, you can can streamline all of this using variables
SE <- 4/sqrt(25)
tstar <- -qt(0.01, df=24)
ME <- tstar * SE
90 - ME
[1] 88.00627
90 + ME
[1] 91.99373
So, we are 98% confident that \(\mu\) is between 88.00627 and 91.99373. Notice this is a slightly wider interval than when we knew \(\sigma\) (rather than \(s\)). This makes sense intuitively- the confidence interval is less precise since we have less information.
Suppose the dataset called Coffee contains the amount of
caffeine in a sample of 25 cups of coffee (the Caffeine
variable).
Us this to give a 95% confidence interval for the amount of caffeine in all cups of coffee.
We should first check to see if it’s “appropriate” to use a t-distribution. For this, we want to look at the histogram of the sample.
ggplot(Coffee, aes(x=Caffeine)) +
geom_histogram(color="black", fill="grey",bins=8)
It’s not perfectly normal, but we have a medium sample size, so it’s still okay to use the t-procedure.
Here are the steps in R (this is all using variables, you can do it by manually plugging in numbers too):
# compute sample size
n <- nrow(Coffee)
# compute sample mean
xbar <- mean(Coffee$Caffeine)
# compute sample sd
s <- sd(Coffee$Caffeine)
# Compute SE
SE <- s/sqrt(n)
# Compute tstar
tstar <- -qt(0.025, df=n-1)
# compute ME
ME <- tstar * SE
# compute CI
xbar - ME
[1] 100.6827
xbar + ME
[1] 103.2919
So, we are 95% confident that the average amount of caffeine in all cups of coffee is between 100.7 and 103.3mg.
Suppose we want to test to see if the average amount of caffeine in all cups of coffee is less than 100mg. We take a sample of size 30 and determine the sample mean is 98, and the sample standard deviation is 6. Write out the steps for the hypothesis test.
Step 1: \(H_0:\mu=100\), \(H_a:\mu<100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.
Step 2: Use a significance level of \(\alpha=0.05\).
Step 3: We do not know the population sd, so use \(t=\frac{98-100}{6/\sqrt{30}}\) as the standardized test statistic.
(98-100)/(6/sqrt(30))
[1] -1.825742
Step 4: The p-value is the area to the left of
-1.825742 in the t-distribution withdf=29. We can use
pt to get this area:
pt(-1.825742, df=29)
[1] 0.03910165
Step 5: Our p-value is less than the significance level, so reject \(H_0\).
Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is less than 100mg.
Suppose we want to test to see if the average amount of caffeine in all cups of coffee is greater than 100mg. We take a sample of size 30 and determine the sample mean is 101 and the sample standard deviation is 6. Write out the steps for the hypothesis test.
Step 1: \(H_0:\mu=100\), \(H_a:\mu>100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.
Step 2: Use a significance level of \(\alpha=0.05\).
Step 3: We do not know the population sd, so use \(t=\frac{101-100}{6/\sqrt{30}}\) as the standardized test statistic.
(101-100)/(6/sqrt(30))
[1] 0.9128709
Step 4: The p-value is the area to the right of
0.9128709 in the t-distribution with df=29. We can use
pt to get this area:
1-pt(0.9128709, df=29)
[1] 0.1844188
Step 5: Our p-value is greater than the significance level, so do not reject \(H_0\).
Step 6: We do not have statistically discernible evidence that the average amount of caffeine in all cups of coffee is greater than 100mg.
Suppose we want to test to see if the average amount of caffeine in all cups of coffee is not equal to 100mg. We take a sample of size 30 and determine the sample mean is 103, and the sample standard deviation is 6. Write out the steps for the hypothesis test.
Step 1: \(H_0:\mu=100\), \(H_a:\mu\ne100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.
Step 2: Use a significance level of \(\alpha=0.05\).
Step 3: We do not know the population sd, so use \(t=\frac{103-100}{6/\sqrt{30}}\) as the standardized test statistic.
(103-100)/(6/sqrt(30))
[1] 2.738613
Step 4: The p-value is the combined area to the left
of -2.738613 and the right of 2.738613 in the t-distribution with
df=29. This is the same as twice the area to the left of
-2.738613. We can use pt to get this area:
2 * pt(-2.738613,df=29)
[1] 0.01043738
Step 5: Our p-value is less than the significance level, so reject \(H_0\).
Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is not equal to 100mg.
The average height for all US adults is 66 inches. Suppose a dataset
called studentSample contains the heights of 10 college
students (in the Height variable). Does this provide
statistically significant evidence that the average height of all
college students is different than the average for all US adults.
(assume it’s “appropriate” to use a t-procedure).
Step 1 The hypotheses are \(H_0:\mu=66\) and \(H_a:\mu\ne66,\) where \(\mu\) is the average height for all college students.
Step 2 Use \(\alpha=0.05.\)
Step 3: Calculate standardized test statistic.
# Option 1: do calculations and copy/paste numbers
# Find sample mean
mean(studentSample$Height)
[1] 68
# Find sample sd
sd(studentSample$Height)
[1] 4.082483
# Calculate t-stat
(68-66)/(4.082483/sqrt(10))
[1] 1.549193
#Option 2: use variables
# Store null parameter in mu0
mu0 <- 66
# store sample mean in xbar
xbar <- mean(studentSample$Height)
# store sample sd in s
s <- sd(studentSample$Height)
# store sample size in n
n <- nrow(studentSample)
# store t-stat in t
t <- (xbar-mu0)/(s/sqrt(n))
#print the answer
print(t)
[1] 1.549193
Step 4 Calculate p-value
# Option 1: plug in -1.549193 to pt
2*pt(-1.549193,df=9)
[1] 0.1557447
# option 2: you can plug in t if you stored the t-stat in t. This is where you can use -abs(t) to make sure you're plugging in a negative t value.
2*pt(-abs(t), df=9)
[1] 0.1557446
Step 5 Do not reject \(H_0\).
Step 6 We do not have statistically discernible evidence that the average height of all college students is different than 66 inches.
In R, you can use the functions t.test to do hypothesis
tests and confidence intervals all at once.
In Example 4 above, we wanted to use the Height variable
in the studentSample dataset to test \(H_0:\mu=66\) vs \(H_a:\mu\ne 66\).
We can get the t-stat and the p-value using t.test like
this:
t.test(studentSample$Height, mu=66)
One Sample t-test
data: studentSample$Height
t = 1.5492, df = 9, p-value = 0.1557
alternative hypothesis: true mean is not equal to 66
95 percent confidence interval:
65.07957 70.92043
sample estimates:
mean of x
68
This gives us the sample mean (68), the t-stat (1.5492), the degrees
of freedom (9) and the p-value (0.1557). It also gives us the confidence
interval (65.07957 to 70.92043). For the hypothesis test, we would still
write out the six steps; t.test is just a quicker way to
get the t-stat and the p-value.
Notice by default, the alternative is the “not equal to” alternative.
You can change this by specifying alternative in the
t.test function. For example, if we wanted to use the same
data to test \(H_0:\mu=66\) vs \(H_a:\mu>66\), we would do this:
t.test(studentSample$Height, mu=66, alternative="greater")
One Sample t-test
data: studentSample$Height
t = 1.5492, df = 9, p-value = 0.07787
alternative hypothesis: true mean is greater than 66
95 percent confidence interval:
65.63346 Inf
sample estimates:
mean of x
68
Notice the p-value changes. This also gives us a “one-sided”
confidence interval. Don’t worry about what this means. If you have the
“less than” alternative, just use alternative="less".
If you want a confidence interval with a confidence level other than
95%, you can specify this using conf.level=.... For
example, if we want 99% confidence:
t.test(studentSample$Height, conf.level=0.99)
One Sample t-test
data: studentSample$Height
t = 52.673, df = 9, p-value = 1.61e-12
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
63.80448 72.19552
sample estimates:
mean of x
68
Suppose a dataset called FiveKTimes contains the 5k
times (the RaceTime variable) for 30 morning runners and 20
afternoon runners. The time of day is recorded in the
TimeOfDay variable. The dataset is structured like
this:
TimeOfDay RaceTime
1 Morning 25.8
2 Morning 26.2
3 Morning 22.4
4 Morning 22.4
5 Afternoon 25.7
6 Morning 25.5
For inference about two means, the syntax is
t.test(dataset$Var1 ~ dataset$Var2), where
Var1 is the quantitative variable and Var2 is
the variable giving the groups (Var2 can only have two
possibilities). In this example, we would do this:
t.test(FiveKTimes$RaceTime ~ FiveKTimes$TimeOfDay)
Welch Two Sample t-test
data: FiveKTimes$RaceTime by FiveKTimes$TimeOfDay
t = 4.1253, df = 36.57, p-value = 0.0002044
alternative hypothesis: true difference in means between group Afternoon and group Morning is not equal to 0
95 percent confidence interval:
1.029572 3.018704
sample estimates:
mean in group Afternoon mean in group Morning
26.20000 24.17586
R will treat the groups alphabetically. Since Afternoon comes before Morning alphabetically, it treats Afternoon as the first group. So we should think of \(\mu_1\) as the average 5k time for all afternoon runners, and \(\mu_2\) as the average 5k time for all morning runners.
The confidence interval (1.03 to 3.018) gives an estimate for \(\mu_1-\mu_2,\) so we are 95% confident that the average 5k time for all afternoon runners is between 1.03 minutes and 3.02 minutes longer than the average 5k time for all morning runners.
If we want to run a hypothesis test to see if this gives
statistically discernible evidence that the average 5k time for after
runners is greater than the average 5k time for morning runners, the
hypotheses are \(H_0:\mu_1=\mu_2\) and
\(H_a:\mu_1>\mu_2.\) We just need to
specify alternative="greater"
t.test(FiveKTimes$RaceTime ~ FiveKTimes$TimeOfDay, alternative="greater")
Welch Two Sample t-test
data: FiveKTimes$RaceTime by FiveKTimes$TimeOfDay
t = 4.1253, df = 36.57, p-value = 0.0001022
alternative hypothesis: true difference in means between group Afternoon and group Morning is greater than 0
95 percent confidence interval:
1.196099 Inf
sample estimates:
mean in group Afternoon mean in group Morning
26.20000 24.17586
The t-statistic is 4.1253, and the p-value is 0.0001022, so we have statistically discernible evidence that the average 5k for all afternoon runners is greater than the average 5k time for all afternoon runners.
In the previous example, the times are measured on different individuals. If instead we have both an afternoon time and a morning time for each individual runner, the dataset might look more like this.
Afternoon Morning
1 29.79 26.53
2 25.02 23.35
3 29.99 26.66
4 29.82 26.54
5 27.24 24.83
6 21.38 20.92
In this case, it makes sense to look at the difference between the afternoon time and the morning time for each individual, rather the difference between the overall afternoon average and the overall morning average.
In other words, we can pair the afternoon time with the morning time
of each individual runner and look at the difference in those pairs. In
R, we can create a new variable called Difference in the
FiveKTimes dataset like this:
FiveKTimes$Difference <- FiveKTimes$Afternoon - FiveKTimes$Morning
Now, the dataset looks like this:
Afternoon Morning Difference
1 29.79 26.53 3.26
2 25.02 23.35 1.67
3 29.99 26.66 3.33
4 29.82 26.54 3.28
5 27.24 24.83 2.41
6 21.38 20.92 0.46
When we’re doing inference, the parameter of interest is now the average difference between the afternoon 5k time and the morning 5k time for all individual runners, rather than the difference between the overall afternoon average and the overall morning average.
So, this is really inference for a single mean (\(\mu\) would be the average difference
between afternoon and morning 5k times). We would use the
Difference variable to test this:
t.test(FiveKTimes$Difference)
One Sample t-test
data: FiveKTimes$Difference
t = 15.624, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1.763849 2.284551
sample estimates:
mean of x
2.0242
The confidence interval means that we are 95% confident that on average, all runners are between 1.76 and 2.28 minutes slower in the afternoon.
If we were looking for evidence that the average difference between the afternoon and morning times is not 0, the hypotheses would be \(H_0:\mu=0\) and \(H_a:\mu\ne0,\) where \(\mu\) is the average difference between afternoon and morning 5k times. The t-stat is 15.6 and the p-value is 2.2e-16 (the e-16 means move the decimal 16 places to the left, so the p-value is basically 0). So we have statistically discernible evidence that the average difference between afternoon and morning 5k times is not 0.
The syntax for using chisq.test for association is just
chisq.test(observed), where observed is a
table with the observed counts.
Suppose we want to use the following dataset to test to see if there is an association between the time of day and the season for all of my bike rides:
Fall Spring Summer Winter
AM 16 16 28 10
PM 6 8 4 12
We first need to store the observed counts in a table in R. The
function rbind (for “row bind”) can be used to create a
table. Enter the rows as individual lists (using c()). The
first row is 10,16,28,19 (enter as c(10,16,28,19)), and the
second is 12,8,4,6 (enter as c(12,8,4,6)), so create the
table like this:
# store observed counts in variable called observed
observed <- rbind(c(10,16,28,16),c(12,8,4,6))
print(observed)
[,1] [,2] [,3] [,4]
[1,] 10 16 28 16
[2,] 12 8 4 6
Ignore the pieces that look like [1,]- these are just
keeping track of the rows and columns.
Then, all you need to do is run
chisq.test(observed):
chisq.test(observed)
Pearson's Chi-squared test
data: observed
X-squared = 11.183, df = 3, p-value = 0.01078
This gives you the \(\chi^2\) statistic, the degrees of freedom, and the p-value. In an actual problem, you still need to write out the six steps, this is just an easier way to get the \(\chi^2\) statistic and the p-value.
You don’t need to include the expected counts in the six steps, but if you want to see them, you can do this:
chisq.test(observed)$expected
[,1] [,2] [,3] [,4]
[1,] 15.4 16.8 22.4 15.4
[2,] 6.6 7.2 9.6 6.6
If you have an actual dataset, you can just use the
table command to create the table of the observed counts.
For example, if the data for the bike rides was stored in a dataset
called BikeSample that looks like this:
TimeOfDay Season
1 PM Summer
2 AM Summer
3 AM Summer
4 AM Summer
5 AM Summer
6 AM Spring
You could do this:
observed <- table(BikeSample$TimeOfDay,BikeSample$Season)
This would give you a table with the observed counts:
print(observed)
Fall Spring Summer Winter
AM 16 16 28 10
PM 6 8 4 12
Then you could use the same command to get the \(\chi^2\) stat and the p-value:
chisq.test(observed)
Pearson's Chi-squared test
data: observed
X-squared = 11.183, df = 3, p-value = 0.01078
The syntax for using chisq.test for multiple proportions
is chisq.test(observed, p=nullProportions), where
observed is the list of observed counts, and
nullProportions is the list of the proportions in the null
hypothesis.
If we want to use the bike rides data to test to see if the proportions of all my bike rides that take place in the winter, spring, summer, and fall are not equal, we’d want to test \[ H_O:p_W=0.25,\ p_{Sp}=0.25,\ p_{Su}=0.25,\ p_F=0.25, \] and \[ H_a:\text{at least one proportion is not as claimed} \] where \(p_W,p_{Sp},p_{Su},\) and \(p_F\) are the proportions of all winter, spring, summer, and fall rides.
If the sample data looks like this:
Fall Spring Summer Winter
22 24 32 22
Then the observed counts are 22,24,32,32, and the null proportions are 0.25,0.25,0.25,0.25, so we can do this:
chisq.test(c(22,24,32,22),p=c(0.25,0.25,0.25,0.25))
Chi-squared test for given probabilities
data: c(22, 24, 32, 22)
X-squared = 2.72, df = 3, p-value = 0.4368
If you’d rather actually store the counts and proportions in variables, you could do this:
observed <- c(22,24,32,22)
nullProportions <- c(0.25,0.25,0.25,0.25)
chisq.test(observed, p=nullProportions)
Chi-squared test for given probabilities
data: observed
X-squared = 2.72, df = 3, p-value = 0.4368
If you have an actual dataset, you can again use table
to get the observed counts. For example, if the season is recorded in
the Season variable in the BikeSample dataset,
you could do this:
observed <- table(BikeSample$Season)
nullProportions <- c(0.25,0.25,0.25,0.25)
chisq.test(observed, p=nullProportions)
Chi-squared test for given probabilities
data: observed
X-squared = 2.72, df = 3, p-value = 0.4368
Recall ANOVA is used to test to see if all means are equal between several groups.
We are testing
\(H_0:\mu_1=\mu_2=\cdots=\mu_k\)
\(H_a:\) at least two means are not
equal.
TO visualize the sample data, it’s useful to creat side-by-side boxplots
For this use the syntax
ggplot("Dataset", aes(x="Group", y="Variable")) + geom_boxplot()
where "Variable" is the quantitative variable, and
"Group" is the variable with the groups, and
"Dataset" is the name of the dataset,
In R, aov is the command for ANOVA:
aov(Dataset$Variable ~ Dataset$Group) will return something
like:
aov(Dataset$Variable ~ Dataset$Group)
Call:
aov(formula = Dataset$Variable ~ Dataset$Group)
Terms:
Dataset$Group Residuals
Sum of Squares 64.6520 394.2047
Deg. of Freedom 2 267
Residual standard error: 1.215081
Estimated effects may be unbalanced
This tells us SSG=64.6520 and SSE=394.2047. It also gives you df1=2 (so there are 3 groups) and df2=267 (so the total sample size is 270)
You can use summary to get more statistics:
summary(aov(Dataset$Variable ~ Dataset$Group))
Df Sum Sq Mean Sq F value Pr(>F)
Dataset$Group 2 64.7 32.33 21.89 1.57e-09 ***
Residuals 267 394.2 1.48
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see MSG=32.33, MSE=1.48, F= 21.89 (same thing as 32.33/1.48), and the p-value is \(1.57\times 10^{-9}\). The F-stat is the “standardized test statistic” used in the third step of the hypothesis test.
The Salaries dataset contains information about the
salaries in USD (the salary_in_usd variable) for a sample
of employees the the tech industry. The variable
experience_level gives the experience level
(EN for entry-level MI for mid-level,
SE for senior-level, and EX for
executive-level).
First, let’s make side-by-side boxplots showing the salaries by experience level:
ggplot(Salaries, aes(x=experience_level, y=salary_in_usd))+
geom_boxplot()+
xlab("Experience Level")+
ylab("Salary (USD)")+
ggtitle("Salary (USD) by Experience Level")
Now, we can use summary(aov(...)) to get the relevant
statistics.
summary(aov(salary_in_usd ~ experience_level, data=Salaries))
Df Sum Sq Mean Sq F value Pr(>F)
experience_level 3 2.814e+12 9.380e+11 273.6 <2e-16 ***
Residuals 4982 1.708e+13 3.428e+09
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Homes dataset contains a sample of 33 homes,
including the cost in thousands (Price) and their square
footage (Sqft). To create a scatterplot, use
ggplot with geom_point():
ggplot(Homes, aes(x=Sqft, y=Price)) + geom_point()
The points fit closely into a line, so we would expect the
correlation to be close to 1. We can check this using the
cor command:
cor(Homes$Price, Homes$Sqft)
[1] 0.888334
To find the equation of the least square regression line, use the
lm command. The syntax is
lm(Dataset$Response ~ Dataset$Explanatory)
lm(Homes$Price ~ Homes$Sqft)
Call:
lm(formula = Homes$Price ~ Homes$Sqft)
Coefficients:
(Intercept) Homes$Sqft
-157.1243 0.1503
This means the least squares regression line is \(\widehat{Price} = -157.12 + 0.1503*Sqft\).
Recall that for inference for regression, we want to test to see if an explanatory variable is an “effective predictor” of a response variable. This is the same thing as saying the population slope \(\beta_1\) is not zero.
If we want to test to see if square footage is an effective predictor of price, the six steps would be:
Step 1: \(H_0\): Square footage is not an effective predictor of price.
\(H_a\): Square footage is an effective predictor of price.
Step 2: \(\alpha = 0.05.\)
Step 3: The standardized test stat is \(t=\frac{b_1}{SE}\), where \(b_1\) is the sample slope (0.1503 in this problem), and SE is the standard error for the slope. For now, we’ll be given the SE: SE=0.01395 in this problem.
0.1503/0.01395
[1] 10.77419
Step 4: The p-value will be twice the area below
-abs(t)
2 * pt(-10.77, df=31)
[1] 5.29358e-12
Step 5: Reject \(H_0.\)
Step 6: We have statistically discernible evidence that square footage is an effective predictor of price.
summary to find the statisticsIf we place the lm command inside of
summary, R will do all of the calculations for us.
summary(lm(Homes$Price ~ Homes$Sqft))
Call:
lm(formula = Homes$Price ~ Homes$Sqft)
Residuals:
Min 1Q Median 3Q Max
-84.948 -12.730 -2.704 14.224 112.473
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -157.12426 46.74986 -3.361 0.00208 **
Homes$Sqft 0.15030 0.01395 10.771 5.28e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.4 on 31 degrees of freedom
Multiple R-squared: 0.7891, Adjusted R-squared: 0.7823
F-statistic: 116 on 1 and 31 DF, p-value: 5.281e-12
There’s a lot here. All you need to know is that the
Estimate column gives the intercept and the slope, the
second number in the Std. Error column is the SE for the
slope, the second number in the t value column is the
t-stat for step 3, and the second number in the Pr(>|t|)
is the p-value.