Basic Calculations

Arithmetic

Use + - * / for addition, subtraction, multiplication, and division.

100 + 2
[1] 102
100 - 2
[1] 98
100 * 2
[1] 200
100/2
[1] 50

Storing values in variables

Use <- to store values in variables. For example, you can store the number 100 in a variable called x and the number 2 in a variable called y like this:

x <- 100
y <- 2

You can now use x and y for calculations involving 100 and 2:

x + y
[1] 102
x - y
[1] 98
x*y
[1] 200
x/y
[1] 50

Lists

To create a list, use c(). For example, to store the numbers 1,3,5,4 in a list called numbers, do this:

numbers <- c(1,3,5,4)

Mean, Median, standard deviation, etc

You can use things like mean(), median(), etc for basic calculations. For example, if we want the mean, median, and standard deviation of the numbers in the list numbers above, we can do this:

mean(numbers)
[1] 3.25
median(numbers)
[1] 3.5
sd(numbers)
[1] 1.707825

Working with Datasets

The data.frame() command can be used to manually create a dataset in R. For example, consider the following dataset:

Name Major GPA
Alice Math 3.6
Charlie CS 3.7
Bob Math 3.2

Here’s the syntax to create this dataset in R

students <- data.frame(Name = c('Alice', 'Bob', 'Charlie'), Major = c('Math','CS','Math'), GPA = c(3.6, 3.7, 3.2))

This creates a dataset (called students) which has three variables (Name, Major, GPA), and we enter the values of those variables as lists (using c()). You can also split this into separate lines to make it easier to read:

students <- data.frame(
  Name = c('Alice', 'Bob', 'Charlie'), 
  Major = c('Math','CS','Math'), 
  GPA = c(3.6, 3.7, 3.2))

To access a certain value in a dataset, type the name of the dataset, followed by a dollar sign $, followed by the name of the variable.

For example, students$Name will just print out the names of the students:

students$Name
[1] "Alice"   "Bob"     "Charlie"

You can also compute means/medians/standard deviations of quantitative variables in datasets. For example, here are the mean, median, and standard deviation of the GPAs in the students dataset.

mean(students$GPA)
[1] 3.5
median(students$GPA)
[1] 3.6
sd(students$GPA)
[1] 0.2645751

You can get the sample size of a dataset using nrow() (for “number of rows”):

nrow(students)
[1] 3

We won’t usually enter datasets manually like this. Instead, we’ll load in csv files. There are two ways to do this. First, if you have the actual csv (in the same folder as your current project), you can load it directly using read.csv("filename"). For example, if you have a csv file called studentSample.csv, you can just enter this:

studentsSample <- read.csv('studentsample.csv')

In this class, for most datasets, I’ll give you the code to load it from a dropbox folder that you can just copy/paste. It’ll look something like this:

studentSample <- read.csv("https://www.dropbox.com/scl/fi/v573fxi9np280m3exwh79/MajorDrinkHeight.csv?rlkey=74kyi6v2kmbi6kbmmwsxslbfg&dl=1")

Finding mean, median, etc in groups

If you have a dataset in which the individuals are placed in different groups, you can use tapply to find the mean/median/etc within each group by using the tapply command. The syntax is tapply(dataset$Var1, dataset$Var2, function), where Var1 is the name of the quantitative variable, Var2 is the name of the categorical variable (which determine the groups), and function is the name of the function you want to use (like mean, median, etc).

For example, the Salaries dataset gives the salary in USD (the salaray_in_usd variable) of individuals, and groups them by experience level (the experience_level variable). If we want the mean salary in USD for each experience level, we can do this:

tapply(Salaries$salary_in_usd, Salaries$experience_level, mean)
       EN        EX        MI        SE 
 90930.13 186808.88 123173.74 161651.86 

For example, this means that the mean salary for Executive level positions (EX) is $186808.88.

Percentiles

Use quantile(dataset$variable, pct) to get a percentile. For example, to get the 90th percentile of the heights in the student sample dataset, use

quantile(studentSample$Height, 0.90)
 90% 
73.1 

Summary Statistics Use summary to get the min, max, quartiles, and mean all at once:

summary(studentSample$Height)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  62.00   65.25   67.00   68.00   71.50   74.00 

Tables

Frequency Tables: The general syntax to make a frequency table for a qualitative variable in a dataset is table(Dataset$Variable). For example, to make a table of the Major variable from the studentSample Dataset, do the following

table(studentSample$Major)

  CS Math 
   4    6 

Adding totals: Use addmargins() to append the totals to the table:

addmargins(table(studentSample$Major))

  CS Math  Sum 
   4    6   10 

Relative Frequency Tables Use proportions() to create a relative frequency table:

proportions(table(studentSample$Major))

  CS Math 
 0.4  0.6 

You can also add the totals:

addmargins(proportions(table(studentSample$Major)))

  CS Math  Sum 
 0.4  0.6  1.0 

Two-Way Tables To make a table showing two qualitative variables, just plug both variables into table():

table(studentSample$Major, studentSample$FavDrink)
      
       Coffee Tea Water
  CS        2   1     1
  Math      3   2     1

You can add the totals:

addmargins(table(studentSample$Major, studentSample$FavDrink))
      
       Coffee Tea Water Sum
  CS        2   1     1   4
  Math      3   2     1   6
  Sum       5   3     2  10

or display relative frequencies:

proportions(table(studentSample$Major, studentSample$FavDrink))
      
       Coffee Tea Water
  CS      0.2 0.1   0.1
  Math    0.3 0.2   0.1

and add the totals to the relative frequency table:

addmargins(proportions(table(studentSample$Major, studentSample$FavDrink)))
      
       Coffee Tea Water Sum
  CS      0.2 0.1   0.1 0.4
  Math    0.3 0.2   0.1 0.6
  Sum     0.5 0.3   0.2 1.0

Graphs

Bar Graphs

To make a bar graph using “Base R”, you need to plug a table() into barplot():

barplot(table(studentSample$FavDrink))

Histograms

To make a histogram using Base R, just plug the dataset$variable into hist():

hist(studentSample$Height)

Boxplots

To make a boxplot using Base R, just plug dataset$variable into boxplot:

boxplot(studentSample$Height)

ggplot2

One of the nice features of R is that there are many packages that you can install to add more functionality.

One of these package, ggplot2 allows you to make nicer-looking graphs.

To start, you need to install the package (go to “packages” on the bottom-right pane, then click install, search for ggplot2, and click install).

Then, you need to run library(ggplot2) to “load” the library.

Setting up the background and Aesthetic

Whenever you want to create a graph, always start with ggplot(dataset).

library(ggplot2)
ggplot(Salaries)

You’ll see this just creates a blank “background”.

The next thing you need to do is to set up the ``aesthetic.” For now, just think of this as specifying what the axes should represent.

For example, if we want to make a grapht of the salary_in_usd variable, we’d want the x-axis to correspond to salary_in_usd. To specify this in ggplot, type aes(x = salary_in_usd)

library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd))

Histograms in ggplot2

Single Histograms

Now, whenever you want to actually plot something, you use a + symbol, followed by what you want to add. If we want to add a histogram, the command is + geom_histogram()

library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) + geom_histogram()

You can customize lots of things. For example, we can make the outline of the bars black and the fill grey like this:

library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) + geom_histogram(color="black", fill="grey")

You can change the axis titles like this:

library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) +
  geom_histogram(color="black", fill="grey") +
  xlab("Salary in USD")+
  ylab("Count")

And you can add a title like this:

library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) +
  geom_histogram(color="black", fill="grey") +
  xlab("Salary in USD")+
  ylab("Count")+
  ggtitle("Histogram of Salaries")

Histograms for groups

If you want to have separate histograms for different groups in a dataset, you can use the facet_grid function. The syntax is to add +facet_grid(~Var), where Var is the name of the variable giving the groups. For example, if we want separate histograms for the different experience levels (experience_level variable), we just need this modification:

ggplot(Salaries, aes(x=salary_in_usd)) +
  geom_histogram(color="black", fill="grey") +
  xlab("Salary in USD")+
  ylab("Count")+
  ggtitle("Histogram of Salaries")+
  facet_grid(~experience_level)

The labels on the x-axis are a little jumbled, don’t worry about this.

Boxplots in ggplot2

Single boxplot

Boxplots work similarly:

library(ggplot2)
ggplot(Salaries, aes(x=salary_in_usd)) +
  geom_boxplot() +
  xlab("Salary in USD")+
  ggtitle("Boxplot of Salaries")

(The numbers on the vertical axis don’t matter here)

You can also make the vertical by using aes(y=...)

library(ggplot2)
ggplot(Salaries, aes(y=salary_in_usd)) +
  geom_boxplot() +
  ylab("Salary in USD")+
  ggtitle("Boxplot of Salaries")

side-by-side grouped boxplots in ggplot2

If you want to display boxplots for separate groups, you can specify both the x and y axes: one for the group (qualitative), one for the quantitative variable

library(ggplot2)

ggplot(Salaries, aes(y=salary_in_usd, x=experience_level)) +
  geom_boxplot() +
  xlab("Experience Level")+
  ylab("Salary in USD")+
  ggtitle("Boxplots of Salaries by Experience Level")

Working with normal distributions

Standard Normal Distribution

Use pnorm(#) to get the area to the left of # in N(0,1). Remember this is the same thing as \(P(z\le\#)\). For example, to get \(P(z\le1.2)\), this is all you need to do:

pnorm(1.2)
[1] 0.8849303

So, \(P(z\le 1.2)=0.8849303\).

To get \(P(a\le z\le b)\) (in \(N(0,1)\)), just use pnorm(b)-pnorm(a). For example, for \(P(-2\le z\le 1.2),\) this is all you need:

pnorm(1.2)-pnorm(-2)
[1] 0.8621802

So, \(P(-2\le z\le 1.2)=0.8621802.\)

Other Normal Distributions

For probabilities in other normal distributions, you can first convert to \(N(0,1)\) using z-scores. Remember the z-score for \(x\) in \(N(\mu,\sigma)\) is \(z=\frac{x-\mu}{\sigma}.\) For example, the z-score for 5.5 in \(N(10,3)\) is \(z=\frac{5.5-10}{3}.\) You can do this in R:

(5.5-10)/3
[1] -1.5

So, the z-score is -1.5. You can use this to get \(P(x\le -1.5)\) (in \(N(10,-1.5)\)) using pnorm(z-score for -1.5):

pnorm(-1.5)
[1] 0.0668072

Similarly, to get \(P(5.5\le x\le 13)\) in \(N(10,3)\), you can first find the z-scores for 5.5 and 13:

(5.5-10)/3
[1] -1.5
(13-10)/3
[1] 1

Then use pnorm:

pnorm(1)-pnorm(-1.5)
[1] 0.7745375

You can also store the z-scores in variables by doing something like this:

z1 <- (5.5-10)/3
z2 <- (13-10)/3

Now you can plug z1 and z2 into pnorm rather than using the actual numbers.

pnorm(z2)-pnorm(z1)
[1] 0.7745375

You can also specify the mean and sd in pnorm without using z-scores:

pnorm(13, mean=10, sd=3) - pnorm(5.5, mean=10, sd=3)
[1] 0.7745375

Either way is fine, but make sure you’re able to do the calculations using z-scores.

Probabilities in sampling distributions

Recall the sampling distribution for \(\bar x\) in samples of size \(n\) is (approximately) \(N(\mu,\frac{\sigma}{\sqrt{n}})\), and \(\frac{\sigma}{\sqrt{n}}\) is called the standard error. So we can use this distribution to find probabilities involving \(\bar x\). For example, if the population is \(N(10,3),\) then the standard error for samples of size 36 is \(SE=\frac{3}{\sqrt{36}}\). You can do this calculation in R:

3/sqrt(36)
[1] 0.5

So, \(SE=0.5\). This means the distribution of \(\bar x\) is \(N(10,0.5).\) If we want \(P(\bar x\le 11.25),\) we need to first get the z-score for 11.25, using this distribution:

(11.25-10)/0.5
[1] 2.5

So, the z-score is 2.5. Then use pnorm to get the probability:

pnorm(2.5)
[1] 0.9937903

So, \(P(\bar x\le 11.25)=0.9937903\).

Confidence Intervals Using Normal Distributions

Remember the formula for a confidence interval for \(\mu\) when \(\sigma\) is known is \[ \bar x\pm ME, \] where \[ ME=z_*\cdot SE, \] and \[ SE=\frac{\sigma}{\sqrt{n}} \]

To get a \(z_*\) value, either use a table, or use qnorm. For example, to get the \(z_*\) value for 98% confidence, since you want the middle 98%, each tail is 1%, so you want the 0.01 percentile. To get this in R, do this:

qnorm(0.01)
[1] -2.326348

You’ll see that it’s negative, the actual \(z_*\) value is just the absolute value of this (so it’s 2.326348 in this case). You can also make it positive by just putting a - in front:

-qnorm(0.01)
[1] 2.326348

Example 1

If the population is normal with \(\sigma=4\) and a sample of size 25 has an average of \(\bar x=90,\) give a 98% confidence interval for \(\mu.\)

First, find the standard error:

4/sqrt(25)
[1] 0.8

So, SE=0.8. Now, find the margin of error using \(z_*=2.326348\) and \(SE=0.8\):

2.326348*0.8
[1] 1.861078

Now find the confidence interval:

90-1.861078
[1] 88.13892
90+1.861078
[1] 91.86108

So, the confidence interval is 88.13892 to 91.86108 (98% confident that \(\mu\) is between these two numbers).

If you’re comfortable working with variables, you can streamline all of this:

SE <- 4/sqrt(25)
zstar <- -qnorm(0.01)
ME <- zstar * SE
90 - ME
[1] 88.13892
90 + ME
[1] 91.86108

Hypothesis testing using normal distributions

Example 1

Suppose we want to test to see if the average amount of caffeine in all cups of coffee is less than 100mg. We take a sample of size 30 and determine the sample mean is 98. Assume the population standard deviation is 6. Write out the steps for the hypothesis test.

Step 1: \(H_0:\mu=100\), \(H_a:\mu<100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.

Step 2: Use a significance level of \(\alpha=0.05\).

Step 3: We know the population sd, so use \(z=\frac{98-100}{6/\sqrt{30}}\) as the standardized test statistic.

(98-100)/(6/sqrt(30))
[1] -1.825742

Step 4: The p-value is the area to the left of -1.825742 in \(N(0,1).\) We can use pnorm to get this area:

pnorm(-1.825742)
[1] 0.03394457

Step 5: Our p-value is less than the significance level, so reject \(H_0\).

Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is less than 100mg.

Example 2

Suppose we want to test to see if the average amount of caffeine in all cups of coffee is greater than 100mg. We take a sample of size 30 and determine the sample mean is 101. Assume the population standard deviation is 6. Write out the steps for the hypothesis test.

Step 1: \(H_0:\mu=100\), \(H_a:\mu>100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.

Step 2: Use a significance level of \(\alpha=0.05\).

Step 3: We know the population sd, so use \(z=\frac{101-100}{6/\sqrt{30}}\) as the standardized test statistic.

(101-100)/(6/sqrt(30))
[1] 0.9128709

Step 4: The p-value is the area to the right of 0.9128709 in \(N(0,1).\) We can use pnorm to get this area:

1-pnorm(0.9128709)
[1] 0.1806552

Step 5: Our p-value is greater than the significance level, so do not reject \(H_0\).

Step 6: We do not have statistically discernible evidence that the average amount of caffeine in all cups of coffee is greater than 100mg.

Example 3

Suppose we want to test to see if the average amount of caffeine in all cups of coffee is not equal to 100mg. We take a sample of size 30 and determine the sample mean is 103. Assume the population standard deviation is 6. Write out the steps for the hypothesis test.

Step 1: \(H_0:\mu=100\), \(H_a:\mu\ne100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.

Step 2: Use a significance level of \(\alpha=0.05\).

Step 3: We know the population sd, so use \(z=\frac{103-100}{6/\sqrt{30}}\) as the standardized test statistic.

(103-100)/(6/sqrt(30))
[1] 2.738613

Step 4: The p-value is the combined area to the left of -2.738613 and the right of 2.738613 in \(N(0,1).\) This is the same as twice the area to the left of -2.738613. We can use pnorm to get this area:

2 * pnorm(-2.738613)
[1] 0.006169895

Step 5: Our p-value is less than the significance level, so reject \(H_0\).

Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is not equal to 100mg.

Working with t-distributions

Recall the t-statistic for a sample with mean \(\bar x\) and standard deviation \(s\) is \[ t=\frac{\bar x - \mu}{s/\sqrt{n}} \] The t-statistic follows a “t-distribution with n-1 degrees of freedom.” In R, the functions pt and qt are the analogs of pnorm and qnorm. The only difference is that with pt and qt you need to specify the degrees of freedom (which is n-1). For example, if we want \(P(-1.2\le z\le 2)\) in \(N(0,1)\), we would use this:

pnorm(2) - pnorm(-1.2)
[1] 0.8621802

So, \(P(-1.2\le z\le 2)=0.8621802.\) If we want \(P(-1.2\le t\le 2)\) in a t-distribution for samples of size 10, then there are 9 degrees of freedom, so we’d use this:

pt(2, df=9) - pt(-1.2, df=9)
[1] 0.831337

So, \(P(-1.2\le t\le 2)=0.8621802.\)

Confidence intervals using t-distributions

Remember the formula for a confidence interval for \(\mu\) when \(\sigma\) is unknown is \[ \bar x\pm ME, \] where \[ ME=t_*\cdot SE, \] and \[ SE=\frac{s}{\sqrt{n}} \]

To get the \(t_*\) values, either use a table, or qt in R. For example, if we want the \(t_*\) value for 98% confidence for samples of size 25 (df = 24), then similarly to finding the \(z_*\) value, we want the area to the left to be 0.01, so we can use this:

-qt(0.01, df=24)
[1] 2.492159

So for samples of size 25, the \(t_*\) value for 98% confidence is 2.492159

Example 1

If the population is normal with unknown mean and unknown standard deviation, and a sample of size 25 has an average of \(\bar x=90\) and a standard deviation of \(s=4,\) give a 98% confidence interval for \(\mu.\)

Note This is the same as the example in the normal distribution section, except we only know the sample standard deviation, rather than the population standard deviation. The steps are almost the same, except that we use the \(t_*\) value rather than a \(z_*\) value.

First, find the standard error:

4/sqrt(25)
[1] 0.8

So, SE=0.8. Now, find the margin of error using \(t_*=2.492159\) (from above) and \(SE=0.8\):

2.492159*0.8
[1] 1.993727

Now find the confidence interval:

90-1.993727
[1] 88.00627
90+1.993727
[1] 91.99373

So, the confidence interval is 88.00627 to 91.99373 (98% confident that \(\mu\) is between these two numbers).

Again, you can can streamline all of this using variables

SE <- 4/sqrt(25)
tstar <- -qt(0.01, df=24)
ME <- tstar * SE
90 - ME
[1] 88.00627
90 + ME
[1] 91.99373

So, we are 98% confident that \(\mu\) is between 88.00627 and 91.99373. Notice this is a slightly wider interval than when we knew \(\sigma\) (rather than \(s\)). This makes sense intuitively- the confidence interval is less precise since we have less information.

Example 2: Working with an actual dataset.

Suppose the dataset called Coffee contains the amount of caffeine in a sample of 25 cups of coffee (the Caffeine variable).

Us this to give a 95% confidence interval for the amount of caffeine in all cups of coffee.

We should first check to see if it’s “appropriate” to use a t-distribution. For this, we want to look at the histogram of the sample.

ggplot(Coffee, aes(x=Caffeine)) +
  geom_histogram(color="black", fill="grey",bins=8)

It’s not perfectly normal, but we have a medium sample size, so it’s still okay to use the t-procedure.

Here are the steps in R (this is all using variables, you can do it by manually plugging in numbers too):

# compute sample size
n <- nrow(Coffee)

# compute sample mean
xbar <- mean(Coffee$Caffeine)

# compute sample sd
s <- sd(Coffee$Caffeine)

# Compute SE
SE <- s/sqrt(n)

# Compute tstar
tstar <- -qt(0.025, df=n-1)

# compute ME
ME <- tstar * SE

# compute CI
xbar - ME
[1] 100.6827
xbar + ME
[1] 103.2919

So, we are 95% confident that the average amount of caffeine in all cups of coffee is between 100.7 and 103.3mg.

Hypothesis testing using t-distributions

Example 1

Suppose we want to test to see if the average amount of caffeine in all cups of coffee is less than 100mg. We take a sample of size 30 and determine the sample mean is 98, and the sample standard deviation is 6. Write out the steps for the hypothesis test.

Step 1: \(H_0:\mu=100\), \(H_a:\mu<100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.

Step 2: Use a significance level of \(\alpha=0.05\).

Step 3: We do not know the population sd, so use \(t=\frac{98-100}{6/\sqrt{30}}\) as the standardized test statistic.

(98-100)/(6/sqrt(30))
[1] -1.825742

Step 4: The p-value is the area to the left of -1.825742 in the t-distribution withdf=29. We can use pt to get this area:

pt(-1.825742, df=29)
[1] 0.03910165

Step 5: Our p-value is less than the significance level, so reject \(H_0\).

Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is less than 100mg.

Example 2

Suppose we want to test to see if the average amount of caffeine in all cups of coffee is greater than 100mg. We take a sample of size 30 and determine the sample mean is 101 and the sample standard deviation is 6. Write out the steps for the hypothesis test.

Step 1: \(H_0:\mu=100\), \(H_a:\mu>100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.

Step 2: Use a significance level of \(\alpha=0.05\).

Step 3: We do not know the population sd, so use \(t=\frac{101-100}{6/\sqrt{30}}\) as the standardized test statistic.

(101-100)/(6/sqrt(30))
[1] 0.9128709

Step 4: The p-value is the area to the right of 0.9128709 in the t-distribution with df=29. We can use pt to get this area:

1-pt(0.9128709, df=29)
[1] 0.1844188

Step 5: Our p-value is greater than the significance level, so do not reject \(H_0\).

Step 6: We do not have statistically discernible evidence that the average amount of caffeine in all cups of coffee is greater than 100mg.

Example 3

Suppose we want to test to see if the average amount of caffeine in all cups of coffee is not equal to 100mg. We take a sample of size 30 and determine the sample mean is 103, and the sample standard deviation is 6. Write out the steps for the hypothesis test.

Step 1: \(H_0:\mu=100\), \(H_a:\mu\ne100\), where \(\mu\) is the average amount of caffeine in all cups of coffee.

Step 2: Use a significance level of \(\alpha=0.05\).

Step 3: We do not know the population sd, so use \(t=\frac{103-100}{6/\sqrt{30}}\) as the standardized test statistic.

(103-100)/(6/sqrt(30))
[1] 2.738613

Step 4: The p-value is the combined area to the left of -2.738613 and the right of 2.738613 in the t-distribution with df=29. This is the same as twice the area to the left of -2.738613. We can use pt to get this area:

2 * pt(-2.738613,df=29)
[1] 0.01043738

Step 5: Our p-value is less than the significance level, so reject \(H_0\).

Step 6: We have statistically discernible evidence that the average amount of caffeine in all cups of coffee is not equal to 100mg.

Example 4 (working with an actual dataset)

The average height for all US adults is 66 inches. Suppose a dataset called studentSample contains the heights of 10 college students (in the Height variable). Does this provide statistically significant evidence that the average height of all college students is different than the average for all US adults. (assume it’s “appropriate” to use a t-procedure).

Step 1 The hypotheses are \(H_0:\mu=66\) and \(H_a:\mu\ne66,\) where \(\mu\) is the average height for all college students.

Step 2 Use \(\alpha=0.05.\)

Step 3: Calculate standardized test statistic.

# Option 1: do calculations and copy/paste numbers

# Find sample mean
mean(studentSample$Height)
[1] 68
# Find sample sd
sd(studentSample$Height)
[1] 4.082483
# Calculate t-stat
(68-66)/(4.082483/sqrt(10))
[1] 1.549193
#Option 2: use variables

# Store null parameter in mu0
mu0 <- 66

# store sample mean in xbar
xbar <- mean(studentSample$Height)

# store sample sd in s
s <- sd(studentSample$Height)

# store sample size in n
n <- nrow(studentSample)

# store t-stat in t
t <- (xbar-mu0)/(s/sqrt(n))

#print the answer
print(t)
[1] 1.549193

Step 4 Calculate p-value

# Option 1: plug in -1.549193 to pt

2*pt(-1.549193,df=9)
[1] 0.1557447
# option 2: you can plug in t if you stored the t-stat in t. This is where you can use -abs(t) to make sure you're plugging in a negative t value.

2*pt(-abs(t), df=9)
[1] 0.1557446

Step 5 Do not reject \(H_0\).

Step 6 We do not have statistically discernible evidence that the average height of all college students is different than 66 inches.

t.test

In R, you can use the functions t.test to do hypothesis tests and confidence intervals all at once.

Example 1 (one mean)

In Example 4 above, we wanted to use the Height variable in the studentSample dataset to test \(H_0:\mu=66\) vs \(H_a:\mu\ne 66\).

We can get the t-stat and the p-value using t.test like this:

t.test(studentSample$Height, mu=66)

    One Sample t-test

data:  studentSample$Height
t = 1.5492, df = 9, p-value = 0.1557
alternative hypothesis: true mean is not equal to 66
95 percent confidence interval:
 65.07957 70.92043
sample estimates:
mean of x 
       68 

This gives us the sample mean (68), the t-stat (1.5492), the degrees of freedom (9) and the p-value (0.1557). It also gives us the confidence interval (65.07957 to 70.92043). For the hypothesis test, we would still write out the six steps; t.test is just a quicker way to get the t-stat and the p-value.

Notice by default, the alternative is the “not equal to” alternative. You can change this by specifying alternative in the t.test function. For example, if we wanted to use the same data to test \(H_0:\mu=66\) vs \(H_a:\mu>66\), we would do this:

t.test(studentSample$Height, mu=66, alternative="greater")

    One Sample t-test

data:  studentSample$Height
t = 1.5492, df = 9, p-value = 0.07787
alternative hypothesis: true mean is greater than 66
95 percent confidence interval:
 65.63346      Inf
sample estimates:
mean of x 
       68 

Notice the p-value changes. This also gives us a “one-sided” confidence interval. Don’t worry about what this means. If you have the “less than” alternative, just use alternative="less".

If you want a confidence interval with a confidence level other than 95%, you can specify this using conf.level=.... For example, if we want 99% confidence:

t.test(studentSample$Height, conf.level=0.99)

    One Sample t-test

data:  studentSample$Height
t = 52.673, df = 9, p-value = 1.61e-12
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
 63.80448 72.19552
sample estimates:
mean of x 
       68 

Example 2 (two means)

Suppose a dataset called FiveKTimes contains the 5k times (the RaceTime variable) for 30 morning runners and 20 afternoon runners. The time of day is recorded in the TimeOfDay variable. The dataset is structured like this:

  TimeOfDay RaceTime
1   Morning     25.8
2   Morning     26.2
3   Morning     22.4
4   Morning     22.4
5 Afternoon     25.7
6   Morning     25.5

For inference about two means, the syntax is t.test(dataset$Var1 ~ dataset$Var2), where Var1 is the quantitative variable and Var2 is the variable giving the groups (Var2 can only have two possibilities). In this example, we would do this:

t.test(FiveKTimes$RaceTime ~ FiveKTimes$TimeOfDay)

    Welch Two Sample t-test

data:  FiveKTimes$RaceTime by FiveKTimes$TimeOfDay
t = 4.1253, df = 36.57, p-value = 0.0002044
alternative hypothesis: true difference in means between group Afternoon and group Morning is not equal to 0
95 percent confidence interval:
 1.029572 3.018704
sample estimates:
mean in group Afternoon   mean in group Morning 
               26.20000                24.17586 

R will treat the groups alphabetically. Since Afternoon comes before Morning alphabetically, it treats Afternoon as the first group. So we should think of \(\mu_1\) as the average 5k time for all afternoon runners, and \(\mu_2\) as the average 5k time for all morning runners.

The confidence interval (1.03 to 3.018) gives an estimate for \(\mu_1-\mu_2,\) so we are 95% confident that the average 5k time for all afternoon runners is between 1.03 minutes and 3.02 minutes longer than the average 5k time for all morning runners.

If we want to run a hypothesis test to see if this gives statistically discernible evidence that the average 5k time for after runners is greater than the average 5k time for morning runners, the hypotheses are \(H_0:\mu_1=\mu_2\) and \(H_a:\mu_1>\mu_2.\) We just need to specify alternative="greater"

t.test(FiveKTimes$RaceTime ~ FiveKTimes$TimeOfDay, alternative="greater")

    Welch Two Sample t-test

data:  FiveKTimes$RaceTime by FiveKTimes$TimeOfDay
t = 4.1253, df = 36.57, p-value = 0.0001022
alternative hypothesis: true difference in means between group Afternoon and group Morning is greater than 0
95 percent confidence interval:
 1.196099      Inf
sample estimates:
mean in group Afternoon   mean in group Morning 
               26.20000                24.17586 

The t-statistic is 4.1253, and the p-value is 0.0001022, so we have statistically discernible evidence that the average 5k for all afternoon runners is greater than the average 5k time for all afternoon runners.

Example 3 (mean difference in pairs)

In the previous example, the times are measured on different individuals. If instead we have both an afternoon time and a morning time for each individual runner, the dataset might look more like this.

  Afternoon Morning
1     29.79   26.53
2     25.02   23.35
3     29.99   26.66
4     29.82   26.54
5     27.24   24.83
6     21.38   20.92

In this case, it makes sense to look at the difference between the afternoon time and the morning time for each individual, rather the difference between the overall afternoon average and the overall morning average.

In other words, we can pair the afternoon time with the morning time of each individual runner and look at the difference in those pairs. In R, we can create a new variable called Difference in the FiveKTimes dataset like this:

FiveKTimes$Difference <- FiveKTimes$Afternoon - FiveKTimes$Morning

Now, the dataset looks like this:

  Afternoon Morning Difference
1     29.79   26.53       3.26
2     25.02   23.35       1.67
3     29.99   26.66       3.33
4     29.82   26.54       3.28
5     27.24   24.83       2.41
6     21.38   20.92       0.46

When we’re doing inference, the parameter of interest is now the average difference between the afternoon 5k time and the morning 5k time for all individual runners, rather than the difference between the overall afternoon average and the overall morning average.

So, this is really inference for a single mean (\(\mu\) would be the average difference between afternoon and morning 5k times). We would use the Difference variable to test this:

t.test(FiveKTimes$Difference)

    One Sample t-test

data:  FiveKTimes$Difference
t = 15.624, df = 49, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 1.763849 2.284551
sample estimates:
mean of x 
   2.0242 

The confidence interval means that we are 95% confident that on average, all runners are between 1.76 and 2.28 minutes slower in the afternoon.

If we were looking for evidence that the average difference between the afternoon and morning times is not 0, the hypotheses would be \(H_0:\mu=0\) and \(H_a:\mu\ne0,\) where \(\mu\) is the average difference between afternoon and morning 5k times. The t-stat is 15.6 and the p-value is 2.2e-16 (the e-16 means move the decimal 16 places to the left, so the p-value is basically 0). So we have statistically discernible evidence that the average difference between afternoon and morning 5k times is not 0.


chisq.test

chisq.test for association

The syntax for using chisq.test for association is just chisq.test(observed), where observed is a table with the observed counts.

Working with a given table of observed counts

Suppose we want to use the following dataset to test to see if there is an association between the time of day and the season for all of my bike rides:

    
     Fall Spring Summer Winter
  AM   16     16     28     10
  PM    6      8      4     12

We first need to store the observed counts in a table in R. The function rbind (for “row bind”) can be used to create a table. Enter the rows as individual lists (using c()). The first row is 10,16,28,19 (enter as c(10,16,28,19)), and the second is 12,8,4,6 (enter as c(12,8,4,6)), so create the table like this:

# store observed counts in variable called observed
observed <- rbind(c(10,16,28,16),c(12,8,4,6))
print(observed)
     [,1] [,2] [,3] [,4]
[1,]   10   16   28   16
[2,]   12    8    4    6

Ignore the pieces that look like [1,]- these are just keeping track of the rows and columns.

Then, all you need to do is run chisq.test(observed):

chisq.test(observed)

    Pearson's Chi-squared test

data:  observed
X-squared = 11.183, df = 3, p-value = 0.01078

This gives you the \(\chi^2\) statistic, the degrees of freedom, and the p-value. In an actual problem, you still need to write out the six steps, this is just an easier way to get the \(\chi^2\) statistic and the p-value.

You don’t need to include the expected counts in the six steps, but if you want to see them, you can do this:

chisq.test(observed)$expected
     [,1] [,2] [,3] [,4]
[1,] 15.4 16.8 22.4 15.4
[2,]  6.6  7.2  9.6  6.6

Working with an actual dataset

If you have an actual dataset, you can just use the table command to create the table of the observed counts. For example, if the data for the bike rides was stored in a dataset called BikeSample that looks like this:

  TimeOfDay Season
1        PM Summer
2        AM Summer
3        AM Summer
4        AM Summer
5        AM Summer
6        AM Spring

You could do this:

observed <- table(BikeSample$TimeOfDay,BikeSample$Season)

This would give you a table with the observed counts:

print(observed)
    
     Fall Spring Summer Winter
  AM   16     16     28     10
  PM    6      8      4     12

Then you could use the same command to get the \(\chi^2\) stat and the p-value:

chisq.test(observed)

    Pearson's Chi-squared test

data:  observed
X-squared = 11.183, df = 3, p-value = 0.01078

Chi-square test for multiple proportions

The syntax for using chisq.test for multiple proportions is chisq.test(observed, p=nullProportions), where observed is the list of observed counts, and nullProportions is the list of the proportions in the null hypothesis.

Working with given observed counts

If we want to use the bike rides data to test to see if the proportions of all my bike rides that take place in the winter, spring, summer, and fall are not equal, we’d want to test \[ H_O:p_W=0.25,\ p_{Sp}=0.25,\ p_{Su}=0.25,\ p_F=0.25, \] and \[ H_a:\text{at least one proportion is not as claimed} \] where \(p_W,p_{Sp},p_{Su},\) and \(p_F\) are the proportions of all winter, spring, summer, and fall rides.

If the sample data looks like this:


  Fall Spring Summer Winter 
    22     24     32     22 

Then the observed counts are 22,24,32,32, and the null proportions are 0.25,0.25,0.25,0.25, so we can do this:

chisq.test(c(22,24,32,22),p=c(0.25,0.25,0.25,0.25))

    Chi-squared test for given probabilities

data:  c(22, 24, 32, 22)
X-squared = 2.72, df = 3, p-value = 0.4368

If you’d rather actually store the counts and proportions in variables, you could do this:

observed <- c(22,24,32,22)
nullProportions <- c(0.25,0.25,0.25,0.25)
chisq.test(observed, p=nullProportions)

    Chi-squared test for given probabilities

data:  observed
X-squared = 2.72, df = 3, p-value = 0.4368

Working with an actual dataset

If you have an actual dataset, you can again use table to get the observed counts. For example, if the season is recorded in the Season variable in the BikeSample dataset, you could do this:

observed <- table(BikeSample$Season)
nullProportions <- c(0.25,0.25,0.25,0.25)
chisq.test(observed, p=nullProportions)

    Chi-squared test for given probabilities

data:  observed
X-squared = 2.72, df = 3, p-value = 0.4368

ANOVA

Recall ANOVA is used to test to see if all means are equal between several groups.

We are testing

\(H_0:\mu_1=\mu_2=\cdots=\mu_k\)
\(H_a:\) at least two means are not equal.

TO visualize the sample data, it’s useful to creat side-by-side boxplots

For this use the syntax

ggplot("Dataset", aes(x="Group", y="Variable")) + geom_boxplot()

where "Variable" is the quantitative variable, and "Group" is the variable with the groups, and "Dataset" is the name of the dataset,

In R, aov is the command for ANOVA: aov(Dataset$Variable ~ Dataset$Group) will return something like:

aov(Dataset$Variable ~ Dataset$Group)
Call:
   aov(formula = Dataset$Variable ~ Dataset$Group)

Terms:
                Dataset$Group Residuals
Sum of Squares        64.6520  394.2047
Deg. of Freedom             2       267

Residual standard error: 1.215081
Estimated effects may be unbalanced

This tells us SSG=64.6520 and SSE=394.2047. It also gives you df1=2 (so there are 3 groups) and df2=267 (so the total sample size is 270)

You can use summary to get more statistics:

summary(aov(Dataset$Variable ~ Dataset$Group))
               Df Sum Sq Mean Sq F value   Pr(>F)    
Dataset$Group   2   64.7   32.33   21.89 1.57e-09 ***
Residuals     267  394.2    1.48                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see MSG=32.33, MSE=1.48, F= 21.89 (same thing as 32.33/1.48), and the p-value is \(1.57\times 10^{-9}\). The F-stat is the “standardized test statistic” used in the third step of the hypothesis test.

Example

The Salaries dataset contains information about the salaries in USD (the salary_in_usd variable) for a sample of employees the the tech industry. The variable experience_level gives the experience level (EN for entry-level MI for mid-level, SE for senior-level, and EX for executive-level).

First, let’s make side-by-side boxplots showing the salaries by experience level:

ggplot(Salaries, aes(x=experience_level, y=salary_in_usd))+
  geom_boxplot()+
  xlab("Experience Level")+
  ylab("Salary (USD)")+
  ggtitle("Salary (USD) by Experience Level")

Now, we can use summary(aov(...)) to get the relevant statistics.

summary(aov(salary_in_usd ~ experience_level, data=Salaries))
                   Df    Sum Sq   Mean Sq F value Pr(>F)    
experience_level    3 2.814e+12 9.380e+11   273.6 <2e-16 ***
Residuals        4982 1.708e+13 3.428e+09                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear Regression

Scatterplots, correlation, LSR lines

The Homes dataset contains a sample of 33 homes, including the cost in thousands (Price) and their square footage (Sqft). To create a scatterplot, use ggplot with geom_point():

ggplot(Homes, aes(x=Sqft, y=Price)) + geom_point()

The points fit closely into a line, so we would expect the correlation to be close to 1. We can check this using the cor command:

cor(Homes$Price, Homes$Sqft)
[1] 0.888334

To find the equation of the least square regression line, use the lm command. The syntax is lm(Dataset$Response ~ Dataset$Explanatory)

lm(Homes$Price ~ Homes$Sqft)

Call:
lm(formula = Homes$Price ~ Homes$Sqft)

Coefficients:
(Intercept)   Homes$Sqft  
  -157.1243       0.1503  

This means the least squares regression line is \(\widehat{Price} = -157.12 + 0.1503*Sqft\).

Inference for regression

Recall that for inference for regression, we want to test to see if an explanatory variable is an “effective predictor” of a response variable. This is the same thing as saying the population slope \(\beta_1\) is not zero.

If we want to test to see if square footage is an effective predictor of price, the six steps would be:

Step 1: \(H_0\): Square footage is not an effective predictor of price.

\(H_a\): Square footage is an effective predictor of price.

Step 2: \(\alpha = 0.05.\)

Step 3: The standardized test stat is \(t=\frac{b_1}{SE}\), where \(b_1\) is the sample slope (0.1503 in this problem), and SE is the standard error for the slope. For now, we’ll be given the SE: SE=0.01395 in this problem.

0.1503/0.01395
[1] 10.77419

Step 4: The p-value will be twice the area below -abs(t)

2 * pt(-10.77, df=31)
[1] 5.29358e-12

Step 5: Reject \(H_0.\)

Step 6: We have statistically discernible evidence that square footage is an effective predictor of price.

Using summary to find the statistics

If we place the lm command inside of summary, R will do all of the calculations for us.

summary(lm(Homes$Price ~ Homes$Sqft))

Call:
lm(formula = Homes$Price ~ Homes$Sqft)

Residuals:
    Min      1Q  Median      3Q     Max 
-84.948 -12.730  -2.704  14.224 112.473 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -157.12426   46.74986  -3.361  0.00208 ** 
Homes$Sqft     0.15030    0.01395  10.771 5.28e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.4 on 31 degrees of freedom
Multiple R-squared:  0.7891,    Adjusted R-squared:  0.7823 
F-statistic:   116 on 1 and 31 DF,  p-value: 5.281e-12

There’s a lot here. All you need to know is that the Estimate column gives the intercept and the slope, the second number in the Std. Error column is the SE for the slope, the second number in the t value column is the t-stat for step 3, and the second number in the Pr(>|t|) is the p-value.