2 January 2018
Welcome to MA256. This is an advanced probability and statistics course for select cadets who have demonstrated high performance in the core math program and have a high aptitude for mathematical science. It is designed to lay the foundation of probability theory and inferential statistics.
In the Course Admin section, you’ll find general information about the course, the overall objectives of the course, and a brief tutorial designed to help you get started using R, the primary technology resource we’ll use in MA256. The course guide is then laid out to match the layout of the course: block by block and lesson by lesson. Each lesson throughout the course occupies a dedicated portion of this document. The portion devoted to each lesson will provide the lesson objectives, a specific reading assignment for that day, links to helpful videos, suggested book problems, and lesson-specific problems available in WebAssign.
The MA256 Course Webpage provides resources you may find helpful throughout the semester.
Throughout MA256 we will use the 9th edition of Jay Devore’s textbook, Probability and Statistics for Engineering and the Sciences. A WebAssign subscription provides you access to an electronic version of this textbook. You are required to have access to either a hard copy of the textbook, an electronic copy of the textbook, or both.
MA256 is a 1000-point course.
| Event | Points |
|---|---|
| Block 1 WPR | 100 |
| Block 2 WPR | 200 |
| Block 3 WPR | 200 |
| Instructor Points | 100 |
| Course Project (Block 4) | 100 |
| Term End Exam | 300 |
Understand the notion of randomness and the role of variability and sampling in making inferences.
Apply the axioms, theorems, and basic properties of probability and conditional probability to quantify and communicate the likelihood of events.
Employ univariate and joint models using discrete or continuous random variables to answer basic probability questions.
Employ the concepts of confidence intervals and know when they apply to gain insight into populations from their representative samples.
Construct several types of hypothesis tests and develop appropriate conclusions.
Construct, apply, assess, and communicate linear regression models to estimate and interpret regression coefficients and draw conclusions from the models. Understand how variable transformations can assist when modeling assumptions are not met and be able to implement basic transformation of both outcome and explanatory variables.
Model, solve, and interpret solutions to applied problems that can be examined using principles of probability theory, simulation, or statistical analysis.
Develop comfort acquiring real-world data sets and developing appropriate statistical models to address questions of scientific, engineering or public interest.
Use technology with appropriate software to solve, gain insight, and improve understanding on probability and statistics problems.
Communicate assumptions, models, and results using technical language in both written and oral formats.
Apply principles of probability and statistics to model and solve problems with engineering applications.
Develop the skills to critically synthesize statistical data and analysis from all types of media to in order to develop rational, well informed conclusions and opinions about real world issues.
Display, analyze, and interpret data.
Calculate measures of location and variability and discuss the importance of each.
Describe events, the union of events, the intersection of events, the complement of events, or any combination thereof with appropriate notation.
Calculate the probability of an event, the union of events, the intersection of events, the complement of events, or any combination thereof.
Use counting methods to calculate the size of certain sets. Determine probabilities using these results.
Apply the conditional probability formula, Bayes Theorem, and the definition of independence to calculate probabilities.
Work through the Introductory R Tutorial in order to download, install, and become familiar with the R-Studio environment.
Describe the relationship between probability and statistics.
Identify and explain the difference between a sample and a population.
Find your instructor’s MA256 page for specific day 1 instructions
| Familiarize | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 1.1 | Introduction to RStudio | 1 | Register Using Class Key |
| Devore 1.2 | 18, 21, 26 | WebAssign Cadet Introduction | |
| Devore 1.3 | 35, 37, 41 | ||
| Devore 1.4 | 44abc, 45, 50, 51 | ||
| Introductory R Tutorial | |||
| MA256 Course Guide |
See R Help below to help answer suggested and WebAssign problems
Here is the book data for Devore Chapter 1, Problem 18.
directors = c(4,5,6,7,8,9,10,11,12,13,14,15,16,17,21,24,32)
frequency = c(3,12,13,25,24,42,23,19,16,11,5,4,1,3,1,1,1)
board.size = data.frame(directors,frequency)
Here is the book data for Devore Chapter 1, Problem 21.
y = c(1,0,1,0,0,2,0,1,1,1,2,1,0,0,1,1,0,1,1,1,1,0,0,0,1,1,2,0,1,2,2,1,1,0,2,1,1,0,1,5,0,3,1,0,0,0,3)
z = c(1,8,6,1,1,5,3,0,0,4,4,0,0,1,2,1,4,0,4,0,3,0,1,1,0,1,3,2,4,6,6,0,1,1,8,3,3,5,0,5,2,3,1,0,0,0,3)
Here is the book data for Devore Chapter 1, Problem 26
rel.freq = c(0.177, 0.166, 0.175, 0.136, 0.194, 0.078, 0.044, 0.03)
Here is the book data for Devore Chapter 1, Problem 35.
data = c(0.2,0.22,0.25,0.3,0.34,0.41,0.55,0.56,
1.42,1.7,1.83,2.2,2.25,3.07,3.25)
Here is the book data for Devore Chapter 1, Problem 37.
snowcover = c(6.5,12.0,14.9,10.0,10.7,7.9,21.9,12.5,14.5,9.2)
Here is the book data for Devore Chapter 1, Problem 44.
data = c(180.50,181.70,180.90,181.60,182.60,181.00,
181.30,182.10,182.10,180.30,181.70,180.50)
Here is the book data for Devore Chapter 1, Problem 50.
data = c(37,60,75,115,135,140,149,150,238,290,340,410,600,750,750,
750,1050,1100,1139,1150,1200,1200,1250,1576,1700,1825,2000)
Here is the book data for Devore Chapter 1, Problem 51.
data = c(87,103,130,160,180,195,132,145,211,
105,145,153,152,138,87,99,93,119,129)
Let’s create a histogram using the grades data introduced in the Introductory R Tutorial. Let’s create a histogram of the WPR1 scores with frequencies on the y-axis:
grades = read.csv("GradeData.csv",header=TRUE)
WPR1 = grades[,1]
hist(WPR1)
We have several additional arguments we can add to customize this histogram with desired labels and a desired number of breaks.
hist(WPR1, breaks = 10, freq = T, xlab = "WPR1 Grades",
main = "Histogram of WPR1 Grades", col = "gray",labels=TRUE)
We can also look at the underlying calculations used to create the histogram. Let’s say we are interested in the number of grades greater than or equal to 100. To answer this question, we’ll need to know how to count data that meet a certain criteria. Consider the following example:
my.data = 1:5
print(my.data)
## [1] 1 2 3 4 5
What percentage of my.data is less than 3?
# Step 1. What's my denominator? How many data points do I have?
total.data.points = length(my.data)
print(total.data.points)
## [1] 5
# Step 2. How many of my data points meet my criteria?
my.data<3 # Notice this returns a TRUE or FALSE for each element of my.data.
## [1] TRUE TRUE FALSE FALSE FALSE
# These will be added as 1s and 0s.
# Now I just count the number of times my.data met my criteria.
sum(my.data<3)
## [1] 2
# Step 3. To get a percentage, I can divide my count by the total number of data points.
sum(my.data<3)/total.data.points
## [1] 0.4
Now I’m ready to calculate how many students scored at least 100 on WPR1.
sum(WPR1>=100)
## [1] 48
What percentage of students scored at least 80 but less than 90? There are lots of ways to calculate this.
total.students = length(WPR1)
(sum(WPR1<90)-sum(WPR1<80))/total.students
## [1] 0.01694915
sum((WPR1>=80)&(WPR1<90))/total.students
## [1] 0.01694915
When you start with a frequency distribution table like the one provided in Problem 19, you simply want to add certain frequency counts to get the total observed frequency across multiple bins.
In Problem 19a, we seek the proportion of sampled wafers that had at least five particles.
num.particles = 0:14
frequency = c(1,2,3,12,11,15,18,10,12,4,5,3,1,2,1)
wafers = data.frame(num.particles,frequency)
wafers$num.particles # Find the indices that correspond to 5 particles or greater.
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
# Indices 6 through 15 correspond to particle counts of at least 5.
sum(wafers$frequency[6:15])/sum(wafers$frequency)
## [1] 0.71
# Alternatively, we can use logical indexing in this manner:
sum(wafers$frequency[wafers$num.particles>=5])/sum(wafers$frequency)
## [1] 0.71
Given a set of data, we can quickly compute a variety of measures of location and variability in R. Let’s revisit the WPR1 scores from GradeData.csv.
grades = read.csv("GradeData.csv",header=TRUE)
WPR1 = grades[,1]
Compute the mean WPR1 score.
mean(WPR1)
## [1] 109.3898
Compute a 10% trimmed mean for WPR1. The “trim” argument removes the specified portion of the sample from BOTH the lower and upper ends of the sorted data set.
mean(WPR1,trim=.1)
## [1] 110.2653
Compute the median WPR1 score.
median(WPR1)
## [1] 111
Compute the standard deviation of WPR1 scores.
sd(WPR1)
## [1] 12.9615
Compute the variance of WPR1 scores.
var(WPR1)
## [1] 168.0006
Compute quantiles for the WPR1 scores. By default, we get the quartiles.
quantile(WPR1)
## 0% 25% 50% 75% 100%
## 60 103 111 118 130
Compute specific percentiles of the WPR1 scores as specified by the “probs” argument.
quantile(WPR1,probs=c(.1,.9))
## 10% 90%
## 93.0 123.2
Use one command to get a brief summary of the data set.
summary(WPR1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 103.0 111.0 109.4 118.0 130.0
Determine the sample space of an experiment.
Identify the relationship between an outcome and an event, and be able to explain them in terms of subsets within a sample space.
Given two events, identify the complement, union, intersection, or any combination thereof.
Draw and label a Venn Diagram.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 2.1 | Sample Space and Intersection and Union | 1, 2 | 3 |
| Review 1-21, 39, 51 |
Define and interpret probability statements.
Apply the three axioms of probability to determine probabilities of specific events.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 2.2 | Axioms of Probability | 12, 17, 22 | 14 |
| Review 6 |
Explain the difference between a combination and permutation.
Solve problems involving combinations, permutations, and the product rule.
Calculate the probability of a union or intersection of events using counting techniques.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 2.3 | Permutations | 30, 34, 39 | 30, 32 |
| Combinations | Review 22 | ||
| Probability Using Combinations |
What is the probability that you share a birthday with someone in this class room?
What is the probability that any two people share a birthday in this classroom?
Combinations in R use the function choose(n,k).
Using the same example question:
choose(n=5,k=3)
## [1] 10
In R, you can evaluate and then use this short function to produce permutations. The Pkn() function name corresponds to the \(P_{k,n}\) notation in the Devore textbook on p. 70.
Pkn = function(n, k){
return(factorial(n)/factorial(n - k))
}
Pkn(n=5,k=3)
## [1] 60
Compute probabilities using the Law of Total Probability.
Apply Bayes Theorem to calculate conditional probabilities.
Given mutually exclusive and exhaustive conditional events, draw a tree diagram to represent the scenario.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 2.4 | Conditional Probability | 46, 47, 60, 62 | 45, 60 |
| Bayes’ Theorem | Review 39 |
Calculate the probability of the union and intersection or two or more independent events.
Use the definition of independence to determine the independence of two events.
Explain the differences between mutually exclusive events and independent events.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 2.5 | Independence and Reliability Diagrams | 71, 80, 83 | 71, 80, 86 |
| Excerpt from Drunkard’s Walk | Review 64 |
Define random variables as real-valued functions and discuss the differences between discrete and continuous random variables.
Calculate and interpret probabilities, expected values, and variances for discrete and continuous random variables.
Discuss the relationship between a probability mass function (PMF) or probability density function (PDF) and the corresponding cumulative distribution function (CDF). Construct a CDF from a PMF or PDF (and vice-versa) and use either to compute probabilities.
Recognize a binomial experiment and understand its underlying assumptions.
Utilize technology to calculate probabilities associated with named discrete and continuous distributions.
Calculate a jointly distributed random variable given two independent distributions.
Calculate Expected Value, Covariance, and Correlation and understand their analytical significance.
Apply and use a basic Monte Carlo simulation to analyze real world problems.
Define a discrete random variable in terms of a real-valued function.
Create a probability mass function (PMF) in a table or graphical format.
Construct a CDF from a PMF and be able to use either to calculate probabilities.
Apply Characteristics 1-6 for discrete distributions (see PMF Reference Sheet).
Calculate and interpret the expected value and variance of a discrete random variable (characteristics 7 and 8 in the PMF Reference Sheet)
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 3.1 | Random Variables | 1, 7 | 1 |
| Devore 3.2 | Discrete Distributions | 12, 16, 24 | 12 |
Here is the book data for Devore Chapter 2, Problem 12.
y = 45:55
py = c(.05,.1,.12,.14,.25,.17,.06,.05,.03,.02,.01)
A common operation with a PMF is to compute the probability over a range of \(x\). Consider Devore Chapter 2, Problem 12. What is the probability that there are more ticketed passengers who show up than there are seats on the plane (i.e., at least 51 ticketed passengers)?
y = 45:55
py = c(.05,.1,.12,.14,.25,.17,.06,.05,.03,.02,.01)
sum(py[y>50])
## [1] 0.17
We can also generate a CDF quickly from any PMF in R using the cumsum() command.
Fy = cumsum(py)
print(Fy)
## [1] 0.05 0.15 0.27 0.41 0.66 0.83 0.89 0.94 0.97 0.99 1.00
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 3.3 | Expected Value and Variance | 29, 30, 36 | 29, 30 |
| Review 16, 23 |
Let’s calculate the expected value and variance of the airline passenger problem (Devore Chapter 2, Problem 12).
y = 45:55
py = c(.05,.1,.12,.14,.25,.17,.06,.05,.03,.02,.01)
Compute the expected value of \(Y\) or \(E[Y]\).
EY = sum(y*py)
print(EY)
## [1] 48.84
Compute the variance of \(Y\) or \(V[Y]\).
VY = sum((y-EY)^2*py)
print(VY)
## [1] 4.4944
Alternatively we can use the shortcut formula for the variance of \(Y\) or \(V[Y]\).
VY = sum(y^2*py)-EY^2
print(VY)
## [1] 4.4944
Recognize the underlying assumptions of a binomial experiment.
Calculate expected value, variance, and probabilities associated with a binomial random variable.
Explain the uses of the Poisson distribution.
Calculate expected value, variance, and probabilities associated with a Poisson random variable.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 3.4 | Binomial Distribution | 48, 49, 54, 60 | 54 |
| Expected Value of Binomial Distribution | |||
| Devore 3.6 | Poisson Distribution 1 | 81, 87, 89 | 80 |
| Poisson Distribution 2 | Review 36 |
We have already seen how to calculate probabilities with unnamed discrete distributions. We can accomplish the same task with named distributions even more easily in R. Let’s use the binomial distribution to demonstrate a new suite of functions. All of the functions introduced here follow the *binom format. The letter we use in place of the * will achieve different results, always using an underlying binomial distribution.
As an example, you may flip a fair coin 10 times (\(n=10\)) with a 50% chance of heads on each flip (\(p=0.5\)). Let \(X\) be the number of heads you observe. Let \(X \sim binom(n=10,p=0.5)\).
dbinom())?Calculate \(P(X=7)\).
# We need to use the PMF, which is called using "dbinom".
# We can say dbinom(x=7,size=10,prob=.5) or more simply:
dbinom(7,10,.5)
## [1] 0.1171875
pbinom())?Calculate \(P(X\leq7)\).
# We need to use the CDF, which is called using "pbinom".
# We can say pbinom(q=7,size=10,prob=.5) or more simply:
pbinom(7,10,.5)
## [1] 0.9453125
pbinom())?Calculate \(P(3\leq X \leq 6)\). Note that \(P(3\leq X \leq 6) = F(6)-F(2)\).
# We can use the CDF, which is called using "pbinom".
# We can say pbinom(q=6,size=10,prob=.5)-pbinom(q=2,size=10,prob=.5) or more simply:
pbinom(6,10,.5)-pbinom(2,10,.5)
## [1] 0.7734375
# We can also use the PMF, which is called using "dbinom".
# We can say sum(dbinom(x=3:6,size=10,prob=.5)) or more simply:
sum(dbinom(x=3:6,size=10,prob=.5))
## [1] 0.7734375
The last command calculated the probabilities for each specified value of \(x\) (\(3:6\)) and summed all the individual probabilities.
As an example, the number of customers, \(X\), arriving at a particular Subway in any given hour has a Poisson distribution with \(\mu=10\). Let \(X \sim Poisson(\mu=10)\).
What is the probability of exactly 7 customers in a particular hour (dpois())?
# We need to use the PMF, which is called using "dpois".
# We can say dpois(x=7,lambda=10) or more simply:
dpois(7,10)
## [1] 0.09007923
What is the probability of no more than 7 customers (ppois())?
# We need to use the CDF, which is called using "ppois".
# We can say ppois(q=7,lambda=10) or more simply:
ppois(7,10)
## [1] 0.2202206
Define a continuous random variable in terms of a real-valued function.
Compare and contrast discrete and continuous random variables.
Verify a function is a valid PDF by using characteristics 1 and 2 on the PDF Reference Sheet.
Explain why the probability of any particular value for a continuous random variable is zero (Characteristic 3).
Given a continuous PDF, identify the domain and calculate corresponding probabilities (Characteristic 5).
Identify the uniform probability distribution and explain its properties, parameters, and applications.
Employ the calculus-based relationship between PDFs and CDFs in order to transform one into the other (Characteristic 4 on the PDF Reference Sheet).
Calculate probabilities using the CDF (Characteristic 6).
Calculate and interpret the Expected Value and Variance of a continuous random variable (characteristics 7) and (8 ).
Calculate and interpret a percentile from the distribution of a continuous random variable (Characteristic 9).
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 4.1 | Probability Density Functions | 1, 5, 7, 9 | 5, 7 |
| Devore 4.2 | 11, 14, 15 | 11, 14 | |
| Review 60, 89 | |||
As an example, the length of time you wait, \(X\), for a sandwich at Subway is uniformly distributed between 2 and 5 minutes. Let \(X \sim Uniform(A=2,B=5)\).
What is the probability you wait between 3 and 4 minutes for a sandwich (punif())?
# We need to use the CDF, which is called using "punif".
# We can say punif(q=4,min=2,max=5)-punif(q=3,min=2,max=5) or more simply:
punif(4,2,5)-punif(3,2,5)
## [1] 0.3333333
In this lesson, we see our first named continuous distribution: the uniform distribution. Let \(f(x)\) represent the uniform distribution pdf and \(F(x)\) represent the uniform distribution CDF. Recall:
\[ P(a \leq x \leq b) = \int_{a}^{b} f(x)dx = F(b)-F(a). \]
R evaluates the CDF of many named distributions using the “p” version of the distribution (e.g., punif()). The technique below will help you calculate integrals for named distributions. As an example, suppose grades are distributed uniformly between 0 and 100. What is the probability that you score between 80 and 100? You can solve this in your head quickly (\(0.2\)), so now let’s try to produce the same answer in R.
# Define the limits of integration.
a = 80
b = 100
# Specify the distribution parameters.
A = 0
B = 100
# Calculate F(b)-F(a)
punif(b,min=A,max=B)-punif(a,min=A,max=B)
## [1] 0.2
We can also calculate percentiles of named continuous distributions. Let’s continue our Subway example from LSN 14 in which the length of time you wait, \(X\), for a sandwich at Subway is uniformly distributed between 2 and 5 minutes. Let \(X \sim Uniform(A=2,B=5)\). What is the 80th percentile wait time for a sandwich (qunif())?
# We need to find a particular percentile (quantile), which is obtained using "qunif".
# We can say qunif(p=.8,min=2,max=5) or more simply:
qunif(.8,2,5)
## [1] 4.4
It is possible to find percentiles for unnamed continuous distributions as well. In one of the suggested problems (Devore Chapter 4, Problem 15d), we are asked to find the 75th percentile of this pdf:
\[ f(x)=\left\{ \begin{array}{ll} 90x^8(1-x) & 0<x<1 \\ 0 & \textrm{otherwise} \\ \end{array} \right. \]
We use the CDF to find percentiles, so let’s derive the CDF for this unnamed pdf.
\[F(x) = \int_0^x 90y^8(1-y)dy = 10x^9-9x^{10}\bigg|_0^x = 10x^9-9x^{10}\]
We’re looking for the value of \(c\) such that \(F(c) = 10c^9-9c^{10} = .75\).
Manipulate the CDF to become a function of \(c\) that equals 0.
\[10c^9-9c^{10} = .75\]
\[10c^9-9c^{10} - .75 = 0\]
Let R’s uniroot() find the value of \(c\) that makes this last equation true.
# First we need to define our function of interest. We'll call it 'p' for percentile. Note that we need to frame our function such that the root(s) of the function (where the function equals 0) provide(s) our answer(s).
p = function(c) 10*c^9-9*c^10-.75
# Use the uniroot() function to find when p=0. We specify the interval our solution should exist in based on the domain of the pdf - (0,1) in this case.
solution = uniroot(p,interval=c(0,1))
solution$root
## [1] 0.9035964
This result denotes the 75th percentile of the pdf above.
Identify the normal probability distribution and explain its properties, parameters, and applications.
Transform the normal distribution into the standard normal distribution with mean = 0 and standard deviation = 1.
Identify the exponential probability distribution and explain its properties, parameters, and applications.
Use technology to calculate probabilities and percentiles associated with random variables from the normal and exponential distributions.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 4.3 (p. 156-165) | 33, 39 | 38, 40 | |
| Devore 4.4 (p. 170-172) | 59, 61 | 59 | |
| Review 9, 15 |
In this lesson, we calculate numerous probabilities using the normal distribution. Let \(f(x)\) represent the normal distribution pdf and \(F(x)\) represent the normal distribution CDF. Recall:
\[ P(a \leq x \leq b) = \int_{a}^{b} f(x)dx = F(b)-F(a). \]
R evaluates the CDF of many named distributions using the “p” version of the distribution (e.g., pnorm()). As an example, the height of cadets, \(X\) (in inches), at USMA is normally distributed with \(\mu=70\) and \(\sigma=2\). Let \(X \sim N(\mu=70,\sigma=2)\).
What is the probability that a randomly selected cadet is between 72 and 75 inches tall (pnorm())?
# Define the limits of integration.
a = 72
b = 75
# Specify the distribution parameters.
mu = 70
sigma = 2
# Calculate F(b)-F(a)
# We need to use the CDF, which is called using "pnorm".
# We can say pnorm(q=b,mean=mu,sd=sigma)-pnorm(q=a,mean=mu,sd=sigma) or more simply:
pnorm(b,mu,sigma)-pnorm(a,mu,sigma)
## [1] 0.1524456
What is the probability that a randomly selected cadet is at least 75 inches tall (pnorm())?
# We need to use the CDF, which is called using "pnorm".
# We can say pnorm(q=Inf,mean=70,sd=2)-pnorm(q=75,mean=70,sd=2) or more simply:
pnorm(Inf,70,2)-pnorm(75,70,2)
## [1] 0.006209665
# or more simply:
1 - pnorm(75,70,2)
## [1] 0.006209665
How tall is the 80th percentile cadet (qnorm())?
# We need to find a particular percentile (quantile), which is obtained using "qunif".
# We can say qnorm(p=.8,mean=70,sd=2) or more simply:
qnorm(.8,70,2)
## [1] 71.68324
We’ll use the pexp(), qexp() functions for the exponential distribution.
Example 4.21 from Devore: Suppose \(X \sim Exp(\lambda=\frac{1}{6})\). What is \(P(X\leq 10)\)?
# Here we use the CDF of the exponential distribution.
# pexp(x=10,rate=1/6)
pexp(10,1/6)
## [1] 0.8111244
Note: Sometimes we have dropped the argument names (e.g., mean=10 \(\rightarrow\) 10) as long as the values we supply in our function call are in the order R is expecting. R anticipates a rate argument as the third value in our function call, so at a minimum we must ensure we specify that the third value is our scale or \(\beta\) parameter. When you aren’t 100% sure how R is going to treat your function arguments, you can never go wrong by spelling it out completely. The order of your listed arguments doesn’t matter when you clearly specify what each argument is.
For example, pnorm(q=75,mean=70,sd=2)\(=\)pnorm(sd=2,q=75,mean=70).
Explain the role of independence in joint probability distribution functions for both discrete and continuous random variables.
Explain the relationship between the joint, marginal, and conditional distributions for both discrete and continuous random variables.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 5.1 | 3, 13 | 3, 9 | |
| Review 4-46, 60 |
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 5.2 | 22, 27 | 22 | |
| Review 3, 12 |
Calculate the covariance between two continuous or discrete random variables
Calculate the correlation coefficient between two continuous or discrete random variables
Explain the pros and cons of analyzing data using these statistical measures
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 5.2 | 30, 31 | ||
| Review 26 |
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| 24 |
Apply Monte Carlo Simulation to model and analyze real world problems.
Given a PDF or CDF, find the inverse CDF and understand its role in Monte Carlo Simulation.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Review 5.3 (p. 225-229) | Monte Carlo Suggested Problems | ||
| Supplemental Material |
You can produce random variates in R using the r prefix for named distributions (e.g., rbinom(), rpois(), rnorm(), rexp(), rgamma()). You can simulate 10 \(U[0,1]\) random variates by calling runif(n=10). You can then apply a derived transformation function to these uniform random variates to simulate drawing random variates from another distribution. The supplemental material explains both the theory and implementation of Monte Carlo simulation. An Army-Navy Football Monte Carlo tutorial can be found at Army-Navy Football: A Monte Carlo Tutorial.
Given a scenario, build a Monte Carlo simulation and analyze the results.
Interpret results from a simulation to provide analysis and answer probability questions.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
Apply the Central Limit Theorem (CLT) to find the distribution and parameters of the sample mean.
Calculate confidence intervals for the population mean or variance for small sample sizes and a normal underlying population distribution.
Perform a hypothesis test by completing the following four steps: Construct the appropriate null and alternative hypothesis; Calculate the test statistic; Calculate a p-value from the test statistic; Interpret the p-value and state a conclusion in terms of the original problem.
Identify the differences between a 1-sample t-test, a 2-sample t-test, and a paired t-test. Recognize when to apply each test and how to interpret the results.
Define and explain the concept of a random sample.
Explain why the sample mean is a random variable.
Approximate the sampling distribution of the sample mean of a random sample from a normally distributed population.
Using the Central Limit Theorem, approximate the sampling distribution of the sample mean of a sufficiently large random sample from any population distribution.
Calculate probabilities associated with the sampling distribution of the sample mean.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 5.3 (p. 220-222), 5.4 | Central Limit Theorem | 46, 47, 50, 53 | 46, 51 |
| Supplemental Material (below) | Standard Error of the Mean |
An excerpt from Devore 6.1, p. 259:
The technique below will help you solve Problem 47a. (Note: Alternatively, all of the numbers can be inserted directly into the pnorm() function.)
# Define the limits of integration.
a = 69
b = 71
# Specify the distribution parameters.
mu = 70
sigma = 1.6/sqrt(16)
# Calculate F(b)-F(a)
pnorm(b,mean=mu,sd=sigma)-pnorm(a,mean=mu,sd=sigma)
## [1] 0.9875807
Explain how the confidence interval expression is derived (pg. 277-278, Equations 7.1 - 7.4).
Interpret a confidence interval on the true population mean of a normally distributed population.
Calculate a confidence interval using technology and interpret the result.
Analyze how changing the confidence level and sample size affect a confidence interval.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 7.1 (p. 276-282) | CI for Mean (known variance) | 3, 5 ,6 | 6 |
| Review 5-53 |
Recognize when the t distribution is appropriate to use in calculating a confidence interval for the true population mean (e.g., small sample from an underlying normal distribution with unknown variance or a large sample from any distribution with unknown variance).
Calculate a confidence interval for the true population mean using the t-distribution.
Interpret a confidence interval for the true population mean.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 7.3 (p. 295-299) | CI for Mean (Small Sample Sizes) | 29, 33, 35 | 29, 33 |
| Supplemental Material (below) | Review 5 |
Confidence Interval for the Population Mean (\(\sigma\) known)
Objective: Determine an interval of plausible values for \(\mu\).
Step 1: Find the \(z\) critical value of the interval (\(z_{\alpha / 2}\)).
Plot the \(z\) distribution.
Place the confidence level in the middle (for two-tail intervals).
Find the \(z_{\alpha / 2}\) that forces the remaining area into the tails.
Step 2: Calculate the interval.
Step 3: Interpret the interval.
Confidence Interval for the Population Mean (\(\sigma\) unknown)
Objective: Determine an interval of plausible values for \(\mu\).
Step 1: Find the t critical value of the interval (\(t_{\alpha/2,\nu}\)).
Plot the \(t_{n-1}\) distribution.
Place the confidence level in the middle (for two-tail intervals).
Find the \(t_{\alpha/2,\nu}\) that forces the remaining area into the tails.
Step 2: Calculate the interval.
Step 3: Interpret the interval.
Note: Recall that \(\nu = df = n-1\) for this application.
Here is the book data for Devore Chapter 7, Problem 33.
data = c(418,421,421,422,425,427,431,434,437,
439,446,447,448,453,454,463,465)
Start with the confidence interval expression for a normally distributed random variable (p. 281):
\[ \left(\bar{y}-z_{\alpha/2} \cdot \frac{\sigma_Y}{\sqrt{n}},\bar{y}+z_{\alpha/2} \cdot \frac{\sigma_Y}{\sqrt{n}}\right) \]
Substitute values from Devore Chapter 7, Problem 5 to obtain the solution to Problem 5a.
# Store the provided information.
ybar=4.85
n=20
sigma.Y = .75
# The critical z value (alpha = .05) falls at the 97.5th percentile of the standard normal distribution.
z.crit = qnorm(.975)
# Calculate the confidence interval. Note: c(a,b) simply combines two values 'a' and 'b' to form a vector (a,b).
c(ybar-z.crit*sigma.Y/sqrt(n),ybar+z.crit*sigma.Y/sqrt(n))
## [1] 4.521304 5.178696
The confidence interval formula (7.15) on page 297 can be calculated in R as follows:
# Suppose we started with this data set:
y = c(418,421,421,422,425,427,431,434,437,
439,446,447,448,453,454,463,465)
# Store some summary statistics.
ybar=mean(y)
n=length(y)
sigma.Y = sd(y)
alpha = .05
# The critical t value (alpha = .05) falls at the 97.5th percentile of the t distribution with n-1 degrees of freedom.
t.crit = qt(1-alpha/2,n-1)
# Calculate the confidence interval. Note: c(a,b) simply combines two values 'a' and 'b' to form a vector (a,b).
c(ybar-t.crit*sigma.Y/sqrt(n),ybar+t.crit*sigma.Y/sqrt(n))
## [1] 430.5077 446.0805
The confidence interval formula (7.15) on page 297 can also be calculated in R using the t.test() function. We’ll explore more of the t.test() arguments and outputs in future lessons. Notice the confidence interval is identical to the one produced above.
y = c(418,421,421,422,425,427,431,434,437,
439,446,447,448,453,454,463,465)
t.test(y)
##
## One Sample t-test
##
## data: y
## t = 119.33, df = 16, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 430.5077 446.0805
## sample estimates:
## mean of x
## 438.2941
By default, R provides a 95% confidence interval. If we want to produce an interval with a different level of confidence, simply change the conf.level argument. Here is an example of a 90% confidence interval.
data = c(418,421,421,422,425,427,431,434,437,
439,446,447,448,453,454,463,465)
t.test(data,conf.level=.9)
##
## One Sample t-test
##
## data: data
## t = 119.33, df = 16, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 431.8815 444.7067
## sample estimates:
## mean of x
## 438.2941
Problem 33 introduces the boxplot. While not a lesson objective, constructing a boxplot in R is quite easy. Check out the command boxplot().
Explain the purpose of conducting a hypothesis test.
Given a scenario, be able to construct the proper null and alternate hypotheses.
Explain the types of error associated with hypothesis testing.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 8.1 | Hypothesis Testing | 2adf, 5, 7, 9 | 2adf, 5 |
| Type I Error | Review 7-34, 35 |
Calculate the test statistic (\(t_{stat}\)) and p-value (\(p\)) for a hypothesis test.
Determine the conclusion of the hypothesis test based on the comparison of the p-value and significance level \(\left(p \, \textrm{vs.} \, \alpha\right)\) or the test statistic and t-critical values \(\left(t_{stat} \, \textrm{vs.} \, t_{\alpha \, \textrm{or} \, \alpha/2}\right)\).
Conduct a four-step hypothesis test (see Supplemental Material) on the population mean of a normally distributed population and interpret the results in the context of the original problem.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 8.3 | Hypothesis Testing and p-values | 30, 37b, 38 | 29, 31 |
| Supplemental Material (below) | What is a p-value? | Review 8 | |
| Hypothesis Testing - One Tail vs. Two Tails | |||
| One-Sample t-Test in R |
The p-value is the probability of obtaining the observed test statistic (or a more extreme result) when the null hypothesis is actually true. More simply stated, how rare is what I just saw if I assume the null hypothesis to be true?
Objective: To determine what the true mean, \(\mu\), is NOT (gather evidence against \(H_0\)).
Step 1: State \(H_0\) and \(H_a\). Note the significance level (\(\alpha\)) and pay particular attention to the inequality of \(H_a\).
Step 2: Calculate the test statistic.
Step 3: Find the P-Value.
Plot the distribution of the test statistic if \(H_0\) is true.
Plot the test statistic on the number line.
Calculate the appropriate area based on the form of \(H_A\) (one or two tails of the distribution).
Step 4: Interpret the P-Value in terms of the original problem.
Here is the book data for Devore Chapter 8, Problem 37.
data = c(112.3,97,92.7,86,102,99.2,95.8,103.5,89,86.7)
Here is the book data for Devore Chapter 8, Problem 38.
data = c(1.10,5.09,0.97,1.59,4.60,0.32,0.55,1.45,0.14,4.47,
1.20,3.50,5.02,4.67,5.22,2.69,3.98,3.17,3.03,2.21,
0.69,4.47,3.31,1.17,0.76,1.17,1.57,2.62,1.66,2.05)
We’ll now explore the arguments and outputs of the t.test() function in R. In Problem 37b, we wish to conduct a hypothesis test regarding the strength of a batch of concrete. In this case, \(H_0:\mu=100 \, \textrm{MPa}\) and \(H_a: \mu<100 \, \textrm{MPa}\). Let \(\alpha=0.05\). The t.test() function has several arguments; for our purposes in this one-sample test, we’ll concern ourselves with x, alternative, and mu:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
x refers to our sample data, alternative will be “less” as stated in our \(H_a\), and mu is 100 as stated in our \(H_0\).
data = c(112.3,97,92.7,86,102,99.2,95.8,103.5,89,86.7)
t.test(x=data,alternative="less",mu=100)
##
## One Sample t-test
##
## data: data
## t = -1.3708, df = 9, p-value = 0.1018
## alternative hypothesis: true mean is less than 100
## 95 percent confidence interval:
## -Inf 101.2073
## sample estimates:
## mean of x
## 96.42
Consider the same scenario presented in this manner:
\[H_0:\mu=100 \, \textrm{MPa}\]
\[H_a: \mu<100 \, \textrm{MPa}\]
\[\bar{x}=96.42\]
\[s=8.259\]
\[n=10\]
xbar = 96.42
s = 8.259
n = 10
t.stat = (96.42-100)/(8.259/sqrt(10))
print(t.stat)
## [1] -1.370741
# We'll use the t distribution CDF (pt) to calculate the p-value.
p.value = pt(t.stat,n-1)
print(p.value)
## [1] 0.1018333
Thus, we arrive at the same p-value whether you start with data or sample statistics.
If you choose to compare the test statistic to the t critical value, recall how we obtain the t critical values using the qt() function.
alpha = .05
# Note that the "one-sided, less than" t critical value occurs at the 5th percentile of the t distribution with 9 degrees of freedom.
t.crit = qt(alpha,n-1)
print(t.crit)
## [1] -1.833113
Recognize and explain the differences between a one-sample t-test and a two-sample t-test and a Pared t-Test. Recognize when the data columns are dependent of one another and a paired t-test is more appropriate than a two-sample t-test.
Given two sets of independent data, use the R function “t.test” to perform a hypothesis test for the true difference of means and interpret the result using the p-value or test statistic.
Given two sets of independent data, calculate a confidence interval for the true difference of means.
Given paired data, perform a hypothesis test in R for the true mean difference and interpret the result using the p-value or test statistic.
Using paired data, calculate a confidence interval for the true mean difference of two populations.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 9.2 (374-377) | 23c, 28, 32 | 28 | |
| Devore 9.3 | Paired t-Test in R | 36, 40ac | 36 |
| Review 8-37ab, 40 |
Here is the book data for Devore Chapter 9, Problem 23.
H = c(1.2,0.9,0.7,1,1.7,1.7,1.1,0.9,1.7,1.9,1.3,2.1,
1.6,1.8,1.4,1.3,1.9,1.6,0.8,2,1.7,1.6,2.3,2)
P = c(1.6,1.5,1.1,2.1,1.5,1.3,1,2.6)
Here is the book data for Devore Chapter 9, Problem 28.
YF = c(29,34,33,27,28,32,31,34,32,27)
OF = c(18,15,23,13,12)
Here is the book data for Devore Chapter 9, Problem 36.
unabraded = c(36.4,55,51.5,38.7,43.2,48.8,25.6,49.8)
abraded = c(28.5,20,46,34.5,36.5,52.5,26.5,46.5)
Here is the book data for Devore Chapter 9, Problem 40.
L = c(1928,2549,2825,1924,1628,2175,2114,2621,1843,2541)
P = c(2126,2885,2895,1942,1750,2184,2164,2626,2006,2627)
We’ll now explore additional arguments and outputs of the t.test() function in R. In Problem 23c, we wish to conduct a hypothesis test (\(\alpha = .05\)) assessing whether the true averages differ between the two fabrics. In this case, \(H_0:\mu_1-\mu_2=0\) and \(H_a: \mu_1-\mu_2 \neq 0\). The t.test() function has several arguments; for our purposes in this two-sample test, we’ll concern ourselves with x, y, alternative, conf.level, and mu:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
x refers to the first sample, y refers to a second sample, alternative will be “two.sided” as stated in our \(H_a\), conf.level is .95, and mu is 0 as stated in our \(H_0\).
H = c(1.2,0.9,0.7,1,1.7,1.7,1.1,0.9,1.7,1.9,1.3,2.1,
1.6,1.8,1.4,1.3,1.9,1.6,0.8,2,1.7,1.6,2.3,2)
P = c(1.6,1.5,1.1,2.1,1.5,1.3,1,2.6)
t.test(x=H,y=P,alternative="two.sided",mu=0,conf.level=.95)
##
## Welch Two Sample t-test
##
## data: H and P
## t = -0.38011, df = 10.482, p-value = 0.7115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5403506 0.3820172
## sample estimates:
## mean of x mean of y
## 1.508333 1.587500
We’ll now explore a final argument of the t.test() function in R. In Problem 36, we wish to conduct a hypothesis test (\(\alpha=0.05\)) assessing various fabrics in both abraded and unabraded conditions. Since the same fabric is tested under two conditions, we have paired data. In this case, \(H_0:\mu_D=0\) and \(H_a: \mu_D > 0\). The t.test() function has several arguments; for our purposes in this paired test, we’ll concern ourselves with x, y, alternative, conf.level, mu, and paired:
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
x refers to the first sample, y refers to a second sample, alternative will be “greater” as stated in our \(H_a\), mu is 0, and paired is “TRUE”.
unabraded = c(36.4,55,51.5,38.7,43.2,48.8,25.6,49.8)
abraded = c(28.5,20,46,34.5,36.5,52.5,26.5,46.5)
t.test(x=unabraded,y=abraded,alternative="greater",mu=0,conf.level = 0.95,paired=TRUE)
##
## Paired t-test
##
## data: unabraded and abraded
## t = 1.7286, df = 7, p-value = 0.06375
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.6961062 Inf
## sample estimates:
## mean of the differences
## 7.25
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Review 32, 40ac |
Create, use, and interpret a simple linear regression model using R and R-Studio.
Determine the adequacy and appropriateness of a simple linear regression model on a given data set.
Verify linear model assumptions: linearity, independence, and normally distributed residuals with a mean of 0 and constant variance (i.e., \(\epsilon \sim N(0,\sigma^2)\)).
Use appropriate techniques to model data when assumptions are violated.
Develop and assess multiple regression models with continuous and categorical variables.
Use appropriate statistical techniques when response variable is discrete.
Identify and explain the underlying assumptions of a simple linear regression model.
Given regression output in R, identify the three parameters (slope, intercept, and variance of the random error) of the simple linear regression model, identify and explain the R-squared, adjusted R-squared, estimate of the intercept, and estimates of the slope and explain what each means.
Explain the random error term and identify its distribution.
Explain the principle of Least Squares and recall how it is used to find the best linear regression model.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 12.1 | Simple Linear Regression | 1c, 7, 8ab, 9a-d | 7, 11abc |
| Devore 12.2 | Simple Linear Regression in R | 14, 16, 24 | 14 |
| Project Prompt |
Here is the book data for Devore Chapter 12, Problem 1.
Temp = c(170,172,173,174,174,175,176,177,180,180,180,180,
180,181,181,182,182,182,182,84,184,185,186,188)
Ratio = c(0.84,1.31,1.42,1.03,1.07,1.08,1.04,1.8,1.45,1.6,1.61,2.13,
2.15,0.84,1.43,0.9,1.81,1.94,2.68,1.49,2.52,3,1.87,3.08)
Here is the book data for Devore Chapter 12, Problem 14.
Temp = c(170,172,173,174,174,175,176,177,180,180,180,180,
180,181,181,182,182,182,182,84,184,185,186,188)
Ratio = c(0.84,1.31,1.42,1.03,1.07,1.08,1.04,1.8,1.45,1.6,1.61,2.13,
2.15,0.84,1.43,0.9,1.81,1.94,2.68,1.49,2.52,3,1.87,3.08)
Here is the book data for Devore Chapter 12, Problem 16.
x = c(5,12,14,17,23,30,40,47,55,67,72,81,96,112,127)
y = c(4,10,13,15,15,25,27,46,38,46,53,70,82,99,100)
Here is the book data for Devore Chapter 12, Problem 24.
x = c(50,71,55,50,33,58,79,26,69,44,37,70,20,45,49)
y = c(152,1929,48,22,2,5,35,7,269,38,171,13,43,185,25)
Regression analysis investigates the relationship between two or more variables, and simple linear regression investigates the relationship between only two variables. The model for simple linear regression is as follows:
\[ Y = \beta_0 +\beta_1x + \epsilon \text{ where } \epsilon \sim N(0,\sigma^2) \]
We want to use our grades data again (see Introductory R Tutorial) in order to determine if there is a linear relationship between WPR 1 scores and instructor grades. Previously we created a scatter plot, and it appeared that there might be a positive linear relationship. Let’s re-create the graph here:
grades = read.csv("GradeData.csv",header=TRUE)
plot(grades[,3], grades[,1], xlab = "Instructor Grades", ylab = "WPR1 Scores", main = "A Scatter Plot of WPR1 Scores against Instructor Grades")
Since we do not have the data on the entire population, we will have to estimate the parameters of the true regression line by using our sample data. The estimated regression line is as follows:
\[ \hat{y} = \hat{\beta_0} +\hat{\beta_1}x \]
Notice that the estimated regression line has hats that denote estimates, and the epsilon term (\(\epsilon\)) drops out because \(\hat{y}\) is not a random variable after the slope and intercept are determined. If we know what \(x\) is, then we can estimate \(\hat{y}\) using the calculated parameter estimates \(\hat{\beta_0}\) and \(\hat{\beta_1}\).
We can easily use calculus and the equations in the book to determine the estimates for the slope and intercept, or we can use the lm() function to do this for us (lm stands for linear model). Let’s determine the estimated regression line in R where we use instructor grades as our x-variable (predictor) and WPR 1 scores as our y-variable (response):
wpr1 = grades[,1] # response
inst.grade = grades[,3] # predictor
Creating new objects for both \(x\) and \(y\) helps the user follow the output of the linear model lm() function. Now that we have defined the response \(y\) and independent \(x\) variables, we can use the lm() function to perform simple linear regression:
model1 = lm(wpr1~inst.grade) # Notice the tilde after wpr1. The tilde means "is a function of."
You can read this as “\(wpr1\) is a function of \(inst.grade\).” Let’s check out the model we produced. It turns out this model1 object we created is a collection of calculated attributes of our linear model. We’ll explore several of these attributes in the coming lessons.
summary(model1) # This prints some details about our created model.
##
## Call:
## lm(formula = wpr1 ~ inst.grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.015 -5.238 1.474 9.050 14.729
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.753 9.893 5.838 2.65e-07 ***
## inst.grade 67.883 12.875 5.272 2.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.72 on 57 degrees of freedom
## Multiple R-squared: 0.3278, Adjusted R-squared: 0.316
## F-statistic: 27.8 on 1 and 57 DF, p-value: 2.157e-06
From the R regression output, the data located under the Coefficients: label contains information for both the intercept and slope parameters. Notice the estimate of the Intercept (Intercept), \(\beta_0\) = 57.753, and the estimate for the Slope (inst.grade), \(\beta_1\) = 67.883. Therefore, the estimated regression line is produced as follows:
\[ \hat{\text{wpr1}} = 57.753 + 67.883 \times \textrm{inst.grade} \]
The interpretation of the intercept is typically not important in MA206. Instead, we will focus on interpreting the slope. First, we see that the estimate of the slope is positive. This means that as the instructor grade increases, we expect the WPR1 score to also increase, which isn’t terribly surprising. Secondly, if instructor grade increases by 1%, then the expected WPR1 score will increase by \(.01 \times 67.88 = .6788\) points. Notice that scales for the two grades are different; WPR1 grades are a whole number from 60 to 130 while instructor grades are a number between .49 and .97. Recognizing differences in scales is critical for model interpretation.
You’ll notice that there is a lot of data contained in the regression summary output. At a minimum, we will want to extract the Intercept, Slope, Standard Error of the Model, and R-squared.
When we defined the regression model above, we named it model1. When we displayed the output from the simple regression model, we used the summary command (i.e., summary(model1)). In order to see the names of all the attributes contained in the model summary, we use the names() function:
names(summary(model1))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
We can assign all of the summary information to a new object as follows:
output = summary(model1)
We may want to index into the output object and get the data that we need for additional calculations. Notice that the coefficients data contains the intercept and slope. We can “grab” specific attributes of the model summary (stored as output) by using the $ operator. See the examples below:
output$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.75300 9.892894 5.837826 2.647529e-07
## inst.grade 67.88264 12.875310 5.272311 2.156943e-06
This gives us both the intercept and the slope along with other information in matrix-style format. We can get each individual number by indexing further:
output$coefficients[1,1] # Extract the estimate for the intercept only.
## [1] 57.753
output$coefficients[2,1] # Extract the estimate for the slope only.
## [1] 67.88264
Picking up where we finished, there is additional information stored in each linear model summary. We’ll start again with:
model1 = lm(wpr1~inst.grade)
output = summary(model1)
The r.squared data contains the R-squared value:
output$r.squared # There is no need to index further since only one value is stored in this attribute.
## [1] 0.3278085
The sigma value contains the Standard Error of the Model:
output$sigma
## [1] 10.7196
We can store these numbers for use in other calculations. For example, we can use specific values from the output$coefficients to calculate a confidence interval for the slope coefficient (see LSN 29 R Help).
Determine the relationship between the independent and dependent variables using linear regression. Consider the sign and magnitude of the slope coefficient as well as the scales of the predictor and response variables.
Given regression output in R, conduct the model utility test on the slope parameter, identify the test statistic and p-value, and draw appropriate conclusions about the usefulness of the slope estimate.
Calculate a confidence interval for the true value of the slope parameter.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 12.3 | 32, 34ab | 32 | |
| Review 16, 24 |
Here is the book data for Devore Chapter 12, Problem 32.
x = c(5,12,14,17,23,30,40,47,55,67,72,81,96,112,127)
y = c(4,10,13,15,15,25,27,46,38,46,53,70,82,99,100)
Here is the book data for Devore Chapter 12, Problem 34.
x = c(4.35,4.79,5.57,5.20,5.07,5.79,5.36,6.40,5.66,
5.90,6.49,5.70,6.49,6.37,6.51,7.88,6.74,7.08)
y = c(4.55,4.49,4.50,4.47,4.47,4.45,4.40,4.34,4.43,
4.43,4.42,4.40,4.33,4.44,4.40,4.26,4.32,4.34)
We’ll continue with our model from the last couple lessons.
model1 = lm(wpr1~inst.grade)
output = summary(model1)
With the use of R, a confidence interval for any of the regression parameters is easily calculated. Suppose we want to determine a 95% confidence interval for the slope parameter, \(\hat{\beta_1}\). We can manually calculate a confidence interval using the attributes available from our model summary. We’ll need the output$coefficients:
output$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.75300 9.892894 5.837826 2.647529e-07
## inst.grade 67.88264 12.875310 5.272311 2.156943e-06
We’ll need some of these values for use in the \(\hat{\beta_1}\) confidence interval:
\[\hat{\beta_1} \pm t_{\alpha/2,n-2}s_{\hat{\beta_1}}\]
The output$coefficients values can be indexed like a traditional matrix. Specifically, we need to access the inst.grade estimate and its standard error. These exist on the second row in the first and second columns.
alpha = .05
n = length(inst.grade)
t.crit = qt(1-alpha/2,n-2)
estimate = output$coefficients[2,1] # Index the coefficient estimate for the inst.grade parameter.
SE = output$coefficients[2,2] # Index the standard error estimate for the inst.grade paramter.
# Now we can calculate and display the confidence interval for the inst.grade coefficient.
c(estimate-t.crit*SE,estimate+t.crit*SE)
## [1] 42.10028 93.66501
From the output above, we are 95% confident that the true value of the slope parameter for the population regression line is between 42.1 and 93.67 (rounded). Alternatively, we can use the confint() function by entering the regression model as the first argument, the parameter name as the second argument, and the confidence level as the third argument. Try the following code:
confint(model1, parm = "inst.grade", level = .95)
## 2.5 % 97.5 %
## inst.grade 42.10028 93.66501
The results from the two methods agree. You can change the confidence level by changing the level argument in the confint() function.
Correctly identify categorical predictors
Properly indicate categorical predictor variables in R
Identify significant categorical variables through predetermined alpha value
Correctly Interpret Beta coefficients in terms of the initial problem
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 13.4 | Categorical Predictors in R | 39, Suggested Problem Below | 39 |
| Supplemental Material Below | Review 37 |
You learned in the video that we sometimes work with data that is categorical.
For example:
We must handle these predictors differently because they do not operate on a continuous scale. Either a person is male or they are not. There is not a continuous scale between male and female.
Take, for example, the following simulated data on an individual’s pay.
pay <- read.csv(file="payexample.csv", header=TRUE, sep=",")
attach(pay)
head(pay)
## Years.of.Service Pay Gender Hair.Color
## 1 3 53520.54 M notblond
## 2 15 113240.48 M notblond
## 3 24 162067.65 M notblond
## 4 20 143756.46 F blond
## 5 12 101980.35 F notblond
## 6 4 62929.25 M notblond
As you can see, we have 1 response variable and 3 predictor variables
We want to determine what impact years of service, gender, and hair color have on an individual’s pay at this company (if any).
To help determine this, let’s see what happens when we regress Years.of.Service with our three predictors.
lm.pay = lm(Pay~(Years.of.Service+Hair.Color+Gender))
summary(lm.pay)
##
## Call:
## lm(formula = Pay ~ (Years.of.Service + Hair.Color + Gender))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9745.5 -3278.6 -453.8 3223.9 13522.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44006.52 1849.10 23.799 < 2e-16 ***
## Years.of.Service 4990.71 99.94 49.939 < 2e-16 ***
## Hair.Colornotblond -5857.65 1543.86 -3.794 0.000431 ***
## GenderM 992.41 1473.42 0.674 0.503973
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5039 on 46 degrees of freedom
## Multiple R-squared: 0.9822, Adjusted R-squared: 0.981
## F-statistic: 846.3 on 3 and 46 DF, p-value: < 2.2e-16
You should notice several things different about our output.
First, our predictor Years.of.Service is significant because our p value is less than our predetermined alpha of .05.
Second, Gender is displayed in the model as GenderM. Since this is not a continuous predictors, r treats it as a categorical predictor where GenderM signifies the effect being male has on an individual’s pay. What about being female? Since being female can be considered the absence of being male, the female effect is rolled up in the intercept part of the model. The same is true with Hair.Colornotblond. This signifies the effect not being blond has on an individual’s pay.
If a categorical predictor is insignificant (IE, P> (\(\alpha\))) then we say that the presence of that predictor has no impact on our response.
In the example above, since our p value for GenderM is >.05, we say that there is no statistically significant pay difference between men and women. We can take it out of our model.
lm.pay2 = lm(Pay~(Years.of.Service+Hair.Color))
summary(lm.pay2)
##
## Call:
## lm(formula = Pay ~ (Years.of.Service + Hair.Color))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9366.7 -3628.8 -409.4 3293.9 12923.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44441.55 1722.53 25.800 < 2e-16 ***
## Years.of.Service 4995.83 99.07 50.430 < 2e-16 ***
## Hair.Colornotblond -5709.77 1519.26 -3.758 0.000472 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5010 on 47 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.9813
## F-statistic: 1284 on 2 and 47 DF, p-value: < 2.2e-16
However, since the p value for Hair.Colornotblond is <.05, this provides strong evidence that there is a significant difference in pay between individuals who have blond hair and those who do not.
How much difference? We must look at our coefficient for our linear model:
lm.pay2$coefficients
## (Intercept) Years.of.Service Hair.Colornotblond
## 44441.548 4995.832 -5709.767
We can see that our model predicts that someone without blond hair to get paid $5709.77 less on average than someone without blond hair.
###Suggested Problem {#Sugprob2}
x1 = c(8.782746, 7.553363, 1.608668, 8.070618, 2.676193, 9.273625, 1.609907, 6.812618, 3.177070, 2.043432, 3.158704, 6.631469, 8.905731, 5.935131, 3.148553, 1.198568, 8.451915, 8.664580, 8.759836, 3.117905)
x2 = c(0.07339088, 0.26012713, 0.65840424, 0.12574442, 0.01284719, 0.93768603, 0.25010248, 0.66528929, 0.27026371, 0.54081582, 0.19599295, 0.57015303, 0.83265614, 0.79580407, 0.74082821, 0.50192293, 0.78072478, 0.94169540, 0.03287943, 0.43837245)
x3 = c(4.2280742, 2.0078027, 4.4866210, 0.9259695, 2.2998340, 2.1388518, 0.3326518, 2.3098570, 3.2388353, 1.3700555, 1.0926034, 1.2696775, 3.4287515, 1.6269742, 0.8906086, 3.6825764, 2.5702117, 4.3030499, 2.9233768, 2.2676811)
y1 = c(32.088026, 26.935867, 12.596351, 27.259926, 11.378572, 37.648887, 7.779088, 25.528008, 15.225756, 9.145160, 10.752428, 27.167595, 34.758202, 22.669995, 14.543876, 9.491130, 34.520567, 36.624624, 30.636396, 17.580203)
Develop and assess a multiple regression model to explain the variation of a response variable using quantitative and categorical predictor variables.
Using R-Studio, perform multiple regression using quantitative and categorical predictors.
Given multiple regression output in R, identify and explain the adjusted R-squared, the estimates of the regression coefficients, and the standard error of the model.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 13.4 | Multiple Linear Regression in R | 36, 38, Suggested Problem Below | 38 |
Multiple regression takes into account more than one independent variable. The multiple regression model is as follows:
\[ Y = \beta_0 +\beta_1x_1 + \beta_2x_2 + ...+ \beta_kx_k + \epsilon \text{ where } \epsilon \sim N(0,\sigma^2) \]
The goal of multiple regression is to determine estimates of the \(\beta_i\) coefficients from our sample data set that best predict the dependent variable \(y\). A useful multiple regression model will have coefficient estimates that are statistically different from zero, the coefficients will be easily interpretable, and the linear model assumptions will be met. The final estimated multiple regression equation will take on the form:
\[ \hat{y} = \hat{\beta_0} +\hat{\beta_1}x_1 + \hat{\beta_2}x_2 + ...+ \hat{\beta_k}x_k \]
Before we begin, recall the different columns of data in the grades data we’ve been using:
grades = read.csv("GradeData.csv",header=TRUE)
names(grades)
## [1] "WPR1" "WPR2" "InstructorGrade" "APSC"
## [5] "SAT.Math"
Let’s assume that we want to find a relationship between WPR2 grades and all the other scores (WPR1, Instructor.Grade, APSC, and SAT.Math).
You can create scatterplots using the plot() function and plot the dependent variable \(y\) against each \(x_i\). If we wanted to explore all the relationships quickly, we can create a scatterplot of all the different variable pairs using the pairs() function:
pairs(grades)
You only need to focus on either the upper right triangle above the diagonal or the lower left triangle below the diagonal. For example, let’s look at the graph in the 1st row and the 2nd column. This graph has WPR1 scores on the y-axis and WPR2 scores on the x-axis. There appears to be a positive linear relationship between these two test scores (i.e., the better I perform on WPR1, then the better I can expect to perform on WPR2). The remaining graphs can be read in a similar way. Creating these scatterplots will help identify any non-linear relationships among the data. In this data set, non-linear relationships appear to be absent, which is a good thing since we are trying to fit a linear model using all of the data.
Multiple regression in R is done similarly to how we performed simple linear regression, but now we can include more than just a single independent variable. In order to easily interpret the regression output, assign each column of data to its own object:
wpr1 = grades[,1]
wpr2 = grades[,2]
inst.grade = grades[,3]
apsc = grades[,4]
sat.math = grades[,5]
Using the objects above, we can use the lm() function to estimate the coefficients. Since we are creating another model, we can label our different models in sequential order. For example, we name the next linear model model2 as follows:
model2 = lm(wpr2~wpr1+inst.grade+apsc+sat.math) # Recall the use of "tilde" after wpr2.
summary(model2) # Print the regression output.
##
## Call:
## lm(formula = wpr2 ~ wpr1 + inst.grade + apsc + sat.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.087 -4.353 1.726 7.295 29.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.73128 19.11153 0.718 0.4756
## wpr1 0.08087 0.17214 0.470 0.6404
## inst.grade 23.58476 26.68980 0.884 0.3808
## apsc 7.17491 5.08628 1.411 0.1641
## sat.math 0.06549 0.02748 2.383 0.0207 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.99 on 54 degrees of freedom
## Multiple R-squared: 0.4152, Adjusted R-squared: 0.3719
## F-statistic: 9.585 on 4 and 54 DF, p-value: 6.252e-06
From the output, the only coefficient that is statistically different from zero is sat.math. In MA206, we want to reduce our initial model down to the most parsimonious model available. A parsimonious model balances explaining as much variance in the response variable as possible while using a minimal number of fewest predictor variables. Since wpr1 has the highest p-value from the output, we can eliminate it from the model.
model3 = lm(wpr2~apsc+sat.math+inst.grade)
summary(model3)
##
## Call:
## lm(formula = wpr2 ~ apsc + sat.math + inst.grade)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.149 -5.398 1.942 7.243 29.722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.36348 17.35353 1.001 0.3214
## apsc 7.65062 4.94902 1.546 0.1279
## sat.math 0.06842 0.02657 2.575 0.0127 *
## inst.grade 26.21973 25.90828 1.012 0.3160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.9 on 55 degrees of freedom
## Multiple R-squared: 0.4128, Adjusted R-squared: 0.3808
## F-statistic: 12.89 on 3 and 55 DF, p-value: 1.729e-06
It looks like we also need to eliminate inst.grade. Students should only eliminate variables ONE AT A TIME.
model4 = lm(wpr2~apsc+sat.math)
output = summary(model4)
summary(model4)
##
## Call:
## lm(formula = wpr2 ~ apsc + sat.math)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.591 -5.272 2.882 8.123 28.655
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.10703 14.44092 1.877 0.065717 .
## apsc 11.49067 3.17796 3.616 0.000643 ***
## sat.math 0.06676 0.02652 2.517 0.014727 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.9 on 56 degrees of freedom
## Multiple R-squared: 0.4019, Adjusted R-squared: 0.3805
## F-statistic: 18.81 on 2 and 56 DF, p-value: 5.627e-07
Suppose you want to investigate all the possible variable combinations to see which combination of predictor variables explains the greatest amount of variation in the dependent variable (HINT: there might be an R function for this…try to use GOOGLE to find it).
Returning to our final model produced above, we see that using the apsc and sat data explains 38.05% of the variation in the dependent variable in this multiple regression model (make sure to use the adjusted R-squared value in multiple regression). Additionally, each estimated coefficient is positive and statistically different from zero (at the 5% level). The final multiple regression model is as follows:
\[ \hat{y} = 27.107 + 11.49\times apsc + .067 \times sat.math \]
Does this make sense? Each student should spend time trying to analyze the meaning of the final multiple regression model. In this situation, WPR2 scores are modeled using the students’ academic proficiency score and SAT math score. The better either one of these scores, the better a student should perform on WPR2. Does it make sense that a student’s SAT score (usually taken in high school) should have an impact on a test in the second year at West Point? A close look at the SAT variable may cause us to eliminate it from the model. We might ask ourselves, “What is the average impact of these variables on our model?” We could take the average scores of apsc and sat.math and substitute them into the model:
apsc.avg = mean(apsc)
sat.avg = mean(sat.math)
Now we can take these two averages and multiply by the corresponding coefficient estimates:
apsc.avg.impact = apsc.avg*model4$coefficients[2]
apsc.avg.impact
## apsc
## 33.60438
sat.avg.impact = sat.avg*model4$coefficients[3]
sat.avg.impact
## sat.math
## 41.40723
The average impact of the SAT math score is approximately 41.4 points and is higher than the APSC average impact. We might begin to question the validity of our model since we attribute an average of 41.4 points of the WPR2 score to our high school SAT math score. Wouldn’t it make more sense if the current APSC at West Point has a higher average impact on our model?
What if there is a 1% increase in either variable from its average value? What would be the expected point change to the WPR2 score?
.01*apsc.avg*model4$coefficient[2]
## apsc
## 0.3360438
.01*sat.avg*model4$coefficient[3]
## sat.math
## 0.4140723
A 1% increase in the APSC average would increase the WPR2 score in the model by .34 points, and the corresponding increase in the SAT math score would increase the WPR2 score by .41 points. From the preceding analyses, SAT math has the biggest impact on our model, but perhaps it shouldn’t since this test is taken at least two years prior to taking MA206. Perhaps this SAT variable acts as a proxy for something else. Could it be that individuals that perform well on the SAT math are better test takers? Harder workers? More comfortable in math?
You can do so much more than we’ve shown in this R Help. Take a look at Model Selection in R to see some other regression capabilities and considerations.
x1 = c(2.510684, 2.452358, 2.528762, 1.239320, 4.875952, 7.840882, 9.818687,
6.250102, 3.959507, 6.590692, 5.605196, 2.375454, 9.851381, 6.839154,
1.236571, 3.928925, 9.503634, 1.954501, 3.861245, 1.595649)
x2 = c(0.966144488, 0.380199655, 0.402895897, 0.160270553, 0.218428088,
0.823071292, 0.419936759, 0.765025441, 0.146236090, 0.009051597,
0.806242274, 0.614748965, 0.310321721, 0.620226537, 0.536446290,
0.162144724, 0.079734733, 0.383490324, 0.900058977, 0.569140746)
x3 = c(0.521057383, 1.423197392, 0.865561773, 0.849311931, 0.069467869,
2.423375367, 0.999890500, 1.180788156, 0.323237168, 1.646717797,
2.175627203, 1.742936300, 1.308336647, 2.443687823, 0.483587769,
0.075603643, 1.137584465, 1.723667013, 0.520898202, 2.440183042)
y1 = c(12.883660, 10.342506, 8.269724, 5.213510, 15.388985, 27.125959,
30.921964, 23.011831, 12.214357, 17.949329, 21.634420, 12.927383,
30.112521, 25.259303, 5.783378, 12.853683, 27.894458, 9.237791,
18.602418, 7.186046)
Execute variable selection using forward selection
Execute variable selection using backward elimination
Execute variable selection using ‘all subsets’.
Be able to compare models and know strengths/weakness of each method (AIC/BIC/R2/Adj R2/Mallow’s CP)
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 13.5 (p 599-603) | Problem Below | 58 |
Suggested Problem
Investigators studied physical characteristics and ability in 13 football punters. Each volunteer punted a football ten times. The investigators recorded the average distance for the ten punts, in feet. They also recorded the average hang time (time the ball is in the air before the receiver catches it) for the ten punts, in seconds. In addition, the investigators recorded five measures of strength and flexibility for each punter: right leg strength (pounds), left leg strength (pounds), right hamstring muscle flexibility (degrees), left hamstring muscle flexibility (degrees), and overall leg strength (foot-pounds). From the study “The relationship between selected physical performance variables and football punting ability” by the Department of Health, Physical Education and Recreation at the Virginia Polytechnic Institute and State University, 1983.
distance=c(162.50, 144.00, 147.50, 163.50, 192.00, 171.75, 162.00, 104.93, 105.67, 117.59, 140.25, 150.17, 165.17)
rstrength=c(170, 140, 180, 160, 170, 150, 170, 110, 120, 130, 120, 140, 160)
lstrength=c(170, 130, 170, 160, 150, 150, 180, 110, 110, 120, 140, 130, 150)
rflexibility=c(106, 92, 93, 103, 104, 101, 108, 86, 90, 85, 89, 92, 95)
lflexibility=c(106, 93, 78, 93, 93, 87, 106, 92, 86, 80, 83, 94, 95)
ostrength=c(240.57, 195.49, 152.99, 197.09, 266.56, 260.56, 219.25, 132.68, 130.24, 205.88, 153.92, 154.64, 240.57)
Create a model to predict the distance a punter can kick a football using forward selection and backwards elimination in R.
Use R’s ‘best subsets’ to find the best model.
Interpret your models in accordance with the context of the problem.
To help with the problem above, consider installing the R package below
install.packages("leaps")
Use the package in the following way:
library(leaps)
puntsubsets <-
regsubsets(response ~ pred1+pred2+pred3+...+predn,
nbest = 1, # 1 best model for each number of predictors
nvmax = NULL, # NULL for no limit on number of variables
force.in = NULL, force.out = NULL,
method = "exhaustive")
summary(puntsubsets) ## displays the results
summary(puntsubsets)$adjr2 ## displays adjusted R^2 for each model
summary(puntsubsets)$rsq ## displays R^2 for each model
summary(puntsubsets)$cp ## displays Mallow's CP for each model
Beat Navy
Beat Air Force
Construct and interpret multiple regression diagnostic plots (e.g., \(e^* \, \textrm{vs.} \, x_i\), \(e^* \, \textrm{vs.} \, \hat{y}\), \(\hat{y} \, \textrm{vs.} \, y\), and a normal probability plot of the standardized residuals).
Use diagnostic plots to verify linear model assumptions: linearity, independence, and normally distributed residuals with a mean of 0 and constant variance (i.e., \(\epsilon \sim N(0,\sigma^2)\)).
Given R regression output and residual plots, identify if one or more of the six problems on pg. 546 exist.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| Devore 4.6 (p. 184-190), 13.1 | Checking Assumptions with Residual Plots | 4,5,7 | 7 |
Here is the book data for Devore Chapter 13, Problem 4.
x = c(2761,19764,25713,3980,12782,19008,
20782,19028,14397,9606,3905,25731)
y = c(1553,14999,32813,1667,8741,16526,
26770,16526,9868,6640,1220,30730)
Here is the book data for Devore Chapter 13, Problem 5.
x = c(0.167,0.333,0.500,0.667,0.833,1.000,1.167,1.333,1.500,
1.667,1.833,2.000,2.167,2,333,2.500,2.667,2.833,3.000,
3.167,3.333,3.500,3.667,3.833,4.000,4.167,4.333,4.500,
4.6,7,4.833,5.000,5.167,5.333,5.500)
y = c(0.5,1.25,1.5,2.75,3.5,4.75,5.75,5.6,7,8,8.25,9.5,10.5,
11,10.75,12.5,12.25,13.25,15.5,15,15.25,16.25,17.25,18,
18.25,18.15,20.25,19.5,20,20.5,20.6,20.5,19.8)
Here is the book data for Devore Chapter 13, Problem 7.
x = c(0.246,0.25,0.251,0.251,0.254,0.262,0.264,0.27,
0.272,0.277,0.281,0.289,0.29,0.292,0.293)
y = c(16,11,15,10.5,13.5,7.5,6.1,
1.7,3.6,0.7,0.9,1,0.7,3,3.1)
Let’s create the four diagnostic plots as described on page 544 of the textbook:
First, we need to calculate the standardized residuals in R using the rstandard() function:
grades = read.csv("GradeData.csv",header=TRUE)
wpr2 = grades[,2]
apsc = grades[,4]
sat.math = grades[,5]
model1 = lm(wpr2~apsc+sat.math)
standard.resid = rstandard(model1)
This command assigns all of the standardized residuals to the standard.resid object. We can now use this object to help us create the different plots. The first plot that we will create is a standardized residuals (\(e^*\)) versus \(x_i\) plot. We’ll do this for every predictor in our final model. In this example, we’ll look at just the sat.math predictor.
plot(sat.math, standard.resid, main = "Standardized Residuals vs. SAT Math Scores", xlab =
"Instructor Grades", ylab = "Standardized Residuals", ylim = c(-3,3))
abline(h=c(0,-2,2), lty=3)
The abline() function is a nice way to add additional lines to any plot. For this plot, the abline() function is used to add three horizontal lines at -2, 0, and 2. The lty argument is for the line type. In this case, lty=3 corresponds to dashed lines.
For this plot, we are looking to see if the standardized residuals fall “equally” above and below zero (see the middle horizontal line). Essentially, we wish to see a random scatter of points (i.e., a shotgun blast) around zero. Additionally, we want to check for constant variance. Constant variance implies that the spread of the residuals throughout the plot is approximately the same. In the above graph, there are no patterns, and the majority of the residuals fall within two standard deviations of zero (i.e., they lie between -2 and 2).
Since \(\hat{y}\) refers to the predicted values, we can get these values from the linear model that we created above (model1):
names(model1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Using the names() function above, we can see that there is an attribute called fitted.values, and we will assign these values to the object yhat as follows:
yhat = model1$fitted.values
# Alternatively, we can use the fitted() function.
yhat =fitted(model1)
Now that we have all the fitted values, we can plot the standardized residuals against the fitted values:
plot(yhat, standard.resid, main = "Standard Residuals vs. Fitted Values", xlab =
"Predicted WPR2 Scores", ylab = "Standardized Residuals", ylim = c(-3,3))
abline(h=c(0,-2,2), lty=3)
In this plot, we are looking for the same random scatter around zero as we saw in the plot above. After looking at both of these residual plots, we can conclude that the residuals appear centered at zero with a constant variance. We do not make any conclusions about the normality of the residuals from these two plots.
Typically, there are two ways to determine if the standardized residuals follow a normal distribution. The first method consists of constructing a histogram of the standardized residuals. If the histogram looks like a bell-shaped curve, we can conclude that the standardized residuals follow a normal distribution. Below, we construct the histogram of standardized residuals.
hist(standard.resid, breaks = 10, freq=T, xlim = c(-4,2),
xlab = "Standardized Residuals",
main = "Histogram of Standardized Residuals", col = "red")
The histogram has a moderate bell-shaped look, but it exhibits negative skewness. We may want to explore the data points influencing the left tail. For this class, it isn’t necessary to have perfect looking diagnostic plots; however, you should be able to identify the potential issues with each plot.
Another way in which we can check normality of the standardized residuals is to create a QQ-Plot. See below:
qqnorm(standard.resid)
qqline(standard.resid)
If the residuals plotted on this QQ plot are generally aligned with the plotted QQ line, we can say the residuals follow a normal distribution. In this case, there appears to be a “heavy” left tail (negative skewness) and a “light” right tail (smaller tail than the normal distribution). We now have some concerns about the residuals following a normal distribution, and this assumption would have to be investigated further.
This plot will give us some indication as to how well we can predict the dependent variable. Ideally, we would like to see a 45-degree line when we plot \(\hat{y} \, \textrm{vs.} \, y\). Let’s see how well our predicted WPR2 scores (\(\hat{y} \, \text{or} \, \hat{\text{wpr2}}\)) model the observed WPR1 scores (\(\text{wpr2}\)).
plot(wpr2, yhat, main = "Predicted WPR2 Scores vs. Observed WPR2 Scores", xlab =
"Observed WPR2 Scores, wpr2", ylab = "Predicted WPR2 Scores, yhat")
abline(a=0, b=1, lty=6) #add the line w/ slope of 1
A perfect model would allow us to exactly predict all wpr2 scores, resulting in all points falling exactly on the 45-degree line in the plot above. In this case, it appears there must be other variables not considered by our model that critically affect wpr2 scores. There is, however, evidence to suggest that our model has at least some value in predicting wpr2 scores.
Recognize when a model violates the linearity assumption and potential issues making inference off of models that violate this assumption
Apply transformations to data to overcome nonlinear data.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| 13.2 (pg 550-555) | 16, 19 | 15, 18 | |
| Review 5 |
If it happens that we have violated one of our 4 assumptions in linear regression, not all hope is lost.
As you read in 13.2, not all data used in regression is linear. Sometime the data is intrinsically linear and we must transform the data in order to exploit its predictive power. Below is problem 16 from the book to show you how to transform the data in r.
First we load the data and create a linear regression:
loadtime = read.csv("ex13-16.csv",header=TRUE, sep = ",")
attach(loadtime)
lm.loadtime = lm(Time~Load)
summary(lm.loadtime)
##
## Call:
## lm(formula = Time ~ Load)
##
## Residuals:
## Min 1Q Median 3Q Max
## -121.49 -32.75 -19.15 8.97 433.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 985.03 501.50 1.964 0.0671 .
## Load -11.14 5.99 -1.859 0.0815 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.7 on 16 degrees of freedom
## Multiple R-squared: 0.1777, Adjusted R-squared: 0.1263
## F-statistic: 3.457 on 1 and 16 DF, p-value: 0.08148
From our initial impression, our predictor variable, ‘Load’ is barely insignificant at alpha = .05.
Lets look at a plot of the data and residuals
par(mfrow=c(1, 2)) ## Changes the graphical parameters for side by side viewing.
plot(Time~Load, main="Data")
lm.loadtime.fitted = lm.loadtime$fitted.values
lm.loadtime.residuals = lm.loadtime$residuals
plot(lm.loadtime.residuals~lm.loadtime.fitted, main="Residuals Plot")
abline(h=0)
It is clear that the data is not linear in its current form. We must apply a transformation.
As the book suggests, we will apply a log transformation to the predictor variable.
transform.lm.loadtime = lm(log(Time)~Load)
summary(transform.lm.loadtime)
##
## Call:
## lm(formula = log(Time) ~ Load)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9788 -2.1590 0.4689 1.6371 3.4915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1482 8.8328 4.772 0.000208 ***
## Load -0.4932 0.1055 -4.675 0.000253 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.179 on 16 degrees of freedom
## Multiple R-squared: 0.5774, Adjusted R-squared: 0.551
## F-statistic: 21.86 on 1 and 16 DF, p-value: 0.0002533
You can see from the summary that our model already has increased in significance.
Lets take a look at a plot of the transformed data and the residuals.
transform.lm.loadtime.fitted = transform.lm.loadtime$fitted.values
transform.lm.loadtime.residuals = transform.lm.loadtime$residuals
par(mfrow=c(1, 2))
plot(log(Time)~Load, main = "Transformed Data")
plot(transform.lm.loadtime.residuals~transform.lm.loadtime.fitted, main = "Transformed Model's Residuals")
abline(h=0)
par(mfrow=c(1, 1)) ## Resets our graphical parameters
Much better.
However, be careful interpreting this data. A load percent cannot predict the time to failure in the same way as non-transformed data.
The estimates of our coefficients are below.
summary(transform.lm.loadtime)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1481947 8.8327889 4.771788 0.0002079588
## Load -0.4932479 0.1054976 -4.675443 0.0002532861
To predict a time until failure of a load percent of 80, we will begin by doing the following:
summary(transform.lm.loadtime)$coefficients[1]+summary(transform.lm.loadtime)$coefficients[2]*80
## [1] 2.688361
This answer yields the log time until failure. To get our prediction in the units we care about, we must exponentiate our result.
exp(summary(transform.lm.loadtime)$coefficients[1]+summary(transform.lm.loadtime)$coefficients[2]*80)
## [1] 14.70756
All data transformations covered in 13.2 can be covered in the same way.
Recognize when a model violates the linearity assumption and potential issues making inference off of models that violate this assumption
Apply transformations to data to overcome nonlinear data.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| 13.2 (pg 550-555) | 27 | 26 |
Use this class time to work on your course project.
Correctly identify count and/or frequency data.
Given count and/or frequency data, create a two-way contingency table.
Properly identify the hypothesis test for Homogeneity and create the test statistic.
Interpret the result of the Hypothesis Test for Homogeneity in terms of the real world problem.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| 14.3 | 26, 29 | 25, 28 |
Below is an example of how to input and analyze a two way contingency table in r.
Problem 29 from section 14.3
Input Data:
spirituality <- matrix(c(56,162,198,211,56,223,243,239,109,164,74,28),ncol=4,byrow=TRUE)
colnames(spirituality) <- c("Very","Moderate","Slightly", "Not at all")
rownames(spirituality) <- c("N.S.","S.S","G.D")
spirituality <- as.table(spirituality)
spirituality
## Very Moderate Slightly Not at all
## N.S. 56 162 198 211
## S.S 56 223 243 239
## G.D 109 164 74 28
Part A)
summary(spirituality)
## Number of cases in table: 1763
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 213.21, df = 6, p-value = 2.912e-43
As you learned in the reading, since the p-value is < (alpha = .05) we reject the null hypothesis that all aspects of the table are equal. We conclude that there is substantial evidence for concluding that the three types of individuals are not homogeneous with respect to their degree of spirituality.
Part B)
summary(spirituality[-3,])
## Number of cases in table: 1388
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 3.0912, df = 3, p-value = 0.3778
Now, since our p-value is > (alpha = .05), we fail to reject the null hypothesis and conclude that there is no difference in spirituality between the N.S and the S.S groups.
Create a two-way contingency table in R.
Properly interpret the Hypothesis Test for Homogeneity in R in terms of the real world problem.
| Read | Watch | Suggested Problems | WebAssign Problems |
|---|---|---|---|
| 14.3 | 30,41 |
Review concepts throughout the semester. Come to class prepared to ask questions to clarify any misunderstandings.
R is a free software environment for statistical computing and graphics. R-Studio is a user-friendly interface used to run the R software package.
Please ensure that you download R before you download R-Studio.
In order to download R, visit The Comprehensive R Archive Network (CRAN) and click on Download R for Windows. In the first line at the top of the page, select install R for the first time. RIGHT CLICK on the top link that should say Download R 3.4.3 for Windows (or something similar depending on the current version of R). Select Save target as... and save it to a folder that you can easily find. After the file has downloaded, execute the .exe file to install R. Note: You may have to right-click and Run as administrator. You must install R before you install R-Studio.
In order to download R-Studio, visit the R-Studio Downloads page. Scroll down the page and RIGHT CLICK on the installer file for RStudio 1.0.383 - Windows Vista/7/8/10 (or something similar depending on the current version of R Studio). Select Save target as... and save the file to the same folder that you saved the previous R installation file. After the file downloads, execute the .exe file and install R-Studio. The first time that you attempt to open R-Studio you may have to search for the program through the search function under the Windows button (the Windows symbol in the lower left corner of your screen). Once you locate R-Studio, right click on the program and select Pin to taskbar so that the program is readily available for future use.
There are numerous R tutorials available on the Internet. Here are a few tutorials that you might want to explore after the minimal introductory tutorial provided in this course guide.
These tutorials will expose you to R’s capabilities beyond what you’ll see in MA206. In addition to the introductory tutorial here, this course guide will provide some R Help at the bottom of many lesson descriptions in order to guide you toward specific R functions that will help you complete the suggested problems.
The default R-Studio interface places a sub-window called the Console on the left or lower-left part of the screen. When you evaluate commands in the Console, recently used commands are available in a History tab or by using the up arrow to scroll through commands when the cursor is in the Console. Alternatively, you can create an R Script by selecting File - New File - R Script or pressing Ctrl+Shift+N. In an R script, you can write a line of code and execute it by pressing Ctrl+Enter. You can run the entire script by selecting Source or by pressing Ctrl+Shift+S. You should not expect to revisit at a later date anything you calculate in the Console. Preserving your work in an R Script will help you organize and document all of your work for the entire course. You can also easily adapt previous work to a new application when you have your work saved in scripts. You may wish to name your scripts with dates or lesson numbers to help you keep them organized.
R can do basic calculations like any calculator. Just enter the expression you want to evaluate in the Console and press Enter.
2+2
## [1] 4
Let’s accomplish the same task in a script. Create a new script, enter the expression you want to evaluate on the first line of your script, and press Ctrl+Enter. You’ll see the same result appear in your Console.
2+2
## [1] 4
You can look up the description and proper usage of any R function using the ? command in the Console. The Help panel in RStudio will show you the documentation associated with the function you entered.
?sum
You can also type help(sum) to reach the same documentation page.
R allows you to define and use variables using either the = or <- assignment operators.
x = 2
print(x)
## [1] 2
y <- 2
print(y)
## [1] 2
Our data often comes in the form of a vector or matrix. We can define these as follows:
my.vector = c(1,2,3,4,5,6)
my.vector
## [1] 1 2 3 4 5 6
my.matrix = matrix(my.vector,nrow=2,ncol=3)
my.matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Notice that filling a matrix with a vector of values uses those values to fill the columns first.
Indexing vectors and matrices starts with an index of 1 and always proceeds in a [row,column] order. Let’s retrieve the number 3 from both the vector and the matrix we defined. The number 3 is the third element of the vector, and it occupies the 1st row, 2nd column of the matrix.
my.vector[3]
## [1] 3
my.matrix[1,2]
## [1] 3
R has a wide range of built-in functions ready to accomplish whatever you’re trying to do. To use one, simply type the name of the function (case-sensitive) and pass it whatever arguments you want the function to evaluate. You can find out what the function is expecting using the R documentation. Let’s find out what the sum() function does.
?sum
The sum() function returns the sum of all the values present in its arguments. Let’s evaluate the sum of my.vector.
my.vector = c(1,2,3,4,5,6)
sum(my.vector)
## [1] 21
You can also write your own functions. Let’s write a function that returns the square root of the input (i.e., the sqrt() function).
my.sqrt = function(x){
result = sqrt(x)
return(result)
}
Now we can use our function to evaluate the square root of 4.
my.sqrt(4)
## [1] 2
We previously created a vector of the values 1-6. There are other ways to create this and related sequences.
1:6 # This syntax means "1 to 6" and is understood to have increments of 1.
## [1] 1 2 3 4 5 6
seq(1,6) # Create a sequence from "1" to "6" with an understood step size of 1.
## [1] 1 2 3 4 5 6
seq(2,12,by=2) # Create a sequence from "2" to "12" with a step size of "2."
## [1] 2 4 6 8 10 12
x=seq(0,2*pi,length.out=100) # Create a sequence of 100 values evenly spaced from 0 to 2*pi.
We’ll generate a basic X-Y plot here, but there are many more Internet resources that show more detailed plot construction. Try Quick-R Graphs to see a wider range of plot capabilities. For now, let’s plot the curve \(y=sin(x)\) for \(x \in [0,2\pi]\).
x=seq(0,2*pi,length.out=100)
y=sin(x)
plot(x,y)
Plot customization is something of an art, and you can find extensive Internet resources to help you produce the exact plot you want. Try ?plot to see the basic arguments to the plot() function. Let’s change our previous plot to a line (add the argument type="l") instead of a scatter plot, and let’s give it a title (add the argument main="sin(x)").
plot(x,y,type='l',main='sin(x)')
One of the first things to do as a new R user is to Set the Working Directory. This will make sure R always knows where to start looking for any data files you wish to reference. Create a folder on your desktop or someplace convenient to save all of the data you work with using R.
You can set a default working directory that R will remember for future use.
Go to the Tools tab in the toolbar at the top.
Select Global Options.
Select Browse.
It is highly recommended that you set the default working directory to the folder that you recently created.
You can also set the working directory any time you open R by performing the following steps:
Go to the Session tab in the toolbar at the top.
Select Set Working Directory.
Select Choose Directory (you can also use the shortcut keys CTRL+SHIFT+H to accomplish this step).
Locate the folder where the desired data file is stored.
Data files come in many formats, but we will typically use .csv files that resemble spreadsheets you can open in Microsoft Excel. You’ll need to ensure you have specified the working directory before attempting to import a data set. To get started go to the course website’s page for data sets and download GradeData.csv. Be sure to store it in the folder you’ve specified as your working directory. Use the following command to import GradeData.csv into RStudio.
grades = read.csv("GradeData.csv",header=TRUE)
The data object grades now contains all of the data contained in the GradeData.csv spreadsheet.
Look in the RStudio Environment panel for an object called grades. The gradesobject is called a data frame and has a right-facing arrow next to it that you can click to get an expanded view of what is included in grades. You can also examine grades by examining its column names, first six rows, last six rows, and overall dimensions.
# Display the column names of the grades data frame.
names(grades)
## [1] "WPR1" "WPR2" "InstructorGrade" "APSC"
## [5] "SAT.Math"
# Display the first six rows of the grades data frame.
head(grades)
## WPR1 WPR2 InstructorGrade APSC SAT.Math
## 1 99 103 0.67 1.942 550
## 2 104 108 0.68 2.675 560
## 3 121 120 0.91 3.582 640
## 4 108 89 0.75 2.139 500
## 5 120 89 0.70 3.358 690
## 6 110 51 0.59 2.126 600
# Display the last six rows of the grades data frame.
tail(grades)
## WPR1 WPR2 InstructorGrade APSC SAT.Math
## 54 93 91 0.68 2.055 510
## 55 118 108 0.82 3.246 470
## 56 114 71 0.66 2.342 610
## 57 102 77 0.60 1.894 580
## 58 113 103 0.74 2.899 590
## 59 79 72 0.66 1.844 410
# Display the dimensions of the grades data frame. The Environment panel said grades consisted of 59 observations of 5 variables.
dim(grades)
## [1] 59 5
Now that we have saved the data into a data frame called grades, we may wish to index into the data frame to extract a specific portion of the data for analysis. Just like indexing with matrices, indexing into a data frame is done by using square brackets at the end of an object. First, let’s index into the tenth row and second column:
grades[10,2]
## [1] 109
What if we only wanted to look at WPR 1 scores for all students? This is the first column of the data frame, which is indexed as follows:
grades[ ,1]
## [1] 99 104 121 108 120 110 107 111 114 129 115 103 112 89 101 124 115
## [18] 122 107 117 111 110 115 128 103 95 123 103 103 94 106 122 117 125
## [35] 120 91 111 106 118 109 130 116 60 130 93 102 99 119 109 115 113
## [52] 121 90 93 118 114 102 113 79
The blank space before the comma denotes that we want ALL the rows while the “1” denotes that we want the first column. If we wanted only the first row we would type the following:
grades[1,]
## WPR1 WPR2 InstructorGrade APSC SAT.Math
## 1 99 103 0.67 1.942 550
There are several ways to yield a single column from a data frame. In addition to using grades[,1] to produce WPR1 scores, we can also use the $ operator.
grades$WPR1
## [1] 99 104 121 108 120 110 107 111 114 129 115 103 112 89 101 124 115
## [18] 122 107 117 111 110 115 128 103 95 123 103 103 94 106 122 117 125
## [35] 120 91 111 106 118 109 130 116 60 130 93 102 99 119 109 115 113
## [52] 121 90 93 118 114 102 113 79
We can drop the leading grades$ from this command by using the attach() command with our data frame.
attach(grades)
## The following object is masked _by_ .GlobalEnv:
##
## WPR1
WPR1
## [1] 99 104 121 108 120 110 107 111 114 129 115 103 112 89 101 124 115
## [18] 122 107 117 111 110 115 128 103 95 123 103 103 94 106 122 117 125
## [35] 120 91 111 106 118 109 130 116 60 130 93 102 99 119 109 115 113
## [52] 121 90 93 118 114 102 113 79
Now WPR1 can be accessed directly without going through grades. The attach() function brings the contents of grades to a higher level of visibility.
We may also want to assign the data that we work with into new objects. For example, if we only wanted to work with WPR1 data, we could assign the WPR1 data into a new object. Let’s assign new objects for each column as follows:
wpr1 = grades[,1]
wpr2 = grades[,2]
instructor = grades[,3]
apsc = grades[,4]
sat = grades[,5]
We can combine objects in a data frame by using rbind() and cbind(). The function rbind() combines rows while cbind() combines columns. Let’s combine only the WPR data together using three separate techniques:
wprs = cbind(wpr1,wpr2)
head(wprs)
## wpr1 wpr2
## [1,] 99 103
## [2,] 104 108
## [3,] 121 120
## [4,] 108 89
## [5,] 120 89
## [6,] 110 51
wprs = grades[,1:2] # The colon takes columns 1 "through" 2.
head(wprs) # We get the same result as Technique 1.
## WPR1 WPR2
## 1 99 103
## 2 104 108
## 3 121 120
## 4 108 89
## 5 120 89
## 6 110 51
wprs = grades[,-3:-5] # Think of the negative sign as "All Except -3, -4, and -5."
head(wprs) # We get the same result as Techniques 1 and 2.
## WPR1 WPR2
## 1 99 103
## 2 104 108
## 3 121 120
## 4 108 89
## 5 120 89
## 6 110 51
Sometimes our data may contain outliers, and we want to perform analysis with and without the outliers. Let’s assume that the second observation (2nd row) is an outlier and we want to remove this observation. This is easily done as follows:
wprs.minus.outlier = wprs[-2,] # Take out 2nd row.
head(wprs.minus.outlier) # Notice the second row is gone.
## WPR1 WPR2
## 1 99 103
## 3 121 120
## 4 108 89
## 5 120 89
## 6 110 51
## 7 107 111
There are many instances when you want to create your own data frame if you are working on a small data set (such as a question in WebAssign). First let’s create two vectors of data, and then we’ll combine the vectors into a data frame. The first vector will be the integers 1 to 10:
x = seq(from=1, to=10, by=1)
Let’s create a second vector that is the reverse sequence of numbers from 10 to 1.
y = seq(from=10, to=1, by=-1)
Now, we can combine both vectors into a data frame:
my.data.frame = data.frame("X"=x, "Y" = y)
class(my.data.frame) # The class of the object is a data frame as desired.
## [1] "data.frame"
head(my.data.frame) #We have two columns of data with the proper column names.
## X Y
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
If we want to change the names of the columns in the data frame we can using the colnames() function and assign the new column names as follows:
colnames(my.data.frame) = c("Column 1", "Column 2")
head(my.data.frame) # Column names have changed.
## Column 1 Column 2
## 1 1 10
## 2 2 9
## 3 3 8
## 4 4 7
## 5 5 6
## 6 6 5
Let’s go back to the original grades object that we created earlier. Maybe we want to explore the relationship between WPR1 grades and the instructor grades. We may have some intuition and believe that if the student has been doing their homework and maintains a high instructor average, then they should have a high WPR1 score. In other words, we think that there is a positive relationship between WPR1 scores and instructor grades.
The first step in this exploratory analysis is to create a scatter plot of the two variables:
plot(grades[,3], grades[,1], xlab = "Instructor Grades", ylab =
"WPR1 Scores", main = "A Scatter Plot of WPR1 Scores against Instructor Grades")
From the plot above, there appears to be a positive relationship between WPR1 scores and the instructor grade.
We can look at a few quantitative measures by using the summary() function. Let’s take the summary measures from the 1st and 3rd column since those columns correspond to the WPR1 and instructor grade data:
summary(grades[,c(1,3)])
## WPR1 InstructorGrade
## Min. : 60.0 Min. :0.4900
## 1st Qu.:103.0 1st Qu.:0.6750
## Median :111.0 Median :0.7600
## Mean :109.4 Mean :0.7607
## 3rd Qu.:118.0 3rd Qu.:0.8450
## Max. :130.0 Max. :0.9700
If we want to compute the variance, standard deviation, or range on columns in a data frame, we can use the apply() function as follows:
apply(grades[,c(1,3)], 2, sd)
## WPR1 InstructorGrade
## 12.9615039 0.1093218
This applies the standard deviation function (sd) to columns 1 and 3 (columns are denoted with a 2 and rows with a 1 in the second argument of the function). Type ?apply in the console for more information. In the following lines of code, we will compute the range, variance, and mean, respectively:
apply(grades[,c(1,3)],2, range) # Compute the ranges of the 1st and 3rd column.
## WPR1 InstructorGrade
## [1,] 60 0.49
## [2,] 130 0.97
apply(grades[,c(1,3)],2, var) # Compute the variances of the 1st and 3rd column.
## WPR1 InstructorGrade
## 168.00058445 0.01195126
apply(grades[,c(1,3)],2, mean) # Compute the means - compare to summary() output.
## WPR1 InstructorGrade
## 109.389831 0.760678
You can also compute each one of these values directly by using the mean() or sd() function for the mean and standard deviation:
mean(grades[,1]) # Compute mean of column 1.
## [1] 109.3898
sd(grades[,1]) # Compute the standard deviation of column 1.
## [1] 12.9615
Return to LSN 8.
Characteristic 1: The pmf must be greater than or equal to zero for all x.
\[p(x) \geq 0 \quad \forall x\]
Characteristic 2: The sum of probabilities of x, \(p(x)\), over all possible values of \(x\) must be one.
\[\sum_x p(x) = 1\]
Characteristic 3: For a discrete random variable \(X\), the probability that \(X\) is equal to a specific value, \(c\), is:
\[P(X=c) =p(c)\]
Characteristic 4: Given the pmf \(p(x)\) for random variable \(X\), we define the cumulative distribution function (CDF) \(F(x)\) as follows:
\[F(x) = P(X\leq x)= \sum_{y:y\leq x}p(y)\]
####Given a discrete random variable \(X\) and its associated pmf or CDF, find the probability that the random variable \(X\) is between \(a\) and \(b\) (inclusive) for \(a < b\).
Characteristic 5: Using the pmf:
\[P(a \leq X \leq b ) = \sum_{x : a \leq x \leq b}p(x)\]
Characteristic 6: Using the CDF:
\[P(a \leq X \leq b)= F(b) - F(a-)\]
where “\(a-\)” represents the largest possible \(X\) value that is strictly less than \(a \, \left( \textrm{i.e.,} \, F(a-) = \textrm{lim}_{x\rightarrow a-}F(x)\right)\).
####Calculate the expected value (mean or average), \(\mu_X\), and the variance, \(\sigma_X^2\), of a discrete random variable \(X\).
Characteristic 7: The expected value of X:
\[E(X) = \mu_X = \sum_x x \cdot p(x)\]
Characteristic 8: The variance of X:
\[V(X) = \sigma_X^2 = \sum_x (x-\mu_x)^2 \cdot p(x)\]
Alternatively, the following shortcut formula may prove easier to calculate:
\[V(X) = \sigma_X^2 = E[X^2] - (E[X])^2\]
Characteristic 9: The (100p)th percentile of the discrete random variable \(X\) is the minimum value of \(X\) such that \(F(x)\geq p\).
\[x^* = \textrm{min}\{ x|F(x)\geq p \}\]
| Distribution | PMF \(\quad p(x)\) | E[X] | V[X] |
|---|---|---|---|
| Binomial | \(\binom{n}{x}p^x(1-p)^{n-x}\) | \(np\) | \(np(1-p)\) |
| Poisson | \(e^{-\mu}\mu^x/x!\) | \(\mu\) | \(\mu\) |
Return to LSN 11.
\[f(x) \geq 0 \quad \forall x\]
\[\int_{-\infty}^{\infty} f(x)dx = 1\]
\[P(X=c) =0\]
\[F(x) = P(X\leq x)= \int_{-\infty}^{x} f(y)dy\]
####Given a continuous random variable \(X\) and its associated pdf or CDF, find the probability that the random variable \(X\) is between \(a\) and \(b\) for \(a < b\).
\[P(a \leq X \leq b ) = \int_{a}^{b} f(x)dx\]
\[P(a \leq X \leq b)= F(b) - F(a)\]
####Calculate the expected value (mean or average), \(\mu_X\), and the variance, \(\sigma_X^2\), of a continuous random variable \(X\).
\[E(X) = \mu_X = \int_{-\infty}^{\infty} x \cdot f(x)dx\]
\[V(X) = \sigma_X^2 = \int_{-\infty}^{\infty} (x-\mu_X)^2 \cdot f(x)dx\]
Alternatively, the following shortcut formula may prove easier to calculate:
\[V(X) = \sigma_X^2 = E[X^2] - (E[X])^2\]
\[x^* = F^{-1}(p)\]
where \(F^{-1}\) denotes the inverse CDF of \(X\).
| Distribution | PDF \(\quad f(x)\) | E[X] | V[X] |
|---|---|---|---|
| Uniform | \(\frac{1}{B-A}\) | \(\frac{A+B}{2}\) | \(\frac{1}{12}(B-A)^2\) |
| Normal | \(\frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu))^2/2\sigma^2}\) | \(\mu\) | \(\sigma^2\) |
| Exponential | \(\lambda e^{-\lambda x}\) | \(1/\lambda\) | \(1/\lambda^2\) |
| Gamma | \(\frac{1}{\beta^{\alpha}\Gamma(\alpha)}x^{\alpha-1}e^{-x/\beta}\) | \(\alpha\beta\) | \(\alpha\beta^2\) |
| t | \(\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^2}{\nu }\right)^{-\frac{\nu+1}{2}}\) | \(0\) | \(\frac{\nu}{\nu-2}\) for \(\nu>2\), \(\infty\) for \(1 < \nu \leq 2\) |
| Chi-Squared | \(\frac{1}{2^{\frac{\nu}{2}}\Gamma\left(\frac{\nu}{2}\right)}x^{\frac{\nu}{2}-1}e^{-\frac{x}{2}}\) | \(\nu\) | \(2\nu\) |
(Return to LSN 18.) (Go to Monte Carlo worksheet.)
Monte Carlo simulation is an example of a statistical simulation method that utilizes sequences of random numbers to simulate real-world occurrences. Although Monte Carlo Simulation has its roots in the development of nuclear weapons, it is now regularly used in many diverse applications ranging from the simulation of complex physical phenomena such as radiation transport in the earth’s atmosphere to the simulation of something as simple as a game of Bingo.
The goal of the Monte Carlo method is to simulate the physical (or mathematical) system of interest by randomly sampling from the system’s CDF(s). We can more clearly define Monte Carlo simulation as a scheme of employing \(\textrm{Uniform}(0,1)\) random numbers to help solve certain problems that are generally static in nature. Monte Carlo simulation is now widely used to solve problems that are often not easily solved analytically. Simulations rely on a large number of trials, and we utilize the collective results to infer a solution.
Suppose we know that a particular random variable has an exponential distribution with \(\lambda=1/10\). We would like to generate random numbers that follow this distribution in order to simulate this random variable. If we know that most mathematics software packages (Excel, Mathematica, and R included) can produce \(\textrm{Uniform}(0,1)\) numbers, how can we use this to produce exponentially distributed (\(\lambda=1/10\)) random numbers? In order to answer this question, we must review inverse functions.
A function \(f(x)\) has an inverse if you can find a function, \(f^{-1}(x)\), such that \(f^{-1}(f(x))=x\). By substituting \(y=f(x)\) into \(f^{-1}(f(x))=x\), we see that \(f^{-1}(y)=x\). If we plot the function \(y=f(x)\), we can see the relationship between the numbers on the \(y\)-axis and the numbers on the \(x\)-axis. We’ll now demonstrate this concept with an example.
Example: Find the inverse for the function \(f(x)=x^3\).
\[y=x^3\]
\[x=y^{1/3}\]
\[f^{-1}(y)=y^{1/3}\]
\[f^{-1}(x^3)=(x^3)^{1/3}=x^{3/3}=x\]
What is \(f(2)\)? In order to find an approximate value from the graph, we would start at 2 on the \(x\)-axis and move up to the curve. From there, we move to the left until we reach the \(y\)-axis. This gives the functional value for \(x = 2\). Follow the arrows in Figure 2 to see that \(f(2) = 8\).
Now that we have reviewed inverse functions, let’s look at a CDF and review some of its properties. Remember that a CDF is the cumulative distribution function, and CDFs always have the same range: \(\left[0,1\right]\). As you can see from Figure 4, the \(\textrm{Exp}(\lambda=1/10)\) CDF starts at \(0\), and as \(x\rightarrow \infty\), the function approaches \(1\). If we wanted to produce the numbers on the \(x\)-axis from the numbers on the \(y\)-axis, then we would want to find the inverse of the CDF. For example, if you randomly chose \(0.65\) from the \(\textrm{Uniform}(0,1)\) distribution, to what exponentially distributed value of \(x\) would that correspond? If you follow the arrows on Figure 4, then you should see that given a \(\textrm{Uniform}(0,1)\) number (y-value), we can find the \(x\)-value by substituting the \(y\)-value into the inverse CDF function. For this example, \(F^{-1}(0.65) \approx 10\).
Let’s see if we can actually find the functional form of the inverse. If we would like to sample from an \(\textrm{Exp}(\lambda=1/10)\) distribution, then we use the known exponential CDF: \(F(x)=1-e^{\frac{-x}{10}} \, \forall x\geq 0\). Instead of setting the function equal to \(y\), let’s set it equal to \(p\) and solve for \(x\).
\[p=1-e^{\frac{-x}{10}} \] \[ p-1=-e^{\frac{-x}{10}} \] \[ \textrm{ln}(1-p)=-\frac{x}{10} \] \[ x=-10 \, \textrm{ln}(1-p)\]
We can now take \(\textrm{Uniform}(0,1)\) numbers and substitute them into the equation above (for \(p\)) in order to produce \(\textrm{Exp}(\lambda=1/10)\) numbers. For the example above (where \(p = 0.65\)), we get \(x=10\,\textrm{ln}(1-0.65)=10.4982\).
Why does a \(\textrm{Uniform}(0,1)\) number produce these \(\textrm{Exp}(\lambda=1/10)\) numbers? Let’s look closely at the graph of the CDF again. What is the probability that a random \(\textrm{Uniform}(0,1)\) number will be less than \(0.65\)? The probability is exactly \(0.65\) according to the properties of the \(\textrm{Uniform}(0,1)\) distribution. A \(y\)-value of \(0.65\) corresponds to an \(x\)-value of \(10.4982\), which by the properties of the CDF denotes the \(x\)-value where \(P(X\leq x)=.65\). This relationship holds no matter where you enter on the \(y\)-axis and serves as the foundation for turning \(\textrm{Uniform}(0,1)\) random variates into random variates from another distribution.
Let’s show how Monte Carlo simulation works in R. First, we’ll create 1,000 \(\textrm{Uniform}(0,1)\) random variates.
my.sample = runif(1000)
Now we’ll transform the \(\textrm{Uniform}(0,1)\) random variates into \(\textrm{Exp}(\lambda=1/10)\) random variates using our derived expression.
exp.sample = -10*log(1-my.sample)
Now we’ll create a histogram of our sample and scale it to match the \(\textrm{Exp}(\lambda=1/10)\) pdf.
hist(exp.sample,freq=FALSE,
main="Histogram of Exp(1/10) Random Variates with Underlying PDF",
xlab="x")
curve(dexp(x,rate=1/10),from=0,to=60,n=200,add=TRUE)
This example shows that we can simulate random variates from a distribution using the CDF inverse transform in such a way that the resulting sample adheres closely to the underlying probability density function. Fortunately R has built in the inverse CDF transform approach to sample from many common distributions. We can produce our exponentially distributed random variates directly using the rexp() function without needing \(\textrm{Uniform}(0,1)\) random variates as our starting point.
exp.sample = rexp(1000,rate=1/10)
We can again show that this produced the type of sample we think it did by constructing a histogram.
hist(exp.sample,freq=FALSE,
main="Histogram of Exp(1/10) Random Variates with Underlying PDF",
xlab="x")
curve(dexp(x,rate=1/10),from=0,to=60,n=200,add=TRUE)
Check out this Simulation in R video to see some more advanced techniques than we’ll show you in MA206.
(Return to LSN 19) (Return to Monte Carlo supplemental material.)
Comments
When you write a script, you will benefit later from placing relevant comments in your code to explain what a particular line of code was meant to accomplish. That can be done as follows using the
#command. You can manually type#anywhere in a line of code to comment what comes after it, or you can pressCtrl+Shift+Cto comment the current line or an entire block of selected code.