Course Admin

Welcome to MA256. This is an advanced probability and statistics course for select cadets who have demonstrated high performance in the core math program and have a high aptitude for mathematical science. It is designed to lay the foundation of probability theory and inferential statistics.

Course Guide Tips

In the Course Admin section, you’ll find general information about the course, the overall objectives of the course, and a brief tutorial designed to help you get started using R, the primary technology resource we’ll use in MA256. The course guide is then laid out to match the layout of the course: block by block and lesson by lesson. Each lesson throughout the course occupies a dedicated portion of this document. The portion devoted to each lesson will provide the lesson objectives, a specific reading assignment for that day, links to helpful videos, suggested book problems, and lesson-specific problems available in WebAssign.

Course Topics

Block 1: Descriptive Statistics and Probability Theory

Block 2: Random Variables and Distributions

Block 3: Inferential Statistics

Block 4: Model Building and Data Analysis

Course Website

The MA256 Course Webpage provides resources you may find helpful throughout the semester.

Instructor Pages

CPT Dusty Turner

Course Textbook

Throughout MA256 we will use the 9th edition of Jay Devore’s textbook, Probability and Statistics for Engineering and the Sciences. A WebAssign subscription provides you access to an electronic version of this textbook. You are required to have access to either a hard copy of the textbook, an electronic copy of the textbook, or both.

Graded Events

MA256 is a 1000-point course.

Event	Points
Block 1 WPR	100
Block 2 WPR	200
Block 3 WPR	200
Instructor Points	100
Course Project (Block 4)	100
Term End Exam	300

Course Objectives

Understand the notion of randomness and the role of variability and sampling in making inferences.
Apply the axioms, theorems, and basic properties of probability and conditional probability to quantify and communicate the likelihood of events.
Employ univariate and joint models using discrete or continuous random variables to answer basic probability questions.
Employ the concepts of confidence intervals and know when they apply to gain insight into populations from their representative samples.
Construct several types of hypothesis tests and develop appropriate conclusions.
Construct, apply, assess, and communicate linear regression models to estimate and interpret regression coefficients and draw conclusions from the models. Understand how variable transformations can assist when modeling assumptions are not met and be able to implement basic transformation of both outcome and explanatory variables.
Model, solve, and interpret solutions to applied problems that can be examined using principles of probability theory, simulation, or statistical analysis.
Develop comfort acquiring real-world data sets and developing appropriate statistical models to address questions of scientific, engineering or public interest.
Use technology with appropriate software to solve, gain insight, and improve understanding on probability and statistics problems.
Communicate assumptions, models, and results using technical language in both written and oral formats.
Apply principles of probability and statistics to model and solve problems with engineering applications.
Develop the skills to critically synthesize statistical data and analysis from all types of media to in order to develop rational, well informed conclusions and opinions about real world issues.

Formula Refrence Card

The following formula reference card can be used for any graded event in MA256. It contains useful formulas and r code to enable success throughout the course.

PDF Download of FRC

Course Calendar

PDF Download of Course Calendar

Block 1 - Descriptive Statistics and Probability Theory

Block 1 Objectives

Display, analyze, and interpret data.
Calculate measures of location and variability and discuss the importance of each.
Describe events, the union of events, the intersection of events, the complement of events, or any combination thereof with appropriate notation.
Calculate the probability of an event, the union of events, the intersection of events, the complement of events, or any combination thereof.
Use counting methods to calculate the size of certain sets. Determine probabilities using these results.
Apply the conditional probability formula, Bayes Theorem, and the definition of independence to calculate probabilities.

Lesson 1 - Course Overview and Introduction to R (Descriptive Statistics, Measures of Location and Variability, Histograms)

Work through the Introductory R Tutorial in order to download, install, and become familiar with the R-Studio environment.
Describe the relationship between probability and statistics.
Identify and explain the difference between a sample and a population.
Given a set of univariate data:
1. Calculate corresponding frequencies and proportions.
2. Calculate the measures of location (Mean, Median, Mode, Trimmed Mean, Percentiles).
3. Calculate the measures of variability (Variance and Standard Deviation).
4. Given a set of univariate data, explain how the mean or median changes when additional data points are added to or removed from the set.
Find your instructor’s MA256 page for specific day 1 instructions

Familiarize	Watch	Suggested Problems	WebAssign Problems
Devore 1.1	Introduction to RStudio	1	Register Using Class Key
Devore 1.2		18, 21, 26	WebAssign Cadet Introduction
Devore 1.3		35, 37, 41
Devore 1.4		44abc, 45, 50, 51
Introductory R Tutorial
MA256 Course Guide

See R Help below to help answer suggested and WebAssign problems

R Help

Creating a Histogram

Let’s create a histogram using the grades data introduced in the Introductory R Tutorial. Let’s create a histogram of the WPR1 scores with frequencies on the y-axis:

grades = read.csv("GradeData.csv",header=TRUE)
WPR1 = grades[,1]
hist(WPR1)

We have several additional arguments we can add to customize this histogram with desired labels and a desired number of breaks.

hist(WPR1, breaks = 10, freq = T, xlab = "WPR1 Grades", 
     main = "Histogram of WPR1 Grades", col = "gray",labels=TRUE)

Calculate Probabilities and Frequencies

We can also look at the underlying calculations used to create the histogram. Let’s say we are interested in the number of grades greater than or equal to 100. To answer this question, we’ll need to know how to count data that meet a certain criteria. Consider the following example:

my.data = 1:5
print(my.data)

## [1] 1 2 3 4 5

What percentage of my.data is less than 3?

# Step 1.  What's my denominator?  How many data points do I have?
total.data.points = length(my.data)
print(total.data.points)

## [1] 5

# Step 2.  How many of my data points meet my criteria?
my.data<3 # Notice this returns a TRUE or FALSE for each element of my.data.

## [1]  TRUE  TRUE FALSE FALSE FALSE

# These will be added as 1s and 0s.
# Now I just count the number of times my.data met my criteria.
sum(my.data<3)

## [1] 2

# Step 3.  To get a percentage, I can divide my count by the total number of data points.
sum(my.data<3)/total.data.points

## [1] 0.4

Now I’m ready to calculate how many students scored at least 100 on WPR1.

sum(WPR1>=100)

## [1] 48

What percentage of students scored at least 80 but less than 90? There are lots of ways to calculate this.

total.students = length(WPR1)
(sum(WPR1<90)-sum(WPR1<80))/total.students

## [1] 0.01694915

sum((WPR1>=80)&(WPR1<90))/total.students

## [1] 0.01694915

When you start with a frequency distribution table like the one provided in Problem 19, you simply want to add certain frequency counts to get the total observed frequency across multiple bins.

In Problem 19a, we seek the proportion of sampled wafers that had at least five particles.

num.particles = 0:14
frequency = c(1,2,3,12,11,15,18,10,12,4,5,3,1,2,1)
wafers = data.frame(num.particles,frequency)
wafers$num.particles # Find the indices that correspond to 5 particles or greater.

##  [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14

# Indices 6 through 15 correspond to particle counts of at least 5.
sum(wafers$frequency[6:15])/sum(wafers$frequency)

## [1] 0.71

# Alternatively, we can use logical indexing in this manner:
sum(wafers$frequency[wafers$num.particles>=5])/sum(wafers$frequency)

## [1] 0.71

Measures of Location and Variability

Given a set of data, we can quickly compute a variety of measures of location and variability in R. Let’s revisit the WPR1 scores from GradeData.csv.

grades = read.csv("GradeData.csv",header=TRUE)
WPR1 = grades[,1]

Mean $\mu$

Compute the mean WPR1 score.

mean(WPR1)

## [1] 109.3898

Trimmed Mean

Compute a 10% trimmed mean for WPR1. The “trim” argument removes the specified portion of the sample from BOTH the lower and upper ends of the sorted data set.

mean(WPR1,trim=.1)

## [1] 110.2653

Median $\tilde{x}$

Compute the median WPR1 score.

median(WPR1)

## [1] 111

Standard Deviation $\sigma$

Compute the standard deviation of WPR1 scores.

sd(WPR1)

## [1] 12.9615

Variance $\sigma^2$

Compute the variance of WPR1 scores.

var(WPR1)

## [1] 168.0006

Quantiles

Compute quantiles for the WPR1 scores. By default, we get the quartiles.

quantile(WPR1)

##   0%  25%  50%  75% 100% 
##   60  103  111  118  130

Percentiles

Compute specific percentiles of the WPR1 scores as specified by the “probs” argument.

quantile(WPR1,probs=c(.1,.9))

##   10%   90% 
##  93.0 123.2

Summary

Use one command to get a brief summary of the data set.

summary(WPR1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0   103.0   111.0   109.4   118.0   130.0

Lesson 2 - Sample Spaces and Events

Lesson Objectives

Determine the sample space of an experiment.
Identify the relationship between an outcome and an event, and be able to explain them in terms of subsets within a sample space.
Given two events, identify the complement, union, intersection, or any combination thereof.
Draw and label a Venn Diagram.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 2.1	Sample Space and Intersection and Union	1, 2	3
			Review 1-21, 39, 51

Lesson 3 - Axioms, Interpretations, and Properties of Probability

Lesson Objectives

Define and interpret probability statements.
Apply the three axioms of probability to determine probabilities of specific events.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 2.2	Axioms of Probability	12, 17, 22	14
			Review 6

Lesson 4 - Counting Techniques

Lesson Objectives

Explain the difference between a combination and permutation.
Solve problems involving combinations, permutations, and the product rule.
Calculate the probability of a union or intersection of events using counting techniques.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 2.3	Permutations	30, 34, 39	30, 32
	Combinations		Review 22
	Probability Using Combinations

Consider the following:

What is the probability that you share a birthday with someone in this class room?
What is the probability that any two people share a birthday in this classroom?

R Help

Combinations in R use the function choose(n,k).

Using the same example question:

choose(n=5,k=3)

## [1] 10

In R, you can evaluate and then use this short function to produce permutations. The Pkn() function name corresponds to the $P_{k,n}$ notation in the Devore textbook on p. 70.

Pkn = function(n, k){ 
   return(factorial(n)/factorial(n - k)) 
} 
Pkn(n=5,k=3)

## [1] 60

Lesson 5 - Conditional Probability

Lesson Objectives

Compute probabilities using the Law of Total Probability.
Apply Bayes Theorem to calculate conditional probabilities.
Given mutually exclusive and exhaustive conditional events, draw a tree diagram to represent the scenario.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 2.4	Conditional Probability	46, 47, 60, 62	45, 60
	Bayes’ Theorem		Review 39

Lesson 6 - Independence

Lesson Objectives

Calculate the probability of the union and intersection or two or more independent events.
Use the definition of independence to determine the independence of two events.
Explain the differences between mutually exclusive events and independent events.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 2.5	Independence and Reliability Diagrams	71, 80, 83	71, 80, 86
Excerpt from Drunkard’s Walk			Review 64

Lesson 7 - WPR1 Instructions

Authorized References:

You are authorized to bring one sheet of notes (8.5$''$ x 11$''$, front and back). All notes MUST be handwritten BY YOU. In addition to your handwritten notes, you may reference a digital copy or unaltered hard copy of the MA256 Reference Sheet during the WPR. You may use any calculator you want, and you may use Excel, Mathematica, and RStudio provided that you start with a blank workbook, notebook, or script, respectively.

Block 2 - Random Variables and Distributions

Block 2 Objectives

Define random variables as real-valued functions and discuss the differences between discrete and continuous random variables.
Calculate and interpret probabilities, expected values, and variances for discrete and continuous random variables.
Discuss the relationship between a probability mass function (PMF) or probability density function (PDF) and the corresponding cumulative distribution function (CDF). Construct a CDF from a PMF or PDF (and vice-versa) and use either to compute probabilities.
Recognize a binomial experiment and understand its underlying assumptions.
Utilize technology to calculate probabilities associated with named discrete and continuous distributions.
Calculate a jointly distributed random variable given two independent distributions.
Calculate Expected Value, Covariance, and Correlation and understand their analytical significance.
Apply and use a basic Monte Carlo simulation to analyze real world problems.

Lesson 8 - Random Variables and Discrete Distributions

Lesson Objectives

Define a discrete random variable in terms of a real-valued function.
Create a probability mass function (PMF) in a table or graphical format.
Construct a CDF from a PMF and be able to use either to calculate probabilities.
Apply Characteristics 1-6 for discrete distributions (see PMF Reference Sheet).
Calculate and interpret the expected value and variance of a discrete random variable (characteristics 7 and 8 in the PMF Reference Sheet)

Read	Watch	Suggested Problems	WebAssign Problems
Devore 3.1	Random Variables	1, 7	1
Devore 3.2	Discrete Distributions	12, 16, 24	12

R Help

A common operation with a PMF is to compute the probability over a range of $x$. Consider Devore Chapter 2, Problem 12. What is the probability that there are more ticketed passengers who show up than there are seats on the plane (i.e., at least 51 ticketed passengers)?

y = 45:55
py = c(.05,.1,.12,.14,.25,.17,.06,.05,.03,.02,.01)
sum(py[y>50])

## [1] 0.17

We can also generate a CDF quickly from any PMF in R using the cumsum() command.

Fy = cumsum(py)
print(Fy)

##  [1] 0.05 0.15 0.27 0.41 0.66 0.83 0.89 0.94 0.97 0.99 1.00

Lesson 9 - Expected Values and Variance

Read	Watch	Suggested Problems	WebAssign Problems
Devore 3.3	Expected Value and Variance	29, 30, 36	29, 30
			Review 16, 23

R Help

Let’s calculate the expected value and variance of the airline passenger problem (Devore Chapter 2, Problem 12).

y = 45:55
py = c(.05,.1,.12,.14,.25,.17,.06,.05,.03,.02,.01)

Compute the expected value of $Y$ or $E[Y]$.

EY = sum(y*py)
print(EY)

## [1] 48.84

Compute the variance of $Y$ or $V[Y]$.

VY = sum((y-EY)^2*py)
print(VY)

## [1] 4.4944

Alternatively we can use the shortcut formula for the variance of $Y$ or $V[Y]$.

VY = sum(y^2*py)-EY^2
print(VY)

## [1] 4.4944

Lesson 10 - Applications to Reliability Engineering, Binomial and Poisson Distribution

Lesson Objectives

Recognize the underlying assumptions of a binomial experiment.
Calculate expected value, variance, and probabilities associated with a binomial random variable.
Explain the uses of the Poisson distribution.
Calculate expected value, variance, and probabilities associated with a Poisson random variable.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 3.4	Binomial Distribution	48, 49, 54, 60	54
	Expected Value of Binomial Distribution
Devore 3.6	Poisson Distribution 1	81, 87, 89	80
	Poisson Distribution 2		Review 36

R Help - Binomial

We have already seen how to calculate probabilities with unnamed discrete distributions. We can accomplish the same task with named distributions even more easily in R. Let’s use the binomial distribution to demonstrate a new suite of functions. All of the functions introduced here follow the *binom format. The letter we use in place of the * will achieve different results, always using an underlying binomial distribution.

As an example, you may flip a fair coin 10 times ($n=10$) with a 50% chance of heads on each flip ($p=0.5$). Let $X$ be the number of heads you observe. Let $X \sim binom(n=10,p=0.5)$.

What is the probability of exactly 7 heads (dbinom())?

Calculate $P(X=7)$.

# We need to use the PMF, which is called using "dbinom".
# We can say dbinom(x=7,size=10,prob=.5) or more simply:
dbinom(7,10,.5)

## [1] 0.1171875

What is the probability of no more than 7 heads (pbinom())?

Calculate $P(X\leq7)$.

# We need to use the CDF, which is called using "pbinom".
# We can say pbinom(q=7,size=10,prob=.5) or more simply:
pbinom(7,10,.5)

## [1] 0.9453125

What is the probability of at least 3 and no more than 6 heads (pbinom())?

Calculate $P(3\leq X \leq 6)$. Note that $P(3\leq X \leq 6) = F(6)-F(2)$.

# We can use the CDF, which is called using "pbinom".
# We can say pbinom(q=6,size=10,prob=.5)-pbinom(q=2,size=10,prob=.5) or more simply:
pbinom(6,10,.5)-pbinom(2,10,.5)

## [1] 0.7734375

# We can also use the PMF, which is called using "dbinom".
# We can say sum(dbinom(x=3:6,size=10,prob=.5)) or more simply:
sum(dbinom(x=3:6,size=10,prob=.5))

## [1] 0.7734375

The last command calculated the probabilities for each specified value of $x$ ($3:6$) and summed all the individual probabilities.

R Help - Poisson

As an example, the number of customers, $X$, arriving at a particular Subway in any given hour has a Poisson distribution with $\mu=10$. Let $X \sim Poisson(\mu=10)$.

What is the probability of exactly 7 customers in a particular hour (dpois())?

# We need to use the PMF, which is called using "dpois".
# We can say dpois(x=7,lambda=10) or more simply:
dpois(7,10)

## [1] 0.09007923

What is the probability of no more than 7 customers (ppois())?

# We need to use the CDF, which is called using "ppois".
# We can say ppois(q=7,lambda=10) or more simply:
ppois(7,10)

## [1] 0.2202206

Lesson 11 - Continuous RVs, PDFs, CDFs, the Uniform and General Distributions

Lesson Objectives

Define a continuous random variable in terms of a real-valued function.
Compare and contrast discrete and continuous random variables.
Verify a function is a valid PDF by using characteristics 1 and 2 on the PDF Reference Sheet.
Explain why the probability of any particular value for a continuous random variable is zero (Characteristic 3).
Given a continuous PDF, identify the domain and calculate corresponding probabilities (Characteristic 5).
Identify the uniform probability distribution and explain its properties, parameters, and applications.
Employ the calculus-based relationship between PDFs and CDFs in order to transform one into the other (Characteristic 4 on the PDF Reference Sheet).
Calculate probabilities using the CDF (Characteristic 6).
Calculate and interpret the Expected Value and Variance of a continuous random variable (characteristics 7) and (8 ).
Calculate and interpret a percentile from the distribution of a continuous random variable (Characteristic 9).

Read	Watch	Suggested Problems	WebAssign Problems
Devore 4.1	Probability Density Functions	1, 5, 7, 9	5, 7
Devore 4.2		11, 14, 15	11, 14
			Review 60, 89

R Help

As an example, the length of time you wait, $X$, for a sandwich at Subway is uniformly distributed between 2 and 5 minutes. Let $X \sim Uniform(A=2,B=5)$.

What is the probability you wait between 3 and 4 minutes for a sandwich (punif())?

# We need to use the CDF, which is called using "punif".
# We can say punif(q=4,min=2,max=5)-punif(q=3,min=2,max=5) or more simply:
punif(4,2,5)-punif(3,2,5)

## [1] 0.3333333

Calculating Probabilities for Named Continuous Distributions

In this lesson, we see our first named continuous distribution: the uniform distribution. Let $f(x)$ represent the uniform distribution pdf and $F(x)$ represent the uniform distribution CDF. Recall:

\[ P(a \leq x \leq b) = \int_{a}^{b} f(x)dx = F(b)-F(a). \]

R evaluates the CDF of many named distributions using the “p” version of the distribution (e.g., punif()). The technique below will help you calculate integrals for named distributions. As an example, suppose grades are distributed uniformly between 0 and 100. What is the probability that you score between 80 and 100? You can solve this in your head quickly ($0.2$), so now let’s try to produce the same answer in R.

# Define the limits of integration.
a = 80
b = 100
# Specify the distribution parameters.
A = 0
B = 100
# Calculate F(b)-F(a)
punif(b,min=A,max=B)-punif(a,min=A,max=B)

## [1] 0.2

We can also calculate percentiles of named continuous distributions. Let’s continue our Subway example from LSN 14 in which the length of time you wait, $X$, for a sandwich at Subway is uniformly distributed between 2 and 5 minutes. Let $X \sim Uniform(A=2,B=5)$. What is the 80th percentile wait time for a sandwich (qunif())?

# We need to find a particular percentile (quantile), which is obtained using "qunif".
# We can say qunif(p=.8,min=2,max=5) or more simply:
qunif(.8,2,5)

## [1] 4.4

It is possible to find percentiles for unnamed continuous distributions as well. In one of the suggested problems (Devore Chapter 4, Problem 15d), we are asked to find the 75th percentile of this pdf:

\[ f(x)=\left\{ \begin{array}{ll} 90x^8(1-x) & 0<x<1 \\ 0 & \textrm{otherwise} \\ \end{array} \right. \]

We use the CDF to find percentiles, so let’s derive the CDF for this unnamed pdf.

\[F(x) = \int_0^x 90y^8(1-y)dy = 10x^9-9x^{10}\bigg|_0^x = 10x^9-9x^{10}\]

We’re looking for the value of $c$ such that $F(c) = 10c^9-9c^{10} = .75$.

Manipulate the CDF to become a function of $c$ that equals 0.

\[10c^9-9c^{10} = .75\]

\[10c^9-9c^{10} - .75 = 0\]

Let R’s uniroot() find the value of $c$ that makes this last equation true.

# First we need to define our function of interest.  We'll call it 'p' for percentile.  Note that we need to frame our function such that the root(s) of the function (where the function equals 0) provide(s) our answer(s).
p = function(c) 10*c^9-9*c^10-.75
# Use the uniroot() function to find when p=0.  We specify the interval our solution should exist in based on the domain of the pdf - (0,1) in this case.
solution = uniroot(p,interval=c(0,1))
solution$root

## [1] 0.9035964

This result denotes the 75th percentile of the pdf above.

Lesson 12 - Normal and Exponential Probability Distribuitions

Lesson Objectives

Identify the normal probability distribution and explain its properties, parameters, and applications.
Transform the normal distribution into the standard normal distribution with mean = 0 and standard deviation = 1.
Identify the exponential probability distribution and explain its properties, parameters, and applications.
Use technology to calculate probabilities and percentiles associated with random variables from the normal and exponential distributions.

Read	Suggested Problems	WebAssign Problems
Devore 4.3 (p. 156-165)	33, 39	38, 40
Devore 4.4 (p. 170-172)	59, 61	59
		Review 9, 15

R Help

Normal Distribution

Calculating Probabilities for Named Continuous Distributions

In this lesson, we calculate numerous probabilities using the normal distribution. Let $f(x)$ represent the normal distribution pdf and $F(x)$ represent the normal distribution CDF. Recall:

\[ P(a \leq x \leq b) = \int_{a}^{b} f(x)dx = F(b)-F(a). \]

R evaluates the CDF of many named distributions using the “p” version of the distribution (e.g., pnorm()). As an example, the height of cadets, $X$ (in inches), at USMA is normally distributed with $\mu=70$ and $\sigma=2$. Let $X \sim N(\mu=70,\sigma=2)$.

What is the probability that a randomly selected cadet is between 72 and 75 inches tall (pnorm())?

# Define the limits of integration.
a = 72
b = 75
# Specify the distribution parameters.
mu = 70
sigma = 2
# Calculate F(b)-F(a)
# We need to use the CDF, which is called using "pnorm".
# We can say pnorm(q=b,mean=mu,sd=sigma)-pnorm(q=a,mean=mu,sd=sigma) or more simply:
pnorm(b,mu,sigma)-pnorm(a,mu,sigma)

## [1] 0.1524456

What is the probability that a randomly selected cadet is at least 75 inches tall (pnorm())?

# We need to use the CDF, which is called using "pnorm".
# We can say pnorm(q=Inf,mean=70,sd=2)-pnorm(q=75,mean=70,sd=2) or more simply:
pnorm(Inf,70,2)-pnorm(75,70,2)

## [1] 0.006209665

# or more simply:
1 - pnorm(75,70,2)

## [1] 0.006209665

How tall is the 80th percentile cadet (qnorm())?

# We need to find a particular percentile (quantile), which is obtained using "qunif".
# We can say qnorm(p=.8,mean=70,sd=2) or more simply:
qnorm(.8,70,2)

## [1] 71.68324

Exponential Distribution

We’ll use the pexp(), qexp() functions for the exponential distribution.

Example 4.21 from Devore: Suppose $X \sim Exp(\lambda=\frac{1}{6})$. What is $P(X\leq 10)$?

# Here we use the CDF of the exponential distribution.
# pexp(x=10,rate=1/6)
pexp(10,1/6)

## [1] 0.8111244

Note: Sometimes we have dropped the argument names (e.g., mean=10 $\rightarrow$ 10) as long as the values we supply in our function call are in the order R is expecting. R anticipates a rate argument as the third value in our function call, so at a minimum we must ensure we specify that the third value is our scale or $\beta$ parameter. When you aren’t 100% sure how R is going to treat your function arguments, you can never go wrong by spelling it out completely. The order of your listed arguments doesn’t matter when you clearly specify what each argument is.

For example, pnorm(q=75,mean=70,sd=2)$=$pnorm(sd=2,q=75,mean=70).

Lesson 13 - Jointly Distributed Random Variables

Lesson Objectives

Explain the role of independence in joint probability distribution functions for both discrete and continuous random variables.
Explain the relationship between the joint, marginal, and conditional distributions for both discrete and continuous random variables.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 5.1		3, 13	3, 9
			Review 4-46, 60

Lesson 14 - Expected Values of Joint Distributions

Lesson Objectives

Calculate the expected value of a continuous and discrete jointly distributed random variable

Read	Watch	Suggested Problems	WebAssign Problems
Devore 5.2		22, 27	22
			Review 3, 12

Lesson 15 - Covariance and Correlation

Calculate the covariance between two continuous or discrete random variables
Calculate the correlation coefficient between two continuous or discrete random variables
Explain the pros and cons of analyzing data using these statistical measures

Read	Watch	Suggested Problems	WebAssign Problems
Devore 5.2			30, 31
			Review 26

Lesson 16 - Problem Solving Lab

Read	Watch	Suggested Problems	WebAssign Problems
			24

Lesson 17 - WPR2 Instructions

Authorized References:

You are authorized to bring two sheets of notes (8.5$''$ x 11$''$, front and back). All notes MUST be handwritten BY YOU. In addition to your handwritten notes, you may reference a digital copy or unaltered hard copy of the MA256 Reference Sheet during the WPR. You may use any calculator you want, and you may use Excel, Mathematica, and RStudio provided that you start with a blank workbook, notebook, or script, respectively.

Lesson 18 - Monte Carlo Simulation

Lesson Objectives

Apply Monte Carlo Simulation to model and analyze real world problems.
Given a PDF or CDF, find the inverse CDF and understand its role in Monte Carlo Simulation.

Read	Watch	Suggested Problems	WebAssign Problems
Review 5.3 (p. 225-229)		Monte Carlo Suggested Problems
Supplemental Material

R Help

You can produce random variates in R using the r prefix for named distributions (e.g., rbinom(), rpois(), rnorm(), rexp(), rgamma()). You can simulate 10 $U[0,1]$ random variates by calling runif(n=10). You can then apply a derived transformation function to these uniform random variates to simulate drawing random variates from another distribution. The supplemental material explains both the theory and implementation of Monte Carlo simulation. An Army-Navy Football Monte Carlo tutorial can be found at Army-Navy Football: A Monte Carlo Tutorial.

Lesson 19 - Monte Carlo In-Class Exercise

Lesson Objectives

Given a scenario, build a Monte Carlo simulation and analyze the results.
Interpret results from a simulation to provide analysis and answer probability questions.

Read	Watch	Suggested Problems	WebAssign Problems

Block 3 - Inferential Statistics

Block 3 Objectives

Apply the Central Limit Theorem (CLT) to find the distribution and parameters of the sample mean.
Calculate confidence intervals for the population mean or variance for small sample sizes and a normal underlying population distribution.
Perform a hypothesis test by completing the following four steps: Construct the appropriate null and alternative hypothesis; Calculate the test statistic; Calculate a p-value from the test statistic; Interpret the p-value and state a conclusion in terms of the original problem.
Identify the differences between a 1-sample t-test, a 2-sample t-test, and a paired t-test. Recognize when to apply each test and how to interpret the results.

Lesson 20 - Random Sample, Distribution of the Sample Mean, Central Limit Theorem

Lesson Objectives:

Define and explain the concept of a random sample.
Explain why the sample mean is a random variable.
Approximate the sampling distribution of the sample mean of a random sample from a normally distributed population.
Using the Central Limit Theorem, approximate the sampling distribution of the sample mean of a sufficiently large random sample from any population distribution.
Calculate probabilities associated with the sampling distribution of the sample mean.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 5.3 (p. 220-222), 5.4	Central Limit Theorem	46, 47, 50, 53	46, 51
Supplemental Material (below)	Standard Error of the Mean

Supplemental Material

An excerpt from Devore 6.1, p. 259:

R Help

The technique below will help you solve Problem 47a. (Note: Alternatively, all of the numbers can be inserted directly into the pnorm() function.)

# Define the limits of integration.
a = 69
b = 71
# Specify the distribution parameters.
mu = 70
sigma = 1.6/sqrt(16)
# Calculate F(b)-F(a)
pnorm(b,mean=mu,sd=sigma)-pnorm(a,mean=mu,sd=sigma)

## [1] 0.9875807

Lesson 21 - Confidence Intervals I

Lesson Objectives:

Explain how the confidence interval expression is derived (pg. 277-278, Equations 7.1 - 7.4).
Interpret a confidence interval on the true population mean of a normally distributed population.
Calculate a confidence interval using technology and interpret the result.
Analyze how changing the confidence level and sample size affect a confidence interval.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 7.1 (p. 276-282)	CI for Mean (known variance)	3, 5 ,6	6
			Review 5-53

Lesson 22 - Confidence Intervals II

Lesson Objectives:

Recognize when the t distribution is appropriate to use in calculating a confidence interval for the true population mean (e.g., small sample from an underlying normal distribution with unknown variance or a large sample from any distribution with unknown variance).
Calculate a confidence interval for the true population mean using the t-distribution.
Interpret a confidence interval for the true population mean.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 7.3 (p. 295-299)	CI for Mean (Small Sample Sizes)	29, 33, 35	29, 33
Supplemental Material (below)			Review 5

Supplemental Material

Confidence Interval for the Population Mean ($\sigma$ known)

Objective: Determine an interval of plausible values for $\mu$.

Step 1: Find the $z$ critical value of the interval ($z_{\alpha / 2}$).

Plot the $z$ distribution.
Place the confidence level in the middle (for two-tail intervals).
Find the $z_{\alpha / 2}$ that forces the remaining area into the tails.

Step 2: Calculate the interval.

Step 3: Interpret the interval.

Confidence Interval for the Population Mean ($\sigma$ unknown)

Objective: Determine an interval of plausible values for $\mu$.

Step 1: Find the t critical value of the interval ($t_{\alpha/2,\nu}$).

Plot the $t_{n-1}$ distribution.
Place the confidence level in the middle (for two-tail intervals).
Find the $t_{\alpha/2,\nu}$ that forces the remaining area into the tails.

Step 2: Calculate the interval.

Step 3: Interpret the interval.

Note: Recall that $\nu = df = n-1$ for this application.

R Help

Confidence Interval for the Population Mean ($\sigma$ known)

Start with the confidence interval expression for a normally distributed random variable (p. 281):

\[ \left(\bar{y}-z_{\alpha/2} \cdot \frac{\sigma_Y}{\sqrt{n}},\bar{y}+z_{\alpha/2} \cdot \frac{\sigma_Y}{\sqrt{n}}\right) \]

Substitute values from Devore Chapter 7, Problem 5 to obtain the solution to Problem 5a.

# Store the provided information.
ybar=4.85
n=20
sigma.Y = .75
# The critical z value (alpha = .05) falls at the 97.5th percentile of the standard normal distribution.
z.crit = qnorm(.975)
# Calculate the confidence interval. Note: c(a,b) simply combines two values 'a' and 'b' to form a vector (a,b).
c(ybar-z.crit*sigma.Y/sqrt(n),ybar+z.crit*sigma.Y/sqrt(n))

## [1] 4.521304 5.178696

Confidence Interval for the Population Mean ($\sigma$ Unknown)

Using Summary Statistics

The confidence interval formula (7.15) on page 297 can be calculated in R as follows:

# Suppose we started with this data set:
y = c(418,421,421,422,425,427,431,434,437,
      439,446,447,448,453,454,463,465)
# Store some summary statistics.
ybar=mean(y)
n=length(y)
sigma.Y = sd(y)
alpha = .05
# The critical t value (alpha = .05) falls at the 97.5th percentile of the t distribution with n-1 degrees of freedom.
t.crit = qt(1-alpha/2,n-1)
# Calculate the confidence interval. Note: c(a,b) simply combines two values 'a' and 'b' to form a vector (a,b).
c(ybar-t.crit*sigma.Y/sqrt(n),ybar+t.crit*sigma.Y/sqrt(n))

## [1] 430.5077 446.0805

Using the Data

The confidence interval formula (7.15) on page 297 can also be calculated in R using the t.test() function. We’ll explore more of the t.test() arguments and outputs in future lessons. Notice the confidence interval is identical to the one produced above.

y = c(418,421,421,422,425,427,431,434,437,
439,446,447,448,453,454,463,465)
t.test(y)

## 
##  One Sample t-test
## 
## data:  y
## t = 119.33, df = 16, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  430.5077 446.0805
## sample estimates:
## mean of x 
##  438.2941

By default, R provides a 95% confidence interval. If we want to produce an interval with a different level of confidence, simply change the conf.level argument. Here is an example of a 90% confidence interval.

data = c(418,421,421,422,425,427,431,434,437,
439,446,447,448,453,454,463,465)
t.test(data,conf.level=.9)

## 
##  One Sample t-test
## 
## data:  data
## t = 119.33, df = 16, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  431.8815 444.7067
## sample estimates:
## mean of x 
##  438.2941

Boxplots in R

Problem 33 introduces the boxplot. While not a lesson objective, constructing a boxplot in R is quite easy. Check out the command boxplot().

Lesson 23 - Hypotheses Test Procedures I

Lesson Objectives:

Explain the purpose of conducting a hypothesis test.
Given a scenario, be able to construct the proper null and alternate hypotheses.
Explain the types of error associated with hypothesis testing.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 8.1	Hypothesis Testing	2adf, 5, 7, 9	2adf, 5
	Type I Error		Review 7-34, 35

Lesson 24 - Hypotheses Test Procedures II

Lesson Objectives:

Calculate the test statistic ($t_{stat}$) and p-value ($p$) for a hypothesis test.
Determine the conclusion of the hypothesis test based on the comparison of the p-value and significance level $\left(p \, \textrm{vs.} \, \alpha\right)$ or the test statistic and t-critical values $\left(t_{stat} \, \textrm{vs.} \, t_{\alpha \, \textrm{or} \, \alpha/2}\right)$.
Conduct a four-step hypothesis test (see Supplemental Material) on the population mean of a normally distributed population and interpret the results in the context of the original problem.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 8.3	Hypothesis Testing and p-values	30, 37b, 38	29, 31
Supplemental Material (below)	What is a p-value?		Review 8
	Hypothesis Testing - One Tail vs. Two Tails
	One-Sample t-Test in R

Supplemental Material

P-Value

The p-value is the probability of obtaining the observed test statistic (or a more extreme result) when the null hypothesis is actually true. More simply stated, how rare is what I just saw if I assume the null hypothesis to be true?

Hypothesis Testing

Objective: To determine what the true mean, $\mu$, is NOT (gather evidence against $H_0$).

Step 1: State $H_0$ and $H_a$. Note the significance level ($\alpha$) and pay particular attention to the inequality of $H_a$.

Step 2: Calculate the test statistic.

Step 3: Find the P-Value.

Plot the distribution of the test statistic if $H_0$ is true.
Plot the test statistic on the number line.
Calculate the appropriate area based on the form of $H_A$ (one or two tails of the distribution).

Step 4: Interpret the P-Value in terms of the original problem.

R Help

We’ll now explore the arguments and outputs of the t.test() function in R. In Problem 37b, we wish to conduct a hypothesis test regarding the strength of a batch of concrete. In this case, $H_0:\mu=100 \, \textrm{MPa}$ and $H_a: \mu<100 \, \textrm{MPa}$. Let $\alpha=0.05$. The t.test() function has several arguments; for our purposes in this one-sample test, we’ll concern ourselves with x, alternative, and mu:

t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)

x refers to our sample data, alternative will be “less” as stated in our $H_a$, and mu is 100 as stated in our $H_0$.

data = c(112.3,97,92.7,86,102,99.2,95.8,103.5,89,86.7)
t.test(x=data,alternative="less",mu=100)

## 
##  One Sample t-test
## 
## data:  data
## t = -1.3708, df = 9, p-value = 0.1018
## alternative hypothesis: true mean is less than 100
## 95 percent confidence interval:
##      -Inf 101.2073
## sample estimates:
## mean of x 
##     96.42

Consider the same scenario presented in this manner:

\[H_0:\mu=100 \, \textrm{MPa}\]

\[H_a: \mu<100 \, \textrm{MPa}\]

\[\bar{x}=96.42\]

\[s=8.259\]

\[n=10\]

xbar = 96.42
s = 8.259
n = 10
t.stat = (96.42-100)/(8.259/sqrt(10))
print(t.stat)

## [1] -1.370741

# We'll use the t distribution CDF (pt) to calculate the p-value.
p.value = pt(t.stat,n-1)
print(p.value)

## [1] 0.1018333

Thus, we arrive at the same p-value whether you start with data or sample statistics.

If you choose to compare the test statistic to the t critical value, recall how we obtain the t critical values using the qt() function.

alpha = .05
# Note that the "one-sided, less than" t critical value occurs at the 5th percentile of the t distribution with 9 degrees of freedom.
t.crit = qt(alpha,n-1)
print(t.crit)

## [1] -1.833113

Lesson 25 - The Two-Sample t-Test, Paired t-Test, and Confidence Interval

Lesson Objectives:

Recognize and explain the differences between a one-sample t-test and a two-sample t-test and a Pared t-Test. Recognize when the data columns are dependent of one another and a paired t-test is more appropriate than a two-sample t-test.
Given two sets of independent data, use the R function “t.test” to perform a hypothesis test for the true difference of means and interpret the result using the p-value or test statistic.
Given two sets of independent data, calculate a confidence interval for the true difference of means.
Given paired data, perform a hypothesis test in R for the true mean difference and interpret the result using the p-value or test statistic.
Using paired data, calculate a confidence interval for the true mean difference of two populations.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 9.2 (374-377)		23c, 28, 32	28
Devore 9.3	Paired t-Test in R	36, 40ac	36
			Review 8-37ab, 40

R Help

We’ll now explore additional arguments and outputs of the t.test() function in R. In Problem 23c, we wish to conduct a hypothesis test ($\alpha = .05$) assessing whether the true averages differ between the two fabrics. In this case, $H_0:\mu_1-\mu_2=0$ and $H_a: \mu_1-\mu_2 \neq 0$. The t.test() function has several arguments; for our purposes in this two-sample test, we’ll concern ourselves with x, y, alternative, conf.level, and mu:

t.test(x, y = NULL, 
alternative = c("two.sided", "less", "greater"), 
mu = 0, paired = FALSE, var.equal = FALSE, 
conf.level = 0.95, ...)

x refers to the first sample, y refers to a second sample, alternative will be “two.sided” as stated in our $H_a$, conf.level is .95, and mu is 0 as stated in our $H_0$.

H = c(1.2,0.9,0.7,1,1.7,1.7,1.1,0.9,1.7,1.9,1.3,2.1,
      1.6,1.8,1.4,1.3,1.9,1.6,0.8,2,1.7,1.6,2.3,2)
P = c(1.6,1.5,1.1,2.1,1.5,1.3,1,2.6)
t.test(x=H,y=P,alternative="two.sided",mu=0,conf.level=.95)

## 
##  Welch Two Sample t-test
## 
## data:  H and P
## t = -0.38011, df = 10.482, p-value = 0.7115
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5403506  0.3820172
## sample estimates:
## mean of x mean of y 
##  1.508333  1.587500

We’ll now explore a final argument of the t.test() function in R. In Problem 36, we wish to conduct a hypothesis test ($\alpha=0.05$) assessing various fabrics in both abraded and unabraded conditions. Since the same fabric is tested under two conditions, we have paired data. In this case, $H_0:\mu_D=0$ and $H_a: \mu_D > 0$. The t.test() function has several arguments; for our purposes in this paired test, we’ll concern ourselves with x, y, alternative, conf.level, mu, and paired:

t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)

x refers to the first sample, y refers to a second sample, alternative will be “greater” as stated in our $H_a$, mu is 0, and paired is “TRUE”.

unabraded = c(36.4,55,51.5,38.7,43.2,48.8,25.6,49.8)
abraded = c(28.5,20,46,34.5,36.5,52.5,26.5,46.5)
t.test(x=unabraded,y=abraded,alternative="greater",mu=0,conf.level = 0.95,paired=TRUE)

## 
##  Paired t-test
## 
## data:  unabraded and abraded
## t = 1.7286, df = 7, p-value = 0.06375
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.6961062        Inf
## sample estimates:
## mean of the differences 
##                    7.25

Lesson 26 - Problem Solving Lab

Read	Watch	Suggested Problems	WebAssign Problems
			Review 32, 40ac

Lesson 27 - WPR3 Instructions

Authorized References:

You are authorized to bring three sheets of notes (8.5$''$ x 11$''$, front and back). All notes MUST be handwritten BY YOU. In addition to your handwritten notes, you may reference a digital copy or unaltered hard copy of the MA256 Reference Sheet during the WPR. You may use any calculator you want, and you may use Excel, Mathematica, and RStudio provided that you start with a blank workbook, notebook, or script, respectively.

Block 4 - Model Building and Data Analysis

Block 4 Objectives

Create, use, and interpret a simple linear regression model using R and R-Studio.
Conduct the Model Utility Test and interpret the results given regression output in

Determine the adequacy and appropriateness of a simple linear regression model on a given data set.
Verify linear model assumptions: linearity, independence, and normally distributed residuals with a mean of 0 and constant variance (i.e., $\epsilon \sim N(0,\sigma^2)$).
Use appropriate techniques to model data when assumptions are violated.
Develop and assess multiple regression models with continuous and categorical variables.
Use appropriate statistical techniques when response variable is discrete.

Lesson 28 - Simple Linear Regression, Estimating Model Parameters, and Introduction to the Course Project

Lesson Objectives

Identify and explain the underlying assumptions of a simple linear regression model.
Given regression output in R, identify the three parameters (slope, intercept, and variance of the random error) of the simple linear regression model, identify and explain the R-squared, adjusted R-squared, estimate of the intercept, and estimates of the slope and explain what each means.
Explain the random error term and identify its distribution.
Explain the principle of Least Squares and recall how it is used to find the best linear regression model.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 12.1	Simple Linear Regression	1c, 7, 8ab, 9a-d	7, 11abc
Devore 12.2	Simple Linear Regression in R	14, 16, 24	14
Project Prompt

R Help

Simple Linear Regression

Regression analysis investigates the relationship between two or more variables, and simple linear regression investigates the relationship between only two variables. The model for simple linear regression is as follows:

\[ Y = \beta_0 +\beta_1x + \epsilon \text{ where } \epsilon \sim N(0,\sigma^2) \]

We want to use our grades data again (see Introductory R Tutorial) in order to determine if there is a linear relationship between WPR 1 scores and instructor grades. Previously we created a scatter plot, and it appeared that there might be a positive linear relationship. Let’s re-create the graph here:

grades = read.csv("GradeData.csv",header=TRUE)
plot(grades[,3], grades[,1], xlab = "Instructor Grades", ylab = "WPR1 Scores", main = "A Scatter Plot of WPR1 Scores against Instructor Grades")

Since we do not have the data on the entire population, we will have to estimate the parameters of the true regression line by using our sample data. The estimated regression line is as follows:

\[ \hat{y} = \hat{\beta_0} +\hat{\beta_1}x \]

Notice that the estimated regression line has hats that denote estimates, and the epsilon term ($\epsilon$) drops out because $\hat{y}$ is not a random variable after the slope and intercept are determined. If we know what $x$ is, then we can estimate $\hat{y}$ using the calculated parameter estimates $\hat{\beta_0}$ and $\hat{\beta_1}$.

Using the $lm()$ Function in R

We can easily use calculus and the equations in the book to determine the estimates for the slope and intercept, or we can use the lm() function to do this for us (lm stands for linear model). Let’s determine the estimated regression line in R where we use instructor grades as our x-variable (predictor) and WPR 1 scores as our y-variable (response):

wpr1 = grades[,1] # response
inst.grade = grades[,3] #  predictor

Creating new objects for both $x$ and $y$ helps the user follow the output of the linear model lm() function. Now that we have defined the response $y$ and independent $x$ variables, we can use the lm() function to perform simple linear regression:

model1 = lm(wpr1~inst.grade) # Notice the tilde after wpr1.  The tilde means "is a function of."

You can read this as “$wpr1$ is a function of $inst.grade$.” Let’s check out the model we produced. It turns out this model1 object we created is a collection of calculated attributes of our linear model. We’ll explore several of these attributes in the coming lessons.

summary(model1) # This prints some details about our created model.

## 
## Call:
## lm(formula = wpr1 ~ inst.grade)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.015  -5.238   1.474   9.050  14.729 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   57.753      9.893   5.838 2.65e-07 ***
## inst.grade    67.883     12.875   5.272 2.16e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.72 on 57 degrees of freedom
## Multiple R-squared:  0.3278, Adjusted R-squared:  0.316 
## F-statistic:  27.8 on 1 and 57 DF,  p-value: 2.157e-06

From the R regression output, the data located under the Coefficients: label contains information for both the intercept and slope parameters. Notice the estimate of the Intercept (Intercept), $\beta_0$ = 57.753, and the estimate for the Slope (inst.grade), $\beta_1$ = 67.883. Therefore, the estimated regression line is produced as follows:

\[ \hat{\text{wpr1}} = 57.753 + 67.883 \times \textrm{inst.grade} \]

The interpretation of the intercept is typically not important in MA206. Instead, we will focus on interpreting the slope. First, we see that the estimate of the slope is positive. This means that as the instructor grade increases, we expect the WPR1 score to also increase, which isn’t terribly surprising. Secondly, if instructor grade increases by 1%, then the expected WPR1 score will increase by $.01 \times 67.88 = .6788$ points. Notice that scales for the two grades are different; WPR1 grades are a whole number from 60 to 130 while instructor grades are a number between .49 and .97. Recognizing differences in scales is critical for model interpretation.

“Grabbing” the Estimates from R Output

You’ll notice that there is a lot of data contained in the regression summary output. At a minimum, we will want to extract the Intercept, Slope, Standard Error of the Model, and R-squared.

When we defined the regression model above, we named it model1. When we displayed the output from the simple regression model, we used the summary command (i.e., summary(model1)). In order to see the names of all the attributes contained in the model summary, we use the names() function:

names(summary(model1))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

We can assign all of the summary information to a new object as follows:

output = summary(model1)

We may want to index into the output object and get the data that we need for additional calculations. Notice that the coefficients data contains the intercept and slope. We can “grab” specific attributes of the model summary (stored as output) by using the $ operator. See the examples below:

output$coefficients

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 57.75300   9.892894 5.837826 2.647529e-07
## inst.grade  67.88264  12.875310 5.272311 2.156943e-06

This gives us both the intercept and the slope along with other information in matrix-style format. We can get each individual number by indexing further:

output$coefficients[1,1] # Extract the estimate for the intercept only.

## [1] 57.753

output$coefficients[2,1] # Extract the estimate for the slope only.

## [1] 67.88264

Picking up where we finished, there is additional information stored in each linear model summary. We’ll start again with:

model1 = lm(wpr1~inst.grade) 
output = summary(model1)

The r.squared data contains the R-squared value:

output$r.squared # There is no need to index further since only one value is stored in this attribute.

## [1] 0.3278085

The sigma value contains the Standard Error of the Model:

output$sigma

## [1] 10.7196

We can store these numbers for use in other calculations. For example, we can use specific values from the output$coefficients to calculate a confidence interval for the slope coefficient (see LSN 29 R Help).

Lesson 29 - Inference on the Slope Parameter

Lesson Objectives

Determine the relationship between the independent and dependent variables using linear regression. Consider the sign and magnitude of the slope coefficient as well as the scales of the predictor and response variables.
Given regression output in R, conduct the model utility test on the slope parameter, identify the test statistic and p-value, and draw appropriate conclusions about the usefulness of the slope estimate.
Calculate a confidence interval for the true value of the slope parameter.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 12.3		32, 34ab	32
			Review 16, 24

R Help

We’ll continue with our model from the last couple lessons.

model1 = lm(wpr1~inst.grade) 
output = summary(model1)

Confidence Interval for the Slope

With the use of R, a confidence interval for any of the regression parameters is easily calculated. Suppose we want to determine a 95% confidence interval for the slope parameter, $\hat{\beta_1}$. We can manually calculate a confidence interval using the attributes available from our model summary. We’ll need the output$coefficients:

output$coefficients

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 57.75300   9.892894 5.837826 2.647529e-07
## inst.grade  67.88264  12.875310 5.272311 2.156943e-06

We’ll need some of these values for use in the $\hat{\beta_1}$ confidence interval:

\[\hat{\beta_1} \pm t_{\alpha/2,n-2}s_{\hat{\beta_1}}\]

The output$coefficients values can be indexed like a traditional matrix. Specifically, we need to access the inst.grade estimate and its standard error. These exist on the second row in the first and second columns.

alpha = .05
n = length(inst.grade)
t.crit = qt(1-alpha/2,n-2)
estimate = output$coefficients[2,1] # Index the coefficient estimate for the inst.grade parameter.
SE = output$coefficients[2,2] # Index the standard error estimate for the inst.grade paramter.
# Now we can calculate and display the confidence interval for the inst.grade coefficient.
c(estimate-t.crit*SE,estimate+t.crit*SE)

## [1] 42.10028 93.66501

From the output above, we are 95% confident that the true value of the slope parameter for the population regression line is between 42.1 and 93.67 (rounded). Alternatively, we can use the confint() function by entering the regression model as the first argument, the parameter name as the second argument, and the confidence level as the third argument. Try the following code:

confint(model1, parm = "inst.grade", level = .95)

##               2.5 %   97.5 %
## inst.grade 42.10028 93.66501

The results from the two methods agree. You can change the confidence level by changing the level argument in the confint() function.

Lesson 30 - Categorical Predictors

Lesson Objectives

Correctly identify categorical predictors
Properly indicate categorical predictor variables in R
Identify significant categorical variables through predetermined alpha value
Correctly Interpret Beta coefficients in terms of the initial problem

Read	Watch	Suggested Problems	WebAssign Problems
Devore 13.4	Categorical Predictors in R	39, Suggested Problem Below	39
Supplemental Material Below			Review 37

R Help

You learned in the video that we sometimes work with data that is categorical.

For example:

Gender
- Employment Status
- Marital Status

We must handle these predictors differently because they do not operate on a continuous scale. Either a person is male or they are not. There is not a continuous scale between male and female.

Take, for example, the following simulated data on an individual’s pay.

pay <- read.csv(file="payexample.csv", header=TRUE, sep=",")  
attach(pay) 
head(pay)

##   Years.of.Service       Pay Gender Hair.Color
## 1                3  53520.54      M   notblond
## 2               15 113240.48      M   notblond
## 3               24 162067.65      M   notblond
## 4               20 143756.46      F      blond
## 5               12 101980.35      F   notblond
## 6                4  62929.25      M   notblond

As you can see, we have 1 response variable and 3 predictor variables

Pay (Response)
- Years.of.Service (Predictor)
- Gender (Predictor)
- Hair.Color (Predictor)

We want to determine what impact years of service, gender, and hair color have on an individual’s pay at this company (if any).

To help determine this, let’s see what happens when we regress Years.of.Service with our three predictors.

lm.pay = lm(Pay~(Years.of.Service+Hair.Color+Gender))
summary(lm.pay)

## 
## Call:
## lm(formula = Pay ~ (Years.of.Service + Hair.Color + Gender))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9745.5 -3278.6  -453.8  3223.9 13522.2 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        44006.52    1849.10  23.799  < 2e-16 ***
## Years.of.Service    4990.71      99.94  49.939  < 2e-16 ***
## Hair.Colornotblond -5857.65    1543.86  -3.794 0.000431 ***
## GenderM              992.41    1473.42   0.674 0.503973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5039 on 46 degrees of freedom
## Multiple R-squared:  0.9822, Adjusted R-squared:  0.981 
## F-statistic: 846.3 on 3 and 46 DF,  p-value: < 2.2e-16

You should notice several things different about our output.

First, our predictor Years.of.Service is significant because our p value is less than our predetermined alpha of .05.

Second, Gender is displayed in the model as GenderM. Since this is not a continuous predictors, r treats it as a categorical predictor where GenderM signifies the effect being male has on an individual’s pay. What about being female? Since being female can be considered the absence of being male, the female effect is rolled up in the intercept part of the model. The same is true with Hair.Colornotblond. This signifies the effect not being blond has on an individual’s pay.

If a categorical predictor is insignificant (IE, P> ($\alpha$)) then we say that the presence of that predictor has no impact on our response.

In the example above, since our p value for GenderM is >.05, we say that there is no statistically significant pay difference between men and women. We can take it out of our model.

lm.pay2 = lm(Pay~(Years.of.Service+Hair.Color))
summary(lm.pay2)

## 
## Call:
## lm(formula = Pay ~ (Years.of.Service + Hair.Color))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9366.7 -3628.8  -409.4  3293.9 12923.9 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        44441.55    1722.53  25.800  < 2e-16 ***
## Years.of.Service    4995.83      99.07  50.430  < 2e-16 ***
## Hair.Colornotblond -5709.77    1519.26  -3.758 0.000472 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5010 on 47 degrees of freedom
## Multiple R-squared:  0.982,  Adjusted R-squared:  0.9813 
## F-statistic:  1284 on 2 and 47 DF,  p-value: < 2.2e-16

However, since the p value for Hair.Colornotblond is <.05, this provides strong evidence that there is a significant difference in pay between individuals who have blond hair and those who do not.

How much difference? We must look at our coefficient for our linear model:

lm.pay2$coefficients

##        (Intercept)   Years.of.Service Hair.Colornotblond 
##          44441.548           4995.832          -5709.767

We can see that our model predicts that someone without blond hair to get paid $5709.77 less on average than someone without blond hair.

###Suggested Problem {#Sugprob2}

x1 = c(8.782746, 7.553363, 1.608668, 8.070618, 2.676193, 9.273625, 1.609907, 6.812618, 3.177070, 2.043432, 3.158704, 6.631469, 8.905731, 5.935131, 3.148553, 1.198568, 8.451915, 8.664580, 8.759836, 3.117905)

x2 = c(0.07339088, 0.26012713, 0.65840424, 0.12574442, 0.01284719, 0.93768603, 0.25010248, 0.66528929, 0.27026371, 0.54081582, 0.19599295, 0.57015303, 0.83265614, 0.79580407, 0.74082821, 0.50192293, 0.78072478, 0.94169540, 0.03287943, 0.43837245)

x3 = c(4.2280742, 2.0078027, 4.4866210, 0.9259695, 2.2998340, 2.1388518, 0.3326518, 2.3098570, 3.2388353, 1.3700555, 1.0926034, 1.2696775, 3.4287515, 1.6269742, 0.8906086, 3.6825764, 2.5702117, 4.3030499, 2.9233768, 2.2676811)

y1 = c(32.088026, 26.935867, 12.596351, 27.259926, 11.378572, 37.648887, 7.779088, 25.528008, 15.225756, 9.145160, 10.752428, 27.167595, 34.758202, 22.669995, 14.543876,  9.491130, 34.520567, 36.624624, 30.636396, 17.580203)

Lesson 31 - Multiple Regression

Lesson Objectives

Develop and assess a multiple regression model to explain the variation of a response variable using quantitative and categorical predictor variables.
Using R-Studio, perform multiple regression using quantitative and categorical predictors.
Given multiple regression output in R, identify and explain the adjusted R-squared, the estimates of the regression coefficients, and the standard error of the model.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 13.4	Multiple Linear Regression in R	36, 38, Suggested Problem Below	38

R Help

Multiple Regression

Multiple regression takes into account more than one independent variable. The multiple regression model is as follows:

\[ Y = \beta_0 +\beta_1x_1 + \beta_2x_2 + ...+ \beta_kx_k + \epsilon \text{ where } \epsilon \sim N(0,\sigma^2) \]

The goal of multiple regression is to determine estimates of the $\beta_i$ coefficients from our sample data set that best predict the dependent variable $y$. A useful multiple regression model will have coefficient estimates that are statistically different from zero, the coefficients will be easily interpretable, and the linear model assumptions will be met. The final estimated multiple regression equation will take on the form:

\[ \hat{y} = \hat{\beta_0} +\hat{\beta_1}x_1 + \hat{\beta_2}x_2 + ...+ \hat{\beta_k}x_k \]

Before we begin, recall the different columns of data in the grades data we’ve been using:

grades = read.csv("GradeData.csv",header=TRUE)
names(grades)

## [1] "WPR1"            "WPR2"            "InstructorGrade" "APSC"           
## [5] "SAT.Math"

Let’s assume that we want to find a relationship between WPR2 grades and all the other scores (WPR1, Instructor.Grade, APSC, and SAT.Math).

Creating Scatterplots in Multiple Regression

You can create scatterplots using the plot() function and plot the dependent variable $y$ against each $x_i$. If we wanted to explore all the relationships quickly, we can create a scatterplot of all the different variable pairs using the pairs() function:

pairs(grades)

You only need to focus on either the upper right triangle above the diagonal or the lower left triangle below the diagonal. For example, let’s look at the graph in the 1st row and the 2nd column. This graph has WPR1 scores on the y-axis and WPR2 scores on the x-axis. There appears to be a positive linear relationship between these two test scores (i.e., the better I perform on WPR1, then the better I can expect to perform on WPR2). The remaining graphs can be read in a similar way. Creating these scatterplots will help identify any non-linear relationships among the data. In this data set, non-linear relationships appear to be absent, which is a good thing since we are trying to fit a linear model using all of the data.

Multiple Regression with $lm()$

Multiple regression in R is done similarly to how we performed simple linear regression, but now we can include more than just a single independent variable. In order to easily interpret the regression output, assign each column of data to its own object:

wpr1 = grades[,1]
wpr2 = grades[,2]
inst.grade = grades[,3]
apsc = grades[,4]
sat.math = grades[,5]

Using the objects above, we can use the lm() function to estimate the coefficients. Since we are creating another model, we can label our different models in sequential order. For example, we name the next linear model model2 as follows:

model2 = lm(wpr2~wpr1+inst.grade+apsc+sat.math) # Recall the use of "tilde" after wpr2.
summary(model2) # Print the regression output.

## 
## Call:
## lm(formula = wpr2 ~ wpr1 + inst.grade + apsc + sat.math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.087  -4.353   1.726   7.295  29.952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 13.73128   19.11153   0.718   0.4756  
## wpr1         0.08087    0.17214   0.470   0.6404  
## inst.grade  23.58476   26.68980   0.884   0.3808  
## apsc         7.17491    5.08628   1.411   0.1641  
## sat.math     0.06549    0.02748   2.383   0.0207 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.99 on 54 degrees of freedom
## Multiple R-squared:  0.4152, Adjusted R-squared:  0.3719 
## F-statistic: 9.585 on 4 and 54 DF,  p-value: 6.252e-06

From the output, the only coefficient that is statistically different from zero is sat.math. In MA206, we want to reduce our initial model down to the most parsimonious model available. A parsimonious model balances explaining as much variance in the response variable as possible while using a minimal number of fewest predictor variables. Since wpr1 has the highest p-value from the output, we can eliminate it from the model.

model3 = lm(wpr2~apsc+sat.math+inst.grade)
summary(model3)

## 
## Call:
## lm(formula = wpr2 ~ apsc + sat.math + inst.grade)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.149  -5.398   1.942   7.243  29.722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 17.36348   17.35353   1.001   0.3214  
## apsc         7.65062    4.94902   1.546   0.1279  
## sat.math     0.06842    0.02657   2.575   0.0127 *
## inst.grade  26.21973   25.90828   1.012   0.3160  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 55 degrees of freedom
## Multiple R-squared:  0.4128, Adjusted R-squared:  0.3808 
## F-statistic: 12.89 on 3 and 55 DF,  p-value: 1.729e-06

It looks like we also need to eliminate inst.grade. Students should only eliminate variables ONE AT A TIME.

model4 = lm(wpr2~apsc+sat.math)
output = summary(model4)
summary(model4)

## 
## Call:
## lm(formula = wpr2 ~ apsc + sat.math)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.591  -5.272   2.882   8.123  28.655 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.10703   14.44092   1.877 0.065717 .  
## apsc        11.49067    3.17796   3.616 0.000643 ***
## sat.math     0.06676    0.02652   2.517 0.014727 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 56 degrees of freedom
## Multiple R-squared:  0.4019, Adjusted R-squared:  0.3805 
## F-statistic: 18.81 on 2 and 56 DF,  p-value: 5.627e-07

Suppose you want to investigate all the possible variable combinations to see which combination of predictor variables explains the greatest amount of variation in the dependent variable (HINT: there might be an R function for this…try to use GOOGLE to find it).

Returning to our final model produced above, we see that using the apsc and sat data explains 38.05% of the variation in the dependent variable in this multiple regression model (make sure to use the adjusted R-squared value in multiple regression). Additionally, each estimated coefficient is positive and statistically different from zero (at the 5% level). The final multiple regression model is as follows:

\[ \hat{y} = 27.107 + 11.49\times apsc + .067 \times sat.math \]

Does this make sense? Each student should spend time trying to analyze the meaning of the final multiple regression model. In this situation, WPR2 scores are modeled using the students’ academic proficiency score and SAT math score. The better either one of these scores, the better a student should perform on WPR2. Does it make sense that a student’s SAT score (usually taken in high school) should have an impact on a test in the second year at West Point? A close look at the SAT variable may cause us to eliminate it from the model. We might ask ourselves, “What is the average impact of these variables on our model?” We could take the average scores of apsc and sat.math and substitute them into the model:

apsc.avg = mean(apsc)
sat.avg = mean(sat.math)

Now we can take these two averages and multiply by the corresponding coefficient estimates:

apsc.avg.impact = apsc.avg*model4$coefficients[2]
apsc.avg.impact

##     apsc 
## 33.60438

sat.avg.impact = sat.avg*model4$coefficients[3]
sat.avg.impact

## sat.math 
## 41.40723

The average impact of the SAT math score is approximately 41.4 points and is higher than the APSC average impact. We might begin to question the validity of our model since we attribute an average of 41.4 points of the WPR2 score to our high school SAT math score. Wouldn’t it make more sense if the current APSC at West Point has a higher average impact on our model?

What if there is a 1% increase in either variable from its average value? What would be the expected point change to the WPR2 score?

 .01*apsc.avg*model4$coefficient[2]

##      apsc 
## 0.3360438

 .01*sat.avg*model4$coefficient[3]

##  sat.math 
## 0.4140723

A 1% increase in the APSC average would increase the WPR2 score in the model by .34 points, and the corresponding increase in the SAT math score would increase the WPR2 score by .41 points. From the preceding analyses, SAT math has the biggest impact on our model, but perhaps it shouldn’t since this test is taken at least two years prior to taking MA206. Perhaps this SAT variable acts as a proxy for something else. Could it be that individuals that perform well on the SAT math are better test takers? Harder workers? More comfortable in math?

Going Further with R

You can do so much more than we’ve shown in this R Help. Take a look at Model Selection in R to see some other regression capabilities and considerations.

Lesson 32 - Model Selection and Comparison

Lesson Objectives

Execute variable selection using forward selection
Execute variable selection using backward elimination
Execute variable selection using ‘all subsets’.
Be able to compare models and know strengths/weakness of each method (AIC/BIC/R2/Adj R2/Mallow’s CP)

Read	Watch	Suggested Problems	WebAssign Problems
Devore 13.5 (p 599-603)		Problem Below	58

Suggested Problem

Investigators studied physical characteristics and ability in 13 football punters. Each volunteer punted a football ten times. The investigators recorded the average distance for the ten punts, in feet. They also recorded the average hang time (time the ball is in the air before the receiver catches it) for the ten punts, in seconds. In addition, the investigators recorded five measures of strength and flexibility for each punter: right leg strength (pounds), left leg strength (pounds), right hamstring muscle flexibility (degrees), left hamstring muscle flexibility (degrees), and overall leg strength (foot-pounds). From the study “The relationship between selected physical performance variables and football punting ability” by the Department of Health, Physical Education and Recreation at the Virginia Polytechnic Institute and State University, 1983.

distance=c(162.50, 144.00, 147.50, 163.50, 192.00, 171.75, 162.00, 104.93, 105.67, 117.59, 140.25, 150.17, 165.17)
rstrength=c(170, 140, 180, 160, 170, 150, 170, 110, 120, 130, 120, 140, 160)
lstrength=c(170, 130, 170, 160, 150, 150, 180, 110, 110, 120, 140, 130, 150)
rflexibility=c(106,  92,  93, 103, 104, 101, 108,  86,  90,  85,  89,  92,  95)
lflexibility=c(106,  93,  78,  93,  93,  87, 106,  92,  86,  80,  83,  94,  95)
ostrength=c(240.57, 195.49, 152.99, 197.09, 266.56, 260.56, 219.25, 132.68, 130.24, 205.88, 153.92, 154.64, 240.57)

Create a model to predict the distance a punter can kick a football using forward selection and backwards elimination in R.
Use R’s ‘best subsets’ to find the best model.
Interpret your models in accordance with the context of the problem.

R Help

To help with the problem above, consider installing the R package below

install.packages("leaps")

Use the package in the following way:

library(leaps)

puntsubsets <-
  regsubsets(response ~ pred1+pred2+pred3+...+predn,
             nbest = 1,       # 1 best model for each number of predictors
             nvmax = NULL,    # NULL for no limit on number of variables
             force.in = NULL, force.out = NULL,
             method = "exhaustive")

summary(puntsubsets) ## displays the results
summary(puntsubsets)$adjr2 ## displays adjusted R^2 for each model
summary(puntsubsets)$rsq ## displays R^2 for each model
summary(puntsubsets)$cp ## displays Mallow's CP for each model

Lesson 33 - Course Drop for Guest Lecture

Lesson Objectives

Beat Navy
Beat Air Force

Lesson 34 - Verify and Assess Modeling Assumptions

Lesson Objectives

Construct and interpret multiple regression diagnostic plots (e.g., $e^* \, \textrm{vs.} \, x_i$, $e^* \, \textrm{vs.} \, \hat{y}$, $\hat{y} \, \textrm{vs.} \, y$, and a normal probability plot of the standardized residuals).
Use diagnostic plots to verify linear model assumptions: linearity, independence, and normally distributed residuals with a mean of 0 and constant variance (i.e., $\epsilon \sim N(0,\sigma^2)$).
Given R regression output and residual plots, identify if one or more of the six problems on pg. 546 exist.

Read	Watch	Suggested Problems	WebAssign Problems
Devore 4.6 (p. 184-190), 13.1	Checking Assumptions with Residual Plots	4,5,7	7

R Help

Model Adequacy: Diagnostic Plots

Let’s create the four diagnostic plots as described on page 544 of the textbook:

$e_i^*$ vs. $x_i$ (Standardized Residuals vs. x)

First, we need to calculate the standardized residuals in R using the rstandard() function:

grades = read.csv("GradeData.csv",header=TRUE)
wpr2 = grades[,2]
apsc = grades[,4]
sat.math = grades[,5]
model1 = lm(wpr2~apsc+sat.math)
standard.resid = rstandard(model1)

This command assigns all of the standardized residuals to the standard.resid object. We can now use this object to help us create the different plots. The first plot that we will create is a standardized residuals ($e^*$) versus $x_i$ plot. We’ll do this for every predictor in our final model. In this example, we’ll look at just the sat.math predictor.

  plot(sat.math, standard.resid, main = "Standardized Residuals vs. SAT Math Scores", xlab = 
        "Instructor Grades", ylab = "Standardized Residuals", ylim = c(-3,3))
 abline(h=c(0,-2,2), lty=3)

The abline() function is a nice way to add additional lines to any plot. For this plot, the abline() function is used to add three horizontal lines at -2, 0, and 2. The lty argument is for the line type. In this case, lty=3 corresponds to dashed lines.

For this plot, we are looking to see if the standardized residuals fall “equally” above and below zero (see the middle horizontal line). Essentially, we wish to see a random scatter of points (i.e., a shotgun blast) around zero. Additionally, we want to check for constant variance. Constant variance implies that the spread of the residuals throughout the plot is approximately the same. In the above graph, there are no patterns, and the majority of the residuals fall within two standard deviations of zero (i.e., they lie between -2 and 2).

$e_i^*$ vs. $\hat{y_i}$ (Standardized Residuals vs. Fitted Values)

Since $\hat{y}$ refers to the predicted values, we can get these values from the linear model that we created above (model1):

names(model1)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Using the names() function above, we can see that there is an attribute called fitted.values, and we will assign these values to the object yhat as follows:

yhat = model1$fitted.values
# Alternatively, we can use the fitted() function.
yhat =fitted(model1)

Now that we have all the fitted values, we can plot the standardized residuals against the fitted values:

plot(yhat, standard.resid, main = "Standard Residuals vs. Fitted Values", xlab = 
        "Predicted WPR2 Scores", ylab = "Standardized Residuals", ylim = c(-3,3))
 abline(h=c(0,-2,2), lty=3)

In this plot, we are looking for the same random scatter around zero as we saw in the plot above. After looking at both of these residual plots, we can conclude that the residuals appear centered at zero with a constant variance. We do not make any conclusions about the normality of the residuals from these two plots.

Determine if the Standardized Residuals are Normally Distributed

1. Create a Histogram

Typically, there are two ways to determine if the standardized residuals follow a normal distribution. The first method consists of constructing a histogram of the standardized residuals. If the histogram looks like a bell-shaped curve, we can conclude that the standardized residuals follow a normal distribution. Below, we construct the histogram of standardized residuals.

hist(standard.resid, breaks = 10, freq=T, xlim = c(-4,2), 
      xlab = "Standardized Residuals", 
      main = "Histogram of Standardized Residuals", col = "red")

The histogram has a moderate bell-shaped look, but it exhibits negative skewness. We may want to explore the data points influencing the left tail. For this class, it isn’t necessary to have perfect looking diagnostic plots; however, you should be able to identify the potential issues with each plot.

2. Create a QQ-Plot

Another way in which we can check normality of the standardized residuals is to create a QQ-Plot. See below:

qqnorm(standard.resid)
qqline(standard.resid)

If the residuals plotted on this QQ plot are generally aligned with the plotted QQ line, we can say the residuals follow a normal distribution. In this case, there appears to be a “heavy” left tail (negative skewness) and a “light” right tail (smaller tail than the normal distribution). We now have some concerns about the residuals following a normal distribution, and this assumption would have to be investigated further.

$\hat{y}$ vs. $y$ (Predicted vs. Observed Values of the Dependent Variable)

This plot will give us some indication as to how well we can predict the dependent variable. Ideally, we would like to see a 45-degree line when we plot $\hat{y} \, \textrm{vs.} \, y$. Let’s see how well our predicted WPR2 scores ($\hat{y} \, \text{or} \, \hat{\text{wpr2}}$) model the observed WPR1 scores ($\text{wpr2}$).

 plot(wpr2, yhat, main = "Predicted WPR2 Scores vs. Observed WPR2 Scores", xlab = 
        "Observed WPR2 Scores, wpr2", ylab = "Predicted WPR2 Scores, yhat")
 abline(a=0, b=1, lty=6) #add the line w/ slope of 1

A perfect model would allow us to exactly predict all wpr2 scores, resulting in all points falling exactly on the 45-degree line in the plot above. In this case, it appears there must be other variables not considered by our model that critically affect wpr2 scores. There is, however, evidence to suggest that our model has at least some value in predicting wpr2 scores.

Lesson 35 - Assessing Modeling Assumptions II (Transformations)

Lesson Objectives

Recognize when a model violates the linearity assumption and potential issues making inference off of models that violate this assumption
Apply transformations to data to overcome nonlinear data.

Read	Watch	Suggested Problems	WebAssign Problems
13.2 (pg 550-555)		16, 19	15, 18
			Review 5

R Help

Transformations

If it happens that we have violated one of our 4 assumptions in linear regression, not all hope is lost.

As you read in 13.2, not all data used in regression is linear. Sometime the data is intrinsically linear and we must transform the data in order to exploit its predictive power. Below is problem 16 from the book to show you how to transform the data in r.

First we load the data and create a linear regression:

loadtime = read.csv("ex13-16.csv",header=TRUE, sep = ",")
attach(loadtime)
lm.loadtime = lm(Time~Load)
summary(lm.loadtime)

## 
## Call:
## lm(formula = Time ~ Load)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -121.49  -32.75  -19.15    8.97  433.47 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   985.03     501.50   1.964   0.0671 .
## Load          -11.14       5.99  -1.859   0.0815 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123.7 on 16 degrees of freedom
## Multiple R-squared:  0.1777, Adjusted R-squared:  0.1263 
## F-statistic: 3.457 on 1 and 16 DF,  p-value: 0.08148

From our initial impression, our predictor variable, ‘Load’ is barely insignificant at alpha = .05.

Lets look at a plot of the data and residuals

par(mfrow=c(1, 2)) ## Changes the graphical parameters for side by side viewing.
plot(Time~Load, main="Data")
lm.loadtime.fitted = lm.loadtime$fitted.values
lm.loadtime.residuals = lm.loadtime$residuals
plot(lm.loadtime.residuals~lm.loadtime.fitted, main="Residuals Plot")
abline(h=0)

It is clear that the data is not linear in its current form. We must apply a transformation.

As the book suggests, we will apply a log transformation to the predictor variable.

transform.lm.loadtime = lm(log(Time)~Load)
summary(transform.lm.loadtime)

## 
## Call:
## lm(formula = log(Time) ~ Load)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9788 -2.1590  0.4689  1.6371  3.4915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.1482     8.8328   4.772 0.000208 ***
## Load         -0.4932     0.1055  -4.675 0.000253 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.179 on 16 degrees of freedom
## Multiple R-squared:  0.5774, Adjusted R-squared:  0.551 
## F-statistic: 21.86 on 1 and 16 DF,  p-value: 0.0002533

You can see from the summary that our model already has increased in significance.

Lets take a look at a plot of the transformed data and the residuals.

transform.lm.loadtime.fitted = transform.lm.loadtime$fitted.values
transform.lm.loadtime.residuals = transform.lm.loadtime$residuals
par(mfrow=c(1, 2))
plot(log(Time)~Load, main = "Transformed Data")
plot(transform.lm.loadtime.residuals~transform.lm.loadtime.fitted, main = "Transformed Model's Residuals")
abline(h=0)

par(mfrow=c(1, 1)) ## Resets our graphical parameters

Much better.

However, be careful interpreting this data. A load percent cannot predict the time to failure in the same way as non-transformed data.

The estimates of our coefficients are below.

summary(transform.lm.loadtime)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 42.1481947  8.8327889  4.771788 0.0002079588
## Load        -0.4932479  0.1054976 -4.675443 0.0002532861

To predict a time until failure of a load percent of 80, we will begin by doing the following:

summary(transform.lm.loadtime)$coefficients[1]+summary(transform.lm.loadtime)$coefficients[2]*80

## [1] 2.688361

This answer yields the log time until failure. To get our prediction in the units we care about, we must exponentiate our result.

exp(summary(transform.lm.loadtime)$coefficients[1]+summary(transform.lm.loadtime)$coefficients[2]*80)

## [1] 14.70756

All data transformations covered in 13.2 can be covered in the same way.

Lesson 36 - Assessing Modeling Assumptions III (Transformations)

Lesson Objectives

Recognize when a model violates the linearity assumption and potential issues making inference off of models that violate this assumption
Apply transformations to data to overcome nonlinear data.

Read	Watch	Suggested Problems	WebAssign Problems
13.2 (pg 550-555)	27		26

Lesson 37 - Course Drop

Use this class time to work on your course project.

Lesson 38 - Introduction to Two-Way Contingency Tables I

Lesson Objectives

Correctly identify count and/or frequency data.
Given count and/or frequency data, create a two-way contingency table.
Properly identify the hypothesis test for Homogeneity and create the test statistic.
Interpret the result of the Hypothesis Test for Homogeneity in terms of the real world problem.

Read	Watch	Suggested Problems	WebAssign Problems
14.3		26, 29	25, 28

R Help

Below is an example of how to input and analyze a two way contingency table in r.

Problem 29 from section 14.3

Input Data:

spirituality <- matrix(c(56,162,198,211,56,223,243,239,109,164,74,28),ncol=4,byrow=TRUE)
colnames(spirituality) <- c("Very","Moderate","Slightly", "Not at all")
rownames(spirituality) <- c("N.S.","S.S","G.D")
spirituality <- as.table(spirituality)
spirituality

##      Very Moderate Slightly Not at all
## N.S.   56      162      198        211
## S.S    56      223      243        239
## G.D   109      164       74         28

Part A)

summary(spirituality)

## Number of cases in table: 1763 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 213.21, df = 6, p-value = 2.912e-43

As you learned in the reading, since the p-value is < (alpha = .05) we reject the null hypothesis that all aspects of the table are equal. We conclude that there is substantial evidence for concluding that the three types of individuals are not homogeneous with respect to their degree of spirituality.

Part B)

summary(spirituality[-3,])

## Number of cases in table: 1388 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 3.0912, df = 3, p-value = 0.3778

Now, since our p-value is > (alpha = .05), we fail to reject the null hypothesis and conclude that there is no difference in spirituality between the N.S and the S.S groups.

Lesson 39 - Introduction to Two-Way Contingency Tables II

Lesson Objectives

Create a two-way contingency table in R.
Properly interpret the Hypothesis Test for Homogeneity in R in terms of the real world problem.

Read	Watch	Suggested Problems	WebAssign Problems
14.3			30,41

Lesson 40 - Course Review

Lesson Objectives

Review concepts throughout the semester. Come to class prepared to ask questions to clarify any misunderstandings.

Supplemental Material

R Tutorial

R is a free software environment for statistical computing and graphics. R-Studio is a user-friendly interface used to run the R software package.

Downloading R and R-Studio

Please ensure that you download R before you download R-Studio.

Downloading R

In order to download R, visit The Comprehensive R Archive Network (CRAN) and click on Download R for Windows. In the first line at the top of the page, select install R for the first time. RIGHT CLICK on the top link that should say Download R 3.4.3 for Windows (or something similar depending on the current version of R). Select Save target as... and save it to a folder that you can easily find. After the file has downloaded, execute the .exe file to install R. Note: You may have to right-click and Run as administrator. You must install R before you install R-Studio.

Downloading R-Studio

In order to download R-Studio, visit the R-Studio Downloads page. Scroll down the page and RIGHT CLICK on the installer file for RStudio 1.0.383 - Windows Vista/7/8/10 (or something similar depending on the current version of R Studio). Select Save target as... and save the file to the same folder that you saved the previous R installation file. After the file downloads, execute the .exe file and install R-Studio. The first time that you attempt to open R-Studio you may have to search for the program through the search function under the Windows button (the Windows symbol in the lower left corner of your screen). Once you locate R-Studio, right click on the program and select Pin to taskbar so that the program is readily available for future use.

Getting Started with R

There are numerous R tutorials available on the Internet. Here are a few tutorials that you might want to explore after the minimal introductory tutorial provided in this course guide.

These tutorials will expose you to R’s capabilities beyond what you’ll see in MA206. In addition to the introductory tutorial here, this course guide will provide some R Help at the bottom of many lesson descriptions in order to guide you toward specific R functions that will help you complete the suggested problems.

Console vs. R Scripts

The default R-Studio interface places a sub-window called the Console on the left or lower-left part of the screen. When you evaluate commands in the Console, recently used commands are available in a History tab or by using the up arrow to scroll through commands when the cursor is in the Console. Alternatively, you can create an R Script by selecting File - New File - R Script or pressing Ctrl+Shift+N. In an R script, you can write a line of code and execute it by pressing Ctrl+Enter. You can run the entire script by selecting Source or by pressing Ctrl+Shift+S. You should not expect to revisit at a later date anything you calculate in the Console. Preserving your work in an R Script will help you organize and document all of your work for the entire course. You can also easily adapt previous work to a new application when you have your work saved in scripts. You may wish to name your scripts with dates or lesson numbers to help you keep them organized.

R as a Calculator

R can do basic calculations like any calculator. Just enter the expression you want to evaluate in the Console and press Enter.

2+2

## [1] 4

Let’s accomplish the same task in a script. Create a new script, enter the expression you want to evaluate on the first line of your script, and press Ctrl+Enter. You’ll see the same result appear in your Console.

2+2

## [1] 4

R Documentation

You can look up the description and proper usage of any R function using the ? command in the Console. The Help panel in RStudio will show you the documentation associated with the function you entered.

?sum

You can also type help(sum) to reach the same documentation page.

Storing Variables and Calculations

R allows you to define and use variables using either the = or <- assignment operators.

x = 2
print(x)

## [1] 2

y <- 2
print(y)

## [1] 2

Vectors and Matrices

Our data often comes in the form of a vector or matrix. We can define these as follows:

my.vector = c(1,2,3,4,5,6)
my.vector

## [1] 1 2 3 4 5 6

my.matrix = matrix(my.vector,nrow=2,ncol=3)
my.matrix

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Notice that filling a matrix with a vector of values uses those values to fill the columns first.

Indexing vectors and matrices starts with an index of 1 and always proceeds in a [row,column] order. Let’s retrieve the number 3 from both the vector and the matrix we defined. The number 3 is the third element of the vector, and it occupies the 1st row, 2nd column of the matrix.

my.vector[3]

## [1] 3

my.matrix[1,2]

## [1] 3

Functions

R has a wide range of built-in functions ready to accomplish whatever you’re trying to do. To use one, simply type the name of the function (case-sensitive) and pass it whatever arguments you want the function to evaluate. You can find out what the function is expecting using the R documentation. Let’s find out what the sum() function does.

?sum

The sum() function returns the sum of all the values present in its arguments. Let’s evaluate the sum of my.vector.

my.vector = c(1,2,3,4,5,6)
sum(my.vector)

## [1] 21

You can also write your own functions. Let’s write a function that returns the square root of the input (i.e., the sqrt() function).

my.sqrt = function(x){
  result = sqrt(x)
  return(result)
}

Now we can use our function to evaluate the square root of 4.

my.sqrt(4)

## [1] 2

Sequences

We previously created a vector of the values 1-6. There are other ways to create this and related sequences.

1:6 # This syntax means "1 to 6" and is understood to have increments of 1.

## [1] 1 2 3 4 5 6

seq(1,6) # Create a sequence from "1" to "6" with an understood step size of 1.

## [1] 1 2 3 4 5 6

seq(2,12,by=2) # Create a sequence from "2" to "12" with a step size of "2."

## [1]  2  4  6  8 10 12

x=seq(0,2*pi,length.out=100) # Create a sequence of 100 values evenly spaced from 0 to 2*pi.

Basic Plotting

We’ll generate a basic X-Y plot here, but there are many more Internet resources that show more detailed plot construction. Try Quick-R Graphs to see a wider range of plot capabilities. For now, let’s plot the curve $y=sin(x)$ for $x \in [0,2\pi]$.

x=seq(0,2*pi,length.out=100)
y=sin(x)
plot(x,y)

Plot customization is something of an art, and you can find extensive Internet resources to help you produce the exact plot you want. Try ?plot to see the basic arguments to the plot() function. Let’s change our previous plot to a line (add the argument type="l") instead of a scatter plot, and let’s give it a title (add the argument main="sin(x)").

plot(x,y,type='l',main='sin(x)')

Comments

When you write a script, you will benefit later from placing relevant comments in your code to explain what a particular line of code was meant to accomplish. That can be done as follows using the # command. You can manually type # anywhere in a line of code to comment what comes after it, or you can press Ctrl+Shift+C to comment the current line or an entire block of selected code.

# This code solves Problem #X.
miles = c(3,5,7)
days = c("Monday","Wednesday","Friday")
# Before you perform a complex operation, describe what it is and why you did it in comments.  Here I'm going to generate a string that combines text and numbers inside a for loop.
for(i in 1:length(days)){
  print(paste('I ran ',miles[i],' miles on ',days[i],'.',sep=""))
}

## [1] "I ran 3 miles on Monday."
## [1] "I ran 5 miles on Wednesday."
## [1] "I ran 7 miles on Friday."

Setting the Working Directory

One of the first things to do as a new R user is to Set the Working Directory. This will make sure R always knows where to start looking for any data files you wish to reference. Create a folder on your desktop or someplace convenient to save all of the data you work with using R.

You can set a default working directory that R will remember for future use.

Go to the Tools tab in the toolbar at the top.
Select Global Options.
Select Browse.
It is highly recommended that you set the default working directory to the folder that you recently created.

You can also set the working directory any time you open R by performing the following steps:

Go to the Session tab in the toolbar at the top.
Select Set Working Directory.
Select Choose Directory (you can also use the shortcut keys CTRL+SHIFT+H to accomplish this step).
Locate the folder where the desired data file is stored.

Importing Data Files

Data files come in many formats, but we will typically use .csv files that resemble spreadsheets you can open in Microsoft Excel. You’ll need to ensure you have specified the working directory before attempting to import a data set. To get started go to the course website’s page for data sets and download GradeData.csv. Be sure to store it in the folder you’ve specified as your working directory. Use the following command to import GradeData.csv into RStudio.

grades = read.csv("GradeData.csv",header=TRUE)

The data object grades now contains all of the data contained in the GradeData.csv spreadsheet.

Working with Data Frames

Look in the RStudio Environment panel for an object called grades. The gradesobject is called a data frame and has a right-facing arrow next to it that you can click to get an expanded view of what is included in grades. You can also examine grades by examining its column names, first six rows, last six rows, and overall dimensions.

# Display the column names of the grades data frame.
names(grades)

## [1] "WPR1"            "WPR2"            "InstructorGrade" "APSC"           
## [5] "SAT.Math"

# Display the first six rows of the grades data frame.
head(grades)

##   WPR1 WPR2 InstructorGrade  APSC SAT.Math
## 1   99  103            0.67 1.942      550
## 2  104  108            0.68 2.675      560
## 3  121  120            0.91 3.582      640
## 4  108   89            0.75 2.139      500
## 5  120   89            0.70 3.358      690
## 6  110   51            0.59 2.126      600

# Display the last six rows of the grades data frame.
tail(grades)

##    WPR1 WPR2 InstructorGrade  APSC SAT.Math
## 54   93   91            0.68 2.055      510
## 55  118  108            0.82 3.246      470
## 56  114   71            0.66 2.342      610
## 57  102   77            0.60 1.894      580
## 58  113  103            0.74 2.899      590
## 59   79   72            0.66 1.844      410

# Display the dimensions of the grades data frame.  The Environment panel said grades consisted of 59 observations of 5 variables.
dim(grades)

## [1] 59  5

Indexing into a Data Frame

Now that we have saved the data into a data frame called grades, we may wish to index into the data frame to extract a specific portion of the data for analysis. Just like indexing with matrices, indexing into a data frame is done by using square brackets at the end of an object. First, let’s index into the tenth row and second column:

grades[10,2]

## [1] 109

What if we only wanted to look at WPR 1 scores for all students? This is the first column of the data frame, which is indexed as follows:

grades[ ,1]

##  [1]  99 104 121 108 120 110 107 111 114 129 115 103 112  89 101 124 115
## [18] 122 107 117 111 110 115 128 103  95 123 103 103  94 106 122 117 125
## [35] 120  91 111 106 118 109 130 116  60 130  93 102  99 119 109 115 113
## [52] 121  90  93 118 114 102 113  79

The blank space before the comma denotes that we want ALL the rows while the “1” denotes that we want the first column. If we wanted only the first row we would type the following:

grades[1,]

##   WPR1 WPR2 InstructorGrade  APSC SAT.Math
## 1   99  103            0.67 1.942      550

There are several ways to yield a single column from a data frame. In addition to using grades[,1] to produce WPR1 scores, we can also use the $ operator.

grades$WPR1

##  [1]  99 104 121 108 120 110 107 111 114 129 115 103 112  89 101 124 115
## [18] 122 107 117 111 110 115 128 103  95 123 103 103  94 106 122 117 125
## [35] 120  91 111 106 118 109 130 116  60 130  93 102  99 119 109 115 113
## [52] 121  90  93 118 114 102 113  79

We can drop the leading grades$ from this command by using the attach() command with our data frame.

attach(grades)

## The following object is masked _by_ .GlobalEnv:
## 
##     WPR1

WPR1

##  [1]  99 104 121 108 120 110 107 111 114 129 115 103 112  89 101 124 115
## [18] 122 107 117 111 110 115 128 103  95 123 103 103  94 106 122 117 125
## [35] 120  91 111 106 118 109 130 116  60 130  93 102  99 119 109 115 113
## [52] 121  90  93 118 114 102 113  79

Now WPR1 can be accessed directly without going through grades. The attach() function brings the contents of grades to a higher level of visibility.

We may also want to assign the data that we work with into new objects. For example, if we only wanted to work with WPR1 data, we could assign the WPR1 data into a new object. Let’s assign new objects for each column as follows:

wpr1 = grades[,1]
wpr2 = grades[,2]
instructor = grades[,3]
apsc = grades[,4]
sat = grades[,5]

We can combine objects in a data frame by using rbind() and cbind(). The function rbind() combines rows while cbind() combines columns. Let’s combine only the WPR data together using three separate techniques:

1. Use the cbind() function:

wprs = cbind(wpr1,wpr2)
head(wprs)

##      wpr1 wpr2
## [1,]   99  103
## [2,]  104  108
## [3,]  121  120
## [4,]  108   89
## [5,]  120   89
## [6,]  110   51

2. Use the colon operator on the columns:

wprs = grades[,1:2] # The colon takes columns 1 "through" 2.
head(wprs) # We get the same result as Technique 1.

##   WPR1 WPR2
## 1   99  103
## 2  104  108
## 3  121  120
## 4  108   89
## 5  120   89
## 6  110   51

3. Remove columns 3 through 5 using the “negative sign:”

wprs = grades[,-3:-5] # Think of the negative sign as "All Except -3, -4, and -5."
head(wprs) # We get the same result as Techniques 1 and 2.

##   WPR1 WPR2
## 1   99  103
## 2  104  108
## 3  121  120
## 4  108   89
## 5  120   89
## 6  110   51

Sometimes our data may contain outliers, and we want to perform analysis with and without the outliers. Let’s assume that the second observation (2nd row) is an outlier and we want to remove this observation. This is easily done as follows:

wprs.minus.outlier = wprs[-2,] # Take out 2nd row.
head(wprs.minus.outlier) # Notice the second row is gone.

##   WPR1 WPR2
## 1   99  103
## 3  121  120
## 4  108   89
## 5  120   89
## 6  110   51
## 7  107  111

Create a Data Frame

There are many instances when you want to create your own data frame if you are working on a small data set (such as a question in WebAssign). First let’s create two vectors of data, and then we’ll combine the vectors into a data frame. The first vector will be the integers 1 to 10:

x = seq(from=1, to=10, by=1)

Let’s create a second vector that is the reverse sequence of numbers from 10 to 1.

y = seq(from=10, to=1, by=-1)

Now, we can combine both vectors into a data frame:

my.data.frame = data.frame("X"=x, "Y" = y)
class(my.data.frame) # The class of the object is a data frame as desired.

## [1] "data.frame"

head(my.data.frame) #We have two columns of data with the proper column names.

##   X  Y
## 1 1 10
## 2 2  9
## 3 3  8
## 4 4  7
## 5 5  6
## 6 6  5

If we want to change the names of the columns in the data frame we can using the colnames() function and assign the new column names as follows:

colnames(my.data.frame) = c("Column 1", "Column 2")
head(my.data.frame) # Column names have changed.

##   Column 1 Column 2
## 1        1       10
## 2        2        9
## 3        3        8
## 4        4        7
## 5        5        6
## 6        6        5

Exploratory Analysis

Let’s go back to the original grades object that we created earlier. Maybe we want to explore the relationship between WPR1 grades and the instructor grades. We may have some intuition and believe that if the student has been doing their homework and maintains a high instructor average, then they should have a high WPR1 score. In other words, we think that there is a positive relationship between WPR1 scores and instructor grades.

Create a scatter plot.

The first step in this exploratory analysis is to create a scatter plot of the two variables:

plot(grades[,3], grades[,1], xlab = "Instructor Grades", ylab = 
       "WPR1 Scores", main = "A Scatter Plot of WPR1 Scores against Instructor Grades")

From the plot above, there appears to be a positive relationship between WPR1 scores and the instructor grade.

Calculating summary statistics.

We can look at a few quantitative measures by using the summary() function. Let’s take the summary measures from the 1st and 3rd column since those columns correspond to the WPR1 and instructor grade data:

summary(grades[,c(1,3)])

##       WPR1       InstructorGrade 
##  Min.   : 60.0   Min.   :0.4900  
##  1st Qu.:103.0   1st Qu.:0.6750  
##  Median :111.0   Median :0.7600  
##  Mean   :109.4   Mean   :0.7607  
##  3rd Qu.:118.0   3rd Qu.:0.8450  
##  Max.   :130.0   Max.   :0.9700

If we want to compute the variance, standard deviation, or range on columns in a data frame, we can use the apply() function as follows:

apply(grades[,c(1,3)], 2, sd)

##            WPR1 InstructorGrade 
##      12.9615039       0.1093218

This applies the standard deviation function (sd) to columns 1 and 3 (columns are denoted with a 2 and rows with a 1 in the second argument of the function). Type ?apply in the console for more information. In the following lines of code, we will compute the range, variance, and mean, respectively:

apply(grades[,c(1,3)],2, range) # Compute the ranges of the 1st and 3rd column.

##      WPR1 InstructorGrade
## [1,]   60            0.49
## [2,]  130            0.97

apply(grades[,c(1,3)],2, var) # Compute the variances of the 1st and 3rd column.

##            WPR1 InstructorGrade 
##    168.00058445      0.01195126

apply(grades[,c(1,3)],2, mean) # Compute the means - compare to summary() output.

##            WPR1 InstructorGrade 
##      109.389831        0.760678

You can also compute each one of these values directly by using the mean() or sd() function for the mean and standard deviation:

mean(grades[,1]) # Compute mean of column 1.

## [1] 109.3898

sd(grades[,1]) # Compute the standard deviation of column 1.

## [1] 12.9615

PMF Reference Sheet

Return to LSN 8.

The Nine Characteristics of Discrete Distributions

Given a probability mass function (pmf) $p(x)$ for a discrete random variable $X$, verify that $p(x)$ is a valid pmf.

Characteristic 1: The pmf must be greater than or equal to zero for all x.

\[p(x) \geq 0 \quad \forall x\]

Characteristic 2: The sum of probabilities of x, $p(x)$, over all possible values of $x$ must be one.

\[\sum_x p(x) = 1\]

Definition of the probability mass function (pmf).

Characteristic 3: For a discrete random variable $X$, the probability that $X$ is equal to a specific value, $c$, is:

\[P(X=c) =p(c)\]

Find the cumulative distribution function (CDF) $F(x)$ for a discrete random variable $X$.

Characteristic 4: Given the pmf $p(x)$ for random variable $X$, we define the cumulative distribution function (CDF) $F(x)$ as follows:

\[F(x) = P(X\leq x)= \sum_{y:y\leq x}p(y)\]

####Given a discrete random variable $X$ and its associated pmf or CDF, find the probability that the random variable $X$ is between $a$ and $b$ (inclusive) for $a < b$.

Characteristic 5: Using the pmf:

\[P(a \leq X \leq b ) = \sum_{x : a \leq x \leq b}p(x)\]

Characteristic 6: Using the CDF:

\[P(a \leq X \leq b)= F(b) - F(a-)\]

where “$a-$” represents the largest possible $X$ value that is strictly less than $a \, \left( \textrm{i.e.,} \, F(a-) = \textrm{lim}_{x\rightarrow a-}F(x)\right)$.

####Calculate the expected value (mean or average), $\mu_X$, and the variance, $\sigma_X^2$, of a discrete random variable $X$.

Characteristic 7: The expected value of X:

\[E(X) = \mu_X = \sum_x x \cdot p(x)\]

Characteristic 8: The variance of X:

\[V(X) = \sigma_X^2 = \sum_x (x-\mu_x)^2 \cdot p(x)\]

Alternatively, the following shortcut formula may prove easier to calculate:

\[V(X) = \sigma_X^2 = E[X^2] - (E[X])^2\]

For $p \in (0, 1)$, denote the (100p)th percentile of the distribution of a discrete random variable $X$ as $x^$, i.e, $F(x^) = P(X \leq x^*) = p$.

Characteristic 9: The (100p)th percentile of the discrete random variable $X$ is the minimum value of $X$ such that $F(x)\geq p$.

\[x^* = \textrm{min}\{ x|F(x)\geq p \}\]

Named Probability Mass Functions Used in MA206

Distribution	PMF $\quad p(x)$	E[X]	V[X]
Binomial	$\binom{n}{x}p^x(1-p)^{n-x}$	$np$	$np(1-p)$
Poisson	$e^{-\mu}\mu^x/x!$	$\mu$	$\mu$

PDF Reference Sheet

Return to LSN 11.

The Nine Characteristics of Continuous Distributions

Verify that a probability density function (pdf) $f(x)$ is valid.

Characteristic 1: The pdf must be greater than or equal to zero for all x.

\[f(x) \geq 0 \quad \forall x\]

Characteristic 2: The total area under $f(x)$ must equal one.

\[\int_{-\infty}^{\infty} f(x)dx = 1\]

Characteristic 3: For a continuous random variable $X$, the probability that $X$ is equal to a specific value, $c$, is:

\[P(X=c) =0\]

Use the pdf to find the cumulative distribution function (CDF) $F(x)$ for a continuous random variable $X$.

Characteristic 4: Given the pdf $f(x)$ for random variable $X$, we define the cumulative distribution function (CDF) $F(x)$ as follows:

\[F(x) = P(X\leq x)= \int_{-\infty}^{x} f(y)dy\]

####Given a continuous random variable $X$ and its associated pdf or CDF, find the probability that the random variable $X$ is between $a$ and $b$ for $a < b$.

Characteristic 5: Using the pdf:

\[P(a \leq X \leq b ) = \int_{a}^{b} f(x)dx\]

Characteristic 6: Using the CDF:

\[P(a \leq X \leq b)= F(b) - F(a)\]

####Calculate the expected value (mean or average), $\mu_X$, and the variance, $\sigma_X^2$, of a continuous random variable $X$.

Characteristic 7: The expected value of X:

\[E(X) = \mu_X = \int_{-\infty}^{\infty} x \cdot f(x)dx\]

Characteristic 8: The variance of X:

\[V(X) = \sigma_X^2 = \int_{-\infty}^{\infty} (x-\mu_X)^2 \cdot f(x)dx\]

Alternatively, the following shortcut formula may prove easier to calculate:

\[V(X) = \sigma_X^2 = E[X^2] - (E[X])^2\]

For $p \in (0, 1)$, denote the (100p)th percentile of the distribution of a continuous random variable $X$ as $x^$, i.e, $F(x^) = P(X \leq x^*) = p$.

Characteristic 9: The (100p)th percentile of the continuous random variable $X$:

\[x^* = F^{-1}(p)\]

where $F^{-1}$ denotes the inverse CDF of $X$.

Named Probability Density Functions Used in MA256

Distribution	PDF $\quad f(x)$	E[X]	V[X]
Uniform	$\frac{1}{B-A}$	$\frac{A+B}{2}$	$\frac{1}{12}(B-A)^2$
Normal	$\frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu))^2/2\sigma^2}$	$\mu$	$\sigma^2$
Exponential	$\lambda e^{-\lambda x}$	$1/\lambda$	$1/\lambda^2$
Gamma	$\frac{1}{\beta^{\alpha}\Gamma(\alpha)}x^{\alpha-1}e^{-x/\beta}$	$\alpha\beta$	$\alpha\beta^2$
t	$\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^2}{\nu }\right)^{-\frac{\nu+1}{2}}$	$0$	$\frac{\nu}{\nu-2}$ for $\nu>2$, $\infty$ for $1 < \nu \leq 2$
Chi-Squared	$\frac{1}{2^{\frac{\nu}{2}}\Gamma\left(\frac{\nu}{2}\right)}x^{\frac{\nu}{2}-1}e^{-\frac{x}{2}}$	$\nu$	$2\nu$

Monte Carlo Simulation

(Return to LSN 18.) (Go to Monte Carlo worksheet.)

Monte Carlo simulation is an example of a statistical simulation method that utilizes sequences of random numbers to simulate real-world occurrences. Although Monte Carlo Simulation has its roots in the development of nuclear weapons, it is now regularly used in many diverse applications ranging from the simulation of complex physical phenomena such as radiation transport in the earth’s atmosphere to the simulation of something as simple as a game of Bingo.

The goal of the Monte Carlo method is to simulate the physical (or mathematical) system of interest by randomly sampling from the system’s CDF(s). We can more clearly define Monte Carlo simulation as a scheme of employing $\textrm{Uniform}(0,1)$ random numbers to help solve certain problems that are generally static in nature. Monte Carlo simulation is now widely used to solve problems that are often not easily solved analytically. Simulations rely on a large number of trials, and we utilize the collective results to infer a solution.

Sampling from Distributions Using a Random Number Generator

Suppose we know that a particular random variable has an exponential distribution with $\lambda=1/10$. We would like to generate random numbers that follow this distribution in order to simulate this random variable. If we know that most mathematics software packages (Excel, Mathematica, and R included) can produce $\textrm{Uniform}(0,1)$ numbers, how can we use this to produce exponentially distributed ($\lambda=1/10$) random numbers? In order to answer this question, we must review inverse functions.

A function $f(x)$ has an inverse if you can find a function, $f^{-1}(x)$, such that $f^{-1}(f(x))=x$. By substituting $y=f(x)$ into $f^{-1}(f(x))=x$, we see that $f^{-1}(y)=x$. If we plot the function $y=f(x)$, we can see the relationship between the numbers on the $y$-axis and the numbers on the $x$-axis. We’ll now demonstrate this concept with an example.

Example: Find the inverse for the function $f(x)=x^3$.

Set the function equal to $y$ and solve for $x$.

\[y=x^3\]

\[x=y^{1/3}\]

\[f^{-1}(y)=y^{1/3}\]

Now that we think we’ve found the inverse, let’s substitute it back into the definition of an inverse to make sure. Is $f^{-1}(f(x))=x$ true?

\[f^{-1}(x^3)=(x^3)^{1/3}=x^{3/3}=x\]

Now let’s graph $f(x)=x^3$.

What is $f(2)$? In order to find an approximate value from the graph, we would start at 2 on the $x$-axis and move up to the curve. From there, we move to the left until we reach the $y$-axis. This gives the functional value for $x = 2$. Follow the arrows in Figure 2 to see that $f(2) = 8$.

If this is the case, then what is $f^{-1}(8)$? Using the information above $(f^{-1}(y)=y^{1/3})$, we see that $f^{-1}(8)=2$. In order to get this information from the graph, we would do the reverse of the process described above. Start at $8$ on the $y$-axis and move to the right. Once you arrive at the function, move down until you reach the $x$-axis. This will be the value for the inverse (see Figure 3).

Now that we have reviewed inverse functions, let’s look at a CDF and review some of its properties. Remember that a CDF is the cumulative distribution function, and CDFs always have the same range: $\left[0,1\right]$. As you can see from Figure 4, the $\textrm{Exp}(\lambda=1/10)$ CDF starts at $0$, and as $x\rightarrow \infty$, the function approaches $1$. If we wanted to produce the numbers on the $x$-axis from the numbers on the $y$-axis, then we would want to find the inverse of the CDF. For example, if you randomly chose $0.65$ from the $\textrm{Uniform}(0,1)$ distribution, to what exponentially distributed value of $x$ would that correspond? If you follow the arrows on Figure 4, then you should see that given a $\textrm{Uniform}(0,1)$ number (y-value), we can find the $x$-value by substituting the $y$-value into the inverse CDF function. For this example, $F^{-1}(0.65) \approx 10$.

Let’s see if we can actually find the functional form of the inverse. If we would like to sample from an $\textrm{Exp}(\lambda=1/10)$ distribution, then we use the known exponential CDF: $F(x)=1-e^{\frac{-x}{10}} \, \forall x\geq 0$. Instead of setting the function equal to $y$, let’s set it equal to $p$ and solve for $x$.

\[p=1-e^{\frac{-x}{10}} \] \[ p-1=-e^{\frac{-x}{10}} \] \[ \textrm{ln}(1-p)=-\frac{x}{10} \] \[ x=-10 \, \textrm{ln}(1-p)\]

We can now take $\textrm{Uniform}(0,1)$ numbers and substitute them into the equation above (for $p$) in order to produce $\textrm{Exp}(\lambda=1/10)$ numbers. For the example above (where $p = 0.65$), we get $x=10\,\textrm{ln}(1-0.65)=10.4982$.

Why does a $\textrm{Uniform}(0,1)$ number produce these $\textrm{Exp}(\lambda=1/10)$ numbers? Let’s look closely at the graph of the CDF again. What is the probability that a random $\textrm{Uniform}(0,1)$ number will be less than $0.65$? The probability is exactly $0.65$ according to the properties of the $\textrm{Uniform}(0,1)$ distribution. A $y$-value of $0.65$ corresponds to an $x$-value of $10.4982$, which by the properties of the CDF denotes the $x$-value where $P(X\leq x)=.65$. This relationship holds no matter where you enter on the $y$-axis and serves as the foundation for turning $\textrm{Uniform}(0,1)$ random variates into random variates from another distribution.

Let’s show how Monte Carlo simulation works in R. First, we’ll create 1,000 $\textrm{Uniform}(0,1)$ random variates.

my.sample = runif(1000)

Now we’ll transform the $\textrm{Uniform}(0,1)$ random variates into $\textrm{Exp}(\lambda=1/10)$ random variates using our derived expression.

exp.sample = -10*log(1-my.sample)

Now we’ll create a histogram of our sample and scale it to match the $\textrm{Exp}(\lambda=1/10)$ pdf.

hist(exp.sample,freq=FALSE,
     main="Histogram of Exp(1/10) Random Variates with Underlying PDF",
     xlab="x")
curve(dexp(x,rate=1/10),from=0,to=60,n=200,add=TRUE)

This example shows that we can simulate random variates from a distribution using the CDF inverse transform in such a way that the resulting sample adheres closely to the underlying probability density function. Fortunately R has built in the inverse CDF transform approach to sample from many common distributions. We can produce our exponentially distributed random variates directly using the rexp() function without needing $\textrm{Uniform}(0,1)$ random variates as our starting point.

exp.sample = rexp(1000,rate=1/10)

We can again show that this produced the type of sample we think it did by constructing a histogram.

hist(exp.sample,freq=FALSE,
     main="Histogram of Exp(1/10) Random Variates with Underlying PDF",
     xlab="x")
curve(dexp(x,rate=1/10),from=0,to=60,n=200,add=TRUE)

Going Further with R

Check out this Simulation in R video to see some more advanced techniques than we’ll show you in MA206.

Monte Carlo Suggested Problems

(Return to LSN 19) (Return to Monte Carlo supplemental material.)

In R, create a Monte Carlo Simulation that models an $\textrm{Exp}(\lambda=1/25)$ distribution.
1. What is the inverse CDF for this distribution?
2. What is the inverse CDF for any $\textrm{Exp}(\lambda=1/25)$ distribution?
3. Simulate 100 exponentially distributed random variates in R. Construct a histogram in R to demonstrate that the simulated values are well modeled by an exponential distribution.
4. What is the average of the 100 data points you simulated? What should the average be?
5. Why are the averages different?
In R, create a Monte Carlo Simulation that models a $\textrm{Uniform}(3,10)$ Distribution. A. What is the inverse CDF for this distribution? B. What is the inverse CDF for all $\textrm{Uniform}(A,B)$ Distributions? C. Simulate 100 data points and find the average and the variance. Compare these with the true population mean and population variance. Why are they different? What can you do to bring them closer together? D. Construct a histogram in R to show that your simulated values actually came from a $\textrm{Uniform}(3,10)$ distribution.

Distribution	PMF \(\quad p(x)\)	E[X]	V[X]
Binomial	\(\binom{n}{x}p^x(1-p)^{n-x}\)	\(np\)	\(np(1-p)\)
Poisson	\(e^{-\mu}\mu^x/x!\)	\(\mu\)	\(\mu\)

Distribution	PDF \(\quad f(x)\)	E[X]	V[X]
Uniform	\(\frac{1}{B-A}\)	\(\frac{A+B}{2}\)	\(\frac{1}{12}(B-A)^2\)
Normal	\(\frac{1}{\sqrt{2\pi}\sigma}e^{-(x-\mu))^2/2\sigma^2}\)	\(\mu\)	\(\sigma^2\)
Exponential	\(\lambda e^{-\lambda x}\)	\(1/\lambda\)	\(1/\lambda^2\)
Gamma	\(\frac{1}{\beta^{\alpha}\Gamma(\alpha)}x^{\alpha-1}e^{-x/\beta}\)	\(\alpha\beta\)	\(\alpha\beta^2\)
t	\(\frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{x^2}{\nu }\right)^{-\frac{\nu+1}{2}}\)	\(0\)	\(\frac{\nu}{\nu-2}\) for \(\nu>2\), \(\infty\) for \(1 < \nu \leq 2\)
Chi-Squared	\(\frac{1}{2^{\frac{\nu}{2}}\Gamma\left(\frac{\nu}{2}\right)}x^{\frac{\nu}{2}-1}e^{-\frac{x}{2}}\)	\(\nu\)	\(2\nu\)

MA256 Course Guide

2 January 2018

Course Admin

Course Guide Tips

Course Topics

Block 1: Descriptive Statistics and Probability Theory

Block 2: Random Variables and Distributions

Block 3: Inferential Statistics

Block 4: Model Building and Data Analysis

Course Website

Instructor Pages

Course Textbook

Graded Events

Course Objectives

Formula Refrence Card

Course Calendar

Block 1 - Descriptive Statistics and Probability Theory

Block 1 Objectives

Lesson 1 - Course Overview and Introduction to R (Descriptive Statistics, Measures of Location and Variability, Histograms)

Suggested Problem Data

R Help

Creating a Histogram

Calculate Probabilities and Frequencies

Measures of Location and Variability

Mean \(\mu\)

Trimmed Mean

Median \(\tilde{x}\)

Standard Deviation \(\sigma\)

Variance \(\sigma^2\)

Quantiles

Percentiles

Summary

Lesson 2 - Sample Spaces and Events

Lesson Objectives

Lesson 3 - Axioms, Interpretations, and Properties of Probability

Lesson Objectives

Lesson 4 - Counting Techniques

Lesson Objectives

Consider the following:

R Help

Lesson 5 - Conditional Probability

Lesson Objectives

Lesson 6 - Independence

Lesson Objectives

Lesson 7 - WPR1 Instructions

Authorized References:

Block 2 - Random Variables and Distributions

Block 2 Objectives

Lesson 8 - Random Variables and Discrete Distributions

Lesson Objectives

Suggested Problem Data

R Help

Lesson 9 - Expected Values and Variance

R Help

Lesson 10 - Applications to Reliability Engineering, Binomial and Poisson Distribution

Lesson Objectives

R Help - Binomial

R Help - Poisson

Lesson 11 - Continuous RVs, PDFs, CDFs, the Uniform and General Distributions

Lesson Objectives

R Help

Calculating Probabilities for Named Continuous Distributions

Lesson 12 - Normal and Exponential Probability Distribuitions

Lesson Objectives

R Help

Normal Distribution

Calculating Probabilities for Named Continuous Distributions

Exponential Distribution

Lesson 13 - Jointly Distributed Random Variables

Lesson Objectives

Lesson 14 - Expected Values of Joint Distributions

Lesson Objectives

Lesson 15 - Covariance and Correlation

Lesson 16 - Problem Solving Lab

Lesson 17 - WPR2 Instructions

Authorized References:

Lesson 18 - Monte Carlo Simulation

Lesson Objectives

R Help

Lesson 19 - Monte Carlo In-Class Exercise

`2 January 2018`