Objectives

Understand linear regression in R and verify linear regression assumptions
Plotting time series to analyze trends
Use R for sampling and simulation
Calculate theory-based probabilities for normal and binomial distributions

Collaboration Policy

In lab you are encouraged to work in pairs or small groups to discuss the concepts on the assignments. However, DO NOT copy each other’s work as this constitutes cheating. The work you submit must be entirely your own. If you have a question in lab, feel free to reach out to other groups or talk to your TA if you get stuck.

Linear Regression in R

You learned about linear regression in lecture. Now, we will learn how to code it in lab. We will be using the following new commands:

lm() will create an output containing the equation of the regression line, correlation coefficient, p-values, residuals, and much more. We typically create an object to store the results of the lm() function (see example below).

abline() is a command to generate a regression line on a plot of the form y = a + bx. It requires arguments for a and b, or linear model results (see example below).

# load the data
NCbirths <- read.csv("births.csv")

# Run the linear model of weight against Mom's age and print a summary
linear_model <- lm(NCbirths$weight ~ NCbirths$Mage)
summary(linear_model)

## 
## Call:
## lm(formula = NCbirths$weight ~ NCbirths$Mage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.574   -9.381    1.221   12.823   61.423 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   105.19108    2.08824  50.373  < 2e-16 ***
## NCbirths$Mage   0.39945    0.07497   5.328  1.1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.27 on 1990 degrees of freedom
## Multiple R-squared:  0.01407,    Adjusted R-squared:  0.01357 
## F-statistic: 28.39 on 1 and 1990 DF,  p-value: 1.104e-07

# Create a plot of the data, and draw the regression line using abline
plot(NCbirths$weight ~ NCbirths$Mage, 
     xlab = "Mom Age", ylab = "Weight",
     main = "Regression of Weight on Mother's Age")
abline(linear_model, col = "red", lwd = 2)

# Create a plot of the residuals to assess regression assumptions
plot(linear_model$residuals ~ NCbirths$Mage, 
     main = "Residual plot")
# Add a line of y = 0 to help visualize the residuals
abline(a = 0, b = 0, col = "red", lwd = 2)

Try running an example of the code on the following page by loading and using NCbirths below. We will discuss what each line does in section.

Exercise 1

We will be working with some soil mining data and are interested in looking at some of the relationships between metal concentrations (in ppm). Download the data ‘soil_complete.txt’ from BruinLearn and read it into R. When you read in the data, name your object “soil”.

soil <- read.table("soil_complete.txt", header = TRUE)

Run a linear regression of lead against zinc concentrations (treat lead as the response variable). Use the summary function just like in the example above and paste the output into your report.

lm1 <- lm(lead ~ zinc, data=soil)
summary(lm1)

## 
## Call:
## lm(formula = lead ~ zinc, data = soil)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.455 -12.570  -1.834  15.946 101.651 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.582928   4.410443    3.76 0.000244 ***
## zinc         0.291335   0.007415   39.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.37 on 149 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.9114 
## F-statistic:  1544 on 1 and 149 DF,  p-value: < 2.2e-16

Plot the lead and zinc data, then use the abline() function to overlay the regression line onto the data.

plot(soil$zinc, soil$lead)
abline(lm1, col = "red")

In a separate plot, plot the residuals of the regression from (a), and again use the abline() function to overlay a horizontal line.

plot(lm1, which=1)
abline(a = 0, b = 0, col = "lightblue", lwd = 2)

hist(lm1$residuals)

plot(lm1$residuals ~ soil$zinc, main = "residual plt")
abline(a = 0, b = 0, col = "red", lwd = 2)

Parts d-h can be answered by hand, using a calculator, or any R functions of your choice.

Based on the output from (a), what is the equation of the linear regression line?

lm1$coefficients

## (Intercept)        zinc 
##  16.5829280   0.2913355

Imagine we have a new data point. We find out that the zinc concentration at this point is 1,000 ppm. What would we expect the lead concentration at this point to be?

lm1$coefficients %*% c(1, 1000) # dot product

##          [,1]
## [1,] 307.9184

sum(lm1$coefficients * c(1, 1000)) # sm

## [1] 307.9184

# More general way 
new_data <- data.frame(zinc = c(1000))
predict(lm1, newdata = new_data)

##        1 
## 307.9184

Imagine two locations (A and B) for which we only observe zinc concentrations. Location A contains 100ppm higher concentration of zinc than location B. How much higher would we expect the lead concentration to be in location A compared to location B?

lm1$coefficients %*% c(1, 100)

##          [,1]
## [1,] 45.71648

hist(soil$lead)

lead_colors <- heat.colors(100)
lead_scaled <- cut(
  soil$lead,
  breaks = 500,
  labels = FALSE
)

plot(
  soil$x,
  soil$y,
  col = lead_colors[lead_scaled],
  pch = 19,
  asp = 1,
  xlab = "X Coordinate",
  ylab = "Y Coordinate",
  main = "Lead Concentration Map"
)

Report the R-squared value and explain in words what it means in context.

summary(lm1)$r.squared |> round(2)

## [1] 0.91

lm1

## 
## Call:
## lm(formula = lead ~ zinc, data = soil)
## 
## Coefficients:
## (Intercept)         zinc  
##     16.5829       0.2913

91% of the variation in lead can be described via zinc concentration

Comment on whether you believe the three main assumptions (linearity, symmetry, equal variance) for linear regression are met for this data. List any concerns you have.

Linearity :

plot(soil$zinc, soil$lead)
abline(lm1, col = "red")

Symmetric / Normal Dist of Errors :

plot(lm1, which = 2)

hist(lm1$residuals)

Equal variance

plot(lm1, which =1)

As we can see heteroscedasticity is present. As X increases (zinc) – our amt of variance increases. Unequal variance – cone shaped.

Note from TM: We only made the assumption of linearity, i.e., a linear association between the two variables x and y, which is all that is actually needed for linear regression. For your information, symmetry refers to the points scattering symmetrically around the regression line. Equal variance refers to the points having equal variability around the regression line for all values of x. Neither of them are necessary conditions for a correct and interpretable linear model, but they are “nice to have”. Feel free to discuss them with your TA and comment on them in the report.

Exercise 2

Our next data set is what is known as a time series, or data in time. It contains the measurements via satellite imagery of sea ice extent in millions of square kilometers for each month from 1988 to 2011. Please download the “sea_ice” data from BruinLearn and read it into R. If you have your working directory properly set, you can use the line below:

ice <- read.csv("sea_ice.csv", header = TRUE)

Note that currently R does not know what class the Date column is. We need to convert the Date column into class “date” using the following line:

ice$Date <- as.Date(ice$Date, "%m/%d/%Y")

Produce a summary of a linear model of sea ice extent against time.

lm2 <- lm(Extent~Date, dat=ice)
lm2

## 
## Call:
## lm(formula = Extent ~ Date, data = ice)
## 
## Coefficients:
## (Intercept)         Date  
##   1.011e+01    1.438e-04

Plot the data and overlay the regression line. Does there seem to be a trend in this data?

plot(ice$Date, ice$Extent)
abline(lm2, col = "red")

# notice seasonal trends
plot(ice$Date, ice$Extent, type = "l")

Plot the residuals of the model over time and include a horizontal line. What assumption(s) about the linear model should we be concerned about?

plot(lm2$residuals ~ ice$Date, main = "residual plt")
abline(a = 0, b = 0, col = "red", lwd = 2)

alternative method :

plot(lm2, which = 1)
abline(a = 0, b = 0, col = "red", lwd = 2)

Sampling and simulating in R

We can use the sample() function to sample data from a vector, and the replicate() function to simulate random events many times over. Note that the computer is not truly random, but it is close enough for our purposes to consider it random. We also rely on the set.seed() function to make our “random” results reproducible. Try the following examples in R and see the ## comments for descriptions.

## Set seed for reproducibility
set.seed(1335)
## Create a names vector
names = c("Leslie", "Ron", "Andy", "April", "Tom", "Ben", "Jerry")
## Sample 3 of the names with replacement
sample(names, 3, replace = TRUE)

## [1] "Ron"   "April" "April"

## Sample 3 of the names without replacement
sample(names, 3, replace = FALSE)

## [1] "Tom"    "Leslie" "Ben"

## Create a vector from 1 to 10
numbers = 1:10
## Simulate sampling 3 different numbers at random, 10 times
replicate(10, sample(numbers, 3, replace = FALSE))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]   10    5    8    7    5    3    4    7    9     3
## [2,]    1    8    5    2    9    9    5   10    8     5
## [3,]    5    2    2    5   10    5    1    6    4     9

## One more time, but save as an object
rand_draws = replicate(10, sample(numbers, 3, replace = FALSE))
## Perform analysis on the random draws
colMeans(rand_draws) ## Takes the mean of each sample

##  [1] 8.333333 5.000000 2.000000 5.666667 5.666667 4.333333 5.666667 6.000000
##  [9] 4.000000 6.000000

colSums(rand_draws) ## Takes the sum of each sample

##  [1] 25 15  6 17 17 13 17 18 12 18

Exercise 3

Personal crabs simulation)

die_1 <- rep(1:6, each = 6)
die_2 <- rep(1:6, times = 6)
sums <- die_1 + die_2

result <- rep("Continue", length(sums))

result[sums %in% c(7, 11)] <- "Double"
result[sums %in% c(2, 3, 12)] <- "Lose"

craps_outcomes <- data.frame(
  die_1 = die_1,
  die_2 = die_2,
  sum = sums,
  result = result
)

craps_outcomes

hist(craps_outcomes$sum,
     breaks = seq(1.5, 12.5, by = 1))

One of Adam’s favorite casino games is called “Craps”. In the first round of this game, two fair 6-sided dice are rolled. If the sum of the two dice equal 7 or 11, Adam doubles his money! If a 2, 3, or 12 are rolled, Adam loses all the money he bets.

Based on your lecture notes, what is the chance Adam will double his money in the first round of the game? What is the chance Adam will lose his money in the first round of the game?

Well the first thing to do is enumerate – there are 36 total options

sums

##  [1]  2  3  4  5  6  7  3  4  5  6  7  8  4  5  6  7  8  9  5  6  7  8  9 10  6
## [26]  7  8  9 10 11  7  8  9 10 11 12

of all the cases :

we are interested when he doubles his money so : x={7, 11}

library(dplyr)
craps_outcomes |> group_by(result) |> count()

there are 8 times we might double –
- Therefore : P(sum = 7 or 11) = 8/36

(8/36) |> round(2)

## [1] 0.22

so about a 22% pct chance

Let’s now approximate the results in (a) by simulation. First, set the seed to 123. Then, create an object that contains 5,000 sample first round Craps outcomes (simulate the sum of 2 dice, 5,000 times). Use the appropriate function to visualize the distribution of these outcomes (hint: are the outcomes discrete or continuous?).

This is a discrete dist – hence bar plot of 1-12 events

set.seed(123)
die_1_sim <- sample(1:6, size = 5000, replace = TRUE) # eq-prob
die_2_sim <- sample(1:6, size = 5000, replace = TRUE) # eq-prob
craps_sim <- die_1_sim + die_2_sim

sum_table_sim <- table(craps_sim)

barplot(sum_table_sim,
        main = "Simulated Distribution of Dice Sums (n = 5000)",
        xlab = "Sum of Two Dice",
        ylab = "Frequency")

Imagine these sample results happened in real life for Adam. Using R functions of your choice, calculate the percentage of time Adam doubled his money. Calculate the percentage of time Adam lost his money.

pct <- sum(craps_sim == 7 | craps_sim == 11)/length(craps_sim) 
pct |> round(2)

## [1] 0.22

recall our theoretical solution is just about the same

Adam winning money and Adam losing money can both be considered events. Are these two events independent, disjoint, or both? Explain why.

these events are disjoint – the reason for this is that Adam can either : Win, Loose or try again. He cannot do both at the same time. So the prob of doing both is 0.
If these events were indendent then we would suspect \(P(W \cap L)=P(W)P(L)\)

Quickly mathematically verify by calculator if those events are independent using part (a) and what you learned in lecture. Show work.

pct_win <- sum(craps_sim == 7 | craps_sim == 11)/length(craps_sim) 


pct_loose <- sum(craps_sim == 2 | craps_sim == 3 | craps_sim == 12)/length(craps_sim) 

0==pct_win*pct_loose

## [1] FALSE

Calculating normal and binomial distribution probabilities The functions pnorm(), dbinom(), and pbinom() allow us to calculate theoretical probabilities using certain assumptions about the distribution. To use the pnorm() function, R assumes a normal distribution with a mean, sd, and some observation we want to find a probability for. The binom functions assume a binomial distribution with sample size n, the probability of success p, and some observation. Try running the below examples.

## Coin flipping scenario. Probability of getting 4 heads when 7 coins are tossed
dbinom(4, size = 7, prob = 0.5)

## [1] 0.2734375

## Probability of getting 4 heads or less when 7 coins are tossed
pbinom(4, size = 7, prob = 0.5)

## [1] 0.7734375

## Probability of getting a number less than 4 from a normal distribution with mean 2, sd of 7
pnorm(4, mean = 2, sd = 0.7)

## [1] 0.9978626

Feel free to use the help screen, or Google, for more examples.

Exercise 4

Tenorio National Park in Costa Rica has a roughly consistent year-round climate. On any given day, we assume there is a 40% chance of heavy rain.

We are interested in forecasting the number of heavy rain days during 2019. Write down n and p if we are to use the binomial distribution for this forecast.

n=365
p=.4

Calculate the mean and standard deviation of heavy rain days in 2019 using the binomial model. Use R or a calculator.

n <- 365
p <- 0.40

# mean
mean <- n * p
mean

## [1] 146

# sd
sd <- sqrt(n * p * (1 - p))
sd

## [1] 9.359487

Find the probability that the park will experience exactly 145 days of heavy rain in 2019.

# dbinom since we want to know exactly 145
dbinom(145, size = 365, prob = 0.40)

## [1] 0.04239996

Find the probability that the park will see between 125 and 175 days of heavy rain in 2019.

below_175 <- pbinom(175, size = 365, prob = 0.40) 
below_124 <- pbinom(124, size = 365, prob = 0.40)

below_175-below_124

## [1] 0.9888137

The yearly amount of rainfall in the park is normally distributed with mean 200 inches and standard deviation of 20 inches. Find the probability that the park will experience more than 230 inches of rain in 2019.

1 - pnorm(230, mean = 200, sd = 20)

## [1] 0.0668072

STATS 10 Lab 3

your name

today’s date