In lab you are encouraged to work in pairs or small groups to discuss the concepts on the assignments. However, DO NOT copy each other’s work as this constitutes cheating. The work you submit must be entirely your own. If you have a question in lab, feel free to reach out to other groups or talk to your TA if you get stuck.
You learned about linear regression in lecture. Now, we will learn how to code it in lab. We will be using the following new commands:
lm() will create an output containing the equation of the regression line, correlation coefficient, p-values, residuals, and much more. We typically create an object to store the results of the lm() function (see example below).
abline() is a command to generate a regression line on a plot of the form y = a + bx. It requires arguments for a and b, or linear model results (see example below).
# load the data
NCbirths <- read.csv("births.csv")
# Run the linear model of weight against Mom's age and print a summary
linear_model <- lm(NCbirths$weight ~ NCbirths$Mage)
summary(linear_model)
##
## Call:
## lm(formula = NCbirths$weight ~ NCbirths$Mage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.574 -9.381 1.221 12.823 61.423
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.19108 2.08824 50.373 < 2e-16 ***
## NCbirths$Mage 0.39945 0.07497 5.328 1.1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.27 on 1990 degrees of freedom
## Multiple R-squared: 0.01407, Adjusted R-squared: 0.01357
## F-statistic: 28.39 on 1 and 1990 DF, p-value: 1.104e-07
# Create a plot of the data, and draw the regression line using abline
plot(NCbirths$weight ~ NCbirths$Mage,
xlab = "Mom Age", ylab = "Weight",
main = "Regression of Weight on Mother's Age")
abline(linear_model, col = "red", lwd = 2)
# Create a plot of the residuals to assess regression assumptions
plot(linear_model$residuals ~ NCbirths$Mage,
main = "Residual plot")
# Add a line of y = 0 to help visualize the residuals
abline(a = 0, b = 0, col = "red", lwd = 2)
Try running an example of the code on the following page by loading and using NCbirths below. We will discuss what each line does in section.
We will be working with some soil mining data and are interested in looking at some of the relationships between metal concentrations (in ppm). Download the data ‘soil_complete.txt’ from BruinLearn and read it into R. When you read in the data, name your object “soil”.
soil <- read.table("soil_complete.txt", header = TRUE)
lm1 <- lm(lead ~ zinc, data=soil)
summary(lm1)
##
## Call:
## lm(formula = lead ~ zinc, data = soil)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.455 -12.570 -1.834 15.946 101.651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.582928 4.410443 3.76 0.000244 ***
## zinc 0.291335 0.007415 39.29 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.37 on 149 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.9114
## F-statistic: 1544 on 1 and 149 DF, p-value: < 2.2e-16
plot(soil$zinc, soil$lead)
abline(lm1, col = "red")
plot(lm1, which=1)
abline(a = 0, b = 0, col = "lightblue", lwd = 2)
hist(lm1$residuals)
plot(lm1$residuals ~ soil$zinc, main = "residual plt")
abline(a = 0, b = 0, col = "red", lwd = 2)
Parts d-h can be answered by hand, using a calculator, or any R functions of your choice.
lm1$coefficients
## (Intercept) zinc
## 16.5829280 0.2913355
lm1$coefficients %*% c(1, 1000) # dot product
## [,1]
## [1,] 307.9184
sum(lm1$coefficients * c(1, 1000)) # sm
## [1] 307.9184
# More general way
new_data <- data.frame(zinc = c(1000))
predict(lm1, newdata = new_data)
## 1
## 307.9184
lm1$coefficients %*% c(1, 100)
## [,1]
## [1,] 45.71648
hist(soil$lead)
lead_colors <- heat.colors(100)
lead_scaled <- cut(
soil$lead,
breaks = 500,
labels = FALSE
)
plot(
soil$x,
soil$y,
col = lead_colors[lead_scaled],
pch = 19,
asp = 1,
xlab = "X Coordinate",
ylab = "Y Coordinate",
main = "Lead Concentration Map"
)
summary(lm1)$r.squared |> round(2)
## [1] 0.91
lm1
##
## Call:
## lm(formula = lead ~ zinc, data = soil)
##
## Coefficients:
## (Intercept) zinc
## 16.5829 0.2913
Linearity :
plot(soil$zinc, soil$lead)
abline(lm1, col = "red")
Symmetric / Normal Dist of Errors :
plot(lm1, which = 2)
hist(lm1$residuals)
Equal variance
plot(lm1, which =1)
Note from TM: We only made the assumption of linearity, i.e., a linear association between the two variables x and y, which is all that is actually needed for linear regression. For your information, symmetry refers to the points scattering symmetrically around the regression line. Equal variance refers to the points having equal variability around the regression line for all values of x. Neither of them are necessary conditions for a correct and interpretable linear model, but they are “nice to have”. Feel free to discuss them with your TA and comment on them in the report.
Our next data set is what is known as a time series, or data in time. It contains the measurements via satellite imagery of sea ice extent in millions of square kilometers for each month from 1988 to 2011. Please download the “sea_ice” data from BruinLearn and read it into R. If you have your working directory properly set, you can use the line below:
ice <- read.csv("sea_ice.csv", header = TRUE)
Note that currently R does not know what class the Date column is. We need to convert the Date column into class “date” using the following line:
ice$Date <- as.Date(ice$Date, "%m/%d/%Y")
lm2 <- lm(Extent~Date, dat=ice)
lm2
##
## Call:
## lm(formula = Extent ~ Date, data = ice)
##
## Coefficients:
## (Intercept) Date
## 1.011e+01 1.438e-04
plot(ice$Date, ice$Extent)
abline(lm2, col = "red")
# notice seasonal trends
plot(ice$Date, ice$Extent, type = "l")
plot(lm2$residuals ~ ice$Date, main = "residual plt")
abline(a = 0, b = 0, col = "red", lwd = 2)
alternative method :
plot(lm2, which = 1)
abline(a = 0, b = 0, col = "red", lwd = 2)
We can use the sample() function to sample data from a vector, and the replicate() function to simulate random events many times over. Note that the computer is not truly random, but it is close enough for our purposes to consider it random. We also rely on the set.seed() function to make our “random” results reproducible. Try the following examples in R and see the ## comments for descriptions.
## Set seed for reproducibility
set.seed(1335)
## Create a names vector
names = c("Leslie", "Ron", "Andy", "April", "Tom", "Ben", "Jerry")
## Sample 3 of the names with replacement
sample(names, 3, replace = TRUE)
## [1] "Ron" "April" "April"
## Sample 3 of the names without replacement
sample(names, 3, replace = FALSE)
## [1] "Tom" "Leslie" "Ben"
## Create a vector from 1 to 10
numbers = 1:10
## Simulate sampling 3 different numbers at random, 10 times
replicate(10, sample(numbers, 3, replace = FALSE))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 10 5 8 7 5 3 4 7 9 3
## [2,] 1 8 5 2 9 9 5 10 8 5
## [3,] 5 2 2 5 10 5 1 6 4 9
## One more time, but save as an object
rand_draws = replicate(10, sample(numbers, 3, replace = FALSE))
## Perform analysis on the random draws
colMeans(rand_draws) ## Takes the mean of each sample
## [1] 8.333333 5.000000 2.000000 5.666667 5.666667 4.333333 5.666667 6.000000
## [9] 4.000000 6.000000
colSums(rand_draws) ## Takes the sum of each sample
## [1] 25 15 6 17 17 13 17 18 12 18
die_1 <- rep(1:6, each = 6)
die_2 <- rep(1:6, times = 6)
sums <- die_1 + die_2
result <- rep("Continue", length(sums))
result[sums %in% c(7, 11)] <- "Double"
result[sums %in% c(2, 3, 12)] <- "Lose"
craps_outcomes <- data.frame(
die_1 = die_1,
die_2 = die_2,
sum = sums,
result = result
)
craps_outcomes
hist(craps_outcomes$sum,
breaks = seq(1.5, 12.5, by = 1))
One of Adam’s favorite casino games is called “Craps”. In the first round of this game, two fair 6-sided dice are rolled. If the sum of the two dice equal 7 or 11, Adam doubles his money! If a 2, 3, or 12 are rolled, Adam loses all the money he bets.
sums
## [1] 2 3 4 5 6 7 3 4 5 6 7 8 4 5 6 7 8 9 5 6 7 8 9 10 6
## [26] 7 8 9 10 11 7 8 9 10 11 12
of all the cases :
library(dplyr)
craps_outcomes |> group_by(result) |> count()
there are 8 times we might double –
(8/36) |> round(2)
## [1] 0.22
so about a 22% pct chance
set.seed(123)
die_1_sim <- sample(1:6, size = 5000, replace = TRUE) # eq-prob
die_2_sim <- sample(1:6, size = 5000, replace = TRUE) # eq-prob
craps_sim <- die_1_sim + die_2_sim
sum_table_sim <- table(craps_sim)
barplot(sum_table_sim,
main = "Simulated Distribution of Dice Sums (n = 5000)",
xlab = "Sum of Two Dice",
ylab = "Frequency")
pct <- sum(craps_sim == 7 | craps_sim == 11)/length(craps_sim)
pct |> round(2)
## [1] 0.22
these events are disjoint – the reason for this is that Adam can either : Win, Loose or try again. He cannot do both at the same time. So the prob of doing both is 0.
If these events were indendent then we would suspect \(P(W \cap L)=P(W)P(L)\)
pct_win <- sum(craps_sim == 7 | craps_sim == 11)/length(craps_sim)
pct_loose <- sum(craps_sim == 2 | craps_sim == 3 | craps_sim == 12)/length(craps_sim)
0==pct_win*pct_loose
## [1] FALSE
Calculating normal and binomial distribution probabilities The functions pnorm(), dbinom(), and pbinom() allow us to calculate theoretical probabilities using certain assumptions about the distribution. To use the pnorm() function, R assumes a normal distribution with a mean, sd, and some observation we want to find a probability for. The binom functions assume a binomial distribution with sample size n, the probability of success p, and some observation. Try running the below examples.
## Coin flipping scenario. Probability of getting 4 heads when 7 coins are tossed
dbinom(4, size = 7, prob = 0.5)
## [1] 0.2734375
## Probability of getting 4 heads or less when 7 coins are tossed
pbinom(4, size = 7, prob = 0.5)
## [1] 0.7734375
## Probability of getting a number less than 4 from a normal distribution with mean 2, sd of 7
pnorm(4, mean = 2, sd = 0.7)
## [1] 0.9978626
Feel free to use the help screen, or Google, for more examples.
Tenorio National Park in Costa Rica has a roughly consistent year-round climate. On any given day, we assume there is a 40% chance of heavy rain.
n=365
p=.4
n <- 365
p <- 0.40
# mean
mean <- n * p
mean
## [1] 146
# sd
sd <- sqrt(n * p * (1 - p))
sd
## [1] 9.359487
# dbinom since we want to know exactly 145
dbinom(145, size = 365, prob = 0.40)
## [1] 0.04239996
below_175 <- pbinom(175, size = 365, prob = 0.40)
below_124 <- pbinom(124, size = 365, prob = 0.40)
below_175-below_124
## [1] 0.9888137
1 - pnorm(230, mean = 200, sd = 20)
## [1] 0.0668072