Inference Guide
There is a useful Inference Cheat Sheet in your readings folder
The histogram below shows
Note that these data are user estimates
Mileage in MPG
Do you think this is reasonable?
Answer: No, there is uncertainty of the accuracy of the source data. The data reported online is not verifiable, and the population of 2012 Prius drivers may differ from 2012 Prius drivers who report on fueleconomy.gov.
Do these data provide strong evidence against this estimate
test_stat = (53.3 - 50)/(5.2/sqrt(14))
test_stat
## [1] 2.374513
reject = qt(0.975, df = 13)
reject
## [1] 2.160369
Answer: Assuming the sample was randomly selected and with alpha = 0.05, the test statistic is higher than the upper boundary of the rejection region. Therefore, the data provides convincing evidence that the average MPG for drivers who post on fueleconomy.gov is different than the MPG for 2012 Prius drivers overall.
z_value <- qnorm(.975)
stan_error <- 5.2/sqrt(14)
lower <- 53.3 - (z_value * stan_error)
upper <- 53.3 + (z_value * stan_error)
lower
## [1] 50.57612
upper
## [1] 56.02388
Answer: The 95% CI for the average gas mileage of 2012 Prius drivers who post on fueleconomy.gov is (50.57612, 56.02388).
Prices of diamonds are determined by what is known as the 4 Cs:
The prices of diamonds go up
For example, the difference between the size
In this question we use two random samples of diamonds,
and compare the average prices of the diamonds.
In order to be able to compare equivalent units,
-we first divide the price for each diamond - by 100 times its weight in carats.
That is, for a 0.99 carat diamond, we divide the price by 99.
For a 1 carat diamond, we divide the price by 100.
The distributions and some sample statistics are shown below.[@wickham_ggplot2:_2016]
Point Price
Sample Statistics
Make sure to
Answer: Hypothesis: The average standardized prices of 0.99 and 1 carat diamonds are different. Null Hypothesis: The average standardized prices of 0.99 and 1 carat diamonds are not different. Alpha = 0.05 Standard deviation between the two groups are different, so a new degrees of freedom is calculated.
test_stat = (44.51 - 56.81)/sqrt((13.32^2)/23 + (16.13^2)/23)
degrees = (((13.32^2) / 23 + (16.13^2) / 23)^2) / ((((13.32^2 / 23)^2) / (22)) + (((16.13^2 / 23)^2) / (22)))
reject = -qt(0.975, df = degrees)
test_stat
## [1] -2.819881
reject
## [1] -2.017405
Answer: With alpha = 0.05, the hypothesis test shows the test statistic is lower than the lower boundary of the rejection region, thus rejecting the null hypothesis that the average standardized prices of 0.99 and 1 carat diamonds are not different.
We discussed diamond prices
See the table for summary statistics,
You may assume the conditions for inference are met.
z_value <- qt(0.975, df = degrees)
stan_error <- sqrt((13.32^2)/23 + (16.13^2)/23)
lower <- (44.51 - 56.81) - (z_value * stan_error)
upper <- (44.51 - 56.81) + (z_value * stan_error)
lower
## [1] -21.09969
upper
## [1] -3.500307
Answer: The 95% CI for the average price of .99 carat diamonds minus the average price of 1 carat diamonds is (-21.09969, -3.500307).
The movie Moneyball focuses on the “quest for the secret of success in baseball”.
In this lab we’ll be looking at data from all 30 Major League Baseball teams
Our aim will be to summarize these relationships
Let’s load up the data for the 2011 season.
mlb11.RDataload("./data/mlb11.RData")
In addition to runs scored,
There are also three newer variables:
For the first portion of the analysis - we’ll consider the seven traditional variables. - At the end of the lab, you’ll work with the newer variables.
What type of plot would you use to display
Plot this relationship using the variable at_bats as the predictor.
plot(mlb11$at_bats, mlb11$runs)
ANSWER: A scatterplot is used to display the relationship between at_runs and runs. The relation does look linear.
If you knew a team’s at_bats,
Quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$at_bats, mlb11$runs)
## [1] 0.610627
ANSWER: I would be comfortable using a linear model for these two variables. The correlation coefficient is 0.610627.
Think back to the way that we described the distribution of a single variable.
Looking at your plot from the previous exercise,
Make sure to discuss
ANSWER: The is a moderately strong linear positive correlation between at-bats and hits from the 2011 season.
Just as we used the mean and standard deviation to summarize a single variable,
Use the following interactive function
library(statsr)
##
## Attaching package: 'statsr'
## The following objects are masked _by_ '.GlobalEnv':
##
## mlb11, plot_ss
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
You’ll need to run this command in an R script (.R file)
After running this command in a .R script, you’ll be prompted
Once you’ve done that,
Note that there are 30 residuals, one for each of the 30 observations.
\[e_i = y_i- \hat{y}_i\]
The most common way to do linear regression
To visualize the squared residuals,
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Note that the output from the plot_ss function
Using plot_ss, choose a line
Run the function several times.
Q 3.3a ANSWER: The smallest sum of squares I got was 145548.8 on my fifth attempt. This was slightly greater than the average fifth attempt sum of least squares compared to my classmate neighbors.
It is rather cumbersome to try to get the correct least squares line,
Instead we can use the lm function in R
m1 <- lm(runs ~ at_bats, data = mlb11)
The first argument in the function lm is a formula that takes the form y ~ x.
The output of lm is an object that contains all of the information we need
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Let’s consider this output piece by piece.
\(\hat{y}\) = -2789.2429 + 0.6305*\(at_{bats}\)
One last piece of information we will discuss from the summary output
For this model, what percentage of the variance in runs
Q 3.3b ANSWER: 37.3% of the variability in runs is explained by at-bats.
Fit a new model that uses homeruns to predict runs.
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Q 3.4. ANSWER: \(Runs\) = 415.2389 + 1.8345*\(Home Runs\) The slope shows us that for every home run, on average 1.8345 runs will occur.
Create a scatterplot with the least squares line laid on top.
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
The function abline plots a line based on its slope and intercept.
library(ggplot2)
ggplot(mlb11, aes(x=at_bats, y=runs)) + geom_point() + geom_smooth(method = lm,
se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Here, we used a shortcut by providing the model m1,
If a team manager saw the least squares regression line
Q 3.5a ANSWER: He or she would predict about 728 runs for the team, using the regression equation above. The Phillies have a data point very close to this number of at-bats, with 5579 at-bats and 713 runs. Therefore it is an overestimate and it has a residual of -15.
-2789.2429 + 0.6305*5578
## [1] 727.6861
How many runs would he or she predict for a team with 5,578 at-bats?
To assess whether the linear model is reliable,
Linearity: You already checked
Q 3.5b ANSWER: Plot of residuals vs. at-bats shown below.
Plot of the residuals vs. at-bats.
plot(m1$residuals ~ mlb11$at_bats)
Is there any apparent pattern in the residuals plot?
Q 3.6 Answer: There is no apparent pattern. Therefore, the relationship between runs and at-bats can be verified as linear.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
Nearly normal residuals: To check this condition,
hist(m1$residuals)
Based on the histogram and the normal probability plot,
qqnorm(m1$residuals)
qqline(m1$residuals)
Q 3.7 ANSWER: Although the histogram is slightly right-skewed, the data is generally symmetrical so the normal residuals condition is indeed met.
Based on the plot above,
Q 3.8 ANSWER: Yes. The constant variability condition appears to be met, since the variability around the line of best fit appears to be roughly constant.
Choose another traditional variable from mlb11
Produce a scatterplot of
At a glance, does there seem to be a linear relationship?
Q 3.9a ANSWER:
m3 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
How does this relationship compare to the relationship between runs and at_bats?
Q 3.9b ANSWER: The R-squared value for this model is 65.61%, which is significantly higher than the R-squared value between runs and at_bats (37.3%). This indicates that batting average is a better variablee to preduct runs.
Now that you can summarize the linear relationship between two variables,
Which variable best predicts runs?
Support your conclusion using
Q 3.10 ANSWER: Batting average (which by chance was the variable chosen in 3.9) is the best predictor for runs with a R-squared value of 65.61%. The other variables tested were hits, at_bats, homeruns, strikeouts, stolen_bases, and wins. The correlation factor between the 2 variables is 0.81.
m3 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859
Now examine the three newer variables.
Q 3.11 ANSWER: All three new predictor have higher R-squared values (84.91% for onbase, 89.69% for slug, and 93.49% for obs) and are thus better predictors for runs. Obs (on-base plus slugging) seems to be the best predictor for runs. The plots between the two variables show linearity and constant variability. With no baseball knowledge, I am unable to make any personal verification on whether this makes the most sense.
onbase_model <- lm(runs ~ new_onbase, data = mlb11)
slug_model <- lm(runs ~ new_slug, data = mlb11)
obs_model <- lm(runs ~ new_obs, data = mlb11)
summary(onbase_model)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
summary(slug_model)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
summary(obs_model)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(mlb11$runs ~ mlb11$new_obs)
abline(obs_model)
plot(obs_model$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)
hist(obs_model$residuals)
qqnorm(obs_model$residuals)
qqline(obs_model$residuals)
What concepts from the textbook are covered in this lab?
Q 3.12 ANSWER: The concepts of linear regression and how to verify it with sum of least squares and linearity are covered in the OIS textbook. The concepts of applying it with R code is seen in lectures and previous labs.
Time series are a common type of data,
In this project we will be using classical decomposition
First as an introduciton to decomposition we will have a quick example.
What is the decomposition of a timeseries?
library(datasets)
library(tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
air <- as.data.frame(AirPassengers)
Plot the total time series of air passangers
ANSWER: The time series of passengers contains 144 data points and the plot shows an increase of airline passengers overtime with some random and seasonal fluctuations.
plot.ts(AirPassengers)
Use the ts() function in base R
If the data is taken monthly,
ANSWER: In the time series plot, the number of passengers increases with each subsequent year. The months July and August tend to have the highest number of passengers within each year and November tends to have the lowest amount of passengers within each year. There are 12 points per yearly season.
ts(AirPassengers, frequency = 12)
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 112 118 132 129 121 135 148 148 136 119 104 118
## 2 115 126 141 135 125 149 170 170 158 133 114 140
## 3 145 150 178 163 172 178 199 199 184 162 146 166
## 4 171 180 193 181 183 218 230 242 209 191 172 194
## 5 196 196 236 235 229 243 264 272 237 211 180 201
## 6 204 188 235 227 234 264 302 293 259 229 203 229
## 7 242 233 267 269 270 315 364 347 312 274 237 278
## 8 284 277 317 313 318 374 413 405 355 306 271 306
## 9 315 301 356 348 355 422 465 467 404 347 305 336
## 10 340 318 362 348 363 435 491 505 404 359 310 337
## 11 360 342 406 396 420 472 548 559 463 407 362 405
## 12 417 391 419 461 472 535 622 606 508 461 390 432
Use the decompose() function
-to decompose the time series and remove the seasonality
The type for this time series is multiplicative
ANSWER: From the plot of the decomposed time series, a positive trend is shown over the years. The seasonal trend shows a cyclic pattern with the same shape. Variations from randomness are also shown.
a <- ts(AirPassengers, frequency = 12)
b <- decompose(a, "multiplicative")
plot(b)
Isolate the trend and plot the trend on top of the raw data with the seasonality included
ANSWER: The trend lines represents the original data for air passengers fairly well. The variation in data would be due to seasonality.
# Extract the trend from the decomposition
trend <- b$trend
# Plot the raw data annotated with the trend
plot(a) + lines(trend) + title(main = "AirPassenengers Raw Data with Trend")
## integer(0)
Now lets try this with a real world time series.
We’ll be using one month of power and weather data from a solar power plant.
The data set variables are as follows:
The power from solar panels is dependant on the irradiance hitting it,
It is useful to have multiple sources of irradiance measurements.
Sensors on the ground are useful because
To combat this, we also have irradiance data from SolarGIS,
Plot the irradiance and power for the first week of data,
ANSWER: There are 7 columns (variables) and 2880 rows (objects) in the dataset. The time interval for the entire dataset goes from “2012-06-01 UTC” to “2012-06-30 23:45:00 UTC”. The irradiance peaks midday and drops off at night, which is expected. The solar power panel and both sources of irradiance line up well. The way the 3 graphs line up on the same plot show the direct relationship between irradiance incident and power produced by the solar panel.
# Read in the data
library(ggplot2)
solar_data <- readr::read_csv("data/2001-353-353m-453-le2b-tsa-data-final.csv")
## Parsed with column specification:
## cols(
## time = col_datetime(format = ""),
## ghir = col_double(),
## iacp = col_double(),
## temp = col_double(),
## ghi_solargis = col_double(),
## clear = col_logical(),
## ratio = col_double()
## )
# Find total length of dataset
sprintf('The total length of the data is %0.4f days',
as.numeric(max(solar_data$time) - min(solar_data$time)))
## [1] "The total length of the data is 29.9896 days"
# Find the time interval
table(diff(solar_data$time))
##
## 15
## 2879
# Plot the irradiance and power for the first week
# Entire dataset is in 15-minute increments
first_week <- solar_data[1:672, ]
# Power plot over time
ggplot(first_week, aes(x = time, y = iacp)) +
geom_point(size = 2, shape = 19, alpha = 0.8) +
geom_line(aes(y = iacp), color = 'black', lwd = 1.2) +
xlab('') + ylab("Power (kW)") + ggtitle("First Week Solar Panel Power")
# Both sources of irradiance
ggplot(first_week, aes(x = time)) +
geom_line(aes(y = ghir, color = 'ghir'), lwd = 1.2) +
geom_point(aes(y = ghir, color = 'ghir'), alpha = 0.8,
shape = 19, size = 2) +
geom_line(aes(y = ghi_solargis, color = 's_ghir'), lwd = 1.2) +
geom_point(aes(y = ghir, color = 's_ghir'), alpha = 0.8,
shape = 19, size = 2) +
scale_color_manual(name = 'legend', values = c('red', 'blue')) +
xlab('') + ylab('Irradiance') +ggtitle('Two Sources of Irradiance')
# All 3 lines on the same plot with a secondary axis
ggplot(first_week, aes(x = time)) +
geom_line(aes(y = iacp, color = 'power')) +
geom_line(aes(y = 0.2 * ghir, color = 'ghir'), lwd = 1.1) +
geom_line(aes(y = 0.2 * ghi_solargis, color = 's_ghi'), lwd = 1.1) +
scale_y_continuous(sec.axis = sec_axis(~.*5, name = "Solar Irradiance")) +
ylab('Power (kW)') + xlab('') + ggtitle("Power vs GHI")
Use the ts() functions and the stlplus() function from the stlplus package
ANSWER: All of the decompositions show similar patterns. The seasonal (daily) trend matches the raw data. There is no discernable trend line as this dataset only covers one month.
# think carefully about the frequency you'll need to define for this data
# what is the seasonal component to this data and how nay data points make up a season?
# use s.window = "periodic" for the stlplus function
library(stlplus)
?stlplus()
## starting httpd help server ... done
time_series <- lapply(solar_data[c('iacp', 'ghir', 'ghi_solargis')], ts)
solar_decom <- lapply(time_series, stlplus, s.window = "periodic", n.p = 24*4)
#Plot the data with lapply as well
lapply(solar_decom, plot, main = "LOESS Decomposition")
## $iacp
##
## $ghir
##
## $ghi_solargis
Isolate the trends for the 3 time series you just decomposed
Compare the models to each other,
ANSWER: All of the trend lines are horizontal lines. The y-interacts for all three graphs are close to 0. The time coefficients for all three graphs are also close to 0.
models <- lapply(solar_decom, function(x)
{lm(x[[c('data', 'trend')]]~solar_data$time)})
lapply(models, function(x) {summary(x)$coefficients})
## $iacp
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.377731e+02 1.696082e+02 -2.581084 0.009898121
## solar_data$time 3.791630e-07 1.265918e-07 2.995163 0.002766366
##
## $ghir
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.191447e+04 9.665667e+02 -22.67249 7.573128e-105
## solar_data$time 1.659443e-05 7.214237e-07 23.00233 1.271864e-107
##
## $ghi_solargis
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.322766e+04 1.079400e+03 -21.51905 2.295667e-95
## solar_data$time 1.759174e-05 8.056396e-07 21.83575 6.201618e-98
Solar panel degradation leads to less power output over time
Based on the linear models you found for the trends of power and irradiance,
ANSWER: Based on the linear models above, the system is not degrading overtime as solar power does not significantly decrease over the course of a month.
How do the sensor GHI and the SolarGIS GHI compare to power?
ANSWER: The sensor GHI and SolarGIS GHI are also relatively constant, but there is a slight increase in solar panel power with irradiance over the course of the month.