# Import Data Into R from Excel file
# (1)install.packages("readxl") if have never done so
# (2)Save data file to same location as RMarkdown file
# (3)Run the following three commands
library(readxl)
setwd( dirname( rstudioapi::getActiveDocumentContext( )$path ) )
mitchellData=read_excel("mitchell.xlsx")## # A tibble: 6 × 3
## Obs Month Temp
## <dbl> <dbl> <dbl>
## 1 1 0 -5.18
## 2 2 1 -1.65
## 3 3 2 2.49
## 4 4 3 10.4
## 5 5 4 15.0
## 6 6 5 21.7
## tibble [204 × 3] (S3: tbl_df/tbl/data.frame)
## $ Obs : num [1:204] 1 2 3 4 5 6 7 8 9 10 ...
## $ Month: num [1:204] 0 1 2 3 4 5 6 7 8 9 ...
## $ Temp : num [1:204] -5.18 -1.65 2.49 10.4 14.99 ...
I expect a negative relationship. A longer driving distance means a golfer can get closer to the hole with fewer strokes, which should help reduce the total scoring average. So as driving distance increases, scoring average should decrease.
I expect a positive relationship. A higher putting average means the player takes more putts per hole, which would increase their total score.
library(readxl)
golfData <- read_excel("golfers.xls")
plot(avgscore ~ driving, data = golfData,
pch = 16, col = "darkgreen", xlab = "Driving Distance (yards)",
ylab = "Scoring Average")## [1] -0.2654319
plot(avgscore ~ putts, data = golfData,
pch = 16, col = "darkgreen", xlab = "Putts",
ylab = "Scoring Average")## [1] 0.1900407
Yes, both results affirm my prediction. The negative correlation between driving distance and scoring average makes sense because golfers who drive farther tend to score slightly better. Similarly, the positive correlation between putting average and scoring average also makes sense because more putts per hole slightly worsens scores.
The relationship between driving distance and scoring average is slightly stronger, as its correlation (−0.265) is greater in absolute value than the putting correlation (+0.190). While both relationships are weak, driving distance has a relatively stronger linear relationship with scoring average in this dataset.
# Airfare data
y <- c(631.8, 338.6, 627.9, 352.6, 699.8, 470.7,
557.8, 547.6, 569.83, 321.1, 344.7, 427.67)
# Mean and median
mean_y <- mean(y)
median_y <- median(y)
mean_y## [1] 490.8417
## [1] 509.15
# Create candidate yhat values from 400 to 600
yhat <- seq(400, 600, by = 0.1)
# Define SAE function
SAE.fun <- function(yhat) {
sum(abs(y - yhat))
}
# Compute SAE for each candidate
SAE <- sapply(yhat, SAE.fun)
# Plot SAE vs yhat
plot(yhat, SAE, cex = 0.5, xlab = "yhat", ylab = "SAE",
main = "Sum of Absolute Errors vs ŷ")## [1] 470.8
# Define SSE function
SSE.fun <- function(yhat) {
sum((y - yhat)^2)
}
# Compute SSE for each candidate
SSE <- sapply(yhat, SSE.fun)
# Plot SSE vs yhat
plot(yhat, SSE, cex = 0.5, xlab = "yhat", ylab = "SSE",
main = "Sum of Squared Errors vs ŷ")## [1] 490.8
When using absolute error (SAE), the optimal prediction is the median of the data. When using squared error (SSE), the optimal prediction is the mean. Choosing between the mean or median depends on the kind of error we are looking for.
An observational unit is a single textbook from the Cal Poly bookstore.
The response variable is price. The variable we are trying to predict based on the number of pages.
The explanatory variable is number of pages. This is used to help predict the price of a textbook.
Price = 0.1465 * (Pages) - 5.13
The correlation is strong and positive, and 0.8 is the closest value to describe that relationship.
Negative
Price=0.1465×1300−5.13=190.45−5.13=185.32 The math is correct. 1300 pages is outside the range of observed data. While the math is correct, 1300 pages is outside the range of the data used to build the model. This makes the prediction an extrapolation, and therefore not reliable. We can’t say for sure the statement is true.
Self-reported measures are subject to response bias. Participants may underreport or exaggerate symptoms because of social desirability, embarrassment, or wrong recall especially in this study post drinking. If men and women differ in how they report symptoms like if one gender underreports more this could introduce systematic bias.
The sample consisted of college students at a single university (University of Missouri) who were enrolled in Intro to Psychology and participated for research credit. This is a non-random, convenience sample. Therefore, the findings cannot be generalized to all adults, or even to all college students.
The two variables are hangover symptoms scale score and frequency of drinking. The explanatory variable is drinking frequency, since the researchers were examining whether more frequent drinking is associated with higher hangover symptom scores.
The correlation coefficient r = 0.44 means there is a moderate positive linear relationship between how often a student drinks and their total hangover symptom score.
In this study, significantly means that the result is statistically significant. The probability that this correlation occurred by random chance is very low (p < 0.001).
It provides strong evidence that drinking frequency and hangover symptoms are related in the sample studied, rather than it being a coincidence.