Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets.
In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:
\[MaxHR=200-Age\]
You have been given the following sample:
Age | 18 | 23 | 25 | 35 | 65 | 54 | 34 | 56 | 72 | 19 | 23 | 42 | 18 | 39 | 37 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MaxHR | 202 | 186 | 187 | 180 | 156 | 169 | 174 | 172 | 153 | 199 | 193 | 174 | 198 | 183 | 178 |
Age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
fit_1 <- lm(MaxHR ~ Age, data = data.frame(Age, MaxHR))
summ_1 <- summary(fit_1)
fit_1$coefficients
## (Intercept) Age
## 210.0484584 -0.7977266
\[\hat{MaxHR_i} = 210.048 - 0.798 \cdot Age_i + \varepsilon_i\]
summ_1$coefficients[2,4] < 0.01
## [1] TRUE
The empirical evidence suggests that the effect of age on maximum heart rate is significant.
(p_1 <- summ_1$coefficients[2,4])
## [1] 3.847987e-08
The \(p\)-value of the data, \(p=0.00000003847987\), was compared against a significance level of \(\alpha=0.01\).
par(mfrow = c(2, 2))
plot(fit_1)
At a significance level of \(\alpha = 0.01\) there is empirical evidence suggesting that the effect of age on maximum heart rate is significant. The data suggests a negative relationship between age and maximum heart rate such that when age goes up, maximum heart rate goes down. The data also suggests that approximately 90.21% of the change in maximum heart rate can be explained by a change in age.
Using the auto-mpg data set from Assignment 5 (attached here), perform a Linear Regression analysis using mpg as the dependent variable and the other 4 variables (displacement, horsepower, weight, acceleration) as independent variables.
\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \varepsilon_i\] \[\hat{mpg_i} = \beta_0 + \left(displacement_i\right) \cdot X_1 + \left(horsepower_i\right) \cdot X_2 + \left(weight_i\right) \cdot X_3 + \left(acceleration_i\right) \cdot X_4 + \varepsilon_i\]
url <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20605/auto-mpg.data"
auto.mpg <- read.table(url)
colnames(auto.mpg) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")
fit_2 <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto.mpg)
summ_2 <- summary(fit_2)
What is the final linear regression fit equation?
fit_2$coefficients
## (Intercept) displacement horsepower weight acceleration
## 45.251139699 -0.006000871 -0.043607731 -0.005280508 -0.023147999
\[\hat{mpg_i} = 45.251 - 0.006 \cdot displacement_i - 0.044 \cdot horsepower_i - 0.005 \cdot weight_i - 0.023 \cdot acceleration_i + \varepsilon_i\]
Which of the 4 independent variables have a significant impact on mpg?
summ_2$coefficients[2:5,4] < 0.01
## displacement horsepower weight acceleration
## FALSE TRUE TRUE FALSE
What are their corresponding significance levels?
summ_2$coefficients[2:5,4]
## displacement horsepower weight acceleration
## 3.716584e-01 8.848982e-03 2.302545e-10 8.538765e-01
What are the standard errors on each of the coefficients?
summ_2$coefficients[2:5,2]
## displacement horsepower weight acceleration
## 0.0067093055 0.0165734633 0.0008108541 0.1256011622
First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs.
set.seed(10014)
random_rows <- sample(1:nrow(auto.mpg), 40, F)
random_sample <- auto.mpg[random_rows, ]
fit_3 <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = random_sample)
summ_3 <- summary(fit_3)
\[\hat{mpg_i} = 31.438 + 0.043 \cdot displacement_i + 0.03 \cdot horsepower_i - 0.013 \cdot weight_i + 1.333 \cdot acceleration_i + \varepsilon_i\]
confint(fit_3, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 9.783414845 53.093025271
## displacement -0.004165032 0.089430522
## horsepower -0.119549172 0.178697785
## weight -0.019772731 -0.005664708
## acceleration 0.351195499 2.315083699
summ_3$coefficients[2:5,4] < 0.01
## displacement horsepower weight acceleration
## FALSE FALSE TRUE TRUE
summ_3$coefficients[2:5,4]
## displacement horsepower weight acceleration
## 0.0728557296 0.6896816398 0.0008234431 0.0092237539
\[mpg_i = 45.251 - 0.006 \cdot displacement_i - 0.044 \cdot horsepower_i - 0.005 \cdot weight_i - 0.023 \cdot acceleration_i + \varepsilon_i\]
confint(fit_2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## displacement -0.019192122 0.007190380
## horsepower -0.076193029 -0.011022433
## weight -0.006874738 -0.003686277
## acceleration -0.270094049 0.223798050
summ_2$coefficients[2:5,4] < 0.01
## displacement horsepower weight acceleration
## FALSE TRUE TRUE FALSE
summ_2$coefficients[2:5,4]
## displacement horsepower weight acceleration
## 3.716584e-01 8.848982e-03 2.302545e-10 8.538765e-01
Statistics computed from the random sample of 20 values against the results obtained from analysis of the population from which the sample was drawn have both similarities and differences. The significance of displacement and weight is evident in both the sample and population data. The significance of acceleration in the sample data however, is a false positive (Type I Error). Similarly, the significance of horsepower is a false negative (Type II Error). The standard errors and confidence intervals of the sample data are also much wider than the standard deviations and confidence intervals of the population.