Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets.

1. Problem Set 1

In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:

\[MaxHR=200-Age\]

You have been given the following sample:

Age 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37
MaxHR 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
  1. Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R.
Age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
fit_1 <- lm(MaxHR ~ Age, data = data.frame(Age, MaxHR))
summ_1 <- summary(fit_1)
  1. What is the resulting equation?
fit_1$coefficients
## (Intercept)         Age 
## 210.0484584  -0.7977266

\[\hat{MaxHR_i} = 210.048 - 0.798 \cdot Age_i + \varepsilon_i\]

  1. Is the effect of Age on Max HR significant?
summ_1$coefficients[2,4] < 0.01 
## [1] TRUE

The empirical evidence suggests that the effect of age on maximum heart rate is significant.

  1. What is the significance level?
(p_1 <- summ_1$coefficients[2,4])
## [1] 3.847987e-08

The \(p\)-value of the data, \(p=0.00000003847987\), was compared against a significance level of \(\alpha=0.01\).

  1. Please also plot the fitted relationship between MaxHR and Age.
par(mfrow = c(2, 2))
plot(fit_1)

Problem Set 1 Conclusions

At a significance level of \(\alpha = 0.01\) there is empirical evidence suggesting that the effect of age on maximum heart rate is significant. The data suggests a negative relationship between age and maximum heart rate such that when age goes up, maximum heart rate goes down. The data also suggests that approximately 90.21% of the change in maximum heart rate can be explained by a change in age.

2. Problem Set 2

Using the auto-mpg data set from Assignment 5 (attached here), perform a Linear Regression analysis using mpg as the dependent variable and the other 4 variables (displacement, horsepower, weight, acceleration) as independent variables.

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \varepsilon_i\] \[\hat{mpg_i} = \beta_0 + \left(displacement_i\right) \cdot X_1 + \left(horsepower_i\right) \cdot X_2 + \left(weight_i\right) \cdot X_3 + \left(acceleration_i\right) \cdot X_4 + \varepsilon_i\]

url <- "https://raw.githubusercontent.com/jzuniga123/SPS/master/DATA%20605/auto-mpg.data"
auto.mpg <- read.table(url)
colnames(auto.mpg) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")
fit_2 <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto.mpg)
summ_2 <- summary(fit_2)

What is the final linear regression fit equation?

fit_2$coefficients
##  (Intercept) displacement   horsepower       weight acceleration 
## 45.251139699 -0.006000871 -0.043607731 -0.005280508 -0.023147999

\[\hat{mpg_i} = 45.251 - 0.006 \cdot displacement_i - 0.044 \cdot horsepower_i - 0.005 \cdot weight_i - 0.023 \cdot acceleration_i + \varepsilon_i\]

Which of the 4 independent variables have a significant impact on mpg?

summ_2$coefficients[2:5,4] < 0.01 
## displacement   horsepower       weight acceleration 
##        FALSE         TRUE         TRUE        FALSE

What are their corresponding significance levels?

summ_2$coefficients[2:5,4]
## displacement   horsepower       weight acceleration 
## 3.716584e-01 8.848982e-03 2.302545e-10 8.538765e-01

What are the standard errors on each of the coefficients?

summ_2$coefficients[2:5,2]
## displacement   horsepower       weight acceleration 
## 0.0067093055 0.0165734633 0.0008108541 0.1256011622

Experiment

First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs.

Sample

set.seed(10014)
random_rows <- sample(1:nrow(auto.mpg), 40, F)
random_sample <- auto.mpg[random_rows, ]
fit_3 <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = random_sample)
summ_3 <- summary(fit_3)

\[\hat{mpg_i} = 31.438 + 0.043 \cdot displacement_i + 0.03 \cdot horsepower_i - 0.013 \cdot weight_i + 1.333 \cdot acceleration_i + \varepsilon_i\]

confint(fit_3, level = 0.95)
##                     2.5 %       97.5 %
## (Intercept)   9.783414845 53.093025271
## displacement -0.004165032  0.089430522
## horsepower   -0.119549172  0.178697785
## weight       -0.019772731 -0.005664708
## acceleration  0.351195499  2.315083699
summ_3$coefficients[2:5,4] < 0.01
## displacement   horsepower       weight acceleration 
##        FALSE        FALSE         TRUE         TRUE
summ_3$coefficients[2:5,4]
## displacement   horsepower       weight acceleration 
## 0.0728557296 0.6896816398 0.0008234431 0.0092237539

Population

\[mpg_i = 45.251 - 0.006 \cdot displacement_i - 0.044 \cdot horsepower_i - 0.005 \cdot weight_i - 0.023 \cdot acceleration_i + \varepsilon_i\]

confint(fit_2, level = 0.95)
##                     2.5 %       97.5 %
## (Intercept)  40.422278855 50.080000544
## displacement -0.019192122  0.007190380
## horsepower   -0.076193029 -0.011022433
## weight       -0.006874738 -0.003686277
## acceleration -0.270094049  0.223798050
summ_2$coefficients[2:5,4] < 0.01
## displacement   horsepower       weight acceleration 
##        FALSE         TRUE         TRUE        FALSE
summ_2$coefficients[2:5,4]
## displacement   horsepower       weight acceleration 
## 3.716584e-01 8.848982e-03 2.302545e-10 8.538765e-01

Problem Set 2 Conclusions

Statistics computed from the random sample of 20 values against the results obtained from analysis of the population from which the sample was drawn have both similarities and differences. The significance of displacement and weight is evident in both the sample and population data. The significance of acceleration in the sample data however, is a false positive (Type I Error). Similarly, the significance of horsepower is a false negative (Type II Error). The standard errors and confidence intervals of the sample data are also much wider than the standard deviations and confidence intervals of the population.

References

https://archive.ics.uci.edu/ml/datasets/Auto+MPG