Using R’s lm function, perform regression analysis and measure the significance of the independent variables for the following two data sets. In the first case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation:
MaxHR = 220 − Age You have been given the following sample:
Age: 18 23 25 35 65 54 34 56 72 19 23 42 18 39 37 MaxHRK: 202 186 187 180 156 169 174 172 153 199 193 174 198 183 178
Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation? Is the effect of Age on Max HR significant? What is the significance level? Please also plot the fitted relationship between Max HR and Age.
#our first step is to input our data as dataframe
age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
HR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
data.set <- data.frame(age, HR)
View(data.set)
#now that we have our data, we can use lm to perform our linear regression.
lin.reg <- lm(HR ~ age, data = data.set)
lin.reg
##
## Call:
## lm(formula = HR ~ age, data = data.set)
##
## Coefficients:
## (Intercept) age
## 210.0485 -0.7977
Our output shows us that our equation is as follows: \[MaxHR = 210.0485 - 0.7977*age\] To determine significance, we will use the summary function
summary(lin.reg)
##
## Call:
## lm(formula = HR ~ age, data = data.set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
From the data produced by the summary functions, we can see the significance level of 3.85e-08.
Our final step with this portion of the assignment is to plot.
require(ggplot2)
## Loading required package: ggplot2
ggplot(data.set, aes(x = age, y = HR)) + geom_point() + stat_smooth(level = .95, method = lm) + ggtitle("MAX HEART RATE VS. AGE") + xlab("AGE") + ylab("HEART RATE")
Using the Auto data set from Assignment 5 (also attached here) perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horse- power, weight, acceleration) as independent variables. What is the final linear regression fit equation? Which of the 4 independent variables have a significant impact on mpg? What are their corresponding significance levels? What are the standard errors on each of the coefficients? Please perform this experiment in two ways. First take any random 40 data points from the entire auto data sample and perform the linear regression fit and measure the 95% confidence intervals. Then, take the entire data set (all 392 points) and perform linear regression and measure the 95% confidence intervals. Please report the resulting fit equation, their significance values and confidence intervals for each of the two runs. Please submit an R-markdown file documenting your experiments. Your submission should include the final linear fits, and their corresponding significance levels. In addition, you should clearly state what you concluded from looking at the fit and their significance levels.
#our first step is to load our auto data
auto.data <- read.table('auto-mpg.data', col.names = c('DP', 'HP', 'WT', 'ACC', 'MPG'))
View(auto.data)
Now that we have our data frame, we can take a random sample of 40 data points from the entire sample, then perform the linear regreassion fit and measure the 95% confidence intervals.
set.seed(100) #this ensures that our results are reproducible
auto.data40 <- auto.data[sample(nrow(auto.data), 40),]
auto.linreg <- lm(MPG ~ DP + HP + WT + ACC, data = auto.data40)
auto.linreg
##
## Call:
## lm(formula = MPG ~ DP + HP + WT + ACC, data = auto.data40)
##
## Coefficients:
## (Intercept) DP HP WT ACC
## 51.577830 -0.038432 0.012317 -0.004231 -0.612764
summary(auto.linreg)
##
## Call:
## lm(formula = MPG ~ DP + HP + WT + ACC, data = auto.data40)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4075 -2.1309 -0.0787 1.7905 5.9743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.577830 8.690585 5.935 9.41e-07 ***
## DP -0.038432 0.017507 -2.195 0.0349 *
## HP 0.012317 0.070821 0.174 0.8629
## WT -0.004231 0.002419 -1.749 0.0890 .
## ACC -0.612764 0.460233 -1.331 0.1917
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.054 on 35 degrees of freedom
## Multiple R-squared: 0.8064, Adjusted R-squared: 0.7843
## F-statistic: 36.45 on 4 and 35 DF, p-value: 5.004e-12
confint(auto.linreg, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 33.935005899 69.2206550291
## DP -0.073972931 -0.0028916705
## HP -0.131456875 0.1560914416
## WT -0.009142027 0.0006798807
## ACC -1.547086269 0.3215579259
Our final linear regression fit equation is:
\[mpg = 51.577830 - (0.0384328*DP) + (0.012317*HP) - (0.004231*WT) - (0.612764*ACC)\] From our sample of 40, we do not see any of the variables as being significant. Our next order of business is to perform the same analysis on our whole dataset.
all.linreg <- lm(MPG ~ DP + HP + WT + ACC, data = auto.data)
all.linreg
##
## Call:
## lm(formula = MPG ~ DP + HP + WT + ACC, data = auto.data)
##
## Coefficients:
## (Intercept) DP HP WT ACC
## 45.251140 -0.006001 -0.043608 -0.005281 -0.023148
summary(all.linreg)
##
## Call:
## lm(formula = MPG ~ DP + HP + WT + ACC, data = auto.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.378 -2.793 -0.333 2.193 16.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2511397 2.4560447 18.424 < 2e-16 ***
## DP -0.0060009 0.0067093 -0.894 0.37166
## HP -0.0436077 0.0165735 -2.631 0.00885 **
## WT -0.0052805 0.0008109 -6.512 2.3e-10 ***
## ACC -0.0231480 0.1256012 -0.184 0.85388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared: 0.707, Adjusted R-squared: 0.704
## F-statistic: 233.4 on 4 and 387 DF, p-value: < 2.2e-16
confint(all.linreg, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## DP -0.019192122 0.007190380
## HP -0.076193029 -0.011022433
## WT -0.006874738 -0.003686277
## ACC -0.270094049 0.223798050
Our final linear regression fit equation when we use the whole data set is:
\[mpg = 45.2511397 - (0.0060009*DP) - (0.0436077*HP) - (0.0052805*WT) - (0.0231480*ACC)\] When looking at the full data set, we see that horsepower and weight have a significance force on mpg. I couldn’t help but think about the importance of sample size. Ideally, you want a sample size that is large enough to show accurate results and 40 may not have been sufficient in this sample.