Using R’s lm function, perform regression analysis and measure the signi cance of the independent variables for the following two data sets. In the rst case, you are evaluating the statement that we hear that Maximum Heart Rate of a person is related to their age by the following equation: MaxHR = 220 ???? Age

Age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)

Heart_Age <- cbind.data.frame(Age,MaxHR)

Let’s create a scatter plot

plot(MaxHR ~ Age, data = Heart_Age,
  xlab = "Age",
  ylab = "Max Heart Rate",
  main = "Heart rates with age"
)

The graph suggests that their is a linear relationship with Max heart rate and Age, Lets fit a linear Regression

Heart_Age_lg <- lm(MaxHR ~ Age, data = Heart_Age)
summary(Heart_Age_lg)
## 
## Call:
## lm(formula = MaxHR ~ Age, data = Heart_Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## Age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

fROM the summary we see the equation is MaxHR = 210.04846 -0.79773* Age

residual standard error is cost which is 4.578

If we assume a null Hypothesis that there no relationship between the two variables. Then that means our significance level should be at least 0.05, to be strict.

However according to the summary, the p-value is also 3.848e-08 which is much less than 0.05. Thereore we reject the Null hypothesis and conclude that there is a significant relationship between the two variables

Let’s look at the correlation coefficient

cor(Age, MaxHR) 
## [1] -0.9534656

Because it has a high negative correlation value of -0.9534 which is close to one. Then I can say the effect of Age on HR is significant

Lets look at the predicted MaxHR versus Age and plot the fitted relationship

#fitted(Heart_Age_lg)
plot(MaxHR ~ Age, data = Heart_Age,
  xlab = "Age",
  ylab = "Max Heart Rate",
  main = "Heart rates with age"
  
)
abline(lm(MaxHR ~ Age, data = Heart_Age))

Problem 2

Let’s import the data set

#filepath <- c("https://raw.githubusercontent.com/nobieyi00/CUNY_MSDA_R/master/auto-mpg.data")
filepath <- c("C:/Users/Mezue/Downloads/assign11/assign11/auto-mpg.data")

Auto_mpg <-read.table(filepath,header = FALSE, sep = "")

colnames(Auto_mpg) <- c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg')

Auto_Mpg 40

Get subset of random 40 rows for the Linear Regression analysis

random.40 <- as.integer(runif(40, min=1, max = nrow(Auto_mpg)))

Auto_mpg_40 <- Auto_mpg[c(random.40),]

Let’s analysize accelaration and mpg

Auto_mpg_40_fit <- lm(mpg ~ displacement + horsepower + weight + acceleration, Auto_mpg_40)
 Auto_mpg_40_fit_lm <-summary(Auto_mpg_40_fit)
 Auto_mpg_40_fit_lm
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = Auto_mpg_40)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8579 -1.9092 -0.4054  1.6305 10.8672 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  50.477375   7.261129   6.952 4.41e-08 ***
## displacement -0.001399   0.023008  -0.061   0.9519    
## horsepower   -0.125737   0.068366  -1.839   0.0744 .  
## weight       -0.003810   0.002254  -1.690   0.0999 .  
## acceleration -0.135412   0.344677  -0.393   0.6968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.354 on 35 degrees of freedom
## Multiple R-squared:  0.8296, Adjusted R-squared:  0.8101 
## F-statistic: 42.59 on 4 and 35 DF,  p-value: 5.526e-13

mpg =57.424729 + -0.001255×displacement+-0.018986×horsepower+-0.008568×weight+-0.463823×acceleration

Which of the 4 independent variables have a significant impact on mpg? Assuming significance level of 0.05

p_values <- Auto_mpg_40_fit_lm$coefficients[2:5,"Pr(>|t|)"]
p_values[which(p_values < .05)]
## named numeric(0)

We can conclude that weight has the most significant impact on mpg

What are the standard errors on each of the coefficients?

Auto_mpg_40_fit_lm$coefficients[2:5,"Std. Error"]
## displacement   horsepower       weight acceleration 
##  0.023008157  0.068366461  0.002254439  0.344676843

measure the 95% confidence intervals.

confint(Auto_mpg_40_fit, level=0.95)
##                     2.5 %       97.5 %
## (Intercept)  35.736500604 6.521825e+01
## displacement -0.048107637 4.531045e-02
## horsepower   -0.264528650 1.305394e-02
## weight       -0.008387234 7.662727e-04
## acceleration -0.835142927 5.643195e-01

Entire Dataset

Auto_mpg_fit <- lm(mpg ~ displacement + horsepower + weight + acceleration, Auto_mpg)
 Auto_mpg_fit_lm <-summary(Auto_mpg_fit)
 Auto_mpg_fit_lm
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = Auto_mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
## displacement -0.0060009  0.0067093  -0.894  0.37166    
## horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16

mpg =45.2511397 + -0.0060009×displacement+-0.0436077×horsepower+-0.0052805×weight+-0.0231480×acceleration

Which of the 4 independent variables have a significant impact on mpg? Assuming significance level of 0.05

p_values <- Auto_mpg_fit_lm$coefficients[2:5,"Pr(>|t|)"]
p_values[which(p_values < .05)]
##   horsepower       weight 
## 8.848982e-03 2.302545e-10

We can conclude that weight and horsepower has the most significant impact on mpg

What are the standard errors on each of the coefficients?

Auto_mpg_fit_lm$coefficients[2:5,"Std. Error"]
## displacement   horsepower       weight acceleration 
## 0.0067093055 0.0165734633 0.0008108541 0.1256011622

measure the 95% confidence intervals.

confint(Auto_mpg_fit, level=0.95)
##                     2.5 %       97.5 %
## (Intercept)  40.422278855 50.080000544
## displacement -0.019192122  0.007190380
## horsepower   -0.076193029 -0.011022433
## weight       -0.006874738 -0.003686277
## acceleration -0.270094049  0.223798050

conclusion

In conclusion, I notice that the larger the dataset the better our confidence levels are. the more data the better or model