Numeric Response Variable: Blood Pressure

Numeric Explanatory Variable: Age

Calling the dataset

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
heartdata1<-read_csv("workableheartdataset.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trestbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalach = col_double(),
##   exang = col_double(),
##   oldpeak = col_double(),
##   slope = col_double(),
##   ca = col_double(),
##   thal = col_double(),
##   target = col_double(),
##   sexfactor = col_character(),
##   riskfactor = col_character(),
##   exangfactor = col_character(),
##   fbsfactor = col_character()
## )
head(heartdata1)
## # A tibble: 6 x 19
##      X1   age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak
##   <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl>
## 1     1    63     1     3      145   233     1       0     150     0     2.3
## 2     2    37     1     2      130   250     0       1     187     0     3.5
## 3     3    41     0     1      130   204     0       0     172     0     1.4
## 4     4    56     1     1      120   236     0       1     178     0     0.8
## 5     5    57     0     0      120   354     0       1     163     1     0.6
## 6     6    57     1     0      140   192     0       1     148     0     0.4
## # … with 8 more variables: slope <dbl>, ca <dbl>, thal <dbl>, target <dbl>,
## #   sexfactor <chr>, riskfactor <chr>, exangfactor <chr>, fbsfactor <chr>

Simple linear regression analysis

heartmod<-lm(trestbps~age, heartdata1)
heartmod
## 
## Call:
## lm(formula = trestbps ~ age, data = heartdata1)
## 
## Coefficients:
## (Intercept)          age  
##    102.2961       0.5394
summary(heartmod)
## 
## Call:
## lm(formula = trestbps ~ age, data = heartdata1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.439 -11.499  -1.044  10.192  67.495 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 102.2961     5.8906  17.366  < 2e-16 ***
## age           0.5394     0.1069   5.048 7.76e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.87 on 301 degrees of freedom
## Multiple R-squared:  0.07804,    Adjusted R-squared:  0.07497 
## F-statistic: 25.48 on 1 and 301 DF,  p-value: 7.762e-07
# Null hypothesis: Beta1(age) = 0
# Alternate hypothesis: Beta1(age) ≠ 0
# Blood Pressure = 0.5394(age) + 102.2961
# Reference Distribution: t-distribution with 301 degrees of freedom
# Degrees of Freedom: 301
# Test statistic: t = 5.048
# P-value: 7.76e-07

We reject the null hypothesis with a p-value of 7.76*10^-7 at the 0.05 significance level. There is convincing evidence to suggest that there is a significant linear relationship between age and resting blood pressure.

ANOVA table

anova(heartmod)
## Analysis of Variance Table
## 
## Response: trestbps
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## age         1   7249  7248.9  25.477 7.762e-07 ***
## Residuals 301  85642   284.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Total DF: 302
# Total Sum of Squares: 92891
# R Squared: 7249/92891 = 0.0780377

The R Squared value indicates that majority of the relationship can be explained be noise and that only about 8% of the variance of resting blood pressure can be explained by the relationship between age and resting blood pressure.

Diagnostics plot

plot(heartmod)

Summary

The residual plot, which shows fitted values vs residual values, suggests linearity, a mean of 0, and constant spread. This plot does not seem to indicate any patterns of non-linearity, has a relatively even distribution above and below the line y=0, and a relatively constant band of residuals. The QQ plot, which represents the quantiles between a normal distribution and the distribution of the data, suggests a relatively normal distribution in the heart data. There are some minor departures from normality in the tails, however, for the most part the elements of the plot align with the diagonal line. Lastly, the Leverage vs. Residual plot, comparing residuals versus Cook’s distance, does not indicate that this data contains any outliers that have high influence in the overall model fit. Therefore, it is fair to conclude in analyzing these plots that there are no significant, potential problems within the data that could disrupt the accuracy of the model fit. Collectively however, the R^2 value of ~8% suggests that this model is not a relatively accurate fit for this data because it only explains a small portion of the variance in the response. So, while the different plots do not highlight any problems with this model, the low coefficient of determination means that age only represents a small variability in resting blood pressure value.