Numeric Response Variable: Blood Pressure
Numeric Explanatory Variable: Age
Calling the dataset
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
heartdata1<-read_csv("workableheartdataset.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## age = col_double(),
## sex = col_double(),
## cp = col_double(),
## trestbps = col_double(),
## chol = col_double(),
## fbs = col_double(),
## restecg = col_double(),
## thalach = col_double(),
## exang = col_double(),
## oldpeak = col_double(),
## slope = col_double(),
## ca = col_double(),
## thal = col_double(),
## target = col_double(),
## sexfactor = col_character(),
## riskfactor = col_character(),
## exangfactor = col_character(),
## fbsfactor = col_character()
## )
head(heartdata1)
## # A tibble: 6 x 19
## X1 age sex cp trestbps chol fbs restecg thalach exang oldpeak
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 63 1 3 145 233 1 0 150 0 2.3
## 2 2 37 1 2 130 250 0 1 187 0 3.5
## 3 3 41 0 1 130 204 0 0 172 0 1.4
## 4 4 56 1 1 120 236 0 1 178 0 0.8
## 5 5 57 0 0 120 354 0 1 163 1 0.6
## 6 6 57 1 0 140 192 0 1 148 0 0.4
## # … with 8 more variables: slope <dbl>, ca <dbl>, thal <dbl>, target <dbl>,
## # sexfactor <chr>, riskfactor <chr>, exangfactor <chr>, fbsfactor <chr>
Simple linear regression analysis
heartmod<-lm(trestbps~age, heartdata1)
heartmod
##
## Call:
## lm(formula = trestbps ~ age, data = heartdata1)
##
## Coefficients:
## (Intercept) age
## 102.2961 0.5394
summary(heartmod)
##
## Call:
## lm(formula = trestbps ~ age, data = heartdata1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.439 -11.499 -1.044 10.192 67.495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.2961 5.8906 17.366 < 2e-16 ***
## age 0.5394 0.1069 5.048 7.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.87 on 301 degrees of freedom
## Multiple R-squared: 0.07804, Adjusted R-squared: 0.07497
## F-statistic: 25.48 on 1 and 301 DF, p-value: 7.762e-07
# Null hypothesis: Beta1(age) = 0
# Alternate hypothesis: Beta1(age) ≠ 0
# Blood Pressure = 0.5394(age) + 102.2961
# Reference Distribution: t-distribution with 301 degrees of freedom
# Degrees of Freedom: 301
# Test statistic: t = 5.048
# P-value: 7.76e-07
We reject the null hypothesis with a p-value of 7.76*10^-7 at the 0.05 significance level. There is convincing evidence to suggest that there is a significant linear relationship between age and resting blood pressure.
ANOVA table
anova(heartmod)
## Analysis of Variance Table
##
## Response: trestbps
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 7249 7248.9 25.477 7.762e-07 ***
## Residuals 301 85642 284.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Total DF: 302
# Total Sum of Squares: 92891
# R Squared: 7249/92891 = 0.0780377
The R Squared value indicates that majority of the relationship can be explained be noise and that only about 8% of the variance of resting blood pressure can be explained by the relationship between age and resting blood pressure.
Summary
The residual plot, which shows fitted values vs residual values, suggests linearity, a mean of 0, and constant spread. This plot does not seem to indicate any patterns of non-linearity, has a relatively even distribution above and below the line y=0, and a relatively constant band of residuals. The QQ plot, which represents the quantiles between a normal distribution and the distribution of the data, suggests a relatively normal distribution in the heart data. There are some minor departures from normality in the tails, however, for the most part the elements of the plot align with the diagonal line. Lastly, the Leverage vs. Residual plot, comparing residuals versus Cook’s distance, does not indicate that this data contains any outliers that have high influence in the overall model fit. Therefore, it is fair to conclude in analyzing these plots that there are no significant, potential problems within the data that could disrupt the accuracy of the model fit. Collectively however, the R^2 value of ~8% suggests that this model is not a relatively accurate fit for this data because it only explains a small portion of the variance in the response. So, while the different plots do not highlight any problems with this model, the low coefficient of determination means that age only represents a small variability in resting blood pressure value.