Data: Age and Cholesterol
Source: https://www.kaggle.com/ronitf/heart-disease-uci
The analysis looks at the relationship between Age and Cholesterol. The independent variable is Age and the Dependent variable is cholesterol.
data <- read.csv(file="https://raw.githubusercontent.com/zahirf/Data605/master/heart.csv", header=TRUE, sep=",")
head(data, 10)
## ï..age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## 7 56 0 1 140 294 0 0 153 0 1.3 1 0 2
## 8 44 1 1 120 263 0 1 173 0 0.0 2 0 3
## 9 52 1 2 172 199 1 1 162 0 0.5 2 0 3
## 10 57 1 2 150 168 0 1 174 0 1.6 2 0 2
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
names(data)[names(data)=="ï..age"]<-"Age"
names(data)[names(data)=="chol"]<-"Cholesterol"
The data shows a slight upward relationship between the 2 variables.
plot(data$Age, data$Cholesterol, xlab = "Age", ylab = "Cholesterol", main="Age vs Cholesterol")
Let us test the relationship further with a linear model.
We can see that the Rsquare is very low at 4%, the model explains the relationships only 4% of the time. The p values of the coefficients are however very low so there is very low probability that the corresponding coefficients are not relevant to the model.
fit <- lm(Cholesterol ~ Age, data = data)
summary(fit)
##
## Call:
## lm(formula = Cholesterol ~ Age, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123.476 -32.560 -5.745 28.024 302.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 179.9675 17.7116 10.161 < 2e-16 ***
## Age 1.2194 0.3213 3.795 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.72 on 301 degrees of freedom
## Multiple R-squared: 0.04566, Adjusted R-squared: 0.04249
## F-statistic: 14.4 on 1 and 301 DF, p-value: 0.0001786
Intercept is 179.9675. Slope is 1.2194.
Equation
Cholesterol= 179.9675 + 1.2194 * Age
plot(data$Age, data$Cholesterol, xlab = "Age", ylab = "Cholesterol", main="Age vs Cholesterol")
abline(fit)
The plot below shows that the residuals are not uniformly distributed around zero. The residuals appear to be increasing in variability as we go right.The QQplot also shows residuals moving further away from the line at the two ends. The residuals are not normally distributed.
plot(fitted(fit), resid(fit))
qqnorm(resid(fit))
qqline(resid(fit))
The model has violated the condition of homoscedasticity for a linear model and therefore this particular model defining the relationship between Age and Cholesterol is not appropriate and should not be used.