Data: Age and Cholesterol

Source: https://www.kaggle.com/ronitf/heart-disease-uci

One factor Linear Regression

The analysis looks at the relationship between Age and Cholesterol. The independent variable is Age and the Dependent variable is cholesterol.

Load data

data <- read.csv(file="https://raw.githubusercontent.com/zahirf/Data605/master/heart.csv", header=TRUE, sep=",")
head(data, 10)
##    ï..age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1      63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2      37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3      41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4      56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5      57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6      57   1  0      140  192   0       1     148     0     0.4     1  0    1
## 7      56   0  1      140  294   0       0     153     0     1.3     1  0    2
## 8      44   1  1      120  263   0       1     173     0     0.0     2  0    3
## 9      52   1  2      172  199   1       1     162     0     0.5     2  0    3
## 10     57   1  2      150  168   0       1     174     0     1.6     2  0    2
##    target
## 1       1
## 2       1
## 3       1
## 4       1
## 5       1
## 6       1
## 7       1
## 8       1
## 9       1
## 10      1

Rename columns

names(data)[names(data)=="ï..age"]<-"Age"
names(data)[names(data)=="chol"]<-"Cholesterol"

Plot Age and Cholesterol

The data shows a slight upward relationship between the 2 variables.

plot(data$Age, data$Cholesterol, xlab = "Age", ylab = "Cholesterol", main="Age vs Cholesterol")

Linear model

Let us test the relationship further with a linear model.

We can see that the Rsquare is very low at 4%, the model explains the relationships only 4% of the time. The p values of the coefficients are however very low so there is very low probability that the corresponding coefficients are not relevant to the model.

fit <- lm(Cholesterol ~ Age, data = data)
summary(fit)
## 
## Call:
## lm(formula = Cholesterol ~ Age, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -123.476  -32.560   -5.745   28.024  302.330 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 179.9675    17.7116  10.161  < 2e-16 ***
## Age           1.2194     0.3213   3.795 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.72 on 301 degrees of freedom
## Multiple R-squared:  0.04566,    Adjusted R-squared:  0.04249 
## F-statistic:  14.4 on 1 and 301 DF,  p-value: 0.0001786

Plotting the linear model

Intercept is 179.9675. Slope is 1.2194.

Equation

Cholesterol= 179.9675 + 1.2194 * Age

plot(data$Age, data$Cholesterol, xlab = "Age", ylab = "Cholesterol", main="Age vs Cholesterol")
abline(fit)

Plot of residuals

The plot below shows that the residuals are not uniformly distributed around zero. The residuals appear to be increasing in variability as we go right.The QQplot also shows residuals moving further away from the line at the two ends. The residuals are not normally distributed.

plot(fitted(fit), resid(fit))

qqnorm(resid(fit))
qqline(resid(fit))

Summary

The model has violated the condition of homoscedasticity for a linear model and therefore this particular model defining the relationship between Age and Cholesterol is not appropriate and should not be used.