Introduction

heartStudy <- read.csv("https://raw.githubusercontent.com/GauravPadawe/Framingham-Heart-Study/master/framingham.csv")

heartStudy.0 = heartStudy    # make a copy of the data for data cleansing 
heartStudy = na.omit(heartStudy.0)       # Delete all records with missing components
y0=heartStudy$TenYearCHD

The Framingham Heart Study is a long-term prospective study of the etiology of cardiovascular disease among a population of free-living subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects FHS Longitudinal Data Document. The study began in 1948 with 5,209 adult subjects, and is now on its third generation of participants.

Our data set includes 4240 subjects. After removing subjects with missing data for convenience, we are left with 3,658 subjects.

The variables in the data set are as follows:

male: the gender of the subjects
age: Age of subject at time of medical examination in years.
Education: A categorical variable of the participants education, with the levels: Some high school(1), high school/GED(2), some college/vocational school(3), college(4)
currentSmoker: Whether or not the subject currently smokes at time of examinations
cigsPerDay: Number of cigarettes the subject smokes per day
BPMeds: whether or not the subject takes anti-hypertensive medication at exam
prevalentStroke: Prevalent Stroke (0=free of disease)
prevalentHyp: Prevalent Hypertensive. Subject was defined as hypertensive if treated
diabetes: Diabetic according to criteria of first exam treated
totChol: Total cholesterol (mg/dl)
sysBP: Systolic Blood Pressure (mmHg)
diaBP: Diastolic blood pressure (mmHg)
BMI: body mass index, weight(kg)/height(m)^2
heartRate: Heart Rate (beats/minute)
glucose: Blood glucose level (mg/dL)
TenYearCHD(response variable): The 10 year risk of coronary heart disease(CHD)

Many studies have shown that variables like blood pressure, cholesterol, diabetes, smoking, BMI, and physical inactivity are good predictors of heart disease. The objective of this study is to explore the association between cholesterol and heart disease

According to Robert S Rosenson, a professor of Medicine,

A total cholesterol level of less than 200 mg/dL (5.17 mmol/L) is normal.
A total cholesterol level of 200 to 239 mg/dL (5.17 to 6.18 mmol/L) is borderline high.
A total cholesterol level of 240 mg/dL (6.21 mmol/L) or greater is high.

Building a Simple Logistic Regression

ylimit = max(density(heartStudy$totChol)$y)
hist(heartStudy$totChol, probability = TRUE, main = "Cholesterol Distribution", xlab="", 
       col = "azure1", border="lightseagreen")
  lines(density(heartStudy$totChol, adjust=2), col="blue")

A histogram plotting the variable totChol, Total Cholesterol, shows the variable is not extremely skewed; this is a good sign.

Since we are only assessing the relationship between the response and one predictor variable, we do not need to worry about imbalance, and can proceed with creating a logistic regression model.

s.logit = glm(TenYearCHD ~ totChol, 
          family = binomial(link = "logit"),  # family is the binomial, logit(p) = log(p/(1-p))!
          data = heartStudy)                    # the data frame is a subset of the original iris data
# summary(s.logit)

model.coef.stats = summary(s.logit)$coef       # output stats of coefficients
conf.ci = confint(s.logit)                     # confidence intervals of betas

## Waiting for profiling to be done...

sum.stats = cbind(model.coef.stats, conf.ci.95=conf.ci)   # rounding off decimals
kable(sum.stats,caption = "The summary stats of regression coefficients")

The summary stats of regression coefficients
	Estimate	Std. Error	z value	Pr(>\|z\|)	2.5 %	97.5 %
(Intercept)	-3.0428382	0.2501341	-12.164827	0	-3.5355625	-2.5544337
totChol	0.0055114	0.0010065	5.475914	0	0.0035358	0.0074841

From the above table, we can see that total Cholesterol is positively associated with the status of diabetes since β1=0.00551 with a p-value close to 0. The 95% confidence interval [0.0035358, 0.0074841]. This also supports the results of the research in the literature.

It is more common to interpret the association results from a practical perspective using the odds ratio. Next, we convert the estimated regression coefficients to the odds ratio.

model.coef.stats = summary(s.logit)$coef
odds.ratio = exp(coef(s.logit))
out.stats = cbind(model.coef.stats, odds.ratio = odds.ratio)                 
kable(out.stats,caption = "Summary Stats with Odds Ratios")

Summary Stats with Odds Ratios
	Estimate	Std. Error	z value	Pr(>\|z\|)	odds.ratio
(Intercept)	-3.0428382	0.2501341	-12.164827	0	0.0476993
totChol	0.0055114	0.0010065	5.475914	0	1.0055266

The odds ratio associated with BMI is 1.00552 meaning that as the BMI increases by one unit, the odds of being tested positive for diabetes increase by about 0.55%. This is a practically significant risk factor for diabetes.

Success Probability

bmi.range = range(heartStudy$totChol)
x = seq(bmi.range[1], bmi.range[2], length = 200)
beta.x = coef(s.logit)[1] + coef(s.logit)[2]*x
success.prob = exp(beta.x)/(1+exp(beta.x))
failure.prob = 1/(1+exp(beta.x))
ylimit = max(success.prob, failure.prob)
##
beta1 = coef(s.logit)[2]
success.prob.rate = beta1*exp(beta.x)/(1+exp(beta.x))^2
##
##
par(mfrow = c(1,2))
plot(x, success.prob, type = "l", lwd = 2, col = "navy",
     main = "The probability of being \n  tested positive in Ten Year Heart Disease Risk", 
     ylim=c(0, 1.1*ylimit),
     xlab = "Total Cholesterol",
     ylab = "probability",
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8)
# lines(x, failure.prob,lwd = 2, col = "darkred")
axis(1, pos = 0)
axis(2)
# legend(30, 1, c("Success Probability", "Failure Probability"), lwd = rep(2,2), 
#       col = c("navy", "darkred"), cex = 0.7, bty = "n")
##
y.rate = max(success.prob.rate)
plot(x, success.prob.rate, type = "l", lwd = 2, col = "navy",
     main = "The rate of change in the probability \n  of being tested positive in Ten Year Heart Disease Risk", 
     xlab = "Total Cholesterol",
     ylab = "Rate of Change",
     ylim=c(0,1.1*y.rate),
     axes = FALSE,
     col.main = "navy",
     cex.main = 0.8
     )
axis(1, pos = 0)
axis(2)

The success probability curve describes the probability of a positive test as Total Cholesterol increases. We can also plot the rate of change of this probability, as we see on the right hand side, where the turning point occurs at roughly 550 Total Cholesterol.

Conclusion

At this point, we have observed the relationship between Cholesterol and risk of Ten year CHD, which looks to be a positive relationship.

Week6 Assignment

Zack Shin

2023-03-06

Introduction

Building a Simple Logistic Regression

Success Probability

Conclusion