\[Research~ protocol\]
Research protocol
The way of evaluating the different variables must be taken into account, for example the character variables to have clarity in the form of classification to homogenize criteria and avoid disorder in the database. For example, the gender variable, mark male 1 and female 2. Likewise, think about the correct way to classify with the other variables that are qualitative.
Before starting the data analysis, check again and clean the database.
\[*Data~ cleaning*\]
To carry out a homogeneous analysis of the database, the data must first be homogenized and corrections made to standardize the variables and performances different score in according values or classification.
For example this data base has different values or names in classification variables. Is important standardize it.
An case that is necessary, we make a new variables in somes cases with other classification.
The following variables will be cleaned. - SEX - TOTCHOL - AGE - CURSMOKE - SYSBP - EDUC - BMI - TIMECVD - CVD
A new variable needs to be made for BMI (3 categories) and TOTCHOL (2 categories).
step 1 of data cleaning: -summary
The following variables show weird values:
SEX is not clean, AGE has a max of 600. Total chol has a possible unrealistic max, SYSBP has a max of 295 (VERY VERY VERY HIGH) BMI has a max of 241.90
summary(data_all) #resume all data (not cleaning)
## RANDID SEX TOTCHOL AGE
## Min. : 2448 Length:11628 Min. :107.0 Min. : 32.00
## 1st Qu.:2474378 Class :character 1st Qu.:210.0 1st Qu.: 48.00
## Median :5006008 Mode :character Median :238.0 Median : 54.00
## Mean :5004741 Mean :241.2 Mean : 54.88
## 3rd Qu.:7472730 3rd Qu.:268.0 3rd Qu.: 62.00
## Max. :9999312 Max. :696.0 Max. :600.00
## NA's :1 NA's :410 NA's :1
## SYSBP DIABP CURSMOKE CIGPDAY
## Min. : 83.5 Min. : 30.00 Min. :0.0000 Min. : 0.00
## 1st Qu.:120.0 1st Qu.: 75.00 1st Qu.:0.0000 1st Qu.: 0.00
## Median :132.0 Median : 82.00 Median :0.0000 Median : 0.00
## Mean :136.3 Mean : 83.04 Mean :0.4325 Mean : 8.25
## 3rd Qu.:149.0 3rd Qu.: 90.00 3rd Qu.:1.0000 3rd Qu.:20.00
## Max. :295.0 Max. :150.00 Max. :1.0000 Max. :90.00
## NA's :1 NA's :1 NA's :1 NA's :80
## BMI DIABETES BPMEDS HEARTRTE
## Min. : 14.43 Min. :0.00000 Min. :0.0000 Min. : 37.00
## 1st Qu.: 23.09 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 69.00
## Median : 25.48 Median :0.00000 Median :0.0000 Median : 75.00
## Mean : 25.90 Mean :0.04558 Mean :0.0856 Mean : 76.78
## 3rd Qu.: 28.07 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.: 85.00
## Max. :241.90 Max. :1.00000 Max. :1.0000 Max. :220.00
## NA's :53 NA's :1 NA's :594 NA's :7
## GLUCOSE educ PREVCHD PREVAP
## Min. : 39.00 Min. :1.00 Min. :0.00000 Min. :0.00000
## 1st Qu.: 72.00 1st Qu.:1.00 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 80.00 Median :2.00 Median :0.00000 Median :0.00000
## Mean : 84.12 Mean :1.99 Mean :0.07242 Mean :0.05393
## 3rd Qu.: 89.00 3rd Qu.:3.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :478.00 Max. :4.00 Max. :1.00000 Max. :1.00000
## NA's :1441 NA's :296 NA's :1 NA's :1
## PREVMI PREVSTRK PREVHYP TIME
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. : 0
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 0
## Median :0.00000 Median :0.00000 Median :0.0000 Median :2156
## Mean :0.03217 Mean :0.01307 Mean :0.4596 Mean :1957
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:4252
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :4854
## NA's :1 NA's :1 NA's :1 NA's :1
## PERIOD HDLC LDLC DEATH
## Min. :1.000 Min. : 10.00 Min. : 20.0 Min. :0.0000
## 1st Qu.:1.000 1st Qu.: 39.00 1st Qu.:145.0 1st Qu.:0.0000
## Median :2.000 Median : 48.00 Median :173.0 Median :0.0000
## Mean :1.899 Mean : 49.37 Mean :176.5 Mean :0.3033
## 3rd Qu.:3.000 3rd Qu.: 58.00 3rd Qu.:205.0 3rd Qu.:1.0000
## Max. :3.000 Max. :189.00 Max. :565.0 Max. :1.0000
## NA's :1 NA's :8601 NA's :8602 NA's :1
## ANGINA HOSPMI MI_FCHD ANYCHD
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.1636 Mean :0.09925 Mean :0.1538 Mean :0.2716
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## NA's :1 NA's :1 NA's :1 NA's :1
## STROKE CVD HYPERTEN TIMEAP
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. : 0
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:6224
## Median :0.00000 Median :0.0000 Median :1.0000 Median :8766
## Mean :0.09125 Mean :0.2493 Mean :0.7433 Mean :7242
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:8766
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :8766
## NA's :1 NA's :1 NA's :1 NA's :1
## TIMEMI TIMEMIFC TIMECHD TIMESTRK TIMECVD
## Min. : 0 Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:7212 1st Qu.:7050 1st Qu.:5598 1st Qu.:7295 1st Qu.:6004
## Median :8766 Median :8766 Median :8766 Median :8766 Median :8766
## Mean :7594 Mean :7543 Mean :7008 Mean :7661 Mean :7166
## 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:8766
## Max. :8766 Max. :8766 Max. :8766 Max. :8766 Max. :8766
## NA's :1 NA's :1 NA's :1 NA's :1 NA's :1
## TIMEDTH TIMEHYP
## Min. : 26 Min. : 0
## 1st Qu.:7798 1st Qu.: 0
## Median :8766 Median :2429
## Mean :7854 Mean :3599
## 3rd Qu.:8766 3rd Qu.:7329
## Max. :8766 Max. :8766
## NA's :1 NA's :1
# We select the period 1 because have more lectures or data.
data <- data_all[data_all$PERIOD == 1, ]
summary(data)
## RANDID SEX TOTCHOL AGE
## Min. : 2448 Length:4435 Min. :107 Min. :32.00
## 1st Qu.:2440336 Class :character 1st Qu.:206 1st Qu.:42.00
## Median :4972848 Mode :character Median :234 Median :49.00
## Mean :4987278 Mean :237 Mean :49.93
## 3rd Qu.:7463577 3rd Qu.:264 3rd Qu.:57.00
## Max. :9999312 Max. :696 Max. :70.00
## NA's :1 NA's :53 NA's :1
## SYSBP DIABP CURSMOKE CIGPDAY
## Min. : 83.5 Min. : 48.00 Min. :0.0000 Min. : 0.000
## 1st Qu.:117.5 1st Qu.: 75.00 1st Qu.:0.0000 1st Qu.: 0.000
## Median :129.0 Median : 82.00 Median :0.0000 Median : 0.000
## Mean :132.9 Mean : 83.08 Mean :0.4919 Mean : 8.966
## 3rd Qu.:144.0 3rd Qu.: 90.00 3rd Qu.:1.0000 3rd Qu.:20.000
## Max. :295.0 Max. :142.50 Max. :1.0000 Max. :70.000
## NA's :1 NA's :1 NA's :1 NA's :33
## BMI DIABETES BPMEDS HEARTRTE
## Min. :15.54 Min. :0.00000 Min. :0.00000 Min. : 44.00
## 1st Qu.:23.09 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 68.00
## Median :25.45 Median :0.00000 Median :0.00000 Median : 75.00
## Mean :25.85 Mean :0.02729 Mean :0.03293 Mean : 75.89
## 3rd Qu.:28.09 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 83.00
## Max. :56.80 Max. :1.00000 Max. :1.00000 Max. :143.00
## NA's :20 NA's :1 NA's :62 NA's :2
## GLUCOSE educ PREVCHD PREVAP
## Min. : 40.00 Min. :1.000 Min. :0.00000 Min. :0.00000
## 1st Qu.: 72.00 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 78.00 Median :2.000 Median :0.00000 Median :0.00000
## Mean : 82.19 Mean :1.976 Mean :0.04375 Mean :0.03315
## 3rd Qu.: 87.00 3rd Qu.:3.000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :394.00 Max. :4.000 Max. :1.00000 Max. :1.00000
## NA's :398 NA's :114 NA's :1 NA's :1
## PREVMI PREVSTRK PREVHYP TIME PERIOD
## Min. :0.0000 Min. :0.000000 Min. :0.0000 Min. :0 Min. :1
## 1st Qu.:0.0000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0 1st Qu.:1
## Median :0.0000 Median :0.000000 Median :0.0000 Median :0 Median :1
## Mean :0.0194 Mean :0.007217 Mean :0.3225 Mean :0 Mean :1
## 3rd Qu.:0.0000 3rd Qu.:0.000000 3rd Qu.:1.0000 3rd Qu.:0 3rd Qu.:1
## Max. :1.0000 Max. :1.000000 Max. :1.0000 Max. :0 Max. :1
## NA's :1 NA's :1 NA's :1 NA's :1 NA's :1
## HDLC LDLC DEATH ANGINA
## Min. : NA Min. : NA Min. :0.0000 Min. :0.0000
## 1st Qu.: NA 1st Qu.: NA 1st Qu.:0.0000 1st Qu.:0.0000
## Median : NA Median : NA Median :0.0000 Median :0.0000
## Mean :NaN Mean :NaN Mean :0.3496 Mean :0.1635
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. : NA Max. : NA Max. :1.0000 Max. :1.0000
## NA's :4435 NA's :4435 NA's :1 NA's :1
## HOSPMI MI_FCHD ANYCHD STROKE
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.1024 Mean :0.1649 Mean :0.2797 Mean :0.09359
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## NA's :1 NA's :1 NA's :1 NA's :1
## CVD HYPERTEN TIMEAP TIMEMI TIMEMIFC
## Min. :0.0000 Min. :0.0000 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:5348 1st Qu.:6224 1st Qu.:6058
## Median :0.0000 Median :1.0000 Median :8766 Median :8766 Median :8766
## Mean :0.2609 Mean :0.7334 Mean :6901 Mean :7240 Mean :7186
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:8766
## Max. :1.0000 Max. :1.0000 Max. :8766 Max. :8766 Max. :8766
## NA's :1 NA's :1 NA's :1 NA's :1 NA's :1
## TIMECHD TIMESTRK TIMECVD TIMEDTH TIMEHYP
## Min. : 0 Min. : 0 Min. : 0 Min. : 26 Min. : 0
## 1st Qu.:4767 1st Qu.:6345 1st Qu.:5145 1st Qu.:6974 1st Qu.: 0
## Median :8766 Median :8766 Median :8766 Median :8766 Median :2188
## Mean :6666 Mean :7305 Mean :6818 Mean :7506 Mean :3436
## 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:8766 3rd Qu.:6934
## Max. :8766 Max. :8766 Max. :8766 Max. :8766 Max. :8766
## NA's :1 NA's :1 NA's :1 NA's :1 NA's :1
Now sex will be cleaned.
The variablesex that 1= male and 2 = female
data$sex_adj <- 1 #creating a new variable, with all 1's
data$sex_adj[data$SEX == "male"] <- 1
data$sex_adj[data$SEX == "Male"] <- 1
data$sex_adj[data$SEX == "Man"] <- 1
data$sex_adj[data$SEX == "man"] <- 1
data$sex_adj[data$SEX == "2"] <- 2
data$sex_adj[data$SEX == "Female"] <- 2
data$sex_adj[data$SEX == "female"] <- 2
data$sex_adj[data$SEX == " "] <- NA
data$sex_adj[is.na(data$SEX)] <- NA
unique(data$sex_adj)
## [1] 1 2 NA
summary(data$sex_adj)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.562 2.000 2.000 1
prop.table(table(data$sex_adj))
##
## 1 2
## 0.4384303 0.5615697
summary(data$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.00 42.00 49.00 49.93 57.00 70.00 1
mean(data$AGE, na.rm=TRUE)
## [1] 49.9258
sd(data$AGE, na.rm=TRUE)
## [1] 8.676929
sum(is.na(data$AGE)) # We have 1 na in this variable
## [1] 1
hist(data$AGE, breaks = 90, xlab = "Age (in years)", main = "Age distribution", col = "brown2", xlim = c(20, 80))
which(data$AGE > 99)
## integer(0)
data$AGE[c(10, 15)] <- NA
summary(data$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 32.00 42.00 49.00 49.93 57.00 70.00 3
mean(data$AGE, na.rm=TRUE)
## [1] 49.92644
sd(data$AGE, na.rm=TRUE)
## [1] 8.674094
sum(is.na(data$AGE))
## [1] 3
hist(data$AGE, breaks = 50, xlab = "Age (in years)", main = "Age distribution ", col = "brown2", xlim = c(25, 80))
quantile(data$AGE, probs=c(0.2),na.rm=TRUE)
## 20%
## 41
qqnorm(data$AGE, col = "pink")
qqline(data$AGE, col="green")
summary(data$CURSMOKE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.4919 1.0000 1.0000 1
sum(is.na(data$CURSMOKE))
## [1] 1
which(is.na(data$CURSMOKE))
## [1] 4435
table(data$CURSMOKE)
##
## 0 1
## 2253 2181
prop.table(table(data$CURSMOKE))
##
## 0 1
## 0.5081191 0.4918809
summary(data$TOTCHOL)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 107 206 234 237 264 696 53
mean(data$TOTCHOL, na.rm=TRUE)
## [1] 236.9843
sd(data$TOTCHOL, na.rm=TRUE)
## [1] 44.6511
sum(is.na(data$TOTCHOL))
## [1] 53
hist(data$TOTCHOL, breaks = 100, xlab = "Total cholestrol (in mg/dL) ", main = "Total cholesterol distribution", col = "chocolate3",ylim=c(0,250))
which(data$TOTCHOL > 500)
## [1] 1428 2603
data$TOTCHOL_adj <- 1
data$TOTCHOL_adj [data$TOTCHOL < 200] <- 0
data$TOTCHOL_adj [data$TOTCHOL > 240] <- 2
data$TOTCHOL_adj [is.na(data$BMI)] <- NA
table(data$TOTCHOL_adj)
##
## 0 1 2
## 867 1665 1883
prop.table(table(data$TOTCHOL_adj))
##
## 0 1 2
## 0.1963760 0.3771234 0.4265006
summary(data$SYSBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 83.5 117.5 129.0 132.9 144.0 295.0 1
mean(data$SYSBP, na.rm=TRUE)
## [1] 132.9078
sd(data$SYSBP, na.rm=TRUE)
## [1] 22.4216
sum(is.na(data$SYSBP))
## [1] 1
hist(data$SYSBP, breaks = 100, xlab = "Systolic blood pressure", main = "Systolic Blood pressure distribution", col = "cyan",ylim=c(0,250))
which(data$SYSBP > 270)
## [1] 2238
unique(data$educ)
## [1] 4 1 3 2 NA
summary(data$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 1.000 2.000 1.976 3.000 4.000 114
table(data$educ)
##
## 1 2 3 4
## 1822 1281 716 502
prop.table(table(data$educ))
##
## 1 2 3 4
## 0.4216617 0.2964592 0.1657024 0.1161768
unique(data$BMI)
## [1] 26.97 25.34 26.36 23.61 26.31 22.37 26.88 21.96 24.18 26.09 28.15 27.57
## [13] 23.62 18.59 25.90 24.71 38.53 28.09 40.11 27.78 26.21 23.59 24.33 23.47
## [25] 25.65 22.29 23.58 19.64 26.84 23.01 28.61 30.58 26.52 32.51 28.56 31.56
## [37] 25.42 18.23 24.80 22.70 28.57 31.01 27.54 29.62 32.23 26.40 21.93 25.09
## [49] 29.29 29.44 25.95 25.87 25.10 21.84 25.50 24.07 25.63 28.21 24.30 26.82
## [61] 31.61 26.49 28.34 30.77 22.99 23.50 28.04 21.19 23.35 24.12 21.66 26.51
## [73] 26.79 18.99 30.76 23.25 28.82 20.12 28.63 23.79 30.85 29.51 22.93 29.79
## [85] 23.75 20.65 20.27 25.74 22.30 29.56 31.30 28.62 24.85 27.06 26.32 24.04
## [97] 29.35 30.62 21.97 33.61 21.90 19.36 29.77 30.96 24.59 27.33 29.73 38.14
## [109] 22.32 24.15 24.26 25.49 30.28 24.10 25.14 26.22 26.74 21.52 28.39 32.63
## [121] 23.94 21.61 19.97 27.87 33.11 26.41 17.44 29.69 26.72 23.22 29.64 26.50
## [133] 29.91 26.45 23.16 19.26 21.51 25.38 25.77 25.82 25.08 23.27 26.46 21.86
## [145] 29.00 32.35 15.54 24.68 27.17 19.98 27.23 25.62 24.76 24.35 31.42 23.44
## [157] 25.13 27.45 25.52 39.88 26.23 21.79 27.47 36.12 31.80 28.35 24.01 31.07
## [169] 29.18 25.01 25.84 27.61 27.48 23.68 26.57 29.60 29.46 30.02 31.22 25.94
## [181] 25.11 27.22 19.82 27.38 26.26 23.88 26.55 29.15 26.75 25.72 26.76 31.69
## [193] 21.91 27.51 22.26 29.17 24.67 22.67 28.66 24.89 22.78 31.46 30.84 29.63
## [205] 20.57 30.92 30.55 28.07 16.59 22.42 26.47 24.42 26.29 32.58 26.60 22.63
## [217] 29.82 25.79 29.23 28.83 25.83 20.87 30.22 27.05 26.98 24.44 19.53 22.54
## [229] 17.81 20.56 27.29 22.11 23.30 24.58 26.67 27.86 27.80 31.96 29.04 27.08
## [241] 20.39 26.81 27.60 19.11 29.86 21.73 25.35 26.25 31.39 27.73 28.36 26.58
## [253] 29.32 24.87 21.45 28.96 25.88 21.24 26.59 31.44 29.98 27.77 27.43 25.48
## [265] 25.05 24.95 31.68 24.75 18.70 22.72 30.50 29.25 26.48 28.87 22.92 26.14
## [277] 22.56 23.80 25.86 25.61 16.87 27.98 23.55 22.69 28.30 26.17 24.39 21.03
## [289] 27.97 24.83 27.18 31.14 27.01 20.63 28.47 24.86 26.56 27.93 22.17 25.04
## [301] 25.16 28.45 27.50 28.46 24.24 25.26 27.27 26.37 28.49 20.55 23.38 30.64
## [313] 26.77 26.91 24.25 22.45 31.16 21.67 19.42 25.75 27.53 22.12 22.73 30.38
## [325] 30.93 18.84 28.95 30.03 22.91 31.76 24.27 32.73 24.49 21.89 32.52 23.03
## [337] 32.66 23.04 28.74 23.65 23.72 25.29 27.75 25.98 25.02 27.10 26.69 23.98
## [349] 24.54 31.27 29.59 28.23 28.59 23.39 26.62 28.86 25.67 23.85 28.92 28.76
## [361] 28.50 26.53 24.43 25.41 32.40 30.12 30.63 26.65 27.13 20.17 29.01 20.19
## [373] 24.77 29.11 19.44 32.00 24.06 24.08 27.88 34.99 28.28 29.34 33.32 27.55
## [385] 24.52 24.50 25.68 23.31 32.65 26.18 26.28 24.28 22.85 23.90 27.71 27.25
## [397] 30.42 27.63 30.18 29.47 22.81 25.97 36.52 28.58 24.36 27.30 23.48 28.33
## [409] 24.62 30.07 24.45 33.03 35.45 27.44 28.70 22.39 31.74 27.04 26.04 23.37
## [421] 25.18 24.69 27.02 28.25 32.84 30.70 24.72 26.73 22.16 28.78 27.66 26.33
## [433] 29.85 28.89 32.15 28.52 22.48 32.02 22.53 25.54 22.41 35.11 27.99 26.54
## [445] 26.83 21.18 29.76 24.51 24.79 26.02 31.23 23.56 27.82 24.96 32.19 23.51
## [457] 32.27 21.85 29.38 27.67 34.53 30.27 28.88 22.24 20.41 24.74 29.03 29.84
## [469] 29.43 21.01 24.16 22.05 33.49 28.53 35.20 19.84 26.78 21.48 24.94 25.24
## [481] 23.84 26.05 23.60 23.28 30.46 30.69 24.38 27.52 35.85 23.87 18.00 18.65
## [493] 22.68 26.00 20.86 23.19 30.39 28.32 22.74 29.41 31.19 21.68 22.01 27.34
## [505] 25.17 31.60 26.71 26.08 24.90 24.05 25.71 29.42 22.71 21.35 19.94 24.65
## [517] 23.08 NA 24.99 30.11 33.45 24.81 23.70 21.27 31.62 24.47 20.72 27.12
## [529] 25.03 19.09 24.41 23.78 26.92 28.16 21.34 34.69 25.40 23.06 22.97 26.68
## [541] 23.21 25.59 29.13 22.90 27.26 24.66 26.11 27.56 28.06 21.41 23.53 18.92
## [553] 33.07 23.05 26.42 31.55 28.55 31.67 24.17 28.41 34.59 26.38 22.64 29.08
## [565] 23.24 25.22 24.84 28.20 33.10 24.19 33.71 29.93 29.10 29.05 22.98 27.28
## [577] 19.51 23.71 21.42 27.91 25.31 25.45 33.29 27.35 25.64 23.46 30.71 28.68
## [589] 24.93 30.43 22.94 23.40 22.82 30.05 22.25 25.33 22.51 26.64 36.04 26.43
## [601] 31.09 31.38 33.16 23.77 29.89 21.02 26.35 21.71 26.89 23.95 30.29 23.64
## [613] 27.39 30.34 21.98 30.66 29.58 28.18 22.36 25.12 23.32 21.06 22.18 31.06
## [625] 23.10 39.04 27.92 38.42 27.15 22.19 33.26 27.94 24.20 32.03 21.17 28.54
## [637] 21.28 25.06 24.61 23.57 22.13 23.97 27.20 23.41 26.20 33.79 24.00 28.31
## [649] 29.07 25.51 25.57 25.91 28.40 27.03 21.78 28.11 30.19 28.77 20.10 33.14
## [661] 19.61 29.19 25.66 24.46 25.56 31.90 29.99 26.27 29.28 30.41 23.29 23.81
## [673] 31.11 30.16 27.31 19.05 30.10 19.70 22.40 19.54 20.26 30.20 35.05 32.33
## [685] 24.91 29.36 19.68 34.04 29.37 26.86 23.96 25.23 25.20 31.93 32.37 23.63
## [697] 21.43 28.38 22.46 28.69 33.00 35.96 35.53 34.89 27.42 37.38 28.27 30.32
## [709] 31.31 28.03 21.07 18.88 29.66 25.21 33.76 24.22 25.70 20.23 21.64 20.09
## [721] 25.46 28.85 27.62 20.34 25.37 31.12 19.81 22.33 24.92 34.39 23.14 26.13
## [733] 28.84 21.83 16.98 37.15 31.20 18.55 22.49 28.12 24.21 25.36 26.87 20.99
## [745] 35.12 24.23 28.65 17.17 31.02 23.34 22.08 22.27 24.32 23.13 28.26 30.82
## [757] 23.23 31.04 31.57 24.98 29.40 20.03 26.30 20.71 27.68 18.64 26.44 27.83
## [769] 26.95 23.43 31.47 32.98 28.13 23.74 36.01 28.24 28.94 31.78 23.93 30.31
## [781] 21.88 34.32 27.89 23.02 32.92 31.05 19.15 28.44 24.56 26.85 30.75 22.34
## [793] 23.52 21.63 24.14 22.79 20.30 30.08 21.77 33.99 21.76 19.76 24.88 35.31
## [805] 22.86 25.69 31.50 21.14 31.08 28.73 30.33 25.96 21.44 28.08 23.86 21.50
## [817] 32.09 24.29 32.22 24.73 27.72 24.63 17.50 20.42 24.53 28.90 33.08 29.72
## [829] 30.80 22.06 21.33 32.32 20.25 25.55 23.76 32.99 25.58 33.81 28.67 24.11
## [841] 31.84 40.08 22.14 30.89 25.60 28.80 24.60 32.13 22.76 20.50 28.43 29.27
## [853] 29.48 32.18 34.97 25.78 19.99 25.80 23.09 21.49 29.87 21.47 33.52 21.65
## [865] 26.63 22.66 18.76 22.04 24.13 29.54 40.38 29.30 27.14 23.92 21.99 30.13
## [877] 21.38 26.15 27.24 27.81 18.10 22.52 20.70 21.55 25.43 24.57 25.81 20.96
## [889] 24.02 31.98 27.40 23.99 20.45 20.54 32.47 19.14 19.00 25.44 29.02 30.61
## [901] 25.85 20.88 27.11 19.34 30.48 26.70 19.71 30.30 27.64 22.35 21.59 34.17
## [913] 32.80 30.36 20.77 28.93 20.68 30.47 22.15 19.66 20.35 17.61 28.60 20.13
## [925] 33.80 34.55 21.16 45.80 26.03 23.89 38.46 29.33 26.93 40.52 24.37 42.15
## [937] 18.38 26.12 20.20 36.81 24.03 38.39 22.02 26.24 31.13 25.32 23.36 42.00
## [949] 19.24 23.67 16.75 22.89 29.06 21.95 30.23 31.77 18.06 44.27 21.81 20.59
## [961] 36.29 27.07 20.06 31.24 21.94 22.00 20.51 21.10 30.24 21.40 21.05 19.88
## [973] 27.96 22.50 27.59 18.48 27.09 21.31 21.21 29.50 31.35 31.82 29.75 18.58
## [985] 34.36 19.20 20.62 35.58 18.80 27.49 36.11 45.79 21.08 20.53 27.65 25.25
## [997] 27.90 23.45 19.56 38.82 27.16 20.18 26.80 24.40 20.47 19.77 34.84 37.41
## [1009] 24.64 22.55 21.80 36.62 16.69 26.39 19.63 20.49 20.81 37.48 32.53 23.54
## [1021] 32.49 30.74 18.86 18.21 30.78 19.27 23.17 39.60 21.60 25.92 18.18 24.48
## [1033] 20.67 31.63 19.78 44.09 22.28 33.38 19.32 23.42 18.98 33.47 28.91 15.96
## [1045] 23.82 24.55 32.26 40.58 43.30 29.21 32.36 33.36 20.64 32.17 19.93 33.19
## [1057] 43.69 29.53 25.27 39.86 20.24 29.16 30.98 20.15 17.64 23.33 20.89 20.79
## [1069] 19.25 20.94 22.88 25.07 42.53 32.67 31.10 29.94 36.21 30.40 27.76 30.67
## [1081] 34.25 23.26 32.04 17.38 30.04 26.61 19.47 20.33 32.77 34.56 20.97 33.40
## [1093] 38.88 21.09 30.06 30.86 25.99 19.83 18.52 21.30 33.82 20.66 31.34 23.66
## [1105] 38.38 23.18 31.37 20.82 34.40 29.45 21.26 31.48 19.57 30.73 21.74 26.07
## [1117] 33.17 20.74 20.52 22.87 33.65 30.91 22.65 27.95 20.00 35.99 24.09 25.89
## [1129] 21.56 19.39 18.43 29.80 18.87 32.07 21.82 30.35 25.73 21.12 22.62 25.15
## [1141] 33.66 16.92 20.05 38.75 44.55 17.11 20.84 20.02 19.38 29.70 32.41 37.70
## [1153] 20.69 21.57 23.69 31.64 19.03 35.22 21.00 26.10 23.20 34.95 32.43 31.71
## [1165] 30.14 42.98 26.01 32.93 33.97 28.02 39.64 20.93 34.13 36.57 19.87 28.72
## [1177] 36.46 23.07 32.57 33.54 21.15 20.40 32.91 19.50 22.58 30.00 29.65 38.06
## [1189] 18.16 25.39 35.78 32.81 29.67 18.67 26.06 21.22 35.17 34.52 39.91 27.46
## [1201] 20.76 19.74 28.10 38.43 29.14 21.72 29.88 35.42 21.29 17.65 37.04 20.22
## [1213] 33.68 44.71 38.54 17.32 30.21 21.11 28.79 16.61 22.84 20.31 39.53 39.54
## [1225] 31.18 30.65 18.44 35.62 43.48 32.68 22.09 19.23 36.91 39.08 39.69 32.82
## [1237] 32.76 17.51 36.65 32.60 19.16 39.82 35.01 19.80 18.30 20.43 28.00 36.79
## [1249] 37.02 25.19 30.99 20.11 33.60 32.29 35.35 29.09 17.89 19.72 38.81 18.62
## [1261] 31.52 20.32 21.32 33.88 36.54 56.80 18.75 31.75 38.11 22.31 18.40 31.54
## [1273] 40.21 20.98 27.69 29.68 30.53 17.48 18.82 28.22 34.91 19.48 18.14 35.16
## [1285] 39.40 40.81 38.61 19.46 32.10 22.57 39.21 31.72 31.65 18.28 17.23 18.26
## [1297] 29.22 33.90 40.51 34.71 30.79 28.14 32.78 20.44 33.70 35.02 22.44 17.71
## [1309] 19.18 30.57 20.73 43.26 29.49 22.43 31.40 16.48 32.89 26.66 20.92 18.11
## [1321] 19.22 29.97 19.12 38.31 21.20 22.38 26.94 22.83 18.53 20.21 35.68 17.92
## [1333] 16.71 37.10 19.69 38.96 18.01 17.93 18.83 23.83 32.54 39.94 34.54 34.27
## [1345] 18.09 18.73 24.82 30.59 28.81 18.46 28.71 34.46 39.22 41.29 18.68 19.49
## [1357] 34.08 31.29 36.18 41.61 19.91 31.03 29.55 37.62 30.51 33.02 17.84 17.68
## [1369] 37.58 51.28 21.25 32.31 38.94 31.92 16.73 21.75 19.37 37.30 35.13 41.66
## [1381] 26.99 19.28 18.63 35.03 35.19 33.37 38.17 20.85 20.78 40.23 36.07 39.17
## [1393] 43.67 20.91
data$BMI_adj <- 1 #creating a new variable, with all 1's
data$BMI_adj[data$BMI < 18.5] <- 0
data$BMI_adj[data$BMI > 25] <- 2
data$BMI_adj[is.na(data$BMI)] <- NA
unique(data$BMI_adj)
## [1] 2 1 0 NA
prop.table(table(data$BMI_adj))
##
## 0 1 2
## 0.01291053 0.43850510 0.54858437
sum(is.na(data$BMI_adj))
## [1] 20
data$CVD_adj <- 1
data$CVD_adj[data$PREVCHD == "1"] <- 0
data$CVD_adj[data$CVD == "0"] <- 0
summary(data$TIMECVD)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 5145 8766 6818 8766 8766 1
mean(data$TIMECVD, na.rm=TRUE)
## [1] 6817.948
sd(data$TIMECVD, na.rm=TRUE)
## [1] 2804.316
sum(is.na(data$TIMECVD))
## [1] 1
hist(data$TIMECVD, ylim = c(0,200), breaks = 100, xlab = "Time to CVD (in days)", main = "Time to a CVD distribution", col = "#A39038")
fitKMCVD <- survfit(Surv(TIMECVD,CVD_adj)~1, data=data)
ggsurvplot(fitKMCVD, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c(""), legend.title ='BMI', palette = c("#332A01", "#7EBBBB", "#290117"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())
########
fitKMBMI <- survfit(Surv(TIMECVD,CVD_adj)~BMI_adj, data=data)
ggsurvplot(fitKMBMI, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c("Underweight", "Normal", "Overweight"), legend.title ='BMI', palette = c("#7EBBBB","#CB1F80", "#2F3BAE"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())
########
fitKMTOTCHOL <- survfit(Surv(TIMECVD,CVD_adj)~TOTCHOL_adj, data=data)
ggsurvplot(fitKMTOTCHOL, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c("Low cholesterol", "Normal cholesterol", "High cholesterol"), legend.title ='Total cholesterol', palette = c("#7EBBBB","#CB1F80","#2F3BAE"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())
#########
fitKMCURSMOKE <- survfit(Surv(TIMECVD,CVD_adj)~CURSMOKE, data=data)
ggsurvplot(fitKMCURSMOKE, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c("Non-smoker", "smoker"), legend.title ='Smoking', palette = c("#7EBBBB","#CB1F80"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())
For fitKMBMI Patients with Overweight are more likely to develop Myocardial infarction problems.And probabilty of survival is more low in comparison with other groups Underweight and Normal
For fitKMTOTCHOL Patients with high cholesterol are more likely of developing Myocardial infarction problems and survive less than the other groups Low cholesterol and Normal cholesterol.
For fitKMCURSMOKE Patients with smoker are more likely of developing Myocardial infarction problems and survive less than the Non-smoker group.
## Myocardial infarction
unique(data$CVD)
## [1] 1 0 NA
prop.table(table(data$CVD))
##
## 0 1
## 0.7390618 0.2609382
chisq.test(data$CVD_adj, data$BMI_adj)
##
## Pearson's Chi-squared test
##
## data: data$CVD_adj and data$BMI_adj
## X-squared = 65.876, df = 2, p-value = 4.958e-15
A small p-value
chisq.test(data$CVD_adj, data$TOTCHOL_adj)
##
## Pearson's Chi-squared test
##
## data: data$CVD_adj and data$TOTCHOL_adj
## X-squared = 54.682, df = 2, p-value = 1.337e-12
A small p-value
chisq.test(data$CVD_adj, data$CURSMOKE)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$CVD_adj and data$CURSMOKE
## X-squared = 4.9843, df = 1, p-value = 0.02558
A small p-value
\[MODELS~FOR~COX~PROPORTIONAL~HAZARD~REGRESION\]
We try variuos model for select the best
library(car) # For see best the p-values
## Loading required package: carData
# AIC criteria is for select the best model. In this case we need select the model with the most lower AIC.
coxfit1 <- coxph(Surv(CVD_adj)~ CURSMOKE, data=data)
summary(coxfit1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ CURSMOKE, data = data)
##
## n= 4434, number of events= 4434
## (1 observation deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## CURSMOKE -0.05473 0.94674 0.03005 -1.822 0.0685 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## CURSMOKE 0.9467 1.056 0.8926 1.004
##
## Concordance= 0.52 (se = 0.009 )
## Likelihood ratio test= 3.32 on 1 df, p=0.07
## Wald test = 3.32 on 1 df, p=0.07
## Score (logrank) test = 3.32 on 1 df, p=0.07
Anova(coxfit1)
AIC(coxfit1)
## [1] 65606.02
coxfit1.1 <- coxph(Surv(CVD_adj)~ CURSMOKE
+ sex_adj
+ AGE
+ educ
, data=data)
summary(coxfit1.1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ CURSMOKE + sex_adj + AGE + educ,
## data = data)
##
## n= 4319, number of events= 4319
## (116 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## CURSMOKE -0.096089 0.908383 0.031898 -3.012 0.00259 **
## sex_adj 0.253587 1.288639 0.031528 8.043 8.76e-16 ***
## AGE -0.020323 0.979883 0.001871 -10.863 < 2e-16 ***
## educ 0.039092 1.039866 0.015110 2.587 0.00968 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## CURSMOKE 0.9084 1.1009 0.8533 0.9670
## sex_adj 1.2886 0.7760 1.2114 1.3708
## AGE 0.9799 1.0205 0.9763 0.9835
## educ 1.0399 0.9617 1.0095 1.0711
##
## Concordance= 0.681 (se = 0.009 )
## Likelihood ratio test= 200.6 on 4 df, p=<2e-16
## Wald test = 198.6 on 4 df, p=<2e-16
## Score (logrank) test = 199.4 on 4 df, p=<2e-16
Anova(coxfit1.1)
AIC(coxfit1.1) # Most lower AIC
## [1] 63486.43
coxfit1.2 <- coxph(Surv(CVD_adj)~ CURSMOKE*sex_adj
+ sex_adj
+ AGE
+ educ
, data=data)
summary(coxfit1.2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ CURSMOKE * sex_adj + sex_adj +
## AGE + educ, data = data)
##
## n= 4319, number of events= 4319
## (116 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## CURSMOKE -0.036657 0.964007 0.103242 -0.355 0.7225
## sex_adj 0.273445 1.314485 0.045562 6.002 1.95e-09 ***
## AGE -0.020368 0.979838 0.001872 -10.879 < 2e-16 ***
## educ 0.039730 1.040529 0.015144 2.623 0.0087 **
## CURSMOKE:sex_adj -0.038093 0.962624 0.062916 -0.605 0.5449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## CURSMOKE 0.9640 1.0373 0.7874 1.1802
## sex_adj 1.3145 0.7608 1.2022 1.4373
## AGE 0.9798 1.0206 0.9762 0.9834
## educ 1.0405 0.9610 1.0101 1.0719
## CURSMOKE:sex_adj 0.9626 1.0388 0.8509 1.0890
##
## Concordance= 0.681 (se = 0.009 )
## Likelihood ratio test= 200.9 on 5 df, p=<2e-16
## Wald test = 199.1 on 5 df, p=<2e-16
## Score (logrank) test = 199.8 on 5 df, p=<2e-16
Anova(coxfit1.2)
AIC(coxfit1.2)
## [1] 63488.06
The best model was coxfit1.1 since it has the lowest AIC value, in addition, the different variables evaluated have a low p value, which indicates that they are statistically significant.
the p value for sex was 1e-15 with a hazard ratio HR = exp(coef) = 1.28 indicating a strong relationship between the patients’ sex and Myocardial infarction. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 1.28. We conclude that being female is associated with good forecast.
There is not evidence of interaction between sex in CVD_adj when CURSMOKE*sex_adj was test
coxfit2 <- coxph(Surv(CVD_adj)~ BMI_adj, data=data)
summary(coxfit2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ BMI_adj, data = data)
##
## n= 4415, number of events= 4415
## (20 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BMI_adj -0.19087 0.82624 0.02874 -6.641 3.11e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BMI_adj 0.8262 1.21 0.781 0.8741
##
## Concordance= 0.573 (se = 0.009 )
## Likelihood ratio test= 43.44 on 1 df, p=4e-11
## Wald test = 44.11 on 1 df, p=3e-11
## Score (logrank) test = 44.22 on 1 df, p=3e-11
Anova(coxfit2)
AIC(coxfit2)
## [1] 65246.89
coxfit2.1 <- coxph(Surv(CVD_adj)~ BMI_adj
+ sex_adj
+ AGE
+ CURSMOKE
+ GLUCOSE , data=data)
summary(coxfit2.1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ BMI_adj + sex_adj + AGE + CURSMOKE +
## GLUCOSE, data = data)
##
## n= 4021, number of events= 4021
## (414 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BMI_adj -0.1311304 0.8771034 0.0311560 -4.209 2.57e-05 ***
## sex_adj 0.2243893 1.2515582 0.0331047 6.778 1.22e-11 ***
## AGE -0.0191379 0.9810441 0.0019275 -9.929 < 2e-16 ***
## CURSMOKE -0.1100844 0.8957585 0.0335242 -3.284 0.001024 **
## GLUCOSE -0.0025211 0.9974821 0.0006833 -3.690 0.000224 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BMI_adj 0.8771 1.140 0.8251 0.9323
## sex_adj 1.2516 0.799 1.1729 1.3355
## AGE 0.9810 1.019 0.9773 0.9848
## CURSMOKE 0.8958 1.116 0.8388 0.9566
## GLUCOSE 0.9975 1.003 0.9961 0.9988
##
## Concordance= 0.691 (se = 0.009 )
## Likelihood ratio test= 211.2 on 5 df, p=<2e-16
## Wald test = 209.3 on 5 df, p=<2e-16
## Score (logrank) test = 210.3 on 5 df, p=<2e-16
Anova(coxfit2.1)
AIC(coxfit2.1)
## [1] 58509.77
coxfit2.2 <- coxph(Surv(CVD_adj)~ BMI_adj*sex_adj
+ AGE
+ CURSMOKE
+ GLUCOSE , data=data)
summary(coxfit2.2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ BMI_adj * sex_adj + AGE + CURSMOKE +
## GLUCOSE, data = data)
##
## n= 4021, number of events= 4021
## (414 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## BMI_adj -0.1610831 0.8512213 0.1032053 -1.561 0.118570
## sex_adj 0.1949556 1.2152571 0.1021986 1.908 0.056441 .
## AGE -0.0191978 0.9809853 0.0019374 -9.909 < 2e-16 ***
## CURSMOKE -0.1100365 0.8958014 0.0335289 -3.282 0.001031 **
## GLUCOSE -0.0025215 0.9974817 0.0006832 -3.691 0.000224 ***
## BMI_adj:sex_adj 0.0189127 1.0190927 0.0621498 0.304 0.760892
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## BMI_adj 0.8512 1.1748 0.6953 1.0421
## sex_adj 1.2153 0.8229 0.9947 1.4848
## AGE 0.9810 1.0194 0.9773 0.9847
## CURSMOKE 0.8958 1.1163 0.8388 0.9566
## GLUCOSE 0.9975 1.0025 0.9961 0.9988
## BMI_adj:sex_adj 1.0191 0.9813 0.9022 1.1511
##
## Concordance= 0.691 (se = 0.009 )
## Likelihood ratio test= 211.3 on 6 df, p=<2e-16
## Wald test = 209.1 on 6 df, p=<2e-16
## Score (logrank) test = 210.4 on 6 df, p=<2e-16
Anova(coxfit2.2)
AIC(coxfit2.2)
## [1] 58511.68
The best model was coxfit2.1 since it has the most lowest AIC value, in addition, the different variables evaluated have a low p value, which indicates that they are statistically significant.
the p value for sex was 1.22e-11 with a hazard ratio HR = exp(coef) = 1.2516 indicating a strong relationship. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 1.25. We conclude that being female is associated with good forecast.
There is not evidence of interaction between sex in CVD_adj when BMI_adj*sex_adj was test.
coxfit3 <- coxph(Surv(CVD_adj)~ TOTCHOL_adj, data=data)
summary(coxfit3)
## Call:
## coxph(formula = Surv(CVD_adj) ~ TOTCHOL_adj, data = data)
##
## n= 4415, number of events= 4415
## (20 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## TOTCHOL_adj -0.1223 0.8849 0.0201 -6.083 1.18e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## TOTCHOL_adj 0.8849 1.13 0.8507 0.9205
##
## Concordance= 0.57 (se = 0.009 )
## Likelihood ratio test= 36.53 on 1 df, p=2e-09
## Wald test = 37 on 1 df, p=1e-09
## Score (logrank) test = 37.08 on 1 df, p=1e-09
Anova(coxfit3)
AIC(coxfit3)
## [1] 65253.8
coxfit3.1 <- coxph(Surv(CVD_adj)~ TOTCHOL_adj
+ BMI_adj
+ AGE
+ sex_adj
+ GLUCOSE, data=data)
summary(coxfit3.1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ TOTCHOL_adj + BMI_adj + AGE +
## sex_adj + GLUCOSE, data = data)
##
## n= 4021, number of events= 4021
## (414 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## TOTCHOL_adj -0.0870528 0.9166287 0.0216676 -4.018 5.88e-05 ***
## BMI_adj -0.1027188 0.9023807 0.0310053 -3.313 0.000923 ***
## AGE -0.0161747 0.9839554 0.0019214 -8.418 < 2e-16 ***
## sex_adj 0.2553925 1.2909682 0.0323732 7.889 3.05e-15 ***
## GLUCOSE -0.0024921 0.9975111 0.0006841 -3.643 0.000270 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## TOTCHOL_adj 0.9166 1.0910 0.8785 0.9564
## BMI_adj 0.9024 1.1082 0.8492 0.9589
## AGE 0.9840 1.0163 0.9803 0.9877
## sex_adj 1.2910 0.7746 1.2116 1.3755
## GLUCOSE 0.9975 1.0025 0.9962 0.9988
##
## Concordance= 0.693 (se = 0.009 )
## Likelihood ratio test= 216.4 on 5 df, p=<2e-16
## Wald test = 217.6 on 5 df, p=<2e-16
## Score (logrank) test = 218.8 on 5 df, p=<2e-16
Anova(coxfit3.1)
AIC(coxfit3.1)
## [1] 58504.54
coxfit3.2 <- coxph(Surv(CVD_adj)~ TOTCHOL_adj*sex_adj
+ BMI_adj
+ AGE
+ sex_adj
+ GLUCOSE, data=data)
summary(coxfit3.2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ TOTCHOL_adj * sex_adj + BMI_adj +
## AGE + sex_adj + GLUCOSE, data = data)
##
## n= 4021, number of events= 4021
## (414 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## TOTCHOL_adj -0.1266727 0.8810220 0.0698381 -1.814 0.069708 .
## sex_adj 0.2239804 1.2510465 0.0618001 3.624 0.000290 ***
## BMI_adj -0.1026301 0.9024607 0.0310054 -3.310 0.000933 ***
## AGE -0.0164030 0.9837308 0.0019593 -8.372 < 2e-16 ***
## GLUCOSE -0.0024931 0.9975100 0.0006843 -3.643 0.000269 ***
## TOTCHOL_adj:sex_adj 0.0256464 1.0259781 0.0429990 0.596 0.550880
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## TOTCHOL_adj 0.8810 1.1350 0.7683 1.0103
## sex_adj 1.2510 0.7993 1.1083 1.4121
## BMI_adj 0.9025 1.1081 0.8493 0.9590
## AGE 0.9837 1.0165 0.9800 0.9875
## GLUCOSE 0.9975 1.0025 0.9962 0.9988
## TOTCHOL_adj:sex_adj 1.0260 0.9747 0.9431 1.1162
##
## Concordance= 0.693 (se = 0.009 )
## Likelihood ratio test= 216.8 on 6 df, p=<2e-16
## Wald test = 216.9 on 6 df, p=<2e-16
## Score (logrank) test = 218.9 on 6 df, p=<2e-16
Anova(coxfit3.2)
AIC(coxfit3.2)
## [1] 58506.18
The best model was coxfit3.1 since it has the most lowest AIC value, in addition, the different variables evaluated have a low p value, which indicates that they are statistically significant.
the p value for sex was 3.05e-15 with a hazard ratio HR = exp(coef) = 1.29 indicating a strong relationship. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 1.29. We conclude that being female is associated with good forecast.
There is not evidence of interaction between sex in CVD_adj when TOTCHOL_adj*sex_adj was test.
\[Discussion\]
After cleaning the database, ordering each variable in classificatory values and reducing the events of missing data, the analysis of the survival models continued. This is important to standardize evaluation criteria.
Attempted to find a Cox porportional hazards regesion model to predict Myocardial infarction (CVD_adj) based on Current cigarette smoking at exam (CURSMOKE), Body Mass Index (BMI), and Serum Total Cholesterol (TOTCHOL), as well as AGE, educ, sex_adj, Casual serum glucose (GLUCOSE). Various Models was test in combinations between variables (TOTCHOL + CURSMOKE) in some cases.
In each case, the best model that presented significant values (low p-values) in all the contrasted variables, as well as in the other evaluation criteria, was selected. For each case, three models were tested and the best one was chosen. In addition, the interaction with the sex_adj of the patients was performed (coxfit1.2, coxfit2.2, coxfit3.2), but no statistically significant values were found. This indicates that sex has no interaction with the variables evaluated. But when evaluating them without interaction, they do present statistical significance.
For example, Kaplan-Meier Curve for incidence of CVD with TOTCHOL, allows us to assume that patients with high cholesterol content have a lower probability of survival if they present CDV, while patients with low cholesterol content can and have a longer time of CVD involvement prolonged survival. Likewise, it can be said with Current cigarette smoking patients that they have a lower probability of survival if they present CVD problems.
We can then say that both the graphs and the models allow to show the relationship between the variables evaluated to assume their behavior in the survival of the patients.