\[Research~ protocol\]

Research protocol

The way of evaluating the different variables must be taken into account, for example the character variables to have clarity in the form of classification to homogenize criteria and avoid disorder in the database. For example, the gender variable, mark male 1 and female 2. Likewise, think about the correct way to classify with the other variables that are qualitative.

Before starting the data analysis, check again and clean the database.

  1. Verify that the numerical variables present a normal distribution. This can be verified with histograms.
  2. Calculate tables of proportions of the most important variables in the study
  3. Visualizing the estimated distribution of survival times. visualize the predicted survival proportion at any given point in time for a particular risk group.
  4. Check with chi square statistical differences in the most important variables. compares the observed distribution of the data with an expected distribution of the data.
  5. Cox models are made and the best one is chosen. Models with interactions are also verified.
  6. Select the best model according to criteria such as AIC and best p-values.

\[*Data~ cleaning*\]

Data cleaning

To carry out a homogeneous analysis of the database, the data must first be homogenized and corrections made to standardize the variables and performances different score in according values or classification.

For example this data base has different values or names in classification variables. Is important standardize it.

An case that is necessary, we make a new variables in somes cases with other classification.

The following variables will be cleaned. - SEX - TOTCHOL - AGE - CURSMOKE - SYSBP - EDUC - BMI - TIMECVD - CVD

A new variable needs to be made for BMI (3 categories) and TOTCHOL (2 categories).

step 1 of data cleaning: -summary

The following variables show weird values:
SEX is not clean, AGE has a max of 600. Total chol has a possible unrealistic max, SYSBP has a max of 295 (VERY VERY VERY HIGH) BMI has a max of 241.90

summary(data_all) #resume all data (not cleaning)
##      RANDID            SEX               TOTCHOL           AGE        
##  Min.   :   2448   Length:11628       Min.   :107.0   Min.   : 32.00  
##  1st Qu.:2474378   Class :character   1st Qu.:210.0   1st Qu.: 48.00  
##  Median :5006008   Mode  :character   Median :238.0   Median : 54.00  
##  Mean   :5004741                      Mean   :241.2   Mean   : 54.88  
##  3rd Qu.:7472730                      3rd Qu.:268.0   3rd Qu.: 62.00  
##  Max.   :9999312                      Max.   :696.0   Max.   :600.00  
##  NA's   :1                            NA's   :410     NA's   :1       
##      SYSBP           DIABP           CURSMOKE         CIGPDAY     
##  Min.   : 83.5   Min.   : 30.00   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.:120.0   1st Qu.: 75.00   1st Qu.:0.0000   1st Qu.: 0.00  
##  Median :132.0   Median : 82.00   Median :0.0000   Median : 0.00  
##  Mean   :136.3   Mean   : 83.04   Mean   :0.4325   Mean   : 8.25  
##  3rd Qu.:149.0   3rd Qu.: 90.00   3rd Qu.:1.0000   3rd Qu.:20.00  
##  Max.   :295.0   Max.   :150.00   Max.   :1.0000   Max.   :90.00  
##  NA's   :1       NA's   :1        NA's   :1        NA's   :80     
##       BMI            DIABETES           BPMEDS          HEARTRTE     
##  Min.   : 14.43   Min.   :0.00000   Min.   :0.0000   Min.   : 37.00  
##  1st Qu.: 23.09   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 69.00  
##  Median : 25.48   Median :0.00000   Median :0.0000   Median : 75.00  
##  Mean   : 25.90   Mean   :0.04558   Mean   :0.0856   Mean   : 76.78  
##  3rd Qu.: 28.07   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.: 85.00  
##  Max.   :241.90   Max.   :1.00000   Max.   :1.0000   Max.   :220.00  
##  NA's   :53       NA's   :1         NA's   :594      NA's   :7       
##     GLUCOSE            educ         PREVCHD            PREVAP       
##  Min.   : 39.00   Min.   :1.00   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.: 72.00   1st Qu.:1.00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median : 80.00   Median :2.00   Median :0.00000   Median :0.00000  
##  Mean   : 84.12   Mean   :1.99   Mean   :0.07242   Mean   :0.05393  
##  3rd Qu.: 89.00   3rd Qu.:3.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :478.00   Max.   :4.00   Max.   :1.00000   Max.   :1.00000  
##  NA's   :1441     NA's   :296    NA's   :1         NA's   :1        
##      PREVMI           PREVSTRK          PREVHYP            TIME     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :   0  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:   0  
##  Median :0.00000   Median :0.00000   Median :0.0000   Median :2156  
##  Mean   :0.03217   Mean   :0.01307   Mean   :0.4596   Mean   :1957  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:4252  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :4854  
##  NA's   :1         NA's   :1         NA's   :1        NA's   :1     
##      PERIOD           HDLC             LDLC           DEATH       
##  Min.   :1.000   Min.   : 10.00   Min.   : 20.0   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.: 39.00   1st Qu.:145.0   1st Qu.:0.0000  
##  Median :2.000   Median : 48.00   Median :173.0   Median :0.0000  
##  Mean   :1.899   Mean   : 49.37   Mean   :176.5   Mean   :0.3033  
##  3rd Qu.:3.000   3rd Qu.: 58.00   3rd Qu.:205.0   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :189.00   Max.   :565.0   Max.   :1.0000  
##  NA's   :1       NA's   :8601     NA's   :8602    NA's   :1       
##      ANGINA           HOSPMI           MI_FCHD           ANYCHD      
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.1636   Mean   :0.09925   Mean   :0.1538   Mean   :0.2716  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :1        NA's   :1         NA's   :1        NA's   :1       
##      STROKE             CVD            HYPERTEN          TIMEAP    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :   0  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:6224  
##  Median :0.00000   Median :0.0000   Median :1.0000   Median :8766  
##  Mean   :0.09125   Mean   :0.2493   Mean   :0.7433   Mean   :7242  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:8766  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :8766  
##  NA's   :1         NA's   :1        NA's   :1        NA's   :1     
##      TIMEMI        TIMEMIFC       TIMECHD        TIMESTRK       TIMECVD    
##  Min.   :   0   Min.   :   0   Min.   :   0   Min.   :   0   Min.   :   0  
##  1st Qu.:7212   1st Qu.:7050   1st Qu.:5598   1st Qu.:7295   1st Qu.:6004  
##  Median :8766   Median :8766   Median :8766   Median :8766   Median :8766  
##  Mean   :7594   Mean   :7543   Mean   :7008   Mean   :7661   Mean   :7166  
##  3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:8766  
##  Max.   :8766   Max.   :8766   Max.   :8766   Max.   :8766   Max.   :8766  
##  NA's   :1      NA's   :1      NA's   :1      NA's   :1      NA's   :1     
##     TIMEDTH        TIMEHYP    
##  Min.   :  26   Min.   :   0  
##  1st Qu.:7798   1st Qu.:   0  
##  Median :8766   Median :2429  
##  Mean   :7854   Mean   :3599  
##  3rd Qu.:8766   3rd Qu.:7329  
##  Max.   :8766   Max.   :8766  
##  NA's   :1      NA's   :1
# We select the period 1 because have more lectures or data.
data <- data_all[data_all$PERIOD == 1, ]
summary(data)
##      RANDID            SEX               TOTCHOL         AGE       
##  Min.   :   2448   Length:4435        Min.   :107   Min.   :32.00  
##  1st Qu.:2440336   Class :character   1st Qu.:206   1st Qu.:42.00  
##  Median :4972848   Mode  :character   Median :234   Median :49.00  
##  Mean   :4987278                      Mean   :237   Mean   :49.93  
##  3rd Qu.:7463577                      3rd Qu.:264   3rd Qu.:57.00  
##  Max.   :9999312                      Max.   :696   Max.   :70.00  
##  NA's   :1                            NA's   :53    NA's   :1      
##      SYSBP           DIABP           CURSMOKE         CIGPDAY      
##  Min.   : 83.5   Min.   : 48.00   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:117.5   1st Qu.: 75.00   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median :129.0   Median : 82.00   Median :0.0000   Median : 0.000  
##  Mean   :132.9   Mean   : 83.08   Mean   :0.4919   Mean   : 8.966  
##  3rd Qu.:144.0   3rd Qu.: 90.00   3rd Qu.:1.0000   3rd Qu.:20.000  
##  Max.   :295.0   Max.   :142.50   Max.   :1.0000   Max.   :70.000  
##  NA's   :1       NA's   :1        NA's   :1        NA's   :33      
##       BMI           DIABETES           BPMEDS           HEARTRTE     
##  Min.   :15.54   Min.   :0.00000   Min.   :0.00000   Min.   : 44.00  
##  1st Qu.:23.09   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.: 68.00  
##  Median :25.45   Median :0.00000   Median :0.00000   Median : 75.00  
##  Mean   :25.85   Mean   :0.02729   Mean   :0.03293   Mean   : 75.89  
##  3rd Qu.:28.09   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.: 83.00  
##  Max.   :56.80   Max.   :1.00000   Max.   :1.00000   Max.   :143.00  
##  NA's   :20      NA's   :1         NA's   :62        NA's   :2       
##     GLUCOSE            educ          PREVCHD            PREVAP       
##  Min.   : 40.00   Min.   :1.000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.: 72.00   1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median : 78.00   Median :2.000   Median :0.00000   Median :0.00000  
##  Mean   : 82.19   Mean   :1.976   Mean   :0.04375   Mean   :0.03315  
##  3rd Qu.: 87.00   3rd Qu.:3.000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :394.00   Max.   :4.000   Max.   :1.00000   Max.   :1.00000  
##  NA's   :398      NA's   :114     NA's   :1         NA's   :1        
##      PREVMI          PREVSTRK           PREVHYP            TIME       PERIOD 
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.0000   Min.   :0   Min.   :1  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:0   1st Qu.:1  
##  Median :0.0000   Median :0.000000   Median :0.0000   Median :0   Median :1  
##  Mean   :0.0194   Mean   :0.007217   Mean   :0.3225   Mean   :0   Mean   :1  
##  3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:1.0000   3rd Qu.:0   3rd Qu.:1  
##  Max.   :1.0000   Max.   :1.000000   Max.   :1.0000   Max.   :0   Max.   :1  
##  NA's   :1        NA's   :1          NA's   :1        NA's   :1   NA's   :1  
##       HDLC           LDLC          DEATH            ANGINA      
##  Min.   : NA    Min.   : NA    Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: NA    1st Qu.: NA    1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : NA    Median : NA    Median :0.0000   Median :0.0000  
##  Mean   :NaN    Mean   :NaN    Mean   :0.3496   Mean   :0.1635  
##  3rd Qu.: NA    3rd Qu.: NA    3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   : NA    Max.   : NA    Max.   :1.0000   Max.   :1.0000  
##  NA's   :4435   NA's   :4435   NA's   :1        NA's   :1       
##      HOSPMI          MI_FCHD           ANYCHD           STROKE       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.1024   Mean   :0.1649   Mean   :0.2797   Mean   :0.09359  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##  NA's   :1        NA's   :1        NA's   :1        NA's   :1        
##       CVD            HYPERTEN          TIMEAP         TIMEMI        TIMEMIFC   
##  Min.   :0.0000   Min.   :0.0000   Min.   :   0   Min.   :   0   Min.   :   0  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:5348   1st Qu.:6224   1st Qu.:6058  
##  Median :0.0000   Median :1.0000   Median :8766   Median :8766   Median :8766  
##  Mean   :0.2609   Mean   :0.7334   Mean   :6901   Mean   :7240   Mean   :7186  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:8766  
##  Max.   :1.0000   Max.   :1.0000   Max.   :8766   Max.   :8766   Max.   :8766  
##  NA's   :1        NA's   :1        NA's   :1      NA's   :1      NA's   :1     
##     TIMECHD        TIMESTRK       TIMECVD        TIMEDTH        TIMEHYP    
##  Min.   :   0   Min.   :   0   Min.   :   0   Min.   :  26   Min.   :   0  
##  1st Qu.:4767   1st Qu.:6345   1st Qu.:5145   1st Qu.:6974   1st Qu.:   0  
##  Median :8766   Median :8766   Median :8766   Median :8766   Median :2188  
##  Mean   :6666   Mean   :7305   Mean   :6818   Mean   :7506   Mean   :3436  
##  3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:8766   3rd Qu.:6934  
##  Max.   :8766   Max.   :8766   Max.   :8766   Max.   :8766   Max.   :8766  
##  NA's   :1      NA's   :1      NA's   :1      NA's   :1      NA's   :1

Now sex will be cleaned.

The variablesex that 1= male and 2 = female

data$sex_adj <- 1 #creating a new variable, with all 1's
data$sex_adj[data$SEX == "male"] <- 1 
data$sex_adj[data$SEX == "Male"] <- 1 
data$sex_adj[data$SEX == "Man"] <- 1 
data$sex_adj[data$SEX == "man"] <- 1 
data$sex_adj[data$SEX == "2"] <- 2
data$sex_adj[data$SEX == "Female"] <- 2
data$sex_adj[data$SEX == "female"] <- 2
data$sex_adj[data$SEX == " "] <- NA
data$sex_adj[is.na(data$SEX)] <- NA

unique(data$sex_adj) 
## [1]  1  2 NA
summary(data$sex_adj)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.562   2.000   2.000       1
prop.table(table(data$sex_adj))
## 
##         1         2 
## 0.4384303 0.5615697
summary(data$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32.00   42.00   49.00   49.93   57.00   70.00       1
mean(data$AGE, na.rm=TRUE)
## [1] 49.9258
sd(data$AGE, na.rm=TRUE)
## [1] 8.676929
sum(is.na(data$AGE)) # We have 1 na in this variable
## [1] 1
hist(data$AGE, breaks = 90, xlab = "Age (in years)", main = "Age distribution", col = "brown2", xlim = c(20, 80))

which(data$AGE > 99)
## integer(0)
data$AGE[c(10, 15)] <- NA 

summary(data$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   32.00   42.00   49.00   49.93   57.00   70.00       3
mean(data$AGE, na.rm=TRUE)
## [1] 49.92644
sd(data$AGE, na.rm=TRUE)
## [1] 8.674094
sum(is.na(data$AGE))
## [1] 3
hist(data$AGE, breaks = 50, xlab = "Age (in years)", main = "Age distribution ", col = "brown2", xlim = c(25, 80))

quantile(data$AGE, probs=c(0.2),na.rm=TRUE)
## 20% 
##  41
qqnorm(data$AGE, col = "pink")
qqline(data$AGE, col="green") 

summary(data$CURSMOKE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.4919  1.0000  1.0000       1
sum(is.na(data$CURSMOKE))
## [1] 1
which(is.na(data$CURSMOKE))
## [1] 4435
table(data$CURSMOKE)
## 
##    0    1 
## 2253 2181
prop.table(table(data$CURSMOKE)) 
## 
##         0         1 
## 0.5081191 0.4918809
summary(data$TOTCHOL)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     107     206     234     237     264     696      53
mean(data$TOTCHOL, na.rm=TRUE)
## [1] 236.9843
sd(data$TOTCHOL, na.rm=TRUE)
## [1] 44.6511
sum(is.na(data$TOTCHOL))
## [1] 53
hist(data$TOTCHOL, breaks = 100, xlab = "Total cholestrol (in mg/dL) ", main = "Total cholesterol distribution", col = "chocolate3",ylim=c(0,250))

which(data$TOTCHOL > 500)
## [1] 1428 2603
data$TOTCHOL_adj <- 1
data$TOTCHOL_adj [data$TOTCHOL < 200] <- 0
data$TOTCHOL_adj [data$TOTCHOL > 240] <- 2
data$TOTCHOL_adj [is.na(data$BMI)] <- NA

table(data$TOTCHOL_adj)
## 
##    0    1    2 
##  867 1665 1883
prop.table(table(data$TOTCHOL_adj)) 
## 
##         0         1         2 
## 0.1963760 0.3771234 0.4265006
summary(data$SYSBP)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    83.5   117.5   129.0   132.9   144.0   295.0       1
mean(data$SYSBP, na.rm=TRUE)
## [1] 132.9078
sd(data$SYSBP, na.rm=TRUE)
## [1] 22.4216
sum(is.na(data$SYSBP))
## [1] 1
hist(data$SYSBP, breaks = 100, xlab = "Systolic blood pressure", main = "Systolic Blood pressure distribution", col = "cyan",ylim=c(0,250))

which(data$SYSBP > 270)
## [1] 2238
unique(data$educ)
## [1]  4  1  3  2 NA
summary(data$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   1.976   3.000   4.000     114
table(data$educ)
## 
##    1    2    3    4 
## 1822 1281  716  502
prop.table(table(data$educ))
## 
##         1         2         3         4 
## 0.4216617 0.2964592 0.1657024 0.1161768
unique(data$BMI)
##    [1] 26.97 25.34 26.36 23.61 26.31 22.37 26.88 21.96 24.18 26.09 28.15 27.57
##   [13] 23.62 18.59 25.90 24.71 38.53 28.09 40.11 27.78 26.21 23.59 24.33 23.47
##   [25] 25.65 22.29 23.58 19.64 26.84 23.01 28.61 30.58 26.52 32.51 28.56 31.56
##   [37] 25.42 18.23 24.80 22.70 28.57 31.01 27.54 29.62 32.23 26.40 21.93 25.09
##   [49] 29.29 29.44 25.95 25.87 25.10 21.84 25.50 24.07 25.63 28.21 24.30 26.82
##   [61] 31.61 26.49 28.34 30.77 22.99 23.50 28.04 21.19 23.35 24.12 21.66 26.51
##   [73] 26.79 18.99 30.76 23.25 28.82 20.12 28.63 23.79 30.85 29.51 22.93 29.79
##   [85] 23.75 20.65 20.27 25.74 22.30 29.56 31.30 28.62 24.85 27.06 26.32 24.04
##   [97] 29.35 30.62 21.97 33.61 21.90 19.36 29.77 30.96 24.59 27.33 29.73 38.14
##  [109] 22.32 24.15 24.26 25.49 30.28 24.10 25.14 26.22 26.74 21.52 28.39 32.63
##  [121] 23.94 21.61 19.97 27.87 33.11 26.41 17.44 29.69 26.72 23.22 29.64 26.50
##  [133] 29.91 26.45 23.16 19.26 21.51 25.38 25.77 25.82 25.08 23.27 26.46 21.86
##  [145] 29.00 32.35 15.54 24.68 27.17 19.98 27.23 25.62 24.76 24.35 31.42 23.44
##  [157] 25.13 27.45 25.52 39.88 26.23 21.79 27.47 36.12 31.80 28.35 24.01 31.07
##  [169] 29.18 25.01 25.84 27.61 27.48 23.68 26.57 29.60 29.46 30.02 31.22 25.94
##  [181] 25.11 27.22 19.82 27.38 26.26 23.88 26.55 29.15 26.75 25.72 26.76 31.69
##  [193] 21.91 27.51 22.26 29.17 24.67 22.67 28.66 24.89 22.78 31.46 30.84 29.63
##  [205] 20.57 30.92 30.55 28.07 16.59 22.42 26.47 24.42 26.29 32.58 26.60 22.63
##  [217] 29.82 25.79 29.23 28.83 25.83 20.87 30.22 27.05 26.98 24.44 19.53 22.54
##  [229] 17.81 20.56 27.29 22.11 23.30 24.58 26.67 27.86 27.80 31.96 29.04 27.08
##  [241] 20.39 26.81 27.60 19.11 29.86 21.73 25.35 26.25 31.39 27.73 28.36 26.58
##  [253] 29.32 24.87 21.45 28.96 25.88 21.24 26.59 31.44 29.98 27.77 27.43 25.48
##  [265] 25.05 24.95 31.68 24.75 18.70 22.72 30.50 29.25 26.48 28.87 22.92 26.14
##  [277] 22.56 23.80 25.86 25.61 16.87 27.98 23.55 22.69 28.30 26.17 24.39 21.03
##  [289] 27.97 24.83 27.18 31.14 27.01 20.63 28.47 24.86 26.56 27.93 22.17 25.04
##  [301] 25.16 28.45 27.50 28.46 24.24 25.26 27.27 26.37 28.49 20.55 23.38 30.64
##  [313] 26.77 26.91 24.25 22.45 31.16 21.67 19.42 25.75 27.53 22.12 22.73 30.38
##  [325] 30.93 18.84 28.95 30.03 22.91 31.76 24.27 32.73 24.49 21.89 32.52 23.03
##  [337] 32.66 23.04 28.74 23.65 23.72 25.29 27.75 25.98 25.02 27.10 26.69 23.98
##  [349] 24.54 31.27 29.59 28.23 28.59 23.39 26.62 28.86 25.67 23.85 28.92 28.76
##  [361] 28.50 26.53 24.43 25.41 32.40 30.12 30.63 26.65 27.13 20.17 29.01 20.19
##  [373] 24.77 29.11 19.44 32.00 24.06 24.08 27.88 34.99 28.28 29.34 33.32 27.55
##  [385] 24.52 24.50 25.68 23.31 32.65 26.18 26.28 24.28 22.85 23.90 27.71 27.25
##  [397] 30.42 27.63 30.18 29.47 22.81 25.97 36.52 28.58 24.36 27.30 23.48 28.33
##  [409] 24.62 30.07 24.45 33.03 35.45 27.44 28.70 22.39 31.74 27.04 26.04 23.37
##  [421] 25.18 24.69 27.02 28.25 32.84 30.70 24.72 26.73 22.16 28.78 27.66 26.33
##  [433] 29.85 28.89 32.15 28.52 22.48 32.02 22.53 25.54 22.41 35.11 27.99 26.54
##  [445] 26.83 21.18 29.76 24.51 24.79 26.02 31.23 23.56 27.82 24.96 32.19 23.51
##  [457] 32.27 21.85 29.38 27.67 34.53 30.27 28.88 22.24 20.41 24.74 29.03 29.84
##  [469] 29.43 21.01 24.16 22.05 33.49 28.53 35.20 19.84 26.78 21.48 24.94 25.24
##  [481] 23.84 26.05 23.60 23.28 30.46 30.69 24.38 27.52 35.85 23.87 18.00 18.65
##  [493] 22.68 26.00 20.86 23.19 30.39 28.32 22.74 29.41 31.19 21.68 22.01 27.34
##  [505] 25.17 31.60 26.71 26.08 24.90 24.05 25.71 29.42 22.71 21.35 19.94 24.65
##  [517] 23.08    NA 24.99 30.11 33.45 24.81 23.70 21.27 31.62 24.47 20.72 27.12
##  [529] 25.03 19.09 24.41 23.78 26.92 28.16 21.34 34.69 25.40 23.06 22.97 26.68
##  [541] 23.21 25.59 29.13 22.90 27.26 24.66 26.11 27.56 28.06 21.41 23.53 18.92
##  [553] 33.07 23.05 26.42 31.55 28.55 31.67 24.17 28.41 34.59 26.38 22.64 29.08
##  [565] 23.24 25.22 24.84 28.20 33.10 24.19 33.71 29.93 29.10 29.05 22.98 27.28
##  [577] 19.51 23.71 21.42 27.91 25.31 25.45 33.29 27.35 25.64 23.46 30.71 28.68
##  [589] 24.93 30.43 22.94 23.40 22.82 30.05 22.25 25.33 22.51 26.64 36.04 26.43
##  [601] 31.09 31.38 33.16 23.77 29.89 21.02 26.35 21.71 26.89 23.95 30.29 23.64
##  [613] 27.39 30.34 21.98 30.66 29.58 28.18 22.36 25.12 23.32 21.06 22.18 31.06
##  [625] 23.10 39.04 27.92 38.42 27.15 22.19 33.26 27.94 24.20 32.03 21.17 28.54
##  [637] 21.28 25.06 24.61 23.57 22.13 23.97 27.20 23.41 26.20 33.79 24.00 28.31
##  [649] 29.07 25.51 25.57 25.91 28.40 27.03 21.78 28.11 30.19 28.77 20.10 33.14
##  [661] 19.61 29.19 25.66 24.46 25.56 31.90 29.99 26.27 29.28 30.41 23.29 23.81
##  [673] 31.11 30.16 27.31 19.05 30.10 19.70 22.40 19.54 20.26 30.20 35.05 32.33
##  [685] 24.91 29.36 19.68 34.04 29.37 26.86 23.96 25.23 25.20 31.93 32.37 23.63
##  [697] 21.43 28.38 22.46 28.69 33.00 35.96 35.53 34.89 27.42 37.38 28.27 30.32
##  [709] 31.31 28.03 21.07 18.88 29.66 25.21 33.76 24.22 25.70 20.23 21.64 20.09
##  [721] 25.46 28.85 27.62 20.34 25.37 31.12 19.81 22.33 24.92 34.39 23.14 26.13
##  [733] 28.84 21.83 16.98 37.15 31.20 18.55 22.49 28.12 24.21 25.36 26.87 20.99
##  [745] 35.12 24.23 28.65 17.17 31.02 23.34 22.08 22.27 24.32 23.13 28.26 30.82
##  [757] 23.23 31.04 31.57 24.98 29.40 20.03 26.30 20.71 27.68 18.64 26.44 27.83
##  [769] 26.95 23.43 31.47 32.98 28.13 23.74 36.01 28.24 28.94 31.78 23.93 30.31
##  [781] 21.88 34.32 27.89 23.02 32.92 31.05 19.15 28.44 24.56 26.85 30.75 22.34
##  [793] 23.52 21.63 24.14 22.79 20.30 30.08 21.77 33.99 21.76 19.76 24.88 35.31
##  [805] 22.86 25.69 31.50 21.14 31.08 28.73 30.33 25.96 21.44 28.08 23.86 21.50
##  [817] 32.09 24.29 32.22 24.73 27.72 24.63 17.50 20.42 24.53 28.90 33.08 29.72
##  [829] 30.80 22.06 21.33 32.32 20.25 25.55 23.76 32.99 25.58 33.81 28.67 24.11
##  [841] 31.84 40.08 22.14 30.89 25.60 28.80 24.60 32.13 22.76 20.50 28.43 29.27
##  [853] 29.48 32.18 34.97 25.78 19.99 25.80 23.09 21.49 29.87 21.47 33.52 21.65
##  [865] 26.63 22.66 18.76 22.04 24.13 29.54 40.38 29.30 27.14 23.92 21.99 30.13
##  [877] 21.38 26.15 27.24 27.81 18.10 22.52 20.70 21.55 25.43 24.57 25.81 20.96
##  [889] 24.02 31.98 27.40 23.99 20.45 20.54 32.47 19.14 19.00 25.44 29.02 30.61
##  [901] 25.85 20.88 27.11 19.34 30.48 26.70 19.71 30.30 27.64 22.35 21.59 34.17
##  [913] 32.80 30.36 20.77 28.93 20.68 30.47 22.15 19.66 20.35 17.61 28.60 20.13
##  [925] 33.80 34.55 21.16 45.80 26.03 23.89 38.46 29.33 26.93 40.52 24.37 42.15
##  [937] 18.38 26.12 20.20 36.81 24.03 38.39 22.02 26.24 31.13 25.32 23.36 42.00
##  [949] 19.24 23.67 16.75 22.89 29.06 21.95 30.23 31.77 18.06 44.27 21.81 20.59
##  [961] 36.29 27.07 20.06 31.24 21.94 22.00 20.51 21.10 30.24 21.40 21.05 19.88
##  [973] 27.96 22.50 27.59 18.48 27.09 21.31 21.21 29.50 31.35 31.82 29.75 18.58
##  [985] 34.36 19.20 20.62 35.58 18.80 27.49 36.11 45.79 21.08 20.53 27.65 25.25
##  [997] 27.90 23.45 19.56 38.82 27.16 20.18 26.80 24.40 20.47 19.77 34.84 37.41
## [1009] 24.64 22.55 21.80 36.62 16.69 26.39 19.63 20.49 20.81 37.48 32.53 23.54
## [1021] 32.49 30.74 18.86 18.21 30.78 19.27 23.17 39.60 21.60 25.92 18.18 24.48
## [1033] 20.67 31.63 19.78 44.09 22.28 33.38 19.32 23.42 18.98 33.47 28.91 15.96
## [1045] 23.82 24.55 32.26 40.58 43.30 29.21 32.36 33.36 20.64 32.17 19.93 33.19
## [1057] 43.69 29.53 25.27 39.86 20.24 29.16 30.98 20.15 17.64 23.33 20.89 20.79
## [1069] 19.25 20.94 22.88 25.07 42.53 32.67 31.10 29.94 36.21 30.40 27.76 30.67
## [1081] 34.25 23.26 32.04 17.38 30.04 26.61 19.47 20.33 32.77 34.56 20.97 33.40
## [1093] 38.88 21.09 30.06 30.86 25.99 19.83 18.52 21.30 33.82 20.66 31.34 23.66
## [1105] 38.38 23.18 31.37 20.82 34.40 29.45 21.26 31.48 19.57 30.73 21.74 26.07
## [1117] 33.17 20.74 20.52 22.87 33.65 30.91 22.65 27.95 20.00 35.99 24.09 25.89
## [1129] 21.56 19.39 18.43 29.80 18.87 32.07 21.82 30.35 25.73 21.12 22.62 25.15
## [1141] 33.66 16.92 20.05 38.75 44.55 17.11 20.84 20.02 19.38 29.70 32.41 37.70
## [1153] 20.69 21.57 23.69 31.64 19.03 35.22 21.00 26.10 23.20 34.95 32.43 31.71
## [1165] 30.14 42.98 26.01 32.93 33.97 28.02 39.64 20.93 34.13 36.57 19.87 28.72
## [1177] 36.46 23.07 32.57 33.54 21.15 20.40 32.91 19.50 22.58 30.00 29.65 38.06
## [1189] 18.16 25.39 35.78 32.81 29.67 18.67 26.06 21.22 35.17 34.52 39.91 27.46
## [1201] 20.76 19.74 28.10 38.43 29.14 21.72 29.88 35.42 21.29 17.65 37.04 20.22
## [1213] 33.68 44.71 38.54 17.32 30.21 21.11 28.79 16.61 22.84 20.31 39.53 39.54
## [1225] 31.18 30.65 18.44 35.62 43.48 32.68 22.09 19.23 36.91 39.08 39.69 32.82
## [1237] 32.76 17.51 36.65 32.60 19.16 39.82 35.01 19.80 18.30 20.43 28.00 36.79
## [1249] 37.02 25.19 30.99 20.11 33.60 32.29 35.35 29.09 17.89 19.72 38.81 18.62
## [1261] 31.52 20.32 21.32 33.88 36.54 56.80 18.75 31.75 38.11 22.31 18.40 31.54
## [1273] 40.21 20.98 27.69 29.68 30.53 17.48 18.82 28.22 34.91 19.48 18.14 35.16
## [1285] 39.40 40.81 38.61 19.46 32.10 22.57 39.21 31.72 31.65 18.28 17.23 18.26
## [1297] 29.22 33.90 40.51 34.71 30.79 28.14 32.78 20.44 33.70 35.02 22.44 17.71
## [1309] 19.18 30.57 20.73 43.26 29.49 22.43 31.40 16.48 32.89 26.66 20.92 18.11
## [1321] 19.22 29.97 19.12 38.31 21.20 22.38 26.94 22.83 18.53 20.21 35.68 17.92
## [1333] 16.71 37.10 19.69 38.96 18.01 17.93 18.83 23.83 32.54 39.94 34.54 34.27
## [1345] 18.09 18.73 24.82 30.59 28.81 18.46 28.71 34.46 39.22 41.29 18.68 19.49
## [1357] 34.08 31.29 36.18 41.61 19.91 31.03 29.55 37.62 30.51 33.02 17.84 17.68
## [1369] 37.58 51.28 21.25 32.31 38.94 31.92 16.73 21.75 19.37 37.30 35.13 41.66
## [1381] 26.99 19.28 18.63 35.03 35.19 33.37 38.17 20.85 20.78 40.23 36.07 39.17
## [1393] 43.67 20.91
data$BMI_adj <- 1 #creating a new variable, with all 1's
data$BMI_adj[data$BMI < 18.5] <- 0 
data$BMI_adj[data$BMI > 25] <- 2
data$BMI_adj[is.na(data$BMI)] <- NA

unique(data$BMI_adj)
## [1]  2  1  0 NA
prop.table(table(data$BMI_adj))
## 
##          0          1          2 
## 0.01291053 0.43850510 0.54858437
sum(is.na(data$BMI_adj))
## [1] 20
data$CVD_adj <- 1
data$CVD_adj[data$PREVCHD == "1"] <- 0
data$CVD_adj[data$CVD == "0"] <- 0

summary(data$TIMECVD)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    5145    8766    6818    8766    8766       1
mean(data$TIMECVD, na.rm=TRUE)
## [1] 6817.948
sd(data$TIMECVD, na.rm=TRUE)
## [1] 2804.316
sum(is.na(data$TIMECVD))
## [1] 1
hist(data$TIMECVD, ylim = c(0,200), breaks = 100, xlab = "Time to CVD (in days)", main = "Time to a CVD distribution", col = "#A39038")

fitKMCVD <- survfit(Surv(TIMECVD,CVD_adj)~1, data=data) 

ggsurvplot(fitKMCVD, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c(""), legend.title ='BMI', palette = c("#332A01", "#7EBBBB", "#290117"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())

########
fitKMBMI <- survfit(Surv(TIMECVD,CVD_adj)~BMI_adj, data=data) 

ggsurvplot(fitKMBMI, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c("Underweight", "Normal", "Overweight"), legend.title ='BMI', palette = c("#7EBBBB","#CB1F80", "#2F3BAE"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())

########
fitKMTOTCHOL <- survfit(Surv(TIMECVD,CVD_adj)~TOTCHOL_adj, data=data) 

ggsurvplot(fitKMTOTCHOL, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c("Low cholesterol", "Normal cholesterol", "High cholesterol"), legend.title ='Total cholesterol', palette = c("#7EBBBB","#CB1F80","#2F3BAE"),data=data, title = "Kaplan-Meier Curve for incidence of CVD",  ggtheme = theme_minimal())

#########
fitKMCURSMOKE <- survfit(Surv(TIMECVD,CVD_adj)~CURSMOKE, data=data)

ggsurvplot(fitKMCURSMOKE, ylim=c(0.5,1.0), xlab = 'Days', ylab = 'Survival', legend.labs = c("Non-smoker", "smoker"), legend.title ='Smoking', palette = c("#7EBBBB","#CB1F80"),data=data, title = "Kaplan-Meier Curve for incidence of CVD", ggtheme = theme_minimal())

For fitKMBMI Patients with Overweight are more likely to develop Myocardial infarction problems.And probabilty of survival is more low in comparison with other groups Underweight and Normal

For fitKMTOTCHOL Patients with high cholesterol are more likely of developing Myocardial infarction problems and survive less than the other groups Low cholesterol and Normal cholesterol.

For fitKMCURSMOKE Patients with smoker are more likely of developing Myocardial infarction problems and survive less than the Non-smoker group.

## Myocardial infarction
unique(data$CVD)
## [1]  1  0 NA
prop.table(table(data$CVD))
## 
##         0         1 
## 0.7390618 0.2609382
chisq.test(data$CVD_adj, data$BMI_adj)
## 
##  Pearson's Chi-squared test
## 
## data:  data$CVD_adj and data$BMI_adj
## X-squared = 65.876, df = 2, p-value = 4.958e-15

A small p-value

chisq.test(data$CVD_adj, data$TOTCHOL_adj)
## 
##  Pearson's Chi-squared test
## 
## data:  data$CVD_adj and data$TOTCHOL_adj
## X-squared = 54.682, df = 2, p-value = 1.337e-12

A small p-value

chisq.test(data$CVD_adj, data$CURSMOKE)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$CVD_adj and data$CURSMOKE
## X-squared = 4.9843, df = 1, p-value = 0.02558

A small p-value

\[MODELS~FOR~COX~PROPORTIONAL~HAZARD~REGRESION\]

We try variuos model for select the best

library(car) # For see best the p-values
## Loading required package: carData
# AIC criteria is for select the best model. In this case we need select the model with the most lower AIC.

coxfit1 <- coxph(Surv(CVD_adj)~ CURSMOKE, data=data) 
summary(coxfit1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ CURSMOKE, data = data)
## 
##   n= 4434, number of events= 4434 
##    (1 observation deleted due to missingness)
## 
##              coef exp(coef) se(coef)      z Pr(>|z|)  
## CURSMOKE -0.05473   0.94674  0.03005 -1.822   0.0685 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## CURSMOKE    0.9467      1.056    0.8926     1.004
## 
## Concordance= 0.52  (se = 0.009 )
## Likelihood ratio test= 3.32  on 1 df,   p=0.07
## Wald test            = 3.32  on 1 df,   p=0.07
## Score (logrank) test = 3.32  on 1 df,   p=0.07
Anova(coxfit1)
AIC(coxfit1)
## [1] 65606.02
coxfit1.1 <- coxph(Surv(CVD_adj)~ CURSMOKE 
                + sex_adj 
                + AGE
                + educ
                , data=data) 
summary(coxfit1.1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ CURSMOKE + sex_adj + AGE + educ, 
##     data = data)
## 
##   n= 4319, number of events= 4319 
##    (116 observations deleted due to missingness)
## 
##               coef exp(coef)  se(coef)       z Pr(>|z|)    
## CURSMOKE -0.096089  0.908383  0.031898  -3.012  0.00259 ** 
## sex_adj   0.253587  1.288639  0.031528   8.043 8.76e-16 ***
## AGE      -0.020323  0.979883  0.001871 -10.863  < 2e-16 ***
## educ      0.039092  1.039866  0.015110   2.587  0.00968 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## CURSMOKE    0.9084     1.1009    0.8533    0.9670
## sex_adj     1.2886     0.7760    1.2114    1.3708
## AGE         0.9799     1.0205    0.9763    0.9835
## educ        1.0399     0.9617    1.0095    1.0711
## 
## Concordance= 0.681  (se = 0.009 )
## Likelihood ratio test= 200.6  on 4 df,   p=<2e-16
## Wald test            = 198.6  on 4 df,   p=<2e-16
## Score (logrank) test = 199.4  on 4 df,   p=<2e-16
Anova(coxfit1.1)
AIC(coxfit1.1) # Most lower AIC
## [1] 63486.43
coxfit1.2 <- coxph(Surv(CVD_adj)~ CURSMOKE*sex_adj 
                + sex_adj 
                + AGE
                + educ
                , data=data) 
summary(coxfit1.2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ CURSMOKE * sex_adj + sex_adj + 
##     AGE + educ, data = data)
## 
##   n= 4319, number of events= 4319 
##    (116 observations deleted due to missingness)
## 
##                       coef exp(coef)  se(coef)       z Pr(>|z|)    
## CURSMOKE         -0.036657  0.964007  0.103242  -0.355   0.7225    
## sex_adj           0.273445  1.314485  0.045562   6.002 1.95e-09 ***
## AGE              -0.020368  0.979838  0.001872 -10.879  < 2e-16 ***
## educ              0.039730  1.040529  0.015144   2.623   0.0087 ** 
## CURSMOKE:sex_adj -0.038093  0.962624  0.062916  -0.605   0.5449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## CURSMOKE            0.9640     1.0373    0.7874    1.1802
## sex_adj             1.3145     0.7608    1.2022    1.4373
## AGE                 0.9798     1.0206    0.9762    0.9834
## educ                1.0405     0.9610    1.0101    1.0719
## CURSMOKE:sex_adj    0.9626     1.0388    0.8509    1.0890
## 
## Concordance= 0.681  (se = 0.009 )
## Likelihood ratio test= 200.9  on 5 df,   p=<2e-16
## Wald test            = 199.1  on 5 df,   p=<2e-16
## Score (logrank) test = 199.8  on 5 df,   p=<2e-16
Anova(coxfit1.2)
AIC(coxfit1.2)
## [1] 63488.06

The best model was coxfit1.1 since it has the lowest AIC value, in addition, the different variables evaluated have a low p value, which indicates that they are statistically significant.

the p value for sex was 1e-15 with a hazard ratio HR = exp(coef) = 1.28 indicating a strong relationship between the patients’ sex and Myocardial infarction. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 1.28. We conclude that being female is associated with good forecast.

There is not evidence of interaction between sex in CVD_adj when CURSMOKE*sex_adj was test

coxfit2 <- coxph(Surv(CVD_adj)~ BMI_adj, data=data)
summary(coxfit2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ BMI_adj, data = data)
## 
##   n= 4415, number of events= 4415 
##    (20 observations deleted due to missingness)
## 
##             coef exp(coef) se(coef)      z Pr(>|z|)    
## BMI_adj -0.19087   0.82624  0.02874 -6.641 3.11e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## BMI_adj    0.8262       1.21     0.781    0.8741
## 
## Concordance= 0.573  (se = 0.009 )
## Likelihood ratio test= 43.44  on 1 df,   p=4e-11
## Wald test            = 44.11  on 1 df,   p=3e-11
## Score (logrank) test = 44.22  on 1 df,   p=3e-11
Anova(coxfit2)
AIC(coxfit2)
## [1] 65246.89
coxfit2.1 <- coxph(Surv(CVD_adj)~ BMI_adj 
                 + sex_adj 
                 + AGE 
                 + CURSMOKE
                 + GLUCOSE , data=data)
summary(coxfit2.1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ BMI_adj + sex_adj + AGE + CURSMOKE + 
##     GLUCOSE, data = data)
## 
##   n= 4021, number of events= 4021 
##    (414 observations deleted due to missingness)
## 
##                coef  exp(coef)   se(coef)      z Pr(>|z|)    
## BMI_adj  -0.1311304  0.8771034  0.0311560 -4.209 2.57e-05 ***
## sex_adj   0.2243893  1.2515582  0.0331047  6.778 1.22e-11 ***
## AGE      -0.0191379  0.9810441  0.0019275 -9.929  < 2e-16 ***
## CURSMOKE -0.1100844  0.8957585  0.0335242 -3.284 0.001024 ** 
## GLUCOSE  -0.0025211  0.9974821  0.0006833 -3.690 0.000224 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## BMI_adj     0.8771      1.140    0.8251    0.9323
## sex_adj     1.2516      0.799    1.1729    1.3355
## AGE         0.9810      1.019    0.9773    0.9848
## CURSMOKE    0.8958      1.116    0.8388    0.9566
## GLUCOSE     0.9975      1.003    0.9961    0.9988
## 
## Concordance= 0.691  (se = 0.009 )
## Likelihood ratio test= 211.2  on 5 df,   p=<2e-16
## Wald test            = 209.3  on 5 df,   p=<2e-16
## Score (logrank) test = 210.3  on 5 df,   p=<2e-16
Anova(coxfit2.1)
AIC(coxfit2.1)
## [1] 58509.77
coxfit2.2 <- coxph(Surv(CVD_adj)~ BMI_adj*sex_adj 
                 + AGE 
                 + CURSMOKE
                 + GLUCOSE , data=data)
summary(coxfit2.2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ BMI_adj * sex_adj + AGE + CURSMOKE + 
##     GLUCOSE, data = data)
## 
##   n= 4021, number of events= 4021 
##    (414 observations deleted due to missingness)
## 
##                       coef  exp(coef)   se(coef)      z Pr(>|z|)    
## BMI_adj         -0.1610831  0.8512213  0.1032053 -1.561 0.118570    
## sex_adj          0.1949556  1.2152571  0.1021986  1.908 0.056441 .  
## AGE             -0.0191978  0.9809853  0.0019374 -9.909  < 2e-16 ***
## CURSMOKE        -0.1100365  0.8958014  0.0335289 -3.282 0.001031 ** 
## GLUCOSE         -0.0025215  0.9974817  0.0006832 -3.691 0.000224 ***
## BMI_adj:sex_adj  0.0189127  1.0190927  0.0621498  0.304 0.760892    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## BMI_adj            0.8512     1.1748    0.6953    1.0421
## sex_adj            1.2153     0.8229    0.9947    1.4848
## AGE                0.9810     1.0194    0.9773    0.9847
## CURSMOKE           0.8958     1.1163    0.8388    0.9566
## GLUCOSE            0.9975     1.0025    0.9961    0.9988
## BMI_adj:sex_adj    1.0191     0.9813    0.9022    1.1511
## 
## Concordance= 0.691  (se = 0.009 )
## Likelihood ratio test= 211.3  on 6 df,   p=<2e-16
## Wald test            = 209.1  on 6 df,   p=<2e-16
## Score (logrank) test = 210.4  on 6 df,   p=<2e-16
Anova(coxfit2.2)
AIC(coxfit2.2)
## [1] 58511.68

The best model was coxfit2.1 since it has the most lowest AIC value, in addition, the different variables evaluated have a low p value, which indicates that they are statistically significant.

the p value for sex was 1.22e-11 with a hazard ratio HR = exp(coef) = 1.2516 indicating a strong relationship. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 1.25. We conclude that being female is associated with good forecast.

There is not evidence of interaction between sex in CVD_adj when BMI_adj*sex_adj was test.

coxfit3 <- coxph(Surv(CVD_adj)~ TOTCHOL_adj, data=data)
summary(coxfit3)
## Call:
## coxph(formula = Surv(CVD_adj) ~ TOTCHOL_adj, data = data)
## 
##   n= 4415, number of events= 4415 
##    (20 observations deleted due to missingness)
## 
##                coef exp(coef) se(coef)      z Pr(>|z|)    
## TOTCHOL_adj -0.1223    0.8849   0.0201 -6.083 1.18e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## TOTCHOL_adj    0.8849       1.13    0.8507    0.9205
## 
## Concordance= 0.57  (se = 0.009 )
## Likelihood ratio test= 36.53  on 1 df,   p=2e-09
## Wald test            = 37  on 1 df,   p=1e-09
## Score (logrank) test = 37.08  on 1 df,   p=1e-09
Anova(coxfit3)
AIC(coxfit3)
## [1] 65253.8
coxfit3.1 <- coxph(Surv(CVD_adj)~ TOTCHOL_adj
                 + BMI_adj
                 + AGE
                 + sex_adj
                 + GLUCOSE, data=data)
summary(coxfit3.1)
## Call:
## coxph(formula = Surv(CVD_adj) ~ TOTCHOL_adj + BMI_adj + AGE + 
##     sex_adj + GLUCOSE, data = data)
## 
##   n= 4021, number of events= 4021 
##    (414 observations deleted due to missingness)
## 
##                   coef  exp(coef)   se(coef)      z Pr(>|z|)    
## TOTCHOL_adj -0.0870528  0.9166287  0.0216676 -4.018 5.88e-05 ***
## BMI_adj     -0.1027188  0.9023807  0.0310053 -3.313 0.000923 ***
## AGE         -0.0161747  0.9839554  0.0019214 -8.418  < 2e-16 ***
## sex_adj      0.2553925  1.2909682  0.0323732  7.889 3.05e-15 ***
## GLUCOSE     -0.0024921  0.9975111  0.0006841 -3.643 0.000270 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## TOTCHOL_adj    0.9166     1.0910    0.8785    0.9564
## BMI_adj        0.9024     1.1082    0.8492    0.9589
## AGE            0.9840     1.0163    0.9803    0.9877
## sex_adj        1.2910     0.7746    1.2116    1.3755
## GLUCOSE        0.9975     1.0025    0.9962    0.9988
## 
## Concordance= 0.693  (se = 0.009 )
## Likelihood ratio test= 216.4  on 5 df,   p=<2e-16
## Wald test            = 217.6  on 5 df,   p=<2e-16
## Score (logrank) test = 218.8  on 5 df,   p=<2e-16
Anova(coxfit3.1)
AIC(coxfit3.1)
## [1] 58504.54
coxfit3.2 <- coxph(Surv(CVD_adj)~ TOTCHOL_adj*sex_adj 
                   + BMI_adj
                   + AGE
                   + sex_adj
                   + GLUCOSE, data=data)
summary(coxfit3.2)
## Call:
## coxph(formula = Surv(CVD_adj) ~ TOTCHOL_adj * sex_adj + BMI_adj + 
##     AGE + sex_adj + GLUCOSE, data = data)
## 
##   n= 4021, number of events= 4021 
##    (414 observations deleted due to missingness)
## 
##                           coef  exp(coef)   se(coef)      z Pr(>|z|)    
## TOTCHOL_adj         -0.1266727  0.8810220  0.0698381 -1.814 0.069708 .  
## sex_adj              0.2239804  1.2510465  0.0618001  3.624 0.000290 ***
## BMI_adj             -0.1026301  0.9024607  0.0310054 -3.310 0.000933 ***
## AGE                 -0.0164030  0.9837308  0.0019593 -8.372  < 2e-16 ***
## GLUCOSE             -0.0024931  0.9975100  0.0006843 -3.643 0.000269 ***
## TOTCHOL_adj:sex_adj  0.0256464  1.0259781  0.0429990  0.596 0.550880    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## TOTCHOL_adj            0.8810     1.1350    0.7683    1.0103
## sex_adj                1.2510     0.7993    1.1083    1.4121
## BMI_adj                0.9025     1.1081    0.8493    0.9590
## AGE                    0.9837     1.0165    0.9800    0.9875
## GLUCOSE                0.9975     1.0025    0.9962    0.9988
## TOTCHOL_adj:sex_adj    1.0260     0.9747    0.9431    1.1162
## 
## Concordance= 0.693  (se = 0.009 )
## Likelihood ratio test= 216.8  on 6 df,   p=<2e-16
## Wald test            = 216.9  on 6 df,   p=<2e-16
## Score (logrank) test = 218.9  on 6 df,   p=<2e-16
Anova(coxfit3.2)
AIC(coxfit3.2)
## [1] 58506.18

The best model was coxfit3.1 since it has the most lowest AIC value, in addition, the different variables evaluated have a low p value, which indicates that they are statistically significant.

the p value for sex was 3.05e-15 with a hazard ratio HR = exp(coef) = 1.29 indicating a strong relationship. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For example, holding the other covariates constant, being female (sex=2) reduces the hazard by a factor of 1.29. We conclude that being female is associated with good forecast.

There is not evidence of interaction between sex in CVD_adj when TOTCHOL_adj*sex_adj was test.

\[Discussion\]

After cleaning the database, ordering each variable in classificatory values and reducing the events of missing data, the analysis of the survival models continued. This is important to standardize evaluation criteria.

Attempted to find a Cox porportional hazards regesion model to predict Myocardial infarction (CVD_adj) based on Current cigarette smoking at exam (CURSMOKE), Body Mass Index (BMI), and Serum Total Cholesterol (TOTCHOL), as well as AGE, educ, sex_adj, Casual serum glucose (GLUCOSE). Various Models was test in combinations between variables (TOTCHOL + CURSMOKE) in some cases.

In each case, the best model that presented significant values (low p-values) in all the contrasted variables, as well as in the other evaluation criteria, was selected. For each case, three models were tested and the best one was chosen. In addition, the interaction with the sex_adj of the patients was performed (coxfit1.2, coxfit2.2, coxfit3.2), but no statistically significant values were found. This indicates that sex has no interaction with the variables evaluated. But when evaluating them without interaction, they do present statistical significance.

For example, Kaplan-Meier Curve for incidence of CVD with TOTCHOL, allows us to assume that patients with high cholesterol content have a lower probability of survival if they present CDV, while patients with low cholesterol content can and have a longer time of CVD involvement prolonged survival. Likewise, it can be said with Current cigarette smoking patients that they have a lower probability of survival if they present CVD problems.

We can then say that both the graphs and the models allow to show the relationship between the variables evaluated to assume their behavior in the survival of the patients.