What are the significant factors associated with the survival time of individuals diagnosed with prostate-related conditions, and how do these factors, such as education, income, race/ethnicity, and health insurance, impact their survival?
Interpretation: Multiple datasets from NHANES are merged here to create a comprehensive dataset. This approach ensures that all relevant variables are included for the analysis, focusing on factors impacting the survival of individuals with prostate-related conditions.
# Load necessary libraries
library(survival)
library(ggplot2)
library(survminer)
## Loading required package: ggpubr
##
## Attaching package: 'survminer'
## The following object is masked from 'package:survival':
##
## myeloma
library(ggpubr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(tidyr)
library(dplyr)
library(haven)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(purrr)
library(readr)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
# Read the XPT files with the complete file paths
nhanes_data <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/DEMO_D.XPT")
nhanes_data1 <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/ALQ_D.XPT")
nhanes_data2 <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/HIQ_D.XPT")
nhanes_data3 <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/KIQ_P_D.XPT")
nhanes_data4 <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/MCQ_D.XPT")
nhanes_data5 <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/PSA_D.XPT")
nhanes_data6 <- read_xpt("/Users/christacrumrine/Desktop/Ad_methods/PSQ_D.XPT")
# Combine all ways
#merged_data <- left_join(nhanes_data1, nhanes_data, by="SEQN")
#merged_data <- left_join(merged_data, nhanes_data2, by="SEQN")
#merged_data <- left_join(merged_data, nhanes_data3, by="SEQN")
#merged_data <- left_join(merged_data, nhanes_data4, by="SEQN")
#merged_data <- left_join(merged_data, nhanes_data5, by="SEQN")
#merged_data <- left_join(merged_data, nhanes_data6, by="SEQN")
#merged_data <- left_join(merged_data, nhanes_data7, by="SEQN")
#merged_data <- left_join(merged_data, NHANES_2006, by="SEQN")
df = list(nhanes_data,nhanes_data6, nhanes_data2,nhanes_data3, nhanes_data4, nhanes_data5, nhanes_data6)
final_df <- df %>% reduce(full_join,
by = 'SEQN')
# Merge the datasets using bind_rows
#merged_data <- bind_rows(nhanes_data, nhanes_data1, nhanes_data2, nhanes_data3, nhanes_data4)
srvyin <- paste("NHANES_2005_2006_MORT_2019_PUBLIC (1).dat") # full .DAT name here
srvyout <- "NHANES_2006" # shorthand dataset name here
# read in the fixed-width format ASCII file
dsn <- read_fwf(file=srvyin,
col_types = "iiiiiiii",
fwf_cols(seqn = c(1,6),
eligstat = c(15,15),
mortstat = c(16,16),
ucod_leading = c(17,19),
diabetes = c(20,20),
hyperten = c(21,21),
permth_int = c(43,45),
permth_exm = c(46,48)
),
na = c("", ".")
)
dsn <- dsn %>%
rename(SEQN = seqn)
Interpretation: In this step, key variables relevant to the study, such as education, race/ethnicity, income, and health metrics, are selected and cleaned. This includes handling missing values and transforming some variables to the appropriate format for analysis.
# Assuming you have already imported your dataset into the 'data' variable
# You can use the 'rename' function to rename the variables
# Create a new dataset with selected variables
selected_data <- final_df %>%
left_join(dsn, by = "SEQN") %>%
select(
Education = DMDEDUC2, #categorical
Race_Ethnicity = RIDRETH1, #categorical
Income = INDFMINC, #categorical
AGE_TOLD_PROSTATE = KID221, #continuous
DIAGNOSED_PROSTATE = KIQ201, #categorical
PROSTATE_ENLARGE = KIQ121, #continuous
AGE_PSA_TEST = MCQ320,
PSA_TOTAL = LBXP1, #continuous
AGE_CURRENT = RIDAGEYR, #continuous
CITIZEN_STATUS = DMDCITZN, #categorical
HEALTH_INSURANCE = HIQ011, #categorical
CURRENT_AGE = RIDAGEYR, #continuous
#RADIATION_TX = KIQ301,
#MED_TX = KIQ311,
mortstat, # Adding mortstat from dsn,
permth_exm,
ucod_leading,
#AGE_AT_DEATH is a continuous variable
)
#Replace NA with 0 in all columns of select_data
selected_data <- selected_data %>%
mutate_all(funs(ifelse(is.na(.), 0, .)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
##
## # Simple named list: list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
##
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
selected_data <- selected_data %>%
mutate(AGE_TOLD_PROSTATE = as.character(AGE_TOLD_PROSTATE),
AGE_TOLD_PROSTATE = ifelse(AGE_TOLD_PROSTATE == "85 or greater", "85", AGE_TOLD_PROSTATE),
AGE_TOLD_PROSTATE = as.numeric(as.character(AGE_TOLD_PROSTATE)))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `AGE_TOLD_PROSTATE =
## as.numeric(as.character(AGE_TOLD_PROSTATE))`.
## Caused by warning:
## ! NAs introduced by coercion
# First, filter out deaths not caused by ucod_leading == 2
selected_data_filtered <- selected_data %>%
filter(!(mortstat == 1 & ucod_leading != 2))
#0= non-white and 1=white
selected_data <- selected_data %>%
mutate(Race_Ethnicity = ifelse(Race_Ethnicity == 1, 1, 0))
selected_data <- na.omit(selected_data)
#0=no insurance, 1=insurance
selected_data <- selected_data %>%
mutate(HEALTH_INSURANCE = ifelse(HEALTH_INSURANCE == 1, 1, 0))
library(dplyr)
# Filter to keep only those diagnosed with prostate cancer
selected_data <- selected_data %>%
filter(DIAGNOSED_PROSTATE == 1)
# Create or update the Age_or_Status variable
selected_data <- selected_data %>%
mutate(Age_or_Status = ifelse(mortstat == 0,
paste0(CURRENT_AGE, "+"), # Current age with a '+' if alive
as.character(floor(CURRENT_AGE + permth_exm / 12)))) # Rounded down age at death if dead
# This code first filters the selected_data to include only individuals diagnosed with prostate cancer.
# Then, for these individuals:
# - If they died from ucod_leading == 2, it calculates their age at death.
# - If they are still alive, it shows their current age with a '+'.
library(stringr)
library(dplyr)
# Selecting CURRENT_AGE and Age_or_Status from the filtered data
age_status_table <- selected_data %>%
select(CURRENT_AGE, Age_or_Status)
# Create a numeric variable from Age_or_Status
selected_data <- selected_data %>%
mutate(Age_or_Status_Numeric = as.numeric(str_replace(Age_or_Status, "\\+", "")))
selected_data <- selected_data %>%
mutate(permth_exm = floor(permth_exm / 12))
selected_data <- selected_data %>%
mutate(Age_At_Death = AGE_CURRENT + permth_exm)
Interpretation: The dataset is filtered to include only individuals diagnosed with prostate conditions.
selected_data <- selected_data %>% filter(DIAGNOSED_PROSTATE ==1)
selected_data
## # A tibble: 59 × 18
## Education Race_Ethnicity Income AGE_TOLD_PROSTATE DIAGNOSED_PROSTATE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 0 5 79 1
## 2 4 0 7 65 1
## 3 4 0 6 61 1
## 4 5 0 7 64 1
## 5 3 0 10 62 1
## 6 4 0 11 54 1
## 7 2 0 7 64 1
## 8 3 0 99 81 1
## 9 5 0 11 80 1
## 10 2 0 5 70 1
## # ℹ 49 more rows
## # ℹ 13 more variables: PROSTATE_ENLARGE <dbl>, AGE_PSA_TEST <dbl>,
## # PSA_TOTAL <dbl>, AGE_CURRENT <dbl>, CITIZEN_STATUS <dbl>,
## # HEALTH_INSURANCE <dbl>, CURRENT_AGE <dbl>, mortstat <dbl>,
## # permth_exm <dbl>, ucod_leading <dbl>, Age_or_Status <chr>,
## # Age_or_Status_Numeric <dbl>, Age_At_Death <dbl>
Interpretation: This section provides a detailed summary of key variables, including demographic and health-related factors.
# Summary statistics of key variables
summary(selected_data$Age_At_Death)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.00 79.00 85.00 83.41 89.00 94.00
summary(selected_data$Education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.559 4.000 9.000
summary(selected_data$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 6.00 7.00 11.44 10.50 99.00
summary(selected_data$Race_Ethnicity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05085 0.00000 1.00000
summary(selected_data$Citizenship_Status)
## Warning: Unknown or uninitialised column: `Citizenship_Status`.
## Length Class Mode
## 0 NULL NULL
summary(selected_data$Health_Insurance)
## Warning: Unknown or uninitialised column: `Health_Insurance`.
## Length Class Mode
## 0 NULL NULL
dim(selected_data) # Check dimensions of the dataframe
## [1] 59 18
summary(selected_data)
## Education Race_Ethnicity Income AGE_TOLD_PROSTATE
## Min. :1.000 Min. :0.00000 Min. : 2.00 Min. :54.00
## 1st Qu.:3.000 1st Qu.:0.00000 1st Qu.: 6.00 1st Qu.:64.00
## Median :4.000 Median :0.00000 Median : 7.00 Median :70.00
## Mean :3.559 Mean :0.05085 Mean :11.44 Mean :70.02
## 3rd Qu.:4.000 3rd Qu.:0.00000 3rd Qu.:10.50 3rd Qu.:75.00
## Max. :9.000 Max. :1.00000 Max. :99.00 Max. :85.00
## DIAGNOSED_PROSTATE PROSTATE_ENLARGE AGE_PSA_TEST PSA_TOTAL
## Min. :1 Min. :0.000 Min. : 0.0 Min. :0
## 1st Qu.:1 1st Qu.:1.000 1st Qu.: 56.5 1st Qu.:0
## Median :1 Median :1.000 Median : 66.0 Median :0
## Mean :1 Mean :1.254 Mean :142.0 Mean :0
## 3rd Qu.:1 3rd Qu.:1.000 3rd Qu.: 73.0 3rd Qu.:0
## Max. :1 Max. :9.000 Max. :999.0 Max. :0
## AGE_CURRENT CITIZEN_STATUS HEALTH_INSURANCE CURRENT_AGE
## Min. :57.00 Min. :1.000 Min. :0.0000 Min. :57.00
## 1st Qu.:69.50 1st Qu.:1.000 1st Qu.:1.0000 1st Qu.:69.50
## Median :75.00 Median :1.000 Median :1.0000 Median :75.00
## Mean :74.69 Mean :1.017 Mean :0.9831 Mean :74.69
## 3rd Qu.:80.50 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:80.50
## Max. :85.00 Max. :2.000 Max. :1.0000 Max. :85.00
## mortstat permth_exm ucod_leading Age_or_Status
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Length:59
## 1st Qu.:0.0000 1st Qu.: 5.000 1st Qu.: 0.00 Class :character
## Median :1.0000 Median : 9.000 Median : 2.00 Mode :character
## Mean :0.6949 Mean : 8.712 Mean : 2.61
## 3rd Qu.:1.0000 3rd Qu.:13.000 3rd Qu.: 2.00
## Max. :1.0000 Max. :14.000 Max. :10.00
## Age_or_Status_Numeric Age_At_Death
## Min. :57.00 Min. :65.00
## 1st Qu.:73.00 1st Qu.:79.00
## Median :80.00 Median :85.00
## Mean :79.31 Mean :83.41
## 3rd Qu.:87.50 3rd Qu.:89.00
## Max. :93.00 Max. :94.00
sapply(selected_data, function(x) sum(is.na(x))) # Counts NAs for each variable
## Education Race_Ethnicity Income
## 0 0 0
## AGE_TOLD_PROSTATE DIAGNOSED_PROSTATE PROSTATE_ENLARGE
## 0 0 0
## AGE_PSA_TEST PSA_TOTAL AGE_CURRENT
## 0 0 0
## CITIZEN_STATUS HEALTH_INSURANCE CURRENT_AGE
## 0 0 0
## mortstat permth_exm ucod_leading
## 0 0 0
## Age_or_Status Age_or_Status_Numeric Age_At_Death
## 0 0 0
sapply(selected_data, class) # Displays the data type of each variable
## Education Race_Ethnicity Income
## "numeric" "numeric" "numeric"
## AGE_TOLD_PROSTATE DIAGNOSED_PROSTATE PROSTATE_ENLARGE
## "numeric" "numeric" "numeric"
## AGE_PSA_TEST PSA_TOTAL AGE_CURRENT
## "numeric" "numeric" "numeric"
## CITIZEN_STATUS HEALTH_INSURANCE CURRENT_AGE
## "numeric" "numeric" "numeric"
## mortstat permth_exm ucod_leading
## "numeric" "numeric" "numeric"
## Age_or_Status Age_or_Status_Numeric Age_At_Death
## "character" "numeric" "numeric"
# Summary Statistics for Education
summary(selected_data$Education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.559 4.000 9.000
# Histogram for Education
hist(selected_data$Education, main = "Histogram of Education", xlab = "Education")
# Structure of the dataset
str(selected_data$Education)
## num [1:59] 4 4 4 5 3 4 2 3 5 2 ...
library(ggplot2)
# Summary Statistics for Education
summary(selected_data$Education)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.559 4.000 9.000
# Summary Statistics for Income
summary(selected_data$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 6.00 7.00 11.44 10.50 99.00
# Histogram for Income
hist(selected_data$Income, main = "Histogram of Income", xlab = "Income")
# Bar Plot for Income (if it's a categorical variable)
# Note: You may need to modify the visualization depending on the nature of the variable.
ggplot(selected_data, aes(x = Income)) +
geom_bar() +
labs(
title = "Distribution of Income ",
x = "Income ",
y = "Frequency"
)
# Summary Statistics for Race_Ethnicity
summary(selected_data$Race_Ethnicity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05085 0.00000 1.00000
# Histogram for Race_Ethnicity
hist(selected_data$Race_Ethnicity, main = "Histogram of Race_Ethnicity", xlab = "Race_Ethnicity")
# Bar Plot for Race_Ethnicity (if it's a categorical variable)
# Note: You may need to modify the visualization depending on the nature of the variable.
ggplot(selected_data, aes(x = Race_Ethnicity)) +
geom_bar() +
labs(
title = "Distribution of Race_Ethnicity ",
x = "Race_Ethnicity ",
y = "Frequency"
)
# Summary Statistics for HEALTH_INSURANCE
summary(selected_data$HEALTH_INSURANCE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.9831 1.0000 1.0000
# Histogram for HEALTH_INSURANCE
hist(selected_data$HEALTH_INSURANCE, main = "Histogram of HEALTH_INSURANCE", xlab = "HEALTH_INSURANCE")
# Bar Plot for HEALTH_INSURANCE (if it's a categorical variable)
# Note: You may need to modify the visualization depending on the nature of the variable.
ggplot(selected_data, aes(x = HEALTH_INSURANCE)) +
geom_bar() +
labs(
title = "Distribution of HEALTH_INSURANCE ",
x = "HEALTH_INSURANCE ",
y = "Frequency"
)
The histogram shows a distribution with multiple peaks, suggesting that the ages at death are concentrated around certain values. There appears to be a relatively uniform distribution with slight increases in frequency at certain age intervals, notably in the late 70s to early 90s. This could indicate common ages at which death occurs in the studied population, possibly due to common age-related factors.
The histogram does not show a smooth, bell-shaped curve, which suggests that the age at death is not normally distributed in this sample. This could be due to a variety of reasons such as the sample size, the presence of outliers, or the natural life course events that don’t follow a normal distribution pattern.
# Summary Statistics for Age_At_Death
summary(selected_data$Age_At_Death)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.00 79.00 85.00 83.41 89.00 94.00
# Histogram for Age_At_Death
hist(selected_data$Age_At_Death, main = "Histogram of Age_At_Death", xlab = "mortstat")
# Box Plot for Age_At_Death
#boxplot(selected_data$Age_At_Death, main = "Box Plot of Age_At_Death", ylab = "mortstat")
# Density Plot for Age_At_Death
#plot(density(selected_data$Age_At_Death), main = "Density Plot of Age_At_Death", xlab = "Age_At_Death")
# Q-Q Plot for Age_At_Death (if it's a continuous variable)
#qqnorm(selected_data$Age_At_Death)
#qqline(selected_data$Age_At_Death)
# Bar Plot for Age_At_Death (if it's a categorical variable)
# Note: You may need to modify the visualization depending on the nature of the variable.
ggplot(selected_data, aes(x = Age_At_Death)) +
geom_bar() +
labs(
title = "Distribution of Age_At_Death ",
x = "Age_At_Death ",
y = "Frequency"
)
#tabyl(selected_data$Age_At_Death)
x <-selected_data %>%
filter(mortstat ==1 )
selected_data <- selected_data %>%
mutate(ageevent = AGE_CURRENT + permth_exm) # Calculate ageevent
# Assuming you want to perform a similar analysis as with tabyl
library(gtsummary)
# Create a tbl_summary object for ageevent
tbl_ageevent <- selected_data %>%
select(ageevent) %>%
tbl_summary()
# Print the table for ageevent
tbl_ageevent
| Characteristic | N = 591 |
|---|---|
| ageevent | 85 (79, 89) |
| 1 Median (IQR) | |
Interpretation: A Kaplan-Meier survival curve is created here to estimate the survival function from lifetime data. This is a non-parametric statistic used to estimate the survival probability from observed survival times.
From this curve, we can infer that survival decreases over time, as expected. The curve’s shape could be analyzed to understand specific periods where there may be a higher risk of death post-diagnosis. For instance, since there is a steep drop at a, this might indicate a period of higher mortality risk.
# Calculate Age_At_Death using AGE_CURRENT and permth_exm
selected_data <- selected_data %>%
mutate(Age_At_Death = AGE_CURRENT + permth_exm)
# Load the survival library (if not already loaded)
library(survival)
surv_obj <- Surv(time = selected_data$Age_At_Death, event = selected_data$mortstat)
# Create a Kaplan-Meier survival curve
km_curve <- survfit(surv_obj ~ 1) # Assuming you want to compare all individuals
# Plot the Kaplan-Meier survival curve
plot(km_curve, main = "Kaplan-Meier Survival Curve",
xlab = "Time (Age at Death from Diagnosis)",
ylab = "Survival Probability")
The Chi-squared test statistic is 40.692 with 15 degrees of freedom. The p-value is very small (less than 0.001). A common threshold for statistical significance is p < 0.05. Since the p-value here is much less than this, we would reject the null hypothesis of no association between the variables at conventional significance levels. This suggests that there is a statistically significant association between the two categorical variables in the contingency table.
# Create a contingency table between two categorical variables (e.g., Education and Race_Ethnicity)
contingency_table <- table(selected_data$Education, selected_data$Race_Ethnicity)
# Perform a chi-square test
chi_square_test <- chisq.test(contingency_table)
## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect
# Print the chi-square test results
print(chi_square_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 21.342, df = 5, p-value = 0.000698
# Calculate correlation matrix for numeric variables
numeric_vars <- select_if(selected_data, is.numeric)
correlation_matrix <- cor(numeric_vars)
## Warning in cor(numeric_vars): the standard deviation is zero
# Print the correlation matrix
print(correlation_matrix)
## Education Race_Ethnicity Income AGE_TOLD_PROSTATE
## Education 1.00000000 0.18557411 0.39313947 -0.05186320
## Race_Ethnicity 0.18557411 1.00000000 0.31709554 0.15950032
## Income 0.39313947 0.31709554 1.00000000 0.11838521
## AGE_TOLD_PROSTATE -0.05186320 0.15950032 0.11838521 1.00000000
## DIAGNOSED_PROSTATE NA NA NA NA
## PROSTATE_ENLARGE 0.07002114 -0.05079330 0.03900663 -0.03895601
## AGE_PSA_TEST -0.12973099 -0.05646898 -0.03508211 0.32832791
## PSA_TOTAL NA NA NA NA
## AGE_CURRENT -0.02664666 0.17785861 0.09617966 0.86579840
## CITIZEN_STATUS 0.51725774 0.56730863 0.60832452 -0.01846651
## HEALTH_INSURANCE -0.04189627 0.03039153 0.03779949 0.16373640
## CURRENT_AGE -0.02664666 0.17785861 0.09617966 0.86579840
## mortstat -0.26471917 -0.01419996 -0.17931908 0.18990634
## permth_exm 0.11348674 0.05045025 0.11741003 -0.24545442
## ucod_leading -0.23896791 0.19136605 -0.14318358 0.15257251
## Age_or_Status_Numeric -0.14479644 0.16534994 0.02065212 0.71745909
## Age_At_Death 0.04052296 0.20503756 0.16392241 0.70909880
## ageevent 0.04052296 0.20503756 0.16392241 0.70909880
## DIAGNOSED_PROSTATE PROSTATE_ENLARGE AGE_PSA_TEST
## Education NA 0.070021136 -0.129730986
## Race_Ethnicity NA -0.050793303 -0.056468975
## Income NA 0.039006628 -0.035082112
## AGE_TOLD_PROSTATE NA -0.038956014 0.328327905
## DIAGNOSED_PROSTATE 1 NA NA
## PROSTATE_ENLARGE NA 1.000000000 -0.006024088
## AGE_PSA_TEST NA -0.006024088 1.000000000
## PSA_TOTAL NA NA NA
## AGE_CURRENT NA 0.062959377 0.233090053
## CITIZEN_STATUS NA 0.084525405 -0.035220402
## HEALTH_INSURANCE NA 0.028815479 0.040752358
## CURRENT_AGE NA 0.062959377 0.233090053
## mortstat NA -0.045237614 0.205077705
## permth_exm NA 0.014435108 -0.216473586
## ucod_leading NA -0.116168134 0.246421782
## Age_or_Status_Numeric NA 0.028626087 0.227942764
## Age_At_Death NA 0.070565389 0.102389451
## ageevent NA 0.070565389 0.102389451
## PSA_TOTAL AGE_CURRENT CITIZEN_STATUS HEALTH_INSURANCE
## Education NA -0.02664666 0.51725774 -0.041896265
## Race_Ethnicity NA 0.17785861 0.56730863 0.030391534
## Income NA 0.09617966 0.60832452 0.037799491
## AGE_TOLD_PROSTATE NA 0.86579840 -0.01846651 0.163736400
## DIAGNOSED_PROSTATE NA NA NA NA
## PROSTATE_ENLARGE NA 0.06295938 0.08452541 0.028815479
## AGE_PSA_TEST NA 0.23309005 -0.03522040 0.040752358
## PSA_TOTAL 1 NA NA NA
## AGE_CURRENT NA 1.00000000 -0.01243566 0.227178279
## CITIZEN_STATUS NA -0.01243566 1.00000000 0.017241379
## HEALTH_INSURANCE NA 0.22717828 0.01724138 1.000000000
## CURRENT_AGE NA 1.00000000 -0.01243566 0.227178279
## mortstat NA 0.27847237 -0.19817172 -0.087002219
## permth_exm NA -0.27430978 0.12853985 -0.008637065
## ucod_leading NA 0.22371294 -0.10405890 0.024325457
## Age_or_Status_Numeric NA 0.87638334 -0.07822591 0.122462287
## Age_At_Death NA 0.82442077 0.06339267 0.218884137
## ageevent NA 0.82442077 0.06339267 0.218884137
## CURRENT_AGE mortstat permth_exm ucod_leading
## Education -0.02664666 -0.26471917 0.113486744 -0.23896791
## Race_Ethnicity 0.17785861 -0.01419996 0.050450252 0.19136605
## Income 0.09617966 -0.17931908 0.117410032 -0.14318358
## AGE_TOLD_PROSTATE 0.86579840 0.18990634 -0.245454423 0.15257251
## DIAGNOSED_PROSTATE NA NA NA NA
## PROSTATE_ENLARGE 0.06295938 -0.04523761 0.014435108 -0.11616813
## AGE_PSA_TEST 0.23309005 0.20507770 -0.216473586 0.24642178
## PSA_TOTAL NA NA NA NA
## AGE_CURRENT 1.00000000 0.27847237 -0.274309784 0.22371294
## CITIZEN_STATUS -0.01243566 -0.19817172 0.128539853 -0.10405890
## HEALTH_INSURANCE 0.22717828 -0.08700222 -0.008637065 0.02432546
## CURRENT_AGE 1.00000000 0.27847237 -0.274309784 0.22371294
## mortstat 0.27847237 1.00000000 -0.715855848 0.52509460
## permth_exm -0.27430978 -0.71585585 1.000000000 -0.28973210
## ucod_leading 0.22371294 0.52509460 -0.289732104 1.00000000
## Age_or_Status_Numeric 0.87638334 0.57248922 -0.233687411 0.40684400
## Age_At_Death 0.82442077 -0.14678280 0.318120583 0.05002831
## ageevent 0.82442077 -0.14678280 0.318120583 0.05002831
## Age_or_Status_Numeric Age_At_Death ageevent
## Education -0.14479644 0.04052296 0.04052296
## Race_Ethnicity 0.16534994 0.20503756 0.20503756
## Income 0.02065212 0.16392241 0.16392241
## AGE_TOLD_PROSTATE 0.71745909 0.70909880 0.70909880
## DIAGNOSED_PROSTATE NA NA NA
## PROSTATE_ENLARGE 0.02862609 0.07056539 0.07056539
## AGE_PSA_TEST 0.22794276 0.10238945 0.10238945
## PSA_TOTAL NA NA NA
## AGE_CURRENT 0.87638334 0.82442077 0.82442077
## CITIZEN_STATUS -0.07822591 0.06339267 0.06339267
## HEALTH_INSURANCE 0.12246229 0.21888414 0.21888414
## CURRENT_AGE 0.87638334 0.82442077 0.82442077
## mortstat 0.57248922 -0.14678280 -0.14678280
## permth_exm -0.23368741 0.31812058 0.31812058
## ucod_leading 0.40684400 0.05002831 0.05002831
## Age_or_Status_Numeric 1.00000000 0.72645966 0.72645966
## Age_At_Death 0.72645966 1.00000000 1.00000000
## ageevent 0.72645966 1.00000000 1.00000000
# Visualize the correlation matrix using a heatmap
library(corrplot)
## corrplot 0.92 loaded
corrplot(correlation_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45)
The overall fit of the model can be assessed by the Concordance statistic (C-index), Likelihood ratio test, Wald test, and Score (logrank) test. A C-index of 0.554 suggests that the model is not particularly strong at predicting outcomes (a C-index of 0.5 suggests no predictive power, and 1.0 suggests perfect prediction).
All the p-values from the tests are not significant (all above 0.05), suggesting that the model does not have strong evidence for the effects of DIAGNOSED_PROSTATE, Education, and Income on the age at death. However, it is worth noting that with a sample size of 59 and 41 events, the study may be underpowered to detect all but the largest effects.
# Assuming selected_data contains your dataset
model_data <- selected_data %>%
select(Age_At_Death, DIAGNOSED_PROSTATE, Education, Income, mortstat)
# Create a binary event variable
model_data$event <- ifelse(model_data$mortstat == 1, 1, 0)
# Fit a Cox Proportional Hazards regression model
cox_model <- coxph(Surv(Age_At_Death, event) ~ DIAGNOSED_PROSTATE + Education + Income, data = model_data)
summary(cox_model)
## Call:
## coxph(formula = Surv(Age_At_Death, event) ~ DIAGNOSED_PROSTATE +
## Education + Income, data = model_data)
##
## n= 59, number of events= 41
##
## coef exp(coef) se(coef) z Pr(>|z|)
## DIAGNOSED_PROSTATE NA NA 0.00000 NA NA
## Education -0.03292 0.96762 0.12169 -0.270 0.787
## Income -0.01187 0.98820 0.01081 -1.098 0.272
##
## exp(coef) exp(-coef) lower .95 upper .95
## DIAGNOSED_PROSTATE NA NA NA NA
## Education 0.9676 1.033 0.7623 1.228
## Income 0.9882 1.012 0.9675 1.009
##
## Concordance= 0.554 (se = 0.044 )
## Likelihood ratio test= 2.07 on 2 df, p=0.4
## Wald test = 1.49 on 2 df, p=0.5
## Score (logrank) test = 1.62 on 2 df, p=0.4
# Calculate time_to_event by adding AGE_TOLD_PROSTATE and Age_At_Death
selected_data <- selected_data %>%
mutate(time_to_event = Age_At_Death)
# Fit a Cox Proportional Hazards regression model using Race_Ethnicity as a predictor
cox_model_race_ethnicity <- coxph(Surv(time_to_event, mortstat) ~ Race_Ethnicity, data = selected_data)
# View the summary of the model
summary(cox_model_race_ethnicity)
## Call:
## coxph(formula = Surv(time_to_event, mortstat) ~ Race_Ethnicity,
## data = selected_data)
##
## n= 59, number of events= 41
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Race_Ethnicity -0.7572 0.4690 0.7289 -1.039 0.299
##
## exp(coef) exp(-coef) lower .95 upper .95
## Race_Ethnicity 0.469 2.132 0.1124 1.957
##
## Concordance= 0.535 (se = 0.019 )
## Likelihood ratio test= 1.36 on 1 df, p=0.2
## Wald test = 1.08 on 1 df, p=0.3
## Score (logrank) test = 1.13 on 1 df, p=0.3
#model age to death, then censor people
# Load required libraries
library(survival)
library(survminer)
# Simulated survival data with demographic variables
set.seed(123)
n <- 100 # Number of observations
time_to_event <- rexp(n, rate = 0.02) # Simulated survival times
status <- rbinom(n, size = 1, prob = 0.7) # Simulated censoring (1=event, 0=censored)
HEALTH_INSURANCE <- factor(sample(c("1", "2"), n, replace = TRUE))
CURRENT_AGE <- rnorm(n, mean = 40, sd = 9)
# Create a survival object
surv_data <- Surv(selected_data$time_to_event, selected_data$mortstat)
# Kaplan-Meier survival curves by HEALTH INSURANCE STATUS
#survfit_HEALTH_INSURANCE <- survfit(surv_data ~ HEALTH_INSURANCE)
# Plot Kaplan-Meier curves
#ggsurvplot(survfit_HEALTH_INSURANCE, data = surv_data, title = "Kaplan-Meier Survival Curves by HEALTH INSURANCE STATUS")
# Perform a log-rank test to compare survival curves
#log_rank_test <- survdiff(surv_data ~ HEALTH_INSURANCE)
#print(log_rank_test)
# Fit a Cox Proportional-Hazards model
#cox_model <- coxph(surv_data ~ HEALTH_INSURANCE + time_to_event)
#summary(cox_model)
surv_data <- survfit(formula = Surv(time_to_event, mortstat) ~ HEALTH_INSURANCE,
data = selected_data)
library(ggplot2)
# Calculate mean and standard deviation of CURRENT_AGE
mean_age <- mean(selected_data$CURRENT_AGE, na.rm = TRUE)
sd_age <- sd(selected_data$CURRENT_AGE, na.rm = TRUE)
# Creating a histogram of CURRENT_AGE and adding mean and SD
ggplot(selected_data, aes(x = CURRENT_AGE)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
geom_vline(aes(xintercept = mean_age), color = "red", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_age + sd_age), color = "green", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_age - sd_age), color = "green", linetype = "dashed", size = 1) +
labs(title = "Histogram of Current Age", x = "Current Age", y = "Frequency") +
theme_minimal() +
annotate("text", x = mean_age, y = Inf, label = paste("Mean:", round(mean_age, 2)), vjust = 2, color = "red") +
annotate("text", x = mean_age + sd_age, y = Inf, label = paste("Mean + SD:", round(mean_age + sd_age, 2)), vjust = 2, color = "green") +
annotate("text", x = mean_age - sd_age, y = Inf, label = paste("Mean - SD:", round(mean_age - sd_age, 2)), vjust = 2, color = "green")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation: This section includes checking the proportional hazards assumption, which is crucial for the validity of Cox regression models. Visual and statistical methods are used for this purpose.
# Load the survminer library
library(survminer)
# Check the proportional hazards assumption visually
ggsurvplot(survfit(cox_model), data = model_data, pval = TRUE)
## Warning in .pvalue(fit, data = data, method = method, pval = pval, pval.coord = pval.coord, : There are no survival curves to be compared.
## This is a null model.
# Test the proportional hazards assumption using Schoenfeld residuals
#cox_zph <- cox.zph(cox_model)
#plot(cox_zph)
# Assuming 'cox_model' is your Cox regression model
# Check the proportional hazards assumption for a continuous variable (e.g., 'Age_At_Death')
cox_zph <- cox.zph(cox_model, transform = "identity")
# Print the results
print(cox_zph)
## chisq df p
## Education 0.607 1 0.44
## Income 1.364 1 0.24
## GLOBAL 1.651 2 0.44
# Calculate mean and standard deviation
mean_age_status <- mean(selected_data$Age_or_Status_Numeric, na.rm = TRUE)
sd_age_status <- sd(selected_data$Age_or_Status_Numeric, na.rm = TRUE)
# Creating a histogram of Age_or_Status_Numeric and adding mean and SD
ggplot(selected_data, aes(x = Age_or_Status_Numeric)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
geom_vline(aes(xintercept = mean_age_status), color = "red", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_age_status + sd_age_status), color = "green", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_age_status - sd_age_status), color = "green", linetype = "dashed", size = 1) +
labs(title = "Histogram of Age or Status", x = "Age or Status", y = "Frequency") +
theme_minimal() +
annotate("text", x = mean_age_status, y = Inf, label = paste("Mean:", round(mean_age_status, 2)), vjust = 2, color = "red") +
annotate("text", x = mean_age_status + sd_age_status, y = Inf, label = paste("Mean + SD:", round(mean_age_status + sd_age_status, 2)), vjust = 2, color = "green") +
annotate("text", x = mean_age_status - sd_age_status, y = Inf, label = paste("Mean - SD:", round(mean_age_status - sd_age_status, 2)), vjust = 2, color = "green")
# Fit a logistic regression model
#model <- glm(mortstat ~ PSA_TOTAL, data = selected_data, family = binomial)
# View the summary of the model
#summary(model)
model_data <- selected_data %>%
select(Age_At_Death, DIAGNOSED_PROSTATE, Education, Income, HEALTH_INSURANCE, CITIZEN_STATUS, Race_Ethnicity, mortstat)
cox_model <- coxph(Surv(Age_At_Death, mortstat) ~
DIAGNOSED_PROSTATE + Education + Income + Race_Ethnicity + CITIZEN_STATUS + HEALTH_INSURANCE,
data = model_data)
## Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 5 ; coefficient may be infinite.
summary(cox_model)
## Call:
## coxph(formula = Surv(Age_At_Death, mortstat) ~ DIAGNOSED_PROSTATE +
## Education + Income + Race_Ethnicity + CITIZEN_STATUS + HEALTH_INSURANCE,
## data = model_data)
##
## n= 59, number of events= 41
##
## coef exp(coef) se(coef) z Pr(>|z|)
## DIAGNOSED_PROSTATE NA NA 0.000e+00 NA NA
## Education -4.311e-02 9.578e-01 1.247e-01 -0.346 0.7296
## Income -9.878e-03 9.902e-01 1.060e-02 -0.932 0.3513
## Race_Ethnicity -6.394e-01 5.276e-01 7.368e-01 -0.868 0.3855
## CITIZEN_STATUS -1.489e+01 3.424e-07 5.006e+03 -0.003 0.9976
## HEALTH_INSURANCE -2.359e+00 9.447e-02 1.099e+00 -2.148 0.0317 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## DIAGNOSED_PROSTATE NA NA NA NA
## Education 9.578e-01 1.044e+00 0.75005 1.2231
## Income 9.902e-01 1.010e+00 0.96981 1.0110
## Race_Ethnicity 5.276e-01 1.895e+00 0.12449 2.2361
## CITIZEN_STATUS 3.424e-07 2.921e+06 0.00000 Inf
## HEALTH_INSURANCE 9.447e-02 1.059e+01 0.01097 0.8135
##
## Concordance= 0.616 (se = 0.047 )
## Likelihood ratio test= 6.18 on 5 df, p=0.3
## Wald test = 6.4 on 5 df, p=0.3
## Score (logrank) test = 10.05 on 5 df, p=0.07
# Create a dataset with Age_At_Death and censoring status (mortstat)
model_data <- selected_data %>%
select(Age_At_Death, mortstat)
# Fit a Cox Proportional Hazards regression model
cox_model <- coxph(Surv(Age_At_Death, mortstat) ~ 1, data = model_data)
# View the summary of the Cox model
summary(cox_model)
## Call: coxph(formula = Surv(Age_At_Death, mortstat) ~ 1, data = model_data)
##
## Null model
## log likelihood= -132.7174
## n= 59
# Create a dataset with Age_At_Death
model_data <- selected_data %>%
select(Age_At_Death)
# Fit a Cox Proportional Hazards regression model without censoring
cox_model <- coxph(Surv(Age_At_Death) ~ 1, data = model_data)
# View the summary of the Cox model
summary(cox_model)
## Call: coxph(formula = Surv(Age_At_Death) ~ 1, data = model_data)
##
## Null model
## log likelihood= -184.5338
## n= 59
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Create ageevent variable
selected_data <- selected_data %>%
mutate(ageevent = ifelse(mortstat == 1, Age_At_Death, CURRENT_AGE))
# Check the first few rows of the dataset to verify the ageevent variable
head(selected_data)
## # A tibble: 6 × 20
## Education Race_Ethnicity Income AGE_TOLD_PROSTATE DIAGNOSED_PROSTATE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 0 5 79 1
## 2 4 0 7 65 1
## 3 4 0 6 61 1
## 4 5 0 7 64 1
## 5 3 0 10 62 1
## 6 4 0 11 54 1
## # ℹ 15 more variables: PROSTATE_ENLARGE <dbl>, AGE_PSA_TEST <dbl>,
## # PSA_TOTAL <dbl>, AGE_CURRENT <dbl>, CITIZEN_STATUS <dbl>,
## # HEALTH_INSURANCE <dbl>, CURRENT_AGE <dbl>, mortstat <dbl>,
## # permth_exm <dbl>, ucod_leading <dbl>, Age_or_Status <chr>,
## # Age_or_Status_Numeric <dbl>, Age_At_Death <dbl>, ageevent <dbl>,
## # time_to_event <dbl>
tabyl(selected_data$ageevent)
## selected_data$ageevent n percent
## 57 1 0.01694915
## 60 1 0.01694915
## 65 1 0.01694915
## 66 2 0.03389831
## 68 2 0.03389831
## 69 4 0.06779661
## 70 1 0.01694915
## 71 2 0.03389831
## 73 3 0.05084746
## 74 2 0.03389831
## 76 3 0.05084746
## 77 2 0.03389831
## 78 4 0.06779661
## 79 1 0.01694915
## 80 3 0.05084746
## 81 2 0.03389831
## 82 1 0.01694915
## 83 2 0.03389831
## 84 2 0.03389831
## 85 1 0.01694915
## 87 4 0.06779661
## 88 4 0.06779661
## 89 4 0.06779661
## 90 1 0.01694915
## 91 2 0.03389831
## 92 2 0.03389831
## 93 2 0.03389831
# Create a survival object
surv_obj <- Surv(time = selected_data$ageevent, event = selected_data$mortstat)
# Fit a Cox Proportional Hazards regression model using ageevent
cox_model_ageevent <- coxph(Surv(ageevent, mortstat) ~ Education + Race_Ethnicity + Income + DIAGNOSED_PROSTATE + HEALTH_INSURANCE + CITIZEN_STATUS, data = selected_data)
## Warning in coxph.fit(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 6 ; coefficient may be infinite.
# View the summary of the Cox model
summary(cox_model_ageevent)
## Call:
## coxph(formula = Surv(ageevent, mortstat) ~ Education + Race_Ethnicity +
## Income + DIAGNOSED_PROSTATE + HEALTH_INSURANCE + CITIZEN_STATUS,
## data = selected_data)
##
## n= 59, number of events= 41
##
## coef exp(coef) se(coef) z Pr(>|z|)
## Education -8.908e-02 9.148e-01 1.309e-01 -0.680 0.496
## Race_Ethnicity -1.342e+00 2.613e-01 7.538e-01 -1.781 0.075 .
## Income -1.158e-02 9.885e-01 1.005e-02 -1.152 0.249
## DIAGNOSED_PROSTATE NA NA 0.000e+00 NA NA
## HEALTH_INSURANCE -2.293e+00 1.009e-01 1.099e+00 -2.086 0.037 *
## CITIZEN_STATUS -1.214e+01 5.336e-06 5.115e+03 -0.002 0.998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Education 9.148e-01 1.093e+00 0.70775 1.1824
## Race_Ethnicity 2.613e-01 3.828e+00 0.05962 1.1448
## Income 9.885e-01 1.012e+00 0.96921 1.0082
## DIAGNOSED_PROSTATE NA NA NA NA
## HEALTH_INSURANCE 1.009e-01 9.909e+00 0.01170 0.8705
## CITIZEN_STATUS 5.336e-06 1.874e+05 0.00000 Inf
##
## Concordance= 0.611 (se = 0.052 )
## Likelihood ratio test= 8.73 on 5 df, p=0.1
## Wald test = 8.81 on 5 df, p=0.1
## Score (logrank) test = 11.94 on 5 df, p=0.04
# Assuming your dataset is named 'data'
# and mortstat = 1 means the individual is alive,
# while mortstat = 0 means the individual is deceased.
# Creating the time_to_event variable
selected_data$time_to_event <- selected_data$permth_exm
# Creating the status variable based on mortstat
# If mortstat = 0 (alive), then status = 0 (censored)
# If mortstat = 0 (deceased), then status = 1 (event occurred)
selected_data$status <- ifelse(selected_data$mortstat == 0, 1, 1)
# Creating the survival object
surv_object1 <- Surv(time = selected_data$time_to_event, event = selected_data$status)
surv_object <- Surv(time = selected_data$time_to_event, event = selected_data$status)
km_fit <- survfit(surv_object ~ 1, data = selected_data)
ggsurvplot(km_fit,
data = selected_data,
xlab = "Years",
ylab = "Survival probability",
title = "Kaplan-Meier Survival Curve")
cox_model <- coxph(surv_object ~ AGE_TOLD_PROSTATE, data = selected_data)
summary(cox_model)
## Call:
## coxph(formula = surv_object ~ AGE_TOLD_PROSTATE, data = selected_data)
##
## n= 59, number of events= 59
##
## coef exp(coef) se(coef) z Pr(>|z|)
## AGE_TOLD_PROSTATE 0.02149 1.02173 0.02067 1.04 0.298
##
## exp(coef) exp(-coef) lower .95 upper .95
## AGE_TOLD_PROSTATE 1.022 0.9787 0.9812 1.064
##
## Concordance= 0.583 (se = 0.046 )
## Likelihood ratio test= 1.08 on 1 df, p=0.3
## Wald test = 1.08 on 1 df, p=0.3
## Score (logrank) test = 1.08 on 1 df, p=0.3
cox.zph(cox_model)
## chisq df p
## AGE_TOLD_PROSTATE 3.63 1 0.057
## GLOBAL 3.63 1 0.057
# Assuming 'surv_object' is your survival object
KM_fit <- survfit(surv_object ~ 1, data = selected_data)
plot(KM_fit)
# Convert HEALTH_INSURANCE to a factor if it is not already
selected_data$HEALTH_INSURANCE <- as.factor(selected_data$HEALTH_INSURANCE)
# Create the survival object
surv_object <- Surv(time = selected_data$ageevent, event = selected_data$status)
# Fit the Cox proportional hazards model
cox_model <- coxph(surv_object ~ HEALTH_INSURANCE, data = selected_data)
# View the summary of the Cox model
summary(cox_model)
## Call:
## coxph(formula = surv_object ~ HEALTH_INSURANCE, data = selected_data)
##
## n= 59, number of events= 59
##
## coef exp(coef) se(coef) z Pr(>|z|)
## HEALTH_INSURANCE1 -1.4137 0.2432 1.0384 -1.361 0.173
##
## exp(coef) exp(-coef) lower .95 upper .95
## HEALTH_INSURANCE1 0.2432 4.111 0.03178 1.862
##
## Concordance= 0.51 (se = 0.01 )
## Likelihood ratio test= 1.27 on 1 df, p=0.3
## Wald test = 1.85 on 1 df, p=0.2
## Score (logrank) test = 2.18 on 1 df, p=0.1
# Pull specific columns
# For example, to pull columns named 'Age_At_Death' and 'Age_Told_Prostate'
extracted_columns <- selected_data %>% select(Age_or_Status, AGE_TOLD_PROSTATE)
# Pull specific rows
# For example, to pull rows 1 to 5
extracted_rows <- selected_data %>% slice(1:5)
# If you want to pull rows based on a condition
# For example, to pull rows where Age_At_Death is greater than 70
extracted_condition_rows <- selected_data %>% filter(Age_At_Death > 70)
# Display the extracted data
print(extracted_columns)
## # A tibble: 59 × 2
## Age_or_Status AGE_TOLD_PROSTATE
## <chr> <dbl>
## 1 93 79
## 2 78 65
## 3 71 61
## 4 77 64
## 5 68 62
## 6 57+ 54
## 7 74 64
## 8 92 81
## 9 89 80
## 10 88 70
## # ℹ 49 more rows
print(extracted_rows)
## # A tibble: 5 × 21
## Education Race_Ethnicity Income AGE_TOLD_PROSTATE DIAGNOSED_PROSTATE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 0 5 79 1
## 2 4 0 7 65 1
## 3 4 0 6 61 1
## 4 5 0 7 64 1
## 5 3 0 10 62 1
## # ℹ 16 more variables: PROSTATE_ENLARGE <dbl>, AGE_PSA_TEST <dbl>,
## # PSA_TOTAL <dbl>, AGE_CURRENT <dbl>, CITIZEN_STATUS <dbl>,
## # HEALTH_INSURANCE <fct>, CURRENT_AGE <dbl>, mortstat <dbl>,
## # permth_exm <dbl>, ucod_leading <dbl>, Age_or_Status <chr>,
## # Age_or_Status_Numeric <dbl>, Age_At_Death <dbl>, ageevent <dbl>,
## # time_to_event <dbl>, status <dbl>
print(extracted_condition_rows)
## # A tibble: 53 × 21
## Education Race_Ethnicity Income AGE_TOLD_PROSTATE DIAGNOSED_PROSTATE
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 0 5 79 1
## 2 4 0 7 65 1
## 3 4 0 6 61 1
## 4 5 0 7 64 1
## 5 2 0 7 64 1
## 6 3 0 99 81 1
## 7 5 0 11 80 1
## 8 2 0 5 70 1
## 9 4 0 6 78 1
## 10 4 0 4 73 1
## # ℹ 43 more rows
## # ℹ 16 more variables: PROSTATE_ENLARGE <dbl>, AGE_PSA_TEST <dbl>,
## # PSA_TOTAL <dbl>, AGE_CURRENT <dbl>, CITIZEN_STATUS <dbl>,
## # HEALTH_INSURANCE <fct>, CURRENT_AGE <dbl>, mortstat <dbl>,
## # permth_exm <dbl>, ucod_leading <dbl>, Age_or_Status <chr>,
## # Age_or_Status_Numeric <dbl>, Age_At_Death <dbl>, ageevent <dbl>,
## # time_to_event <dbl>, status <dbl>
# Load necessary libraries
library(survival)
library(survminer)
extracted_condition_rows$time_to_event <- as.numeric(extracted_condition_rows$time_to_event)
# Assuming 'selected_data' is your dataset
# Ensure 'ageevent' is numeric and 'status' is a binary indicator (0 or 1)
# Create the survival object
surv_object <- Surv(time = extracted_condition_rows$time_to_event, event = extracted_condition_rows$status)
# Fit the Kaplan-Meier model
km_fit <- survfit(surv_object ~ 1)
# Plot the Kaplan-Meier survival curve
ggsurvplot(km_fit,
data = extracted_condition_rows,
xlab = "Years since Diagnosis",
ylab = "Survival Probability",
title = "Kaplan-Meier Survival Curve",
xlim = c(0, max(extracted_condition_rows$time_to_event))) # Adjust the x-axis limit as needed
# If you want to start from 0 (year of diagnosis) ensure that 'ageevent' is calculated as the time since diagnosis
# Fit the Kaplan-Meier model
km_fit <- survfit(Surv(time_to_event, status) ~ HEALTH_INSURANCE, data = extracted_condition_rows)
# Plot the Kaplan-Meier survival curve
ggsurvplot(km_fit,
data = extracted_condition_rows,
xlab = "Years since Diagnosis",
ylab = "Survival Probability",
title = "Kaplan-Meier Survival Curve",
xlim = c(0, max(extracted_condition_rows$time_to_event))) # Adjust the x-axis limit as needed
# If you want to start from 0 (year of diagnosis) ensure that 'ageevent' is calculated as the time since diagnosis
# Fit the Kaplan-Meier model
km_fit <- survfit(Surv(time_to_event, status) ~ Race_Ethnicity, data = extracted_condition_rows)
# Plot the Kaplan-Meier survival curve
ggsurvplot(km_fit,
data = extracted_condition_rows,
xlab = "Years since Diagnosis",
ylab = "Survival Probability",
title = "Kaplan-Meier Survival Curve",
xlim = c(0, max(extracted_condition_rows$time_to_event))) # Adjust the x-axis limit as needed
# If you want to start from 0 (year of diagnosis) ensure that 'ageevent' is calculated as the time since diagnosis
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
extracted_condition_rows$Race_Ethnicity <- factor(extracted_condition_rows$Race_Ethnicity,
levels = c(0, 1),
labels = c("Nonwhite", "White"))
extracted_condition_rows$mortstat <- factor(extracted_condition_rows$mortstat,
levels = c(0, 1),
labels = c("Alive", "Deceased"))
extracted_condition_rows$HEALTH_INSURANCE <- factor(extracted_condition_rows$HEALTH_INSURANCE,
levels = c(0, 1),
labels = c("Uninsured", "Insured"))
extracted_condition_rows$Education <- as.factor(extracted_condition_rows$Education)
table1::table1(~mortstat + time_to_event + AGE_TOLD_PROSTATE + Income + HEALTH_INSURANCE + Education|Race_Ethnicity, data= extracted_condition_rows)
| Nonwhite (N=50) |
White (N=3) |
Overall (N=53) |
|
|---|---|---|---|
| mortstat | |||
| Alive | 16 (32.0%) | 1 (33.3%) | 17 (32.1%) |
| Deceased | 34 (68.0%) | 2 (66.7%) | 36 (67.9%) |
| time_to_event | |||
| Mean (SD) | 9.16 (4.26) | 9.67 (2.89) | 9.19 (4.17) |
| Median [Min, Max] | 10.0 [0, 14.0] | 8.00 [8.00, 13.0] | 10.0 [0, 14.0] |
| AGE_TOLD_PROSTATE | |||
| Mean (SD) | 70.8 (6.74) | 75.0 (8.72) | 71.0 (6.84) |
| Median [Min, Max] | 70.5 [58.0, 85.0] | 71.0 [69.0, 85.0] | 71.0 [58.0, 85.0] |
| Income | |||
| Mean (SD) | 10.2 (16.4) | 37.3 (53.4) | 11.8 (20.1) |
| Median [Min, Max] | 7.00 [2.00, 99.0] | 8.00 [5.00, 99.0] | 7.00 [2.00, 99.0] |
| HEALTH_INSURANCE | |||
| Uninsured | 1 (2.0%) | 0 (0%) | 1 (1.9%) |
| Insured | 49 (98.0%) | 3 (100%) | 52 (98.1%) |
| Education | |||
| 1 | 4 (8.0%) | 0 (0%) | 4 (7.5%) |
| 2 | 6 (12.0%) | 1 (33.3%) | 7 (13.2%) |
| 3 | 11 (22.0%) | 1 (33.3%) | 12 (22.6%) |
| 4 | 16 (32.0%) | 0 (0%) | 16 (30.2%) |
| 5 | 13 (26.0%) | 0 (0%) | 13 (24.5%) |
| 9 | 0 (0%) | 1 (33.3%) | 1 (1.9%) |
# Create a crosstab of the mortstat variable
mortstat_table <- table(selected_data$mortstat)
# Display the crosstab
print(mortstat_table)
##
## 0 1
## 18 41
cox_model <- coxph(Surv(time_to_event, status) ~ HEALTH_INSURANCE + Education + Income +Race_Ethnicity, data = extracted_condition_rows)
summary(cox_model)
## Call:
## coxph(formula = Surv(time_to_event, status) ~ HEALTH_INSURANCE +
## Education + Income + Race_Ethnicity, data = extracted_condition_rows)
##
## n= 53, number of events= 53
##
## coef exp(coef) se(coef) z Pr(>|z|)
## HEALTH_INSURANCEInsured -0.886375 0.412147 1.055552 -0.840 0.401
## Education2 0.458903 1.582337 0.645001 0.711 0.477
## Education3 0.351592 1.421328 0.586590 0.599 0.549
## Education4 -0.052745 0.948622 0.564853 -0.093 0.926
## Education5 0.269629 1.309478 0.589480 0.457 0.647
## Education9 -0.996283 0.369249 1.654374 -0.602 0.547
## Income 0.001818 1.001819 0.009706 0.187 0.851
## Race_EthnicityWhite 0.682410 1.978640 0.772227 0.884 0.377
##
## exp(coef) exp(-coef) lower .95 upper .95
## HEALTH_INSURANCEInsured 0.4121 2.4263 0.05207 3.262
## Education2 1.5823 0.6320 0.44696 5.602
## Education3 1.4213 0.7036 0.45018 4.487
## Education4 0.9486 1.0542 0.31354 2.870
## Education5 1.3095 0.7637 0.41241 4.158
## Education9 0.3692 2.7082 0.01442 9.452
## Income 1.0018 0.9982 0.98294 1.021
## Race_EthnicityWhite 1.9786 0.5054 0.43556 8.989
##
## Concordance= 0.529 (se = 0.052 )
## Likelihood ratio test= 3.25 on 8 df, p=0.9
## Wald test = 3.45 on 8 df, p=0.9
## Score (logrank) test = 3.58 on 8 df, p=0.9
The Cox Proportional Hazards model, applied to a dataset of 53 observations, aims to analyze the impact of variables such as health insurance status, education, income, and race/ethnicity on the survival of individuals diagnosed with prostate cancer. A notable finding is the hazard ratio of 0.412147 for health insurance status. This ratio indicates that insured individuals have a hazard (risk of dying from prostate cancer) that is approximately 41.2% of the hazard for those who are uninsured. However, the statistical significance of this finding is not established, as indicated by the p-value, which suggests that the observed association might not be robust or could be due to chance.
The analysis further compares various education levels against a reference category of having less than a 9th-grade education. Here, positive coefficients for education levels such as 9-11th Grade, High School Grad/GED or Equivalent, and College Graduate or above suggest these groups have a higher hazard of the event (mortality from prostate cancer) compared to those with less than a 9th-grade education. On the other hand, the category “Some College or AA” (Education Level 4) is associated with a lower hazard compared to the baseline group. However, these interpretations must be approached cautiously. The p-values associated with these education categories do not denote statistical significance, indicating that these differences might not be reliable indicators of actual risk variation.
It is crucial to consider the limitations of this analysis, particularly the small sample size. A limited sample size can affect the power of the study to detect significant differences or associations, potentially leading to less reliable or inconclusive results. As such, while the model provides valuable insights into potential trends and associations, these findings should be interpreted in the context of the study’s inherent limitations and not taken as definitive evidence of the effects of these variables on survival among prostate cancer patients.