library(purrr)

Uploading data:

data_path <- "C:/Users/shanata/Downloads/smoking_driking_dataset_Ver01.csv"

Creating new columns:

Here I have created a new column for hypertension. If a patient has Systolic Blood pressure greater than 140 or Diastolic pressure greater than 90 he/she is considered to be hypertensive.

data$Hypertension[data$SBP > 140] <- "YES"
data$Hypertension[data$SBP < 140] <- "NO"
data$Hypertension[data$DBP > 90] <-"YES"
data$Hypertension[data$SBP < 90] <- "NO"

After creating the hypertensive column, I will create separate columns for hypotensive patients.If a patient has Systolic Blood pressure less than 90 or Diastolic pressure less than 60 he/she is considered to be hypotensive.

data$Hypotension[data$SBP <90 ] <- "YES"
data$Hypotension[data$SBP > 90] <- "NO"
data$Hypotension[data$DBP < 60] <-"YES"
data$Hypotension[data$DBP > 60] <- "NO"

Now we have separate columns for hypertensive and hypotensive patients. Now let’s find out, how many people with drinking habits have hyper and hypo tension.

summary(data$DRK_YN)
##    Length     Class      Mode 
##    991346 character character
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Group-by dataframe 1:

Here, I have grouped people based on their drinking habits and their blood pressure levels. I have stripped the null values before computing.

f1 <- data %>%
  group_by(DRK_YN) %>%
  summarize(Hypertension = sum(Hypertension == "YES", na.rm = TRUE),
            #Normal_sytolic = sum(Hypertension == "NO", na.rm = TRUE),
            #Normal_diastolic=sum(Hypotension=="NO",na.rm=TRUE),
            Hypotension = sum(Hypotension == "YES", na.rm = TRUE)
            )
f1
## # A tibble: 2 × 3
##   DRK_YN Hypertension Hypotension
##   <chr>         <int>       <int>
## 1 N             49005       15537
## 2 Y             50988       10242

People with Drinking habits:

count_yes <- sum(data$DRK_YN == "Y")
count_no <- sum(data$DRK_YN == "N")

# Display the counts
cat("Count of YES:", count_yes, "\n")
## Count of YES: 495488
cat("Count of NO:", count_no, "\n")
## Count of NO: 495858

Now, I have the total number of alcoholic drinkers and non-alcoholic drinkers, I also have the Hypertensive and Hypotensive people among the drinkers and non drinkers. Now, it is time to calculate the expected probability of each case.

Calculating expected probability:

total_drinkers <- 495488
total_non_drinkers <- 495858
total_hypertensive_non_drinkers <- 49005
total_hypotensive_non_drinkers <- 15537
total_hypertensive_drinkers <- 50988
total_hypotensive_drinkers <- 10242

# Calculate the probabilities
p_drinker <- total_drinkers / (total_drinkers + total_non_drinkers)
p_hypertensive_given_drinker <- total_hypertensive_drinkers / total_drinkers
p_hypotensive_given_drinker <- total_hypotensive_drinkers / total_drinkers
p_non_drinker <- total_non_drinkers / (total_drinkers + total_non_drinkers)
p_hypertensive_given_non_drinker <- total_hypertensive_non_drinkers / total_non_drinkers
p_hypotensive_given_non_drinker <- total_hypotensive_non_drinkers / total_non_drinkers


# Calculate the expected probability 
expected_probability_dh <- p_drinker * p_hypertensive_given_drinker
expected_probability_di <- p_drinker * p_hypotensive_given_drinker
expected_probability_dj <- p_non_drinker * p_hypertensive_given_non_drinker
expected_probability_dk <- p_non_drinker * p_hypotensive_given_non_drinker

# Display the result
cat("Expected Probability of Drinkers who are Hypertensive:", expected_probability_dh, "\n")
## Expected Probability of Drinkers who are Hypertensive: 0.0514331
cat("Expected Probability of Drinkers who are Hypotensive:", expected_probability_di, "\n")
## Expected Probability of Drinkers who are Hypotensive: 0.01033141
cat("Expected Probability of Non Drinkers who are Hypertensive:", expected_probability_dj, "\n")
## Expected Probability of Non Drinkers who are Hypertensive: 0.04943279
cat("Expected Probability of NonDrinkers who are Hypotensive:", expected_probability_dk, "\n")
## Expected Probability of NonDrinkers who are Hypotensive: 0.01567263

From this I can draw a hypothesis:

  1. The probability of someone being a drinker and hypertensive is higher compared to the probability of being a drinker and hypotensive.

Hypothesis Testing:

Null Hypothesis (H0):

The probability of being a drinker and hypertensive is equal to the probability of being a drinker and hypotensive.

Alternative Hypothesis (Ha):

The probability of being a drinker and hypertensive is not equal to the probability of being a drinker and hypotensive.

To test the hypothesis, I have used chi-squared test.

  1. Accept null hypothesis if p-value less then 0.05 else reject it
contingency_table <- matrix(c(total_hypertensive_drinkers,total_hypertensive_non_drinkers,total_hypotensive_drinkers,total_hypotensive_non_drinkers), nrow = 2)

# Perform a chi-squared test
chi_squared_test_result <- chisq.test(contingency_table)

# Display the test result
print(chi_squared_test_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 1040, df = 1, p-value < 2.2e-16

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a drinker and hypertensive is not equal to the probability of being a drinker and hypotensive.

Visualization

library(ggplot2)

Here I have used a bar chart to compare the expected probabilities.

v1 <- data.frame(
  Drinking_Status = c("Drinkers", "Drinkers", "Non Drinkers", "Non Drinkers"),
  Tension_Status = c("Hypertensive", "Hypotensive", "Hypertensive", "Hypotensive"),
  Expected_Probability = c(0.0514331, 0.01033141, 0.04943279, 0.01567263)
)
ggplot(v1, aes(x = Drinking_Status, y = Expected_Probability, fill = Tension_Status)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(
    title = "Expected Probabilities of Tension by Drinking Status",
    x = "Drinking Status",
    y = "Expected Probability"
  )

Findings :

The difference in expected probabilities between drinkers and non-drinkers for hypertension is relatively small, suggesting that drinking status alone may not be a strong predictor of hypertension.

Further questions:

Are there any other confounding variables or interactions that may affect the relationship between drinking status and hypertension/hypotension ??

Group-by data frame 2

Grouping data based on smokers and non-smokers. Here 1 indicate people who don’t smoke, 2 indicate people who used to smoke but quit smoking now and 3 indicate people who are active smokers.

f <- data %>%
  group_by(SMK_stat_type_cd) %>%
  summarise(Hypertension = sum(Hypertension == "YES", na.rm = TRUE),
            #Normal_sytolic = sum(Hypertension == "NO", na.rm = TRUE),
            #Normal_diastolic=sum(Hypotension=="NO",na.rm=TRUE),
            Hypotension = sum(Hypotension == "YES", na.rm = TRUE)
            )
f
## # A tibble: 3 × 3
##   SMK_stat_type_cd Hypertension Hypotension
##              <dbl>        <int>       <int>
## 1                1        56357       19900
## 2                2        22044        2541
## 3                3        21592        3338

Count of people in each category :

count_1 <- sum(data$SMK_stat_type_cd == 1)
count_2 <- sum(data$SMK_stat_type_cd == 2)
count_3 <- sum(data$SMK_stat_type_cd == 3)

# Display the counts
cat("Count of people who don't smoke:", count_1, "\n")
## Count of people who don't smoke: 602441
cat("Count of people who used to smoke but quit:", count_2, "\n")
## Count of people who used to smoke but quit: 174951
cat("Count of people who used to still smoke:", count_3, "\n")
## Count of people who used to still smoke: 213954

Calculating expected probability:

total_non_smokers <- 602441
#total_smokers_but_quit <- 174951
total_smokers<- 213954


total_hypertensive_non_smokers <- 56357
total_hypotensive_non_smokers <- 19900

#total_hypertensive_used_to_smoke <- 22044
#total_hypotensive_used_to_smoke <- 2541

total_hypertensive_smokers <- 21592
total_hypotensive_smokers <- 3338

# Calculate the probabilities
p_smoker <-  total_smokers/ (total_non_smokers + total_smokers )
p_hypertensive_given_smoker <- total_hypertensive_smokers / total_smokers
p_hypotensive_given_smoker <- total_hypotensive_smokers / total_smokers


p_non_smokers <- total_non_smokers / (total_smokers + total_non_smokers)
p_hypertensive_given_non_smoker <- total_hypertensive_non_smokers / total_non_smokers
p_hypotensive_given_non_smoker <- total_hypotensive_non_smokers / total_non_smokers


# Calculate the expected probability for drinkers who are hypersensitive
expected_probability_dh <- p_smoker * p_hypertensive_given_smoker
expected_probability_di <- p_smoker * p_hypotensive_given_smoker
expected_probability_dj <- p_non_smokers * p_hypertensive_given_non_smoker
expected_probability_dk <- p_non_smokers * p_hypotensive_given_non_smoker

# Display the result
cat("Expected Probability of Smokers who are Hypertensive:", expected_probability_dh, "\n")
## Expected Probability of Smokers who are Hypertensive: 0.02644798
cat("Expected Probability of Smokers who are Hypotensive:", expected_probability_di, "\n")
## Expected Probability of Smokers who are Hypotensive: 0.004088707
cat("Expected Probability of Non Smokers who are Hypertensive:", expected_probability_dj, "\n")
## Expected Probability of Non Smokers who are Hypertensive: 0.06903153
cat("Expected Probability of Non Smokers who are Hypotensive:", expected_probability_dk, "\n")
## Expected Probability of Non Smokers who are Hypotensive: 0.02437546

Null Hypothesis (H0):

The probability of being a smoker and hypertensive is equal to the probability of being a smoker and hypotensive.

Alternative Hypothesis (Ha):

The probability of being a smoker and hypertensive is not equal to the probability of being a smoker and hypotensive.

contingency_table <- matrix(c(21592,56357,3338,19900), nrow = 2)

# Perform a chi-squared test
chi_squared_test_result <- chisq.test(contingency_table)

# Display the test result
print(chi_squared_test_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 1713.9, df = 1, p-value < 2.2e-16

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a active smoker and hypertensive is not equal to the probability of being a active smoker and hypotensive.

Visualization

v2 <- data.frame(
  Smoking_Status = c("Smokers", "Smokers", "Non Smokers", "Non Smokers"),
  Tension_Status = c("Hypertensive", "Hypotensive", "Hypertensive", "Hypotensive"),
  Expected_Probability = c(0.02644798, 0.004088707 , 0.06903153, 0.02437546)
)
ggplot(v2, aes(x = Smoking_Status, y = Expected_Probability, fill = Tension_Status)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(
    title = "Expected Probabilities of Tension by Smoking Status",
    x = "Smoking Status",
    y = "Expected Probability"
  )

### Group-by dataframe 3

f3 <- data %>%
  group_by(sex) %>%
  summarise(Hypertension = sum(Hypertension == "YES", na.rm = TRUE),
            #Normal_sytolic = sum(Hypertension == "NO", na.rm = TRUE),
            #Normal_diastolic=sum(Hypotension=="NO",na.rm=TRUE),
            Hypotension = sum(Hypotension == "YES", na.rm = TRUE)
            )
f3
## # A tibble: 2 × 3
##   sex    Hypertension Hypotension
##   <chr>         <int>       <int>
## 1 Female        39746       18804
## 2 Male          60247        6975
count_1 <- sum(data$sex=="Female")
count_2 <- sum(data$sex=="Male")

# Display the counts
cat("Count of Female:", count_1, "\n")
## Count of Female: 464931
cat("Count of Male:", count_2, "\n")
## Count of Male: 526415

Calculating probability

total_female <- 464931
#total_smokers_but_quit <- 174951
total_male<- 526415


total_hypertensive_male <- 60247
total_hypotensive_male<- 6975

#total_hypertensive_used_to_smoke <- 22044
#total_hypotensive_used_to_smoke <- 2541

total_hypertensive_female <- 39746
total_hypotensive_female <- 18804

# Calculate the probabilities
p_female <-  total_female/ (total_female + total_male )
p_hypertensive_given_female <- total_hypertensive_female / total_female
p_hypotensive_given_female <- total_hypotensive_female / total_female


p_male <- total_male / (total_male + total_female)
p_hypertensive_given_male <- total_hypertensive_male / total_male
p_hypotensive_given_male <- total_hypotensive_male / total_male


# Calculate the expected probability for drinkers who are hypersensitive
expected_probability_dh <- p_male * p_hypertensive_given_male
expected_probability_di <- p_male * p_hypotensive_given_male
expected_probability_dj <- p_female * p_hypertensive_given_female
expected_probability_dk <- p_female * p_hypotensive_given_female

# Display the result
cat("Expected Probability of Male who are Hypertensive:", expected_probability_dh, "\n")
## Expected Probability of Male who are Hypertensive: 0.06077293
cat("Expected Probability of Male who are Hypotensive:", expected_probability_di, "\n")
## Expected Probability of Male who are Hypotensive: 0.007035889
cat("Expected Probability of Female who are Hypertensive:", expected_probability_dj, "\n")
## Expected Probability of Female who are Hypertensive: 0.04009296
cat("Expected Probability of Female who are Hypotensive:", expected_probability_dk, "\n")
## Expected Probability of Female who are Hypotensive: 0.01896815

Hypothesis :

Null Hypothesis (H0):

The probability of being a male and hypertensive is equal to the probability of being a male and hypotensive.

Alternative Hypothesis (Ha):

The probability of being a male and hypertensive is not equal to the probability of being a male and hypotensive.

contingency_table <- matrix(c(60247,39746,6975,18804), nrow = 2)

# Perform a chi-squared test
chi_squared_test_result <- chisq.test(contingency_table)

# Display the test result
print(chi_squared_test_result)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 9075, df = 1, p-value < 2.2e-16

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a male and hypertensive is not equal to the probability of being a female and hypertensive.

Findings:

Males seem to have a higher expected probability of both hypertension and hypotension compared to females.

Visualisation

v3 <- data.frame(
  Gender = c("Male", "Male", "Female", "Female"),
  Tension_Status = c("Hypertensive", "Hypotensive", "Hypertensive", "Hypotensive"),
  Expected_Probability = c(0.06077293 ,0.007035889,0.04009296,0.01896815)
)
ggplot(v3, aes(x = Gender, y = Expected_Probability, fill = Tension_Status)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(
    title = "Expected Probabilities of Tension by Gender",
    x = "Gender",
    y = "Expected Probability"
  )

Combinations:

cross_table <- table(data$sex, data$DRK_YN, data$Hypertension,data$Hypotension,data$SMK_stat_type_cd)
cross_table_df <- as.data.frame(cross_table)
cross_table_sum <- rowSums(cross_table)
cross_table_normalized <- scale(cross_table_sum)
cross_table_df
##      Var1 Var2 Var3 Var4 Var5   Freq
## 1  Female    N   NO   NO    1 262236
## 2    Male    N   NO   NO    1  64507
## 3  Female    Y   NO   NO    1 109298
## 4    Male    Y   NO   NO    1  75682
## 5  Female    N  YES   NO    1  30686
## 6    Male    N  YES   NO    1   8195
## 7  Female    Y  YES   NO    1   7207
## 8    Male    Y  YES   NO    1  10172
## 9  Female    N   NO  YES    1  11645
## 10   Male    N   NO  YES    1   1301
## 11 Female    Y   NO  YES    1   5700
## 12   Male    Y   NO  YES    1   1148
## 13 Female    N  YES  YES    1     70
## 14   Male    N  YES  YES    1     17
## 15 Female    Y  YES  YES    1      5
## 16   Male    Y  YES  YES    1      5
## 17 Female    N   NO   NO    2   3762
## 18   Male    N   NO   NO    2  42140
## 19 Female    Y   NO   NO    2   5728
## 20   Male    Y   NO   NO    2  94889
## 21 Female    N  YES   NO    2    321
## 22   Male    N  YES   NO    2   5848
## 23 Female    Y  YES   NO    2    362
## 24   Male    Y  YES   NO    2  15496
## 25 Female    N   NO  YES    2    252
## 26   Male    N   NO  YES    2    866
## 27 Female    Y   NO  YES    2    307
## 28   Male    Y   NO  YES    2   1097
## 29 Female    N  YES  YES    2      0
## 30   Male    N  YES  YES    2      9
## 31 Female    Y  YES  YES    2      0
## 32   Male    Y  YES  YES    2      8
## 33 Female    N   NO   NO    3   5237
## 34   Male    N   NO   NO    3  40947
## 35 Female    Y   NO   NO    3   8795
## 36   Male    Y   NO   NO    3 130116
## 37 Female    N  YES   NO    3    395
## 38   Male    N  YES   NO    3   3458
## 39 Female    Y  YES   NO    3    699
## 40   Male    Y  YES   NO    3  17029
## 41 Female    N   NO  YES    3    386
## 42   Male    N   NO  YES    3    974
## 43 Female    Y   NO  YES    3    430
## 44   Male    Y   NO  YES    3   1536
## 45 Female    N  YES  YES    3      0
## 46   Male    N  YES  YES    3      6
## 47 Female    Y  YES  YES    3      1
## 48   Male    Y  YES  YES    3      4

I have created a Cross-table with Sex, Drinking State, Smoking State, Hypertension and Hypotension as variables. We can note that,

Never show up:

  1. Females: No Drinking, Smoking but quit, Hypertension, Hypo tension
  2. Females: Drinking, Smoking but quit, Hypertension, Hypo tension
  3. Females: No Drinking, Active Smoking, Hypertensive, Hypertensive

This may be due to :

  • Rare occurrences: Females who don’t drink and smoke but have both hypertension and hypo tension is a rare occurrence.

  • The data which I have used may be not large enough to cover all cases.

  • The data I have used, have no female who quit smoking.

Most common occurring combination:

  1. Females: No Smoking, No Drinking, No Hypertension, No Hypotension
  2. Females: Drinking, No Smoking, No Hypertension, No Hypotension
  3. Males: Drinking, Smoking, No Hypertension, No Hypotension

Most Least occurring combinations:

1)Females: Drinking, smoking, Hypertension, Hypotension 2)Male : Drinking, smoking, Hypertension , Hypotension 3)Females: Drinking, no smoking, Hypertensive, Hypotensive

Findings:

  1. Most woman in this data set, follow a healthy lifestyle: no smoking, No Drinking and have normal blood pressure
  2. Males who simultaneously smoke, drink, and have both hypertension and hypotension are not common.
  3. There is a subgroup of females who consume alcohol but do not smoke, and they also do not have hypertension or hypotension. This could represent individuals who consume alcohol without additional risk factors.