library(purrr)

Uploading data:

data_path <- "C:/Users/shanata/Downloads/smoking_driking_dataset_Ver01.csv"

Creating new columns:

Here I have created a new column for hypertension. If a patient has Systolic Blood pressure greater than 140 or Diastolic pressure greater than 90 he/she is considered to be hypertensive.

data$Hypertension[data$SBP > 140] <- "YES"
data$Hypertension[data$SBP < 140] <- "NO"
data$Hypertension[data$DBP > 90] <-"YES"
data$Hypertension[data$SBP < 90] <- "NO"

After creating the hypertensive column, I will create separate columns for hypotensive patients.If a patient has Systolic Blood pressure less than 90 or Diastolic pressure less than 60 he/she is considered to be hypotensive.

data$Hypotension[data$SBP <90 ] <- "YES"
data$Hypotension[data$SBP > 90] <- "NO"
data$Hypotension[data$DBP < 60] <-"YES"
data$Hypotension[data$DBP > 60] <- "NO"

Now we have separate columns for hypertensive and hypotensive patients. Now let’s find out, how many people with drinking habits have hyper and hypo tension.

summary(data$DRK_YN)

##    Length     Class      Mode 
##    991346 character character

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Group-by dataframe 1:

Here, I have grouped people based on their drinking habits and their blood pressure levels. I have stripped the null values before computing.

f1 <- data %>%
  group_by(DRK_YN) %>%
  summarize(Hypertension = sum(Hypertension == "YES", na.rm = TRUE),
            #Normal_sytolic = sum(Hypertension == "NO", na.rm = TRUE),
            #Normal_diastolic=sum(Hypotension=="NO",na.rm=TRUE),
            Hypotension = sum(Hypotension == "YES", na.rm = TRUE)
            )
f1

## # A tibble: 2 × 3
##   DRK_YN Hypertension Hypotension
##   <chr>         <int>       <int>
## 1 N             49005       15537
## 2 Y             50988       10242

People with Drinking habits:

count_yes <- sum(data$DRK_YN == "Y")
count_no <- sum(data$DRK_YN == "N")

# Display the counts
cat("Count of YES:", count_yes, "\n")

## Count of YES: 495488

cat("Count of NO:", count_no, "\n")

## Count of NO: 495858

Now, I have the total number of alcoholic drinkers and non-alcoholic drinkers, I also have the Hypertensive and Hypotensive people among the drinkers and non drinkers. Now, it is time to calculate the expected probability of each case.

Calculating expected probability:

total_drinkers <- 495488
total_non_drinkers <- 495858
total_hypertensive_non_drinkers <- 49005
total_hypotensive_non_drinkers <- 15537
total_hypertensive_drinkers <- 50988
total_hypotensive_drinkers <- 10242

# Calculate the probabilities
p_drinker <- total_drinkers / (total_drinkers + total_non_drinkers)
p_hypertensive_given_drinker <- total_hypertensive_drinkers / total_drinkers
p_hypotensive_given_drinker <- total_hypotensive_drinkers / total_drinkers
p_non_drinker <- total_non_drinkers / (total_drinkers + total_non_drinkers)
p_hypertensive_given_non_drinker <- total_hypertensive_non_drinkers / total_non_drinkers
p_hypotensive_given_non_drinker <- total_hypotensive_non_drinkers / total_non_drinkers


# Calculate the expected probability 
expected_probability_dh <- p_drinker * p_hypertensive_given_drinker
expected_probability_di <- p_drinker * p_hypotensive_given_drinker
expected_probability_dj <- p_non_drinker * p_hypertensive_given_non_drinker
expected_probability_dk <- p_non_drinker * p_hypotensive_given_non_drinker

# Display the result
cat("Expected Probability of Drinkers who are Hypertensive:", expected_probability_dh, "\n")

## Expected Probability of Drinkers who are Hypertensive: 0.0514331

cat("Expected Probability of Drinkers who are Hypotensive:", expected_probability_di, "\n")

## Expected Probability of Drinkers who are Hypotensive: 0.01033141

cat("Expected Probability of Non Drinkers who are Hypertensive:", expected_probability_dj, "\n")

## Expected Probability of Non Drinkers who are Hypertensive: 0.04943279

cat("Expected Probability of NonDrinkers who are Hypotensive:", expected_probability_dk, "\n")

## Expected Probability of NonDrinkers who are Hypotensive: 0.01567263

From this I can draw a hypothesis:

The probability of someone being a drinker and hypertensive is higher compared to the probability of being a drinker and hypotensive.

Hypothesis Testing:

Null Hypothesis (H0):

The probability of being a drinker and hypertensive is equal to the probability of being a drinker and hypotensive.

Alternative Hypothesis (Ha):

The probability of being a drinker and hypertensive is not equal to the probability of being a drinker and hypotensive.

To test the hypothesis, I have used chi-squared test.

Accept null hypothesis if p-value less then 0.05 else reject it

contingency_table <- matrix(c(total_hypertensive_drinkers,total_hypertensive_non_drinkers,total_hypotensive_drinkers,total_hypotensive_non_drinkers), nrow = 2)

# Perform a chi-squared test
chi_squared_test_result <- chisq.test(contingency_table)

# Display the test result
print(chi_squared_test_result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 1040, df = 1, p-value < 2.2e-16

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a drinker and hypertensive is not equal to the probability of being a drinker and hypotensive.

Visualization

library(ggplot2)

Here I have used a bar chart to compare the expected probabilities.

v1 <- data.frame(
  Drinking_Status = c("Drinkers", "Drinkers", "Non Drinkers", "Non Drinkers"),
  Tension_Status = c("Hypertensive", "Hypotensive", "Hypertensive", "Hypotensive"),
  Expected_Probability = c(0.0514331, 0.01033141, 0.04943279, 0.01567263)
)
ggplot(v1, aes(x = Drinking_Status, y = Expected_Probability, fill = Tension_Status)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(
    title = "Expected Probabilities of Tension by Drinking Status",
    x = "Drinking Status",
    y = "Expected Probability"
  )

Findings :

The difference in expected probabilities between drinkers and non-drinkers for hypertension is relatively small, suggesting that drinking status alone may not be a strong predictor of hypertension.

Further questions:

Are there any other confounding variables or interactions that may affect the relationship between drinking status and hypertension/hypotension ??

Group-by data frame 2

Grouping data based on smokers and non-smokers. Here 1 indicate people who don’t smoke, 2 indicate people who used to smoke but quit smoking now and 3 indicate people who are active smokers.

f <- data %>%
  group_by(SMK_stat_type_cd) %>%
  summarise(Hypertension = sum(Hypertension == "YES", na.rm = TRUE),
            #Normal_sytolic = sum(Hypertension == "NO", na.rm = TRUE),
            #Normal_diastolic=sum(Hypotension=="NO",na.rm=TRUE),
            Hypotension = sum(Hypotension == "YES", na.rm = TRUE)
            )
f

## # A tibble: 3 × 3
##   SMK_stat_type_cd Hypertension Hypotension
##              <dbl>        <int>       <int>
## 1                1        56357       19900
## 2                2        22044        2541
## 3                3        21592        3338

Count of people in each category :

count_1 <- sum(data$SMK_stat_type_cd == 1)
count_2 <- sum(data$SMK_stat_type_cd == 2)
count_3 <- sum(data$SMK_stat_type_cd == 3)

# Display the counts
cat("Count of people who don't smoke:", count_1, "\n")

## Count of people who don't smoke: 602441

cat("Count of people who used to smoke but quit:", count_2, "\n")

## Count of people who used to smoke but quit: 174951

cat("Count of people who used to still smoke:", count_3, "\n")

## Count of people who used to still smoke: 213954

Calculating expected probability:

total_non_smokers <- 602441
#total_smokers_but_quit <- 174951
total_smokers<- 213954


total_hypertensive_non_smokers <- 56357
total_hypotensive_non_smokers <- 19900

#total_hypertensive_used_to_smoke <- 22044
#total_hypotensive_used_to_smoke <- 2541

total_hypertensive_smokers <- 21592
total_hypotensive_smokers <- 3338

# Calculate the probabilities
p_smoker <-  total_smokers/ (total_non_smokers + total_smokers )
p_hypertensive_given_smoker <- total_hypertensive_smokers / total_smokers
p_hypotensive_given_smoker <- total_hypotensive_smokers / total_smokers


p_non_smokers <- total_non_smokers / (total_smokers + total_non_smokers)
p_hypertensive_given_non_smoker <- total_hypertensive_non_smokers / total_non_smokers
p_hypotensive_given_non_smoker <- total_hypotensive_non_smokers / total_non_smokers


# Calculate the expected probability for drinkers who are hypersensitive
expected_probability_dh <- p_smoker * p_hypertensive_given_smoker
expected_probability_di <- p_smoker * p_hypotensive_given_smoker
expected_probability_dj <- p_non_smokers * p_hypertensive_given_non_smoker
expected_probability_dk <- p_non_smokers * p_hypotensive_given_non_smoker

# Display the result
cat("Expected Probability of Smokers who are Hypertensive:", expected_probability_dh, "\n")

## Expected Probability of Smokers who are Hypertensive: 0.02644798

cat("Expected Probability of Smokers who are Hypotensive:", expected_probability_di, "\n")

## Expected Probability of Smokers who are Hypotensive: 0.004088707

cat("Expected Probability of Non Smokers who are Hypertensive:", expected_probability_dj, "\n")

## Expected Probability of Non Smokers who are Hypertensive: 0.06903153

cat("Expected Probability of Non Smokers who are Hypotensive:", expected_probability_dk, "\n")

## Expected Probability of Non Smokers who are Hypotensive: 0.02437546

Null Hypothesis (H0):

The probability of being a smoker and hypertensive is equal to the probability of being a smoker and hypotensive.

Alternative Hypothesis (Ha):

The probability of being a smoker and hypertensive is not equal to the probability of being a smoker and hypotensive.

contingency_table <- matrix(c(21592,56357,3338,19900), nrow = 2)

# Perform a chi-squared test
chi_squared_test_result <- chisq.test(contingency_table)

# Display the test result
print(chi_squared_test_result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 1713.9, df = 1, p-value < 2.2e-16

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a active smoker and hypertensive is not equal to the probability of being a active smoker and hypotensive.

Visualization

v2 <- data.frame(
  Smoking_Status = c("Smokers", "Smokers", "Non Smokers", "Non Smokers"),
  Tension_Status = c("Hypertensive", "Hypotensive", "Hypertensive", "Hypotensive"),
  Expected_Probability = c(0.02644798, 0.004088707 , 0.06903153, 0.02437546)
)
ggplot(v2, aes(x = Smoking_Status, y = Expected_Probability, fill = Tension_Status)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(
    title = "Expected Probabilities of Tension by Smoking Status",
    x = "Smoking Status",
    y = "Expected Probability"
  )

### Group-by dataframe 3

f3 <- data %>%
  group_by(sex) %>%
  summarise(Hypertension = sum(Hypertension == "YES", na.rm = TRUE),
            #Normal_sytolic = sum(Hypertension == "NO", na.rm = TRUE),
            #Normal_diastolic=sum(Hypotension=="NO",na.rm=TRUE),
            Hypotension = sum(Hypotension == "YES", na.rm = TRUE)
            )
f3

## # A tibble: 2 × 3
##   sex    Hypertension Hypotension
##   <chr>         <int>       <int>
## 1 Female        39746       18804
## 2 Male          60247        6975

count_1 <- sum(data$sex=="Female")
count_2 <- sum(data$sex=="Male")

# Display the counts
cat("Count of Female:", count_1, "\n")

## Count of Female: 464931

cat("Count of Male:", count_2, "\n")

## Count of Male: 526415

Calculating probability

total_female <- 464931
#total_smokers_but_quit <- 174951
total_male<- 526415


total_hypertensive_male <- 60247
total_hypotensive_male<- 6975

#total_hypertensive_used_to_smoke <- 22044
#total_hypotensive_used_to_smoke <- 2541

total_hypertensive_female <- 39746
total_hypotensive_female <- 18804

# Calculate the probabilities
p_female <-  total_female/ (total_female + total_male )
p_hypertensive_given_female <- total_hypertensive_female / total_female
p_hypotensive_given_female <- total_hypotensive_female / total_female


p_male <- total_male / (total_male + total_female)
p_hypertensive_given_male <- total_hypertensive_male / total_male
p_hypotensive_given_male <- total_hypotensive_male / total_male


# Calculate the expected probability for drinkers who are hypersensitive
expected_probability_dh <- p_male * p_hypertensive_given_male
expected_probability_di <- p_male * p_hypotensive_given_male
expected_probability_dj <- p_female * p_hypertensive_given_female
expected_probability_dk <- p_female * p_hypotensive_given_female

# Display the result
cat("Expected Probability of Male who are Hypertensive:", expected_probability_dh, "\n")

## Expected Probability of Male who are Hypertensive: 0.06077293

cat("Expected Probability of Male who are Hypotensive:", expected_probability_di, "\n")

## Expected Probability of Male who are Hypotensive: 0.007035889

cat("Expected Probability of Female who are Hypertensive:", expected_probability_dj, "\n")

## Expected Probability of Female who are Hypertensive: 0.04009296

cat("Expected Probability of Female who are Hypotensive:", expected_probability_dk, "\n")

## Expected Probability of Female who are Hypotensive: 0.01896815

Hypothesis :

Null Hypothesis (H0):

The probability of being a male and hypertensive is equal to the probability of being a male and hypotensive.

Alternative Hypothesis (Ha):

The probability of being a male and hypertensive is not equal to the probability of being a male and hypotensive.

contingency_table <- matrix(c(60247,39746,6975,18804), nrow = 2)

# Perform a chi-squared test
chi_squared_test_result <- chisq.test(contingency_table)

# Display the test result
print(chi_squared_test_result)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 9075, df = 1, p-value < 2.2e-16

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a male and hypertensive is not equal to the probability of being a female and hypertensive.

Findings:

Males seem to have a higher expected probability of both hypertension and hypotension compared to females.

Visualisation

v3 <- data.frame(
  Gender = c("Male", "Male", "Female", "Female"),
  Tension_Status = c("Hypertensive", "Hypotensive", "Hypertensive", "Hypotensive"),
  Expected_Probability = c(0.06077293 ,0.007035889,0.04009296,0.01896815)
)

ggplot(v3, aes(x = Gender, y = Expected_Probability, fill = Tension_Status)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  labs(
    title = "Expected Probabilities of Tension by Gender",
    x = "Gender",
    y = "Expected Probability"
  )

Combinations:

cross_table <- table(data$sex, data$DRK_YN, data$Hypertension,data$Hypotension,data$SMK_stat_type_cd)

cross_table_df <- as.data.frame(cross_table)
cross_table_sum <- rowSums(cross_table)
cross_table_normalized <- scale(cross_table_sum)

cross_table_df

##      Var1 Var2 Var3 Var4 Var5   Freq
## 1  Female    N   NO   NO    1 262236
## 2    Male    N   NO   NO    1  64507
## 3  Female    Y   NO   NO    1 109298
## 4    Male    Y   NO   NO    1  75682
## 5  Female    N  YES   NO    1  30686
## 6    Male    N  YES   NO    1   8195
## 7  Female    Y  YES   NO    1   7207
## 8    Male    Y  YES   NO    1  10172
## 9  Female    N   NO  YES    1  11645
## 10   Male    N   NO  YES    1   1301
## 11 Female    Y   NO  YES    1   5700
## 12   Male    Y   NO  YES    1   1148
## 13 Female    N  YES  YES    1     70
## 14   Male    N  YES  YES    1     17
## 15 Female    Y  YES  YES    1      5
## 16   Male    Y  YES  YES    1      5
## 17 Female    N   NO   NO    2   3762
## 18   Male    N   NO   NO    2  42140
## 19 Female    Y   NO   NO    2   5728
## 20   Male    Y   NO   NO    2  94889
## 21 Female    N  YES   NO    2    321
## 22   Male    N  YES   NO    2   5848
## 23 Female    Y  YES   NO    2    362
## 24   Male    Y  YES   NO    2  15496
## 25 Female    N   NO  YES    2    252
## 26   Male    N   NO  YES    2    866
## 27 Female    Y   NO  YES    2    307
## 28   Male    Y   NO  YES    2   1097
## 29 Female    N  YES  YES    2      0
## 30   Male    N  YES  YES    2      9
## 31 Female    Y  YES  YES    2      0
## 32   Male    Y  YES  YES    2      8
## 33 Female    N   NO   NO    3   5237
## 34   Male    N   NO   NO    3  40947
## 35 Female    Y   NO   NO    3   8795
## 36   Male    Y   NO   NO    3 130116
## 37 Female    N  YES   NO    3    395
## 38   Male    N  YES   NO    3   3458
## 39 Female    Y  YES   NO    3    699
## 40   Male    Y  YES   NO    3  17029
## 41 Female    N   NO  YES    3    386
## 42   Male    N   NO  YES    3    974
## 43 Female    Y   NO  YES    3    430
## 44   Male    Y   NO  YES    3   1536
## 45 Female    N  YES  YES    3      0
## 46   Male    N  YES  YES    3      6
## 47 Female    Y  YES  YES    3      1
## 48   Male    Y  YES  YES    3      4

I have created a Cross-table with Sex, Drinking State, Smoking State, Hypertension and Hypotension as variables. We can note that,

Never show up:

Females: No Drinking, Smoking but quit, Hypertension, Hypo tension
Females: Drinking, Smoking but quit, Hypertension, Hypo tension
Females: No Drinking, Active Smoking, Hypertensive, Hypertensive

This may be due to :

Rare occurrences: Females who don’t drink and smoke but have both hypertension and hypo tension is a rare occurrence.
The data which I have used may be not large enough to cover all cases.
The data I have used, have no female who quit smoking.

Most common occurring combination:

Females: No Smoking, No Drinking, No Hypertension, No Hypotension
Females: Drinking, No Smoking, No Hypertension, No Hypotension
Males: Drinking, Smoking, No Hypertension, No Hypotension

Most Least occurring combinations:

1)Females: Drinking, smoking, Hypertension, Hypotension 2)Male : Drinking, smoking, Hypertension , Hypotension 3)Females: Drinking, no smoking, Hypertensive, Hypotensive

Findings:

Most woman in this data set, follow a healthy lifestyle: no smoking, No Drinking and have normal blood pressure
Males who simultaneously smoke, drink, and have both hypertension and hypotension are not common.
There is a subgroup of females who consume alcohol but do not smoke, and they also do not have hypertension or hypotension. This could represent individuals who consume alcohol without additional risk factors.

Lab 3 Data Dive sharmitha Yazhini

Uploading data:

Creating new columns:

Group-by dataframe 1:

People with Drinking habits:

Calculating expected probability:

From this I can draw a hypothesis:

Hypothesis Testing:

Null Hypothesis (H0):

Alternative Hypothesis (Ha):

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a drinker and hypertensive is not equal to the probability of being a drinker and hypotensive.

Visualization

Here I have used a bar chart to compare the expected probabilities.

Findings :

The difference in expected probabilities between drinkers and non-drinkers for hypertension is relatively small, suggesting that drinking status alone may not be a strong predictor of hypertension.

Further questions:

Group-by data frame 2

Calculating expected probability:

Null Hypothesis (H0):

Alternative Hypothesis (Ha):

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a active smoker and hypertensive is not equal to the probability of being a active smoker and hypotensive.

Visualization

Calculating probability

Hypothesis :

Null Hypothesis (H0):

Alternative Hypothesis (Ha):

Therefore We can reject the null hypothesis.

Hence we can conclude that the probability of being a male and hypertensive is not equal to the probability of being a female and hypertensive.

Findings:

Visualisation

Combinations:

Never show up:

Most common occurring combination:

Most Least occurring combinations:

Findings: