Data Dive Week 2

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

data_path <-"C:/Users/18136/OneDrive - University of South Florida/Documents/Desktop/smoking_driking_dataset_Ver01.csv"

data <- read.csv(data_path)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Question 1: How many male and female are included in the dataset?

summary_table <- data %>%
  count(sex) %>%
  arrange(desc(n))
print(summary_table)

##      sex      n
## 1   Male 526415
## 2 Female 464931

The above table explains there are 526415 male and 464931 female in the dataset

Question 2 : How many people have hearing disability in their left ear?

summary_table1 <- data %>%
  count(hear_left) %>%
  arrange(desc(n))
print(summary_table1)

##   hear_left      n
## 1         1 960124
## 2         2  31222

(31222/960124)*100

## [1] 3.251872

This shows that there are 31222 (3.25 %) people with hearing disability

summary_table2 <- data %>%
  count(hear_right) %>%
  arrange(desc(n))
print(summary_table2)

##   hear_right      n
## 1          1 961134
## 2          2  30212

summary_table3 <- data %>%
  count(urine_protein) %>%
  arrange(desc(n))
print(summary_table3)

##   urine_protein      n
## 1             1 935175
## 2             2  30850
## 3             3  16405
## 4             4   6427
## 5             5   1977
## 6             6    512

print("The summary for systolic blood pressureage")

## [1] "The summary for systolic blood pressureage"

summary(data$SBP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    67.0   112.0   120.0   122.4   131.0   273.0

print("The summary for diastolic blood pressure")

## [1] "The summary for diastolic blood pressure"

summary(data$DBP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.00   70.00   76.00   76.05   82.00  185.00

print("The summary for fasting blood sugar")

## [1] "The summary for fasting blood sugar"

summary(data$BLDS)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    25.0    88.0    96.0   100.4   105.0   852.0

print("The summary for column age")

## [1] "The summary for column age"

summary(data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   35.00   45.00   47.61   60.00   85.00

print("The summary for column height")

## [1] "The summary for column height"

summary(data$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   130.0   155.0   160.0   162.2   170.0   190.0

print("The summary for column weight")

## [1] "The summary for column weight"

summary(data$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   25.00   55.00   60.00   63.28   70.00  140.00

print("The summary for column waistline")

## [1] "The summary for column waistline"

summary(data$waistline)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   74.10   81.00   81.23   87.80  999.00

print("The summary for column Cholestral levels")

## [1] "The summary for column Cholestral levels"

summary(data$tot_chole)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    30.0   169.0   193.0   195.6   219.0  2344.0

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data |>                     # get data frame
  filter(DRK_YN == "Y") |>  # then, filter it by the drinking yes/no column
  pluck("tot_chole") |>          # then, select the "total cholestral" column
  mean()

## [1] 196.3197

Question 3: are there any difference in cholestral levels between alcoholic drinkers and non-drinkers

data |>
  group_by(DRK_YN) |>
  summarise(mean_cholestral = mean(tot_chole),
            max_cholestral= max(tot_chole))

## # A tibble: 2 × 3
##   DRK_YN mean_cholestral max_cholestral
##   <chr>            <dbl>          <dbl>
## 1 N                 195.           2046
## 2 Y                 196.           2344

The mean and the maximum cholestral levels for non drinkers are comparativelt lower when compared to alcoholic drinkers. From this we can also conclude that alcohol consumption increases our cholestral levels.

data |>
  group_by(sex) |>
  summarise(mean_height = mean(height),
            mean_weight = mean(weight))

## # A tibble: 2 × 3
##   sex    mean_height mean_weight
##   <chr>        <dbl>       <dbl>
## 1 Female        155.        55.5
## 2 Male          169.        70.1

Question 4: What about the hemoglobin for drinkers and non-drinkers?

data |>
  group_by(DRK_YN) |>
  summarise(mean_hemoglobin = mean(hemoglobin),
            median_hemoglobin = median(hemoglobin),
            max_hemoglobin= max(hemoglobin))

## # A tibble: 2 × 4
##   DRK_YN mean_hemoglobin median_hemoglobin max_hemoglobin
##   <chr>            <dbl>             <dbl>          <dbl>
## 1 N                 13.8              13.7             25
## 2 Y                 14.7              14.9             25

Hemoglobin levels are low for alcoholic drinkers, which indicates that alcohol consumption can reduce the hemoglobin levels in our body.

Question 5: What about the cholestral level for smokers ?

data |>
  group_by(SMK_stat_type_cd) |>
  summarise(mean_cholestral = mean(tot_chole),
            median_cholestral = median(tot_chole),
            max_cholestral= max(tot_chole))

## # A tibble: 3 × 4
##   SMK_stat_type_cd mean_cholestral median_cholestral max_cholestral
##              <dbl>           <dbl>             <dbl>          <dbl>
## 1                1            195.               193           2344
## 2                2            195.               194           2046
## 3                3            197.               195           2033

We can see that as the intensity of smoking increases, the mean cholestral level also increases. This explains the dangerous effect smoking can have in our body.But interesting the anomaly is that the maximum cholestral level decreases as the intensity of smoking incerases.

data |>
  group_by(DRK_YN) |>
  summarise(mean_triglyceride = mean(triglyceride),
            median_triglyceride = median(triglyceride),
            max_triglyceride= max(triglyceride))

## # A tibble: 2 × 4
##   DRK_YN mean_triglyceride median_triglyceride max_triglyceride
##   <chr>              <dbl>               <dbl>            <dbl>
## 1 N                   121.                 102             9490
## 2 Y                   143.                 112             6430

Data Documentation:

Sex: male, female age: round up to 5 years height: round up to 5 cm[cm] weight: [kg] sight_left eyesight(left) sight_right eyesight(right) hear_left: hearing left, 1(normal), 2(abnormal)
hear_right: hearing right, 1(normal), 2(abnormal)
SBP Systolic: blood pressure[mmHg]
DBP Diastolic: blood pressure[mmHg]
BLDS: (fasting blood glucose)[mg/dL]
tot_chole: total cholesterol[mg/dL]
HDL_chole :HDL cholesterol[mg/dL] HDL LDL_chole :LDL cholesterol[mg/dL] LDL triglyceride: triglyceride[mg/dL] hemoglobin: hemoglobin[g/dL]
urine_protein: protein in urine, 1(-), 2(+/-), 3(+1), 4(+2), 5(+3), 6(+4)
serum_creatinine: serum(blood) creatinine[mg/dL]
SGOT_AST: SGOT(Glutamate-oxaloacetate transaminase) AST(Aspartate transaminase)[IU/L] AST SGOT_ALT: ALT(Alanine transaminase)[IU/L]
gamma_GTP: y-glutamyl transpeptidase[IU/L] SMK_stat_type_cd :Smoking state, 1(never), 2(used to smoke but quit), 3(still smoke) DRK_YN: Drinker or Not

Project goals:

My goal is to learn abot the effects of drinking and smoking in human body, how it affects factors like hemoglobin, cholestral and protein levels. Hope my project will create awareness.

ggplot(data, aes(x =height)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  labs(title = "Histogram of Numeric Column 1", x = "Values",y="Frequency")

This histogram shows the distribution of height, we could see the mean height is around 160 cm. This is for both male and female.

ggplot(data, aes(y = weight)) +
  geom_boxplot(fill = "lightblue", color = "blue") +
  labs(title = "Box Plot of weight",y="Values")

This box plot shows us the ouliers in weight, from this we could see the 25th and 75th percentile of weight is around 50kg and 75kg. Morover it also shows us that there are more insances of over-weight than under weight.

data |>
  ggplot() +
  geom_point(mapping = aes(x = weight, y = hemoglobin)) +
  theme_classic()

It is interesting to observe as the weight increases, the minimum hemoglobin also increase. Weight and hemoglobin follows a linear pattern of increase

data <- mutate(data, high_risk = (SBP > 140 & DBP >90) )
ggplot(mapping = aes(x = height, y = weight)) +
  geom_point(data = filter(data, !high_risk), color = 'darkblue') +
  geom_point(data = filter(data, high_risk), color = 'yellow') +
  theme_classic() +
  scale_color_brewer(palette = "Set1")

Yellow colour represents people with high risk. High risk denotes people with syaystolic pressure greater than 140 and diastolic pressure greater than 90. We can see most of the high risk patients are continuous smokers.

data <- mutate(data, high_risk1 = (hemoglobin < 13 & sex=="Male") |
                 (hemoglobin < 12 & sex=="Female"))
ggplot(mapping = aes(x = height, y = weight)) +
  geom_point(data = filter(data, !high_risk1), color = 'darkblue') +
  geom_point(data = filter(data, high_risk1), color = 'yellow') +
  theme_classic() +
  scale_color_brewer(palette = "Set1")

I have classified high risk patients as female with hemoglobin less than 12 and male with hemoglobin less than 13 levels. From both the graphs, one observation is that tall people with height grater 180 cm, are extremely healthy.

{r}

```{r

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.