This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
data_path <-"C:/Users/18136/OneDrive - University of South Florida/Documents/Desktop/smoking_driking_dataset_Ver01.csv"
data <- read.csv(data_path)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
summary_table <- data %>%
count(sex) %>%
arrange(desc(n))
print(summary_table)
## sex n
## 1 Male 526415
## 2 Female 464931
The above table explains there are 526415 male and 464931 female in the dataset
summary_table1 <- data %>%
count(hear_left) %>%
arrange(desc(n))
print(summary_table1)
## hear_left n
## 1 1 960124
## 2 2 31222
(31222/960124)*100
## [1] 3.251872
This shows that there are 31222 (3.25 %) people with hearing disability
summary_table2 <- data %>%
count(hear_right) %>%
arrange(desc(n))
print(summary_table2)
## hear_right n
## 1 1 961134
## 2 2 30212
summary_table3 <- data %>%
count(urine_protein) %>%
arrange(desc(n))
print(summary_table3)
## urine_protein n
## 1 1 935175
## 2 2 30850
## 3 3 16405
## 4 4 6427
## 5 5 1977
## 6 6 512
print("The summary for systolic blood pressureage")
## [1] "The summary for systolic blood pressureage"
summary(data$SBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 67.0 112.0 120.0 122.4 131.0 273.0
print("The summary for diastolic blood pressure")
## [1] "The summary for diastolic blood pressure"
summary(data$DBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.00 70.00 76.00 76.05 82.00 185.00
print("The summary for fasting blood sugar")
## [1] "The summary for fasting blood sugar"
summary(data$BLDS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.0 88.0 96.0 100.4 105.0 852.0
print("The summary for column age")
## [1] "The summary for column age"
summary(data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 35.00 45.00 47.61 60.00 85.00
print("The summary for column height")
## [1] "The summary for column height"
summary(data$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 130.0 155.0 160.0 162.2 170.0 190.0
print("The summary for column weight")
## [1] "The summary for column weight"
summary(data$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 25.00 55.00 60.00 63.28 70.00 140.00
print("The summary for column waistline")
## [1] "The summary for column waistline"
summary(data$waistline)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 74.10 81.00 81.23 87.80 999.00
print("The summary for column Cholestral levels")
## [1] "The summary for column Cholestral levels"
summary(data$tot_chole)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.0 169.0 193.0 195.6 219.0 2344.0
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data |> # get data frame
filter(DRK_YN == "Y") |> # then, filter it by the drinking yes/no column
pluck("tot_chole") |> # then, select the "total cholestral" column
mean()
## [1] 196.3197
data |>
group_by(DRK_YN) |>
summarise(mean_cholestral = mean(tot_chole),
max_cholestral= max(tot_chole))
## # A tibble: 2 × 3
## DRK_YN mean_cholestral max_cholestral
## <chr> <dbl> <dbl>
## 1 N 195. 2046
## 2 Y 196. 2344
The mean and the maximum cholestral levels for non drinkers are comparativelt lower when compared to alcoholic drinkers. From this we can also conclude that alcohol consumption increases our cholestral levels.
data |>
group_by(sex) |>
summarise(mean_height = mean(height),
mean_weight = mean(weight))
## # A tibble: 2 × 3
## sex mean_height mean_weight
## <chr> <dbl> <dbl>
## 1 Female 155. 55.5
## 2 Male 169. 70.1
data |>
group_by(DRK_YN) |>
summarise(mean_hemoglobin = mean(hemoglobin),
median_hemoglobin = median(hemoglobin),
max_hemoglobin= max(hemoglobin))
## # A tibble: 2 × 4
## DRK_YN mean_hemoglobin median_hemoglobin max_hemoglobin
## <chr> <dbl> <dbl> <dbl>
## 1 N 13.8 13.7 25
## 2 Y 14.7 14.9 25
Hemoglobin levels are low for alcoholic drinkers, which indicates that alcohol consumption can reduce the hemoglobin levels in our body.
data |>
group_by(SMK_stat_type_cd) |>
summarise(mean_cholestral = mean(tot_chole),
median_cholestral = median(tot_chole),
max_cholestral= max(tot_chole))
## # A tibble: 3 × 4
## SMK_stat_type_cd mean_cholestral median_cholestral max_cholestral
## <dbl> <dbl> <dbl> <dbl>
## 1 1 195. 193 2344
## 2 2 195. 194 2046
## 3 3 197. 195 2033
We can see that as the intensity of smoking increases, the mean cholestral level also increases. This explains the dangerous effect smoking can have in our body.But interesting the anomaly is that the maximum cholestral level decreases as the intensity of smoking incerases.
data |>
group_by(DRK_YN) |>
summarise(mean_triglyceride = mean(triglyceride),
median_triglyceride = median(triglyceride),
max_triglyceride= max(triglyceride))
## # A tibble: 2 × 4
## DRK_YN mean_triglyceride median_triglyceride max_triglyceride
## <chr> <dbl> <dbl> <dbl>
## 1 N 121. 102 9490
## 2 Y 143. 112 6430
Sex: male, female age: round up to 5 years height: round up to 5
cm[cm] weight: [kg] sight_left eyesight(left) sight_right
eyesight(right) hear_left: hearing left, 1(normal), 2(abnormal)
hear_right: hearing right, 1(normal), 2(abnormal)
SBP Systolic: blood pressure[mmHg]
DBP Diastolic: blood pressure[mmHg]
BLDS: (fasting blood glucose)[mg/dL]
tot_chole: total cholesterol[mg/dL]
HDL_chole :HDL cholesterol[mg/dL] HDL LDL_chole :LDL cholesterol[mg/dL]
LDL triglyceride: triglyceride[mg/dL] hemoglobin: hemoglobin[g/dL]
urine_protein: protein in urine, 1(-), 2(+/-), 3(+1), 4(+2), 5(+3),
6(+4)
serum_creatinine: serum(blood) creatinine[mg/dL]
SGOT_AST: SGOT(Glutamate-oxaloacetate transaminase) AST(Aspartate
transaminase)[IU/L] AST SGOT_ALT: ALT(Alanine transaminase)[IU/L]
gamma_GTP: y-glutamyl transpeptidase[IU/L] SMK_stat_type_cd :Smoking
state, 1(never), 2(used to smoke but quit), 3(still smoke) DRK_YN:
Drinker or Not
My goal is to learn abot the effects of drinking and smoking in human body, how it affects factors like hemoglobin, cholestral and protein levels. Hope my project will create awareness.
ggplot(data, aes(x =height)) +
geom_histogram(binwidth = 5, fill = "blue", color = "black") +
labs(title = "Histogram of Numeric Column 1", x = "Values",y="Frequency")
This histogram shows the distribution of height, we could see the mean
height is around 160 cm. This is for both male and female.
ggplot(data, aes(y = weight)) +
geom_boxplot(fill = "lightblue", color = "blue") +
labs(title = "Box Plot of weight",y="Values")
This box plot shows us the ouliers in weight, from this we could see the
25th and 75th percentile of weight is around 50kg and 75kg. Morover it
also shows us that there are more insances of over-weight than under
weight.
data |>
ggplot() +
geom_point(mapping = aes(x = weight, y = hemoglobin)) +
theme_classic()
It is interesting to observe as the weight increases, the minimum
hemoglobin also increase. Weight and hemoglobin follows a linear pattern
of increase
data <- mutate(data, high_risk = (SBP > 140 & DBP >90) )
ggplot(mapping = aes(x = height, y = weight)) +
geom_point(data = filter(data, !high_risk), color = 'darkblue') +
geom_point(data = filter(data, high_risk), color = 'yellow') +
theme_classic() +
scale_color_brewer(palette = "Set1")
Yellow colour represents people with high risk. High risk denotes people
with syaystolic pressure greater than 140 and diastolic pressure greater
than 90. We can see most of the high risk patients are continuous
smokers.
data <- mutate(data, high_risk1 = (hemoglobin < 13 & sex=="Male") |
(hemoglobin < 12 & sex=="Female"))
ggplot(mapping = aes(x = height, y = weight)) +
geom_point(data = filter(data, !high_risk1), color = 'darkblue') +
geom_point(data = filter(data, high_risk1), color = 'yellow') +
theme_classic() +
scale_color_brewer(palette = "Set1")
I have classified high risk patients as female with hemoglobin less than
12 and male with hemoglobin less than 13 levels. From both the graphs,
one observation is that tall people with height grater 180 cm, are
extremely healthy.
{r}
```{r
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.