library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset
Mean, Median of BMI.
standard deviation of BMI.
min_ur<- min(dataset$`BMI`)
min_ur
## [1] 12
max_ur<- max(dataset$`BMI`)
max_ur
## [1] 98
mean_ur <- mean(dataset$`BMI`)
mean_ur
## [1] 29.85699
med_ur<-median(dataset$`BMI`)
med_ur
## [1] 29
sd_ur<-sd(dataset$`BMI`)
sd_ur
## [1] 7.113954
The minimum BMI value is 12, and the maximum BMI value is 98.
The mean BMI is approximately 29.86, indicating the average BMI across the data set.
The median BMI is 29, which represents the middle value of the BMI distribution, providing a central measure of tendency.
The standard deviation of BMI is approximately 7.11, reflecting the variability or dispersion of BMI values around the mean.
Let’s find out how many have people are lying in different number of ranges in our data set.
Income_1 <- dataset|>
group_by(Income) |>
summarise(num=n()) |>
arrange(desc(Income))
Income_1
From the above data previewed we can see many of the people are lying under income_scale of 8 and a very few under 1 income_scale.
The INCOME scale is an 8-point ordinal scale used to categorize respondents’ annual household income. The scale ranges from 1 to 8, with each number corresponding to a specific income bracket. Here’s the breakdown:
1: Less than $10,000
2: $10,000 to less than $15,000
3: $15,000 to less than $20,000
4: $20,000 to less than $25,000
5: $25,000 to less than $35,000
6: $35,000 to less than $50,000
7: $50,000 to less than $75,000
8: $75,000 or more
This scale provides a general overview of household income levels,
allowing for data to be segmented into different income categories for
analysis.
[link for column descriptions: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csvhttps://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv]
Related questions observed from the data set in the
initial exploration would be:
# Aggregation example: Mean of numeric column with respect to category
aggregated_data <- dataset |>
group_by('PhysActivity'=dataset$'PhysActivity') |>
summarise(mean_value = mean(`BMI`),min_value=min(`BMI`),max_value=max(`BMI`),median_value=median(`BMI`))
print(aggregated_data)
## # A tibble: 2 × 5
## PhysActivity mean_value min_value max_value median_value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 31.7 13 98 30
## 2 1 29.1 12 98 28
The analysis of physical activity levels and their corresponding statistics reveals the following trends:
This indicates that higher physical activity levels are associated with a slightly lower mean and median BMI, suggesting a potential relationship between increased physical activity and healthier BMI levels.
# Histogram for "Poor Mental Health(1-30)"
ggplot(dataset, aes(x = dataset$'MentHlth')) +
geom_histogram(binwidth = 1, fill = "purple", color = "black") +
labs(title = "Number of Days People Suffer from Mental Health Issues", x = "Poor Mental Health(1-30)", y = "Frequency")
Let’s examine the impact of heavy alcohol consumption on BMI, we will analyze BMI data specifically for individuals who consume alcohol heavily.
# Load the ggplot2 library
library(ggplot2)
ggplot(dataset, aes(x = factor(Diabetes_binary), y = MentHlth)) +
geom_boxplot() +
labs(
title = "Mental Health by Diabetes Binary",
x = "Diabetes (0 = No, 1 = Yes)",
y = "Days of Poor Mental Health"
) +
theme_minimal()