Week 2 - Data Dive Summaries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Uploading my data set into R and sorting into “dataset” data frame.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")

## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dataset

Finding 1. Minimum, Maximum values of BMI.

Mean, Median of BMI.

standard deviation of BMI.

min_ur<- min(dataset$`BMI`)
min_ur

## [1] 12

max_ur<- max(dataset$`BMI`)
max_ur

## [1] 98

mean_ur <- mean(dataset$`BMI`)

mean_ur

## [1] 29.85699

med_ur<-median(dataset$`BMI`)
med_ur

## [1] 29

sd_ur<-sd(dataset$`BMI`)
sd_ur

## [1] 7.113954

The minimum BMI value is 12, and the maximum BMI value is 98.
The mean BMI is approximately 29.86, indicating the average BMI across the data set.
The median BMI is 29, which represents the middle value of the BMI distribution, providing a central measure of tendency.
The standard deviation of BMI is approximately 7.11, reflecting the variability or dispersion of BMI values around the mean.
Let’s find out how many have people are lying in different number of ranges in our data set.
```
Income_1 <- dataset|>
  group_by(Income) |>
  summarise(num=n()) |>
  arrange(desc(Income))

Income_1
```
From the above data previewed we can see many of the people are lying under income_scale of 8 and a very few under 1 income_scale.
The INCOME scale is an 8-point ordinal scale used to categorize respondents’ annual household income. The scale ranges from 1 to 8, with each number corresponding to a specific income bracket. Here’s the breakdown:

1: Less than $10,000

2: $10,000 to less than $15,000

3: $15,000 to less than $20,000

4: $20,000 to less than $25,000

5: $25,000 to less than $35,000

6: $35,000 to less than $50,000

7: $50,000 to less than $75,000

8: $75,000 or more

This scale provides a general overview of household income levels, allowing for data to be segmented into different income categories for analysis.

[link for column descriptions: https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csvhttps://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv]

Related questions observed from the data set in the initial exploration would be:

How does a person demographics relates with diabetes?
How does mental health affects BMI of a person?
How varying levels of physical activity ranging from minimal to maximum correlate with BMI values across the data set?

# Aggregation example: Mean of numeric column with respect to category
aggregated_data <- dataset |>
  group_by('PhysActivity'=dataset$'PhysActivity') |>
  summarise(mean_value = mean(`BMI`),min_value=min(`BMI`),max_value=max(`BMI`),median_value=median(`BMI`))
print(aggregated_data)

## # A tibble: 2 × 5
##   PhysActivity mean_value min_value max_value median_value
##          <dbl>      <dbl>     <dbl>     <dbl>        <dbl>
## 1            0       31.7        13        98           30
## 2            1       29.1        12        98           28

The analysis of physical activity levels and their corresponding statistics reveals the following trends:

physical activity in past 30 days - not including job 0 = no 1 = yes
For individuals with a physical activity level of 0, the mean BMI value is 31.73, with a minimum value of 13 and a maximum of 98, and a median of 30.
For those with a physical activity level of 1, the mean BMI value is slightly lower at 29.07, with a minimum value of 12 and a maximum of 98, and a median of 28.

This indicates that higher physical activity levels are associated with a slightly lower mean and median BMI, suggesting a potential relationship between increased physical activity and healthier BMI levels.

Let’s examine the distribution of days of poor mental health reported on a scale from 1 to 30 days using a histogram. The histogram will visually represent how frequently individuals experience varying lengths of poor mental health days.

# Histogram for "Poor Mental Health(1-30)"
ggplot(dataset, aes(x = dataset$'MentHlth')) +
  geom_histogram(binwidth = 1, fill = "purple", color = "black") +
  labs(title = "Number of Days People Suffer from Mental Health Issues", x = "Poor Mental Health(1-30)", y = "Frequency")

The histogram plot reveals that a significant number of individuals report experiencing 0 days of poor mental health within the month, compared to other duration. This suggests that while mental health challenges do exist, a majority of people in the data set do not frequently encounter prolonged periods of poor mental health.

Let’s examine the impact of heavy alcohol consumption on BMI, we will analyze BMI data specifically for individuals who consume alcohol heavily.

# Load the ggplot2 library
library(ggplot2)

ggplot(dataset, aes(x = factor(Diabetes_binary), y = MentHlth)) +
  geom_boxplot() +
  labs(
    title = "Mental Health by Diabetes Binary",
    x = "Diabetes (0 = No, 1 = Yes)",
    y = "Days of Poor Mental Health"
  ) +
  theme_minimal()

The box-plot analysis reveals distinct differences in the distribution of poor mental health days between individuals with and without diabetes. This visualization highlights potential correlations between diabetes status and mental health, suggesting that those with diabetes may experience different levels of mental health challenges.