This data set was downloaded from kaggle. “Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents.”

Data source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download&select=heart_2020_cleaned.csv

The data set is clean to begin with. Rows with missing records were deleted.

This is a very large dat set with 18 variables and 319795 observations.

Here’s a list of the variables and what they represent:

HeartDisease: heart disease (yes/no)

BMI: Body mass Index

Smoking: smokes (yes/no)

AlcoholDrinking: consumes alcohol (yes/no)

Stroke: suffered a stroke (yes/no)

PhysicalHealth: scale of 1-30 (30 being the worst)

MentalHealth: scale of 1-30 (30 being the worst)

DiffWalking: Having difficulty walking (yes/no)

Sex: Gender

AgeCategory: A range

Race: race

Diabetic: diabetic (yes/no)

PhysicalActivity: (yes/no)

GenHealth: general condition of health

SleepTime: Amount of sleep per day

Asthma: Asthma (yes/no)

KidneyDisease: Kidney disease (yes/no)

SkinCancer: Skin cancer (yes/no)

Loading necessary libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dbplyr)

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

Set working directory and read data set into R

setwd("~/Data101/Project 3a")
dataheart <- read_csv("heart_2020_cleaned.csv")

## Rows: 319795 Columns: 18

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s take a look at our data structure

str(dataheart)

## spec_tbl_df [319,795 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ HeartDisease    : chr [1:319795] "No" "No" "No" "No" ...
##  $ BMI             : num [1:319795] 16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ AlcoholDrinking : chr [1:319795] "No" "No" "No" "No" ...
##  $ Stroke          : chr [1:319795] "No" "Yes" "No" "No" ...
##  $ PhysicalHealth  : num [1:319795] 3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num [1:319795] 30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : chr [1:319795] "No" "No" "No" "No" ...
##  $ Sex             : chr [1:319795] "Female" "Female" "Male" "Female" ...
##  $ AgeCategory     : chr [1:319795] "55-59" "80 or older" "65-69" "75-79" ...
##  $ Race            : chr [1:319795] "White" "White" "White" "White" ...
##  $ Diabetic        : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ PhysicalActivity: chr [1:319795] "Yes" "Yes" "Yes" "No" ...
##  $ GenHealth       : chr [1:319795] "Very good" "Very good" "Fair" "Good" ...
##  $ SleepTime       : num [1:319795] 5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ KidneyDisease   : chr [1:319795] "No" "No" "No" "No" ...
##  $ SkinCancer      : chr [1:319795] "Yes" "No" "No" "Yes" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   HeartDisease = col_character(),
##   ..   BMI = col_double(),
##   ..   Smoking = col_character(),
##   ..   AlcoholDrinking = col_character(),
##   ..   Stroke = col_character(),
##   ..   PhysicalHealth = col_double(),
##   ..   MentalHealth = col_double(),
##   ..   DiffWalking = col_character(),
##   ..   Sex = col_character(),
##   ..   AgeCategory = col_character(),
##   ..   Race = col_character(),
##   ..   Diabetic = col_character(),
##   ..   PhysicalActivity = col_character(),
##   ..   GenHealth = col_character(),
##   ..   SleepTime = col_double(),
##   ..   Asthma = col_character(),
##   ..   KidneyDisease = col_character(),
##   ..   SkinCancer = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

2- Summary statistics: mean, median, mode, min, max, count and standard deviation for 2-3 key quantitative variables.

Some of the summary statistics might not be appropriate for this data set.

However, we will compute some of them for the two quantitative variables BMI and Sleeptime.

Summary statistics for BMI

mean(dataheart$BMI)

## [1] 28.3254

median(dataheart$BMI)

## [1] 27.34

#dataheart%>% count(BMI)

min(dataheart$BMI)

## [1] 12.02

max(dataheart$BMI)

## [1] 94.85

sd(dataheart$BMI)

## [1] 6.3561

dataheart%>% count(Race)

## # A tibble: 6 x 2
##   Race                                n
##   <chr>                           <int>
## 1 American Indian/Alaskan Native   5202
## 2 Asian                            8068
## 3 Black                           22939
## 4 Hispanic                        27446
## 5 Other                           10928
## 6 White                          245212

Summary statistics for Sleeptime

mean(dataheart$SleepTime)

## [1] 7.097075

median(dataheart$SleepTime)

## [1] 7

min(dataheart$SleepTime)

## [1] 1

max(dataheart$SleepTime)

## [1] 24

sd(dataheart$SleepTime)

## [1] 1.436007

dataheart%>% count(SleepTime)

## # A tibble: 24 x 2
##    SleepTime     n
##        <dbl> <int>
##  1         1   551
##  2         2   788
##  3         3  1992
##  4         4  7750
##  5         5 19184
##  6         6 66721
##  7         7 97751
##  8         8 97602
##  9         9 16041
## 10        10  7796
## # ... with 14 more rows

To calculate the mode we will create a new data set and do some conversions

# function for mode
cat_mode <- function(cat_var){
  mode_idx <- which.max(table(cat_var))
  levels(cat_var)[mode_idx]
}

# new data set
newdataheart <- dataheart

# conversion to factor
newdataheart$BMI <- as.factor(newdataheart$BMI)
newdataheart$SleepTime <- as.factor(newdataheart$SleepTime)

# Mode calculation
cat_mode(newdataheart$SleepTime)

## [1] "7"

cat_mode(newdataheart$BMI)

## [1] "26.63"

3- A frequency distribution and relative frequency distribution for a key categorical variable.

table(dataheart$HeartDisease)

## 
##     No    Yes 
## 292422  27373

dataheart2 <- dataheart%>%
  group_by(HeartDisease)%>%
  summarise(Freq=n())%>%
  mutate(Relative_Freq=Freq/sum(Freq))

4 - A contingency table for two categorical variables.

table(dataheart$HeartDisease, dataheart$Sex)

##      
##       Female   Male
##   No  156571 135851
##   Yes  11234  16139

5 - 1 bar graph and 1 pie chart for two of your categorical variables. (A bar graph for one variable and a pie chart for the second variable.) Label your plots!

Bar graph of heart disease

bargraph <- dataheart%>%
  ggplot(aes(Sex,fill=HeartDisease))+
  geom_bar(position = "dodge", alpha=0.5)+
  facet_wrap(~HeartDisease)+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  labs(title = "Gender and Heart Disease", x= "Gender", y="Number")
  
bargraph <- ggplotly(bargraph)
bargraph

pie chart for BMI

For our pie chart we will create a new data set and calculate the average BMI by race

piechart <- dataheart %>%
  group_by(Race)%>%
  filter(HeartDisease=="Yes")%>%
  summarise(Av_BMI = mean(BMI))

g <- c(2,4, 5, 7, 18, 30) # create a vector for colors


pie(piechart$Av_BMI, labels = piechart$Race, main = "Average BMI of People with Heart Disease", sub="By Race",col = rainbow(length(g)), clockwise = TRUE, init.angle = 0)

6 - 2 histograms and 2 boxplots for two quantitative variables. (Both a histogram and boxplot for each variable.) Label your plots!

Histograms

#hist(dataheart$SleepTime)

ggplot(dataheart,aes(SleepTime))+
  geom_histogram(color="white")+
 
  scale_color_discrete(name=" ")+
  labs(y = "Frequency", x = "Sleep Time", title = "Histogram of Sleep Time")+
  #theme(legend.position="none")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(dataheart$BMI, main = "Histogram of BMI", xlim = c(10,70),xlab="BMI")

Box plots

ggplot(dataheart,aes(GenHealth, SleepTime, col=GenHealth))+
  stat_boxplot(geom = "errorbar", width = 0.2)+
  geom_boxplot()+
  
  labs(y = "Sleeptime", x = "General Health", title = "Boxplot by General Health")+
  theme_bw()

ggplot(dataheart,aes(AgeCategory, BMI, col=AgeCategory))+
  stat_boxplot(geom = "errorbar", width = 0.2)+
  geom_boxplot()+
  facet_wrap(~Sex)+
  labs(y = "BMI", x = "Age Category", title = "Boxplot by Age Category")+
  theme_bw()

7- A two paragraph summary about what this information (answers to parts 1-6) is telling you about your data. This answer is just text – no code needed.

The average BMI for the participants of this survey is about 28.32. Assuming that the data was randomly collected, this could suggest that the population is slightly overweight. According to MedicalNewsToday.com, “a BMI of 25–29.9 indicates that a person is slightly overweight.” The data also suggests that age groups, for both men and women, between 35-65 tend to have a higher BMI.

Additionally, the standard deviation for the sleeping time is very low, indicating that most people get the same amount of sleep every day. Another observation from the simple bar chart we created is that men are more likely to suffer from heart diseases than women. In fact, while there were slightly more women in the study, 16139 men shown to have heart disease compared to 11234 women. More investigation of the data is needed to find out the key risk factors for heart disease; however, our pie chart seems to indicate that a high BMI is a significant factor.

Project 3a

Daniel Lachaud

4/8/2022

This data set was downloaded from kaggle. “Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents.”

Data source: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download&select=heart_2020_cleaned.csv

The data set is clean to begin with. Rows with missing records were deleted.

This is a very large dat set with 18 variables and 319795 observations.

Here’s a list of the variables and what they represent:

HeartDisease: heart disease (yes/no)

BMI: Body mass Index

Smoking: smokes (yes/no)

AlcoholDrinking: consumes alcohol (yes/no)

Stroke: suffered a stroke (yes/no)

PhysicalHealth: scale of 1-30 (30 being the worst)

MentalHealth: scale of 1-30 (30 being the worst)

DiffWalking: Having difficulty walking (yes/no)

Sex: Gender

AgeCategory: A range

Race: race

Diabetic: diabetic (yes/no)

PhysicalActivity: (yes/no)

GenHealth: general condition of health

SleepTime: Amount of sleep per day

Asthma: Asthma (yes/no)

KidneyDisease: Kidney disease (yes/no)

SkinCancer: Skin cancer (yes/no)

Loading necessary libraries

Set working directory and read data set into R

Let’s take a look at our data structure

2- Summary statistics: mean, median, mode, min, max, count and standard deviation for 2-3 key quantitative variables.

Some of the summary statistics might not be appropriate for this data set.

However, we will compute some of them for the two quantitative variables BMI and Sleeptime.

Summary statistics for BMI

Summary statistics for Sleeptime

To calculate the mode we will create a new data set and do some conversions

3- A frequency distribution and relative frequency distribution for a key categorical variable.

4 - A contingency table for two categorical variables.

5 - 1 bar graph and 1 pie chart for two of your categorical variables. (A bar graph for one variable and a pie chart for the second variable.) Label your plots!

Bar graph of heart disease

pie chart for BMI

For our pie chart we will create a new data set and calculate the average BMI by race

6 - 2 histograms and 2 boxplots for two quantitative variables. (Both a histogram and boxplot for each variable.) Label your plots!

Histograms

Box plots

7- A two paragraph summary about what this information (answers to parts 1-6) is telling you about your data. This answer is just text – no code needed.