This data set was downloaded from kaggle. “Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents.”
The data set is clean to begin with. Rows with missing records were deleted.
This is a very large dat set with 18 variables and 319795 observations.
Here’s a list of the variables and what they represent:
HeartDisease: heart disease (yes/no)
BMI: Body mass Index
Smoking: smokes (yes/no)
AlcoholDrinking: consumes alcohol (yes/no)
Stroke: suffered a stroke (yes/no)
PhysicalHealth: scale of 1-30 (30 being the worst)
MentalHealth: scale of 1-30 (30 being the worst)
DiffWalking: Having difficulty walking (yes/no)
Sex: Gender
AgeCategory: A range
Race: race
Diabetic: diabetic (yes/no)
PhysicalActivity: (yes/no)
GenHealth: general condition of health
SleepTime: Amount of sleep per day
Asthma: Asthma (yes/no)
KidneyDisease: Kidney disease (yes/no)
SkinCancer: Skin cancer (yes/no)
Loading necessary libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
Set working directory and read data set into R
setwd("~/Data101/Project 3a")
dataheart <- read_csv("heart_2020_cleaned.csv")
## Rows: 319795 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Let’s take a look at our data structure
str(dataheart)
## spec_tbl_df [319,795 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ HeartDisease : chr [1:319795] "No" "No" "No" "No" ...
## $ BMI : num [1:319795] 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ AlcoholDrinking : chr [1:319795] "No" "No" "No" "No" ...
## $ Stroke : chr [1:319795] "No" "Yes" "No" "No" ...
## $ PhysicalHealth : num [1:319795] 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num [1:319795] 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : chr [1:319795] "No" "No" "No" "No" ...
## $ Sex : chr [1:319795] "Female" "Female" "Male" "Female" ...
## $ AgeCategory : chr [1:319795] "55-59" "80 or older" "65-69" "75-79" ...
## $ Race : chr [1:319795] "White" "White" "White" "White" ...
## $ Diabetic : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ PhysicalActivity: chr [1:319795] "Yes" "Yes" "Yes" "No" ...
## $ GenHealth : chr [1:319795] "Very good" "Very good" "Fair" "Good" ...
## $ SleepTime : num [1:319795] 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ KidneyDisease : chr [1:319795] "No" "No" "No" "No" ...
## $ SkinCancer : chr [1:319795] "Yes" "No" "No" "Yes" ...
## - attr(*, "spec")=
## .. cols(
## .. HeartDisease = col_character(),
## .. BMI = col_double(),
## .. Smoking = col_character(),
## .. AlcoholDrinking = col_character(),
## .. Stroke = col_character(),
## .. PhysicalHealth = col_double(),
## .. MentalHealth = col_double(),
## .. DiffWalking = col_character(),
## .. Sex = col_character(),
## .. AgeCategory = col_character(),
## .. Race = col_character(),
## .. Diabetic = col_character(),
## .. PhysicalActivity = col_character(),
## .. GenHealth = col_character(),
## .. SleepTime = col_double(),
## .. Asthma = col_character(),
## .. KidneyDisease = col_character(),
## .. SkinCancer = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
3- A frequency distribution and relative frequency distribution for a key categorical variable.
table(dataheart$HeartDisease)
##
## No Yes
## 292422 27373
dataheart2 <- dataheart%>%
group_by(HeartDisease)%>%
summarise(Freq=n())%>%
mutate(Relative_Freq=Freq/sum(Freq))
4 - A contingency table for two categorical variables.
table(dataheart$HeartDisease, dataheart$Sex)
##
## Female Male
## No 156571 135851
## Yes 11234 16139
5 - 1 bar graph and 1 pie chart for two of your categorical variables. (A bar graph for one variable and a pie chart for the second variable.) Label your plots!
Bar graph of heart disease
bargraph <- dataheart%>%
ggplot(aes(Sex,fill=HeartDisease))+
geom_bar(position = "dodge", alpha=0.5)+
facet_wrap(~HeartDisease)+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
labs(title = "Gender and Heart Disease", x= "Gender", y="Number")
bargraph <- ggplotly(bargraph)
bargraph
pie chart for BMI
For our pie chart we will create a new data set and calculate the average BMI by race
piechart <- dataheart %>%
group_by(Race)%>%
filter(HeartDisease=="Yes")%>%
summarise(Av_BMI = mean(BMI))
g <- c(2,4, 5, 7, 18, 30) # create a vector for colors
pie(piechart$Av_BMI, labels = piechart$Race, main = "Average BMI of People with Heart Disease", sub="By Race",col = rainbow(length(g)), clockwise = TRUE, init.angle = 0)

6 - 2 histograms and 2 boxplots for two quantitative variables. (Both a histogram and boxplot for each variable.) Label your plots!
Histograms
#hist(dataheart$SleepTime)
ggplot(dataheart,aes(SleepTime))+
geom_histogram(color="white")+
scale_color_discrete(name=" ")+
labs(y = "Frequency", x = "Sleep Time", title = "Histogram of Sleep Time")+
#theme(legend.position="none")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(dataheart$BMI, main = "Histogram of BMI", xlim = c(10,70),xlab="BMI")

Box plots
ggplot(dataheart,aes(GenHealth, SleepTime, col=GenHealth))+
stat_boxplot(geom = "errorbar", width = 0.2)+
geom_boxplot()+
labs(y = "Sleeptime", x = "General Health", title = "Boxplot by General Health")+
theme_bw()

ggplot(dataheart,aes(AgeCategory, BMI, col=AgeCategory))+
stat_boxplot(geom = "errorbar", width = 0.2)+
geom_boxplot()+
facet_wrap(~Sex)+
labs(y = "BMI", x = "Age Category", title = "Boxplot by Age Category")+
theme_bw()

7- A two paragraph summary about what this information (answers to parts 1-6) is telling you about your data. This answer is just text – no code needed.
The average BMI for the participants of this survey is about 28.32. Assuming that the data was randomly collected, this could suggest that the population is slightly overweight. According to MedicalNewsToday.com, “a BMI of 25–29.9 indicates that a person is slightly overweight.” The data also suggests that age groups, for both men and women, between 35-65 tend to have a higher BMI.
Additionally, the standard deviation for the sleeping time is very low, indicating that most people get the same amount of sleep every day. Another observation from the simple bar chart we created is that men are more likely to suffer from heart diseases than women. In fact, while there were slightly more women in the study, 16139 men shown to have heart disease compared to 11234 women. More investigation of the data is needed to find out the key risk factors for heart disease; however, our pie chart seems to indicate that a high BMI is a significant factor.