2020 annual CDC survey data of 400k adults related to their health status
According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect “patterns” from the data that can predict a patient’s condition.
Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: “Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.”. The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as “Do you have serious difficulty walking or climbing stairs?” or “Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]”. In this dataset, I noticed many different factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects.
# load required packages
# install.packages("plotly")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## Warning: package 'dplyr' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(readr)
library(ggplot2)
library(dplyr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# load disease and democracy data
setwd("C:/Users/wrxio/projects/Datasets")
heartCleaned <- read_csv("heart_2020_cleaned.csv")
## Rows: 319795 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# check the result
head(heartCleaned)
## # A tibble: 6 x 18
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## # ... with 11 more variables: DiffWalking <chr>, Sex <chr>, AgeCategory <chr>,
## # Race <chr>, Diabetic <chr>, PhysicalActivity <chr>, GenHealth <chr>,
## # SleepTime <dbl>, Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>
# overall to check the dataset
summary(heartCleaned)
## HeartDisease BMI Smoking AlcoholDrinking
## Length:319795 Min. :12.02 Length:319795 Length:319795
## Class :character 1st Qu.:24.03 Class :character Class :character
## Mode :character Median :27.34 Mode :character Mode :character
## Mean :28.33
## 3rd Qu.:31.42
## Max. :94.85
## Stroke PhysicalHealth MentalHealth DiffWalking
## Length:319795 Min. : 0.000 Min. : 0.000 Length:319795
## Class :character 1st Qu.: 0.000 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Median : 0.000 Mode :character
## Mean : 3.372 Mean : 3.898
## 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :30.000 Max. :30.000
## Sex AgeCategory Race Diabetic
## Length:319795 Length:319795 Length:319795 Length:319795
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## PhysicalActivity GenHealth SleepTime Asthma
## Length:319795 Length:319795 Min. : 1.000 Length:319795
## Class :character Class :character 1st Qu.: 6.000 Class :character
## Mode :character Mode :character Median : 7.000 Mode :character
## Mean : 7.097
## 3rd Qu.: 8.000
## Max. :24.000
## KidneyDisease SkinCancer
## Length:319795 Length:319795
## Class :character Class :character
## Mode :character Mode :character
##
##
##
heartCleaned[,]
## # A tibble: 319,795 x 18
## HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## 7 No 21.6 No No No 15 0
## 8 No 31.6 Yes No No 5 0
## 9 No 26.4 No No No 0 0
## 10 No 40.7 No No No 0 0
## # ... with 319,785 more rows, and 11 more variables: DiffWalking <chr>,
## # Sex <chr>, AgeCategory <chr>, Race <chr>, Diabetic <chr>,
## # PhysicalActivity <chr>, GenHealth <chr>, SleepTime <dbl>, Asthma <chr>,
## # KidneyDisease <chr>, SkinCancer <chr>
names(heartCleaned) <- tolower(names(heartCleaned))
names(heartCleaned) <- gsub(" ","",names(heartCleaned))
names(heartCleaned)
## [1] "heartdisease" "bmi" "smoking" "alcoholdrinking"
## [5] "stroke" "physicalhealth" "mentalhealth" "diffwalking"
## [9] "sex" "agecategory" "race" "diabetic"
## [13] "physicalactivity" "genhealth" "sleeptime" "asthma"
## [17] "kidneydisease" "skincancer"
str(heartCleaned)
## spec_tbl_df [319,795 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ heartdisease : chr [1:319795] "No" "No" "No" "No" ...
## $ bmi : num [1:319795] 16.6 20.3 26.6 24.2 23.7 ...
## $ smoking : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ alcoholdrinking : chr [1:319795] "No" "No" "No" "No" ...
## $ stroke : chr [1:319795] "No" "Yes" "No" "No" ...
## $ physicalhealth : num [1:319795] 3 0 20 0 28 6 15 5 0 0 ...
## $ mentalhealth : num [1:319795] 30 0 30 0 0 0 0 0 0 0 ...
## $ diffwalking : chr [1:319795] "No" "No" "No" "No" ...
## $ sex : chr [1:319795] "Female" "Female" "Male" "Female" ...
## $ agecategory : chr [1:319795] "55-59" "80 or older" "65-69" "75-79" ...
## $ race : chr [1:319795] "White" "White" "White" "White" ...
## $ diabetic : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ physicalactivity: chr [1:319795] "Yes" "Yes" "Yes" "No" ...
## $ genhealth : chr [1:319795] "Very good" "Very good" "Fair" "Good" ...
## $ sleeptime : num [1:319795] 5 7 8 6 8 12 4 9 5 10 ...
## $ asthma : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ kidneydisease : chr [1:319795] "No" "No" "No" "No" ...
## $ skincancer : chr [1:319795] "Yes" "No" "No" "Yes" ...
## - attr(*, "spec")=
## .. cols(
## .. HeartDisease = col_character(),
## .. BMI = col_double(),
## .. Smoking = col_character(),
## .. AlcoholDrinking = col_character(),
## .. Stroke = col_character(),
## .. PhysicalHealth = col_double(),
## .. MentalHealth = col_double(),
## .. DiffWalking = col_character(),
## .. Sex = col_character(),
## .. AgeCategory = col_character(),
## .. Race = col_character(),
## .. Diabetic = col_character(),
## .. PhysicalActivity = col_character(),
## .. GenHealth = col_character(),
## .. SleepTime = col_double(),
## .. Asthma = col_character(),
## .. KidneyDisease = col_character(),
## .. SkinCancer = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
heartCleaned_nona <- heartCleaned %>%
filter(!is.na(heartdisease) & !is.na(bmi) & !is.na(smoking) & !is.na(alcoholdrinking) & !is.na(stroke) & !is.na(physicalhealth) & !is.na(mentalhealth) & !is.na(diffwalking)& !is.na(sex) & !is.na(agecategory) & !is.na(race) & !is.na(diabetic) & !is.na(physicalactivity) & !is.na(genhealth) & !is.na(sleeptime) & !is.na(asthma) & !is.na(kidneydisease) & !is.na(skincancer))
head(heartCleaned_nona)
## # A tibble: 6 x 18
## heartdisease bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## # race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## # sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>
# Check the result
str(heartCleaned_nona)
## spec_tbl_df [319,795 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ heartdisease : chr [1:319795] "No" "No" "No" "No" ...
## $ bmi : num [1:319795] 16.6 20.3 26.6 24.2 23.7 ...
## $ smoking : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ alcoholdrinking : chr [1:319795] "No" "No" "No" "No" ...
## $ stroke : chr [1:319795] "No" "Yes" "No" "No" ...
## $ physicalhealth : num [1:319795] 3 0 20 0 28 6 15 5 0 0 ...
## $ mentalhealth : num [1:319795] 30 0 30 0 0 0 0 0 0 0 ...
## $ diffwalking : chr [1:319795] "No" "No" "No" "No" ...
## $ sex : chr [1:319795] "Female" "Female" "Male" "Female" ...
## $ agecategory : chr [1:319795] "55-59" "80 or older" "65-69" "75-79" ...
## $ race : chr [1:319795] "White" "White" "White" "White" ...
## $ diabetic : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ physicalactivity: chr [1:319795] "Yes" "Yes" "Yes" "No" ...
## $ genhealth : chr [1:319795] "Very good" "Very good" "Fair" "Good" ...
## $ sleeptime : num [1:319795] 5 7 8 6 8 12 4 9 5 10 ...
## $ asthma : chr [1:319795] "Yes" "No" "Yes" "No" ...
## $ kidneydisease : chr [1:319795] "No" "No" "No" "No" ...
## $ skincancer : chr [1:319795] "Yes" "No" "No" "Yes" ...
## - attr(*, "spec")=
## .. cols(
## .. HeartDisease = col_character(),
## .. BMI = col_double(),
## .. Smoking = col_character(),
## .. AlcoholDrinking = col_character(),
## .. Stroke = col_character(),
## .. PhysicalHealth = col_double(),
## .. MentalHealth = col_double(),
## .. DiffWalking = col_character(),
## .. Sex = col_character(),
## .. AgeCategory = col_character(),
## .. Race = col_character(),
## .. Diabetic = col_character(),
## .. PhysicalActivity = col_character(),
## .. GenHealth = col_character(),
## .. SleepTime = col_double(),
## .. Asthma = col_character(),
## .. KidneyDisease = col_character(),
## .. SkinCancer = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
dim(heartCleaned_nona)
## [1] 319795 18
# Check the rows and columns, and no missing rows
heartCleaned_nona[,]
## # A tibble: 319,795 x 18
## heartdisease bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## 7 No 21.6 No No No 15 0
## 8 No 31.6 Yes No No 5 0
## 9 No 26.4 No No No 0 0
## 10 No 40.7 No No No 0 0
## # ... with 319,785 more rows, and 11 more variables: diffwalking <chr>,
## # sex <chr>, agecategory <chr>, race <chr>, diabetic <chr>,
## # physicalactivity <chr>, genhealth <chr>, sleeptime <dbl>, asthma <chr>,
## # kidneydisease <chr>, skincancer <chr>
# Look at the result to check if still 319,795 x 18. The result no missing values
# total Non-heartdisease and Heartdisease count
ggplot(heartCleaned, aes(x = heartdisease)) +
geom_bar(width=0.5, fill = "coral") + #Tried: fill = "blue" and worked
geom_text(stat='count', aes(label=stat(count)), vjust=-0.5,) +
theme_classic()+
ggtitle("Titanic Total Number for the Non-heartdisease and Heartdisease") +
xlab("Tatol Number for the Non-heartdisease (No) and Heartdisease (Yes)") +
ylab("Count Number") +
labs(fill = "Heartdisease Survived by Yes or No")
ggplot(heartCleaned, aes(x = race, fill=race)) +
geom_bar(width=0.3,position = position_dodge()) +
#geom_bar(width=0.5, fill = "green", position = position_dodge()) +
geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
theme_classic() +
ggtitle("Total Number of the Each Race In the Data Set") +
xlab("Race") +
ylab("Count Number for the Race") +
labs(fill = "Total Number of the Each Race")
ggplot(heartCleaned, aes(x = sex, fill=sex)) +
geom_bar(width=0.3,position = position_dodge()) +
#geom_bar(width=0.5, fill = "green", position = position_dodge()) +
geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
theme_classic() +
ggtitle("Total Number of the Each Sex In the Data Set") +
xlab("Sex") +
ylab("Count Number for the Sex Category") +
labs(fill = "Total Number of the Each Sex Category")
ggplot(heartCleaned, aes(x = heartdisease, fill=sex)) +
geom_bar(position = position_dodge()) +
geom_text(stat='count',
aes(label=stat(count)), position=position_dodge(width=1), vjust=-0.5)+
theme_classic() +
ggtitle("Total Number of Non-heartdisease and Heartdisease by Sex") +
xlab("Non-heartdisease (No) and Heartdisease (Yes) by Sex") +
ylab("Count Number") +
labs(fill = "Non-heartdisease and Heartdisease by Sex")
ggplot(heartCleaned, aes(x = agecategory, fill=agecategory)) +
geom_bar(width=0.3,position = position_dodge()) +
#geom_bar(width=0.5, fill = "green", position = position_dodge()) +
geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
theme_classic() +
ggtitle("Total Number of the Each Age Range In the Data Set") +
xlab("Age Range") +
ylab("Count Number for the Age Range") +
labs(fill = "Total Number of the Each Age Range Category")
# Age Density
ggplot(heartCleaned, aes(x = agecategory)) +
geom_density(fill='coral') +
theme_classic()+
ggtitle("Age Density") +
xlab("Age") +
ylab("Density") +
labs(fill = "Age Density")
ggplot(heartCleaned, aes(x = bmi, fill=bmi)) +
geom_bar(width=0.3,position = position_dodge()) +
#geom_bar(width=0.5, fill = "green", position = position_dodge()) +
geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
theme_classic() +
ggtitle("Total Number of the BMI In the Data Set") +
xlab("BMI") +
ylab("Count Number for the BMI Category") +
labs(fill = "Total Number of the BMI Category")
## Warning: position_dodge requires non-overlapping x intervals
## position_dodge requires non-overlapping x intervals
# Age Density
ggplot(heartCleaned, aes(x = bmi)) +
geom_density(fill='coral') +
theme_classic()+
ggtitle("BMI Density") +
xlab("BMI") +
ylab("Density") +
labs(fill = "BMI Density")
dim(heartCleaned)
## [1] 319795 18
boxp1 <- heartCleaned %>% ggplot() + geom_boxplot(aes(y=bmi, group=agecategory, fill=agecategory)) +
ggtitle("Side-by-Side Boxplots For BMI filled by Age Category") +
xlab("Age Category") +
ylab("BMI")
boxp1
#boxp12 <- boxp1 + guides(fill=FALSE)
#boxp12
heartCleaned %>%
mutate(sex=factor(sex, levels=c("Female","Male"), ordered=TRUE)) %>%
ggplot() + geom_boxplot(aes(y=bmi, group=sex, fill=sex)) +
scale_fill_manual(values=c("red","blue")) +
theme(axis.text.y=element_blank()) +
ggtitle("Boxplots For BMI filled by Sex Category") +
xlab("Sex Category") +
ylab("BMI")
coord_flip()
## <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
heartCleaned %>%
mutate(group=factor(sex, levels=c("Female","Male"), ordered=TRUE)) %>%
ggplot() + geom_boxplot(aes(y=bmi, group=sex, fill=sex)) +
scale_fill_manual(values=c("white","darkgray")) +
ggtitle("Boxplots For BMI filled by Sex Category") +
xlab("Sex Category") +
ylab("BMI")
heartCleaned %>%
mutate(group=factor(sex, levels=c("Female","Male"), ordered=TRUE)) %>%
ggplot() + geom_boxplot(aes(y=bmi, group=sex, fill=sex))+
ggtitle("Boxplots For BMI filled by Sex Category") +
xlab("Sex Category") +
ylab("BMI")
head(heartCleaned)
## # A tibble: 6 x 18
## heartdisease bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## # race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## # sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>
# Change the theme
ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
xlab("bmi") +
ylab("physicalhealth") +
theme_minimal(base_size = 12)
# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
labs(title = "bmi versus physicalhealth",
caption = "Source: The CDC") +
xlab("bmi") +
ylab("physicalhealth") +
theme_minimal(base_size = 12)
p1 + geom_point()
head(heartCleaned)
## # A tibble: 6 x 18
## heartdisease bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## # race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## # sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>
# Change the theme
ggplot(heartCleaned, aes(x = agecategory, y = physicalhealth)) +
xlab("agecategory") +
ylab("physicalhealth") +
theme_minimal(base_size = 12)
# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = agecategory, y = physicalhealth)) +
labs(title = "agecategory versus physicalhealth",
caption = "Source: The CDC") +
xlab("agecategory") +
ylab("physicalhealth") +
theme_minimal(base_size = 12)
p1 + geom_point()
head(heartCleaned)
## # A tibble: 6 x 18
## heartdisease bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## # race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## # sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>
# Change the theme
ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
xlab("sleeptime") +
ylab("physicalhealth") +
theme_minimal(base_size = 12)
# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
labs(title = "sleeptime versus physicalhealth",
caption = "Source: The CDC") +
xlab("sleeptime") +
ylab("physicalhealth") +
theme_minimal(base_size = 12)
p1 + geom_point()
head(heartCleaned)
## # A tibble: 6 x 18
## heartdisease bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 No 16.6 Yes No No 3 30
## 2 No 20.3 No No Yes 0 0
## 3 No 26.6 Yes No No 20 30
## 4 No 24.2 No No No 0 0
## 5 No 23.7 No No No 28 0
## 6 Yes 28.9 Yes No No 6 0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## # race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## # sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>
# Change the theme
ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth)) +
xlab("sleeptime") +
ylab("mentalhealth") +
theme_minimal(base_size = 12)
# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth)) +
labs(title = "sleeptime versus mentalhealth",
caption = "Source: The CDC") +
xlab("sleeptime") +
ylab("mentalhealth") +
theme_minimal(base_size = 12)
p1 + geom_point()
p2 <- ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth)) +
labs(title = "sleeptime versus mentalhealth",
caption = "Source: The CDC") +
xlab("sleeptime") +
ylab("mentalhealth") +
theme_minimal(base_size = 12)
p2 + geom_point()
p3 <- p2 + xlim(0,25)+ ylim(0,30)
p3 + geom_point()
#p4 <- p3 + geom_point() + geom_smooth(color = "red")
#p4
p4 <- p3 + geom_point(size = 3, alpha = 0.5, aes(color = race)) + geom_smooth(method = 'lm', se =FALSE, color = "red", lty = 2, size = 0.3)
p4
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 38 rows containing missing values (geom_smooth).
## Add a linear regression with confidence interval
p5 <- p3 + geom_point(size = 2, alpha = 0.3, aes(color = race)) + geom_smooth(method='lm',se=FALSE,formula=y~x,color = "black", lty = 5, size = 1)
p5
## Warning: Removed 38 rows containing missing values (geom_smooth).
p6 <- p3 + geom_point(size = 3, alpha = 0.5, aes(color = race)) + geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash", size = 0.3) +
ggtitle("sleeptime versus mentalhealth")
p6
## Warning: Removed 38 rows containing missing values (geom_smooth).
Notice how the aes function colors the points by values in the data, rather than setting them to a single color. ggplot2 recognizes that income_group is a categorical variable, and uses its default qualitative color palette.
Now run this code, to see the different effect of setting the aes color mapping for the entire chart, rather than just one geom layer.
ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth, color=race)) +
labs(title = "sleeptime versus mentalhealth",
caption = "Source: The CDC") +
xlab("sleeptime") +
ylab("mentalhealth") +
theme_minimal(base_size = 14, base_family = "Georgia") +
geom_point(size = 3, alpha = 0.5) +
geom_smooth(method=lm, se=FALSE, lty = 1, size = 0.1)
## `geom_smooth()` using formula 'y ~ x'
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
Set your working directory to access your files
# load required packages
library(readr)
library(ggplot2)
library(scales)
## Warning: package 'scales' was built under R version 4.1.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(dplyr)
##Make a range of simple charts using the highcharter package
Highcharter is a package within the htmlwidgets framework that connects R to the Highcharts and Highstock JavaScript visualization libraries. For more information, see https://github.com/jbkunst/highcharter/
Also check out this site: https://cran.r-project.org/web/packages/highcharter/vignettes/charting-data-frames.html
Now install and load highcharter, plus RColorBrewer, which will make it possible to use ColorBrewer color palettes.
Also load dplyr and readr for loading and processing data.
# install highcharter, RColorBrewer
#install.packages("highcharter","RColorBrewer")
# load required packages
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.1.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(RColorBrewer)
First, prepare the data using dplyr.
# prepare data
x <- heartCleaned %>%
group_by(sex, race) %>%
summarize(mentalhealth = sum(mentalhealth, na.rm = TRUE)) %>%
arrange(sex,race)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
# basic area chart, default options
highchart () %>%
hc_add_series(data = x,
type = "area",
hcaes(x = sex,
y = mentalhealth,
group = race))
# prepare data
x <- heartCleaned %>%
group_by(genhealth, race) %>%
summarize(mentalhealth = sum(mentalhealth, na.rm = TRUE)) %>%
arrange(genhealth,race)
## `summarise()` has grouped output by 'genhealth'. You can override using the
## `.groups` argument.
will try tomorrow morning again
# basic area chart, default options
highchart () %>%
hc_add_series(data = x,
type = "area",
hcaes(x = genhealth,
y = mentalhealth,
group = race))