Load the required packages and dataset for this project 2

About Dataset - from https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

Key Indicators of Heart Disease

2020 annual CDC survey data of 400k adults related to their health status

What topic does the dataset cover?

According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect “patterns” from the data that can predict a patient’s condition.

Where did the dataset come from and what treatments did it undergo?

Originally, the dataset come from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. As the CDC describes: “Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.”. The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns. The vast majority of columns are questions asked to respondents about their health status, such as “Do you have serious difficulty walking or climbing stairs?” or “Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]”. In this dataset, I noticed many different factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects.

This project will use the dataset to do vitalization and analises for our project 2

# load required packages
# install.packages("plotly")
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## Warning: package 'dplyr' was built under R version 4.1.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readr)
library(ggplot2)
library(dplyr)
library(plotly)

## Warning: package 'plotly' was built under R version 4.1.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# load disease and democracy data
setwd("C:/Users/wrxio/projects/Datasets")
heartCleaned <- read_csv("heart_2020_cleaned.csv")

## Rows: 319795 Columns: 18

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# check the result 
head(heartCleaned)

## # A tibble: 6 x 18
##   HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
##   <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
## 1 No            16.6 Yes     No              No                  3           30
## 2 No            20.3 No      No              Yes                 0            0
## 3 No            26.6 Yes     No              No                 20           30
## 4 No            24.2 No      No              No                  0            0
## 5 No            23.7 No      No              No                 28            0
## 6 Yes           28.9 Yes     No              No                  6            0
## # ... with 11 more variables: DiffWalking <chr>, Sex <chr>, AgeCategory <chr>,
## #   Race <chr>, Diabetic <chr>, PhysicalActivity <chr>, GenHealth <chr>,
## #   SleepTime <dbl>, Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>

# overall to check the dataset
summary(heartCleaned)

##  HeartDisease            BMI          Smoking          AlcoholDrinking   
##  Length:319795      Min.   :12.02   Length:319795      Length:319795     
##  Class :character   1st Qu.:24.03   Class :character   Class :character  
##  Mode  :character   Median :27.34   Mode  :character   Mode  :character  
##                     Mean   :28.33                                        
##                     3rd Qu.:31.42                                        
##                     Max.   :94.85                                        
##     Stroke          PhysicalHealth    MentalHealth    DiffWalking       
##  Length:319795      Min.   : 0.000   Min.   : 0.000   Length:319795     
##  Class :character   1st Qu.: 0.000   1st Qu.: 0.000   Class :character  
##  Mode  :character   Median : 0.000   Median : 0.000   Mode  :character  
##                     Mean   : 3.372   Mean   : 3.898                     
##                     3rd Qu.: 2.000   3rd Qu.: 3.000                     
##                     Max.   :30.000   Max.   :30.000                     
##      Sex            AgeCategory            Race             Diabetic        
##  Length:319795      Length:319795      Length:319795      Length:319795     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  PhysicalActivity    GenHealth           SleepTime         Asthma         
##  Length:319795      Length:319795      Min.   : 1.000   Length:319795     
##  Class :character   Class :character   1st Qu.: 6.000   Class :character  
##  Mode  :character   Mode  :character   Median : 7.000   Mode  :character  
##                                        Mean   : 7.097                     
##                                        3rd Qu.: 8.000                     
##                                        Max.   :24.000                     
##  KidneyDisease       SkinCancer       
##  Length:319795      Length:319795     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

heartCleaned[,]

## # A tibble: 319,795 x 18
##    HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
##    <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
##  1 No            16.6 Yes     No              No                  3           30
##  2 No            20.3 No      No              Yes                 0            0
##  3 No            26.6 Yes     No              No                 20           30
##  4 No            24.2 No      No              No                  0            0
##  5 No            23.7 No      No              No                 28            0
##  6 Yes           28.9 Yes     No              No                  6            0
##  7 No            21.6 No      No              No                 15            0
##  8 No            31.6 Yes     No              No                  5            0
##  9 No            26.4 No      No              No                  0            0
## 10 No            40.7 No      No              No                  0            0
## # ... with 319,785 more rows, and 11 more variables: DiffWalking <chr>,
## #   Sex <chr>, AgeCategory <chr>, Race <chr>, Diabetic <chr>,
## #   PhysicalActivity <chr>, GenHealth <chr>, SleepTime <dbl>, Asthma <chr>,
## #   KidneyDisease <chr>, SkinCancer <chr>

Clean up and Organize the Datasets

Make all headers lowercase and remove spaces

Check the result: After cleaning up, look up the variable names and the structure of the data.

names(heartCleaned) <- tolower(names(heartCleaned))
names(heartCleaned) <- gsub(" ","",names(heartCleaned))

names(heartCleaned)

##  [1] "heartdisease"     "bmi"              "smoking"          "alcoholdrinking" 
##  [5] "stroke"           "physicalhealth"   "mentalhealth"     "diffwalking"     
##  [9] "sex"              "agecategory"      "race"             "diabetic"        
## [13] "physicalactivity" "genhealth"        "sleeptime"        "asthma"          
## [17] "kidneydisease"    "skincancer"

str(heartCleaned)

## spec_tbl_df [319,795 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ heartdisease    : chr [1:319795] "No" "No" "No" "No" ...
##  $ bmi             : num [1:319795] 16.6 20.3 26.6 24.2 23.7 ...
##  $ smoking         : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ alcoholdrinking : chr [1:319795] "No" "No" "No" "No" ...
##  $ stroke          : chr [1:319795] "No" "Yes" "No" "No" ...
##  $ physicalhealth  : num [1:319795] 3 0 20 0 28 6 15 5 0 0 ...
##  $ mentalhealth    : num [1:319795] 30 0 30 0 0 0 0 0 0 0 ...
##  $ diffwalking     : chr [1:319795] "No" "No" "No" "No" ...
##  $ sex             : chr [1:319795] "Female" "Female" "Male" "Female" ...
##  $ agecategory     : chr [1:319795] "55-59" "80 or older" "65-69" "75-79" ...
##  $ race            : chr [1:319795] "White" "White" "White" "White" ...
##  $ diabetic        : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ physicalactivity: chr [1:319795] "Yes" "Yes" "Yes" "No" ...
##  $ genhealth       : chr [1:319795] "Very good" "Very good" "Fair" "Good" ...
##  $ sleeptime       : num [1:319795] 5 7 8 6 8 12 4 9 5 10 ...
##  $ asthma          : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ kidneydisease   : chr [1:319795] "No" "No" "No" "No" ...
##  $ skincancer      : chr [1:319795] "Yes" "No" "No" "Yes" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   HeartDisease = col_character(),
##   ..   BMI = col_double(),
##   ..   Smoking = col_character(),
##   ..   AlcoholDrinking = col_character(),
##   ..   Stroke = col_character(),
##   ..   PhysicalHealth = col_double(),
##   ..   MentalHealth = col_double(),
##   ..   DiffWalking = col_character(),
##   ..   Sex = col_character(),
##   ..   AgeCategory = col_character(),
##   ..   Race = col_character(),
##   ..   Diabetic = col_character(),
##   ..   PhysicalActivity = col_character(),
##   ..   GenHealth = col_character(),
##   ..   SleepTime = col_double(),
##   ..   Asthma = col_character(),
##   ..   KidneyDisease = col_character(),
##   ..   SkinCancer = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

remove all the NA values for all columns if existing

heartCleaned_nona <- heartCleaned %>%
 filter(!is.na(heartdisease) & !is.na(bmi) & !is.na(smoking) & !is.na(alcoholdrinking) & !is.na(stroke) & !is.na(physicalhealth) & !is.na(mentalhealth) & !is.na(diffwalking)& !is.na(sex) & !is.na(agecategory) & !is.na(race) & !is.na(diabetic) & !is.na(physicalactivity) & !is.na(genhealth) & !is.na(sleeptime) & !is.na(asthma) & !is.na(kidneydisease) & !is.na(skincancer))

head(heartCleaned_nona)

## # A tibble: 6 x 18
##   heartdisease   bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
##   <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
## 1 No            16.6 Yes     No              No                  3           30
## 2 No            20.3 No      No              Yes                 0            0
## 3 No            26.6 Yes     No              No                 20           30
## 4 No            24.2 No      No              No                  0            0
## 5 No            23.7 No      No              No                 28            0
## 6 Yes           28.9 Yes     No              No                  6            0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## #   race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## #   sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>

# Check the result
str(heartCleaned_nona)

## spec_tbl_df [319,795 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ heartdisease    : chr [1:319795] "No" "No" "No" "No" ...
##  $ bmi             : num [1:319795] 16.6 20.3 26.6 24.2 23.7 ...
##  $ smoking         : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ alcoholdrinking : chr [1:319795] "No" "No" "No" "No" ...
##  $ stroke          : chr [1:319795] "No" "Yes" "No" "No" ...
##  $ physicalhealth  : num [1:319795] 3 0 20 0 28 6 15 5 0 0 ...
##  $ mentalhealth    : num [1:319795] 30 0 30 0 0 0 0 0 0 0 ...
##  $ diffwalking     : chr [1:319795] "No" "No" "No" "No" ...
##  $ sex             : chr [1:319795] "Female" "Female" "Male" "Female" ...
##  $ agecategory     : chr [1:319795] "55-59" "80 or older" "65-69" "75-79" ...
##  $ race            : chr [1:319795] "White" "White" "White" "White" ...
##  $ diabetic        : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ physicalactivity: chr [1:319795] "Yes" "Yes" "Yes" "No" ...
##  $ genhealth       : chr [1:319795] "Very good" "Very good" "Fair" "Good" ...
##  $ sleeptime       : num [1:319795] 5 7 8 6 8 12 4 9 5 10 ...
##  $ asthma          : chr [1:319795] "Yes" "No" "Yes" "No" ...
##  $ kidneydisease   : chr [1:319795] "No" "No" "No" "No" ...
##  $ skincancer      : chr [1:319795] "Yes" "No" "No" "Yes" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   HeartDisease = col_character(),
##   ..   BMI = col_double(),
##   ..   Smoking = col_character(),
##   ..   AlcoholDrinking = col_character(),
##   ..   Stroke = col_character(),
##   ..   PhysicalHealth = col_double(),
##   ..   MentalHealth = col_double(),
##   ..   DiffWalking = col_character(),
##   ..   Sex = col_character(),
##   ..   AgeCategory = col_character(),
##   ..   Race = col_character(),
##   ..   Diabetic = col_character(),
##   ..   PhysicalActivity = col_character(),
##   ..   GenHealth = col_character(),
##   ..   SleepTime = col_double(),
##   ..   Asthma = col_character(),
##   ..   KidneyDisease = col_character(),
##   ..   SkinCancer = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

dim(heartCleaned_nona)

## [1] 319795     18

# Check the rows and columns, and no missing rows
heartCleaned_nona[,]

## # A tibble: 319,795 x 18
##    heartdisease   bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
##    <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
##  1 No            16.6 Yes     No              No                  3           30
##  2 No            20.3 No      No              Yes                 0            0
##  3 No            26.6 Yes     No              No                 20           30
##  4 No            24.2 No      No              No                  0            0
##  5 No            23.7 No      No              No                 28            0
##  6 Yes           28.9 Yes     No              No                  6            0
##  7 No            21.6 No      No              No                 15            0
##  8 No            31.6 Yes     No              No                  5            0
##  9 No            26.4 No      No              No                  0            0
## 10 No            40.7 No      No              No                  0            0
## # ... with 319,785 more rows, and 11 more variables: diffwalking <chr>,
## #   sex <chr>, agecategory <chr>, race <chr>, diabetic <chr>,
## #   physicalactivity <chr>, genhealth <chr>, sleeptime <dbl>, asthma <chr>,
## #   kidneydisease <chr>, skincancer <chr>

# Look at the result to check if still 319,795 x 18. The result no missing values

Get the result of the total Non-heartdisease and Heartdisease

# total Non-heartdisease and Heartdisease count
ggplot(heartCleaned, aes(x = heartdisease)) +
  geom_bar(width=0.5, fill = "coral") +    #Tried: fill = "blue" and worked
  geom_text(stat='count', aes(label=stat(count)), vjust=-0.5,) +
  theme_classic()+
  ggtitle("Titanic Total Number for the Non-heartdisease and Heartdisease") +
  xlab("Tatol Number for the Non-heartdisease (No) and Heartdisease (Yes)") +
  ylab("Count Number") + 
  labs(fill = "Heartdisease Survived by Yes or No")

Get the result of Total Number of the Each Race In the Data Set

ggplot(heartCleaned, aes(x = race, fill=race)) +
  geom_bar(width=0.3,position = position_dodge()) +
  #geom_bar(width=0.5, fill = "green", position = position_dodge()) +
  geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
  theme_classic() +
  ggtitle("Total Number of the Each Race In the Data Set") +
  xlab("Race") +
  ylab("Count Number for the Race") + 
  labs(fill = "Total Number of the Each Race")

Get the result of Total Number of the Each Sex In the Data Set

ggplot(heartCleaned, aes(x = sex, fill=sex)) +
  geom_bar(width=0.3,position = position_dodge()) +
  #geom_bar(width=0.5, fill = "green", position = position_dodge()) +
  geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
  theme_classic() +
  ggtitle("Total Number of the Each Sex In the Data Set") +
  xlab("Sex") +
  ylab("Count Number for the Sex Category") + 
  labs(fill = "Total Number of the Each Sex Category")

Get the result of the Total Number of Non-heartdisease and Heartdisease by Sex

ggplot(heartCleaned, aes(x = heartdisease, fill=sex)) +
  geom_bar(position = position_dodge()) +
  geom_text(stat='count', 
      aes(label=stat(count)), position=position_dodge(width=1), vjust=-0.5)+
  theme_classic() +
  ggtitle("Total Number of Non-heartdisease and Heartdisease by Sex") +
  xlab("Non-heartdisease (No) and Heartdisease (Yes) by Sex") +
  ylab("Count Number") + 
  labs(fill = "Non-heartdisease and Heartdisease by Sex")

Get the result of Total Number of the Each Sex In the Data Set

ggplot(heartCleaned, aes(x = agecategory, fill=agecategory)) +
  geom_bar(width=0.3,position = position_dodge()) +
  #geom_bar(width=0.5, fill = "green", position = position_dodge()) +
  geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
  theme_classic() +
  ggtitle("Total Number of the Each Age Range In the Data Set") +
  xlab("Age Range") +
  ylab("Count Number for the Age Range") + 
  labs(fill = "Total Number of the Each Age Range Category")

Get the result of the Age Density

# Age Density
ggplot(heartCleaned, aes(x = agecategory)) +
 geom_density(fill='coral') +
 theme_classic()+
 ggtitle("Age Density") +
  xlab("Age") +
  ylab("Density") + 
  labs(fill = "Age Density")

Get the result of Total Number of the BMI In the Data Set

ggplot(heartCleaned, aes(x = bmi, fill=bmi)) +
  geom_bar(width=0.3,position = position_dodge()) +
  #geom_bar(width=0.5, fill = "green", position = position_dodge()) +
  geom_text(stat='count', aes(label=stat(count)), position=position_dodge(width=0.5), vjust=-0.5)+
  theme_classic() +
  ggtitle("Total Number of the BMI In the Data Set") +
  xlab("BMI") +
  ylab("Count Number for the BMI Category") + 
  labs(fill = "Total Number of the BMI Category")

## Warning: position_dodge requires non-overlapping x intervals
## position_dodge requires non-overlapping x intervals

Get the result of the BMI Density

# Age Density
ggplot(heartCleaned, aes(x = bmi)) +
 geom_density(fill='coral') +
 theme_classic()+
 ggtitle("BMI Density") +
  xlab("BMI") +
  ylab("Density") + 
  labs(fill = "BMI Density")

dim(heartCleaned)

## [1] 319795     18

Use Side-by-Side Boxplots

Here is Side-by-Side Boxplots For BMI filled by Age Category.

boxp1 <- heartCleaned %>% ggplot() + geom_boxplot(aes(y=bmi, group=agecategory, fill=agecategory)) +
           ggtitle("Side-by-Side Boxplots For BMI filled by Age Category") +
           xlab("Age Category") +
           ylab("BMI") 
boxp1

#boxp12 <- boxp1 + guides(fill=FALSE)
#boxp12

Ensure that the groups are considered as factors, rather than numbers. Then manually fill with the 3 colors: white, light gray, and dark gray. Make the boxplots orient horizontally.

heartCleaned %>%
  mutate(sex=factor(sex, levels=c("Female","Male"), ordered=TRUE)) %>%
  ggplot() + geom_boxplot(aes(y=bmi, group=sex, fill=sex)) +
  scale_fill_manual(values=c("red","blue")) +
  theme(axis.text.y=element_blank()) +
  ggtitle("Boxplots For BMI filled by Sex Category") +
           xlab("Sex Category") +
           ylab("BMI")

  coord_flip()

## <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     setup_params: function
##     train_panel_guides: function
##     transform: function
##     super:  <ggproto object: Class CoordFlip, CoordCartesian, Coord, gg>

heartCleaned %>%
  mutate(group=factor(sex, levels=c("Female","Male"), ordered=TRUE)) %>%
  ggplot() + geom_boxplot(aes(y=bmi, group=sex, fill=sex)) +
  scale_fill_manual(values=c("white","darkgray")) +
  ggtitle("Boxplots For BMI filled by Sex Category") +
           xlab("Sex Category") +
           ylab("BMI")

heartCleaned %>%
  mutate(group=factor(sex, levels=c("Female","Male"), ordered=TRUE)) %>%
  ggplot() + geom_boxplot(aes(y=bmi, group=sex, fill=sex))+
           ggtitle("Boxplots For BMI filled by Sex Category") +
           xlab("Sex Category") +
           ylab("BMI")

Draw first chart

Map variables in the data onto the X and Y axes and change the axes labels and theme

head(heartCleaned)

## # A tibble: 6 x 18
##   heartdisease   bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
##   <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
## 1 No            16.6 Yes     No              No                  3           30
## 2 No            20.3 No      No              Yes                 0            0
## 3 No            26.6 Yes     No              No                 20           30
## 4 No            24.2 No      No              No                  0            0
## 5 No            23.7 No      No              No                 28            0
## 6 Yes           28.9 Yes     No              No                  6            0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## #   race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## #   sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>

# Change the theme
ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
 xlab("bmi") +
 ylab("physicalhealth") +
 theme_minimal(base_size = 12)

# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
 labs(title = "bmi versus physicalhealth",
 caption = "Source: The CDC") +
 xlab("bmi") +
 ylab("physicalhealth") +
 theme_minimal(base_size = 12)
p1 + geom_point()

head(heartCleaned)

## # A tibble: 6 x 18
##   heartdisease   bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
##   <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
## 1 No            16.6 Yes     No              No                  3           30
## 2 No            20.3 No      No              Yes                 0            0
## 3 No            26.6 Yes     No              No                 20           30
## 4 No            24.2 No      No              No                  0            0
## 5 No            23.7 No      No              No                 28            0
## 6 Yes           28.9 Yes     No              No                  6            0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## #   race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## #   sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>

# Change the theme
ggplot(heartCleaned, aes(x = agecategory, y = physicalhealth)) +
 xlab("agecategory") +
 ylab("physicalhealth") +
 theme_minimal(base_size = 12)

# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = agecategory, y = physicalhealth)) +
 labs(title = "agecategory versus physicalhealth",
 caption = "Source: The CDC") +
 xlab("agecategory") +
 ylab("physicalhealth") +
 theme_minimal(base_size = 12)
p1 + geom_point()

head(heartCleaned)

## # A tibble: 6 x 18
##   heartdisease   bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
##   <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
## 1 No            16.6 Yes     No              No                  3           30
## 2 No            20.3 No      No              Yes                 0            0
## 3 No            26.6 Yes     No              No                 20           30
## 4 No            24.2 No      No              No                  0            0
## 5 No            23.7 No      No              No                 28            0
## 6 Yes           28.9 Yes     No              No                  6            0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## #   race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## #   sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>

# Change the theme
ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
 xlab("sleeptime") +
 ylab("physicalhealth") +
 theme_minimal(base_size = 12)

# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = sleeptime, y = physicalhealth)) +
 labs(title = "sleeptime versus physicalhealth",
 caption = "Source: The CDC") +
 xlab("sleeptime") +
 ylab("physicalhealth") +
 theme_minimal(base_size = 12)
p1 + geom_point()

head(heartCleaned)

## # A tibble: 6 x 18
##   heartdisease   bmi smoking alcoholdrinking stroke physicalhealth mentalhealth
##   <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
## 1 No            16.6 Yes     No              No                  3           30
## 2 No            20.3 No      No              Yes                 0            0
## 3 No            26.6 Yes     No              No                 20           30
## 4 No            24.2 No      No              No                  0            0
## 5 No            23.7 No      No              No                 28            0
## 6 Yes           28.9 Yes     No              No                  6            0
## # ... with 11 more variables: diffwalking <chr>, sex <chr>, agecategory <chr>,
## #   race <chr>, diabetic <chr>, physicalactivity <chr>, genhealth <chr>,
## #   sleeptime <dbl>, asthma <chr>, kidneydisease <chr>, skincancer <chr>

# Change the theme
ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth)) +
 xlab("sleeptime") +
 ylab("mentalhealth") +
 theme_minimal(base_size = 12)

# Include all the related counties
p1 <- ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth)) +
 labs(title = "sleeptime versus mentalhealth",
 caption = "Source: The CDC") +
 xlab("sleeptime") +
 ylab("mentalhealth") +
 theme_minimal(base_size = 12)
p1 + geom_point()

p2 <- ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth)) +
 labs(title = "sleeptime versus mentalhealth",  
caption = "Source: The CDC") +
 xlab("sleeptime") +
 ylab("mentalhealth") +
 theme_minimal(base_size = 12)
p2 + geom_point()

the scatterplot appears to show a correlation assessment

p3 <- p2 + xlim(0,25)+ ylim(0,30)
p3 + geom_point()

Add a smoother in red with a confidence interval

#p4 <- p3 + geom_point() + geom_smooth(color = "red")
#p4

p4 <- p3 + geom_point(size = 3, alpha = 0.5, aes(color = race)) + geom_smooth(method = 'lm', se  =FALSE, color = "red", lty = 2, size = 0.3)
p4

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 38 rows containing missing values (geom_smooth).

## Add a linear regression with confidence interval

p5 <- p3 + geom_point(size = 2, alpha = 0.3, aes(color = race)) + geom_smooth(method='lm',se=FALSE,formula=y~x,color = "black", lty = 5, size = 1)
p5

## Warning: Removed 38 rows containing missing values (geom_smooth).

Add a title, make the line dashed, and remove the confidence interval band

p6 <- p3 + geom_point(size = 3, alpha = 0.5, aes(color = race)) + geom_smooth(method='lm',formula=y~x, se = FALSE, linetype= "dotdash", size = 0.3) +
  ggtitle("sleeptime versus mentalhealth")
p6

## Warning: Removed 38 rows containing missing values (geom_smooth).

color the entire chart by income group

Notice how the aes function colors the points by values in the data, rather than setting them to a single color. ggplot2 recognizes that income_group is a categorical variable, and uses its default qualitative color palette.

Now run this code, to see the different effect of setting the aes color mapping for the entire chart, rather than just one geom layer.

ggplot(heartCleaned, aes(x = sleeptime, y = mentalhealth, color=race)) +
  labs(title = "sleeptime versus mentalhealth",  
  caption = "Source: The CDC") +
  xlab("sleeptime") +
  ylab("mentalhealth") +
           
  theme_minimal(base_size = 14, base_family = "Georgia") + 
  geom_point(size = 3, alpha = 0.5) +
  geom_smooth(method=lm, se=FALSE, lty = 1, size = 0.1)

## `geom_smooth()` using formula 'y ~ x'

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

Working with HTML Widgets and Highcharter

Set your working directory to access your files

# load required packages
library(readr)
library(ggplot2)
library(scales)

## Warning: package 'scales' was built under R version 4.1.3

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(dplyr)

##Make a range of simple charts using the highcharter package

Highcharter is a package within the htmlwidgets framework that connects R to the Highcharts and Highstock JavaScript visualization libraries. For more information, see https://github.com/jbkunst/highcharter/

Also check out this site: https://cran.r-project.org/web/packages/highcharter/vignettes/charting-data-frames.html

Install and load required packages

Now install and load highcharter, plus RColorBrewer, which will make it possible to use ColorBrewer color palettes.

Also load dplyr and readr for loading and processing data.

# install highcharter, RColorBrewer
#install.packages("highcharter","RColorBrewer")

# load required packages
library(highcharter)

## Warning: package 'highcharter' was built under R version 4.1.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

library(RColorBrewer)

Prepare the data

First, prepare the data using dplyr.

# prepare data
x <- heartCleaned %>%
  group_by(sex, race) %>%
  summarize(mentalhealth = sum(mentalhealth, na.rm = TRUE)) %>%
  arrange(sex,race)

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

Make an area chart using default options

# basic area chart, default options
highchart () %>%
  hc_add_series(data = x,
                   type = "area",
                   hcaes(x = sex,
                   y = mentalhealth, 
                   group = race))

# prepare data
x <- heartCleaned %>%
  group_by(genhealth, race) %>%
  summarize(mentalhealth = sum(mentalhealth, na.rm = TRUE)) %>%
  arrange(genhealth,race)

## `summarise()` has grouped output by 'genhealth'. You can override using the
## `.groups` argument.

will try tomorrow morning again

# basic area chart, default options
highchart () %>%
  hc_add_series(data = x,
                   type = "area",
                   hcaes(x = genhealth,
                   y = mentalhealth, 
                   group = race))

project2-2a

Bob Xiong

4/14/2022

Load the required packages and dataset for this project 2

About Dataset - from https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

Key Indicators of Heart Disease

What topic does the dataset cover?

Where did the dataset come from and what treatments did it undergo?

This project will use the dataset to do vitalization and analises for our project 2

Clean up and Organize the Datasets

Make all headers lowercase and remove spaces

Check the result: After cleaning up, look up the variable names and the structure of the data.

remove all the NA values for all columns if existing

Get the result of the total Non-heartdisease and Heartdisease

Get the result of Total Number of the Each Race In the Data Set

Get the result of Total Number of the Each Sex In the Data Set

Get the result of the Total Number of Non-heartdisease and Heartdisease by Sex

Get the result of Total Number of the Each Sex In the Data Set

Get the result of the Age Density

Get the result of Total Number of the BMI In the Data Set

Get the result of the BMI Density

Use Side-by-Side Boxplots

Here is Side-by-Side Boxplots For BMI filled by Age Category.

Ensure that the groups are considered as factors, rather than numbers. Then manually fill with the 3 colors: white, light gray, and dark gray. Make the boxplots orient horizontally.

Draw first chart

Map variables in the data onto the X and Y axes and change the axes labels and theme

the scatterplot appears to show a correlation assessment

Add a smoother in red with a confidence interval

Add a title, make the line dashed, and remove the confidence interval band

color the entire chart by income group

Working with HTML Widgets and Highcharter

Install and load required packages

Prepare the data

Make an area chart using default options