Extended Tidyverse

Our task here is to Extend an Existing Example.Using one of our classmate’s examples (as created here), we have to extend his or her example with additional annotated code.

Tidyverse is a collection of R packages designed for data science that provides a consistent framework for working with data. The goal of Tidyverse is to make data manipulation, visualization, and analysis easier and more intuitive by providing a set of tools that work seamlessly together.ggplot2 is part of tidverse which is designed to carry out hassle free data visualization and exploratory data analysis. This package is designed to work in a layered fashion, one can start working by plotting raw data and carrying out exploratory data analysis and then later on can add layers to the existing plot to add more details and display/draw more insights out of data. ggplot2 works under deep grammar called as “Grammar of graphics” which is made up of a set of independent components that can be created in many ways.

How to get ggplot2 in R

In order to make ggplot2 available in your R and Rstudio you’ll have to install the packages first with the help of install.packages() function. The command that you will be running is install.packages("ggplot2") but ggplot2 is also part of tidyverse so if you are using other packages like dplyr,forcats,tidyr and readr, it is better to install the tidyverse package. Once you have installed the package then you can call the library for that particular package. In order to use ggplot2 one will have to call the library and this could de done using library() function. If you have installed tidyverse instead of ggplot2 then using the command library(tidyverse) and it will load all the packages including ggplot2 and make the function of ggplot2 available for user (you) to use. Since we will be using it in this vignette so let install and call ggplot2

Basics of ggplot2

In this section we will try to understand how the basic of ggplot2 works. We already have an idea that ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties.

ggplot2 refers to the name of the package itself. When using the package we use the function ggplot() to generate the plots, and so references to using the function will be referred to as ggplot() and the package as a whole as ggplot2. In order to make plot with ggplot2 we will be using the following template as a reference:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()

If you look closely the ggplot() function would ask you for a data and then map the data to aesthetics. Mapping will map the aesthetic i.e. x and y axis to the data. But before we go any further we will have to have a data set to work with So let me add a data set to this vignette. I found this data set about diabatese patients on Kaggle and here is the link:

https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

you can also get this data set from my Github, here is the link for that:

https://github.com/Umerfarooq122/Data_sets/blob/main/diabetes_data.csv

Now lets load this data set and get on with the basics of ggplot2. We will be using read.csv() from the base R to load the data set and in order to confirm if the data set loaded has all the variables we can use head() function from base R. Below code chunk represents that

dataset <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/diabetes_data.csv")
knitr::kable(head(dataset))

Age	Sex	HighChol	CholCheck	BMI	Smoker	PhysActivity	Fruits	Veggies	GenHlth	MentHlth	PhysHlth	Stroke	HighBP
4	1	0	1	26	0	1	0	1	3	5	30	0	1
12	1	1	1	26	1	0	1	0	3	0	0	1	1
13	1	0	1	26	0	1	1	1	1	0	10	0	0
11	1	1	1	28	1	1	1	1	3	0	3	0	1
8	0	0	1	29	1	1	1	1	2	0	0	0	0
1	0	0	1	18	0	1	1	1	2	7	0	0	0

glimpse(dataset)
#> Rows: 70,692
#> Columns: 18
#> $ Age                  <dbl> 4, 12, 13, 11, 8, 1, 13, 6, 3, 6, 12, 4, 7, 10, 1…
#> $ Sex                  <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0…
#> $ HighChol             <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0…
#> $ CholCheck            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ BMI                  <dbl> 26, 26, 26, 28, 29, 18, 26, 31, 32, 27, 24, 21, 2…
#> $ Smoker               <dbl> 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0…
#> $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ PhysActivity         <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1…
#> $ Fruits               <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
#> $ Veggies              <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1…
#> $ HvyAlcoholConsump    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ GenHlth              <dbl> 3, 3, 1, 3, 2, 2, 1, 4, 3, 3, 3, 1, 2, 3, 1, 3, 2…
#> $ MentHlth             <dbl> 5, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
#> $ PhysHlth             <dbl> 30, 0, 10, 3, 0, 0, 0, 0, 0, 6, 4, 0, 0, 3, 0, 0,…
#> $ DiffWalk             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
#> $ Stroke               <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ HighBP               <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0…
#> $ Diabetes             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Now that we have got the data set lets try to use ggplot() function and make some basic plot. We will make a simple bar chart first to get an idea of how everything works together.

ggplot(data = dataset, mapping = aes(x=HighChol))+geom_bar()

As we can see that there were two response for HighChol column in the data set so the ggplot actually plotted tow bars one for each response. The dataset data frame was loaded through data = and then mapped to the aesthetic i.e. aes through mapping =and then we added a layer called geom_bar which specify what kind of plot do we need from the data that mapped to the aesthetics. Similarly we can plot another variable (column) from the data set to better under the basic of ggplot() function. Let’s take two variable i.e. Age and BMI and plot it against each other using another geom.

ggplot(data = dataset, mapping = aes(x=Age, y=BMI))+geom_point()

The graph above plot two variable from the dataset data frame. It plot Age variable on X-axis and BMI on Y-axis and try to depict the relationship between the two variable. We can clearly see that geom_point() helps us to create scatter plot which could be really helpful in understanding the aggregation of data with the help of some other layers.

Some of the popular geoms to create different types of graph with their respective outcome as below:

geom_line() : Creates a line graph between two relational variable with numerical data
geom_bar() : Creates a bar graph for categorical data by setting the categorical variable for the X axis, use the numeric for the Y axis.
geom_histogram()
geom_point() : Creates a scatter plot between two relational variables.
geom_boxplot() : Creates a box plot. The box plot compactly displays the distribution of a continuous variable. It visualizes five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individually
geom_polygon() : Creates a polygons, which are filled paths.

The above mentioned geoms are the most commonly used geoms out there. In fact, there are in total more than 40 geoms available in ggplot2 to create effective and meaningful visualization. We are just scratching the surface over here.

Layers

One of the key ideas behind ggplot2 is that it allows you to easily iterate, building up a complex plot a layer at a time. Each layer can come from a different data set and have a different aesthetic mappings, making it possible to create sophisticated plots that display data from multiple sources.

Previously, We did use function like geom_bar() or geom_point()and added the data set and mapped it to aesthetics in the global ggplot() function but The beauty of working with ggplot2 is that we can also define that in geom too. Which helps in adding layers for plot from multiple data sets. In order to demonstrate that let me load another data set

dataset_new <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/stroke_data.csv")

Now we got another data set that contains information about stroke and some factors that can have possible contributions in the cause of stroke. Now we are going to plot a graph from the first data frame called dataset and add a new layer of plot from another data frame called as dataset_new with the help of + sign.

plot1 <-ggplot()+
  geom_bar(data = dataset, aes(x=HighChol), fill = "blue",)+
  geom_bar(data = dataset_new, aes(x=work_type), fill='red')
plot1

I added the colors so that we can differentiate between the data plotted from two different data sets with help of fill argument. Similarly, apart from adding plots from different data set we can add some much stuff like themes, labels, scales, and much more as different layers on plot. Layers will be discussed in every session coming up.

Themes

Another plus point in working in ggplot2 is that we can add different themes to our plots and make them aesthetically more acceptable.The theme system does not affect how the data is rendered by geoms, or how it is transformed by scales. Themes don’t change the perceptual properties of the plot but they do offer some control over things like fonts, ticks, panel strips, and backgrounds.

The function theme_bw() will change the background into black and white as show below:

ggplot(data = dataset, mapping = aes(x=HighChol), fill=HighChol)+geom_bar()+theme_bw()

Similar we can also as legend to our graph by using fill argument and assigning categorical data to that fill argument inside the aes() as shown below. But before that we have to change the data type of HighChol from int to factor to be considered as category

dataset$HighChol <- as.factor(dataset$HighChol)

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()

As we can see that we did add the legend and ggplot2 automatically picks up the color for different categories. In case you want to take control of parameters like font, legends, scale, and backgrounds I would suggest to explore the function theme() but I will leave that for the extension part.

Coming back to themes we can use other themes like theme_linedraw(), theme_light(), theme_dark(), theme_minimal(), theme_classic(), theme_void() and theme_test(). I would suggest to try each one of them out to see what it adds to the current plot.

Facets

Faceting is to create subplot based on a subgroup with in the data set. These subplots are plots of same parameters from the same data but divided by a subgroup in a data. The subplots are placed next to each other and the number of subplots depends upon the number of subgroups in the data. If there are two subgroups in the data there will be only two subplots. Similarly if there are more than two sub groups in the data there will be more two subplots. Lets take the very first graph that we made in explaining the basic of ggplot2 and use facet() function on it. Again a quick reminder that facet() function is going to be an added layer to the current graph. Over here I will facet the data based on graph that has the information about general health.

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+facet_wrap(~GenHlth)

Now we can see that instead of one plot we got five subplots since we facet the column GenHlth and we got five kinds of categorical values which is from 1 to 5. We can re define our own labels and title for faceting and overall plot but for now I will leave that to the next section.

Apart from creating subplots using facet_wrap() function, we can also make a large grid and plot the data divide but subgroups in the data on one large axes using facet_grid() function. Lets try to recreate the the same plot using facet_grid().

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+facet_grid(~GenHlth)

So we can clearly see that facet_grid() makes one giant graph and plots every subgroup on the same plot rather than creating subplots for every subgroup as we saw in case of facet_wrap()

Labels and Annotations

i) Labels

The purpose of creating visualization is to communicate certain important information with decision making entity. The graph might not translate or convey the information that we want it to convey if the graph is not properly labeled. Labeling properly means to define the axes properly and add a meaningful title along with annotation if required to highlight important point or key takeaways on the graph. Since the beauty of ggplot2 is that we can work with layers so we can add layers for labeling and annotating the graph effectively. Let’s go back the first graph again and this time lets properly label the graph using labs() function.

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy")

Now we can see that the labels are properly labeled plus we have a title added to our plot. By adding this information now we know right away what this graph is all about and it sets the mind of the reader right away.

We can also add a subtitle and a caption to our graph too:

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy", subtitle = "Cholesterol vs the number of people having it", caption = "Each count represents one human")

So now our graph looks more informative as compared to the previous one. We can do so much more with labs() function but for now we will leave it to this point.

ii) Annotation:

We can highlight key takeaways and important points by directing the attention of the reader toward that point using proper annotations on the graph. We can use annotate() function to achieve that. Let’s take the graph that we recently labeled. Again we will be adding a layer of annotation to the existing graph

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy", subtitle = "Cholesterol vs the number of people having it", caption = "Each count represents one human")+annotate("text", x = 1:2, y = 35000:35000, label = c("Lower", "Higher"))

So we can see that we have two text annotations on our graph that stats the lower and higher bar. Similarly, we can do much more with annotations like introducing an arrow marker, point or a line to highlight or point towards the key takeaway but I will leave that to the extension part of this vignette.

Scales

There are a lot to cover here but we just scratch the surface.One of the most difficult parts of any graphics package is scaling, converting from data values to perceptual properties. There are numerous ways to control the scale of axes, for example, we can use all three methods mentioned below: * xlim() and ylim() * expand_limits() * scale_x_continuous() and scale_y_continuous()

Over here, I will be touching on scale_x() functions with break,labels and limits. We will be using the graph that we recently labeled and annotated. Since, we got categorical data on x-axis and numbers on y-axis so in this particular example we will be fiddling with y-axis of the plot. So let’s use scale_y_continuous() as an added layer with breaks, labels and limits arguments.

ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy", subtitle = "Cholesterol vs the number of people having it", caption = "Each count represents one human")+annotate("text", x = 1:2, y = 35000, label = c("Lower", "Higher"))+scale_y_continuous("Count",breaks = scales::breaks_extended(8), labels = scales::label_number() , limits = c(0,40000))

As we can that the scale of the plot or graph has been altered and increased to 40000 using limits argument and simultaneously, the breaks between every increased to 8 breaks using break argument similarly in label argument we kind of defined what are we dealing with.

Data wrangling

# Check for missing values
sum(is.na(dataset_new))
#> [1] 3

# Remove rows with missing values
dataset_new <- na.omit(dataset_new)

# Remove rows with missing values
dataset_new <- na.omit(dataset_new)

# Remove rows with invalid age values
dataset_new <- subset(dataset_new, age >= 0)

Descriptive statistics

summary(dataset_new)
#>       sex              age         hypertension    heart_disease   
#>  Min.   :0.0000   Min.   :  0.0   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.0000   1st Qu.: 35.0   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :1.0000   Median : 52.0   Median :0.0000   Median :0.0000  
#>  Mean   :0.5549   Mean   : 51.4   Mean   :0.2136   Mean   :0.1277  
#>  3rd Qu.:1.0000   3rd Qu.: 68.0   3rd Qu.:0.0000   3rd Qu.:0.0000  
#>  Max.   :1.0000   Max.   :103.0   Max.   :1.0000   Max.   :1.0000  
#>   ever_married      work_type     Residence_type   avg_glucose_level
#>  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   : 55.12   
#>  1st Qu.:1.0000   1st Qu.:3.000   1st Qu.:0.0000   1st Qu.: 78.75   
#>  Median :1.0000   Median :4.000   Median :1.0000   Median : 97.95   
#>  Mean   :0.8213   Mean   :3.461   Mean   :0.5148   Mean   :122.07   
#>  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:1.0000   3rd Qu.:167.41   
#>  Max.   :1.0000   Max.   :4.000   Max.   :1.0000   Max.   :271.74   
#>       bmi        smoking_status       stroke      
#>  Min.   :11.50   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:25.90   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :29.40   Median :0.0000   Median :0.0000  
#>  Mean   :30.41   Mean   :0.4887   Mean   :0.4994  
#>  3rd Qu.:34.10   3rd Qu.:1.0000   3rd Qu.:1.0000  
#>  Max.   :92.00   Max.   :1.0000   Max.   :1.0000

Correlation analysis

cor(dataset_new[,c("age", "hypertension", "heart_disease", "avg_glucose_level", "bmi")])
#>                           age hypertension heart_disease avg_glucose_level
#> age                1.00000000   0.01604117    0.02496913        0.02779724
#> hypertension       0.01604117   1.00000000    0.08016345        0.20314228
#> heart_disease      0.02496913   0.08016345    1.00000000        0.25238497
#> avg_glucose_level  0.02779724   0.20314228    0.25238497        1.00000000
#> bmi               -0.01213928   0.08255989    0.02132540        0.24271540
#>                           bmi
#> age               -0.01213928
#> hypertension       0.08255989
#> heart_disease      0.02132540
#> avg_glucose_level  0.24271540
#> bmi                1.00000000

The correlation matrix above shows the correlation coefficients between pairs of variables in the dataset_new. The values range from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

Looking at the matrix, we can see that there is a positive correlation between hypertension and avg_glucose_level (correlation coefficient of 0.203), heart_disease and avg_glucose_level (correlation coefficient of 0.252), and bmi and avg_glucose_level (correlation coefficient of 0.243). This suggests that higher glucose levels may be associated with these health conditions.

On the other hand, there is a negative correlation between age and bmi (correlation coefficient of -0.012), which suggests that as age increases, BMI tends to decrease slightly.

Regression analysis

model <- lm(stroke ~ age + hypertension + heart_disease + avg_glucose_level + bmi, data = dataset_new)

model
#> 
#> Call:
#> lm(formula = stroke ~ age + hypertension + heart_disease + avg_glucose_level + 
#>     bmi, data = dataset_new)
#> 
#> Coefficients:
#>       (Intercept)                age       hypertension      heart_disease  
#>          0.259209           0.001140           0.253953           0.237208  
#> avg_glucose_level                bmi  
#>          0.001684          -0.003569

The output of the linear regression model shows the coefficients for each independent variable, including the intercept (constant term). These coefficients indicate the strength and direction of the relationship between each independent variable and the dependent variable.

For example, in this model, we can see that age has a positive coefficient, which means that as age increases, the likelihood of having a stroke also increases. Similarly, hypertension and heart disease also have positive coefficients, indicating that these factors are positively associated with stroke risk.

On the other hand, avg_glucose_level and bmi have negative coefficients, suggesting that higher levels of these variables may be associated with a lower risk of stroke.

Overall, the linear regression model provides insight into which variables are most strongly associated with the outcome of interest (in this case, stroke) and can be useful for predicting or explaining the likelihood of that outcome based on the values of the independent variables.

library(ggplot2)

ggplot(data = dataset_new, aes(x = age, y = stroke)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Age") +
  ylab("Stroke") +
  ggtitle("Relationship between Age and Stroke")
#> `geom_smooth()` using formula = 'y ~ x'

Conclusion:

We have only scratch the surface in every section but we the above basic knowledge of each and every aspect of ggplot2 will enable us enough to create a meaningful and effective plots. Plots made in ggplot2 are very flexible and customizable, we have not really talked much amount customizing the plot. While customizing, we can add our own pre-defined colors palette to graph’s data and legends, we can play around with the size and fonts styling of different types texts used in labeling, legends, titles e.t.c and much more.

We explored the relationship between various factors and the likelihood of stroke using a linear regression model. We found that age, hypertension, and heart disease were positively associated with stroke risk, while higher levels of average glucose and BMI were negatively associated with stroke risk.

These findings suggest that individuals with hypertension or heart disease, as well as those who are older, may be at increased risk of stroke. However, maintaining healthy levels of glucose and BMI may help reduce the risk of stroke.