Our task here is to Extend an Existing Example.Using one of our classmate’s examples (as created here), we have to extend his or her example with additional annotated code.
Tidyverse is a collection of R packages designed for data science that provides a consistent framework for working with data. The goal of Tidyverse is to make data manipulation, visualization, and analysis easier and more intuitive by providing a set of tools that work seamlessly together.ggplot2 is part of tidverse which is designed to carry out hassle free data visualization and exploratory data analysis. This package is designed to work in a layered fashion, one can start working by plotting raw data and carrying out exploratory data analysis and then later on can add layers to the existing plot to add more details and display/draw more insights out of data. ggplot2 works under deep grammar called as “Grammar of graphics” which is made up of a set of independent components that can be created in many ways.
How to get ggplot2 in R
In order to make ggplot2 available in your R and Rstudio you’ll have
to install the packages first with the help of
install.packages() function. The command that you will be
running is install.packages("ggplot2") but ggplot2 is also
part of tidyverse so if you are using other packages like
dplyr,forcats,tidyr and
readr, it is better to install the tidyverse package. Once
you have installed the package then you can call the library for that
particular package. In order to use ggplot2 one will have to call the
library and this could de done using library() function. If
you have installed tidyverse instead of ggplot2 then using the command
library(tidyverse) and it will load all the packages
including ggplot2 and make the function of ggplot2 available for user
(you) to use. Since we will be using it in this vignette so let install
and call ggplot2
Table of Contents:
Before jumping into explaining the function of ggplot2 lets quickly go through over what we will be covering in this vignette:
- Basics of ggplot
- Layers
- Themes
- Facets
- Labs and Annotations
- Scales
- Data wrangling
- Descriptive statistics
- Correlation analysis
- Build a linear regression model
Basics of ggplot2
In this section we will try to understand how the basic of ggplot2 works. We already have an idea that ggplot2 is a plotting package that provides helpful commands to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties.
ggplot2 refers to the name of the package itself. When using the
package we use the function ggplot() to generate the plots,
and so references to using the function will be referred to as
ggplot() and the package as a whole as ggplot2. In order to
make plot with ggplot2 we will be using the following template as a
reference:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
If you look closely the ggplot() function would ask you for a data and then map the data to aesthetics. Mapping will map the aesthetic i.e. x and y axis to the data. But before we go any further we will have to have a data set to work with So let me add a data set to this vignette. I found this data set about diabatese patients on Kaggle and here is the link:
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
you can also get this data set from my Github, here is the link for that:
https://github.com/Umerfarooq122/Data_sets/blob/main/diabetes_data.csv
Now lets load this data set and get on with the basics of ggplot2. We
will be using read.csv() from the base R to load the data
set and in order to confirm if the data set loaded has all the variables
we can use head() function from base R. Below code chunk
represents that
dataset <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/diabetes_data.csv")
knitr::kable(head(dataset))| Age | Sex | HighChol | CholCheck | BMI | Smoker | HeartDiseaseorAttack | PhysActivity | Fruits | Veggies | HvyAlcoholConsump | GenHlth | MentHlth | PhysHlth | DiffWalk | Stroke | HighBP | Diabetes |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 1 | 0 | 1 | 26 | 0 | 0 | 1 | 0 | 1 | 0 | 3 | 5 | 30 | 0 | 0 | 1 | 0 |
| 12 | 1 | 1 | 1 | 26 | 1 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 1 | 0 |
| 13 | 1 | 0 | 1 | 26 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 10 | 0 | 0 | 0 | 0 |
| 11 | 1 | 1 | 1 | 28 | 1 | 0 | 1 | 1 | 1 | 0 | 3 | 0 | 3 | 0 | 0 | 1 | 0 |
| 8 | 0 | 0 | 1 | 29 | 1 | 0 | 1 | 1 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 18 | 0 | 0 | 1 | 1 | 1 | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 |
glimpse(dataset)
#> Rows: 70,692
#> Columns: 18
#> $ Age <dbl> 4, 12, 13, 11, 8, 1, 13, 6, 3, 6, 12, 4, 7, 10, 1…
#> $ Sex <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0…
#> $ HighChol <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0…
#> $ CholCheck <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ BMI <dbl> 26, 26, 26, 28, 29, 18, 26, 31, 32, 27, 24, 21, 2…
#> $ Smoker <dbl> 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0…
#> $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ PhysActivity <dbl> 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1…
#> $ Fruits <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
#> $ Veggies <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1…
#> $ HvyAlcoholConsump <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ GenHlth <dbl> 3, 3, 1, 3, 2, 2, 1, 4, 3, 3, 3, 1, 2, 3, 1, 3, 2…
#> $ MentHlth <dbl> 5, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0…
#> $ PhysHlth <dbl> 30, 0, 10, 3, 0, 0, 0, 0, 0, 6, 4, 0, 0, 3, 0, 0,…
#> $ DiffWalk <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
#> $ Stroke <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ HighBP <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0…
#> $ Diabetes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…Now that we have got the data set lets try to use ggplot() function and make some basic plot. We will make a simple bar chart first to get an idea of how everything works together.
ggplot(data = dataset, mapping = aes(x=HighChol))+geom_bar()As we can see that there were two response for HighChol
column in the data set so the ggplot actually plotted tow bars one for
each response. The dataset data frame was loaded through
data = and then mapped to the aesthetic
i.e. aes through mapping =and then we added a
layer called geom_bar which specify what kind of plot do we
need from the data that mapped to the aesthetics. Similarly we can plot
another variable (column) from the data set to better under the basic of
ggplot() function. Let’s take two variable i.e. Age and BMI
and plot it against each other using another geom.
ggplot(data = dataset, mapping = aes(x=Age, y=BMI))+geom_point()The graph above plot two variable from the dataset data
frame. It plot Age variable on X-axis and BMI on Y-axis and try to
depict the relationship between the two variable. We can clearly see
that geom_point() helps us to create scatter plot which
could be really helpful in understanding the aggregation of data with
the help of some other layers.
Some of the popular geoms to create different types of graph with their respective outcome as below:
geom_line(): Creates a line graph between two relational variable with numerical datageom_bar(): Creates a bar graph for categorical data by setting the categorical variable for the X axis, use the numeric for the Y axis.geom_histogram()geom_point(): Creates a scatter plot between two relational variables.geom_boxplot(): Creates a box plot. The box plot compactly displays the distribution of a continuous variable. It visualizes five summary statistics (the median, two hinges and two whiskers), and all “outlying” points individuallygeom_polygon(): Creates a polygons, which are filled paths.
The above mentioned geoms are the most commonly used geoms out there. In fact, there are in total more than 40 geoms available in ggplot2 to create effective and meaningful visualization. We are just scratching the surface over here.
Layers
One of the key ideas behind ggplot2 is that it allows you to easily iterate, building up a complex plot a layer at a time. Each layer can come from a different data set and have a different aesthetic mappings, making it possible to create sophisticated plots that display data from multiple sources.
Previously, We did use function like geom_bar() or
geom_point()and added the data set and mapped it to
aesthetics in the global ggplot() function but The beauty
of working with ggplot2 is that we can also define that in
geom too. Which helps in adding layers for plot from
multiple data sets. In order to demonstrate that let me load another
data set
dataset_new <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/Data_sets/main/stroke_data.csv")Now we got another data set that contains information about stroke
and some factors that can have possible contributions in the cause of
stroke. Now we are going to plot a graph from the first data frame
called dataset and add a new layer of plot from another
data frame called as dataset_new with the help of
+ sign.
plot1 <-ggplot()+
geom_bar(data = dataset, aes(x=HighChol), fill = "blue",)+
geom_bar(data = dataset_new, aes(x=work_type), fill='red')
plot1I added the colors so that we can differentiate between the data
plotted from two different data sets with help of fill
argument. Similarly, apart from adding plots from different data set we
can add some much stuff like themes, labels, scales, and much more as
different layers on plot. Layers will be discussed in every session
coming up.
Themes
Another plus point in working in ggplot2 is that we can add different themes to our plots and make them aesthetically more acceptable.The theme system does not affect how the data is rendered by geoms, or how it is transformed by scales. Themes don’t change the perceptual properties of the plot but they do offer some control over things like fonts, ticks, panel strips, and backgrounds.
The function theme_bw() will change the background into
black and white as show below:
ggplot(data = dataset, mapping = aes(x=HighChol), fill=HighChol)+geom_bar()+theme_bw()Similar we can also as legend to our graph by using fill argument and
assigning categorical data to that fill argument inside the
aes() as shown below. But before that we have to change the
data type of HighChol from int to
factor to be considered as category
dataset$HighChol <- as.factor(dataset$HighChol)ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()As we can see that we did add the legend and ggplot2 automatically
picks up the color for different categories. In case you want to take
control of parameters like font, legends, scale, and backgrounds I would
suggest to explore the function theme() but I will leave
that for the extension part.
Coming back to themes we can use other themes like
theme_linedraw(), theme_light(),
theme_dark(), theme_minimal(),
theme_classic(), theme_void() and
theme_test(). I would suggest to try each one of them out
to see what it adds to the current plot.
Facets
Faceting is to create subplot based on a subgroup with in the data
set. These subplots are plots of same parameters from the same data but
divided by a subgroup in a data. The subplots are placed next to each
other and the number of subplots depends upon the number of subgroups in
the data. If there are two subgroups in the data there will be only two
subplots. Similarly if there are more than two sub groups in the data
there will be more two subplots. Lets take the very first graph that we
made in explaining the basic of ggplot2 and use facet()
function on it. Again a quick reminder that facet()
function is going to be an added layer to the current graph. Over here I
will facet the data based on graph that has the information about
general health.
ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+facet_wrap(~GenHlth)
Now we can see that instead of one plot we got five subplots since we
facet the column
GenHlth and we got five kinds of
categorical values which is from 1 to 5. We can re define our own labels
and title for faceting and overall plot but for now I will leave that to
the next section.
Apart from creating subplots using facet_wrap()
function, we can also make a large grid and plot the data divide but
subgroups in the data on one large axes using facet_grid()
function. Lets try to recreate the the same plot using
facet_grid().
ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+facet_grid(~GenHlth)
So we can clearly see that
facet_grid() makes one giant
graph and plots every subgroup on the same plot rather than creating
subplots for every subgroup as we saw in case of
facet_wrap()
Labels and Annotations
i) Labels
The purpose of creating visualization is to communicate certain
important information with decision making entity. The graph might not
translate or convey the information that we want it to convey if the
graph is not properly labeled. Labeling properly means to define the
axes properly and add a meaningful title along with annotation if
required to highlight important point or key takeaways on the graph.
Since the beauty of ggplot2 is that we can work with layers so we can
add layers for labeling and annotating the graph effectively. Let’s go
back the first graph again and this time lets properly label the graph
using labs() function.
ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy")Now we can see that the labels are properly labeled plus we have a title added to our plot. By adding this information now we know right away what this graph is all about and it sets the mind of the reader right away.
We can also add a subtitle and a caption to our graph too:
ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy", subtitle = "Cholesterol vs the number of people having it", caption = "Each count represents one human")So now our graph looks more informative as compared to the previous
one. We can do so much more with labs() function but for
now we will leave it to this point.
ii) Annotation:
We can highlight key takeaways and important points by directing the
attention of the reader toward that point using proper annotations on
the graph. We can use annotate() function to achieve that.
Let’s take the graph that we recently labeled. Again we will be adding a
layer of annotation to the existing graph
ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy", subtitle = "Cholesterol vs the number of people having it", caption = "Each count represents one human")+annotate("text", x = 1:2, y = 35000:35000, label = c("Lower", "Higher"))So we can see that we have two text annotations on our graph that stats the lower and higher bar. Similarly, we can do much more with annotations like introducing an arrow marker, point or a line to highlight or point towards the key takeaway but I will leave that to the extension part of this vignette.
Scales
There are a lot to cover here but we just scratch the surface.One of the most difficult parts of any graphics package is scaling, converting from data values to perceptual properties. There are numerous ways to control the scale of axes, for example, we can use all three methods mentioned below: * xlim() and ylim() * expand_limits() * scale_x_continuous() and scale_y_continuous()
Over here, I will be touching on scale_x() functions
with break,labels and limits. We will be using the graph that we
recently labeled and annotated. Since, we got categorical data on x-axis
and numbers on y-axis so in this particular example we will be fiddling
with y-axis of the plot. So let’s use scale_y_continuous()
as an added layer with breaks, labels and
limits arguments.
ggplot(data = dataset, mapping = aes(x=HighChol, fill=HighChol))+geom_bar()+labs(x="High Cholesterol", y="Count", title= "Cholesterol Frequnecy", subtitle = "Cholesterol vs the number of people having it", caption = "Each count represents one human")+annotate("text", x = 1:2, y = 35000, label = c("Lower", "Higher"))+scale_y_continuous("Count",breaks = scales::breaks_extended(8), labels = scales::label_number() , limits = c(0,40000))As we can that the scale of the plot or graph has been altered and
increased to 40000 using limits argument and
simultaneously, the breaks between every increased to 8 breaks using
break argument similarly in label argument we
kind of defined what are we dealing with.
Data wrangling
# Check for missing values
sum(is.na(dataset_new))
#> [1] 3# Remove rows with missing values
dataset_new <- na.omit(dataset_new)# Remove rows with missing values
dataset_new <- na.omit(dataset_new)
# Remove rows with invalid age values
dataset_new <- subset(dataset_new, age >= 0)Descriptive statistics
summary(dataset_new)
#> sex age hypertension heart_disease
#> Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.0000
#> 1st Qu.:0.0000 1st Qu.: 35.0 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :1.0000 Median : 52.0 Median :0.0000 Median :0.0000
#> Mean :0.5549 Mean : 51.4 Mean :0.2136 Mean :0.1277
#> 3rd Qu.:1.0000 3rd Qu.: 68.0 3rd Qu.:0.0000 3rd Qu.:0.0000
#> Max. :1.0000 Max. :103.0 Max. :1.0000 Max. :1.0000
#> ever_married work_type Residence_type avg_glucose_level
#> Min. :0.0000 Min. :0.000 Min. :0.0000 Min. : 55.12
#> 1st Qu.:1.0000 1st Qu.:3.000 1st Qu.:0.0000 1st Qu.: 78.75
#> Median :1.0000 Median :4.000 Median :1.0000 Median : 97.95
#> Mean :0.8213 Mean :3.461 Mean :0.5148 Mean :122.07
#> 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:167.41
#> Max. :1.0000 Max. :4.000 Max. :1.0000 Max. :271.74
#> bmi smoking_status stroke
#> Min. :11.50 Min. :0.0000 Min. :0.0000
#> 1st Qu.:25.90 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :29.40 Median :0.0000 Median :0.0000
#> Mean :30.41 Mean :0.4887 Mean :0.4994
#> 3rd Qu.:34.10 3rd Qu.:1.0000 3rd Qu.:1.0000
#> Max. :92.00 Max. :1.0000 Max. :1.0000Correlation analysis
cor(dataset_new[,c("age", "hypertension", "heart_disease", "avg_glucose_level", "bmi")])
#> age hypertension heart_disease avg_glucose_level
#> age 1.00000000 0.01604117 0.02496913 0.02779724
#> hypertension 0.01604117 1.00000000 0.08016345 0.20314228
#> heart_disease 0.02496913 0.08016345 1.00000000 0.25238497
#> avg_glucose_level 0.02779724 0.20314228 0.25238497 1.00000000
#> bmi -0.01213928 0.08255989 0.02132540 0.24271540
#> bmi
#> age -0.01213928
#> hypertension 0.08255989
#> heart_disease 0.02132540
#> avg_glucose_level 0.24271540
#> bmi 1.00000000The correlation matrix above shows the correlation coefficients between pairs of variables in the dataset_new. The values range from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.
Looking at the matrix, we can see that there is a positive correlation between hypertension and avg_glucose_level (correlation coefficient of 0.203), heart_disease and avg_glucose_level (correlation coefficient of 0.252), and bmi and avg_glucose_level (correlation coefficient of 0.243). This suggests that higher glucose levels may be associated with these health conditions.
On the other hand, there is a negative correlation between age and bmi (correlation coefficient of -0.012), which suggests that as age increases, BMI tends to decrease slightly.
Regression analysis
model <- lm(stroke ~ age + hypertension + heart_disease + avg_glucose_level + bmi, data = dataset_new)
model
#>
#> Call:
#> lm(formula = stroke ~ age + hypertension + heart_disease + avg_glucose_level +
#> bmi, data = dataset_new)
#>
#> Coefficients:
#> (Intercept) age hypertension heart_disease
#> 0.259209 0.001140 0.253953 0.237208
#> avg_glucose_level bmi
#> 0.001684 -0.003569The output of the linear regression model shows the coefficients for each independent variable, including the intercept (constant term). These coefficients indicate the strength and direction of the relationship between each independent variable and the dependent variable.
For example, in this model, we can see that age has a positive coefficient, which means that as age increases, the likelihood of having a stroke also increases. Similarly, hypertension and heart disease also have positive coefficients, indicating that these factors are positively associated with stroke risk.
On the other hand, avg_glucose_level and bmi have negative coefficients, suggesting that higher levels of these variables may be associated with a lower risk of stroke.
Overall, the linear regression model provides insight into which variables are most strongly associated with the outcome of interest (in this case, stroke) and can be useful for predicting or explaining the likelihood of that outcome based on the values of the independent variables.
library(ggplot2)
ggplot(data = dataset_new, aes(x = age, y = stroke)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
xlab("Age") +
ylab("Stroke") +
ggtitle("Relationship between Age and Stroke")
#> `geom_smooth()` using formula = 'y ~ x'Conclusion:
We have only scratch the surface in every section but we the above basic knowledge of each and every aspect of ggplot2 will enable us enough to create a meaningful and effective plots. Plots made in ggplot2 are very flexible and customizable, we have not really talked much amount customizing the plot. While customizing, we can add our own pre-defined colors palette to graph’s data and legends, we can play around with the size and fonts styling of different types texts used in labeling, legends, titles e.t.c and much more.
We explored the relationship between various factors and the likelihood of stroke using a linear regression model. We found that age, hypertension, and heart disease were positively associated with stroke risk, while higher levels of average glucose and BMI were negatively associated with stroke risk.
These findings suggest that individuals with hypertension or heart disease, as well as those who are older, may be at increased risk of stroke. However, maintaining healthy levels of glucose and BMI may help reduce the risk of stroke.