library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We load two libraries :
ggplot2 is used to plots using layer-by-layer(We will we it to create the scatter plot).
dplyr provides functions for exploring and summarizing data (We will use it to understand the categories in the dataset).
library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We use the built-in dataset iris.
What this dataset contains : - Each row is one flower sample() . - There are 150 total observations. - The column Species is a categorical variable with 3 groups : - setosa - versicolor - virginica - The columns sepal.Length and sepal.Width are numeric measurements that we will plot.
data <- iris| ##Step 3 : Preview the dataset (see the first few rows) |
| ::: {.cell} |
{.r .cell-code} head(data, 10) |
| ::: {.cell-output .cell-output-stdout} |
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa 7 4.6 3.4 1.4 0.3 setosa 8 5.0 3.4 1.5 0.2 setosa 9 4.4 2.9 1.4 0.2 setosa 10 4.9 3.1 1.5 0.1 setosa |
| ::: ::: |
table(data$Species)
setosa versicolor virginica
50 50 50
A scatter plot shows the relationship between two numeric variables.
Here we plot : - x-axis : sepal.Length - y-axis : sepal.Width
IMP point : Each dot represents one flower (one row in the dataset).
ggplot(data , aes(x=Sepal.Length , y=Sepal.Width)) +
geom_point()Now we include the categorical variable : - color=Species tells ggplot2 to assign a different color to each species.
ggplot(data,aes(x=Sepal.Length , y=Sepal.Width, colour = Species))+
geom_point()We adjust how points look: - size = 3 makes each dot bigger , so it easier to see . - alpha = 0.7 makes dots slightly transparent , which helps when points overlap.
Why transparency helps : - If many points overlap in the same region , transparency make dense areas more visible .
ggplot(data,aes(x=Sepal.Length , y=Sepal.Width , colour = Species))+
geom_point(size=3 , alpha=0.7)Good plots should clearly communicate what the viewer is seeing . labs() adds : - title for the plot heading . - x and y are axis labels. - color legend title (so the legend has a meaningful name).
ggplot(data,aes(x=Sepal.Length , y=Sepal.Width , colour = Species))+
geom_point(size=3 , alpha=0.7) labs(
title = "Scatter point of sepal dimensions" ,
x = "Sepal Length",
y = "Sepal Width",
color = "Species"
)<ggplot2::labels> List of 4
$ x : chr "Sepal Length"
$ y : chr "Sepal Width"
$ colour: chr "Species"
$ title : chr "Scatter point of sepal dimensions"
Themes control the background , grids and text styling .
theme_minimal() removes heavy backgrounds and gives a clean look.theme(legend.position = "top") moves the legend above the plot.Why move the legend ?
ggplot(data,aes(x=Sepal.Length , y=Sepal.Width , colour = Species))+
geom_point(size=3 , alpha=0.7) labs(
title = "Scatter point of sepal dimensions" ,
x = "Sepal Length",
y = "Sepal Width",
color = "Species"
)+
theme_minimal()+
theme(legend.position = "top")NULL