Write an R script to create a scatter plot, incorporating categorical analysis through color-coded data points representing different groups, using ggplot2.
Step 1: Load libraries
We load two libraries:
ggplot2 is used to build plots layer-by-layer (we will use it to create the scatter plot).
dplyr provides functions for exploring and summarizing data (we will use it to understand the categories in the dataset).
library (ggplot2)
library (dplyr)
Warning: package 'dplyr' was built under R version 4.5.2
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Step 2: Load the dataset (iris)
We use the built-in dataset iris.
What this dataset contains:
Each row is one flower sample (an observation).
There are 150 total observations.
The column Species is a categorical variable with 3 groups:
setosa
versicolor
virginica
The columns Sepal.Length and Sepal.Width are numeric measurements that we will plot.
data <- iris
head (data,10 )
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
typeof (data$ Sepal.Length)
Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
Step 3: Preview the dataset (see the first few rows)
We preview the data to understand:
what columns exist,
what values look like,
whether the dataset was loaded correctly.
head(data, n = 10) displays the first 10 rows.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
setosa versicolor virginica
50 50 50
Step 4: Inspect the categorical variable (Species)
This is our categorical analysis step.
table(data$Species) counts how many observations belong to each species.
Why do this?
It confirms that Species has the groups we expect.
It helps us understand whether the data is balanced (equal counts per category).
setosa versicolor virginica
50 50 50
Step 5: Create a basic scatter plot (no categories yet)
A scatter plot shows the relationship between two numeric variables.
Here we plot:
X-axis: Sepal.Length
Y-axis: Sepal.Width
Important point:
Each dot represents one flower (one row in the dataset).
ggplot (data, aes (x = Sepal.Length, y = Sepal.Width)) +
geom_point ()
At this stage, all points are the same color, so we cannot see species-based grouping yet.
Step 6: Add categorical grouping using color = Species
Now we include the categorical variable:
color = Species tells ggplot2 to assign a different color to each species.
What changes?
The plot now visually separates the three species based on color.
This is the main “categorical analysis” idea: we can see if different groups cluster differently.
ggplot (data, aes (x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point ()
Step 7: Improve point visibility (size and transparency)
We adjust how points look:
size = 3 makes each dot bigger, so it is easier to see.
alpha = 0.7 makes dots slightly transparent, which helps when points overlap.
Why transparency helps:
If many points overlap in the same region, transparency makes dense areas more visible.
ggplot (data, aes (x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point (size = 3 , alpha = 0.7 )
Step 9: Apply a clean theme and move the legend
Themes control the background, grids, and text styling.
theme_minimal() removes heavy backgrounds and gives a clean look.
theme(legend.position = "top") moves the legend above the plot.
Why move the legend?
When the legend is at the top, it is often easier to notice and read, especially in presentations.
ggplot (data, aes (x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point (size = 3 , alpha = 0.7 ) +
labs (
title = "Scatter Plot of Sepal Dimensions" ,
x = "Sepal Length" ,
y = "Sepal Width" ,
color = "Species"
) +
theme_minimal () +
theme (legend.position = "top" )
Discussion Questions
Do you see clusters of points by species?
Which species appears most separated from the others?
What happens if you plot Petal.Length vs Petal.Width instead?
What changes if you remove alpha or increase size further?