library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Write an R script to create a scatter plot, incorporating categorical analysis though color-coded data points representing different groups, using ggplot2
ggplot is used to build plots layer-by-layer (we will use it to create the scatter plot).dplyr provides functions for exploring and summarizing data (we will use it to understand the categories in the dataset).library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We use the built-in dataset iris.
What this dataset contains:
Species is a categorical variable with 3 groups:
setosaversicolorvirginicaSepal.Length and Sepal.Width are numeric measurements that we will plot.data <- irisWe preview the data to understand:
head(data,10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
tail(data,10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
str(data)'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(data) Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
names(data)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
dim(data)[1] 150 5
data[1] Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
data$Sepal.Length [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
typeof(data$Sepal.Length)[1] "double"
data[][1] Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
data[2][] Sepal.Width
1 3.5
2 3.0
3 3.2
4 3.1
5 3.6
6 3.9
7 3.4
8 3.4
9 2.9
10 3.1
11 3.7
12 3.4
13 3.0
14 3.0
15 4.0
16 4.4
17 3.9
18 3.5
19 3.8
20 3.8
21 3.4
22 3.7
23 3.6
24 3.3
25 3.4
26 3.0
27 3.4
28 3.5
29 3.4
30 3.2
31 3.1
32 3.4
33 4.1
34 4.2
35 3.1
36 3.2
37 3.5
38 3.6
39 3.0
40 3.4
41 3.5
42 2.3
43 3.2
44 3.5
45 3.8
46 3.0
47 3.8
48 3.2
49 3.7
50 3.3
51 3.2
52 3.2
53 3.1
54 2.3
55 2.8
56 2.8
57 3.3
58 2.4
59 2.9
60 2.7
61 2.0
62 3.0
63 2.2
64 2.9
65 2.9
66 3.1
67 3.0
68 2.7
69 2.2
70 2.5
71 3.2
72 2.8
73 2.5
74 2.8
75 2.9
76 3.0
77 2.8
78 3.0
79 2.9
80 2.6
81 2.4
82 2.4
83 2.7
84 2.7
85 3.0
86 3.4
87 3.1
88 2.3
89 3.0
90 2.5
91 2.6
92 3.0
93 2.6
94 2.3
95 2.7
96 3.0
97 2.9
98 2.9
99 2.5
100 2.8
101 3.3
102 2.7
103 3.0
104 2.9
105 3.0
106 3.0
107 2.5
108 2.9
109 2.5
110 3.6
111 3.2
112 2.7
113 3.0
114 2.5
115 2.8
116 3.2
117 3.0
118 3.8
119 2.6
120 2.2
121 3.2
122 2.8
123 2.8
124 2.7
125 3.3
126 3.2
127 2.8
128 3.0
129 2.8
130 3.0
131 2.8
132 3.8
133 2.8
134 2.8
135 2.6
136 3.0
137 3.4
138 3.1
139 3.0
140 3.1
141 3.1
142 3.1
143 2.7
144 3.2
145 3.3
146 3.0
147 2.5
148 3.0
149 3.4
150 3.0
data[1:5, 1:3] Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
This is our categorical analysis step.
table(data$Species) counts how many observations belong to each species.
Why do this?
Species has the groups we expect.table(data$Species)
setosa versicolor virginica
50 50 50
A scatter plot shows the relationship between two numeric variables.
Here we plot:
Sepal.LengthSepal.WidthImportant point:
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()Now we include the categorical variable:
color = Sepcies tells ggplot2 to assign a different color to each species.What changes?
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()We adjust how points look:
size = 3 makes each dot bigger, so it is easier to see.alpha = 0.7 makes dots slightly transparent, which helps when points overlap.Why transparency helps:
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.7)Good plots should clearly communicate what the viewer is seeing.
labs() adds:
title for the plot heading.x and y axis labelscolor legend title (so the legend has a meaningful name)ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.7) +
labs(
title = "Scatter Plot of Sepal Dimensions",
x = "Sepal Length",
y = "Sepal Width",
color = "Species"
)Themes control the background, grids, and text styling.
theme_minimal() removes heavy backgrounds and gives a clean look.theme(legend.position = "top") moves the legend above the plot.Why move the legend?
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.7) +
labs(
title = "Scatter Plot of Sepal Dimensions",
x = "Sepal Length",
y = "Sepal Width",
color = "Species"
) +
theme_minimal() +
theme(legend.position = "top")Do you see clusters of points by species?
Yes
Which species appears most separated from the others?
Setosa
What happens if you plot Petal.Length vs Petal.Width instead?
Setosa becomes completely separate. Versicolor and Virginica also become much more distinguishable. The clusters become much cleaner.
What changes if you remove aplha or increase size further?
It reduces readablility.