library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We load two libraries:
ggplot2 is used to build plots layer-by-layer (we will use it create the scatter plot). dplyr provides functions for exploring and summarizing data (we will used it to understand the categories in the dataset).
library(ggplot2)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
We use the built-in dataset iris,
What this dataset contains:
-each row is one flower sample (an observation) -there are 150 total observation. -the column species is a categorical variable with 3 groups: -setosa -versicolor -virginica -the column sepal.length and sepal.width are numeric measurements that we will plot.
data <-iris
head(data,n=10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
tail(data,n=10) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
str(data)'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(data) Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
names(data)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
dim(data)[1] 150 5
data[1] Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
data$Sepal.Length [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
typeof(data$Sepal.Length)[1] "double"
typeof(data[1])[1] "list"
data[][1] Sepal.Length
1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
7 4.6
8 5.0
9 4.4
10 4.9
11 5.4
12 4.8
13 4.8
14 4.3
15 5.8
16 5.7
17 5.4
18 5.1
19 5.7
20 5.1
21 5.4
22 5.1
23 4.6
24 5.1
25 4.8
26 5.0
27 5.0
28 5.2
29 5.2
30 4.7
31 4.8
32 5.4
33 5.2
34 5.5
35 4.9
36 5.0
37 5.5
38 4.9
39 4.4
40 5.1
41 5.0
42 4.5
43 4.4
44 5.0
45 5.1
46 4.8
47 5.1
48 4.6
49 5.3
50 5.0
51 7.0
52 6.4
53 6.9
54 5.5
55 6.5
56 5.7
57 6.3
58 4.9
59 6.6
60 5.2
61 5.0
62 5.9
63 6.0
64 6.1
65 5.6
66 6.7
67 5.6
68 5.8
69 6.2
70 5.6
71 5.9
72 6.1
73 6.3
74 6.1
75 6.4
76 6.6
77 6.8
78 6.7
79 6.0
80 5.7
81 5.5
82 5.5
83 5.8
84 6.0
85 5.4
86 6.0
87 6.7
88 6.3
89 5.6
90 5.5
91 5.5
92 6.1
93 5.8
94 5.0
95 5.6
96 5.7
97 5.7
98 6.2
99 5.1
100 5.7
101 6.3
102 5.8
103 7.1
104 6.3
105 6.5
106 7.6
107 4.9
108 7.3
109 6.7
110 7.2
111 6.5
112 6.4
113 6.8
114 5.7
115 5.8
116 6.4
117 6.5
118 7.7
119 7.7
120 6.0
121 6.9
122 5.6
123 7.7
124 6.3
125 6.7
126 7.2
127 6.2
128 6.1
129 6.4
130 7.2
131 7.4
132 7.9
133 6.4
134 6.3
135 6.1
136 7.7
137 6.3
138 6.4
139 6.0
140 6.9
141 6.7
142 6.9
143 5.8
144 6.8
145 6.7
146 6.7
147 6.3
148 6.5
149 6.2
150 5.9
data[150,5][1] virginica
Levels: setosa versicolor virginica
data[1:5,1:3] Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
table(data[,5])
setosa versicolor virginica
50 50 50
table(data$Species)
setosa versicolor virginica
50 50 50
A scatter plot shows the relationship between two numeric variables.
Here we plot:
-X-axis: Sepal.Length -Y-axis: Sepal.Width
Important points: -Each dot represents one flower (one row in the dataset).
ggplot(data, aes( x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point()Now we include the categorical variable:
-color = Species tells ggplot2 to assign a different color to each species
what changes?
-The plot now visually separates the three species based on color. -This is the main “categorical analysis” idea: we can see if different groups cluster differently.
ggplot(data,aes (x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point()We adjust how points look:
-size = 3 makes each dot bigger, so it is easier to see. -alpha = 0.7 makes dots slightly transparent, which helps when dense areas more visible.
ggplot(data,aes (x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point(size = 3,alpha = 0.7)good plot should clearly coummnicate what the viewers is seeing.
lab() adds: -title for the plot heading -xandyaxis labels -color` legend title (so the legend has a meaningful name)
ggplot(data,aes (x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point(size = 3, alpha = 0.7) labs(
title = "Scatter plot of Sepal Dimensions",
X = "Sepal Length",
y = "Sepal Width",
color = "Species"
)<ggplot2::labels> List of 4
$ X : chr "Sepal Length"
$ y : chr "Sepal Width"
$ colour: chr "Species"
$ title : chr "Scatter plot of Sepal Dimensions"
Themes control the background , grids, and text styling.
-theme_minimal() removes heavy background and gives a clean look. -`theme(legend.position = “top”) moves the legend above the plot.
Why move the legend?
-when the legend is at the top, it is when easier to notice and read, especially in presentation.
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+
geom_point(size = 3,alpha = 0.7) +
labs(
title = "Scatter plot of Sepal Dimensions",
x = "Sepal Length",
y = "Sepal Width",
color = "Species"
)+
theme_minimal()+
theme(legend.position = "top" )1.Do you see clusters of points by species? 2.Which species appears most separated from the others? 3.What happens if you plot Petal.Length vs Petal.Width instead? 4.What changes if you remove alpha or increase size further?
1.Yes, points form clear clusters by species.
2.Setosa is the most clearly separated species.
3.Plotting Petal.Length vs Petal.Width shows much clearer separation of species.
4.Removing alpha makes the plot cluttered; increasing size too much causes overlapping points.