Program 2

Author

Stephen George

Write an R script to create a scatter plot, incorporating categorical analysis though color-coded data points representing different groups, using ggplot2

Step 1: Load libraries We had two libraries:

  • ggplot is used to build plots layer-by-layer (we will use it to create the scatter plot).
  • dplyr provides functions for exploring and summarizing data (we will use it to understand the categories in the dataset).
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Step 2: Load the dataset(iris)

We use the built-in dataset iris.

What this dataset contains:

  • Each row is one flower sample (on observation).
  • There are 150 total observations.
  • The column Species is a categorical variable with 3 groups:
    • setosa
    • versicolor
    • virginica
  • The columns Sepal.Length and Sepal.Width are numeric measurements that we will plot.
data <- iris

Step 3: Preview the dataset (see the first few rows)

We preview the data to understand:

head(data,10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
tail(data,10)
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
141          6.7         3.1          5.6         2.4 virginica
142          6.9         3.1          5.1         2.3 virginica
143          5.8         2.7          5.1         1.9 virginica
144          6.8         3.2          5.9         2.3 virginica
145          6.7         3.3          5.7         2.5 virginica
146          6.7         3.0          5.2         2.3 virginica
147          6.3         2.5          5.0         1.9 virginica
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica
str(data)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(data)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                
names(data)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
dim(data)
[1] 150   5
data[1]
    Sepal.Length
1            5.1
2            4.9
3            4.7
4            4.6
5            5.0
6            5.4
7            4.6
8            5.0
9            4.4
10           4.9
11           5.4
12           4.8
13           4.8
14           4.3
15           5.8
16           5.7
17           5.4
18           5.1
19           5.7
20           5.1
21           5.4
22           5.1
23           4.6
24           5.1
25           4.8
26           5.0
27           5.0
28           5.2
29           5.2
30           4.7
31           4.8
32           5.4
33           5.2
34           5.5
35           4.9
36           5.0
37           5.5
38           4.9
39           4.4
40           5.1
41           5.0
42           4.5
43           4.4
44           5.0
45           5.1
46           4.8
47           5.1
48           4.6
49           5.3
50           5.0
51           7.0
52           6.4
53           6.9
54           5.5
55           6.5
56           5.7
57           6.3
58           4.9
59           6.6
60           5.2
61           5.0
62           5.9
63           6.0
64           6.1
65           5.6
66           6.7
67           5.6
68           5.8
69           6.2
70           5.6
71           5.9
72           6.1
73           6.3
74           6.1
75           6.4
76           6.6
77           6.8
78           6.7
79           6.0
80           5.7
81           5.5
82           5.5
83           5.8
84           6.0
85           5.4
86           6.0
87           6.7
88           6.3
89           5.6
90           5.5
91           5.5
92           6.1
93           5.8
94           5.0
95           5.6
96           5.7
97           5.7
98           6.2
99           5.1
100          5.7
101          6.3
102          5.8
103          7.1
104          6.3
105          6.5
106          7.6
107          4.9
108          7.3
109          6.7
110          7.2
111          6.5
112          6.4
113          6.8
114          5.7
115          5.8
116          6.4
117          6.5
118          7.7
119          7.7
120          6.0
121          6.9
122          5.6
123          7.7
124          6.3
125          6.7
126          7.2
127          6.2
128          6.1
129          6.4
130          7.2
131          7.4
132          7.9
133          6.4
134          6.3
135          6.1
136          7.7
137          6.3
138          6.4
139          6.0
140          6.9
141          6.7
142          6.9
143          5.8
144          6.8
145          6.7
146          6.7
147          6.3
148          6.5
149          6.2
150          5.9
data$Sepal.Length
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
typeof(data$Sepal.Length)
[1] "double"
data[][1]
    Sepal.Length
1            5.1
2            4.9
3            4.7
4            4.6
5            5.0
6            5.4
7            4.6
8            5.0
9            4.4
10           4.9
11           5.4
12           4.8
13           4.8
14           4.3
15           5.8
16           5.7
17           5.4
18           5.1
19           5.7
20           5.1
21           5.4
22           5.1
23           4.6
24           5.1
25           4.8
26           5.0
27           5.0
28           5.2
29           5.2
30           4.7
31           4.8
32           5.4
33           5.2
34           5.5
35           4.9
36           5.0
37           5.5
38           4.9
39           4.4
40           5.1
41           5.0
42           4.5
43           4.4
44           5.0
45           5.1
46           4.8
47           5.1
48           4.6
49           5.3
50           5.0
51           7.0
52           6.4
53           6.9
54           5.5
55           6.5
56           5.7
57           6.3
58           4.9
59           6.6
60           5.2
61           5.0
62           5.9
63           6.0
64           6.1
65           5.6
66           6.7
67           5.6
68           5.8
69           6.2
70           5.6
71           5.9
72           6.1
73           6.3
74           6.1
75           6.4
76           6.6
77           6.8
78           6.7
79           6.0
80           5.7
81           5.5
82           5.5
83           5.8
84           6.0
85           5.4
86           6.0
87           6.7
88           6.3
89           5.6
90           5.5
91           5.5
92           6.1
93           5.8
94           5.0
95           5.6
96           5.7
97           5.7
98           6.2
99           5.1
100          5.7
101          6.3
102          5.8
103          7.1
104          6.3
105          6.5
106          7.6
107          4.9
108          7.3
109          6.7
110          7.2
111          6.5
112          6.4
113          6.8
114          5.7
115          5.8
116          6.4
117          6.5
118          7.7
119          7.7
120          6.0
121          6.9
122          5.6
123          7.7
124          6.3
125          6.7
126          7.2
127          6.2
128          6.1
129          6.4
130          7.2
131          7.4
132          7.9
133          6.4
134          6.3
135          6.1
136          7.7
137          6.3
138          6.4
139          6.0
140          6.9
141          6.7
142          6.9
143          5.8
144          6.8
145          6.7
146          6.7
147          6.3
148          6.5
149          6.2
150          5.9
data[2][]
    Sepal.Width
1           3.5
2           3.0
3           3.2
4           3.1
5           3.6
6           3.9
7           3.4
8           3.4
9           2.9
10          3.1
11          3.7
12          3.4
13          3.0
14          3.0
15          4.0
16          4.4
17          3.9
18          3.5
19          3.8
20          3.8
21          3.4
22          3.7
23          3.6
24          3.3
25          3.4
26          3.0
27          3.4
28          3.5
29          3.4
30          3.2
31          3.1
32          3.4
33          4.1
34          4.2
35          3.1
36          3.2
37          3.5
38          3.6
39          3.0
40          3.4
41          3.5
42          2.3
43          3.2
44          3.5
45          3.8
46          3.0
47          3.8
48          3.2
49          3.7
50          3.3
51          3.2
52          3.2
53          3.1
54          2.3
55          2.8
56          2.8
57          3.3
58          2.4
59          2.9
60          2.7
61          2.0
62          3.0
63          2.2
64          2.9
65          2.9
66          3.1
67          3.0
68          2.7
69          2.2
70          2.5
71          3.2
72          2.8
73          2.5
74          2.8
75          2.9
76          3.0
77          2.8
78          3.0
79          2.9
80          2.6
81          2.4
82          2.4
83          2.7
84          2.7
85          3.0
86          3.4
87          3.1
88          2.3
89          3.0
90          2.5
91          2.6
92          3.0
93          2.6
94          2.3
95          2.7
96          3.0
97          2.9
98          2.9
99          2.5
100         2.8
101         3.3
102         2.7
103         3.0
104         2.9
105         3.0
106         3.0
107         2.5
108         2.9
109         2.5
110         3.6
111         3.2
112         2.7
113         3.0
114         2.5
115         2.8
116         3.2
117         3.0
118         3.8
119         2.6
120         2.2
121         3.2
122         2.8
123         2.8
124         2.7
125         3.3
126         3.2
127         2.8
128         3.0
129         2.8
130         3.0
131         2.8
132         3.8
133         2.8
134         2.8
135         2.6
136         3.0
137         3.4
138         3.1
139         3.0
140         3.1
141         3.1
142         3.1
143         2.7
144         3.2
145         3.3
146         3.0
147         2.5
148         3.0
149         3.4
150         3.0
data[1:5, 1:3]
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4

Step 4: Inspect the categorical variable (Species)

This is our categorical analysis step.

table(data$Species) counts how many observations belong to each species.

Why do this?

  • It confirms that Species has the groups we expect.
  • It helps us understand whether the
table(data$Species)

    setosa versicolor  virginica 
        50         50         50 

Step 5: Create a basic scatter plot (no categories yet)

A scatter plot shows the relationship between two numeric variables.

Here we plot:

  • X-axis: Sepal.Length
  • Y-axis: Sepal.Width

Important point:

  • Each dot represents one flower (one row in the dataset).
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point()


Step 6: Add categorical grouping using color = Species

Now we include the categorical variable:

  • color = Sepcies tells ggplot2 to assign a different color to each species.

What changes?

  • The plot now visually separates the three species based on color.
  • This is the main “categorical analysis” idea: we can see if different groups cluster differently.
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point()


Step 7: Improve point visibility (size and transparency)

We adjust how points look:

  • size = 3 makes each dot bigger, so it is easier to see.
  • alpha = 0.7 makes dots slightly transparent, which helps when points overlap.

Why transparency helps:

  • If many points overlap in the same region, transparency makes dense areas more visible.
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7)


Step 8: Add informative labels (title, axes, legend)

Good plots should clearly communicate what the viewer is seeing.

labs() adds:

  • title for the plot heading.
  • x and y axis labels
  • color legend title (so the legend has a meaningful name)
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "Scatter Plot of Sepal Dimensions",
    x = "Sepal Length",
    y = "Sepal Width",
    color = "Species"
  )


Step 9: Apply a clean theme and move the legend

Themes control the background, grids, and text styling.

  • theme_minimal() removes heavy backgrounds and gives a clean look.
  • theme(legend.position = "top") moves the legend above the plot.

Why move the legend?

  • When the legend is at the top, it is often easier to notice and read, especially in presentations.
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(
    title = "Scatter Plot of Sepal Dimensions",
    x = "Sepal Length",
    y = "Sepal Width",
    color = "Species"
  ) +
  theme_minimal() +
  theme(legend.position = "top")


Discussion Questions

  1. Do you see clusters of points by species?

    Yes

  2. Which species appears most separated from the others?

    Setosa

  3. What happens if you plot Petal.Length vs Petal.Width instead?

    Setosa becomes completely separate. Versicolor and Virginica also become much more distinguishable. The clusters become much cleaner.

  4. What changes if you remove aplha or increase size further?

    It reduces readablility.