# Libraries used in this document

library(rmarkdown)
library(readxl)
library(productplots) # For happy
library(dplyr)
library(ggplot2)
library(gridExtra)
library(Rfast)
library(Hmisc)       # For scatter plot with error bars
library(knitr)
library(kableExtra)
library(magrittr)

# Data Sets used in this document

data("happy")

carSales = read_excel("DataSets/carSales.xlsx")

Introduction to data analysis using R, R Studio and R Markdown
Dee Chiluiza, PhD
Northeastern University
Boston, Massachusetts

Short manual series:
Scatter Plots

1 Scatter Plots

In data analysis and statistics, there are times when it is important to establish the relationship between two variables. For example: does an increase in calorie intake correlates with increases in body weight?, or, people who study longer hours obtain higher grades?

Scatter plots are diagrams that use Cartesian coordinates to display the relationship between two numerical variables. In a simplistic way, a table can contain information for two variables: variable A and variable B. Does variable A affect the behavior of variable B? We will see this topic in more detail in the correlation and regression section. Observe the following scenarios:

If variable A increases, variable B increases.
If variable A decreases, variable B decreases.
These two scenarios refer to positive correlations, in which the two variables move in the same direction.
If variable A increases, variable B decreases.
If variable A decreases, variable B increases.
These two scenarios refer to negative correlations, in which the two variables move in opposite directions.
And there is of course the case in which changes on variable A do not affect the behavior of variable B, this is referred to as no correlation.

Using the 6 vectors in the R chunk below, we will create scatter plots to observe relationships between A1 and B1, A21 and B2, and A3 and B3. In all cases, A is the independent variable and B is the dependent variable.

Observe the {r} chunk below, it contains the objects above mentioned and their corresponding data.

# Vectors used for correlation analysis and scatter plot displays.
# Set 1
A1 = c(22,23,24,26,28,45,58,64,71, 87, 110, 135)
B1 = c(125,129,134,146,157,253,324,380,425, 501, 854, 876)
# Set 2
A2 = c(2,8,23,34,43,51,63,75,96,120, 145, 160)
B2 = c(501,490,479,468,457,447,437,427,410,403, 396, 363)
# Set 3
A3 = c(2,8,23,34,43,51,63,75,96,120, 145, 160)
B3 = c(562,120,537,445,438,420,305,231,480,300, 100, 224)

# Create a table to present data

table1 = matrix(c(A1, B1, A2, B2, A3, B3), 
                nrow = 12, 
                byrow = FALSE)
colnames(table1) = c("A1", "B1", "A2", "B2", "A3", "B3")


# Present table, check addition of three follow-up codes: 
# kable(), kable_styling(), add_header_above(). 

table1 %>%
  kable(align = "c", 
             caption = "Table 1. The variables",
             table.attr = "style='width:60%;'") %>%
  kable_classic_2(bootstrap_options=c("hover","bordered"),
              html_font = "Cambria",
              position = "center",
              font_size = 12) %>%
  add_header_above(c("Group 1" = 2,"Group 2" = 2,"Group 3" = 2))

Table 1. The variables
Group 1		Group 2		Group 3
A1	B1	A2	B2	A3	B3
22	125	2	501	2	562
23	129	8	490	8	120
24	134	23	479	23	537
26	146	34	468	34	445
28	157	43	457	43	438
45	253	51	447	51	420
58	324	63	437	63	305
64	380	75	427	75	231
71	425	96	410	96	480
87	501	120	403	120	300
110	854	145	396	145	100
135	876	160	363	160	224

I will not explain in detail correlation or linear regression in this section; for more information, please read the corresponding section.

# Correlations
correlation_1 = cor(B1,A1)
correlation_2 = cor(B2,A2)
correlation_3 = cor(B3,A3)

Correlation between A1 and A2 is 0.984.
This is an example of a strong positive correlation.

Correlation between B1 and B2 is -0.983.
This is an example of a strong negative correlation.

Correlation between C1 and C2 is -0.508.
This is an example of a week negative correlation

2 Create a Scatter Plot

To create a scatter plot, we use code plot(), where we separate variables using a wavy dash (~). The order of the variables is very important: always start with the dependent variable (it will appear in the y-axis), then write the independent variable (it will appear in the x-axis). In scatter plots, the independent variable should always be placed in the X-axis, and the dependent variable in the Y-axis.

2.1 Plot a basic scatter plot

Observe the codes below. At this point, you are familiar with the par(mfrow()) code combination to present figures in a matrix, in this case, one row and two columns, mfrow=c(1,2).

We are using the plots below to observe the relationships between variables A1-B1 (group 1) and A2-B2 (group 2).

par(mfrow=c(1,2), mai=c(0.6, 0.8, 0.5, 0.4), mar=c(4,4,1,1))
# Plot 1
plot(B1~A1)
# Plot 2
plot(B2~A2)

2.2 Improve graph presentation

The two plots above were created using the very basic code plot() with the name of the variables inside. We will start with some common changes using plot # 1:

Change direction of y-axis values (las=).
Improve x- and y-axes labels (xlab, ylab).
Change x- and y-axes limits to improve data visualization (xlim, ylim).
Change the color of data points (col).
Change the shape of the data points (pch).

For the latest, we use code pch(), check several options using ?pch in your console. Notice that pch numbers, from 0 to 25, are the most commonly used values since they produce plot-friendly shapes. You can also use any ASCII characters (numbers 32 to 127) or native characters (numbers 128 to 255); try them.

par(font=1)
# Plot 1
plot(B1~A1,
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)

2.3 Choose plot type

One of the options you can use inside the plot code is type. Use ?plot in the console to find more information about these options:

p: for points,
l: for lines
b: For both, points and lines
c: for empty points joined by lines
o: for overplotted points and lines
s: for stair steps
S: for stair steps
h: histogram-like vertical lines

Using our first scatter plot, let’s explore all these options; keep in mind that our plot uses pch=8, which can be changed. It is up to you decide which one option fits better your data visualization needs. In the plots below, the different type options are mentioned on the top-left corner of each plot. The first option (p), creates the basic plot observed above, it is the default type.

par(mfrow = c(2,2), mar=c(1,1,1,1), mai=c(0.5,0.5,0.5,0.5))

# Using type p
plot(B1~A1,
     type = "p",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("p", y=800, x=10, cex=2)

# Using type l
plot(B1~A1,
     type = "l",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("l", y=800, x=10, cex=2)

# Using type b
plot(B1~A1,
     type = "b",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("b", y=800, x=10, cex=2)

# Using type c
plot(B1~A1,
     type = "c",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("c", y=800, x=10, cex=2)

# Using type o
plot(B1~A1,
     type = "o",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("o", y=800, x=10, cex=2)

# Using type s
plot(B1~A1,
     type = "s",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("s", y=800, x=10, cex=2)

# Using type S
plot(B1~A1,
     type = "S",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("S", y=800, x=10, cex=2)

# Using type h
plot(B1~A1,
     type = "h",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("h", y=800, x=10, cex=2)

2.4 Obtain and plot the linear regression model

The linear regression model is used to add a trend line to scatter plots. This line allows us to observe the direction of the relationship. This line is also called a trendline.

To obtain the linear regression model, use code lm().
Inside the code, separate the variables exactly as you did in the plot, using a wavy dash ~, starting with the dependent variable (y), and then the independent variable (x).
For practical purposes, provide a name to the linear regression model, you will use it to plot the trend line.
Simply, after the plot code, add the abline( ) with the name of the

In the {r} chunk below, observe how abline() is used to add the linear regression line after the end of the plot code. All you have to do is to enter the name of the object containing the

# Create an object with linear regression model
linReg1 = lm(B1 ~ A1)

# Create the scatter plot and add the trendline with the linear regressio model

plot(B1~A1,
     type = "p",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)

text("Type: p", y=800, x=10, cex=1)

abline(linReg1, col="#A11515")

2.5 Change image size and position

Perform these steps only if you feel it is necessary for your data visualization needs. I will use exactly the same codes from the previous plot, the only changes you will see are in the {r} Chunk options.

Since the {r} Chunk options are not displayed in the outcome document (the HTML file you are reading), I will mention it here:

{r, fig.align=“center”, fig.width=4, fig.height=4, fig.cap=“Scatter Plot 1: Linear relationship between variables A1 and B1”}

Notice the use of fig.align, fig.width, fig.height, and fig.cap. There are many other options; investigate and learn.

# Create an object with linear regression model
linReg1 = lm(B1 ~ A1)

# Create the scatter plot and add the trendline with the linear regressio model

plot(B1~A1,
     type = "p",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)

text("Type: p", y=800, x=10, cex=1)

abline(linReg1, col="#A11515")

Scatter Plot 1: Linear relationship between variables A1 and B1

3 Scatter plot 2, negative correlation

Since you already know the codes and their applications, I will include them all in the R chunk below. Following the recommendation I provided above, customize the plot on your R Studio.

# Linear regression
linReg2 = lm(B2 ~ A2)

# Create graph with regression line
plot(B2~A2)
abline(linReg2)

4 Scatter plot with Error bars: the Hmisc library

For this part we will use a data set car sales, check the information in the summary and table below.

summary(carSales) %>%
  kable(align = "c",
        format = "html",
        table.attr = "style='width:60%;'")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
              html_font = "Cambria",
              position = "center",
              font_size = 10)

Location	Year	FuelType	Transmission	Owner	Efficiency	Engine_cc	Power_bhp	Seats	Km	Price
Length:5844	Min. :2001	Length:5844	Length:5844	Length:5844	Min. : 8.00	Min. :1000	Min. : 50.0	Min. : 2.000	Min. : 5539	Min. : 800
Class :character	1st Qu.:2012	Class :character	Class :character	Class :character	1st Qu.:16.00	1st Qu.:1500	1st Qu.:100.0	1st Qu.: 5.000	1st Qu.: 79159	1st Qu.:18280
Mode :character	Median :2014	Mode :character	Mode :character	Mode :character	Median :18.00	Median :1500	Median :100.0	Median : 5.000	Median :100295	Median :27157
NA	Mean :2014	NA	NA	NA	Mean :18.46	Mean :1790	Mean :139.3	Mean : 5.171	Mean :101558	Mean :26474
NA	3rd Qu.:2016	NA	NA	NA	3rd Qu.:22.00	3rd Qu.:2000	3rd Qu.:150.0	3rd Qu.: 5.000	3rd Qu.:121491	3rd Qu.:35695
NA	Max. :2020	NA	NA	NA	Max. :30.00	Max. :6000	Max. :600.0	Max. :10.000	Max. :452805	Max. :63496

4.1 Car sales: engine size

Imagine you are interested in knowing what effect the size of the engine has on its efficiency. People say that big cars consume more gas than smaller cars, is that true.
Using code unique(carSales$Engine_cc) you will find that the engine size variable contains seven different values: 1000, 1500, 2000, 2500, 3000, 4000, and 6000 cubic centimeters (cc).

Check section Continuous Data to see an extended explanation of the table below.

# Using car sales data set
# Plot and observe relationship between engine size and efficiency
# Start by grouping by the different engine sizes

eff = carSales %>% 
  group_by(EngineSize = Engine_cc) %>%
  summarise(Mean = mean(Efficiency), 
            SD = sd(Efficiency),
            Minimum = min(Efficiency),
            Maximum = max(Efficiency))
 
eff %>%
  kable(align = "c",
        caption = "Descriptive values of efficiency per engine size",
        format = "html",
        digits = 2,
        table.attr = "style='width:60%;'")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
              html_font = "Cambria",
              position = "center",
              font_size = 12) %>%
  add_header_above(c(" " = 1,"Engine Efficiency" = 4))

Descriptive values of efficiency per engine size
	Engine Efficiency
EngineSize	Mean	SD	Minimum	Maximum
1000	24.95	1.96	22	30
1500	19.67	3.41	16	26
2000	16.51	2.88	10	22
2500	14.49	3.13	10	18
3000	12.78	2.20	8	18
4000	10.62	1.75	8	12
6000	8.83	1.03	8	10

By observing the table, we could deduce that the mean efficiency reduces with increasing engine size. Is that a correct observation? To have a better idea, let’s create two graphs:
1. One scatter plot to observe the distribution of the two variables.
2. A graph with standard deviations as error bars.

4.2 Size vs efficiency: Scatter plot

Similar to the graphs above, we have two variables under analysis. Which one is the independent variable and which one is the dependent variable? It should be quite easy in this case.
- Does efficiency affects engine size? Of course not.
- Does engine size affects the efficiency? Yes. This indicates that the independent variable is engine size and it should be plotted in the x-axis. Observe the codes and graph below.

plot(carSales$Efficiency ~ carSales$Engine_cc,
     las=2,
     xlab = "Engine size in CC",
     ylab = "Efficiency (mpg)",
     xlim=c(500,6000),
     ylim = c(0,35),
     pch = 19,
     col="blue")

abline(lm(carSales$Efficiency ~ carSales$Engine_cc),
       col="red")

legend("topright", 
       cex=0.8,
       paste("Y = 26.365 - 0.0046X"))

# Add the information for the correlation
# Use x and y coordinates to place the text

correlation = cor(carSales$Efficiency , carSales$Engine_cc)

text(x=6000,
     y= 28,
     paste("Correlation: ", round(correlation, 3)),
     cex=0.8,
     pos = 2)

4.3 Size vs efficiency: Hmisc::errbar graph

Hmisc::errbar(x = eff$EngineSize, 
       y = eff$Mean, 
       las=2,
       yplus = eff$Mean+eff$SD, 
       yminus = eff$Mean-eff$SD, 
       cap = 0.03,
       xlab = "Engine size in CC",
       ylab = "Mean efficiency (mpg +/- SD)",
       lty = 8,
       lwd = 1,
       pch=24,
       errbar.col = "#A11515",
      xlim=c(500,6000),
      ylim = c(0,35))

5 Recommended readings

Hmisc::errbar: Plot Error Bars. RDocumentation.
https://www.rdocumentation.org/packages/Hmisc/versions/4.6-0/topics/errbar
errbar: Plot Error Bars in Hmisc: Harrell Miscellaneous. Oct. 7, 2021.
https://rdrr.io/cran/Hmisc/man/errbar.html
Harrell Jr, Frank E. and Dupont, Charles. (2021) Harrell Miscellaneous, package ‘Hmisc’. PDF document.
https://cran.r-project.org/web/packages/Hmisc/Hmisc.pdf
Kabacoff, R. Scatter plots. (2017). Quick-R by Datacamp.
https://www.statmethods.net/graphs/scatterplot.html

Disclaimer

Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.

Dee Chiluiza, PhD
June 2021
Last update: 12 March, 2022
Boston, Massachusetts, USA
Bruno Dog

Group 1		Group 2		Group 3
A1	B1	A2	B2	A3	B3
22	125	2	501	2	562
23	129	8	490	8	120
24	134	23	479	23	537
26	146	34	468	34	445
28	157	43	457	43	438
45	253	51	447	51	420
58	324	63	437	63	305
64	380	75	427	75	231
71	425	96	410	96	480
87	501	120	403	120	300
110	854	145	396	145	100
135	876	160	363	160	224

Group 1		Group 2		Group 3
A1	B1	A2	B2	A3	B3
22	125	2	501	2	562
23	129	8	490	8	120
24	134	23	479	23	537
26	146	34	468	34	445
28	157	43	457	43	438
45	253	51	447	51	420
58	324	63	437	63	305
64	380	75	427	75	231
71	425	96	410	96	480
87	501	120	403	120	300
110	854	145	396	145	100
135	876	160	363	160	224

Group 1		Group 2		Group 3
A1	B1	A2	B2	A3	B3
22	125	2	501	2	562
23	129	8	490	8	120
24	134	23	479	23	537
26	146	34	468	34	445
28	157	43	457	43	438
45	253	51	447	51	420
58	324	63	437	63	305
64	380	75	427	75	231
71	425	96	410	96	480
87	501	120	403	120	300
110	854	145	396	145	100
135	876	160	363	160	224