# Libraries used in this document
library(rmarkdown)
library(readxl)
library(productplots) # For happy
library(dplyr)
library(ggplot2)
library(gridExtra)
library(Rfast)
library(Hmisc) # For scatter plot with error bars
library(knitr)
library(kableExtra)
library(magrittr)
# Data Sets used in this document
data("happy")
carSales = read_excel("DataSets/carSales.xlsx")
|
Dee Chiluiza, PhD Northeastern University Boston, Massachusetts |
|
Short manual series: Scatter Plots |
In data analysis and statistics, there are times when it is important to establish the relationship between two variables. For example: does an increase in calorie intake correlates with increases in body weight?, or, people who study longer hours obtain higher grades?
Scatter plots are diagrams that use Cartesian coordinates to display the relationship between two numerical variables. In a simplistic way, a table can contain information for two variables: variable A and variable B. Does variable A affect the behavior of variable B? We will see this topic in more detail in the correlation and regression section. Observe the following scenarios:
If variable A increases, variable B increases.
If variable A decreases, variable B decreases.
These two scenarios refer to positive correlations, in which the two variables move in the same direction.
If variable A increases, variable B decreases.
If variable A decreases, variable B increases.
These two scenarios refer to negative correlations, in which the two variables move in opposite directions.
And there is of course the case in which changes on variable A do not affect the behavior of variable B, this is referred to as no correlation.
Using the 6 vectors in the R chunk below, we will create scatter plots to observe relationships between A1 and B1, A21 and B2, and A3 and B3. In all cases, A is the independent variable and B is the dependent variable.
Observe the {r} chunk below, it contains the objects above mentioned and their corresponding data.
# Vectors used for correlation analysis and scatter plot displays.
# Set 1
A1 = c(22,23,24,26,28,45,58,64,71, 87, 110, 135)
B1 = c(125,129,134,146,157,253,324,380,425, 501, 854, 876)
# Set 2
A2 = c(2,8,23,34,43,51,63,75,96,120, 145, 160)
B2 = c(501,490,479,468,457,447,437,427,410,403, 396, 363)
# Set 3
A3 = c(2,8,23,34,43,51,63,75,96,120, 145, 160)
B3 = c(562,120,537,445,438,420,305,231,480,300, 100, 224)
# Create a table to present data
table1 = matrix(c(A1, B1, A2, B2, A3, B3),
nrow = 12,
byrow = FALSE)
colnames(table1) = c("A1", "B1", "A2", "B2", "A3", "B3")
# Present table, check addition of three follow-up codes:
# kable(), kable_styling(), add_header_above().
table1 %>%
kable(align = "c",
caption = "Table 1. The variables",
table.attr = "style='width:60%;'") %>%
kable_classic_2(bootstrap_options=c("hover","bordered"),
html_font = "Cambria",
position = "center",
font_size = 12) %>%
add_header_above(c("Group 1" = 2,"Group 2" = 2,"Group 3" = 2))|
Group 1
|
Group 2
|
Group 3
|
|||
|---|---|---|---|---|---|
| A1 | B1 | A2 | B2 | A3 | B3 |
| 22 | 125 | 2 | 501 | 2 | 562 |
| 23 | 129 | 8 | 490 | 8 | 120 |
| 24 | 134 | 23 | 479 | 23 | 537 |
| 26 | 146 | 34 | 468 | 34 | 445 |
| 28 | 157 | 43 | 457 | 43 | 438 |
| 45 | 253 | 51 | 447 | 51 | 420 |
| 58 | 324 | 63 | 437 | 63 | 305 |
| 64 | 380 | 75 | 427 | 75 | 231 |
| 71 | 425 | 96 | 410 | 96 | 480 |
| 87 | 501 | 120 | 403 | 120 | 300 |
| 110 | 854 | 145 | 396 | 145 | 100 |
| 135 | 876 | 160 | 363 | 160 | 224 |
I will not explain in detail correlation or linear regression in this section; for more information, please read the corresponding section.
# Correlations
correlation_1 = cor(B1,A1)
correlation_2 = cor(B2,A2)
correlation_3 = cor(B3,A3)Correlation between A1 and A2 is 0.984.
This is an example of a strong positive correlation.
Correlation between B1 and B2 is -0.983.
This is an example of a strong negative correlation.
Correlation between C1 and C2 is -0.508.
This is an example of a week negative correlation
To create a scatter plot, we use code plot(), where we separate variables using a wavy dash (~). The order of the variables is very important: always start with the dependent variable (it will appear in the y-axis), then write the independent variable (it will appear in the x-axis). In scatter plots, the independent variable should always be placed in the X-axis, and the dependent variable in the Y-axis.
Observe the codes below. At this point, you are familiar with the par(mfrow()) code combination to present figures in a matrix, in this case, one row and two columns, mfrow=c(1,2).
We are using the plots below to observe the relationships between variables A1-B1 (group 1) and A2-B2 (group 2).
par(mfrow=c(1,2), mai=c(0.6, 0.8, 0.5, 0.4), mar=c(4,4,1,1))
# Plot 1
plot(B1~A1)
# Plot 2
plot(B2~A2)The two plots above were created using the very basic code plot() with the name of the variables inside. We will start with some common changes using plot # 1:
Change direction of y-axis values (las=).
Improve x- and y-axes labels (xlab, ylab).
Change x- and y-axes limits to improve data visualization (xlim, ylim).
Change the color of data points (col).
Change the shape of the data points (pch).
For the latest, we use code pch(), check several options using ?pch in your console. Notice that pch numbers, from 0 to 25, are the most commonly used values since they produce plot-friendly shapes. You can also use any ASCII characters (numbers 32 to 127) or native characters (numbers 128 to 255); try them.
par(font=1)
# Plot 1
plot(B1~A1,
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)One of the options you can use inside the plot code is type. Use ?plot in the console to find more information about these options:
p: for points,
l: for lines
b: For both, points and lines
c: for empty points joined by lines
o: for overplotted points and lines
s: for stair steps
S: for stair steps
h: histogram-like vertical lines
Using our first scatter plot, let’s explore all these options; keep in mind that our plot uses pch=8, which can be changed. It is up to you decide which one option fits better your data visualization needs. In the plots below, the different type options are mentioned on the top-left corner of each plot. The first option (p), creates the basic plot observed above, it is the default type.
par(mfrow = c(2,2), mar=c(1,1,1,1), mai=c(0.5,0.5,0.5,0.5))
# Using type p
plot(B1~A1,
type = "p",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("p", y=800, x=10, cex=2)
# Using type l
plot(B1~A1,
type = "l",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("l", y=800, x=10, cex=2)
# Using type b
plot(B1~A1,
type = "b",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("b", y=800, x=10, cex=2)
# Using type c
plot(B1~A1,
type = "c",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("c", y=800, x=10, cex=2)# Using type o
plot(B1~A1,
type = "o",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("o", y=800, x=10, cex=2)
# Using type s
plot(B1~A1,
type = "s",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("s", y=800, x=10, cex=2)
# Using type S
plot(B1~A1,
type = "S",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("S", y=800, x=10, cex=2)
# Using type h
plot(B1~A1,
type = "h",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("h", y=800, x=10, cex=2)The linear regression model is used to add a trend line to scatter plots. This line allows us to observe the direction of the relationship. This line is also called a trendline.
To obtain the linear regression model, use code lm().
Inside the code, separate the variables exactly as you did in the plot, using a wavy dash ~, starting with the dependent variable (y), and then the independent variable (x).
For practical purposes, provide a name to the linear regression model, you will use it to plot the trend line.
Simply, after the plot code, add the abline( ) with the name of the
In the {r} chunk below, observe how abline() is used to add the linear regression line after the end of the plot code. All you have to do is to enter the name of the object containing the
# Create an object with linear regression model
linReg1 = lm(B1 ~ A1)
# Create the scatter plot and add the trendline with the linear regressio model
plot(B1~A1,
type = "p",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("Type: p", y=800, x=10, cex=1)
abline(linReg1, col="#A11515")Perform these steps only if you feel it is necessary for your data visualization needs. I will use exactly the same codes from the previous plot, the only changes you will see are in the {r} Chunk options.
Since the {r} Chunk options are not displayed in the outcome document (the HTML file you are reading), I will mention it here:
{r, fig.align=“center”, fig.width=4, fig.height=4, fig.cap=“Scatter Plot 1: Linear relationship between variables A1 and B1”}
Notice the use of fig.align, fig.width, fig.height, and fig.cap. There are many other options; investigate and learn.
# Create an object with linear regression model
linReg1 = lm(B1 ~ A1)
# Create the scatter plot and add the trendline with the linear regressio model
plot(B1~A1,
type = "p",
las=1,
ylab="A2 (Dependent)",
xlab="A1 (Independent)",
xlim=c(0,140),
ylim=c(0,900),
col="#A11515",
pch=8)
text("Type: p", y=800, x=10, cex=1)
abline(linReg1, col="#A11515")Scatter Plot 1: Linear relationship between variables A1 and B1
Since you already know the codes and their applications, I will include them all in the R chunk below. Following the recommendation I provided above, customize the plot on your R Studio.
# Linear regression
linReg2 = lm(B2 ~ A2)
# Create graph with regression line
plot(B2~A2)
abline(linReg2)For this part we will use a data set car sales, check the information in the summary and table below.
summary(carSales) %>%
kable(align = "c",
format = "html",
table.attr = "style='width:60%;'")%>%
kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
html_font = "Cambria",
position = "center",
font_size = 10) | Location | Year | FuelType | Transmission | Owner | Efficiency | Engine_cc | Power_bhp | Seats | Km | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Length:5844 | Min. :2001 | Length:5844 | Length:5844 | Length:5844 | Min. : 8.00 | Min. :1000 | Min. : 50.0 | Min. : 2.000 | Min. : 5539 | Min. : 800 | |
| Class :character | 1st Qu.:2012 | Class :character | Class :character | Class :character | 1st Qu.:16.00 | 1st Qu.:1500 | 1st Qu.:100.0 | 1st Qu.: 5.000 | 1st Qu.: 79159 | 1st Qu.:18280 | |
| Mode :character | Median :2014 | Mode :character | Mode :character | Mode :character | Median :18.00 | Median :1500 | Median :100.0 | Median : 5.000 | Median :100295 | Median :27157 | |
| NA | Mean :2014 | NA | NA | NA | Mean :18.46 | Mean :1790 | Mean :139.3 | Mean : 5.171 | Mean :101558 | Mean :26474 | |
| NA | 3rd Qu.:2016 | NA | NA | NA | 3rd Qu.:22.00 | 3rd Qu.:2000 | 3rd Qu.:150.0 | 3rd Qu.: 5.000 | 3rd Qu.:121491 | 3rd Qu.:35695 | |
| NA | Max. :2020 | NA | NA | NA | Max. :30.00 | Max. :6000 | Max. :600.0 | Max. :10.000 | Max. :452805 | Max. :63496 |
Imagine you are interested in knowing what effect the size of the engine has on its efficiency. People say that big cars consume more gas than smaller cars, is that true.
Using code unique(carSales$Engine_cc) you will find that the engine size variable contains seven different values: 1000, 1500, 2000, 2500, 3000, 4000, and 6000 cubic centimeters (cc).
Check section Continuous Data to see an extended explanation of the table below.
# Using car sales data set
# Plot and observe relationship between engine size and efficiency
# Start by grouping by the different engine sizes
eff = carSales %>%
group_by(EngineSize = Engine_cc) %>%
summarise(Mean = mean(Efficiency),
SD = sd(Efficiency),
Minimum = min(Efficiency),
Maximum = max(Efficiency))
eff %>%
kable(align = "c",
caption = "Descriptive values of efficiency per engine size",
format = "html",
digits = 2,
table.attr = "style='width:60%;'")%>%
kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
html_font = "Cambria",
position = "center",
font_size = 12) %>%
add_header_above(c(" " = 1,"Engine Efficiency" = 4))|
Engine Efficiency
|
||||
|---|---|---|---|---|
| EngineSize | Mean | SD | Minimum | Maximum |
| 1000 | 24.95 | 1.96 | 22 | 30 |
| 1500 | 19.67 | 3.41 | 16 | 26 |
| 2000 | 16.51 | 2.88 | 10 | 22 |
| 2500 | 14.49 | 3.13 | 10 | 18 |
| 3000 | 12.78 | 2.20 | 8 | 18 |
| 4000 | 10.62 | 1.75 | 8 | 12 |
| 6000 | 8.83 | 1.03 | 8 | 10 |
By observing the table, we could deduce that the mean efficiency reduces with increasing engine size. Is that a correct observation? To have a better idea, let’s create two graphs:
1. One scatter plot to observe the distribution of the two variables.
2. A graph with standard deviations as error bars.
Similar to the graphs above, we have two variables under analysis. Which one is the independent variable and which one is the dependent variable? It should be quite easy in this case.
- Does efficiency affects engine size? Of course not.
- Does engine size affects the efficiency? Yes. This indicates that the independent variable is engine size and it should be plotted in the x-axis. Observe the codes and graph below.
plot(carSales$Efficiency ~ carSales$Engine_cc,
las=2,
xlab = "Engine size in CC",
ylab = "Efficiency (mpg)",
xlim=c(500,6000),
ylim = c(0,35),
pch = 19,
col="blue")
abline(lm(carSales$Efficiency ~ carSales$Engine_cc),
col="red")
legend("topright",
cex=0.8,
paste("Y = 26.365 - 0.0046X"))
# Add the information for the correlation
# Use x and y coordinates to place the text
correlation = cor(carSales$Efficiency , carSales$Engine_cc)
text(x=6000,
y= 28,
paste("Correlation: ", round(correlation, 3)),
cex=0.8,
pos = 2)Hmisc::errbar(x = eff$EngineSize,
y = eff$Mean,
las=2,
yplus = eff$Mean+eff$SD,
yminus = eff$Mean-eff$SD,
cap = 0.03,
xlab = "Engine size in CC",
ylab = "Mean efficiency (mpg +/- SD)",
lty = 8,
lwd = 1,
pch=24,
errbar.col = "#A11515",
xlim=c(500,6000),
ylim = c(0,35))
| Disclaimer |
Disclaimer: This short series manual project is a work in progress. Until otherwise clearly stated, this material is considered to be draft version.