# Libraries used in this document

library(rmarkdown)
library(readxl)
library(productplots) # For happy
library(dplyr)
library(ggplot2)
library(gridExtra)
library(Rfast)
library(Hmisc)       # For scatter plot with error bars
library(knitr)
library(kableExtra)
library(magrittr)

# Data Sets used in this document

data("happy")

carSales = read_excel("DataSets/carSales.xlsx")


Introduction to data analysis using R, R Studio and R Markdown
Dee Chiluiza, PhD
Northeastern University
Boston, Massachusetts

 
Short manual series:
Scatter Plots

1 Scatter Plots

In data analysis and statistics, there are times when it is important to establish the relationship between two variables. For example: does an increase in calorie intake correlates with increases in body weight?, or, people who study longer hours obtain higher grades?

Scatter plots are diagrams that use Cartesian coordinates to display the relationship between two numerical variables. In a simplistic way, a table can contain information for two variables: variable A and variable B. Does variable A affect the behavior of variable B? We will see this topic in more detail in the correlation and regression section. Observe the following scenarios:

  1. If variable A increases, variable B increases.
    If variable A decreases, variable B decreases.
    These two scenarios refer to positive correlations, in which the two variables move in the same direction.

  2. If variable A increases, variable B decreases.
    If variable A decreases, variable B increases.
    These two scenarios refer to negative correlations, in which the two variables move in opposite directions.

  3. And there is of course the case in which changes on variable A do not affect the behavior of variable B, this is referred to as no correlation.

Using the 6 vectors in the R chunk below, we will create scatter plots to observe relationships between A1 and B1, A21 and B2, and A3 and B3. In all cases, A is the independent variable and B is the dependent variable.

Observe the {r} chunk below, it contains the objects above mentioned and their corresponding data.

# Vectors used for correlation analysis and scatter plot displays.
# Set 1
A1 = c(22,23,24,26,28,45,58,64,71, 87, 110, 135)
B1 = c(125,129,134,146,157,253,324,380,425, 501, 854, 876)
# Set 2
A2 = c(2,8,23,34,43,51,63,75,96,120, 145, 160)
B2 = c(501,490,479,468,457,447,437,427,410,403, 396, 363)
# Set 3
A3 = c(2,8,23,34,43,51,63,75,96,120, 145, 160)
B3 = c(562,120,537,445,438,420,305,231,480,300, 100, 224)

# Create a table to present data

table1 = matrix(c(A1, B1, A2, B2, A3, B3), 
                nrow = 12, 
                byrow = FALSE)
colnames(table1) = c("A1", "B1", "A2", "B2", "A3", "B3")


# Present table, check addition of three follow-up codes: 
# kable(), kable_styling(), add_header_above(). 

table1 %>%
  kable(align = "c", 
             caption = "Table 1. The variables",
             table.attr = "style='width:60%;'") %>%
  kable_classic_2(bootstrap_options=c("hover","bordered"),
              html_font = "Cambria",
              position = "center",
              font_size = 12) %>%
  add_header_above(c("Group 1" = 2,"Group 2" = 2,"Group 3" = 2))
Table 1. The variables
Group 1
Group 2
Group 3
A1 B1 A2 B2 A3 B3
22 125 2 501 2 562
23 129 8 490 8 120
24 134 23 479 23 537
26 146 34 468 34 445
28 157 43 457 43 438
45 253 51 447 51 420
58 324 63 437 63 305
64 380 75 427 75 231
71 425 96 410 96 480
87 501 120 403 120 300
110 854 145 396 145 100
135 876 160 363 160 224


I will not explain in detail correlation or linear regression in this section; for more information, please read the corresponding section.

# Correlations
correlation_1 = cor(B1,A1)
correlation_2 = cor(B2,A2)
correlation_3 = cor(B3,A3)

Correlation between A1 and A2 is 0.984.
This is an example of a strong positive correlation.

Correlation between B1 and B2 is -0.983.
This is an example of a strong negative correlation.

Correlation between C1 and C2 is -0.508.
This is an example of a week negative correlation

2 Create a Scatter Plot


To create a scatter plot, we use code plot(), where we separate variables using a wavy dash (~). The order of the variables is very important: always start with the dependent variable (it will appear in the y-axis), then write the independent variable (it will appear in the x-axis). In scatter plots, the independent variable should always be placed in the X-axis, and the dependent variable in the Y-axis.

2.1 Plot a basic scatter plot

Observe the codes below. At this point, you are familiar with the par(mfrow()) code combination to present figures in a matrix, in this case, one row and two columns, mfrow=c(1,2).

We are using the plots below to observe the relationships between variables A1-B1 (group 1) and A2-B2 (group 2).

par(mfrow=c(1,2), mai=c(0.6, 0.8, 0.5, 0.4), mar=c(4,4,1,1))
# Plot 1
plot(B1~A1)
# Plot 2
plot(B2~A2)

2.2 Improve graph presentation

The two plots above were created using the very basic code plot() with the name of the variables inside. We will start with some common changes using plot # 1:

  1. Change direction of y-axis values (las=).

  2. Improve x- and y-axes labels (xlab, ylab).

  3. Change x- and y-axes limits to improve data visualization (xlim, ylim).

  4. Change the color of data points (col).

  5. Change the shape of the data points (pch).

For the latest, we use code pch(), check several options using ?pch in your console. Notice that pch numbers, from 0 to 25, are the most commonly used values since they produce plot-friendly shapes. You can also use any ASCII characters (numbers 32 to 127) or native characters (numbers 128 to 255); try them.

par(font=1)
# Plot 1
plot(B1~A1,
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)

2.3 Choose plot type

One of the options you can use inside the plot code is type. Use ?plot in the console to find more information about these options:

  1. p: for points,

  2. l: for lines

  3. b: For both, points and lines

  4. c: for empty points joined by lines

  5. o: for overplotted points and lines

  6. s: for stair steps

  7. S: for stair steps

  8. h: histogram-like vertical lines

Using our first scatter plot, let’s explore all these options; keep in mind that our plot uses pch=8, which can be changed. It is up to you decide which one option fits better your data visualization needs. In the plots below, the different type options are mentioned on the top-left corner of each plot. The first option (p), creates the basic plot observed above, it is the default type.

par(mfrow = c(2,2), mar=c(1,1,1,1), mai=c(0.5,0.5,0.5,0.5))

# Using type p
plot(B1~A1,
     type = "p",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("p", y=800, x=10, cex=2)

# Using type l
plot(B1~A1,
     type = "l",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("l", y=800, x=10, cex=2)

# Using type b
plot(B1~A1,
     type = "b",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("b", y=800, x=10, cex=2)

# Using type c
plot(B1~A1,
     type = "c",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("c", y=800, x=10, cex=2)

# Using type o
plot(B1~A1,
     type = "o",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("o", y=800, x=10, cex=2)

# Using type s
plot(B1~A1,
     type = "s",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("s", y=800, x=10, cex=2)

# Using type S
plot(B1~A1,
     type = "S",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("S", y=800, x=10, cex=2)

# Using type h
plot(B1~A1,
     type = "h",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)
text("h", y=800, x=10, cex=2)

2.4 Obtain and plot the linear regression model

The linear regression model is used to add a trend line to scatter plots. This line allows us to observe the direction of the relationship. This line is also called a trendline.

  • To obtain the linear regression model, use code lm().

  • Inside the code, separate the variables exactly as you did in the plot, using a wavy dash ~, starting with the dependent variable (y), and then the independent variable (x).

  • For practical purposes, provide a name to the linear regression model, you will use it to plot the trend line.

  • Simply, after the plot code, add the abline( ) with the name of the

In the {r} chunk below, observe how abline() is used to add the linear regression line after the end of the plot code. All you have to do is to enter the name of the object containing the

# Create an object with linear regression model
linReg1 = lm(B1 ~ A1)

# Create the scatter plot and add the trendline with the linear regressio model

plot(B1~A1,
     type = "p",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)

text("Type: p", y=800, x=10, cex=1)

abline(linReg1, col="#A11515")

2.5 Change image size and position

Perform these steps only if you feel it is necessary for your data visualization needs. I will use exactly the same codes from the previous plot, the only changes you will see are in the {r} Chunk options.

Since the {r} Chunk options are not displayed in the outcome document (the HTML file you are reading), I will mention it here:

{r, fig.align=“center”, fig.width=4, fig.height=4, fig.cap=“Scatter Plot 1: Linear relationship between variables A1 and B1”}

Notice the use of fig.align, fig.width, fig.height, and fig.cap. There are many other options; investigate and learn.

# Create an object with linear regression model
linReg1 = lm(B1 ~ A1)

# Create the scatter plot and add the trendline with the linear regressio model

plot(B1~A1,
     type = "p",
     las=1,
     ylab="A2 (Dependent)",
     xlab="A1 (Independent)",
     xlim=c(0,140),
     ylim=c(0,900),
     col="#A11515",
     pch=8)

text("Type: p", y=800, x=10, cex=1)

abline(linReg1, col="#A11515")
Scatter Plot 1: Linear relationship between variables A1 and B1

Scatter Plot 1: Linear relationship between variables A1 and B1

3 Scatter plot 2, negative correlation

Since you already know the codes and their applications, I will include them all in the R chunk below. Following the recommendation I provided above, customize the plot on your R Studio.

# Linear regression
linReg2 = lm(B2 ~ A2)

# Create graph with regression line
plot(B2~A2)
abline(linReg2)

4 Scatter plot with Error bars: the Hmisc library

For this part we will use a data set car sales, check the information in the summary and table below.

summary(carSales) %>%
  kable(align = "c",
        format = "html",
        table.attr = "style='width:60%;'")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
              html_font = "Cambria",
              position = "center",
              font_size = 10) 
Location Year FuelType Transmission Owner Efficiency Engine_cc Power_bhp Seats Km Price
Length:5844 Min. :2001 Length:5844 Length:5844 Length:5844 Min. : 8.00 Min. :1000 Min. : 50.0 Min. : 2.000 Min. : 5539 Min. : 800
Class :character 1st Qu.:2012 Class :character Class :character Class :character 1st Qu.:16.00 1st Qu.:1500 1st Qu.:100.0 1st Qu.: 5.000 1st Qu.: 79159 1st Qu.:18280
Mode :character Median :2014 Mode :character Mode :character Mode :character Median :18.00 Median :1500 Median :100.0 Median : 5.000 Median :100295 Median :27157
NA Mean :2014 NA NA NA Mean :18.46 Mean :1790 Mean :139.3 Mean : 5.171 Mean :101558 Mean :26474
NA 3rd Qu.:2016 NA NA NA 3rd Qu.:22.00 3rd Qu.:2000 3rd Qu.:150.0 3rd Qu.: 5.000 3rd Qu.:121491 3rd Qu.:35695
NA Max. :2020 NA NA NA Max. :30.00 Max. :6000 Max. :600.0 Max. :10.000 Max. :452805 Max. :63496

4.1 Car sales: engine size

Imagine you are interested in knowing what effect the size of the engine has on its efficiency. People say that big cars consume more gas than smaller cars, is that true.
Using code unique(carSales$Engine_cc) you will find that the engine size variable contains seven different values: 1000, 1500, 2000, 2500, 3000, 4000, and 6000 cubic centimeters (cc).

Check section Continuous Data to see an extended explanation of the table below.

# Using car sales data set
# Plot and observe relationship between engine size and efficiency
# Start by grouping by the different engine sizes

eff = carSales %>% 
  group_by(EngineSize = Engine_cc) %>%
  summarise(Mean = mean(Efficiency), 
            SD = sd(Efficiency),
            Minimum = min(Efficiency),
            Maximum = max(Efficiency))
 
eff %>%
  kable(align = "c",
        caption = "Descriptive values of efficiency per engine size",
        format = "html",
        digits = 2,
        table.attr = "style='width:60%;'")%>%
  kable_classic_2(bootstrap_options=c("hover","bordered","condensed"),
              html_font = "Cambria",
              position = "center",
              font_size = 12) %>%
  add_header_above(c(" " = 1,"Engine Efficiency" = 4))
Descriptive values of efficiency per engine size
Engine Efficiency
EngineSize Mean SD Minimum Maximum
1000 24.95 1.96 22 30
1500 19.67 3.41 16 26
2000 16.51 2.88 10 22
2500 14.49 3.13 10 18
3000 12.78 2.20 8 18
4000 10.62 1.75 8 12
6000 8.83 1.03 8 10

By observing the table, we could deduce that the mean efficiency reduces with increasing engine size. Is that a correct observation? To have a better idea, let’s create two graphs:
1. One scatter plot to observe the distribution of the two variables.
2. A graph with standard deviations as error bars.

4.2 Size vs efficiency: Scatter plot

Similar to the graphs above, we have two variables under analysis. Which one is the independent variable and which one is the dependent variable? It should be quite easy in this case.
- Does efficiency affects engine size? Of course not.
- Does engine size affects the efficiency? Yes. This indicates that the independent variable is engine size and it should be plotted in the x-axis. Observe the codes and graph below.

plot(carSales$Efficiency ~ carSales$Engine_cc,
     las=2,
     xlab = "Engine size in CC",
     ylab = "Efficiency (mpg)",
     xlim=c(500,6000),
     ylim = c(0,35),
     pch = 19,
     col="blue")

abline(lm(carSales$Efficiency ~ carSales$Engine_cc),
       col="red")

legend("topright", 
       cex=0.8,
       paste("Y = 26.365 - 0.0046X"))

# Add the information for the correlation
# Use x and y coordinates to place the text

correlation = cor(carSales$Efficiency , carSales$Engine_cc)

text(x=6000,
     y= 28,
     paste("Correlation: ", round(correlation, 3)),
     cex=0.8,
     pos = 2)

4.3 Size vs efficiency: Hmisc::errbar graph

Hmisc::errbar(x = eff$EngineSize, 
       y = eff$Mean, 
       las=2,
       yplus = eff$Mean+eff$SD, 
       yminus = eff$Mean-eff$SD, 
       cap = 0.03,
       xlab = "Engine size in CC",
       ylab = "Mean efficiency (mpg +/- SD)",
       lty = 8,
       lwd = 1,
       pch=24,
       errbar.col = "#A11515",
      xlim=c(500,6000),
      ylim = c(0,35))