Disclaimer
Please do not cite this document. Data and syntax can be reused freely while keep attributing and refering to the original source (where applicable) in the reference. This document is used for educational purpose only.

Introduction to Data Exploration and Visualization Techniques

Data Exploration Instance

Univariate Data Exploration

Data Distribution Exploration (Histogram, Boxplot, Density Plot)

Histogram

Benefits of using Histogram:
1. To give immediate information about central tendency of the data (the mean, median, mode) and the dispersion and variability of data
2. To recognize general pattern of data distribution of one’s interest
3. To identify the outlier, extreme, and unusual observation

Aside from base, ggplot2, and other packages, R package that also can produce Histogram is graphics packages. Please install and activate the library prior to graph execution.

#install.packages("graphics")
library(graphics)

We are going to use Low Birth Weight data from Hosmer & Lemeshow (2013) that supplied from different packages aplore3 (stands for Applied Logistic Regression 3rd Ed.)

#install.packages("aplore3")
library(aplore3)

Let us inspect the low birth weight data:

head(lowbwt, n = 10)
##    id      low age lwt  race smoke  ptl  ht  ui       ftv  bwt
## 1   4 < 2500 g  28 120 Other   Yes  One  No Yes      None  709
## 2  10 < 2500 g  29 130 White    No None  No Yes Two, etc. 1021
## 3  11 < 2500 g  34 187 Black   Yes None Yes  No      None 1135
## 4  13 < 2500 g  25 105 Other    No  One Yes  No      None 1330
## 5  15 < 2500 g  25  85 Other    No None  No Yes      None 1474
## 6  16 < 2500 g  27 150 Other    No None  No  No      None 1588
## 7  17 < 2500 g  23  97 Other    No None  No Yes       One 1588
## 8  18 < 2500 g  24 128 Black    No  One  No  No       One 1701
## 9  19 < 2500 g  24 132 Other    No None Yes  No      None 1729
## 10 20 < 2500 g  21 165 White   Yes None Yes  No       One 1790
names(lowbwt)
##  [1] "id"    "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
## [10] "ftv"   "bwt"
str(lowbwt)
## 'data.frame':    189 obs. of  11 variables:
##  $ id   : int  4 10 11 13 15 16 17 18 19 20 ...
##  $ low  : Factor w/ 2 levels ">= 2500 g","< 2500 g": 2 2 2 2 2 2 2 2 2 2 ...
##  $ age  : int  28 29 34 25 25 27 23 24 24 21 ...
##  $ lwt  : int  120 130 187 105 85 150 97 128 132 165 ...
##  $ race : Factor w/ 3 levels "White","Black",..: 3 1 2 3 3 3 3 2 3 1 ...
##  $ smoke: Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 2 ...
##  $ ptl  : Factor w/ 3 levels "None","One","Two, etc.": 2 1 1 2 1 1 1 2 1 1 ...
##  $ ht   : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 1 2 2 ...
##  $ ui   : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 1 1 ...
##  $ ftv  : Factor w/ 3 levels "None","One","Two, etc.": 1 3 1 1 1 1 2 2 1 2 ...
##  $ bwt  : int  709 1021 1135 1330 1474 1588 1588 1701 1729 1790 ...

The above example is worked when the data frame is embedded on the package itself. What if we only had it externally in our working directory?. As a dummy, let us export the lowbwt data and store it in the same assigned working directory.

#write.csv(lowbwt,"/Users/sachnazdes/Documents/2022-IPB/MATA KULIAH/STA563 - Eksplorasi dan Visualisasi Data/lowbwt.csv", row.names = TRUE)
data.lahir <- read.csv("lowbwt.csv")
colnames(data.lahir)
##  [1] "X"     "id"    "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"   
## [10] "ui"    "ftv"   "bwt"
#dim(data.lahir)
#View(data.lahir)

The data comprises 11 variables:

  • id is Identification number
  • low is Low birth weight (1: >= 2500, 2: < 2500 g)
  • age is Age of mother (Years)
  • lwt is Weight of mother at last menstrual period (Pounds)
  • race is Race (1: White, 2: Black, 3: Other)
  • smoke is Smoking status during pregnancy (1: No, 2: Yes)
  • ptl is History of premature labor (1: None, 2: One, 3: Two, etc)
  • ht is History of hypertension (1: No, 2: Yes)
  • ui is Presence of Uterine irritability (1: No, 2: Yes)
  • ftv is Number of physician visits during the first trimester (1: None, 2: One, 3: Two, etc)
  • bwt is Recorded birth weight (Grams)

Be cautious on class of variables after importing. Specify the class so that it satisfies the analysis tool requirement.
Coerce multiple variables in data frame into factor at once:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
cols <- c("low","race","smoke","ptl", "ht", "ui", "ftv")
data.lahir[cols] <- lapply(data.lahir[cols], factor)
str(data.lahir)
## 'data.frame':    189 obs. of  12 variables:
##  $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id   : int  4 10 11 13 15 16 17 18 19 20 ...
##  $ low  : Factor w/ 2 levels "< 2500 g",">= 2500 g": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  28 29 34 25 25 27 23 24 24 21 ...
##  $ lwt  : int  120 130 187 105 85 150 97 128 132 165 ...
##  $ race : Factor w/ 3 levels "Black","Other",..: 2 3 1 2 2 2 2 1 2 3 ...
##  $ smoke: Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 2 ...
##  $ ptl  : Factor w/ 3 levels "None","One","Two, etc.": 2 1 1 2 1 1 1 2 1 1 ...
##  $ ht   : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 1 2 2 ...
##  $ ui   : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 1 1 ...
##  $ ftv  : Factor w/ 3 levels "None","One","Two, etc.": 1 3 1 1 1 1 2 2 1 2 ...
##  $ bwt  : int  709 1021 1135 1330 1474 1588 1588 1701 1729 1790 ...

Prior to creating histogram, we can describe the summary of the data as follows:

summary(data.lahir)
##        X             id               low           age             lwt       
##  Min.   :  1   Min.   :  4.0   < 2500 g : 59   Min.   :14.00   Min.   : 80.0  
##  1st Qu.: 48   1st Qu.: 68.0   >= 2500 g:130   1st Qu.:19.00   1st Qu.:110.0  
##  Median : 95   Median :123.0                   Median :23.00   Median :121.0  
##  Mean   : 95   Mean   :121.1                   Mean   :23.24   Mean   :129.8  
##  3rd Qu.:142   3rd Qu.:176.0                   3rd Qu.:26.00   3rd Qu.:140.0  
##  Max.   :189   Max.   :226.0                   Max.   :45.00   Max.   :250.0  
##     race    smoke            ptl        ht        ui             ftv     
##  Black:26   No :115   None     :159   No :177   No :161   None     :100  
##  Other:67   Yes: 74   One      : 24   Yes: 12   Yes: 28   One      : 47  
##  White:96             Two, etc.:  6                       Two, etc.: 42  
##                                                                          
##                                                                          
##                                                                          
##       bwt      
##  Min.   : 709  
##  1st Qu.:2414  
##  Median :2977  
##  Mean   :2945  
##  3rd Qu.:3475  
##  Max.   :4990

To create histogram, one might utilize hist() function

hist(data.lahir$age, breaks = 25, col = "coral",
     xlab = "Age of Mother (years)",
     main = "Age Distribution of Mother when Giving Birth")

Let us learn about the argument of hist() function. breaks= is option to indicate how much range or bar that suit our interest. The more breaks we use, the more denser the histogram will be. col= is option to denote color of the chart. xlab = is used to add label on x axis, while mean is to showcast the title of the histogram.

From this graph we can see that the distribution of data is somewhat skewed to the right. which means the mean of the data might exceed its median. The average age of mother when she is giving birth is mostly around 23 years old. Accordingly we can also guess the outlier of the data. Which age is it?

Challenge:
Construct the histogram of other variables, what can you explain from it? What is distribution of each variable?, are there any outliers? which one(s)?

Boxplot
boxplot(data.lahir$age, horizontal = TRUE, col = 'coral', xlab="Age of Mother (years)")

Density Plot (Naïve and Kernel)

Naïve Estimator

weight <- data.lahir$bwt
hist(weight, breaks= 25, col = "coral",
     xlab= "birth weight in (g)",
     main = "",
     freq = FALSE)

If we arbitrarily change the value of freq= TRUE, then the vertical axis will be its corresponding frequency, or else it would be its probability density.

weight <- data.lahir$bwt
hist(weight, breaks= 25, col = "coral",
     xlab= "birth weight (g)",
     main = "",
     freq = FALSE)

h = 500
n = length(weight)
x <- seq(700, 5000, by=20)
fx <- NULL
for (i in 1:length(x)) {
  fx[i] = sum(ifelse(abs(weight - x[i]) < h, 1, 0))/(2*h*n)
}
lines(x, fx, type="l", col="blue", lwd=2)

The program to calculate the estimated value of f(x) can be explained as follows: The command sum(ifelse(abs(weight - x[i]) < h, 1, 0)) is used to calculate the number of observations that are in the interval (x-h, x+h ) by first converting the observations with values in the interval to 1 and those outside the interval to 0. The complete line of the fx[i] command is to store the value of the density function for each value of x in the range between 700 and 5000 g. Since it is impossible to calculate f(x) at any value, in this program only x weight values between 700 to 5000 with intervals of 20 g are calculated. The smaller the interval, the longer the computation will be due to the large number of x values that are estimated to be f(x) using the program above.

Challenge:
Fine-tuning the chart above using different h values. What can you digest from it?

Kernel Estimator
Unlike the naive estimator, the kernel estimator has a smoothing parameter feature.

dense <- density(data.lahir$age, bw=1, kernel="epanechnikov")
hist(data.lahir$age, freq = FALSE, breaks = 15, col = "skyblue1", main = "", xlab = "Age of Mother (years)")
lines(dense, col="blue", lwd=2, main="", ylim=c(0, 0.09))

The main options available in the density() function are bw and the kernel. The bw option is used to specify the window width to use, while the kernel option is used to select the kernel function to use. The density() function provides several kernel options, such as: gaussian, epanechnikov, rectangular, triangular, and biweight.

Summary statistics for Categorical Data
table(data.lahir$race)
## 
## Black Other White 
##    26    67    96
frekuensi <- table(data.lahir$race)
prop.table(frekuensi)
## 
##     Black     Other     White 
## 0.1375661 0.3544974 0.5079365
barplot(frekuensi)

pie(frekuensi)

Pembandingan antar kelompok:

aggregate(bwt ~ smoke, data = data.lahir, mean)
##   smoke      bwt
## 1    No 3054.957
## 2   Yes 2773.243

Cross tabulation:

ras <- data.lahir$race
kategori_weight <- data.lahir$low
table(ras, kategori_weight)
##        kategori_weight
## ras     < 2500 g >= 2500 g
##   Black       11        15
##   Other       25        42
##   White       23        73

99 / 5,000 Translation results Row percentage calculation can be done using the prop.table() function with margin = 1 option as follows:

crosstab<-table(ras, kategori_weight)
prop.table(crosstab, margin = 1)
##        kategori_weight
## ras      < 2500 g >= 2500 g
##   Black 0.4230769 0.5769231
##   Other 0.3731343 0.6268657
##   White 0.2395833 0.7604167

The comparison view can also be expressed in the form of a stacked barplot:

persentase <- prop.table(crosstab, margin = 1)
barplot(t(persentase))

Comparison of Distribution Shape of Several Population

#install.packages("sm")
library(sm)
## Warning: package 'sm' was built under R version 4.1.2
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
sm.density.compare(data.lahir$age, data.lahir$race,
                   xlab="Age of Mother (year)",
                   lwd = 2)
colfill <-c(2:(2+length(levels(data.lahir$race))))
legend("topright", levels(data.lahir$race), fill = colfill)

hrbrthemes::import_roboto_condensed()
## You will likely need to install these fonts on your system as well.
## 
## You can find them in [/Library/Frameworks/R.framework/Versions/4.1/Resources/library/hrbrthemes/fonts/roboto-condensed]
library(ggridges)
library(ggplot2)
ggplot(data.lahir, aes(x = age, y = race, fill=race)) + 
  geom_density_ridges() +
  theme_ridges() +
  theme(legend.position = "none")
## Picking joint bandwidth of 2.05

library(ggridges)
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
#plot
ggplot(data.lahir, aes( x = age, y = race, fill = ..x..))+
  geom_density_ridges_gradient(scale=1, rel_min_height=0.01)+
  scale_fill_viridis(name = "usia", option = "C") +
  labs(title = "") +
  theme_ipsum() +
  theme(legend.position = "none",
        panel.spacing = unit(0.1, "lines"),
        strip.text.x = element_text(size=8))
## Picking joint bandwidth of 2.05

boxplot(data.lahir$age ~ data.lahir$race,
        col=c("coral","green","skyblue"),
        ylab="Mother Age (year)",
        xlab = "Race")

Outlier Detection

The outlier identification technique on a single variable can be done by paying attention to the difference between the observations and its sample average. An observation is said to be an outlier if the difference between the observed value and the mean (x bar) is greater than 3s (s is the standard deviation). In other words, an observation will be called an outlier if it is outside the range:
(xbar - 3s, xbar + 3s)

m <- mean(weight)
s <- sd(weight)
pencilan <- (weight > m + 3*s) | (weight < m - 3*s)

#to count how many outliers in our data
sum(pencilan)
## [1] 1
#to identify which index of observation belong to the outlier
which(pencilan)
## [1] 1
#value of outlier
weight[which(pencilan)]
## [1] 709

Another possible approach is to use a limit value. The limit used to identify outliers is the lower limit and the upper limit with the formula:
- Lower Limit = Q1 - 1.5*IQR - Upper Limit = Q3 + 1.5*IQR

#outlier detection
nilai.pencilan <- boxplot.stats(weight)$out
which(weight == nilai.pencilan)
## [1] 1
nilai.pencilan
## [1] 709

The existence of large extreme data will cause the sample average to be larger than it should be. Likewise, when there is small extreme data, the average sample will be smaller than it should be. So in this condition it can be said that our estimator is not robust because its value is easily disturbed by the presence of outliers, especially if the size of the data cluster is not large. The solution to this problem is a robust estimator, such as: Trimmed Mean, Winsorized Mean, and M-Estimators for the average.

example1 <- c(12, 12, 12, 13, 13, 14, 16, 17, 18, 18, 19, 19, 19, 19, 20, 20, 21, 22, 22, 67)
mean(example1)
## [1] 19.65
mean(example1, trim = 0.05)
## [1] 17.44444

The pysch package provides a function named winsor.mean() which can be used to calculate the winsorized mean of a data vector.

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
winsor.mean(example1, trim=0.1)
## [1] 17.4
winsor.mean(example1, trim=0.05)
## [1] 17.5125

Bivariate Data Exploration

Scatter Plot

The scatter plot is a two-dimensional graph that displays the cross of the Cartesian axis and data points where each axis represents each variable and each point on the plot represents one observation in the data whose position coordinates are adjusted to the value of the variable. You can find harga rumah data through this link

rumah <- read.table("harga rumah.txt", header=TRUE)
attach(rumah)
plot(size, price, ylab="Housing Price (US$)",
     xlab = "Size (sq feet)",
     pch = 19, col = "coral", cex = 1.5)

Signed Scatter Plot

plot(size, price, ylab="Housing Price (US$)",
     xlab = "Size (sq feet)",
     pch = 19, cex = 1.5,
     col=ifelse(location == 1, "coral", "navyblue"))

Correlation matrix

pairs(rumah[,1:2], pch=19, col="coral", cex=1.5)

library(PerformanceAnalytics)
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(rumah, histogram = TRUE, pch= 19)

pairs(data.lahir[,c(4,5,12)], pch=19, col="coral", cex=1.5)

2D Density Plot

# Library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x psych::%+%()    masks ggplot2::%+%()
## x psych::alpha()  masks ggplot2::alpha()
## x dplyr::filter() masks stats::filter()
## x xts::first()    masks dplyr::first()
## x dplyr::lag()    masks stats::lag()
## x xts::last()     masks dplyr::last()
#Generate data
a <- data.frame( x=rnorm(20000, 10, 1.9), y=rnorm(20000, 10, 1.2) )
b <- data.frame( x=rnorm(20000, 14.5, 1.9), y=rnorm(20000, 14.5, 1.9) )
c <- data.frame( x=rnorm(20000, 9.5, 1.9), y=rnorm(20000, 15.5, 1.9) )
data <- rbind(a,b,c)

Create basic scatterplot:

ggplot(data, aes(x=x, y=y) ) +
  geom_point()

Second histogram with default feature:

ggplot(data, aes(x=x, y=y) ) +
  geom_bin2d() +
  theme_bw()

2d plot is useful when we have abundance data/ observation. For the 2d histogram, the plot area is divided in a multitude of squares. (It is a 2d version of the classic histogram). It is called using the geom_bin2d() function.This function offers a bins argument that controls the number of bins you want to display. Please execute the following syntax:

# Bin size control + color palette
ggplot(data, aes(x=x, y=y) ) +
  geom_bin2d(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

Another representative of 2d plot is hexagone histogram as an alternative of square multitude. It is then called hexbin chart. It is using function geom_hex() to produce this chart. This function provides the bins argument as well, to control the number of division per axis.

It is also need to activate dependency packages such as: hexbin packages.

#install.packages("hexbin")
library(hexbin)

Let’s create hexbin chart with default option:

ggplot(data, aes(x=x, y=y) ) +
  geom_hex() +
  theme_bw()

We can control bin size + custom color pallette:

ggplot(data, aes(x=x, y=y) ) +
  geom_hex(bins = 70) +
  scale_fill_continuous(type = "viridis") +
  theme_bw()

Another type of 2D density plot is contour 2d density plot. it is supported with raster function.

Show contour only:

ggplot(data, aes(x=x, y=y) ) +
  geom_density_2d()

Show the area only:

ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon")

Show area and contour:

ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..level..), geom = "polygon", colour="white")

Using Raster:

ggplot(data, aes(x=x, y=y) ) +
  stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  theme(
    legend.position='none'
  )

Bubble Plot

It is used when we have at least 3 variables of interest. For ilustration we take data of Hans Rosling TED Talk - Gapminder. We need to install and activate gapminder library.

#install.packages("gapminder")

Activate other supplementary libraries:

library(tidyverse)
library(hrbrthemes)
library(viridis)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ggrepel)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Import the data:

library(gapminder)
data_bubble <- gapminder %>% filter(year=="2007") %>% dplyr::select(-year)

Display the bubble chart:

data_bubble %>%
  mutate(pop=pop/1000000) %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
    geom_point(alpha=0.7) +
    scale_size(range = c(1.4, 19), name="Population (M)") +
    scale_color_viridis(discrete=TRUE, guide=FALSE) +
    theme_ipsum() +
    theme(legend.position="bottom")
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

To make it more insightful, we can annotate the name of country on the bubble chart. Firstly, let us pre-proccessed the data.

tmp <- data_bubble %>%
 mutate(
   annotation = case_when(
    gdpPercap > 5000 & lifeExp < 60 ~ "yes",
    lifeExp < 30 ~ "yes",
     gdpPercap > 40000 ~ "yes"
    )
) %>%
mutate(pop=pop/1000000) %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country))

Create the plot:

ggplot( tmp, aes(x=gdpPercap, y=lifeExp, size = pop, color = continent)) +
    geom_point(alpha=0.7) +
    scale_size(range = c(1.4, 19), name="Population (M)") +
    scale_color_viridis(discrete=TRUE) +
    theme_ipsum() +
    theme(legend.position="none") +
    geom_text_repel(data=tmp %>% filter(annotation=="yes"), aes(label=country), size=4 )

Create interactive version:

p <- data_bubble %>%
  mutate(gdpPercap=round(gdpPercap,0)) %>%
  mutate(pop=round(pop/1000000,2)) %>%
  mutate(lifeExp=round(lifeExp,1)) %>%
  arrange(desc(pop)) %>%
  mutate(country = factor(country, country)) %>%
  mutate(text = paste("Country: ", country, "\nPopulation (M): ", pop, "\nLife Expectancy: ", lifeExp, "\nGdp per capita: ", gdpPercap, sep="")) %>%
  ggplot( aes(x=gdpPercap, y=lifeExp, size = pop, color = continent, text=text)) +
    geom_point(alpha=0.7) +
    scale_size(range = c(1.4, 19), name="Population (M)") +
    scale_color_viridis(discrete=TRUE, guide=FALSE) +
    theme_ipsum() +
    theme(legend.position="none")

ggplotly(p, tooltip="text")

Local Regression

Local Regression (loess) is a non-parametric approach that fits multiple regressions in local neighborhood. It is used to smoothen plot. The principle of local regression is to determine the regression line using only the closest few points, not using all the data in the, thus it is so called local regression.

data(economics, package="ggplot2")  # load data
economics$index <- 1:nrow(economics)  # create index variable
economics <- economics[1:80, ]  # retail 80rows for better graphical understanding
loessMod10 <- loess(uempmed ~ index, data=economics, span=0.10) # 10% smoothing span
loessMod25 <- loess(uempmed ~ index, data=economics, span=0.25) # 25% smoothing span
loessMod50 <- loess(uempmed ~ index, data=economics, span=0.50) # 50% smoothing span

Predict Loess:

smoothed10 <- predict(loessMod10) 
smoothed25 <- predict(loessMod25) 
smoothed50 <- predict(loessMod50) 

Create the plot:

plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed10, x=economics$date, col="red")
lines(smoothed25, x=economics$date, col="green")
lines(smoothed50, x=economics$date, col="blue")
legend('bottomright', legend=c('.1', '.25', '.5'),
        col=c('red', 'green', 'blue'), pch=19, title='Smoothing Span')

References

  1. https://www.data-to-viz.com/
  2. https://r-graph-gallery.com/
  3. http://r-statistics.co/
  4. Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.
  5. Sartono, B., Kiswani, D. B., Dito, G.A. (2021). Teknik Eksplorasi Data yang Harus Dikuasai Data Scientist. IPB Press.