1 Exploratory Data Analysis (EDA)

1.1 Data

  • Data comes from many sources such as measurements, events, text and videos.

  • Raw Data are unstructured (such as images, texts and click streams) and needed to be handled into structured actionable data before applying statistical methods.

  • Structured data formats:

    • Rectangular: two-dimensional data with rows indicating records (cases) and columns indicating features (variables)
    • Non-rectangular : time-series data, spatial data, graphs (or networks)
  • Rectangular data types

    1. Numeric:

      • Continuous (float)
      • Discrete (integer, count)
    2. Categorical:

      • Non-ordinal
      • Ordinal
      • Binary

1.2 Estimates of central tendency

  • Estimates sensitive to outlier effect:

    • Mean: Sum of values divided by the number of values.

    • Weighted mean: The sum of all values times a weight divided by teh sum of the weights (motivation to use: to reduce intrinsic viability, to unify the value of data based on another variable)

  • Estimates robust to outlier effect:

    • Median: The value such that one-half of data lies above and below it.

    • Weighted median: The value such that one-half of the sum of the weights lies above and below the sorted data.

    • Trimmed mean: The average of all values after dropping a fixed numbeer of extreme values.

  • Here we loaded an example dataset (crime dataset in USA) to show the R codes for calculating these metrics in R. A summary of the dataset is shown in table 1.1.

# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)

# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)

# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")
state[1:10,1:10] %>% knitr::kable(booktabs = TRUE,
  caption = "A table of the first 10 rows and 10 columns of the crime dataset.")
Table 1.1: A table of the first 10 rows and 10 columns of the crime dataset.
State Year Data.Population Data.Rates.Property.All Data.Rates.Property.Burglary Data.Rates.Property.Larceny Data.Rates.Property.Motor Data.Rates.Violent.All Data.Rates.Violent.Assault Data.Rates.Violent.Murder
Alabama 1960 3266740 1035.4 355.9 592.1 87.3 186.6 138.1 12.4
Alabama 1961 3302000 985.5 339.3 569.4 76.8 168.5 128.9 12.9
Alabama 1962 3358000 1067.0 349.1 634.5 83.4 157.3 119.0 9.4
Alabama 1963 3347000 1150.9 376.9 683.4 90.6 182.7 142.1 10.2
Alabama 1964 3407000 1358.7 466.6 784.1 108.0 213.1 163.0 9.3
Alabama 1965 3462000 1392.7 473.7 812.1 106.9 199.8 149.1 11.4
Alabama 1966 3517000 1528.0 527.5 869.6 131.0 230.3 177.7 10.9
Alabama 1967 3540000 1612.4 571.4 895.0 146.0 238.6 183.5 11.7
Alabama 1968 3566000 1766.6 628.2 967.7 170.7 232.4 168.5 11.8
Alabama 1969 3531000 1876.2 667.2 1037.8 171.2 250.4 181.7 13.7
# Mean
mean(state[["Data.Population"]])
## [1] 9708502
# Weighted mean 
weighted.mean(state[["Data.Rates.Violent.Murder"]],w = state[["Data.Population"]])
## [1] 6.83844
# Trimmed mean (trimming 10 percent of data on both tails)
mean(state[["Data.Population"]], trim = 0.1)
## [1] 4009627
# Median
median(state[["Data.Population"]])
## [1] 3358000
# Weighted median
matrixStats::weightedMean(state[["Data.Rates.Violent.Murder"]],w = state[["Data.Population"]])
## [1] 6.83844

1.3 Estimates of variability (dispersion)

  • The second dimension of any numerical data is variability (dispersion) which measure if the data are tightly clustered or spread out.

  • Estimates sensitive to outlier effect:

    • Variance: The sum of squared deviations from the mean devided by n - 1 where n in teh number of data values.

    • Standard deviation (12-norm or Euclidean norm): The square root of variance.

    • Mean absolute deviation (11-norm or Manhattan norm): The mean of absolute value of the deviation from the mean.

    • Range: The difference between teh largest adn teh smallest value in a data set.

  • Estimates robust to outlier effect:

    • Median absolute deviation from the median (MAD): The median of the absolute value of the deviations from the median.

    • Percentile (quantile): The value such that P percent of the values take on this value or less and 100-P percent take on this value or more.

    • Interquantile range (IQR): The difference between the 75th percentile and the 25th percentile.

  • Here we loaded an example dataset (crime dataset in USA) to show the R codes for calculating these metrics in R. A summary of the dataset is shown in table 1.1.

# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)


# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)

# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")

# Standard deviation
sd(state[["Data.Population"]])
## [1] 35067501
# Range
range (state[["Data.Population"]])
## [1]    226167 328239523
# Inter-quantile range (IQR)
IQR (state[["Data.Population"]])
## [1] 4803680
# MAD
mad (state[["Data.Population"]])
## [1] 3412314

1.4 Exploring the Data Distribution

  • Another aspect of the data which should be explored is overall data distribution.

  • There are thee ways to explore data distribution:

    1. Percentile and Boxplots

      • Percentiles are also valuable to summarize the entire distribution. It is common to report the quartiles (25th, 50th, and 75th percentiles) or the deciles (10th, 20th, … , 90th percentiles).
      • Box plots visualize 25th, 50th and 75th percentile (as bottom middle and top of the box). The dash lines (known as whiskers) aindicate the range for the bulk of the data (By default 1.5 times the IQR). and outliers are shown as isngle points.
    2. Frequency table and histogram

      • Frequency table is a table of the count of numeric data that fall into a set of intervals (bins). Histogram is a plot of the frequency table with teh bins on the x-axis and the counts (or proportions) on the y-axis.
    3. Density extimate

      • Density plot is the smoothed version of the histogram, often based on kernal density estimate.
# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)
library (ggplot2)


# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)

# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")


# Percentile
quantile(state[["Data.Rates.Violent.Murder"]], p = c(0.05, 0.25, 0.5, 0.75, 0.95 ))
##   5%  25%  50%  75%  95% 
##  1.5  3.1  5.4  8.4 13.2
#Boxplot using basic R graphics 

state <- state %>% filter(State != "United States")
boxplot(state[["Data.Population"]]/1000000, ylab = "Population (million)", xlab= "USA states", main = "Boxplot using basic R graphics")

#Boxplot using ggplot  
ggplot(state,aes(y= Data.Population/1000000)) +
  geom_boxplot() +
  labs(x="USA states",y="Population (million)",title="Boxplot using ggplot package") +
  theme_classic() +
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())+
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Frequency table 
breaks <- seq (from = min (state[["Data.Population"]]), to=  max (state[["Data.Population"]]), length  = 11)
pop_freq <- cut (state[["Data.Population"]], breaks = breaks)

knitr::kable(table (pop_freq), booktabs = TRUE,
  caption = "Frequency table of population across different USA states")
Table 1.2: Frequency table of population across different USA states
pop_freq Freq
(2.26e+05,4.16e+06] 1806
(4.16e+06,8.09e+06] 721
(8.09e+06,1.2e+07] 286
(1.2e+07,1.6e+07] 71
(1.6e+07,1.99e+07] 93
(1.99e+07,2.38e+07] 25
(2.38e+07,2.78e+07] 16
(2.78e+07,3.17e+07] 12
(3.17e+07,3.56e+07] 8
(3.56e+07,3.96e+07] 16
# Histogram using basic R graphics
hist (state[["Data.Population"]], breaks = breaks,xlab= "Population", main = "Histogram using basic R graphics")

# Histogram using ggplot package
ggplot(state,aes(x= Data.Population)) +
  geom_histogram(bins = 10, breaks = breaks ) +
  labs(x="Population",y="Frequency",title="Histogram using ggplot package") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Density plot using basic R graphics
hist (state[["Data.Rates.Violent.Murder"]], freq = F, xlab= "Murder rate", main = "Density plot using basic R graphics")
lines (density(state[["Data.Rates.Violent.Murder"]]), lwd = 2, col ="blue")

# Density plot using ggplot package
ggplot(state,aes(x=Data.Rates.Violent.Murder))+
  geom_density(color="blue", fill="lightblue")+
  labs(x="Murder rate",y="Density",title="Density plot using ggplot package") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

1.5 Exploring Binary and Categorial Data