1 Exploratory Data Analysis (EDA)

1.1 Data

Data comes from many sources such as measurements, events, text and videos.
Raw Data are unstructured (such as images, texts and click streams) and needed to be handled into structured actionable data before applying statistical methods.
Structured data formats:
- Rectangular: two-dimensional data with rows indicating records (cases) and columns indicating features (variables)
- Non-rectangular : time-series data, spatial data, graphs (or networks)
Rectangular data types
1. Numeric:
  - Continuous (float)
  - Discrete (integer, count)
2. Categorical:
  - Non-ordinal
  - Ordinal
  - Binary

1.2 Estimates of central tendency

Estimates sensitive to outlier effect:
- Mean: Sum of values divided by the number of values.
- Weighted mean: The sum of all values times a weight divided by teh sum of the weights (motivation to use: to reduce intrinsic viability, to unify the value of data based on another variable)
Estimates robust to outlier effect:
- Median: The value such that one-half of data lies above and below it.
- Weighted median: The value such that one-half of the sum of the weights lies above and below the sorted data.
- Trimmed mean: The average of all values after dropping a fixed numbeer of extreme values.
Here we loaded an example dataset (crime dataset in USA) to show the R codes for calculating these metrics in R. A summary of the dataset is shown in table 1.1.

# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)

# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)

# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")
state[1:10,1:10] %>% knitr::kable(booktabs = TRUE,
  caption = "A table of the first 10 rows and 10 columns of the crime dataset.")

Table 1.1: A table of the first 10 rows and 10 columns of the crime dataset.
State	Year	Data.Population	Data.Rates.Property.All	Data.Rates.Property.Burglary	Data.Rates.Property.Larceny	Data.Rates.Property.Motor	Data.Rates.Violent.All	Data.Rates.Violent.Assault	Data.Rates.Violent.Murder
Alabama	1960	3266740	1035.4	355.9	592.1	87.3	186.6	138.1	12.4
Alabama	1961	3302000	985.5	339.3	569.4	76.8	168.5	128.9	12.9
Alabama	1962	3358000	1067.0	349.1	634.5	83.4	157.3	119.0	9.4
Alabama	1963	3347000	1150.9	376.9	683.4	90.6	182.7	142.1	10.2
Alabama	1964	3407000	1358.7	466.6	784.1	108.0	213.1	163.0	9.3
Alabama	1965	3462000	1392.7	473.7	812.1	106.9	199.8	149.1	11.4
Alabama	1966	3517000	1528.0	527.5	869.6	131.0	230.3	177.7	10.9
Alabama	1967	3540000	1612.4	571.4	895.0	146.0	238.6	183.5	11.7
Alabama	1968	3566000	1766.6	628.2	967.7	170.7	232.4	168.5	11.8
Alabama	1969	3531000	1876.2	667.2	1037.8	171.2	250.4	181.7	13.7

# Mean
mean(state[["Data.Population"]])

## [1] 9708502

# Weighted mean 
weighted.mean(state[["Data.Rates.Violent.Murder"]],w = state[["Data.Population"]])

## [1] 6.83844

# Trimmed mean (trimming 10 percent of data on both tails)
mean(state[["Data.Population"]], trim = 0.1)

## [1] 4009627

# Median
median(state[["Data.Population"]])

## [1] 3358000

# Weighted median
matrixStats::weightedMean(state[["Data.Rates.Violent.Murder"]],w = state[["Data.Population"]])

## [1] 6.83844

1.3 Estimates of variability (dispersion)

The second dimension of any numerical data is variability (dispersion) which measure if the data are tightly clustered or spread out.
Estimates sensitive to outlier effect:
- Variance: The sum of squared deviations from the mean devided by n - 1 where n in teh number of data values.
- Standard deviation (12-norm or Euclidean norm): The square root of variance.
- Mean absolute deviation (11-norm or Manhattan norm): The mean of absolute value of the deviation from the mean.
- Range: The difference between teh largest adn teh smallest value in a data set.
Estimates robust to outlier effect:
- Median absolute deviation from the median (MAD): The median of the absolute value of the deviations from the median.
- Percentile (quantile): The value such that P percent of the values take on this value or less and 100-P percent take on this value or more.
- Interquantile range (IQR): The difference between the 75th percentile and the 25th percentile.
Here we loaded an example dataset (crime dataset in USA) to show the R codes for calculating these metrics in R. A summary of the dataset is shown in table 1.1.

# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)


# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)

# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")

# Standard deviation
sd(state[["Data.Population"]])

## [1] 35067501

# Range
range (state[["Data.Population"]])

## [1]    226167 328239523

# Inter-quantile range (IQR)
IQR (state[["Data.Population"]])

## [1] 4803680

# MAD
mad (state[["Data.Population"]])

## [1] 3412314

1.4 Exploring the Data Distribution

Another aspect of the data which should be explored is overall data distribution.
There are thee ways to explore data distribution:
1. Percentile and Boxplots
  - Percentiles are also valuable to summarize the entire distribution. It is common to report the quartiles (25th, 50th, and 75th percentiles) or the deciles (10th, 20th, … , 90th percentiles).
  - Box plots visualize 25th, 50th and 75th percentile (as bottom middle and top of the box). The dash lines (known as whiskers) aindicate the range for the bulk of the data (By default 1.5 times the IQR). and outliers are shown as isngle points.
2. Frequency table and histogram
  - Frequency table is a table of the count of numeric data that fall into a set of intervals (bins). Histogram is a plot of the frequency table with teh bins on the x-axis and the counts (or proportions) on the y-axis.
3. Density extimate
  - Density plot is the smoothed version of the histogram, often based on kernal density estimate.

# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)
library (ggplot2)


# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)

# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")


# Percentile
quantile(state[["Data.Rates.Violent.Murder"]], p = c(0.05, 0.25, 0.5, 0.75, 0.95 ))

##   5%  25%  50%  75%  95% 
##  1.5  3.1  5.4  8.4 13.2

#Boxplot using basic R graphics 

state <- state %>% filter(State != "United States")
boxplot(state[["Data.Population"]]/1000000, ylab = "Population (million)", xlab= "USA states", main = "Boxplot using basic R graphics")

#Boxplot using ggplot  
ggplot(state,aes(y= Data.Population/1000000)) +
  geom_boxplot() +
  labs(x="USA states",y="Population (million)",title="Boxplot using ggplot package") +
  theme_classic() +
  theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())+
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Frequency table 
breaks <- seq (from = min (state[["Data.Population"]]), to=  max (state[["Data.Population"]]), length  = 11)
pop_freq <- cut (state[["Data.Population"]], breaks = breaks)

knitr::kable(table (pop_freq), booktabs = TRUE,
  caption = "Frequency table of population across different USA states")

Table 1.2: Frequency table of population across different USA states
pop_freq	Freq
(2.26e+05,4.16e+06]	1806
(4.16e+06,8.09e+06]	721
(8.09e+06,1.2e+07]	286
(1.2e+07,1.6e+07]	71
(1.6e+07,1.99e+07]	93
(1.99e+07,2.38e+07]	25
(2.38e+07,2.78e+07]	16
(2.78e+07,3.17e+07]	12
(3.17e+07,3.56e+07]	8
(3.56e+07,3.96e+07]	16

# Histogram using basic R graphics
hist (state[["Data.Population"]], breaks = breaks,xlab= "Population", main = "Histogram using basic R graphics")

# Histogram using ggplot package
ggplot(state,aes(x= Data.Population)) +
  geom_histogram(bins = 10, breaks = breaks ) +
  labs(x="Population",y="Frequency",title="Histogram using ggplot package") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Density plot using basic R graphics
hist (state[["Data.Rates.Violent.Murder"]], freq = F, xlab= "Murder rate", main = "Density plot using basic R graphics")
lines (density(state[["Data.Rates.Violent.Murder"]]), lwd = 2, col ="blue")

# Density plot using ggplot package
ggplot(state,aes(x=Data.Rates.Violent.Murder))+
  geom_density(color="blue", fill="lightblue")+
  labs(x="Murder rate",y="Density",title="Density plot using ggplot package") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Statistic analysis and data visualization in R

Mohammad Amin Honardoost

05 October 2022

1 Exploratory Data Analysis (EDA)

1.1 Data

1.2 Estimates of central tendency

1.3 Estimates of variability (dispersion)

1.4 Exploring the Data Distribution

1.5 Exploring Binary and Categorial Data