Data comes from many sources such as measurements, events, text and videos.
Raw Data are unstructured (such as images, texts and click streams) and needed to be handled into structured actionable data before applying statistical methods.
Structured data formats:
Rectangular data types
Numeric:
Categorical:
Estimates sensitive to outlier effect:
Mean: Sum of values divided by the number of values.
Weighted mean: The sum of all values times a weight divided by teh sum of the weights (motivation to use: to reduce intrinsic viability, to unify the value of data based on another variable)
Estimates robust to outlier effect:
Median: The value such that one-half of data lies above and below it.
Weighted median: The value such that one-half of the sum of the weights lies above and below the sorted data.
Trimmed mean: The average of all values after dropping a fixed numbeer of extreme values.
Here we loaded an example dataset (crime dataset in USA) to show the R codes for calculating these metrics in R. A summary of the dataset is shown in table 1.1.
# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)
# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)
# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")
state[1:10,1:10] %>% knitr::kable(booktabs = TRUE,
caption = "A table of the first 10 rows and 10 columns of the crime dataset.")
| State | Year | Data.Population | Data.Rates.Property.All | Data.Rates.Property.Burglary | Data.Rates.Property.Larceny | Data.Rates.Property.Motor | Data.Rates.Violent.All | Data.Rates.Violent.Assault | Data.Rates.Violent.Murder |
|---|---|---|---|---|---|---|---|---|---|
| Alabama | 1960 | 3266740 | 1035.4 | 355.9 | 592.1 | 87.3 | 186.6 | 138.1 | 12.4 |
| Alabama | 1961 | 3302000 | 985.5 | 339.3 | 569.4 | 76.8 | 168.5 | 128.9 | 12.9 |
| Alabama | 1962 | 3358000 | 1067.0 | 349.1 | 634.5 | 83.4 | 157.3 | 119.0 | 9.4 |
| Alabama | 1963 | 3347000 | 1150.9 | 376.9 | 683.4 | 90.6 | 182.7 | 142.1 | 10.2 |
| Alabama | 1964 | 3407000 | 1358.7 | 466.6 | 784.1 | 108.0 | 213.1 | 163.0 | 9.3 |
| Alabama | 1965 | 3462000 | 1392.7 | 473.7 | 812.1 | 106.9 | 199.8 | 149.1 | 11.4 |
| Alabama | 1966 | 3517000 | 1528.0 | 527.5 | 869.6 | 131.0 | 230.3 | 177.7 | 10.9 |
| Alabama | 1967 | 3540000 | 1612.4 | 571.4 | 895.0 | 146.0 | 238.6 | 183.5 | 11.7 |
| Alabama | 1968 | 3566000 | 1766.6 | 628.2 | 967.7 | 170.7 | 232.4 | 168.5 | 11.8 |
| Alabama | 1969 | 3531000 | 1876.2 | 667.2 | 1037.8 | 171.2 | 250.4 | 181.7 | 13.7 |
# Mean
mean(state[["Data.Population"]])
## [1] 9708502
# Weighted mean
weighted.mean(state[["Data.Rates.Violent.Murder"]],w = state[["Data.Population"]])
## [1] 6.83844
# Trimmed mean (trimming 10 percent of data on both tails)
mean(state[["Data.Population"]], trim = 0.1)
## [1] 4009627
# Median
median(state[["Data.Population"]])
## [1] 3358000
# Weighted median
matrixStats::weightedMean(state[["Data.Rates.Violent.Murder"]],w = state[["Data.Population"]])
## [1] 6.83844
The second dimension of any numerical data is variability (dispersion) which measure if the data are tightly clustered or spread out.
Estimates sensitive to outlier effect:
Variance: The sum of squared deviations from the mean devided by n - 1 where n in teh number of data values.
Standard deviation (12-norm or Euclidean norm): The square root of variance.
Mean absolute deviation (11-norm or Manhattan norm): The mean of absolute value of the deviation from the mean.
Range: The difference between teh largest adn teh smallest value in a data set.
Estimates robust to outlier effect:
Median absolute deviation from the median (MAD): The median of the absolute value of the deviations from the median.
Percentile (quantile): The value such that P percent of the values take on this value or less and 100-P percent take on this value or more.
Interquantile range (IQR): The difference between the 75th percentile and the 25th percentile.
Here we loaded an example dataset (crime dataset in USA) to show the R codes for calculating these metrics in R. A summary of the dataset is shown in table 1.1.
# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)
# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)
# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")
# Standard deviation
sd(state[["Data.Population"]])
## [1] 35067501
# Range
range (state[["Data.Population"]])
## [1] 226167 328239523
# Inter-quantile range (IQR)
IQR (state[["Data.Population"]])
## [1] 4803680
# MAD
mad (state[["Data.Population"]])
## [1] 3412314
Another aspect of the data which should be explored is overall data distribution.
There are thee ways to explore data distribution:
Percentile and Boxplots
Frequency table and histogram
Density extimate
# load libraries
library(RSkittleBrewer)
library(matrixStats)
library(DT)
library(tidyverse)
library (ggplot2)
# Make the colors pretty
trop = RSkittleBrewer("tropical")
palette(trop)
par(pch=19)
# import crime dataset
url <- "https://corgis-edu.github.io/corgis/datasets/csv/state_crime/state_crime.csv"
destfile <- "crime_dataset.csv"
download.file(url, destfile)
state <- read.csv("crime_dataset.csv")
# Percentile
quantile(state[["Data.Rates.Violent.Murder"]], p = c(0.05, 0.25, 0.5, 0.75, 0.95 ))
## 5% 25% 50% 75% 95%
## 1.5 3.1 5.4 8.4 13.2
#Boxplot using basic R graphics
state <- state %>% filter(State != "United States")
boxplot(state[["Data.Population"]]/1000000, ylab = "Population (million)", xlab= "USA states", main = "Boxplot using basic R graphics")
#Boxplot using ggplot
ggplot(state,aes(y= Data.Population/1000000)) +
geom_boxplot() +
labs(x="USA states",y="Population (million)",title="Boxplot using ggplot package") +
theme_classic() +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank())+
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Frequency table
breaks <- seq (from = min (state[["Data.Population"]]), to= max (state[["Data.Population"]]), length = 11)
pop_freq <- cut (state[["Data.Population"]], breaks = breaks)
knitr::kable(table (pop_freq), booktabs = TRUE,
caption = "Frequency table of population across different USA states")
| pop_freq | Freq |
|---|---|
| (2.26e+05,4.16e+06] | 1806 |
| (4.16e+06,8.09e+06] | 721 |
| (8.09e+06,1.2e+07] | 286 |
| (1.2e+07,1.6e+07] | 71 |
| (1.6e+07,1.99e+07] | 93 |
| (1.99e+07,2.38e+07] | 25 |
| (2.38e+07,2.78e+07] | 16 |
| (2.78e+07,3.17e+07] | 12 |
| (3.17e+07,3.56e+07] | 8 |
| (3.56e+07,3.96e+07] | 16 |
# Histogram using basic R graphics
hist (state[["Data.Population"]], breaks = breaks,xlab= "Population", main = "Histogram using basic R graphics")
# Histogram using ggplot package
ggplot(state,aes(x= Data.Population)) +
geom_histogram(bins = 10, breaks = breaks ) +
labs(x="Population",y="Frequency",title="Histogram using ggplot package") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
# Density plot using basic R graphics
hist (state[["Data.Rates.Violent.Murder"]], freq = F, xlab= "Murder rate", main = "Density plot using basic R graphics")
lines (density(state[["Data.Rates.Violent.Murder"]]), lwd = 2, col ="blue")
# Density plot using ggplot package
ggplot(state,aes(x=Data.Rates.Violent.Murder))+
geom_density(color="blue", fill="lightblue")+
labs(x="Murder rate",y="Density",title="Density plot using ggplot package") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))