Explain what the four data types (levels of measurements) are and provide simple examples to illustrate these levels.


Note: Since I am rather uncreative, I leveraged my knowledge of commonly used open source data sets to use as an example. Here, I proceed using the mtcars data, which comprises a series of variables on 32 automobiles (1973-74 models) from Motor Trend US magazine. I am not particularly fond of cars, but the data should allow me to address some of requirements for this assignment.


library(dplyr)
mtcars %>%
  mutate(car = row.names(.)) %>%
  select(car, everything()) %>%
  DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 1: 1974 Motor Trend Cars',
                options = list(
                  dom         = "tipf",
                  scrollY     = 200,
                  scrollX     = TRUE,
                  scroller    = TRUE
  ))

Explain in your own words what distributional moments mean and why they might be important.

As per the required textbook, the distributional moments can be defined into four moments as follow:

  1. Measuring the center: Descriptive statistics of the average used to find a central position in the data.

For instance, Table 2 reports these statistics for the weight (1000 lbs) (cyl) from the 1974 Motor Trend data:

data.frame(measure    = c("mean",
                          "median",
                          "mode"),
           definition = c("average",
                          "center of the distribution",
                          "most commonly occurring value"),
           value       = c(mean(mtcars$wt),
                           median(mtcars$wt),
                           names(table(mtcars$wt))[which(table(mtcars$wt) == max(table(mtcars$wt)))]
                       )
           ) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 2: Measures of center',
                options = list(
                  dom         = "t"
  ))

Why do we care about the center? To identify the center of the data, which we will build upon for measures of spread.


  1. Measuring the spread: Provide information about how the values of a variable are distributed in the data.
data.frame(
  measure    = c("range",
                 "variance",
                 "standard deviation",
                 "coeficient variation"),
  definition = c("the minimum and maximun numbers in a distribution",
                 "the average of the squared differences from the mean",
                 "the square root of the variance",
                 "the ratio of the standard deviation to the mean"),
  value      = c(paste0(range(mtcars$wt)[1], " to ", range(mtcars$wt)[2]),
                 round(var(mtcars$wt),
                       digits = 4),
                 round(sd(mtcars$wt),
                       digits = 4),
                 round(sd(mtcars$wt)/mean(mtcars$wt) * 100,
                       digits = 4)
                 )
  ) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 3: Measures of spread',
                options = list(
                  dom         = "t"
  ))

Why do we care about the spread? The spread on the data indicates how much variation we may have in the distribution of our data. While the measures of center seek to give us a starting point, measures of spread tell us about the upper and lower bounds of our data. Furthermore, as a series of descriptive measures, these support the analysts in understanding the number of observations above or below the center.

library(ggplot2)
ggplot(mtcars, aes(x = wt)) + 
  geom_histogram(aes(y      = ..density..),
                 binwidth   = 0.5,
                 color      = "lightgrey",
                 fill       = "lightgrey") +
  geom_density(alpha        = 0.5, 
               fill         = "darkgrey",
               color        = "darkgrey") +
  geom_vline(aes(xintercept = mean(wt)),
             linetype       = "dotted") +
  xlab("Weight") +
  ylab("Density") +
  ggtitle(label = "Figure 1. Weight spread") +
  theme_bw()


  1. Measuring the skew: The measures of distortion or asymmetry in a normal distribution.
data.frame(
  measure    = c("skew"),
  definition = c("whether or not the curve shifts to the left of the right"),
  value      = c(round(e1071::skewness(mtcars$wt),
                       digits = 4))
) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 4: Measure of skewness',
                options = list(
                  dom         = "t"
  ))

Why do we care about skewness? Helps us describe whether the curve is shifted to the right or the left. For instance, here the skewness value is positive, which means that the distribution is pulled to the right of the average or towards the right tail. We care about it because it can help us identify the symmetry of the distribution. That is to say, in a perfectly symmetric distribution the same amount of observations would fall on the left and right sides. On a skewed distribution, one side is more prevalent, which might mean different things in different contexts.


  1. Measuring the catastrophic tail events: A metric indicating the degree to which the distribution is clustered on the tails or the peak of a frequency distribution.
data.frame(
  measure    = c("kurtosis"),
  definition = c("degree to which the distribution clusters in the center or the tails"),
  value      = c(round(e1071::kurtosis(mtcars$wt),
                       digits = 4))
) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 5: Measures of kurtosis',
                options = list(
                  dom         = "t"
  ))

Why do we care about kurtosis? This measure is an indicator of the extent to which data are concentrated in the peak (a positive value) or the tail (a negative value) of the distribution. A distribution with a high kurtosis means that a large proportion of observations fall on the extremes. As such, if we draw at random from that distribution, we will likely get an extreme value (very high or very low).