Explain what the four data types (levels of measurements) are and provide simple examples to illustrate these levels.
Note: Since I am rather uncreative, I leveraged my knowledge of commonly used open source data sets to use as an example. Here, I proceed using the mtcars data, which comprises a series of variables on 32 automobiles (1973-74 models) from Motor Trend US magazine. I am not particularly fond of cars, but the data should allow me to address some of requirements for this assignment.
car
name (e.g., Merc 240D, Merc 230, etc.), engine shape (vs
: 0 = V-shaped, 1 = straight), and transmission (am
: 0 = automatic, 1 = manual)qsec
) is an interval as the distance between a time of 5 and 6 seconds is the same as that between 6 and 7 seconds.wt
) is a prime example. Here, a weight equaling zero would mean that there is no discernible weight.library(dplyr)
mtcars %>%
mutate(car = row.names(.)) %>%
select(car, everything()) %>%
DT::datatable(rownames = FALSE,
class = 'cell-border stripe',
extensions = 'Scroller',
caption = 'Table 1: 1974 Motor Trend Cars',
options = list(
dom = "tipf",
scrollY = 200,
scrollX = TRUE,
scroller = TRUE
))
Explain in your own words what distributional moments mean and why they might be important.
As per the required textbook, the distributional moments can be defined into four moments as follow:
For instance, Table 2 reports these statistics for the weight (1000 lbs) (cyl
) from the 1974 Motor Trend data:
data.frame(measure = c("mean",
"median",
"mode"),
definition = c("average",
"center of the distribution",
"most commonly occurring value"),
value = c(mean(mtcars$wt),
median(mtcars$wt),
names(table(mtcars$wt))[which(table(mtcars$wt) == max(table(mtcars$wt)))]
)
) %>%
DT::datatable(rownames = FALSE,
class = 'cell-border stripe',
extensions = 'Scroller',
caption = 'Table 2: Measures of center',
options = list(
dom = "t"
))
Why do we care about the center? To identify the center of the data, which we will build upon for measures of spread.
data.frame(
measure = c("range",
"variance",
"standard deviation",
"coeficient variation"),
definition = c("the minimum and maximun numbers in a distribution",
"the average of the squared differences from the mean",
"the square root of the variance",
"the ratio of the standard deviation to the mean"),
value = c(paste0(range(mtcars$wt)[1], " to ", range(mtcars$wt)[2]),
round(var(mtcars$wt),
digits = 4),
round(sd(mtcars$wt),
digits = 4),
round(sd(mtcars$wt)/mean(mtcars$wt) * 100,
digits = 4)
)
) %>%
DT::datatable(rownames = FALSE,
class = 'cell-border stripe',
extensions = 'Scroller',
caption = 'Table 3: Measures of spread',
options = list(
dom = "t"
))
Why do we care about the spread? The spread on the data indicates how much variation we may have in the distribution of our data. While the measures of center seek to give us a starting point, measures of spread tell us about the upper and lower bounds of our data. Furthermore, as a series of descriptive measures, these support the analysts in understanding the number of observations above or below the center.
library(ggplot2)
ggplot(mtcars, aes(x = wt)) +
geom_histogram(aes(y = ..density..),
binwidth = 0.5,
color = "lightgrey",
fill = "lightgrey") +
geom_density(alpha = 0.5,
fill = "darkgrey",
color = "darkgrey") +
geom_vline(aes(xintercept = mean(wt)),
linetype = "dotted") +
xlab("Weight") +
ylab("Density") +
ggtitle(label = "Figure 1. Weight spread") +
theme_bw()
data.frame(
measure = c("skew"),
definition = c("whether or not the curve shifts to the left of the right"),
value = c(round(e1071::skewness(mtcars$wt),
digits = 4))
) %>%
DT::datatable(rownames = FALSE,
class = 'cell-border stripe',
extensions = 'Scroller',
caption = 'Table 4: Measure of skewness',
options = list(
dom = "t"
))
Why do we care about skewness? Helps us describe whether the curve is shifted to the right or the left. For instance, here the skewness value is positive, which means that the distribution is pulled to the right of the average or towards the right tail. We care about it because it can help us identify the symmetry of the distribution. That is to say, in a perfectly symmetric distribution the same amount of observations would fall on the left and right sides. On a skewed distribution, one side is more prevalent, which might mean different things in different contexts.
data.frame(
measure = c("kurtosis"),
definition = c("degree to which the distribution clusters in the center or the tails"),
value = c(round(e1071::kurtosis(mtcars$wt),
digits = 4))
) %>%
DT::datatable(rownames = FALSE,
class = 'cell-border stripe',
extensions = 'Scroller',
caption = 'Table 5: Measures of kurtosis',
options = list(
dom = "t"
))
Why do we care about kurtosis? This measure is an indicator of the extent to which data are concentrated in the peak (a positive value) or the tail (a negative value) of the distribution. A distribution with a high kurtosis means that a large proportion of observations fall on the extremes. As such, if we draw at random from that distribution, we will likely get an extreme value (very high or very low).