Quantitative Research Overview

Explain what the four data types (levels of measurements) are and provide simple examples to illustrate these levels.

Note: Since I am rather uncreative, I leveraged my knowledge of commonly used open source data sets to use as an example. Here, I proceed using the mtcars data, which comprises a series of variables on 32 automobiles (1973-74 models) from Motor Trend US magazine. I am not particularly fond of cars, but the data should allow me to address some of requirements for this assignment.

Nominal: Data categories exhibiting no order. For instance, on Table 1, which contains data from the 1974 Motor Trend magazine, the nominal categories would be the car name (e.g., Merc 240D, Merc 230, etc.), engine shape (vs: 0 = V-shaped, 1 = straight), and transmission (am: 0 = automatic, 1 = manual)
Ordinal: Ordered categories on a scale, though without clearly defined distance between the units. The table below does not include ordinal data; however, one way an ordinal variable could be added is by asking editors to rank their liking of each vehicle using a likert scale with 0 as ‘strongly dislike’ and 5 as ‘strongly like’.
Interval: Variables that exhibit both order and distance between categories. For instance, in the table below the the quarter mile (qsec) is an interval as the distance between a time of 5 and 6 seconds is the same as that between 6 and 7 seconds.
Ratio: Variables that exhibit order, distance and a true zero of consequence. For instance, in the table below the weight (wt) is a prime example. Here, a weight equaling zero would mean that there is no discernible weight.

library(dplyr)
mtcars %>%
  mutate(car = row.names(.)) %>%
  select(car, everything()) %>%
  DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 1: 1974 Motor Trend Cars',
                options = list(
                  dom         = "tipf",
                  scrollY     = 200,
                  scrollX     = TRUE,
                  scroller    = TRUE
  ))

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
Duster 360	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.07	17.4	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.73	17.6	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.78	18	0	0	3	3
Cadillac Fleetwood	10.4	8	472	205	2.93	5.25	17.98	0	0	3	4
Lincoln Continental	10.4	8	460	215	3	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.2	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.9	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.7	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318	150	2.76	3.52	16.87	0	0	3	2
AMC Javelin	15.2	8	304	150	3.15	3.435	17.3	0	0	3	2
Camaro Z28	13.3	8	350	245	3.73	3.84	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79	66	4.08	1.935	18.9	1	1	4	1
Porsche 914-2	26	4	120.3	91	4.43	2.14	16.7	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
Ford Pantera L	15.8	8	351	264	4.22	3.17	14.5	0	1	5	4
Ferrari Dino	19.7	6	145	175	3.62	2.77	15.5	0	1	5	6
Maserati Bora	15	8	301	335	3.54	3.57	14.6	0	1	5	8
Volvo 142E	21.4	4	121	109	4.11	2.78	18.6	1	1	4	2

Showing 1 to 6 of 32 entries

Previous1Next

Search:

Explain in your own words what distributional moments mean and why they might be important.

As per the required textbook, the distributional moments can be defined into four moments as follow:

Measuring the center: Descriptive statistics of the average used to find a central position in the data.

For instance, Table 2 reports these statistics for the weight (1000 lbs) (cyl) from the 1974 Motor Trend data:

data.frame(measure    = c("mean",
                          "median",
                          "mode"),
           definition = c("average",
                          "center of the distribution",
                          "most commonly occurring value"),
           value       = c(mean(mtcars$wt),
                           median(mtcars$wt),
                           names(table(mtcars$wt))[which(table(mtcars$wt) == max(table(mtcars$wt)))]
                       )
           ) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 2: Measures of center',
                options = list(
                  dom         = "t"
  ))

Table 2: Measures of center
measure	definition	value
mean	average	3.21725
median	center of the distribution	3.325
mode	most commonly occurring value	3.44

Why do we care about the center? To identify the center of the data, which we will build upon for measures of spread.

Measuring the spread: Provide information about how the values of a variable are distributed in the data.

data.frame(
  measure    = c("range",
                 "variance",
                 "standard deviation",
                 "coeficient variation"),
  definition = c("the minimum and maximun numbers in a distribution",
                 "the average of the squared differences from the mean",
                 "the square root of the variance",
                 "the ratio of the standard deviation to the mean"),
  value      = c(paste0(range(mtcars$wt)[1], " to ", range(mtcars$wt)[2]),
                 round(var(mtcars$wt),
                       digits = 4),
                 round(sd(mtcars$wt),
                       digits = 4),
                 round(sd(mtcars$wt)/mean(mtcars$wt) * 100,
                       digits = 4)
                 )
  ) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 3: Measures of spread',
                options = list(
                  dom         = "t"
  ))

Table 3: Measures of spread
measure	definition	value
range	the minimum and maximun numbers in a distribution	1.513 to 5.424
variance	the average of the squared differences from the mean	0.9574
standard deviation	the square root of the variance	0.9785
coeficient variation	the ratio of the standard deviation to the mean	30.4129

Why do we care about the spread? The spread on the data indicates how much variation we may have in the distribution of our data. While the measures of center seek to give us a starting point, measures of spread tell us about the upper and lower bounds of our data. Furthermore, as a series of descriptive measures, these support the analysts in understanding the number of observations above or below the center.

library(ggplot2)
ggplot(mtcars, aes(x = wt)) + 
  geom_histogram(aes(y      = ..density..),
                 binwidth   = 0.5,
                 color      = "lightgrey",
                 fill       = "lightgrey") +
  geom_density(alpha        = 0.5, 
               fill         = "darkgrey",
               color        = "darkgrey") +
  geom_vline(aes(xintercept = mean(wt)),
             linetype       = "dotted") +
  xlab("Weight") +
  ylab("Density") +
  ggtitle(label = "Figure 1. Weight spread") +
  theme_bw()

Measuring the skew: The measures of distortion or asymmetry in a normal distribution.

data.frame(
  measure    = c("skew"),
  definition = c("whether or not the curve shifts to the left of the right"),
  value      = c(round(e1071::skewness(mtcars$wt),
                       digits = 4))
) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 4: Measure of skewness',
                options = list(
                  dom         = "t"
  ))

Table 4: Measure of skewness
measure	definition	value
skew	whether or not the curve shifts to the left of the right	0.4231

Why do we care about skewness? Helps us describe whether the curve is shifted to the right or the left. For instance, here the skewness value is positive, which means that the distribution is pulled to the right of the average or towards the right tail. We care about it because it can help us identify the symmetry of the distribution. That is to say, in a perfectly symmetric distribution the same amount of observations would fall on the left and right sides. On a skewed distribution, one side is more prevalent, which might mean different things in different contexts.

Measuring the catastrophic tail events: A metric indicating the degree to which the distribution is clustered on the tails or the peak of a frequency distribution.

data.frame(
  measure    = c("kurtosis"),
  definition = c("degree to which the distribution clusters in the center or the tails"),
  value      = c(round(e1071::kurtosis(mtcars$wt),
                       digits = 4))
) %>%
    DT::datatable(rownames = FALSE,
                class = 'cell-border stripe',
                extensions = 'Scroller',
                caption = 'Table 5: Measures of kurtosis',
                options = list(
                  dom         = "t"
  ))

Table 5: Measures of kurtosis
measure	definition	value
kurtosis	degree to which the distribution clusters in the center or the tails	-0.0227

Why do we care about kurtosis? This measure is an indicator of the extent to which data are concentrated in the peak (a positive value) or the tail (a negative value) of the distribution. A distribution with a high kurtosis means that a large proportion of observations fall on the extremes. As such, if we draw at random from that distribution, we will likely get an extreme value (very high or very low).

Quantitative Research Overview

Short Assignment

Chris Callaghan

2020-04-13