Center, Shape & Spread

Steve Hoffman

Center, Shape, & Spread

Measurement and Evaluation in Psychology and Education

8th Edition (2010)

Robert M. Thorndike & Tracy Thorndike-Christ

Pages 39 - 57

SETUP

library(tidyverse)

# READ IN DATA 
Table.2.1 <- read_csv(file = "Table.2.1_clean.csv") 
Table.2.1
# A tibble: 52 × 7
   First    Last     Gender Class   Reading Spelling  Math
   <chr>    <chr>    <chr>  <chr>     <dbl>    <dbl> <dbl>
 1 Aaron    Andrews  male   Johnson      32       64    43
 2 Byron    Biggs    male   Johnson      40       64    37
 3 Charles  Cowen    male   Johnson      36       40    38
 4 Donna    Davis    female Johnson      41       74    40
 5 Erin     Edwards  female Johnson      36       69    28
 6 Fernando Franco   male   Johnson      41       67    42
 7 Gail     Galaraga female Johnson      40       71    37
 8 Harpo    Henry    male   Johnson      30       51    34
 9 Irrida   Ignacio  female Johnson      37       68    35
10 Jack     Johanson male   Johnson      26       56    26
# … with 42 more rows

Center of the data

The Thorndike’s explain three different measures of the center of a group of scores. (Measures of Central Tendency)

  • Mode
  • Median
  • Mean

Let’s look at scores on Johnson and Cordero’s spelling test

Sort the data

Sort the data set by spelling scores from lowest to highest

arrange(Table.2.1, Spelling)
# A tibble: 52 × 7
   First     Last     Gender Class   Reading Spelling  Math
   <chr>     <chr>    <chr>  <chr>     <dbl>    <dbl> <dbl>
 1 Bellinda  Brown    female Cordero      33       38    41
 2 Charles   Cowen    male   Johnson      36       40    38
 3 Larry     Lewis    male   Cordero      29       40    34
 4 Thelma    Thwaites female Cordero      38       43    45
 5 Quadra    Quickly  female Johnson      21       44    19
 6 Nancy     Nowits   female Cordero      28       44    44
 7 Nathan    Natts    male   Johnson      22       47    22
 8 Charlotta Cowen    female Cordero      33       47    50
 9 Zephina   Zoro     female Cordero      30       47    38
10 Quincy    Quirn    male   Cordero      33       48    33
# … with 42 more rows

Look at only spelling scores

sorted <- arrange(Table.2.1,
  Spelling)
sorted$Spelling 
 [1] 38 40 40 43 44 44 47 47 47 48 49 50 51 51 51 52 52 53 53 53 54 54 55 55 56
[26] 57 57 58 59 59 59 61 61 61 63 64 64 64 64 64 65 65 66 67 68 68 68 69 71 73
[51] 74 76

Mode

Which score occurred most frequently?

sorted$Spelling 
 [1] 38 40 40 43 44 44 47 47 47 48 49 50 51 51 51 52 52 53 53 53 54 54 55 55 56
[26] 57 57 58 59 59 59 61 61 61 63 64 64 64 64 64 65 65 66 67 68 68 68 69 71 73
[51] 74 76
# R does not have a standard in-built function to calculate mode. 
# Create a user function to calculate mode of a data set in R. 
Mode <- function(x) {
   uniqx <- unique(x)
   uniqx[which.max(tabulate(match(x, uniqx)))]
}
Mode(sorted$Spelling) 
[1] 64

So in this case, the Mode is 64

Median

No need to build a function for median (the middle value), as this is already built into R.

Note that I’m asking R to look at the dataset called Table.2.1 – and then look specifically at the column called Spelling.

Calculate the median value – 57 in this case.

How does R calculate? “Under the hood” R will sort the values from smallest to largest, and then find the value that is halfway down the list. If there is an even number of values, then R takes the average of the middle two.

median(Table.2.1$Spelling)
[1] 57

Mean

Known also as the average, R calculates this by adding up all of the values in the column (e.g. all 52 of the spelling scores: 64 + 64 + 40 + 74 + …) and divide this sum by the number of scores (52 in this case)

mean(Table.2.1$Spelling)
[1] 57.15385

Just a fraction over 57 in this case

Shape (pages 45 & 46)

When a distribution of a variable is NOT symmetrical, then it is described as skewed

Negative skewness indicates that the distribution is left skewed and the mean of the data (like Spelling scores) is less than the median value. Positive skewness means the opposite – that a distribution is right skewed, with the mean is higher than the median.

Moments (how to calculate skewness)

There’s a package called “moments” that calculates this statistic. If you wanted to calculate the value of skewness, install it on your computer. install.packages(“moments”)

library(moments)
skewness(Table.2.1$Spelling)
[1] -0.05793692

This value is pretty small (and only a touch negative) indicating that our Spelling scores are fairly symmetrical

Example of skewed

An example of a positively skewed variable is the engine displacment from the mpg dataset from our R for Data Science book.

ggplot(mpg) +
  geom_density(aes(displ))

Calculating skewness

This is an example where the median is 3.3 while the mean is a bit higher at 3.47

median(mpg$displ)
[1] 3.3
mean(mpg$displ)
[1] 3.471795
skewness(mpg$displ)
[1] 0.441463

Spread

Our authors identify three different ways to describe how much something like spelling scores are spread out:

  • range
  • interquartile range
  • standard deviation

Range

range(Table.2.1$Spelling)
[1] 38 76

And if you somehow needed to put this in a singal statistic, you might subract the minimum value from the maximum value.

HowBig <- (max(Table.2.1$Spelling) - min(Table.2.1$Spelling))
HowBig
[1] 38

Interquartile Range

IQR is the middle half of the scores – from the 25th percentile to the 75th percentile.

quantile(Table.2.1$Spelling, probs = c(.25, .5, .75))
25% 50% 75% 
 51  57  64 
IQR(Table.2.1$Spelling)
[1] 13

In this case, the 25th percentile is 51 (only a quarter of the scores are below 51). The 50th percentile (the median) is 57. And the 75th percentile is 64.

The IQR in this case is 13 points between the 25th percentile and the 75th percentile

Boxplots show the IQR

Boxplots (also called box & whiskers plots) are constructed by drawing a box showing the lower quartile, Q1 – the spot identifying where 25% of the scores are at and the upper quartile – Q3, marking the 75% of scores. And then the median is marked with a line, showing the middle score (for spelling scores, in our case).

ggplot(Table.2.1, mapping = aes(Spelling)) + geom_boxplot()

Standard Deviation

On page 48, “There are also measures of variability that belong to the family of the arithmetic mean and are based on score deviations from the mean. The most commonly used one is called the standard deviation.”

Our Spelling scores from the Table 2.1 data can be calculated this way:

mean(Table.2.1$Spelling)
[1] 57.15385
sd(Table.2.1$Spelling)
[1] 9.350232

Easy enough to find out that the mean (the average) Spelling Score is 57.15 and the standard deviation of these spelling scores it’s 9.35. But what does this mean?

Normal distribution of Spelling

In the book they talked about Math scores (Figure 2-8 on page 51), but we were already talking about Spelling, so let’s produce a similar plot.

ggplot(Table.2.1, aes(Spelling)) +
  geom_histogram(binwidth = 2, fill = "grey") +
  stat_function(fun=dnorm, args = list(mean = 57.15, sd = 9.35), aes(y = after_stat(y * 2*52))) +
  labs(y = "count")

Interpreting SD

The superimposed normal distribution is meant to demonstrate that roughly two-thirds (68%) of the spelling scores fall between one standard deviation below the mean and one standard deviation above the mean.

In our case, two thirds of spelling scores fell between 47.8 and 66.5

mean(Table.2.1$Spelling) - sd(Table.2.1$Spelling)
[1] 47.80361
mean(Table.2.1$Spelling) + sd(Table.2.1$Spelling)
[1] 66.50408

Pass the sniff test?

If we line up our spelling scores in order from lowest to highest.

How many are below 47? I see six scores below 47.

And how many are above 66.5? I see nine.

That leaves 37 spelling scores (out of 52) within the standard deviation of 9.35

Is 37 out of 52 about two-thirds? Yes: 37/52 = 71%