The data

What does the distribution of housefly wing lengths (in mm) look like? This was apparently an important question in 1955, and the data set we look at is a famous one, from Sokal & Hunter, reproduced here:

https://seattlecentral.edu/qelp/sets/057/057.html

The data set can be scanned in from the web:

wings <- scan("https://seattlecentral.edu/qelp/sets/057/s057.txt")

This reads the data in as a vector.

Summary statistics

Find the sample mean:

mean(wings)
## [1] 45.5

The sample mean fly wing length is 45.5 millimeters.

Find the plug-in standard deviation and the sample standard deviation:

# Plug-in
sqrt(mean(wings^2) - mean(wings)^2)
## [1] 3.9
# Sample SD
sd(wings)
## [1] 3.919647

There’s hardly any difference. To one decimal place, the standard deviation of fly wing lengths is ??????? millimeters.

We can also describe the data using the five-number summary:

summary(wings)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    36.0    43.0    45.5    45.5    48.0    55.0

The median fly wing length is 45.5 millimeters. The interquartile range is 5.0 millimeters.

Plot the data

Plot the empirical CDF, adding labels:

plot(ecdf(wings), main = "ECDF of housefly wing lengths",
     xlab = "Wing length (mm)", ylab = "Empirical CDF")

Draw a well-labeled frequency histogram, using bins of width 1 mm:

hist(wings, breaks = 35.5:55.5, main = "Housefly Wing Lengths",
     xlab = "Length (x.1mm)")

Draw a kernel density plot:

plot(density(wings), main = "Housefly Wing Normal Distribution",
     xlab = "Frequency by Length")

Draw a boxplot:

boxplot(wings, main = "Housefly Wing Lengths (By Quartile)",
        ylab = "Length (x.1mm)")

Draw a normal quantile-quantile plot:

qqnorm(wings, main = "Normal QQ Plot of Housefly Wing Lengths")

Does the data look like it follows a normal distribution, apart from small discrepancies like rounding? This data seems to follow a normal distribution where the data does not skew either way, apart from the rounding that is causing a slight skew to the right.

There’s one thing about the data that might make one suspicious that the data is made up. The suspicious thing is: The Suspicious thing that might make me believe that the data is made up, is that the lengths increase and decrease with the exact same frequency as they move away from the population mean. This makes me think that the data is made up as it is never exactly equal as you move from the mean in a real-life dataset.