Missing Data

Sometimes we do not have all variables measured on all individuals in the data set. When this happens, we need a space holder in our data files so that R knows that the data is missing. The standard way of doing this in R is to put “NA” (without the quotes) in the location that the data would have gone. NA is short for “not available”.

For example, in the Titanic data set, we do not know the age of several passengers. Let’s look at it. Load the Titanic data set:

titanicData <- read.csv("C:\\BI412L\\ABDLabs\\ABDLabs\\DataForLabs\\titanic.csv", stringsAsFactors = TRUE)
titanicData$age
##    [1] 29.0000  2.0000 30.0000 25.0000  0.9167 47.0000 63.0000 39.0000 58.0000
##   [10] 71.0000 47.0000 19.0000      NA      NA      NA 50.0000 24.0000 36.0000
##   [19] 37.0000 47.0000 26.0000 25.0000 25.0000 19.0000 28.0000 45.0000 39.0000
##   [28] 30.0000 58.0000      NA 45.0000 22.0000      NA 41.0000 48.0000      NA
##   [37] 44.0000 59.0000 60.0000 45.0000      NA 53.0000 58.0000 36.0000 33.0000
##   [46]      NA      NA 36.0000 36.0000 14.0000 11.0000 49.0000      NA 36.0000
##   [55]      NA 46.0000 47.0000 27.0000 31.0000      NA      NA      NA      NA
##   [64] 27.0000 26.0000      NA      NA 64.0000 37.0000 39.0000 55.0000      NA
##   [73] 70.0000 69.0000 36.0000 39.0000 38.0000      NA 27.0000 31.0000 27.0000
##   [82]      NA 31.0000 17.0000      NA      NA  4.0000 27.0000 50.0000 48.0000
##   [91] 49.0000 48.0000 39.0000 23.0000 53.0000 36.0000      NA      NA 30.0000
##  [100] 24.0000 19.0000 28.0000 23.0000 64.0000 60.0000      NA 49.0000      NA
##  [109] 44.0000 22.0000 60.0000 48.0000 37.0000 35.0000 47.0000 22.0000 45.0000
##  [118] 49.0000      NA 71.0000 54.0000 38.0000 19.0000 58.0000 45.0000 23.0000
##  [127] 46.0000 25.0000 21.0000 48.0000 49.0000 45.0000 36.0000      NA 55.0000
##  [136] 52.0000 24.0000      NA      NA      NA 16.0000 44.0000 51.0000 42.0000
##  [145] 35.0000 35.0000 38.0000 35.0000      NA 50.0000 49.0000 46.0000      NA
##  [154] 58.0000 41.0000      NA 42.0000 40.0000      NA      NA      NA 42.0000
##  [163] 55.0000 50.0000 16.0000      NA 29.0000 21.0000 30.0000 15.0000 30.0000
##  [172]      NA      NA      NA 46.0000 54.0000 36.0000 28.0000      NA 65.0000
##  [181] 33.0000 44.0000 37.0000      NA 55.0000 47.0000 36.0000 58.0000 31.0000
##  [190] 23.0000 19.0000 64.0000      NA 64.0000 22.0000 28.0000      NA      NA
##  [199] 22.0000      NA      NA 18.0000 17.0000 52.0000 46.0000 56.0000      NA
##  [208]      NA 43.0000 31.0000      NA      NA 33.0000      NA 27.0000 55.0000
##  [217] 54.0000      NA 61.0000 48.0000 18.0000 13.0000 21.0000      NA      NA
##  [226]      NA 34.0000 40.0000 36.0000 50.0000 39.0000 56.0000 28.0000 56.0000
##  [235] 56.0000 24.0000 18.0000      NA 24.0000 23.0000 45.0000 40.0000  6.0000
##  [244] 57.0000      NA 32.0000 62.0000 54.0000 43.0000 52.0000      NA 62.0000
##  [253] 67.0000 63.0000 61.0000 46.0000 52.0000 39.0000 18.0000 48.0000      NA
##  [262] 49.0000 39.0000 17.0000 46.0000      NA 31.0000      NA 61.0000 47.0000
##  [271] 64.0000 60.0000 60.0000 55.0000 54.0000 21.0000 57.0000 45.0000 31.0000
##  [280] 50.0000 50.0000 27.0000 20.0000 51.0000      NA 21.0000      NA      NA
##  [289] 36.0000      NA      NA      NA      NA      NA      NA      NA      NA
##  [298]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [307] 40.0000      NA      NA 32.0000      NA      NA      NA      NA      NA
##  [316]      NA 33.0000      NA      NA      NA      NA      NA 30.0000 28.0000
##  [325] 18.0000      NA 34.0000 32.0000 57.0000 18.0000 23.0000 36.0000 28.0000
##  [334] 51.0000 32.0000 19.0000 28.0000 36.0000  4.0000  1.0000 12.0000 34.0000
##  [343] 19.0000 23.0000 26.0000      NA 27.0000 15.0000 45.0000 40.0000 20.0000
##  [352] 25.0000 36.0000 25.0000      NA 42.0000 26.0000 26.0000  0.8333 31.0000
##  [361]      NA 19.0000 54.0000 44.0000 52.0000 30.0000 30.0000      NA      NA
##  [370] 29.0000      NA 29.0000 27.0000 24.0000 35.0000 31.0000  8.0000 22.0000
##  [379] 30.0000      NA 20.0000      NA 21.0000 49.0000  8.0000 28.0000 18.0000
##  [388]      NA 28.0000 22.0000 25.0000 18.0000 32.0000 18.0000      NA 42.0000
##  [397] 34.0000  8.0000      NA      NA 23.0000 21.0000 19.0000      NA      NA
##  [406]      NA 38.0000      NA 38.0000 35.0000 35.0000 38.0000 24.0000 16.0000
##  [415] 26.0000 45.0000 24.0000 21.0000 22.0000      NA 34.0000 30.0000 50.0000
##  [424] 30.0000 23.0000  1.0000 44.0000 28.0000  6.0000 30.0000      NA 43.0000
##  [433] 45.0000  7.0000 24.0000 24.0000 49.0000 48.0000      NA 34.0000 32.0000
##  [442] 21.0000 18.0000 53.0000 23.0000 21.0000      NA 52.0000 42.0000 36.0000
##  [451] 21.0000 41.0000      NA      NA 33.0000 17.0000      NA      NA      NA
##  [460]      NA      NA      NA 23.0000 34.0000      NA 22.0000      NA      NA
##  [469] 45.0000      NA      NA 31.0000 30.0000 26.0000      NA 34.0000 26.0000
##  [478] 22.0000  1.0000  3.0000      NA      NA      NA 25.0000      NA 48.0000
##  [487]      NA 57.0000      NA      NA      NA  2.0000      NA 27.0000 19.0000
##  [496] 30.0000 20.0000 45.0000      NA 46.0000 41.0000 13.0000 19.0000 30.0000
##  [505] 48.0000 71.0000 54.0000      NA      NA 64.0000 32.0000 18.0000  2.0000
##  [514] 32.0000  3.0000 26.0000 19.0000      NA 20.0000 29.0000 39.0000 22.0000
##  [523]      NA 24.0000      NA 28.0000      NA 50.0000 20.0000 40.0000 42.0000
##  [532] 21.0000 32.0000 34.0000      NA      NA 33.0000  2.0000  8.0000 36.0000
##  [541] 34.0000 30.0000 28.0000 23.0000  0.8333 25.0000  3.0000 50.0000      NA
##  [550] 21.0000      NA      NA 25.0000 18.0000 20.0000 30.0000 59.0000 30.0000
##  [559] 35.0000 22.0000      NA 25.0000 41.0000 25.0000 14.0000 50.0000 22.0000
##  [568]      NA 27.0000 29.0000 27.0000 30.0000 22.0000 35.0000 30.0000 28.0000
##  [577] 23.0000      NA 12.0000 40.0000 36.0000 28.0000 32.0000 29.0000  4.0000
##  [586]  2.0000      NA      NA 36.0000 33.0000      NA      NA      NA 32.0000
##  [595]      NA      NA 26.0000      NA 30.0000 24.0000      NA 18.0000 42.0000
##  [604] 13.0000 16.0000 35.0000 16.0000 25.0000 18.0000 20.0000 30.0000 26.0000
##  [613] 40.0000 24.0000 41.0000 18.0000  0.8333 23.0000 20.0000 25.0000 35.0000
##  [622] 17.0000 32.0000 20.0000 39.0000 39.0000  6.0000  2.0000 17.0000 38.0000
##  [631]  9.0000 26.0000 11.0000  4.0000 20.0000 26.0000 25.0000 18.0000 24.0000
##  [640] 35.0000 40.0000 38.0000  5.0000  9.0000  3.0000 13.0000 23.0000  5.0000
##  [649]      NA 45.0000 23.0000 17.0000 27.0000 23.0000 20.0000 32.0000 33.0000
##  [658]  3.0000      NA      NA      NA 18.0000 40.0000 26.0000 15.0000 45.0000
##  [667] 18.0000 27.0000 22.0000 19.0000 26.0000 22.0000 20.0000 32.0000 21.0000
##  [676] 18.0000 26.0000  6.0000      NA      NA  9.0000 40.0000 32.0000      NA
##  [685] 26.0000 18.0000 20.0000      NA 29.0000 22.0000 22.0000 35.0000 21.0000
##  [694] 20.0000 19.0000 18.0000 18.0000 38.0000      NA 30.0000 17.0000 21.0000
##  [703] 21.0000 21.0000      NA      NA 24.0000 33.0000 33.0000 28.0000 16.0000
##  [712] 37.0000 28.0000      NA 24.0000 21.0000      NA 32.0000 29.0000 26.0000
##  [721] 18.0000 20.0000 19.0000 24.0000 24.0000 36.0000 31.0000 31.0000 30.0000
##  [730] 22.0000      NA 43.0000 35.0000 27.0000 19.0000 30.0000 36.0000  3.0000
##  [739]  9.0000 59.0000 19.0000 44.0000 17.0000      NA 45.0000 22.0000 19.0000
##  [748] 29.0000 30.0000 34.0000 28.0000  0.3333 27.0000 25.0000 24.0000 22.0000
##  [757] 21.0000 17.0000      NA      NA 26.0000 33.0000  1.0000  0.1667 25.0000
##  [766] 36.0000 36.0000 30.0000      NA 23.0000 26.0000 19.0000 65.0000      NA
##  [775] 42.0000 43.0000 32.0000 19.0000 30.0000 24.0000 23.0000      NA 24.0000
##  [784] 24.0000 23.0000 22.0000      NA 18.0000 16.0000 45.0000      NA      NA
##  [793]      NA 47.0000  5.0000      NA      NA      NA      NA      NA      NA
##  [802]      NA      NA      NA      NA      NA 21.0000 18.0000  9.0000 48.0000
##  [811] 16.0000      NA      NA 25.0000      NA      NA 22.0000 16.0000      NA
##  [820] 33.0000      NA  9.0000 41.0000 38.0000 40.0000 43.0000 14.0000 16.0000
##  [829]  9.0000 10.0000  6.0000 11.0000 40.0000 32.0000      NA 20.0000 37.0000
##  [838] 28.0000 19.0000      NA      NA      NA      NA      NA      NA      NA
##  [847]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [856]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [865]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [874]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [883]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [892]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [901]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [910]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [919]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [928]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [937]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [946]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [955]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [964]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [973]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [982]      NA      NA      NA      NA      NA      NA      NA      NA      NA
##  [991]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1000]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1009]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1018]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1027]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1036]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1045]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1054]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1063]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1072]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1081]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1090]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1099]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1108]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1117]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1126]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1135]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1144]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1153]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1162]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1171]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1180]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1189]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1198]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1207]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1216]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1225]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1234]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1243]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1252]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1261]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1270]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1279]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1288]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1297]      NA      NA      NA      NA      NA      NA      NA      NA      NA
## [1306]      NA      NA      NA      NA      NA      NA      NA      NA

Some of the entries for the variable age is NA which represents the people who do not have age information.

By the way, the titanic.csv file simply has nothing in the places where there is missing data. When R loaded it, it replaced the empty spots with NA automatically.

Measures of Location

For this lab, we are going to use R to retrieve some basic descriptive statistics for numerical data.

Mean

Lets get the mean of the age using the mean() function.

mean(titanicData$age)
## [1] NA

NA Values

Notice that the output is NA and that is because as we know, the age variable has some NA values and so those are NAs are being included in the mean computation. To fix this, we will tell R to treat NA values as if they weren’t there (i.e. we will essentially only take the numbers to compute for the mean and exclude the NA values). We do this by adding the option na.rm = TRUE.

mean(titanicData$age, na.rm = TRUE)
## [1] 31.19418

We can use the same na.rm = TRUE option with various functions to exclude NA values if there are NA values included in those variables/columns. Let’s apply this option to some other functions.

Median

Get the median for age using the median() function and use na.rm = TRUE to remove any NA values.

median(titanicData$age, na.rm = TRUE)
## [1] 30

Summary

Another useful function to get the mean and median at the same time as well as a few more information about a variable including the number of NA values in that variable is the summary() function. Get the summary for age.

summary(titanicData$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1667 21.0000 30.0000 31.1942 41.0000 71.0000     680

From the output, we can see that summary() gives the minimum value, the 1st quartile, he median, the mean, the 3rd quartile, the maximum value, and the number of NA in the variable.

Measures of Variability

R can also calculate measures of the variability of a sample. In this section we’ll learn how to calculate the variance, standard deviation, coefficient of variation and interquartile range of a set of data.

Variance

To get the variance of a variable in R, use the var() function.

var(titanicData$age)
## [1] NA
var(titanicData$age, na.rm = TRUE)
## [1] 217.4895

Standard Deviation

To get the standard deviation of a varibale in R, use the sd() function.

sd(titanicData$age, na.rm = TRUE)
## [1] 14.74753

Note: The standard deviation is the square root of the variance and the variance is the standard deviation squared. Let us verify this in R:

sqrt(var(titanicData$age, na.rm = TRUE))
## [1] 14.74753
(sd(titanicData$age, na.rm = TRUE))^2
## [1] 217.4895

Coefficient of Variation

The coefficient of variation (CV) is a statistical measure that represents the ratio of the standard deviation to the mean of a dataset. It is often expressed as a percentage and is used to describe the relative variability or dispersion of data in relation to its mean.

Interpretation:

  • Low CV: Typically, a CV less than 10% is considered low. It indicates that the data points are tightly clustered around the mean, reflecting relatively low variability.
  • Moderate CV: A CV between 10% and 20% is often considered moderate. The variability is present, but it’s not extreme.

  • High CV: A CV greater than 20% is generally considered high. It suggests high variability, meaning the data points are spread out more widely around the mean.

    A high or low coefficient of variation depends on the context of the data and the field in which it is applied. For example:

  • In fields like quality control or engineering, where consistency is important, even a CV above 5-10% may be considered high.

  • In biological or social sciences, where more natural variability is expected, CV values of up to 30% might be considered normal, and higher values may still be acceptable depending on the circumstances.

  • For financial data, a high CV often reflects higher risk or volatility, so a lower CV might be preferred for stability.

The coefficient of variation is particularly useful when comparing the degree of variation between datasets with different units or scales because it provides a standardized measure of dispersion.

There is no standard function in R to calculate the coefficient of variation. However, with inputting mathematical expressions (like we did in the first couple of labs when converting temperatures), we can write our mathematical expression to get the coefficient of variation. Here is the formula for calculating the coefficient of variation:

\[ \text{CV} = \frac{\text{standard deviation}(\sigma)}{\text{mean} (\mu)} \times 100 \]

(sd(titanicData$age, na.rm = TRUE)/mean(titanicData$age, na.rm = TRUE))*100
## [1] 47.27653

Interquartile Range (IQR)

The interquartile range (IQR) is a measure of statistical dispersion that represents the range within which the middle 50% of the data lies. It helps identify the spread of the central portion of the data and is useful for detecting outliers. In R, we use the IQR() function to get the interquartile range for a variable.

IQR(titanicData$age, na.rm = TRUE)
## [1] 20

We get the same thing if we take the difference of the 3rd and 1st quartile provided from getting the summary of the variable:

summary(titanicData$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1667 21.0000 30.0000 31.1942 41.0000 71.0000     680

Confidence Intervals (CI) of the Mean

The confidence interval for an estimate tells us a range of values that is likely to contain the true value of the parameter. For example, in 95% of random samples the 95% confidence interval of the mean will contain the true value of the mean.

R does not have a simple built-in function to calculate only the confidence interval of the mean, but the function that calculates t-tests will give us this information. The function t.test() has many results in its output. By adding $conf.int to this function we only get back the confidence interval for the mean. By default it gives us the 95% confidence interval.

t.test(titanicData$age)$conf.int
## [1] 30.04312 32.34524
## attr(,"conf.level")
## [1] 0.95

As the result above shows, the 95% confidence interval of the mean of age in the titanicData data set is from about 30.0 to 32.3. R also tells us that it used a 95% confidence level for its calculation.

To calculate confidence intervals with a different level of confidence, we can add the option conf.level to the t.test() function. For example, for a 99% confidence interval we can use the following:

t.test(titanicData$age, conf.level = 0.99)$conf.int
## [1] 29.67976 32.70861
## attr(,"conf.level")
## [1] 0.99

Notice that with a greater confidence level, the interval becomes larger because you are increasing the certainty that the interval contains the true population parameter.