One Quantitative Variable

Harold Nelson

April 2, 2020

Setup

if (!require(tidyverse)) install.packages('tidyverse')
library(tidyverse)
if (!require(openintro)) install.packages('openintro')
library(openintro)

Descriptive Statistics for One Quantitative Variable

This set of exercises is not just for reading. You should have a script window in RStudio cloud open in another tab in your browser so that you can work the problems. Be sure to do each problem before you look at my results. You can copy the code from what follows directly into RStudio.

There are three basic questions you need to answer to explore a single quantitative variable.

  1. Location - basically, where is the data located?
  2. Variation (scatter, spread, etc.) - how spread out is the data?
  3. Shape - Beyond location and variation, what should be said about the data?

Let’s create some artificial data to explore these ideas

x = rnorm(1000)

That created a vector x with 1,000 numbers drawn from a normal distribution with mean = 0 and standard deviation = 1. Note that nothing seems to have happened. It is worth remembering that in R creating something doesn’t automatically display it. But we can do many things to explore x.

Look at some measures of location and variation to see if they are close to what we would expect from the way we created x.

mean(x)
## [1] 0.02471982
median(x)
## [1] 0.0195678
sd(x)
## [1] 1.040631

Following Changes

Why do we call something a measure of location or variation of a collection of numbers? In a practical sense, we want the measure to reflect changes that we make to the set of numbers. If we move the set of numbers to the right, we want the measure to move to the right. If we spread the numbers out more, we want the measure of variation to increase

Now let’s change x and see what happens. First add 100 to each number in the vector

xplus100 = x + 100

Exercise

What would you want to see in the mean of this new variable? After you think about it, use the R command mean() applied to both variables. Then advance to the next slide and see.

Answer

mean(x)
## [1] 0.02471982
mean(xplus100)
## [1] 100.0247

The Median

Now look at the median of both variables. Does it do what a good measure of location should do?

Answer

median(x)
## [1] 0.0195678
median(xplus100)
## [1] 100.0196

Measures of variation - The Standard Deviation.

Look at the standard deviations of both variables. What do you see? Is this what you should see?

Answer

sd(x)
## [1] 1.040631
sd(xplus100)
## [1] 1.040631

There is no change. This does make sense because the entire set of numbers has been uniformly displaced. There is no change in variation, just a change in location.

Measures of variation - The IQR.

Does the IQR do the same thing?

Answer

IQR(x)
## [1] 1.392278
IQR(xplus100)
## [1] 1.392278

Yes, the IQR remains the same.

Measures of variation - The range.

Does the range also show no change?

Answer

max(x) - min(x)
## [1] 6.678957
max(xplus100) - min(xplus100)
## [1] 6.678957

Yes.

An Exercise for you

I’ll leave it as an exercise for you to describe the relationships between the measures of location and variation for x and those for xtimes100. The values of xtimes100 should be the values of x multiplied by 100.

Shape - Graphical Displays

The standard graphical displays for quantitative variables are the histogram and the boxplot. To be comparable with the histogram, you may want to ask that the boxplot be laid out horizontally instead of vertically, which is the default. There is a command summary(), which produces the key numerical results displayed in the boxplot.

hist(x)

boxplot(x,horizontal=TRUE)

summary(x)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.22145 -0.67695  0.01957  0.02472  0.71533  3.45750

These graphical displays of x show a very conventional symmetric distribution with a single central peak.

Exercise - Another Example

It is useful to look at some other examples to see a few possibilities. Let’s generate some uniformly distributed numbers between 5 and 10 and create the graphical displays.

flatones = runif(1000,min=5,max=10)
summary(flatones)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.004   6.191   7.489   7.507   8.792   9.999
boxplot(flatones,horizontal=TRUE)

hist(flatones)

Do you see what you expected to see. Is this distribution symmetric? is there a noticeable peak? Are there outliers?

Another Example

Let’s try some values drawn from a Chi-squared distribution with 10 degrees of freedom. Don’t worry about what this means. Just concentrate on the shape questions.

cq = rchisq(1000,df=10)
hist(cq)

boxplot(cq,horizontal=TRUE)

summary(cq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.673   6.535   9.297   9.845  12.601  28.238
IQR(cq)
## [1] 6.066829

Is this distribution symmetric? Is there a noticeable peak? Are there outliers?

Answer

The distribution is asymmetric with a clear right skew. There is a clear peak between 9 and 10. Considering the IQR criterion, the maximum value is an outlier.

Real Data

We can use the cdc dataframe for some examples. Load the dataframe and run(str) to identify some quantitative variables. Just click on cdc.rdata in the files pane to load it.

Answer

load("/cloud/project/cdc.Rdata")
str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

Examine height

This variable is clearly quantitative, measured in inches. Use your exploratory tools on it. What should you say about it? Remember that since it’s inside the dataframe cdc, you must refer to it as cdc$height.

Answer

summary(cdc$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00
sd(cdc$height)
## [1] 4.125954
IQR(cdc$height)
## [1] 6
hist(cdc$height)

boxplot(cdc$height, horizontal = TRUE)

The distribution is essentially symmetric with a center of 67 inches. There is a clear outlier at 93 inches, which is likely invalid.

Exercise

Examine age. Describe what you see.

Answer

summary(cdc$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   31.00   43.00   45.07   57.00   99.00
sd(cdc$age)
## [1] 17.19269
IQR(cdc$age)
## [1] 26
hist(cdc$age)

boxplot(cdc$age, horizontal = TRUE)

There is a single peak between 40 and 50. The distribution is skewed to the right. There are outliers on the right.