Part 2: Descriptive Summaries

Aimee Schwab
June 11, 2013

When we look at data, it's important to understand what's actually happening behind the scenes.

Looking at a table or a list of observations just won't cut it! Instead we need a way to conceptualize the data…

Are there any trends?
Are there any unusual observations? Why are they unusual?
What conclusions can we draw?

Where We're Headed…

Types of variables
Summarizing categorical variables
Summarizing quantitative variables

Now's a good time to open RStudio and load the mosaic library!

library(mosaic)

Some starting definitions:

Variable: any characteristic that we observe for a subject in our study

Different types of variables will require different analysis methods. There are two major types:

Categorical variable: each observation belongs to one of a set of categories
Quantitative variable: each observation takes on numerical values

Example: Identify each of the following variables as categorical or quantitative.

Number of pets in your family.
County of residence.
Choice of car to buy.
Distance (in miles) of your daily commute.

The best type of graph to use will depend on the type of variable which we're interested in. Since categorical variables tend to be easier to work with, we'll start with those.

Graphs: bar charts, mosaic plots
Numerical summaries: contingency tables

Example: The Whickham study

Data on age, smoking, and mortality from a survey of voters in Whickham, UK was collected from 1972-1974 to study heart disease and thyroid disease. A follow-up on those participants was conducted 20 years later.

Step 1: Load the data.

data(Whickham)

Step 2: View the data set. The head function shows us the first 6 observations. Which variables are categorical and which are quantitative?

head(Whickham)

  outcome smoker age
1   Alive    Yes  23
2   Alive    Yes  18
3    Dead    Yes  71
4   Alive     No  67
5   Alive     No  64
6   Alive    Yes  38

outcome and smoker are categorical, age is quantitative.

Bar chart: each category or response for a single variable is represented with a vertical bar. The height of the bar represents the number of observations which fall into that category.

bargraph(~outcome, data=Whickham)

plot of chunk unnamed-chunk-4

Each function that we use in RStudio will follow a certain formula notation. The basic syntax is:

formula(y ~ x, data=DATASET, options=...)

The y variable is on the vertical axis of our graph, the x variable is on the horizontal axis. For a bar chart, there's no y axis, so we leave that spot blank.

Functions will have options associated with them. To see all the options, type the function name into the console and hit the TAB button on your keyboard.

Back to the Whickham study:

How could you change the code in the previous example to produce a bar graph of the number of smokers in the Whickham study?

plot of chunk unnamed-chunk-6

bargraph(~smoker, data=Whickham)

How can we graphically represent multiple categorical variables? We have some options:

Option 1. Separate our bar graph into groups.

bargraph(~smoker, groups=outcome, data=Whickham, auto.key=TRUE)

plot of chunk unnamed-chunk-8

Option 2. Use a mosaic plot: divides a square into subportions depending on the number of observations in each category.

mosaicplot(~outcome+smoker, data=Whickham, color=TRUE)

plot of chunk unnamed-chunk-9

Based on the two graphs on the previous slides, who do you think was more likely to be alive at the end of the 20 year period - smokers or nonsmokers?

The people who were alive at the end of the study were more likely to be smokers than nonsmokers! Follow-up question: is this surprising?

We'll come back to this issue later on in the course!

Next question, how could we find out how many observations fall into each category?

To get a tally of how many observations are in each category, we can use the tally function!

tally(~outcome, data=Whickham)


Alive  Dead Total 
  945   369  1314

tally(~smoker, data=Whickham)


   No   Yes Total 
  732   582  1314

Suppose we wanted to use the tally function to split the data up by both smoking status and outcome. In R's formula notation, we can use a + to add multiple x variables.

tally(~smoker+outcome, data=Whickham)

       outcome
smoker  Alive Dead Total
  No      502  230   732
  Yes     443  139   582
  Total   945  369  1314

This is called a contingency table.

By default, tally reports observation counts in each category. What do you think the code below will report?

tally(~smoker+outcome, data=Whickham, format='proportion')

Try running it!

       outcome
smoker   Alive   Dead  Total
  No    0.3820 0.1750 0.5571
  Yes   0.3371 0.1058 0.4429
  Total 0.7192 0.2808 1.0000

Next we'll look at ways to describe quantitative variables. As you may have guessed, working with numbers gives us many more options!

Graphs: histograms
Numerical summaries: mean, median, mode, range, standard deviation, quantiles

Example: Who watches more TV?

The Carnegie Mellon Online Learning Initiative wanted to know who watches more TV per week: men or women. A random sample of 400 adults was chosen. At the end of the week, each subject reported the total amount of time (in minutes) that he or she watched TV during that week.

data(TV2)
head(TV2)

  time gender
1  180 Female
2  150 Female
3  130 Female
4  990 Female
5  470 Female
6  260 Female

Histogram: like a bar chart, each bar represents an interval of values

xhistogram(~time, data=TV2, nint=15)

plot of chunk unnamed-chunk-16

nint controls the number of intervals in the graph. There is no “best” number to use!

What does this histogram tell you about TV watching habits? Does it answer our RQ: which gender watches more TV?

plot of chunk unnamed-chunk-17

This isn't enough! Can we break up this histogram by gender? Yes!

xhistogram(~time|gender, data=TV2, nint=15, auto.key=TRUE)

plot of chunk unnamed-chunk-18

Now can we say anything about gender and TV watching?

Example: Do students with cellphones get less sleep?

Data was collected from 312 college students at a large state university. The data is contained in the Cellphones data set.

data(Cellphones)
head(Cellphones)

  Math Verbal Credits Year Exer Sleep Cell  Veg
1  640    470      15    1   60   7.0  yes   no
2  660    650      14    1   20   7.5  yes   no
3  550    580      15    2    0   9.0   no   no
4  560    660      16    1   30   7.0  yes   no
5  600    790      15    4   45   6.5   no some
6  560    640      16    2   75   4.5   no  yes

With a partner, use the variables Cell and Sleep to make a graph showing the relationship between owning a cellphone sleep. Use the tally function to find out how many students don't have a cell phone. Write a short paragraph interpreting your results.

plot of chunk unnamed-chunk-20


   no   yes Total 
   67   243   310

When we're formally analyzing a histogram, there are several characteristics we might look for:

Shape: if we drew a smooth curve over the histogram, what shape would we see?

Histograms typically have one of the three shapes below.

plot of chunk unnamed-chunk-21

Example: What shape would you expect the following data sets to have? Sketch a graph for each.

Annual income of Nebraskans
Height of American adults
Lifespan of a house cat

Annual income should be skewed right; height should be symmetric; lifespan should be skewed left.

Pattern: Does the data cluster together, or is there a gap? Is one observation significantly different from the rest?

Outlier: an extreme observation(s) that falls far below or far above the rest of the observed data

plot of chunk unnamed-chunk-22

Modes: how is the data concentrated?

plot of chunk unnamed-chunk-23

Example: Revisit your description of the histograms you generated for cellphone use and sleep, using the definitions for shape, pattern, and modes.

Graphs don't necessarily tell us all that we need to know about a data set. Typically, there are three questions that a numerical summary can answer.

Where is the “center” of our data?
How “spread out” is our data?
Are there outliers in our data?

Q1. Where is the “center” of our data?

There are two measures of center commonly used in statistics. These should already be familiar to you!

Mean: average of all observations \[ \overline{x}=\frac{1}{n}\sum_{i=1}^n x_{i} \]
Median: the “middle” observation when all data points are ordered from smallest to largest

Each of these is easy to find by hand. But, they're even easier to find in R.

mean(x=time, data=TV2)

[1] 598.5

median(x=time, data=TV2)

[1] 470

The mean and median rarely are the same! Let's compare the mean and median for TV watching times.

xhistogram(~time, data=TV2, v=c(598.55, 470))

plot of chunk unnamed-chunk-25

Hint: c(.,.) lets us read in two numbers to draw vertical lines.

Example: For a skewed right distribution (like TV watching times), the mean is greater than the median. How do you think the mean and median compare in a…

Skewed left distribution?

Mean is less than the median.

Symmetric distribution?

Mean and the median are about the same.

Example: Use the Cellphones data set to find the mean and median number of hours slept that night. Based on how the mean and median compare, is Sleep symmetric, skewed right, or skewed left. Make a histogram of sleep alone to confirm.

[1] 7.194

[1] 7

The mean and the median are almost identical, so hours of sleep is roughly symmetric.

Example: Only the mean or the median is “resistant to outliers” - relatively unaffected by an extreme observation. Which do you think it is? Explain your choice.

The median is “resistant to outliers”.

Q2. How “spread out” is our data?

Measures of center are helpful, but they don't tell us much about the shape of our data. The shape of our data can affect our decisions!

Example: The graphs below show annual income distributions of music teachers in two countries, Denmark (in red) and the United States (in blue). Both countries have an annual average income of $40,000 (adjusted to US dollars). Based on the graphs alone, where would you rather be a music teacher and why?

plot of chunk unnamed-chunk-27

Both distributions show variability.

Variability:

Variability is a natural part of life! Just like no two people are exactly the same, no two data points are exactly the same.
Variability can be measured.

Standard deviation: a measure of variability (larger $ s $ means more variability!)

Standard deviation tells us the average “variation” from the mean.
We can calculate the standard deviation by hand. Let $ n $ represent the sample size, and $ s $ represent the standard deviation.\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} \]
- In words, standard deviation is the average squared distance from the mean.

Important Properties of $ s $:

The larger the standard deviation, the more dispersed the data is from the mean.
$ s $ can never be negative
$ s=0 $ only if there is no variation in the data
$ s $ is highly affected by outliers

Example: Look at the income distributions for music teachers in Denmark and the US. Based on the graphs alone, which country has a higher standard deviation? How do you know?

plot of chunk unnamed-chunk-28

The United States has a higher standard deviation, since it has more observations that are “far” from $40,000.

Example: Exams are typically graded on a scale from 0 to 100. Assume that the mean score in an exam is 80. Which of these values is most likely to be the standard deviation? Why? What can we say about the shape of the exam scores?

a. $ s=0 $ b. $ s=10 $ c. $ s=50 $ d. $ s=-5 $

10 is the most likely standard deviation.

With R, the standard deviation is easy to calculate.

Example: For the TV2 data set, find the mean and standard deviation of TV watching times for both men and women. Which gender watches more TV on average? Which gender is more variable?

favstats(~time|gender, data=TV2)

       min  Q1 median  Q3  max  mean    sd   n missing
Female 120 260    400 650 2650 521.4 410.1 191       0
Male   200 330    530 820 2700 669.1 478.6 209       0

The favstats function will be your best friend this semester! We can use this to calculate lots of numerical summaries. So far we've mentioned:

median
mean
standard deviation (sd)
n (sample size in each category)
missing (how many data points have missing information)

Example: Use the favstats function to find the standard deviation for hours of sleep in the Cellphones data set. Compare the standard deviation and mean of hours of sleep for students who use a cell phone and students who do not use a cell phone. Do you think there's a significant difference in sleep patterns for students who do/don't use cell phones?

To remind yourself, make a histogram!

xhistogram(~Sleep|Cell, data=Cellphones, auto.key=TRUE)

favstats(~Sleep|Cell, data=Cellphones)

Q3. Are there outliers in our data?

In statstics, we are sometimes interested in how a certain observation falls in a data set relative to all other observations.

Percentile: a value such that _% of the data falls at or below that value

Suppose your total score of 28 (out of 36) on the ACT college entrace exam falls at the 90th percentiles. Then, 90% of those students who took the exam at the same time you did scored between 0 and 28 (your score). Only 10% of the scores were higher than 28.
How could we use a percentile to decide whether an observation is an outlier?

Quartile: the 25th (Q1), 50th (Q2), and 75th (Q3) percentiles

Quartiles split the distribution of the data into four parts.
We've already used Q2 today…
Quartiles can be found using favstats!

Example: For the TV2 data set, identify the quartiles in the output for men and women. Identify the 0th percentile and 100th percentile.

favstats(~time|gender, data=TV2)

       min  Q1 median  Q3  max  mean    sd   n missing
Female 120 260    400 650 2650 521.4 410.1 191       0
Male   200 330    530 820 2700 669.1 478.6 209       0

Interquartile Range: the difference between Q1 and Q3 (IQR = Q3-Q1)

Why would we want to know how spread out the middle 50% of our data is?

Example: Use the favstats output to find the IQR for the TV2 data set.

favstats(~time|gender, data=TV2)

       min  Q1 median  Q3  max  mean    sd   n missing
Female 120 260    400 650 2650 521.4 410.1 191       0
Male   200 330    530 820 2700 669.1 478.6 209       0

Boxplot: a convenient way to describe the variability of a data set in a single graph

A boxplot is an incredibly useful plot! Using the bwplot function, our graph shows:
- A black dot at the median value
- A blue box surrounding the middle 50% (Q1 to Q3)
- Dotted lines extend out up to 1.5*IQR
- Points beyond 1.5*IQR from the mean are shown as dots

Why 1.5 times IQR? This is the IQR Criterion!

A point that falls more than 1.5*IQR below Q1 or above Q3 is usually considered an outlier.

Example: Compare the boxplots for TV viewing habits of men and women. Write a short paragraph summarizing the patterns the boxplot tells you.

bwplot(~time|gender, data=TV2)

plot of chunk unnamed-chunk-33

Example: Repeat the following exercise using the Cellphones data set to compare sleeping habits and cell phone use.

On a sheet of paper, please complete the following:

Big Picture: Write in your own words the most important take-away message of this section.
I understand…: What do you feel most confident with in this section?
Let's do more…: What would you like to see more examples of?