Part 2: Descriptive Summaries

When we look at data, it's important to understand what's actually happening behind the scenes.

Looking at a table or a list of observations just won't cut it! Instead we need a way to conceptualize the data…

Are there any trends?
Are there any unusual observations? Why are they unusual?
What conclusions can we draw?

Where We're Headed…

Types of variables
Summarizing categorical variables
Summarizing quantitative variables

Now's a good time to open RStudio and load the mosaic library!

library(mosaic)

Some starting definitions:

Variable:

Different types of variables will require different analysis methods. There are two major types:

Categorical variable:
Quantitative variable:

Example: Identify each of the following variables as categorical or quantitative.

Number of pets in your family.
County of residence.
Choice of car to buy.
Distance (in miles) of your daily commute.

Categorical Variables

The best type of graph to use will depend on the type of variable which we're interested in. Since categorical variables tend to be easier to work with, we'll start with those.

Graphs: bar charts, mosaic plots
Numerical summaries: contingency tables

Example: The Whickham study

Data on age, smoking, and mortality from a survey of voters in Whickham, UK was collected from 1972-1974 to study heart disease and thyroid disease. A follow-up on those participants was conducted 20 years later.

Step 1: Load the data.

data(Whickham)

Step 2: View the data set. The head function shows us the first 6 observations.

Which variables are categorical and which are quantitative?

head(Whickham)

##   outcome smoker age
## 1   Alive    Yes  23
## 2   Alive    Yes  18
## 3    Dead    Yes  71
## 4   Alive     No  67
## 5   Alive     No  64
## 6   Alive    Yes  38

Bar chart:

bargraph(~outcome, data = Whickham)

plot of chunk unnamed-chunk-4

How could you change the code in the previous example to produce a bar graph of the number of smokers in the Whickham study?

plot of chunk unnamed-chunk-5

How can we graphically represent multiple categorical variables? We have some options:

Option 1. Separate our bar graph into groups.

bargraph(~smoker, groups = outcome, data = Whickham, auto.key = TRUE)

plot of chunk unnamed-chunk-6

Option 2. Use a mosaic plot: divides a square into subportions depending on the number of observations in each category.

mosaicplot(~outcome + smoker, data = Whickham, color = TRUE)

plot of chunk unnamed-chunk-7

Example: Based on the two graphs on the previous slides, who do you think was more likely to be alive at the end of the 20 year period - smokers or nonsmokers?

We'll come back to this issue later on in the course!

Next question, how could we find out how many observations fall into each category? We can use the tally function!

tally(~outcome, data = Whickham)

## 
## Alive  Dead Total 
##   945   369  1314

tally(~smoker, data = Whickham)

## 
##    No   Yes Total 
##   732   582  1314

Suppose we wanted to use the tally function to split the data up by both smoking status and outcome. In R's formula notation, we can use a + to add multiple x variables.

tally(~smoker + outcome, data = Whickham)

##        outcome
## smoker  Alive Dead Total
##   No      502  230   732
##   Yes     443  139   582
##   Total   945  369  1314

This is called a contingency table.

By default, tally reports observation counts in each category.

Example: What do you think the code below will report?

tally(~smoker + outcome, data = Whickham, format = "proportion")

Try running it!

Quantitative Variables

Next we'll look at ways to describe quantitative variables. As you may have guessed, working with numbers gives us many more options!

Graphs: histograms
Numerical summaries: mean, median, mode, range, standard deviation, quantiles

Example: Who watches more TV?

The Carnegie Mellon Online Learning Initiative wanted to know who watches more TV per week: men or women. A random sample of 400 adults was chosen. At the end of the week, each subject reported the total amount of time (in minutes) that he or she watched TV during that week.

data(TV2)
head(TV2)

##   time gender
## 1  180 Female
## 2  150 Female
## 3  130 Female
## 4  990 Female
## 5  470 Female
## 6  260 Female

Histogram:

xhistogram(~time, data = TV2, nint = 15)

plot of chunk unnamed-chunk-13

nint controls the number of intervals in the graph. There is no “best” number to use!

Example: What does this histogram tell you about TV watching habits? Does it answer our RQ: which gender watches more TV?

xhistogram(~time | gender, data = TV2, nint = 15, auto.key = TRUE)

plot of chunk unnamed-chunk-14

Example: Now can we say anything about gender and TV watching?

Example: Do students with cellphones get less sleep?

Data was collected from 312 college students at a large state university. The data is contained in the Cellphones data set.

data(Cellphones)
head(Cellphones)

##   Math Verbal Credits Year Exer Sleep Cell  Veg
## 1  640    470      15    1   60   7.0  yes   no
## 2  660    650      14    1   20   7.5  yes   no
## 3  550    580      15    2    0   9.0   no   no
## 4  560    660      16    1   30   7.0  yes   no
## 5  600    790      15    4   45   6.5   no some
## 6  560    640      16    2   75   4.5   no  yes

With a partner, use the variables Cell and Sleep to make a graph showing the relationship between owning a cellphone sleep. Use the tally function to find out how many students don't have a cell phone. Write a short paragraph interpreting your results.

plot of chunk unnamed-chunk-16

## 
##    no   yes Total 
##    67   243   310

When we're formally analyzing a histogram, there are several characteristics we might look for:

Shape: Histograms typically have one of the three shapes below.

plot of chunk unnamed-chunk-17

Example: What shape would you expect the following data sets to have? Sketch a graph for each.

Pattern: Does the data cluster together, or is there a gap? Is one observation significantly different from the rest?

Outlier:

plot of chunk unnamed-chunk-18

Modes: how is the data concentrated?

plot of chunk unnamed-chunk-19

Example: Revisit your description of the histograms you generated for cellphone use and sleep, using the definitions for shape, pattern, and modes.

Graphs don't necessarily tell us all that we need to know about a data set. Typically, there are three questions that a numerical summary can answer.

Q1. Where is the “center” of our data?

There are two measures of center commonly used in statistics. These should already be familiar to you!

Mean:
Median:

Each of these is easy to find by hand. But, they're even easier to find in R.

mean(x = time, data = TV2)

## [1] 598.5

median(x = time, data = TV2)

## [1] 470

The mean and median rarely are the same! Let's compare the mean and median for TV watching times.

xhistogram(~time, data = TV2, v = c(598.55, 470))

plot of chunk unnamed-chunk-21

## c(.,.) lets us read in two numbers to draw vertical lines.

Example: For a skewed right distribution (like TV watching times), the mean is greater than the median. How do you think the mean and median compare in a…

Skewed left distribution?
Symmetric distribution?

Example: Use the Cellphones data set to find the mean and median number of hours slept that night. Based on how the mean and median compare, is Sleep symmetric, skewed right, or skewed left. Make a histogram of sleep alone to confirm.

## [1] 7.194

## [1] 7

Example: Only the mean or the median is “resistant to outliers” - relatively unaffected by an extreme observation. Which do you think it is? Explain your choice.

Q2. How “spread out” is our data?

Measures of center are helpful, but they don't tell us much about the shape of our data. The shape of our data can affect our decisions!

Example: The graphs below show annual income distributions of music teachers in two countries, Denmark (in red) and the United States (in blue). Both countries have an annual average income of $40,000 (adjusted to US dollars). Based on the graphs alone, where would you rather be a music teacher and why?

plot of chunk unnamed-chunk-23

Both distributions show variability.

Variability:

Variability is a natural part of life! Just like no two people are exactly the same, no two data points are exactly the same.
Variability can be measured.

Standard deviation:

Standard deviation tells us the average “variation” from the mean.
We can calculate the standard deviation by hand. Let $ n $ represent the sample size, and $ s $ represent the standard deviation.\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2} \]
- In words, standard deviation is the average squared distance from the mean.

Important Properties of $ s $:

The larger the standard deviation, the more dispersed the data is from the mean.
$ s $ can never be negative
$ s=0 $ only if there is no variation in the data
$ s $ is highly affected by outliers

Example: Look at the income distributions for music teachers in Denmark and the US. Based on the graphs alone, which country has a higher standard deviation? How do you know?

plot of chunk unnamed-chunk-24

Example: Exams are typically graded on a scale from 0 to 100. Assume that the mean score in an exam is 80. Which of these values is most likely to be the standard deviation? Why? What can we say about the shape of the exam scores?

a. $ s=0 $ b. $ s=10 $ c. $ s=50 $ d. $ s=-5 $

With R, the standard deviation is easy to calculate.

Example: For the TV2 data set, find the mean and standard deviation of TV watching times for both men and women. Which gender watches more TV on average? Which gender is more variable?

favstats(~time | gender, data = TV2)

##        min  Q1 median  Q3  max  mean    sd   n missing
## Female 120 260    400 650 2650 521.4 410.1 191       0
## Male   200 330    530 820 2700 669.1 478.6 209       0

The favstats function will be your best friend this semester! We can use this to calculate lots of numerical summaries. So far we've mentioned:

median
mean
standard deviation (sd)
n (sample size in each category)
missing (how many data points have missing information)

Example: Use the favstats function to find the standard deviation for hours of sleep in the Cellphones data set. Compare the standard deviation and mean of hours of sleep for students who use a cell phone and students who do not use a cell phone. Do you think there's a significant difference in sleep patterns for students who do/don't use cell phones?

xhistogram(~Sleep | Cell, data = Cellphones, auto.key = TRUE)

plot of chunk unnamed-chunk-26

Q3. Are there outliers in our data?

In statstics, we are sometimes interested in how a certain observation falls in a data set relative to all other observations.

Percentile:

Suppose your total score of 28 (out of 36) on the ACT college entrace exam falls at the 90th percentiles. Then, 90% of those students who took the exam at the same time you did scored between 0 and 28 (your score). Only 10% of the scores were higher than 28.
How could we use a percentile to decide whether an observation is an outlier?

Quartile:

Quartiles split the distribution of the data into four parts.
We've already used Q2 today…
Quartiles can be found using favstats!

Example: For the TV2 data set, identify the quartiles in the output for men and women. Identify the 0th percentile and 100th percentile.

favstats(~time | gender, data = TV2)

##        min  Q1 median  Q3  max  mean    sd   n missing
## Female 120 260    400 650 2650 521.4 410.1 191       0
## Male   200 330    530 820 2700 669.1 478.6 209       0

Interquartile Range:

Why would we want to know how spread out the middle 50% of our data is?

Example: Use the favstats output to find the IQR for the TV2 data set.

favstats(~time | gender, data = TV2)

##        min  Q1 median  Q3  max  mean    sd   n missing
## Female 120 260    400 650 2650 521.4 410.1 191       0
## Male   200 330    530 820 2700 669.1 478.6 209       0

Boxplot:

A boxplot is an incredibly useful plot! Using the bwplot function, our graph shows:
- A black dot at the median value
- A blue box surrounding the middle 50% (Q1 to Q3)
- Dotted lines extend out up to 1.5*IQR
- Points beyond 1.5*IQR from the mean are shown as dots

Why 1.5 times IQR? This is the IQR Criterion!

A point that falls more than 1.5*IQR below Q1 or above Q3 is usually considered an outlier.

Example: Compare the boxplots for TV viewing habits of men and women. Write a short paragraph summarizing the patterns the boxplot tells you.

bwplot(~time | gender, data = TV2)

plot of chunk unnamed-chunk-29

Example: Repeat the following exercise using the Cellphones data set to compare sleeping habits and cell phone use.