Part 2: Descriptive Summaries

When we look at data, it's important to understand what's actually happening behind the scenes.

Looking at a table or a list of observations just won't cut it! Instead we need a way to conceptualize the data…

Where We're Headed…

Now's a good time to open RStudio and load the mosaic library!

library(mosaic)

Some starting definitions:

Variable:

Different types of variables will require different analysis methods. There are two major types:

Example: Identify each of the following variables as categorical or quantitative.

  1. Number of pets in your family.
  2. County of residence.
  3. Choice of car to buy.
  4. Distance (in miles) of your daily commute.

Categorical Variables

The best type of graph to use will depend on the type of variable which we're interested in. Since categorical variables tend to be easier to work with, we'll start with those.

Example: The Whickham study

Data on age, smoking, and mortality from a survey of voters in Whickham, UK was collected from 1972-1974 to study heart disease and thyroid disease. A follow-up on those participants was conducted 20 years later.

data(Whickham)

Which variables are categorical and which are quantitative?

head(Whickham)
##   outcome smoker age
## 1   Alive    Yes  23
## 2   Alive    Yes  18
## 3    Dead    Yes  71
## 4   Alive     No  67
## 5   Alive     No  64
## 6   Alive    Yes  38

Bar chart:

bargraph(~outcome, data = Whickham)

plot of chunk unnamed-chunk-4

How could you change the code in the previous example to produce a bar graph of the number of smokers in the Whickham study?

plot of chunk unnamed-chunk-5


How can we graphically represent multiple categorical variables? We have some options:

Option 1. Separate our bar graph into groups.

bargraph(~smoker, groups = outcome, data = Whickham, auto.key = TRUE)

plot of chunk unnamed-chunk-6

Option 2. Use a mosaic plot: divides a square into subportions depending on the number of observations in each category.

mosaicplot(~outcome + smoker, data = Whickham, color = TRUE)

plot of chunk unnamed-chunk-7

Example: Based on the two graphs on the previous slides, who do you think was more likely to be alive at the end of the 20 year period - smokers or nonsmokers?

We'll come back to this issue later on in the course!

Next question, how could we find out how many observations fall into each category? We can use the tally function!

tally(~outcome, data = Whickham)
## 
## Alive  Dead Total 
##   945   369  1314
tally(~smoker, data = Whickham)
## 
##    No   Yes Total 
##   732   582  1314

Suppose we wanted to use the tally function to split the data up by both smoking status and outcome. In R's formula notation, we can use a + to add multiple x variables.

tally(~smoker + outcome, data = Whickham)
##        outcome
## smoker  Alive Dead Total
##   No      502  230   732
##   Yes     443  139   582
##   Total   945  369  1314

This is called a contingency table.

By default, tally reports observation counts in each category.

Example: What do you think the code below will report?

tally(~smoker + outcome, data = Whickham, format = "proportion")

Try running it!


Quantitative Variables

Next we'll look at ways to describe quantitative variables. As you may have guessed, working with numbers gives us many more options!

Example: Who watches more TV?

The Carnegie Mellon Online Learning Initiative wanted to know who watches more TV per week: men or women. A random sample of 400 adults was chosen. At the end of the week, each subject reported the total amount of time (in minutes) that he or she watched TV during that week.

data(TV2)
head(TV2)
##   time gender
## 1  180 Female
## 2  150 Female
## 3  130 Female
## 4  990 Female
## 5  470 Female
## 6  260 Female

Histogram:

xhistogram(~time, data = TV2, nint = 15)

plot of chunk unnamed-chunk-13

Example: What does this histogram tell you about TV watching habits? Does it answer our RQ: which gender watches more TV?

xhistogram(~time | gender, data = TV2, nint = 15, auto.key = TRUE)

plot of chunk unnamed-chunk-14

Example: Now can we say anything about gender and TV watching?


Example: Do students with cellphones get less sleep?

Data was collected from 312 college students at a large state university. The data is contained in the Cellphones data set.

data(Cellphones)
head(Cellphones)
##   Math Verbal Credits Year Exer Sleep Cell  Veg
## 1  640    470      15    1   60   7.0  yes   no
## 2  660    650      14    1   20   7.5  yes   no
## 3  550    580      15    2    0   9.0   no   no
## 4  560    660      16    1   30   7.0  yes   no
## 5  600    790      15    4   45   6.5   no some
## 6  560    640      16    2   75   4.5   no  yes

With a partner, use the variables Cell and Sleep to make a graph showing the relationship between owning a cellphone sleep. Use the tally function to find out how many students don't have a cell phone. Write a short paragraph interpreting your results.

plot of chunk unnamed-chunk-16

## 
##    no   yes Total 
##    67   243   310

When we're formally analyzing a histogram, there are several characteristics we might look for:

Shape: Histograms typically have one of the three shapes below.

plot of chunk unnamed-chunk-17 plot of chunk unnamed-chunk-17 plot of chunk unnamed-chunk-17

Example: What shape would you expect the following data sets to have? Sketch a graph for each.


Pattern: Does the data cluster together, or is there a gap? Is one observation significantly different from the rest?

Outlier:

plot of chunk unnamed-chunk-18


Modes: how is the data concentrated?

plot of chunk unnamed-chunk-19 plot of chunk unnamed-chunk-19

Example: Revisit your description of the histograms you generated for cellphone use and sleep, using the definitions for shape, pattern, and modes.


Graphs don't necessarily tell us all that we need to know about a data set. Typically, there are three questions that a numerical summary can answer.

Q1. Where is the “center” of our data?

There are two measures of center commonly used in statistics. These should already be familiar to you!

Each of these is easy to find by hand. But, they're even easier to find in R.

mean(x = time, data = TV2)
## [1] 598.5
median(x = time, data = TV2)
## [1] 470

The mean and median rarely are the same! Let's compare the mean and median for TV watching times.

xhistogram(~time, data = TV2, v = c(598.55, 470))

plot of chunk unnamed-chunk-21

## c(.,.) lets us read in two numbers to draw vertical lines.

Example: For a skewed right distribution (like TV watching times), the mean is greater than the median. How do you think the mean and median compare in a…


Example: Use the Cellphones data set to find the mean and median number of hours slept that night. Based on how the mean and median compare, is Sleep symmetric, skewed right, or skewed left. Make a histogram of sleep alone to confirm.

## [1] 7.194
## [1] 7

Example: Only the mean or the median is “resistant to outliers” - relatively unaffected by an extreme observation. Which do you think it is? Explain your choice.


Q2. How “spread out” is our data?

Measures of center are helpful, but they don't tell us much about the shape of our data. The shape of our data can affect our decisions!

Example: The graphs below show annual income distributions of music teachers in two countries, Denmark (in red) and the United States (in blue). Both countries have an annual average income of $40,000 (adjusted to US dollars). Based on the graphs alone, where would you rather be a music teacher and why?

plot of chunk unnamed-chunk-23 plot of chunk unnamed-chunk-23


Both distributions show variability.

Variability:

Standard deviation:

Important Properties of \( s \):


Example: Look at the income distributions for music teachers in Denmark and the US. Based on the graphs alone, which country has a higher standard deviation? How do you know?

plot of chunk unnamed-chunk-24 plot of chunk unnamed-chunk-24

Example: Exams are typically graded on a scale from 0 to 100. Assume that the mean score in an exam is 80. Which of these values is most likely to be the standard deviation? Why? What can we say about the shape of the exam scores?

a. \( s=0 \) b. \( s=10 \) c. \( s=50 \) d. \( s=-5 \)


With R, the standard deviation is easy to calculate.

Example: For the TV2 data set, find the mean and standard deviation of TV watching times for both men and women. Which gender watches more TV on average? Which gender is more variable?

favstats(~time | gender, data = TV2)
##        min  Q1 median  Q3  max  mean    sd   n missing
## Female 120 260    400 650 2650 521.4 410.1 191       0
## Male   200 330    530 820 2700 669.1 478.6 209       0

The favstats function will be your best friend this semester! We can use this to calculate lots of numerical summaries. So far we've mentioned:


Example: Use the favstats function to find the standard deviation for hours of sleep in the Cellphones data set. Compare the standard deviation and mean of hours of sleep for students who use a cell phone and students who do not use a cell phone. Do you think there's a significant difference in sleep patterns for students who do/don't use cell phones?

xhistogram(~Sleep | Cell, data = Cellphones, auto.key = TRUE)

plot of chunk unnamed-chunk-26


Q3. Are there outliers in our data?

In statstics, we are sometimes interested in how a certain observation falls in a data set relative to all other observations.

Percentile:

Quartile:

Example: For the TV2 data set, identify the quartiles in the output for men and women. Identify the 0th percentile and 100th percentile.

favstats(~time | gender, data = TV2)
##        min  Q1 median  Q3  max  mean    sd   n missing
## Female 120 260    400 650 2650 521.4 410.1 191       0
## Male   200 330    530 820 2700 669.1 478.6 209       0

Interquartile Range:

Example: Use the favstats output to find the IQR for the TV2 data set.

favstats(~time | gender, data = TV2)
##        min  Q1 median  Q3  max  mean    sd   n missing
## Female 120 260    400 650 2650 521.4 410.1 191       0
## Male   200 330    530 820 2700 669.1 478.6 209       0

Boxplot:

Why 1.5 times IQR? This is the IQR Criterion!


Example: Compare the boxplots for TV viewing habits of men and women. Write a short paragraph summarizing the patterns the boxplot tells you.

bwplot(~time | gender, data = TV2)

plot of chunk unnamed-chunk-29

Example: Repeat the following exercise using the Cellphones data set to compare sleeping habits and cell phone use.