Aimee Schwab
June 11, 2013
When we look at data, it's important to understand what's actually happening behind the scenes.
Looking at a table or a list of observations just won't cut it! Instead we need a way to conceptualize the data…
Where We're Headed…
Now's a good time to open RStudio and load the mosaic library!
library(mosaic)
Some starting definitions:
Variable: any characteristic that we observe for a subject in our study
Different types of variables will require different analysis methods. There are two major types:
Example: Identify each of the following variables as categorical or quantitative.
The best type of graph to use will depend on the type of variable which we're interested in. Since categorical variables tend to be easier to work with, we'll start with those.
Example: The Whickham study
Data on age, smoking, and mortality from a survey of voters in Whickham, UK was collected from 1972-1974 to study heart disease and thyroid disease. A follow-up on those participants was conducted 20 years later.
data(Whickham)
head function shows us the first 6 observations. Which variables are categorical and which are quantitative?head(Whickham)
outcome smoker age
1 Alive Yes 23
2 Alive Yes 18
3 Dead Yes 71
4 Alive No 67
5 Alive No 64
6 Alive Yes 38
outcomeandsmokerare categorical,ageis quantitative.
Bar chart: each category or response for a single variable is represented with a vertical bar. The height of the bar represents the number of observations which fall into that category.
bargraph(~outcome, data=Whickham)
Each function that we use in RStudio will follow a certain formula notation. The basic syntax is:
formula(y ~ x, data=DATASET, options=...)
The y variable is on the vertical axis of our graph, the x variable is on the horizontal axis. For a bar chart, there's no y axis, so we leave that spot blank.
Functions will have options associated with them. To see all the options, type the function name into the console and hit the TAB button on your keyboard.
Back to the Whickham study:
How could you change the code in the previous example to produce a bar graph of the number of smokers in the Whickham study?
bargraph(~smoker, data=Whickham)
How can we graphically represent multiple categorical variables? We have some options:
Option 1. Separate our bar graph into groups.
bargraph(~smoker, groups=outcome, data=Whickham, auto.key=TRUE)
Option 2. Use a mosaic plot: divides a square into subportions depending on the number of observations in each category.
mosaicplot(~outcome+smoker, data=Whickham, color=TRUE)
Based on the two graphs on the previous slides, who do you think was more likely to be alive at the end of the 20 year period - smokers or nonsmokers?
The people who were alive at the end of the study were more likely to be smokers than nonsmokers! Follow-up question: is this surprising?
We'll come back to this issue later on in the course!
Next question, how could we find out how many observations fall into each category?
To get a tally of how many observations are in each category, we can use the tally function!
tally(~outcome, data=Whickham)
Alive Dead Total
945 369 1314
tally(~smoker, data=Whickham)
No Yes Total
732 582 1314
Suppose we wanted to use the tally function to split the data up by both smoking status and outcome. In R's formula notation, we can use a + to add multiple x variables.
tally(~smoker+outcome, data=Whickham)
outcome
smoker Alive Dead Total
No 502 230 732
Yes 443 139 582
Total 945 369 1314
This is called a contingency table.
By default, tally reports observation counts in each category. What do you think the code below will report?
tally(~smoker+outcome, data=Whickham, format='proportion')
Try running it!
outcome
smoker Alive Dead Total
No 0.3820 0.1750 0.5571
Yes 0.3371 0.1058 0.4429
Total 0.7192 0.2808 1.0000
Next we'll look at ways to describe quantitative variables. As you may have guessed, working with numbers gives us many more options!
Example: Who watches more TV?
The Carnegie Mellon Online Learning Initiative wanted to know who watches more TV per week: men or women. A random sample of 400 adults was chosen. At the end of the week, each subject reported the total amount of time (in minutes) that he or she watched TV during that week.
data(TV2)
head(TV2)
time gender
1 180 Female
2 150 Female
3 130 Female
4 990 Female
5 470 Female
6 260 Female
Histogram: like a bar chart, each bar represents an interval of values
xhistogram(~time, data=TV2, nint=15)
nint controls the number of intervals in the graph. There is no “best” number to use!What does this histogram tell you about TV watching habits? Does it answer our RQ: which gender watches more TV?
This isn't enough! Can we break up this histogram by gender? Yes!
xhistogram(~time|gender, data=TV2, nint=15, auto.key=TRUE)
Now can we say anything about gender and TV watching?
Example: Do students with cellphones get less sleep?
Data was collected from 312 college students at a large state university. The data is contained in the Cellphones data set.
data(Cellphones)
head(Cellphones)
Math Verbal Credits Year Exer Sleep Cell Veg
1 640 470 15 1 60 7.0 yes no
2 660 650 14 1 20 7.5 yes no
3 550 580 15 2 0 9.0 no no
4 560 660 16 1 30 7.0 yes no
5 600 790 15 4 45 6.5 no some
6 560 640 16 2 75 4.5 no yes
With a partner, use the variables Cell and Sleep to make a graph showing the relationship between owning a cellphone sleep. Use the tally function to find out how many students don't have a cell phone. Write a short paragraph interpreting your results.
no yes Total
67 243 310
When we're formally analyzing a histogram, there are several characteristics we might look for:
Shape: if we drew a smooth curve over the histogram, what shape would we see?
Histograms typically have one of the three shapes below.
Example: What shape would you expect the following data sets to have? Sketch a graph for each.
Annual income should be skewed right; height should be symmetric; lifespan should be skewed left.
Pattern: Does the data cluster together, or is there a gap? Is one observation significantly different from the rest?
Outlier: an extreme observation(s) that falls far below or far above the rest of the observed data
Modes: how is the data concentrated?
Example: Revisit your description of the histograms you generated for cellphone use and sleep, using the definitions for shape, pattern, and modes.
Graphs don't necessarily tell us all that we need to know about a data set. Typically, there are three questions that a numerical summary can answer.
Q1. Where is the “center” of our data?
There are two measures of center commonly used in statistics. These should already be familiar to you!
Each of these is easy to find by hand. But, they're even easier to find in R.
mean(x=time, data=TV2)
[1] 598.5
median(x=time, data=TV2)
[1] 470
The mean and median rarely are the same! Let's compare the mean and median for TV watching times.
xhistogram(~time, data=TV2, v=c(598.55, 470))
Hint: c(.,.) lets us read in two numbers to draw vertical lines.
Example: For a skewed right distribution (like TV watching times), the mean is greater than the median. How do you think the mean and median compare in a…
Mean is less than the median.
Mean and the median are about the same.
Example: Use the Cellphones data set to find the mean and median number of hours slept that night. Based on how the mean and median compare, is Sleep symmetric, skewed right, or skewed left. Make a histogram of sleep alone to confirm.
[1] 7.194
[1] 7
The mean and the median are almost identical, so hours of sleep is roughly symmetric.
Example: Only the mean or the median is “resistant to outliers” - relatively unaffected by an extreme observation. Which do you think it is? Explain your choice.
The median is “resistant to outliers”.
Q2. How “spread out” is our data?
Measures of center are helpful, but they don't tell us much about the shape of our data. The shape of our data can affect our decisions!
Example: The graphs below show annual income distributions of music teachers in two countries, Denmark (in red) and the United States (in blue). Both countries have an annual average income of $40,000 (adjusted to US dollars). Based on the graphs alone, where would you rather be a music teacher and why?
Both distributions show variability.
Variability:
Standard deviation: a measure of variability (larger \( s \) means more variability!)
Important Properties of \( s \):
Example: Look at the income distributions for music teachers in Denmark and the US. Based on the graphs alone, which country has a higher standard deviation? How do you know?
The United States has a higher standard deviation, since it has more observations that are “far” from $40,000.
Example: Exams are typically graded on a scale from 0 to 100. Assume that the mean score in an exam is 80. Which of these values is most likely to be the standard deviation? Why? What can we say about the shape of the exam scores?
a. \( s=0 \) b. \( s=10 \) c. \( s=50 \) d. \( s=-5 \)
10 is the most likely standard deviation.
With R, the standard deviation is easy to calculate.
Example: For the TV2 data set, find the mean and standard deviation of TV watching times for both men and women. Which gender watches more TV on average? Which gender is more variable?
favstats(~time|gender, data=TV2)
min Q1 median Q3 max mean sd n missing
Female 120 260 400 650 2650 521.4 410.1 191 0
Male 200 330 530 820 2700 669.1 478.6 209 0
The favstats function will be your best friend this semester! We can use this to calculate lots of numerical summaries. So far we've mentioned:
Example: Use the favstats function to find the standard deviation for hours of sleep in the Cellphones data set. Compare the standard deviation and mean of hours of sleep for students who use a cell phone and students who do not use a cell phone. Do you think there's a significant difference in sleep patterns for students who do/don't use cell phones?
To remind yourself, make a histogram!
xhistogram(~Sleep|Cell, data=Cellphones, auto.key=TRUE)
favstats(~Sleep|Cell, data=Cellphones)
Q3. Are there outliers in our data?
In statstics, we are sometimes interested in how a certain observation falls in a data set relative to all other observations.
Percentile: a value such that _% of the data falls at or below that value
Suppose your total score of 28 (out of 36) on the ACT college entrace exam falls at the 90th percentiles. Then, 90% of those students who took the exam at the same time you did scored between 0 and 28 (your score). Only 10% of the scores were higher than 28.
How could we use a percentile to decide whether an observation is an outlier?
Quartile: the 25th (Q1), 50th (Q2), and 75th (Q3) percentiles
favstats!Example: For the TV2 data set, identify the quartiles in the output for men and women. Identify the 0th percentile and 100th percentile.
favstats(~time|gender, data=TV2)
min Q1 median Q3 max mean sd n missing
Female 120 260 400 650 2650 521.4 410.1 191 0
Male 200 330 530 820 2700 669.1 478.6 209 0
Interquartile Range: the difference between Q1 and Q3 (IQR = Q3-Q1)
Example: Use the favstats output to find the IQR for the TV2 data set.
favstats(~time|gender, data=TV2)
min Q1 median Q3 max mean sd n missing
Female 120 260 400 650 2650 521.4 410.1 191 0
Male 200 330 530 820 2700 669.1 478.6 209 0
Boxplot: a convenient way to describe the variability of a data set in a single graph
bwplot function, our graph shows:
Why 1.5 times IQR? This is the IQR Criterion!
Example: Compare the boxplots for TV viewing habits of men and women. Write a short paragraph summarizing the patterns the boxplot tells you.
bwplot(~time|gender, data=TV2)
Example: Repeat the following exercise using the Cellphones data set to compare sleeping habits and cell phone use.
On a sheet of paper, please complete the following:
Big Picture: Write in your own words the most important take-away message of this section.
I understand…: What do you feel most confident with in this section?
Let's do more…: What would you like to see more examples of?