When we look at data, it's important to understand what's actually happening behind the scenes.
Looking at a table or a list of observations just won't cut it! Instead we need a way to conceptualize the data…
Now's a good time to open RStudio and load the mosaic library!
library(mosaic)
Some starting definitions:
Variable:
Different types of variables will require different analysis methods. There are two major types:
Example: Identify each of the following variables as categorical or quantitative.
The best type of graph to use will depend on the type of variable which we're interested in. Since categorical variables tend to be easier to work with, we'll start with those.
Example: The Whickham study
Data on age, smoking, and mortality from a survey of voters in Whickham, UK was collected from 1972-1974 to study heart disease and thyroid disease. A follow-up on those participants was conducted 20 years later.
data(Whickham)
head function shows us the first 6 observations. Which variables are categorical and which are quantitative?
head(Whickham)
## outcome smoker age
## 1 Alive Yes 23
## 2 Alive Yes 18
## 3 Dead Yes 71
## 4 Alive No 67
## 5 Alive No 64
## 6 Alive Yes 38
Bar chart:
bargraph(~outcome, data = Whickham)
How could you change the code in the previous example to produce a bar graph of the number of smokers in the Whickham study?
How can we graphically represent multiple categorical variables? We have some options:
Option 1. Separate our bar graph into groups.
bargraph(~smoker, groups = outcome, data = Whickham, auto.key = TRUE)
Option 2. Use a mosaic plot: divides a square into subportions depending on the number of observations in each category.
mosaicplot(~outcome + smoker, data = Whickham, color = TRUE)
Example: Based on the two graphs on the previous slides, who do you think was more likely to be alive at the end of the 20 year period - smokers or nonsmokers?
We'll come back to this issue later on in the course!
Next question, how could we find out how many observations fall into each category? We can use the tally function!
tally(~outcome, data = Whickham)
##
## Alive Dead Total
## 945 369 1314
tally(~smoker, data = Whickham)
##
## No Yes Total
## 732 582 1314
Suppose we wanted to use the tally function to split the data up by both smoking status and outcome. In R's formula notation, we can use a + to add multiple x variables.
tally(~smoker + outcome, data = Whickham)
## outcome
## smoker Alive Dead Total
## No 502 230 732
## Yes 443 139 582
## Total 945 369 1314
This is called a contingency table.
By default, tally reports observation counts in each category.
Example: What do you think the code below will report?
tally(~smoker + outcome, data = Whickham, format = "proportion")
Try running it!
Next we'll look at ways to describe quantitative variables. As you may have guessed, working with numbers gives us many more options!
Example: Who watches more TV?
The Carnegie Mellon Online Learning Initiative wanted to know who watches more TV per week: men or women. A random sample of 400 adults was chosen. At the end of the week, each subject reported the total amount of time (in minutes) that he or she watched TV during that week.
data(TV2)
head(TV2)
## time gender
## 1 180 Female
## 2 150 Female
## 3 130 Female
## 4 990 Female
## 5 470 Female
## 6 260 Female
Histogram:
xhistogram(~time, data = TV2, nint = 15)
nint controls the number of intervals in the graph. There is no “best” number to use!Example: What does this histogram tell you about TV watching habits? Does it answer our RQ: which gender watches more TV?
xhistogram(~time | gender, data = TV2, nint = 15, auto.key = TRUE)
Example: Now can we say anything about gender and TV watching?
Example: Do students with cellphones get less sleep?
Data was collected from 312 college students at a large state university. The data is contained in the Cellphones data set.
data(Cellphones)
head(Cellphones)
## Math Verbal Credits Year Exer Sleep Cell Veg
## 1 640 470 15 1 60 7.0 yes no
## 2 660 650 14 1 20 7.5 yes no
## 3 550 580 15 2 0 9.0 no no
## 4 560 660 16 1 30 7.0 yes no
## 5 600 790 15 4 45 6.5 no some
## 6 560 640 16 2 75 4.5 no yes
With a partner, use the variables Cell and Sleep to make a graph showing the relationship between owning a cellphone sleep. Use the tally function to find out how many students don't have a cell phone. Write a short paragraph interpreting your results.
##
## no yes Total
## 67 243 310
When we're formally analyzing a histogram, there are several characteristics we might look for:
Shape: Histograms typically have one of the three shapes below.
Example: What shape would you expect the following data sets to have? Sketch a graph for each.
Pattern: Does the data cluster together, or is there a gap? Is one observation significantly different from the rest?
Outlier:
Modes: how is the data concentrated?
Example: Revisit your description of the histograms you generated for cellphone use and sleep, using the definitions for shape, pattern, and modes.
Graphs don't necessarily tell us all that we need to know about a data set. Typically, there are three questions that a numerical summary can answer.
Q1. Where is the “center” of our data?
There are two measures of center commonly used in statistics. These should already be familiar to you!
Each of these is easy to find by hand. But, they're even easier to find in R.
mean(x = time, data = TV2)
## [1] 598.5
median(x = time, data = TV2)
## [1] 470
The mean and median rarely are the same! Let's compare the mean and median for TV watching times.
xhistogram(~time, data = TV2, v = c(598.55, 470))
## c(.,.) lets us read in two numbers to draw vertical lines.
Example: For a skewed right distribution (like TV watching times), the mean is greater than the median. How do you think the mean and median compare in a…
Example: Use the Cellphones data set to find the mean and median number of hours slept that night. Based on how the mean and median compare, is Sleep symmetric, skewed right, or skewed left. Make a histogram of sleep alone to confirm.
## [1] 7.194
## [1] 7
Example: Only the mean or the median is “resistant to outliers” - relatively unaffected by an extreme observation. Which do you think it is? Explain your choice.
Q2. How “spread out” is our data?
Measures of center are helpful, but they don't tell us much about the shape of our data. The shape of our data can affect our decisions!
Example: The graphs below show annual income distributions of music teachers in two countries, Denmark (in red) and the United States (in blue). Both countries have an annual average income of $40,000 (adjusted to US dollars). Based on the graphs alone, where would you rather be a music teacher and why?
Both distributions show variability.
Variability:
Standard deviation:
Important Properties of \( s \):
Example: Look at the income distributions for music teachers in Denmark and the US. Based on the graphs alone, which country has a higher standard deviation? How do you know?
Example: Exams are typically graded on a scale from 0 to 100. Assume that the mean score in an exam is 80. Which of these values is most likely to be the standard deviation? Why? What can we say about the shape of the exam scores?
a. \( s=0 \) b. \( s=10 \) c. \( s=50 \) d. \( s=-5 \)
With R, the standard deviation is easy to calculate.
Example: For the TV2 data set, find the mean and standard deviation of TV watching times for both men and women. Which gender watches more TV on average? Which gender is more variable?
favstats(~time | gender, data = TV2)
## min Q1 median Q3 max mean sd n missing
## Female 120 260 400 650 2650 521.4 410.1 191 0
## Male 200 330 530 820 2700 669.1 478.6 209 0
The favstats function will be your best friend this semester! We can use this to calculate lots of numerical summaries. So far we've mentioned:
Example: Use the favstats function to find the standard deviation for hours of sleep in the Cellphones data set. Compare the standard deviation and mean of hours of sleep for students who use a cell phone and students who do not use a cell phone. Do you think there's a significant difference in sleep patterns for students who do/don't use cell phones?
xhistogram(~Sleep | Cell, data = Cellphones, auto.key = TRUE)
Q3. Are there outliers in our data?
In statstics, we are sometimes interested in how a certain observation falls in a data set relative to all other observations.
Percentile:
Suppose your total score of 28 (out of 36) on the ACT college entrace exam falls at the 90th percentiles. Then, 90% of those students who took the exam at the same time you did scored between 0 and 28 (your score). Only 10% of the scores were higher than 28.
How could we use a percentile to decide whether an observation is an outlier?
Quartile:
favstats!Example: For the TV2 data set, identify the quartiles in the output for men and women. Identify the 0th percentile and 100th percentile.
favstats(~time | gender, data = TV2)
## min Q1 median Q3 max mean sd n missing
## Female 120 260 400 650 2650 521.4 410.1 191 0
## Male 200 330 530 820 2700 669.1 478.6 209 0
Interquartile Range:
Example: Use the favstats output to find the IQR for the TV2 data set.
favstats(~time | gender, data = TV2)
## min Q1 median Q3 max mean sd n missing
## Female 120 260 400 650 2650 521.4 410.1 191 0
## Male 200 330 530 820 2700 669.1 478.6 209 0
Boxplot:
bwplot function, our graph shows:
Why 1.5 times IQR? This is the IQR Criterion!
Example: Compare the boxplots for TV viewing habits of men and women. Write a short paragraph summarizing the patterns the boxplot tells you.
bwplot(~time | gender, data = TV2)
Example: Repeat the following exercise using the Cellphones data set to compare sleeping habits and cell phone use.