knitr::opts_chunk$set(echo = TRUE)
library(Lock5Data)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1
## v purrr   0.3.4
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.2
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.2
## Loading required package: viridisLite

Unit A: Data

Chapter 1. Collecting Data

1.1 The structure of Data

Statistics is the science of collecting, describing, and analyzing data.

DATA 1.1: StudentSurvey1

Case and Variables
Cases or units record the information in a row of a data table. A variable is characteristic for each case. Correspond to the column.

Example 1.1 Explain Student ID 1 variable.

Student 01 is Male, no smoking, prefer an Olympic gold medal other than Nobel Prize or Academy Award. He exercises 10 hours a week.

Categorical and Quantitative Variable
A categorical variable divides the cases into groups, placing each case into exactly one of two or more categories. A quantitative variable measures or records a numerical quantity for each case.

Example 1.2 Classify variables in StudentSurvey table 1.1.

Gender is categorical
Smoke is categorical
Award is categorical
Exercise, TV, GPA, and Pulse are all quantitative
Birth ambiguous variable.

Investigating Variables and Relationships between Variables
DATA 1.2 AllCountries2

install.packages("Lock5Data", repos = "http://cran.us.r-project.org")
## Warning: package 'Lock5Data' is in use and will not be installed
library(Lock5Data)
head(AllCountries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1

Example 1.3

  1. Determine that 96.5% of Icelanders has access to the Internet, highest rate. What are cases, variable, and is it categorical or quantitative?
  2. In the AllCountries dataset, what is the cases, variable of the percent of people with access to the Internet for each country. Is it categorical or quantitative?
  1. The cases are people in Iceland, the relevant variable is whether or not each person has access the internet. This is a categorical variable.
  2. The cases are countries of the world, variable s the proportion with access to the Internet. It is quantitative.

Using Data to Answer a Question Question of interest and collect data that help to answer that question.

Example 1.4 Is there a “Springting Gene”?

  1. The cases are the people included in the study. One variable is whether the individual has the gene variant or not. Since we record simply “yes” or “no,” this is a categorical variable. The second variable keeps track of the group to which the individual belongs. This is also a categorical variable, with three possible categories (sprinter, marathon runner, or non-athlete). We are interested in the relationship between these two categorical variables.
  2. The table of data must record answers for each of these variables and may or may not have an identifier column. Table 1.2 shows a possible first two rows for this dataset.

TABLE 1.2 Possible table to investigate whether there is a sprinter’s gene.

Name Gene Variant Group
Allan Yes Marathon runner
Beth No | S printer

Explanatory and Response Variables

One variable that help understand or predict values of another variable, we call the former the explanatory variable and the latter the response variable.

Example 1.5 AllCountries dataset.

  1. Do countries larger in area tend to have a more rural population?
  2. Is the birth rate higher in developed or undeveloped countries?
  1. The question indicates that the area might influence the percent of a country that is rural, so we call area the explanatory variable and percent rural the response variable.
  2. The explanatory is whether the country is developed or undeveloped and the response variable is birth rate.

SECTION LEARNING GOALS

  1. Recognize that a dataset consists of cases and variables

  2. Identify variables as either categorical or quantitative

  3. Determine explanatory and response variables where appropriate

  4. Describe how data might be used to address a specific question

  5. Recognize that understanding statistics allows us to investigate a wide variety of interesting phenomena.

1.2 Sampling from a Population

A population includes all individuals or objects of interest. Data are collected from a sample (n), which is a subset of the population.

Statistical inference is the process of using data from a sample to gain information about the population.

Figure 1.1 From population to sample and from sample to population

Sampling Bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. If sampling bias exists, then we cannot trust generalizations from the sample to the population.

Example 1.6 After a flight, one of the authors received an email from the airline asking her to fill out a survey regarding her satisfaction with the travel experience. The airline analyzes the data from all responses to such emails.

  1. The sample is all people who choose to fill out the survey and the population is all people who fly this airline
  2. The survey results will probably not accurately portray customer satisfaction. Many people won’t bother to fill out the survey if the flight was uneventful, while people with a particularly bad or good experience are more likely to fill out the survey.

Simple random Sample each unit of a population has an equal chance of being selected, regardless of the other unites chosen for the sample -> avoid sampling bias.

Bias exists when the method of collecting data causes the sample data to inaccurately reflect the population.

Example 1.7 Random sample using R for AllCountries dataset

sample(AllCountries$Country, size=5, replace=T) #allow repeated values
## [1] Mongolia Kuwait   Honduras Tuvalu   Tonga   
## 217 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe

SECTION LEARNING GOALS

  1. Distinguish between a sample and a population

  2. Recognize when it is appropriate to use sample data to infer information about the population

  3. Critically examine the way a sample is selected, identifying possible sources of sampling bias

  4. Recognize that random sampling is a powerful way to avoid sampling bias

  5. Identify other potential sources of bias that may arise in studies on humans

1.3 Experiments and Observational Studies

Association and Causation

Two variables are associated if values of one variable tend to be related to the values of the other variables. Two variables are causally associated if changing the value of one variable influences the value of the other variable.

Example 1.8 State whether the sentence implies no association between the variables, with or without causation. If causation, which is explanatory and response?

  1. Studies show that taking a practice exam increases your score on an exam.
  2. Families with many cars tend to also own many television sets.
  3. Sales are the same even with different levels of spending on advertising.
  4. Taking a low-dose aspirin a day reduces the risk of heart attacks.
  5. Goldfish who live in large ponds are usually larger than goldfish who live in small ponds.
  6. Putting a goldfish into a larger pond will cause it to grow larger.
  1. Taking a practice exam causes an increase in the exam grade -> association with causation.
  2. Association without causation.
  3. Because sales don’t vary in any systematic way as advertising varies, there is no association.
  4. Association with causation. A daily low-does aspirin causes heart attack risk to go down.
  5. Association without causation.
  6. Association with causation. size of pond causes goldfish larger.

Confounding Variable also known as a confounding factor or lurking variable, is a third variable that is associated with both the explanatory variable and the response variable. A confounding variable can offer a plausible explanation for an association between two variables of interest.

Example 1.9 Describe a possible confounding variable in Data 1.5 about vehicle registrations and life expectancy.

One confounding variable is the year. As time goes along, the population grows so more vehicles are registered and improvements in medical care help people live longer. Both variables naturally tend to increase as the year increases and may have little direct relationship with each other. The years are an explanation for the association between vehicle registrations and life expectancy.

Observational Studies vs Experiments

An experiment is a study in which the researcher actively controls one or more of the explanatory variables.

An observational study is a study in which the researcher does not actively control the value of any variable but simply observes the values as they naturally exist

Example 2.0 Both studies below are designed to examine the effect of fertilizer on the yield of an apple orchard. Indicate whether each method of collecting the data is an experiment or an observational study.
(a) Researchers find several different apple orchards and record the amount of fertilizer that had been used and the yield of the orchards.
(b) Researchers find several different apple orchards and assign different amounts of fertilizer to each orchard. They record the resulting yield from each.

  1. This is an observational study, since data were recorded after the fact and no variables were actively manipulated. Notice that there are many possible confounding variables that might be associated with both the amount of fertilizer used and the yield, such as the quality of soil.

  2. This is an experiment since the amount of fertilizer was assigned to different orchards. The researchers actively manipulated the assignment of the fertilizer variable, in order to determine the effect on the yield variable.

Observational Studies vs Experiments

In a randomized experiment the value of the explanatory variable for each unit is determined randomly, before the response variable is measured.If a randomized experiment yields an association between the two variables, we can establish a causal relationship from the explanatory to the response variable.

Figure 1.2 Two fundamental questions about data collection

SECTION LEARNING GOALS

  1. Recognize that not every association implies causation

  2. Identify potential confounding variables in a study

  3. Distinguish between an observational study and a randomized experiment

  4. Recognize that only randomized experiments can lead to claims of causation

  5. Explain how and why placebos and blinding are used in experiments


Chapter2. Describing Data

“Technology [has] allowed us to collect vast amounts of data in almost every business. The people who are able to in a sophisticated and practical way analyze that data are going to have terrific jobs.” - Chrystia Freeland, Managing Editor, Financial Times

2.1 Categorical Variables

One categorical variable

Proportion

\[Proportion.in.a.category = \frac{Number.in.that.category}{Total.number}\]

Proportions are also called relative frequencies, and we can display them in a relative frequency table. The proportions in a relative frequency table will add to 1 (or approximately 1 if there is some round-off error). Relative frequencies allow us to make comparisons without referring to the sample size.

Example 2.1

tl.response <- c(1, 2, 3)
tl.frequency <- c(735, 1812, 78)
tl.relative <- c(0.28, 0.69, 0.03)
truelove <- data.frame(tl.response, tl.frequency, tl.relative)
truelove
##   tl.response tl.frequency tl.relative
## 1           1          735        0.28
## 2           2         1812        0.69
## 3           3           78        0.03
Table 2.1 A frequency table
Response Frequency Relative Frequency
Agree 735 0.28
Disagree 1812 0.69
Don’t know 78 0.03
Total 2625 1.00

Notation for a Proportion

The proportion for a sample is denoted p^ and read “p-hat.”

The proportion for a population is denoted p.

Two Categorical Variable: Two-Way Variable

A two-way table is used to show the relationship between two categorical variables. The categories for one variable are listed down the side (rows) and the categories for the second variable are listed across the top (columns). Each cell of the table contains the count of the number of data cases that are in both the row and column categories.

Table 2.2 Two-way table with row and column totals
Male Female Total
Agree 372 363 735
Disagree 807 1005 1812
Don’t know 34 44 78
Total 1213 1412 2625

Example 2.2

  1. \[Proportion.of.female.who.agree = \frac{363}{1412}=0.26\]
  2. \[Proportion.who.agree.that.are.female = \frac{363}{735}=0.49\]

SECTION LEARNING GOALS

  1. Display information from a categorical variable in a table or graph

  2. Use information about a categorical variable to find a proportion, with correct notation

  3. Display information about a relationship between two categorical variables in a two-way table

  4. Use a two-way table to find proportions

2.2 One Quantitative Variable: Shape and Center

The Shape of a Distribution

Dotplot

#DATA 2.2 Longevity of Mamals
#Example 2.3
#Load data
data("MammalLongevity")
str(MammalLongevity)
## 'data.frame':    40 obs. of  3 variables:
##  $ Animal   : Factor w/ 40 levels "baboon","bear,black",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Gestation: int  187 219 225 240 122 278 406 63 231 31 ...
##  $ Longevity: int  20 18 25 20 5 15 12 12 20 6 ...
library(extrafont)
## Registering fonts with R
loadfonts(device = "win")
ggplot(MammalLongevity, aes(x = Longevity)) +
  geom_dotplot(fill = "grey", binwidth = 1.5) +
  labs(title = "longevity of mammals",
       y = "Species",
       x = "Longevity") +
  theme_ipsum(base_family = "serif", base_size = 11.5)

Outlier is an observed value that is notably distinct from the other values in a dataset. Usually, an outlier is much larger or much smaller than the rest of the data values.

Histograms

Symmetric and Skewed Distribution

#Load StudentSurvey dataset
data("StudentSurvey")
str(StudentSurvey)
## 'data.frame':    362 obs. of  17 variables:
##  $ Year      : Factor w/ 5 levels "","FirstYear",..: 4 5 2 3 5 5 2 5 3 2 ...
##  $ Sex       : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...
##  $ Smoke     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Award     : Factor w/ 3 levels "Academy","Nobel",..: 3 1 2 2 2 2 3 3 2 2 ...
##  $ HigherSAT : Factor w/ 3 levels "","Math","Verbal": 2 2 2 2 3 3 2 2 3 2 ...
##  $ Exercise  : num  10 4 14 3 3 5 10 13 3 12 ...
##  $ TV        : int  1 7 5 1 3 4 10 8 6 1 ...
##  $ Height    : int  71 66 72 63 65 65 66 74 61 60 ...
##  $ Weight    : int  180 120 208 110 150 114 128 235 NA 115 ...
##  $ Siblings  : int  4 2 2 1 1 2 1 1 2 7 ...
##  $ BirthOrder: int  4 2 1 1 1 2 1 1 2 8 ...
##  $ VerbalSAT : int  540 520 550 490 720 600 640 660 550 670 ...
##  $ MathSAT   : int  670 630 560 630 450 550 680 710 550 700 ...
##  $ SAT       : int  1210 1150 1110 1120 1170 1150 1320 1370 1100 1370 ...
##  $ GPA       : num  3.13 2.5 2.55 3.1 2.7 3.2 2.77 3.3 2.8 3.7 ...
##  $ Pulse     : int  54 66 130 78 40 80 94 77 60 94 ...
##  $ Piercings : int  0 3 0 0 6 4 8 0 7 2 ...
#Histogram Pulse rate
m <- mean(StudentSurvey$Pulse)
std <- sd(StudentSurvey$Pulse)
hist(StudentSurvey$Pulse, breaks = 20, labels = FALSE, prob=T)
curve(dnorm(x, mean = m, sd=std), col="darkblue", lwd=2, add = TRUE)

#Histogram Exercise
m1 <- mean(StudentSurvey$Exercise, trim = 0, na.rm = TRUE)
std2 <- sd(StudentSurvey$Exercise, na.rm = TRUE) 
hist(StudentSurvey$Exercise, breaks = 20, labels = FALSE, prob=T)
curve(dnorm(x, mean = m1, sd = std2), col="darkblue", lwd=2, add=TRUE)

Common Shapes for Distributions

  • Symmetric if the two sides approximately match when folded on a vertical center line

  • Skewed to the right if the data are piled up on the left and the tail extends relatively far out to the right. Opposite with skewed to the left

  • Bell-shaped if the data are symmetric and, in addition, have the shape like a bell.

The Center of a Distribution

Mean

The mean of the data values for a single quantitative variable is given by

\[ Mean = \frac{x1+x2+...+xn}{n}=\frac{Σx}{n}\] Notation for a Mean

The mean of a sample is denoted x- and read “x-bar”. The mean of a population is denoted u, “mu”

Median

The median of a set of data values for a single quantitative variable, denoted m, is • the middle entry if an ordered list of the data values contains an odd number of entries, or • the average of the middle two values if an ordered list contains an even number of entries. The median splits the data in half.

Resistance In general, we say that a statistic is resistant if it is relatively unaffected by extreme values. The median is resistant, while the mean is not.

DATA 2.4 FloridaLakes3

#Example 2.4 
data(FloridaLakes)
mflorida <- mean(FloridaLakes$Alkalinity, trim = 0, na.rm = TRUE)
meflorida <- median(FloridaLakes$Alkalinity, trim = 0, na.rm = TRUE)

FloridaLakes %>%
        ggplot( aes(x=Alkalinity)) +
          geom_histogram(binwidth = 10, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
          geom_vline(data=FloridaLakes, aes(xintercept=mflorida),
                 linetype="dashed") +
          geom_vline(data=FloridaLakes, aes(xintercept=meflorida),
                 linetype="solid")

SECTION LEARNING GOALS

• Use a dotplot or histogram to describe the shape of a distribution

• Find the mean and the median for a set of data values, with appropriate notation

• Identify the approximate locations of the mean and the median on a dotplot or histogram

• Explain how outliers and skewness affect the values for the mean and median

Exercises for Section 2.2

#Exercise 2.65 Population of States in the US
data("USStates")
mean.uss <- mean(USStates$Population, 
                 trim = 0, 
                 na.rm = TRUE)
sd.uss <- sd(USStates$Population, 
             na.rm = TRUE)
hist(USStates$Population, 
     labels = FALSE, 
     prob=T)
curve(dnorm(x, 
            mean = mean.uss, 
            sd = sd.uss), 
      col="darkblue", 
      lwd=2, add=TRUE)

  1. The values represent a population. Not sample.
  2. The shape of the distrbution is skewed to the right. Mean the mean > median
#Exercise 2.75 Create a histogram 
data("AllCountries")
xbar <- mean(AllCountries$BirthRate, trim = 0, na.rm = TRUE)
xbar
## [1] 20.1104
sd.br <- sd(AllCountries$BirthRate, na.rm = TRUE)
sd.br
## [1] 9.977
hist(AllCountries$BirthRate, 
     main = "Histogram for Birthrate",
     xlab = "Birthrate",
     xlim = c(0,50),
     las=1, 
     breaks = 5,
     prob=T)
curve(dnorm(x,
            mean = xbar,
            sd = sd.br),
      col="darkblue",
      lwd=2, add=TRUE)

2.3 One Quantitative Variable: Measures of Spread

#Example 2.3.1 Des Moines vs San Francisco Temp
data("April14Temps")
str(April14Temps)
## 'data.frame':    25 obs. of  3 variables:
##  $ Year        : int  1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ...
##  $ DesMoines   : num  56 37.5 37.2 56 54.3 63.3 54.7 60.6 70.6 53.7 ...
##  $ SanFrancisco: num  51 55.3 55.7 48.7 56.2 57.2 49.5 61 51.4 55.8 ...
summary(April14Temps)
##       Year        DesMoines      SanFrancisco  
##  Min.   :1995   Min.   :35.70   Min.   :48.70  
##  1st Qu.:2001   1st Qu.:44.40   1st Qu.:52.10  
##  Median :2007   Median :54.70   Median :54.20  
##  Mean   :2007   Mean   :53.73   Mean   :54.25  
##  3rd Qu.:2013   3rd Qu.:60.20   3rd Qu.:56.20  
##  Max.   :2019   Max.   :74.90   Max.   :61.00

Standard Deviation for a quantitative variable measures the spread of the data in a sample:

\[Standard.deviabtion = sqrt\frac{Σ(x−xbar)^2} {n-1}\]

The standard deviation gives a rough estimate of the typical distance of a data value from the mean.The larger the sd, the more variability there is in the data and the more spread out. The standard deviation of a population is denoted 6 - “sigma”.

The 95% Rule

If a distribution of data is approximately symmetric and bell-shaped,
about 95% of the data should fall within two standard deviations of
the mean. This means that about 95% of the data in a sample from a
bell-shaped distribution should fall in the interval from x - 2s to x + 2s.

z-Scores A common way to determine how unusual a single data value is, that is independent of the units used, is to count how many standard deviations it is away from the mean

\[z-score = \frac{x - xbar}{s} \] Example 2.3.2 A patient has a high systolic blood pressure of 204 mmHg and a low pulse rate of 52 bpm. Which of these values is more unusual relative to the other patients in the sample? The summary statistics for systolic blood pressure show a mean of 132.3 and standard deviation of 32.95, while the heart rates have a mean of 98.9 and standard deviation of 26.83.

\[ Blood pressure: z = \frac{x - xbar}{s} = \frac{204 - 132.3}{32.95} = 2.18\] > This patient’s blood pressure is slightly more than two sd above the sample mean.

\[ Heart rate: z = \frac{x - xbar}{s} = \frac{52 - 98.9}{26.83} = -1.75 \] > This patient’s heart rate is less than two standard deviations below the sample mean heart rate. The high blood pressure is somewhat more unusual than the low heart rate.

Percentiles is the value of a quantitative variable which is greater than P percent of the data.

#Example 2.3.3 
data("SandP500")
hist(SandP500$Volume,
     main = "Histogram of SandP500 Volume",
          density = NULL, angle = 45, col = "#6666cc", border = NULL,
     xlim = c(1400,6600),
     breaks = 30,
     xlab = "Volume (million)")

summary(SandP500$Volume)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1652    3234    3475    3612    3843    7609

Five Number Summary =(minimum, Q1, median, Q3, maximum) where

Q1 = First Quartile = 25th percentile Q3 = Third Quartile = 75th percentile Range = Maximum - Minimum Interquartile range (IQR) = Q3 - Q1

SECTION LEARNING GOALS

  • Use technology to compute summary statistics for a quantitative variable

  • Recognize the uses and meaning of the standard deviation

  • Compute and interpret a z-score

  • Interpret a five number summary or percentiles

  • Use the range, the interquartile range, and the sd as measures of spread.

Exercises for Section 2.3

#1)
summary(c(1, 3, 4, 5, 7, 10, 18, 20, 25, 31, 42))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.50   10.00   15.09   22.50   42.00
#2) The variable TV in StudentSurvey dataset.
summary(StudentSurvey$TV)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   5.000   6.504   9.000  40.000       1
summary(c(45, 46, 48, 49, 49, 50, 50, 52, 52, 54, 57, 57, 58, 58, 60,61))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   45.00   49.00   52.00   52.88   57.25   61.00
sd(c(45, 46, 48, 49, 49, 50, 50, 52, 52, 54, 57, 57, 58, 58, 60, 61))
## [1] 5.07116
#3) Percent Obese by State
data("USStates")
summary(USStates$Obese)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23.00   28.77   30.90   31.43   34.38   39.50

2.4 Boxplots and Quantitative/ Categorical Relationships

Boxplots
  • A numerical scale appropriate for the data values.

  • A box stretching from Q1 to Q3.

  • A line that divides the box drawn at the median.

  • A line from each quartile to the most extreme data value that is not
    an outlier.

  • Each outlier plotted individually with a symbol such as an asterisk
    or a dot.

Example 2.4.1

# DATA 2.7 Holywood Movies
data("HollywoodMovies")
boxplot(HollywoodMovies$Budget,
        ylim= c(0,400),
        ylab="millions of dollars")

Detection of Outliers

IQR Method for Detecting Outliers

For boxplots, we call a data value an outlier if it is Smaller than Q1 − 1.5(IQR) or Larger than Q3 + 1.5(IQR)

Example 2.4.2

summary(MammalLongevity$Longevity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   12.00   13.15   15.25   40.00

Q1 − 1.5(IQR) = 8 − 1.5(8) = 8 − 12 = −4

Q3 + 1.5(IQR) = 16 + 1.5(8) = 16 + 12 = 28

Clearly, there are no mammals with negative lifetimes, so there can be no outliers below the lower value of −4. On the upper side, the elephant, as expected, is clearly an outlier beyond the value of 28 years. No other mammal in this sample exceeds that value, so the elephant is the only outlier in the longevity data.

One Quantitative and One Categorical Variable

Visualizing a Relationship between Quantitative and Categorical Variables

Example 2.4.2

data("StudentSurvey")
str(StudentSurvey)
## 'data.frame':    362 obs. of  17 variables:
##  $ Year      : Factor w/ 5 levels "","FirstYear",..: 4 5 2 3 5 5 2 5 3 2 ...
##  $ Sex       : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...
##  $ Smoke     : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Award     : Factor w/ 3 levels "Academy","Nobel",..: 3 1 2 2 2 2 3 3 2 2 ...
##  $ HigherSAT : Factor w/ 3 levels "","Math","Verbal": 2 2 2 2 3 3 2 2 3 2 ...
##  $ Exercise  : num  10 4 14 3 3 5 10 13 3 12 ...
##  $ TV        : int  1 7 5 1 3 4 10 8 6 1 ...
##  $ Height    : int  71 66 72 63 65 65 66 74 61 60 ...
##  $ Weight    : int  180 120 208 110 150 114 128 235 NA 115 ...
##  $ Siblings  : int  4 2 2 1 1 2 1 1 2 7 ...
##  $ BirthOrder: int  4 2 1 1 1 2 1 1 2 8 ...
##  $ VerbalSAT : int  540 520 550 490 720 600 640 660 550 670 ...
##  $ MathSAT   : int  670 630 560 630 450 550 680 710 550 700 ...
##  $ SAT       : int  1210 1150 1110 1120 1170 1150 1320 1370 1100 1370 ...
##  $ GPA       : num  3.13 2.5 2.55 3.1 2.7 3.2 2.77 3.3 2.8 3.7 ...
##  $ Pulse     : int  54 66 130 78 40 80 94 77 60 94 ...
##  $ Piercings : int  0 3 0 0 6 4 8 0 7 2 ...
#show the distribution of hours spent watching television for males and females using dotplot
StudentSurvey$Sex <- factor(StudentSurvey$Sex)
StudentSurvey$TV <- num(StudentSurvey$TV)
ggplot(data = StudentSurvey, aes(y=TV, x=Sex, fill=Sex)) +
  geom_boxplot() +
  ggtitle("Hours Watching TV/Week")
## Don't know how to automatically pick scale for object of type pillar_num/pillar_vctr/vctrs_vctr/double. Defaulting to continuous.
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).

ggplot(data = StudentSurvey, aes(x=TV)) +
  geom_dotplot(binwidth = 2, fill="Gray") +
  ggtitle("Hours Watching TV/Week") +
  xlab("TV") +
  ylab("") +
  facet_grid(. ~ Sex)
## Don't know how to automatically pick scale for object of type pillar_num/pillar_vctr/vctrs_vctr/double. Defaulting to continuous.
## Warning: Removed 1 rows containing non-finite values (stat_bindot).

#Both distributions are skewed to the right and have many outliers. There appears
#to be an association: In this group of students, males tend to watch more
#television. In fact, we see that the females who watch about 15 hours of TV a week
#are considered outliers, whereas males who watch the same amount of television are
#not so unusual. The minimum, first quartile, and median are relatively similar for
#the two genders, but the third quartile for males is much higher than the third
#quartile for females and the maximum for males is also much higher. While the
#medians are similar, the distribution for males is more highly skewed to the right,
#so the mean for males will be higher than the mean for female.

SECTION LEARNING GOALS

  • Identify outliers in a dataset based on the IQR method

  • Use a boxplot to describe data for a single quantitative variable

  • Use a side-by-side graph to visualize a relationship between quantitative and categorical variables

2.5 Two Quantitative Variables: Scatterplot and Correlation

2.6 Two Quantitative Variables: Linear Regression

2.7 Data Visualization and Multiple Variables

Chapter 3. Confidence Intervals

Chapter 4. Hypothesis Tests

Unit B: Understanding Inference

Unit B: Essential Synthesis

Unit C: Inference with Normal and t-Distributions

Unit D: Inference for Multiple Parameters

The Big Picture: Essential Synthesis


  1. http://www.wiley.com/college/lock↩︎

  2. <http://data.worldbank.org/indicator/IT.NET.USER.P2>↩︎

  3. Lange, T., Royals, H., and Connor, L., “Mercury accumulation in largemouth bass (Micropterus
    salmoides) in a Florida Lake,” Archives of Environmental Contamination and Toxicology, 2004; 27(4):
    466–471
    ↩︎