knitr::opts_chunk$set(echo = TRUE)
library(Lock5Data)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.0 v forcats 0.5.1
## v purrr 0.3.4
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(hrbrthemes)
## Warning: package 'hrbrthemes' was built under R version 4.1.2
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(viridis)
## Warning: package 'viridis' was built under R version 4.1.2
## Loading required package: viridisLite
Collect and describe data
Summarize and visualize data
Statistics is the science of collecting, describing, and analyzing data.
DATA 1.1: StudentSurvey1
Case and Variables
Cases or units record the information in a row of a data table. A variable is characteristic for each case. Correspond to the column.
Example 1.1 Explain Student ID 1 variable.
Student 01 is Male, no smoking, prefer an Olympic gold medal other than Nobel Prize or Academy Award. He exercises 10 hours a week.
Categorical and Quantitative Variable
A categorical variable divides the cases into groups, placing each case into exactly one of two or more categories. A quantitative variable measures or records a numerical quantity for each case.
Example 1.2 Classify variables in StudentSurvey table 1.1.
Gender is categorical
Smoke is categorical
Award is categorical
Exercise, TV, GPA, and Pulse are all quantitative
Birth ambiguous variable.
Investigating Variables and Relationships between Variables
DATA 1.2 AllCountries2
install.packages("Lock5Data", repos = "http://cran.us.r-project.org")
## Warning: package 'Lock5Data' is in use and will not be installed
library(Lock5Data)
head(AllCountries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
Example 1.3
- The cases are people in Iceland, the relevant variable is whether or not each person has access the internet. This is a categorical variable.
- The cases are countries of the world, variable s the proportion with access to the Internet. It is quantitative.
Using Data to Answer a Question Question of interest and collect data that help to answer that question.
Example 1.4 Is there a “Springting Gene”?
TABLE 1.2 Possible table to investigate whether there is a sprinter’s gene.
| Name | Gene Variant | Group |
| Allan | Yes | Marathon runner |
| Beth | No | S | printer |
| … | … | … |
Explanatory and Response Variables
One variable that help understand or predict values of another variable, we call the former the explanatory variable and the latter the response variable.
Example 1.5 AllCountries dataset.
- The question indicates that the area might influence the percent of a country that is rural, so we call area the explanatory variable and percent rural the response variable.
- The explanatory is whether the country is developed or undeveloped and the response variable is birth rate.
SECTION LEARNING GOALS
Recognize that a dataset consists of cases and variables
Identify variables as either categorical or quantitative
Determine explanatory and response variables where appropriate
Describe how data might be used to address a specific question
Recognize that understanding statistics allows us to investigate a wide variety of interesting phenomena.
A population includes all individuals or objects of interest. Data are collected from a sample (n), which is a subset of the population.
Statistical inference is the process of using data from a sample to gain information about the population.
Figure 1.1 From population to sample and from sample to population
Sampling Bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way. If sampling bias exists, then we cannot trust generalizations from the sample to the population.
Example 1.6 After a flight, one of the authors received an email from the airline asking her to fill out a survey regarding her satisfaction with the travel experience. The airline analyzes the data from all responses to such emails.
- The sample is all people who choose to fill out the survey and the population is all people who fly this airline
- The survey results will probably not accurately portray customer satisfaction. Many people won’t bother to fill out the survey if the flight was uneventful, while people with a particularly bad or good experience are more likely to fill out the survey.
Simple random Sample each unit of a population has an equal chance of being selected, regardless of the other unites chosen for the sample -> avoid sampling bias.
Bias exists when the method of collecting data causes the sample data to inaccurately reflect the population.
Example 1.7 Random sample using R for AllCountries dataset
sample(AllCountries$Country, size=5, replace=T) #allow repeated values
## [1] Mongolia Kuwait Honduras Tuvalu Tonga
## 217 Levels: Afghanistan Albania Algeria American Samoa Andorra ... Zimbabwe
SECTION LEARNING GOALS
Distinguish between a sample and a population
Recognize when it is appropriate to use sample data to infer information about the population
Critically examine the way a sample is selected, identifying possible sources of sampling bias
Recognize that random sampling is a powerful way to avoid sampling bias
Identify other potential sources of bias that may arise in studies on humans
Association and Causation
Two variables are associated if values of one variable tend to be related to the values of the other variables. Two variables are causally associated if changing the value of one variable influences the value of the other variable.
Example 1.8 State whether the sentence implies no association between the variables, with or without causation. If causation, which is explanatory and response?
- Taking a practice exam causes an increase in the exam grade -> association with causation.
- Association without causation.
- Because sales don’t vary in any systematic way as advertising varies, there is no association.
- Association with causation. A daily low-does aspirin causes heart attack risk to go down.
- Association without causation.
- Association with causation. size of pond causes goldfish larger.
Confounding Variable also known as a confounding factor or lurking variable, is a third variable that is associated with both the explanatory variable and the response variable. A confounding variable can offer a plausible explanation for an association between two variables of interest.
Example 1.9 Describe a possible confounding variable in Data 1.5 about vehicle registrations and life expectancy.
One confounding variable is the year. As time goes along, the population grows so more vehicles are registered and improvements in medical care help people live longer. Both variables naturally tend to increase as the year increases and may have little direct relationship with each other. The years are an explanation for the association between vehicle registrations and life expectancy.
Observational Studies vs Experiments
An experiment is a study in which the researcher actively controls one or more of the explanatory variables.
An observational study is a study in which the researcher does not actively control the value of any variable but simply observes the values as they naturally exist
Example 2.0 Both studies below are designed to examine the effect of fertilizer on the yield of an apple orchard. Indicate whether each method of collecting the data is an experiment or an observational study.
(a) Researchers find several different apple orchards and record the amount of fertilizer that had been used and the yield of the orchards.
(b) Researchers find several different apple orchards and assign different amounts of fertilizer to each orchard. They record the resulting yield from each.
|
Observational Studies vs Experiments
In a randomized experiment the value of the explanatory variable for each unit is determined randomly, before the response variable is measured.If a randomized experiment yields an association between the two variables, we can establish a causal relationship from the explanatory to the response variable.
Figure 1.2 Two fundamental questions about data collection
SECTION LEARNING GOALS
Recognize that not every association implies causation
Identify potential confounding variables in a study
Distinguish between an observational study and a randomized experiment
Recognize that only randomized experiments can lead to claims of causation
Explain how and why placebos and blinding are used in experiments
“Technology [has] allowed us to collect vast amounts of data in almost every business. The people who are able to in a sophisticated and practical way analyze that data are going to have terrific jobs.” - Chrystia Freeland, Managing Editor, Financial Times
One categorical variable
Proportion
\[Proportion.in.a.category = \frac{Number.in.that.category}{Total.number}\]
Proportions are also called relative frequencies, and we can display them in a relative frequency table. The proportions in a relative frequency table will add to 1 (or approximately 1 if there is some round-off error). Relative frequencies allow us to make comparisons without referring to the sample size.
Example 2.1
tl.response <- c(1, 2, 3)
tl.frequency <- c(735, 1812, 78)
tl.relative <- c(0.28, 0.69, 0.03)
truelove <- data.frame(tl.response, tl.frequency, tl.relative)
truelove
## tl.response tl.frequency tl.relative
## 1 1 735 0.28
## 2 2 1812 0.69
## 3 3 78 0.03
| Response | Frequency | Relative Frequency |
|---|---|---|
| Agree | 735 | 0.28 |
| Disagree | 1812 | 0.69 |
| Don’t know | 78 | 0.03 |
| Total | 2625 | 1.00 |
Notation for a Proportion
The proportion for a sample is denoted p^ and read “p-hat.”
The proportion for a population is denoted p.
Two Categorical Variable: Two-Way Variable
A two-way table is used to show the relationship between two categorical variables. The categories for one variable are listed down the side (rows) and the categories for the second variable are listed across the top (columns). Each cell of the table contains the count of the number of data cases that are in both the row and column categories.
| Male | Female | Total | |
|---|---|---|---|
| Agree | 372 | 363 | 735 |
| Disagree | 807 | 1005 | 1812 |
| Don’t know | 34 | 44 | 78 |
| Total | 1213 | 1412 | 2625 |
Example 2.2
SECTION LEARNING GOALS
Display information from a categorical variable in a table or graph
Use information about a categorical variable to find a proportion, with correct notation
Display information about a relationship between two categorical variables in a two-way table
Use a two-way table to find proportions
The Shape of a Distribution
Dotplot
#DATA 2.2 Longevity of Mamals
#Example 2.3
#Load data
data("MammalLongevity")
str(MammalLongevity)
## 'data.frame': 40 obs. of 3 variables:
## $ Animal : Factor w/ 40 levels "baboon","bear,black",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Gestation: int 187 219 225 240 122 278 406 63 231 31 ...
## $ Longevity: int 20 18 25 20 5 15 12 12 20 6 ...
library(extrafont)
## Registering fonts with R
loadfonts(device = "win")
ggplot(MammalLongevity, aes(x = Longevity)) +
geom_dotplot(fill = "grey", binwidth = 1.5) +
labs(title = "longevity of mammals",
y = "Species",
x = "Longevity") +
theme_ipsum(base_family = "serif", base_size = 11.5)
Outlier is an observed value that is notably distinct from the other values in a dataset. Usually, an outlier is much larger or much smaller than the rest of the data values.
Histograms
Symmetric and Skewed Distribution
#Load StudentSurvey dataset
data("StudentSurvey")
str(StudentSurvey)
## 'data.frame': 362 obs. of 17 variables:
## $ Year : Factor w/ 5 levels "","FirstYear",..: 4 5 2 3 5 5 2 5 3 2 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...
## $ Smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Award : Factor w/ 3 levels "Academy","Nobel",..: 3 1 2 2 2 2 3 3 2 2 ...
## $ HigherSAT : Factor w/ 3 levels "","Math","Verbal": 2 2 2 2 3 3 2 2 3 2 ...
## $ Exercise : num 10 4 14 3 3 5 10 13 3 12 ...
## $ TV : int 1 7 5 1 3 4 10 8 6 1 ...
## $ Height : int 71 66 72 63 65 65 66 74 61 60 ...
## $ Weight : int 180 120 208 110 150 114 128 235 NA 115 ...
## $ Siblings : int 4 2 2 1 1 2 1 1 2 7 ...
## $ BirthOrder: int 4 2 1 1 1 2 1 1 2 8 ...
## $ VerbalSAT : int 540 520 550 490 720 600 640 660 550 670 ...
## $ MathSAT : int 670 630 560 630 450 550 680 710 550 700 ...
## $ SAT : int 1210 1150 1110 1120 1170 1150 1320 1370 1100 1370 ...
## $ GPA : num 3.13 2.5 2.55 3.1 2.7 3.2 2.77 3.3 2.8 3.7 ...
## $ Pulse : int 54 66 130 78 40 80 94 77 60 94 ...
## $ Piercings : int 0 3 0 0 6 4 8 0 7 2 ...
#Histogram Pulse rate
m <- mean(StudentSurvey$Pulse)
std <- sd(StudentSurvey$Pulse)
hist(StudentSurvey$Pulse, breaks = 20, labels = FALSE, prob=T)
curve(dnorm(x, mean = m, sd=std), col="darkblue", lwd=2, add = TRUE)
#Histogram Exercise
m1 <- mean(StudentSurvey$Exercise, trim = 0, na.rm = TRUE)
std2 <- sd(StudentSurvey$Exercise, na.rm = TRUE)
hist(StudentSurvey$Exercise, breaks = 20, labels = FALSE, prob=T)
curve(dnorm(x, mean = m1, sd = std2), col="darkblue", lwd=2, add=TRUE)
Common Shapes for Distributions
Symmetric if the two sides approximately match when folded on a vertical center line
Skewed to the right if the data are piled up on the left and the tail extends relatively far out to the right. Opposite with skewed to the left
Bell-shaped if the data are symmetric and, in addition, have the shape like a bell.
The Center of a Distribution
Mean
The mean of the data values for a single quantitative variable is given by
\[ Mean = \frac{x1+x2+...+xn}{n}=\frac{Σx}{n}\] Notation for a Mean
The mean of a sample is denoted x- and read “x-bar”. The mean of a population is denoted u, “mu”
Median
The median of a set of data values for a single quantitative variable, denoted m, is • the middle entry if an ordered list of the data values contains an odd number of entries, or • the average of the middle two values if an ordered list contains an even number of entries. The median splits the data in half.
Resistance In general, we say that a statistic is resistant if it is relatively unaffected by extreme values. The median is resistant, while the mean is not.
DATA 2.4 FloridaLakes3
#Example 2.4
data(FloridaLakes)
mflorida <- mean(FloridaLakes$Alkalinity, trim = 0, na.rm = TRUE)
meflorida <- median(FloridaLakes$Alkalinity, trim = 0, na.rm = TRUE)
FloridaLakes %>%
ggplot( aes(x=Alkalinity)) +
geom_histogram(binwidth = 10, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
geom_vline(data=FloridaLakes, aes(xintercept=mflorida),
linetype="dashed") +
geom_vline(data=FloridaLakes, aes(xintercept=meflorida),
linetype="solid")
SECTION LEARNING GOALS
• Use a dotplot or histogram to describe the shape of a distribution
• Find the mean and the median for a set of data values, with appropriate notation
• Identify the approximate locations of the mean and the median on a dotplot or histogram
• Explain how outliers and skewness affect the values for the mean and median
Exercises for Section 2.2
#Exercise 2.65 Population of States in the US
data("USStates")
mean.uss <- mean(USStates$Population,
trim = 0,
na.rm = TRUE)
sd.uss <- sd(USStates$Population,
na.rm = TRUE)
hist(USStates$Population,
labels = FALSE,
prob=T)
curve(dnorm(x,
mean = mean.uss,
sd = sd.uss),
col="darkblue",
lwd=2, add=TRUE)
- The values represent a population. Not sample.
- The shape of the distrbution is skewed to the right. Mean the mean > median
#Exercise 2.75 Create a histogram
data("AllCountries")
xbar <- mean(AllCountries$BirthRate, trim = 0, na.rm = TRUE)
xbar
## [1] 20.1104
sd.br <- sd(AllCountries$BirthRate, na.rm = TRUE)
sd.br
## [1] 9.977
hist(AllCountries$BirthRate,
main = "Histogram for Birthrate",
xlab = "Birthrate",
xlim = c(0,50),
las=1,
breaks = 5,
prob=T)
curve(dnorm(x,
mean = xbar,
sd = sd.br),
col="darkblue",
lwd=2, add=TRUE)
#Example 2.3.1 Des Moines vs San Francisco Temp
data("April14Temps")
str(April14Temps)
## 'data.frame': 25 obs. of 3 variables:
## $ Year : int 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ...
## $ DesMoines : num 56 37.5 37.2 56 54.3 63.3 54.7 60.6 70.6 53.7 ...
## $ SanFrancisco: num 51 55.3 55.7 48.7 56.2 57.2 49.5 61 51.4 55.8 ...
summary(April14Temps)
## Year DesMoines SanFrancisco
## Min. :1995 Min. :35.70 Min. :48.70
## 1st Qu.:2001 1st Qu.:44.40 1st Qu.:52.10
## Median :2007 Median :54.70 Median :54.20
## Mean :2007 Mean :53.73 Mean :54.25
## 3rd Qu.:2013 3rd Qu.:60.20 3rd Qu.:56.20
## Max. :2019 Max. :74.90 Max. :61.00
Standard Deviation for a quantitative variable measures the spread of the data in a sample:
The standard deviation gives a rough estimate of the typical distance of a data value from the mean.The larger the sd, the more variability there is in the data and the more spread out. The standard deviation of a population is denoted 6 - “sigma”.
If a distribution of data is approximately symmetric and bell-shaped,
about 95% of the data should fall within two standard deviations of
the mean. This means that about 95% of the data in a sample from a
bell-shaped distribution should fall in the interval from x - 2s to x + 2s.
z-Scores A common way to determine how unusual a single data value is, that is independent of the units used, is to count how many standard deviations it is away from the mean
\[z-score = \frac{x - xbar}{s} \] Example 2.3.2 A patient has a high systolic blood pressure of 204 mmHg and a low pulse rate of 52 bpm. Which of these values is more unusual relative to the other patients in the sample? The summary statistics for systolic blood pressure show a mean of 132.3 and standard deviation of 32.95, while the heart rates have a mean of 98.9 and standard deviation of 26.83.
\[ Blood pressure: z = \frac{x - xbar}{s} = \frac{204 - 132.3}{32.95} = 2.18\] > This patient’s blood pressure is slightly more than two sd above the sample mean.
\[ Heart rate: z = \frac{x - xbar}{s} = \frac{52 - 98.9}{26.83} = -1.75 \] > This patient’s heart rate is less than two standard deviations below the sample mean heart rate. The high blood pressure is somewhat more unusual than the low heart rate.
Percentiles is the value of a quantitative variable which is greater than P percent of the data.
#Example 2.3.3
data("SandP500")
hist(SandP500$Volume,
main = "Histogram of SandP500 Volume",
density = NULL, angle = 45, col = "#6666cc", border = NULL,
xlim = c(1400,6600),
breaks = 30,
xlab = "Volume (million)")
summary(SandP500$Volume)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1652 3234 3475 3612 3843 7609
Five Number Summary =(minimum, Q1, median, Q3, maximum) where
Q1 = First Quartile = 25th percentile Q3 = Third Quartile = 75th percentile Range = Maximum - Minimum Interquartile range (IQR) = Q3 - Q1
SECTION LEARNING GOALS
Use technology to compute summary statistics for a quantitative variable
Recognize the uses and meaning of the standard deviation
Compute and interpret a z-score
Interpret a five number summary or percentiles
Use the range, the interquartile range, and the sd as measures of spread.
Exercises for Section 2.3
#1)
summary(c(1, 3, 4, 5, 7, 10, 18, 20, 25, 31, 42))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.50 10.00 15.09 22.50 42.00
#2) The variable TV in StudentSurvey dataset.
summary(StudentSurvey$TV)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 5.000 6.504 9.000 40.000 1
summary(c(45, 46, 48, 49, 49, 50, 50, 52, 52, 54, 57, 57, 58, 58, 60,61))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.00 49.00 52.00 52.88 57.25 61.00
sd(c(45, 46, 48, 49, 49, 50, 50, 52, 52, 54, 57, 57, 58, 58, 60, 61))
## [1] 5.07116
#3) Percent Obese by State
data("USStates")
summary(USStates$Obese)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.00 28.77 30.90 31.43 34.38 39.50
A numerical scale appropriate for the data values.
A box stretching from Q1 to Q3.
A line that divides the box drawn at the median.
A line from each quartile to the most extreme data value that is not
an outlier.
Each outlier plotted individually with a symbol such as an asterisk
or a dot.
Example 2.4.1
# DATA 2.7 Holywood Movies
data("HollywoodMovies")
boxplot(HollywoodMovies$Budget,
ylim= c(0,400),
ylab="millions of dollars")
Detection of Outliers
IQR Method for Detecting Outliers
For boxplots, we call a data value an outlier if it is Smaller than Q1 − 1.5(IQR) or Larger than Q3 + 1.5(IQR)
Example 2.4.2
summary(MammalLongevity$Longevity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 12.00 13.15 15.25 40.00
Q1 − 1.5(IQR) = 8 − 1.5(8) = 8 − 12 = −4
Q3 + 1.5(IQR) = 16 + 1.5(8) = 16 + 12 = 28
Clearly, there are no mammals with negative lifetimes, so there can be no outliers below the lower value of −4. On the upper side, the elephant, as expected, is clearly an outlier beyond the value of 28 years. No other mammal in this sample exceeds that value, so the elephant is the only outlier in the longevity data.
One Quantitative and One Categorical Variable
Visualizing a Relationship between Quantitative and Categorical Variables
Example 2.4.2
data("StudentSurvey")
str(StudentSurvey)
## 'data.frame': 362 obs. of 17 variables:
## $ Year : Factor w/ 5 levels "","FirstYear",..: 4 5 2 3 5 5 2 5 3 2 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 2 2 1 1 1 2 1 1 ...
## $ Smoke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Award : Factor w/ 3 levels "Academy","Nobel",..: 3 1 2 2 2 2 3 3 2 2 ...
## $ HigherSAT : Factor w/ 3 levels "","Math","Verbal": 2 2 2 2 3 3 2 2 3 2 ...
## $ Exercise : num 10 4 14 3 3 5 10 13 3 12 ...
## $ TV : int 1 7 5 1 3 4 10 8 6 1 ...
## $ Height : int 71 66 72 63 65 65 66 74 61 60 ...
## $ Weight : int 180 120 208 110 150 114 128 235 NA 115 ...
## $ Siblings : int 4 2 2 1 1 2 1 1 2 7 ...
## $ BirthOrder: int 4 2 1 1 1 2 1 1 2 8 ...
## $ VerbalSAT : int 540 520 550 490 720 600 640 660 550 670 ...
## $ MathSAT : int 670 630 560 630 450 550 680 710 550 700 ...
## $ SAT : int 1210 1150 1110 1120 1170 1150 1320 1370 1100 1370 ...
## $ GPA : num 3.13 2.5 2.55 3.1 2.7 3.2 2.77 3.3 2.8 3.7 ...
## $ Pulse : int 54 66 130 78 40 80 94 77 60 94 ...
## $ Piercings : int 0 3 0 0 6 4 8 0 7 2 ...
#show the distribution of hours spent watching television for males and females using dotplot
StudentSurvey$Sex <- factor(StudentSurvey$Sex)
StudentSurvey$TV <- num(StudentSurvey$TV)
ggplot(data = StudentSurvey, aes(y=TV, x=Sex, fill=Sex)) +
geom_boxplot() +
ggtitle("Hours Watching TV/Week")
## Don't know how to automatically pick scale for object of type pillar_num/pillar_vctr/vctrs_vctr/double. Defaulting to continuous.
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
ggplot(data = StudentSurvey, aes(x=TV)) +
geom_dotplot(binwidth = 2, fill="Gray") +
ggtitle("Hours Watching TV/Week") +
xlab("TV") +
ylab("") +
facet_grid(. ~ Sex)
## Don't know how to automatically pick scale for object of type pillar_num/pillar_vctr/vctrs_vctr/double. Defaulting to continuous.
## Warning: Removed 1 rows containing non-finite values (stat_bindot).
#Both distributions are skewed to the right and have many outliers. There appears
#to be an association: In this group of students, males tend to watch more
#television. In fact, we see that the females who watch about 15 hours of TV a week
#are considered outliers, whereas males who watch the same amount of television are
#not so unusual. The minimum, first quartile, and median are relatively similar for
#the two genders, but the third quartile for males is much higher than the third
#quartile for females and the maximum for males is also much higher. While the
#medians are similar, the distribution for males is more highly skewed to the right,
#so the mean for males will be higher than the mean for female.
SECTION LEARNING GOALS
Identify outliers in a dataset based on the IQR method
Use a boxplot to describe data for a single quantitative variable
Use a side-by-side graph to visualize a relationship between quantitative and categorical variables
Lange, T., Royals, H., and Connor, L., “Mercury accumulation in largemouth bass (Micropterus
salmoides) in a Florida Lake,” Archives of Environmental Contamination and Toxicology, 2004; 27(4):
466–471
↩︎