library(UsingR)
## Warning: package 'UsingR' was built under R version 3.6.1
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 3.6.1
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.1
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
res <- lm(response ~ predictor, dataset, subset=…)
res <- lm(maxrate ~ age, data=heartrate)
res
##
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
##
## Coefficients:
## (Intercept) age
## 210.0485 -0.7977
So the rule of Thumb is not entirely off the mark. 2. Visualise the regression line on a scatter plot
plot(maxrate ~ age, data=heartrate)
abline(res)
3. What is the sum of residuals.
The residuals are computed for each point and returned by the resid function
sum(resid(res))
## [1] -1.776357e-15
This is virtually Zero. 4. Residuals are defined as Actual points- Fitted points. Check the sum of the residuals using this relation and check if the values returned by the resid function are matching.
actual <- heartrate$maxrate
diffs <- resid(res) - (actual - fitted(res))
max(diffs)
## [1] 1.199041e-14
This is the same as earlier. 5. Lets use it to predict, what is the maximum heart rate for a 30 and 40 year old.
age <- c(30, 40)
coef(res)[1] + coef(res)[2] * age # using the linear function.
## [1] 186.1167 178.1394
So the values are 186 and 178 beats per minute respectively. 6. Use predict function to get these values
predict(res, data.frame(age=c(30, 40)))
## 1 2
## 186.1167 178.1394
plot(log(brain) ~ log(body), data=Animals)
## label the outliers
idx <- log(Animals$body) > 6 & log(Animals$brain) < 6 # select
text(log(Animals$body[idx]), log(Animals$brain[idx]), # label
labels=rownames(Animals)[idx], pos=2)
8. Now a fit a linear model with log(body) and log(brain) excluding the outliers
lm(log(brain) ~ log(body), data=Animals, subset=!idx)
##
## Call:
## lm(formula = log(brain) ~ log(body), data = Animals, subset = !idx)
##
## Coefficients:
## (Intercept) log(body)
## 2.1504 0.7523
Example 3.4: Body mass index The body mass index (BMI) is a measure of an individual’s fitness designed by Quetelet in the 1800s, but still very much part of the everyday vernacular. BMI is a simple measure computed by taking an individual’s weight and dividing by their height squared using units of kilograms and meters. This comparison allows the same scale to be used for people of varying heights. It also implies that we should expect weight and height squared to exhibit a linear relationship. For that we can investigate. We use the weight and height data in the kid.weights (UsingR) data set, after restricting the data to fiveyear-olds.Note more than 72 and not less than 60.
?kid.weights
## starting httpd help server ... done
A data frame with 250 observations on the following 4 variables.
age Age in months
weight weight in pounds
height height in inches
gender Male of Female
summary(kid.weights)
## age weight height gender
## Min. : 3.00 Min. : 10.00 Min. :12.00 F:129
## 1st Qu.: 12.25 1st Qu.: 22.00 1st Qu.:28.00 M:121
## Median : 39.00 Median : 32.00 Median :36.00
## Mean : 47.95 Mean : 38.38 Mean :36.52
## 3rd Qu.: 69.75 3rd Qu.: 45.00 3rd Qu.:43.00
## Max. :144.00 Max. :150.00 Max. :67.00
idx <- 6 <= kid.weights$age & kid.weights$age < 72
fm <- weight/2.2~I((height*2.54/100)^2)
y <- lm(fm,kid.weights,subset=idx)
plot(fm,data=kid.weights,subset=idx)
abline(y)
y
##
## Call:
## lm(formula = fm, data = kid.weights, subset = idx)
##
## Coefficients:
## (Intercept) I((height * 2.54/100)^2)
## 5.472 11.212
Here the formula is wrapped in I as * and ^ have different meanings in the model syntex. For every 1 meter increase in height squared meter, the weight increases by 11 kg.
3.13 For the homedata (UsingR) data set, make a histogram and density estimate of the multiplicative change in values (the variable y2/y197). Describe the shape, and explain why it is shaped thus. (Hint: There are two sides to the tracks.)
?homedata
Data set containing assessed values of homes in Maplewood NJ for the years 1970 and 2000. The homedata data frame has 6841 rows and 2 columns. This data frame contains the following columns:
y1970 a numeric vector
y2000 a numeric vector
head(homedata)
## y1970 y2000
## 1 89700 359100
## 2 118400 504500
## 3 116400 477300
## 4 122000 500400
## 5 91500 433900
## 6 102800 464800
column<- homedata$y2000/homedata$y1970
b_hist <- hist(column, plot=FALSE)
b_dens <- density(column)
hist(column, probability=TRUE,
xlim = range(c(b_hist$breaks, b_dens$x)),
ylim = range(c(b_hist$density, b_dens$y)))
lines(b_dens, lwd=2)
3.14 The data set normtemp (UsingR) contains body measurements for 130 healthy, randomly selected individuals. The variable temperature measures normal body temperature, and the variable hr measures resting heart rate. Make a scatter plot of the two variables and find the Pearson correlation coefficient.
x <- normtemp$temperature; y <- normtemp$hr
plot(x, y, main="Temperature Vs. Resting Heart Rate")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.2536564
There is hardly any correlation between them. 3.15 The data set fat (UsingR) contains several circumference measurements for 252 men. The variable body.fat contains body fat percentage, and the variable BMI records the body mass index. Make a scatter plot of the two variables and then find the correlation coefficient.
x <- fat$body.fat; y <- fat$BMI
plot(x, y, main="body Fat Vs. BMI")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.7279942
So BMI is positively correlated with Body Fat 3.16 The data set twins (UsingR) contains IQ scores for pairs of identical twins who were separated at birth. Make a scatter plot of the variables Foster and Biological. Based on the scatter plot, predict what the Pearson correlation coefficient will be and whether the Pearson and Spearman coefficients will be similar. Check your guesses.
x <- twins$Foster; y <- twins$Biological
plot(x, y, main="Foster Vs. Biological")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.8819877
So there is a 88% association in the IQ similarity between the identical twins. 3.17 The state.x77 data set contains various information for each of the fifty United States. We wish to explore possible relationships among the variables. First, we make the data set easier to work with by turning it into a data frame. x77 <- data.frame(state.x77); Now, make scatter plots of Population and Frost; Population and Murder; Population and Area; and Income and HS.Grad. Do any relationships appear linear? Are there any surprising correlations?
x77 <- data.frame(state.x77)
x <- x77$Frost; y <-x77$Population
plot(x, y, main="Frost Vs. Population")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] -0.3321525
y <- x77$Murder; x <-x77$Population
plot(x, y, main="Population Vs. Murder")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.3436428
x <- x77$Area; y <-x77$Population
plot(x, y, main="Area vs. Population")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.02254384
x <- x77$HS.Grad ; y <- x77$Income
plot(x, y, main="Title")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.6199323
3.18 The data set nym.22 (UsingR) contains information about the 2002 New York City Marathon. What do you expect the correlation between age and finishing time to be? Find it and see whether you were close.
cor(nym.2002$age,nym.2002$time)
## [1] 0.1898672
Surprisingly, there is a positive correlation. Checking the scatter plot
x <- nym.2002$age; y <- nym.2002$time
plot(x, y, main="Age vs. Finishing Time")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
3.19 For the data set state.center do this plot:
plot(y ~ x, data=state.center)
Can you tell from the shape of the points what the data set is?
3.20 The batting (UsingR) data set contains baseball statistics for the 2002 Major League Baseball season. What is the correlation between the number of strikeouts (SO) and the number of home runs (HR)? Make a scatter plot to see whether there is any trend. Does the data suggest that in order to hit a lot of home runs one should strike out a lot?
x <- batting$SO; y <- batting$HR
plot(x, y, main="Stikes Vs. Home Runs")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
cor(x,y)
## [1] 0.7084697
3.22 The data set fat (UsingR) contains ten body circumference measurements. Fit a linear model modeling the circumference of the abdomen by the circumference of the wrist. A 17-cm wrist size has what predicted abdomen size?
y<- fat$abdomen
x<- fat$wrist
dataset<- fat
cor(x,y)
## [1] 0.6198324
res <- lm(y ~ x, data=dataset)
res
##
## Call:
## lm(formula = y ~ x, data = dataset)
##
## Coefficients:
## (Intercept) x
## -37.954 7.159
plot(y ~ x, data=dataset)
abline(res)
predict(res, data.frame(x=17))
## 1
## 83.75187
3.23 The data set wtloss (MASS) contains measurements of a patient’s weight in kilograms during a weight-rehabilitation program. Make a scatter plot showing how the variable Weight decays as a function of Days
x <- wtloss$Days; y <- wtloss$Weight
plot(x, y, main="Days Vs. Weight")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
What is the Pearson correlation coefficient of the two variables?
cor(x,y)
## [1] -0.9853149
Does the data appear appropriate for a linear model? (A linear model says that for two comparable time periods the same amount of weight is expected to be lost.)
Fit a linear model. Store the results in res. Add the regression line to your scatter plot. Does the regression line fit the data well?
y<- wtloss$Weight
x<- wtloss$Days
dataset<- wtloss
cor(x,y)
## [1] -0.9853149
res <- lm(y ~ x, data=dataset, subset= )
res
##
## Call:
## lm(formula = y ~ x, data = dataset)
##
## Coefficients:
## (Intercept) x
## 176.8902 -0.2907
plot(y ~ x, data=dataset)
abline(res)
. Make a plot of the residuals, residuals(res), against the Days variable. Comment on the shape of the points.
rd<-resid(res) #residuals
x <-x ; y <- rd
plot(x, y, main="x vs, Residuals")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
As x is the number of days, it is showing an autocorrelation.
3.24 The data frame x77 contains data from each of the fifty United States. First coerce the state.x77 variable into a data frame with x77 <- data.frame(state.x77) For the following, make a scatter plot with regression line: 1. The model of illiteracy rate (Illiteracy) modeled by high school graduation rate (HS.Grad). 2. The model of life expectancy (Life.Exp) modeled by the murder rate (Murder). 3. The model of income (Income) modeled by the illiteracy rate (Illiteracy). Write a sentence or two describing any relationship. In particular, do you find it as expected or is it surprising?
x77 <- data.frame(state.x77)
x <- x77$HS.Grad; y <- x77$Illiteracy
plot(x, y, main="Hs.Grad vs. Illiteracy")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
res <- lm(y~x,data=x77, subset=)
abline(res)
cor(x,y)
## [1] -0.6571886
x77 <- data.frame(state.x77)
x <- x77$Murder; y <- x77$Life.Exp
plot(x, y, main="Murder Vs. Life.Exp")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
res <- lm(y~x,data=x77, subset=)
abline(res)
cor(x,y)
## [1] -0.7808458
x77 <- data.frame(state.x77)
x <- x77$Illiteracy; y <- x77$Income
plot(x, y, main="Illiteracy vs. Income")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))
res <- lm(y~x,data=x77, subset=)
abline(res)
cor(x,y)
## [1] -0.4370752
3.25 The data set batting (UsingR) contains baseball statistics for the year 2002. Fit a linear model to runs batted in (RBI) modeled by number of home runs (HR). Make a scatter plot and add a regression line. In 2002, Mike Piazza had 33 home runs and 98 runs batted in. What is his predicted number of RBIs based on his number of home runs? What is his residual?
y<- batting$RBI
x<- batting$HR
dataset<- batting
cor(x,y)
## [1] 0.9069777
res <- lm(y ~ x, data=dataset, subset= )
res
##
## Call:
## lm(formula = y ~ x, data = dataset)
##
## Coefficients:
## (Intercept) x
## 16.550 2.653
plot(y ~ x, data=dataset)
abline(res)
In 2002, Mike Piazza had 33 home runs and 98 runs batted in. What is his predicted number of RBIs based on his number of home runs? What is his residual?
b <- predict(res, data.frame(x=33))
b
## 1
## 104.1099
residual1 <- b-98
residual1
## 1
## 6.1099
3.26 In the American culture, it is not considered unusual or inappropriate for a man to date a younger woman. But it is viewed as inappropriate for a man to date a much younger woman. Just what is too young? Some say anything less than half the man’s age plus seven. This is tested with a survey of ten people, each indicating what the cutoff is for various ages. The results are in the data set too.young (UsingR). Fit the regression model and compare it with the rule of thumb by also plotting the line y = 7 + (1/2)x. How do they compare?
y<- too.young$Male
x<- too.young$Female
dataset<- too.young
cor(x,y)
## [1] NA
res <- lm(y ~ x, data=dataset, subset= )
res
##
## Call:
## lm(formula = y ~ x, data = dataset)
##
## Coefficients:
## (Intercept) x
## -3.088 1.474
plot(y ~ x, data=dataset)
abline(res)
abline(-7,2,lw=2)
3.27 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond. Make a scatter plot of carat versus price. Use pch=5 to plot with diamonds. Add the regression line and predict the amount a one-third carat diamond ring would cost.
x <- diamond$carat; y <- diamond$price
plot(x, y, main="Caret vs. Price",pch=5)
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=5, cex=4, col=rgb(0,0,0, .25))
res <- lm(y~x,data=dataset, subset=)
abline(res)
predict the amount a one-third carat diamond ring would cost.
predict(res, data.frame(x=(1/3)))
## 1
## 980.7157
3.28 To gain an understanding of the variability present in a measurement, a researcher may repeat or replicate a measurement several times. The data set breakdown (UsingR) includes measurements in minutes of the time it takes an insulating fluid to break down as a function of an applied voltage. The relationship calls for a log-transform. Plot the voltage against the logarithm of time. Find the coefficients for simple linear regression and discuss the amount of variance for each level of the voltage.
y<- breakdown$voltage
x<-log(breakdown$time)
dataset<- breakdown
cor(x,y)
## [1] -0.7111953
res <- lm(y ~ x, data=dataset, subset= )
res
##
## Call:
## lm(formula = y ~ x, data = dataset)
##
## Coefficients:
## (Intercept) x
## 35.2356 -0.9973
plot(y ~ x, data=dataset)
abline(res)
3.29 The motors (MASS) data set contains measurements on how long, in hours, it takes a motor to fail. For a range of temperatures, in degrees Celsius, a number of motors were run in an accelerated manner until they failed, or until time was cut off. (When time is cut off the data is said to have been censored.) The data shows a relationship between increased temperature and shortened life span. The commands data(motors, package=“MASS”) plot(time ~ temp, pch=cens, data=motors) produce a scatter plot of the variable time modeled by temp. The pch=cens argument marks points that were censored with a square; otherwise a circle is used. Make the scatter plot and answer the following: 1. How many different temperatures were used in the experiment? 2. Does the data look to be a candidate for a linear model? (You might want to consider why the data point (150,800) is marked with a square.) 3. Fit a linear model. What are the coefficients? 4. Use the linear model to make a prediction for the accelerated lifetime of a motor run at a temperature of 210°C
data(motors, package="MASS")
plot(time ~ temp, pch=cens, data=motors)
Fitting a Linear Model
y<- motors$time
x<- motors$temp
dataset<- motors
cor(x,y)
## [1] -0.9108416
res <- lm(y ~ x, data=dataset, subset= )
res
##
## Call:
## lm(formula = y ~ x, data = dataset)
##
## Coefficients:
## (Intercept) x
## 22999.1 -106.8
plot(y ~ x, data=dataset)
abline(res)
4. Use the linear model to make a prediction for the accelerated lifetime of a motor run at a temperature of 210°C Predicting from linear model
predict(res, data.frame(x=210))
## 1
## 580.5888
Bivariate Categorical Data
• Example 3.5: Is past performance an indicator of future performance? A common belief is that an “A” student in one class will be an “A” student in the next. Is this so? The data set grades contains the grades students received in a math class and their grades in a previous math class.Create a two way table
head(grades)
## prev grade
## 1 B+ B+
## 2 A- A-
## 3 B+ A-
## 4 F F
## 5 F F
## 6 A B
table(grades$prev,grades$grade)
##
## A A- B+ B B- C+ C D F
## A 15 3 1 4 0 0 3 2 0
## A- 3 1 1 0 0 0 0 0 0
## B+ 0 2 2 1 2 0 0 1 1
## B 0 1 1 4 3 1 3 0 2
## B- 0 1 0 2 0 0 1 0 0
## C+ 1 1 0 0 0 0 1 0 0
## C 1 0 0 1 1 3 5 9 7
## D 0 0 0 1 0 0 4 3 1
## F 1 0 0 1 1 1 3 4 11
seatbelts <- matrix(c(56, 2, 8, 16), nrow=2)
dimnames(seatbelts) <- list(parent=c("buckled","unbuckled"),
child=c("buckled","unbuckled"))
seatbelts
## child
## parent buckled unbuckled
## buckled 56 8
## unbuckled 2 16
margin.table(seatbelts, margin=1) # Rowsum for parents
## parent
## buckled unbuckled
## 64 18
margin.table(seatbelts,margin=2) #Colsum for child
## child
## buckled unbuckled
## 58 24
Alternatively, the function addmargins will return the marginal distributions by extending the table.
addmargins(seatbelts)
## child
## parent buckled unbuckled Sum
## buckled 56 8 64
## unbuckled 2 16 18
## Sum 58 24 82
Looking at the marginal distributions of the grade data also shows two similar distributions:
tbl <- with(grades, table(prev, grade))
margin.table(tbl,1)
## prev
## A A- B+ B B- C+ C D F
## 28 5 9 15 4 3 27 9 22
margin.table(tbl,2)
## grade
## A A- B+ B B- C+ C D F
## 21 9 5 14 7 5 20 19 22
Conditional Distribution of Two Way Tables For example, is there a difference in the grade a student gets if her previous grade is a B or a C? Or does the fact that a parent wears a seat belt affect the chance a child does? These questions are answered by comparing the rows or columns in a two-way table. It is usually much easier to compare proportions or percentages and not the absolute counts.
For that we use prop.table
round(prop.table(table(grades$prev, grades$grade), margin=1) * 100)
##
## A A- B+ B B- C+ C D F
## A 54 11 4 14 0 0 11 7 0
## A- 60 20 20 0 0 0 0 0 0
## B+ 0 22 22 11 22 0 0 11 11
## B 0 7 7 27 20 7 20 0 13
## B- 0 25 0 50 0 0 25 0 0
## C+ 33 33 0 0 0 0 33 0 0
## C 4 0 0 4 4 11 19 33 26
## D 0 0 0 11 0 0 44 33 11
## F 5 0 0 5 5 5 14 18 50
We can use xtabs ,To look at the distribution of type of car by origin in the Cars93. The xtabs function maps a data frame into a contingency table.
xtabs(~Origin+Type,Cars93)
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 7 11 10 7 8 5
## non-USA 9 0 12 14 6 4
xtabs(~Origin+Type+Passengers,Cars93)
## , , Passengers = 2
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 0 1 0
## non-USA 0 0 0 0 1 0
##
## , , Passengers = 4
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 2 7 0
## non-USA 1 0 2 6 5 0
##
## , , Passengers = 5
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 5 0 6 5 0 0
## non-USA 8 0 9 8 0 0
##
## , , Passengers = 6
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 2 11 4 0 0 0
## non-USA 0 0 1 0 0 0
##
## , , Passengers = 7
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 0 0 4
## non-USA 0 0 0 0 0 4
##
## , , Passengers = 8
##
## Type
## Origin Compact Large Midsize Small Sporty Van
## USA 0 0 0 0 0 1
## non-USA 0 0 0 0 0 0
However, we would like to present it like this. Row - Type should be there Col- Origin and passenders we use ftable for this
tbl <- xtabs( ~ Origin + Type + Passengers, Cars93)
ftable(tbl, row.vars=2, col.vars=c(1,3))
## Origin USA non-USA
## Passengers 2 4 5 6 7 8 2 4 5 6 7 8
## Type
## Compact 0 0 5 2 0 0 0 1 8 0 0 0
## Large 0 0 0 11 0 0 0 0 0 0 0 0
## Midsize 0 0 6 4 0 0 0 2 9 1 0 0
## Small 0 2 5 0 0 0 0 6 8 0 0 0
## Sporty 1 7 0 0 0 0 1 5 0 0 0 0
## Van 0 0 0 0 4 1 0 0 0 0 4 0
We can use proportions also using prop.table
tbl <- xtabs( ~ Origin + Type + Passengers, Cars93)
out <- ftable(tbl, row.vars=2, col.vars=c(1,3))
round(prop.table(out,1)*100)
## Origin USA non-USA
## Passengers 2 4 5 6 7 8 2 4 5 6 7 8
## Type
## Compact 0 0 31 12 0 0 0 6 50 0 0 0
## Large 0 0 0 100 0 0 0 0 0 0 0 0
## Midsize 0 0 27 18 0 0 0 9 41 5 0 0
## Small 0 10 24 0 0 0 0 29 38 0 0 0
## Sporty 7 50 0 0 0 0 7 36 0 0 0 0
## Van 0 0 0 0 44 11 0 0 0 0 44 0
We can also create bins of a continuous function and use in the contingency table
price.range <- cut(Cars93$Price, c(0, 20, 70))
tbl <- xtabs( ~ price.range + Origin + Cylinders, data=Cars93)
out <- ftable(tbl, row.vars=3)
100 * prop.table(out, margin=1)
## price.range (0,20] (20,70]
## Origin USA non-USA USA non-USA
## Cylinders
## 3 0.000000 100.000000 0.000000 0.000000
## 4 44.897959 44.897959 0.000000 10.204082
## 5 0.000000 50.000000 0.000000 50.000000
## 6 35.483871 6.451613 29.032258 29.032258
## 8 14.285714 0.000000 71.428571 14.285714
## rotary 0.000000 0.000000 0.000000 100.000000
Graphical Summaries of Two Way Contingency Tables Barplots can be used effectively. 1. One variable is chosen to form the categories for barplots 2. Either the segment the bars for others or 3. Separate bars are plotted side by side
barplot(seatbelts, xlab="Parent", main="Child seat-belt usage",legend.text = TRUE)
barplot(seatbelts, xlab="Parent", main="Child seat-belt usage",
beside=TRUE,legend.text = TRUE)
if we want the parents distribution, we an flip the table around.
barplot(t(seatbelts))
We can use percentage also
prop.table(barplot(seatbelts, xlab="Parent", main="Child seat-belt usage",legend.text = TRUE))
## [1] 0.2692308 0.7307692
Mosaic Plot that makes it suitable for viewing relationships between two or more categorical variables.
titanic <- as.data.frame(Titanic)
xtabs(Freq ~ Survived + Class, data=titanic, subset=Sex=="Female")
## Class
## Survived 1st 2nd 3rd Crew
## No 4 13 106 3
## Yes 141 93 90 20
We can create a mosaic plot like this
tbl <- xtabs(Freq ~ Sex, titanic)
mosaicplot(tbl)
Or this
tbl <- xtabs(Freq ~ Sex + Survived, titanic)
mosaicplot(tbl)
Or this
tbl <- xtabs(Freq ~ Sex + Survived + Class, titanic)
mosaicplot(tbl)
c. Here we note how the variables are assigned: The first variable (Sex) again splits the x axis, the second variable (Survived) again splits the y axis. The third (Class) then splits each of the four blocks.
Different orderings give different insights. Putting Class first (not shown) makes it easier to focus on the classes.
mosaicplot( xtabs(Freq ~ Class + Sex + Survived, data=titanic))
titanic <- as.data.frame(Titanic)
titanic
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## 11 3rd Male Adult No 387
## 12 Crew Male Adult No 670
## 13 1st Female Adult No 4
## 14 2nd Female Adult No 13
## 15 3rd Female Adult No 89
## 16 Crew Female Adult No 3
## 17 1st Male Child Yes 5
## 18 2nd Male Child Yes 11
## 19 3rd Male Child Yes 13
## 20 Crew Male Child Yes 0
## 21 1st Female Child Yes 1
## 22 2nd Female Child Yes 13
## 23 3rd Female Child Yes 14
## 24 Crew Female Child Yes 0
## 25 1st Male Adult Yes 57
## 26 2nd Male Adult Yes 14
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
3.31 The data set coins (UsingR) contains the number of coins in a change bin and the years they were minted. Do the following: 1. How much money is in the change bin? 2. Make a barplot of the years. Is there a trend? 3. Use cut to construct a barplot by decade. 4. Make a contingency table of the year and the value. Does it look like the two variables are associated? Why?
head(coins)
## year value
## 1 2003 0.01
## 2 2003 0.25
## 3 2003 0.25
## 4 2002 0.01
## 5 2002 0.01
## 6 2002 0.25
summary(coins)
## year value
## Min. :1929 Min. :0.01000
## 1st Qu.:1983 1st Qu.:0.01000
## Median :1993 Median :0.01000
## Mean :1989 Mean :0.06989
## 3rd Qu.:1999 3rd Qu.:0.10000
## Max. :2003 Max. :0.25000
str(coins)
## 'data.frame': 371 obs. of 2 variables:
## $ year : num 2003 2003 2003 2002 2002 ...
## $ value: num 0.01 0.25 0.25 0.01 0.01 0.25 0.01 0.01 0.01 0.01 ...
sum(coins$value)
## [1] 25.93
tbl <- table(coins$year,coins$value)
tbl
##
## 0.01 0.05 0.1 0.25
## 1929 2 1 0 0
## 1936 0 0 1 0
## 1939 0 0 1 0
## 1955 0 0 0 1
## 1959 1 0 0 0
## 1960 1 0 0 0
## 1964 1 0 0 0
## 1965 2 1 0 0
## 1966 1 1 0 0
## 1967 0 1 0 0
## 1968 1 0 0 1
## 1969 2 0 1 0
## 1970 1 1 0 2
## 1971 3 0 2 1
## 1972 2 0 0 1
## 1973 1 0 0 0
## 1974 1 1 1 0
## 1975 4 1 0 4
## 1976 2 1 0 1
## 1977 4 2 0 1
## 1978 5 1 0 0
## 1979 1 1 0 0
## 1980 1 1 0 0
## 1981 7 1 2 0
## 1982 9 1 2 1
## 1983 7 3 1 0
## 1984 7 3 0 1
## 1985 2 4 1 1
## 1986 3 0 1 1
## 1987 7 1 1 0
## 1988 7 3 2 2
## 1989 4 0 3 2
## 1990 3 3 1 2
## 1991 2 1 1 1
## 1992 6 0 0 6
## 1993 8 3 1 2
## 1994 7 1 2 1
## 1995 6 2 3 2
## 1996 7 3 0 4
## 1997 6 2 1 3
## 1998 5 1 0 5
## 1999 16 2 6 1
## 2000 15 4 3 8
## 2001 13 5 0 7
## 2002 19 3 5 3
## 2003 1 0 0 2
barplot(t(tbl),legend.text = TRUE)
Use cut to construct a barplot by decade.
yrange <- cut(coins$year,c(1940,1950,1960,1970,1980,1990,2000,2010))
tbl2 <- table(yrange,coins$value)
barplot(t(tbl2))
4. Make a contingency table of the year and the value. Does it look like the two variables are associated? Why?
str(coins)
## 'data.frame': 371 obs. of 2 variables:
## $ year : num 2003 2003 2003 2002 2002 ...
## $ value: num 0.01 0.25 0.25 0.01 0.01 0.25 0.01 0.01 0.01 0.01 ...