UsingR2

library(UsingR)

## Warning: package 'UsingR' was built under R version 3.6.1

## Loading required package: MASS

## Loading required package: HistData

## Warning: package 'HistData' was built under R version 3.6.1

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.6.1

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## 
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':
## 
##     cancer

res <- lm(response ~ predictor, dataset, subset=…)

Have a look at the data set heartrate (UsingR). This data set records an age (age) and maximum heart rate (maxrate) for a hypothetical cohort of people. The rule of thumb in many gyms is that the maximum heart rate is related linearly to a person’s age by the formula 220 age. Mathematically, this is a line with slope of 1 and intercept of 220.11. Test this hypothesis.

res <- lm(maxrate ~ age, data=heartrate)
res

## 
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
## 
## Coefficients:
## (Intercept)          age  
##    210.0485      -0.7977

So the rule of Thumb is not entirely off the mark. 2. Visualise the regression line on a scatter plot

plot(maxrate ~ age, data=heartrate)
abline(res)

3. What is the sum of residuals.

The residuals are computed for each point and returned by the resid function

sum(resid(res))

## [1] -1.776357e-15

This is virtually Zero. 4. Residuals are defined as Actual points- Fitted points. Check the sum of the residuals using this relation and check if the values returned by the resid function are matching.

actual <- heartrate$maxrate
diffs <- resid(res) - (actual - fitted(res))
max(diffs)

## [1] 1.199041e-14

This is the same as earlier. 5. Lets use it to predict, what is the maximum heart rate for a 30 and 40 year old.

age <- c(30, 40)
coef(res)[1] + coef(res)[2] * age  # using the linear function.

## [1] 186.1167 178.1394

So the values are 186 and 178 beats per minute respectively. 6. Use predict function to get these values

predict(res, data.frame(age=c(30, 40)))

##        1        2 
## 186.1167 178.1394

Plot the outliers for Animals as follows if log for Animal body is >6 and that for brain is less than 6. label the outliers.

plot(log(brain) ~ log(body), data=Animals)
## label the outliers
idx <- log(Animals$body) > 6 & log(Animals$brain) < 6 # select
text(log(Animals$body[idx]), log(Animals$brain[idx]), # label
labels=rownames(Animals)[idx], pos=2)

8. Now a fit a linear model with log(body) and log(brain) excluding the outliers

lm(log(brain) ~ log(body), data=Animals, subset=!idx)

## 
## Call:
## lm(formula = log(brain) ~ log(body), data = Animals, subset = !idx)
## 
## Coefficients:
## (Intercept)    log(body)  
##      2.1504       0.7523

Interpret the data- for 1% increase in the body the brain increases by 0.75%.

Example 3.4: Body mass index The body mass index (BMI) is a measure of an individual’s fitness designed by Quetelet in the 1800s, but still very much part of the everyday vernacular. BMI is a simple measure computed by taking an individual’s weight and dividing by their height squared using units of kilograms and meters. This comparison allows the same scale to be used for people of varying heights. It also implies that we should expect weight and height squared to exhibit a linear relationship. For that we can investigate. We use the weight and height data in the kid.weights (UsingR) data set, after restricting the data to fiveyear-olds.Note more than 72 and not less than 60.

?kid.weights

## starting httpd help server ... done

A data frame with 250 observations on the following 4 variables.

age Age in months

weight weight in pounds

height height in inches

gender Male of Female

summary(kid.weights)

##       age             weight           height      gender 
##  Min.   :  3.00   Min.   : 10.00   Min.   :12.00   F:129  
##  1st Qu.: 12.25   1st Qu.: 22.00   1st Qu.:28.00   M:121  
##  Median : 39.00   Median : 32.00   Median :36.00          
##  Mean   : 47.95   Mean   : 38.38   Mean   :36.52          
##  3rd Qu.: 69.75   3rd Qu.: 45.00   3rd Qu.:43.00          
##  Max.   :144.00   Max.   :150.00   Max.   :67.00

idx <- 6 <= kid.weights$age & kid.weights$age < 72
fm <- weight/2.2~I((height*2.54/100)^2)
y <- lm(fm,kid.weights,subset=idx)
plot(fm,data=kid.weights,subset=idx)
abline(y)

## 
## Call:
## lm(formula = fm, data = kid.weights, subset = idx)
## 
## Coefficients:
##              (Intercept)  I((height * 2.54/100)^2)  
##                    5.472                    11.212

Here the formula is wrapped in I as * and ^ have different meanings in the model syntex. For every 1 meter increase in height squared meter, the weight increases by 11 kg.

3.13 For the homedata (UsingR) data set, make a histogram and density estimate of the multiplicative change in values (the variable y2/y197). Describe the shape, and explain why it is shaped thus. (Hint: There are two sides to the tracks.)

?homedata

Data set containing assessed values of homes in Maplewood NJ for the years 1970 and 2000. The homedata data frame has 6841 rows and 2 columns. This data frame contains the following columns:

y1970 a numeric vector

y2000 a numeric vector

head(homedata)

##    y1970  y2000
## 1  89700 359100
## 2 118400 504500
## 3 116400 477300
## 4 122000 500400
## 5  91500 433900
## 6 102800 464800

column<- homedata$y2000/homedata$y1970
b_hist <- hist(column, plot=FALSE)
b_dens <- density(column)
hist(column, probability=TRUE,
xlim = range(c(b_hist$breaks, b_dens$x)),
ylim = range(c(b_hist$density, b_dens$y)))
lines(b_dens, lwd=2)

3.14 The data set normtemp (UsingR) contains body measurements for 130 healthy, randomly selected individuals. The variable temperature measures normal body temperature, and the variable hr measures resting heart rate. Make a scatter plot of the two variables and find the Pearson correlation coefficient.

x <- normtemp$temperature; y <- normtemp$hr 
plot(x, y, main="Temperature Vs. Resting Heart Rate")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.2536564

There is hardly any correlation between them. 3.15 The data set fat (UsingR) contains several circumference measurements for 252 men. The variable body.fat contains body fat percentage, and the variable BMI records the body mass index. Make a scatter plot of the two variables and then find the correlation coefficient.

x <- fat$body.fat; y <- fat$BMI 
plot(x, y, main="body Fat Vs. BMI")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.7279942

So BMI is positively correlated with Body Fat 3.16 The data set twins (UsingR) contains IQ scores for pairs of identical twins who were separated at birth. Make a scatter plot of the variables Foster and Biological. Based on the scatter plot, predict what the Pearson correlation coefficient will be and whether the Pearson and Spearman coefficients will be similar. Check your guesses.

x <- twins$Foster; y <- twins$Biological 
plot(x, y, main="Foster Vs. Biological")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.8819877

So there is a 88% association in the IQ similarity between the identical twins. 3.17 The state.x77 data set contains various information for each of the fifty United States. We wish to explore possible relationships among the variables. First, we make the data set easier to work with by turning it into a data frame. x77 <- data.frame(state.x77); Now, make scatter plots of Population and Frost; Population and Murder; Population and Area; and Income and HS.Grad. Do any relationships appear linear? Are there any surprising correlations?

x77 <- data.frame(state.x77)

x <- x77$Frost; y <-x77$Population
plot(x, y, main="Frost Vs. Population")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] -0.3321525

y <- x77$Murder; x <-x77$Population
plot(x, y, main="Population Vs. Murder")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.3436428

x <- x77$Area; y <-x77$Population
plot(x, y, main="Area vs. Population")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.02254384

x <- x77$HS.Grad ; y <- x77$Income
plot(x, y, main="Title")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.6199323

3.18 The data set nym.22 (UsingR) contains information about the 2002 New York City Marathon. What do you expect the correlation between age and finishing time to be? Find it and see whether you were close.

cor(nym.2002$age,nym.2002$time)

## [1] 0.1898672

Surprisingly, there is a positive correlation. Checking the scatter plot

x <- nym.2002$age; y <- nym.2002$time
plot(x, y, main="Age vs. Finishing Time")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

3.19 For the data set state.center do this plot:

plot(y ~ x, data=state.center)

Can you tell from the shape of the points what the data set is?

3.20 The batting (UsingR) data set contains baseball statistics for the 2002 Major League Baseball season. What is the correlation between the number of strikeouts (SO) and the number of home runs (HR)? Make a scatter plot to see whether there is any trend. Does the data suggest that in order to hit a lot of home runs one should strike out a lot?

x <- batting$SO; y <- batting$HR
plot(x, y, main="Stikes Vs. Home Runs")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

cor(x,y)

## [1] 0.7084697

3.22 The data set fat (UsingR) contains ten body circumference measurements. Fit a linear model modeling the circumference of the abdomen by the circumference of the wrist. A 17-cm wrist size has what predicted abdomen size?

y<- fat$abdomen
x<- fat$wrist
dataset<- fat
cor(x,y)

## [1] 0.6198324

res <- lm(y ~ x, data=dataset)
res

## 
## Call:
## lm(formula = y ~ x, data = dataset)
## 
## Coefficients:
## (Intercept)            x  
##     -37.954        7.159

plot(y ~ x, data=dataset) 
abline(res)

predict(res, data.frame(x=17))

##        1 
## 83.75187

3.23 The data set wtloss (MASS) contains measurements of a patient’s weight in kilograms during a weight-rehabilitation program. Make a scatter plot showing how the variable Weight decays as a function of Days

x <- wtloss$Days; y <- wtloss$Weight
plot(x, y, main="Days Vs. Weight")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

What is the Pearson correlation coefficient of the two variables?

cor(x,y)

## [1] -0.9853149

Does the data appear appropriate for a linear model? (A linear model says that for two comparable time periods the same amount of weight is expected to be lost.)
Fit a linear model. Store the results in res. Add the regression line to your scatter plot. Does the regression line fit the data well?

y<- wtloss$Weight
x<- wtloss$Days
dataset<- wtloss
cor(x,y)

## [1] -0.9853149

res <- lm(y ~ x, data=dataset, subset= )
res

## 
## Call:
## lm(formula = y ~ x, data = dataset)
## 
## Coefficients:
## (Intercept)            x  
##    176.8902      -0.2907

plot(y ~ x, data=dataset) 
abline(res)

. Make a plot of the residuals, residuals(res), against the Days variable. Comment on the shape of the points.

rd<-resid(res)   #residuals
x <-x ; y <- rd 
plot(x, y, main="x vs, Residuals")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25))

As x is the number of days, it is showing an autocorrelation.

3.24 The data frame x77 contains data from each of the fifty United States. First coerce the state.x77 variable into a data frame with x77 <- data.frame(state.x77) For the following, make a scatter plot with regression line: 1. The model of illiteracy rate (Illiteracy) modeled by high school graduation rate (HS.Grad). 2. The model of life expectancy (Life.Exp) modeled by the murder rate (Murder). 3. The model of income (Income) modeled by the illiteracy rate (Illiteracy). Write a sentence or two describing any relationship. In particular, do you find it as expected or is it surprising?

x77 <- data.frame(state.x77)
x <- x77$HS.Grad; y <- x77$Illiteracy 
plot(x, y, main="Hs.Grad vs. Illiteracy")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25)) 
res <- lm(y~x,data=x77, subset=) 
abline(res)

cor(x,y)

## [1] -0.6571886

The model of life expectancy (Life.Exp) modeled by the murder rate (Murder).

x77 <- data.frame(state.x77)
x <- x77$Murder; y <- x77$Life.Exp
plot(x, y, main="Murder Vs. Life.Exp")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25)) 
res <- lm(y~x,data=x77, subset=) 
abline(res)

cor(x,y)

## [1] -0.7808458

The model of income (Income) modeled by the illiteracy rate (Illiteracy).

x77 <- data.frame(state.x77)
x <- x77$Illiteracy; y <- x77$Income
plot(x, y, main="Illiteracy vs. Income")
 abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0, .25)) 
res <- lm(y~x,data=x77, subset=) 
abline(res)

cor(x,y)

## [1] -0.4370752

3.25 The data set batting (UsingR) contains baseball statistics for the year 2002. Fit a linear model to runs batted in (RBI) modeled by number of home runs (HR). Make a scatter plot and add a regression line. In 2002, Mike Piazza had 33 home runs and 98 runs batted in. What is his predicted number of RBIs based on his number of home runs? What is his residual?

y<- batting$RBI
x<- batting$HR
dataset<- batting
cor(x,y)

## [1] 0.9069777

res <- lm(y ~ x, data=dataset, subset= )
res

## 
## Call:
## lm(formula = y ~ x, data = dataset)
## 
## Coefficients:
## (Intercept)            x  
##      16.550        2.653

plot(y ~ x, data=dataset) 
abline(res)

In 2002, Mike Piazza had 33 home runs and 98 runs batted in. What is his predicted number of RBIs based on his number of home runs? What is his residual?

b <- predict(res, data.frame(x=33))
b

##        1 
## 104.1099

residual1 <- b-98
residual1

##      1 
## 6.1099

3.26 In the American culture, it is not considered unusual or inappropriate for a man to date a younger woman. But it is viewed as inappropriate for a man to date a much younger woman. Just what is too young? Some say anything less than half the man’s age plus seven. This is tested with a survey of ten people, each indicating what the cutoff is for various ages. The results are in the data set too.young (UsingR). Fit the regression model and compare it with the rule of thumb by also plotting the line y = 7 + (1/2)x. How do they compare?

y<- too.young$Male
x<- too.young$Female
dataset<- too.young
cor(x,y)

## [1] NA

res <- lm(y ~ x, data=dataset, subset= )
res

## 
## Call:
## lm(formula = y ~ x, data = dataset)
## 
## Coefficients:
## (Intercept)            x  
##      -3.088        1.474

plot(y ~ x, data=dataset) 
abline(res)
abline(-7,2,lw=2)

3.27 The data set diamond (UsingR) contains data about the price of 48 diamond rings. The variable price records the price in Singapore dollars and the variable carat records the size of the diamond. Make a scatter plot of carat versus price. Use pch=5 to plot with diamonds. Add the regression line and predict the amount a one-third carat diamond ring would cost.

x <- diamond$carat; y <- diamond$price 
plot(x, y, main="Caret vs. Price",pch=5)
abline(v = mean(x), lty=2) # dashed vertical line 
abline(h = mean(y), lty=2) # dashed horizontal line 
points(mean(x), mean(y), pch=5, cex=4, col=rgb(0,0,0, .25)) 
res <- lm(y~x,data=dataset, subset=) 
abline(res)

predict the amount a one-third carat diamond ring would cost.

predict(res, data.frame(x=(1/3)))

##        1 
## 980.7157

3.28 To gain an understanding of the variability present in a measurement, a researcher may repeat or replicate a measurement several times. The data set breakdown (UsingR) includes measurements in minutes of the time it takes an insulating fluid to break down as a function of an applied voltage. The relationship calls for a log-transform. Plot the voltage against the logarithm of time. Find the coefficients for simple linear regression and discuss the amount of variance for each level of the voltage.

y<- breakdown$voltage
x<-log(breakdown$time)
dataset<- breakdown
cor(x,y)

## [1] -0.7111953

res <- lm(y ~ x, data=dataset, subset= )
res

## 
## Call:
## lm(formula = y ~ x, data = dataset)
## 
## Coefficients:
## (Intercept)            x  
##     35.2356      -0.9973

plot(y ~ x, data=dataset) 
abline(res)

3.29 The motors (MASS) data set contains measurements on how long, in hours, it takes a motor to fail. For a range of temperatures, in degrees Celsius, a number of motors were run in an accelerated manner until they failed, or until time was cut off. (When time is cut off the data is said to have been censored.) The data shows a relationship between increased temperature and shortened life span. The commands data(motors, package=“MASS”) plot(time ~ temp, pch=cens, data=motors) produce a scatter plot of the variable time modeled by temp. The pch=cens argument marks points that were censored with a square; otherwise a circle is used. Make the scatter plot and answer the following: 1. How many different temperatures were used in the experiment? 2. Does the data look to be a candidate for a linear model? (You might want to consider why the data point (150,800) is marked with a square.) 3. Fit a linear model. What are the coefficients? 4. Use the linear model to make a prediction for the accelerated lifetime of a motor run at a temperature of 210°C

data(motors, package="MASS")
plot(time ~ temp, pch=cens, data=motors)

Fitting a Linear Model

y<- motors$time
x<- motors$temp
dataset<- motors
cor(x,y)

## [1] -0.9108416

res <- lm(y ~ x, data=dataset, subset= )
res

## 
## Call:
## lm(formula = y ~ x, data = dataset)
## 
## Coefficients:
## (Intercept)            x  
##     22999.1       -106.8

plot(y ~ x, data=dataset) 
abline(res)

4. Use the linear model to make a prediction for the accelerated lifetime of a motor run at a temperature of 210°C Predicting from linear model

predict(res, data.frame(x=210))

##        1 
## 580.5888

Bivariate Categorical Data

• Example 3.5: Is past performance an indicator of future performance? A common belief is that an “A” student in one class will be an “A” student in the next. Is this so? The data set grades contains the grades students received in a math class and their grades in a previous math class.Create a two way table

head(grades)

##   prev grade
## 1  B+    B+ 
## 2  A-    A- 
## 3  B+    A- 
## 4  F     F  
## 5  F     F  
## 6  A     B

table(grades$prev,grades$grade)

##       
##         A    A-   B+   B    B-   C+   C    D    F  
##    A     15    3    1    4    0    0    3    2    0
##    A-     3    1    1    0    0    0    0    0    0
##    B+     0    2    2    1    2    0    0    1    1
##    B      0    1    1    4    3    1    3    0    2
##    B-     0    1    0    2    0    0    1    0    0
##    C+     1    1    0    0    0    0    1    0    0
##    C      1    0    0    1    1    3    5    9    7
##    D      0    0    0    1    0    0    4    3    1
##    F      1    0    0    1    1    1    3    4   11

seatbelts <- matrix(c(56, 2, 8, 16), nrow=2)
dimnames(seatbelts) <- list(parent=c("buckled","unbuckled"),
child=c("buckled","unbuckled"))
seatbelts

##            child
## parent      buckled unbuckled
##   buckled        56         8
##   unbuckled       2        16

margin.table(seatbelts, margin=1) # Rowsum for parents

## parent
##   buckled unbuckled 
##        64        18

margin.table(seatbelts,margin=2) #Colsum for child

## child
##   buckled unbuckled 
##        58        24

Alternatively, the function addmargins will return the marginal distributions by extending the table.

addmargins(seatbelts)

##            child
## parent      buckled unbuckled Sum
##   buckled        56         8  64
##   unbuckled       2        16  18
##   Sum            58        24  82

Looking at the marginal distributions of the grade data also shows two similar distributions:

tbl <- with(grades, table(prev, grade))
margin.table(tbl,1)

## prev
##  A    A-   B+   B    B-   C+   C    D    F   
##   28    5    9   15    4    3   27    9   22

margin.table(tbl,2)

## grade
##  A    A-   B+   B    B-   C+   C    D    F   
##   21    9    5   14    7    5   20   19   22

Conditional Distribution of Two Way Tables For example, is there a difference in the grade a student gets if her previous grade is a B or a C? Or does the fact that a parent wears a seat belt affect the chance a child does? These questions are answered by comparing the rows or columns in a two-way table. It is usually much easier to compare proportions or percentages and not the absolute counts.

For that we use prop.table

round(prop.table(table(grades$prev, grades$grade), margin=1) * 100)

##       
##         A    A-   B+   B    B-   C+   C    D    F  
##    A     54   11    4   14    0    0   11    7    0
##    A-    60   20   20    0    0    0    0    0    0
##    B+     0   22   22   11   22    0    0   11   11
##    B      0    7    7   27   20    7   20    0   13
##    B-     0   25    0   50    0    0   25    0    0
##    C+    33   33    0    0    0    0   33    0    0
##    C      4    0    0    4    4   11   19   33   26
##    D      0    0    0   11    0    0   44   33   11
##    F      5    0    0    5    5    5   14   18   50

We can use xtabs ,To look at the distribution of type of car by origin in the Cars93. The xtabs function maps a data frame into a contingency table.

xtabs(~Origin+Type,Cars93)

##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           7    11      10     7      8   5
##   non-USA       9     0      12    14      6   4

xtabs(~Origin+Type+Passengers,Cars93)

## , , Passengers = 2
## 
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           0     0       0     0      1   0
##   non-USA       0     0       0     0      1   0
## 
## , , Passengers = 4
## 
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           0     0       0     2      7   0
##   non-USA       1     0       2     6      5   0
## 
## , , Passengers = 5
## 
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           5     0       6     5      0   0
##   non-USA       8     0       9     8      0   0
## 
## , , Passengers = 6
## 
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           2    11       4     0      0   0
##   non-USA       0     0       1     0      0   0
## 
## , , Passengers = 7
## 
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           0     0       0     0      0   4
##   non-USA       0     0       0     0      0   4
## 
## , , Passengers = 8
## 
##          Type
## Origin    Compact Large Midsize Small Sporty Van
##   USA           0     0       0     0      0   1
##   non-USA       0     0       0     0      0   0

However, we would like to present it like this. Row - Type should be there Col- Origin and passenders we use ftable for this

tbl <- xtabs( ~ Origin + Type + Passengers, Cars93)
ftable(tbl, row.vars=2, col.vars=c(1,3))

##         Origin     USA                non-USA               
##         Passengers   2  4  5  6  7  8       2  4  5  6  7  8
## Type                                                        
## Compact              0  0  5  2  0  0       0  1  8  0  0  0
## Large                0  0  0 11  0  0       0  0  0  0  0  0
## Midsize              0  0  6  4  0  0       0  2  9  1  0  0
## Small                0  2  5  0  0  0       0  6  8  0  0  0
## Sporty               1  7  0  0  0  0       1  5  0  0  0  0
## Van                  0  0  0  0  4  1       0  0  0  0  4  0

We can use proportions also using prop.table

tbl <- xtabs( ~ Origin + Type + Passengers, Cars93)
out <- ftable(tbl, row.vars=2, col.vars=c(1,3))
round(prop.table(out,1)*100)

##         Origin     USA                     non-USA                    
##         Passengers   2   4   5   6   7   8       2   4   5   6   7   8
## Type                                                                  
## Compact              0   0  31  12   0   0       0   6  50   0   0   0
## Large                0   0   0 100   0   0       0   0   0   0   0   0
## Midsize              0   0  27  18   0   0       0   9  41   5   0   0
## Small                0  10  24   0   0   0       0  29  38   0   0   0
## Sporty               7  50   0   0   0   0       7  36   0   0   0   0
## Van                  0   0   0   0  44  11       0   0   0   0  44   0

We can also create bins of a continuous function and use in the contingency table

price.range <- cut(Cars93$Price, c(0, 20, 70))
tbl <- xtabs( ~ price.range + Origin + Cylinders, data=Cars93)
out <- ftable(tbl, row.vars=3)
100 * prop.table(out, margin=1)

##           price.range     (0,20]               (20,70]           
##           Origin             USA    non-USA        USA    non-USA
## Cylinders                                                        
## 3                       0.000000 100.000000   0.000000   0.000000
## 4                      44.897959  44.897959   0.000000  10.204082
## 5                       0.000000  50.000000   0.000000  50.000000
## 6                      35.483871   6.451613  29.032258  29.032258
## 8                      14.285714   0.000000  71.428571  14.285714
## rotary                  0.000000   0.000000   0.000000 100.000000

Graphical Summaries of Two Way Contingency Tables Barplots can be used effectively. 1. One variable is chosen to form the categories for barplots 2. Either the segment the bars for others or 3. Separate bars are plotted side by side

barplot(seatbelts, xlab="Parent", main="Child seat-belt usage",legend.text = TRUE)

barplot(seatbelts, xlab="Parent", main="Child seat-belt usage",
beside=TRUE,legend.text = TRUE)

if we want the parents distribution, we an flip the table around.

barplot(t(seatbelts))

We can use percentage also

prop.table(barplot(seatbelts, xlab="Parent", main="Child seat-belt usage",legend.text = TRUE))

## [1] 0.2692308 0.7307692

Mosaic Plot that makes it suitable for viewing relationships between two or more categorical variables.

titanic <- as.data.frame(Titanic)
xtabs(Freq ~ Survived + Class, data=titanic, subset=Sex=="Female")

##         Class
## Survived 1st 2nd 3rd Crew
##      No    4  13 106    3
##      Yes 141  93  90   20

We can create a mosaic plot like this

tbl <- xtabs(Freq ~ Sex, titanic)
mosaicplot(tbl)

Or this

tbl <- xtabs(Freq ~ Sex + Survived, titanic)
mosaicplot(tbl)

Or this

tbl <- xtabs(Freq ~ Sex + Survived + Class, titanic)
mosaicplot(tbl)

c. Here we note how the variables are assigned: The first variable (Sex) again splits the x axis, the second variable (Survived) again splits the y axis. The third (Class) then splits each of the four blocks.

Different orderings give different insights. Putting Class first (not shown) makes it easier to focus on the classes.

mosaicplot( xtabs(Freq ~ Class + Sex + Survived, data=titanic))

titanic <- as.data.frame(Titanic)
titanic

##    Class    Sex   Age Survived Freq
## 1    1st   Male Child       No    0
## 2    2nd   Male Child       No    0
## 3    3rd   Male Child       No   35
## 4   Crew   Male Child       No    0
## 5    1st Female Child       No    0
## 6    2nd Female Child       No    0
## 7    3rd Female Child       No   17
## 8   Crew Female Child       No    0
## 9    1st   Male Adult       No  118
## 10   2nd   Male Adult       No  154
## 11   3rd   Male Adult       No  387
## 12  Crew   Male Adult       No  670
## 13   1st Female Adult       No    4
## 14   2nd Female Adult       No   13
## 15   3rd Female Adult       No   89
## 16  Crew Female Adult       No    3
## 17   1st   Male Child      Yes    5
## 18   2nd   Male Child      Yes   11
## 19   3rd   Male Child      Yes   13
## 20  Crew   Male Child      Yes    0
## 21   1st Female Child      Yes    1
## 22   2nd Female Child      Yes   13
## 23   3rd Female Child      Yes   14
## 24  Crew Female Child      Yes    0
## 25   1st   Male Adult      Yes   57
## 26   2nd   Male Adult      Yes   14
## 27   3rd   Male Adult      Yes   75
## 28  Crew   Male Adult      Yes  192
## 29   1st Female Adult      Yes  140
## 30   2nd Female Adult      Yes   80
## 31   3rd Female Adult      Yes   76
## 32  Crew Female Adult      Yes   20

3.31 The data set coins (UsingR) contains the number of coins in a change bin and the years they were minted. Do the following: 1. How much money is in the change bin? 2. Make a barplot of the years. Is there a trend? 3. Use cut to construct a barplot by decade. 4. Make a contingency table of the year and the value. Does it look like the two variables are associated? Why?

head(coins)

##   year value
## 1 2003  0.01
## 2 2003  0.25
## 3 2003  0.25
## 4 2002  0.01
## 5 2002  0.01
## 6 2002  0.25

summary(coins)

##       year          value        
##  Min.   :1929   Min.   :0.01000  
##  1st Qu.:1983   1st Qu.:0.01000  
##  Median :1993   Median :0.01000  
##  Mean   :1989   Mean   :0.06989  
##  3rd Qu.:1999   3rd Qu.:0.10000  
##  Max.   :2003   Max.   :0.25000

str(coins)

## 'data.frame':    371 obs. of  2 variables:
##  $ year : num  2003 2003 2003 2002 2002 ...
##  $ value: num  0.01 0.25 0.25 0.01 0.01 0.25 0.01 0.01 0.01 0.01 ...

How much money is in the change bin?

sum(coins$value)

## [1] 25.93

Make a barplot of the years. Is there a trend?

tbl <- table(coins$year,coins$value)
tbl

##       
##        0.01 0.05 0.1 0.25
##   1929    2    1   0    0
##   1936    0    0   1    0
##   1939    0    0   1    0
##   1955    0    0   0    1
##   1959    1    0   0    0
##   1960    1    0   0    0
##   1964    1    0   0    0
##   1965    2    1   0    0
##   1966    1    1   0    0
##   1967    0    1   0    0
##   1968    1    0   0    1
##   1969    2    0   1    0
##   1970    1    1   0    2
##   1971    3    0   2    1
##   1972    2    0   0    1
##   1973    1    0   0    0
##   1974    1    1   1    0
##   1975    4    1   0    4
##   1976    2    1   0    1
##   1977    4    2   0    1
##   1978    5    1   0    0
##   1979    1    1   0    0
##   1980    1    1   0    0
##   1981    7    1   2    0
##   1982    9    1   2    1
##   1983    7    3   1    0
##   1984    7    3   0    1
##   1985    2    4   1    1
##   1986    3    0   1    1
##   1987    7    1   1    0
##   1988    7    3   2    2
##   1989    4    0   3    2
##   1990    3    3   1    2
##   1991    2    1   1    1
##   1992    6    0   0    6
##   1993    8    3   1    2
##   1994    7    1   2    1
##   1995    6    2   3    2
##   1996    7    3   0    4
##   1997    6    2   1    3
##   1998    5    1   0    5
##   1999   16    2   6    1
##   2000   15    4   3    8
##   2001   13    5   0    7
##   2002   19    3   5    3
##   2003    1    0   0    2

barplot(t(tbl),legend.text = TRUE)

Use cut to construct a barplot by decade.

yrange <- cut(coins$year,c(1940,1950,1960,1970,1980,1990,2000,2010))
tbl2 <- table(yrange,coins$value)
barplot(t(tbl2))

4. Make a contingency table of the year and the value. Does it look like the two variables are associated? Why?

str(coins)

## 'data.frame':    371 obs. of  2 variables:
##  $ year : num  2003 2003 2003 2002 2002 ...
##  $ value: num  0.01 0.25 0.25 0.01 0.01 0.25 0.01 0.01 0.01 0.01 ...

UsingR2

Priyank Goyal

02/11/2019