library(UsingR)
## Warning: package 'UsingR' was built under R version 3.6.1
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 3.6.1
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.1
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
library(aplpack)
## Warning: package 'aplpack' was built under R version 3.6.1
2.1 Enter the following data into a variable p using c 2 3 5 7 11 13 17 19 Use length to check its length.
2.2 Al recorded his car’s mileage at gust last eight fill-ups: 65311 65624 6598 66219 66499 66821 67145 67447 Enter these numbers into the variable gas. Use the function diff on the data. What does it give? Interpret what both of these commands return: mean(gas) and mean(diff(gas)).
gas <- scan()
gas
## numeric(0)
diff(gas)
## numeric(0)
mean(gas)
## [1] NaN
mean(diff(gas))
## [1] NaN
2.3 Let our small data set be 2 5 4 1 8 1. Enter this data into a data vector x. 2. Find the square of each number. 3. Subtract 6 from each number. 4. Subtract 9 from each number and then square the answers.
x <- c(2,5,4,1,8)
x
## [1] 2 5 4 1 8
x*x
## [1] 4 25 16 1 64
x-6
## [1] -4 -1 -2 -5 2
(x-9)^2
## [1] 49 16 25 64 1
2.4 Create the following sequences: 1. “a”, “a”, “a”, “a”, “a” 2. 1, 3, . . ., 99 (the odd numbers in [1, 100]) 3. 1, 1, 1, 2, 2, 2, 3, 3, 3 4. 1, 1, 1, 2, 2, 3 5. 1, 2, 3, 4, 5, 4, 3, 2, 1 using :, seq, or rep as appropriate.
rep("a",times=5)
## [1] "a" "a" "a" "a" "a"
seq(1,100,2)
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
c(rep(c(1,2,3),times=c(3,3,3)))
## [1] 1 1 1 2 2 2 3 3 3
c(rep(c(1,2,3),times=c(3,2,1)))
## [1] 1 1 1 2 2 3
c(1:5,4:1)
## [1] 1 2 3 4 5 4 3 2 1
2.5 Store the following data sets into a variable any way you can:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
1/(1:10)
## [1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571
## [8] 0.1250000 0.1111111 0.1000000
(1:6)^3
## [1] 1 8 27 64 125 216
1964:2014
## [1] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
## [15] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [29] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [43] 2006 2007 2008 2009 2010 2011 2012 2013 2014
seq(0,1000,25)
## [1] 0 25 50 75 100 125 150 175 200 225 250 275 300 325
## [15] 350 375 400 425 450 475 500 525 550 575 600 625 650 675
## [29] 700 725 750 775 800 825 850 875 900 925 950 975 1000
2.6 The average distance from the center is computed by (|x1-x¯| + ··· + |xn -x¯|)/n, where x¯ is the mean of the data vector. Compute this for the rivers data set using the function sum to add the values and abs to find the absolute value.
xbar <- mean(rivers)
sum(abs(rivers-xbar))
## [1] 44210.67
2.7 Precedence rules are used to decide the order of evaluating operations when parentheses are not present. Look at the value produced by -1:3. What is done first - or :? Now look at 1:23. Which is done first : or ?
-1:3
## [1] -1 0 1 2 3
-1:2*3
## [1] -3 0 3 6
2.8 The precip data set records average annual rainfall for many different cities in the United States. It is stored as a data vector with names. Find the average amounts for the cities starting with a “J”.
names(precip)
## [1] "Mobile" "Juneau" "Phoenix"
## [4] "Little Rock" "Los Angeles" "Sacramento"
## [7] "San Francisco" "Denver" "Hartford"
## [10] "Wilmington" "Washington" "Jacksonville"
## [13] "Miami" "Atlanta" "Honolulu"
## [16] "Boise" "Chicago" "Peoria"
## [19] "Indianapolis" "Des Moines" "Wichita"
## [22] "Louisville" "New Orleans" "Portland"
## [25] "Baltimore" "Boston" "Detroit"
## [28] "Sault Ste. Marie" "Duluth" "Minneapolis/St Paul"
## [31] "Jackson" "Kansas City" "St Louis"
## [34] "Great Falls" "Omaha" "Reno"
## [37] "Concord" "Atlantic City" "Albuquerque"
## [40] "Albany" "Buffalo" "New York"
## [43] "Charlotte" "Raleigh" "Bismark"
## [46] "Cincinnati" "Cleveland" "Columbus"
## [49] "Oklahoma City" "Portland" "Philadelphia"
## [52] "Pittsburg" "Providence" "Columbia"
## [55] "Sioux Falls" "Memphis" "Nashville"
## [58] "Dallas" "El Paso" "Houston"
## [61] "Salt Lake City" "Burlington" "Norfolk"
## [64] "Richmond" "Seattle Tacoma" "Spokane"
## [67] "Charleston" "Milwaukee" "Cheyenne"
## [70] "San Juan"
str(precip)
## Named num [1:70] 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
## - attr(*, "names")= chr [1:70] "Mobile" "Juneau" "Phoenix" "Little Rock" ...
head(precip)
## Mobile Juneau Phoenix Little Rock Los Angeles Sacramento
## 67.0 54.7 7.0 48.5 14.0 17.2
2.9 An experiment had 10 different trials. Create a character vector with 10 different names for the trials, e.g., “Trial 1”,….
paste("trial",1:10)
## [1] "trial 1" "trial 2" "trial 3" "trial 4" "trial 5" "trial 6"
## [7] "trial 7" "trial 8" "trial 9" "trial 10"
2.10 Working with file names in R is easy, but requires the proper use of file separators, which vary depending on the operating system. For example, suppose you have the directory and file name of a file and want to get the entire file. f <- system.file(“DESCRIPTION”, package=“UsingR”) dname <- dirname(f) fname <- basename(f) To combine dname and fname into a full pathname use paste with the sep argument being .Platform$file.sep. What is the result?
2.11 The Manufacturer variable in the Cars93 (MASS) data set is stored as a factor. How many levels are there? How many different cases are there?
Cars93$Manufacturer
## [1] Acura Acura Audi Audi BMW
## [6] Buick Buick Buick Buick Cadillac
## [11] Cadillac Chevrolet Chevrolet Chevrolet Chevrolet
## [16] Chevrolet Chevrolet Chevrolet Chevrolet Chrylser
## [21] Chrysler Chrysler Dodge Dodge Dodge
## [26] Dodge Dodge Dodge Eagle Eagle
## [31] Ford Ford Ford Ford Ford
## [36] Ford Ford Ford Geo Geo
## [41] Honda Honda Honda Hyundai Hyundai
## [46] Hyundai Hyundai Infiniti Lexus Lexus
## [51] Lincoln Lincoln Mazda Mazda Mazda
## [56] Mazda Mazda Mercedes-Benz Mercedes-Benz Mercury
## [61] Mercury Mitsubishi Mitsubishi Nissan Nissan
## [66] Nissan Nissan Oldsmobile Oldsmobile Oldsmobile
## [71] Oldsmobile Plymouth Pontiac Pontiac Pontiac
## [76] Pontiac Pontiac Saab Saturn Subaru
## [81] Subaru Subaru Suzuki Toyota Toyota
## [86] Toyota Toyota Volkswagen Volkswagen Volkswagen
## [91] Volkswagen Volvo Volvo
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo
2.12 The Cylinders variable in the Cars93 (MASS) data set records the number of cylinders in the respective car. Why can this not be stored as a numeric value? Which cars have 5 cylinders?
View(Cars93)
Cars93$Cylinders
## [1] 4 6 6 6 4 4 6 6 6 8
## [11] 8 4 4 6 4 6 6 8 8 6
## [21] 4 6 4 4 4 6 4 6 4 6
## [31] 4 4 4 4 4 6 6 8 3 4
## [41] 4 4 4 4 4 4 4 8 6 6
## [51] 6 8 4 4 4 6 rotary 4 6 4
## [61] 6 4 6 4 4 6 6 4 4 6
## [71] 6 4 4 4 6 6 6 4 4 3
## [81] 4 4 3 4 4 4 4 4 5 4
## [91] 6 4 5
## Levels: 3 4 5 6 8 rotary
head(Cars93)
## Manufacturer Model Type Min.Price Price Max.Price MPG.city
## 1 Acura Integra Small 12.9 15.9 18.8 25
## 2 Acura Legend Midsize 29.2 33.9 38.7 18
## 3 Audi 90 Compact 25.9 29.1 32.3 20
## 4 Audi 100 Midsize 30.8 37.7 44.6 19
## 5 BMW 535i Midsize 23.7 30.0 36.2 22
## 6 Buick Century Midsize 14.2 15.7 17.3 22
## MPG.highway AirBags DriveTrain Cylinders EngineSize
## 1 31 None Front 4 1.8
## 2 25 Driver & Passenger Front 6 3.2
## 3 26 Driver only Front 6 2.8
## 4 26 Driver & Passenger Front 6 2.8
## 5 30 Driver only Rear 4 3.5
## 6 31 Driver only Front 4 2.2
## Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1 140 6300 2890 Yes 13.2
## 2 200 5500 2335 Yes 18.0
## 3 172 5500 2280 Yes 16.9
## 4 172 5500 2535 Yes 21.1
## 5 208 5700 2545 Yes 21.1
## 6 110 5200 2565 No 16.4
## Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1 5 177 102 68 37 26.5
## 2 5 195 115 71 38 30.0
## 3 5 180 102 67 37 28.0
## 4 6 193 106 70 37 31.0
## 5 4 186 109 69 39 27.0
## 6 6 189 105 69 41 28.0
## Luggage.room Weight Origin Make
## 1 11 2705 non-USA Acura Integra
## 2 15 3560 non-USA Acura Legend
## 3 14 3375 non-USA Audi 90
## 4 17 3405 non-USA Audi 100
## 5 13 3640 non-USA BMW 535i
## 6 16 2880 USA Buick Century
Cars93$Model[Cars93$Cylinders==5]
## [1] Eurovan 850
## 93 Levels: 100 190E 240 300E 323 535i 626 850 90 900 Accord ... Vision
2.13 The mtcars data set records information about cars from 1972. The values are coded using numbers. Recoding as factors can be more informative for the user. Recode the am variable, with 0 being “automatic” and 1 being “manual.”
str(mtcars$am)
## num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
mtcars$am <- as.factor(mtcars$am)
mtcars$am
## [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
## Levels: 0 1
levels(mtcars$am) <- paste(c("automatic","manual"))
mtcars$am
## [1] manual manual manual automatic automatic automatic automatic
## [8] automatic automatic automatic automatic automatic automatic automatic
## [15] automatic automatic automatic manual manual manual automatic
## [22] automatic automatic automatic automatic manual manual manual
## [29] manual manual manual manual
## Levels: automatic manual
2.14 The Arbuthnot (HistData) data set contains information on the number of Male and Female births in London from 1629 to 1710. Using and and > determine if there was ever a year with more female births.
Arbuthnot
## Year Males Females Plague Mortality Ratio Total
## 1 1629 5218 4683 0 8771 1.114243 9.901
## 2 1630 4858 4457 1317 10554 1.089971 9.315
## 3 1631 4422 4102 274 8562 1.078011 8.524
## 4 1632 4994 4590 8 9535 1.088017 9.584
## 5 1633 5158 4839 0 8393 1.065923 9.997
## 6 1634 5035 4820 1 10400 1.044606 9.855
## 7 1635 5106 4928 0 10651 1.036120 10.034
## 8 1636 4917 4605 10400 23359 1.067752 9.522
## 9 1637 4703 4457 3082 11763 1.055194 9.160
## 10 1638 5359 4952 363 13624 1.082189 10.311
## 11 1639 5366 4784 314 9862 1.121656 10.150
## 12 1640 5518 5332 1450 12771 1.034884 10.850
## 13 1641 5470 5200 1375 13142 1.051923 10.670
## 14 1642 5460 4910 1274 13273 1.112016 10.370
## 15 1643 4793 4617 996 13212 1.038120 9.410
## 16 1644 4107 3997 1492 10933 1.027521 8.104
## 17 1645 4047 3919 1871 11479 1.032661 7.966
## 18 1646 3768 3395 2365 12780 1.109867 7.163
## 19 1647 3796 3536 3597 14059 1.073529 7.332
## 20 1648 3363 3181 611 9894 1.057215 6.544
## 21 1649 3079 2746 67 10566 1.121267 5.825
## 22 1650 2890 2722 15 8754 1.061719 5.612
## 23 1651 3231 2840 23 10827 1.137676 6.071
## 24 1652 3220 2908 16 12569 1.107290 6.128
## 25 1653 3196 2959 6 10087 1.080095 6.155
## 26 1654 3441 3179 16 13247 1.082416 6.620
## 27 1655 3655 3349 9 11357 1.091371 7.004
## 28 1656 3668 3382 6 13921 1.084565 7.050
## 29 1657 3396 3289 4 12434 1.032533 6.685
## 30 1658 3157 3013 14 14993 1.047793 6.170
## 31 1659 3209 2781 36 14756 1.153901 5.990
## 32 1660 3724 3247 13 12681 1.146905 6.971
## 33 1661 4748 4107 20 16665 1.156075 8.855
## 34 1662 5216 4803 12 13664 1.085988 10.019
## 35 1663 5411 4881 9 12741 1.108584 10.292
## 36 1664 6041 5681 5 15453 1.063369 11.722
## 37 1665 5114 4858 68596 97306 1.052697 9.972
## 38 1666 4678 4319 1998 12738 1.083121 8.997
## 39 1667 5616 5322 35 15842 1.055242 10.938
## 40 1668 6073 5560 14 17278 1.092266 11.633
## 41 1669 6506 5829 3 19432 1.116143 12.335
## 42 1670 6278 5719 0 20198 1.097744 11.997
## 43 1671 6449 6061 5 15729 1.064016 12.510
## 44 1672 6443 6120 5 18230 1.052778 12.563
## 45 1673 6073 5822 5 17504 1.043112 11.895
## 46 1674 6113 5738 3 21201 1.065354 11.851
## 47 1675 6058 5717 1 17244 1.059647 11.775
## 48 1676 6552 5847 2 18732 1.120575 12.399
## 49 1677 6423 6203 2 19076 1.035467 12.626
## 50 1678 6568 6033 5 20678 1.088679 12.601
## 51 1679 6247 6041 2 21730 1.034100 12.288
## 52 1680 6548 6299 0 21053 1.039530 12.847
## 53 1681 6822 6533 0 23951 1.044237 13.355
## 54 1682 6909 6744 0 20691 1.024466 13.653
## 55 1683 7577 7158 0 20587 1.058536 14.735
## 56 1684 7575 7127 0 23202 1.062860 14.702
## 57 1685 7484 7246 0 23222 1.032846 14.730
## 58 1686 7575 7119 0 22609 1.064054 14.694
## 59 1687 7737 7214 0 21460 1.072498 14.951
## 60 1688 7487 7101 0 22921 1.054359 14.588
## 61 1689 7604 7167 0 23502 1.060974 14.771
## 62 1690 7909 7302 0 21461 1.083128 15.211
## 63 1691 7662 7392 0 22691 1.036526 15.054
## 64 1692 7602 7316 0 20874 1.039092 14.918
## 65 1693 7676 7483 0 20959 1.025792 15.159
## 66 1694 6985 6647 0 24109 1.050850 13.632
## 67 1695 7263 6713 0 19047 1.081931 13.976
## 68 1696 7632 7229 0 18638 1.055748 14.861
## 69 1697 8062 7767 0 20292 1.037981 15.829
## 70 1698 8426 7626 0 20183 1.104904 16.052
## 71 1699 7911 7452 0 20795 1.061594 15.363
## 72 1700 7578 7061 0 19443 1.073219 14.639
## 73 1701 8102 7514 0 20471 1.078254 15.616
## 74 1702 8031 7656 0 19481 1.048981 15.687
## 75 1703 7765 7683 0 20720 1.010673 15.448
## 76 1704 6113 5738 0 22684 1.065354 11.851
## 77 1705 8366 7779 0 22097 1.075460 16.145
## 78 1706 7952 7417 0 19847 1.072132 15.369
## 79 1707 8379 7687 0 21600 1.090022 16.066
## 80 1708 8239 7623 0 21291 1.080808 15.862
## 81 1709 7840 7380 0 21800 1.062331 15.220
## 82 1710 7640 7288 0 24620 1.048299 14.928
Arbuthnot$Year[Arbuthnot$Females>Arbuthnot$Males]
## integer(0)
2.15 The negation operator ! is used to reverse Boolean values. For example: A <- c(TRUE, FALSE, TRUE, TRUE) !A ## [1] FALSE TRUE FALSE FALSE One of De Morgan’s laws in R code is !(A & B) = !A | !B. Verify this with B <- c(FALSE, TRUE, FALSE, TRUE) and A as above.
A <- c(F,T,F,F)
B <- c(F,T,F,T)
!(A&B)
## [1] TRUE FALSE TRUE TRUE
!A|!B
## [1] TRUE FALSE TRUE TRUE
2.16 In the precip data set, find all the cities with an average annual rainfall exceeding 50 inches.
names(precip)
## [1] "Mobile" "Juneau" "Phoenix"
## [4] "Little Rock" "Los Angeles" "Sacramento"
## [7] "San Francisco" "Denver" "Hartford"
## [10] "Wilmington" "Washington" "Jacksonville"
## [13] "Miami" "Atlanta" "Honolulu"
## [16] "Boise" "Chicago" "Peoria"
## [19] "Indianapolis" "Des Moines" "Wichita"
## [22] "Louisville" "New Orleans" "Portland"
## [25] "Baltimore" "Boston" "Detroit"
## [28] "Sault Ste. Marie" "Duluth" "Minneapolis/St Paul"
## [31] "Jackson" "Kansas City" "St Louis"
## [34] "Great Falls" "Omaha" "Reno"
## [37] "Concord" "Atlantic City" "Albuquerque"
## [40] "Albany" "Buffalo" "New York"
## [43] "Charlotte" "Raleigh" "Bismark"
## [46] "Cincinnati" "Cleveland" "Columbus"
## [49] "Oklahoma City" "Portland" "Philadelphia"
## [52] "Pittsburg" "Providence" "Columbia"
## [55] "Sioux Falls" "Memphis" "Nashville"
## [58] "Dallas" "El Paso" "Houston"
## [61] "Salt Lake City" "Burlington" "Norfolk"
## [64] "Richmond" "Seattle Tacoma" "Spokane"
## [67] "Charleston" "Milwaukee" "Cheyenne"
## [70] "San Juan"
head(precip)
## Mobile Juneau Phoenix Little Rock Los Angeles Sacramento
## 67.0 54.7 7.0 48.5 14.0 17.2
precip[precip>50]
## Mobile Juneau Jacksonville Miami New Orleans
## 67.0 54.7 54.5 59.8 56.8
## San Juan
## 59.2
2.17 For the precip data set, we can find the mean and the 25%-trimmed mean with mean(precip) and mean(precip, trim=.25). Are any values in the data set more than 1.5 times the trimmed mean above the mean?
mean(precip)
## [1] 34.88571
tmean <- mean(precip,trim=.25)
tmean
## [1] 36.65278
precip[precip>1.5*tmean]
## Mobile Miami New Orleans San Juan
## 67.0 59.8 56.8 59.2
2.19 You track your commute times for two weeks (ten days), recording the following times in minutes: 17 16 2 24 22 15 21 15 17 22 Enter these into R. Use the function max to find the longest commute time, the function mean to find the average, and the function min to find the minimum. Oops, the 24 was a mistake. It should have been 18. How can you fix this? Do so, and then find the new average. How many times was your commute 20 minutes or more? What percent of your commutes are less than 18 minutes long?
comm <- c(17,16, 20,24, 22, 15, 21, 15, 17, 22)
max(comm)
## [1] 24
mean(comm)
## [1] 18.9
min(comm)
## [1] 15
comm[4] <- 18
comm
## [1] 17 16 20 18 22 15 21 15 17 22
mean(comm)
## [1] 18.3
sum(comm>=20)
## [1] 4
(sum(comm<18)*100)/length(comm)
## [1] 50
2.20 Suppose monthly sales (in 10,000s) of CDs in 2013 were JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC 79 74 161 127 133 21 99 143 249 249 368 32 Enter the data into a data vector cd. Through indexing, form two data vectors: one containing the months with 31 days, the other the remaining months. Compare the means of these two data vectors.
cd <- c(Jan=79,Feb=74,Mar=161,Apr=127,May=133,Jun=210,Jul=99,Aug=143,Sep=249,Oct=249,Nov=368,Dec=302)
cd
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 79 74 161 127 133 210 99 143 249 249 368 302
cd31 <- cd[c(1,3,5,7,8,10,12)]
cd31
## Jan Mar May Jul Aug Oct Dec
## 79 161 133 99 143 249 302
cdOth <- cd[c(2,4,6,9,11)]
cdOth
## Feb Apr Jun Sep Nov
## 74 127 210 249 368
mean(cd31)-mean(cdOth)
## [1] -39.02857
2.21 The following data records the average salary in major league baseball for the years 1990–1999 (in millions): .57 .89 1.08 1.12 1.18 1.07 1.17 1.38 1.44 1.72 Use diff to find the differences from year to year. Are there any years where the amount dropped from the previous year? The percentage difference is the difference divided by the previous year times 100. This can be found by dividing the output of diff by the first nine numbers (but not all ten). After doing this, determine which year has the biggest percentage increase.
sal <- c(.57, .89, 1.08, 1.12, 1.18, 1.07, 1.17, 1.38, 1.44,1.72)
diff(sal)
## [1] 0.32 0.19 0.04 0.06 -0.11 0.10 0.21 0.06 0.28
perdiff <- diff(sal)/sal[1:length(sal)-1]
perdiff
## [1] 0.56140351 0.21348315 0.03703704 0.05357143 -0.09322034 0.09345794
## [7] 0.17948718 0.04347826 0.19444444
2.22 Write a function to compute the average distance from the mean for some data vector.
dis <- function(xVec){
mean1 <- mean(xVec)
mean(abs(xVec-mean1))
}
a<- c(2,4,6,8,10)
dis(a)
## [1] 2.4
2.23 Write a function f which finds the average of the x values after squaring and subtracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:1).
2.24 An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x %% 2. Use this to write a function iseven that indicates if a number is even. How would you write isodd?
2.25 Write a function isprime that checks if a number x is prime by dividing x by all the values in 2, . . . , x -1 then checking to see if there is a remainder of 0. The expression a %% b returns the remainder of a divided by b.
isprime <- function(x){
all(x %% 2:(x-1)!=0)
}
isprime(37)
## [1] TRUE
• Example 2.9: Grading by z-scores Professor H. grades by z-scores. Students having a z-score greater than 1.28 will get an “A.” In a small class the following student grades were recorded: 54 5 79 79 51 69 55 62 1 8 Did anyone get an “A”?
score1 <- c(54,50, 79, 79, 51, 69, 55, 62, 100, 80)
zscore1 <- scale(score1)
zscore1
## [,1]
## [1,] -0.84681616
## [2,] -1.09050427
## [3,] 0.67623449
## [4,] 0.67623449
## [5,] -1.02958224
## [6,] 0.06701423
## [7,] -0.78589414
## [8,] -0.35943995
## [9,] 1.95559704
## [10,] 0.73715652
## attr(,"scaled:center")
## [1] 67.9
## attr(,"scaled:scale")
## [1] 16.41442
score1[zscore1>1.28]
## [1] 100
Now, given these values for the mean and standard deviation, what score would be just good enough for an “A”?
ScoreGoodEnough <- mean(score1)+1.28*sd(score1)
ScoreGoodEnough
## [1] 88.91046
For the exec.pay (UsingR) data set there are 199 values. What proportion are more than 3 standard deviations from the mean? Translating, this means the absolute value of the z-scores should be more than 3:
head(exec.pay)
## [1] 136 74 8 38 46 43
mean(exec.pay)
## [1] 59.88945
sd(exec.pay)
## [1] 207.0435
threesdaway <- mean(exec.pay)+3*sd(exec.pay)
sum(exec.pay>threesdaway)/length(exec.pay)
## [1] 0.01507538
That is 1.5% if the z-scores are larger than 3. This is much more than would be expected for bell shaped data (we would expect 0.3%) but less than theoretically possible, which is 11.1% (1/9).
• Example 2.11: SAT rankings Many college-bound students in the United States are familiar with the IQR, even if not by name. The ubiquitous college rankings almost always list the 25th and 75th percentiles for SAT scores. For example, Table 2.3 reproduces a table showing the first and third quartiles for accepted students at IvyLeague colleges. The IQR is Q3-Q1. “Writing” has the smallest average IQR of 85, with “reading” the highest with 98.75. Does the data imply that nearly 25% of Yale students have perfect scores (800 on each category)?
skew <- function(x) {
n <- length(x)
z <- (x - mean(x)) / sd(x)
sum(z^3) / n }
skew(exec.pay)
## [1] 9.578542
kurtosis <- function(x) {
n <- length(x)
z <- (x - mean(x)) / sd(x)
sum(z^4)/n - 3
}
kurtosis(galton$parent)
## [1] 0.05104267
stem(bumpers)
##
## The decimal point is 3 digit(s) to the right of the |
##
## 0 | 68
## 1 | 333556
## 2 | 000123445
## 3 | 011233
From this we can see this data is mostly symmetric and not long tailed. We can also count that there are 23 values in the data set so by counting can find that the median has a stem of 2 and leaf of 1. But what is it as a number? The line “The decimal point is 3 digit(s) to the right of the |” needs to be interpreted. Mathematically it is saying the value is 2.1e3, or 2, 100. (It in fact is 2, 129 which gets truncated to 2, 100 in the making of the summary.)
head(bumpers)
## Honda Accord Chevrolet Cavalier Toyota Camry
## 618 795 1304
## Saturn SL2 Mitsubishi Galant Dodge Monaco
## 1308 1340 1456
?bumpers
## starting httpd help server ... done
hist(faithful$waiting)
How to Create a Histogram Manually 1. We first need to create the bins bins <- seq(40, 100, by=5) 2. We then create breaks x <- faithful$waiting out <- cut(x, breaks=bins) head(out) ## [1] (75,8] (5,55] (7,75] (6,65] (8,85] (5,55] ## 12 Levels: (4,45] (45,5] (5,55] (55,6] (6,65] … (95,1] 3. We then create frequencies
table(out) ## out ## (4,45] (45,5] (5,55] (55,6] (6,65] (65,7] (7,75] ## 4 22 33 24 14 1 27 ## (75,8] (8,85] (85,9] (9,95] (95,1] ## 54 55 23 5 1 The histogram then just represents these counts with bars.
plot( density(bumpers) )
How to Create a Histogram and a density plot
b_hist <- hist(bumpers, plot=FALSE)
b_dens <- density(bumpers)
hist(bumpers, probability=TRUE,
xlim = range(c(b_hist$breaks, b_dens$x)),
ylim = range(c(b_hist$density, b_dens$y)))
lines(b_dens, lwd=2)
The construction of the density plot uses a function (called a density) and a window size (referred to as the bandwidth). Different choices produce different graphics. The bandwidth is similar to the choice of bin size in the histogram: larger values smooth out the graphic, smaller ones make it more spiky. R provides algorithms for automatic selection. The default is “nrd”. We can specify other names to the bw argument: “nrd”, “ucv”, “bcv”, or “SJ”.
boxplot(bumpers, horizontal=TRUE, main="Bumpers")
head(Macdonell)
## height finger frequency
## 1 4.630208 9.4 0
## 2 4.630208 9.5 0
## 3 4.630208 9.6 0
## 4 4.630208 9.7 0
## 5 4.630208 9.8 0
## 6 4.630208 9.9 0
x <- rep(Macdonell$finger, Macdonell$frequency)
qqnorm(x)
x <- jitter(HistData::Galton$child, factor=5) # add noise
qqnorm(x)
2.26 The beatles (LearnEDA) data set records the length of songs (in seconds) by the Beatles in the time variable. Find the mean length, median length, longest length, and shortest length in minutes.
2.27 The ChestSizes (HistData) data set contains Quetelet’s data on chest measurements of 5,738 Scottish Militiamen. The data is tabulated. Find the mean.
# We need to find the weighted mean
head(ChestSizes)
## chest count
## 1 33 3
## 2 34 18
## 3 35 81
## 4 36 185
## 5 37 420
## 6 38 749
summary(ChestSizes)
## chest count
## Min. :33.00 Min. : 1.00
## 1st Qu.:36.75 1st Qu.: 20.25
## Median :40.50 Median : 138.50
## Mean :40.50 Mean : 358.62
## 3rd Qu.:44.25 3rd Qu.: 680.75
## Max. :48.00 Max. :1079.00
propf <- prop.table(ChestSizes$count)
propf
## [1] 0.0005228303 0.0031369815 0.0141164169 0.0322411990 0.0731962356
## [6] 0.1305332869 0.1869989543 0.1880446148 0.1627744859 0.1146741025
## [11] 0.0644823980 0.0160334611 0.0087138376 0.0036598118 0.0006971070
## [16] 0.0001742768
mean=sum(ChestSizes$chest*propf)
mean
## [1] 39.83182
farm <- read.csv("farms.csv",header = TRUE)
stem(farm$count)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 1133346677890266
## 2 | 155890139
## 4 | 013589003589
## 6 | 5589
## 8 | 0149118
## 10 | 0
## 12 |
## 14 |
## 16 |
## 18 |
## 20 |
## 22 | 7
mean(farm$count)
## [1] 44.04
2.31 For the data sets bumpers (UsingR), firstchi (UsingR), and math (UsingR), make histograms. Try to predict the mean, median, and standard deviation from the graphic. Check your guesses with the appropriate R commands
hist(bumpers)
hist(firstchi)
hist(math)
plot(density(pi2000))
hist(pi2000,breaks = 0:10-0.5)
x <- babies$smoke
x <- factor(x, labels=c("never", "now", "until current",
"once, quit", "unknown"))
table(x)
## x
## never now until current once, quit unknown
## 544 484 95 103 10
barplot(table(x),horiz = TRUE,main="Smoking Data")
dotchart(table(x))
## Warning in dotchart(table(x)): 'x' is neither a vector nor a matrix: using
## as.numeric(x)
pie(table(x))
2.61 Numeric data can be discretized through the cut function. For example, the command cut(bumpers, c(0, 1000, 2000, 3000, 4000)) will categorize the repair cost of a bumper by its rough amount. Make a table of this. Which range has the largest number of data points?
a <- cut(bumpers, c(0, 1000, 2000, 3000, 4000))
table(a)
## a
## (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03]
## 2 8 7 6
2.62 The Cylinders variable Cars93 (MASS) data set records the number of cylinders in a factor. What kind of summary does R compute for factors? Look at summary(Cars93$Cylinders) to see.
summary(Cars93$Cylinders)
## 3 4 5 6 8 rotary
## 3 49 2 31 7 1
2.63 The lorem variable in UsingR contains 5 paragraphs of dummy typesetting text that has been in use for centuries. What is the most common letter used? To answer this, you can break a character value into letters by the idiom unlist(strsplit(x,"")) where x is character data.
x <- lorem
sort(table(unlist(strsplit(x,""))),decreasing = TRUE)
##
## e i t u s a l n r o c m d p . v ,
## 589 370 343 289 289 272 251 220 211 183 174 156 142 102 80 73 49 48
## g b q f h N \n D I M S P C V j A E L
## 45 38 30 22 17 13 10 8 8 7 7 6 5 5 4 3 3 3
## U x F Q ;
## 3 3 2 2 1
After blanks, e is the most common character used. 2.64 Make a dot chart of the Cylinders variable in the Cars93 (MASS) data set.
dotchart(table(Cars93$Cylinders))
## Warning in dotchart(table(Cars93$Cylinders)): 'x' is neither a vector nor a
## matrix: using as.numeric(x)
2.68 The data set central.park (UsingR) holds the coded variable WX representing bad weather (e.g., 1 for “fog”, 3 for “thunder”, 8 for “smoke” or “haze”). NA is used when none of the types occurred. Make a table of the data, then make a table with the extra argument exclude=FALSE. Why is the second table better?
head(central.park)
## DY MAX MIN AVG DEP HDD CDD WTR SNW DPTH SPD SPD1 DIR MIN2 PSBL S.S WX
## 1 1 72 51 62 4 3 0 0.04 0 0 5.8 12 180 0 0 5 18
## 2 2 75 52 64 6 1 0 0.29 0 0 7.0 17 160 M M 4 1
## 3 3 65 50 58 0 7 0 0.00 0 0 8.5 16 40 M M 4 NA
## 4 4 63 48 56 -3 9 0 0.00 0 0 2.8 10 150 M M 0 NA
## 5 5 59 45 52 -7 13 0 0.05 0 0 4.6 13 160 M M 4 1
## 6 6 61 45 53 -7 12 0 0.00 0 M 4.5 10 90 0 0 7 18
## SPD3 DR
## 1 17 170
## 2 22 350
## 3 26 40
## 4 14 190
## 5 17 130
## 6 14 70
table(central.park$WX)
##
## 1 18
## 10 11
table(central.park$WX,exclude=FALSE)
##
## 1 18 <NA>
## 10 11 10
For example, suppose that to investigate the question of what foods can have an impact on sports performance, a study was done with a group of subjects.2 The treatment group was fed 3/2 cups of beets 75 minutes prior to an activity, the control group was not. The activity was a timed run. The researcher’s thinking was that the nitrates found in beets would widen the subjects’ blood vessels, increase blood flow to the working muscles, hence lower the time duration of the activity.
Suppose the data measured (in minutes) are given by:
Beets eaten: 41 40 41 42 44 35 41 36 47 45 No Beets eaten: 51 51 50 42 40 31 43 45 What can be said about this data?
Does the data seem to come from similar populations, that is do the two have the same shape?
Does the data seem to have similar center, similar spread?
Compare Mean and SD
beets <- c(41, 40, 41, 42, 44, 35, 41, 36, 47, 45)
no_beets <- c(51, 51, 50, 42, 40, 31, 43, 45)
c(xbar1=mean(beets), xbar2=mean(no_beets),
sd1=sd(beets), sd2=sd(no_beets))
## xbar1 xbar2 sd1 sd2
## 41.200000 44.125000 3.705851 6.812541
Visualise with Appropriate Data Set stem.leaf.backback from the aplpack package does
stem.leaf.backback(beets,no_beets,rule.line = "Sturges")
## ________________________________
## 1 | 2: represents 12, leaf unit: 1
## beets no_beets
## ________________________________
## | 3* |1 1
## 2 65| 3. |
## (6) 421110| 4* |023 (3)
## 2 75| 4. |5 (1)
## | 5* |011 3
## | 5. |
## | 6* |
## ________________________________
## n: 10 8
## ________________________________
Focussing in the middle, we see that no_beets has more spread.but the difference in the centres is not obvious.
boxplot(no_beets, beets, names=c("no beets", "beets"),horizontal=TRUE)
How to Make multiple Density Plots
Let’s consider a different data set. The michelson (MASS) data set contains measurement of the speed of light in air made by Michelson in 1879. The data set consists of five experiments. We compare the fourth and fifth below. The data is stored in a data frame with one variable Speed recording the speed and one Expt recording the experiment number. We use indexing to get our two variables:
speed <- michelson$Speed; expt <- michelson$Expt
fourth <- speed[expt == 4]
fifth <- speed[expt == 5]
With this, we now pre-compute the densities so that we can find their respective ranges:
d_4 <- density(fourth)
d_5 <- density(fifth)
xrange <- range(c(d_4$x, d_5$x))
yrange <- range(c(d_4$y, d_5$y))
Now we plot. The first density plot is drawn using plot, the second layer added with lines. As the default labels come from the initial density plotted—and don’t apply to both—we modify both the xlab and main arguments. As well, we use the lty=2 argument for the lines command, to have that line added with a different line type: The figure drawn would benefit from a legend indicting which line represents which variable. The legend function will add a legend in base graphics. The position of the legend, the text, and the defining feature needs to be specified. The following would place the legend in the upper left with the line type marking which part of the graph corresponds to which variables:
plot(d_4, xlim=xrange, ylim=yrange, xlab="densities", main="")
lines(d_5, lty=2)
legend(65, .8, legend=c("Fourth", "Fifth"), lty=c(1,2))
create a quantile Comaprison to see if the two distributions are the same.
qqplot(fourth, fifth)
The data more or less tracks a straight line, though the observed difference in the shape of the peak causes it to flatten out in the middle. The small size of the data sets involved make such irregularities hard to gauge if they are due to sampling or some systematic difference. Either way, the most important question to answer with this from a statistical inference viewpoint is whether the tails seem comparable, as they do in this case.
beets <- c(41, 40, 41, 42, 44, 35, 41, 36, 47, 45)
no_beets <- c(51, 51, 50, 42, 40, 31, 43, 45)
b <- list(beets = beets, no_beets = no_beets)
b
## $beets
## [1] 41 40 41 42 44 35 41 36 47 45
##
## $no_beets
## [1] 51 51 50 42 40 31 43 45
b$beets
## [1] 41 40 41 42 44 35 41 36 47 45
b[1]
## $beets
## [1] 41 40 41 42 44 35 41 36 47 45
b[[1]]
## [1] 41 40 41 42 44 35 41 36 47 45
boxplot(b)
speed <- michelson$Speed; expt <- michelson$Expt
speeds <- list(speed[expt == 1], speed[expt == 2], speed[expt == 3],
speed[expt == 4], speed[expt == 5])
names(speeds) <- paste("Expt", 1:5)
boxplot(speeds)
Data frames We have mentioned that data frames are a preferred way to store many variables in rectangular manner with each variable being a different column and observations for similar cases in rows. R provides the data.frame constructor for creating data frames, among other ways—they can be produced column-by-column with cbind and rowby-row with rbind. The usage is similar to list. For example, here we create a data frame to hold some student data:
student <- c("Alice", "Bob")
grade <- c("A", "B")
attendance <- c("awesome", "bad")
data.frame(student, grade, attendance)
## student grade attendance
## 1 Alice A awesome
## 2 Bob B bad
## student grade attendance
boxplot(Speed ~ Expt, data=michelson)
boxplot(Speed ~ Expt, data=michelson, subset=Run %in% 11:20)
plot(Speed ~ Expt, data=michelson)
out <- summary(Speed ~ Expt, data=michelson)
plot(out)
b <- list("beets" = beets, "no beets" = no_beets)
stacked <- stack(b)
str(stacked)
## 'data.frame': 18 obs. of 2 variables:
## $ values: num 41 40 41 42 44 35 41 36 47 45 ...
## $ ind : Factor w/ 2 levels "beets","no beets": 1 1 1 1 1 1 1 1 1 1 ...
head(stacked)
## values ind
## 1 41 beets
## 2 40 beets
## 3 41 beets
## 4 42 beets
## 5 44 beets
## 6 35 beets
plot(values ~ ind, data=stacked)
3.4 Three students record the time spent on homework per class. Their data is Marsha 25 0 45 90 0 Bill 30 30 30 30 Holly 15 0 90 0 Use a list to store these values. Then create a boxplot to compare.
Marsha <- c(25,0,45,90,0)
Bill <- c(30,30,30,30)
Holly <- c(15,0,90,0)
All <- list(Marsha,Bill,Holly)
All
## [[1]]
## [1] 25 0 45 90 0
##
## [[2]]
## [1] 30 30 30 30
##
## [[3]]
## [1] 15 0 90 0
boxplot(All)
3.5 A group of nursing students take turns measuring some basic assessments. Their data is: Temp Pulse Systolic Diastolic —————————————- Jackie 98.2 96 134 90 Florence 98.6 56 120 80 Mildred 98.2 76 150 95 Create a data frame of these values. Will plot and boxplot produce same graphic?
Jackie <- c(98.2, 96, 134,90)
Florence <- c(98.6, 56, 120, 80)
Mildred <- c(98.2, 76, 150, 95)
all <- data.frame(Jackie,Florence,Mildred)
str(all)
## 'data.frame': 4 obs. of 3 variables:
## $ Jackie : num 98.2 96 134 90
## $ Florence: num 98.6 56 120 80
## $ Mildred : num 98.2 76 150 95
boxplot(all)
plot(all)
3.6 After loading Hmisc, what kind of summary is produced for a model formula of the type x~f? To see, investigate the output of
require(Hmisc) # loaded by UsingR
library(MASS)
summary(Speed ~ Expt, michelson)
## Speed N= 100
##
## +-------+-+---+-----+
## | | |N |Speed|
## +-------+-+---+-----+
## |Expt |1| 20|909.0|
## | |2| 20|856.0|
## | |3| 20|845.0|
## | |4| 20|820.5|
## | |5| 20|831.5|
## +-------+-+---+-----+
## |Overall| |100|852.4|
## +-------+-+---+-----+
3.7 The split function will coerce its second argument to a factor then split. This is useful if the grouping variable is stored as a numeric value. Verify this by splitting the mpg variable in the mtcars data set by the cyl variable.
a <- split(mtcars$mpg,mtcars$cyl)
a
## $`4`
## [1] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26.0 30.4 21.4
##
## $`6`
## [1] 21.0 21.0 21.4 18.1 19.2 17.8 19.7
##
## $`8`
## [1] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 13.3 19.2 15.8 15.0
3.8 The second argument to split can be a list of factors. The result is that all interactions (possible combinations) are used for the groups. In the ToothGrowth data set, growth (len) is measured for two types of supplements (supp) and three doses (dose). Split this len value into 6 groups.
b <- split(ToothGrowth$supp,ToothGrowth$dose)
b
## $`0.5`
## [1] VC VC VC VC VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC
##
## $`1`
## [1] VC VC VC VC VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC
##
## $`2`
## [1] VC VC VC VC VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ
## Levels: OJ VC
c <- split(ToothGrowth$len,b)
c
## $OJ.OJ.OJ
## [1] 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5 15.2 21.5 17.6 9.7
## [15] 14.5 10.0 8.2 9.4 16.5 9.7 25.5 26.4 22.4 24.5 24.8 30.9 26.4 27.3
## [29] 29.4 23.0
##
## $VC.OJ.OJ
## numeric(0)
##
## $OJ.VC.OJ
## numeric(0)
##
## $VC.VC.OJ
## numeric(0)
##
## $OJ.OJ.VC
## numeric(0)
##
## $VC.OJ.VC
## numeric(0)
##
## $OJ.VC.VC
## numeric(0)
##
## $VC.VC.VC
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0 23.6 18.5 33.9 25.5
## [15] 26.4 32.5 26.7 21.5 23.3 29.5 19.7 23.3 23.6 26.4 20.0 25.2 25.8 21.2
## [29] 14.5 27.3
3.9 The use of a cell phone while driving is often thought to increase the chance of an accident. The data set reaction.time (UsingR) is simulated data on the time it takes to react to an external event while driving. Subjects with control == “C” are not using a cell phone, and those with control == “T” are. Their time to respond to some external event is recorded in seconds. Create side-by-side boxplots of the variable reaction.time for the two values of control. Compare the centers and spreads
reaction.time
## age gender control time
## 1 16-24 F T 1.360075
## 2 16-24 M T 1.467939
## 3 16-24 M T 1.512036
## 4 16-24 F T 1.390647
## 5 16-24 M T 1.384208
## 6 16-24 M C 1.393875
## 7 16-24 M T 1.419043
## 8 16-24 F T 1.461187
## 9 16-24 F T 1.382461
## 10 16-24 M C 1.260152
## 11 16-24 M T 1.292917
## 12 16-24 F T 1.331092
## 13 16-24 M T 1.369171
## 14 16-24 F C 1.330471
## 15 16-24 F T 1.344506
## 16 16-24 M C 1.357869
## 17 16-24 F T 1.520244
## 18 16-24 M T 1.352397
## 19 16-24 F T 1.348420
## 20 16-24 F T 1.373000
## 21 25+ M C 1.306813
## 22 25+ F C 1.384192
## 23 25+ M T 1.345643
## 24 25+ M C 1.594901
## 25 25+ M T 1.341972
## 26 25+ F T 1.472212
## 27 25+ F T 1.440685
## 28 25+ M T 1.502757
## 29 25+ F T 1.501541
## 30 25+ F T 1.562813
## 31 25+ F C 1.475851
## 32 25+ F T 1.475456
## 33 25+ M C 1.312376
## 34 25+ F T 1.447160
## 35 25+ M T 1.463118
## 36 25+ M C 1.293425
## 37 25+ F C 1.402860
## 38 25+ M C 1.245497
## 39 25+ F T 1.475806
## 40 25+ M T 1.520106
## 41 25+ F T 1.437079
## 42 25+ F C 1.583610
## 43 25+ F T 1.499246
## 44 25+ M C 1.436701
## 45 25+ F T 1.524056
## 46 25+ F T 1.510929
## 47 25+ F C 1.522563
## 48 25+ M C 1.462412
## 49 25+ F C 1.313935
## 50 25+ F T 1.440382
## 51 25+ M T 1.444503
## 52 25+ F C 1.316956
## 53 25+ M T 1.517435
## 54 25+ M T 1.536446
## 55 25+ M T 1.444865
## 56 25+ M C 1.442599
## 57 25+ M C 1.355206
## 58 25+ F T 1.614979
## 59 25+ M T 1.527699
## 60 25+ M T 1.466591
boxplot(time~control,data = reaction.time)
3.10 For the data set twins (UsingR) make a boxplot of the Foster and Biological variables. Do they appear to have the same spread? The same center?
head(twins)
## Foster Biological Social
## 1 82 82 high
## 2 80 90 high
## 3 88 91 high
## 4 108 115 high
## 5 116 115 high
## 6 117 129 high
boxplot(twins$Foster,twins$Biological)
3.11 The data set stud.recs (UsingR) contains 160 SAT scores for incoming college students stored in the variables sat.v and sat.m. Produce layered density plots of the data. Do the two data sets appear to have the same center? Then make a quantile-quantile plot. Do the data sets appear to have the same shape?
a <- density(stud.recs$sat.v)
b <- density(stud.recs$sat.m)
plot(a)
lines(b,lty=2)
Quantile-Quantile Plot
qqplot(stud.recs$sat.v,stud.recs$sat.m)
3.12 The data set normtemp (UsingR) contains normal body temperature measurements for 130 healthy individuals recorded in the variable temperature.The variable gender is 1 for a male subject and 2 for a female subject. Break the data up by gender and create side-by-side boxplots. Does it appear that males and females have similar normal body temperatures?
boxplot(normtemp$temperature~normtemp$gender)
res <- lm(maxrate ~ age, data=heartrate)
res
##
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
##
## Coefficients:
## (Intercept) age
## 210.0485 -0.7977
plot(maxrate ~ age, data=heartrate)
abline(res)
Use this format when drawing a scatter plot
x <- fat$wrist[1:20]; y <- fat$neck[1:20] # some data
plot(x, y, main="Neck by wrist")
abline(v = mean(x), lty=2) # dashed vertical line
abline(h = mean(y), lty=2) # dashed horizontal line
points(mean(x), mean(y), pch=16, cex=4, col=rgb(0,0,0,.25))
For the correlated data shown, most of the data appear in two opposing quadrants of the four
body <- Animals$body; brain <- Animals$brain
cross_prods <- (body - mean(body)) * (brain - mean(brain))
Animals[cross_prods < 0, ]
## body brain
## Dipliodocus 11700 50.0
## Asian elephant 2547 4603.0
## Horse 521 655.0
## Giraffe 529 680.0
## Human 62 1320.0
## Triceratops 9400 70.0
## Brachiosaurus 87000 154.5
```
rank(Animals$body)
## [1] 6 21 13 11 5 27 24 18 22 10 8 23 20 16 25 26 9 12 2 1 7 15 17
## [24] 14 4 28 3 19
cor(ToothGrowth$dose,ToothGrowth$len)
## [1] 0.8026913
cor(SAT$salary, SAT$total)
## [1] -0.4398834
plot( total ~ salary, SAT)
points(total ~ salary, SAT, subset = perc < 10, pch=15)
points(total ~ salary, SAT, subset = perc > 40, pch=16)
total <- SAT$total; salary <- SAT$salary; perc <- SAT$perc
less_10 <- perc < 10
more_40 <- perc > 40
between <- !less_10 & !more_40
c(less = cor(total[less_10], salary[less_10]),
between = cor(total[between], salary[between]),
more = cor(total[more_40], salary[more_40]))
## less between more
## 0.2588199 0.2224926 0.3673461