library(UsingR)
## Warning: package 'UsingR' was built under R version 3.6.1
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 3.6.1
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.1
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
library(aplpack)
## Warning: package 'aplpack' was built under R version 3.6.1

2.1 Enter the following data into a variable p using c 2 3 5 7 11 13 17 19 Use length to check its length.

2.2 Al recorded his car’s mileage at gust last eight fill-ups: 65311 65624 6598 66219 66499 66821 67145 67447 Enter these numbers into the variable gas. Use the function diff on the data. What does it give? Interpret what both of these commands return: mean(gas) and mean(diff(gas)).

gas <- scan()
gas
## numeric(0)
diff(gas)
## numeric(0)
mean(gas)
## [1] NaN
mean(diff(gas))
## [1] NaN

2.3 Let our small data set be 2 5 4 1 8 1. Enter this data into a data vector x. 2. Find the square of each number. 3. Subtract 6 from each number. 4. Subtract 9 from each number and then square the answers.

x <- c(2,5,4,1,8)
x
## [1] 2 5 4 1 8
x*x
## [1]  4 25 16  1 64
x-6
## [1] -4 -1 -2 -5  2
(x-9)^2
## [1] 49 16 25 64  1

2.4 Create the following sequences: 1. “a”, “a”, “a”, “a”, “a” 2. 1, 3, . . ., 99 (the odd numbers in [1, 100]) 3. 1, 1, 1, 2, 2, 2, 3, 3, 3 4. 1, 1, 1, 2, 2, 3 5. 1, 2, 3, 4, 5, 4, 3, 2, 1 using :, seq, or rep as appropriate.

rep("a",times=5)
## [1] "a" "a" "a" "a" "a"
seq(1,100,2)
##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45
## [24] 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91
## [47] 93 95 97 99
c(rep(c(1,2,3),times=c(3,3,3)))
## [1] 1 1 1 2 2 2 3 3 3
c(rep(c(1,2,3),times=c(3,2,1)))
## [1] 1 1 1 2 2 3
c(1:5,4:1)
## [1] 1 2 3 4 5 4 3 2 1

2.5 Store the following data sets into a variable any way you can:

  1. 2, 3, 5, 7, 11, 13, 17, 19
  2. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 (positive integers)
  3. 1/1, 1/2, 1/3, 1/4, 1/5, 1/6, 1/7, 1/8, 1/9, 1/10 (reciprocals)
  4. 1, 8, 27, 64, 125, 216 (the cubes)
  5. 1964, 1965, . . ., 2014 (some years)
  6. 14, 18, 23, 28, 34, 42, 50, 59, 66, 72, 79, 86, 96, 103, 110 (stops on New York’s No. 9 subway)
  7. 0, 25, 50, 75, 100, . . . , 975, 1000 (0 to 1000 by 25s) Use c only when : or seq will not work.
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
1/(1:10)
##  [1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571
##  [8] 0.1250000 0.1111111 0.1000000
(1:6)^3
## [1]   1   8  27  64 125 216
1964:2014
##  [1] 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977
## [15] 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991
## [29] 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
## [43] 2006 2007 2008 2009 2010 2011 2012 2013 2014
seq(0,1000,25)
##  [1]    0   25   50   75  100  125  150  175  200  225  250  275  300  325
## [15]  350  375  400  425  450  475  500  525  550  575  600  625  650  675
## [29]  700  725  750  775  800  825  850  875  900  925  950  975 1000

2.6 The average distance from the center is computed by (|x1-x¯| + ··· + |xn -x¯|)/n, where x¯ is the mean of the data vector. Compute this for the rivers data set using the function sum to add the values and abs to find the absolute value.

xbar <- mean(rivers)
sum(abs(rivers-xbar))
## [1] 44210.67

2.7 Precedence rules are used to decide the order of evaluating operations when parentheses are not present. Look at the value produced by -1:3. What is done first - or :? Now look at 1:23. Which is done first : or ?

-1:3
## [1] -1  0  1  2  3
-1:2*3
## [1] -3  0  3  6

2.8 The precip data set records average annual rainfall for many different cities in the United States. It is stored as a data vector with names. Find the average amounts for the cities starting with a “J”.

names(precip)
##  [1] "Mobile"              "Juneau"              "Phoenix"            
##  [4] "Little Rock"         "Los Angeles"         "Sacramento"         
##  [7] "San Francisco"       "Denver"              "Hartford"           
## [10] "Wilmington"          "Washington"          "Jacksonville"       
## [13] "Miami"               "Atlanta"             "Honolulu"           
## [16] "Boise"               "Chicago"             "Peoria"             
## [19] "Indianapolis"        "Des Moines"          "Wichita"            
## [22] "Louisville"          "New Orleans"         "Portland"           
## [25] "Baltimore"           "Boston"              "Detroit"            
## [28] "Sault Ste. Marie"    "Duluth"              "Minneapolis/St Paul"
## [31] "Jackson"             "Kansas City"         "St Louis"           
## [34] "Great Falls"         "Omaha"               "Reno"               
## [37] "Concord"             "Atlantic City"       "Albuquerque"        
## [40] "Albany"              "Buffalo"             "New York"           
## [43] "Charlotte"           "Raleigh"             "Bismark"            
## [46] "Cincinnati"          "Cleveland"           "Columbus"           
## [49] "Oklahoma City"       "Portland"            "Philadelphia"       
## [52] "Pittsburg"           "Providence"          "Columbia"           
## [55] "Sioux Falls"         "Memphis"             "Nashville"          
## [58] "Dallas"              "El Paso"             "Houston"            
## [61] "Salt Lake City"      "Burlington"          "Norfolk"            
## [64] "Richmond"            "Seattle Tacoma"      "Spokane"            
## [67] "Charleston"          "Milwaukee"           "Cheyenne"           
## [70] "San Juan"
str(precip)
##  Named num [1:70] 67 54.7 7 48.5 14 17.2 20.7 13 43.4 40.2 ...
##  - attr(*, "names")= chr [1:70] "Mobile" "Juneau" "Phoenix" "Little Rock" ...
head(precip)
##      Mobile      Juneau     Phoenix Little Rock Los Angeles  Sacramento 
##        67.0        54.7         7.0        48.5        14.0        17.2

2.9 An experiment had 10 different trials. Create a character vector with 10 different names for the trials, e.g., “Trial 1”,….

paste("trial",1:10)
##  [1] "trial 1"  "trial 2"  "trial 3"  "trial 4"  "trial 5"  "trial 6" 
##  [7] "trial 7"  "trial 8"  "trial 9"  "trial 10"

2.10 Working with file names in R is easy, but requires the proper use of file separators, which vary depending on the operating system. For example, suppose you have the directory and file name of a file and want to get the entire file. f <- system.file(“DESCRIPTION”, package=“UsingR”) dname <- dirname(f) fname <- basename(f) To combine dname and fname into a full pathname use paste with the sep argument being .Platform$file.sep. What is the result?

2.11 The Manufacturer variable in the Cars93 (MASS) data set is stored as a factor. How many levels are there? How many different cases are there?

Cars93$Manufacturer
##  [1] Acura         Acura         Audi          Audi          BMW          
##  [6] Buick         Buick         Buick         Buick         Cadillac     
## [11] Cadillac      Chevrolet     Chevrolet     Chevrolet     Chevrolet    
## [16] Chevrolet     Chevrolet     Chevrolet     Chevrolet     Chrylser     
## [21] Chrysler      Chrysler      Dodge         Dodge         Dodge        
## [26] Dodge         Dodge         Dodge         Eagle         Eagle        
## [31] Ford          Ford          Ford          Ford          Ford         
## [36] Ford          Ford          Ford          Geo           Geo          
## [41] Honda         Honda         Honda         Hyundai       Hyundai      
## [46] Hyundai       Hyundai       Infiniti      Lexus         Lexus        
## [51] Lincoln       Lincoln       Mazda         Mazda         Mazda        
## [56] Mazda         Mazda         Mercedes-Benz Mercedes-Benz Mercury      
## [61] Mercury       Mitsubishi    Mitsubishi    Nissan        Nissan       
## [66] Nissan        Nissan        Oldsmobile    Oldsmobile    Oldsmobile   
## [71] Oldsmobile    Plymouth      Pontiac       Pontiac       Pontiac      
## [76] Pontiac       Pontiac       Saab          Saturn        Subaru       
## [81] Subaru        Subaru        Suzuki        Toyota        Toyota       
## [86] Toyota        Toyota        Volkswagen    Volkswagen    Volkswagen   
## [91] Volkswagen    Volvo         Volvo        
## 32 Levels: Acura Audi BMW Buick Cadillac Chevrolet Chrylser ... Volvo

2.12 The Cylinders variable in the Cars93 (MASS) data set records the number of cylinders in the respective car. Why can this not be stored as a numeric value? Which cars have 5 cylinders?

View(Cars93)
Cars93$Cylinders
##  [1] 4      6      6      6      4      4      6      6      6      8     
## [11] 8      4      4      6      4      6      6      8      8      6     
## [21] 4      6      4      4      4      6      4      6      4      6     
## [31] 4      4      4      4      4      6      6      8      3      4     
## [41] 4      4      4      4      4      4      4      8      6      6     
## [51] 6      8      4      4      4      6      rotary 4      6      4     
## [61] 6      4      6      4      4      6      6      4      4      6     
## [71] 6      4      4      4      6      6      6      4      4      3     
## [81] 4      4      3      4      4      4      4      4      5      4     
## [91] 6      4      5     
## Levels: 3 4 5 6 8 rotary
head(Cars93)
##   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city
## 1        Acura Integra   Small      12.9  15.9      18.8       25
## 2        Acura  Legend Midsize      29.2  33.9      38.7       18
## 3         Audi      90 Compact      25.9  29.1      32.3       20
## 4         Audi     100 Midsize      30.8  37.7      44.6       19
## 5          BMW    535i Midsize      23.7  30.0      36.2       22
## 6        Buick Century Midsize      14.2  15.7      17.3       22
##   MPG.highway            AirBags DriveTrain Cylinders EngineSize
## 1          31               None      Front         4        1.8
## 2          25 Driver & Passenger      Front         6        3.2
## 3          26        Driver only      Front         6        2.8
## 4          26 Driver & Passenger      Front         6        2.8
## 5          30        Driver only       Rear         4        3.5
## 6          31        Driver only      Front         4        2.2
##   Horsepower  RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
## 1        140 6300         2890             Yes               13.2
## 2        200 5500         2335             Yes               18.0
## 3        172 5500         2280             Yes               16.9
## 4        172 5500         2535             Yes               21.1
## 5        208 5700         2545             Yes               21.1
## 6        110 5200         2565              No               16.4
##   Passengers Length Wheelbase Width Turn.circle Rear.seat.room
## 1          5    177       102    68          37           26.5
## 2          5    195       115    71          38           30.0
## 3          5    180       102    67          37           28.0
## 4          6    193       106    70          37           31.0
## 5          4    186       109    69          39           27.0
## 6          6    189       105    69          41           28.0
##   Luggage.room Weight  Origin          Make
## 1           11   2705 non-USA Acura Integra
## 2           15   3560 non-USA  Acura Legend
## 3           14   3375 non-USA       Audi 90
## 4           17   3405 non-USA      Audi 100
## 5           13   3640 non-USA      BMW 535i
## 6           16   2880     USA Buick Century
Cars93$Model[Cars93$Cylinders==5]
## [1] Eurovan 850    
## 93 Levels: 100 190E 240 300E 323 535i 626 850 90 900 Accord ... Vision

2.13 The mtcars data set records information about cars from 1972. The values are coded using numbers. Recoding as factors can be more informative for the user. Recode the am variable, with 0 being “automatic” and 1 being “manual.”

str(mtcars$am)
##  num [1:32] 1 1 1 0 0 0 0 0 0 0 ...
mtcars$am <- as.factor(mtcars$am)
mtcars$am
##  [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1
## Levels: 0 1
levels(mtcars$am) <- paste(c("automatic","manual"))
mtcars$am
##  [1] manual    manual    manual    automatic automatic automatic automatic
##  [8] automatic automatic automatic automatic automatic automatic automatic
## [15] automatic automatic automatic manual    manual    manual    automatic
## [22] automatic automatic automatic automatic manual    manual    manual   
## [29] manual    manual    manual    manual   
## Levels: automatic manual

2.14 The Arbuthnot (HistData) data set contains information on the number of Male and Female births in London from 1629 to 1710. Using and and > determine if there was ever a year with more female births.

Arbuthnot
##    Year Males Females Plague Mortality    Ratio  Total
## 1  1629  5218    4683      0      8771 1.114243  9.901
## 2  1630  4858    4457   1317     10554 1.089971  9.315
## 3  1631  4422    4102    274      8562 1.078011  8.524
## 4  1632  4994    4590      8      9535 1.088017  9.584
## 5  1633  5158    4839      0      8393 1.065923  9.997
## 6  1634  5035    4820      1     10400 1.044606  9.855
## 7  1635  5106    4928      0     10651 1.036120 10.034
## 8  1636  4917    4605  10400     23359 1.067752  9.522
## 9  1637  4703    4457   3082     11763 1.055194  9.160
## 10 1638  5359    4952    363     13624 1.082189 10.311
## 11 1639  5366    4784    314      9862 1.121656 10.150
## 12 1640  5518    5332   1450     12771 1.034884 10.850
## 13 1641  5470    5200   1375     13142 1.051923 10.670
## 14 1642  5460    4910   1274     13273 1.112016 10.370
## 15 1643  4793    4617    996     13212 1.038120  9.410
## 16 1644  4107    3997   1492     10933 1.027521  8.104
## 17 1645  4047    3919   1871     11479 1.032661  7.966
## 18 1646  3768    3395   2365     12780 1.109867  7.163
## 19 1647  3796    3536   3597     14059 1.073529  7.332
## 20 1648  3363    3181    611      9894 1.057215  6.544
## 21 1649  3079    2746     67     10566 1.121267  5.825
## 22 1650  2890    2722     15      8754 1.061719  5.612
## 23 1651  3231    2840     23     10827 1.137676  6.071
## 24 1652  3220    2908     16     12569 1.107290  6.128
## 25 1653  3196    2959      6     10087 1.080095  6.155
## 26 1654  3441    3179     16     13247 1.082416  6.620
## 27 1655  3655    3349      9     11357 1.091371  7.004
## 28 1656  3668    3382      6     13921 1.084565  7.050
## 29 1657  3396    3289      4     12434 1.032533  6.685
## 30 1658  3157    3013     14     14993 1.047793  6.170
## 31 1659  3209    2781     36     14756 1.153901  5.990
## 32 1660  3724    3247     13     12681 1.146905  6.971
## 33 1661  4748    4107     20     16665 1.156075  8.855
## 34 1662  5216    4803     12     13664 1.085988 10.019
## 35 1663  5411    4881      9     12741 1.108584 10.292
## 36 1664  6041    5681      5     15453 1.063369 11.722
## 37 1665  5114    4858  68596     97306 1.052697  9.972
## 38 1666  4678    4319   1998     12738 1.083121  8.997
## 39 1667  5616    5322     35     15842 1.055242 10.938
## 40 1668  6073    5560     14     17278 1.092266 11.633
## 41 1669  6506    5829      3     19432 1.116143 12.335
## 42 1670  6278    5719      0     20198 1.097744 11.997
## 43 1671  6449    6061      5     15729 1.064016 12.510
## 44 1672  6443    6120      5     18230 1.052778 12.563
## 45 1673  6073    5822      5     17504 1.043112 11.895
## 46 1674  6113    5738      3     21201 1.065354 11.851
## 47 1675  6058    5717      1     17244 1.059647 11.775
## 48 1676  6552    5847      2     18732 1.120575 12.399
## 49 1677  6423    6203      2     19076 1.035467 12.626
## 50 1678  6568    6033      5     20678 1.088679 12.601
## 51 1679  6247    6041      2     21730 1.034100 12.288
## 52 1680  6548    6299      0     21053 1.039530 12.847
## 53 1681  6822    6533      0     23951 1.044237 13.355
## 54 1682  6909    6744      0     20691 1.024466 13.653
## 55 1683  7577    7158      0     20587 1.058536 14.735
## 56 1684  7575    7127      0     23202 1.062860 14.702
## 57 1685  7484    7246      0     23222 1.032846 14.730
## 58 1686  7575    7119      0     22609 1.064054 14.694
## 59 1687  7737    7214      0     21460 1.072498 14.951
## 60 1688  7487    7101      0     22921 1.054359 14.588
## 61 1689  7604    7167      0     23502 1.060974 14.771
## 62 1690  7909    7302      0     21461 1.083128 15.211
## 63 1691  7662    7392      0     22691 1.036526 15.054
## 64 1692  7602    7316      0     20874 1.039092 14.918
## 65 1693  7676    7483      0     20959 1.025792 15.159
## 66 1694  6985    6647      0     24109 1.050850 13.632
## 67 1695  7263    6713      0     19047 1.081931 13.976
## 68 1696  7632    7229      0     18638 1.055748 14.861
## 69 1697  8062    7767      0     20292 1.037981 15.829
## 70 1698  8426    7626      0     20183 1.104904 16.052
## 71 1699  7911    7452      0     20795 1.061594 15.363
## 72 1700  7578    7061      0     19443 1.073219 14.639
## 73 1701  8102    7514      0     20471 1.078254 15.616
## 74 1702  8031    7656      0     19481 1.048981 15.687
## 75 1703  7765    7683      0     20720 1.010673 15.448
## 76 1704  6113    5738      0     22684 1.065354 11.851
## 77 1705  8366    7779      0     22097 1.075460 16.145
## 78 1706  7952    7417      0     19847 1.072132 15.369
## 79 1707  8379    7687      0     21600 1.090022 16.066
## 80 1708  8239    7623      0     21291 1.080808 15.862
## 81 1709  7840    7380      0     21800 1.062331 15.220
## 82 1710  7640    7288      0     24620 1.048299 14.928
Arbuthnot$Year[Arbuthnot$Females>Arbuthnot$Males]
## integer(0)

2.15 The negation operator ! is used to reverse Boolean values. For example: A <- c(TRUE, FALSE, TRUE, TRUE) !A ## [1] FALSE TRUE FALSE FALSE One of De Morgan’s laws in R code is !(A & B) = !A | !B. Verify this with B <- c(FALSE, TRUE, FALSE, TRUE) and A as above.

A <- c(F,T,F,F)
B <- c(F,T,F,T)
!(A&B)
## [1]  TRUE FALSE  TRUE  TRUE
!A|!B
## [1]  TRUE FALSE  TRUE  TRUE

2.16 In the precip data set, find all the cities with an average annual rainfall exceeding 50 inches.

names(precip)
##  [1] "Mobile"              "Juneau"              "Phoenix"            
##  [4] "Little Rock"         "Los Angeles"         "Sacramento"         
##  [7] "San Francisco"       "Denver"              "Hartford"           
## [10] "Wilmington"          "Washington"          "Jacksonville"       
## [13] "Miami"               "Atlanta"             "Honolulu"           
## [16] "Boise"               "Chicago"             "Peoria"             
## [19] "Indianapolis"        "Des Moines"          "Wichita"            
## [22] "Louisville"          "New Orleans"         "Portland"           
## [25] "Baltimore"           "Boston"              "Detroit"            
## [28] "Sault Ste. Marie"    "Duluth"              "Minneapolis/St Paul"
## [31] "Jackson"             "Kansas City"         "St Louis"           
## [34] "Great Falls"         "Omaha"               "Reno"               
## [37] "Concord"             "Atlantic City"       "Albuquerque"        
## [40] "Albany"              "Buffalo"             "New York"           
## [43] "Charlotte"           "Raleigh"             "Bismark"            
## [46] "Cincinnati"          "Cleveland"           "Columbus"           
## [49] "Oklahoma City"       "Portland"            "Philadelphia"       
## [52] "Pittsburg"           "Providence"          "Columbia"           
## [55] "Sioux Falls"         "Memphis"             "Nashville"          
## [58] "Dallas"              "El Paso"             "Houston"            
## [61] "Salt Lake City"      "Burlington"          "Norfolk"            
## [64] "Richmond"            "Seattle Tacoma"      "Spokane"            
## [67] "Charleston"          "Milwaukee"           "Cheyenne"           
## [70] "San Juan"
head(precip)
##      Mobile      Juneau     Phoenix Little Rock Los Angeles  Sacramento 
##        67.0        54.7         7.0        48.5        14.0        17.2
precip[precip>50]
##       Mobile       Juneau Jacksonville        Miami  New Orleans 
##         67.0         54.7         54.5         59.8         56.8 
##     San Juan 
##         59.2

2.17 For the precip data set, we can find the mean and the 25%-trimmed mean with mean(precip) and mean(precip, trim=.25). Are any values in the data set more than 1.5 times the trimmed mean above the mean?

mean(precip)
## [1] 34.88571
tmean <- mean(precip,trim=.25)
tmean
## [1] 36.65278
precip[precip>1.5*tmean]
##      Mobile       Miami New Orleans    San Juan 
##        67.0        59.8        56.8        59.2

2.19 You track your commute times for two weeks (ten days), recording the following times in minutes: 17 16 2 24 22 15 21 15 17 22 Enter these into R. Use the function max to find the longest commute time, the function mean to find the average, and the function min to find the minimum. Oops, the 24 was a mistake. It should have been 18. How can you fix this? Do so, and then find the new average. How many times was your commute 20 minutes or more? What percent of your commutes are less than 18 minutes long?

comm <- c(17,16, 20,24, 22, 15, 21, 15, 17, 22)
max(comm)
## [1] 24
mean(comm)
## [1] 18.9
min(comm)
## [1] 15
comm[4] <- 18
comm
##  [1] 17 16 20 18 22 15 21 15 17 22
mean(comm)
## [1] 18.3
sum(comm>=20)
## [1] 4
(sum(comm<18)*100)/length(comm)
## [1] 50

2.20 Suppose monthly sales (in 10,000s) of CDs in 2013 were JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC 79 74 161 127 133 21 99 143 249 249 368 32 Enter the data into a data vector cd. Through indexing, form two data vectors: one containing the months with 31 days, the other the remaining months. Compare the means of these two data vectors.

cd <- c(Jan=79,Feb=74,Mar=161,Apr=127,May=133,Jun=210,Jul=99,Aug=143,Sep=249,Oct=249,Nov=368,Dec=302)
cd
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##  79  74 161 127 133 210  99 143 249 249 368 302
cd31 <- cd[c(1,3,5,7,8,10,12)]
cd31
## Jan Mar May Jul Aug Oct Dec 
##  79 161 133  99 143 249 302
cdOth <- cd[c(2,4,6,9,11)]
cdOth
## Feb Apr Jun Sep Nov 
##  74 127 210 249 368
mean(cd31)-mean(cdOth)
## [1] -39.02857

2.21 The following data records the average salary in major league baseball for the years 1990–1999 (in millions): .57 .89 1.08 1.12 1.18 1.07 1.17 1.38 1.44 1.72 Use diff to find the differences from year to year. Are there any years where the amount dropped from the previous year? The percentage difference is the difference divided by the previous year times 100. This can be found by dividing the output of diff by the first nine numbers (but not all ten). After doing this, determine which year has the biggest percentage increase.

sal <- c(.57, .89, 1.08, 1.12, 1.18, 1.07, 1.17, 1.38, 1.44,1.72)
diff(sal)
## [1]  0.32  0.19  0.04  0.06 -0.11  0.10  0.21  0.06  0.28
perdiff <- diff(sal)/sal[1:length(sal)-1]
perdiff
## [1]  0.56140351  0.21348315  0.03703704  0.05357143 -0.09322034  0.09345794
## [7]  0.17948718  0.04347826  0.19444444

2.22 Write a function to compute the average distance from the mean for some data vector.

dis <- function(xVec){
  mean1 <- mean(xVec)
  mean(abs(xVec-mean1))
}
a<- c(2,4,6,8,10)
dis(a)
## [1] 2.4

2.23 Write a function f which finds the average of the x values after squaring and subtracts the square of the average of the numbers. Verify this output will always be non-negative by computing f(1:1).

2.24 An integer is even if the remainder upon dividing it by 2 is 0. This remainder is given by R with the syntax x %% 2. Use this to write a function iseven that indicates if a number is even. How would you write isodd?

2.25 Write a function isprime that checks if a number x is prime by dividing x by all the values in 2, . . . , x -1 then checking to see if there is a remainder of 0. The expression a %% b returns the remainder of a divided by b.

isprime <- function(x){
  all(x %% 2:(x-1)!=0)
}
isprime(37)
## [1] TRUE

• Example 2.9: Grading by z-scores Professor H. grades by z-scores. Students having a z-score greater than 1.28 will get an “A.” In a small class the following student grades were recorded: 54 5 79 79 51 69 55 62 1 8 Did anyone get an “A”?

score1 <- c(54,50, 79, 79, 51, 69, 55, 62, 100, 80)
zscore1 <- scale(score1)
zscore1
##              [,1]
##  [1,] -0.84681616
##  [2,] -1.09050427
##  [3,]  0.67623449
##  [4,]  0.67623449
##  [5,] -1.02958224
##  [6,]  0.06701423
##  [7,] -0.78589414
##  [8,] -0.35943995
##  [9,]  1.95559704
## [10,]  0.73715652
## attr(,"scaled:center")
## [1] 67.9
## attr(,"scaled:scale")
## [1] 16.41442
score1[zscore1>1.28]
## [1] 100

Now, given these values for the mean and standard deviation, what score would be just good enough for an “A”?

ScoreGoodEnough <- mean(score1)+1.28*sd(score1)
ScoreGoodEnough
## [1] 88.91046

For the exec.pay (UsingR) data set there are 199 values. What proportion are more than 3 standard deviations from the mean? Translating, this means the absolute value of the z-scores should be more than 3:

head(exec.pay)
## [1] 136  74   8  38  46  43
mean(exec.pay)
## [1] 59.88945
sd(exec.pay)
## [1] 207.0435
threesdaway <- mean(exec.pay)+3*sd(exec.pay)
sum(exec.pay>threesdaway)/length(exec.pay)
## [1] 0.01507538

That is 1.5% if the z-scores are larger than 3. This is much more than would be expected for bell shaped data (we would expect 0.3%) but less than theoretically possible, which is 11.1% (1/9).

• Example 2.11: SAT rankings Many college-bound students in the United States are familiar with the IQR, even if not by name. The ubiquitous college rankings almost always list the 25th and 75th percentiles for SAT scores. For example, Table 2.3 reproduces a table showing the first and third quartiles for accepted students at IvyLeague colleges. The IQR is Q3-Q1. “Writing” has the smallest average IQR of 85, with “reading” the highest with 98.75. Does the data imply that nearly 25% of Yale students have perfect scores (800 on each category)?

skew <- function(x) { 
n <- length(x) 
z <- (x - mean(x)) / sd(x)
sum(z^3) / n }
skew(exec.pay)
## [1] 9.578542
kurtosis <- function(x) { 
n <- length(x)
 z <- (x - mean(x)) / sd(x) 
sum(z^4)/n - 3
 }
kurtosis(galton$parent)
## [1] 0.05104267
stem(bumpers)
## 
##   The decimal point is 3 digit(s) to the right of the |
## 
##   0 | 68
##   1 | 333556
##   2 | 000123445
##   3 | 011233

From this we can see this data is mostly symmetric and not long tailed. We can also count that there are 23 values in the data set so by counting can find that the median has a stem of 2 and leaf of 1. But what is it as a number? The line “The decimal point is 3 digit(s) to the right of the |” needs to be interpreted. Mathematically it is saying the value is 2.1e3, or 2, 100. (It in fact is 2, 129 which gets truncated to 2, 100 in the making of the summary.)

head(bumpers)
##       Honda Accord Chevrolet Cavalier       Toyota Camry 
##                618                795               1304 
##         Saturn SL2  Mitsubishi Galant       Dodge Monaco 
##               1308               1340               1456
?bumpers
## starting httpd help server ... done
hist(faithful$waiting)

How to Create a Histogram Manually 1. We first need to create the bins bins <- seq(40, 100, by=5) 2. We then create breaks x <- faithful$waiting out <- cut(x, breaks=bins) head(out) ## [1] (75,8] (5,55] (7,75] (6,65] (8,85] (5,55] ## 12 Levels: (4,45] (45,5] (5,55] (55,6] (6,65] … (95,1] 3. We then create frequencies

table(out) ## out ## (4,45] (45,5] (5,55] (55,6] (6,65] (65,7] (7,75] ## 4 22 33 24 14 1 27 ## (75,8] (8,85] (85,9] (9,95] (95,1] ## 54 55 23 5 1 The histogram then just represents these counts with bars.

plot( density(bumpers) )

How to Create a Histogram and a density plot

b_hist <- hist(bumpers, plot=FALSE)
b_dens <- density(bumpers)
hist(bumpers, probability=TRUE,
xlim = range(c(b_hist$breaks, b_dens$x)),
ylim = range(c(b_hist$density, b_dens$y)))
lines(b_dens, lwd=2)

The construction of the density plot uses a function (called a density) and a window size (referred to as the bandwidth). Different choices produce different graphics. The bandwidth is similar to the choice of bin size in the histogram: larger values smooth out the graphic, smaller ones make it more spiky. R provides algorithms for automatic selection. The default is “nrd”. We can specify other names to the bw argument: “nrd”, “ucv”, “bcv”, or “SJ”.

boxplot(bumpers, horizontal=TRUE, main="Bumpers")

head(Macdonell)
##     height finger frequency
## 1 4.630208    9.4         0
## 2 4.630208    9.5         0
## 3 4.630208    9.6         0
## 4 4.630208    9.7         0
## 5 4.630208    9.8         0
## 6 4.630208    9.9         0
x <- rep(Macdonell$finger, Macdonell$frequency)
qqnorm(x)

x <- jitter(HistData::Galton$child, factor=5) # add noise
qqnorm(x)

2.26 The beatles (LearnEDA) data set records the length of songs (in seconds) by the Beatles in the time variable. Find the mean length, median length, longest length, and shortest length in minutes.

2.27 The ChestSizes (HistData) data set contains Quetelet’s data on chest measurements of 5,738 Scottish Militiamen. The data is tabulated. Find the mean.

# We need to find the weighted mean
head(ChestSizes)
##   chest count
## 1    33     3
## 2    34    18
## 3    35    81
## 4    36   185
## 5    37   420
## 6    38   749
summary(ChestSizes)
##      chest           count        
##  Min.   :33.00   Min.   :   1.00  
##  1st Qu.:36.75   1st Qu.:  20.25  
##  Median :40.50   Median : 138.50  
##  Mean   :40.50   Mean   : 358.62  
##  3rd Qu.:44.25   3rd Qu.: 680.75  
##  Max.   :48.00   Max.   :1079.00
propf <- prop.table(ChestSizes$count)
propf
##  [1] 0.0005228303 0.0031369815 0.0141164169 0.0322411990 0.0731962356
##  [6] 0.1305332869 0.1869989543 0.1880446148 0.1627744859 0.1146741025
## [11] 0.0644823980 0.0160334611 0.0087138376 0.0036598118 0.0006971070
## [16] 0.0001742768
mean=sum(ChestSizes$chest*propf)
mean
## [1] 39.83182
farm <- read.csv("farms.csv",header = TRUE)
stem(farm$count)
## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##    0 | 1133346677890266
##    2 | 155890139
##    4 | 013589003589
##    6 | 5589
##    8 | 0149118
##   10 | 0
##   12 | 
##   14 | 
##   16 | 
##   18 | 
##   20 | 
##   22 | 7
mean(farm$count)
## [1] 44.04

2.31 For the data sets bumpers (UsingR), firstchi (UsingR), and math (UsingR), make histograms. Try to predict the mean, median, and standard deviation from the graphic. Check your guesses with the appropriate R commands

hist(bumpers)

hist(firstchi)

hist(math)

plot(density(pi2000))

hist(pi2000,breaks = 0:10-0.5)

x <- babies$smoke
x <- factor(x, labels=c("never", "now", "until current",
"once, quit", "unknown"))
table(x)
## x
##         never           now until current    once, quit       unknown 
##           544           484            95           103            10
barplot(table(x),horiz = TRUE,main="Smoking Data")

dotchart(table(x))
## Warning in dotchart(table(x)): 'x' is neither a vector nor a matrix: using
## as.numeric(x)

pie(table(x))

2.61 Numeric data can be discretized through the cut function. For example, the command cut(bumpers, c(0, 1000, 2000, 3000, 4000)) will categorize the repair cost of a bumper by its rough amount. Make a table of this. Which range has the largest number of data points?

a <- cut(bumpers, c(0, 1000, 2000, 3000, 4000)) 
table(a)
## a
##     (0,1e+03] (1e+03,2e+03] (2e+03,3e+03] (3e+03,4e+03] 
##             2             8             7             6

2.62 The Cylinders variable Cars93 (MASS) data set records the number of cylinders in a factor. What kind of summary does R compute for factors? Look at summary(Cars93$Cylinders) to see.

summary(Cars93$Cylinders)
##      3      4      5      6      8 rotary 
##      3     49      2     31      7      1

2.63 The lorem variable in UsingR contains 5 paragraphs of dummy typesetting text that has been in use for centuries. What is the most common letter used? To answer this, you can break a character value into letters by the idiom unlist(strsplit(x,"")) where x is character data.

x <- lorem
sort(table(unlist(strsplit(x,""))),decreasing = TRUE)
## 
##       e   i   t   u   s   a   l   n   r   o   c   m   d   p   .   v   , 
## 589 370 343 289 289 272 251 220 211 183 174 156 142 102  80  73  49  48 
##   g   b   q   f   h   N  \n   D   I   M   S   P   C   V   j   A   E   L 
##  45  38  30  22  17  13  10   8   8   7   7   6   5   5   4   3   3   3 
##   U   x   F   Q   ; 
##   3   3   2   2   1

After blanks, e is the most common character used. 2.64 Make a dot chart of the Cylinders variable in the Cars93 (MASS) data set.

dotchart(table(Cars93$Cylinders))
## Warning in dotchart(table(Cars93$Cylinders)): 'x' is neither a vector nor a
## matrix: using as.numeric(x)

2.68 The data set central.park (UsingR) holds the coded variable WX representing bad weather (e.g., 1 for “fog”, 3 for “thunder”, 8 for “smoke” or “haze”). NA is used when none of the types occurred. Make a table of the data, then make a table with the extra argument exclude=FALSE. Why is the second table better?

head(central.park)
##   DY MAX MIN AVG DEP HDD CDD  WTR SNW DPTH SPD SPD1 DIR MIN2 PSBL S.S WX
## 1  1  72  51  62   4   3   0 0.04   0    0 5.8   12 180    0    0   5 18
## 2  2  75  52  64   6   1   0 0.29   0    0 7.0   17 160    M    M   4  1
## 3  3  65  50  58   0   7   0 0.00   0    0 8.5   16  40    M    M   4 NA
## 4  4  63  48  56  -3   9   0 0.00   0    0 2.8   10 150    M    M   0 NA
## 5  5  59  45  52  -7  13   0 0.05   0    0 4.6   13 160    M    M   4  1
## 6  6  61  45  53  -7  12   0 0.00   0    M 4.5   10  90    0    0   7 18
##   SPD3  DR
## 1   17 170
## 2   22 350
## 3   26  40
## 4   14 190
## 5   17 130
## 6   14  70
table(central.park$WX)
## 
##  1 18 
## 10 11
table(central.park$WX,exclude=FALSE)
## 
##    1   18 <NA> 
##   10   11   10

For example, suppose that to investigate the question of what foods can have an impact on sports performance, a study was done with a group of subjects.2 The treatment group was fed 3/2 cups of beets 75 minutes prior to an activity, the control group was not. The activity was a timed run. The researcher’s thinking was that the nitrates found in beets would widen the subjects’ blood vessels, increase blood flow to the working muscles, hence lower the time duration of the activity.

Suppose the data measured (in minutes) are given by:

Beets eaten: 41 40 41 42 44 35 41 36 47 45 No Beets eaten: 51 51 50 42 40 31 43 45 What can be said about this data?

Does the data seem to come from similar populations, that is do the two have the same shape?

Does the data seem to have similar center, similar spread?

Compare Mean and SD

beets <- c(41, 40, 41, 42, 44, 35, 41, 36, 47, 45)
no_beets <- c(51, 51, 50, 42, 40, 31, 43, 45)
c(xbar1=mean(beets), xbar2=mean(no_beets),
sd1=sd(beets), sd2=sd(no_beets))
##     xbar1     xbar2       sd1       sd2 
## 41.200000 44.125000  3.705851  6.812541

Visualise with Appropriate Data Set stem.leaf.backback from the aplpack package does

stem.leaf.backback(beets,no_beets,rule.line = "Sturges")
## ________________________________
##   1 | 2: represents 12, leaf unit: 1 
##         beets      no_beets 
## ________________________________
##              | 3* |1        1   
##    2       65| 3. |             
##   (6)  421110| 4* |023     (3)  
##    2       75| 4. |5       (1)  
##              | 5* |011      3   
##              | 5. |             
##              | 6* |             
## ________________________________
## n:         10      8        
## ________________________________

Focussing in the middle, we see that no_beets has more spread.but the difference in the centres is not obvious.

boxplot(no_beets, beets, names=c("no beets", "beets"),horizontal=TRUE)

How to Make multiple Density Plots

Let’s consider a different data set. The michelson (MASS) data set contains measurement of the speed of light in air made by Michelson in 1879. The data set consists of five experiments. We compare the fourth and fifth below. The data is stored in a data frame with one variable Speed recording the speed and one Expt recording the experiment number. We use indexing to get our two variables:

speed <- michelson$Speed; expt <- michelson$Expt
fourth <- speed[expt == 4]
fifth <- speed[expt == 5]

With this, we now pre-compute the densities so that we can find their respective ranges:

d_4 <- density(fourth)
d_5 <- density(fifth)
xrange <- range(c(d_4$x, d_5$x))
yrange <- range(c(d_4$y, d_5$y))

Now we plot. The first density plot is drawn using plot, the second layer added with lines. As the default labels come from the initial density plotted—and don’t apply to both—we modify both the xlab and main arguments. As well, we use the lty=2 argument for the lines command, to have that line added with a different line type: The figure drawn would benefit from a legend indicting which line represents which variable. The legend function will add a legend in base graphics. The position of the legend, the text, and the defining feature needs to be specified. The following would place the legend in the upper left with the line type marking which part of the graph corresponds to which variables:

plot(d_4, xlim=xrange, ylim=yrange, xlab="densities", main="")
lines(d_5, lty=2)
legend(65, .8, legend=c("Fourth", "Fifth"), lty=c(1,2))

create a quantile Comaprison to see if the two distributions are the same.

qqplot(fourth, fifth)

The data more or less tracks a straight line, though the observed difference in the shape of the peak causes it to flatten out in the middle. The small size of the data sets involved make such irregularities hard to gauge if they are due to sampling or some systematic difference. Either way, the most important question to answer with this from a statistical inference viewpoint is whether the tails seem comparable, as they do in this case.

beets <- c(41, 40, 41, 42, 44, 35, 41, 36, 47, 45)
no_beets <- c(51, 51, 50, 42, 40, 31, 43, 45)
b <- list(beets = beets, no_beets = no_beets)
b
## $beets
##  [1] 41 40 41 42 44 35 41 36 47 45
## 
## $no_beets
## [1] 51 51 50 42 40 31 43 45
b$beets
##  [1] 41 40 41 42 44 35 41 36 47 45
b[1]
## $beets
##  [1] 41 40 41 42 44 35 41 36 47 45
b[[1]]
##  [1] 41 40 41 42 44 35 41 36 47 45
boxplot(b)

speed <- michelson$Speed; expt <- michelson$Expt
speeds <- list(speed[expt == 1], speed[expt == 2], speed[expt == 3],
speed[expt == 4], speed[expt == 5])
names(speeds) <- paste("Expt", 1:5)
boxplot(speeds)

Data frames We have mentioned that data frames are a preferred way to store many variables in rectangular manner with each variable being a different column and observations for similar cases in rows. R provides the data.frame constructor for creating data frames, among other ways—they can be produced column-by-column with cbind and rowby-row with rbind. The usage is similar to list. For example, here we create a data frame to hold some student data:

student <- c("Alice", "Bob")
grade <- c("A", "B")
attendance <- c("awesome", "bad")
data.frame(student, grade, attendance)
##   student grade attendance
## 1   Alice     A    awesome
## 2     Bob     B        bad
## student grade attendance
boxplot(Speed ~ Expt, data=michelson)

boxplot(Speed ~ Expt, data=michelson, subset=Run %in% 11:20)

plot(Speed ~ Expt, data=michelson)