Exercises: Week 4

Asad Zaidi

Data and Computing Fundamentals

Data Cleaning

Implement the steps for cleaning the OrdwayBirdsOrig data, including

data(OrdwayBirdsOrig)
groupBy(OrdwayBirdsOrig, by = Year)
##           Year count
## 1                  4
## 2         1968    24
## 3         1969     9
## 4         1970    25
## 5         1971    63
## 6         1972  1500
## 7         1973  2434
## 8         1974   120
## 9         1975  1882
## 10        1976  2214
## 11        1978  1402
## 12        1979   928
## 13        1980  1138
## 14        1981  1031
## 15        1982   934
## 16        1983  1231
## 17        1984   811
## 18        1985    56
## 19        1994    22
## 20        2979     1
## 21 Year (19xx)     0
snames = with(OrdwayBirdsOrig, levels(SpeciesName))
namesCleaned = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Av2C2RiwUxpVdFlPWUp6NERSQzhld3o4QklQd1p6d2c&single=true&gid=0&output=csv")
## Loading required package: RCurl
## Loading required package: bitops
birds = join(OrdwayBirdsOrig, namesCleaned)
## Joining by: SpeciesName
groupBy(birds, by = SpeciesNameCleaned)
##             SpeciesNameCleaned count
## 1           Acadian Flycatcher     1
## 2           American Goldfinch  1204
## 3             Baltimore Oriole   208
## 4      Black and White Warbler    10
## 5          Black-billed Cookoo    16
## 6       Black-capped Chickadee  1327
## 7         Black-throat Sparrow    62
## 8         Brown-headed Cowbird     4
## 9                     Cardinal    77
## 10          Carolina Chickadee     1
## 11                     Catbird   554
## 12               Cedar Waxwing    59
## 13   Chestnut-backed Chickadee     3
## 14      Chestnut-sided Warbler     1
## 15                   Chickadee     3
## 16            Chipping Sparrow   319
## 17        Clay-colored Sparrow    14
## 18                     Cowbird    90
## 19       Curve-billed Thrasher    14
## 20              Eastern Phoebe    12
## 21          Eastern Wood Pewee     6
## 22               Field Sparrow  1164
## 23      Golden-Crowned Kinglet     9
## 24       Gray - cheeked Thrush     2
## 25    Great Crested Flycatcher     3
## 26            Harris's Sparrow    16
## 27                  House Wren   460
## 28                     Kestrel     2
## 29            Least Flycatcher   372
## 30           Lincoln's Sparrow   790
## 31                        Lost     5
## 32               Mourning Dove    29
## 33            Mourning Warbler     4
## 34              Myrtle Warbler   454
## 35                         N/A     9
## 36           Nashville Warbler   160
## 37             Northern Shrike    11
## 38        Northern Waterthrush    13
## 39       Northern Yellowthroat     7
## 40      Olive-sided Flycatcher    22
## 41      Orange-Crowned Warbler    57
## 42              Orchard Oriole     8
## 43                Oregon Junco     3
## 44                    Ovenbird    24
## 45                Palm Warbler   127
## 46  Partly-cloudy; light winds     1
## 47          Pectoral Sandpiper     1
## 48                       Pewee     9
## 49                 Phainopepla     1
## 50          Philadelphia Vireo    21
## 51                      Phoebe    19
## 52                 Pine Siskin    35
## 53                Purple Finch   122
## 54                 Pyrrhuloxia     1
## 55      Red-Bellied Woodpecker     9
## 56         Red-Breast Grosbeak    20
## 57        Red-Winged Blackbird    53
## 58       Red-bellied Sapsucker     1
## 59            Red-eyed Cowbird     1
## 60              Red-eyed Viero    43
## 61       Red-headed Woodpecker     1
## 62             Red-tailed Hawk     1
## 63                    Redstart     3
## 64                       Robin   608
## 65      Rose Breasted Grosbeak   201
## 66        Ruby-Crested Kinglet    21
## 67          Ruby-crown Kinglet   112
## 68   Ruby-throated Hummingbird     5
## 69         Rufous-sided Towhee    10
## 70            Savannah Sparrow     1
## 71         Slate-colored Junco  2732
## 72              Solitary Vireo    14
## 73                Song Sparrow   512
## 74                Sparrow Hawk     1
## 75                    Starling    37
## 76               Steller's Jay    11
## 77           Swainson's Thrush   103
## 78               Swamp Sparrow    83
## 79           Tennessee Warbler    86
## 80         Traill's Flycatcher    47
## 81                      Tree L     2
## 82                Tree Swallow  1537
## 83             Tufted Titmouse     2
## 84                     Unknown     1
## 85               Varied Thrush     2
## 86                       Veery     6
## 87              Vesper Sparrow     2
## 88              Warbling Vireo     1
## 89       White-Crested Sparrow     1
## 90          White-Fronted Dove     1
## 91     White-breasted Nuthatch   281
## 92       White-crowned Sparrow    95
## 93            White-eyed Vireo     1
## 94        White-throat Sparrow   328
## 95          White-winged Junco     2
## 96            Wilson's Warbler    26
## 97                 Winter Wren     1
## 98                  Wood Pewee    37
## 99                 Wood Thrush     3
## 100                   Woodcock     1
## 101                       Wren     2
## 102     Yellow Shafted Flicker    17
## 103             Yellow Warbler    19
## 104  Yellow-bellied Flycatcher     7
## 105   Yellow-bellied Sapsucker     3
## 106       Yellow-tailed Oriole     1
## 107               Yellowthroat   107
## 108                       none     2
with(birds, class(Year))
## [1] "factor"
with(birds, levels(Year))
##  [1] ""            "1968"        "1969"        "1970"        "1971"       
##  [6] "1972"        "1973"        "1974"        "1975"        "1976"       
## [11] "1978"        "1979"        "1980"        "1981"        "1982"       
## [16] "1983"        "1984"        "1985"        "1994"        "2979"       
## [21] "Year (19xx)"
groupBy(birds, by = Year)
##           Year count
## 1                  4
## 2         1968    24
## 3         1969     9
## 4         1970    25
## 5         1971    64
## 6         1972  1674
## 7         1973  2706
## 8         1974   120
## 9         1975  1999
## 10        1976  2289
## 11        1978  1464
## 12        1979  1014
## 13        1980  1222
## 14        1981  1088
## 15        1982  1044
## 16        1983  1383
## 17        1984   912
## 18        1985    56
## 19        1994    22
## 20        2979     1
## 21 Year (19xx)     0
birds = transform(birds, Year = as.numeric(as.character(Year)))
birds = subset(birds, Year %in% 1960:2020)
groupBy(birds, by = Year)
##    Year count
## 1  1968    24
## 2  1969     9
## 3  1970    25
## 4  1971    64
## 5  1972  1674
## 6  1973  2706
## 7  1974   120
## 8  1975  1999
## 9  1976  2289
## 10 1978  1464
## 11 1979  1014
## 12 1980  1222
## 13 1981  1088
## 14 1982  1044
## 15 1983  1383
## 16 1984   912
## 17 1985    56
## 18 1994    22
groupBy(birds, by = Month)
##    Month count
## 1            0
## 2      1   660
## 3     10  3549
## 4     11  1166
## 5     12   554
## 6      2   601
## 7     25     1
## 8      3   906
## 9      4  1667
## 10     5  2780
## 11     6  1124
## 12     7  1159
## 13     8   875
## 14     9  2073
## 15 Month     0
birds = transform(birds, month = as.numeric(as.character(Month)))
groupBy(birds, by = month)
##    month count
## 1      1   660
## 2      2   601
## 3      3   906
## 4      4  1667
## 5      5  2780
## 6      6  1124
## 7      7  1159
## 8      8   875
## 9      9  2073
## 10    10  3549
## 11    11  1166
## 12    12   554
## 13    25     1
birds = subset(birds, month %in% 1:12)
groupBy(birds, by = month)
##    month count
## 1      1   660
## 2      2   601
## 3      3   906
## 4      4  1667
## 5      5  2780
## 6      6  1124
## 7      7  1159
## 8      8   875
## 9      9  2073
## 10    10  3549
## 11    11  1166
## 12    12   554
groupBy(birds, by = Day)
##     Day count
## 1           0
## 2     1   566
## 3    10   544
## 4    11   516
## 5    12   475
## 6    13   693
## 7    14   608
## 8    15   600
## 9    16   579
## 10   17   494
## 11   18   573
## 12   19   539
## 13 1975     0
## 14    2   583
## 15   20   542
## 16   21   498
## 17   22   467
## 18   23   580
## 19   24   587
## 20   25   515
## 21   26   633
## 22   27   551
## 23   28   508
## 24   29   459
## 25    3   663
## 26   30   507
## 27   31   317
## 28    4   637
## 29    5   619
## 30    6   566
## 31    7   569
## 32    8   598
## 33   80     1
## 34    9   527
## 35  Day     0
birds = transform(birds, Day = as.numeric(as.character(Day)))
birds = subset(birds, Day %in% 1:31)
groupBy(birds, Day)
##    Day count
## 1    1   566
## 2    2   583
## 3    3   663
## 4    4   637
## 5    5   619
## 6    6   566
## 7    7   569
## 8    8   598
## 9    9   527
## 10  10   544
## 11  11   516
## 12  12   475
## 13  13   693
## 14  14   608
## 15  15   600
## 16  16   579
## 17  17   494
## 18  18   573
## 19  19   539
## 20  20   542
## 21  21   498
## 22  22   467
## 23  23   580
## 24  24   587
## 25  25   515
## 26  26   633
## 27  27   551
## 28  28   508
## 29  29   459
## 30  30   507
## 31  31   317
birds = transform(birds, weight = ifelse(Weight == "", NA, Weight))
birds = transform(birds, weight = gsub("grams", "", as.character(Weight), fixed = TRUE))
birds = transform(birds, weight = as.numeric(as.character(Weight)))
## Warning: NAs introduced by coercion
groupBy(birds, is.na(Weight))
##   is.na(Weight) count
## 1         FALSE 17113
groupBy(birds, is.na(weight))
##   is.na(weight) count
## 1         FALSE 11943
## 2          TRUE  5170
densityplot(~weight, data = birds)

plot of chunk unnamed-chunk-2

birds = transform(birds, weight = ifelse(weight < 200, NA, weight))
with(birds, class(WingChord))
## [1] "factor"
with(birds, levels(WingChord))
##   [1] ""           "10"         "10.2"       "100"        "101"       
##   [6] "102"        "103"        "104"        "105"        "106"       
##  [11] "107"        "108"        "109"        "11.5"       "110"       
##  [16] "111"        "112"        "113"        "114"        "115"       
##  [21] "116"        "117"        "118"        "119"        "12.1"      
##  [26] "12.6"       "12.9"       "120"        "121"        "122"       
##  [31] "123"        "124"        "125"        "125 mm"     "126"       
##  [36] "126 mm"     "127"        "128"        "129"        "129 mm"    
##  [41] "13.1"       "13.5"       "130"        "130 (84.9)" "131"       
##  [46] "131 mm"     "132"        "133"        "133 (86.3)" "134"       
##  [51] "135"        "136"        "137"        "138"        "139"       
##  [56] "140"        "140 mm"     "141"        "142"        "143"       
##  [61] "144"        "145"        "146"        "147"        "148"       
##  [66] "149"        "15"         "150"        "151"        "152"       
##  [71] "153"        "154"        "155"        "156"        "158"       
##  [76] "159"        "160"        "163"        "170"        "181"       
##  [81] "187"        "19.4"       "195"        "197"        "21"        
##  [86] "222"        "26"         "28"         "38"         "39"        
##  [91] "41"         "44"         "46"         "47"         "48"        
##  [96] "49"         "50"         "51"         "52"         "53"        
## [101] "54"         "55"         "56"         "57"         "58"        
## [106] "59"         "6"          "60"         "61"         "62"        
## [111] "62 mm"      "63"         "64"         "64 mm"      "65"        
## [116] "65 mm"      "66"         "67"         "68"         "68 mm"     
## [121] "69"         "69 mm"      "7.3"        "70"         "70 mm"     
## [126] "71"         "71 mm"      "71.5"       "72"         "72 mm"     
## [131] "73"         "73 mm"      "74"         "74 mm"      "75"        
## [136] "75 mm"      "76"         "76 mm"      "77"         "77 mm"     
## [141] "78"         "78 mm"      "79"         "79 mm"      "80"        
## [146] "80 mm"      "81"         "81 mm"      "82"         "83"        
## [151] "84"         "85"         "85 mm"      "86"         "87"        
## [156] "88"         "89"         "90"         "91"         "91.6"      
## [161] "92"         "93"         "93 mm"      "94"         "95"        
## [166] "96"         "96 mm"      "97"         "98"         "99"        
## [171] "N/A"        "none"       "p/s 10 63"  "p/s 11 57"  "p/s 11 62" 
## [176] "p/s 12 59"  "p/s 9 57"   "Wing chord"
birds = transform(birds, WC = ifelse(WingChord == "", NA, WingChord))
birds = transform(birds, WC = gsub("mm", "", as.character(WingChord), fixed = TRUE))
birds = transform(birds, WC = as.numeric(as.character(WC)))
## Warning: NAs introduced by coercion
groupBy(birds, is.na(WingChord))
##   is.na(WingChord) count
## 1            FALSE 17113
groupBy(birds, is.na(WC))
##   is.na(WC) count
## 1     FALSE  9788
## 2      TRUE  7325
densityplot(~WC, data = birds)

plot of chunk unnamed-chunk-3

birds = transform(birds, WC = ifelse(WC < 20, NA, WC))

Using the Cleaned Data

Using Grouped Data for Individual Cases

Using the Cleaned Ordway Bird data:

Country Data

The countrySynonyms data file (you can load it with data(countrySynonmyms)) gives word synonyms for each country listed in the World-Map software, together with an official ISO 3-letter country code. This data set is in “wide” format. To turn it into narrow format, you can do this:

data(countrySynonyms)
foo <- melt(countrySynonyms, id.vars = c("ID", "ISO3"), value.name = "Country", 
    measure.vars = names(countrySynonyms)[-(1:2)], variable.name = "whence")
countrySynonymsLong <- subset(foo, !is.na(Country))
countrySynonymsLong$whence <- NULL

The countryRegions data file (you can load it with data(countryRegions) or get documentation with help(countryRegions) let's you aggregate countries in various ways.

Model Fitting

Height versus Age

Further Exploration:

Height

Look at people aged greater than 20 years, first with a very small sample (about 100 people) and then with the entire data set.

BMI

For Math 135 graduates …

The BMI models weight as being proportional to height squared. This should give a straight line on a graph of log-weight against log-height. Is this a reasonable model for adults?