Data from the web

Heike Hofmann
Stat 579, Fall 2013

Outline

  • reading tables from web sources
  • working with dates
  • … and messy data

Data on Websites

Getting Data from the web into R

  • Single Data File: copy & paste into spread sheet software
  • Multiple Data Files: use automatic parser - e.g. R script

Package XML:

allows parsing of HTML and XML, e.g.: batting statistics for Chicago Cubs

library("XML")
url ="http://www.baseball-reference.com/teams/CHC/2013.shtml"
tables <- readHTMLTable(url)
is.list(tables)
[1] TRUE

Working with lists

  • lists are the most general data types in R - they can contain anything
  • lists sometimes hard to work with
  • subsetting individual element i: [[i]]
  • but: use single brackets for multiple elements, i.e. [i:j]
  • in the plyr package: ldply(list, function)
library(plyr)
ldply(tables, dim)
                     .id V1 V2
1           team_batting 58 28
2          team_pitching 33 33
3      standard_fielding 56 26
4  players_value_batting 56 22
5 players_value_pitching 31 22

table of names and dimensions (#rows, #columns) of the data tables on the website

head(tables[[1]])
  Rk Pos                    Age   G  PA  AB  R   H 2B 3B HR
1  1   C Welington Castillo  26 113 428 380 41 104 23  0  8
2  2  1B     Anthony Rizzo*  23 160 690 606 71 141 40  2 23
3  3  2B      Darwin Barney  27 141 555 501 49 104 25  1  7
4  4  SS     Starlin Castro  23 161 705 666 59 163 34  2 10
5  5  3B     Luis Valbuena*  27 108 391 331 34  72 15  1 12
6  6  LF    Alfonso Soriano  37  93 383 362 47  92 24  1 17
  RBI SB CS BB  SO   BA  OBP  SLG  OPS OPS+  TB GDP HBP SH
1  32  2  0 34  97 .274 .349 .397 .746  104 151  13  11  1
2  80  6  5 76 127 .233 .323 .419 .742  101 254  12   6  0
3  41  4  2 36  64 .208 .266 .303 .569   56 152  22   6  4
4  44  9  6 30 129 .245 .284 .347 .631   72 231  21   7  1
5  37  1  4 53  63 .218 .331 .378 .708   94 125   4   4  1
6  51 10  5 15  89 .254 .287 .467 .754  102 169   9   3  0
  SF IBB
1  2   3
2  2   7
3  6   5
4  1   0
5  2   4
6  3   2

Your turn

  • Find your favorite baseball team (if you don’t have a favorite, use the one that matches your initials the closest)

  • Use the function readHTMLTable to get the batting statistics for the 2013, 2012 and 2011 seasons (you might need to find the correct URL first)

  • Introduce a new variable called ‘Handedness’ into the data set, that has values ‘L’, ‘R’, or ‘B’ for the batting handedness (see website how this is encoded).

Working with dates

  • Package lubridate allows to work with dates as.Date converts (most commonly used) date formats in character variables to dates
  • in case of ambiguities - e.g. 6/3/2012 could be interpreted (Europe) as 6th of March instead of June 3rd, use parameter format accessor functions: month(), wday(), year(), ...
  • allows for sensible summaries and comparisons, e.g. date > as.Date("2012-11-11")
schedule <- readHTMLTable("http://espn.go.com/nba/team/schedule/_/name/chi/chicago-bulls")[[1]]
head(schedule)
                            V1            V2       V3   V4
1 2014 Regular Season Schedule          <NA>     <NA> <NA>
2                      OCTOBER      OPPONENT   RESULT  W-L
3                  Tue, Oct 29        @Miami  L107-95  0-1
4                  Thu, Oct 31   vsNY Knicks   W82-81  1-1
5                     NOVEMBER      OPPONENT   RESULT  W-L
6                   Sat, Nov 2 @Philadelphia L107-104  1-2
                     V5             V6
1                  <NA>           <NA>
2             HI POINTS    HI REBOUNDS
3          C. Boozer 31     J. Noah 11
4         C. Anthony 22 T. Chandler 19
5             HI POINTS    HI REBOUNDS
6 M. Carter-Williams 26    S. Hawes 11
                     V7
1                  <NA>
2            HI ASSISTS
3            L. James 8
4          C. Anthony 6
5            HI ASSISTS
6 M. Carter-Williams 10

Little bit of cleanup first:

names(schedule) <- as.character(unlist(schedule[2,]))
schedule <- schedule[-(1:2),]
library(lubridate)
head(schedule$OCTOBER)
[1] Tue, Oct 29 Thu, Oct 31 NOVEMBER    Sat, Nov 2 
[5] Wed, Nov 6  Fri, Nov 8 
90 Levels: 2014 Regular Season Schedule APRIL ... Wed, Nov 6
schedule$Date <- as.Date(as.character(schedule$OCTOBER), format='%a, %b %d')
summary(schedule$Date)
        Min.      1st Qu.       Median         Mean 
"2013-01-02" "2013-02-11" "2013-03-26" "2013-06-06" 
     3rd Qu.         Max.         NA's 
"2013-11-20" "2013-12-31"          "7" 

see ?strptime for the formatting of date and time.

Your Turn

  • Look at your favorite's baseball team's schedule for the last season, and read the data

  • Introduce a new variable into the data set called date and re-format the dates of the season's schedule.

Beyond tables ...

We can use HTML structure directly (using readXML). HTML is hierarchically structured as a tree:

Alt text

Messy data

  • The website http://www.google.org/flutrends/us/#US shows the Google trends for flu cases across the US

  • Read the data into R (without using any text editor help)

  • Extract the data for the last month for all states.

  • Plot one chart with time series of flu cases for each state.

as a start: read the data, but this data set needs some reshaping with reshape.

flu <- read.table("http://www.google.org/flutrends/us/data.txt", sep=",", skip=10, header=TRUE)
head(flu)
        Date United.States Alabama Alaska Arizona Arkansas
1 2003-09-28           902     477     NA     606       NA
2 2003-10-05           952     501     NA     663       NA
3 2003-10-12          1092     492     NA     700       NA
4 2003-10-19          1209     533     NA     819       NA
5 2003-10-26          1249     594     NA     959       NA
6 2003-11-02          1374     715     NA    1167       NA
  California Colorado Connecticut Delaware
1        929      233         223       NA
2        849      251         243       NA
3       1032      283         261       NA
4       1084      310         268       NA
5        989      344         334       NA
6       1284      569         394       NA
  District.of.Columbia Florida Georgia Hawaii Idaho
1                  927     587     514     NA    NA
2                  993     582     532     NA    NA
3                 1033     606     557     NA    NA
4                 1089     698     608     NA    NA
5                 1177     708     745     NA    NA
6                 1270     746     767    936    NA
  Illinois Indiana Iowa Kansas Kentucky Louisiana Maine
1      677     544  303    272      420      1017    NA
2      732     607  303    270      442      1096    NA
3      799     637  312    280      460      1144    NA
4      841     680  342    292      485      1308    NA
5      978     765  368    313      496      1936    NA
6     1102     773  399    323      505      2791    NA
  Maryland Massachusetts Michigan Minnesota Mississippi
1     1268           344      685       484          NA
2     1374           362      748       514          NA
3     1445           372      791       588          NA
4     1432           367      829       692          NA
5     1426           370      803       701          NA
6     1495           404      898       708          NA
  Missouri Montana Nebraska Nevada New.Hampshire New.Jersey
1      349      NA       NA     NA            NA        695
2      359      NA       NA     NA            NA        716
3      381      NA       NA     NA            NA        815
4      431      NA       NA     NA            NA        831
5      452      NA       NA     NA            NA        901
6      450      NA       NA     NA            NA        897
  New.Mexico New.York North.Carolina North.Dakota Ohio
1         NA      649            565           NA  616
2         NA      725            660           NA  699
3         NA      739            861           NA  729
4         NA      874            852           NA  759
5         NA      865            857           NA  847
6         NA     1005            922           NA  830
  Oklahoma Oregon Pennsylvania Rhode.Island South.Carolina
1     1040    409         1186           NA            462
2     1065    409         1176           NA            478
3     1122    428         1340           NA            521
4     1199    467         1424           NA            529
5     1278    510         1514           NA            556
6     1437    576         1448           NA            621
  South.Dakota Tennessee Texas Utah Vermont Virginia
1           NA       551  1398   NA      NA     1112
2           NA       597  1517   NA      NA     1198
3           NA       670  2010   NA      NA     1343
4           NA       666  2786   NA      NA     1448
5           NA       717  2917   NA      NA     1556
6           NA       733  3177   NA      NA     1640
  Washington West.Virginia Wisconsin Wyoming
1        588            NA       466      NA
2        624            NA       504      NA
3        777            NA       538      NA
4        866            NA       595      NA
5       1026            NA       653      NA
6       1053            NA       688      NA
  HHS.Region.1..CT..ME..MA..NH..RI..VT.
1                                   322
2                                   381
3                                   410
4                                   397
5                                   444
6                                   457
  HHS.Region.2..NJ..NY.
1                   666
2                   711
3                   819
4                   839
5                   917
6                   997
  HHS.Region.3..DE..DC..MD..PA..VA..WV.
1                                  1366
2                                  1335
3                                  1411
4                                  1444
5                                  1608
6                                  1616
  HHS.Region.4..AL..FL..GA..KY..MS..NC..SC..TN.
1                                           631
2                                           652
3                                           735
4                                           822
5                                           797
6                                           850
  HHS.Region.5..IL..IN..MI..MN..OH..WI.
1                                   690
2                                   775
3                                   760
4                                   803
5                                   931
6                                   938
  HHS.Region.6..AR..LA..NM..OK..TX.
1                              1385
2                              1613
3                              2089
4                              2583
5                              2955
6                              3202
  HHS.Region.7..IA..KS..MO..NE.
1                           385
2                           400
3                           422
4                           498
5                           530
6                           509
  HHS.Region.8..CO..MT..ND..SD..UT..WY.
1                                   266
2                                   271
3                                   285
4                                   338
5                                   387
6                                   653
  HHS.Region.9..AZ..CA..HI..NV.
1                           878
2                           853
3                          1102
4                          1086
5                          1028
6                          1394
  HHS.Region.10..AK..ID..OR..WA. Anchorage..AK
1                            624            NA
2                            688            NA
3                            791            NA
4                            852            NA
5                           1012            NA
6                           1160            NA
  Birmingham..AL Little.Rock..AR Mesa..AZ Phoenix..AZ
1            407              NA       NA         757
2            402              NA       NA         796
3            428              NA       NA         766
4            494              NA       NA         820
5            531              NA       NA         908
6            548              NA       NA        1109
  Scottsdale..AZ Tempe..AZ Tucson..AZ Berkeley..CA
1             NA       585        598           NA
2             NA       608        674           NA
3             NA       629        731           NA
4             NA       649        778           NA
5             NA       658       1167           NA
6             NA       810       1584           NA
  Fresno..CA Irvine..CA Los.Angeles..CA Oakland..CA
1         NA         NA             901         848
2         NA         NA             891         888
3         NA         NA            1165         839
4         NA         NA            1158         794
5         NA         NA             922         908
6         NA         NA            1306         930
  Sacramento..CA San.Diego..CA San.Francisco..CA
1            448           562              1003
2            436           840              1115
3            468           938              1311
4            495          1045              1429
5            525           988              1377
6            643          1021              1423
  San.Jose..CA Santa.Clara..CA Sunnyvale..CA
1          731             990           602
2          740             915           594
3          826             989           609
4          958            1020           604
5         1123            1070           640
6         1035            1322           728
  Colorado.Springs..CO Denver..CO Washington..DC
1                   NA        235           1153
2                   NA        270           1310
3                   NA        257           1309
4                   NA        314           1151
5                   NA        354           1285
6                   NA        845           1522
  Gainesville..FL Hialeah..FL Jacksonville..FL Miami..FL
1              NA          NA               NA       373
2              NA          NA               NA       386
3             641          NA               NA       370
4             680          NA               NA       380
5             679          NA               NA       412
6             736          NA               NA       445
  Orlando..FL Tampa..FL Atlanta..GA Roswell..GA
1         609       461         519          NA
2         663       581         484          NA
3         615       567         497          NA
4         679       628         563          NA
5         664       593         845          NA
6         797       699         771          NA
  Honolulu..HI Des.Moines..IA Boise..ID Chicago..IL
1          794             NA        NA         731
2          877             NA        NA         850
3         1030             NA        NA         799
4         1012             NA        NA         852
5          910             NA        NA        1051
6          946             NA        NA        1137
  Indianapolis..IN Wichita..KS Lexington..KY
1              641          NA            NA
2              657          NA            NA
3              685          NA            NA
4              697          NA            NA
5              715          NA            NA
6              681          NA            NA
  Baton.Rouge..LA New.Orleans..LA Boston..MA Somerville..MA
1              NA            1154        314            332
2              NA            1162        323            375
3              NA            1274        369            447
4              NA            1488        321            440
5              NA            2229        346            464
6              NA            3288        395            487
  Baltimore..MD Grand.Rapids..MI St.Paul..MN
1          1505               NA         426
2          1535               NA         423
3          1549               NA         457
4          1466               NA         440
5          1460               NA         445
6          1501               NA         462
  Kansas.City..MO Springfield..MO St.Louis..MO Jackson..MS
1             330              NA          391          NA
2             316              NA          397          NA
3             343              NA          408          NA
4             358              NA          426          NA
5             384              NA          448          NA
6             405              NA          464          NA
  Cary..NC Charlotte..NC Durham..NC Greensboro..NC
1       NA           561        521             NA
2       NA           673        536             NA
3       NA           738        521             NA
4       NA           799        563             NA
5       NA           925        573             NA
6       NA           975        601             NA
  Raleigh..NC Lincoln..NE Omaha..NE Newark..NJ
1         503          NA       314        540
2         586          NA       331        549
3         838          NA       373        575
4         866          NA       382        584
5         831          NA       421        626
6        1065          NA       431        658
  Albuquerque..NM Las.Vegas..NV Reno..NV Albany..NY
1              NA           843       NA        505
2              NA           831       NA        508
3            1068           824       NA        555
4            1092           834       NA        604
5            1089           872       NA        629
6            1194          1028       NA        678
  Buffalo..NY New.York..NY Rochester..NY Cincinnati..OH
1          NA          579           406            678
2          NA          730           483            765
3          NA          652           476            854
4         548          831           498            930
5         557          776           662            864
6         653         1030          1008            778
  Cleveland..OH Columbus..OH Dayton..OH Oklahoma.City..OK
1           466          437         NA               924
2           535          415         NA               894
3           671          442         NA               922
4           703          485         NA              1024
5           794          522         NA              1191
6           928          546         NA              1232
  Tulsa..OK Beaverton..OR Eugene..OR Portland..OR
1      1034            NA         NA          444
2      1042            NA         NA          471
3      1089            NA         NA          574
4      1089            NA         NA          683
5      1095            NA         NA          735
6      1187            NA         NA          866
  Philadelphia..PA Pittsburgh..PA State.College..PA
1             1204           1122                NA
2             1124           1193                NA
3             1249           1306                NA
4             1279           1343                NA
5             1537           1484                NA
6             1542           1445                NA
  Providence..RI Columbia..SC Greenville..SC Knoxville..TN
1             NA           NA             NA            NA
2             NA           NA             NA            NA
3             NA           NA             NA            NA
4             NA           NA             NA            NA
5             NA           NA             NA            NA
6             NA           NA             NA            NA
  Memphis..TN Nashville..TN Austin..TX Dallas..TX
1          NA           425       1150       1200
2          NA           468       1331       1487
3          NA           497       1492       1869
4          NA           544       2696       1897
5          NA           554       3918       2029
6          NA           603       3840       2647
  Ft.Worth..TX Houston..TX Irving..TX Lubbock..TX Plano..TX
1           NA        1412       1122          NA        NA
2           NA        2057       1208          NA        NA
3           NA        3770       1191          NA        NA
4           NA        6618       1516          NA        NA
5           NA        6412       1559          NA        NA
6           NA        6046       1891          NA        NA
  San.Antonio..TX Salt.Lake.City..UT Arlington..VA
1             986                261          1066
2             989                249          1249
3            1463                295          1289
4            2095                292          1308
5            3282                328          1426
6            3554                355          1419
  Norfolk..VA Reston..VA Richmond..VA Bellevue..WA
1         948         NA         1035           NA
2         963         NA         1135           NA
3         970         NA         1170           NA
4        1011         NA         1342           NA
5        1009         NA         1354           NA
6        1032         NA         1446           NA
  Seattle..WA Spokane..WA Madison..WI Milwaukee..WI
1         668          NA         622           452
2         787          NA         626           449
3         994          NA         661           437
4         999          NA         649           487
5        1297          NA         657           499
6        1100          NA         707           521
library(lubridate)
library(ggplot2)
flu$date <- as.Date(flu$Date)
qplot(date, Arkansas, data=flu)

plot of chunk unnamed-chunk-7