Heike Hofmann
Stat 579, Fall 2013
Election results (2008) http://www.nytimes.com/2008/results/president/votes.html
Election results (2012)
http://elections.nytimes.com/2012/results/president
Election results (2013)
http://elections.huffingtonpost.com/2013/results
Data from goverment agencies: http://data.gov
Weather Data, e.g. http://www.wunderground.com/cgi-bin/findweather/getForecast?code=KAMW&airportorwmo=airport&historytype=DailyHistory&month=11&day=11&year=2013
Sports data, e.g. baseball: http://www.baseball-reference.com/teams/CHC/2013.shtml
Package XML:
allows parsing of HTML and XML, e.g.: batting statistics for Chicago Cubs
library("XML")
url ="http://www.baseball-reference.com/teams/CHC/2013.shtml"
tables <- readHTMLTable(url)
is.list(tables)
[1] TRUE
library(plyr)
ldply(tables, dim)
.id V1 V2
1 team_batting 58 28
2 team_pitching 33 33
3 standard_fielding 56 26
4 players_value_batting 56 22
5 players_value_pitching 31 22
table of names and dimensions (#rows, #columns) of the data tables on the website
head(tables[[1]])
Rk Pos Age G PA AB R H 2B 3B HR
1 1 C Welington Castillo 26 113 428 380 41 104 23 0 8
2 2 1B Anthony Rizzo* 23 160 690 606 71 141 40 2 23
3 3 2B Darwin Barney 27 141 555 501 49 104 25 1 7
4 4 SS Starlin Castro 23 161 705 666 59 163 34 2 10
5 5 3B Luis Valbuena* 27 108 391 331 34 72 15 1 12
6 6 LF Alfonso Soriano 37 93 383 362 47 92 24 1 17
RBI SB CS BB SO BA OBP SLG OPS OPS+ TB GDP HBP SH
1 32 2 0 34 97 .274 .349 .397 .746 104 151 13 11 1
2 80 6 5 76 127 .233 .323 .419 .742 101 254 12 6 0
3 41 4 2 36 64 .208 .266 .303 .569 56 152 22 6 4
4 44 9 6 30 129 .245 .284 .347 .631 72 231 21 7 1
5 37 1 4 53 63 .218 .331 .378 .708 94 125 4 4 1
6 51 10 5 15 89 .254 .287 .467 .754 102 169 9 3 0
SF IBB
1 2 3
2 2 7
3 6 5
4 1 0
5 2 4
6 3 2
Find your favorite baseball team (if you don’t have a favorite, use the one that matches your initials the closest)
Use the function readHTMLTable to get the batting statistics for the 2013, 2012 and 2011 seasons (you might need to find the correct URL first)
Introduce a new variable called ‘Handedness’ into the data set, that has values ‘L’, ‘R’, or ‘B’ for the batting handedness (see website how this is encoded).
lubridate allows to work with dates
as.Date converts (most commonly used) date formats in character variables to datesformat
accessor functions: month(), wday(), year(), ... date > as.Date("2012-11-11")schedule <- readHTMLTable("http://espn.go.com/nba/team/schedule/_/name/chi/chicago-bulls")[[1]]
head(schedule)
V1 V2 V3 V4
1 2014 Regular Season Schedule <NA> <NA> <NA>
2 OCTOBER OPPONENT RESULT W-L
3 Tue, Oct 29 @Miami L107-95 0-1
4 Thu, Oct 31 vsNY Knicks W82-81 1-1
5 NOVEMBER OPPONENT RESULT W-L
6 Sat, Nov 2 @Philadelphia L107-104 1-2
V5 V6
1 <NA> <NA>
2 HI POINTS HI REBOUNDS
3 C. Boozer 31 J. Noah 11
4 C. Anthony 22 T. Chandler 19
5 HI POINTS HI REBOUNDS
6 M. Carter-Williams 26 S. Hawes 11
V7
1 <NA>
2 HI ASSISTS
3 L. James 8
4 C. Anthony 6
5 HI ASSISTS
6 M. Carter-Williams 10
Little bit of cleanup first:
names(schedule) <- as.character(unlist(schedule[2,]))
schedule <- schedule[-(1:2),]
library(lubridate)
head(schedule$OCTOBER)
[1] Tue, Oct 29 Thu, Oct 31 NOVEMBER Sat, Nov 2
[5] Wed, Nov 6 Fri, Nov 8
90 Levels: 2014 Regular Season Schedule APRIL ... Wed, Nov 6
schedule$Date <- as.Date(as.character(schedule$OCTOBER), format='%a, %b %d')
summary(schedule$Date)
Min. 1st Qu. Median Mean
"2013-01-02" "2013-02-11" "2013-03-26" "2013-06-06"
3rd Qu. Max. NA's
"2013-11-20" "2013-12-31" "7"
see ?strptime for the formatting of date and time.
Look at your favorite's baseball team's schedule for the last season, and read the data
Introduce a new variable into the data set called date and re-format the dates of the season's schedule.
We can use HTML structure directly (using readXML).
HTML is hierarchically structured as a tree:
The website http://www.google.org/flutrends/us/#US shows the Google trends for flu cases across the US
Read the data into R (without using any text editor help)
Extract the data for the last month for all states.
Plot one chart with time series of flu cases for each state.
as a start: read the data, but this data set needs some reshaping with reshape.
flu <- read.table("http://www.google.org/flutrends/us/data.txt", sep=",", skip=10, header=TRUE)
head(flu)
Date United.States Alabama Alaska Arizona Arkansas
1 2003-09-28 902 477 NA 606 NA
2 2003-10-05 952 501 NA 663 NA
3 2003-10-12 1092 492 NA 700 NA
4 2003-10-19 1209 533 NA 819 NA
5 2003-10-26 1249 594 NA 959 NA
6 2003-11-02 1374 715 NA 1167 NA
California Colorado Connecticut Delaware
1 929 233 223 NA
2 849 251 243 NA
3 1032 283 261 NA
4 1084 310 268 NA
5 989 344 334 NA
6 1284 569 394 NA
District.of.Columbia Florida Georgia Hawaii Idaho
1 927 587 514 NA NA
2 993 582 532 NA NA
3 1033 606 557 NA NA
4 1089 698 608 NA NA
5 1177 708 745 NA NA
6 1270 746 767 936 NA
Illinois Indiana Iowa Kansas Kentucky Louisiana Maine
1 677 544 303 272 420 1017 NA
2 732 607 303 270 442 1096 NA
3 799 637 312 280 460 1144 NA
4 841 680 342 292 485 1308 NA
5 978 765 368 313 496 1936 NA
6 1102 773 399 323 505 2791 NA
Maryland Massachusetts Michigan Minnesota Mississippi
1 1268 344 685 484 NA
2 1374 362 748 514 NA
3 1445 372 791 588 NA
4 1432 367 829 692 NA
5 1426 370 803 701 NA
6 1495 404 898 708 NA
Missouri Montana Nebraska Nevada New.Hampshire New.Jersey
1 349 NA NA NA NA 695
2 359 NA NA NA NA 716
3 381 NA NA NA NA 815
4 431 NA NA NA NA 831
5 452 NA NA NA NA 901
6 450 NA NA NA NA 897
New.Mexico New.York North.Carolina North.Dakota Ohio
1 NA 649 565 NA 616
2 NA 725 660 NA 699
3 NA 739 861 NA 729
4 NA 874 852 NA 759
5 NA 865 857 NA 847
6 NA 1005 922 NA 830
Oklahoma Oregon Pennsylvania Rhode.Island South.Carolina
1 1040 409 1186 NA 462
2 1065 409 1176 NA 478
3 1122 428 1340 NA 521
4 1199 467 1424 NA 529
5 1278 510 1514 NA 556
6 1437 576 1448 NA 621
South.Dakota Tennessee Texas Utah Vermont Virginia
1 NA 551 1398 NA NA 1112
2 NA 597 1517 NA NA 1198
3 NA 670 2010 NA NA 1343
4 NA 666 2786 NA NA 1448
5 NA 717 2917 NA NA 1556
6 NA 733 3177 NA NA 1640
Washington West.Virginia Wisconsin Wyoming
1 588 NA 466 NA
2 624 NA 504 NA
3 777 NA 538 NA
4 866 NA 595 NA
5 1026 NA 653 NA
6 1053 NA 688 NA
HHS.Region.1..CT..ME..MA..NH..RI..VT.
1 322
2 381
3 410
4 397
5 444
6 457
HHS.Region.2..NJ..NY.
1 666
2 711
3 819
4 839
5 917
6 997
HHS.Region.3..DE..DC..MD..PA..VA..WV.
1 1366
2 1335
3 1411
4 1444
5 1608
6 1616
HHS.Region.4..AL..FL..GA..KY..MS..NC..SC..TN.
1 631
2 652
3 735
4 822
5 797
6 850
HHS.Region.5..IL..IN..MI..MN..OH..WI.
1 690
2 775
3 760
4 803
5 931
6 938
HHS.Region.6..AR..LA..NM..OK..TX.
1 1385
2 1613
3 2089
4 2583
5 2955
6 3202
HHS.Region.7..IA..KS..MO..NE.
1 385
2 400
3 422
4 498
5 530
6 509
HHS.Region.8..CO..MT..ND..SD..UT..WY.
1 266
2 271
3 285
4 338
5 387
6 653
HHS.Region.9..AZ..CA..HI..NV.
1 878
2 853
3 1102
4 1086
5 1028
6 1394
HHS.Region.10..AK..ID..OR..WA. Anchorage..AK
1 624 NA
2 688 NA
3 791 NA
4 852 NA
5 1012 NA
6 1160 NA
Birmingham..AL Little.Rock..AR Mesa..AZ Phoenix..AZ
1 407 NA NA 757
2 402 NA NA 796
3 428 NA NA 766
4 494 NA NA 820
5 531 NA NA 908
6 548 NA NA 1109
Scottsdale..AZ Tempe..AZ Tucson..AZ Berkeley..CA
1 NA 585 598 NA
2 NA 608 674 NA
3 NA 629 731 NA
4 NA 649 778 NA
5 NA 658 1167 NA
6 NA 810 1584 NA
Fresno..CA Irvine..CA Los.Angeles..CA Oakland..CA
1 NA NA 901 848
2 NA NA 891 888
3 NA NA 1165 839
4 NA NA 1158 794
5 NA NA 922 908
6 NA NA 1306 930
Sacramento..CA San.Diego..CA San.Francisco..CA
1 448 562 1003
2 436 840 1115
3 468 938 1311
4 495 1045 1429
5 525 988 1377
6 643 1021 1423
San.Jose..CA Santa.Clara..CA Sunnyvale..CA
1 731 990 602
2 740 915 594
3 826 989 609
4 958 1020 604
5 1123 1070 640
6 1035 1322 728
Colorado.Springs..CO Denver..CO Washington..DC
1 NA 235 1153
2 NA 270 1310
3 NA 257 1309
4 NA 314 1151
5 NA 354 1285
6 NA 845 1522
Gainesville..FL Hialeah..FL Jacksonville..FL Miami..FL
1 NA NA NA 373
2 NA NA NA 386
3 641 NA NA 370
4 680 NA NA 380
5 679 NA NA 412
6 736 NA NA 445
Orlando..FL Tampa..FL Atlanta..GA Roswell..GA
1 609 461 519 NA
2 663 581 484 NA
3 615 567 497 NA
4 679 628 563 NA
5 664 593 845 NA
6 797 699 771 NA
Honolulu..HI Des.Moines..IA Boise..ID Chicago..IL
1 794 NA NA 731
2 877 NA NA 850
3 1030 NA NA 799
4 1012 NA NA 852
5 910 NA NA 1051
6 946 NA NA 1137
Indianapolis..IN Wichita..KS Lexington..KY
1 641 NA NA
2 657 NA NA
3 685 NA NA
4 697 NA NA
5 715 NA NA
6 681 NA NA
Baton.Rouge..LA New.Orleans..LA Boston..MA Somerville..MA
1 NA 1154 314 332
2 NA 1162 323 375
3 NA 1274 369 447
4 NA 1488 321 440
5 NA 2229 346 464
6 NA 3288 395 487
Baltimore..MD Grand.Rapids..MI St.Paul..MN
1 1505 NA 426
2 1535 NA 423
3 1549 NA 457
4 1466 NA 440
5 1460 NA 445
6 1501 NA 462
Kansas.City..MO Springfield..MO St.Louis..MO Jackson..MS
1 330 NA 391 NA
2 316 NA 397 NA
3 343 NA 408 NA
4 358 NA 426 NA
5 384 NA 448 NA
6 405 NA 464 NA
Cary..NC Charlotte..NC Durham..NC Greensboro..NC
1 NA 561 521 NA
2 NA 673 536 NA
3 NA 738 521 NA
4 NA 799 563 NA
5 NA 925 573 NA
6 NA 975 601 NA
Raleigh..NC Lincoln..NE Omaha..NE Newark..NJ
1 503 NA 314 540
2 586 NA 331 549
3 838 NA 373 575
4 866 NA 382 584
5 831 NA 421 626
6 1065 NA 431 658
Albuquerque..NM Las.Vegas..NV Reno..NV Albany..NY
1 NA 843 NA 505
2 NA 831 NA 508
3 1068 824 NA 555
4 1092 834 NA 604
5 1089 872 NA 629
6 1194 1028 NA 678
Buffalo..NY New.York..NY Rochester..NY Cincinnati..OH
1 NA 579 406 678
2 NA 730 483 765
3 NA 652 476 854
4 548 831 498 930
5 557 776 662 864
6 653 1030 1008 778
Cleveland..OH Columbus..OH Dayton..OH Oklahoma.City..OK
1 466 437 NA 924
2 535 415 NA 894
3 671 442 NA 922
4 703 485 NA 1024
5 794 522 NA 1191
6 928 546 NA 1232
Tulsa..OK Beaverton..OR Eugene..OR Portland..OR
1 1034 NA NA 444
2 1042 NA NA 471
3 1089 NA NA 574
4 1089 NA NA 683
5 1095 NA NA 735
6 1187 NA NA 866
Philadelphia..PA Pittsburgh..PA State.College..PA
1 1204 1122 NA
2 1124 1193 NA
3 1249 1306 NA
4 1279 1343 NA
5 1537 1484 NA
6 1542 1445 NA
Providence..RI Columbia..SC Greenville..SC Knoxville..TN
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
Memphis..TN Nashville..TN Austin..TX Dallas..TX
1 NA 425 1150 1200
2 NA 468 1331 1487
3 NA 497 1492 1869
4 NA 544 2696 1897
5 NA 554 3918 2029
6 NA 603 3840 2647
Ft.Worth..TX Houston..TX Irving..TX Lubbock..TX Plano..TX
1 NA 1412 1122 NA NA
2 NA 2057 1208 NA NA
3 NA 3770 1191 NA NA
4 NA 6618 1516 NA NA
5 NA 6412 1559 NA NA
6 NA 6046 1891 NA NA
San.Antonio..TX Salt.Lake.City..UT Arlington..VA
1 986 261 1066
2 989 249 1249
3 1463 295 1289
4 2095 292 1308
5 3282 328 1426
6 3554 355 1419
Norfolk..VA Reston..VA Richmond..VA Bellevue..WA
1 948 NA 1035 NA
2 963 NA 1135 NA
3 970 NA 1170 NA
4 1011 NA 1342 NA
5 1009 NA 1354 NA
6 1032 NA 1446 NA
Seattle..WA Spokane..WA Madison..WI Milwaukee..WI
1 668 NA 622 452
2 787 NA 626 449
3 994 NA 661 437
4 999 NA 649 487
5 1297 NA 657 499
6 1100 NA 707 521
library(lubridate)
library(ggplot2)
flu$date <- as.Date(flu$Date)
qplot(date, Arkansas, data=flu)