Heike Hofmann
Stat 579, Fall 2013
Databases are important because they allow
More and more datasets are in a Gigabyte range … R reads data into main memory … most computers have a very limited amount of main memory(
(Now, data on the order of some activities today, terabytes on a frequent basis are TOO BIG for databases! Then analysts revert back to FLAT FILES.)
Relationships are defined by key-to-key relationships: 1:1 one to one, 1:n one to many, or m:n many to many
Normal Forms of data:
Distinction between measured data and data indices (keys)
“The key, the whole key & nothing but the key”
header line contains variable names (optional), in particular no information is lost by renaming columns with random strings
some column(s) uniquely describe a data record
Keys:
i.e. data sets in 1st NF with single key are automatically in 2nd NF
the example is in 1st NF but not in 2nd:
SQL (Structured Query Language) for querying and managing data in DB:
SELECT columns(, aggregate function(*))
FROM table1(, table2)
WHERE row_condition
(AND table1.id = table2.id)
(GROUP BY column)
(ORDER BY order_by_columns)
SELECT Name, Major FROM Student
SELECT * FROM Student
SELECT * FROM Student WHERE Major = “Math”
SELECT Name, Date FROM Student, Attendance WHERE Student.ID = Attendance.ID AND Status = “Absent”
SELECT ID, count(ID) FROM Attendance WHERE Status=“Absent” GROUP BY ID
COUNT AVG MAX MIN SUM ROUND LEN
SQL is a very powerful language - almost any mathematical function is implemented
http://www.w3schools.com/sql/ has a list of functions
Go to website http://www.w3schools.com/sql/sql_tryit.asp to try for yourself:
What fields are in the table “customers”?
Select the CompanyName and ContactName of customers that come from Germany
Find a frequency breakdown of all customers by country.
frontend depends on DBS, there is RMySQL, RSQLite, ROracle, …
Install both packages: DBI, RMySQL
On your own machine you’ll need to install the mysql client in order to run RMySQL
On terminal servers just call libraries:
library(DBI)
library(RMySQL)
#Connect to the database:
con <- dbConnect(dbDriver("MySQL"), user="2009Expo", password="R R0cks", port=3306, dbname="accidents", host="headnode.stat.iastate.edu")
con
<MySQLConnection:(20317,0)>
# What tables are in the database?
dbListTables(con)
[1] "accidents" "person" "vehicle"
# What variables are in the table?
dbListFields(con, "accidents")
[1] "row_names" "STATE" "COUNTY" "MONTH"
[5] "DAY" "HOUR" "MINUTE" "VE_TOTAL"
[9] "PERSONS" "PEDS" "NHS" "ROAD_FNC"
[13] "ROUTE" "SP_JUR" "HARM_EV" "MAN_COLL"
[17] "REL_JUNC" "REL_ROAD" "TRAF_FLO" "NO_LANES"
[21] "SP_LIMIT" "ALIGNMNT" "PROFILE" "PAVE_TYP"
[25] "SUR_COND" "TRA_CONT" "T_CONT_F" "LGT_COND"
[29] "WEATHER1" "WEATHER2" "WRK_ZONE" "NOT_HOUR"
[33] "NOT_MIN" "ARR_HOUR" "ARR_MIN" "HOSP_HR"
[37] "HOSP_MN" "SCH_BUS" "CF1" "CF2"
[41] "CF3" "FATALS" "DAY_WEEK" "DRUNK_DR"
[45] "ST_CASE" "CITY" "YEAR" "MILEPT"
[49] "TWAY_ID" "TWAY_ID2" "RAIL" "LATITUDE"
[53] "LONGITUD" "VE_FORMS" "WEATHER" "srsindex"
dbDisconnect(con)
[1] TRUE
con <- dbConnect(dbDriver("MySQL"), user="2009Expo", password="R R0cks", port=3306, dbname="accidents", host="headnode.stat.iastate.edu")
dbGetQuery(con, "SELECT * FROM accidents LIMIT 6")
row_names STATE COUNTY MONTH DAY HOUR MINUTE VE_TOTAL
1 1 1 127 1 2 18 15 2
2 2 1 3 1 1 20 0 1
3 3 1 15 1 1 15 59 2
4 4 1 69 1 3 8 40 2
5 5 1 77 1 3 0 5 1
6 6 1 9 1 6 14 50 2
PERSONS PEDS NHS ROAD_FNC ROUTE SP_JUR HARM_EV MAN_COLL
1 2 0 1 2 2 0 12 5
2 2 1 0 15 4 0 9 0
3 3 0 0 3 3 0 12 2
4 3 0 0 4 3 0 12 5
5 1 0 0 6 4 0 34 0
6 2 0 0 3 3 0 12 5
REL_JUNC REL_ROAD TRAF_FLO NO_LANES SP_LIMIT ALIGNMNT
1 2 1 2 2 65 1
2 1 1 2 2 35 1
3 1 1 1 2 55 1
4 1 1 1 2 55 1
5 1 4 1 2 45 1
6 1 1 1 2 55 1
PROFILE PAVE_TYP SUR_COND TRA_CONT T_CONT_F LGT_COND
1 1 2 1 20 3 2
2 1 2 1 0 0 2
3 1 2 1 0 0 1
4 2 2 1 0 0 1
5 2 2 1 0 0 2
6 1 2 2 0 0 1
WEATHER1 WEATHER2 WRK_ZONE NOT_HOUR NOT_MIN ARR_HOUR
1 1 0 0 99 99 18
2 1 0 0 99 99 20
3 1 0 0 99 99 16
4 1 0 0 99 99 9
5 5 0 0 99 99 0
6 2 0 0 99 99 15
ARR_MIN HOSP_HR HOSP_MN SCH_BUS CF1 CF2 CF3 FATALS
1 50 99 99 0 0 0 0 1
2 7 99 99 0 0 0 0 1
3 9 99 99 0 0 0 0 2
4 5 99 99 0 0 0 0 1
5 30 88 88 0 0 0 0 1
6 5 99 99 0 0 0 0 1
DAY_WEEK DRUNK_DR ST_CASE CITY YEAR MILEPT TWAY_ID
1 6 0 10001 0 2009 1626 US-SR 5
2 5 0 10002 0 2009 0 1480
3 5 0 10003 0 2009 2228 SR-9
4 7 0 10004 0 2009 149 SR-123
5 7 1 10005 0 2009 0 1611
6 3 0 10006 0 2009 2503 SR-53
TWAY_ID2 RAIL LATITUDE LONGITUD VE_FORMS
1 7797 0000000 33.81 -87.13 2
2 99999999999999999999 0000000 30.60 -87.86 1
3 99999999999999999999 0000000 33.69 -85.68 2
4 99999999999999999999 0000000 31.27 -85.62 2
5 99999999999999999999 0000000 34.99 -87.23 1
6 99999999999999999999 0000000 33.94 -86.42 2
WEATHER srsindex
1 1 1
2 1 2
3 1 3
4 1 4
5 5 5
6 2 6
accidents <- dbGetQuery(con, "SELECT * FROM accidents WHERE Year = 2009")
dbDisconnect(con)
[1] TRUE
library(ggplot2)
qplot(LONGITUD, LATITUDE, data=accidents)
The accident data are collected within FARS (http://www.nhtsa.gov/FARS, Fatality Accident Reporting System).
Using an SQL statement each,
find the number of total accidents by year
find the percentage of accidents, in which alcohol played a role (DRUNK_DR > 0)
find the percentage of accidents, in which alcohol played a role by time of the day (group by HOUR)