R and Databases

Heike Hofmann
Stat 579, Fall 2013

Outline

  • Normal forms of data
  • Select Query Language: SQL
  • … ggplot2 layer system

What is a database?

  • a collection of data
  • a set of rules to manipulate data

Databases are important because they allow

  • efficient manipulation of large data
  • convenient processing of data
  • integration of multiple sources of data
  • access to a shared resource

Reasons

  • More and more datasets are in a Gigabyte range … R reads data into main memory … most computers have a very limited amount of main memory(

  • (Now, data on the order of some activities today, terabytes on a frequent basis are TOO BIG for databases! Then analysts revert back to FLAT FILES.)

Relational Databases

  • Database is collection of tables (in normal form) and links between them
  • SQL (Structured Query Language) for querying
  • DBMS (Database Management System) for managing data

Relational Databases

alt text

Relationships are defined by key-to-key relationships: 1:1 one to one, 1:n one to many, or m:n many to many

DRY principle for data

  • Normal Forms of data:

    • provide more consistency, less repetition
    • are harder to view & edit
  • Distinction between measured data and data indices (keys)

  • “The key, the whole key & nothing but the key”

Why normalize?

  • Reduce overall data size
  • Easier to maintain
  • Reduce redundancy to detect possible errors
  • Useful way of thinking about objects under study

The key: 1st normal form

  • Data has rectangular shape
  • header line contains variable names (optional), in particular no information is lost by renaming columns with random strings

  • some column(s) uniquely describe a data record

Keys:

  • e.g. unique row names provide a key
  • keys are id variables, fixed by the design (as opposed to measured variables)
  • multiple id variables are called a composite key

The whole key: 2nd normal form

  • Violated, if non-key entry is described by part of the key
  • i.e. data sets in 1st NF with single key are automatically in 2nd NF

  • the example is in 1st NF but not in 2nd: alt text

Remedy: normalization by splitting tables

alt text

  • Both tables are now in normal form

and nothing but the key: 3rd NF

  • Violated, if non-key entry is identified by another non-key entry
  • example is in 2nd but not in 3rd normal form

alt text

Databases and SQL

  • Database is collection of tables in normal form

SQL (Structured Query Language) for querying and managing data in DB:

  • Structured Query Language (1970, E Codds)
  • Programming language used for accessing data in a database
  • ANSI standard since 1986, ISO standard since 1987
  • Still some portability issues between software and operating systems!
  • We'll mainly focus on SQL queries to access data

Select Query:

SELECT columns(, aggregate function(*))

FROM table1(, table2)

WHERE row_condition

(AND table1.id = table2.id)

(GROUP BY column)

(ORDER BY order_by_columns)

Examples

alt text

  • SELECT Name, Major FROM Student

  • SELECT * FROM Student

  • SELECT * FROM Student WHERE Major = “Math”

  • SELECT Name, Date FROM Student, Attendance WHERE Student.ID = Attendance.ID AND Status = “Absent”

  • SELECT ID, count(ID) FROM Attendance WHERE Status=“Absent” GROUP BY ID

Functions for aggregation

COUNT AVG MAX MIN SUM ROUND LEN

SQL is a very powerful language - almost any mathematical function is implemented

http://www.w3schools.com/sql/ has a list of functions

Your turn

  • Go to website http://www.w3schools.com/sql/sql_tryit.asp to try for yourself:

  • What fields are in the table “customers”?

  • Select the CompanyName and ContactName of customers that come from Germany

  • Find a frequency breakdown of all customers by country.

Accessing databases from R

  • Packages in R have Front-/Backend Set-up
  • backend is the same for all database systems (DBS): done by package DBI
  • frontend depends on DBS, there is RMySQL, RSQLite, ROracle, …

  • Install both packages: DBI, RMySQL

  • On your own machine you’ll need to install the mysql client in order to run RMySQL

  • On terminal servers just call libraries:

library(DBI)
library(RMySQL)

RMySQL

#Connect to the database:
con <- dbConnect(dbDriver("MySQL"), user="2009Expo", password="R R0cks", port=3306, dbname="accidents", host="headnode.stat.iastate.edu")
con
<MySQLConnection:(20317,0)> 

# What tables are in the database?
dbListTables(con)
[1] "accidents" "person"    "vehicle"  
# What variables are in the table?
dbListFields(con, "accidents")
 [1] "row_names" "STATE"     "COUNTY"    "MONTH"    
 [5] "DAY"       "HOUR"      "MINUTE"    "VE_TOTAL" 
 [9] "PERSONS"   "PEDS"      "NHS"       "ROAD_FNC" 
[13] "ROUTE"     "SP_JUR"    "HARM_EV"   "MAN_COLL" 
[17] "REL_JUNC"  "REL_ROAD"  "TRAF_FLO"  "NO_LANES" 
[21] "SP_LIMIT"  "ALIGNMNT"  "PROFILE"   "PAVE_TYP" 
[25] "SUR_COND"  "TRA_CONT"  "T_CONT_F"  "LGT_COND" 
[29] "WEATHER1"  "WEATHER2"  "WRK_ZONE"  "NOT_HOUR" 
[33] "NOT_MIN"   "ARR_HOUR"  "ARR_MIN"   "HOSP_HR"  
[37] "HOSP_MN"   "SCH_BUS"   "CF1"       "CF2"      
[41] "CF3"       "FATALS"    "DAY_WEEK"  "DRUNK_DR" 
[45] "ST_CASE"   "CITY"      "YEAR"      "MILEPT"   
[49] "TWAY_ID"   "TWAY_ID2"  "RAIL"      "LATITUDE" 
[53] "LONGITUD"  "VE_FORMS"  "WEATHER"   "srsindex" 
dbDisconnect(con)
[1] TRUE
con <- dbConnect(dbDriver("MySQL"), user="2009Expo", password="R R0cks", port=3306, dbname="accidents", host="headnode.stat.iastate.edu")

dbGetQuery(con, "SELECT * FROM accidents LIMIT 6")
  row_names STATE COUNTY MONTH DAY HOUR MINUTE VE_TOTAL
1         1     1    127     1   2   18     15        2
2         2     1      3     1   1   20      0        1
3         3     1     15     1   1   15     59        2
4         4     1     69     1   3    8     40        2
5         5     1     77     1   3    0      5        1
6         6     1      9     1   6   14     50        2
  PERSONS PEDS NHS ROAD_FNC ROUTE SP_JUR HARM_EV MAN_COLL
1       2    0   1        2     2      0      12        5
2       2    1   0       15     4      0       9        0
3       3    0   0        3     3      0      12        2
4       3    0   0        4     3      0      12        5
5       1    0   0        6     4      0      34        0
6       2    0   0        3     3      0      12        5
  REL_JUNC REL_ROAD TRAF_FLO NO_LANES SP_LIMIT ALIGNMNT
1        2        1        2        2       65        1
2        1        1        2        2       35        1
3        1        1        1        2       55        1
4        1        1        1        2       55        1
5        1        4        1        2       45        1
6        1        1        1        2       55        1
  PROFILE PAVE_TYP SUR_COND TRA_CONT T_CONT_F LGT_COND
1       1        2        1       20        3        2
2       1        2        1        0        0        2
3       1        2        1        0        0        1
4       2        2        1        0        0        1
5       2        2        1        0        0        2
6       1        2        2        0        0        1
  WEATHER1 WEATHER2 WRK_ZONE NOT_HOUR NOT_MIN ARR_HOUR
1        1        0        0       99      99       18
2        1        0        0       99      99       20
3        1        0        0       99      99       16
4        1        0        0       99      99        9
5        5        0        0       99      99        0
6        2        0        0       99      99       15
  ARR_MIN HOSP_HR HOSP_MN SCH_BUS CF1 CF2 CF3 FATALS
1      50      99      99       0   0   0   0      1
2       7      99      99       0   0   0   0      1
3       9      99      99       0   0   0   0      2
4       5      99      99       0   0   0   0      1
5      30      88      88       0   0   0   0      1
6       5      99      99       0   0   0   0      1
  DAY_WEEK DRUNK_DR ST_CASE CITY YEAR MILEPT TWAY_ID
1        6        0   10001    0 2009   1626 US-SR 5
2        5        0   10002    0 2009      0    1480
3        5        0   10003    0 2009   2228    SR-9
4        7        0   10004    0 2009    149  SR-123
5        7        1   10005    0 2009      0    1611
6        3        0   10006    0 2009   2503   SR-53
              TWAY_ID2    RAIL LATITUDE LONGITUD VE_FORMS
1                 7797 0000000    33.81   -87.13        2
2 99999999999999999999 0000000    30.60   -87.86        1
3 99999999999999999999 0000000    33.69   -85.68        2
4 99999999999999999999 0000000    31.27   -85.62        2
5 99999999999999999999 0000000    34.99   -87.23        1
6 99999999999999999999 0000000    33.94   -86.42        2
  WEATHER srsindex
1       1        1
2       1        2
3       1        3
4       1        4
5       5        5
6       2        6
accidents <- dbGetQuery(con, "SELECT * FROM accidents WHERE Year = 2009")

dbDisconnect(con)
[1] TRUE
library(ggplot2)
qplot(LONGITUD, LATITUDE, data=accidents)

plot of chunk unnamed-chunk-4

Your Turn

The accident data are collected within FARS (http://www.nhtsa.gov/FARS, Fatality Accident Reporting System).

Using an SQL statement each,

  • find the number of total accidents by year

  • find the percentage of accidents, in which alcohol played a role (DRUNK_DR > 0)

  • find the percentage of accidents, in which alcohol played a role by time of the day (group by HOUR)