2017-lec3

Today

New printr package for making nice tables
DC chap 4: Files and Documents
a brief intro to Databases (for hw 1)

printr

Here is a print out of BabyNames rendered using knitr (it looks like the printout at the console):

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To produce more publication quality tables use the new package printr. It isn’t available yet on the CRAN repository so you have to install it by running the following chunk (you neeed to run it not just Knit HTML).

install.packages(
  'printr',
  type = 'source',
  repos = c('http://yihui.name/xran', 'http://cran.rstudio.com')
)

Once you install it you will see prinr in your packages list in rstudio. You only have to do this once.

Next you need to remember to put library(printr) in a chunk in your rmarkdown file.

library(printr)

Here is a print out of mtcars now:

head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.460	20.22	1	0	3	1

DC Chap 4: Files and Documents

The file path is the set of successive folders that bring you to the file.

There is a standard format for file paths. An example: /Users/kaplan/Downloads/0021_001.pdf. Here the filename is 0021_001, the filename extension is .pdf, and the file itself is in the Downloads folder contained in the kaplan folder, which is in turn contained in the Users folder. The starting / means “on this computer”.

You could also write this as ~/Downloads/0021_001.pdf where ~/ means “in my home directory”

The R file.choose() — which should be used only in the console, not an Rmd file — brings up an interactive file browser. You can select a file with the browser. The returned value will be a quoted character string with the path name.

file.choose() then select a file. The output is: [1] "/Users/kaplan/Downloads/0021_001.pdf"

For example I have a .csv file on my Desktop called names.csv. If I want to load it into R I might first find the path:

file.choose()
[1] "/Users/Adam/Desktop/names.csv"

Then to load it I type:

my_names <- read.file("/Users/Adam/Desktop/names.csv")

## Reading data with read.csv()

my_names

long_name	short_name
NPT41_PUB	Q1
NPT42_PUB	Q2
NPT43_PUB	Q3
NPT44_PUB	Q4
NPT45_PUB	Q5
NPT41_PRIV	Q1
NPT42_PRIV	Q2
NPT43_PRIV	Q3
NPT44_PRIV	Q4
NPT45_PRIV	Q5

Some common filename extensions for the sort of web resources you will be using:

.png for pictures
.jpg or .jpeg for photographs
.csv or .Rdata for data files
.Rmd for the human editable text of a document (called R Markdown)
.html for web pages themselves
.dbf for a certain kind of database

Writing reproducible reports in R with R Markdown

It is important that when you do some data cleaning or analysis every command that you use is documented in your report and can be repeated to get your results. This makes your report reproducible. A .Rmd (R Markdown) file makes your report reproducible since it integrates computer commands into the narrative so that graphics are produced by the commands rather than being inserted from another source.

For example when you make a plot you can include the R commands that made the plot in your report using “chuncks”

library(DataComputing)
BabyNames %>%
  filter(name=="Adam") %>%
  group_by(year) %>%
  summarise(total_births=sum(count)) %>%
  ggplot( aes(x=year, y=total_births)) + geom_line()

This allows the reader to see how you generated the plot and how to modify it.

In lab this week you will learn about making R Markdown files. There is a link to an R-Markdown quick reference pdf in the data camp course you are doing in lab this week.

Note: Most .Rmd files will draw on a library that needs to be loaded into the R session.

When you compile .Rmd -> .html, R starts a brand new session that has no libraries loaded. You don’t need to install packages but you need to load libraries into your chunk. Usually you put a chunk at the top of your document with that looks like:

# you may want ```{r, include=FALSE} so report doesn't show the contents of this chunk in your output file
library(DataComputing)
library(mosaic)

embedding URLs in R Markdown

You will sometimes need to copy URLs into your work in R, to access a dataset, make a link to some reference, etc. Remember to copy the entire URL, including the http:// part if it is there. For example, to include the jpeg picture of the gray wolf in your own Rmd file, use the following markup, which includes the URL.

[Gray Wolf](http://www.macalester.edu/~montgomery/GrayWolf.jpg) Gray Wolf

For example, here is a link to a nice summary of R Markdown basics: R Markdown basics.

Here is an RMarkdown R Markdown cheat sheet

embedding a picture in R Markdown

in the console type file.choose() and search for your file interactively.

![](/Users/Adam/Desktop/Stat133_S17/lectures/dog.png)

Embedded Source Files within HTML files

When you hand in documents for the course, you will always be handing in an HTML file. However, it’s very useful to be able to access the original Rmd file from which the HTML was compiled, or perhaps other files such as CSV data. The Data Computing package provides a way to do this easily. Inside an R chunk in the Rmd file, include the following command:

includeSourceDocuments()

Source file ⇒ 2017-lec3.Rmd

For example see the top chunk of today’s lecture notes.

Creating an .Rmd file

When doing an assignment, you will be creating an .Rmd file, compiling it to .html, and then handing in your assignment by uploading the HTML file to b-courses.

To create a new .Rmd file, open the “File/New File/R Markdown/Document then click on HTML.

To see what commands to put at the top of your .Rmd file look at the source file for this html and copy the stuff above ##Today.

Software for R Markdown (see Reporting with R Markdown course in Data Camp).

RStudio
R
The rmarkdown R package a software package that makes human editable text
The Kitr R package
a software package that allows you to weave R code in your R markdown text.
5 ** The Shiny R package**
a software package that allows you to make interactive graphs
pandoc
a software program that allows you to render .Rmd to different formats including .html, .pdf, and .doc
Latex
a software package that allows you to format math equations. Necessary to form .pdf files not .html files.

Kitr, Shiny and Pandoc are all included with Rstudio. You only need to download latex

In Class exercise #1

Figure out what the knitted html will look like for the following RMarkdown code

In Class exercises #2

DC chapter 4 exercises

In Class exercise #3

Make the following html file: Birds

A brief intro to databases

Since we are on the topic of files and documents, I will make a quick mention of storage of large datafiles. Splitting data into a number of related tables (called a database) brings many advantages for big data over a simple data table. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle.

example

Typically a library stores details of all their books in a database. When you want to know if a book is in stock you can enter either the title, author or ISBN number and search for information about the book. You can find out how many copies are stored not only by your local library but also libraries in neighbouring towns. You can check when the book is due to be returned and also reserve it.

The database also records details of all the borrowers, what books they currently have out on loan and when they are due back. When they return their books the librarian will be informed if they are overdue and whether there are any fines outstanding.

For example, here is a data table containing information about a set of books, including the ISBN and title for the book, the author of the book and the author’s gender, the publisher of the book, and the publisher’s country of origin.

There are many advantages of putting large data files in a database.

Data is only stored once. In the previous example, the city data was gathered into one table so now there is only one record per city. The advantages of this are No multiple record changes needed More efficient storage Simple to delete or modify details. All records in other tables having a link to that entry will show the change.
Complex queries can be carried out. A language called SQL has been developed to allow programmers to ‘Insert’, ‘Update’, ‘Delete’, ‘Create’, ‘Drop’ table records. These actions are further refined by a ‘Where’ clause.
For example SELECT * FROM Customer WHERE ID = 2 This SQL statement will extract record number 2 from the Customer table. Far more complicated queries can be written that can extract data from many tables at once.
Easier to maintain security. By splitting data into tables, certain tables can be made confidential. When a person logs on with their username and password, the system can then limit access only to those tables whose records they are authorised to view. For example, a receptionist would be able to view employee location and contact details but not their salary. A salesman may see his team’s sales performance but not competing teams.

Central to the design of databases is the concept of an entity and their attributes. An entity is most easily thought of as a person, place, or physical object (e.g. a book); an event; or a concept. An attribute is a piece of information about the entity. The entities in our data set is books, publishers and authors. The attributes of the the book table is ISBN, title, author and publisher.

An important principle of databases is that if an attribute is repeated, you can more efficiently store information about each level of the entity in a separate table.
In the above datatable, the author Lynley Dodd and publisher Mallinson Rendell is repeated multiple time.

There should be a seperate table for each entity in the data set (namely book, author, publisher)

Shown below is a simple example of a database table that contains information about some of the books in our data set. This table has three columns–the ISBN of the book, the title of the book, and the author of the book–and four rows, with each row representing one book.

Each table in a database has a primary key. The primary key must be unique for every row in a table. In the book table, the ISBN provides a perfect primary key because every book has a different ISBN.

A database containing information on books might also contain information on book publishers. Below we show another table in the same database containing information on publishers.

The primary key in this table is the ID column.’

Tables within the same database can be related to each other using foreign keys. These are columns in one table that specify a value from the primary key in another table. For example, we can relate each book in the book_table to a publisher in the publisher_table by adding a foreign key to the book_table. This foreign key consists of a column, pub, containing the appropriate publisher ID. The book_table now looks like this:

Notice that two of the books in the book_table have the same publisher, with a pub value of 1. This corresponds to the publisher with an ID value of 1 in the publisher_table, which is the publisher called Mallinson Rendel.

Also notice that a foreign key column in one table does not have to have the same name as the primary key column that it refers to in another table. The foreign key column in the book_table is called pub, but it refers to the primary key column in the publisher_table called ID.

We can represent the database schematically this way:

Each “box” in this diagram represents one table in the database, with the name of the table as the heading in the box. The other names in each box are the names of the columns within the table; if the name is bold, then that column is part of the primary key for the table and if the name is italic, then that column is a foreign key. Arrows are used to show the link between a foreign key in one table and the primary key in another table.

In order to accommodate information about authors in the database, there should be another table for author information.

What is the relationship between books and authors? An author can write several books and a book can have more than one author, so this is an example of a many-to-many relationship.

A many-to-many relationship can only be represented by creating a new table in the database.

For example, we can create a table, called the book_author_table, that contains the relationship between authors and books.

Usually it isn’t possible for every table in the database to be about a single entity. Our book author table is concerned with both book and author entities.

Here we have:

book author table

book table publisher table

author table

You can interact with databases in a variety of database systems (DBMS=database management system) (some systems are SQLite, MySQL, postgreSQL, Oracle, Access). We will discuss this later in the semester.

For hw 1 you will play around with a database concerning accident fatalities.