Source file ⇒ R2.Rmd
#Today

  1. Review basics of graphing with ggplot (finish ggplot2 first course)
  2. Importing flat files (Do Importing Data from Flat Files (first chapter only))
  3. Some data cleaning and visualization (Do Cleaning Data in R in DataCamp –whole course)

ggplot basics

Here is a reference for ggplot

*scatterplot

head(CPS85)
wage educ race sex hispanic south married exper union age sector
9.0 10 W M NH NS Married 27 Not 43 const
5.5 12 W M NH NS Married 20 Not 38 sales
3.8 12 W F NH NS Single 4 Not 22 sales
10.5 12 W F NH NS Married 29 Not 47 clerical
15.0 12 W M NH NS Married 40 Union 58 const
9.0 16 W F NH NS Married 27 Not 49 clerical
CPS85 %>% ggplot(aes(x=age,y=wage))+ geom_point(aes(shape=sex)) + facet_grid(married ~ .) + labs(title = "Wage versus Age", x = "Age", y = "Wage")

Task for you

Can you make this scatterplot from CPS85 (hint shape=1 is hollow circle)

library(reshape2)
head(tips)
total_bill tip sex smoker day time size
16.99 1.01 Female No Sun Dinner 2
10.34 1.66 Male No Sun Dinner 3
21.01 3.50 Male No Sun Dinner 3
23.68 3.31 Male No Sun Dinner 2
24.59 3.61 Female No Sun Dinner 4
25.29 4.71 Male No Sun Dinner 4
hp <- ggplot(tips, aes(x=total_bill)) + geom_histogram(binwidth=2,colour="blue")

# Histogram of total_bill, divided by sex and smoker
hp + facet_grid(sex ~ smoker)

2. Flat files such as CSV, tab delimited or text files

CSV stands for comma-separated values. It’s a text format that can be read with a huge variety of softare. It has a data table format, with the values of variables in each case separated by commas. Here’s an example of the first several lines of a CSV file:

"name","sex","count","year"
"Mary","F",7065,1880
"Anna","F",2604,1880
"Emma","F",2003,1880
"Elizabeth","F",1939,1880

The top row usually (but not always) contains the variable names. Quotation marks are often used at the start and end of character strings; these quotation marks are not part of the content of the string.

Although CSV files are often named with the .csv suffix, it’s also common for them to be named with .txt or other things. You will also see characters other than commas being used to delimit the fields, tabs and vertical bars are particularly common.

read.csv() is an old school function. It has no real advantage since it is slow and doesn’t read csv files on the web. Better functions for reading CSV files into R is read_csv() in the readr() package or fread() in the data.table package. fread() figures out the delimiter for you (from the filename ending) so can be used with different types of files. The problem with fread() is that it outputs a data.table instead of a data.frame (data.table is a data type of the data.table package). read.file() has all of the advantages of fread() but outputs a dataframe. Hence we will use read.file().

Here is a useful overview of functions to read CSV files:

Function Package WebURL Fast
read.csv() base no no
read_csv() readr yes yes
fread() data.table yes yes
read.file() mosaic yes yes

Here’s a way to access a .csv file over the Internet.

houses

library(mosaic)

myURL <- "http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv"
my_dataTable <- myURL %>% 
  read.file() %>%
  head(3)

  class(my_dataTable)
## [1] "data.frame"
  my_dataTable
Price Living.Area Baths Bedrooms Fireplace Acres Age
142212 1982 1.0 3 N 2.00 133
134865 1676 1.5 3 Y 0.38 14
118007 1694 2.0 3 Y 0.96 15

Here is an example of text file whose delimiter are tabs.

potatoes

Use read_tsv for this:

library(readr)
url_delim <- "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt"
potatoes <- read_tsv(url_delim)
potatoes %>% head(3)
area temp size storage method texture flavor moistness
1 1 1 1 1 2.9 3.2 3.0
1 1 1 1 2 2.3 2.5 2.6
1 1 1 1 3 2.5 2.8 2.8

Now try it with read.file(). From file ending it knows to use tab delimiter.

library(mosaic)
myURL <- "http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/potatoes.txt"
myURL %>% 
  read.file() %>%
  head(3)
area temp size storage method texture flavor moistness
1 1 1 1 1 2.9 3.2 3.0
1 1 1 1 2 2.3 2.5 2.6
1 1 1 1 3 2.5 2.8 2.8

Reading from a file on your own computer is even easier. You just need to have the file path, as can be found using file.choose(). For instance:

# Call file.choose() then copy the string with the file path below
fileName <- "~/Project1/Important.csv"
fileName %>%
  read.file()

Useful argument to read.file():

  • stringsAsFactors=FALSE is useful since we often don’t want a variable with character string values to be categorical.

With download.file() you can download any kind of file from the web, using HTTP and HTTPS: images, executable files, but also RData files. An RData file is very efficient format to store R data.

url_rdata <- "https://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/wine.RData"

# Download the wine file to your working directory
download.file(url_rdata,"~/Desktop/wine_local.RData")

# Load the wine data into your workspace using load()
load("~/Desktop/wine_local.RData")

Task for you

  1. Download this file on your computer: http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/swimming_pools.csv using download.file()

  2. Load this into R using file.choose() and read.file() and call the file pools.

  3. Wrangle pools so your data table looks like:

#I did this from the web
myURL <-"http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/swimming_pools.csv"
myURL %>% 
  read.file() %>%
  select(Name, Address) %>%
  head(3)
Name Address
Acacia Ridge Leisure Centre 1391 Beaudesert Road, Acacia Ridge
Bellbowrie Pool Sugarwood Street, Bellbowrie
Carole Park Cnr Boundary Road and Waterford Road Wacol

Some more data wrangling and visualization

In file Stat133.csv is my classes overall grades. I wasnt to analyze the midterm scores.

Lets clean this data, find the mean, median, standard deviation and make a histogram.

Steps:
1. Read in the csv file
2. Select midterm grades column
3. make a histogram
4. Find stats and label graph

Lets make a histogram