UFO Sightings Analysis with R

Lynna Jirpongopas
Mon Apr 27 19:52:22 2015

UFO 1 goes here

What is R?

"R is a free software programming language and software environment for statistical computing and graphics." 

-Wikipedia

Dialect of S language
Free software
In 2000, R version 1.0.0 is released

Learning goals

Install RStudio
Using common features of RStudio
Load a csv table
Use zoo package for dates data type
Subset data
Draw insights from data!

Download and installing RStudio

Getting RStudio: http://www.rstudio.com/products/rstudio/download/
Customizing your GUI
Alternative: RGui from http://www.r-project.org/

Dive into R basics (works for RGui or RStudio)

Case sensitive

For comments use #

; is not necessary, but optional

<- instead of =

“ ” are required for strings

: is used to generate sequence of numbers

? before a function name to get help with a function

Dive into R basics (works for RGui or RStudio)

Hotkeys:

up and down for last or recent line of code
Ctrl + r is for running a line of code
Ctrl + l is for clearing the console

RStudio tools

install packages
RStudio has help tab
set working directory
preview data
executing R code - try 2+3
save and open your R file
save data

A very hotkey stroke:

“tab” for predictive keyboard

Load a csv table

There are several methods:

1) use RStudio GUI to set working directory

2) use setwd()

3) your csv file name in read.csv() contains the entire directory

Load a csv table, method 3

nasdaqData <- read.csv("~/Github/gdi-r/class1/datasets/NASDAQOMX-NDX.csv")

read.csv() arguments

There are bunch of arguments (ie. options)! Here is my advice for best practice:

setwd("~/Github/gdi-r/class1/datasets")
opecData <- read.csv("OPEC-ORB.csv", header = T, sep = ",", stringsAsFactors = F) #loading csv file

opecData <- read.csv("OPEC-ORB.txt", header = T, sep = "\t", stringsAsFactors = F) #loading txt file

PS. Data sets were taken from quandl.com

Structure of your table

2 ways to find out what kind of table you have

(matrix, list, data frame, etc.)

Method 1:

str(nasdaqData)

Method 2: Go to Environment tab

Subsetting data

Method 1: using subset()
Method 2: using square brackets (ie. extraction brakets)

subset()

let's say we only want to see part of the data with index value higher than 4000

highIndexValue <- subset(nasdaqData, Index.Value > 4000)

using square brackets []

highIndexValueBrackets <- nasdaqData[nasdaqData$Index.Value > 4000 , ]

Comma is needed because brackets have 2 arguments: the row and the column

[row, col]

The criteria that we're using tells R which row to extract!

using square brackets [] to extract a column

Your criteria goes after the comma!

justACol <- nasdaqData[, "Low"]

You can use the column name with quotes or the number of the column.

Now you have a vector, justACol, that contains everything that is in Low column from nasdaqData

Basic statistics

use summary() to get summary statistics

summary(nasdaqData)
summary(highIndexValue)

UFO data

http://blog.modeanalytics.com/five-public-dataset/

Which month is most popular for UFO sighting?

Load our UFO data!

UFO 2 goes here

Dates

ufoData$month <- as.Date(ufoData$month, format = "%Y-%m-%d")

Basic statistics

use summary() to get summary statistics

summary(ufoData)

Plot the data using base package

plot(ufoData$month, ufoData$sightings)

plot of chunk unnamed-chunk-11

Plot only 1900 to 2014

ufoData1900to2014 <- subset(ufoData, month < "2014-12-31" & month > "1900-01-01")

plot(ufoData1900to2014$month, 
     ufoData1900to2014$sightings)

plot of chunk unnamed-chunk-12

Zoo package

install zoo package so that we can plot the data as a line graph!

funny zoo pic goes here

Plot using zoo package

Instead of scatterplot, you want a line graph

library(zoo)
z <- zoo(ufoData$sightings, ufoData$month)
plot(z)

plot of chunk unnamed-chunk-13

Which month is most popular for UFO sighting?

There are many ways to solve this problem.

Plan:

Aggregate sightings by month
Combind all the months together to form a data frame (ie. table)
Use summary() to find the mean and median sightings for each month

We can use summary() for each month, then combind the summaries

Which month is most popular for UFO sighting?

Problem with the first plan is that different months, we have diffent amount of sightings.

Can't use cbind with vectors that have different lenths.

There are ways around this by inputting “NA”, but let's just use the second method…

Subset data by month

Use what we know about:

summary()
subset()
square brackets

Put them all together!

Hint: format.Date(month, “%m”) == “01” is the criteria that we have to use to tell subset function to select rows with month = 01.

Subset data by month

Jan <- summary(subset(ufoData, format.Date(month, "%m") == "01" )[,"sightings"])
Feb <- summary(subset(ufoData, format.Date(month, "%m") == "02" )[,"sightings"])
Mar <- summary(subset(ufoData, format.Date(month, "%m") == "03" )[,"sightings"])
Apr <- summary(subset(ufoData, format.Date(month, "%m") == "04" )[,"sightings"])

… replicate each line for each month of the year

Or use a self-defined function to replace the long line of R nested functions

We can cover how to write functions in a different R course

summMonth <- function(mm){
  summary(subset(ufoData, format.Date(month, "%m") == mm )[,"sightings"])
}

Only replicate the self-defined function

We have to input month name and corresponding month number…

Jan <- summMonth("01")
Feb <- summMonth("02")
Mar <- summMonth("03")
Apr <- summMonth("04")
May <- summMonth("05")
Jun <- summMonth("06")
Jul <- summMonth("07")
Aug <- summMonth("08")
Sep <- summMonth("09")
Oct <- summMonth("10")
Nov <- summMonth("11")
Dec <- summMonth("12")

Which month is most popular for UFO sighting?

ufoDataByMonth <- cbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec)

Check out ufoDataByMonth table!

Know the answer?!

UFO 2 goes here

We're only interested in the row called, Mean.

maxMean <- max(ufoDataByMonth["Mean",])

Which month!!!?

colnames(ufoDataByMonth)[ufoDataByMonth["Mean",] == maxMean]

[1] "Jul"

maxMedian <- max(ufoDataByMonth["Median",])
colnames(ufoDataByMonth)[ufoDataByMonth["Median",] == maxMedian]

[1] "Jun"

How was it?!

UFO 3 goes here

more light shed on the mystery of R ?

Want to learn more?

1) A foundation course

Understanding data types for data cleansing

2) Best of R

Intro to statistics and machine learning with R

3) Dataviz with R

Learn how to visualize your data with the following R packages: ggplot2 & shiny

Data visualisation and interactivity with Shiny

We will go through this tutorial together: http://shiny.rstudio.com/tutorial/lesson5/

or we can do something else!

I'm open to suggestions for the type of R class you're interested! Also, I'm open to using any types of data set.