Lynna Jirpongopas
Mon Apr 27 19:52:22 2015
"R is a free software programming language and software environment for statistical computing and graphics."
-Wikipedia
Getting RStudio: http://www.rstudio.com/products/rstudio/download/
Customizing your GUI
Alternative: RGui from http://www.r-project.org/
Case sensitive
For comments use #
; is not necessary, but optional
<- instead of =
“ ” are required for strings
: is used to generate sequence of numbers
? before a function name to get help with a function
Hotkeys:
A very hotkey stroke:
There are several methods:
1) use RStudio GUI to set working directory
2) use setwd()
3) your csv file name in read.csv() contains the entire directory
nasdaqData <- read.csv("~/Github/gdi-r/class1/datasets/NASDAQOMX-NDX.csv")
There are bunch of arguments (ie. options)! Here is my advice for best practice:
setwd("~/Github/gdi-r/class1/datasets")
opecData <- read.csv("OPEC-ORB.csv", header = T, sep = ",", stringsAsFactors = F) #loading csv file
opecData <- read.csv("OPEC-ORB.txt", header = T, sep = "\t", stringsAsFactors = F) #loading txt file
PS. Data sets were taken from quandl.com
2 ways to find out what kind of table you have
(matrix, list, data frame, etc.)
Method 1:
str(nasdaqData)
Method 2: Go to Environment tab
let's say we only want to see part of the data with index value higher than 4000
highIndexValue <- subset(nasdaqData, Index.Value > 4000)
highIndexValueBrackets <- nasdaqData[nasdaqData$Index.Value > 4000 , ]
Comma is needed because brackets have 2 arguments: the row and the column
[row, col]
The criteria that we're using tells R which row to extract!
Your criteria goes after the comma!
justACol <- nasdaqData[, "Low"]
You can use the column name with quotes or the number of the column.
Now you have a vector, justACol, that contains everything that is in Low column from nasdaqData
use summary() to get summary statistics
summary(nasdaqData)
summary(highIndexValue)
http://blog.modeanalytics.com/five-public-dataset/
ufoData$month <- as.Date(ufoData$month, format = "%Y-%m-%d")
use summary() to get summary statistics
summary(ufoData)
plot(ufoData$month, ufoData$sightings)
ufoData1900to2014 <- subset(ufoData, month < "2014-12-31" & month > "1900-01-01")
plot(ufoData1900to2014$month,
ufoData1900to2014$sightings)
install zoo package so that we can plot the data as a line graph!
Instead of scatterplot, you want a line graph
library(zoo)
z <- zoo(ufoData$sightings, ufoData$month)
plot(z)
There are many ways to solve this problem.
Plan:
OR
Problem with the first plan is that different months, we have diffent amount of sightings.
Can't use cbind with vectors that have different lenths.
There are ways around this by inputting “NA”, but let's just use the second method…
Use what we know about:
Put them all together!
Hint: format.Date(month, “%m”) == “01” is the criteria that we have to use to tell subset function to select rows with month = 01.
Jan <- summary(subset(ufoData, format.Date(month, "%m") == "01" )[,"sightings"])
Feb <- summary(subset(ufoData, format.Date(month, "%m") == "02" )[,"sightings"])
Mar <- summary(subset(ufoData, format.Date(month, "%m") == "03" )[,"sightings"])
Apr <- summary(subset(ufoData, format.Date(month, "%m") == "04" )[,"sightings"])
… replicate each line for each month of the year
We can cover how to write functions in a different R course
summMonth <- function(mm){
summary(subset(ufoData, format.Date(month, "%m") == mm )[,"sightings"])
}
We have to input month name and corresponding month number…
Jan <- summMonth("01")
Feb <- summMonth("02")
Mar <- summMonth("03")
Apr <- summMonth("04")
May <- summMonth("05")
Jun <- summMonth("06")
Jul <- summMonth("07")
Aug <- summMonth("08")
Sep <- summMonth("09")
Oct <- summMonth("10")
Nov <- summMonth("11")
Dec <- summMonth("12")
ufoDataByMonth <- cbind(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec)
Check out ufoDataByMonth table!
We're only interested in the row called, Mean.
maxMean <- max(ufoDataByMonth["Mean",])
colnames(ufoDataByMonth)[ufoDataByMonth["Mean",] == maxMean]
[1] "Jul"
maxMedian <- max(ufoDataByMonth["Median",])
colnames(ufoDataByMonth)[ufoDataByMonth["Median",] == maxMedian]
[1] "Jun"
more light shed on the mystery of R ?
1) A foundation course
Understanding data types for data cleansing
2) Best of R
Intro to statistics and machine learning with R
3) Dataviz with R
Learn how to visualize your data with the following R packages: ggplot2 & shiny
We will go through this tutorial together: http://shiny.rstudio.com/tutorial/lesson5/
or we can do something else!
I'm open to suggestions for the type of R class you're interested! Also, I'm open to using any types of data set.