IntroR

Statistics 4869/6620: Statistical Learning with R

Prof. Eric A. Suess

3/29/2017

Today

  • We will introduce R
  • data structures in R
  • functions in R
  • understanding data with R

Introduction to R: Data Structures

The main types of data structures in R

  • vectors - numeric or character or logical
  • factors - for nominal variables/features
  • lists - numeric and/or character and/or logical
  • data frames - list of vectors and/or lists
  • matrices - numeric, r by c, fills columns
  • arrays - layers, like sheets in MS Excel

Introduction to R: vectors

x = c(34,45,56)
y = c(178,132,99)
plot(x,y)

plot of chunk unnamed-chunk-1

Introduction to R: factors

gender = factor(c("F", "M", "F"))
gender
[1] F M F
Levels: F M

Introduction to R: lists

subject1 = list(x = x[1], y = y[1], 
  gender = gender[1])
subject1
$x
[1] 34

$y
[1] 178

$gender
[1] F
Levels: F M

Introduction to R: data frames

mydata = data.frame(x, y, gender)
mydata
   x   y gender
1 34 178      F
2 45 132      M
3 56  99      F
mydata$x
[1] 34 45 56
mydata$gender
[1] F M F
Levels: F M

Introduction to R: data frames

mydata = data.frame(x, y, gender)
mydata[1,]
   x   y gender
1 34 178      F
mydata[,c(2,3)]
    y gender
1 178      F
2 132      M
3  99      F

Introduction to R: matrices

X = matrix(c(x,y), ncol=2)
X
     [,1] [,2]
[1,]   34  178
[2,]   45  132
[3,]   56   99

Introduction to R: Managing data

Set the working directory.

getwd()

setwd(“C:\ path to where your data is, with double \”)

Introduction to R: Managing data

Reading and writing .csv files

usedcars <- read.csv(“usedcars.csv”, stringsAsFactors = FALSE)

write.csv(“mydata”, file “mydata.csv”)

In RStudio try to load the data with the

Environment > Import Dataset >

From Text File… OR From Web URL…

Introduction to R: Understanding data

When exploring quantitative/numeric variables we use

  • mean and median
  • standard deviation
  • 5-number summary
  • box-plots
  • histograms
  • normal distributions?

Introduction to R: Understanding data

usedcars <- read.csv("usedcars.csv", 
     stringsAsFactors = FALSE)
head(usedcars)
  year model price mileage  color transmission
1 2011   SEL 21992    7413 Yellow         AUTO
2 2011   SEL 20995   10926   Gray         AUTO
3 2011   SEL 19995    7351 Silver         AUTO
4 2011   SEL 17809   11613   Gray         AUTO
5 2012    SE 17500    8367  White         AUTO
6 2010   SEL 17495   25125 Silver         AUTO

Introduction to R: Understanding data

usedcars <- read.csv("usedcars.csv", 
     stringsAsFactors = FALSE)
summary(usedcars$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3800   11000   13590   12960   14900   21990 

Introduction to R: Understanding data

mean(usedcars$price)
[1] 12961.93
sd(usedcars$price)
[1] 3122.482
range(usedcars$price)
[1]  3800 21992

Introduction to R: Understanding data

When exploring qualitative/categorical variables we use

  • counts and percentages
  • tables
  • mode

Introduction to R: Understanding data

When exploring the relationships between quantitative/numeric variables we use

  • correlation
  • scatterplots

Introduction to R: Understanding data

When exploring the relationships between qualitative/categorical variables we use

  • tables
  • Chi-Square

Introduction to R: Understanding data

install.packages(“gmodels”)

library(gmodels)