IntroR

Statistics 4869/6620: Statistical Learning with R

Prof. Eric A. Suess

3/29/2017

Today

We will introduce R
data structures in R
functions in R
understanding data with R

Introduction to R: Data Structures

The main types of data structures in R

vectors - numeric or character or logical
factors - for nominal variables/features
lists - numeric and/or character and/or logical
data frames - list of vectors and/or lists
matrices - numeric, r by c, fills columns
arrays - layers, like sheets in MS Excel

Introduction to R: vectors

x = c(34,45,56)
y = c(178,132,99)
plot(x,y)

plot of chunk unnamed-chunk-1

Introduction to R: factors

gender = factor(c("F", "M", "F"))
gender

[1] F M F
Levels: F M

Introduction to R: lists

subject1 = list(x = x[1], y = y[1], 
  gender = gender[1])
subject1

$x
[1] 34

$y
[1] 178

$gender
[1] F
Levels: F M

Introduction to R: data frames

mydata = data.frame(x, y, gender)
mydata

   x   y gender
1 34 178      F
2 45 132      M
3 56  99      F

mydata$x

[1] 34 45 56

mydata$gender

[1] F M F
Levels: F M

Introduction to R: data frames

mydata = data.frame(x, y, gender)
mydata[1,]

   x   y gender
1 34 178      F

mydata[,c(2,3)]

    y gender
1 178      F
2 132      M
3  99      F

Introduction to R: matrices

X = matrix(c(x,y), ncol=2)
X

     [,1] [,2]
[1,]   34  178
[2,]   45  132
[3,]   56   99

Introduction to R: Managing data

Set the working directory.

getwd()

setwd(“C:\ path to where your data is, with double \”)

Introduction to R: Managing data

Reading and writing .csv files

usedcars <- read.csv(“usedcars.csv”, stringsAsFactors = FALSE)

write.csv(“mydata”, file “mydata.csv”)

In RStudio try to load the data with the

Environment > Import Dataset >

From Text File… OR From Web URL…

Introduction to R: Understanding data

When exploring quantitative/numeric variables we use

mean and median
standard deviation
5-number summary
box-plots
histograms
normal distributions?

Introduction to R: Understanding data

usedcars <- read.csv("usedcars.csv", 
     stringsAsFactors = FALSE)
head(usedcars)

  year model price mileage  color transmission
1 2011   SEL 21992    7413 Yellow         AUTO
2 2011   SEL 20995   10926   Gray         AUTO
3 2011   SEL 19995    7351 Silver         AUTO
4 2011   SEL 17809   11613   Gray         AUTO
5 2012    SE 17500    8367  White         AUTO
6 2010   SEL 17495   25125 Silver         AUTO

Introduction to R: Understanding data

usedcars <- read.csv("usedcars.csv", 
     stringsAsFactors = FALSE)
summary(usedcars$price)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3800   11000   13590   12960   14900   21990

Introduction to R: Understanding data

mean(usedcars$price)

[1] 12961.93

sd(usedcars$price)

[1] 3122.482

range(usedcars$price)

[1]  3800 21992

Introduction to R: Understanding data

When exploring qualitative/categorical variables we use

counts and percentages
tables
mode

Introduction to R: Understanding data

When exploring the relationships between quantitative/numeric variables we use

correlation
scatterplots

Introduction to R: Understanding data

When exploring the relationships between qualitative/categorical variables we use

tables
Chi-Square

Introduction to R: Understanding data

install.packages(“gmodels”)

library(gmodels)