R Data Types

R Data Structures

Most of our work will be concerned with vectors and dataframes.

Specifying an individual element of a vector

First create a vector with c(), supply names and examine the vector.

vec <- c(6,7,8,9)
names(vec) <- c("a","b","c","d")
# Simple display
vec
## a b c d 
## 6 7 8 9
# Look at the properties.
str(vec)
##  Named num [1:4] 6 7 8 9
##  - attr(*, "names")= chr [1:4] "a" "b" "c" "d"

All of the elements of a vector must be of the same type. R will allow you to put any atomic data element in a vector, but will perform coercion to enforce this restriction.

vec["a"] = "six"
# Simple display
vec
##     a     b     c     d 
## "six"   "7"   "8"   "9"
# Look at the properties.
str(vec)
##  Named chr [1:4] "six" "7" "8" "9"
##  - attr(*, "names")= chr [1:4] "a" "b" "c" "d"

Return vec to being numeric.

# Let's try a simple fix. Note that we can use the positional index or the name to refer to an element of a vector.
vec[1] <- 6
# Simple display
vec
##   a   b   c   d 
## "6" "7" "8" "9"
# Look at the properties.
str(vec)
##  Named chr [1:4] "6" "7" "8" "9"
##  - attr(*, "names")= chr [1:4] "a" "b" "c" "d"
# Since these character strings are all numbers, we can manually coerce.
vec <- as.numeric(vec)
# Simple display
vec
## [1] 6 7 8 9
# Look at the properties.
str(vec)
##  num [1:4] 6 7 8 9

Note that when we replaced the entire vector we lost the names.

Now let’s see what happens when we replace a numeric element wih a logical value.

vec[1] <- FALSE
# Simple display
vec
## [1] 0 7 8 9
# Look at the properties.
str(vec)
##  num [1:4] 0 7 8 9

Let’s try to make a logical vector

vec <- as.logical(vec)
# Simple display
vec
## [1] FALSE  TRUE  TRUE  TRUE
# Look at the properties.
str(vec)
##  logi [1:4] FALSE TRUE TRUE TRUE

Exercise

Create a vector with three elements of three different types. What is the class of your vector?

Specifying and using subvectors of a vector

If x is a vector x[sub-vector specification] defines a sub-vector of x. The subvector may be a vector of numbers indicating positions in x, a logical vector or names if they have been assigned.

x <- 11:20
names(x) = c("a","b","c","d","e",
             "f","g","h","i","j")
x
##  a  b  c  d  e  f  g  h  i  j 
## 11 12 13 14 15 16 17 18 19 20
x[c(2,3,4)]
##  b  c  d 
## 12 13 14
x[c("f","b","e")]
##  f  b  e 
## 16 12 15
x[c(2,2,2)]
##  b  b  b 
## 12 12 12
x[7:9]
##  g  h  i 
## 17 18 19
x[-c(7:9)] # "-" means everything but 
##  a  b  c  d  e  f  j 
## 11 12 13 14 15 16 20
x[c(rep(TRUE,4),rep(FALSE,4),rep(TRUE,2))]
##  a  b  c  d  i  j 
## 11 12 13 14 19 20
x[c(TRUE,FALSE)] # Note the recycling
##  a  c  e  g  i 
## 11 13 15 17 19
x > 5 & x <= 8 # Create a logical vector
##     a     b     c     d     e     f     g     h     i     j 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[x > 15 & x <= 18] # Use it to specify a subvector
##  f  g  h 
## 16 17 18
x[] # No specification means "everything."
##  a  b  c  d  e  f  g  h  i  j 
## 11 12 13 14 15 16 17 18 19 20

Note that a subvector may be used to display the subvector, as the right side of a replacement statement or the left side of a replacement statement.

x <- 11:20
x[7:9]
## [1] 17 18 19
y <- x[7:9]
y
## [1] 17 18 19
x[7:9] <- 5
x
##  [1] 11 12 13 14 15 16  5  5  5 20

Exercises

Create a numeric vector consisting of the integers between 81 and 90.

What is the class of your vector? Convert one of your values to a logical value. What is the class of your vector now?

Create a logical vector with three values. What is the class of your vector. Replace one of the values with a number. What is the class of your vector now?

Subsets of a dataframe

Most of the principles are the same as with vectors, but there are two dimensions rather than one. The specifications are separated by a comma. In general we have df[Row Spec,Col Spec]. The result is a new dataframe. A missing column dimension must be represented by a blank space.

Recall the dataframe mtcars.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
NewDF <- mtcars[3:7,c("cyl","mpg")]
NewDF
##                   cyl  mpg
## Datsun 710          4 22.8
## Hornet 4 Drive      6 21.4
## Hornet Sportabout   8 18.7
## Valiant             6 18.1
## Duster 360          8 14.3
str(NewDF)
## 'data.frame':    5 obs. of  2 variables:
##  $ cyl: num  4 6 8 6 8
##  $ mpg: num  22.8 21.4 18.7 18.1 14.3

Note that we can use specified subsets of a dataframe the same way we used specified subvectors.

mtcars[3:7,c("cyl","mpg")]
##                   cyl  mpg
## Datsun 710          4 22.8
## Hornet 4 Drive      6 21.4
## Hornet Sportabout   8 18.7
## Valiant             6 18.1
## Duster 360          8 14.3
NewDF <- mtcars[3:7,c("cyl","mpg")]
NewDF
##                   cyl  mpg
## Datsun 710          4 22.8
## Hornet 4 Drive      6 21.4
## Hornet Sportabout   8 18.7
## Valiant             6 18.1
## Duster 360          8 14.3
NewDF[1:2,"mpg"] <- 100
NewDF
##                   cyl   mpg
## Datsun 710          4 100.0
## Hornet 4 Drive      6 100.0
## Hornet Sportabout   8  18.7
## Valiant             6  18.1
## Duster 360          8  14.3
NewDF[NewDF$mpg > 50,"class"] <- "High MPG"
NewDF # Note that we added a new column and it has NA values where we supplied nothing.
##                   cyl   mpg    class
## Datsun 710          4 100.0 High MPG
## Hornet 4 Drive      6 100.0 High MPG
## Hornet Sportabout   8  18.7     <NA>
## Valiant             6  18.1     <NA>
## Duster 360          8  14.3     <NA>

Dealing with NA values

Let’s fix the NA values. Look at the problem rows first.

NewDF[is.na(NewDF$class),]
##                   cyl  mpg class
## Hornet Sportabout   8 18.7  <NA>
## Valiant             6 18.1  <NA>
## Duster 360          8 14.3  <NA>
# Replace the NA values
NewDF[is.na(NewDF$class),"class"] <- "Low MPG"
NewDF
##                   cyl   mpg    class
## Datsun 710          4 100.0 High MPG
## Hornet 4 Drive      6 100.0 High MPG
## Hornet Sportabout   8  18.7  Low MPG
## Valiant             6  18.1  Low MPG
## Duster 360          8  14.3  Low MPG

Sometimes we need to replace suspicious data with NA values.

NewDF[NewDF$mpg > 50, "mpg"] <- NA
NewDF
##                   cyl  mpg    class
## Datsun 710          4   NA High MPG
## Hornet 4 Drive      6   NA High MPG
## Hornet Sportabout   8 18.7  Low MPG
## Valiant             6 18.1  Low MPG
## Duster 360          8 14.3  Low MPG

One problem is that arithmetic done using NA values always returns NA.

mean(NewDF$mpg)
## [1] NA
# There is a solution
mean(NewDF$mpg,na.rm=TRUE)
## [1] 17.03333

We can use the complete.cases function to find those rows which have no NA values.

GoodOnes <- complete.cases(NewDF)
GoodOnes
## [1] FALSE FALSE  TRUE  TRUE  TRUE
NewDF[GoodOnes,]
##                   cyl  mpg   class
## Hornet Sportabout   8 18.7 Low MPG
## Valiant             6 18.1 Low MPG
## Duster 360          8 14.3 Low MPG

User Written functions

I want to mention the Quick-R website as a good resource on this an many other topics. http://www.statmethods.net/management/userfunctions.html

Recall that R doesn’t quite do what we want when we ask for the range. It’s easy to construct a function of our own which does what we want.

MyRange <- function(x){
   Result <- max(x) - min(x)
   return(Result)
}

# Here's an example
y <- rnorm(100)
MyRange(y)
## [1] 5.266417
# Verify
max(y) - min(y)
## [1] 5.266417

Exercise: The function summary gives us several useful descriptive statistics for a numerical vector, but not all that we would want. Write a function called MySummary that adds, the standard deviation, Interquartile range and range to the usual output. Enhance the names of the output vector to include the names of the added items.