CSC 530 Session 2 Notes

Harold Nelson

9/10/2018

Installing and Loading Packages

  1. Review packages tab and command line.

Exercise

Install the package mosaic using the packages pane. A few other packages including MosaicData will also be installed.

Look at the documentation for mosaicData in the package pane. What is in the Alcohol data?

Try to examine this dataframe in RStudio.

Solution

# str(Alcohol) will fail
# Check the box to the left of the package to load it.
# Now you can see it.
library("mosaicData")
str(Alcohol)
## 'data.frame':    411 obs. of  4 variables:
##  $ X      : int  139 328 517 706 895 980 997 1012 1084 1273 ...
##  $ country: chr  "Russia" "Russia" "Russia" "Russia" ...
##  $ year   : int  1985 1986 1987 1988 1989 1990 1990 1990 1990 1991 ...
##  $ alcohol: num  13.3 10.8 11 11.6 12 ...

The assignment statement.

You may use either “=” or “<-”. The arrow emphasizes the direction in which information flows.

Exercise

Does a reversed arrow move information from left to right.

Solution

3 -> three
three
## [1] 3

Exercise

  1. Creating an object in R does not automatically display it.

Note the difference in the following.

rnorm(10)
##  [1]  0.881707322 -0.471165743 -1.297096793  0.446784815 -0.826080628
##  [6]  0.141705507 -0.681989303  1.117945399  0.002295676  1.640320880
# The results appear, but are not available for future use
x = rnorm(10)
# The results do not appear but are available for future work.
mean(x)
## [1] 0.2067563

Creating Vectors

Generally we use the c() function. A vector can’t have more than one type of entry. Use class to find out what kind of objects a vector holds.

vec = c(1,2,3)
class(vec)
## [1] "numeric"
vecL = c(1L,2L,3L)
class(vecL)
## [1] "integer"
vec[3] = "Three"
vec
## [1] "1"     "2"     "Three"
class(vec)
## [1] "character"

Note that R forces (coerces) compliance with its one-type rule.

The c() function is very flexible with its inputs.

u = c(6,7,8)
y = c(1:5,rnorm(4),u,9,10,rep(0,5))
y
##  [1]  1.0000000  2.0000000  3.0000000  4.0000000  5.0000000 -1.6299282
##  [7]  1.6770383  0.6159077  0.3791377  6.0000000  7.0000000  8.0000000
## [13]  9.0000000 10.0000000  0.0000000  0.0000000  0.0000000  0.0000000
## [19]  0.0000000

Exercise

Use the c() function to create a vector w containing the following in order. 1. The contents of a numeric vector created by c(12,95,26)

  1. 23 values of 1.

  2. The values 16, -4 and 0.

  3. The mean value of the vector in item.

  4. The standard deviation of the vector in 1.

  5. The median of the vector in 1.

Solution

v1 = c(12,95,26)
w = c(v1,rep(1,23),16,-4,0,mean(v1),sd(v1),median(v1))
w
##  [1] 12.00000 95.00000 26.00000  1.00000  1.00000  1.00000  1.00000
##  [8]  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000
## [15]  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000
## [22]  1.00000  1.00000  1.00000  1.00000  1.00000 16.00000 -4.00000
## [29]  0.00000 44.33333 44.43347 26.00000

Creating and Using Logical Vectors

We can create logical vectors with the values TRUE and FALSE. Note that these are not character strings.

xlog = c(TRUE,TRUE,FALSE,FALSE,TRUE)
xlog
## [1]  TRUE  TRUE FALSE FALSE  TRUE

We can also use the abbreviations T and F to save typing.

ylog = c(T,T,F,F,T)
ylog
## [1]  TRUE  TRUE FALSE FALSE  TRUE

When we apply numerical functions to logical vectors, the TRUE and FALSE values are coerced to act like 1 and 0 respectively.

sum(xlog)
## [1] 3
mean(xlog)
## [1] 0.6

Two useful facts:

  1. The sum of a vector of logical expressions is the count of TRUE values.
  2. THe mean of a vector of logical expressions is the proportion of cases in which the logical expression is TRUE.

Example

xn = 1:10
xn
##  [1]  1  2  3  4  5  6  7  8  9 10
y = xn >5
y
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
sum(y)
## [1] 5
sum(xn>5)
## [1] 5
mean(xn>5)
## [1] 0.5

Exercise

Create a vector of 1000 observations drawn from a standard normal distribution.

What fraction of these observations is positive? Repeat. Do you get exactly the same answer.

Solution

x = rnorm(1000)
mean(x>0)
## [1] 0.514

You get different answers with each run, but they are all close to .5.

NA

NA means “not available.” NA values may exist in datasets you import. You may also set the value of a variable to NA. NaN (Not a number) values arise from failed computations.

x = c(1,2,3,NA)
x4 = x[4]
mean(x)
## [1] NA
# Any computation involving an NA value will result in NA.
# Use "na.rm = T" to get numeric functions to skip the NA values.

mean(x,na.rm=T)
## [1] 2

Use of is.na()

is.na is a logical function which tests values.

is.na(x)
## [1] FALSE FALSE FALSE  TRUE

Exercise

How do you count the number of NA values in a vector? Try with our vector x.

Solution

sum(is.na(x))
## [1] 1

Exercise: Use of ==

Can we use logical equality (==) to test for NA values? Use or vector x.

Solution

navals = x==NA
navals
## [1] NA NA NA NA

Any calculation involving NA produces NA.

NA == NA
## [1] NA
# Must be true, right
is.na(NA)
## [1] TRUE

Recycling

Normally if we perform operations on pairs of vectors, we expect them to be the same length and the results are intuitively obvious.

Example

x = 1:4
y = 6:9
x
## [1] 1 2 3 4
y
## [1] 6 7 8 9
x+y
## [1]  7  9 11 13
x*y
## [1]  6 14 24 36

What happens if x and y are of different lengths? See if you can infer the rule.

x = 1:4
x
## [1] 1 2 3 4
y = 6:7
y
## [1] 6 7
x+y
## [1]  7  9  9 11

What happens in this case?

x = 1:8
x
## [1] 1 2 3 4 5 6 7 8
y = 2:4
y
## [1] 2 3 4
x+y
## Warning in x + y: longer object length is not a multiple of shorter object
## length
## [1]  3  5  7  6  8 10  9 11

Basically the same thing, but there is a warning.

Extracting

x = 1:4
y = c(T,T,F,F)
x[y]
## [1] 1 2

How do you explain this?

x=1:10
y= c(T,T,F)
x[y]
## [1]  1  2  4  5  7  8 10

Another example of the recycling rule.