Cargill R Training Day 1

Section 1: Exploring a Data Set

Mpg is a dataset that comes with the ggplot2 package. Our goal is explore the structure of this dataset and see what we can learn from it. This exploration can be done with Spotfire on using R direclty, which is the approach we are going to take here.

First we examine the dimensions and structure of the dataset.

library('ggplot2') # needed for mpg dataset and later plotting
dim(mpg)

## [1] 234  11

str(mpg)

## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

qplot() function from the ggplot2 package allows for producing static plots. You can examine

qplot(hwy, data = mpg) + theme_bw()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-2

qplot(displ, hwy, data = mpg) + theme_bw()

plot of chunk unnamed-chunk-3

qplot(displ, hwy, data = mpg, colour = factor(cyl)) + theme_bw()

plot of chunk unnamed-chunk-4

qplot(displ, hwy, data = mpg, colour = drv) + theme_bw()

plot of chunk unnamed-chunk-5

qplot(displ, hwy, data = mpg, colour = factor(cyl), size = cty) + theme_bw()

plot of chunk unnamed-chunk-6

qplot(displ, hwy, data = mpg, colour = factor(cyl), size = cty) +
  stat_smooth(method = "lm") + theme_bw()

plot of chunk unnamed-chunk-7

?qplot

Section 2: Your Turn

Explore the economics data set. To see the description type ?economics What are some of the interesting relationships among the numerical variables? Can you see any trends over time?

Section 3: Simple Simulation and Reading and Writing Data

The first thing you will probably want to do is to read in data. Here we will simulate some data, write it out and the read it in.

# simulate some data
X <- matrix(rnorm(25), nrow = 5)
# ?matrix
# ?rnorm
X

##         [,1]    [,2]     [,3]    [,4]      [,5]
## [1,] -0.2126 -0.3187 -0.04648 -0.2172 -0.553741
## [2,]  0.1559  0.6345  1.80849  1.4361  1.555234
## [3,]  0.8712  0.8077  0.08187  0.6804 -0.006837
## [4,]  0.8579 -0.2901  0.08815 -0.1529  0.261421
## [5,]  1.3025  0.4982  0.03750  0.0970 -0.189652

pairs(X, panel = panel.smooth)

plot of chunk unnamed-chunk-9

heatmap(X) # ?heatmap

plot of chunk unnamed-chunk-9

gender <- sample(c("M", "F"), 5, replace = TRUE) # ?sample
gender

## [1] "M" "M" "M" "F" "M"

X <- data.frame(X, gender)
colnames(X)[1:5] <- LETTERS[1:5]
class(X)

## [1] "data.frame"

str(X)

## 'data.frame':    5 obs. of  6 variables:
##  $ A     : num  -0.213 0.156 0.871 0.858 1.303
##  $ B     : num  -0.319 0.635 0.808 -0.29 0.498
##  $ C     : num  -0.0465 1.8085 0.0819 0.0882 0.0375
##  $ D     : num  -0.217 1.436 0.68 -0.153 0.097
##  $ E     : num  -0.55374 1.55523 -0.00684 0.26142 -0.18965
##  $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2

# write out CSV file
write.csv(X, "rand.csv", row.names = FALSE)

# reading the CSV file use read.csv
read.csv("rand.csv", stringsAsFactors = FALSE)

##         A       B        C       D         E gender
## 1 -0.2126 -0.3187 -0.04648 -0.2172 -0.553741      M
## 2  0.1559  0.6345  1.80849  1.4361  1.555234      M
## 3  0.8712  0.8077  0.08187  0.6804 -0.006837      M
## 4  0.8579 -0.2901  0.08815 -0.1529  0.261421      F
## 5  1.3025  0.4982  0.03750  0.0970 -0.189652      M

Section 4: Your Turn

Take a look at iris data set. Do you see any clusteting patterns? Try using heatmap function to find the clusters.

Section 5: Quick Demo: Finding Clusting using Spotfire

qplot(Sepal.Length, Sepal.Width, colour = Species, data = iris) + theme_bw()

plot of chunk unnamed-chunk-10

Section 6: More on Data Types

Most R functions work with vectors and matrixes (or other rectangular data structures). For example:

x <- 10
y <- 5

x * y

## [1] 50

x <- 1:10
y <- 11:20
x; y

##  [1]  1  2  3  4  5  6  7  8  9 10

##  [1] 11 12 13 14 15 16 17 18 19 20

# what do you expect from the following? 
x * y

##  [1]  11  24  39  56  75  96 119 144 171 200

# %*% is a matrix multiplication operator
# what do you expect from the following?
x %*% y

##      [,1]
## [1,]  935

if (sum(x * y) == x %*% y) {
  cat("Sum of the pairwise multiplication equals to dot product")
} else {
  cat("Something is seriously wrong with the universe")
}

## Sum of the pairwise multiplication equals to dot product

# now x and y are different lengths.
# what do you expect from the following?
y <- 1:5
x ; y

##  [1]  1  2  3  4  5  6  7  8  9 10

## [1] 1 2 3 4 5

# notice the recyling "feature" of R
x * y

##  [1]  1  4  9 16 25  6 14 24 36 50

# this obviously should fail as a matrix operation
# x %*% y
## Error in x %*% y : non-conformable arguments

# matrix inverse is available through solve()
# which linkes to the LAPACK routines DGESV and ZGESV
m <- matrix(sample(25), nrow = 5)
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   23   10   17   14
## [2,]   19   25    2   22    7
## [3,]   21    1   15   20    3
## [4,]   24   12   13   11   16
## [5,]    9    8    4    6   18

t(m) # transpose

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   19   21   24    9
## [2,]   23   25    1   12    8
## [3,]   10    2   15   13    4
## [4,]   17   22   20   11    6
## [5,]   14    7    3   16   18

solve(m)

##           [,1]     [,2]      [,3]     [,4]     [,5]
## [1,] -0.044322  0.02371 -0.009286  0.04531 -0.01347
## [2,]  0.029042  0.01215 -0.045558  0.04739 -0.06184
## [3,]  0.059292 -0.05843  0.004562  0.05846 -0.07612
## [4,]  0.001269  0.01996  0.057921 -0.08977  0.06139
## [5,] -0.004345 -0.01092  0.004570 -0.02678  0.08623

# was this really an inverse?
# how would you check?

m == solve(solve(m))

##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE FALSE FALSE  TRUE  TRUE
## [2,]  TRUE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE

m - solve(solve(m))

##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -5.329e-15  3.553e-15 -1.776e-15  0.000e+00  0.000e+00
## [2,]  0.000e+00  3.553e-15  2.442e-15 -3.553e-15  5.329e-15
## [3,] -1.066e-14 -1.177e-14 -3.553e-15 -1.066e-14 -5.773e-15
## [4,]  0.000e+00 -3.553e-15  3.553e-15 -3.553e-15  3.553e-15
## [5,]  3.553e-15  8.882e-16  2.665e-15  8.882e-16  7.105e-15

m %*% solve(m) # should by identity

##            [,1]       [,2]       [,3]       [,4]      [,5]
## [1,]  1.000e+00 -2.776e-17 -1.943e-16  2.220e-16 -2.22e-16
## [2,]  1.041e-16  1.000e+00 -1.527e-16  4.441e-16  0.00e+00
## [3,] -1.492e-16 -5.551e-17  1.000e+00 -2.776e-17 -1.11e-16
## [4,] -1.388e-17  0.000e+00  0.000e+00  1.000e+00  0.00e+00
## [5,] -1.388e-17  0.000e+00  2.776e-17 -1.110e-16  1.00e+00

For an extremely robust sparse matrix math, see Matrix package.

The most common type of data structure in a data frame, which is like a matrix, but may contain columns of different types. In that sense it like an Excel spreadsheet.

d.f <- data.frame(poison = rpois(20, 1), 
                category = sample(c("M", "F"), 20, replace = TRUE))
dim(d.f)

## [1] 20  2

class(d.f)

## [1] "data.frame"

Perhaps the most flexible type of data structure is a list. A list can contain other data structures of arbitrary types. Lists are often used when a function needs to return lots of non-conforming data structures (e.g. regression coefficients and residuals)

l <- list(vec = 1:10, mat = m, df = d.f)
class(l)

## [1] "list"

str(l)

## List of 3
##  $ vec: int [1:10] 1 2 3 4 5 6 7 8 9 10
##  $ mat: int [1:5, 1:5] 5 19 21 24 9 23 25 1 12 8 ...
##  $ df :'data.frame': 20 obs. of  2 variables:
##   ..$ poison  : int [1:20] 2 1 0 1 2 1 1 1 1 0 ...
##   ..$ category: Factor w/ 2 levels "F","M": 1 2 2 1 2 2 1 1 1 2 ...

dim(l)

## NULL

## $vec
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $mat
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   23   10   17   14
## [2,]   19   25    2   22    7
## [3,]   21    1   15   20    3
## [4,]   24   12   13   11   16
## [5,]    9    8    4    6   18
## 
## $df
##    poison category
## 1       2        F
## 2       1        M
## 3       0        M
## 4       1        F
## 5       2        M
## 6       1        M
## 7       1        F
## 8       1        F
## 9       1        F
## 10      0        M
## 11      1        M
## 12      1        F
## 13      1        F
## 14      0        M
## 15      3        M
## 16      4        M
## 17      1        F
## 18      1        M
## 19      2        M
## 20      0        M

Section 7: Subsetting

Often times we don’t need the entire Matrix, Data Frame or List. In this section we will look at some ways of getting at individual elements.

# Here is our normal matrix m
m # same as m[, ]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   23   10   17   14
## [2,]   19   25    2   22    7
## [3,]   21    1   15   20    3
## [4,]   24   12   13   11   16
## [5,]    9    8    4    6   18

# first row of m
m[1, ]

## [1]  5 23 10 17 14

# first two rows of m
m[1:2, ]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   23   10   17   14
## [2,]   19   25    2   22    7

# last two columns of m
# figure out why this works...
m[, c(ncol(m), ncol(m) -1)]

##      [,1] [,2]
## [1,]   14   17
## [2,]    7   22
## [3,]    3   20
## [4,]   16   11
## [5,]   18    6

You can also compute on rows and columns.

rowSums(m)

## [1] 69 75 60 76 45

colSums(m)

## [1] 78 69 44 76 58

# What if you wanted to compute average per row or columns?
# ?apply

# Notice that I can match arguments by name (order does not matter) and by position (order does matter)
apply(X = m, MARGIN = 1, FUN = mean)

## [1] 13.8 15.0 12.0 15.2  9.0

apply(m, 2, mean)

## [1] 15.6 13.8  8.8 15.2 11.6

apply(m, 2, sd)

## [1]  8.173 10.134  5.630  6.611  6.348

Apply is a very flexible funtion and can take your own FUN as an argument. In general R functions can take functions as arguments and return functions as return values, wihch is called a closure.

Subsetting data frames can be done with $ operator.

d.f$poison

##  [1] 2 1 0 1 2 1 1 1 1 0 1 1 1 0 3 4 1 1 2 0

# you can still use numeric substripts
d.f$poison == d.f[, 1]

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE

# this is safer
all.equal(d.f$poison, d.f[, 1])

## [1] TRUE

subset(d.f, poison > 0, select = category)

##    category
## 1         F
## 2         M
## 4         F
## 5         M
## 6         M
## 7         F
## 8         F
## 9         F
## 11        M
## 12        F
## 13        F
## 15        M
## 16        M
## 17        F
## 18        M
## 19        M

subset(d.f, poison > 0, select = c(poison, category))

##    poison category
## 1       2        F
## 2       1        M
## 4       1        F
## 5       2        M
## 6       1        M
## 7       1        F
## 8       1        F
## 9       1        F
## 11      1        M
## 12      1        F
## 13      1        F
## 15      3        M
## 16      4        M
## 17      1        F
## 18      1        M
## 19      2        M

Your Turn

Look at the ?transform function and ?mtcars dataset

Plot columns 1 through 7 against each other with “smooth” curve fit to each plot. Do the relationships make sense?
Compute the means and standard deviations of columns 1 through 7 (mpg through qsec)
Compute hourse power per unit of weight overall and then for 4, 6, and 8 cylinder cars.
Compare some of the results to the results of the summary() function

Section 8: Grouping, Reshaping, and Merging Data

Plyr is a great package for performing computations on grouped data. There is a new package called dplyr which you should explore, but here I demonstrare the simple use of plyr.

library("plyr")

mt.new <- transform(mtcars, hp.weight = hp/wt)

ddply(mt.new, .(cyl), summarise, 
      mean = mean(hp.weight),
      sd   = sd(hp.weight))

##   cyl  mean    sd
## 1   4 37.93 13.78
## 2   6 39.93 10.85
## 3   8 53.86 17.08

qplot(factor(cyl), hp.weight, geom = "boxplot", data = mt.new) + theme_bw()

plot of chunk unnamed-chunk-17

Time series (and other) processing often requires reshaping data. In excel this is called pivoting. Here we will take a look at the reshape2 package. First take a look at the ?melt function in the reshape2 package and see if you can figure out what it does.

library("reshape2")

str(economics)

## 'data.frame':    478 obs. of  6 variables:
##  $ date    : Date, format: "1967-06-30" "1967-07-31" ...
##  $ pce     : num  508 511 517 513 518 ...
##  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
##  $ psavert : num  9.8 9.8 9 9.8 9.7 9.4 9 9.5 8.9 9.6 ...
##  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
##  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...

head(economics)

##         date   pce    pop psavert uempmed unemploy
## 1 1967-06-30 507.8 198712     9.8     4.5     2944
## 2 1967-07-31 510.9 198911     9.8     4.7     2945
## 3 1967-08-31 516.7 199113     9.0     4.6     2958
## 4 1967-09-30 513.3 199311     9.8     4.9     3143
## 5 1967-10-31 518.5 199498     9.7     4.7     3066
## 6 1967-11-30 526.2 199657     9.4     4.8     3018

econ <- melt(economics, id = "date")

str(econ)

## 'data.frame':    2390 obs. of  3 variables:
##  $ date    : Date, format: "1967-06-30" "1967-07-31" ...
##  $ variable: Factor w/ 5 levels "pce","pop","psavert",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value   : num  508 511 517 513 518 ...

head(econ)

##         date variable value
## 1 1967-06-30      pce 507.8
## 2 1967-07-31      pce 510.9
## 3 1967-08-31      pce 516.7
## 4 1967-09-30      pce 513.3
## 5 1967-10-31      pce 518.5
## 6 1967-11-30      pce 526.2

qplot(date, value, colour = variable, data = econ)

plot of chunk unnamed-chunk-18

qplot(date, value, colour = variable, facets = . ~ variable, data = econ)

plot of chunk unnamed-chunk-18

qplot(date, value, colour = variable, geom = "line", data = econ) +
  facet_wrap(~ variable, scales = "free_y")

plot of chunk unnamed-chunk-18

Suppose, I wanted to plot all the lines on the same chart, so I can compare the pattern. How would I do that? Take a look at the scale() function. Also some.date.frame[, -1] returns all but the first column.

scaled <- scale(economics[, -1])

econ <- data.frame(date = economics$date, scaled)

econ <- melt(econ, id = "date")

qplot(date, value, colour = variable, geom = "line", data = econ)

plot of chunk unnamed-chunk-19

rang <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / diff(rng)
}

scaled <- apply(economics[, -1], 2, rang)

econ <- data.frame(date = economics$date, scaled)

econ <- melt(econ, id = "date")

qplot(date, value, colour = variable, geom = "line", data = econ)

plot of chunk unnamed-chunk-19

Section 9: Intro to Statistical Models

# ?lm

mpg.lm <- lm(hwy ~ ., data = subset(mpg, select = c(-manufacturer, -model)))

str(mpg.lm)

## List of 13
##  $ coefficients : Named num [1:26] -126.7859 -0.1884 0.0699 -0.1082 -0.225 ...
##   ..- attr(*, "names")= chr [1:26] "(Intercept)" "displ" "year" "cyl" ...
##  $ residuals    : Named num [1:234] 2.064 -0.442 2.147 1.106 1.371 ...
##   ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:234] -358.57 -69.63 -10.61 -11.78 -1.59 ...
##   ..- attr(*, "names")= chr [1:234] "(Intercept)" "displ" "year" "cyl" ...
##  $ rank         : int 26
##  $ fitted.values: Named num [1:234] 26.9 29.4 28.9 28.9 24.6 ...
##   ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
##  $ assign       : int [1:26] 0 1 2 3 4 4 4 4 4 4 ...
##  $ qr           :List of 5
##   ..$ qr   : num [1:234, 1:26] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:26] "(Intercept)" "displ" "year" "cyl" ...
##   .. ..- attr(*, "assign")= int [1:26] 0 1 2 3 4 4 4 4 4 4 ...
##   .. ..- attr(*, "contrasts")=List of 4
##   .. .. ..$ trans: chr "contr.treatment"
##   .. .. ..$ drv  : chr "contr.treatment"
##   .. .. ..$ fl   : chr "contr.treatment"
##   .. .. ..$ class: chr "contr.treatment"
##   ..$ qraux: num [1:26] 1.07 1.08 1.08 1.02 1.01 ...
##   ..$ pivot: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ tol  : num 1e-07
##   ..$ rank : int 26
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 208
##  $ contrasts    :List of 4
##   ..$ trans: chr "contr.treatment"
##   ..$ drv  : chr "contr.treatment"
##   ..$ fl   : chr "contr.treatment"
##   ..$ class: chr "contr.treatment"
##  $ xlevels      :List of 4
##   ..$ trans: chr [1:10] "auto(av)" "auto(l3)" "auto(l4)" "auto(l5)" ...
##   ..$ drv  : chr [1:3] "4" "f" "r"
##   ..$ fl   : chr [1:5] "c" "d" "e" "p" ...
##   ..$ class: chr [1:7] "2seater" "compact" "midsize" "minivan" ...
##  $ call         : language lm(formula = hwy ~ ., data = subset(mpg, select = c(-manufacturer,      -model)))
##  $ terms        :Classes 'terms', 'formula' length 3 hwy ~ displ + year + cyl + trans + drv + cty + fl + class
##   .. ..- attr(*, "variables")= language list(hwy, displ, year, cyl, trans, drv, cty, fl, class)
##   .. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:9] "hwy" "displ" "year" "cyl" ...
##   .. .. .. ..$ : chr [1:8] "displ" "year" "cyl" "trans" ...
##   .. ..- attr(*, "term.labels")= chr [1:8] "displ" "year" "cyl" "trans" ...
##   .. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(hwy, displ, year, cyl, trans, drv, cty, fl, class)
##   .. ..- attr(*, "dataClasses")= Named chr [1:9] "numeric" "numeric" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:9] "hwy" "displ" "year" "cyl" ...
##  $ model        :'data.frame':   234 obs. of  9 variables:
##   ..$ hwy  : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##   ..$ displ: num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##   ..$ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##   ..$ cyl  : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##   ..$ trans: Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##   ..$ drv  : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##   ..$ cty  : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##   ..$ fl   : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##   ..$ class: Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 hwy ~ displ + year + cyl + trans + drv + cty + fl + class
##   .. .. ..- attr(*, "variables")= language list(hwy, displ, year, cyl, trans, drv, cty, fl, class)
##   .. .. ..- attr(*, "factors")= int [1:9, 1:8] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:9] "hwy" "displ" "year" "cyl" ...
##   .. .. .. .. ..$ : chr [1:8] "displ" "year" "cyl" "trans" ...
##   .. .. ..- attr(*, "term.labels")= chr [1:8] "displ" "year" "cyl" "trans" ...
##   .. .. ..- attr(*, "order")= int [1:8] 1 1 1 1 1 1 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(hwy, displ, year, cyl, trans, drv, cty, fl, class)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:9] "numeric" "numeric" "numeric" "numeric" ...
##   .. .. .. ..- attr(*, "names")= chr [1:9] "hwy" "displ" "year" "cyl" ...
##  - attr(*, "class")= chr "lm"

summary(mpg.lm)

## 
## Call:
## lm(formula = hwy ~ ., data = subset(mpg, select = c(-manufacturer, 
##     -model)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.434 -0.566 -0.072  0.603  2.909 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -126.7859    45.0283   -2.82  0.00534 ** 
## displ             -0.1884     0.2152   -0.88  0.38239    
## year               0.0699     0.0226    3.09  0.00227 ** 
## cyl               -0.1082     0.1411   -0.77  0.44416    
## transauto(l3)     -0.2250     0.9830   -0.23  0.81916    
## transauto(l4)      1.1150     0.5634    1.98  0.04912 *  
## transauto(l5)      1.4867     0.5579    2.66  0.00831 ** 
## transauto(l6)      1.8330     0.7171    2.56  0.01129 *  
## transauto(s4)      0.0183     0.8233    0.02  0.98229    
## transauto(s5)      1.9106     0.8150    2.34  0.02001 *  
## transauto(s6)      1.0889     0.5754    1.89  0.05982 .  
## transmanual(m5)    1.1397     0.5614    2.03  0.04362 *  
## transmanual(m6)    0.9097     0.5696    1.60  0.11177    
## drvf               0.9644     0.2990    3.23  0.00146 ** 
## drvr               1.1779     0.3432    3.43  0.00072 ***
## cty                0.9512     0.0477   19.94  < 2e-16 ***
## fld               -1.6453     1.2522   -1.31  0.19034    
## fle               -5.2199     1.2236   -4.27  3.0e-05 ***
## flp               -3.3776     1.1583   -2.92  0.00393 ** 
## flr               -3.6397     1.1356   -3.21  0.00156 ** 
## classcompact      -1.5004     0.7026   -2.14  0.03391 *  
## classmidsize      -1.1462     0.6982   -1.64  0.10217    
## classminivan      -3.0181     0.8040   -3.75  0.00023 ***
## classpickup       -4.6279     0.7203   -6.42  8.8e-10 ***
## classsubcompact   -2.1500     0.6943   -3.10  0.00223 ** 
## classsuv          -4.3160     0.6824   -6.32  1.5e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 208 degrees of freedom
## Multiple R-squared:  0.97,   Adjusted R-squared:  0.967 
## F-statistic:  270 on 25 and 208 DF,  p-value: <2e-16

op <- par(mfrow = c(2, 2))
plot(mpg.lm)

## Warning: not plotting observations with leverage one:
##   107
## Warning: not plotting observations with leverage one:
##   107

plot of chunk unnamed-chunk-20

par(op)

Cargill R Training Day 1

Eric Novik

Thursday, August 14, 2014

Section 1: Exploring a Data Set

Section 2: Your Turn

Section 3: Simple Simulation and Reading and Writing Data

Section 4: Your Turn

Section 5: Quick Demo: Finding Clusting using Spotfire

Section 6: More on Data Types

Section 7: Subsetting

Section 8: Grouping, Reshaping, and Merging Data

Section 9: Intro to Statistical Models