Class 1 Data Analysis

Statistics 101

Tonight, I will introduce R, R Markdown, and R Studio as a soft introduction to the course. R is a comprehensive statistical package and programming language that is necessary for data science. I would suggest that anyone performing analysis of large data should be interested in both R and Python.

#I like to load up some libraries that are useful.  You have to install these first.
require (IPSUR)  #This library is part of your textbook.

## Loading required package: IPSUR

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'IPSUR'

require (psych)  #This library helps with descriptive statistics.

## Loading required package: psych

require(MASS)    #Some functions that are generally required for EDA

## Loading required package: MASS

require(car)     #Same here

## Loading required package: car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

require(forecast) # A good function for time series

## Loading required package: forecast

require(data.tree) #A good function for tree diagrams (and Bayes!)

## Loading required package: data.tree

## Warning: package 'data.tree' was built under R version 3.5.2

Basic Operations In R

R is fully programmable. I write functions to assist in repetitive tasks. But for tonight, I want to focus on some basics.

2+3-2 #adding & subtracting

## [1] 3

2*2/3 #multiplying & dividing

## [1] 1.333333

2^3   #exponents

## [1] 8

myadd=2+3  #assigning to variables / containers
myadd

## [1] 5

options(digits=17) #specifying total digits to be displayed
exp(1)  #e

## [1] 2.7182818284590451

pi   #pi

## [1] 3.1415926535897931

options(digits=10)

Vectors and Matrices in R

Defining vectors in R is easy to do. You build a column vector using the concatenate function. You may also use the scan function, but this can be painful. Similarly, you declare a matrix in R by concatenation and defining the number of rows (or columns).

myvector=c(1,3,4)  #Note:  this is a column vector, although it displays as a row!  
mymatrix=matrix(c(1,3,4,4,3,4,4,4,4), nrow=3)
myvector #show the vector

## [1] 1 3 4

mymatrix #show the matrix

##      [,1] [,2] [,3]
## [1,]    1    4    4
## [2,]    3    3    4
## [3,]    4    4    4

myvector[-2]  #reports only the 1st and 3d elements of myvector

## [1] 1 4

myvector[2]   #reports only the 2d element of myvector

## [1] 3

mymatrix[,1]  #reports only the first column of mymatrix

## [1] 1 3 4

mymatrix[,-2] #reports all but the second column of my matrix

##      [,1] [,2]
## [1,]    1    4
## [2,]    3    4
## [3,]    4    4

myseq=seq(1,10, by=.1)  #generates a sequence from 1 to 10 by =.1
myseq2=1:10  #generates a sequence on the integers
LETTERS[1:26]  #produces capital letters

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

letters[1:26]  #lower case

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

Reading Data in R

Reading data is not too difficult in R, depending on the format. R can read in matrices, vectors, data.frame, etc. Here are some simple examples of reading data into R.

mydata=read.csv("d:/naturalgas.csv")  #read in the data
head(mydata)  #get the header information

##   Year Month  Amount     I_1      Forecast         Error SSR X9.89107E.12
## 1 2001     1 2505011       0            NA       0.00000  NA           NA
## 2 2001     2 2156873 -348138       0.00000 -348138.00000  NA           NA
## 3 2001     3 2086568  -70305 -181821.32560  111516.32560  NA           NA
## 4 2001     4 1663832 -422736   12484.31930 -435220.31930  NA           NA
## 5 2001     5 1385163 -278669 -224159.83240  -54509.16759  NA           NA
## 6 2001     6 1313119  -72044  -84880.36997   12836.36997  NA           NA
##   Coef.for.AR X0.251659618
## 1 Coef for MA  0.270608349
## 2                       NA
## 3                       NA
## 4                       NA
## 5                       NA
## 6                       NA

#pasteddata=read.table(file="clipboard", sep="")  #pasting from the clipboard...Omitted.

Some Basic Functions

R makes sense and is Google friendly. If you don’t know how to do something, Google “how do I do XYZ in R?”

mean(mydata$Amount)

## [1] 1819415.67

median(mydata$Amount)

## [1] 1690836

sd(mydata$Amount)

## [1] 396746.8344

var(mydata$Amount)

## [1] 157408050620

fivenum(mydata$Amount)

## [1] 1240196.0 1506585.0 1690836.0 2102802.5 2995327.0

Functions in R

R is easy to program. You can query functions as demonstrated in your book, but you can also write them easily.

myf = function(x)  {  # this states that myf is a function.
  a1=mean(x)
  a2=median(x)
  a3=sd(x)
  y=c(a1,a2,a3) #concatenates the items into a vector
  names(y)=c("Mean","Median","SD")  #names them
  return(y)
  myf(mydata$Amount)
  
}

Plotting in R

R is simple to use for plots. And if you use the right packages, you can create some amazing diagrams. I will show you some simple ones.

par(mfrow=c(2,3)) #set up a 3x3 grid for plots
hist(mydata$Amount, breaks="STURGES",main="Histogram of Natural Gas Consumption", col="RED") #histogram
hist(mydata$Amount, breaks="STURGES", freq=FALSE,main="Histogram of Natural Gas Consumption", col="BLUE")
boxplot(mydata$Amount, main="Boxplot, NG Consumption", notch=TRUE, col="BLUE") #boxplot
boxplot(mydata$Amount~mydata$Month, main="Boxplot, NG Consumption", notch=TRUE, col="GREEN")

## Warning in bxp(list(stats = structure(c(2306943, 2450862, 2538368,
## 2643016.5, : some notches went outside hinges ('box'): maybe set
## notch=FALSE

myts=ts(mydata$Amount, start=c(2001,01), frequency=12)  #declare a time series
plot(myts) #plots the time series
plot(forecast(myts,h=12))  #plot a forecast of myts using ETS

plot(decompose(myts)) #plot a decomposition of the time series.

plot(density(myts))  #plot a density

Basic Data Types

Steven’s typologies are widely used in determining statistics to use, plots to generate, and tests to conduct. Essentially, data are categorized as either quantitative or qualitative. They are further subdivided into interval or ratio (quantitative) or nominal / ordinal (qualitative).

Qualitative data are data where mathematical operations make no sense (e.g., color). Nominal data are those data which are “in name only,” e.g., color. There is no direction. Ordinal data have a direction but no equal spacing (e.g., position finished in race). Quantitative data are data where math operations make sense. Interval data have a constant spacing between data points (e.g., temperature in Celsius). Ratio data have constant spacing and a known or theoretical zero (e.g., temperature in Kelvin). These levels of measurements help dictate what statistics and graphs are appropriate. We can begin to evaluate how pre-built data sets are defined (quantitative vs. qualitative) using the structure (str) function.

str(mydata)  # this displays the structure of the data

## 'data.frame':    176 obs. of  10 variables:
##  $ Year        : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ Month       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Amount      : int  2505011 2156873 2086568 1663832 1385163 1313119 1459919 1528483 1360871 1507428 ...
##  $ I_1         : int  0 -348138 -70305 -422736 -278669 -72044 146800 68564 -167612 146557 ...
##  $ Forecast    : num  NA 0 -181821 12484 -224160 ...
##  $ Error       : num  0 -348138 111516 -435220 -54509 ...
##  $ SSR         : logi  NA NA NA NA NA NA ...
##  $ X9.89107E.12: logi  NA NA NA NA NA NA ...
##  $ Coef.for.AR : Factor w/ 2 levels "","Coef for MA": 2 1 1 1 1 1 1 1 1 1 ...
##  $ X0.251659618: num  0.271 NA NA NA NA ...

mydata$Year=factor(mydata$Year)  #converts the integer to a factor (qualitative)
mydata$Month=factor(mydata$Month)  #converts the integer to a factor (qualitative)
str(mydata) #shows that we have converted to factors

## 'data.frame':    176 obs. of  10 variables:
##  $ Year        : Factor w/ 15 levels "2001","2002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Month       : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Amount      : int  2505011 2156873 2086568 1663832 1385163 1313119 1459919 1528483 1360871 1507428 ...
##  $ I_1         : int  0 -348138 -70305 -422736 -278669 -72044 146800 68564 -167612 146557 ...
##  $ Forecast    : num  NA 0 -181821 12484 -224160 ...
##  $ Error       : num  0 -348138 111516 -435220 -54509 ...
##  $ SSR         : logi  NA NA NA NA NA NA ...
##  $ X9.89107E.12: logi  NA NA NA NA NA NA ...
##  $ Coef.for.AR : Factor w/ 2 levels "","Coef for MA": 2 1 1 1 1 1 1 1 1 1 ...
##  $ X0.251659618: num  0.271 NA NA NA NA ...

table(mydata$Month)  #Looks at the number of observations by month

## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 15 15 15 15 15 15 15 15 14 14 14 14

Describing Data

Measures of center, measures of variation, measures of location, and measures of shape are important for understanding data. Fortunately for us, R packages can analyze all of these for us. For example, the “describe” function in library(psych) generates many of the statistics for us. However, we can also generate our own.

## Warning: package 'qcc' was built under R version 3.5.2

## Package 'qcc' version 2.7

## Type 'citation("qcc")' for citing this R package in publications.

## Warning: package 'UsingR' was built under R version 3.5.2

## Loading required package: HistData

## Warning: package 'HistData' was built under R version 3.5.2

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.5.2

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Warning: package 'Formula' was built under R version 3.5.2

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:psych':
## 
##     describe

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## 
## Attaching package: 'UsingR'

## The following object is masked from 'package:survival':
## 
##     cancer

## The following objects are masked from 'package:psych':
## 
##     galton, headtail

## 
##   The decimal point is 5 digit(s) to the right of the |
## 
##   12 | 4
##   13 | 000133566688999
##   14 | 0111223334555566677899
##   15 | 000001111111333445666778888999
##   16 | 00112222444566777888
##   17 | 0001233333344457889
##   18 | 1346789
##   19 | 55777779
##   20 | 001456999
##   21 | 011266778
##   22 | 1339
##   23 | 1111222455667
##   24 | 0017
##   25 | 11234556
##   26 | 26
##   27 | 004
##   28 | 
##   29 | 0
##   30 | 0

## [[1]]
##       Min       Max        5%       25%       75%       95%      Mean 
## 1.240e+06 2.995e+06 1.357e+06 1.507e+06 2.100e+06 2.553e+06 1.819e+06 
##    Median         s       s^2     MAveD       MAD        CV      Skew 
## 1.691e+06 3.967e+05 1.574e+11 3.345e+05 3.166e+05 2.181e+01 7.964e-01 
##  Kurtosis 
## 2.657e+00 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  a
## W = 0.91, p-value = 1e-08
## 
## 
## [[3]]
## Estimated transformation parameter 
##    new 
## -2.126