str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Tonight, I will introduce R, R Markdown, and R Studio as a soft introduction to the course. R is a comprehensive statistical package and programming language that is necessary for data science. I would suggest that anyone performing analysis of large data should be interested in both R and Python.
#I like to load up some libraries that are useful. You have to install these first.
require(car) #Same here
## Loading required package: car
## Loading required package: carData
require(data.tree) #A good function for tree diagrams (and Bayes!)
## Loading required package: data.tree
require(DiagrammeR)
## Loading required package: DiagrammeR
require(forecast) # A good function for time series
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
require(IPSUR) #This library is part of your textbook.
## Loading required package: IPSUR
#https://cran.r-project.org/src/contrib/Archive/IPSUR/
require(kableExtra)
## Loading required package: kableExtra
require(MASS) #Some functions that are generally required for EDA
## Loading required package: MASS
require(psych) #This library helps with descriptive statistics.
## Loading required package: psych
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
R is fully programmable. I write functions to assist in repetitive tasks. But for tonight, I want to focus on some basics.
2+3-2 #adding & subtracting
## [1] 3
2*2/3 #multiplying & dividing
## [1] 1.333333
2^3 #exponents ** = ^
## [1] 8
(myadd=2+3) #assigning to variables / containers
## [1] 5
options(digits=17) #specifying total digits to be displayed
exp(1) #e
## [1] 2.7182818284590451
(paste0("pi=",pi)) #pi
## [1] "pi=3.14159265358979"
options(digits=10)
noquote(paste0("pi=", pi))
## [1] pi=3.14159265358979
Defining vectors in R is easy to do. You build a column vector using the concatenate function. You may also use the scan function, but this can be painful. Similarly, you declare a matrix in R by concatenation and defining the number of rows (or columns).
myvector=c(1,3,4) #Note: this is a column vector, although it displays as a row!
mymatrix=matrix(c(1,3,4,4,3,4,4,4,4), nrow=3)
myvector #show the vector
## [1] 1 3 4
mymatrix #show the matrix
## [,1] [,2] [,3]
## [1,] 1 4 4
## [2,] 3 3 4
## [3,] 4 4 4
myvector[-2] #reports only the 1st and 3d elements of myvector
## [1] 1 4
myvector[2] #reports only the 2d element of myvector
## [1] 3
mymatrix[,1] #reports only the first column of mymatrix
## [1] 1 3 4
mymatrix[,-2] #reports all but the second column of my matrix
## [,1] [,2]
## [1,] 1 4
## [2,] 3 4
## [3,] 4 4
myseq=seq(1,10, by=.1) #generates a sequence from 1 to 10 by =.1
myseq2=1:10 #generates a sequence on the integers
LETTERS[1:26] #produces capital letters
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
letters[1:26] #lower case
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
Reading data is not too difficult in R, depending on the format. R can read in matrices, vectors, data.frame, etc. Here are some simple examples of reading data into R.
mydata=read.csv("d:/naturalgas.csv", stringsAsFactors = TRUE) #read in the data
mydata$Amount=mydata$Amount/1000000
head(mydata) #get the header information
## Year Month Amount
## 1 2001 1 2.505011
## 2 2001 2 2.156873
## 3 2001 3 2.086568
## 4 2001 4 1.663832
## 5 2001 5 1.385163
## 6 2001 6 1.313119
#pasteddata=read.table(file="clipboard", sep="") #pasting from the clipboard...Omitted.
R makes sense and is Google friendly. If you don’t know how to do something, Google “how do I do XYZ in R?”
mean(mydata$Amount)
## [1] 1.81941567
median(mydata$Amount)
## [1] 1.690836
sd(mydata$Amount)
## [1] 0.3967468344
var(mydata$Amount)
## [1] 0.1574080506
fivenum(mydata$Amount)
## [1] 1.2401960 1.5065850 1.6908360 2.1028025 2.9953270
R is easy to program. You can query functions as demonstrated in your book, but you can also write them easily.
myf = function(x) {# this states that myf is a function.
a1=mean(x)
a2=median(x)
a3=sd(x)
y=c(a1,a2,a3) #concatenates the items into a vector
names(y)=c("Mean","Median","Mode") #names them
return(round(y,3))
}
myf(mydata$Amount)
## Mean Median Mode
## 1.819 1.691 0.397
for (i in 1:12){
print(myf(mydata$Amount[mydata$Month==i]))
}
## Mean Median Mode
## 2.541 2.538 0.238
## Mean Median Mode
## 2.302 2.319 0.188
## Mean Median Mode
## 2.089 2.048 0.152
## Mean Median Mode
## 1.671 1.664 0.096
## Mean Median Mode
## 1.481 1.442 0.115
## Mean Median Mode
## 1.467 1.454 0.134
## Mean Median Mode
## 1.622 1.603 0.144
## Mean Median Mode
## 1.648 1.608 0.128
## Mean Median Mode
## 1.436 1.418 0.119
## Mean Median Mode
## 1.526 1.507 0.120
## Mean Median Mode
## 1.765 1.698 0.202
## Mean Median Mode
## 2.266 2.226 0.203
R is simple to use for plots. And if you use the right packages, you can create some amazing diagrams. I will show you some simple ones.
par(mfrow=c(2,3)) #set up a 2x3 grid for plots
hist(mydata$Amount, breaks="STURGES",main="Histogram of Natural Gas Consumption", col="RED") #histogram
hist(mydata$Amount, breaks="STURGES", freq=FALSE,main="Histogram of Natural Gas Consumption", col="BLUE")
boxplot(mydata$Amount, main="Boxplot, NG Consumption", notch=TRUE, col="BLUE") #boxplot
boxplot(mydata$Amount~mydata$Month, main="Boxplot, NG Consumption", notch=TRUE, col="GREEN")
## Warning in bxp(list(stats = structure(c(2.306943, 2.450862, 2.538368,
## 2.6430165, : some notches went outside hinges ('box'): maybe set notch=FALSE
myts=ts(mydata$Amount, start=c(2001,01), frequency=12) #declare a time series
plot(myts) #plots the time series
plot(forecast(myts,h=12)) #plot a forecast of myts using ETS
plot(decompose(myts)) #plot a decomposition of the time series.
plot(density(myts)) #plot a density
Steven’s typologies are widely used in determining statistics to use, plots to generate, and tests to conduct. Essentially, data are categorized as either quantitative or qualitative. They are further subdivided into interval or ratio (quantitative) or nominal / ordinal (qualitative).
Qualitative data are data where mathematical operations make no sense (e.g., color). Nominal data are those data which are “in name only,” e.g., color. There is no direction. Ordinal data have a direction but no equal spacing (e.g., position finished in race). Quantitative data are data where math operations make sense. Interval data have a constant spacing between data points (e.g., temperature in Celsius). Ratio data have constant spacing and a known or theoretical zero (e.g., temperature in Kelvin). These levels of measurements help dictate what statistics and graphs are appropriate. We can begin to evaluate how pre-built data sets are defined (quantitative vs. qualitative) using the structure (str) function.
str(mydata) # this displays the structure of the data
## 'data.frame': 176 obs. of 3 variables:
## $ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## $ Month : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Amount: num 2.51 2.16 2.09 1.66 1.39 ...
mydata$Year=factor(mydata$Year) #converts the integer to a factor (qualitative)
mydata$Month=factor(mydata$Month) #converts the integer to a factor (qualitative)
str(mydata) #shows that we have converted to factors
## 'data.frame': 176 obs. of 3 variables:
## $ Year : Factor w/ 15 levels "2001","2002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Month : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Amount: num 2.51 2.16 2.09 1.66 1.39 ...
table(mydata$Month) #Looks at the number of observations by month
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 15 15 15 15 15 15 15 15 14 14 14 14
mydescribe=round(describe(mydata[,-1]),3)
mydescribe%>%kbl()%>%kable_classic(html_font = "Cambria")
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Month* | 1 | 176 | 6.409 | 3.443 | 6.000 | 6.394 | 4.448 | 1.00 | 12.000 | 11.000 | 0.035 | -1.219 | 0.26 |
| Amount | 2 | 176 | 1.819 | 0.397 | 1.691 | 1.781 | 0.366 | 1.24 | 2.995 | 1.755 | 0.790 | -0.373 | 0.03 |
Measures of center, measures of variation, measures of location, and measures of shape are important for understanding data. Fortunately for us, R packages can analyze all of these for us. For example, the “describe” function in library(psych) generates many of the statistics for us. However, we can also generate our own.
## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
##
## describe
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
## The following object is masked from 'package:psych':
##
## headtail
##
## The decimal point is 1 digit(s) to the left of the |
##
## 12 | 4
## 13 | 000133566688999
## 14 | 0111223334555566677899
## 15 | 000001111111333445666778888999
## 16 | 00112222444566777888
## 17 | 0001233333344457889
## 18 | 1346789
## 19 | 55777779
## 20 | 001456999
## 21 | 011266778
## 22 | 1339
## 23 | 1111222455667
## 24 | 0017
## 25 | 11234556
## 26 | 26
## 27 | 004
## 28 |
## 29 | 0
## 30 | 0
## [[1]]
## Min Max 5% 25% 75% 95% Mean Median
## 1.2402 2.9953 1.3574 1.5068 2.0999 2.5527 1.8194 1.6908
## s s^2 MAveD MAD CV Skew Kurtosis
## 0.3967 0.1574 0.3345 0.3166 21.8063 0.7964 2.6570
##
## [[2]]
##
## Shapiro-Wilk normality test
##
## data: a
## W = 0.91, p-value = 1e-08
##
##
## [[3]]
## Estimated transformation parameter
## new
## -2.875