Test

Tab1

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Tab2

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Statistics 101

Tonight, I will introduce R, R Markdown, and R Studio as a soft introduction to the course. R is a comprehensive statistical package and programming language that is necessary for data science. I would suggest that anyone performing analysis of large data should be interested in both R and Python.

#I like to load up some libraries that are useful.  You have to install these first.

require(car)     #Same here
## Loading required package: car
## Loading required package: carData
require(data.tree) #A good function for tree diagrams (and Bayes!)
## Loading required package: data.tree
require(DiagrammeR)
## Loading required package: DiagrammeR
require(forecast) # A good function for time series
## Loading required package: forecast
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
require(IPSUR)  #This library is part of your textbook.
## Loading required package: IPSUR
#https://cran.r-project.org/src/contrib/Archive/IPSUR/
require(kableExtra)
## Loading required package: kableExtra
require(MASS)    #Some functions that are generally required for EDA
## Loading required package: MASS
require(psych)  #This library helps with descriptive statistics.
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit

Basic Operations In R

R is fully programmable. I write functions to assist in repetitive tasks. But for tonight, I want to focus on some basics.

2+3-2 #adding & subtracting
## [1] 3
2*2/3 #multiplying & dividing
## [1] 1.333333
2^3   #exponents ** = ^
## [1] 8
(myadd=2+3)  #assigning to variables / containers
## [1] 5
options(digits=17) #specifying total digits to be displayed
exp(1)  #e
## [1] 2.7182818284590451
(paste0("pi=",pi))   #pi
## [1] "pi=3.14159265358979"
options(digits=10)

noquote(paste0("pi=", pi))
## [1] pi=3.14159265358979

Vectors and Matrices in R

Defining vectors in R is easy to do. You build a column vector using the concatenate function. You may also use the scan function, but this can be painful. Similarly, you declare a matrix in R by concatenation and defining the number of rows (or columns).

myvector=c(1,3,4)  #Note:  this is a column vector, although it displays as a row!  
mymatrix=matrix(c(1,3,4,4,3,4,4,4,4), nrow=3)
myvector #show the vector
## [1] 1 3 4
mymatrix #show the matrix
##      [,1] [,2] [,3]
## [1,]    1    4    4
## [2,]    3    3    4
## [3,]    4    4    4
myvector[-2]  #reports only the 1st and 3d elements of myvector
## [1] 1 4
myvector[2]   #reports only the 2d element of myvector
## [1] 3
mymatrix[,1]  #reports only the first column of mymatrix
## [1] 1 3 4
mymatrix[,-2] #reports all but the second column of my matrix
##      [,1] [,2]
## [1,]    1    4
## [2,]    3    4
## [3,]    4    4
myseq=seq(1,10, by=.1)  #generates a sequence from 1 to 10 by =.1
myseq2=1:10  #generates a sequence on the integers
LETTERS[1:26]  #produces capital letters
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
letters[1:26]  #lower case
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

Reading Data in R

Reading data is not too difficult in R, depending on the format. R can read in matrices, vectors, data.frame, etc. Here are some simple examples of reading data into R.

mydata=read.csv("d:/naturalgas.csv", stringsAsFactors = TRUE)  #read in the data
mydata$Amount=mydata$Amount/1000000
head(mydata)  #get the header information
##   Year Month   Amount
## 1 2001     1 2.505011
## 2 2001     2 2.156873
## 3 2001     3 2.086568
## 4 2001     4 1.663832
## 5 2001     5 1.385163
## 6 2001     6 1.313119
#pasteddata=read.table(file="clipboard", sep="")  #pasting from the clipboard...Omitted.

Some Basic Functions

R makes sense and is Google friendly. If you don’t know how to do something, Google “how do I do XYZ in R?”

mean(mydata$Amount)
## [1] 1.81941567
median(mydata$Amount)
## [1] 1.690836
sd(mydata$Amount)
## [1] 0.3967468344
var(mydata$Amount)
## [1] 0.1574080506
fivenum(mydata$Amount)
## [1] 1.2401960 1.5065850 1.6908360 2.1028025 2.9953270

Functions in R

R is easy to program. You can query functions as demonstrated in your book, but you can also write them easily.

myf = function(x)  {# this states that myf is a function.
  a1=mean(x)
  a2=median(x)
  a3=sd(x)
  y=c(a1,a2,a3) #concatenates the items into a vector
  names(y)=c("Mean","Median","Mode")  #names them
  return(round(y,3))
}

myf(mydata$Amount)
##   Mean Median   Mode 
##  1.819  1.691  0.397
for (i in 1:12){
  print(myf(mydata$Amount[mydata$Month==i]))
}
##   Mean Median   Mode 
##  2.541  2.538  0.238 
##   Mean Median   Mode 
##  2.302  2.319  0.188 
##   Mean Median   Mode 
##  2.089  2.048  0.152 
##   Mean Median   Mode 
##  1.671  1.664  0.096 
##   Mean Median   Mode 
##  1.481  1.442  0.115 
##   Mean Median   Mode 
##  1.467  1.454  0.134 
##   Mean Median   Mode 
##  1.622  1.603  0.144 
##   Mean Median   Mode 
##  1.648  1.608  0.128 
##   Mean Median   Mode 
##  1.436  1.418  0.119 
##   Mean Median   Mode 
##  1.526  1.507  0.120 
##   Mean Median   Mode 
##  1.765  1.698  0.202 
##   Mean Median   Mode 
##  2.266  2.226  0.203

Plotting in R

R is simple to use for plots. And if you use the right packages, you can create some amazing diagrams. I will show you some simple ones.

par(mfrow=c(2,3)) #set up a 2x3 grid for plots
hist(mydata$Amount, breaks="STURGES",main="Histogram of Natural Gas Consumption", col="RED") #histogram
hist(mydata$Amount, breaks="STURGES", freq=FALSE,main="Histogram of Natural Gas Consumption", col="BLUE")
boxplot(mydata$Amount, main="Boxplot, NG Consumption", notch=TRUE, col="BLUE") #boxplot
boxplot(mydata$Amount~mydata$Month, main="Boxplot, NG Consumption", notch=TRUE, col="GREEN")
## Warning in bxp(list(stats = structure(c(2.306943, 2.450862, 2.538368,
## 2.6430165, : some notches went outside hinges ('box'): maybe set notch=FALSE
myts=ts(mydata$Amount, start=c(2001,01), frequency=12)  #declare a time series
plot(myts) #plots the time series
plot(forecast(myts,h=12))  #plot a forecast of myts using ETS

plot(decompose(myts)) #plot a decomposition of the time series.

plot(density(myts))  #plot a density

Basic Data Types

Steven’s typologies are widely used in determining statistics to use, plots to generate, and tests to conduct. Essentially, data are categorized as either quantitative or qualitative. They are further subdivided into interval or ratio (quantitative) or nominal / ordinal (qualitative).

Qualitative data are data where mathematical operations make no sense (e.g., color). Nominal data are those data which are “in name only,” e.g., color. There is no direction. Ordinal data have a direction but no equal spacing (e.g., position finished in race). Quantitative data are data where math operations make sense. Interval data have a constant spacing between data points (e.g., temperature in Celsius). Ratio data have constant spacing and a known or theoretical zero (e.g., temperature in Kelvin). These levels of measurements help dictate what statistics and graphs are appropriate. We can begin to evaluate how pre-built data sets are defined (quantitative vs. qualitative) using the structure (str) function.

str(mydata)  # this displays the structure of the data
## 'data.frame':    176 obs. of  3 variables:
##  $ Year  : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ Month : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Amount: num  2.51 2.16 2.09 1.66 1.39 ...
mydata$Year=factor(mydata$Year)  #converts the integer to a factor (qualitative)
mydata$Month=factor(mydata$Month)  #converts the integer to a factor (qualitative)
str(mydata) #shows that we have converted to factors
## 'data.frame':    176 obs. of  3 variables:
##  $ Year  : Factor w/ 15 levels "2001","2002",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Month : Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Amount: num  2.51 2.16 2.09 1.66 1.39 ...
table(mydata$Month)  #Looks at the number of observations by month
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 15 15 15 15 15 15 15 15 14 14 14 14

Describing Data, Easy Way

mydescribe=round(describe(mydata[,-1]),3)
mydescribe%>%kbl()%>%kable_classic(html_font = "Cambria")
vars n mean sd median trimmed mad min max range skew kurtosis se
Month* 1 176 6.409 3.443 6.000 6.394 4.448 1.00 12.000 11.000 0.035 -1.219 0.26
Amount 2 176 1.819 0.397 1.691 1.781 0.366 1.24 2.995 1.755 0.790 -0.373 0.03

Describing Data

Measures of center, measures of variation, measures of location, and measures of shape are important for understanding data. Fortunately for us, R packages can analyze all of these for us. For example, the “describe” function in library(psych) generates many of the statistics for us. However, we can also generate our own.

## Package 'qcc' version 2.7
## Type 'citation("qcc")' for citing this R package in publications.
## Loading required package: HistData
## Loading required package: Hmisc
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## 
## Attaching package: 'Hmisc'
## The following object is masked from 'package:psych':
## 
##     describe
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## 
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
## 
##     cancer
## The following object is masked from 'package:psych':
## 
##     headtail

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   12 | 4
##   13 | 000133566688999
##   14 | 0111223334555566677899
##   15 | 000001111111333445666778888999
##   16 | 00112222444566777888
##   17 | 0001233333344457889
##   18 | 1346789
##   19 | 55777779
##   20 | 001456999
##   21 | 011266778
##   22 | 1339
##   23 | 1111222455667
##   24 | 0017
##   25 | 11234556
##   26 | 26
##   27 | 004
##   28 | 
##   29 | 0
##   30 | 0
## [[1]]
##      Min      Max       5%      25%      75%      95%     Mean   Median 
##   1.2402   2.9953   1.3574   1.5068   2.0999   2.5527   1.8194   1.6908 
##        s      s^2    MAveD      MAD       CV     Skew Kurtosis 
##   0.3967   0.1574   0.3345   0.3166  21.8063   0.7964   2.6570 
## 
## [[2]]
## 
##  Shapiro-Wilk normality test
## 
## data:  a
## W = 0.91, p-value = 1e-08
## 
## 
## [[3]]
## Estimated transformation parameter 
##    new 
## -2.875