Class Notes

Fundamentals

population - all the observations of interest
- Most of the time you won’t know the number of observations, N. We’re going to treat the population as unknown and unknowable.
Observation -
- An observation must be independent from any other observation
Sample - subset of observations drawn somehow from the population
- A known number of observations, n, to make a conclusion about the unknown population
- Representative sample or ‘microcosm’
  - An unbiased sample
  - We can never really be sure if our sample is representative
  - If \(\bar{x} \pm CI\) overlap \(\mu\), then our sample is representative
- Random sample - a sample where every observation has equal chance of being picked
  - This is how we get an unbiased sample
Variable - characteristics measured and recorded from each observation in the sample
- Categorical or qualitative - not a real number to record
  - Ordinal - self evident there is a rank (ex: A-F grades)
  - Nominal - no meaningful gradient or rank
- Quantitative
  - Discrete - countable, integers
    - Fractions make no sense
  - Continous
    - Fractions make sense
    - Theoretically an infinite number of values
Parameter vs. estimate
- The population has parameters; the sample has estimates
Inductive reasoning - making a conclusion about a larger whole from a part of that whole
- Statistical inference - making a conclusion about the population from a sample
  - Making conclusions about prameters from estimates
Errors
- Sample error - variation
  - Usually uses a confidence interval
- Measurement error - error due to imperfect measurements (ignored in this class)
- A true statistical inference needs to account for both

Descriptive stats

Summary statistics are concerned with central tendency and dispersion
Central tendency values can be evaluated by being unique or efficient
- Unique - can’t have more than one
  - Median and mean are unique, mode might not be
- Efficient - every number in the dataset goes in to the calculation
  - Mean is efficient, mode and median are not
Dispersion
- Range = max - min
  - Unique, but not efficient
- Variance
- Standard deviation
- Both unique and efficient
- Same units as the data
- Sample standard deviations use n-1 to avoid bias

Distributions

Skew
- Refers to the tail that flares
- Most ditributions we’re intereste in are unimodal
- We, and the rest of the world, use the Gaussian (normal) distribution so often because of its symmetry

R Introduction Handout

Repetitions (rep)
Random smapling
Loading and saving .Rdata

#Repeat the first 4 CAPITAL letters of the alphabet 4 times
rep(LETTERS[1:4],4)
##  [1] "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D"

#Repeat 1 three times, 2 two times, 3 one time
rep(1:3,c(3,2,1))
## [1] 1 1 1 2 2 3

#Different ways of random sampling
sam1 = rnorm(20,5,2)
sam2 = sample(rnorm(10000,5,2), size=30, replace=FALSE)

#Saving data - this will save all values in the current global environment
# save.image(file.choose())

#Loading data
# load(file.choose())

Quartiles and Boxplot Handout (Litter Size)

Data = litter size of 36 sows
Formulas for quartiles - sort data, the \(i\)th ranked value where \(i\) is equal to:
- \(Q_1: i = (n+1) \times 0.25\)
- \(Q_2: i = (n+1) \times 0.50\)
- \(Q_3: i = (n+1) \times 0.75\)
- \(IQR: i = Q_3 - Q_1\)
Formulas for percentiles
- \(j\)th percentile: \(i = (n+1) \times \frac{j}{100}\)
Box plot
- Gives you a sense of the distribution’s shape
- The dots below the dashed vertical line indicate outliers
- The bottom of the box is the first quartile
- The band in the box is the second quartile (median)
- The top of the box is the third quartile
Whiskers
- The whiskers can mean different things based on the type of boxplot
- The default value is 1.5 \(\times\) IQR
- Outliers are values outside of the whiskers
  - This makes identification of outliers arbitrary; it all depends on what you choose

#ggplot requires the ggplot2 library and all data in a data frame
library(ggplot2)
litter.df = data.frame(size=c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14))

#Create boxplot variable using 'grammar of graphics'
litter.plot = ggplot(litter.df, aes(x="",y=size)) +
  stat_boxplot(geom="errorbar", width=0.2) + 
  geom_boxplot(width = 0.4, outlier.shape=1) + 
  ylab("Litter Size") + 
  scale_y_continuous(breaks=seq(4,14,2)) +
  theme(axis.title.x=element_blank(),axis.ticks.x=element_blank())

#Quartiles and IQR
litter.quartile = quantile(litter.df$size,c(.25,.50,.75))
IQR = litter.quartile[3]-litter.quartile[1]

#Output boxplot
litter.plot

For the example above

Statistics:
- Q1 = 9.75
- Q2 = 10.5
- Q3 = 12
- IQR = 2.25
Other:
- Observation is a litter of pigs
- Variable is litter size
- Variable is discrete

Ungrouped Frequency Table Handout (Litter Size)

Mathematical definition of random variable
- Frequency as a function of observation
- Thus, we can essentially use a relative frequency table as a random variable
Hints
- Relative frequencies should always sum to 1
- CRF of largest \(x_i\) should = 1
- CF of largest \(x_i\) should = n

#This can be done a lot easier in Excel, but in case you want a reference on how to do this in R.
#It's a little confusing, so I'll walk you through it.  The table() function constructs an 
#ungrouped frequency table of your data with your variable of interest as a factor, and Freq as
#an integer.  For the table to be useable, we need to convert it to a dataframe and change both 
#columns to numeric.  Converting straight from factor to numeric will give you the ranked order 
#of the values, instead of the actual values.  Thus, we must convert it to character first.

#Create frequency table
litter = c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14)
freq.table = as.data.frame(table(litter))
freq.table$litter = as.numeric(as.character(freq.table$litter))
freq.table$Freq = as.numeric(freq.table$Freq)
freq.table$Rel = format(round(freq.table$Freq/sum(freq.table$Freq),3), nsmall=3)
freq.table$CF = cumsum(freq.table$Freq)
freq.table$CRF = format(round(cumsum(freq.table$Rel),3), nsmall=3)

#Display interactive table
library(DT)
datatable(data = freq.table, 
          options = list(ordering = FALSE, 
                         dom = "t", 
                         columnDefs = list(list(className = "dt-right", targets = "_all" ))))

Grouped Frequency Table Handout (Dead Presidents)

Bio 7405

Handouts

October 6, 2017

Class Notes

Fundamentals

Descriptive stats

Distributions

R Introduction Handout

Quartiles and Boxplot Handout (Litter Size)

For the example above

Ungrouped Frequency Table Handout (Litter Size)

Grouped Frequency Table Handout (Dead Presidents)