Class Notes

Fundamentals

  • population - all the observations of interest
    • Most of the time you won’t know the number of observations, N. We’re going to treat the population as unknown and unknowable.
  • Observation -
    • An observation must be independent from any other observation
  • Sample - subset of observations drawn somehow from the population
    • A known number of observations, n, to make a conclusion about the unknown population
    • Representative sample or ‘microcosm’
      • An unbiased sample
      • We can never really be sure if our sample is representative
      • If \(\bar{x} \pm CI\) overlap \(\mu\), then our sample is representative
    • Random sample - a sample where every observation has equal chance of being picked
      • This is how we get an unbiased sample
  • Variable - characteristics measured and recorded from each observation in the sample
    • Categorical or qualitative - not a real number to record
      • Ordinal - self evident there is a rank (ex: A-F grades)
      • Nominal - no meaningful gradient or rank
    • Quantitative
      • Discrete - countable, integers
        • Fractions make no sense
      • Continous
        • Fractions make sense
        • Theoretically an infinite number of values
  • Parameter vs. estimate
    • The population has parameters; the sample has estimates
  • Inductive reasoning - making a conclusion about a larger whole from a part of that whole
    • Statistical inference - making a conclusion about the population from a sample
      • Making conclusions about prameters from estimates
  • Errors
    • Sample error - variation
      • Usually uses a confidence interval
    • Measurement error - error due to imperfect measurements (ignored in this class)
    • A true statistical inference needs to account for both

Descriptive stats

  • Summary statistics are concerned with central tendency and dispersion
  • Central tendency values can be evaluated by being unique or efficient
    • Unique - can’t have more than one
      • Median and mean are unique, mode might not be
    • Efficient - every number in the dataset goes in to the calculation
      • Mean is efficient, mode and median are not
  • Dispersion
    • Range = max - min
      • Unique, but not efficient
    • Variance
    • Standard deviation
    • Both unique and efficient
    • Same units as the data
    • Sample standard deviations use n-1 to avoid bias

Distributions

  • Skew
    • Refers to the tail that flares
    • Most ditributions we’re intereste in are unimodal
    • We, and the rest of the world, use the Gaussian (normal) distribution so often because of its symmetry

R Introduction Handout

  • Repetitions (rep)
  • Random smapling
  • Loading and saving .Rdata
#Repeat the first 4 CAPITAL letters of the alphabet 4 times
rep(LETTERS[1:4],4)
##  [1] "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D"

#Repeat 1 three times, 2 two times, 3 one time
rep(1:3,c(3,2,1))
## [1] 1 1 1 2 2 3

#Different ways of random sampling
sam1 = rnorm(20,5,2)
sam2 = sample(rnorm(10000,5,2), size=30, replace=FALSE)

#Saving data - this will save all values in the current global environment
# save.image(file.choose())

#Loading data
# load(file.choose())

Quartiles and Boxplot Handout (Litter Size)

  • Data = litter size of 36 sows
  • Formulas for quartiles - sort data, the \(i\)th ranked value where \(i\) is equal to:
    • \(Q_1: i = (n+1) \times 0.25\)
    • \(Q_2: i = (n+1) \times 0.50\)
    • \(Q_3: i = (n+1) \times 0.75\)
    • \(IQR: i = Q_3 - Q_1\)
  • Formulas for percentiles
    • \(j\)th percentile: \(i = (n+1) \times \frac{j}{100}\)
  • Box plot
    • Gives you a sense of the distribution’s shape
    • The dots below the dashed vertical line indicate outliers
    • The bottom of the box is the first quartile
    • The band in the box is the second quartile (median)
    • The top of the box is the third quartile
  • Whiskers
    • The whiskers can mean different things based on the type of boxplot
    • The default value is 1.5 \(\times\) IQR
    • Outliers are values outside of the whiskers
      • This makes identification of outliers arbitrary; it all depends on what you choose
#ggplot requires the ggplot2 library and all data in a data frame
library(ggplot2)
litter.df = data.frame(size=c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14))

#Create boxplot variable using 'grammar of graphics'
litter.plot = ggplot(litter.df, aes(x="",y=size)) +
  stat_boxplot(geom="errorbar", width=0.2) + 
  geom_boxplot(width = 0.4, outlier.shape=1) + 
  ylab("Litter Size") + 
  scale_y_continuous(breaks=seq(4,14,2)) +
  theme(axis.title.x=element_blank(),axis.ticks.x=element_blank())

#Quartiles and IQR
litter.quartile = quantile(litter.df$size,c(.25,.50,.75))
IQR = litter.quartile[3]-litter.quartile[1]

#Output boxplot
litter.plot

For the example above

  • Statistics:
    • Q1 = 9.75
    • Q2 = 10.5
    • Q3 = 12
    • IQR = 2.25
  • Other:
    • Observation is a litter of pigs
    • Variable is litter size
    • Variable is discrete

Ungrouped Frequency Table Handout (Litter Size)

  • Mathematical definition of random variable
    • Frequency as a function of observation
    • Thus, we can essentially use a relative frequency table as a random variable
  • Hints
    • Relative frequencies should always sum to 1
    • CRF of largest \(x_i\) should = 1
    • CF of largest \(x_i\) should = n
#This can be done a lot easier in Excel, but in case you want a reference on how to do this in R.
#It's a little confusing, so I'll walk you through it.  The table() function constructs an 
#ungrouped frequency table of your data with your variable of interest as a factor, and Freq as
#an integer.  For the table to be useable, we need to convert it to a dataframe and change both 
#columns to numeric.  Converting straight from factor to numeric will give you the ranked order 
#of the values, instead of the actual values.  Thus, we must convert it to character first.

#Create frequency table
litter = c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14)
freq.table = as.data.frame(table(litter))
freq.table$litter = as.numeric(as.character(freq.table$litter))
freq.table$Freq = as.numeric(freq.table$Freq)
freq.table$Rel = format(round(freq.table$Freq/sum(freq.table$Freq),3), nsmall=3)
freq.table$CF = cumsum(freq.table$Freq)
freq.table$CRF = format(round(cumsum(freq.table$Rel),3), nsmall=3)

#Display interactive table
library(DT)
datatable(data = freq.table, 
          options = list(ordering = FALSE, 
                         dom = "t", 
                         columnDefs = list(list(className = "dt-right", targets = "_all" ))))

Grouped Frequency Table Handout (Dead Presidents)