Class Notes
Fundamentals
- population - all the observations of interest
- Most of the time you won’t know the number of observations, N. We’re going to treat the population as unknown and unknowable.
- Observation -
- An observation must be independent from any other observation
- Sample - subset of observations drawn somehow from the population
- A known number of observations, n, to make a conclusion about the unknown population
- Representative sample or ‘microcosm’
- An unbiased sample
- We can never really be sure if our sample is representative
- If \(\bar{x} \pm CI\) overlap \(\mu\), then our sample is representative
- Random sample - a sample where every observation has equal chance of being picked
- This is how we get an unbiased sample
- Variable - characteristics measured and recorded from each observation in the sample
- Categorical or qualitative - not a real number to record
- Ordinal - self evident there is a rank (ex: A-F grades)
- Nominal - no meaningful gradient or rank
- Quantitative
- Discrete - countable, integers
- Continous
- Fractions make sense
- Theoretically an infinite number of values
- Parameter vs. estimate
- The population has parameters; the sample has estimates
- Inductive reasoning - making a conclusion about a larger whole from a part of that whole
- Statistical inference - making a conclusion about the population from a sample
- Making conclusions about prameters from estimates
- Errors
- Sample error - variation
- Usually uses a confidence interval
- Measurement error - error due to imperfect measurements (ignored in this class)
- A true statistical inference needs to account for both
Descriptive stats
- Summary statistics are concerned with central tendency and dispersion
- Central tendency values can be evaluated by being unique or efficient
- Unique - can’t have more than one
- Median and mean are unique, mode might not be
- Efficient - every number in the dataset goes in to the calculation
- Mean is efficient, mode and median are not
- Dispersion
- Range = max - min
- Unique, but not efficient
- Variance
- Standard deviation
- Both unique and efficient
- Same units as the data
- Sample standard deviations use n-1 to avoid bias
Distributions
- Skew
- Refers to the tail that flares
- Most ditributions we’re intereste in are unimodal
- We, and the rest of the world, use the Gaussian (normal) distribution so often because of its symmetry
R Introduction Handout
- Repetitions (rep)
- Random smapling
- Loading and saving .Rdata
#Repeat the first 4 CAPITAL letters of the alphabet 4 times
rep(LETTERS[1:4],4)
## [1] "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D" "A" "B" "C" "D"
#Repeat 1 three times, 2 two times, 3 one time
rep(1:3,c(3,2,1))
## [1] 1 1 1 2 2 3
#Different ways of random sampling
sam1 = rnorm(20,5,2)
sam2 = sample(rnorm(10000,5,2), size=30, replace=FALSE)
#Saving data - this will save all values in the current global environment
# save.image(file.choose())
#Loading data
# load(file.choose())
Quartiles and Boxplot Handout (Litter Size)
- Data = litter size of 36 sows
- Formulas for quartiles - sort data, the \(i\)th ranked value where \(i\) is equal to:
- \(Q_1: i = (n+1) \times 0.25\)
- \(Q_2: i = (n+1) \times 0.50\)
- \(Q_3: i = (n+1) \times 0.75\)
- \(IQR: i = Q_3 - Q_1\)
- Formulas for percentiles
- \(j\)th percentile: \(i = (n+1) \times \frac{j}{100}\)
- Box plot
- Gives you a sense of the distribution’s shape
- The dots below the dashed vertical line indicate outliers
- The bottom of the box is the first quartile
- The band in the box is the second quartile (median)
- The top of the box is the third quartile
- Whiskers
- The whiskers can mean different things based on the type of boxplot
- The default value is 1.5 \(\times\) IQR
- Outliers are values outside of the whiskers
- This makes identification of outliers arbitrary; it all depends on what you choose
#ggplot requires the ggplot2 library and all data in a data frame
library(ggplot2)
litter.df = data.frame(size=c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14))
#Create boxplot variable using 'grammar of graphics'
litter.plot = ggplot(litter.df, aes(x="",y=size)) +
stat_boxplot(geom="errorbar", width=0.2) +
geom_boxplot(width = 0.4, outlier.shape=1) +
ylab("Litter Size") +
scale_y_continuous(breaks=seq(4,14,2)) +
theme(axis.title.x=element_blank(),axis.ticks.x=element_blank())
#Quartiles and IQR
litter.quartile = quantile(litter.df$size,c(.25,.50,.75))
IQR = litter.quartile[3]-litter.quartile[1]
#Output boxplot
litter.plot

For the example above
- Statistics:
- Q1 = 9.75
- Q2 = 10.5
- Q3 = 12
- IQR = 2.25
- Other:
- Observation is a litter of pigs
- Variable is litter size
- Variable is discrete
Ungrouped Frequency Table Handout (Litter Size)
- Mathematical definition of random variable
- Frequency as a function of observation
- Thus, we can essentially use a relative frequency table as a random variable
- Hints
- Relative frequencies should always sum to 1
- CRF of largest \(x_i\) should = 1
- CF of largest \(x_i\) should = n
#This can be done a lot easier in Excel, but in case you want a reference on how to do this in R.
#It's a little confusing, so I'll walk you through it. The table() function constructs an
#ungrouped frequency table of your data with your variable of interest as a factor, and Freq as
#an integer. For the table to be useable, we need to convert it to a dataframe and change both
#columns to numeric. Converting straight from factor to numeric will give you the ranked order
#of the values, instead of the actual values. Thus, we must convert it to character first.
#Create frequency table
litter = c(5,7,7,8,8,8,9,9,9,rep(10,9),rep(11,8),rep(12,5),13,13,13,14,14)
freq.table = as.data.frame(table(litter))
freq.table$litter = as.numeric(as.character(freq.table$litter))
freq.table$Freq = as.numeric(freq.table$Freq)
freq.table$Rel = format(round(freq.table$Freq/sum(freq.table$Freq),3), nsmall=3)
freq.table$CF = cumsum(freq.table$Freq)
freq.table$CRF = format(round(cumsum(freq.table$Rel),3), nsmall=3)
#Display interactive table
library(DT)
datatable(data = freq.table,
options = list(ordering = FALSE,
dom = "t",
columnDefs = list(list(className = "dt-right", targets = "_all" ))))
Grouped Frequency Table Handout (Dead Presidents)