Key Concepts covered in the lecture include:
1. Descriptive statistics vs. inferential statistics
2. Measures of central tendency
3. Sampling (Swain vs. Alabama)
4. Z-scores
5. Hypothesis Testing
6. Types of Error (I and II)
7. Assessing normality
8. Test Statistics and P-values
9. Parametric vs. non-parametric
10. Chi-squared
The inclass exercise is about basic descriptive statistics.
Part 1: What is the mean and standard deviation of annual salary?
# import the csv file
faculty <- read.csv("E:/Quant/inClassExercises/InClassExerciseData/faculty.csv")
names(faculty)
## [1] "AYSALARY" "R1" "R2" "R7" "PRIOREXP" "YRBG"
## [7] "YRRANK" "TERMDEG" "YRDG" "EMINENT" "FEMALE"
# colunm name for annual salary is aysalary
# Using the summary function, we can easily get some descriptive
# statistics about the data
summary(faculty$AYSALARY)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23800 36800 46700 47800 57600 103000
# get the mean using built in functions...
mean(faculty$AYSALARY) #get the mean
## [1] 47801
sd(faculty$AYSALARY) #get the standard deviation using sd()
## [1] 13992
This can also be calculated manually:
# We can also calcualte mean manually by creating objects We will still us
# built in functions. This could also probably done in a loop and achieve
# the same results, or modified results (i.e. omitting outliers from the
# mean)
sumSal <- sum(faculty$AYSALARY) #add all of the observed salaries together
totObs <- length(faculty$AYSALARY) #get the number of observed salaries
sumSal/totObs #calculate the mean
## [1] 47801
# Check out Homework 1 for calculating the Root Mean Square Standard
# Deviation
Part 2: Is annual salary normally distributed?
# standard deviation can tell us information about the spread of the data,
# but it doesn't tell us anything about if the data is skewed. We can do
# this in multiple ways. First, we will assess this statistically, then
# we will look at it graphically.
# Use statistical methods Kolmogorov-Smirnov test
# ks.test(faculty$AYSALARY, 'pnorm')#this didn't work. I don't know why
# Shapiro-Wilks Test the null hypothesis is that the data is normally
# distributed
shapiro.test(faculty$AYSALARY)
##
## Shapiro-Wilk normality test
##
## data: faculty$AYSALARY
## W = 0.9771, p-value = 3.197e-09
# with a w-statistic of 0.9771 and a p-value < 0.001, we can reject the
# null hypothesis and declare that the data is NOT normally distributed.
We can look at this using graphical methods.
# quantile-quantile (QQ Plot)
qqnorm(faculty$AYSALARY) #this create a normality plot. The straighter the line, the more normally distributed the data is
qqline(faculty$AYSALARY) #adds the line to which the data should be aligned with. We can see that the data isn't normally distributed
# histogram
hist(faculty$AYSALARY)
# We can also modify the appearance of the histogram
hist(faculty$AYSALARY, col = "red", breaks = seq(20000, 110000, by = 5000),
xlab = "Salary")
# boxplot
boxplot(faculty$AYSALARY, ylab = "Salary") #This allows us to better see the dispersion of the data and the presence of outliers (data outside of the IQR)