Lecture 3: Review Key Concepts

Key Concepts covered in the lecture include:
1. Descriptive statistics vs. inferential statistics
2. Measures of central tendency
3. Sampling (Swain vs. Alabama)
4. Z-scores
5. Hypothesis Testing
6. Types of Error (I and II)
7. Assessing normality
8. Test Statistics and P-values
9. Parametric vs. non-parametric
10. Chi-squared

The inclass exercise is about basic descriptive statistics.

Part 1: What is the mean and standard deviation of annual salary?

# import the csv file
faculty <- read.csv("E:/Quant/inClassExercises/InClassExerciseData/faculty.csv")

names(faculty)

##  [1] "AYSALARY" "R1"       "R2"       "R7"       "PRIOREXP" "YRBG"    
##  [7] "YRRANK"   "TERMDEG"  "YRDG"     "EMINENT"  "FEMALE"

# colunm name for annual salary is aysalary


# Using the summary function, we can easily get some descriptive
# statistics about the data
summary(faculty$AYSALARY)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   23800   36800   46700   47800   57600  103000


# get the mean using built in functions...
mean(faculty$AYSALARY)  #get the mean

## [1] 47801

sd(faculty$AYSALARY)  #get the standard deviation using sd()

## [1] 13992

This can also be calculated manually:

# We can also calcualte mean manually by creating objects We will still us
# built in functions.  This could also probably done in a loop and achieve
# the same results, or modified results (i.e. omitting outliers from the
# mean)

sumSal <- sum(faculty$AYSALARY)  #add all of the observed salaries together
totObs <- length(faculty$AYSALARY)  #get the number of observed salaries
sumSal/totObs  #calculate the mean

## [1] 47801


# Check out Homework 1 for calculating the Root Mean Square Standard
# Deviation

Part 2: Is annual salary normally distributed?

# standard deviation can tell us information about the spread of the data,
# but it doesn't tell us anything about if the data is skewed.  We can do
# this in multiple ways.  First, we will assess this statistically, then
# we will look at it graphically.

# Use statistical methods Kolmogorov-Smirnov test
# ks.test(faculty$AYSALARY, 'pnorm')#this didn't work.  I don't know why

# Shapiro-Wilks Test the null hypothesis is that the data is normally
# distributed
shapiro.test(faculty$AYSALARY)

## 
##  Shapiro-Wilk normality test
## 
## data:  faculty$AYSALARY 
## W = 0.9771, p-value = 3.197e-09


# with a w-statistic of 0.9771 and a p-value < 0.001, we can reject the
# null hypothesis and declare that the data is NOT normally distributed.

We can look at this using graphical methods.


# quantile-quantile (QQ Plot)
qqnorm(faculty$AYSALARY)  #this create a normality plot.  The straighter the line, the more normally distributed the data is
qqline(faculty$AYSALARY)  #adds the line to which the data should be aligned with.  We can see that the data isn't normally distributed

plot of chunk unnamed-chunk-4


# histogram
hist(faculty$AYSALARY)

plot of chunk unnamed-chunk-4

# We can also modify the appearance of the histogram
hist(faculty$AYSALARY, col = "red", breaks = seq(20000, 110000, by = 5000), 
    xlab = "Salary")

plot of chunk unnamed-chunk-4


# boxplot
boxplot(faculty$AYSALARY, ylab = "Salary")  #This allows us to better see the dispersion of the data and the presence of outliers (data outside of the IQR)

plot of chunk unnamed-chunk-4