Statistical Inference for Data Science: Core Concepts

IID Random Variables

Random variables are said to be independent and identically distributed if they are independent and are all drawn from the same population. The reason iid samples are so important is that they are the model for random samples. This is the default starting point for most statistical inferences.

Note: An IID sequence does not imply the probabilities for all elements of the sample space or event space must be the same. For example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.

Expected Value

In a probability distribution , the weighted average of possible values of a random variable , with weights given by their respective theoretical probabilities, is known as the expected value, usually represented by E(x).

The expected value informs about what to expect in an experiment “in the long run”, after many trials. In most of the cases, there could be no such value in the sample space (see dice example below).

Formula

The weighted average formula for expected value is given by multiplying each possible value for the random variable by the probability that the random variable takes that value, and summing all these products.

Roll1Die = function(n) sample(1:6,n,rep=T)
TwoDie = Roll1Die(100) + Roll1Die(100)
hist(TwoDie, breaks=c(1.5:12.5), ylim = c(0, 0.2), prob=T, xlab = "Sum of Two Die Rolls", main= "Demonstration of Expected Value", col = "lightblue")

plot of chunk unnamed-chunk-1

#expected value = long run average value of repititions of the experiment it represents
mean(TwoDie)

## [1] 6.75

Statistical Robustness

A robust statistic is resistant to non-�intuitive behaviors in the results, produced by deviations from assumptions (e.g., of normality). This means that if the assumptions are only approximately met, the robust estimator will still have a reasonable efficiency, and reasonably small bias, as well as being asymptotically unbiased, meaning having a bias tending towards 0 as the sample size tends towards infinity.

The median is a robust measure of central tendency while the mean is not. (The median has a breakdown point of 0.5 while the mean has a breakdown point of zero - a single large observation can throw it off). “Breakdown Point” = fraction of the data that can be given arbitrary values without making the estimator go bad.

The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation is not.

Pearson’s Correlation Coefficient

Pearson's Correlation Coefficient is not robust. Adding/removing a single outlier can significantly effect results

You can quantify the strength of a linear relationship between two variables by calculating a correlation coefficient. This corresponds to “R” in the regression table output. The coefficient of determination refers to R-squared.

The value of the correlation coefficient ranges between -1 and +1.

A correlation coefficient near zero indicates a weak or nonexistent linear relationship. A correlation coefficient near zero does not mean there is no relationship between the two variables; it indicates only that any relationship that does exist is not linear.

Be alert to the possibility of hidden variables, which may be responsible for patterns we see when graphing or examining relationships between two data sets.

library(MASS)
data(Boston)
attach(Boston)
cor(rm, medv)

## [1] 0.6953599

plot(rm, medv, xlab = "Avg # of Rooms Per Dwelling", ylab = "Median Value of Occupied Homes (1000s)", main = "Visualizing the Relationship", type = "p", col = "red")
reg <- lm(medv ~ rm)
abline(reg, col="blue", lwd = "5")

plot of chunk unnamed-chunk-2

Spearman’s Rank Correlation Coefficient

The Spearman correlation is less sensitive than the Pearson correlation to strong outliers that are in the tails of both samples

Running the correlation test again using the Spearman method this time, you can see that the results indicate a weaker relationship between the two variables:

cor(rm, medv, method = "spearman")

## [1] 0.6335764

Coefficient of Variation

In order to compare the variation of one dataset to another we use the coefficient of variation, which is given by dividing the standard deviation by the mean.

CV <- function(sd, mean) {
  (sd/mean)*100
}

CV(sd(rm), mean(rm))

## [1] 11.17992

CV(sd(medv), mean(medv))

## [1] 40.81651

Maximum Likelihood Estimation

MLE is a method of estimating the parameters of a statistical model given data. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints.

Assuming that the heights are normally distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable given the model.

Note:

The method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the “agreement” of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution.

#ex data: number of right turns during 300 3 min intervals at specific intersection, Poisson Distribution
X <- c(rep(0,14), rep(1,30), rep(2,36), rep(3,68), rep(4, 43), rep(5,43), rep(6, 30), rep(7,14), rep(8,10), rep(9, 6), rep(10,4), rep(11,1), rep(12,1)) 

hist(X, main="Histogram of Number of Right Turns", xlab = "Intervals", right=FALSE, prob=TRUE, col = "red", ylim = c(0,.25))

plot of chunk unnamed-chunk-5

n <- length(X)
#maximize log-likelihood
negloglike<-function(lam) {n* lam-sum(X) *log(lam) + sum(log(factorial(X)))}

#evaluate the function by providing a value for lambda
negloglike(0.3)

## [1] 2583.046

#After we define the negative log likelihood, we can perform the optimization as following: 
#nlm function = non-linear minimization function in R
#minimum = neg. log likelihood, estimate = MLE estimate of paramaters (mean in this case)
out<-nlm(negloglike,p=c(0.5), hessian = TRUE) 
out

## $minimum
## [1] 667.183
## 
## $estimate
## [1] 3.893331
## 
## $gradient
## [1] -2.569636e-05
## 
## $hessian
##          [,1]
## [1,] 77.03948
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

#Check estimate against the mean of the Poisson Distribution (they should be very similar)
mean(X)

## [1] 3.893333