Estimating Parameters from data - sample size, central tendency, variation, skew, kurtosis
SAMPLE SIZE - denoted by ‘n’. This is an important factor when sampling *** finish later
CENTRAL TENDENCY
# rm(x) # removes any existing vectors by the name x
distribution = read.table("C:\\temp\\mode.txt", header = TRUE)
attach(distribution)
names(distribution)
## [1] "fx" "fy" "fz" "x"
To understand distribution of fx and fy, we will not plot them on a histogram.
par(mfrow=c(1,2))
barplot(fx,names=as.character(x))
barplot(fy,names=as.character(x))
2. MEDIAN - shows the middle value from a ranked list. Not sensitive to outliers. In our example, each value of x is repeated fx number of times. To find and plot the median, we need to first create the actual list and then rank it.
y = rep(x,fx) # create actual list of values to find median for
y = sort(y) # rank this list
length(y) # find the total number of data points in this list
## [1] 33
This shows that the list contains an odd no. of elements (n=33). Therefore, median will be the (n/2)th or 17th element. The same can be calculated using R as follows:
y[ceiling(length(y)/2)] #celiling returns the integer part of our answer
## [1] 14
However, one can avoid the above code and use R’s built-in function.
median(y)
## [1] 14
3a. ARITHEMATIC MEAN
mean(y)
## [1] 13.78788
3b. GEOMETRIC MEAN - used to average variables that change exponentially/multiplicatively.
prod(y)^(1/length(y)) # approach 1 - nth root of the product of y values
## [1] 13.71531
exp(sum(log(y))/length(y)) # approach 2 - using logarithm ("log of times is plus")
## [1] 13.71531
3c. HARMONIC MEAN - Used for averaging variable with a “rate” characteristic.
1/mean(1/y)
## [1] 13.64209
# par(mfrow=c(1,1))
barplot(fz,names.arg = as.character(x))
Above bar plot shows a long-tail on the right side, therefore the distribution of fz is a positively skewed. Now, we will create the distribution x,fz (x contains labels, fz contains their frequencies) and then produce a summary of all measures of central tendency for it.
w = rep(x,fz)
AM = mean(w)
AM
## [1] 12.60606
GM = prod(w)^(1/(length(w)))
GM
## [1] 12.4038
HM = 1/mean(1/w)
HM
## [1] 12.22037
Median = median(w)
Median
## [1] 12
# Modes are at 11 and 16, with x=11 showing the largest mode.
VARIATION Here we will discuss the different ways to measure variation in a variable.
y=c(13,7,5,12,9,15,6,11,9,7,12)
plot(y,ylim = c(0,20))
1. Range - shows the min and max values of a variable
range(y)
## [1] 5 15
Range is highly impacted by outliers and the uncertainty it represents increases with increasing sample size. Therfore, we will try to find a more reliable measure of variability. One way is to fit the average value/mean of y through the data and measure how far each value is from the mean.
plot(y,ylim = c(0,20))
abline(mean(y),0)
Important concepts - sum of squares, mean square, degrees of freedom (n-p, where n is sample size, p is no. of parameters estimated from data)
Working on the ozone levels in Gardens A,B,C example:
A=c(3,4,4,3,2,3,1,3,5,2)
B=c(5,5,6,7,4,4,3,5,6,5)
C=c(3,3,2,1,10,4,3,11,3,10)
mean(A)
## [1] 3
mean(B)
## [1] 5
mean(C)
## [1] 5
Now, we have 3 samples and means of each. Now, we calculate vectors containing differences y-mean(y):
dA = A-3
dB = B-5
dC = C-5
Next, sum of squares of these differences:
SSA = sum(dA^2)
SSB = sum(dB^2)
SSC = sum(dC^2)
Obtaining variances by dividing each sum of squares by deg of freedom:
s2A = SSA/(length(A)-1)
s2B = SSB/(length(B)-1)
s2C = SSC/(length(C)-1)
s2A
## [1] 1.333333
s2B
## [1] 1.333333
s2C
## [1] 14.22222
Calculating variance using built-in function:
S2A = var(A)
S2B = var(B)
S2C = var(C)
S2A
## [1] 1.333333
S2B
## [1] 1.333333
S2C
## [1] 14.22222
Above example shows that: - 2 samples with the same mean can have very different variances - 2 samples with the same variance can have very different means Therefore, its important to check “variance ratio” for 2 samples as it should not be >4 (large variance/small variance.
When to use variance? - for establishing unreliability (Standard error = sqrt(variance/sample size)) - for testing hypotheses
sqrt(var(A)/length(A))
## [1] 0.3651484
sqrt(var(B)/length(B))
## [1] 0.3651484
sqrt(var(C)/length(C))
## [1] 1.19257
Constructing a Confidence Interval = Multiply the z or t-value with standard error
# For B, alpha=95%, deg of freedom = n-1 = 9
qt(0.025,9)*sqrt(var(B)/length(B))
## [1] -0.826023
Therefore, we can say that the average ozone conc in Garden B was Mean+-CI, i.e., 5+-0.826 (95% C.I.,n=10)
QUANTILES Creating a std normal dist of 1000 random numbers
z = rnorm(1000)
mean(z)
## [1] 0.003637856
quantile(z,c(0.025,0.975))
## 2.5% 97.5%
## -2.146163 1.990625
For a std normal dist, 2.5% of the values should be < -1.96 and 97.5% of the values <+1.96. Next trying with a larger sample size:
z = rnorm(10000)
mean(z)
## [1] 0.008095526
quantile(z,c(0.025,0.975))
## 2.5% 97.5%
## -1.926039 1.971935
Results are much more closer to what we expected with a larger sample size.
ROBUST ESTIMATOR - MEAN ABSOLUTE DEVIATION
Below we compare MAD vs SD on robustness
p = c(3,4,6,4,5,2,4,5,1,5,4,6)
mad(p)
## [1] 1.4826
sd(p)
## [1] 1.505042
mean(p)
## [1] 4.083333
Adding an outlier to our list
p1 = c(p,100)
mad(p1)
## [1] 1.4826
sd(p1)
## [1] 26.64149
mean(p1)
## [1] 11.46154
sd(p1)/mad(p1) # this calculates the ratio of b/w sd and mad for our updated sample
## [1] 17.96944
Therefore, if we allow a window of upto 4x differences in sd and MAD, anything outside can be defined as an outlier. Next, creating a function to check for outliers:
outlier = function(x) {
if(sd(x)>4*mad(x)) print("Outliers present")
else print("Deviation is reasonable")
}
outlier(p)
## [1] "Deviation is reasonable"
outlier(p1)
## [1] "Outliers present"
| SINGLE SAMPLE ESTIMATION |
r light=read.csv("C:\\temp\\Course Data\\light.csv", header=TRUE) names(light) |
## [1] "speed" |
r summary(light) |
## speed ## Min. : 650 ## 1st Qu.: 850 ## Median : 940 ## Mean : 909 ## 3rd Qu.: 980 ## Max. :1070 |
r hist(light$speed) |
| INTERQUARTILE RANGE = difference b/w 1st and 3rd quartile = 980-850 = 130. This is a great measure for testing outliers (=any value > 1.5x of the interquartile range above the 3rd quartile or below the 1st quartile). |
| “Non-normality, outliers and serial correlation can all invalidate inferences made by standard parametric tests like Student’s t-test. Much better in such cases to use a robust non-parametric technique like Wilcoxon’s signed-rank test.” |
| INFERENCE IN THE 1-SAMPLE CASE We want to test the hypothesis that Michelson’s estimate of the speed of light is significantly different from the value of 299,990 thought to prevail at the time. |
r #library(ctest) wilcox.test(light$speed,mu=990) |
## Warning in wilcox.test.default(light$speed, mu = 990): cannot compute exact p- ## value with ties |
## ## Wilcoxon signed rank test with continuity correction ## ## data: light$speed ## V = 22.5, p-value = 0.00213 ## alternative hypothesis: true location is not equal to 990 Result: Accept alternate hypothesis - true mu != 990 |
| For demonstration purposes, using the t-test on above problem: |
r t.test(light$speed,mu=990) |
## ## One Sample t-test ## ## data: light$speed ## t = -3.4524, df = 19, p-value = 0.002669 ## alternative hypothesis: true mean is not equal to 990 ## 95 percent confidence interval: ## 859.8931 958.1069 ## sample estimates: ## mean of x ## 909 The above results show that the hypothesized value (mu=990) is far outside the 95% CI. The sample mean is highly significantly different from the hypothesized mean. |
| Applying 2-sample t-test on the “ozone levels in 3 gardens” problem: Hnull = the mean ozone concs in Gardens A and B are the same (Diff in means = 0) Halt = the mean ozone concs in Gardens A and B are not the same (Diff in means != 0) |
r qt(0.975,18) |
## [1] 2.100922 Above is the critical value of student’s t (boundary established by setting a 95% CI net on a 2-tailed test) |
r t.test(A,B) |
## ## Welch Two Sample t-test ## ## data: A and B ## t = -3.873, df = 18, p-value = 0.001115 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -3.0849115 -0.9150885 ## sample estimates: ## mean of x mean of y ## 3 5 Since the value of the test statistic is much larger than the critical t-value (sign of the test statistic does not matter in t-test), therefore we reject Hnull. Another way to look at it is that the CI of the test statistic does not include a “zero”, therefore the probability of the diff of means being zero is very low. |
WILCOX RANK SUM TEST - an alternative to the t-test when the standard errors are not normal (which implies that the variable distribution is not normal).
Looking at the standard errors in the ozone-garden example:
par(mfrow=c(1,2))
hist(B,breaks=c(0.5:11.5))
hist(C,breaks=c(0.5:11.5))
The errors in B look normal but that’s not the case with C. Performing the Wilcox test on this:
wilcox.test(B,C)
## Warning in wilcox.test.default(B, C): cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: B and C
## W = 66, p-value = 0.2349
## alternative hypothesis: true location shift is not equal to 0
Since p-value is large enough (>0.05), we accept the null hypothesis, i.e., mean ozone concs in Gardens B and C are the same. The last statement in the result above is only meant to display the alt hyp (just for clarity) and is not the final result being expressed by the function.
| TEST ON PAIRED SAMPLES - correlation expected between the 2 samples. Not effective when the correl is negative though. |
| Below data are a composite biodiversity score based on a kick sample of aquatic invertebrates.The elements of x and y are paired because the 2 samples were taken on the same river, upstream (y) or downstream (x) of a sewage outfall. |
| ```r x=c(20,15,10,5,20,15,10,5,20,15,10,5,20,15,10,5) y=c(23,16,10,4,22,15,12,7,21,16,11,5,22,14,10,6) |
| t.test(x,y) # 2-sample t-test ``` |
## ## Welch Two Sample t-test ## ## data: x and y ## t = -0.40876, df = 29.755, p-value = 0.6856 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -5.248256 3.498256 ## sample estimates: ## mean of x mean of y ## 12.500 13.375 |
r t.test(x,y, paired = T) # paired t-test |
## ## Paired t-test ## ## data: x and y ## t = -3.0502, df = 15, p-value = 0.0081 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -1.4864388 -0.2635612 ## sample estimates: ## mean of the differences ## -0.875 |
| According to the first test, there is no difference b/w means since p-value is >0.05. However, when we feed the function the information of the samples being paired, the results are completely different. Test 2 shows that the means of the 2 samples are significantly different since p-value < 0.05. |
CENTRAL LIMIT THEOREM
Generating 1000 random numbers from a negative binomial distribution with 2 parameters ‘size’ = 1, ‘probability’ = 0.2
par(mfrow=c(1,2))
y=rnbinom(1000,1,.2)
mean(y)
## [1] 4.079
var(y)
## [1] 19.61437
table(y)
## y
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 177 161 128 104 94 75 62 35 36 25 22 10 11 10 14 7 8 5 2 2
## 20 21 23 25 26 28 29 30
## 2 3 1 2 1 1 1 1
hist(y,breaks=-0.5:38.5)
Creating a normal distribution of the mean values:
my = numeric(1000)
for (i in 1:1000) {
y=rnbinom(30, 1, 0.2)
my[i]=mean(y)}
hist(my)
par(mfrow=c(1,1))