Download this RData file to your working directory: http://courses.edx.org/c4x/HarvardX/PH525.1x/asset/skew.RData. Then load the data into R with the following command:
#install.packages("UsingR")
load("skew.RData")
# You should have a 1000 x 9 dimensional matrix 'dat':
dim(dat)
## [1] 1000 9
Using QQ-plots, compare the distribution of each column of the matrix to a normal. That is, use qqnorm() on each column. To accomplish this quickly, you can use the following line of code to set up a grid for 3x3=9 plots.
# Note: "mfrow" means we want a multifigure grid filled in row-by-row. Another choice is mfcol.
par(mfrow = c(3,3))
# Then you can use a for loop, to loop through the columns, and display one qqnorm() plot at a time. You should replace the text between ** with your own code.
for (i in 1:9) {
# **put your qqnorm call here**
x <- dat[,i]
qqnorm(x, main=paste0("Q-Q plot for column V.",i,sep=""))
qqline(x)
}
# You can use the following line to reset your graph to just show one at a time:
# par(mfrow=c(1,1))
Identify the two columns which are skewed.(4 & 9)
Examine each of these two columns using a histogram.
yy <- dat[,4]
hist(dat[,4])
hist(dat[,9])
Note which column has “positive skew”, in other words the histogram shows a long tail to the right (toward larger values) [#4]. Note which column has “negative skew”, that is, a long tail to the left (toward smaller values) [#9]. Note that positive skew looks like an up-shaping curve in a qqnorm() plot, while negative skew looks like a down-shaping curve.
QUESTION 2.1 (1/1 point) Which column has positive skew (a long tail to the right)? - #4
xQUESTION 2.2 (1/1 point) Which column has negative skew (a long tail to the left)? - #9
The InsectSprays data set measures the counts of insects in agricultural experimental units treated with different insecticides. This dataset is included in R, and you can examine it by typing:
head(InsectSprays)
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
# Try out two equivalent ways of drawing boxplots in R, using the InsectSprays dataset. Below is pseudocode, which you should modify to work with the InsectSprays dataset.
# 1) using split:
boxplot( split(InsectSprays$count, InsectSprays$spray), main="split(values, factor version) version")
# 2) using a formula:
boxplot(InsectSprays$count ~ InsectSprays$spray, main="values ~ factor version")
QUESTION 3.1 (1 point possible) Note: this question has 2 attempts, because there are only 5 options…
Which spray seems the most effective (has the lowest median)?
Load the father’s and son’s heights data by installing the UsingR package: |
---|
Here we will use the plots we’ve learned about to explore a dataset: some stats on a random sample of runners from the New York City Marthon in 2002. This data set can be found in the UsingR package (used in the previous assessment). Load the library and then load the nym.2002 data set with the following lines: |
```r #install.packages(“UsingR”) library(UsingR) |
data(nym.2002) |
# Examine the head() of the data in nym.2002. head(nym.2002) ``` |
## place gender age home time ## 3475 3592 Male 52 GBR 217.4833 ## 13594 13853 Female 40 NY 272.5500 ## 12012 12256 Male 31 FRA 265.2833 ## 10236 10457 Female 33 MI 256.1500 ## 9476 9686 Male 33 NY 252.2500 ## 1720 1784 Male 40 NJ 201.9667 |
```r #Try the following plots: |
#histogram of times hist(nym.2002$time, main=“Times historgram for nym.2002 database”) ``` |
r #plot of runner's age vs their time plot(nym.2002$time,nym.2002$age, main="Plot of age vs. runners' times in nym.2002 database") |
r #plot of runner's time vs their place in the race plot(nym.2002$time,nym.2002$place, main="Plot of place vs. runners' time in nym.2002 database") |
r #qqnorm() of the runner's times. qqnorm(nym.2002$time, main="Compare runners' time in nym.2002 database against qqline") qqline(nym.2002$time) |
```r #Are they normal? Positive or negative skew? # ANS: Not normal Positive skew due to long tail on right |
# a barplot of the most common location of origin. hint: tail(sort(table(nym.2002\(home)),10) barplot(tail(sort(table(nym.2002\)home)),15)) ``` |
```r # a boxplot of the runner’s time over their gender #barplot(tail(sort(table(split(nym.2002\(time,nym.2002\)gender)),15)) |
boxplot( split(nym.2002\(time, nym.2002\)gender), main=“split(Boxplot of the runner’s time over their gender)”) ``` |
In the previous video, we saw that multiplicative changes are symmetric around 0 when we are on the logarithmic scale. In other words, if we use the log scale, 1/2 times a number x, and 2 times a number x, are equally far away from x. We will explore this with the NYC marathon data.
# Create a vector time of the sorted times:
time = sort(nym.2002$time)
# ------
# QUESTION 2.1 (1 point possible)
# What is the fastest time divided the median time?
fast.over.median <- min(time)/median(time)
paste("Fastest over median time is ", format(fast.over.median, digits=3), sep="")
## [1] "Fastest over median time is 0.561"
# ANS: 0.561
# QUESTION 2.2 (1 point possible)
# What is the slowest time divided the median time?
slow.over.median <- max(time)/median(time)
paste("Fastest over median time is ", format(slow.over.median, digits=3), sep="")
## [1] "Fastest over median time is 2.16"
# ANS: 2.16
Compare the following two plots. |
1) A plot of the ratio of times to the median time, with horizontal lines at twice as fast as the median time, and twice as slow as the median time. |
r plot(time/median(time), ylim=c(1/4,4), main="Plot with 1 to 1 ratio") abline(h=c(1/2,1,2)) |
```r # 2) A plot of the log2 ratio of times to the median time. The horizontal lines indicate the same as above: twice as fast and twice as slow. |
plot(log2(time/median(time)),ylim=c(-2,2), main=“Plot with log transformation”) abline(h=-1:1) ``` |
Note that the lines are equally spaced in Figure #2. |
Read in the msleep dataset from CSV file, available from [here] (https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv), and then load into R:
(note: we’ve given you 20 total attempts at this question)
# Get the data from a URL
library(RCurl) # May need this around issue of reading 'https' URLs on Macs
## Loading required package: bitops
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/msleep_ggplot2.csv"
filename <- "msleep_ggplot2.csv"
# can use X <- read.csv(url("<fname.csv"))
if (!file.exists(filename)){
filename <- getURL(url)
msleep <- read.csv(text = filename)
# print("executed from URL") test
}
if (file.exists(filename)){
# print("executed from working directory") test
msleep <- read.csv("msleep_ggplot2.csv")
}
#install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:Hmisc':
##
## combine, src, summarize
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# QUESTION 1.1 (1 point possible)
# Using dplyr and the pipe command %>%, and perform the following steps:
# XX Add a column of the proportion of REM sleep to total sleep
msleep %>%
mutate(rem_proportion = sleep_rem / sleep_total) %>%
# XX Group the animals by their taxonomic order
group_by(order) %>%
# XX Summarise by the median REM proportion
# Note: for right answer, can't use the option of na.rm=TRUE for the median
summarise(median.rem = median(rem_proportion), total = n()) %>%
# head
# XX Arrange by the median REM proportion
arrange(median.rem) %>%
# Take the head() of this to see just the orders
# with smallest median REM proportion
head
## Source: local data frame [6 x 3]
##
## order median.rem total
## 1 Hyracoidea 0.09433962 3
## 2 Lagomorpha 0.10714286 1
## 3 Diprotodontia 0.13326100 2
## 4 Afrosoricida 0.14743590 1
## 5 Chiroptera 0.14923603 2
## 6 Pilosa 0.15277778 1
# What is the median REM proportion of the order with the smallest median REM proportion?
# ANS: Hyracoidea 0.09433962
Robust Summaries
We’re going to explore the properties of robust statistics, using a dataset of the weight of chicks in grams as they grow from day 0 to day 21. This dataset also splits up the chicks by different protein diets, which are coded from 1 to 4.
This dataset is built into R, and can be loaded with:
data(ChickWeight)
# Just to start, take a look at the weights of all observations over time, and color this by the Diet:
plot(ChickWeight$Time, ChickWeight$weight, col=ChickWeight$Diet)
First, in order to easily compare weights at different time points across the different chicks, we will use the reshape() function in R to change the dataset from a “long” shape to a “wide” shape. Long data and wide data are useful for different purposes (for example, the plotting library ggplot2 and the manipulation library dplyr want to have data in the long format).
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
chick = reshape(ChickWeight,idvar=c("Chick","Diet"),timevar="Time",direction="wide")
# The meaning of this line is: reshape the data from long to wide, where the columns Chick and Diet are the ID's and the column Time indicates different observations for each ID. Now examine the head of this dataset:
head(chick)
## Chick Diet weight.0 weight.2 weight.4 weight.6 weight.8 weight.10
## 1 1 1 42 51 59 64 76 93
## 13 2 1 40 49 58 72 84 103
## 25 3 1 43 39 55 67 84 99
## 37 4 1 42 49 56 67 74 87
## 49 5 1 41 42 48 60 79 106
## 61 6 1 41 49 59 74 97 124
## weight.12 weight.14 weight.16 weight.18 weight.20 weight.21
## 1 106 125 149 171 199 205
## 13 122 138 162 187 209 215
## 25 115 138 163 187 198 202
## 37 102 108 136 154 160 157
## 49 141 164 197 199 220 223
## 61 141 148 155 160 160 157
# The only remaining step is that we want to remove any chicks which have missing observations at any time points (NA for "not available") . The following line of code identifies these rows, and then removes them
chick = na.omit(chick)
QUESTION 2.1 (1 point possible) We will focus on the chick weights on day 4 (check the column names of ‘chick’ and note the numbers). How much does the average of chick weights at day 4 increase if we add an outlier measurement of 3000 grams? Specifically what is the average weight of the day 4 chicks including the outlier chick divided by the average of the weight of the day 4 chicks without the outlier. Hint: use c() to add a number to a vector.
day4 <- chick$weight.4
day4plus <- c(day4,3000)
wt.ration <- mean(day4plus)/mean(day4)
# ANS: 2.062
We will focus on the chick weights on day 4 (check the column names of ‘chick’ and note the numbers). How much does the average of chick weights at day 4 increase if we add an outlier measurement of 3000 grams? Specifically what is the average weight of the day 4 chicks including the outlier chick divided by the average of the weight of the day 4 chicks without the outlier. Hint: use c() to add a number to a vector. - ANS: 2.062
QUESTION 2.2 (1 point possible) In the problem above we saw how sensitive the mean is to outliers. Now let’s see what happens when we use the median instead of the mean. Compute the same ratio but now using median instead of the mean. Specifically what is the median weight of the day 4 chick including the outlier chick divided by the median of the weight of the day 4 chicks without the outleir.
wt.ration.median <- median(day4plus)/median(day4) # ANS: 1
In the problem above we saw how sensitive the mean is to outliers. Now let’s see what happens when we use the median instead of the mean. Compute the same ratio but now using median instead of the mean. Specifically what is the median weight of the day 4 chick including the outlier chick divided by the median of the weight of the day 4 chicks without the outleir. - ANS: 1
QUESTION 2.3 (1 point possible) Now try the same thing with the sample standard deviation, (the sd() function in R). Add a chick with weight 3000 grams to the chick weights from day 4. How much does the standard deviation change? What’s the standard devition with the outlier chick divided by the standard deviation without the outlier chick?
wt.ratio.median <- sd(day4plus)/sd(day4) # ANS: 101.29
paste("SD median ratio ", wt.ratio.median)
## [1] "SD median ratio 101.285878530203"
wt.ratio.mad <- mad(day4plus)/mad(day4) # ANS: 101.29
paste("SD MAD ratio ", wt.ratio.median)
## [1] "SD MAD ratio 101.285878530203"
Now try the same thing with the sample standard deviation, (the sd() function in R). Add a chick with weight 3000 grams to the chick weights from day 4. How much does the standard deviation change? What’s the standard devition with the outlier chick divided by the standard deviation without the outlier chick? - ANS: 101.29
Compare this to the median absolute deviation in R, which is calculated with the mad() function. Note that the mad is unaffected by the addition of a single outlier. The mad() function in R includes the scaling factor 1.4826 which was mentioned in the video, such that mad(x) and sd(x) are very similar for a sample from a normal distribution.
QUESTION 2.4 (1 point possible) Our last question relates to how the Pearson correlation is affected by an outlier, and compare to the Spearman correlation. The Pearson correlation between x and y is given in R by cor(x,y). The Spearman correlation is given by cor(x,y,method=“spearman”).
Plot the weights of chicks from day 4 and day 21. We can see that there is some general trend, with the lower weight chicks on day 4 having low weight again on day 21, and likewise for the high weight chicks.
plot(chick$weight.4,chick$weight.21, main="Plot of Day 4 and Day 21 with median lines")
abline(h=median(chick$weight.21), v=median(chick$weight.4))
pearson4 <- cor(chick$weight.4,chick$weight.21,method="spearman")
pearson4plus <-
paste("SD MAD ratio ", wt.ratio.median)
Calculate the Pearson correlation of the weights of chicks from day 4 and day 21. Now calculate how much the Pearson correlation changes if we add a chick that weighs 3000 on day4 and 3000 on day 21. Again, divide the Pearson correlation with the outlier chick over the Pearson correlation computed without the outliers. ANS: 2.37
day4 <- chick$weight.4
day4plus <- c(day4,3000)
day21 <- chick$weight.21
day21plus <- c(day21,3000)
pearson4 <- cor(day4,day21) # Pearson correlation
pearson4plus <- cor(day4plus,day21plus) # Pearson correlation with outlier
pearson.ratio <- pearson4plus/pearson4
paste("Pearson regular ", format(pearson4, digits=2))
## [1] "Pearson regular 0.42"
paste("Pearson outlier ", format(pearson4plus, digits=2))
## [1] "Pearson outlier 0.99"
paste("Pearson ratio ", format(pearson.ratio, digits=2)) # ANS: 2.37
## [1] "Pearson ratio 2.4"
Note that the Spearman correlation also changes with the addition of this outlier chick, but much less drastically: cor(x,y,method=“spearman”) compared to cor(c(x,3000),c(y,3000),method=“spearman”) with x and y the vectors of interest. (ratio is 1.085)
spearman4 <- cor(day4,day21, method="spearman") # Pearson correlation
spearman4plus <- cor(day4plus,day21plus, method="spearman") # Pearson correlation with outlier
spearman.ratio <- spearman4plus/spearman4
paste("Spearman regular ", format(spearman4, digits=2))
## [1] "Spearman regular 0.43"
paste("Spearman outlier ", format(spearman4plus, digits=2))
## [1] "Spearman outlier 0.47"
paste("spearman ratio ", format(spearman.ratio, digits=2))
## [1] "spearman ratio 1.1"
We will continue using the chick weight dataset from the previous problem. As a reminder, run these lines of code in R to prepare the data:
data(ChickWeight)
chick <- reshape(ChickWeight,idvar=c("Chick","Diet"),timevar="Time",direction="wide")
chick <- na.omit(chick)
head(chick)
## Chick Diet weight.0 weight.2 weight.4 weight.6 weight.8 weight.10
## 1 1 1 42 51 59 64 76 93
## 13 2 1 40 49 58 72 84 103
## 25 3 1 43 39 55 67 84 99
## 37 4 1 42 49 56 67 74 87
## 49 5 1 41 42 48 60 79 106
## 61 6 1 41 49 59 74 97 124
## weight.12 weight.14 weight.16 weight.18 weight.20 weight.21
## 1 106 125 149 171 199 205
## 13 122 138 162 187 209 215
## 25 115 138 163 187 198 202
## 37 102 108 136 154 160 157
## 49 141 164 197 199 220 223
## 61 141 148 155 160 160 157
Make a strip chart with horizontal jitter of the chick weights from day 4 over the different diets:
stripchart(chick$weight.4 ~ chick$Diet, method="jitter", vertical=TRUE)
Suppose we want to know if diet 4 has a significant impact on chick weight over diet 1 by day 4. It certainly appears so, but we can use statistical tests to quantify the probability of seeing such a difference if the different diets had equal effect on chick weight.
QUESTION 2.1 (1 point possible) Save the weights of the chicks on day 4 from diet 1 as a vector ‘x’. Save the weights of the chicks on day 4 from diet 4 as a vector ‘y’. Now perform a t test comparing x and y (in R the function t.test(x,y) will perform the test). Now, perform a Wilcoxon test of x and y (in R the function wilcox.test(x,y) will perform the test). Note that a warning will appear that an exact p-value cannot be calculated with ties (so an approximation is used, which is fine for our purposes).
x <- chick[chick$Diet==1,"weight.4"]
y <- chick[chick$Diet==4,"weight.4"]
t.test(x,y) # p-value = 0.0002012
##
## Welch Two Sample t-test
##
## data: x and y
## t = -5.8393, df = 21.827, p-value = 7.32e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.513134 -5.000755
## sample estimates:
## mean of x mean of y
## 56.68750 64.44444
wilcox.test(x,y)$p.value # no p-value due to ties
## Warning in wilcox.test.default(x, y): cannot compute exact p-value with
## ties
## [1] 0.0002011939
Now, perform a t-test of x and y, after adding a single chick of weight 200 grams to ‘x’ (the diet 1 chicks). What is the p-value from this test? The p-value of a test is available with the following code: t.test(x,y)$p.value ANS: 0.938
x <- c(x,200)
t.test(x,y)$p.value # p-value = 0.938
## [1] 0.9380347
Do the same for the Wilcoxon test. Note that the Wilcoxon test is robust to the outlier (and in addition, has less assumptions that the t-test on the distribution of the underlying data).
wilcox.test(x,y)$p.value # p-value = 0.0009841
## Warning in wilcox.test.default(x, y): cannot compute exact p-value with
## ties
## [1] 0.0009840921
QUESTION 2.2 (1 point possible) We will now investigate a possible downside to the Wilcoxon-Mann-Whitney test statistic. Using the following code to make three boxplots, showing the true Diet 1 vs 4 weights, and then two altered versions: one with an additional difference of 10 grams and one with an additional difference of 100 grams. (Use the x and y as defined above, NOT the ones with the added outlier.)
x <- chick[chick$Diet==1,"weight.4"]
y <- chick[chick$Diet==4,"weight.4"]
par(mfrow=c(1,3))
boxplot(x,y)
boxplot(x,y+10)
boxplot(x,y+100)
What is the difference in t-test statistic (the statistic is obtained by t.test(x,y)$statistic) between adding 10 and adding 100 to all the values in the group ‘y’? So take the the t-test statistic with x and y+10 and substract away the t-test statistic with x and y+100. (The value should be positive). ANS: 67.8
tstandard <- t.test(x,y)$statistic
t.diff <- t.test(x,y+10)$statistic - t.test(x,y+100)$statistic
Now examine the Wilcoxon test statistic for x and y+10 and for x and y+100. Because the Wilcoxon works on ranks, after the groups have complete separation (all points from group ‘y’ are above all points from group ‘x’), the statistic will not change, regardless of how large the difference grows. Likewise, the p-value has a minimum value, regardless of how far apart the groups are. This means that the Wilcoxon test can be considered less powerful than the t-test in certain contexts, and with small significance levels (alpha). In fact for small sample sizes, the p-value can’t be very small, even when the difference is very large. Compare:
wilcox.test(c(1,2,3),c(4,5,6))
##
## Wilcoxon rank sum test
##
## data: c(1, 2, 3) and c(4, 5, 6)
## W = 0, p-value = 0.1
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(c(1,2,3),c(400,500,600))
##
## Wilcoxon rank sum test
##
## data: c(1, 2, 3) and c(400, 500, 600)
## W = 0, p-value = 0.1
## alternative hypothesis: true location shift is not equal to 0
# p-value is 0.1 in both cases
This issue becomes important in Course 3 on Advanced Statistics, when we discuss correction for performing thousands of tests, as is done in high-throughput biological assays.