In this project, three data sets that were considered untidy were put through tidying and transformation using R. In a tidy data set, each variable and observation is saved as its own column and row respectively. These are the main packages use for data wrangling, visualization and graphics. Any other minor packages for analysis will be listed when needed.
library(tidyr)
library(dplyr)
# Other main packages
library(kableExtra)
library(ggplot2)
library(car)
The three data sets were taken from Richard A. Johnson and Dean W. Wichern’s Applied Multivariate Statistical Analysis textbook. These sets are relatively small, but the focus of this project is on the tidying and transformation methodology using only dplyr and tidyr, where necessary, rather than the amount of data and the analysis performed. These techniques can therefore be translated to another data set that needs to be format in a similar way. The data sets were saved to my Github and were retrieved from there. The difficulty of the tidying to be done can by ranked as follow:
Hooked-billed kites are tropical raptor found broadly across Central and South America, occasionally north to the very southern tip of Texas. They have long tail and broad, rounded wings distinctive in flight. Males are gray overall with white barring on the belly. Females are brown above with rufous barred underparts and mostly gray head. Juveniles paler below with thinner barring.
In this data set, a wildlife ecologist measured the tail length (in millimeters) and wing length (in millimeters) for a sample of n = 45 female hook-billed kites. The data is a text file which list the tail and wing lengths in three separate columns, 15 rows each, to conserve page space.
theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy1.txt"
untidy1 <- read.delim(theURL, header = FALSE, stringsAsFactors = FALSE, na.strings = c("", "NA"))
Click to show raw data
## V1
## 1 Table 5.12 Bird Data
## 2 =============================================================================================
## 3 (Tail length) |
## 4 ---------------------------------------------------------------------------------------------
## 5 x1
## 6 ---------------------------------------------------------------------------------------------
## 7 191
## 8 197
## 9 208
## 10 180
## 11 180
## 12 188
## 13 210
## 14 196
## 15 191
## 16 179
## 17 208
## 18 202
## 19 200
## 20 192
## 21 199
## 22 ---------------------------------------------------------------------------------------------
## 23 Source: Data courtesy of S. Temple.
## V2 V3 V4 V5
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 (Wing length) | (Tail length) | (Wing length) | (Tail length) |
## 4 <NA> <NA> <NA> <NA>
## 5 <NA> x2 <NA> x1
## 6 <NA> <NA> <NA> <NA>
## 7 <NA> 284 <NA> 186
## 8 <NA> 285 <NA> 197
## 9 <NA> 288 <NA> 201
## 10 <NA> 273 <NA> 190
## 11 <NA> 275 <NA> 209
## 12 <NA> 280 <NA> 187
## 13 <NA> 283 <NA> 207
## 14 <NA> 288 <NA> 178
## 15 <NA> 271 <NA> 202
## 16 <NA> 257 <NA> 205
## 17 <NA> 289 <NA> 190
## 18 <NA> 285 <NA> 189
## 19 <NA> 272 <NA> 211
## 20 <NA> 282 <NA> 216
## 21 <NA> 280 <NA> 189
## 22 <NA> <NA> <NA> <NA>
## 23 <NA> <NA> <NA> <NA>
## V6 V7 V8 V9 V10 V11
## 1 <NA> <NA> NA <NA> NA <NA>
## 2 <NA> <NA> NA <NA> NA <NA>
## 3 (Wing length) <NA> NA <NA> NA <NA>
## 4 <NA> <NA> NA <NA> NA <NA>
## 5 <NA> x2 NA x1 NA x2
## 6 <NA> <NA> NA <NA> NA <NA>
## 7 <NA> 266 NA 173 NA 271
## 8 <NA> 285 NA 194 NA 280
## 9 <NA> 295 NA 198 NA 300
## 10 <NA> 282 NA 180 NA 272
## 11 <NA> 305 NA 190 NA 292
## 12 <NA> 285 NA 191 NA 286
## 13 <NA> 297 NA 196 NA 285
## 14 <NA> 268 NA 207 NA 286
## 15 <NA> 271 NA 109 NA 303
## 16 <NA> 285 NA 179 NA 261
## 17 <NA> 280 NA 186 NA 262
## 18 <NA> 277 NA 174 NA 245
## 19 <NA> 310 NA 181 NA 250
## 20 <NA> 305 NA 189 NA 262
## 21 <NA> 274 NA 188 NA 258
## 22 <NA> <NA> NA <NA> NA <NA>
## 23 <NA> <NA> NA <NA> NA <NA>
Each of the same variable was spread across three columns, this made it cumbersome to index to run any analysis. In addition, there are unnecessary variable labels which can be removed once transformed to make the data set more comprehensible, not to mention the numerous empty columns and rows. Therefore, the main reason for tidying was to make the data set easily identifiable with the necessary data.
In this case, it was decided that the columns needed to be merged, transforming the wide data set into a long format, as shown below.
dplyr and tidyr.# Remove empty rows and columns
untidy1 <- data.frame(sapply(untidy1, as.numeric))
tidy1 <- untidy1 %>% filter_all(any_vars(!is.na(.))) %>% select_if(function(x) !(all(is.na(x)) | all(x=="")))
# Gather the data of all the columns into one
# Convert variables to factor for identification
# Recode variable factors into appropriate labels
tidy1 <- tidy1 %>% gather(tidy1, lengths, 1:6, convert = TRUE) %>% mutate(tidy1= recode(tidy1, "c('V1', 'V5', 'V9')='TailLength'; c('V3', 'V7', 'V11') = 'WingLength'"))
# Renaming rows
rownames(untidy1) <- 1:nrow(untidy1)
# Spreading the data to make it long
tidy1 <- tidy1 %>% group_by(tidy1) %>% mutate(grouped_id = row_number()) %>% spread(tidy1, lengths) %>% select(-grouped_id)
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 |
|---|---|---|---|---|---|---|---|---|---|---|
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 191 | NA | 284 | NA | 186 | NA | 266 | NA | 173 | NA | 271 |
| 197 | NA | 285 | NA | 197 | NA | 285 | NA | 194 | NA | 280 |
| 208 | NA | 288 | NA | 201 | NA | 295 | NA | 198 | NA | 300 |
| 180 | NA | 273 | NA | 190 | NA | 282 | NA | 180 | NA | 272 |
| 180 | NA | 275 | NA | 209 | NA | 305 | NA | 190 | NA | 292 |
| 188 | NA | 280 | NA | 187 | NA | 285 | NA | 191 | NA | 286 |
| 210 | NA | 283 | NA | 207 | NA | 297 | NA | 196 | NA | 285 |
| 196 | NA | 288 | NA | 178 | NA | 268 | NA | 207 | NA | 286 |
| 191 | NA | 271 | NA | 202 | NA | 271 | NA | 109 | NA | 303 |
| 179 | NA | 257 | NA | 205 | NA | 285 | NA | 179 | NA | 261 |
| 208 | NA | 289 | NA | 190 | NA | 280 | NA | 186 | NA | 262 |
| 202 | NA | 285 | NA | 189 | NA | 277 | NA | 174 | NA | 245 |
| 200 | NA | 272 | NA | 211 | NA | 310 | NA | 181 | NA | 250 |
| 192 | NA | 282 | NA | 216 | NA | 305 | NA | 189 | NA | 262 |
| 199 | NA | 280 | NA | 189 | NA | 274 | NA | 188 | NA | 258 |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| TailLength | WingLength |
|---|---|
| 191 | 284 |
| 197 | 285 |
| 208 | 288 |
| 180 | 273 |
| 180 | 275 |
| 188 | 280 |
| 210 | 283 |
| 196 | 288 |
| 191 | 271 |
| 179 | 257 |
| 208 | 289 |
| 202 | 285 |
| 200 | 272 |
| 192 | 282 |
| 199 | 280 |
| 186 | 266 |
| 197 | 285 |
| 201 | 295 |
| 190 | 282 |
| 209 | 305 |
| 187 | 285 |
| 207 | 297 |
| 178 | 268 |
| 202 | 271 |
| 205 | 285 |
| 190 | 280 |
| 189 | 277 |
| 211 | 310 |
| 216 | 305 |
| 189 | 274 |
| 173 | 271 |
| 194 | 280 |
| 198 | 300 |
| 180 | 272 |
| 190 | 292 |
| 191 | 286 |
| 196 | 285 |
| 207 | 286 |
| 109 | 303 |
| 179 | 261 |
| 186 | 262 |
| 174 | 245 |
| 181 | 250 |
| 189 | 262 |
| 188 | 258 |
Let’s do some fun analysis! Using this data, let’s find the simultaneous confidence level for the component means, i.e. the wing and tail lengths. When multiple confidence intervals are examined, the chances that at least one of the confidence intervals does not contain the population parameter is greater for a set of intervals than for any single interval.
In this analysis, the 95% Bonferroni confidence intervals were determined. The plot reveals that the Bonferroni confidence intervals, indicated by the dash lines, suggest that the ecologist can be 95% confident that the points that fall within all the set of confidence intervals includes the true population standard deviations for all variables. This plot is helpful to the wildlife ecologist because when he/she discovers more female hooked-billed kites, he/she can establish whether its measurements are within the typical standards. Knowing this, they can further study why a measurement, such as an outlier, was recorded in the first place.
Click to show code
n <- length(tidy1$TailLength)
# Get the fitted values of the linear model
model <- lm(TailLength ~ WingLength, data = tidy1)
fit <- model$fitted.values
# Find standard error
serror <- sqrt(sum((tidy1$TailLength - fit)^2) / (n - 2)) * sqrt(1 / n + (tidy1$WingLength - mean(tidy1$WingLength))^2 / sum((tidy1$WingLength - mean(tidy1$WingLength))^2))
# Calculate B statistics
B <- 1-qt(.95/(2 * 3), n - 1)
# Compute the simultaneous Bonferroni confidence intervals
bon.upper <- fit + B * serror
bon.lower <- fit - B * serror
Moreover, from the plot, there are a few outliers within the data set that can be a result of type 1 error. However, the confidence ellipse still appeared to contain 95% of the observations. As the ecologist observations for female hook-billed kites increases, the mean will usually be better and better estimated, leading to smaller and smaller confidence ellipses, which in turn contain a smaller and smaller proportion of the actual data.
Anacondas are some of the largest snakes in the world. A team of researchers recorded the snout vent length (in cm) or the length from the snout of the snake to its vent where it evacuates waste, and its weight (in kg). The data set has the measurements sub-divided by the snakes’ gender.
theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy2.csv"
untidy2 <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))
Click to show raw data
## Female X Male X.1
## 1 Snout Vent Length Weight Snout Vent Length Weight
## 2 271 18.5 176.7 3
## 3 477 82.5 259.5 9.75
## 4 306.3 23.4 258 10.07
## 5 365.3 33.5 229.8 7.50
## 6 466 69 233 6.25
## 7 440.7 54 237.5 9.85
## 8 315 24.97 268.3 10
## 9 417.5 56.75 222.5 9
## 10 307.3 23.15 186.5 3.75
## 11 319 29.51 238.8 9.75
## 12 303.9 19.98 257.6 9.75
## 13 331.7 24 172 3
## 14 435 70.37 244.7 10
## 15 261.3 15.5 224.7 7.25
## 16 384.8 63 231.7 9.25
## 17 360.3 39 235.9 7.5
## 18 441.4 53 236.5 5.75
## 19 246.7 15.75 247.4 7.75
## 20 365.3 44 223 5.75
## 21 336.8 30 223 5.75
## 22 326.7 34 212.5 7.65
## 23 312 25 223 7.75
## 24 226.7 9.25 225 5.84
## 25 347.4 30 228 7.53
## 26 280.2 15.25 215.6 5.75
## 27 290.7 21.5 221 6.45
## 28 438.6 57 236.7 6.49
## 29 377.1 61.5 235.3 6
For this data set, let the aim be to test for equality of means between male and female anacondas. The subdivision of this data set makes the test inputs very untidy and will not allow for quick analysis since careful attention is needed for indexing. Therefore, in this case, tidying is based on the statistical analysis.
Here, it was decided that the variable for gender needed to have its own column, and the respective measurements needed to be merge, changing from the wide format into a long format, as shown below:
dplyr and tidyr.# Identifying the columns by renaming
tidy2 <- untidy2 %>% rename(Fsnout = Female, Fwght = X, Msnout = Male, Mwght = X.1)
# Simple row removal and renaming
tidy2 <- tidy2[-1,]
rownames(tidy2) <- 1:nrow(tidy2)
# Gather the data by respective columns
tidy2 <- gather(tidy2, "gender", "measurement", c(c(1,3),c(2,4)))
# Creating a new data frame of variable identification to be merged with data set
id <- data.frame(tidy2$gender, stringsAsFactors=FALSE)
titles <- c("snoutLength", "weight")
id[c(1:28),] <- titles[1]
id[c(29:56),] <- titles[2]
id[c(57:84),] <- titles[1]
id[c(85:112),] <- titles[2]
# Convert variables to factor for identification
# Recode variable factors into appropriate labels
tidy2 <- tidy2 %>% mutate_at(vars(gender), as.factor) %>% mutate(gender = recode(gender, "c('Fsnout', 'Fwght') = 'Female' ; c('Msnout', 'Mwght') = 'Male'"))
# Spread the data based on the identified gender categories for snout and weight data
# Gather once more to create the final gender key
tidy2 <- tidy2 %>% group_by(gender) %>% mutate(grouped_id = row_number()) %>% spread(gender, measurement, convert = TRUE) %>% select(-grouped_id) %>% gather("Key", "measurement", 1:2)
# Attach the new column data
tidy2 <- data.frame(append(tidy2,id))
# Spread once more by the gender key so each measurement is in its own column
tidy2 <- tidy2 %>% group_by(tidy2.gender) %>% mutate(grouped_id = row_number()) %>% spread(tidy2.gender, measurement, convert = TRUE) %>% select(-grouped_id) %>% rename(gender = Key)
| Female | X | Male | X.1 |
|---|---|---|---|
| Snout Vent Length | Weight | Snout Vent Length | Weight |
| 271 | 18.5 | 176.7 | 3 |
| 477 | 82.5 | 259.5 | 9.75 |
| 306.3 | 23.4 | 258 | 10.07 |
| 365.3 | 33.5 | 229.8 | 7.50 |
| 466 | 69 | 233 | 6.25 |
| 440.7 | 54 | 237.5 | 9.85 |
| 315 | 24.97 | 268.3 | 10 |
| 417.5 | 56.75 | 222.5 | 9 |
| 307.3 | 23.15 | 186.5 | 3.75 |
| 319 | 29.51 | 238.8 | 9.75 |
| 303.9 | 19.98 | 257.6 | 9.75 |
| 331.7 | 24 | 172 | 3 |
| 435 | 70.37 | 244.7 | 10 |
| 261.3 | 15.5 | 224.7 | 7.25 |
| 384.8 | 63 | 231.7 | 9.25 |
| 360.3 | 39 | 235.9 | 7.5 |
| 441.4 | 53 | 236.5 | 5.75 |
| 246.7 | 15.75 | 247.4 | 7.75 |
| 365.3 | 44 | 223 | 5.75 |
| 336.8 | 30 | 223 | 5.75 |
| 326.7 | 34 | 212.5 | 7.65 |
| 312 | 25 | 223 | 7.75 |
| 226.7 | 9.25 | 225 | 5.84 |
| 347.4 | 30 | 228 | 7.53 |
| 280.2 | 15.25 | 215.6 | 5.75 |
| 290.7 | 21.5 | 221 | 6.45 |
| 438.6 | 57 | 236.7 | 6.49 |
| 377.1 | 61.5 | 235.3 | 6 |
| gender | snoutLength | weight |
|---|---|---|
| Female | 271.0 | 18.50 |
| Female | 477.0 | 82.50 |
| Female | 306.3 | 23.40 |
| Female | 365.3 | 33.50 |
| Female | 466.0 | 69.00 |
| Female | 440.7 | 54.00 |
| Female | 315.0 | 24.97 |
| Female | 417.5 | 56.75 |
| Female | 307.3 | 23.15 |
| Female | 319.0 | 29.51 |
| Female | 303.9 | 19.98 |
| Female | 331.7 | 24.00 |
| Female | 435.0 | 70.37 |
| Female | 261.3 | 15.50 |
| Female | 384.8 | 63.00 |
| Female | 360.3 | 39.00 |
| Female | 441.4 | 53.00 |
| Female | 246.7 | 15.75 |
| Female | 365.3 | 44.00 |
| Female | 336.8 | 30.00 |
| Female | 326.7 | 34.00 |
| Female | 312.0 | 25.00 |
| Female | 226.7 | 9.25 |
| Female | 347.4 | 30.00 |
| Female | 280.2 | 15.25 |
| Female | 290.7 | 21.50 |
| Female | 438.6 | 57.00 |
| Female | 377.1 | 61.50 |
| Male | 176.7 | 3.00 |
| Male | 259.5 | 9.75 |
| Male | 258.0 | 10.07 |
| Male | 229.8 | 7.50 |
| Male | 233.0 | 6.25 |
| Male | 237.5 | 9.85 |
| Male | 268.3 | 10.00 |
| Male | 222.5 | 9.00 |
| Male | 186.5 | 3.75 |
| Male | 238.8 | 9.75 |
| Male | 257.6 | 9.75 |
| Male | 172.0 | 3.00 |
| Male | 244.7 | 10.00 |
| Male | 224.7 | 7.25 |
| Male | 231.7 | 9.25 |
| Male | 235.9 | 7.50 |
| Male | 236.5 | 5.75 |
| Male | 247.4 | 7.75 |
| Male | 223.0 | 5.75 |
| Male | 223.0 | 5.75 |
| Male | 212.5 | 7.65 |
| Male | 223.0 | 7.75 |
| Male | 225.0 | 5.84 |
| Male | 228.0 | 7.53 |
| Male | 215.6 | 5.75 |
| Male | 221.0 | 6.45 |
| Male | 236.7 | 6.49 |
| Male | 235.3 | 6.00 |
With the untidy data, there are some simple analysis that can be done (but note that the untidy data still needed to be converted to the appropriate class). For instance, a correlation test can be done to evaluate the association between snout length and weight for both male and female snakes. Since the data is already separated, it can be easily fed into a function like this.
cor.test(as.numeric(untidy2$Female), as.numeric(untidy2$X), method = "pearson")
##
## Pearson's product-moment correlation
##
## data: as.numeric(untidy2$Female) and as.numeric(untidy2$X)
## t = 6.1083, df = 27, p-value = 1.591e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5482271 0.8819781
## sample estimates:
## cor
## 0.761688
The p-value of the test is less than 0.05, therefore, it can concluded that the weight and snout length for female anacondas are significantly correlated with a correlation coefficient of 0.76. Since it is a positive correlation, it further suggest that as the weight increases, the snout length increases as well for female anacondas.
cor.test(as.numeric(untidy2$Male), as.numeric(untidy2$X.1), method = "pearson")
##
## Pearson's product-moment correlation
##
## data: as.numeric(untidy2$Male) and as.numeric(untidy2$X.1)
## t = 1.6046, df = 27, p-value = 0.1202
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08011696 0.59699954
## sample estimates:
## cor
## 0.2950524
With the male anacondas, the p-value of the test is more than 0.05. Therefore, it can concluded that the weight and snout length for male are not significantly correlated, and there is no evidence that weight and snout length has any relationship for male anacondas.
Now, with the tidy data, a two sample independent test can be conducted. Two data samples are independent as they come from unrelated populations and the samples does not affect each other. Firstly, from the normal Q-Q plot, the data on snout lengths can be said to follow closely to a normal distribution, therefore satisfying the assumption of normality to conduct a t-test.
t.test(snoutLength ~ gender, data = tidy2)
##
## Welch Two Sample t-test
##
## data: snoutLength by gender
## t = 8.7618, df = 32.712, p-value = 4.289e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 91.78351 147.32364
## sample estimates:
## mean in group Female mean in group Male
## 348.2750 228.7214
Using the t-test, the interval estimate of the difference between two gender means was obtained. In the tidy data, the mean snout length for female anacondas is 348.28 cm and the mean snout length for male anacondas is 228.72 cm. The 95% confidence interval of the difference in mean snout length is between 91.78 and 147.32 cm. As a result, there is a statistical significant difference in snout length based on gender where female anacondas have a larger snout than males.
Now with the weight of the anacondas, the normal Q-Q plot also show that the data satisfy the assumption of normality to conduct a t-test.
t.test(weight ~ gender, data = tidy2)
##
## Welch Two Sample t-test
##
## data: weight by gender
## t = 7.8475, df = 27.592, p-value = 1.672e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 22.14418 37.80225
## sample estimates:
## mean in group Female mean in group Male
## 37.263571 7.290357
The t-test results shows that the mean weight for female anacondas is 37.26 kg and the mean weight for male anacondas is 7.29 kg. The 95% confidence interval of the difference in mean weight is between 22.14 and 37.80 kg. Therefore, there is also a statistical significant difference in weight based on gender where female anacondas are heavier than males.
In conclusion, when analyzing the tidy set, it is evident that female anacondas are considerably longer and heavier than males. This may be because there is a relationship with female snout length and weight, which was not significant with male anacondas.
Peanut are an important crop in parts of the southern United States. To develop improved plants, crop scientists routinely compare varieties with respect to several attributes. A table of data is given in a .csv file. There are four varieties of peanuts, Runner, Spanish, Valencia, and Virginia that are grown at four geographical locations, Alabama, Florida, Georgia, and Texas. Moreover, this table consist the yield (plot weight), and two important grade-grain characteristics, sound mature kernels (weight in grams), and seed size (weight in grams).
theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy3.csv"
untidy3 <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))
Click to show raw data
## Table.6.17.Peanut.Data X X.1 X.2 X.3 X.4
## 1 Batch 1
## 2 Variety
## 3 Runner Spanish
## 4 Location Georgia Yield 195.3 203
## 5 kernels weight 153.1 156.8
## 6 seed size 51.4 49.8
## 7
## 8 Texas Yield 189.7 202.7
## 9 kernels weight 139.5 166.1
## 10 seed size 55.5 60.4
## 11
## 12 Alabama Yield 181.8 204.6
## 13 kernels weight 139.6 151.7
## 14 seed size 56.4 57.4
## 15
## 16 Florida Yield 185.2 204.5
## 17 kernels weight 158.2 159.8
## 18 seed size 53.7 57.1
## 19
## 20 Batch 2
## 21 Variety
## 22 Runner Spanish
## 23 Location Georgia Yield 194.3 195.9
## 24 kernels weight 167.7 166
## 25 seed size 53.7 45.8
## 26
## 27 Texas Yield 180.4 197.6
## 28 kernels weight 121.1 161.8
## 29 seed size 44.4 54.1
## 30
## 31 Alabama Yield 186.7 194.5
## 32 kernels weight 157.5 157.3
## 33 seed size 49.6 54.1
## 34
## 35 Florida Yield 193.7 198.4
## 36 kernels weight 146.5 160.3
## 37 seed size 52.5 57.4
## X.5 X.6
## 1
## 2
## 3 Valencia Virgina
## 4 193.5 196.4
## 5 164.5 158.3
## 6 57.8 53.1
## 7
## 8 201.5 195.5
## 9 166.8 157.4
## 10 65 57.8
## 11
## 12 198.8 195.7
## 13 166.1 161.9
## 14 63.5 57.2
## 15
## 16 197.4 195.9
## 17 165.2 157.2
## 18 60.9 56.8
## 19
## 20
## 21
## 22 Valencia Virgina
## 23 187 194.2
## 24 165.1 162.4
## 25 58.6 48.9
## 26
## 27 200 192.3
## 28 173.8 153.5
## 29 67.2 58.7
## 30
## 31 197.4 193.7
## 32 168.8 159.5
## 33 58.7 54.8
## 34
## 35 197.5 193.3
## 36 169.4 156.7
## 37 59.8 57.4
The data set is in a table format. The table itself can be easily read, but it is in no way workable for any statistical analysis using R. More specifically, there are necessary variable names which do not have their own columns. The subdivision of this data set makes the test inputs very untidy and will not allow for quick analysis since careful attention is needed for indexing. Therefore, tidying is based on the statistical analysis to make the data set easily identifiable with the necessary data.
In this case, it was decided that specific rows needed to become columns, and the respective measurements needed to be merge, transforming the wide table into a long data set, as shown below.
dplyr and tidyr.# Rigorous cleaning methods #
## Removing unnecessary rows
tidy3 <- untidy3[-c(2,7,11,15,19,21,26,30,34),-2]
# Identify the variable names for missing column before tidying
## Batch labels
tidy3[c(2:14),1] <- tidy3[1,1]
tidy3[c(16:28),1] <- tidy3[15,1]
## Location labels
tidy3[c(3:5,17:19),2] <- tidy3[3,2]
tidy3[c(6:8,20:22),2] <- tidy3[6,2]
tidy3[c(9:11,23:25),2] <- tidy3[9,2]
tidy3[c(12:14,26:28),2] <- tidy3[12,2]
# Identifying the columns by renaming
tidy3 <- rename(tidy3, Batch = Table.6.17.Peanut.Data, Location = X.1, Variable = X.2, Runner = X.3, Spanish = X.4, Valencia = X.5, Virginia = X.6)
# Removing and renaming final rows
tidy3 = tidy3[-c(1,2,15,16),]
rownames(tidy3) <- 1:nrow(tidy3)
# Gather the data into variables of peanut variety and the measurements
# Spread based on the measurement variable names
# Convert character data into factors and numbers
tidy3 <- tidy3 %>% gather("Variety", "Measurement", 4:7) %>% spread("Variable", "Measurement", convert = TRUE)
tidy3 <- droplevels(mutate_if(tidy3, is.character, as.factor))
| Table.6.17.Peanut.Data | X | X.1 | X.2 | X.3 | X.4 | X.5 | X.6 |
|---|---|---|---|---|---|---|---|
| Batch 1 | |||||||
| Variety | |||||||
| Runner | Spanish | Valencia | Virgina | ||||
| Location | Georgia | Yield | 195.3 | 203 | 193.5 | 196.4 | |
| kernels weight | 153.1 | 156.8 | 164.5 | 158.3 | |||
| seed size | 51.4 | 49.8 | 57.8 | 53.1 | |||
| Texas | Yield | 189.7 | 202.7 | 201.5 | 195.5 | ||
| kernels weight | 139.5 | 166.1 | 166.8 | 157.4 | |||
| seed size | 55.5 | 60.4 | 65 | 57.8 | |||
| Alabama | Yield | 181.8 | 204.6 | 198.8 | 195.7 | ||
| kernels weight | 139.6 | 151.7 | 166.1 | 161.9 | |||
| seed size | 56.4 | 57.4 | 63.5 | 57.2 | |||
| Florida | Yield | 185.2 | 204.5 | 197.4 | 195.9 | ||
| kernels weight | 158.2 | 159.8 | 165.2 | 157.2 | |||
| seed size | 53.7 | 57.1 | 60.9 | 56.8 | |||
| Batch 2 | |||||||
| Variety | |||||||
| Runner | Spanish | Valencia | Virgina | ||||
| Location | Georgia | Yield | 194.3 | 195.9 | 187 | 194.2 | |
| kernels weight | 167.7 | 166 | 165.1 | 162.4 | |||
| seed size | 53.7 | 45.8 | 58.6 | 48.9 | |||
| Texas | Yield | 180.4 | 197.6 | 200 | 192.3 | ||
| kernels weight | 121.1 | 161.8 | 173.8 | 153.5 | |||
| seed size | 44.4 | 54.1 | 67.2 | 58.7 | |||
| Alabama | Yield | 186.7 | 194.5 | 197.4 | 193.7 | ||
| kernels weight | 157.5 | 157.3 | 168.8 | 159.5 | |||
| seed size | 49.6 | 54.1 | 58.7 | 54.8 | |||
| Florida | Yield | 193.7 | 198.4 | 197.5 | 193.3 | ||
| kernels weight | 146.5 | 160.3 | 169.4 | 156.7 | |||
| seed size | 52.5 | 57.4 | 59.8 | 57.4 |
| Batch | Location | Variety | kernels weight | seed size | Yield |
|---|---|---|---|---|---|
| Batch 1 | Alabama | Runner | 139.6 | 56.4 | 181.8 |
| Batch 1 | Alabama | Spanish | 151.7 | 57.4 | 204.6 |
| Batch 1 | Alabama | Valencia | 166.1 | 63.5 | 198.8 |
| Batch 1 | Alabama | Virginia | 161.9 | 57.2 | 195.7 |
| Batch 1 | Florida | Runner | 158.2 | 53.7 | 185.2 |
| Batch 1 | Florida | Spanish | 159.8 | 57.1 | 204.5 |
| Batch 1 | Florida | Valencia | 165.2 | 60.9 | 197.4 |
| Batch 1 | Florida | Virginia | 157.2 | 56.8 | 195.9 |
| Batch 1 | Georgia | Runner | 153.1 | 51.4 | 195.3 |
| Batch 1 | Georgia | Spanish | 156.8 | 49.8 | 203.0 |
| Batch 1 | Georgia | Valencia | 164.5 | 57.8 | 193.5 |
| Batch 1 | Georgia | Virginia | 158.3 | 53.1 | 196.4 |
| Batch 1 | Texas | Runner | 139.5 | 55.5 | 189.7 |
| Batch 1 | Texas | Spanish | 166.1 | 60.4 | 202.7 |
| Batch 1 | Texas | Valencia | 166.8 | 65.0 | 201.5 |
| Batch 1 | Texas | Virginia | 157.4 | 57.8 | 195.5 |
| Batch 2 | Alabama | Runner | 157.5 | 49.6 | 186.7 |
| Batch 2 | Alabama | Spanish | 157.3 | 54.1 | 194.5 |
| Batch 2 | Alabama | Valencia | 168.8 | 58.7 | 197.4 |
| Batch 2 | Alabama | Virginia | 159.5 | 54.8 | 193.7 |
| Batch 2 | Florida | Runner | 146.5 | 52.5 | 193.7 |
| Batch 2 | Florida | Spanish | 160.3 | 57.4 | 198.4 |
| Batch 2 | Florida | Valencia | 169.4 | 59.8 | 197.5 |
| Batch 2 | Florida | Virginia | 156.7 | 57.4 | 193.3 |
| Batch 2 | Georgia | Runner | 167.7 | 53.7 | 194.3 |
| Batch 2 | Georgia | Spanish | 166.0 | 45.8 | 195.9 |
| Batch 2 | Georgia | Valencia | 165.1 | 58.6 | 187.0 |
| Batch 2 | Georgia | Virginia | 162.4 | 48.9 | 194.2 |
| Batch 2 | Texas | Runner | 121.1 | 44.4 | 180.4 |
| Batch 2 | Texas | Spanish | 161.8 | 54.1 | 197.6 |
| Batch 2 | Texas | Valencia | 173.8 | 67.2 | 200.0 |
| Batch 2 | Texas | Virginia | 153.5 | 58.7 | 192.3 |
In this tidy data set, there are two batches (1 and 2) of peanut shipment from four different locations, and there is an interest in the kernels weight, seed size and yield of the peanuts. Simple two sample t-tests between the batches resulted in no difference in the means for the three attributes. Hence, let’s investigate if all three together are affected by the difference in location and peanut variety. A multivariate analysis of variance could be used to test this, by answering the following research questions:
Is there any significant difference in the three attributes among the different location they were sourced?
Is there any significant difference in the three attributes among the different variety of peanuts?
Is there any significant difference in the three attributes with the interactions of the different variety and location they were sourced?
res.man <- manova(cbind(`kernels weight`,`seed size`, Yield) ~ Location + Variety + Location*Variety, data = tidy3)
summary(res.man, test="Pillai")
## Df Pillai approx F num Df den Df Pr(>F)
## Location 3 0.69178 1.5984 9 48 0.1427
## Variety 3 1.52281 5.4981 9 48 3.512e-05 ***
## Location:Variety 9 1.33139 1.4185 27 48 0.1430
## Residuals 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model reveal that there is no significant difference among the three attributes and the location they were grown, p-value > 0.05. Moreover, the findings indicated that the interaction effect was also non-significant overall. However, there is a statistical significant difference among the three attributes and the different variety of peanut, p-value < 0.05. Let’s look at the fit residuals and effects in more detail.
The residuals for kernels weight appears very large in absolute value, but the Q-Q plots of residuals indicate that univariable normality cannot be rejected for all three variables. Since the data can be assume to follow a normal distribution based on the Q-Q plots, an analysis of the variance model can be done.
Here, the detailed result shows that there is a difference when the kernel weight was considered based on the variety of the peanut and there is an interaction effect with the location and variety, p-value < 0.05. For seed size, both location and variety were statistically significant but there was no interaction effect. Lastly, for yield, only the different variety were found to be statistically significant, p-value < 0.05.
summary.aov(res.man)
## Response kernels weight :
## Df Sum Sq Mean Sq F value Pr(>F)
## Location 3 189.10 63.03 1.6059 0.2273214
## Variety 3 1558.87 519.62 13.2379 0.0001333 ***
## Location:Variety 9 949.11 105.46 2.6866 0.0406680 *
## Residuals 16 628.04 39.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response seed size :
## Df Sum Sq Mean Sq F value Pr(>F)
## Location 3 141.64 47.213 5.0843 0.0116171 *
## Variety 3 373.98 124.659 13.4245 0.0001232 ***
## Location:Variety 9 122.98 13.665 1.4716 0.2396095
## Residuals 16 148.57 9.286
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Yield :
## Df Sum Sq Mean Sq F value Pr(>F)
## Location 3 10.08 3.361 0.2280 0.8755284
## Variety 3 584.48 194.826 13.2169 0.0001345 ***
## Location:Variety 9 259.01 28.779 1.9524 0.1164411
## Residuals 16 235.85 14.741
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In conclusion, the two-factor MANOVA indicates that there is evidence of variety effect for all three attributes. There is also a location effect for the seed size, and an interaction effect for the kernel weight only. These findings could have only been calculated with a tidy data set, demonstrating why the data needed to be tidy appropriately based on the analysis.
For each data set, it was established how and why the tidying and transformation were carried out. The reasons include to make the data set easily identifiable with the necessary data and for appropriate inputs to statistical analyses. The challenging of this project was deciding how to transform the data. For the Level 1 data set, Hook-billed kites, the text data needed to be correctly merge with their respective attributes. The Level 2 Anaconda data set needed to be gather and spread a few times with additional information before it was in the right format. Lastly, the Level 3 Peanut data table needed rigorous cleaning and proper labeling to make a table into a data frame. Overall, each data set needed careful attention to understand how and when to use tidyr and dplyr to tidy and transform them as necessary.
Johnson, Richard A, and Dean W. Wichern. Applied Multivariate Statistical Analysis. Englewood Cliffs, N.J: Prentice Hall, 1992. Print.
Tinsley, Howard E. Handbook of Applied Multivariate Statistics and Mathematical Modeling. Academic Press, 2006. Print.