Overview

In this project, three data sets that were considered untidy were put through tidying and transformation using R. In a tidy data set, each variable and observation is saved as its own column and row respectively. These are the main packages use for data wrangling, visualization and graphics. Any other minor packages for analysis will be listed when needed.

library(tidyr)
library(dplyr)

# Other main packages
library(kableExtra)
library(ggplot2)
library(car)

The three data sets were taken from Richard A. Johnson and Dean W. Wichern’s Applied Multivariate Statistical Analysis textbook. These sets are relatively small, but the focus of this project is on the tidying and transformation methodology using only dplyr and tidyr, where necessary, rather than the amount of data and the analysis performed. These techniques can therefore be translated to another data set that needs to be format in a similar way. The data sets were saved to my Github and were retrieved from there. The difficulty of the tidying to be done can by ranked as follow:

  • Level 1: Hook-Billed Kites
  • Level 2: Anacondas
  • Level 3: Peanuts

LEVEL 1

Hook-Billed Kites

Hooked-billed kites are tropical raptor found broadly across Central and South America, occasionally north to the very southern tip of Texas. They have long tail and broad, rounded wings distinctive in flight. Males are gray overall with white barring on the belly. Females are brown above with rufous barred underparts and mostly gray head. Juveniles paler below with thinner barring.

In this data set, a wildlife ecologist measured the tail length (in millimeters) and wing length (in millimeters) for a sample of n = 45 female hook-billed kites. The data is a text file which list the tail and wing lengths in three separate columns, 15 rows each, to conserve page space.

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy1.txt"
untidy1 <- read.delim(theURL, header = FALSE, stringsAsFactors = FALSE, na.strings = c("", "NA"))

Click to show raw data

##                                                                                               V1
## 1                                                                           Table 5.12 Bird Data
## 2  =============================================================================================
## 3                                                                                (Tail length) |
## 4  ---------------------------------------------------------------------------------------------
## 5                                                                                             x1
## 6  ---------------------------------------------------------------------------------------------
## 7                                                                                            191
## 8                                                                                            197
## 9                                                                                            208
## 10                                                                                           180
## 11                                                                                           180
## 12                                                                                           188
## 13                                                                                           210
## 14                                                                                           196
## 15                                                                                           191
## 16                                                                                           179
## 17                                                                                           208
## 18                                                                                           202
## 19                                                                                           200
## 20                                                                                           192
## 21                                                                                           199
## 22 ---------------------------------------------------------------------------------------------
## 23                                                          Source: Data courtesy of S. Temple. 
##                 V2              V3              V4              V5
## 1             <NA>            <NA>            <NA>            <NA>
## 2             <NA>            <NA>            <NA>            <NA>
## 3  (Wing length) | (Tail length) | (Wing length) | (Tail length) |
## 4             <NA>            <NA>            <NA>            <NA>
## 5             <NA>              x2            <NA>              x1
## 6             <NA>            <NA>            <NA>            <NA>
## 7             <NA>             284            <NA>             186
## 8             <NA>             285            <NA>             197
## 9             <NA>             288            <NA>             201
## 10            <NA>             273            <NA>             190
## 11            <NA>             275            <NA>             209
## 12            <NA>             280            <NA>             187
## 13            <NA>             283            <NA>             207
## 14            <NA>             288            <NA>             178
## 15            <NA>             271            <NA>             202
## 16            <NA>             257            <NA>             205
## 17            <NA>             289            <NA>             190
## 18            <NA>             285            <NA>             189
## 19            <NA>             272            <NA>             211
## 20            <NA>             282            <NA>             216
## 21            <NA>             280            <NA>             189
## 22            <NA>            <NA>            <NA>            <NA>
## 23            <NA>            <NA>            <NA>            <NA>
##               V6   V7 V8   V9 V10  V11
## 1           <NA> <NA> NA <NA>  NA <NA>
## 2           <NA> <NA> NA <NA>  NA <NA>
## 3  (Wing length) <NA> NA <NA>  NA <NA>
## 4           <NA> <NA> NA <NA>  NA <NA>
## 5           <NA>   x2 NA   x1  NA   x2
## 6           <NA> <NA> NA <NA>  NA <NA>
## 7           <NA>  266 NA  173  NA  271
## 8           <NA>  285 NA  194  NA  280
## 9           <NA>  295 NA  198  NA  300
## 10          <NA>  282 NA  180  NA  272
## 11          <NA>  305 NA  190  NA  292
## 12          <NA>  285 NA  191  NA  286
## 13          <NA>  297 NA  196  NA  285
## 14          <NA>  268 NA  207  NA  286
## 15          <NA>  271 NA  109  NA  303
## 16          <NA>  285 NA  179  NA  261
## 17          <NA>  280 NA  186  NA  262
## 18          <NA>  277 NA  174  NA  245
## 19          <NA>  310 NA  181  NA  250
## 20          <NA>  305 NA  189  NA  262
## 21          <NA>  274 NA  188  NA  258
## 22          <NA> <NA> NA <NA>  NA <NA>
## 23          <NA> <NA> NA <NA>  NA <NA>


How & Why to Tidy

Each of the same variable was spread across three columns, this made it cumbersome to index to run any analysis. In addition, there are unnecessary variable labels which can be removed once transformed to make the data set more comprehensible, not to mention the numerous empty columns and rows. Therefore, the main reason for tidying was to make the data set easily identifiable with the necessary data.

In this case, it was decided that the columns needed to be merged, transforming the wide data set into a long format, as shown below.

Data transformation using dplyr and tidyr.
# Remove empty rows and columns
untidy1 <- data.frame(sapply(untidy1, as.numeric))
tidy1 <- untidy1 %>% filter_all(any_vars(!is.na(.))) %>% select_if(function(x) !(all(is.na(x)) | all(x=="")))

# Gather the data of all the columns into one 
# Convert variables to factor for identification
# Recode variable factors into appropriate labels
tidy1 <- tidy1 %>% gather(tidy1, lengths, 1:6, convert = TRUE) %>%  mutate(tidy1= recode(tidy1, "c('V1', 'V5', 'V9')='TailLength'; c('V3', 'V7', 'V11') = 'WingLength'"))

# Renaming rows
rownames(untidy1) <- 1:nrow(untidy1)

# Spreading the data to make it long
tidy1 <- tidy1 %>% group_by(tidy1) %>% mutate(grouped_id = row_number()) %>% spread(tidy1, lengths) %>% select(-grouped_id)

Hook-Billed Kites Data set

Untidy

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA
191 NA 284 NA 186 NA 266 NA 173 NA 271
197 NA 285 NA 197 NA 285 NA 194 NA 280
208 NA 288 NA 201 NA 295 NA 198 NA 300
180 NA 273 NA 190 NA 282 NA 180 NA 272
180 NA 275 NA 209 NA 305 NA 190 NA 292
188 NA 280 NA 187 NA 285 NA 191 NA 286
210 NA 283 NA 207 NA 297 NA 196 NA 285
196 NA 288 NA 178 NA 268 NA 207 NA 286
191 NA 271 NA 202 NA 271 NA 109 NA 303
179 NA 257 NA 205 NA 285 NA 179 NA 261
208 NA 289 NA 190 NA 280 NA 186 NA 262
202 NA 285 NA 189 NA 277 NA 174 NA 245
200 NA 272 NA 211 NA 310 NA 181 NA 250
192 NA 282 NA 216 NA 305 NA 189 NA 262
199 NA 280 NA 189 NA 274 NA 188 NA 258
NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA NA NA

Tidy

TailLength WingLength
191 284
197 285
208 288
180 273
180 275
188 280
210 283
196 288
191 271
179 257
208 289
202 285
200 272
192 282
199 280
186 266
197 285
201 295
190 282
209 305
187 285
207 297
178 268
202 271
205 285
190 280
189 277
211 310
216 305
189 274
173 271
194 280
198 300
180 272
190 292
191 286
196 285
207 286
109 303
179 261
186 262
174 245
181 250
189 262
188 258

Results & Conclusion

Let’s do some fun analysis! Using this data, let’s find the simultaneous confidence level for the component means, i.e. the wing and tail lengths. When multiple confidence intervals are examined, the chances that at least one of the confidence intervals does not contain the population parameter is greater for a set of intervals than for any single interval.

In this analysis, the 95% Bonferroni confidence intervals were determined. The plot reveals that the Bonferroni confidence intervals, indicated by the dash lines, suggest that the ecologist can be 95% confident that the points that fall within all the set of confidence intervals includes the true population standard deviations for all variables. This plot is helpful to the wildlife ecologist because when he/she discovers more female hooked-billed kites, he/she can establish whether its measurements are within the typical standards. Knowing this, they can further study why a measurement, such as an outlier, was recorded in the first place.

Click to show code

n <- length(tidy1$TailLength)

# Get the fitted values of the linear model
model <- lm(TailLength ~ WingLength, data = tidy1)
fit <- model$fitted.values
  
# Find standard error 
serror <- sqrt(sum((tidy1$TailLength - fit)^2) / (n - 2)) * sqrt(1 / n + (tidy1$WingLength - mean(tidy1$WingLength))^2 / sum((tidy1$WingLength - mean(tidy1$WingLength))^2))

# Calculate B statistics 
B <- 1-qt(.95/(2 * 3), n - 1)

# Compute the simultaneous Bonferroni confidence intervals 
bon.upper <- fit + B * serror
bon.lower <- fit - B * serror

Moreover, from the plot, there are a few outliers within the data set that can be a result of type 1 error. However, the confidence ellipse still appeared to contain 95% of the observations. As the ecologist observations for female hook-billed kites increases, the mean will usually be better and better estimated, leading to smaller and smaller confidence ellipses, which in turn contain a smaller and smaller proportion of the actual data.


LEVEL 2

Anacondas

Anacondas are some of the largest snakes in the world. A team of researchers recorded the snout vent length (in cm) or the length from the snout of the snake to its vent where it evacuates waste, and its weight (in kg). The data set has the measurements sub-divided by the snakes’ gender.

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy2.csv"
untidy2 <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))

Click to show raw data

##               Female      X              Male    X.1
## 1  Snout Vent Length Weight Snout Vent Length Weight
## 2                271   18.5             176.7      3
## 3                477   82.5             259.5   9.75
## 4              306.3   23.4               258  10.07
## 5              365.3   33.5             229.8   7.50
## 6                466     69               233   6.25
## 7              440.7     54             237.5   9.85
## 8                315  24.97             268.3     10
## 9              417.5  56.75             222.5      9
## 10             307.3  23.15             186.5   3.75
## 11               319  29.51             238.8   9.75
## 12             303.9  19.98             257.6   9.75
## 13             331.7     24               172      3
## 14               435  70.37             244.7     10
## 15             261.3   15.5             224.7   7.25
## 16             384.8     63             231.7   9.25
## 17             360.3     39             235.9    7.5
## 18             441.4     53             236.5   5.75
## 19             246.7  15.75             247.4   7.75
## 20             365.3     44               223   5.75
## 21             336.8     30               223   5.75
## 22             326.7     34             212.5   7.65
## 23               312     25               223   7.75
## 24             226.7   9.25               225   5.84
## 25             347.4     30               228   7.53
## 26             280.2  15.25             215.6   5.75
## 27             290.7   21.5               221   6.45
## 28             438.6     57             236.7   6.49
## 29             377.1   61.5             235.3      6


How & Why to Tidy

For this data set, let the aim be to test for equality of means between male and female anacondas. The subdivision of this data set makes the test inputs very untidy and will not allow for quick analysis since careful attention is needed for indexing. Therefore, in this case, tidying is based on the statistical analysis.

Here, it was decided that the variable for gender needed to have its own column, and the respective measurements needed to be merge, changing from the wide format into a long format, as shown below:

Data transformation using dplyr and tidyr.
# Identifying the columns by renaming
tidy2 <- untidy2 %>% rename(Fsnout = Female, Fwght = X, Msnout = Male, Mwght = X.1) 

# Simple row removal and renaming
tidy2 <- tidy2[-1,]
rownames(tidy2) <- 1:nrow(tidy2)

# Gather the data by respective columns
tidy2 <- gather(tidy2, "gender", "measurement", c(c(1,3),c(2,4)))

# Creating a new data frame of variable identification to be merged with data set
id <- data.frame(tidy2$gender, stringsAsFactors=FALSE)
titles <- c("snoutLength", "weight")
id[c(1:28),] <- titles[1]
id[c(29:56),] <- titles[2]
id[c(57:84),] <- titles[1]
id[c(85:112),] <- titles[2]

# Convert variables to factor for identification
# Recode variable factors into appropriate labels
tidy2 <- tidy2 %>% mutate_at(vars(gender), as.factor) %>% mutate(gender = recode(gender, "c('Fsnout', 'Fwght') = 'Female' ; c('Msnout', 'Mwght') = 'Male'"))

# Spread the data based on the identified gender categories for snout and weight data
# Gather once more to create the final gender key
tidy2 <- tidy2 %>% group_by(gender) %>% mutate(grouped_id = row_number()) %>% spread(gender, measurement, convert = TRUE) %>% select(-grouped_id) %>% gather("Key", "measurement", 1:2) 

# Attach the new column data
tidy2 <- data.frame(append(tidy2,id))

# Spread once more by the gender key so each measurement is in its own column
tidy2 <- tidy2 %>% group_by(tidy2.gender) %>% mutate(grouped_id = row_number()) %>% spread(tidy2.gender, measurement, convert = TRUE) %>% select(-grouped_id) %>% rename(gender = Key)

Anacondas Data set

Untidy

Female X Male X.1
Snout Vent Length Weight Snout Vent Length Weight
271 18.5 176.7 3
477 82.5 259.5 9.75
306.3 23.4 258 10.07
365.3 33.5 229.8 7.50
466 69 233 6.25
440.7 54 237.5 9.85
315 24.97 268.3 10
417.5 56.75 222.5 9
307.3 23.15 186.5 3.75
319 29.51 238.8 9.75
303.9 19.98 257.6 9.75
331.7 24 172 3
435 70.37 244.7 10
261.3 15.5 224.7 7.25
384.8 63 231.7 9.25
360.3 39 235.9 7.5
441.4 53 236.5 5.75
246.7 15.75 247.4 7.75
365.3 44 223 5.75
336.8 30 223 5.75
326.7 34 212.5 7.65
312 25 223 7.75
226.7 9.25 225 5.84
347.4 30 228 7.53
280.2 15.25 215.6 5.75
290.7 21.5 221 6.45
438.6 57 236.7 6.49
377.1 61.5 235.3 6

Tidy

gender snoutLength weight
Female 271.0 18.50
Female 477.0 82.50
Female 306.3 23.40
Female 365.3 33.50
Female 466.0 69.00
Female 440.7 54.00
Female 315.0 24.97
Female 417.5 56.75
Female 307.3 23.15
Female 319.0 29.51
Female 303.9 19.98
Female 331.7 24.00
Female 435.0 70.37
Female 261.3 15.50
Female 384.8 63.00
Female 360.3 39.00
Female 441.4 53.00
Female 246.7 15.75
Female 365.3 44.00
Female 336.8 30.00
Female 326.7 34.00
Female 312.0 25.00
Female 226.7 9.25
Female 347.4 30.00
Female 280.2 15.25
Female 290.7 21.50
Female 438.6 57.00
Female 377.1 61.50
Male 176.7 3.00
Male 259.5 9.75
Male 258.0 10.07
Male 229.8 7.50
Male 233.0 6.25
Male 237.5 9.85
Male 268.3 10.00
Male 222.5 9.00
Male 186.5 3.75
Male 238.8 9.75
Male 257.6 9.75
Male 172.0 3.00
Male 244.7 10.00
Male 224.7 7.25
Male 231.7 9.25
Male 235.9 7.50
Male 236.5 5.75
Male 247.4 7.75
Male 223.0 5.75
Male 223.0 5.75
Male 212.5 7.65
Male 223.0 7.75
Male 225.0 5.84
Male 228.0 7.53
Male 215.6 5.75
Male 221.0 6.45
Male 236.7 6.49
Male 235.3 6.00

Analysis & Results

Test of Correlation

With the untidy data, there are some simple analysis that can be done (but note that the untidy data still needed to be converted to the appropriate class). For instance, a correlation test can be done to evaluate the association between snout length and weight for both male and female snakes. Since the data is already separated, it can be easily fed into a function like this.

  1. Is there a relationship between the snout length and weight for female anacondas?
cor.test(as.numeric(untidy2$Female), as.numeric(untidy2$X), method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(untidy2$Female) and as.numeric(untidy2$X)
## t = 6.1083, df = 27, p-value = 1.591e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5482271 0.8819781
## sample estimates:
##      cor 
## 0.761688

The p-value of the test is less than 0.05, therefore, it can concluded that the weight and snout length for female anacondas are significantly correlated with a correlation coefficient of 0.76. Since it is a positive correlation, it further suggest that as the weight increases, the snout length increases as well for female anacondas.

  1. Is there a relationship between the snout length and weight for male anacondas?
cor.test(as.numeric(untidy2$Male), as.numeric(untidy2$X.1), method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(untidy2$Male) and as.numeric(untidy2$X.1)
## t = 1.6046, df = 27, p-value = 0.1202
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08011696  0.59699954
## sample estimates:
##       cor 
## 0.2950524

With the male anacondas, the p-value of the test is more than 0.05. Therefore, it can concluded that the weight and snout length for male are not significantly correlated, and there is no evidence that weight and snout length has any relationship for male anacondas.


Test of Difference

Now, with the tidy data, a two sample independent test can be conducted. Two data samples are independent as they come from unrelated populations and the samples does not affect each other. Firstly, from the normal Q-Q plot, the data on snout lengths can be said to follow closely to a normal distribution, therefore satisfying the assumption of normality to conduct a t-test.

t.test(snoutLength ~ gender, data = tidy2)
## 
##  Welch Two Sample t-test
## 
## data:  snoutLength by gender
## t = 8.7618, df = 32.712, p-value = 4.289e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   91.78351 147.32364
## sample estimates:
## mean in group Female   mean in group Male 
##             348.2750             228.7214

Using the t-test, the interval estimate of the difference between two gender means was obtained. In the tidy data, the mean snout length for female anacondas is 348.28 cm and the mean snout length for male anacondas is 228.72 cm. The 95% confidence interval of the difference in mean snout length is between 91.78 and 147.32 cm. As a result, there is a statistical significant difference in snout length based on gender where female anacondas have a larger snout than males.

Now with the weight of the anacondas, the normal Q-Q plot also show that the data satisfy the assumption of normality to conduct a t-test.

t.test(weight ~ gender, data = tidy2)
## 
##  Welch Two Sample t-test
## 
## data:  weight by gender
## t = 7.8475, df = 27.592, p-value = 1.672e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.14418 37.80225
## sample estimates:
## mean in group Female   mean in group Male 
##            37.263571             7.290357

The t-test results shows that the mean weight for female anacondas is 37.26 kg and the mean weight for male anacondas is 7.29 kg. The 95% confidence interval of the difference in mean weight is between 22.14 and 37.80 kg. Therefore, there is also a statistical significant difference in weight based on gender where female anacondas are heavier than males.


Conclusion

In conclusion, when analyzing the tidy set, it is evident that female anacondas are considerably longer and heavier than males. This may be because there is a relationship with female snout length and weight, which was not significant with male anacondas.


LEVEL 3

Peanut

Peanut are an important crop in parts of the southern United States. To develop improved plants, crop scientists routinely compare varieties with respect to several attributes. A table of data is given in a .csv file. There are four varieties of peanuts, Runner, Spanish, Valencia, and Virginia that are grown at four geographical locations, Alabama, Florida, Georgia, and Texas. Moreover, this table consist the yield (plot weight), and two important grade-grain characteristics, sound mature kernels (weight in grams), and seed size (weight in grams).

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy3.csv"
untidy3 <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))

Click to show raw data

##    Table.6.17.Peanut.Data        X     X.1            X.2     X.3     X.4
## 1                 Batch 1                                                
## 2                                                         Variety        
## 3                                                          Runner Spanish
## 4                         Location Georgia          Yield   195.3     203
## 5                                          kernels weight   153.1   156.8
## 6                                               seed size    51.4    49.8
## 7                                                                        
## 8                                    Texas          Yield   189.7   202.7
## 9                                          kernels weight   139.5   166.1
## 10                                              seed size    55.5    60.4
## 11                                                                       
## 12                                 Alabama          Yield   181.8   204.6
## 13                                         kernels weight   139.6   151.7
## 14                                              seed size    56.4    57.4
## 15                                                                       
## 16                                 Florida          Yield   185.2   204.5
## 17                                         kernels weight   158.2   159.8
## 18                                              seed size    53.7    57.1
## 19                                                                       
## 20                Batch 2                                                
## 21                                                        Variety        
## 22                                                         Runner Spanish
## 23                        Location Georgia          Yield   194.3   195.9
## 24                                         kernels weight   167.7     166
## 25                                              seed size    53.7    45.8
## 26                                                                       
## 27                                   Texas          Yield   180.4   197.6
## 28                                         kernels weight   121.1   161.8
## 29                                              seed size    44.4    54.1
## 30                                                                       
## 31                                 Alabama          Yield   186.7   194.5
## 32                                         kernels weight   157.5   157.3
## 33                                              seed size    49.6    54.1
## 34                                                                       
## 35                                 Florida          Yield   193.7   198.4
## 36                                         kernels weight   146.5   160.3
## 37                                              seed size    52.5    57.4
##         X.5     X.6
## 1                  
## 2                  
## 3  Valencia Virgina
## 4     193.5   196.4
## 5     164.5   158.3
## 6      57.8    53.1
## 7                  
## 8     201.5   195.5
## 9     166.8   157.4
## 10       65    57.8
## 11                 
## 12    198.8   195.7
## 13    166.1   161.9
## 14     63.5    57.2
## 15                 
## 16    197.4   195.9
## 17    165.2   157.2
## 18     60.9    56.8
## 19                 
## 20                 
## 21                 
## 22 Valencia Virgina
## 23      187   194.2
## 24    165.1   162.4
## 25     58.6    48.9
## 26                 
## 27      200   192.3
## 28    173.8   153.5
## 29     67.2    58.7
## 30                 
## 31    197.4   193.7
## 32    168.8   159.5
## 33     58.7    54.8
## 34                 
## 35    197.5   193.3
## 36    169.4   156.7
## 37     59.8    57.4


How & Why to Tidy

The data set is in a table format. The table itself can be easily read, but it is in no way workable for any statistical analysis using R. More specifically, there are necessary variable names which do not have their own columns. The subdivision of this data set makes the test inputs very untidy and will not allow for quick analysis since careful attention is needed for indexing. Therefore, tidying is based on the statistical analysis to make the data set easily identifiable with the necessary data.

In this case, it was decided that specific rows needed to become columns, and the respective measurements needed to be merge, transforming the wide table into a long data set, as shown below.

Data Transformation using dplyr and tidyr.
# Rigorous cleaning methods #
## Removing unnecessary rows
tidy3 <- untidy3[-c(2,7,11,15,19,21,26,30,34),-2]

# Identify the variable names for missing column before tidying
## Batch labels
tidy3[c(2:14),1] <- tidy3[1,1]
tidy3[c(16:28),1] <- tidy3[15,1]

## Location labels
tidy3[c(3:5,17:19),2] <- tidy3[3,2]
tidy3[c(6:8,20:22),2] <- tidy3[6,2]
tidy3[c(9:11,23:25),2] <- tidy3[9,2]
tidy3[c(12:14,26:28),2] <- tidy3[12,2]

# Identifying the columns by renaming
tidy3 <- rename(tidy3, Batch = Table.6.17.Peanut.Data, Location = X.1, Variable = X.2, Runner = X.3, Spanish = X.4, Valencia = X.5, Virginia = X.6) 

# Removing and renaming final rows
tidy3 = tidy3[-c(1,2,15,16),]
rownames(tidy3) <- 1:nrow(tidy3)

# Gather the data into variables of peanut variety and the measurements
# Spread based on the measurement variable names
# Convert character data into factors and numbers
tidy3 <- tidy3 %>% gather("Variety", "Measurement", 4:7) %>% spread("Variable", "Measurement", convert = TRUE) 
tidy3 <- droplevels(mutate_if(tidy3, is.character, as.factor))

Peanut Data set

Untidy

Table.6.17.Peanut.Data X X.1 X.2 X.3 X.4 X.5 X.6
Batch 1
Variety
Runner Spanish Valencia Virgina
Location Georgia Yield 195.3 203 193.5 196.4
kernels weight 153.1 156.8 164.5 158.3
seed size 51.4 49.8 57.8 53.1
Texas Yield 189.7 202.7 201.5 195.5
kernels weight 139.5 166.1 166.8 157.4
seed size 55.5 60.4 65 57.8
Alabama Yield 181.8 204.6 198.8 195.7
kernels weight 139.6 151.7 166.1 161.9
seed size 56.4 57.4 63.5 57.2
Florida Yield 185.2 204.5 197.4 195.9
kernels weight 158.2 159.8 165.2 157.2
seed size 53.7 57.1 60.9 56.8
Batch 2
Variety
Runner Spanish Valencia Virgina
Location Georgia Yield 194.3 195.9 187 194.2
kernels weight 167.7 166 165.1 162.4
seed size 53.7 45.8 58.6 48.9
Texas Yield 180.4 197.6 200 192.3
kernels weight 121.1 161.8 173.8 153.5
seed size 44.4 54.1 67.2 58.7
Alabama Yield 186.7 194.5 197.4 193.7
kernels weight 157.5 157.3 168.8 159.5
seed size 49.6 54.1 58.7 54.8
Florida Yield 193.7 198.4 197.5 193.3
kernels weight 146.5 160.3 169.4 156.7
seed size 52.5 57.4 59.8 57.4

Tidy

Batch Location Variety kernels weight seed size Yield
Batch 1 Alabama Runner 139.6 56.4 181.8
Batch 1 Alabama Spanish 151.7 57.4 204.6
Batch 1 Alabama Valencia 166.1 63.5 198.8
Batch 1 Alabama Virginia 161.9 57.2 195.7
Batch 1 Florida Runner 158.2 53.7 185.2
Batch 1 Florida Spanish 159.8 57.1 204.5
Batch 1 Florida Valencia 165.2 60.9 197.4
Batch 1 Florida Virginia 157.2 56.8 195.9
Batch 1 Georgia Runner 153.1 51.4 195.3
Batch 1 Georgia Spanish 156.8 49.8 203.0
Batch 1 Georgia Valencia 164.5 57.8 193.5
Batch 1 Georgia Virginia 158.3 53.1 196.4
Batch 1 Texas Runner 139.5 55.5 189.7
Batch 1 Texas Spanish 166.1 60.4 202.7
Batch 1 Texas Valencia 166.8 65.0 201.5
Batch 1 Texas Virginia 157.4 57.8 195.5
Batch 2 Alabama Runner 157.5 49.6 186.7
Batch 2 Alabama Spanish 157.3 54.1 194.5
Batch 2 Alabama Valencia 168.8 58.7 197.4
Batch 2 Alabama Virginia 159.5 54.8 193.7
Batch 2 Florida Runner 146.5 52.5 193.7
Batch 2 Florida Spanish 160.3 57.4 198.4
Batch 2 Florida Valencia 169.4 59.8 197.5
Batch 2 Florida Virginia 156.7 57.4 193.3
Batch 2 Georgia Runner 167.7 53.7 194.3
Batch 2 Georgia Spanish 166.0 45.8 195.9
Batch 2 Georgia Valencia 165.1 58.6 187.0
Batch 2 Georgia Virginia 162.4 48.9 194.2
Batch 2 Texas Runner 121.1 44.4 180.4
Batch 2 Texas Spanish 161.8 54.1 197.6
Batch 2 Texas Valencia 173.8 67.2 200.0
Batch 2 Texas Virginia 153.5 58.7 192.3

Analysis & Results

Variable Effect

In this tidy data set, there are two batches (1 and 2) of peanut shipment from four different locations, and there is an interest in the kernels weight, seed size and yield of the peanuts. Simple two sample t-tests between the batches resulted in no difference in the means for the three attributes. Hence, let’s investigate if all three together are affected by the difference in location and peanut variety. A multivariate analysis of variance could be used to test this, by answering the following research questions:

  1. Is there any significant difference in the three attributes among the different location they were sourced?

  2. Is there any significant difference in the three attributes among the different variety of peanuts?

  3. Is there any significant difference in the three attributes with the interactions of the different variety and location they were sourced?

res.man <- manova(cbind(`kernels weight`,`seed size`, Yield) ~ Location + Variety + Location*Variety, data = tidy3)

summary(res.man, test="Pillai")
##                  Df  Pillai approx F num Df den Df    Pr(>F)    
## Location          3 0.69178   1.5984      9     48    0.1427    
## Variety           3 1.52281   5.4981      9     48 3.512e-05 ***
## Location:Variety  9 1.33139   1.4185     27     48    0.1430    
## Residuals        16                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model reveal that there is no significant difference among the three attributes and the location they were grown, p-value > 0.05. Moreover, the findings indicated that the interaction effect was also non-significant overall. However, there is a statistical significant difference among the three attributes and the different variety of peanut, p-value < 0.05. Let’s look at the fit residuals and effects in more detail.


Residuals of the model

The residuals for kernels weight appears very large in absolute value, but the Q-Q plots of residuals indicate that univariable normality cannot be rejected for all three variables. Since the data can be assume to follow a normal distribution based on the Q-Q plots, an analysis of the variance model can be done.


Test of Difference

Here, the detailed result shows that there is a difference when the kernel weight was considered based on the variety of the peanut and there is an interaction effect with the location and variety, p-value < 0.05. For seed size, both location and variety were statistically significant but there was no interaction effect. Lastly, for yield, only the different variety were found to be statistically significant, p-value < 0.05.

summary.aov(res.man)
##  Response kernels weight :
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## Location          3  189.10   63.03  1.6059 0.2273214    
## Variety           3 1558.87  519.62 13.2379 0.0001333 ***
## Location:Variety  9  949.11  105.46  2.6866 0.0406680 *  
## Residuals        16  628.04   39.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response seed size :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## Location          3 141.64  47.213  5.0843 0.0116171 *  
## Variety           3 373.98 124.659 13.4245 0.0001232 ***
## Location:Variety  9 122.98  13.665  1.4716 0.2396095    
## Residuals        16 148.57   9.286                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Yield :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## Location          3  10.08   3.361  0.2280 0.8755284    
## Variety           3 584.48 194.826 13.2169 0.0001345 ***
## Location:Variety  9 259.01  28.779  1.9524 0.1164411    
## Residuals        16 235.85  14.741                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

In conclusion, the two-factor MANOVA indicates that there is evidence of variety effect for all three attributes. There is also a location effect for the seed size, and an interaction effect for the kernel weight only. These findings could have only been calculated with a tidy data set, demonstrating why the data needed to be tidy appropriately based on the analysis.


The Challenge of Tidying and Transformation

For each data set, it was established how and why the tidying and transformation were carried out. The reasons include to make the data set easily identifiable with the necessary data and for appropriate inputs to statistical analyses. The challenging of this project was deciding how to transform the data. For the Level 1 data set, Hook-billed kites, the text data needed to be correctly merge with their respective attributes. The Level 2 Anaconda data set needed to be gather and spread a few times with additional information before it was in the right format. Lastly, the Level 3 Peanut data table needed rigorous cleaning and proper labeling to make a table into a data frame. Overall, each data set needed careful attention to understand how and when to use tidyr and dplyr to tidy and transform them as necessary.


Works Cited

  • Johnson, Richard A, and Dean W. Wichern. Applied Multivariate Statistical Analysis. Englewood Cliffs, N.J: Prentice Hall, 1992. Print.

  • Tinsley, Howard E. Handbook of Applied Multivariate Statistics and Mathematical Modeling. Academic Press, 2006. Print.