Overview

In this project, three data sets that were considered untidy were put through tidying and transformation using R. In a tidy data set, each variable and observation is saved as its own column and row respectively. These are the main packages use for data wrangling, visualization and graphics. Any other minor packages for analysis will be listed when needed.

library(tidyr)
library(dplyr)

# Other main packages
library(kableExtra)
library(ggplot2)
library(car)

The three data sets were taken from Richard A. Johnson and Dean W. Wichern’s Applied Multivariate Statistical Analysis textbook. These sets are relatively small, but the focus of this project is on the tidying and transformation methodology using only dplyr and tidyr, where necessary, rather than the amount of data and the analysis performed. These techniques can therefore be translated to another data set that needs to be format in a similar way. The data sets were saved to my Github and were retrieved from there. The difficulty of the tidying to be done can by ranked as follow:

Level 1: Hook-Billed Kites
Level 2: Anacondas
Level 3: Peanuts

LEVEL 1

Hook-Billed Kites

Hooked-billed kites are tropical raptor found broadly across Central and South America, occasionally north to the very southern tip of Texas. They have long tail and broad, rounded wings distinctive in flight. Males are gray overall with white barring on the belly. Females are brown above with rufous barred underparts and mostly gray head. Juveniles paler below with thinner barring.

In this data set, a wildlife ecologist measured the tail length (in millimeters) and wing length (in millimeters) for a sample of n = 45 female hook-billed kites. The data is a text file which list the tail and wing lengths in three separate columns, 15 rows each, to conserve page space.

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy1.txt"
untidy1 <- read.delim(theURL, header = FALSE, stringsAsFactors = FALSE, na.strings = c("", "NA"))

Click to show raw data

##                                                                                               V1
## 1                                                                           Table 5.12 Bird Data
## 2  =============================================================================================
## 3                                                                                (Tail length) |
## 4  ---------------------------------------------------------------------------------------------
## 5                                                                                             x1
## 6  ---------------------------------------------------------------------------------------------
## 7                                                                                            191
## 8                                                                                            197
## 9                                                                                            208
## 10                                                                                           180
## 11                                                                                           180
## 12                                                                                           188
## 13                                                                                           210
## 14                                                                                           196
## 15                                                                                           191
## 16                                                                                           179
## 17                                                                                           208
## 18                                                                                           202
## 19                                                                                           200
## 20                                                                                           192
## 21                                                                                           199
## 22 ---------------------------------------------------------------------------------------------
## 23                                                          Source: Data courtesy of S. Temple. 
##                 V2              V3              V4              V5
## 1             <NA>            <NA>            <NA>            <NA>
## 2             <NA>            <NA>            <NA>            <NA>
## 3  (Wing length) | (Tail length) | (Wing length) | (Tail length) |
## 4             <NA>            <NA>            <NA>            <NA>
## 5             <NA>              x2            <NA>              x1
## 6             <NA>            <NA>            <NA>            <NA>
## 7             <NA>             284            <NA>             186
## 8             <NA>             285            <NA>             197
## 9             <NA>             288            <NA>             201
## 10            <NA>             273            <NA>             190
## 11            <NA>             275            <NA>             209
## 12            <NA>             280            <NA>             187
## 13            <NA>             283            <NA>             207
## 14            <NA>             288            <NA>             178
## 15            <NA>             271            <NA>             202
## 16            <NA>             257            <NA>             205
## 17            <NA>             289            <NA>             190
## 18            <NA>             285            <NA>             189
## 19            <NA>             272            <NA>             211
## 20            <NA>             282            <NA>             216
## 21            <NA>             280            <NA>             189
## 22            <NA>            <NA>            <NA>            <NA>
## 23            <NA>            <NA>            <NA>            <NA>
##               V6   V7 V8   V9 V10  V11
## 1           <NA> <NA> NA <NA>  NA <NA>
## 2           <NA> <NA> NA <NA>  NA <NA>
## 3  (Wing length) <NA> NA <NA>  NA <NA>
## 4           <NA> <NA> NA <NA>  NA <NA>
## 5           <NA>   x2 NA   x1  NA   x2
## 6           <NA> <NA> NA <NA>  NA <NA>
## 7           <NA>  266 NA  173  NA  271
## 8           <NA>  285 NA  194  NA  280
## 9           <NA>  295 NA  198  NA  300
## 10          <NA>  282 NA  180  NA  272
## 11          <NA>  305 NA  190  NA  292
## 12          <NA>  285 NA  191  NA  286
## 13          <NA>  297 NA  196  NA  285
## 14          <NA>  268 NA  207  NA  286
## 15          <NA>  271 NA  109  NA  303
## 16          <NA>  285 NA  179  NA  261
## 17          <NA>  280 NA  186  NA  262
## 18          <NA>  277 NA  174  NA  245
## 19          <NA>  310 NA  181  NA  250
## 20          <NA>  305 NA  189  NA  262
## 21          <NA>  274 NA  188  NA  258
## 22          <NA> <NA> NA <NA>  NA <NA>
## 23          <NA> <NA> NA <NA>  NA <NA>

How & Why to Tidy

Each of the same variable was spread across three columns, this made it cumbersome to index to run any analysis. In addition, there are unnecessary variable labels which can be removed once transformed to make the data set more comprehensible, not to mention the numerous empty columns and rows. Therefore, the main reason for tidying was to make the data set easily identifiable with the necessary data.

In this case, it was decided that the columns needed to be merged, transforming the wide data set into a long format, as shown below.

Data transformation using `dplyr` and `tidyr`.

# Remove empty rows and columns
untidy1 <- data.frame(sapply(untidy1, as.numeric))
tidy1 <- untidy1 %>% filter_all(any_vars(!is.na(.))) %>% select_if(function(x) !(all(is.na(x)) | all(x=="")))

# Gather the data of all the columns into one 
# Convert variables to factor for identification
# Recode variable factors into appropriate labels
tidy1 <- tidy1 %>% gather(tidy1, lengths, 1:6, convert = TRUE) %>%  mutate(tidy1= recode(tidy1, "c('V1', 'V5', 'V9')='TailLength'; c('V3', 'V7', 'V11') = 'WingLength'"))

# Renaming rows
rownames(untidy1) <- 1:nrow(untidy1)

# Spreading the data to make it long
tidy1 <- tidy1 %>% group_by(tidy1) %>% mutate(grouped_id = row_number()) %>% spread(tidy1, lengths) %>% select(-grouped_id)

Hook-Billed Kites Data set

Untidy

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
191	NA	284	NA	186	NA	266	NA	173	NA	271
197	NA	285	NA	197	NA	285	NA	194	NA	280
208	NA	288	NA	201	NA	295	NA	198	NA	300
180	NA	273	NA	190	NA	282	NA	180	NA	272
180	NA	275	NA	209	NA	305	NA	190	NA	292
188	NA	280	NA	187	NA	285	NA	191	NA	286
210	NA	283	NA	207	NA	297	NA	196	NA	285
196	NA	288	NA	178	NA	268	NA	207	NA	286
191	NA	271	NA	202	NA	271	NA	109	NA	303
179	NA	257	NA	205	NA	285	NA	179	NA	261
208	NA	289	NA	190	NA	280	NA	186	NA	262
202	NA	285	NA	189	NA	277	NA	174	NA	245
200	NA	272	NA	211	NA	310	NA	181	NA	250
192	NA	282	NA	216	NA	305	NA	189	NA	262
199	NA	280	NA	189	NA	274	NA	188	NA	258
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Tidy

TailLength	WingLength
191	284
197	285
208	288
180	273
180	275
188	280
210	283
196	288
191	271
179	257
208	289
202	285
200	272
192	282
199	280
186	266
197	285
201	295
190	282
209	305
187	285
207	297
178	268
202	271
205	285
190	280
189	277
211	310
216	305
189	274
173	271
194	280
198	300
180	272
190	292
191	286
196	285
207	286
109	303
179	261
186	262
174	245
181	250
189	262
188	258

Results & Conclusion

Let’s do some fun analysis! Using this data, let’s find the simultaneous confidence level for the component means, i.e. the wing and tail lengths. When multiple confidence intervals are examined, the chances that at least one of the confidence intervals does not contain the population parameter is greater for a set of intervals than for any single interval.

In this analysis, the 95% Bonferroni confidence intervals were determined. The plot reveals that the Bonferroni confidence intervals, indicated by the dash lines, suggest that the ecologist can be 95% confident that the points that fall within all the set of confidence intervals includes the true population standard deviations for all variables. This plot is helpful to the wildlife ecologist because when he/she discovers more female hooked-billed kites, he/she can establish whether its measurements are within the typical standards. Knowing this, they can further study why a measurement, such as an outlier, was recorded in the first place.

Click to show code

n <- length(tidy1$TailLength)

# Get the fitted values of the linear model
model <- lm(TailLength ~ WingLength, data = tidy1)
fit <- model$fitted.values
  
# Find standard error 
serror <- sqrt(sum((tidy1$TailLength - fit)^2) / (n - 2)) * sqrt(1 / n + (tidy1$WingLength - mean(tidy1$WingLength))^2 / sum((tidy1$WingLength - mean(tidy1$WingLength))^2))

# Calculate B statistics 
B <- 1-qt(.95/(2 * 3), n - 1)

# Compute the simultaneous Bonferroni confidence intervals 
bon.upper <- fit + B * serror
bon.lower <- fit - B * serror

Moreover, from the plot, there are a few outliers within the data set that can be a result of type 1 error. However, the confidence ellipse still appeared to contain 95% of the observations. As the ecologist observations for female hook-billed kites increases, the mean will usually be better and better estimated, leading to smaller and smaller confidence ellipses, which in turn contain a smaller and smaller proportion of the actual data.

LEVEL 2

Anacondas

Anacondas are some of the largest snakes in the world. A team of researchers recorded the snout vent length (in cm) or the length from the snout of the snake to its vent where it evacuates waste, and its weight (in kg). The data set has the measurements sub-divided by the snakes’ gender.

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy2.csv"
untidy2 <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))

Click to show raw data

##               Female      X              Male    X.1
## 1  Snout Vent Length Weight Snout Vent Length Weight
## 2                271   18.5             176.7      3
## 3                477   82.5             259.5   9.75
## 4              306.3   23.4               258  10.07
## 5              365.3   33.5             229.8   7.50
## 6                466     69               233   6.25
## 7              440.7     54             237.5   9.85
## 8                315  24.97             268.3     10
## 9              417.5  56.75             222.5      9
## 10             307.3  23.15             186.5   3.75
## 11               319  29.51             238.8   9.75
## 12             303.9  19.98             257.6   9.75
## 13             331.7     24               172      3
## 14               435  70.37             244.7     10
## 15             261.3   15.5             224.7   7.25
## 16             384.8     63             231.7   9.25
## 17             360.3     39             235.9    7.5
## 18             441.4     53             236.5   5.75
## 19             246.7  15.75             247.4   7.75
## 20             365.3     44               223   5.75
## 21             336.8     30               223   5.75
## 22             326.7     34             212.5   7.65
## 23               312     25               223   7.75
## 24             226.7   9.25               225   5.84
## 25             347.4     30               228   7.53
## 26             280.2  15.25             215.6   5.75
## 27             290.7   21.5               221   6.45
## 28             438.6     57             236.7   6.49
## 29             377.1   61.5             235.3      6

How & Why to Tidy

For this data set, let the aim be to test for equality of means between male and female anacondas. The subdivision of this data set makes the test inputs very untidy and will not allow for quick analysis since careful attention is needed for indexing. Therefore, in this case, tidying is based on the statistical analysis.

Here, it was decided that the variable for gender needed to have its own column, and the respective measurements needed to be merge, changing from the wide format into a long format, as shown below:

Data transformation using `dplyr` and `tidyr`.

# Identifying the columns by renaming
tidy2 <- untidy2 %>% rename(Fsnout = Female, Fwght = X, Msnout = Male, Mwght = X.1) 

# Simple row removal and renaming
tidy2 <- tidy2[-1,]
rownames(tidy2) <- 1:nrow(tidy2)

# Gather the data by respective columns
tidy2 <- gather(tidy2, "gender", "measurement", c(c(1,3),c(2,4)))

# Creating a new data frame of variable identification to be merged with data set
id <- data.frame(tidy2$gender, stringsAsFactors=FALSE)
titles <- c("snoutLength", "weight")
id[c(1:28),] <- titles[1]
id[c(29:56),] <- titles[2]
id[c(57:84),] <- titles[1]
id[c(85:112),] <- titles[2]

# Convert variables to factor for identification
# Recode variable factors into appropriate labels
tidy2 <- tidy2 %>% mutate_at(vars(gender), as.factor) %>% mutate(gender = recode(gender, "c('Fsnout', 'Fwght') = 'Female' ; c('Msnout', 'Mwght') = 'Male'"))

# Spread the data based on the identified gender categories for snout and weight data
# Gather once more to create the final gender key
tidy2 <- tidy2 %>% group_by(gender) %>% mutate(grouped_id = row_number()) %>% spread(gender, measurement, convert = TRUE) %>% select(-grouped_id) %>% gather("Key", "measurement", 1:2) 

# Attach the new column data
tidy2 <- data.frame(append(tidy2,id))

# Spread once more by the gender key so each measurement is in its own column
tidy2 <- tidy2 %>% group_by(tidy2.gender) %>% mutate(grouped_id = row_number()) %>% spread(tidy2.gender, measurement, convert = TRUE) %>% select(-grouped_id) %>% rename(gender = Key)

Anacondas Data set

Untidy

Female	X	Male	X.1
Snout Vent Length	Weight	Snout Vent Length	Weight
271	18.5	176.7	3
477	82.5	259.5	9.75
306.3	23.4	258	10.07
365.3	33.5	229.8	7.50
466	69	233	6.25
440.7	54	237.5	9.85
315	24.97	268.3	10
417.5	56.75	222.5	9
307.3	23.15	186.5	3.75
319	29.51	238.8	9.75
303.9	19.98	257.6	9.75
331.7	24	172	3
435	70.37	244.7	10
261.3	15.5	224.7	7.25
384.8	63	231.7	9.25
360.3	39	235.9	7.5
441.4	53	236.5	5.75
246.7	15.75	247.4	7.75
365.3	44	223	5.75
336.8	30	223	5.75
326.7	34	212.5	7.65
312	25	223	7.75
226.7	9.25	225	5.84
347.4	30	228	7.53
280.2	15.25	215.6	5.75
290.7	21.5	221	6.45
438.6	57	236.7	6.49
377.1	61.5	235.3	6

Tidy

gender	snoutLength	weight
Female	271.0	18.50
Female	477.0	82.50
Female	306.3	23.40
Female	365.3	33.50
Female	466.0	69.00
Female	440.7	54.00
Female	315.0	24.97
Female	417.5	56.75
Female	307.3	23.15
Female	319.0	29.51
Female	303.9	19.98
Female	331.7	24.00
Female	435.0	70.37
Female	261.3	15.50
Female	384.8	63.00
Female	360.3	39.00
Female	441.4	53.00
Female	246.7	15.75
Female	365.3	44.00
Female	336.8	30.00
Female	326.7	34.00
Female	312.0	25.00
Female	226.7	9.25
Female	347.4	30.00
Female	280.2	15.25
Female	290.7	21.50
Female	438.6	57.00
Female	377.1	61.50
Male	176.7	3.00
Male	259.5	9.75
Male	258.0	10.07
Male	229.8	7.50
Male	233.0	6.25
Male	237.5	9.85
Male	268.3	10.00
Male	222.5	9.00
Male	186.5	3.75
Male	238.8	9.75
Male	257.6	9.75
Male	172.0	3.00
Male	244.7	10.00
Male	224.7	7.25
Male	231.7	9.25
Male	235.9	7.50
Male	236.5	5.75
Male	247.4	7.75
Male	223.0	5.75
Male	223.0	5.75
Male	212.5	7.65
Male	223.0	7.75
Male	225.0	5.84
Male	228.0	7.53
Male	215.6	5.75
Male	221.0	6.45
Male	236.7	6.49
Male	235.3	6.00

Analysis & Results

Test of Correlation

With the untidy data, there are some simple analysis that can be done (but note that the untidy data still needed to be converted to the appropriate class). For instance, a correlation test can be done to evaluate the association between snout length and weight for both male and female snakes. Since the data is already separated, it can be easily fed into a function like this.

Is there a relationship between the snout length and weight for female anacondas?

cor.test(as.numeric(untidy2$Female), as.numeric(untidy2$X), method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(untidy2$Female) and as.numeric(untidy2$X)
## t = 6.1083, df = 27, p-value = 1.591e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5482271 0.8819781
## sample estimates:
##      cor 
## 0.761688

The p-value of the test is less than 0.05, therefore, it can concluded that the weight and snout length for female anacondas are significantly correlated with a correlation coefficient of 0.76. Since it is a positive correlation, it further suggest that as the weight increases, the snout length increases as well for female anacondas.

Is there a relationship between the snout length and weight for male anacondas?

cor.test(as.numeric(untidy2$Male), as.numeric(untidy2$X.1), method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(untidy2$Male) and as.numeric(untidy2$X.1)
## t = 1.6046, df = 27, p-value = 0.1202
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08011696  0.59699954
## sample estimates:
##       cor 
## 0.2950524

With the male anacondas, the p-value of the test is more than 0.05. Therefore, it can concluded that the weight and snout length for male are not significantly correlated, and there is no evidence that weight and snout length has any relationship for male anacondas.

Test of Difference

Now, with the tidy data, a two sample independent test can be conducted. Two data samples are independent as they come from unrelated populations and the samples does not affect each other. Firstly, from the normal Q-Q plot, the data on snout lengths can be said to follow closely to a normal distribution, therefore satisfying the assumption of normality to conduct a t-test.

t.test(snoutLength ~ gender, data = tidy2)

## 
##  Welch Two Sample t-test
## 
## data:  snoutLength by gender
## t = 8.7618, df = 32.712, p-value = 4.289e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   91.78351 147.32364
## sample estimates:
## mean in group Female   mean in group Male 
##             348.2750             228.7214

Using the t-test, the interval estimate of the difference between two gender means was obtained. In the tidy data, the mean snout length for female anacondas is 348.28 cm and the mean snout length for male anacondas is 228.72 cm. The 95% confidence interval of the difference in mean snout length is between 91.78 and 147.32 cm. As a result, there is a statistical significant difference in snout length based on gender where female anacondas have a larger snout than males.

Now with the weight of the anacondas, the normal Q-Q plot also show that the data satisfy the assumption of normality to conduct a t-test.

t.test(weight ~ gender, data = tidy2)

## 
##  Welch Two Sample t-test
## 
## data:  weight by gender
## t = 7.8475, df = 27.592, p-value = 1.672e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  22.14418 37.80225
## sample estimates:
## mean in group Female   mean in group Male 
##            37.263571             7.290357

The t-test results shows that the mean weight for female anacondas is 37.26 kg and the mean weight for male anacondas is 7.29 kg. The 95% confidence interval of the difference in mean weight is between 22.14 and 37.80 kg. Therefore, there is also a statistical significant difference in weight based on gender where female anacondas are heavier than males.

Conclusion

In conclusion, when analyzing the tidy set, it is evident that female anacondas are considerably longer and heavier than males. This may be because there is a relationship with female snout length and weight, which was not significant with male anacondas.

LEVEL 3

Peanut

Peanut are an important crop in parts of the southern United States. To develop improved plants, crop scientists routinely compare varieties with respect to several attributes. A table of data is given in a .csv file. There are four varieties of peanuts, Runner, Spanish, Valencia, and Virginia that are grown at four geographical locations, Alabama, Florida, Georgia, and Texas. Moreover, this table consist the yield (plot weight), and two important grade-grain characteristics, sound mature kernels (weight in grams), and seed size (weight in grams).

theURL <- "https://raw.githubusercontent.com/greeneyefirefly/Data607/master/Projects/Project%202/untidy3.csv"
untidy3 <- data.frame(read.csv(file = theURL, header = TRUE, sep = ","))

Click to show raw data

##    Table.6.17.Peanut.Data        X     X.1            X.2     X.3     X.4
## 1                 Batch 1                                                
## 2                                                         Variety        
## 3                                                          Runner Spanish
## 4                         Location Georgia          Yield   195.3     203
## 5                                          kernels weight   153.1   156.8
## 6                                               seed size    51.4    49.8
## 7                                                                        
## 8                                    Texas          Yield   189.7   202.7
## 9                                          kernels weight   139.5   166.1
## 10                                              seed size    55.5    60.4
## 11                                                                       
## 12                                 Alabama          Yield   181.8   204.6
## 13                                         kernels weight   139.6   151.7
## 14                                              seed size    56.4    57.4
## 15                                                                       
## 16                                 Florida          Yield   185.2   204.5
## 17                                         kernels weight   158.2   159.8
## 18                                              seed size    53.7    57.1
## 19                                                                       
## 20                Batch 2                                                
## 21                                                        Variety        
## 22                                                         Runner Spanish
## 23                        Location Georgia          Yield   194.3   195.9
## 24                                         kernels weight   167.7     166
## 25                                              seed size    53.7    45.8
## 26                                                                       
## 27                                   Texas          Yield   180.4   197.6
## 28                                         kernels weight   121.1   161.8
## 29                                              seed size    44.4    54.1
## 30                                                                       
## 31                                 Alabama          Yield   186.7   194.5
## 32                                         kernels weight   157.5   157.3
## 33                                              seed size    49.6    54.1
## 34                                                                       
## 35                                 Florida          Yield   193.7   198.4
## 36                                         kernels weight   146.5   160.3
## 37                                              seed size    52.5    57.4
##         X.5     X.6
## 1                  
## 2                  
## 3  Valencia Virgina
## 4     193.5   196.4
## 5     164.5   158.3
## 6      57.8    53.1
## 7                  
## 8     201.5   195.5
## 9     166.8   157.4
## 10       65    57.8
## 11                 
## 12    198.8   195.7
## 13    166.1   161.9
## 14     63.5    57.2
## 15                 
## 16    197.4   195.9
## 17    165.2   157.2
## 18     60.9    56.8
## 19                 
## 20                 
## 21                 
## 22 Valencia Virgina
## 23      187   194.2
## 24    165.1   162.4
## 25     58.6    48.9
## 26                 
## 27      200   192.3
## 28    173.8   153.5
## 29     67.2    58.7
## 30                 
## 31    197.4   193.7
## 32    168.8   159.5
## 33     58.7    54.8
## 34                 
## 35    197.5   193.3
## 36    169.4   156.7
## 37     59.8    57.4

How & Why to Tidy

The data set is in a table format. The table itself can be easily read, but it is in no way workable for any statistical analysis using R. More specifically, there are necessary variable names which do not have their own columns. The subdivision of this data set makes the test inputs very untidy and will not allow for quick analysis since careful attention is needed for indexing. Therefore, tidying is based on the statistical analysis to make the data set easily identifiable with the necessary data.

In this case, it was decided that specific rows needed to become columns, and the respective measurements needed to be merge, transforming the wide table into a long data set, as shown below.

Data Transformation using `dplyr` and `tidyr`.

# Rigorous cleaning methods #
## Removing unnecessary rows
tidy3 <- untidy3[-c(2,7,11,15,19,21,26,30,34),-2]

# Identify the variable names for missing column before tidying
## Batch labels
tidy3[c(2:14),1] <- tidy3[1,1]
tidy3[c(16:28),1] <- tidy3[15,1]

## Location labels
tidy3[c(3:5,17:19),2] <- tidy3[3,2]
tidy3[c(6:8,20:22),2] <- tidy3[6,2]
tidy3[c(9:11,23:25),2] <- tidy3[9,2]
tidy3[c(12:14,26:28),2] <- tidy3[12,2]

# Identifying the columns by renaming
tidy3 <- rename(tidy3, Batch = Table.6.17.Peanut.Data, Location = X.1, Variable = X.2, Runner = X.3, Spanish = X.4, Valencia = X.5, Virginia = X.6) 

# Removing and renaming final rows
tidy3 = tidy3[-c(1,2,15,16),]
rownames(tidy3) <- 1:nrow(tidy3)

# Gather the data into variables of peanut variety and the measurements
# Spread based on the measurement variable names
# Convert character data into factors and numbers
tidy3 <- tidy3 %>% gather("Variety", "Measurement", 4:7) %>% spread("Variable", "Measurement", convert = TRUE) 
tidy3 <- droplevels(mutate_if(tidy3, is.character, as.factor))

Peanut Data set

Untidy

Table.6.17.Peanut.Data	X	X.1	X.2	X.3	X.4	X.5	X.6
Batch 1
				Variety
				Runner	Spanish	Valencia	Virgina
	Location	Georgia	Yield	195.3	203	193.5	196.4
			kernels weight	153.1	156.8	164.5	158.3
			seed size	51.4	49.8	57.8	53.1

		Texas	Yield	189.7	202.7	201.5	195.5
			kernels weight	139.5	166.1	166.8	157.4
			seed size	55.5	60.4	65	57.8

		Alabama	Yield	181.8	204.6	198.8	195.7
			kernels weight	139.6	151.7	166.1	161.9
			seed size	56.4	57.4	63.5	57.2

		Florida	Yield	185.2	204.5	197.4	195.9
			kernels weight	158.2	159.8	165.2	157.2
			seed size	53.7	57.1	60.9	56.8

Batch 2
				Variety
				Runner	Spanish	Valencia	Virgina
	Location	Georgia	Yield	194.3	195.9	187	194.2
			kernels weight	167.7	166	165.1	162.4
			seed size	53.7	45.8	58.6	48.9

		Texas	Yield	180.4	197.6	200	192.3
			kernels weight	121.1	161.8	173.8	153.5
			seed size	44.4	54.1	67.2	58.7

		Alabama	Yield	186.7	194.5	197.4	193.7
			kernels weight	157.5	157.3	168.8	159.5
			seed size	49.6	54.1	58.7	54.8

		Florida	Yield	193.7	198.4	197.5	193.3
			kernels weight	146.5	160.3	169.4	156.7
			seed size	52.5	57.4	59.8	57.4

Tidy

Batch	Location	Variety	kernels weight	seed size	Yield
Batch 1	Alabama	Runner	139.6	56.4	181.8
Batch 1	Alabama	Spanish	151.7	57.4	204.6
Batch 1	Alabama	Valencia	166.1	63.5	198.8
Batch 1	Alabama	Virginia	161.9	57.2	195.7
Batch 1	Florida	Runner	158.2	53.7	185.2
Batch 1	Florida	Spanish	159.8	57.1	204.5
Batch 1	Florida	Valencia	165.2	60.9	197.4
Batch 1	Florida	Virginia	157.2	56.8	195.9
Batch 1	Georgia	Runner	153.1	51.4	195.3
Batch 1	Georgia	Spanish	156.8	49.8	203.0
Batch 1	Georgia	Valencia	164.5	57.8	193.5
Batch 1	Georgia	Virginia	158.3	53.1	196.4
Batch 1	Texas	Runner	139.5	55.5	189.7
Batch 1	Texas	Spanish	166.1	60.4	202.7
Batch 1	Texas	Valencia	166.8	65.0	201.5
Batch 1	Texas	Virginia	157.4	57.8	195.5
Batch 2	Alabama	Runner	157.5	49.6	186.7
Batch 2	Alabama	Spanish	157.3	54.1	194.5
Batch 2	Alabama	Valencia	168.8	58.7	197.4
Batch 2	Alabama	Virginia	159.5	54.8	193.7
Batch 2	Florida	Runner	146.5	52.5	193.7
Batch 2	Florida	Spanish	160.3	57.4	198.4
Batch 2	Florida	Valencia	169.4	59.8	197.5
Batch 2	Florida	Virginia	156.7	57.4	193.3
Batch 2	Georgia	Runner	167.7	53.7	194.3
Batch 2	Georgia	Spanish	166.0	45.8	195.9
Batch 2	Georgia	Valencia	165.1	58.6	187.0
Batch 2	Georgia	Virginia	162.4	48.9	194.2
Batch 2	Texas	Runner	121.1	44.4	180.4
Batch 2	Texas	Spanish	161.8	54.1	197.6
Batch 2	Texas	Valencia	173.8	67.2	200.0
Batch 2	Texas	Virginia	153.5	58.7	192.3

Analysis & Results

Variable Effect

In this tidy data set, there are two batches (1 and 2) of peanut shipment from four different locations, and there is an interest in the kernels weight, seed size and yield of the peanuts. Simple two sample t-tests between the batches resulted in no difference in the means for the three attributes. Hence, let’s investigate if all three together are affected by the difference in location and peanut variety. A multivariate analysis of variance could be used to test this, by answering the following research questions:

Is there any significant difference in the three attributes among the different location they were sourced?
Is there any significant difference in the three attributes among the different variety of peanuts?
Is there any significant difference in the three attributes with the interactions of the different variety and location they were sourced?

res.man <- manova(cbind(`kernels weight`,`seed size`, Yield) ~ Location + Variety + Location*Variety, data = tidy3)

summary(res.man, test="Pillai")

##                  Df  Pillai approx F num Df den Df    Pr(>F)    
## Location          3 0.69178   1.5984      9     48    0.1427    
## Variety           3 1.52281   5.4981      9     48 3.512e-05 ***
## Location:Variety  9 1.33139   1.4185     27     48    0.1430    
## Residuals        16                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model reveal that there is no significant difference among the three attributes and the location they were grown, p-value > 0.05. Moreover, the findings indicated that the interaction effect was also non-significant overall. However, there is a statistical significant difference among the three attributes and the different variety of peanut, p-value < 0.05. Let’s look at the fit residuals and effects in more detail.

Residuals of the model

The residuals for kernels weight appears very large in absolute value, but the Q-Q plots of residuals indicate that univariable normality cannot be rejected for all three variables. Since the data can be assume to follow a normal distribution based on the Q-Q plots, an analysis of the variance model can be done.

Test of Difference

Here, the detailed result shows that there is a difference when the kernel weight was considered based on the variety of the peanut and there is an interaction effect with the location and variety, p-value < 0.05. For seed size, both location and variety were statistically significant but there was no interaction effect. Lastly, for yield, only the different variety were found to be statistically significant, p-value < 0.05.

summary.aov(res.man)

##  Response kernels weight :
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## Location          3  189.10   63.03  1.6059 0.2273214    
## Variety           3 1558.87  519.62 13.2379 0.0001333 ***
## Location:Variety  9  949.11  105.46  2.6866 0.0406680 *  
## Residuals        16  628.04   39.25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response seed size :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## Location          3 141.64  47.213  5.0843 0.0116171 *  
## Variety           3 373.98 124.659 13.4245 0.0001232 ***
## Location:Variety  9 122.98  13.665  1.4716 0.2396095    
## Residuals        16 148.57   9.286                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Yield :
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## Location          3  10.08   3.361  0.2280 0.8755284    
## Variety           3 584.48 194.826 13.2169 0.0001345 ***
## Location:Variety  9 259.01  28.779  1.9524 0.1164411    
## Residuals        16 235.85  14.741                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion

In conclusion, the two-factor MANOVA indicates that there is evidence of variety effect for all three attributes. There is also a location effect for the seed size, and an interaction effect for the kernel weight only. These findings could have only been calculated with a tidy data set, demonstrating why the data needed to be tidy appropriately based on the analysis.

The Challenge of Tidying and Transformation

For each data set, it was established how and why the tidying and transformation were carried out. The reasons include to make the data set easily identifiable with the necessary data and for appropriate inputs to statistical analyses. The challenging of this project was deciding how to transform the data. For the Level 1 data set, Hook-billed kites, the text data needed to be correctly merge with their respective attributes. The Level 2 Anaconda data set needed to be gather and spread a few times with additional information before it was in the right format. Lastly, the Level 3 Peanut data table needed rigorous cleaning and proper labeling to make a table into a data frame. Overall, each data set needed careful attention to understand how and when to use tidyr and dplyr to tidy and transform them as necessary.

Works Cited

Johnson, Richard A, and Dean W. Wichern. Applied Multivariate Statistical Analysis. Englewood Cliffs, N.J: Prentice Hall, 1992. Print.
Tinsley, Howard E. Handbook of Applied Multivariate Statistics and Mathematical Modeling. Academic Press, 2006. Print.

CUNY MSDS Data 607 Project #2

Samantha Deokinanan

10th March, 2019

Overview

Hook-Billed Kites

How & Why to Tidy

Data transformation using `dplyr` and `tidyr`.

Untidy

Tidy

Results & Conclusion

Anacondas

How & Why to Tidy

Data transformation using `dplyr` and `tidyr`.

Untidy

Tidy

Analysis & Results

Test of Correlation

Test of Difference

Conclusion

Peanut

How & Why to Tidy

Data Transformation using `dplyr` and `tidyr`.

Untidy

Tidy

Analysis & Results

Variable Effect

Residuals of the model

Test of Difference

Conclusion

The Challenge of Tidying and Transformation

Works Cited

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
191	NA	284	NA	186	NA	266	NA	173	NA	271
197	NA	285	NA	197	NA	285	NA	194	NA	280
208	NA	288	NA	201	NA	295	NA	198	NA	300
180	NA	273	NA	190	NA	282	NA	180	NA	272
180	NA	275	NA	209	NA	305	NA	190	NA	292
188	NA	280	NA	187	NA	285	NA	191	NA	286
210	NA	283	NA	207	NA	297	NA	196	NA	285
196	NA	288	NA	178	NA	268	NA	207	NA	286
191	NA	271	NA	202	NA	271	NA	109	NA	303
179	NA	257	NA	205	NA	285	NA	179	NA	261
208	NA	289	NA	190	NA	280	NA	186	NA	262
202	NA	285	NA	189	NA	277	NA	174	NA	245
200	NA	272	NA	211	NA	310	NA	181	NA	250
192	NA	282	NA	216	NA	305	NA	189	NA	262
199	NA	280	NA	189	NA	274	NA	188	NA	258
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

TailLength	WingLength
191	284
197	285
208	288
180	273
180	275
188	280
210	283
196	288
191	271
179	257
208	289
202	285
200	272
192	282
199	280
186	266
197	285
201	295
190	282
209	305
187	285
207	297
178	268
202	271
205	285
190	280
189	277
211	310
216	305
189	274
173	271
194	280
198	300
180	272
190	292
191	286
196	285
207	286
109	303
179	261
186	262
174	245
181	250
189	262
188	258

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
191	NA	284	NA	186	NA	266	NA	173	NA	271
197	NA	285	NA	197	NA	285	NA	194	NA	280
208	NA	288	NA	201	NA	295	NA	198	NA	300
180	NA	273	NA	190	NA	282	NA	180	NA	272
180	NA	275	NA	209	NA	305	NA	190	NA	292
188	NA	280	NA	187	NA	285	NA	191	NA	286
210	NA	283	NA	207	NA	297	NA	196	NA	285
196	NA	288	NA	178	NA	268	NA	207	NA	286
191	NA	271	NA	202	NA	271	NA	109	NA	303
179	NA	257	NA	205	NA	285	NA	179	NA	261
208	NA	289	NA	190	NA	280	NA	186	NA	262
202	NA	285	NA	189	NA	277	NA	174	NA	245
200	NA	272	NA	211	NA	310	NA	181	NA	250
192	NA	282	NA	216	NA	305	NA	189	NA	262
199	NA	280	NA	189	NA	274	NA	188	NA	258
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

TailLength	WingLength
191	284
197	285
208	288
180	273
180	275
188	280
210	283
196	288
191	271
179	257
208	289
202	285
200	272
192	282
199	280
186	266
197	285
201	295
190	282
209	305
187	285
207	297
178	268
202	271
205	285
190	280
189	277
211	310
216	305
189	274
173	271
194	280
198	300
180	272
190	292
191	286
196	285
207	286
109	303
179	261
186	262
174	245
181	250
189	262
188	258

CUNY MSDS Data 607 Project #2

Samantha Deokinanan

10th March, 2019

Overview

Hook-Billed Kites

How & Why to Tidy

Data transformation using dplyr and tidyr.

Untidy

Tidy

Results & Conclusion

Anacondas

How & Why to Tidy

Data transformation using dplyr and tidyr.

Untidy

Tidy

Analysis & Results

Test of Correlation

Test of Difference

Conclusion

Peanut

How & Why to Tidy

Data Transformation using dplyr and tidyr.

Untidy

Tidy

Analysis & Results

Variable Effect

Residuals of the model

Test of Difference

Conclusion

The Challenge of Tidying and Transformation

Works Cited

Data transformation using `dplyr` and `tidyr`.

Data transformation using `dplyr` and `tidyr`.

Data Transformation using `dplyr` and `tidyr`.

V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
191	NA	284	NA	186	NA	266	NA	173	NA	271
197	NA	285	NA	197	NA	285	NA	194	NA	280
208	NA	288	NA	201	NA	295	NA	198	NA	300
180	NA	273	NA	190	NA	282	NA	180	NA	272
180	NA	275	NA	209	NA	305	NA	190	NA	292
188	NA	280	NA	187	NA	285	NA	191	NA	286
210	NA	283	NA	207	NA	297	NA	196	NA	285
196	NA	288	NA	178	NA	268	NA	207	NA	286
191	NA	271	NA	202	NA	271	NA	109	NA	303
179	NA	257	NA	205	NA	285	NA	179	NA	261
208	NA	289	NA	190	NA	280	NA	186	NA	262
202	NA	285	NA	189	NA	277	NA	174	NA	245
200	NA	272	NA	211	NA	310	NA	181	NA	250
192	NA	282	NA	216	NA	305	NA	189	NA	262
199	NA	280	NA	189	NA	274	NA	188	NA	258
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

TailLength	WingLength
191	284
197	285
208	288
180	273
180	275
188	280
210	283
196	288
191	271
179	257
208	289
202	285
200	272
192	282
199	280
186	266
197	285
201	295
190	282
209	305
187	285
207	297
178	268
202	271
205	285
190	280
189	277
211	310
216	305
189	274
173	271
194	280
198	300
180	272
190	292
191	286
196	285
207	286
109	303
179	261
186	262
174	245
181	250
189	262
188	258