Excel Clean-up

Here are the steps I took to clean up my Excel file before loading it into R:

  1. I first deleted any rows of variables that were not needed for this analysis. I was pretty conservative with this, and left any column I thought I could possibly ever need to reference. I also kept an original version of the Excel file so that if there was something I deleted that I end up needing, I still have access to it.
    • The original file path is: “C:/Users/james/Dropbox/Laura and Cheryl/Data Files/Habituation Criteria FileMaker Files/Export for Habituation Strategies - 2022.xlsx”
  2. Some columns displayed the presence or absence of a variable, with each row (follow) marked with an “X” if the variable was present and left blank if it was absent (such as “Eat Minimally”). In these cases, I did a Find and Replace for that column, replacing “X” with “Y” for Yes and replacing Blanks with “N” for No.
  3. I created a new variable which will act as my response variable. In order to do this, I created a new column called “Habituation Response” and created a formula in Excel as follows:

Research Question: Which variable(s) is/are most likely to predict/account for most of the variation in the behavioral responses (i.e. Stay, Flee, Vocalize) in GP wild orangutans during the process of habituation?

Response Variable:

  1. Habituation Reaction: Stay, Flee, or Vocalize

Fixed Effect/Predictor Variables:

  1. Date (DD/MM/YYYY)
  2. Phenology Z-Score
  3. Age-Sex Class:
    • Adult Female
    • Flanged Male
    • Adult Unflanged Male
  4. Mother Class:
    • Adult Male (thus no offspring)
    • Adult Female with No Offspring Present
    • Adult Female with Infant Present (and perhaps a juvenile as well)
    • Adult Female with Only Juvenile Present (and no infant)
  5. Follow Series Day: How many days this orangutan has been followed in a row on the day this data was taken
  6. Percent Time Drizzle: Percent of the time during the follow day that it was drizzling
  7. Percent Time Rain: Percent of the time during the follow that it was raining (harder than a drizzle)
  8. Total Minutes Drizzle: Total number of minutes it drizzled during the follow day
  9. Total Minutes Rain: Total number of minutes it rained during the follow day
  10. Social Interaction: Did the focal orangutan encounter another orangutan any time during the follow?
    • This is a binary (Y/N/) variable

Random Effect Variable:
1. Orangutan Name - This will account for individual orangutans as a variable that could impact the variation in habituation responses.

The path for the cleaned up file I will be using is: “C:/Users/james/OneDrive/Laura/PhD Dissertation/Statistical Analyses/Habituation Behaviors Stats/Master for Model.xlsx”

Importing the Excel data into R:

  1. Before importing the data, I make sure all the packages I will be using are loaded and ready to go. See below, plus explanations for the purpose of each:
library(readxl) #to read xls and xlsx files to use in R
library(dplyr) #"A Grammar of Data Manipulation" - several useful commands
library(mclogit) #Package for performing multinomial logit models
library(nnet)
library(MASS)
library(memisc)
  1. Now I will bring in the Excel sheet with the data I will be using. It includes 555 follows (thus, 555 rows of data in addition to the title row) which fit the criteria for the evaluation of “Habituation Response” as either Stay, Flee, or Vocalize.
habituation_data <- read_excel("C:/Users/james/OneDrive/Laura/PhD Dissertation/Statistical Analyses/Habituation Behaviors Stats/Master for Model.xlsx")
# Reminder that when copying in file path, must change back slashes to forward slashes between each file level
# Reminder that the file must not be open in Excel

Now that the file is imported into R, I want to take a look at what the data looks like:

head(habituation_data)
## # A tibble: 6 × 39
##   Follow_Number Date_MDY            Nama_OH Pheno_Zscore Mother_Class  Age_Class
##           <dbl> <dttm>              <chr>          <dbl> <chr>         <chr>    
## 1             1 1994-08-22 00:00:00 Herman        NA     Adult Male    Flanged …
## 2             2 1994-09-14 00:00:00 Kristen        0.584 Juvenile Only Adult Fe…
## 3             3 1994-09-15 00:00:00 Cindy          0.584 No Offspring… Adult Fe…
## 4             4 1994-09-15 00:00:00 Kristen        0.584 Juvenile Only Adult Fe…
## 5             5 1994-09-15 00:00:00 Cindy          0.584 No Offspring… Adult Fe…
## 6             6 1994-09-16 00:00:00 Kristen        0.584 Juvenile Only Adult Fe…
## # ℹ 33 more variables: Percent_Time_Feed <dbl>, Percent_Time_Rest <dbl>,
## #   Percent_Time_Travel <dbl>, Total_Vocalizations <dbl>,
## #   Vocalization_per_time_awake <dbl>, Vocalization_Group_Calc <dbl>,
## #   Vocalization_Group <chr>, Count_Alarm_at_OBS <dbl>,
## #   Follow_Series_Day <dbl>, Follow_Type <chr>, Habituated <chr>,
## #   Habituation_2 <chr>, Minutes_awake <dbl>, Number_Eating <chr>,
## #   Number_Nesting <dbl>, New_Day_Nest <chr>, Number_Rest_Bouts <dbl>, …

I also want to check the variable classes for each column:

str(habituation_data)
## tibble [555 × 39] (S3: tbl_df/tbl/data.frame)
##  $ Follow_Number              : num [1:555] 1 2 3 4 5 6 14 15 16 17 ...
##  $ Date_MDY                   : POSIXct[1:555], format: "1994-08-22" "1994-09-14" ...
##  $ Nama_OH                    : chr [1:555] "Herman" "Kristen" "Cindy" "Kristen" ...
##  $ Pheno_Zscore               : num [1:555] NA 0.584 0.584 0.584 0.584 ...
##  $ Mother_Class               : chr [1:555] "Adult Male" "Juvenile Only" "No Offspring Present" "Juvenile Only" ...
##  $ Age_Class                  : chr [1:555] "Flanged Male" "Adult Female" "Adult Female" "Adult Female" ...
##  $ Percent_Time_Feed          : num [1:555] 5.3 22.568 1.901 12.374 0.958 ...
##  $ Percent_Time_Rest          : num [1:555] 89.9 51.6 65 64.1 69.8 ...
##  $ Percent_Time_Travel        : num [1:555] 4.8 20.2 31.8 23.5 29.3 ...
##  $ Total_Vocalizations        : num [1:555] 0 0 2 1 13 1 0 0 1 0 ...
##  $ Vocalization_per_time_awake: num [1:555] 0.01 0.01167 0.01897 0.00685 0.0607 ...
##  $ Vocalization_Group_Calc    : num [1:555] 2 3 4 2 9 5 3 1 5 10 ...
##  $ Vocalization_Group         : chr [1:555] "NA" "7" "5" "4" ...
##  $ Count_Alarm_at_OBS         : num [1:555] 5 3 7 4 6 17 5 0 11 41 ...
##  $ Follow_Series_Day          : num [1:555] 1 1 1 2 1 3 1 2 2 1 ...
##  $ Follow_Type                : chr [1:555] "H" "D" "F" "F" ...
##  $ Habituated                 : chr [1:555] "NA" "H" "U" "U" ...
##  $ Habituation_2              : chr [1:555] "U" "H" "H" "H" ...
##  $ Minutes_awake              : num [1:555] 500 257 474 730 313 ...
##  $ Number_Eating              : chr [1:555] "4" "7" "4" "8" ...
##  $ Number_Nesting             : num [1:555] 0 2 1 0 0 0 5 0 0 1 ...
##  $ New_Day_Nest               : chr [1:555] "0" "NA" "0" "0" ...
##  $ Number_Rest_Bouts          : num [1:555] 24 15 23 38 14 16 39 1 13 6 ...
##  $ Number_Traveling_Bouts     : chr [1:555] "21" "17" "26" "34" ...
##  $ Percent_Time_Nesting       : num [1:555] 0 5.64 1.27 0 0 ...
##  $ Hide                       : chr [1:555] "N" "N" "N" "N" ...
##  $ Eat_Excessive              : chr [1:555] "N" "N" "N" "N" ...
##  $ Eat_Minimal                : chr [1:555] "Y" "N" "Y" "Y" ...
##  $ Rest_Excessive             : chr [1:555] "Y" "N" "N" "N" ...
##  $ Rest_Minimal               : chr [1:555] "N" "N" "N" "N" ...
##  $ Travel_Excessive           : chr [1:555] "N" "Y" "Y" "Y" ...
##  $ Travel_Minimal             : chr [1:555] "N" "N" "N" "N" ...
##  $ Vocalize_Excessive         : chr [1:555] "N" "N" "Y" "N" ...
##  $ Habituation_Response       : chr [1:555] "S" "F" "F" "F" ...
##  $ Percent_time_drizzle       : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Percent_Time_Rain          : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Ttl_Min_drizzle            : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Ttl_Min_Rain               : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Social_Interaction         : chr [1:555] "N" "Y" "N" "Y" ...

I am going to change the column titled “Follow_Series_Day” from a character class to a numerical class:

habituation_data$Follow_Series_Day <- as.numeric(habituation_data$Follow_Series_Day) # This line of code indicated that the "Follow_Series_Column" in the "habituation_data" data frame/tibble should be changed from its current class to the numeric class
class(habituation_data$Follow_Series_Day) #this code allows us to check the class designation of this specific column in our data frame, as opposed to the str() command, which shows us the class of every column in the data frame.
## [1] "numeric"

I will change some of the other columns from a character class to a factor class. These are the ones that may be better to switch to factors, since the data in these columns only can fall into a specific number of categories, vs. one like the name of the orangutans in which new categories could be added in the future. For example, Social Interaction can only be a Y/N answer, and in this model, Age-Sex Class can only be Adult Female, Flanged Male, or Unflanged Male. Here are the columns that seem to fit this pattern:

  1. Mother_Class
  2. Age_Class
  3. Habituation_Response
  4. Social_Interaction

There are other columns of data similar to the ones above, but not ones I will be using for this model, so I believe these are the only ones I need to from character to factor class.

habituation_data$Mother_Class <- as.factor(habituation_data$Mother_Class)
class(habituation_data$Mother_Class)
## [1] "factor"
habituation_data$Habituation_Response <- as.factor(habituation_data$Habituation_Response)
class(habituation_data$Habituation_Response)
## [1] "factor"
habituation_data$Age_Class <- as.factor(habituation_data$Age_Class)
class(habituation_data$Age_Class)
## [1] "factor"
habituation_data$Social_Interaction <- as.factor(habituation_data$Social_Interaction)
class(habituation_data$Social_Interaction)
## [1] "factor"

Multinomial Logistic Regression

For this analysis, I will be performing a multinomial logistic regression with random effects. To do so, I will be using the mlogit package, already installed loaded into this session in an earlier step.

The reason I will be trying this type of analysis is because my response (dependent) variable can fit into three categories: Stay (S), Flee (F), or Vocalize (V). These are not ordinal nor binary, thus a multinomial logistic regression makes sense for this analysis, and can also account for the various types of predictor (independent) variables, some of which are continuous, some of which are categorical.

Furthermore, the mclogit command within the mlogit package will allow me to perform a multinomial logistic regression with a random effect variable included, in this case, Orangutan Name (labeled as “Nama_OH” in the table).

model1 <- mclogit(Follow_Number|Habituation_Response ~ Pheno_Zscore + Age_Class + Mother_Class + Follow_Series_Day + Percent_time_drizzle + Percent_Time_Rain, random = ~1|Nama_OH, data = habituation_data)
## 
## Iteration 1 - deviance = 156952.4 - criterion = 0.4486491
## Iteration 2 - deviance = 129410.6 - criterion = 0.01516399
## Iteration 3 - deviance = 128531.2 - criterion = 0.000593259
## Iteration 4 - deviance = 128529.6 - criterion = 1.316346e-06
## Iteration 5 - deviance = 128529.6 - criterion = 1.098118e-11
## converged
summary(model1)
## 
## Call:
## mclogit(formula = Follow_Number | Habituation_Response ~ Pheno_Zscore + 
##     Age_Class + Mother_Class + Follow_Series_Day + Percent_time_drizzle + 
##     Percent_Time_Rain, data = habituation_data, random = ~1 | 
##     Nama_OH)
## 
## Coefficents:
##                                    Estimate Std. Error z value Pr(>|z|)    
## Pheno_Zscore                      0.0497730  0.0009817  50.702  < 2e-16 ***
## Age_ClassAdult Female             0.1513781  0.0110415  13.710  < 2e-16 ***
## Age_ClassFlanged Male             0.2937861  0.2780801   1.056  0.29075    
## Age_ClassUnflanged Male           0.0252364  0.2367220   0.107  0.91510    
## Mother_ClassInfant Present        0.1950003  0.0066117  29.493  < 2e-16 ***
## Mother_ClassJuvenile Only         0.2829159  0.0065225  43.376  < 2e-16 ***
## Mother_ClassNo Offspring Present -0.3352736  0.0075603 -44.347  < 2e-16 ***
## Follow_Series_Day                 0.0056568  0.0004446  12.722  < 2e-16 ***
## Percent_time_drizzle              0.0010673  0.0000781  13.665  < 2e-16 ***
## Percent_Time_Rain                -0.0003404  0.0001108  -3.071  0.00213 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Co-)Variances:
## Grouping level: Nama_OH 
##          Estimate   Std.Err.
## (Const.) 1.891      0.489   
## 
## Approximate residual deviance: 128500 
## Number of Fisher scoring iterations:  5
## Number of observations
##   Groups by Nama_OH: 187
##   Individual observations:  1971030  (11 observations deleted due to missingness)

Okay, this is as far as I got! I am not sure if the formula I am using in the mclogit command is correct. Specifically, the formula part of the mclogit command (i.e. the formula for the model) requires the first part of it create a two-column matrix in the form of choice|set. I am not sure if I am understanding the definitions of “choice” and “set” correctly. I used “Follow Number” as that is the only number that is unique to each row, and “Habituation Response” is the dependent variable I am looking at, which has three levels. I am also not completely sure how to interpret the results even if I did set up the entire formula correctly.