Here are the steps I took to clean up my Excel file before loading it into R:
Research Question: Which variable(s) is/are
most likely to predict/account for most of the variation in the
behavioral responses (i.e. Stay, Flee, Vocalize) in GP wild orangutans
during the process of habituation?
Response Variable:
Fixed Effect/Predictor Variables:
Random Effect Variable:
1. Orangutan Name - This will account for individual orangutans as a
variable that could impact the variation in habituation responses.
The path for the cleaned up file I will be using is: “C:/Users/james/OneDrive/Laura/PhD Dissertation/Statistical Analyses/Habituation Behaviors Stats/Master for Model.xlsx”
library(readxl) #to read xls and xlsx files to use in R
library(dplyr) #"A Grammar of Data Manipulation" - several useful commands
library(mclogit) #Package for performing multinomial logit models
library(nnet)
library(MASS)
library(memisc)
habituation_data <- read_excel("C:/Users/james/OneDrive/Laura/PhD Dissertation/Statistical Analyses/Habituation Behaviors Stats/Master for Model.xlsx")
# Reminder that when copying in file path, must change back slashes to forward slashes between each file level
# Reminder that the file must not be open in Excel
Now that the file is imported into R, I want to take a look at what the data looks like:
head(habituation_data)
## # A tibble: 6 × 39
## Follow_Number Date_MDY Nama_OH Pheno_Zscore Mother_Class Age_Class
## <dbl> <dttm> <chr> <dbl> <chr> <chr>
## 1 1 1994-08-22 00:00:00 Herman NA Adult Male Flanged …
## 2 2 1994-09-14 00:00:00 Kristen 0.584 Juvenile Only Adult Fe…
## 3 3 1994-09-15 00:00:00 Cindy 0.584 No Offspring… Adult Fe…
## 4 4 1994-09-15 00:00:00 Kristen 0.584 Juvenile Only Adult Fe…
## 5 5 1994-09-15 00:00:00 Cindy 0.584 No Offspring… Adult Fe…
## 6 6 1994-09-16 00:00:00 Kristen 0.584 Juvenile Only Adult Fe…
## # ℹ 33 more variables: Percent_Time_Feed <dbl>, Percent_Time_Rest <dbl>,
## # Percent_Time_Travel <dbl>, Total_Vocalizations <dbl>,
## # Vocalization_per_time_awake <dbl>, Vocalization_Group_Calc <dbl>,
## # Vocalization_Group <chr>, Count_Alarm_at_OBS <dbl>,
## # Follow_Series_Day <dbl>, Follow_Type <chr>, Habituated <chr>,
## # Habituation_2 <chr>, Minutes_awake <dbl>, Number_Eating <chr>,
## # Number_Nesting <dbl>, New_Day_Nest <chr>, Number_Rest_Bouts <dbl>, …
I also want to check the variable classes for each column:
str(habituation_data)
## tibble [555 × 39] (S3: tbl_df/tbl/data.frame)
## $ Follow_Number : num [1:555] 1 2 3 4 5 6 14 15 16 17 ...
## $ Date_MDY : POSIXct[1:555], format: "1994-08-22" "1994-09-14" ...
## $ Nama_OH : chr [1:555] "Herman" "Kristen" "Cindy" "Kristen" ...
## $ Pheno_Zscore : num [1:555] NA 0.584 0.584 0.584 0.584 ...
## $ Mother_Class : chr [1:555] "Adult Male" "Juvenile Only" "No Offspring Present" "Juvenile Only" ...
## $ Age_Class : chr [1:555] "Flanged Male" "Adult Female" "Adult Female" "Adult Female" ...
## $ Percent_Time_Feed : num [1:555] 5.3 22.568 1.901 12.374 0.958 ...
## $ Percent_Time_Rest : num [1:555] 89.9 51.6 65 64.1 69.8 ...
## $ Percent_Time_Travel : num [1:555] 4.8 20.2 31.8 23.5 29.3 ...
## $ Total_Vocalizations : num [1:555] 0 0 2 1 13 1 0 0 1 0 ...
## $ Vocalization_per_time_awake: num [1:555] 0.01 0.01167 0.01897 0.00685 0.0607 ...
## $ Vocalization_Group_Calc : num [1:555] 2 3 4 2 9 5 3 1 5 10 ...
## $ Vocalization_Group : chr [1:555] "NA" "7" "5" "4" ...
## $ Count_Alarm_at_OBS : num [1:555] 5 3 7 4 6 17 5 0 11 41 ...
## $ Follow_Series_Day : num [1:555] 1 1 1 2 1 3 1 2 2 1 ...
## $ Follow_Type : chr [1:555] "H" "D" "F" "F" ...
## $ Habituated : chr [1:555] "NA" "H" "U" "U" ...
## $ Habituation_2 : chr [1:555] "U" "H" "H" "H" ...
## $ Minutes_awake : num [1:555] 500 257 474 730 313 ...
## $ Number_Eating : chr [1:555] "4" "7" "4" "8" ...
## $ Number_Nesting : num [1:555] 0 2 1 0 0 0 5 0 0 1 ...
## $ New_Day_Nest : chr [1:555] "0" "NA" "0" "0" ...
## $ Number_Rest_Bouts : num [1:555] 24 15 23 38 14 16 39 1 13 6 ...
## $ Number_Traveling_Bouts : chr [1:555] "21" "17" "26" "34" ...
## $ Percent_Time_Nesting : num [1:555] 0 5.64 1.27 0 0 ...
## $ Hide : chr [1:555] "N" "N" "N" "N" ...
## $ Eat_Excessive : chr [1:555] "N" "N" "N" "N" ...
## $ Eat_Minimal : chr [1:555] "Y" "N" "Y" "Y" ...
## $ Rest_Excessive : chr [1:555] "Y" "N" "N" "N" ...
## $ Rest_Minimal : chr [1:555] "N" "N" "N" "N" ...
## $ Travel_Excessive : chr [1:555] "N" "Y" "Y" "Y" ...
## $ Travel_Minimal : chr [1:555] "N" "N" "N" "N" ...
## $ Vocalize_Excessive : chr [1:555] "N" "N" "Y" "N" ...
## $ Habituation_Response : chr [1:555] "S" "F" "F" "F" ...
## $ Percent_time_drizzle : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
## $ Percent_Time_Rain : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
## $ Ttl_Min_drizzle : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
## $ Ttl_Min_Rain : num [1:555] 0 0 0 0 0 0 0 0 0 0 ...
## $ Social_Interaction : chr [1:555] "N" "Y" "N" "Y" ...
I am going to change the column titled “Follow_Series_Day” from a character class to a numerical class:
habituation_data$Follow_Series_Day <- as.numeric(habituation_data$Follow_Series_Day) # This line of code indicated that the "Follow_Series_Column" in the "habituation_data" data frame/tibble should be changed from its current class to the numeric class
class(habituation_data$Follow_Series_Day) #this code allows us to check the class designation of this specific column in our data frame, as opposed to the str() command, which shows us the class of every column in the data frame.
## [1] "numeric"
I will change some of the other columns from a character class to a
factor class. These are the ones that may be better to switch to
factors, since the data in these columns only can fall into a specific
number of categories, vs. one like the name of the orangutans in which
new categories could be added in the future. For example, Social
Interaction can only be a Y/N answer, and in this
model, Age-Sex Class can only be Adult Female,
Flanged Male, or Unflanged Male. Here are the columns
that seem to fit this pattern:
There are other columns of data similar to the ones above, but not ones I will be using for this model, so I believe these are the only ones I need to from character to factor class.
habituation_data$Mother_Class <- as.factor(habituation_data$Mother_Class)
class(habituation_data$Mother_Class)
## [1] "factor"
habituation_data$Habituation_Response <- as.factor(habituation_data$Habituation_Response)
class(habituation_data$Habituation_Response)
## [1] "factor"
habituation_data$Age_Class <- as.factor(habituation_data$Age_Class)
class(habituation_data$Age_Class)
## [1] "factor"
habituation_data$Social_Interaction <- as.factor(habituation_data$Social_Interaction)
class(habituation_data$Social_Interaction)
## [1] "factor"
For this analysis, I will be performing a multinomial
logistic regression with random effects. To do so, I
will be using the mlogit package, already installed loaded
into this session in an earlier step.
The reason I will be trying this type of analysis is because my response (dependent) variable can fit into three categories: Stay (S), Flee (F), or Vocalize (V). These are not ordinal nor binary, thus a multinomial logistic regression makes sense for this analysis, and can also account for the various types of predictor (independent) variables, some of which are continuous, some of which are categorical.
Furthermore, the mclogit command within the
mlogit package will allow me to perform a multinomial
logistic regression with a random effect variable included, in this
case, Orangutan Name (labeled as “Nama_OH” in the table).
model1 <- mclogit(Follow_Number|Habituation_Response ~ Pheno_Zscore + Age_Class + Mother_Class + Follow_Series_Day + Percent_time_drizzle + Percent_Time_Rain, random = ~1|Nama_OH, data = habituation_data)
##
## Iteration 1 - deviance = 156952.4 - criterion = 0.4486491
## Iteration 2 - deviance = 129410.6 - criterion = 0.01516399
## Iteration 3 - deviance = 128531.2 - criterion = 0.000593259
## Iteration 4 - deviance = 128529.6 - criterion = 1.316346e-06
## Iteration 5 - deviance = 128529.6 - criterion = 1.098118e-11
## converged
summary(model1)
##
## Call:
## mclogit(formula = Follow_Number | Habituation_Response ~ Pheno_Zscore +
## Age_Class + Mother_Class + Follow_Series_Day + Percent_time_drizzle +
## Percent_Time_Rain, data = habituation_data, random = ~1 |
## Nama_OH)
##
## Coefficents:
## Estimate Std. Error z value Pr(>|z|)
## Pheno_Zscore 0.0497730 0.0009817 50.702 < 2e-16 ***
## Age_ClassAdult Female 0.1513781 0.0110415 13.710 < 2e-16 ***
## Age_ClassFlanged Male 0.2937861 0.2780801 1.056 0.29075
## Age_ClassUnflanged Male 0.0252364 0.2367220 0.107 0.91510
## Mother_ClassInfant Present 0.1950003 0.0066117 29.493 < 2e-16 ***
## Mother_ClassJuvenile Only 0.2829159 0.0065225 43.376 < 2e-16 ***
## Mother_ClassNo Offspring Present -0.3352736 0.0075603 -44.347 < 2e-16 ***
## Follow_Series_Day 0.0056568 0.0004446 12.722 < 2e-16 ***
## Percent_time_drizzle 0.0010673 0.0000781 13.665 < 2e-16 ***
## Percent_Time_Rain -0.0003404 0.0001108 -3.071 0.00213 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Co-)Variances:
## Grouping level: Nama_OH
## Estimate Std.Err.
## (Const.) 1.891 0.489
##
## Approximate residual deviance: 128500
## Number of Fisher scoring iterations: 5
## Number of observations
## Groups by Nama_OH: 187
## Individual observations: 1971030 (11 observations deleted due to missingness)
Okay, this is as far as I got! I am not sure if the formula I am
using in the mclogit command is correct. Specifically, the
formula part of the mclogit command (i.e. the formula for
the model) requires the first part of it create a two-column matrix in
the form of choice|set. I am not sure if I am understanding
the definitions of “choice” and “set” correctly. I used “Follow Number”
as that is the only number that is unique to each row, and “Habituation
Response” is the dependent variable I am looking at, which has three
levels. I am also not completely sure how to interpret the results even
if I did set up the entire formula correctly.