Dogs have been a popular choice as a pet for thousands of years, evolving alongside men while building a close bond that has been passed down for generations. Dogs put all of their trust and love into their owners, so why should we behave any different?
The data was gotten from a study titled “Cohort profile: The Golden Retriever Lifetime Study (GRLS)”. The research was about how cancer affected the lifespan of dogs but the dataset used had a number of other features of each participant such as their gender, neuter status, age, weight, and age of death. The study used 3,044 American Golden Retrievers aged 6 months to 2 years from June 2012 to April 2015. Information was taken during vet visits as well as from questionnaires that were completed by the owners.
Some of the variables used are: Age_at_death_or_ltf_years for the age of death, Body_condition_categorical for how fat they are, and Sex for the gender and neutering status of the dog.
I chose this topic because I love dogs and want to advocate for others to take care of their dogs as best as possible in order to keep their pets around for as long as possible. I want to find what features shorten lifespan and by how much, and then see if I can make an equation to predict the lifespan of a dog by what features they have.
#I used some of these during the brainstorming phase of this project but I doubt that all are still relevant to the final project... I do not want to risk breaking anything by removing a library thoughlibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(ggridges)
Warning: package 'ggridges' was built under R version 4.5.2
Warning: package 'gridExtra' was built under R version 4.5.2
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
library(plyr)
------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------
Attaching package: 'plyr'
The following objects are masked from 'package:dplyr':
arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize
The following object is masked from 'package:purrr':
compact
library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
library(plotly)
Attaching package: 'plotly'
The following objects are masked from 'package:plyr':
arrange, mutate, rename, summarise
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 3049 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): ID, Enrollment_status, Sex, Age_at_gonadectomy_categorical, Region_...
dbl (7): Age_at_enrollment_years, Age_current_years, Primary_endpoint_diagno...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# the dataset is originally ment for cancer risk in different breeds, but I am using it to find a coorelation between spaying/neutering and cancer/death
do avg death date for the two groups, then find optimal age for spay, maybe later find difference between genders maybe faceted for genders
#DIAGNOSTIC PLOT! DO NOT GRADE THIS PLEASE I tried to take them away but it broke everythingdogdata2 <- dogdata|>filter(!is.na(Age_at_death_or_ltf_years))dogdata3 <- dogdata|>filter(!is.na(Body_condition_categorical))dogdata4 <- dogdata |>filter(!is.na(Sex))a <-boxplot(Age_at_death_or_ltf_years ~ Sex, data = dogdata2)
b <-boxplot(Age_at_death_or_ltf_years ~ Body_condition_categorical, data = dogdata3)
c <-boxplot(Age_at_death_or_ltf_years ~ Sex, data = dogdata4)
Disclaimer: Obesity is bad. I am not sure why the overweight dogs live longer. I am guessing that it is because larger Golden Retrievers are typically better bred and have a higher average weight because of their frame so they may be categorized as overweight. PLEASE DO NOT OVERFEED YOUR DOG! It is very unhealthy and neglectful.
After seeing the difference in medians, I decided to make a new factor that determines the lifespan of the dog based on all factors with appropriate “weights” for each factor. Below are the statistics for all three boxplots:
The difference in median lifespan shows how much each factor affects lifespan, the larger the difference, the more of an effect it has.
Neuter vs Intact: 2.08 for female, 1.715 for male, 1.8975 average for both (Ill use the number for both genders but usually females’ health are very dependent on being spayed)
Bodyweight: 1.585, overweight lives longer (not really, please dont overfeed your dog)
Gender: 0.515, males live longer
The total of all median differences is 4.405 years.
With these findings: male, neutered, overweight dogs should be the longest living dogs, while underweight, intact, female dogs should be the shortest living. I made the graph below to compare and get more accurate numbers by comparing each feature against the new found control.
#make groups different colors, then dot plot for lifespan maybe markers for individual stats DELETE INSTRUCTIONS BEFORE SUBMISSION# make a faceted plot to compare "healthiest dog" to another dog with one attribute changed to see percent change and compare it to risk hypothesis aboveCompare1 <- dogdata# I got help with Boolean statements from this pdf document: https://gsp.humboldt.edu/olm/R/02_05_BooleanComparisonsIfStatements.html
Compare1$Primary_endpoint_diagnosis_tier[Compare1$Sex =="Male Neutered"& Compare1$Body_condition_categorical =="Overweight"] <-"A"Compare1$Primary_endpoint_diagnosis_tier[Compare1$Sex =="Female Spayed"& Compare1$Body_condition_categorical =="Overweight"] <-"B"Compare1$Primary_endpoint_diagnosis_tier[Compare1$Sex =="Male Intact"& Compare1$Body_condition_categorical =="Overweight"] <-"C"Compare1$Primary_endpoint_diagnosis_tier[Compare1$Sex =="Male Neutered"& Compare1$Body_condition_categorical =="Underweight"| Compare1$Body_condition_categorical =="Normal"] <-"D"#Making the control with a different dataframe so that I can put it in all comparisons, I found this in the R documentation if you press f1 while hovering over the "Facet_wrap" command
Compare1filter1 <- Compare1 |>filter(Primary_endpoint_diagnosis_tier =="A"| Primary_endpoint_diagnosis_tier =="B"| Primary_endpoint_diagnosis_tier =="C"| Primary_endpoint_diagnosis_tier =="D")Compare1filter1 =cbind("rowname"=rownames(Compare1filter1),Compare1filter1)my_colors <-c("blue", "red")my_colors2 <-c("blue", "purple")my_colors3 <-c("blue", "yellow")#I got the cbind command from https://www.geeksforgeeks.org/r-language/how-to-merge-r-dataframes-of-different-length/ I originally wanted to use another command I already knew but it kept acting weird
#My colors wont work. Not sure whyCompare1filter2 <- Compare1filter1 |>filter(Primary_endpoint_diagnosis_tier =="A"| Primary_endpoint_diagnosis_tier =="B")p6 <- Compare1filter2 |>ggplot(aes(x = Primary_endpoint_diagnosis_tier , y = Age_at_death_or_ltf_years, fill = Primary_endpoint_diagnosis_tier, ))+scale_fill_manual(values =c("A"="#FF4700", "B"="#55B4F9","D"="#F1E542","C"="#51FFC2"))+labs(x ="Male vs Female", y ="Average lifespan")+theme(legend.position ="none")+geom_boxplot()Compare1filter3 <- Compare1filter1 |>filter(Primary_endpoint_diagnosis_tier =="A"| Primary_endpoint_diagnosis_tier =="C")p7 <- Compare1filter3 |>ggplot(aes(x = Primary_endpoint_diagnosis_tier , y = Age_at_death_or_ltf_years, fill = Primary_endpoint_diagnosis_tier, scale_color_manual(values = my_colors2)))+labs(x ="Neutered vs Intact", y ="")+scale_fill_manual(values =c("A"="#FF4700", "B"="#55B4F9","D"="#F1E542","C"="#51FFC2"))+theme(legend.position ="none")+geom_boxplot()Compare1filter4 <- Compare1filter1 |>filter(Primary_endpoint_diagnosis_tier =="A"| Primary_endpoint_diagnosis_tier =="D")p8 <- Compare1filter4 |>ggplot(aes(x = Primary_endpoint_diagnosis_tier , y = Age_at_death_or_ltf_years, fill = Primary_endpoint_diagnosis_tier, scale_color_manual(values = my_colors3)))+labs(x ="Healthy vs unhealthy weight",y ="")+scale_fill_manual(values =c("A"="#FF4700", "B"="#55B4F9","D"="#F1E542","C"="#51FFC2"))+theme(legend.position ="none")+geom_boxplot()grid.arrange(p6,p7,p8, nrow =1)
Warning: Removed 631 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 367 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 1504 rows containing non-finite outside the scale range
(`stat_boxplot()`).
d <-boxplot(Age_at_death_or_ltf_years ~ Primary_endpoint_diagnosis_tier, data = Compare1filter2)
e <-boxplot(Age_at_death_or_ltf_years ~ Primary_endpoint_diagnosis_tier, data = Compare1filter3)
f <-boxplot(Age_at_death_or_ltf_years ~ Primary_endpoint_diagnosis_tier, data = Compare1filter4)
Neuter vs. intact: 2.97 years, neutered dogs live longer
Weight: 1.71 years, overweight dogs live longer (No they don’t! please don’t overfeed your dog)
Percent difference in lifespan…
Sex: 6.8%
Neuter vs. intact: 53%
Weight: 30.6%
It is reasonable to believe that you can predict a dog’s lifespan by subtracting the percent change in lifespan from one median to another, for example, if a dog from the control group lived 10 years, they would have likely only lived 4.69 years if they were intact since the difference in lifespan divided by the median lifespan is 0.53 and by multiplying a dog’s lifespan by (1-% change) will give you their lifespan if they had any unhealthy factors.
Since it has been proven beyond reasonable doubt that neutering is the largest out of the three factors that decreases lifespan, there needs to be a plan to neuter dogs as efficiently and safely as possible. A big question is what age to get a dog neutered. Most vets recommend to get large male dogs neutered around 1-1.5 years old.
#This is a sidebar thing. Not part of the final project.dogdata2$Age_at_gonadectomy_categorical <-factor(dogdata2$Age_at_gonadectomy_categorical , levels =c("<= 6 mo","6 mo - 1 yr" , "1 - 2 yrs", "2 - 5 yrs", ">5 yrs", "Intact"))ggplot(dogdata2, aes(x = Age_at_death_or_ltf_years , y = Age_at_gonadectomy_categorical , fill = Age_at_gonadectomy_categorical)) +labs(y="Age of Neuter", x ="Age of death", title ="Lifespan based on age of neuter")+geom_density_ridges() +theme_ridges() +theme(legend.position ="none")
Picking joint bandwidth of 0.767
# I saw this plot type in the R documentation in "Control aesthetic evaluation" and thought it looked cool.
This plot shows the lifespan based on what age the dog got neutered at. The distribution shows a slight difference in ages under 5 yet shows a sharp increase in lifespan when the dog is 5 years or older. A cause of this can be that the dog is fully physically and mentally matured, therefore will have less issues with the stress that a surgery puts on their body. Of course, this does not show the likelihood of cancer that increases infinitely when a dog is intact… You cant have testicular cancer if you’re neutered.
To summarize, I will make a graph of lifespans based on each factor that affects lifespan. You can hover over each dot to see the individual statistics of each dog.
color = gender
shape = Neuter status
size = Age at being neutered
# I am making two new dataframes so that I can separate the ages of dogs based on their group to compare each group to the control group "A"Compare2filter2 <- Compare1filter2 |>filter(!is.na(Age_at_death_or_ltf_years))Compare2filter3 <- Compare1filter2 |>filter(!is.na(Age_at_death_or_ltf_years))Compare2filter4 <- Compare1filter3 |>filter(!is.na(Age_at_death_or_ltf_years))Compare2filter5 <- Compare1filter3 |>filter(!is.na(Age_at_death_or_ltf_years))Compare2filter6 <- Compare1filter4 |>filter(!is.na(Age_at_death_or_ltf_years)) Compare2filter7 <- Compare1filter4 |>filter(!is.na(Age_at_death_or_ltf_years))
#I am separating the death for the testing and control groups, group A's death is going to be in the higher number of each pair of dataframes while the lower number is going to get a new column named "predictedlife" where the death date from the control group will be changed by the change in median from the previous diagnostic plotsCompare2filter2$predictedlife[Compare2filter2$Primary_endpoint_diagnosis_tier =="A"] <- Compare2filter2$Age_at_death_or_ltf_years
Warning in
Compare2filter2$predictedlife[Compare2filter2$Primary_endpoint_diagnosis_tier
== : number of items to replace is not a multiple of replacement length
Warning in
Compare2filter4$predictedlife[Compare2filter4$Primary_endpoint_diagnosis_tier
== : number of items to replace is not a multiple of replacement length
Warning in
Compare2filter6$predictedlife[Compare2filter6$Primary_endpoint_diagnosis_tier
== : number of items to replace is not a multiple of replacement length
#I am ordering all dataframes from least to greatest so I can create ordered pairs in ascending value when I combine the dataCompare2filter2 <- Compare2filter2[order(Compare2filter2$predictedlife),]Compare2filter3 <- Compare2filter3[order(Compare2filter3$Age_at_death_or_ltf_years),]Compare2filter4 <- Compare2filter4[order(Compare2filter4$predictedlife),]Compare2filter5 <- Compare2filter5[order(Compare2filter5$Age_at_death_or_ltf_years),]Compare2filter6 <- Compare2filter6[order(Compare2filter6$predictedlife),] Compare2filter7 <- Compare2filter7[order(Compare2filter7$Age_at_death_or_ltf_years),]
#I am combining the data so that the dataframe has the lifespan of a dog that was in the control group as well as what the Compare2filter3$predictedlife <- Compare2filter2$predictedlifeCompare2filter5$predictedlife <- Compare2filter4$predictedlifeCompare2filter7$predictedlife <- Compare2filter6$predictedlife# I learned to move a column from one dataframe to another through this stack overflow post: https://stackoverflow.com/questions/19201209/copying-multiple-columns-from-one-data-frame-to-another
#This is the second official plotfinal2 <- Compare2filter5 |>ggplot(aes(x = Age_at_death_or_ltf_years, y= predictedlife, text =paste("Age of death:", Age_at_death_or_ltf_years,"\nAge of death:", predictedlife), shape = ))+geom_smooth(color ="black")+geom_point(color ="grey")+geom_abline(intercept =0, slope =1)+scale_x_continuous(breaks =seq(0, 10, by =2), minor_breaks =seq(0, 10, by =1))+labs(x ="Intact", y ="", title ="Accuracy of formula based off distance from y=x", subtitle ="red is the predicted life, grey is the real lifespan",caption ="data gotten from plabs" )+labs(subtitle ="red is the predicted life, grey is the real lifespan",caption ="data gotten from plabs")+theme_minimal()final22 <- final2 +geom_point(x = Compare2filter5$Age_at_death_or_ltf_years, y= Compare2filter5$predictedlife-2.97, color ="red")fig2 <-ggplotly(final22)
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
fig2
fit1 <-lm(predictedlife~Age_at_death_or_ltf_years, data = Compare2filter3)summary(fit1)
Call:
lm(formula = predictedlife ~ Age_at_death_or_ltf_years, data = Compare2filter3)
Residuals:
Min 1Q Median 3Q Max
-1.6063 -0.3853 -0.1037 0.5569 0.9701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00217 0.13451 7.451 5.18e-11 ***
Age_at_death_or_ltf_years 0.95366 0.02292 41.613 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5606 on 91 degrees of freedom
Multiple R-squared: 0.9501, Adjusted R-squared: 0.9495
F-statistic: 1732 on 1 and 91 DF, p-value: < 2.2e-16
fit2 <-lm(predictedlife~Age_at_death_or_ltf_years, data = Compare2filter5)summary(fit2)
Call:
lm(formula = predictedlife ~ Age_at_death_or_ltf_years, data = Compare2filter5)
Residuals:
Min 1Q Median 3Q Max
-0.94347 -0.29474 -0.09994 0.44906 0.80817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.22749 0.11256 10.91 1.08e-13 ***
Age_at_death_or_ltf_years 0.46410 0.02324 19.97 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4595 on 41 degrees of freedom
Multiple R-squared: 0.9068, Adjusted R-squared: 0.9045
F-statistic: 398.9 on 1 and 41 DF, p-value: < 2.2e-16
fit3 <-lm(predictedlife~Age_at_death_or_ltf_years, data = Compare2filter7)summary(fit3)
Call:
lm(formula = predictedlife ~ Age_at_death_or_ltf_years, data = Compare2filter7)
Residuals:
Min 1Q Median 3Q Max
-1.0809 -0.3431 -0.0080 0.3942 0.8977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.5210 0.1619 -9.394 3.26e-15 ***
Age_at_death_or_ltf_years 5.4910 0.1118 49.131 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5341 on 95 degrees of freedom
(388 observations deleted due to missingness)
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9617
F-statistic: 2414 on 1 and 95 DF, p-value: < 2.2e-16
So if the equation for the lifespan of a female dog were seen as L(x) for x is years of life, the best equation would likely be L(x) = 0.95366x + 1.00217. Since the line y=x is the lifespan of a healthy dog, the distance from y=x is going to show how much there will be a decrease in the lifespan. To find the distance, use the formula: |x1+y1| / 2 where x1 and y1 are the coordinates of the answer from your first equation. I wish I knew if this works but I am really not sure, I am awful at math and could barely find anything on google so I mostly made it up myself on desmos. the distance between the point you got from the first equation vs the line y=x should be the years that your dog’s lifespan decreases in years.
The p-value is 2.2e-16 which is an incredibly small number. The smaller the p-value, the less likely my hypothesis (that the given feature decreases the lifespan of the dog) is proven to be incorrect. The adjusted r-squared is 0.945, which shows a very strong correlation between the points and the line given. This means that the line is very accurate and very few points stray from the slope.
The equations for the other two features are:
Neuter status: L(x) = 0.4641x+1.22749 and |x1+y1| / 2
Weight: L(x) = 5.491x-1.521 and |x1+y1| / 2
All adjusted r-squared values are above 0.9 which means that each line has a strong correlation to the data points. The p values are all the exact same as well.
Many problems arose during this project. It took a very long time to get data, almost zero projects actually share the data and the locations for their data are private. The only other open-data project required you to request data and I waited for days for a response and only got access on the day the project was due. Thankfully I found another data set a few days before the deadline and started to work. One of my biggest issues afterwards was the fact that dogs tend to only die when they’re really old or really young. That is especially common in the last graph for neutering status where the danger of a dog dying due to not being neutered is through cancer and that usually only happens in old age. I also had to deal with very one-dimensional data and had to make a graph with weird axes that subsequently gave a really weird formula for finding the lifespan of a dog. I think that the median ages of death were off too. This reminds me of how people always say that the average person in medieval times only lived to 30… No they didn’t. There were just an insane amount of infant deaths that it skewed the median death age super far to the left. Once you made it out of infancy you were likely to live decently long. I think the same happened in my data where a bunch of puppies died and skewed the data to the left, which makes sense as the original purpose of the data set was to look at cancer in dogs so obviously most dogs included in the study were sick. Otherwise, I think it was a good project. I learned a lot about manipulating data, especially on the second plot. Putting the control data in each plot in the faceted chart was difficult and took many hours to figure out.
Works Cited “Adding colors in r for boxplot.” Stack Overflow, https://stackoverflow.com/questions/61926574/adding-colors-in-r-for-boxplot. “Cohort profile: The Golden Retriever Lifetime Study (GRLS) | PLOS One.” Research journals, 9 June 2022, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0269425. Accessed 16 December 2025. “Copyint multiple columns from one dataframe to another.” stack overflow. “How to merge r dataframes of different lengths.” geeks for geeks, https://www.geeksforgeeks.org/r-language/how-to-merge-r-dataframes-of-different-length