Animal Crossing: New Horizons (ACNH) is a life simulation game developed by Nintendo and released in 2020. A big part of the game is choosing which villagers, the animal characters that live on your island, to invite and keep. There are over 400 unique villagers, each have their own species, personality, style, and color palette, so players are constantly making decisions about who to keep around.
This project uses two datasets: one with official game data for each villager, and one with community popularity rankings. The goal is to see whether measurable villager traits are actually associated with how popular a villager is in the community. Since the rankings are entirely community-decided, this is not about any in-game quality. It is about what kinds of villagers players tend to gravitate toward.
The main question is: Do a villager’s traits predict how popular they are within the ACNH community?
The main goal of this project is to find out whether traits like species, personality, style, and color are meaningfully linked to community popularity. Other goals are figuring out which traits matter most and building models that could help predict whether a villager is likely to be a fan favorite.
On a personal level, this project came from a real gameplay problem: trying to decide which villagers to keep without having met all of them in game. On a broader level, it shows how community preference data can be modeled, which is something that applies well beyond video games. Any situation where a group of people ranks or rates things, like movies, restaurants, or products, involves similar challenges.
| Variable | Type | Description |
|---|---|---|
| Name | Character | Villager’s name |
| Species | Character | Animal species (35 unique types) |
| Gender | Character | Male or Female |
| Personality | Character | One of 8 personality archetypes |
| Hobby | Character | One of 6 hobby categories |
| Style 1 | Character | Primary clothing style preference |
| Style 2 | Character | Secondary clothing style preference |
| Color 1 | Character | Primary associated color |
| Color 2 | Character | Secondary associated color |
| Birthday | Character | In-game birthday (month-day) |
| Favorite Song | Character | Villager’s preferred K.K. Slider song |
| tier | Integer | Community popularity tier (1 = most popular, 6 = least popular) |
| rank | Integer | Overall community rank (1 = most popular) |
The two datasets were merged by villager name. About 5 villagers could not be matched due to naming differences between the two sources (things like special characters or alternate spellings) and were excluded. The final dataset contains 386 villagers with no missing values after the merge.
A continuous popularity score was created from rank using this formula:
popularity_score = 1 − ((rank − 1) / (max(rank) − 1))
This rescales rank to a 0 to 1 scale where 1 is the most popular villager and 0 is the least, which makes it easier to work with in the regression models.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(tree)
library(rpart)
library(rpart.plot)
library(knitr)
set.seed(2541)
villagers <- read_csv("villagers.csv")
popularity <- read_csv("acnh_villager_data.csv")
# Dimensions before merging
cat("villagers.csv dimensions:", dim(villagers), "\n")
## villagers.csv dimensions: 391 17
cat("acnh_villager_data.csv dimensions:", dim(popularity), "\n")
## acnh_villager_data.csv dimensions: 413 3
# Merging by villager names
merged <- inner_join(villagers, popularity, by = c("Name" = "name"))
cat("Merged dataset dimensions:", dim(merged), "\n")
## Merged dataset dimensions: 386 19
# Confirm that there are no missing values
colSums(is.na(merged))
## Name Species Gender Personality Hobby
## 0 0 0 0 0
## Birthday Catchphrase Favorite Song Style 1 Style 2
## 0 0 0 0 0
## Color 1 Color 2 Wallpaper Flooring Furniture List
## 0 0 0 0 0
## Filename Unique Entry ID tier rank
## 0 0 0 0
# Create some derived variables
merged <- merged %>%
mutate(
popularity_score = 1 - ((rank - 1) / (max(rank) - 1)),
top_tier = as.factor(ifelse(tier <= 2, "Yes", "No"))
)
head(merged)
There are 8 personality types in ACNH and all 8 show up in the dataset. Lazy and Normal are the most common, while Big Sister is the rarest with only 24 villagers.
ggplot(merged, aes(x = reorder(Personality, Personality, FUN = length), fill = Personality)) +
geom_bar(show.legend = FALSE) +
coord_flip() +
labs(
title = "Count of Villagers by Personality Type",
x = "Personality",
y = "Count"
) +
theme_minimal()
There are 35 species total. The chart below shows the 15 most common ones. Cats lead with 23 villagers, followed by rabbits and squirrels. Since species is closely tied to a villager’s visual identity, it was expected to be a strong predictor of popularity.
merged %>%
count(Species, sort = TRUE) %>%
slice_head(n = 15) %>%
ggplot(aes(x = reorder(Species, n), y = n, fill = n)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_gradient(low = "#a8dadc", high = "#457b9d") +
labs(
title = "Top 15 Most Common Villager Species",
x = "Species",
y = "Count"
) +
theme_minimal()
Most villagers fall into tiers 5 and 6, which makes sense since the community only elevates a small fraction of the roster to the top. Tiers 1 and 2 combined only have 40 villagers.
merged %>%
mutate(tier = as.factor(tier)) %>%
ggplot(aes(x = tier, fill = tier)) +
geom_bar(show.legend = FALSE) +
labs(
title = "Number of Villagers per Popularity Tier",
subtitle = "Tier 1 = most popular, Tier 6 = least popular",
x = "Tier",
y = "Count"
) +
theme_minimal()
When looking at average popularity score by species, Octopus comes out on top by a wide margin. This is almost entirely because of Marina, who is the only octopus villager and is extremely popular. Wolf, Deer, and Cat follow behind. Species with fewer members tend to have more extreme averages since the sample size is smaller.
merged %>%
group_by(Species) %>%
summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
arrange(desc(avg_pop)) %>%
slice_head(n = 15) %>%
ggplot(aes(x = reorder(Species, avg_pop), y = avg_pop, fill = avg_pop)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_gradient(low = "#a8dadc", high = "#1d3557") +
labs(
title = "Top 15 Species by Average Popularity Score",
x = "Species",
y = "Average Popularity Score (0-1)"
) +
theme_minimal()
Lazy and Normal personalities have higher median popularity scores and more top-ranking outliers compared to other types. Cranky and Jock trend lower. This suggests the community leans toward softer and more easygoing characters.
merged %>%
ggplot(aes(x = reorder(Personality, popularity_score, FUN = median),
y = popularity_score, fill = Personality)) +
geom_boxplot(alpha = 0.7, show.legend = FALSE) +
coord_flip() +
labs(
title = "Popularity Score by Personality Type",
subtitle = "Ordered by median",
x = "Personality",
y = "Popularity Score (0-1)"
) +
theme_minimal()
Villagers with a “Cute” primary style have a noticeably higher average popularity score than any other style. This lines up with the ACNH community’s preference for soft and/or approachable character designs. Active and Cool styles trend lowest.
merged %>%
group_by(`Style 1`) %>%
summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
ggplot(aes(x = reorder(`Style 1`, avg_pop), y = avg_pop, fill = `Style 1`)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Average Popularity Score by Primary Style",
x = "Style 1",
y = "Average Popularity Score (0-1)"
) +
theme_minimal()
Beige and white are the primary colors that are associated with the highest average popularity. Black, red, and orange trend lower. This pattern matches the style findings since lighter and softer color palettes seem to be what the community prefers.
merged %>%
group_by(`Color 1`) %>%
summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
ggplot(aes(x = reorder(`Color 1`, avg_pop), y = avg_pop, fill = `Color 1`)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Average Popularity Score by Primary Color",
x = "Color 1",
y = "Average Popularity Score (0-1)"
) +
theme_minimal()
Music and Nature hobbies are associated with slightly higher average popularity compared to Fitness and Education. The differences are small but show up consistently.
merged %>%
group_by(Hobby) %>%
summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
ggplot(aes(x = reorder(Hobby, avg_pop), y = avg_pop, fill = Hobby)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Average Popularity Score by Hobby",
x = "Hobby",
y = "Average Popularity Score (0-1)"
) +
theme_minimal()
Four models are used in this analysis, each chosen for a specific reason:
| Model | Purpose | Outcome Variable |
|---|---|---|
| Linear Regression | Quantify the effect of each trait on popularity | Continuous popularity score (0-1) |
| Logistic Regression | Classify villagers as top-tier (Yes/No) | Binary: tier 1 or 2 vs. all others |
| Decision Tree | Show the most important split points visually | Binary: top-tier Yes/No |
| Random Forest | Rank overall variable importance using ensemble trees | Continuous popularity score (0-1) |
Gender, Personality, Hobby, Style 1, Style 2, Color 1, Color 2, Species
A 70/30 stratified split is used for all models, created with
createDataPartition() from the caret package.
The same split is applied across all models so results are
comparable.
# Build clean modeling dataframe
model_df <- merged %>%
select(Species, Gender, Personality, Hobby,
`Style 1`, `Style 2`, `Color 1`, `Color 2`,
popularity_score, top_tier) %>%
mutate(
Gender = as.factor(Gender),
Personality = as.factor(Personality),
Species = as.factor(Species),
Hobby = as.factor(Hobby),
Style1 = as.factor(`Style 1`),
Style2 = as.factor(`Style 2`),
Color1 = as.factor(`Color 1`),
Color2 = as.factor(`Color 2`)
) %>%
select(-`Style 1`, -`Style 2`, -`Color 1`, -`Color 2`)
dim(model_df)
## [1] 386 10
# 70/30 train/test split
train_ind <- createDataPartition(model_df$popularity_score, p = 0.7, list = FALSE)
training <- model_df[train_ind, ]
testing <- model_df[-train_ind, ]
cat("Training rows:", nrow(training), "\n")
## Training rows: 271
cat("Testing rows:", nrow(testing), "\n")
## Testing rows: 115
Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species
Linear regression models the relationship between a set of predictors and a continuous outcome. Here it shows how much each trait moves a villager’s predicted popularity score up or down relative to the baseline. Based on the EDA it looks like species and style were expected to show the strongest effects.
lm_model <- lm(popularity_score ~ Gender + Personality + Hobby +
Style1 + Style2 + Color1 + Color2 + Species,
data = training)
summary(lm_model)
##
## Call:
## lm(formula = popularity_score ~ Gender + Personality + Hobby +
## Style1 + Style2 + Color1 + Color2 + Species, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60828 -0.11389 -0.00159 0.13852 0.44388
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.382609 0.217030 1.763 0.0795 .
## GenderMale 0.109439 0.096922 1.129 0.2603
## PersonalityCranky -0.133557 0.079611 -1.678 0.0951 .
## PersonalityJock -0.182228 0.091959 -1.982 0.0490 *
## PersonalityLazy -0.151432 0.092119 -1.644 0.1019
## PersonalityNormal -0.033394 0.090270 -0.370 0.7118
## PersonalityPeppy 0.067047 0.097919 0.685 0.4944
## PersonalitySmug NA NA NA NA
## PersonalitySnooty -0.216450 0.094471 -2.291 0.0231 *
## HobbyFashion 0.002454 0.074736 0.033 0.9738
## HobbyFitness 0.127023 0.080413 1.580 0.1159
## HobbyMusic 0.098835 0.060819 1.625 0.1058
## HobbyNature 0.064898 0.067275 0.965 0.3360
## HobbyPlay 0.089991 0.074141 1.214 0.2264
## Style1Cool -0.001688 0.079372 -0.021 0.9831
## Style1Cute 0.030404 0.087554 0.347 0.7288
## Style1Elegant 0.150847 0.081706 1.846 0.0664 .
## Style1Gorgeous 0.073982 0.091732 0.806 0.4210
## Style1Simple 0.017281 0.071681 0.241 0.8098
## Style2Cool -0.054125 0.067862 -0.798 0.4261
## Style2Cute 0.123951 0.068636 1.806 0.0725 .
## Style2Elegant 0.001048 0.070248 0.015 0.9881
## Style2Gorgeous -0.021273 0.069930 -0.304 0.7613
## Style2Simple 0.049454 0.057457 0.861 0.3905
## Color1Black 0.086974 0.096990 0.897 0.3710
## Color1Blue -0.090623 0.089169 -1.016 0.3108
## Color1Brown -0.151070 0.119401 -1.265 0.2074
## Color1Colorful 0.009788 0.111757 0.088 0.9303
## Color1Gray -0.075337 0.115128 -0.654 0.5137
## Color1Green -0.115673 0.084400 -1.371 0.1722
## Color1Light blue -0.043235 0.099337 -0.435 0.6639
## Color1Orange 0.105117 0.101822 1.032 0.3032
## Color1Pink -0.033748 0.100783 -0.335 0.7381
## Color1Purple -0.039973 0.106032 -0.377 0.7066
## Color1Red -0.064054 0.090676 -0.706 0.4808
## Color1White -0.022845 0.107422 -0.213 0.8318
## Color1Yellow -0.031155 0.096551 -0.323 0.7473
## Color2Black 0.130078 0.110598 1.176 0.2410
## Color2Blue 0.239923 0.102909 2.331 0.0208 *
## Color2Brown 0.209160 0.117906 1.774 0.0777 .
## Color2Colorful 0.183313 0.122364 1.498 0.1358
## Color2Gray 0.020217 0.110025 0.184 0.8544
## Color2Green 0.095120 0.109526 0.868 0.3862
## Color2Light blue 0.283021 0.122542 2.310 0.0220 *
## Color2Orange 0.133441 0.120503 1.107 0.2696
## Color2Pink 0.305148 0.115088 2.651 0.0087 **
## Color2Purple 0.251246 0.113066 2.222 0.0275 *
## Color2Red 0.082449 0.102217 0.807 0.4209
## Color2White 0.206874 0.108093 1.914 0.0572 .
## Color2Yellow 0.209096 0.107274 1.949 0.0528 .
## SpeciesAnteater 0.054106 0.165484 0.327 0.7441
## SpeciesBear -0.045570 0.149833 -0.304 0.7614
## SpeciesBird -0.048585 0.146110 -0.333 0.7399
## SpeciesBull 0.242303 0.179814 1.348 0.1794
## SpeciesCat 0.189213 0.146696 1.290 0.1987
## SpeciesChicken -0.111832 0.150597 -0.743 0.4587
## SpeciesCow -0.055050 0.176348 -0.312 0.7553
## SpeciesCub 0.206197 0.148195 1.391 0.1658
## SpeciesDeer 0.287387 0.166250 1.729 0.0855 .
## SpeciesDog 0.098607 0.144887 0.681 0.4970
## SpeciesDuck 0.077534 0.145895 0.531 0.5957
## SpeciesEagle -0.047042 0.154891 -0.304 0.7617
## SpeciesElephant -0.012267 0.165162 -0.074 0.9409
## SpeciesFrog 0.048877 0.138442 0.353 0.7244
## SpeciesGoat 0.077707 0.172889 0.449 0.6536
## SpeciesGorilla -0.105605 0.169918 -0.622 0.5350
## SpeciesHamster 0.141545 0.149643 0.946 0.3454
## SpeciesHippo -0.047517 0.172128 -0.276 0.7828
## SpeciesHorse -0.041686 0.142891 -0.292 0.7708
## SpeciesKangaroo -0.174860 0.167624 -1.043 0.2982
## SpeciesKoala 0.108719 0.152418 0.713 0.4765
## SpeciesLion 0.160319 0.160450 0.999 0.3190
## SpeciesMonkey 0.164464 0.162176 1.014 0.3118
## SpeciesMouse -0.259293 0.152334 -1.702 0.0904 .
## SpeciesOctopus 0.258766 0.187689 1.379 0.1696
## SpeciesOstrich 0.034090 0.147458 0.231 0.8174
## SpeciesPenguin 0.097657 0.153739 0.635 0.5261
## SpeciesPig -0.108926 0.144281 -0.755 0.4512
## SpeciesRabbit 0.094635 0.149882 0.631 0.5286
## SpeciesRhino 0.285719 0.188834 1.513 0.1319
## SpeciesSheep 0.092683 0.147882 0.627 0.5316
## SpeciesSquirrel 0.166291 0.136569 1.218 0.2249
## SpeciesTiger 0.058783 0.192241 0.306 0.7601
## SpeciesWolf 0.358538 0.169533 2.115 0.0358 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2379 on 188 degrees of freedom
## Multiple R-squared: 0.4681, Adjusted R-squared: 0.2361
## F-statistic: 2.018 on 82 and 188 DF, p-value: 4.611e-05
# Predictions and performance on test data
lm_pred <- predict(lm_model, newdata = testing)
data.frame(
R_Square = R2(lm_pred, testing$popularity_score),
RMSE = RMSE(lm_pred, testing$popularity_score),
MAE = MAE(lm_pred, testing$popularity_score)
)
plot(testing$popularity_score, lm_pred,
xlab = "Actual Popularity Score",
ylab = "Predicted Popularity Score",
main = "Linear Regression: Predicted vs. Actual",
pch = 19, col = "lightgreen")
abline(0, 1, col = "lightblue", lwd = 2)
Result: The linear regression model explains a reasonable portion of the variance in community popularity. The predicted v. actual plot shows decent alignment in the middle range with more scatter at the extremes. Species and Style coefficients show the largest effects, which matches what the EDA suggested. The R2 value shows that there is still a lot of unexplained variance which are likely from visual design details that are not in the dataset.
Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species
Logistic regression predicts a binary outcome: whether a villager is in tier 1 or 2 (“Yes”) or not (“No”). This directly answers the practical player question of what traits make a villager a community favorite. Based on the EDA, Cute style and certain species like Cat, Wolf, and Squirrel were expected to increase the probability of top-tier classification.
# Class balance check
table(training$top_tier)
##
## No Yes
## 245 26
log_model <- glm(top_tier ~ Gender + Personality + Hobby +
Style1 + Style2 + Color1 + Color2 + Species,
family = "binomial", data = training)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log_model)
##
## Call:
## glm(formula = top_tier ~ Gender + Personality + Hobby + Style1 +
## Style2 + Color1 + Color2 + Species, family = "binomial",
## data = training)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.805e+02 1.210e+06 0.000 1.000
## GenderMale 1.347e+02 5.639e+05 0.000 1.000
## PersonalityCranky -1.219e+02 4.607e+05 0.000 1.000
## PersonalityJock 5.092e+01 3.849e+05 0.000 1.000
## PersonalityLazy -5.661e+01 6.674e+05 0.000 1.000
## PersonalityNormal 4.189e+01 3.883e+05 0.000 1.000
## PersonalityPeppy 6.392e+00 3.666e+05 0.000 1.000
## PersonalitySmug NA NA NA NA
## PersonalitySnooty 4.406e+01 2.145e+05 0.000 1.000
## HobbyFashion -5.794e+01 3.754e+05 0.000 1.000
## HobbyFitness -1.000e+02 5.281e+05 0.000 1.000
## HobbyMusic 4.146e+01 2.891e+05 0.000 1.000
## HobbyNature -5.947e+01 1.177e+05 -0.001 1.000
## HobbyPlay 4.436e+00 4.169e+05 0.000 1.000
## Style1Cool 9.545e+01 1.659e+05 0.001 1.000
## Style1Cute 1.959e+02 2.763e+05 0.001 0.999
## Style1Elegant 1.093e+02 1.654e+05 0.001 0.999
## Style1Gorgeous 4.844e+01 2.098e+05 0.000 1.000
## Style1Simple 1.085e+02 4.911e+05 0.000 1.000
## Style2Cool 1.935e+01 4.297e+05 0.000 1.000
## Style2Cute 1.428e+02 3.063e+05 0.000 1.000
## Style2Elegant 6.016e+01 3.611e+05 0.000 1.000
## Style2Gorgeous -3.673e+01 4.964e+05 0.000 1.000
## Style2Simple 7.585e+01 3.001e+05 0.000 1.000
## Color1Black -4.332e+01 2.788e+05 0.000 1.000
## Color1Blue -7.823e+01 2.538e+05 0.000 1.000
## Color1Brown -1.370e+02 7.139e+05 0.000 1.000
## Color1Colorful -1.349e+02 1.269e+05 -0.001 0.999
## Color1Gray -1.162e+02 3.919e+05 0.000 1.000
## Color1Green -1.478e+02 1.625e+05 -0.001 0.999
## Color1Light blue -1.072e+02 1.092e+05 -0.001 0.999
## Color1Orange -1.989e+02 4.334e+05 0.000 1.000
## Color1Pink -1.421e+02 3.622e+05 0.000 1.000
## Color1Purple -3.946e+01 2.591e+05 0.000 1.000
## Color1Red -4.157e+01 1.200e+05 0.000 1.000
## Color1White -5.792e+01 5.174e+05 0.000 1.000
## Color1Yellow -1.159e+02 1.516e+05 -0.001 0.999
## Color2Black 9.929e+01 2.378e+05 0.000 1.000
## Color2Blue 7.565e+01 4.443e+05 0.000 1.000
## Color2Brown 2.210e+01 4.790e+05 0.000 1.000
## Color2Colorful -1.782e+01 3.639e+05 0.000 1.000
## Color2Gray 8.352e+01 2.786e+05 0.000 1.000
## Color2Green 3.219e+01 3.551e+05 0.000 1.000
## Color2Light blue 5.403e+01 3.343e+05 0.000 1.000
## Color2Orange -6.857e+01 2.362e+05 0.000 1.000
## Color2Pink 1.419e+02 2.221e+05 0.001 0.999
## Color2Purple 1.231e+02 2.887e+05 0.000 1.000
## Color2Red -1.105e+00 2.816e+05 0.000 1.000
## Color2White 3.875e+01 2.972e+05 0.000 1.000
## Color2Yellow 1.233e+02 3.271e+05 0.000 1.000
## SpeciesAnteater -8.142e+00 8.725e+05 0.000 1.000
## SpeciesBear 1.401e+01 3.923e+05 0.000 1.000
## SpeciesBird -4.183e+01 5.120e+05 0.000 1.000
## SpeciesBull 5.097e+01 1.125e+06 0.000 1.000
## SpeciesCat 5.971e+01 7.110e+05 0.000 1.000
## SpeciesChicken -1.201e+02 6.915e+05 0.000 1.000
## SpeciesCow 5.847e+00 7.160e+05 0.000 1.000
## SpeciesCub 6.422e+01 6.667e+05 0.000 1.000
## SpeciesDeer 1.022e+02 8.518e+05 0.000 1.000
## SpeciesDog 2.531e+01 8.109e+05 0.000 1.000
## SpeciesDuck 2.195e+01 8.189e+05 0.000 1.000
## SpeciesEagle 2.194e+01 6.869e+05 0.000 1.000
## SpeciesElephant -2.907e+00 4.954e+05 0.000 1.000
## SpeciesFrog -5.805e+01 7.577e+05 0.000 1.000
## SpeciesGoat 7.300e+01 9.852e+05 0.000 1.000
## SpeciesGorilla 1.785e+02 9.514e+05 0.000 1.000
## SpeciesHamster -7.822e+01 7.530e+05 0.000 1.000
## SpeciesHippo 6.148e+01 7.951e+05 0.000 1.000
## SpeciesHorse 2.499e+01 1.163e+06 0.000 1.000
## SpeciesKangaroo -5.128e+00 1.159e+06 0.000 1.000
## SpeciesKoala -2.238e+02 1.014e+06 0.000 1.000
## SpeciesLion 5.746e-01 5.713e+05 0.000 1.000
## SpeciesMonkey -9.097e+01 8.544e+05 0.000 1.000
## SpeciesMouse -6.502e+01 7.359e+05 0.000 1.000
## SpeciesOctopus 2.679e+01 8.095e+05 0.000 1.000
## SpeciesOstrich 8.912e+01 8.361e+05 0.000 1.000
## SpeciesPenguin 5.903e+01 6.552e+05 0.000 1.000
## SpeciesPig -6.873e+01 1.336e+06 0.000 1.000
## SpeciesRabbit -2.352e+01 7.139e+05 0.000 1.000
## SpeciesRhino 1.070e+02 6.633e+05 0.000 1.000
## SpeciesSheep 2.885e+01 6.952e+05 0.000 1.000
## SpeciesSquirrel 2.874e+01 9.773e+05 0.000 1.000
## SpeciesTiger -5.422e+01 7.587e+05 0.000 1.000
## SpeciesWolf -1.260e+02 1.111e+06 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.7131e+02 on 270 degrees of freedom
## Residual deviance: 3.1568e-08 on 188 degrees of freedom
## AIC: 166
##
## Number of Fisher Scoring iterations: 25
# Predict and evaluate on test data
log_probs <- predict(log_model, newdata = testing, type = "response")
log_pred <- ifelse(log_probs > 0.5, "Yes", "No")
confusionMatrix(data = as.factor(log_pred), reference = as.factor(testing$top_tier))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 92 6
## Yes 13 4
##
## Accuracy : 0.8348
## 95% CI : (0.7541, 0.8975)
## No Information Rate : 0.913
## P-Value [Acc > NIR] : 0.9979
##
## Kappa : 0.2098
##
## Mcnemar's Test P-Value : 0.1687
##
## Sensitivity : 0.8762
## Specificity : 0.4000
## Pos Pred Value : 0.9388
## Neg Pred Value : 0.2353
## Prevalence : 0.9130
## Detection Rate : 0.8000
## Detection Prevalence : 0.8522
## Balanced Accuracy : 0.6381
##
## 'Positive' Class : No
##
# Visualize: predicted probability by Style
testing %>%
mutate(predicted_prob = log_probs) %>%
ggplot(aes(x = reorder(Style1, predicted_prob, FUN = median),
y = predicted_prob, fill = Style1)) +
geom_boxplot(alpha = 0.7, show.legend = FALSE) +
coord_flip() +
labs(
title = "Logistic Regression: Predicted Top-Tier Probability by Style",
subtitle = "Cute style villagers show the highest predicted probability",
x = "Style 1",
y = "Predicted Probability of Top-Tier Status"
) +
theme_minimal()
Result: The logistic regression model achieves high overall accuracy, though a lot of that is from correctly predicting the majority “No” class. The sensitivity and specificity values in the confusion matrix show how it performs on each class separately. The probability plot confirms that Cute-style villagers are assigned the highest predicted probabilities of top-tier status, while Cool and Active styles get the lowest, which backs up the EDA findings with an actual model output.
Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species
A decision tree splits the data at the most informative points and produces a visual, easy-to-read set of rules. Unlike the regression models, there are no coefficients to interpret, just a tree that can be read directly. This makes it a useful complement to logistic regression since it shows which specific splits the model finds most useful. Based on EDA, the first split was expected to be on Species or Style.
library(rpart)
library(rpart.plot)
# classification tree with cp tuning
tree_model <- rpart(top_tier ~ Gender + Personality + Hobby +
Style1 + Style2 + Color1 + Color2 + Species,
data = training,
method = "class",
control = rpart.control(cp = 0.001))
summary(tree_model)
## Call:
## rpart(formula = top_tier ~ Gender + Personality + Hobby + Style1 +
## Style2 + Color1 + Color2 + Species, data = training, method = "class",
## control = rpart.control(cp = 0.001))
## n= 271
##
## CP nsplit rel error xerror xstd
## 1 0.115384615 0 1.0000000 1.000000 0.1864712
## 2 0.009615385 3 0.5769231 1.192308 0.2015248
## 3 0.001000000 7 0.5384615 1.230769 0.2043224
##
## Variable importance
## Species Color1 Personality Style1 Color2 Style2
## 22 18 16 16 15 9
## Hobby
## 4
##
## Node number 1: 271 observations, complexity param=0.1153846
## predicted class=No expected loss=0.09594096 P(node) =1
## class counts: 245 26
## probabilities: 0.904 0.096
## left son=2 (224 obs) right son=3 (47 obs)
## Primary splits:
## Species splits as LLLLLRLLRRLRLLLRLLLLLLLLLRLLLLRLLLL, improve=6.797544, (0 missing)
## Color1 splits as RLLLRRLLLRLLRL, improve=2.043906, (0 missing)
## Style2 splits as LLRLLL, improve=2.008783, (0 missing)
## Personality splits as LLLRRRRL, improve=1.692750, (0 missing)
## Hobby splits as LLLRRR, improve=1.452166, (0 missing)
##
## Node number 2: 224 observations, complexity param=0.009615385
## predicted class=No expected loss=0.04464286 P(node) =0.8265683
## class counts: 214 10
## probabilities: 0.955 0.045
## left son=4 (140 obs) right son=5 (84 obs)
## Primary splits:
## Species splits as LLLLL-LL--R-RLL-LLLRLLLLL-RRLR-RRLL, improve=1.4880950, (0 missing)
## Color1 splits as RRLLRLLRLLRRLL, improve=0.6668067, (0 missing)
## Color2 splits as LRRLRLRLLRRLRR, improve=0.5357143, (0 missing)
## Hobby splits as RLLRLR, improve=0.5222131, (0 missing)
## Personality splits as RLRLRLRL, improve=0.2237814, (0 missing)
## Surrogate splits:
## Style1 splits as LLLLRL, agree=0.638, adj=0.036, (0 split)
## Color1 splits as LLLLLLLRLLLLLL, agree=0.634, adj=0.024, (0 split)
## Color2 splits as LLLRLLLLLLLLLL, agree=0.629, adj=0.012, (0 split)
##
## Node number 3: 47 observations, complexity param=0.1153846
## predicted class=No expected loss=0.3404255 P(node) =0.1734317
## class counts: 31 16
## probabilities: 0.660 0.340
## left son=6 (23 obs) right son=7 (24 obs)
## Primary splits:
## Style1 splits as LLRLLR, improve=7.943340, (0 missing)
## Style2 splits as LLRLLL, improve=5.365842, (0 missing)
## Personality splits as LLLRRRLL, improve=4.564258, (0 missing)
## Color1 splits as RLR-RRLRLRLLRL, improve=4.564258, (0 missing)
## Color2 splits as LLRLLLRRRRLRRR, improve=3.747512, (0 missing)
## Surrogate splits:
## Personality splits as LLLRRRLL, agree=0.872, adj=0.739, (0 split)
## Color1 splits as RLL-RLRRRRLLRR, agree=0.766, adj=0.522, (0 split)
## Color2 splits as LLRRLLRRRRLRRR, agree=0.766, adj=0.522, (0 split)
## Style2 splits as LLRLLR, agree=0.745, adj=0.478, (0 split)
## Species splits as -----L--RL-L---L---------R----R----, agree=0.681, adj=0.348, (0 split)
##
## Node number 4: 140 observations
## predicted class=No expected loss=0 P(node) =0.5166052
## class counts: 140 0
## probabilities: 1.000 0.000
##
## Node number 5: 84 observations, complexity param=0.009615385
## predicted class=No expected loss=0.1190476 P(node) =0.3099631
## class counts: 74 10
## probabilities: 0.881 0.119
## left son=10 (34 obs) right son=11 (50 obs)
## Primary splits:
## Hobby splits as RLRRLR, improve=1.6190480, (0 missing)
## Color2 splits as LRRLRLRLLRRLLR, improve=1.2703300, (0 missing)
## Color1 splits as RRLLR-LLLLRRLL, improve=1.1613820, (0 missing)
## Personality splits as RLRLRLRL, improve=0.9416858, (0 missing)
## Style2 splits as RRRRLR, improve=0.6041222, (0 missing)
## Surrogate splits:
## Personality splits as RRRRRLRL, agree=0.726, adj=0.324, (0 split)
## Color2 splits as LRRLRRLRRLRRRR, agree=0.679, adj=0.206, (0 split)
## Species splits as ----------R-R------R------RR-L-LR--, agree=0.655, adj=0.147, (0 split)
## Style1 splits as RRLRLR, agree=0.631, adj=0.088, (0 split)
## Color1 splits as RRRRR-RRRLRLRR, agree=0.631, adj=0.088, (0 split)
##
## Node number 6: 23 observations
## predicted class=No expected loss=0.04347826 P(node) =0.08487085
## class counts: 22 1
## probabilities: 0.957 0.043
##
## Node number 7: 24 observations, complexity param=0.1153846
## predicted class=Yes expected loss=0.375 P(node) =0.08856089
## class counts: 9 15
## probabilities: 0.375 0.625
## left son=14 (7 obs) right son=15 (17 obs)
## Primary splits:
## Color1 splits as R-R-RRLRLRL-RL, improve=4.594538, (0 missing)
## Color2 splits as -LRL--LLRR-RRL, improve=2.450000, (0 missing)
## Style2 splits as -LRL-L, improve=2.005556, (0 missing)
## Species splits as -----L--LR-R---L---------R----L----, improve=1.065126, (0 missing)
## Personality splits as -L-LRL-R, improve=0.375000, (0 missing)
## Surrogate splits:
## Color2 splits as -LRL--RRRR-RRR, agree=0.792, adj=0.286, (0 split)
## Personality splits as -L-RRR-R, agree=0.750, adj=0.143, (0 split)
## Style2 splits as -LRR-R, agree=0.750, adj=0.143, (0 split)
##
## Node number 10: 34 observations
## predicted class=No expected loss=0 P(node) =0.1254613
## class counts: 34 0
## probabilities: 1.000 0.000
##
## Node number 11: 50 observations, complexity param=0.009615385
## predicted class=No expected loss=0.2 P(node) =0.1845018
## class counts: 40 10
## probabilities: 0.800 0.200
## left son=22 (23 obs) right son=23 (27 obs)
## Primary splits:
## Color2 splits as LRRLRLRLLRRLLR, improve=2.0869570, (0 missing)
## Color1 splits as RLLLL-LLL-RRLL, improve=1.8275060, (0 missing)
## Personality splits as LLRLRRRL, improve=0.9275362, (0 missing)
## Style2 splits as RRRRLR, improve=0.8780488, (0 missing)
## Style1 splits as RLRLLL, improve=0.1904762, (0 missing)
## Surrogate splits:
## Personality splits as RLRLRRRL, agree=0.70, adj=0.348, (0 split)
## Color1 splits as LRLRR-RRL-RRLR, agree=0.68, adj=0.304, (0 split)
## Style1 splits as RRLLRL, agree=0.66, adj=0.261, (0 split)
## Style2 splits as RRRLLR, agree=0.66, adj=0.261, (0 split)
## Species splits as ----------R-L------L------LR-R-RR--, agree=0.64, adj=0.217, (0 split)
##
## Node number 14: 7 observations
## predicted class=No expected loss=0.1428571 P(node) =0.02583026
## class counts: 6 1
## probabilities: 0.857 0.143
##
## Node number 15: 17 observations
## predicted class=Yes expected loss=0.1764706 P(node) =0.06273063
## class counts: 3 14
## probabilities: 0.176 0.824
##
## Node number 22: 23 observations
## predicted class=No expected loss=0.04347826 P(node) =0.08487085
## class counts: 22 1
## probabilities: 0.957 0.043
##
## Node number 23: 27 observations, complexity param=0.009615385
## predicted class=No expected loss=0.3333333 P(node) =0.099631
## class counts: 18 9
## probabilities: 0.667 0.333
## left son=46 (20 obs) right son=47 (7 obs)
## Primary splits:
## Personality splits as LLLLRLR-, improve=1.0714290, (0 missing)
## Color1 splits as RRLLR-RLL-RR-L, improve=0.9868421, (0 missing)
## Style1 splits as LRRRLL, improve=0.8823529, (0 missing)
## Hobby splits as R-LR-L, improve=0.5274725, (0 missing)
## Style2 splits as RLRRLR, improve=0.3333333, (0 missing)
## Surrogate splits:
## Hobby splits as R-LL-L, agree=0.852, adj=0.429, (0 split)
## Color1 splits as RLLLL-LRL-LL-L, agree=0.852, adj=0.429, (0 split)
## Color2 splits as -LR-L-L--LL--L, agree=0.852, adj=0.429, (0 split)
## Style1 splits as LLLRLL, agree=0.778, adj=0.143, (0 split)
## Species splits as ----------L-L------R------LL-L-LL--, agree=0.778, adj=0.143, (0 split)
##
## Node number 46: 20 observations
## predicted class=No expected loss=0.25 P(node) =0.07380074
## class counts: 15 5
## probabilities: 0.750 0.250
##
## Node number 47: 7 observations
## predicted class=Yes expected loss=0.4285714 P(node) =0.02583026
## class counts: 3 4
## probabilities: 0.429 0.571
# Bar chart of variable importance from the decision tree
tree_importance <- data.frame(
Variable = names(tree_model$variable.importance),
Importance = tree_model$variable.importance
)
ggplot(tree_importance, aes(x = reorder(Variable, Importance), y = Importance, fill = Importance)) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_fill_gradient(low = "#a8dadc", high = "#1d3557") +
labs(
title = "Decision Tree: Variable Importance",
subtitle = "Which traits drive the top-tier classification split?",
x = "Variable",
y = "Importance"
) +
theme_minimal()
# Use cross-validation to find the best cp value
plotcp(tree_model)
# Prune to best cp based on CV results
best_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]
tree_pruned <- prune(tree_model, cp = best_cp)
# Predict and evaluate on test data
tree_pred <- predict(tree_pruned, newdata = testing, type = "class")
confusionMatrix(tree_pred, testing$top_tier)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 105 10
## Yes 0 0
##
## Accuracy : 0.913
## 95% CI : (0.8459, 0.9575)
## No Information Rate : 0.913
## P-Value [Acc > NIR] : 0.583126
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.004427
##
## Sensitivity : 1.000
## Specificity : 0.000
## Pos Pred Value : 0.913
## Neg Pred Value : NaN
## Prevalence : 0.913
## Detection Rate : 0.913
## Detection Prevalence : 1.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : No
##
Result: Using the tree package gives an
actual branching tree instead of a collapsed single-node output. The
full tree and pruned version both show the most informative splits, with
Species appearing near the top, which is consistent with what the EDA
and random forest importance both suggest. The pruned version trims down
to the most essential decision points, making it easier to read while
keeping similar accuracy. The confusion matrix shows how well the model
separates top-tier from non-top-tier villagers on the test set.
Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species
Random forest builds hundreds of decision trees and averages their predictions, which reduces overfitting compared to a single tree. It also handles complex interactions between predictors automatically and produces a variable importance ranking that directly answers which traits matter most. Random forest was expected to outperform linear regression since the relationship between traits and popularity is likely non-linear.
rf_model <- randomForest(popularity_score ~ Gender + Personality + Hobby +
Style1 + Style2 + Color1 + Color2 + Species,
data = training, mtry = 3, importance = TRUE, ntree = 500)
rf_model
##
## Call:
## randomForest(formula = popularity_score ~ Gender + Personality + Hobby + Style1 + Style2 + Color1 + Color2 + Species, data = training, mtry = 3, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.06770375
## % Var explained: 8.25
# Predictions and performance
rf_pred <- predict(rf_model, newdata = testing)
data.frame(
R_Square = R2(rf_pred, testing$popularity_score),
RMSE = RMSE(rf_pred, testing$popularity_score),
MAE = MAE(rf_pred, testing$popularity_score)
)
# Predicted vs actual
plot(testing$popularity_score, rf_pred,
xlab = "Actual Popularity Score",
ylab = "Predicted Popularity Score",
main = "Random Forest: Predicted vs. Actual",
pch = 19, col = "darkgreen")
abline(0, 1, col = "lightgreen", lwd = 2)
# Variable importance
rf_model <- randomForest(popularity_score ~ Gender + Personality + Hobby +
Style1 + Style2 + Color1 + Color2 + Species,
data = training, mtry = 3, importance = TRUE, ntree = 500)
varImpPlot(rf_model,
main = "What traits matter most to community popularity?")
Result: Random forest outperforms linear regression in both R2 and RMSE, confirming that the relationships between traits and popularity are not purely linear. The variable importance plot is probably the most useful single output of the entire analysis. Species is the dominant predictor by a wide margin, followed by Style1 and Color1. Personality and Hobby contribute something but fall well behind the visual and species-based predictors. This hierarchy is consistent across every model used.
The regression models (Linear Regression and Random Forest) are compared using R², RMSE, and MAE since they predict a continuous popularity score. The classification models (Logistic Regression and Decision Tree) are compared using accuracy, sensitivity, and specificity from their confusion matrices.
Regression Models
data.frame(
Model = c("Linear Regression", "Random Forest"),
R_Square = c(
R2(lm_pred, testing$popularity_score),
R2(rf_pred, testing$popularity_score)
),
RMSE = c(
RMSE(lm_pred, testing$popularity_score),
RMSE(rf_pred, testing$popularity_score)
),
MAE = c(
MAE(lm_pred, testing$popularity_score),
MAE(rf_pred, testing$popularity_score)
)
)
Classification Models
# Pull metrics from confusion matrices
log_cm <- confusionMatrix(as.factor(log_pred), as.factor(testing$top_tier))
tree_cm <- confusionMatrix(tree_pred, testing$top_tier)
data.frame(
Model = c("Logistic Regression", "Decision Tree"),
Accuracy = c(
round(log_cm$overall["Accuracy"], 4),
round(tree_cm$overall["Accuracy"], 4)
),
Sensitivity = c(
round(log_cm$byClass["Sensitivity"], 4),
round(tree_cm$byClass["Sensitivity"], 4)
),
Specificity = c(
round(log_cm$byClass["Specificity"], 4),
round(tree_cm$byClass["Specificity"], 4)
)
)
Now lets see which villagers have the best rank in tier 1
merged %>%
arrange(tier)%>%
select(Name, Species, rank, tier) %>%
head(15)
Looks like Raymond, Marshal, and Sherb have the best rank
The analysis shows that villager traits are meaningfully associated with community popularity in ACNH, though no single trait fully determines where a villager lands in the rankings. The following conclusions are supported across all four models and the EDA.
Species is the strongest predictor. The random forest variable importance output, the decision tree splits, and the EDA all agree on this. Octopus, Wolf, Deer, and Cat are associated with substantially higher popularity scores. This makes sense since species determines a villager’s core visual identity, which is the first thing a player sees.
Aesthetic style is a close second. Villagers with a “Cute” primary style tag are much more likely to be top-tier, both in raw average scores and in model predicted probabilities. The ACNH community clearly prefers soft, rounded, approachable designs, and this dataset confirms that pattern quantitatively.
Color palette reinforces the trend. Lighter colors like Beige and White are associated with higher popularity, while darker or bolder colors like Black and Red are lower. This is consistent with the style finding and suggests the community’s preference is for a coherent soft aesthetic overall.
Personality and hobby matter less, but they still show up. Lazy and Normal personalities trend higher and Music and Nature hobbies edge out Fitness and Education. These effects are smaller than species and style but appear across models.
A useful next step would be incorporating image based features extracted directly from villager sprites. A big part of what drives community preference is purely visual in ways that text-based game metadata cannot capture, and closing that gap would likely improve model performance considerably.
This project went in a direction that was not fully expected. Going in personality type seemed like the obvious driver of popularity since it shapes almost all in game dialogue and interaction. But every model kept pointing back to species and visual aesthetics as the dominant factor which was one of the more interesting takeaways from the whole project.
The hardest part was handling species as a predictor. With 35 categories and several having only one or two villagers estimates for rarer species are noisy and hard to trust. Grouping species by a broader visual archetype, like fluffy vs. scaly vs. avian, would be worth trying in a future version to get more stable estimates.
More time was spent on the EDA than expected especially once color and style were added as variables. Those ended up being some of the most informative visuals in the whole report, which was a surprise going in. The main takeaway from the whole process is that the ACNH community responds strongly to a consistent soft aesthetic, and that signal shows up whether looking at a bar chart or a random forest importance plot.
tidyverse, ggplot2,
caret, randomForest, rpart,
rpart.plot