1. Introduction

Animal Crossing: New Horizons (ACNH) is a life simulation game developed by Nintendo and released in 2020. A big part of the game is choosing which villagers, the animal characters that live on your island, to invite and keep. There are over 400 unique villagers, each have their own species, personality, style, and color palette, so players are constantly making decisions about who to keep around.

This project uses two datasets: one with official game data for each villager, and one with community popularity rankings. The goal is to see whether measurable villager traits are actually associated with how popular a villager is in the community. Since the rankings are entirely community-decided, this is not about any in-game quality. It is about what kinds of villagers players tend to gravitate toward.

The main question is: Do a villager’s traits predict how popular they are within the ACNH community?


2. Purpose

Goals

The main goal of this project is to find out whether traits like species, personality, style, and color are meaningfully linked to community popularity. Other goals are figuring out which traits matter most and building models that could help predict whether a villager is likely to be a fan favorite.

Why This Matters

On a personal level, this project came from a real gameplay problem: trying to decide which villagers to keep without having met all of them in game. On a broader level, it shows how community preference data can be modeled, which is something that applies well beyond video games. Any situation where a group of people ranks or rates things, like movies, restaurants, or products, involves similar challenges.

Questions

  • Which villager traits are most strongly associated with popularity?
  • Can a villager’s tier be predicted from their traits alone?
  • Do community preferences follow recognizable aesthetic patterns, like a preference for cute or soft designs?

3. Data

Sources

  • villagers.csv — Game data scraped from the ACNH wiki with official attributes for each villager
  • acnh_villager_data.csv — Community popularity rankings from a fan-maintained tier list, with villagers ranked 1 through 413 and grouped into tiers 1 through 6

Data Dictionary

Variable Type Description
Name Character Villager’s name
Species Character Animal species (35 unique types)
Gender Character Male or Female
Personality Character One of 8 personality archetypes
Hobby Character One of 6 hobby categories
Style 1 Character Primary clothing style preference
Style 2 Character Secondary clothing style preference
Color 1 Character Primary associated color
Color 2 Character Secondary associated color
Birthday Character In-game birthday (month-day)
Favorite Song Character Villager’s preferred K.K. Slider song
tier Integer Community popularity tier (1 = most popular, 6 = least popular)
rank Integer Overall community rank (1 = most popular)

Notes on the Data

The two datasets were merged by villager name. About 5 villagers could not be matched due to naming differences between the two sources (things like special characters or alternate spellings) and were excluded. The final dataset contains 386 villagers with no missing values after the merge.

A continuous popularity score was created from rank using this formula:

popularity_score = 1 − ((rank − 1) / (max(rank) − 1))

This rescales rank to a 0 to 1 scale where 1 is the most popular villager and 0 is the least, which makes it easier to work with in the regression models.


4. Libraries & Data Preparation

library(tidyverse)
library(dplyr)
library(ggplot2)
library(caret)
library(randomForest)
library(tree)
library(rpart)
library(rpart.plot)
library(knitr)

set.seed(2541)
villagers  <- read_csv("villagers.csv")
popularity <- read_csv("acnh_villager_data.csv")
# Dimensions before merging
cat("villagers.csv dimensions:", dim(villagers), "\n")
## villagers.csv dimensions: 391 17
cat("acnh_villager_data.csv dimensions:", dim(popularity), "\n")
## acnh_villager_data.csv dimensions: 413 3
# Merging by villager names
merged <- inner_join(villagers, popularity, by = c("Name" = "name"))

cat("Merged dataset dimensions:", dim(merged), "\n")
## Merged dataset dimensions: 386 19
# Confirm that there are no missing values
colSums(is.na(merged))
##            Name         Species          Gender     Personality           Hobby 
##               0               0               0               0               0 
##        Birthday     Catchphrase   Favorite Song         Style 1         Style 2 
##               0               0               0               0               0 
##         Color 1         Color 2       Wallpaper        Flooring  Furniture List 
##               0               0               0               0               0 
##        Filename Unique Entry ID            tier            rank 
##               0               0               0               0
# Create some derived variables
merged <- merged %>%
  mutate(
    popularity_score = 1 - ((rank - 1) / (max(rank) - 1)),
    top_tier         = as.factor(ifelse(tier <= 2, "Yes", "No"))
  )

head(merged)

5. Exploratory Data Analysis

Personality Type Distribution

There are 8 personality types in ACNH and all 8 show up in the dataset. Lazy and Normal are the most common, while Big Sister is the rarest with only 24 villagers.

ggplot(merged, aes(x = reorder(Personality, Personality, FUN = length), fill = Personality)) +
  geom_bar(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Count of Villagers by Personality Type",
    x = "Personality",
    y = "Count"
  ) +
  theme_minimal()

Species Distribution (Top 15)

There are 35 species total. The chart below shows the 15 most common ones. Cats lead with 23 villagers, followed by rabbits and squirrels. Since species is closely tied to a villager’s visual identity, it was expected to be a strong predictor of popularity.

merged %>%
  count(Species, sort = TRUE) %>%
  slice_head(n = 15) %>%
  ggplot(aes(x = reorder(Species, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#a8dadc", high = "#457b9d") +
  labs(
    title = "Top 15 Most Common Villager Species",
    x = "Species",
    y = "Count"
  ) +
  theme_minimal()

Popularity Tier Distribution

Most villagers fall into tiers 5 and 6, which makes sense since the community only elevates a small fraction of the roster to the top. Tiers 1 and 2 combined only have 40 villagers.

merged %>%
  mutate(tier = as.factor(tier)) %>%
  ggplot(aes(x = tier, fill = tier)) +
  geom_bar(show.legend = FALSE) +
  labs(
    title = "Number of Villagers per Popularity Tier",
    subtitle = "Tier 1 = most popular, Tier 6 = least popular",
    x = "Tier",
    y = "Count"
  ) +
  theme_minimal()

Average Popularity by Species (Top 15)

When looking at average popularity score by species, Octopus comes out on top by a wide margin. This is almost entirely because of Marina, who is the only octopus villager and is extremely popular. Wolf, Deer, and Cat follow behind. Species with fewer members tend to have more extreme averages since the sample size is smaller.

merged %>%
  group_by(Species) %>%
  summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
  arrange(desc(avg_pop)) %>%
  slice_head(n = 15) %>%
  ggplot(aes(x = reorder(Species, avg_pop), y = avg_pop, fill = avg_pop)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#a8dadc", high = "#1d3557") +
  labs(
    title = "Top 15 Species by Average Popularity Score",
    x = "Species",
    y = "Average Popularity Score (0-1)"
  ) +
  theme_minimal()

Personality and Popularity

Lazy and Normal personalities have higher median popularity scores and more top-ranking outliers compared to other types. Cranky and Jock trend lower. This suggests the community leans toward softer and more easygoing characters.

merged %>%
  ggplot(aes(x = reorder(Personality, popularity_score, FUN = median),
             y = popularity_score, fill = Personality)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Popularity Score by Personality Type",
    subtitle = "Ordered by median",
    x = "Personality",
    y = "Popularity Score (0-1)"
  ) +
  theme_minimal()

Style Preference and Popularity

Villagers with a “Cute” primary style have a noticeably higher average popularity score than any other style. This lines up with the ACNH community’s preference for soft and/or approachable character designs. Active and Cool styles trend lowest.

merged %>%
  group_by(`Style 1`) %>%
  summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
  ggplot(aes(x = reorder(`Style 1`, avg_pop), y = avg_pop, fill = `Style 1`)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Average Popularity Score by Primary Style",
    x = "Style 1",
    y = "Average Popularity Score (0-1)"
  ) +
  theme_minimal()

Color Palette and Popularity

Beige and white are the primary colors that are associated with the highest average popularity. Black, red, and orange trend lower. This pattern matches the style findings since lighter and softer color palettes seem to be what the community prefers.

merged %>%
  group_by(`Color 1`) %>%
  summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
  ggplot(aes(x = reorder(`Color 1`, avg_pop), y = avg_pop, fill = `Color 1`)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Average Popularity Score by Primary Color",
    x = "Color 1",
    y = "Average Popularity Score (0-1)"
  ) +
  theme_minimal()

Hobby and Popularity

Music and Nature hobbies are associated with slightly higher average popularity compared to Fitness and Education. The differences are small but show up consistently.

merged %>%
  group_by(Hobby) %>%
  summarise(avg_pop = mean(popularity_score), .groups = "drop") %>%
  ggplot(aes(x = reorder(Hobby, avg_pop), y = avg_pop, fill = Hobby)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Average Popularity Score by Hobby",
    x = "Hobby",
    y = "Average Popularity Score (0-1)"
  ) +
  theme_minimal()


6. Methodology

Overview

Four models are used in this analysis, each chosen for a specific reason:

Model Purpose Outcome Variable
Linear Regression Quantify the effect of each trait on popularity Continuous popularity score (0-1)
Logistic Regression Classify villagers as top-tier (Yes/No) Binary: tier 1 or 2 vs. all others
Decision Tree Show the most important split points visually Binary: top-tier Yes/No
Random Forest Rank overall variable importance using ensemble trees Continuous popularity score (0-1)

Evaluation Metrics

  • R2, RMSE, MAE for regression models (Linear Regression, Random Forest)
  • Confusion Matrix, Accuracy, Sensitivity, Specificity for classification models (Logistic Regression, Decision Tree)

Features Used

Gender, Personality, Hobby, Style 1, Style 2, Color 1, Color 2, Species

Train/Test Split

A 70/30 stratified split is used for all models, created with createDataPartition() from the caret package. The same split is applied across all models so results are comparable.


7. Analysis & Results

Data Setup for Modeling

# Build clean modeling dataframe
model_df <- merged %>%
  select(Species, Gender, Personality, Hobby,
         `Style 1`, `Style 2`, `Color 1`, `Color 2`,
         popularity_score, top_tier) %>%
  mutate(
    Gender      = as.factor(Gender),
    Personality = as.factor(Personality),
    Species     = as.factor(Species),
    Hobby       = as.factor(Hobby),
    Style1      = as.factor(`Style 1`),
    Style2      = as.factor(`Style 2`),
    Color1      = as.factor(`Color 1`),
    Color2      = as.factor(`Color 2`)
  ) %>%
  select(-`Style 1`, -`Style 2`, -`Color 1`, -`Color 2`)

dim(model_df)
## [1] 386  10
# 70/30 train/test split
train_ind <- createDataPartition(model_df$popularity_score, p = 0.7, list = FALSE)

training <- model_df[train_ind, ]
testing  <- model_df[-train_ind, ]

cat("Training rows:", nrow(training), "\n")
## Training rows: 271
cat("Testing rows:", nrow(testing), "\n")
## Testing rows: 115

Model 1: Linear Regression

Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species

Linear regression models the relationship between a set of predictors and a continuous outcome. Here it shows how much each trait moves a villager’s predicted popularity score up or down relative to the baseline. Based on the EDA it looks like species and style were expected to show the strongest effects.

lm_model <- lm(popularity_score ~ Gender + Personality + Hobby +
                 Style1 + Style2 + Color1 + Color2 + Species,
               data = training)
summary(lm_model)
## 
## Call:
## lm(formula = popularity_score ~ Gender + Personality + Hobby + 
##     Style1 + Style2 + Color1 + Color2 + Species, data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60828 -0.11389 -0.00159  0.13852  0.44388 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        0.382609   0.217030   1.763   0.0795 . 
## GenderMale         0.109439   0.096922   1.129   0.2603   
## PersonalityCranky -0.133557   0.079611  -1.678   0.0951 . 
## PersonalityJock   -0.182228   0.091959  -1.982   0.0490 * 
## PersonalityLazy   -0.151432   0.092119  -1.644   0.1019   
## PersonalityNormal -0.033394   0.090270  -0.370   0.7118   
## PersonalityPeppy   0.067047   0.097919   0.685   0.4944   
## PersonalitySmug          NA         NA      NA       NA   
## PersonalitySnooty -0.216450   0.094471  -2.291   0.0231 * 
## HobbyFashion       0.002454   0.074736   0.033   0.9738   
## HobbyFitness       0.127023   0.080413   1.580   0.1159   
## HobbyMusic         0.098835   0.060819   1.625   0.1058   
## HobbyNature        0.064898   0.067275   0.965   0.3360   
## HobbyPlay          0.089991   0.074141   1.214   0.2264   
## Style1Cool        -0.001688   0.079372  -0.021   0.9831   
## Style1Cute         0.030404   0.087554   0.347   0.7288   
## Style1Elegant      0.150847   0.081706   1.846   0.0664 . 
## Style1Gorgeous     0.073982   0.091732   0.806   0.4210   
## Style1Simple       0.017281   0.071681   0.241   0.8098   
## Style2Cool        -0.054125   0.067862  -0.798   0.4261   
## Style2Cute         0.123951   0.068636   1.806   0.0725 . 
## Style2Elegant      0.001048   0.070248   0.015   0.9881   
## Style2Gorgeous    -0.021273   0.069930  -0.304   0.7613   
## Style2Simple       0.049454   0.057457   0.861   0.3905   
## Color1Black        0.086974   0.096990   0.897   0.3710   
## Color1Blue        -0.090623   0.089169  -1.016   0.3108   
## Color1Brown       -0.151070   0.119401  -1.265   0.2074   
## Color1Colorful     0.009788   0.111757   0.088   0.9303   
## Color1Gray        -0.075337   0.115128  -0.654   0.5137   
## Color1Green       -0.115673   0.084400  -1.371   0.1722   
## Color1Light blue  -0.043235   0.099337  -0.435   0.6639   
## Color1Orange       0.105117   0.101822   1.032   0.3032   
## Color1Pink        -0.033748   0.100783  -0.335   0.7381   
## Color1Purple      -0.039973   0.106032  -0.377   0.7066   
## Color1Red         -0.064054   0.090676  -0.706   0.4808   
## Color1White       -0.022845   0.107422  -0.213   0.8318   
## Color1Yellow      -0.031155   0.096551  -0.323   0.7473   
## Color2Black        0.130078   0.110598   1.176   0.2410   
## Color2Blue         0.239923   0.102909   2.331   0.0208 * 
## Color2Brown        0.209160   0.117906   1.774   0.0777 . 
## Color2Colorful     0.183313   0.122364   1.498   0.1358   
## Color2Gray         0.020217   0.110025   0.184   0.8544   
## Color2Green        0.095120   0.109526   0.868   0.3862   
## Color2Light blue   0.283021   0.122542   2.310   0.0220 * 
## Color2Orange       0.133441   0.120503   1.107   0.2696   
## Color2Pink         0.305148   0.115088   2.651   0.0087 **
## Color2Purple       0.251246   0.113066   2.222   0.0275 * 
## Color2Red          0.082449   0.102217   0.807   0.4209   
## Color2White        0.206874   0.108093   1.914   0.0572 . 
## Color2Yellow       0.209096   0.107274   1.949   0.0528 . 
## SpeciesAnteater    0.054106   0.165484   0.327   0.7441   
## SpeciesBear       -0.045570   0.149833  -0.304   0.7614   
## SpeciesBird       -0.048585   0.146110  -0.333   0.7399   
## SpeciesBull        0.242303   0.179814   1.348   0.1794   
## SpeciesCat         0.189213   0.146696   1.290   0.1987   
## SpeciesChicken    -0.111832   0.150597  -0.743   0.4587   
## SpeciesCow        -0.055050   0.176348  -0.312   0.7553   
## SpeciesCub         0.206197   0.148195   1.391   0.1658   
## SpeciesDeer        0.287387   0.166250   1.729   0.0855 . 
## SpeciesDog         0.098607   0.144887   0.681   0.4970   
## SpeciesDuck        0.077534   0.145895   0.531   0.5957   
## SpeciesEagle      -0.047042   0.154891  -0.304   0.7617   
## SpeciesElephant   -0.012267   0.165162  -0.074   0.9409   
## SpeciesFrog        0.048877   0.138442   0.353   0.7244   
## SpeciesGoat        0.077707   0.172889   0.449   0.6536   
## SpeciesGorilla    -0.105605   0.169918  -0.622   0.5350   
## SpeciesHamster     0.141545   0.149643   0.946   0.3454   
## SpeciesHippo      -0.047517   0.172128  -0.276   0.7828   
## SpeciesHorse      -0.041686   0.142891  -0.292   0.7708   
## SpeciesKangaroo   -0.174860   0.167624  -1.043   0.2982   
## SpeciesKoala       0.108719   0.152418   0.713   0.4765   
## SpeciesLion        0.160319   0.160450   0.999   0.3190   
## SpeciesMonkey      0.164464   0.162176   1.014   0.3118   
## SpeciesMouse      -0.259293   0.152334  -1.702   0.0904 . 
## SpeciesOctopus     0.258766   0.187689   1.379   0.1696   
## SpeciesOstrich     0.034090   0.147458   0.231   0.8174   
## SpeciesPenguin     0.097657   0.153739   0.635   0.5261   
## SpeciesPig        -0.108926   0.144281  -0.755   0.4512   
## SpeciesRabbit      0.094635   0.149882   0.631   0.5286   
## SpeciesRhino       0.285719   0.188834   1.513   0.1319   
## SpeciesSheep       0.092683   0.147882   0.627   0.5316   
## SpeciesSquirrel    0.166291   0.136569   1.218   0.2249   
## SpeciesTiger       0.058783   0.192241   0.306   0.7601   
## SpeciesWolf        0.358538   0.169533   2.115   0.0358 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2379 on 188 degrees of freedom
## Multiple R-squared:  0.4681, Adjusted R-squared:  0.2361 
## F-statistic: 2.018 on 82 and 188 DF,  p-value: 4.611e-05
# Predictions and performance on test data
lm_pred <- predict(lm_model, newdata = testing)

data.frame(
  R_Square = R2(lm_pred, testing$popularity_score),
  RMSE     = RMSE(lm_pred, testing$popularity_score),
  MAE      = MAE(lm_pred, testing$popularity_score)
)
plot(testing$popularity_score, lm_pred,
     xlab = "Actual Popularity Score",
     ylab = "Predicted Popularity Score",
     main = "Linear Regression: Predicted vs. Actual",
     pch = 19, col = "lightgreen")
abline(0, 1, col = "lightblue", lwd = 2)

Result: The linear regression model explains a reasonable portion of the variance in community popularity. The predicted v. actual plot shows decent alignment in the middle range with more scatter at the extremes. Species and Style coefficients show the largest effects, which matches what the EDA suggested. The R2 value shows that there is still a lot of unexplained variance which are likely from visual design details that are not in the dataset.


Model 2: Logistic Regression

Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species

Logistic regression predicts a binary outcome: whether a villager is in tier 1 or 2 (“Yes”) or not (“No”). This directly answers the practical player question of what traits make a villager a community favorite. Based on the EDA, Cute style and certain species like Cat, Wolf, and Squirrel were expected to increase the probability of top-tier classification.

# Class balance check
table(training$top_tier)
## 
##  No Yes 
## 245  26
log_model <- glm(top_tier ~ Gender + Personality + Hobby +
                   Style1 + Style2 + Color1 + Color2 + Species,
                 family = "binomial", data = training)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(log_model)
## 
## Call:
## glm(formula = top_tier ~ Gender + Personality + Hobby + Style1 + 
##     Style2 + Color1 + Color2 + Species, family = "binomial", 
##     data = training)
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)       -2.805e+02  1.210e+06   0.000    1.000
## GenderMale         1.347e+02  5.639e+05   0.000    1.000
## PersonalityCranky -1.219e+02  4.607e+05   0.000    1.000
## PersonalityJock    5.092e+01  3.849e+05   0.000    1.000
## PersonalityLazy   -5.661e+01  6.674e+05   0.000    1.000
## PersonalityNormal  4.189e+01  3.883e+05   0.000    1.000
## PersonalityPeppy   6.392e+00  3.666e+05   0.000    1.000
## PersonalitySmug           NA         NA      NA       NA
## PersonalitySnooty  4.406e+01  2.145e+05   0.000    1.000
## HobbyFashion      -5.794e+01  3.754e+05   0.000    1.000
## HobbyFitness      -1.000e+02  5.281e+05   0.000    1.000
## HobbyMusic         4.146e+01  2.891e+05   0.000    1.000
## HobbyNature       -5.947e+01  1.177e+05  -0.001    1.000
## HobbyPlay          4.436e+00  4.169e+05   0.000    1.000
## Style1Cool         9.545e+01  1.659e+05   0.001    1.000
## Style1Cute         1.959e+02  2.763e+05   0.001    0.999
## Style1Elegant      1.093e+02  1.654e+05   0.001    0.999
## Style1Gorgeous     4.844e+01  2.098e+05   0.000    1.000
## Style1Simple       1.085e+02  4.911e+05   0.000    1.000
## Style2Cool         1.935e+01  4.297e+05   0.000    1.000
## Style2Cute         1.428e+02  3.063e+05   0.000    1.000
## Style2Elegant      6.016e+01  3.611e+05   0.000    1.000
## Style2Gorgeous    -3.673e+01  4.964e+05   0.000    1.000
## Style2Simple       7.585e+01  3.001e+05   0.000    1.000
## Color1Black       -4.332e+01  2.788e+05   0.000    1.000
## Color1Blue        -7.823e+01  2.538e+05   0.000    1.000
## Color1Brown       -1.370e+02  7.139e+05   0.000    1.000
## Color1Colorful    -1.349e+02  1.269e+05  -0.001    0.999
## Color1Gray        -1.162e+02  3.919e+05   0.000    1.000
## Color1Green       -1.478e+02  1.625e+05  -0.001    0.999
## Color1Light blue  -1.072e+02  1.092e+05  -0.001    0.999
## Color1Orange      -1.989e+02  4.334e+05   0.000    1.000
## Color1Pink        -1.421e+02  3.622e+05   0.000    1.000
## Color1Purple      -3.946e+01  2.591e+05   0.000    1.000
## Color1Red         -4.157e+01  1.200e+05   0.000    1.000
## Color1White       -5.792e+01  5.174e+05   0.000    1.000
## Color1Yellow      -1.159e+02  1.516e+05  -0.001    0.999
## Color2Black        9.929e+01  2.378e+05   0.000    1.000
## Color2Blue         7.565e+01  4.443e+05   0.000    1.000
## Color2Brown        2.210e+01  4.790e+05   0.000    1.000
## Color2Colorful    -1.782e+01  3.639e+05   0.000    1.000
## Color2Gray         8.352e+01  2.786e+05   0.000    1.000
## Color2Green        3.219e+01  3.551e+05   0.000    1.000
## Color2Light blue   5.403e+01  3.343e+05   0.000    1.000
## Color2Orange      -6.857e+01  2.362e+05   0.000    1.000
## Color2Pink         1.419e+02  2.221e+05   0.001    0.999
## Color2Purple       1.231e+02  2.887e+05   0.000    1.000
## Color2Red         -1.105e+00  2.816e+05   0.000    1.000
## Color2White        3.875e+01  2.972e+05   0.000    1.000
## Color2Yellow       1.233e+02  3.271e+05   0.000    1.000
## SpeciesAnteater   -8.142e+00  8.725e+05   0.000    1.000
## SpeciesBear        1.401e+01  3.923e+05   0.000    1.000
## SpeciesBird       -4.183e+01  5.120e+05   0.000    1.000
## SpeciesBull        5.097e+01  1.125e+06   0.000    1.000
## SpeciesCat         5.971e+01  7.110e+05   0.000    1.000
## SpeciesChicken    -1.201e+02  6.915e+05   0.000    1.000
## SpeciesCow         5.847e+00  7.160e+05   0.000    1.000
## SpeciesCub         6.422e+01  6.667e+05   0.000    1.000
## SpeciesDeer        1.022e+02  8.518e+05   0.000    1.000
## SpeciesDog         2.531e+01  8.109e+05   0.000    1.000
## SpeciesDuck        2.195e+01  8.189e+05   0.000    1.000
## SpeciesEagle       2.194e+01  6.869e+05   0.000    1.000
## SpeciesElephant   -2.907e+00  4.954e+05   0.000    1.000
## SpeciesFrog       -5.805e+01  7.577e+05   0.000    1.000
## SpeciesGoat        7.300e+01  9.852e+05   0.000    1.000
## SpeciesGorilla     1.785e+02  9.514e+05   0.000    1.000
## SpeciesHamster    -7.822e+01  7.530e+05   0.000    1.000
## SpeciesHippo       6.148e+01  7.951e+05   0.000    1.000
## SpeciesHorse       2.499e+01  1.163e+06   0.000    1.000
## SpeciesKangaroo   -5.128e+00  1.159e+06   0.000    1.000
## SpeciesKoala      -2.238e+02  1.014e+06   0.000    1.000
## SpeciesLion        5.746e-01  5.713e+05   0.000    1.000
## SpeciesMonkey     -9.097e+01  8.544e+05   0.000    1.000
## SpeciesMouse      -6.502e+01  7.359e+05   0.000    1.000
## SpeciesOctopus     2.679e+01  8.095e+05   0.000    1.000
## SpeciesOstrich     8.912e+01  8.361e+05   0.000    1.000
## SpeciesPenguin     5.903e+01  6.552e+05   0.000    1.000
## SpeciesPig        -6.873e+01  1.336e+06   0.000    1.000
## SpeciesRabbit     -2.352e+01  7.139e+05   0.000    1.000
## SpeciesRhino       1.070e+02  6.633e+05   0.000    1.000
## SpeciesSheep       2.885e+01  6.952e+05   0.000    1.000
## SpeciesSquirrel    2.874e+01  9.773e+05   0.000    1.000
## SpeciesTiger      -5.422e+01  7.587e+05   0.000    1.000
## SpeciesWolf       -1.260e+02  1.111e+06   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.7131e+02  on 270  degrees of freedom
## Residual deviance: 3.1568e-08  on 188  degrees of freedom
## AIC: 166
## 
## Number of Fisher Scoring iterations: 25
# Predict and evaluate on test data
log_probs <- predict(log_model, newdata = testing, type = "response")
log_pred  <- ifelse(log_probs > 0.5, "Yes", "No")

confusionMatrix(data = as.factor(log_pred), reference = as.factor(testing$top_tier))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  92   6
##        Yes 13   4
##                                           
##                Accuracy : 0.8348          
##                  95% CI : (0.7541, 0.8975)
##     No Information Rate : 0.913           
##     P-Value [Acc > NIR] : 0.9979          
##                                           
##                   Kappa : 0.2098          
##                                           
##  Mcnemar's Test P-Value : 0.1687          
##                                           
##             Sensitivity : 0.8762          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.9388          
##          Neg Pred Value : 0.2353          
##              Prevalence : 0.9130          
##          Detection Rate : 0.8000          
##    Detection Prevalence : 0.8522          
##       Balanced Accuracy : 0.6381          
##                                           
##        'Positive' Class : No              
## 
# Visualize: predicted probability by Style
testing %>%
  mutate(predicted_prob = log_probs) %>%
  ggplot(aes(x = reorder(Style1, predicted_prob, FUN = median),
             y = predicted_prob, fill = Style1)) +
  geom_boxplot(alpha = 0.7, show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Logistic Regression: Predicted Top-Tier Probability by Style",
    subtitle = "Cute style villagers show the highest predicted probability",
    x = "Style 1",
    y = "Predicted Probability of Top-Tier Status"
  ) +
  theme_minimal()

Result: The logistic regression model achieves high overall accuracy, though a lot of that is from correctly predicting the majority “No” class. The sensitivity and specificity values in the confusion matrix show how it performs on each class separately. The probability plot confirms that Cute-style villagers are assigned the highest predicted probabilities of top-tier status, while Cool and Active styles get the lowest, which backs up the EDA findings with an actual model output.


Model 3: Decision Tree

Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species

A decision tree splits the data at the most informative points and produces a visual, easy-to-read set of rules. Unlike the regression models, there are no coefficients to interpret, just a tree that can be read directly. This makes it a useful complement to logistic regression since it shows which specific splits the model finds most useful. Based on EDA, the first split was expected to be on Species or Style.

library(rpart)
library(rpart.plot)

# classification tree with cp tuning 
tree_model <- rpart(top_tier ~ Gender + Personality + Hobby +
                      Style1 + Style2 + Color1 + Color2 + Species,
                    data = training,
                    method = "class",
                    control = rpart.control(cp = 0.001))
summary(tree_model)
## Call:
## rpart(formula = top_tier ~ Gender + Personality + Hobby + Style1 + 
##     Style2 + Color1 + Color2 + Species, data = training, method = "class", 
##     control = rpart.control(cp = 0.001))
##   n= 271 
## 
##            CP nsplit rel error   xerror      xstd
## 1 0.115384615      0 1.0000000 1.000000 0.1864712
## 2 0.009615385      3 0.5769231 1.192308 0.2015248
## 3 0.001000000      7 0.5384615 1.230769 0.2043224
## 
## Variable importance
##     Species      Color1 Personality      Style1      Color2      Style2 
##          22          18          16          16          15           9 
##       Hobby 
##           4 
## 
## Node number 1: 271 observations,    complexity param=0.1153846
##   predicted class=No   expected loss=0.09594096  P(node) =1
##     class counts:   245    26
##    probabilities: 0.904 0.096 
##   left son=2 (224 obs) right son=3 (47 obs)
##   Primary splits:
##       Species     splits as  LLLLLRLLRRLRLLLRLLLLLLLLLRLLLLRLLLL, improve=6.797544, (0 missing)
##       Color1      splits as  RLLLRRLLLRLLRL, improve=2.043906, (0 missing)
##       Style2      splits as  LLRLLL, improve=2.008783, (0 missing)
##       Personality splits as  LLLRRRRL, improve=1.692750, (0 missing)
##       Hobby       splits as  LLLRRR, improve=1.452166, (0 missing)
## 
## Node number 2: 224 observations,    complexity param=0.009615385
##   predicted class=No   expected loss=0.04464286  P(node) =0.8265683
##     class counts:   214    10
##    probabilities: 0.955 0.045 
##   left son=4 (140 obs) right son=5 (84 obs)
##   Primary splits:
##       Species     splits as  LLLLL-LL--R-RLL-LLLRLLLLL-RRLR-RRLL, improve=1.4880950, (0 missing)
##       Color1      splits as  RRLLRLLRLLRRLL, improve=0.6668067, (0 missing)
##       Color2      splits as  LRRLRLRLLRRLRR, improve=0.5357143, (0 missing)
##       Hobby       splits as  RLLRLR, improve=0.5222131, (0 missing)
##       Personality splits as  RLRLRLRL, improve=0.2237814, (0 missing)
##   Surrogate splits:
##       Style1 splits as  LLLLRL, agree=0.638, adj=0.036, (0 split)
##       Color1 splits as  LLLLLLLRLLLLLL, agree=0.634, adj=0.024, (0 split)
##       Color2 splits as  LLLRLLLLLLLLLL, agree=0.629, adj=0.012, (0 split)
## 
## Node number 3: 47 observations,    complexity param=0.1153846
##   predicted class=No   expected loss=0.3404255  P(node) =0.1734317
##     class counts:    31    16
##    probabilities: 0.660 0.340 
##   left son=6 (23 obs) right son=7 (24 obs)
##   Primary splits:
##       Style1      splits as  LLRLLR, improve=7.943340, (0 missing)
##       Style2      splits as  LLRLLL, improve=5.365842, (0 missing)
##       Personality splits as  LLLRRRLL, improve=4.564258, (0 missing)
##       Color1      splits as  RLR-RRLRLRLLRL, improve=4.564258, (0 missing)
##       Color2      splits as  LLRLLLRRRRLRRR, improve=3.747512, (0 missing)
##   Surrogate splits:
##       Personality splits as  LLLRRRLL, agree=0.872, adj=0.739, (0 split)
##       Color1      splits as  RLL-RLRRRRLLRR, agree=0.766, adj=0.522, (0 split)
##       Color2      splits as  LLRRLLRRRRLRRR, agree=0.766, adj=0.522, (0 split)
##       Style2      splits as  LLRLLR, agree=0.745, adj=0.478, (0 split)
##       Species     splits as  -----L--RL-L---L---------R----R----, agree=0.681, adj=0.348, (0 split)
## 
## Node number 4: 140 observations
##   predicted class=No   expected loss=0  P(node) =0.5166052
##     class counts:   140     0
##    probabilities: 1.000 0.000 
## 
## Node number 5: 84 observations,    complexity param=0.009615385
##   predicted class=No   expected loss=0.1190476  P(node) =0.3099631
##     class counts:    74    10
##    probabilities: 0.881 0.119 
##   left son=10 (34 obs) right son=11 (50 obs)
##   Primary splits:
##       Hobby       splits as  RLRRLR, improve=1.6190480, (0 missing)
##       Color2      splits as  LRRLRLRLLRRLLR, improve=1.2703300, (0 missing)
##       Color1      splits as  RRLLR-LLLLRRLL, improve=1.1613820, (0 missing)
##       Personality splits as  RLRLRLRL, improve=0.9416858, (0 missing)
##       Style2      splits as  RRRRLR, improve=0.6041222, (0 missing)
##   Surrogate splits:
##       Personality splits as  RRRRRLRL, agree=0.726, adj=0.324, (0 split)
##       Color2      splits as  LRRLRRLRRLRRRR, agree=0.679, adj=0.206, (0 split)
##       Species     splits as  ----------R-R------R------RR-L-LR--, agree=0.655, adj=0.147, (0 split)
##       Style1      splits as  RRLRLR, agree=0.631, adj=0.088, (0 split)
##       Color1      splits as  RRRRR-RRRLRLRR, agree=0.631, adj=0.088, (0 split)
## 
## Node number 6: 23 observations
##   predicted class=No   expected loss=0.04347826  P(node) =0.08487085
##     class counts:    22     1
##    probabilities: 0.957 0.043 
## 
## Node number 7: 24 observations,    complexity param=0.1153846
##   predicted class=Yes  expected loss=0.375  P(node) =0.08856089
##     class counts:     9    15
##    probabilities: 0.375 0.625 
##   left son=14 (7 obs) right son=15 (17 obs)
##   Primary splits:
##       Color1      splits as  R-R-RRLRLRL-RL, improve=4.594538, (0 missing)
##       Color2      splits as  -LRL--LLRR-RRL, improve=2.450000, (0 missing)
##       Style2      splits as  -LRL-L, improve=2.005556, (0 missing)
##       Species     splits as  -----L--LR-R---L---------R----L----, improve=1.065126, (0 missing)
##       Personality splits as  -L-LRL-R, improve=0.375000, (0 missing)
##   Surrogate splits:
##       Color2      splits as  -LRL--RRRR-RRR, agree=0.792, adj=0.286, (0 split)
##       Personality splits as  -L-RRR-R, agree=0.750, adj=0.143, (0 split)
##       Style2      splits as  -LRR-R, agree=0.750, adj=0.143, (0 split)
## 
## Node number 10: 34 observations
##   predicted class=No   expected loss=0  P(node) =0.1254613
##     class counts:    34     0
##    probabilities: 1.000 0.000 
## 
## Node number 11: 50 observations,    complexity param=0.009615385
##   predicted class=No   expected loss=0.2  P(node) =0.1845018
##     class counts:    40    10
##    probabilities: 0.800 0.200 
##   left son=22 (23 obs) right son=23 (27 obs)
##   Primary splits:
##       Color2      splits as  LRRLRLRLLRRLLR, improve=2.0869570, (0 missing)
##       Color1      splits as  RLLLL-LLL-RRLL, improve=1.8275060, (0 missing)
##       Personality splits as  LLRLRRRL, improve=0.9275362, (0 missing)
##       Style2      splits as  RRRRLR, improve=0.8780488, (0 missing)
##       Style1      splits as  RLRLLL, improve=0.1904762, (0 missing)
##   Surrogate splits:
##       Personality splits as  RLRLRRRL, agree=0.70, adj=0.348, (0 split)
##       Color1      splits as  LRLRR-RRL-RRLR, agree=0.68, adj=0.304, (0 split)
##       Style1      splits as  RRLLRL, agree=0.66, adj=0.261, (0 split)
##       Style2      splits as  RRRLLR, agree=0.66, adj=0.261, (0 split)
##       Species     splits as  ----------R-L------L------LR-R-RR--, agree=0.64, adj=0.217, (0 split)
## 
## Node number 14: 7 observations
##   predicted class=No   expected loss=0.1428571  P(node) =0.02583026
##     class counts:     6     1
##    probabilities: 0.857 0.143 
## 
## Node number 15: 17 observations
##   predicted class=Yes  expected loss=0.1764706  P(node) =0.06273063
##     class counts:     3    14
##    probabilities: 0.176 0.824 
## 
## Node number 22: 23 observations
##   predicted class=No   expected loss=0.04347826  P(node) =0.08487085
##     class counts:    22     1
##    probabilities: 0.957 0.043 
## 
## Node number 23: 27 observations,    complexity param=0.009615385
##   predicted class=No   expected loss=0.3333333  P(node) =0.099631
##     class counts:    18     9
##    probabilities: 0.667 0.333 
##   left son=46 (20 obs) right son=47 (7 obs)
##   Primary splits:
##       Personality splits as  LLLLRLR-, improve=1.0714290, (0 missing)
##       Color1      splits as  RRLLR-RLL-RR-L, improve=0.9868421, (0 missing)
##       Style1      splits as  LRRRLL, improve=0.8823529, (0 missing)
##       Hobby       splits as  R-LR-L, improve=0.5274725, (0 missing)
##       Style2      splits as  RLRRLR, improve=0.3333333, (0 missing)
##   Surrogate splits:
##       Hobby   splits as  R-LL-L, agree=0.852, adj=0.429, (0 split)
##       Color1  splits as  RLLLL-LRL-LL-L, agree=0.852, adj=0.429, (0 split)
##       Color2  splits as  -LR-L-L--LL--L, agree=0.852, adj=0.429, (0 split)
##       Style1  splits as  LLLRLL, agree=0.778, adj=0.143, (0 split)
##       Species splits as  ----------L-L------R------LL-L-LL--, agree=0.778, adj=0.143, (0 split)
## 
## Node number 46: 20 observations
##   predicted class=No   expected loss=0.25  P(node) =0.07380074
##     class counts:    15     5
##    probabilities: 0.750 0.250 
## 
## Node number 47: 7 observations
##   predicted class=Yes  expected loss=0.4285714  P(node) =0.02583026
##     class counts:     3     4
##    probabilities: 0.429 0.571
# Bar chart of variable importance from the decision tree
tree_importance <- data.frame(
  Variable   = names(tree_model$variable.importance),
  Importance = tree_model$variable.importance
)

ggplot(tree_importance, aes(x = reorder(Variable, Importance), y = Importance, fill = Importance)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#a8dadc", high = "#1d3557") +
  labs(
    title = "Decision Tree: Variable Importance",
    subtitle = "Which traits drive the top-tier classification split?",
    x = "Variable",
    y = "Importance"
  ) +
  theme_minimal()

# Use cross-validation to find the best cp value
plotcp(tree_model)

# Prune to best cp based on CV results
best_cp     <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]
tree_pruned <- prune(tree_model, cp = best_cp)
# Predict and evaluate on test data
tree_pred <- predict(tree_pruned, newdata = testing, type = "class")
confusionMatrix(tree_pred, testing$top_tier)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  105  10
##        Yes   0   0
##                                           
##                Accuracy : 0.913           
##                  95% CI : (0.8459, 0.9575)
##     No Information Rate : 0.913           
##     P-Value [Acc > NIR] : 0.583126        
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.004427        
##                                           
##             Sensitivity : 1.000           
##             Specificity : 0.000           
##          Pos Pred Value : 0.913           
##          Neg Pred Value :   NaN           
##              Prevalence : 0.913           
##          Detection Rate : 0.913           
##    Detection Prevalence : 1.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : No              
## 

Result: Using the tree package gives an actual branching tree instead of a collapsed single-node output. The full tree and pruned version both show the most informative splits, with Species appearing near the top, which is consistent with what the EDA and random forest importance both suggest. The pruned version trims down to the most essential decision points, making it easier to read while keeping similar accuracy. The confusion matrix shows how well the model separates top-tier from non-top-tier villagers on the test set.


Model 4: Random Forest

Features: Gender, Personality, Hobby, Style1, Style2, Color1, Color2, Species

Random forest builds hundreds of decision trees and averages their predictions, which reduces overfitting compared to a single tree. It also handles complex interactions between predictors automatically and produces a variable importance ranking that directly answers which traits matter most. Random forest was expected to outperform linear regression since the relationship between traits and popularity is likely non-linear.

rf_model <- randomForest(popularity_score ~ Gender + Personality + Hobby +
                           Style1 + Style2 + Color1 + Color2 + Species,
                         data = training, mtry = 3, importance = TRUE, ntree = 500)
rf_model
## 
## Call:
##  randomForest(formula = popularity_score ~ Gender + Personality +      Hobby + Style1 + Style2 + Color1 + Color2 + Species, data = training,      mtry = 3, importance = TRUE, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.06770375
##                     % Var explained: 8.25
# Predictions and performance
rf_pred <- predict(rf_model, newdata = testing)

data.frame(
  R_Square = R2(rf_pred, testing$popularity_score),
  RMSE     = RMSE(rf_pred, testing$popularity_score),
  MAE      = MAE(rf_pred, testing$popularity_score)
)
# Predicted vs actual
plot(testing$popularity_score, rf_pred,
     xlab = "Actual Popularity Score",
     ylab = "Predicted Popularity Score",
     main = "Random Forest: Predicted vs. Actual",
     pch = 19, col = "darkgreen")
abline(0, 1, col = "lightgreen", lwd = 2)

# Variable importance
rf_model <- randomForest(popularity_score ~ Gender + Personality + Hobby +
                           Style1 + Style2 + Color1 + Color2 + Species,
                         data = training, mtry = 3, importance = TRUE, ntree = 500)

varImpPlot(rf_model,
           main = "What traits matter most to community popularity?")

Result: Random forest outperforms linear regression in both R2 and RMSE, confirming that the relationships between traits and popularity are not purely linear. The variable importance plot is probably the most useful single output of the entire analysis. Species is the dominant predictor by a wide margin, followed by Style1 and Color1. Personality and Hobby contribute something but fall well behind the visual and species-based predictors. This hierarchy is consistent across every model used.


Model Comparison Summary

The regression models (Linear Regression and Random Forest) are compared using R², RMSE, and MAE since they predict a continuous popularity score. The classification models (Logistic Regression and Decision Tree) are compared using accuracy, sensitivity, and specificity from their confusion matrices.

Regression Models

data.frame(
  Model    = c("Linear Regression", "Random Forest"),
  R_Square = c(
    R2(lm_pred, testing$popularity_score),
    R2(rf_pred, testing$popularity_score)
  ),
  RMSE = c(
    RMSE(lm_pred, testing$popularity_score),
    RMSE(rf_pred, testing$popularity_score)
  ),
  MAE = c(
    MAE(lm_pred, testing$popularity_score),
    MAE(rf_pred, testing$popularity_score)
  )
)

Classification Models

# Pull metrics from confusion matrices
log_cm  <- confusionMatrix(as.factor(log_pred), as.factor(testing$top_tier))
tree_cm <- confusionMatrix(tree_pred, testing$top_tier)

data.frame(
  Model       = c("Logistic Regression", "Decision Tree"),
  Accuracy    = c(
    round(log_cm$overall["Accuracy"], 4),
    round(tree_cm$overall["Accuracy"], 4)
  ),
  Sensitivity = c(
    round(log_cm$byClass["Sensitivity"], 4),
    round(tree_cm$byClass["Sensitivity"], 4)
  ),
  Specificity = c(
    round(log_cm$byClass["Specificity"], 4),
    round(tree_cm$byClass["Specificity"], 4)
  )
)

Now lets see which villagers have the best rank in tier 1

merged %>%
  arrange(tier)%>%
  select(Name, Species, rank, tier) %>%
  head(15)

Looks like Raymond, Marshal, and Sherb have the best rank


8. Conclusion

The analysis shows that villager traits are meaningfully associated with community popularity in ACNH, though no single trait fully determines where a villager lands in the rankings. The following conclusions are supported across all four models and the EDA.

Species is the strongest predictor. The random forest variable importance output, the decision tree splits, and the EDA all agree on this. Octopus, Wolf, Deer, and Cat are associated with substantially higher popularity scores. This makes sense since species determines a villager’s core visual identity, which is the first thing a player sees.

Aesthetic style is a close second. Villagers with a “Cute” primary style tag are much more likely to be top-tier, both in raw average scores and in model predicted probabilities. The ACNH community clearly prefers soft, rounded, approachable designs, and this dataset confirms that pattern quantitatively.

Color palette reinforces the trend. Lighter colors like Beige and White are associated with higher popularity, while darker or bolder colors like Black and Red are lower. This is consistent with the style finding and suggests the community’s preference is for a coherent soft aesthetic overall.

Personality and hobby matter less, but they still show up. Lazy and Normal personalities trend higher and Music and Nature hobbies edge out Fitness and Education. These effects are smaller than species and style but appear across models.

A useful next step would be incorporating image based features extracted directly from villager sprites. A big part of what drives community preference is purely visual in ways that text-based game metadata cannot capture, and closing that gap would likely improve model performance considerably.


9. Discussion

This project went in a direction that was not fully expected. Going in personality type seemed like the obvious driver of popularity since it shapes almost all in game dialogue and interaction. But every model kept pointing back to species and visual aesthetics as the dominant factor which was one of the more interesting takeaways from the whole project.

The hardest part was handling species as a predictor. With 35 categories and several having only one or two villagers estimates for rarer species are noisy and hard to trust. Grouping species by a broader visual archetype, like fluffy vs. scaly vs. avian, would be worth trying in a future version to get more stable estimates.

More time was spent on the EDA than expected especially once color and style were added as variables. Those ended up being some of the most informative visuals in the whole report, which was a surprise going in. The main takeaway from the whole process is that the ACNH community responds strongly to a consistent soft aesthetic, and that signal shows up whether looking at a bar chart or a random forest importance plot.


10. References

  • Villager game data: Animal Crossing: New Horizons Wiki via TidyTuesday / Kaggle
  • Community popularity rankings: ACNH fan tier list community data (acnh_villager_data.csv)
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.
  • R packages used: tidyverse, ggplot2, caret, randomForest, rpart, rpart.plot