About Model Forensics
Model Forensics is a website to compare the personalities of AI models.
Tested personality traits
We use the term personality very broadly to refer to non-capability differences between models, including model characteristics such as style, tone, and character. Vibes would be an alternative term for this category. In particular, we test the following traits shown below:
(hover over icon to show trait)
How it works
Model Forensics uses the Feedback Forensics (GitHub, paper) annotation pipeline. AI annotators (aka LLM-as-a-Judge) based on Gemini-2.5-Flash annotate how much a model exhibits each personality trait relative to a reference model (GPT-4o) across multiple responses (500). These annotations are combined and tested for significance to come up with the final list of personality differences between models. The significant differences are further classified into Top 25% (strongest differences, marked with +++ or ---), Middle 50% (++/--) and Bottom 25% (+/-).
More detail
Step 1: Prompts and data collection. Each tested model is prompted with a set of 500 English-language manually filtered prompts collected on LMArena representing real-world use* (license: CC-BY-4.0). The tested model's responses are then paired with corresponding responses by a reference model (GPT-4o), giving us a dataset of 500 datapoints with responses both by the tested model and the reference model.
*As close as it gets with publicly available data.
Step 2: Annotation. These response pairs (tested + reference model responses) are then given to an AI annotator ("LLM-as-a-Judge", model: Gemini-2.5-Flash) to label the fourty different personality traits shown before. For each trait the annotators determine whether the tested model exhibits the trait less, equally, or more than the reference model (ref. model), or whether the trait is not relevant to the given response pair. The AI annotator has been adjusted and validated to correlate highly with equivalent human trait annotations (see our paper). To compare an arbitrary pair of two models (model A, model B), we then combine each set of equivalent ref. model comparisons (model A vs ref. model + model B vs ref. model) into a new direct comparison of model A vs model B. This combination is done under the assumption of transitivity, i.e. if model A exhibits a trait less than the ref. model and model B exhibits it more than the ref model, then model B exhibits the trait more than model A. This setup allows us to compare a large number of models cost-effectively. The fully annotated dataset used for Model Forensics is available here.
Step 3: Metric computation. To compare how much one model (model A) exhibits a personality trait relative to another model (model B), we test how much selecting for that trait correlates with selecting for that model. If these two selections correlate highly, thenmodel A exhibits a trait more than model B. Inversely, if these two selections are negatively correlated, model B exhibits the trait more. If there is no correlation, neither model exhibits the trait. Correlation is measured using a version of Cohen's kappa that account for the overall relevance of traits that we refer to as strength metric. We compute a 99% bootstrapped confidence interval (CI) for the strength metric and reject any personality trait differences as insignificant if their CI contains 0 (±0.01). We categorize the remaining differences into three kinds:
- Top 25% as large differences (shown with
---/+++icons), - Middle 50% as moderate differences (shown with
--/++icons) - Bottom 25% as small differences (shown with
-/+icons)
Step 4: Overall Assessment. To assess whether two model personalities are overall similar or different, we consider the number of large, moderate and small differences and categorize a model pairing as follows:
- If more than 3 large differences → very different
- Else if more than 3 moderate (or larger) differences → moderately different
- Else if more than 3 small (or larger) differences → slightly different
- Else if no (significant) differences → very similar
- Otherwise → similar