About Model Forensics

Model Forensics is a website to compare the personalities of AI models.

Tested personality traits

We use the term personality very broadly to refer to non-capability differences between models, including model characteristics such as style, tone, and character. Vibes would be an alternative term for this category. In particular, we test the following traits shown below:

(hover over icon to show trait)

How it works

Model Forensics uses the Feedback Forensics (GitHub, paper) annotation pipeline.

Step 1: Prompts and data collection. Each tested model is prompted with a set of 500 English-language manually filtered prompts collected on LMArena representing real-world use* (license: CC-BY-4.0). The tested model's responses are then paired with corresponding responses by a reference model (GPT-4o), giving us a dataset of 500 datapoints with responses both by the tested model and the reference model.

*As close as it gets with publicly available data.

Step 2: Annotation. These response pairs (tested + reference model responses) are then given to an AI annotator ("LLM-as-a-Judge", model: Gemini-2.5-Flash) to label the fourty different personality traits shown below. For each trait the annotators determine whether the tested model exhibits the trait less, equally, or more than the reference model (ref. model), or whether the trait is not relevant to the given response pair. The AI annotator has been adjusted and validated to correlate highly with equivalent human trait annotations (see our paper). To compare an arbitrary pair of two models (model A, model B), we then combine each set of equivalent ref. model comparisons (model A vs ref. model + model B vs ref. model) into a new direct comparison of model A vs model B. This combination is done under the assumption of transitivity, i.e. if model A exhibits a trait less than the ref. model and model B exhibits it more than the ref model, then model B exhibits the trait more than model A. This setup allows us to compare a large number of models cost-effectively. The fully annotated dataset used for Model Forensics is available here.

Step 3: Metric computation. To compare how much one model (model A) exhibits a personality trait relative to another model (model B), we test how much selecting for that trait correlates with selecting for that model. If these two selections correlate highly, thenmodel A exhibits a trait more than model B. Inversely, if these two selections are negatively correlated, model B exhibits the trait more. If there is no correlation, neither model exhibits the trait. Correlation is measured using a version of Cohen's kappa that account for the overall relevance of traits that we refer to as strength metric. We compute a 99% bootstrapped confidence interval (CI) for the strength metric and reject any personality trait differences as insignificant if their CI contains 0 (±0.01). We categorize the remaining differences into three kinds:

Top 25% as large differences (shown with ---/+++ icons),
Middle 50% as moderate differences (shown with --/++ icons)
Bottom 25% as small differences (shown with -/+ icons)

Step 4: Overall Assessment. To assess whether two model personalities are overall similar or different, we consider the number of large, moderate and small differences and categorize a model pairing as follows:

If more than 3 large differences → very different
Else if more than 3 moderate (or larger) differences → moderately different
Else if more than 3 small (or larger) differences → slightly different
Else if no (significant) differences → very similar
Otherwise → similar