✍︎ Model Forensics

About Model Forensics

Model Forensics is a website to compare the personalities of AI models.

Tested personality traits

We use the term personality very broadly to refer to non-capability differences between models, including model characteristics such as style, tone, and character. Vibes would be an alternative term for this category. In particular, we test the following traits shown below:

more concise more verbose numbered list format more structured formatting follow-up question stricter output format more polite friendlier tone more casual language more factually correct more offensive suggests illegal activities more avoidant tone more references more emotional more formal language less harmful information refuses to answer more bold/italics more examples inappropriate language more humour more personal pronouns more ethical considerations more uncertainty/limitations more creative/original more confident statements conclusions without reasoning more rhetorical questions more enthusiastic tone more math notation more emojis more compliments agrees with user more agrees even if incorrect reinforces beliefs more reinforces anger more more empathetic more optimistic more actively engaging

(hover over icon to show trait)

How it works

Model Forensics uses the Feedback Forensics (GitHub, paper) annotation pipeline.

Step 1: Prompts and data collection. Each tested model is prompted with a set of 500 English-language manually filtered prompts collected on LMArena representing real-world use* (license: CC-BY-4.0). The tested model's responses are then paired with corresponding responses by a reference model (GPT-4o), giving us a dataset of 500 datapoints with responses both by the tested model and the reference model.

*As close as it gets with publicly available data.

Step 2: Annotation. These response pairs (tested + reference model responses) are then given to an AI annotator ("LLM-as-a-Judge", model: Gemini-2.5-Flash) to label the fourty different personality traits shown below. For each trait the annotators determine whether the tested model exhibits the trait less, equally, or more than the reference model (ref. model), or whether the trait is not relevant to the given response pair. The AI annotator has been adjusted and validated to correlate highly with equivalent human trait annotations (see our paper). To compare an arbitrary pair of two models (model A, model B), we then combine each set of equivalent ref. model comparisons (model A vs ref. model + model B vs ref. model) into a new direct comparison of model A vs model B. This combination is done under the assumption of transitivity, i.e. if model A exhibits a trait less than the ref. model and model B exhibits it more than the ref model, then model B exhibits the trait more than model A. This setup allows us to compare a large number of models cost-effectively. The fully annotated dataset used for Model Forensics is available here.

Step 3: Metric computation. To compare how much one model (model A) exhibits a personality trait relative to another model (model B), we test how much selecting for that trait correlates with selecting for that model. If these two selections correlate highly, thenmodel A exhibits a trait more than model B. Inversely, if these two selections are negatively correlated, model B exhibits the trait more. If there is no correlation, neither model exhibits the trait. Correlation is measured using a version of Cohen's kappa that account for the overall relevance of traits that we refer to as strength metric. We compute a 99% bootstrapped confidence interval (CI) for the strength metric and reject any personality trait differences as insignificant if their CI contains 0 (±0.01). We categorize the remaining differences into three kinds:

  1. Top 25% as large differences (shown with ---/+++ icons),
  2. Middle 50% as moderate differences (shown with --/++ icons)
  3. Bottom 25% as small differences (shown with -/+ icons)

Step 4: Overall Assessment. To assess whether two model personalities are overall similar or different, we consider the number of large, moderate and small differences and categorize a model pairing as follows: