Overview
We introduce Urdu Bench, a benchmark for testing how well language models handle Urdu. Models are evaluated on three task types and given a score from 0 to 100 for each, which are combined into a single overall score.
Scoring is handled by an LLM judge, so a response that gets the meaning right but uses different wording can still score well.
The model is given an English sentence and asked to translate it into Urdu.
The model is given an Urdu sentence with a grammatical error and asked to return the corrected version.
The model reads a short Urdu passage and answers a question about it in Urdu.
Scoring Pipeline
Each response is scored by a separate language model acting as a judge. It reads both the reference answer and the model's response and gives a score from 0 to 100. The judge runs three times per response and the scores are averaged. A response that gets the meaning right but uses different, equally correct wording can still score well.
There is also a small format check that catches things like responding in the wrong language or refusing the task. This makes up a minor part of the overall task score.
Reasoning Modes
Extended reasoning is enabled at the highest setting the model supports. This is the model at its most capable, with higher cost and latency to match.
Reasoning is disabled or set to its lowest available setting.
The results were mixed in interesting ways. On grammar tasks, reasoning tended to help: models with thinking enabled scored noticeably higher, with GPT-5.2 reaching 91/100 on grammar while comparable non-thinking models sat in the 73-82 range. On comprehension, the opposite held: non-thinking models consistently scored 96+ while thinking-enabled models averaged around 87, suggesting that reading and answering in Urdu is better served by direct inference than extended reasoning.