LLM’s Verification Problem

How do we know if an LLM’s answer for a given task is correct? This is fundamentally a verification problem. I think about LLM’s verification problem using the following simple taxonomy:

When verification is subjective.
When verification is relatively easy / cheap.
When verification is relatively hard / expensive.

When verification is subjective

Many artistic / creative tasks do not really have an objective verification standard (e.g., is a certain poem / painting beautiful or not). Their evaluations largely rely on social norm or subjective human sentiment. There is nothing wrong with subjective evaluations, but for the sake of tractability, I do not consider this type of tasks as part of LLM’s verification problem.

When verification is relatively easy / cheap

By “relatively easy / cheap”, what I mean is that verification is easier / cheaper than solving the problem itself. Some example tasks in this category include:

Coding: generating a piece of code to solve a problem is not easy, but once it is generated, it can be verified by compiling and running it. Larger systems can be verified (to an extent) with many LLM-generated unit tests almost automatically.
Math: having an LLM to generate math calculations or proofs is hard and error-prone, but verification is comparatively simpler. Simple calculations can be verified with a calculator tool. Proof generation can be verified with proof assistant systems such as Lean.
Web Search: search is not easy per se, but is that URL generated by LLM real or hallucinated? Just click it to verify.

This category of task is currently witnessing tremendous progress with agentic AI systems, which can follow a generation <-> verification loop under the system reaches a satisfactory end state. I do not mean to imply that all issues are completely solved in this category, but the fact that verification can be (at least partially) automated and can be run repeatedly is very helpful.

When verification is relatively hard / expensive

Again, by “relatively hard / expensive”, I mean that verifying a solution is almost as hard as solving the problem itself. Keep in mind that this does not necessarily mean that verification or solution is hard per se, just that verification cost is almost as high as solution cost. Some examples include:

Translation: to really verify if the translation of a paragraph is correct or not, you’d have to translate it.
Unstructured information retrieval / processing: imagine you have 1 millon PDFs with unstructured information stored in them (texts, tables, figures, tables that are stored as figures, etc.). The task is to go through each PDF, find a particular table (irregular names, sometimes stored as texts but sometimes stored as images with uneven resolutions), and precisely retrieve a number under row 2 and column 3. A multi-modal LLM can do this, but to verify each of its answers, you’d have to manually check that PDF, find that table, and identify that target number yourself.

To me, this is the most interesting category of tasks where the verification problem is largely unsolved. Many existing strategies involve pairing a “solution LLM” with a “verification LLM”, where the latter is trained/tasked to verify the first one’s output. However, this is circular design – if verification is almost as hard as solving, then it’s unclear how much more credibility can a “verification LLM” provide on top of a “solution LLM”.