Signal

I built an open-source benchmark to test if open-source LLMs are actually as confident as they claim to be (Spoiler: They often aren't)

Hey everyone, When building systems around modern open-source LLMs, one of the biggest issues is that they can confidently hallucinate or state an incorrect answer with a 95%+ probability.

rwallms

Evidence locked

Today's free sample is only available for the edition's flagship signal.

Back Unlock Pro

Evidence preview

They often aren't) (via Reddit)
[P] I built an open-source benchmark to test if LLMs are actually as confident as they claim to be (Spoiler