When models commonly achieve 97% on a task, it means it's time to define a harder task, as it's long stopped providing any useful signal.
When models commonly achieve 97% on a task, it means it's time to define a harder task, as it's long stopped providing any useful signal.