Can AI and People Play Nice?

THE MAY 2019 SHOWDOWN between people and artificial intelligence (AI) at the University of Maryland wasn’t exactly a “Terminator”-style last stand for humanity—just a quiz game. Yet it represented, in a sense, the final days of an era when flesh-and-blood competitors could beat silicon-based ones built to relentlessly pursue trivia wins.

That day, humankind was represented by four visiting “Jeopardy!” champions. As the game unfolded, the computer buzzed in at least as quickly and accurately as the humans with answers about straightforward facts, like the birth date of a pop star or the melting point of an element. But other questions required creative logic to link, say, cryptic clues about gender norms to others about religion.

“What I call squishy questions, the kind that require lateral thinking, tripped up the system sometimes,” says Roger Craig, a “Jeopardy!” Tournament of Champions winner who held the record for one-day winnings ($77,000) for nearly a decade. Thanks to that limitation, he and past winners Aaron Lichtig, Kristin Sausville and Monica Thieu prevailed 190-160. “It wasn’t a nail-biter, but not a blowout either. If we’d played multiple times, I think we would have won maybe 60% of the time.”

But the balance of power in the game they were playing—quizbowl—has shifted. AI technology has evolved by leaps and bounds since that match; the advent of powerful and uncanny large language models behind popular chatbots and image generators is the prime example. People, though, have pretty much stayed the same.

“When it comes to raw accuracy, the computer is now better at answering questions than humans, and I mean humans who are good at this,” says Jordan Boyd-Graber, a UMD computer science professor who hosted the 2019 contest and has been developing the system used in the match, QANTA (Question Answering is Not a Trivial Activity), for nearly 15 years.

As one rough measure of the current state of AI, the system answered a recent set of quizbowl questions he and his research team used in a study with 70% accuracy, while humans managed about 55%. Even on “adversarial” questions crafted by AI experts to exploit its weaknesses and give the advantage to humans, computers are catching up.

Although he built a system that can reliably beat the smartest people on earth, Boyd-Graber’s goal isn’t AI dominance, or to fuel predictions of AI-induced utopias or apocalyptic visions of humanity rendered obsolete.

Now, instead of a face-off, Boyd-Graber is having the two sides team up in a new phase of research. He’s trying to understand what people are good at, what AI is good at, and how this meeting of the minds can bring about benefits for all of society while avoiding, he jokes, our “enslavement by robots.”

“I want people to be able to live happily and productively alongside AI,” he says, “and to do that we need ways to measure and understand human-computer collaboration.”

What he and his students have found so far is that despite its capability, AI is far too confident in its authoritative pronouncements, while people are unsure when to trust AI suggestions. Should they stand aside for superior intelligences, or tell it to shut up and play quizbowl the old-fashioned way? It’s a question with broader implications for how we live our lives and do our work in the future. Now he’s looking for ways to help AI earn badly needed trust—but also to ensure that humans can always run the show when the time comes to deliver the final answer.

I want people to be able to live happily and productively alongside AI, and to do that we need ways to measure and understand human-computer collaboration.”

—Jordan Boyd-Graber

Professor of Computer Science

QUIZBOWL HAS BEEN CENTRAL to Boyd-Graber’s life for 20 years, and he has the wife to prove it.

He competed in high school at a science-focused boarding school in Arkansas, during his undergraduate years at Caltech (where he met the fellow quizbowler he would marry) and grad school at Princeton. Like most who play the game, he’d amassed broad knowledge along with specialties in history and science—he was also the guy to call on for German literature and light opera questions—helping each of his collegiate teams place fourth at national championships.

hand holding a gameshow buzzer, with images of computer chip, question mark, etc.

He’d just become an assistant professor at UMD (where he’s long served as the quizbowl club’s faculty adviser) when AI struck its first major blow in the trivia war. IBM’s Watson, which could respond with encyclopedic knowledge to queries posed in natural, human language rather than in formal query language, beat “Jeopardy!” GOAT and current show host Ken Jennings in a heavily hyped 2011 match.

There’s nothing unusual about computerized tech besting humans at cognitive tasks. Few would try to race a calculator in basic number crunching, while in the world of games, IBM’s Deep Blue supercomputer beat chess world champ Gary Kasparov in 1997. This inaugurated the era of “centaur” chess, in which the best teams were human-AI hybrids (similar to Boyd-Graber’s collaborative quizbowl teams), a situation that persisted until the 2010s, when AI alone could finally beat the best hybrid teams. But math calculations and coldly strategic games like chess bear little resemblance to quizbowl’s complex language play, which seems closer to the ambiguities of real life.

As a specialist in natural language processing, which focuses on computers’ ability to use human language, Boyd-Graber was captivated—and perhaps a bit aggravated—by Watson’s win. “I was jealous about not being involved,” he says. “The two things I did were now coming together in front of me.”

He set to work on QANTA, publishing his first paper on a trivia answer system, titled “Besting the Quizmaster,” in 2012. By 2015, working with then-UMD student, now Associate Professor Mohit Iyyer Ph.D. ’17 and others, Boyd-Graber had the system running well enough to defeat Jennings in a solo demonstration match at the University of Washington.

During that faceoff, Jennings said that part of Watson’s winning ability was its superhuman speed hitting the buzzer. But QANTA, he said, wasn’t just buzzing in faster, it was thinking through answers at lightning speed: “It’s legit faster than me, as well as knowing more stuff,” Jennings says in a video of the contest. (Boyd-Graber himself appeared on “Jeopardy!” in 2018, where he came in second with $8,200.)

(Photo by Dylan Singleton)

Alex Trebek and Jordan Boyd-Graber pose on "Jeopardy!" set

Watson—a room-size, heavily engineered crown jewel of one of the world’s most powerful corporations— represents the traditional, statistical approach to AI that stretches back to the mid-20th century. QANTA, by comparison, has comparatively modest hardware demands and springs from a newer type of AI: neural representations. It teaches itself from large datasets, such as the complete text of Wikipedia, and uses computational models that emulate the workings of the human brain.

That difference in cost and scalability of the two approaches is why Watson was retired from the trivia world, while QANTA kept getting better at answering hard questions until it rendered humans passé as trivia competitors.

Not that this AI-induced obsolescence matters in real life, where millions continue to tune into “Jeopardy!” each weeknight without a QANTA or a Watson in sight. “People want to watch other people competing,” says former champ Craig, himself a computer scientist. “No one is going to care about computers playing a game.”

Boyd-Graber, shown with late “Jeopardy!” host Alex Trebek, competed on the show in 2018. (Photo courtesy of NBC)

Jordan Boyd-Graber and students work with QANTA system — Computer science Professor Jordan Boyd-Graber (second from left) and Ph.D. students (from left) Yoo Yeon Sung, Feng Gu and Ishani Mondal work with the QANTA trivia game-playing system. (Photo by Mike Morgan)

IF HUMANITY IS HAVING a moment of self-doubt in the face of AI, the technology itself appears to be overflowing with self-assurance.

By now, we’re all familiar with the crisply stated, often unequivocal answers provided by AI chatbots and even web browsers whether we want them or not. Just one problem: The AI is often wrong. One 2025 study found AI summaries provide misinformation about recent news more often than not. A key difference between premium and free chatbots was the former’s higher level of confidence in its errors.

How AI systems come up with answers—and botch them—has been a central point of Boyd-Graber’s and colleagues’ question-answering research. An early collaboration with computer science Professor Hal Daumé began with a shared interest in AI systems able to reason in real time, exercising “incremental processing,” as Daumé called it.

“This is what humans do; we don’t wait until the end of a sentence to start thinking about what’s in that sentence,” says Daumé, director of the Artificial Intelligence Interdisciplinary Institute at Maryland, which coordinates AI research and education efforts across the university. It infuses UMD’s widely acknowledged strengths in many areas of AI with a particular dedication to leading its ethical and responsible use in society.

The question-answering studies put Boyd-Graber at the forefront of successive “next big things” in AI and computer science time and again, from neural networks to large language models themselves, Daumé says. “His biggest academic flaw is that he’s not good at bragging about himself ... (but) his research has been fundamental in a series of things that really resulted in the AI we have today.”

To test computers’ ability to incrementally process, the collaborators chose two applications: simultaneous language interpretation and quizbowl. Daumé had never played, but the game struck him as an ideal test bed: “If you don’t keep up with the question being asked and wait until the end to try and answer, you will definitely lose.”

Unlike “Jeopardy!,” where competitors can’t hit the buzzer until the announcer has read the entire clue, players in quizbowl can buzz in for “toss-up” questions and offer an answer as quickly as they can blurt it out. The questions are also longer and more complex than on the TV game show; for instance, a question with the answer “King Lear” might start with a hard clue mentioning only a minor character, progressing in several steps to the simple, “Name this Shakespeare play about an aging king.” This inability to break up problems and think on its feet is why Watson—which can only consider a question as a complete unit—would never be able to compete at quizbowl.

three students hold gameshow buzzers, with score on screen reading, "Human: 185, Computer: 475" — QANTA, an AI system that plays trivia games, surges 290 points ahead in a demonstration match against Boyd-Graber’s former students: Jo Shoemaker Ph.D. ’20, Dennis Peskov Ph.D. ’18 and Michelle Yuan Ph.D. ’22. (Photo by John T. Consoli)

BOYD-GRABER’S AREA OF STUDY has become more mainstream in recent years, but he was a pioneer at the start, says one of his research collaborators. “One unique thing is that he’s not just interested in accuracy, but other aspects too, like time—how quickly the system can answer,” says Sewon Min, a University of California, Berkeley computer science assistant professor who has worked with Boyd-Graber on natural language question answering. “That’s how trivia games work, but that’s also how actual human communication works.”

The opportunity on toss-up questions to buzz as soon as possible with only partial clues makes quizbowl not just a contest of knowledge, but of self-awareness and confidence as well: How much do you trust yourself?

A 2025 study led by Boyd-Graber and presented at the conference of the Association for Computational Linguistics found that the computer competitors trusted their knowledge too much; the AI system provided inflated estimates of its likelihood of being correct on the answers it provided in a game played alongside human teammates (who were better able to gauge whether the answer they’d provided was right).

The computer was still more likely than people to nail the answer, Boyd-Graber says. But when you weight the raw results by the players’ knowledge of whether they were correct—for example, if humans say they are 25% sure and AI says it is 75% sure that the capital of Persia under the ruler who defeated the Median Empire was Persepolis the humans effectively come out on top, even though both are wrong.

But wait: If AI gets the correct answer more often than humans overall, isn’t AI simply better, no matter how you slice it?

No, because a confidently delivered wrong answer is far worse than no answer, says Yoo Yeon Sung Ph.D. ’25, first author of the paper on the study. This is particularly true given the range of sensitive societal functions AI is being introduced to, from supporting military decision-making to sifting through mortgage applications.

Sung, whose doctorate is in information science, now works as an applied scientist at Hippocratic AI, a company that creates conversational AI agents to help patients with health care decisions; its motto of “Do no harm” acknowledges the damage that bad AI, like bad medicine, can do in the world.

“Health decisions are one of the most crucial things people think about using AI, and some of the decision-making can be extremely delicate,” she says. “People have to be able to trust these systems are providing something that is useful and accurate.”

All the upsides of AI—from saving time on tedious tasks to crunching complicated data to potentially solving major societal problems—can benefit us only if people trust the technology can live up to its own confident claims, Daumé says.

“This is really important with ‘agentic AI,’ where it’s actually acting as your agent out in the world,” he says. “If an AI is doing my travel booking for me, I need to be able to trust when it tells me it paid a certain amount, it didn’t hallucinate an extra zero onto the payment—otherwise, I’m not using it.”

People have to be able to trust these systems are providing something that is useful and accurate.”

—Yoo Yeon Sung Ph.D. ’25

Applied Scientist at Hippocratic AI

IN THEIR MOST RECENT research, Boyd-Graber and his team turned their attention more toward the people teamed up with AI than the technology. What they found, essentially, was that human competitors weren’t sure how to react to talented but cocky AI teammates.

In these quizbowl games, humans and computers alike can buzz in for toss-up questions (although if an AI teammate is really hurting the score, its human teammates can “mute” it). Per standard quizbowl rules, the team that wins the toss-up then has a chance to add to the score in a bonus round in which teammates confer before they answer.

In Boyd-Graber’s games, the rules specify that human and AI players come up with bonus answers separately, then compare notes, with the human players ultimately determining the answer. In many cases, computers and humans agreed on the right answer; it was when they disagreed that things got interesting.

“We saw in the previous paper that computers aren’t very good at knowing what they know, and in our new research, it shows people don’t always know whether to trust them or trust themselves,” says Gor, first author of a forthcoming paper in Findings of the Association for Computational Linguistics.

One noteworthy type of disagreement occurred when human players were wrong, the AI player was right, and the people stuck with their wrong answer not because they were simply stubborn, but because the computer failed to provide a logical case for its answer.

The worst-case scenario, says Boyd-Graber, was when the humans had the answer, the computer was mired in error, and people still conceded to their electronic teammate. The group’s paper on the phenomenon, “AI, Take the Wheel,” found that people believe in themselves less than they should, while “AI is banking on trust that it hasn’t earned,” he says.

Like a quizbowl player jumping to answer a still-ambiguous question, humanity is giving up more and more decision-making to AI systems. Boyd-Graber hopes that what he ironically calls a “silly game” can help build the foundation needed for this partnership to pay off: an understanding of the gaps between voluminous facts and knowledge, and between confidence and competence. TERP

Support AI RESEARCH and fuel unprecedented transformation for our students, our state and our world.

Give now

Anatomy of a Quizbowl Question

Competition questions start with more obscure clues and proceed to ones that provide more obvious (although not necessarily easy) context, as in this practice question from National Academic Quiz Tournaments.

A book by this economist describes an experiment in which it took him nearly a year to license a small garment workshop. Another book on informal economies by this man argues that the assets of the poor are “dead capital.” He proposed formalizing property rights to combat terrorism in the 1986 book The Other Path, whose title plays on the name of a militant group in his country. “The Mystery of Capital” is a book by—for 10 points—what Peruvian economist who shares his name with a conquistador?

In his classes, however, Boyd-Graber often assigns students to write “adversarial” questions that take advantage of AI’s weaknesses. This can range from current events not yet included in the system’s training data—for example, the title of a new book that came out last week—to more complex approaches, like questions that require computers to use common sense to fill in unspoken parts of questions and logic to piece separate parts of a question together correctly.

Answer

Hernando de Soto

Can AI and People Play Nice?

Anatomy of a Quizbowl Question

Post a Comment

Leave a Reply

Can AI and People Play Nice?

Anatomy of a Quizbowl Question

Post a Comment

Leave a Reply

Related Articles