Tecnologia do Blogger.
RSS

Weekend: Anthropic’s Secret Trick for Measuring Claude

The Weekend
Welcome, Weekenders! In this newsletter: • The Big Read: Why Runway thinks it can outrun OpenAI and Google • Artificial Intelligence: AI.com is up for sale. Asking price? $100 million
Mar 1, 2025
Welcome, Weekenders! In this newsletter:
The Big Read: Why Runway thinks it can outrun OpenAI and Google
Artificial Intelligence: AI.com is up for sale. Asking price? $100 million
The Top 5: The very best sleep and health tech for kids 
Plus, our Recommendations: A father confesses his darkest secrets; how a financial scheme fueled a murder land; and Netflix's slam-dunk documentary
 
On a bit of a whim last year, an Anthropic staffer, David Hershey, sought out a different method for tracking the development of the startup's Claude chatbot. (After all, even the nerdiest AI nerds get tired of staring at the same old benchmark material.) He wanted something that would let him see how Claude performed on a long-term solo project, so he decided to have it play the original Pokémon game released by Nintendo in 1996.
When Hershey set up the older 3.0 Sonnet version of Claude to play, the bot couldn't even get the game going. The next iteration of Claude, 3.5 Sonnet, did a little better. "There were glimmers of hope," Hershey said. And now 3.7 Sonnet, the hybrid reasoning model version of Claude released this week, which can think more about what it's doing, can go much further into the game. Discussion of Hershey's impromptu Pokémon tests with Claude became a viral pastime within Anthropic, and to let more people in on the fun, the startup has decided to set up a public Twitch stream of 3.7 Sonnet playing Pokémon. (When I last checked, Claude was at Mount Moon preparing to square off against a Team Rocket henchman just as a wild Level 10 Paras appeared, forcing Claude to battle it out with a Level 14 Spearow.) 
"This is just a much more visceral way of seeing general improvement on intelligence," Hershey said. 
Monitoring Claude's Pokémon games has allowed Hershey and other researchers at Anthropic to hone their thinking around the development of the startup's agentic technology, a buzzy artificial intelligence subcategory focused on developing AI that can complete tasks by itself. Sure, Anthropic could—and does—keep measuring Claude with traditional tests, but they're a matter of routine. The researchers hope more unusual assignments like having the bot play Pokémon can spark insights on how to hone the model that might not have dawned on them from using just the standard assessments.
"In the last few years, benchmarks and evaluations don't really tell the full story of the quality of these models—just like you don't know how smart someone is just by giving them a SAT test," said Dianne Na Penn, who leads the research efforts of Anthropic's product team. "Knowing how well a model can do on a goal-oriented agentic task is not something you can know from multiple-choice questions." 
Hershey suspects 3.7 Sonnet still won't be able to finish the game. On the Twitch stream, it had spent at least a day in labyrinthine Mount Moon. (I feel for Claude: Exactly the same fate befell me when I first played the game at age 8.) And Hershey thinks even if the model makes it out of Mount Moon, it will really struggle to get through Fuschia City's Safari Zone, which it must traverse in a set number of steps. "If we've learned anything from Mount Moon, we know that optimal pathing is not what we're good at yet," Hershey acknowledged. 
Would Claude's ability to finish the game be a sign that artificial general intelligence is finally here?
"We'd be a lot closer—" Na Penn said with a laugh.
"Yeah, it would definitely make me question some things," Hershey said.
But About That AI Bubble…
I cannot remember the last time I had a lengthy conversation with anyone important in tech that didn't involve AI. It's the industry's preeminent preoccupation. 
Yet Pew Research is out with some eyebrow-raising poll results. In a survey of white-collar Americans, more than half said they rarely or never use an AI chatbot at work and just 9% said they use one daily. Nearly a third of the respondents said they had never even heard of a chatbot. 
Both of those figures astounded me and left me puzzling on what Pew's study suggests: It may be a marker that AI has an enormous market still to conquer. It could just as easily be a warning sign that Silicon Valley has vastly overestimated how much America wants AI. Or needs it.—Abram Brown
 
Cristóbal Valenzuela, co-founder and CEO of Runway, an AI video startup, has gone to substantial lengths to win over Hollywood. While he lives in New York, he has lately found himself spending about a quarter of his time in Southern California and has fallen into a steady routine—taking every breakfast, coffee, lunch, brunch and Chateau Marmont cocktail he possibly can schedule. 
"I'll start at 9 a.m. and try to fit in at least six meetings a day," Valenzuela told Weekend newcomer Gary Rivlin for this week's Big Read. "Producers, actors, writers, directors, the guilds, the big studios, the production houses—pretty much everyone." 
Backed with millions from the likes of Nvidia and General Atlantic, Valenzuela is trying hard to entrench Runway deeply into Hollywood. He hopes to make the company's tools an industry standard, explains Rivlin, a former Pulitzer Prize winner at The New York Times and the author of the upcoming book "AI Valley." Still, Runway faces stiff competition from a slew of rivals, including OpenAI, which released an initial version of its video tool, Sora, in December. 
Nonetheless, Valenzuela believes Runway's tools will be far more sophisticated than anyone else's—thanks partly to a new landmark deal with Lionsgate, the Hollywood studio. "We see these other companies as creating concept cars while we're building a real car," he said. "But our job is to make cars that are useful to people."
Larry Fischer, a fast-talking native New Yorker, has pulled off quite a few domain name megadeals over the years, including selling Messenger.com to Facebook in 2014 and Teams.com to Microsoft in 2020. Now the domain broker says he is "drooling" over the prospect of an even bigger sale: AI.com, which he thinks can fetch $100 million.
It would likely be a record-breaking domain sale, as our Akash Pasricha reports, and the anonymous owner of AI.com—the person who hired Fischer—has been trying to catch the attention of prospective buyers by continually switching up where AI.com redirects to.
Our Paris Martineau has taken a break from stories about the technology that imperils young people to assemble this list of gadgets that help wee ones stay asleep, rested and fit—which of course then affords similar comfort to their parents. 
Abram Brown, editor of The Information's Weekend section, remains a committed plant-type Pokémon guy. Reach him at abe@theinformation.com.
 
Listening: A Saga From a Father's Sins
At first listen, "Crook County" might seem like a fairly ordinary true-crime podcast: a narrative about a young Chicago-area mafia soldier, Kenny "The Kid" Tekiela, who for several decades led a double life as a fireman by day and a low-level assassin by night. But wait—is that Kenny himself as co-narrator, going on the record about his felonious past? Yup, it is. Somehow Kenny is still alive to talk about it all. "I did fly under the radar," he says. He worked hard to keep his ambitions guarded—knowing full well the more ambitious guys were often the ones who didn't make it.
Rather than picture himself as the next Al Capone, he held smaller-minded goals: Earn some extra dollars for his family and hope they never found out about his moonlighting. He did and they didn't—until recently. And there's the second unique element to "Crook County": The other guy narrating the iHeartRadio pod is Kyle Tekiela, Kenny's son, who grew up to become a Hollywood producer of films such as 2017's "Mudbound." "I hope we can all find some healing through this journey together," Kyle says as the first episode ends. He then gives a special nod to his wife Nicole "for not leaving me after she found out about all this shit."—A.B.
Reading: Money Games and Murders
In the '60s, East New York was a place teeming with sharks—real estate brokers, that is—and one of the most predatory was Ortrud Kapraki. She was hard working, organized and ruthless. "She was a sociopath," recalls one FBI agent who tangled with her. Through her efforts, she emerged as a crucial player in a financial scheme that stretched from Brooklyn to Washington in which realtors, mortgage lenders and Dun & Bradstreet credit analysts conspired with federal housing officials to flip cheap properties and sell them to low-income, minority owners who couldn't actually afford them. 
The money shenanigans form only one part of "The Killing Fields of East New York" from Stacy Horn, who channels a fair David Simon impression to fill out the book. She also shows how the actions of those white-collar criminals fueled street-level gangsterism in East New York, eventually turning the Brooklyn enclave into one of the most dangerous towns in the country. During a nadir in 1991, the place saw 116 murders. More than a third remain unsolved.—A.B.
Watching: Netflix Goes for Gold
My most joyful sports experience in recent memory was watching the USA men's basketball team win a gold medal at the Olympics last summer. The second most joyful? Rewatching them win it in "Court of Gold," a six-part Netflix documentary that closely follows the national men's basketball teams from the U.S., Canada, France and Serbia on their quests for victory in Paris. 
Reliving Steph Curry stealing the souls of Team France by raining threes down on them in the final is far from the only memorable moment in it. There's also the cocktail reception spectacle of the Minnesota Timberwolves' Anthony Edwards cockily declaring "I'm the truth" to Barack Obama (who produced the film along with wife Michelle). There's Serbia's coach chewing out his team in the locker room after his players left a transcendent Kevin Durant unguarded during an early loss to the U.S.: "He gets the ball as if…he is alone as a ghost." And there is the unexpected pathos of Durant getting emotional while describing how the Olympics and basketball have elevated his life. This doc is podium-worthy.—Nick Wingfield
 
Kinda like: It's not "Killing Eve," it's "Killing Steve!"
Follow us
X
LinkedIn
Facebook
Threads
Instagram
Sent to cintilanteaguda@gmail.c­om | Manage your preferences or unsubscribe | Help The Information · 251 Rhode Island Street, Suite 107, San Francisco, CA 94103

  • Digg
  • Del.icio.us
  • StumbleUpon
  • Reddit
  • RSS

0 comentários:

Postar um comentário