Connect with us

Tech

The promise and perils of synthetic data

Is it possible for an AI to be trained just on data generated by another AI? It might sound like a harebrained idea. But it’s one that’s been around for quite some time — and as new, real data is increasingly hard to come by, it’s been gaining traction.

Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta fine-tuned its Llama 3.1 models using AI-generated data. And OpenAI is said to be sourcing synthetic training data from o1, its “reasoning” model, for the upcoming Orion.

But why does AI need data in the first place — and what kind of data does it need? And can this data really be replaced by synthetic data?

The importance of annotations

AI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions, like that “to whom” in an email typically precedes “it may concern.”

Annotations, usually text labeling the meaning or parts of the data these systems ingest, are a key piece in these examples. They serve as guideposts, “teaching” a model to distinguish among things, places, and ideas.

Consider a photo-classifying model shown lots of pictures of kitchens labeled with the word “kitchen.” As it trains, the model will begin to make associations between “kitchen” and general characteristics of kitchens (e.g. that they contain fridges and countertops). After training, given a photo of a kitchen that wasn’t included in the initial examples, the model should be able to identify it as such. (Of course, if the pictures of kitchens were labeled “cow,” it would identify them as cows, which emphasizes the importance of good annotation.)

The appetite for AI and the need to provide labeled data for its development have ballooned the market for annotation services. Dimension Market Research estimates that it’s worth $838.2 million today — and will be worth $10.34 billion in the next ten years. While there aren’t precise estimates of how many people engage in labeling work, a 2022 paper pegs the number in the “millions.”

Companies large and small rely on workers employed by data annotation firms to create labels for AI training sets. Some of these jobs pay reasonably well, particularly if the labeling requires specialized knowledge (e.g. math expertise). Others can be backbreaking. Annotators in developing countries are paid only a few dollars per hour on average without any benefits or guarantees of future gigs.

A drying data well

So there’s humanistic reasons to seek out alternatives to human-generated labels. But there are also practical ones.

Humans can only label so fast. Annotators also have biases that can manifest in their annotations, and, subsequently, any models trained on them. Annotators make mistakes, or get tripped up by labeling instructions. And paying humans to do things is expensive.

Data in general is expensive, for that matter. Shutterstock is charging AI vendors tens of millions of dollars to access its archives, while Reddit has made hundreds of millions from licensing data to Google, OpenAI, and others.

Lastly, data is also becoming harder to acquire.

Most models are trained on massive collections of public data — data that owners are increasingly choosing to gate over fears their data will be plagiarized, or that they won’t receive credit or attribution for it. More than 35% of the world’s top 1,000 websites now block OpenAI’s web scraper. And around 25% of data from “high-quality” sources has been restricted from the major datasets used to train models, one recent study found.

Should the current access-blocking trend continue, the research group Epoch AI projects that developers will run out of data to train generative AI models between 2026 and 2032. That, combined with fears of copyright lawsuits and objectionable material making their way into open data sets, has forced a reckoning for AI vendors.

Synthetic alternatives

At first glance, synthetic data would appear to be the solution to all these problems. Need annotations? Generate ’em. More example data? No problem. The sky’s the limit.

And to a certain extent, this is true.

“If ‘data is the new oil,’ synthetic data pitches itself as biofuel, creatable without the negative externalities of the real thing,” Os Keyes, a PhD candidate at the University of Washington who studies the ethical impact of emerging technologies, told TechCrunch. “You can take a small starting set of data and simulate and extrapolate new entries from it.”

The AI industry has taken the concept and run with it.

This month, Writer, an enterprise-focused generative AI company, debuted a model, Palmyra X 004, trained almost entirely on synthetic data. Developing it cost just $700,000, Writer claims — compared to estimates of $4.6 million for a comparably-sized OpenAI model.

Microsoft’s Phi open models were trained using synthetic data, in part. So were Google’s Gemma models. Nvidia this summer unveiled a model family designed to generate synthetic training data, and AI startup Hugging Face recently released what it claims is the largest AI training dataset of synthetic text.

Synthetic data generation has become a business in its own right — one that could be worth $2.34 billion by 2030. Gartner predicts that 60% of the data used for AI and an­a­lyt­ics projects this year will be syn­thet­i­cally gen­er­ated.

Luca Soldaini, a senior research scientist at the Allen Institute for AI, noted that synthetic data techniques can be used to generate training data in a format that’s not easily obtained through scraping (or even content licensing). For example, in training its video generator Movie Gen, Meta used Llama 3 to create captions for footage in the training data, which humans then refined to add more detail, like descriptions of the lighting.

Along these same lines, OpenAI says that it fine-tuned GPT-4o using synthetic data to build the sketchpad-like Canvas feature for ChatGPT. And Amazon has said that it generates synthetic data to supplement the real-world data it uses to train speech recognition models for Alexa.

“Synthetic data models can be used to quickly expand upon human intuition of which data is needed to achieve a specific model behavior,” Soldaini said.

Synthetic risks

Synthetic data is no panacea, however. It suffers from the same “garbage in, garbage out” problem as all AI. Models create synthetic data, and if the data used to train these models has biases and limitations, their outputs will be similarly tainted. For instance, groups poorly represented in the base data will be so in the synthetic data.

“The problem is, you can only do so much,” Keyes said. “Say you only have 30 Black people in a dataset. Extrapolating out might help, but if those 30 people are all middle-class, or all light-skinned, that’s what the ‘representative’ data will all look like.”

To this point, a 2023 study by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose “quality or diversity progressively decrease.” Sampling bias — poor representation of the real world — causes a model’s diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this).

Keyes sees additional risks in complex models such as OpenAI’s o1, which he thinks could produce harder-to-spot hallucinations in their synthetic data. These, in turn, could reduce the accuracy of models trained on the data — especially if the hallucinations’ sources aren’t easy to identify.

“Complex models hallucinate; data produced by complex models contain hallucinations,” Keyes added. “And with a model like o1, the developers themselves can’t necessarily explain why artefacts appear.”

Compounding hallucinations can lead to gibberish-spewing models. A study published in the journal Nature reveals how models, trained on error-ridden data, generate even more error-ridden data, and how this feedback loop degrades future generations of models. Models lose their grasp of more esoteric knowledge over generations, the researchers found — becoming more generic and often producing answers irrelevant to the questions they’re asked.

Image Credits:Ilia Shumailov et al.

A follow-up study shows that oher types of models, like image generators, aren’t immune to this sort of collapse:

Image Credits:Ilia Shumailov et al.

Soldaini agrees that “raw” synthetic data isn’t to be trusted, at least if the goal is to avoid training forgetful chatbots and homogenous image generators. Using it “safely,” he says, requires thoroughly reviewing, curating, and filtering it, and ideally pairing it with fresh, real data — just like you’d do with any other dataset.

Failing to do so could eventually lead to model collapse, where a model becomes less “creative” — and more biased — in its outputs, eventually seriously compromising its functionality. Though this process could be identified and arrested before it gets serious, it is a risk.

“Researchers need to examine the generated data, iterate on the generation process, and identify safeguards to remove low-quality data points,” Soldaini said. “Synthetic data pipelines are not a self-improving machine; their output must be carefully inspected and improved before being used for training.”

OpenAI CEO Sam Altman once argued that AI will someday produce synthetic data good enough to effectively train itself. But — assuming that’s even feasible — the tech doesn’t exist yet. No major AI lab has released a model trained on synthetic data alone.

At least for the foreseeable future, it seems we’ll need humans in the loop somewhere to make sure a model’s training doesn’t go awry.

source

Continue Reading
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Tech

Volkswagen’s cheapest EV ever is the first to use Rivian software

Volkswagen’s ultra-cheap EV called the ID EVERY1 — a small four-door hatchback revealed Wednesday — will be the first to roll out with software and architecture from Rivian, according to a source familiar with the new model.

The EV is expected to go into production in 2027 with a starting price of 20,000 euros ($21,500). A second EV called the ID.2all, which will be priced in the 25,000 euro price category, will be available in 2026. Both vehicles are part of the automaker’s new of category electric urban front-wheel drive cars that are being developing under the so-called “Brand Group Core” that makes up the volume brands in the VW Group. And both vehicles are for the European market.

The EVERY1 will be the first to ship with Rivian’s vehicle architecture and software as part of a $5.8 billion joint venture struck last year between the German automaker and U.S. EV maker. The ID.2all is based on the E3 1.1 architecture and software developed by VW’s software unit Cariad.

VW didn’t name Rivian in its reveal Wednesday, although there were numerous nods to next-generation software. Kai Grünitz, member of the Volkswagen Brand Board of Management responsible for Technical Development, noted it would be the first model in the entire VW Group to use a “fundamentally new, particularly powerful software architecture.”

“This means the future entry-level Volkswagen can be equipped with new functions throughout its entire life cycle,” he said. “Even after purchase of a new car, the small Volkswagen can still be individually adapted to customer needs.”

Sources who didn’t want to be named because they were not authorized to speak publicly, confirmed to TechCrunch that Rivian’s software will be in the ID EVERY1 EV. TechCrunch has reached out to Rivian and VW and will update the article if the companies respond.

The new joint venture provides Rivian with a needed influx of cash and the opportunity to diversify its business. Meanwhile, VW Group gains a next-generation electrical architecture and software for EVs that will help it better compete. Both companies have said that the joint venture, called Rivian and Volkswagen Group Technologies, will reduce development costs and help scale new technologies more quickly.

The joint venture is a 50-50 partnership with co-CEOs. Rivian’s head of software, Wassym Bensaid, and Volkswagen Group’s chief technical engineer, Carsten Helbing, will lead the joint venture. The team will be based initially in Palo Alto, California. Three other sites are in development in North America and Europe, the companies have previously said.

image credits: VW

“The ID. EVERY1 represents the last piece of the puzzle on our way to the widest model selection in the volume segment,” Thomas Schäfer, CEO of the Volkswagen Passenger Cars brand and Head of the Brand Group Core, said in a statement. “We will then offer every customer the right car with the right drive system–including affordable all-electric entry-level mobility. Our goal is to be the world’s technologically leading high-volume manufacturer by 2030. And as a brand for everyone–just as you would expect from Volkswagen.”

The Volkswagen ID EVERY1 is just a concept for now — and with only a few details attached to the unveiling. The concept vehicle reaches a top speed of 130 km/h (80 miles per hour) and is powered by a newly developed electric drive motor with 70 kW, according to Volkswagen. The German automaker said the range on the EVERY1 will be at least 250 kilometers (150 miles). The vehicle is small but larger than VW’s former UP! vehicle. The company said it will have enough space for four people and a luggage compartment volume of 305 liters.

source

Continue Reading

Tech

The hottest AI models, what they do, and how to use them

AI models are being cranked out at a dizzying pace, by everyone from Big Tech companies like Google to startups like OpenAI and Anthropic. Keeping track of the latest ones can be overwhelming. 

Adding to the confusion is that AI models are often promoted based on industry benchmarks. But these technical metrics often reveal little about how real people and companies actually use them. 

To cut through the noise, TechCrunch has compiled an overview of the most advanced AI models released since 2024, with details on how to use them and what they’re best for. We’ll keep this list updated with the latest launches, too.

There are literally over a million AI models out there: Hugging Face, for example, hosts over 1.4 million. So this list might miss some models that perform better, in one way or another. 

AI models released in 2025

Cohere’s Aya Vision

Cohere released a multimodal model called Aya Vision that it claims is best in class at doing things like captioning images and answering questions about photos. It also excels in languages other than English, unlike other models, Cohere claims. It is available for free on WhatsApp.

OpenAI’s GPT 4.5 ‘Orion’

OpenAI calls Orion their largest model to date, touting its strong “world knowledge” and “emotional intelligence.” However, it underperforms on certain benchmarks compared to newer reasoning models. Orion is available to subscribers of OpenAI’s $200 a month plan.

Claude Sonnet 3.7

Anthropic says this is the industry’s first ‘hybrid’ reasoning model, because it can both fire off quick answers and really think things through when needed. It also gives users control over how long the model can think for, per Anthropic. Sonnet 3.7 is available to all Claude users, but heavier users will need a $20 a month Pro plan.

xAI’s Grok 3

Grok 3 is the latest flagship model from Elon Musk-founded startup xAI. It’s claimed to outperform other leading models on math, science, and coding. The model requires X Premium (which is $50 a month.) After one study found Grok 2 leaned left, Musk pledged to shift Grok more “politically neutral” but it’s not yet clear if that’s been achieved.

OpenAI o3-mini

This is OpenAI’s latest reasoning model and is optimized for STEM-related tasks like coding, math, and science. It’s not OpenAI’s most powerful model but because it’s smaller, the company says it’s significantly lower cost. It is available for free but requires a subscription for heavy users.

OpenAI Deep Research

OpenAI’s Deep Research is designed for doing in-depth research on a topic with clear citations. This service is only available with ChatGPT’s $200 per month Pro subscription. OpenAI recommends it for everything from science to shopping research, but beware that hallucinations remain a problem for AI.

Mistral Le Chat

Mistral has launched app versions of Le Chat, a multimodal AI personal assistant. Mistral claims Le Chat responds faster than any other chatbot. It also has a paid version with up-to-date journalism from the AFP. Tests from Le Monde found Le Chat’s performance impressive, although it made more errors than ChatGPT.

OpenAI Operator

OpenAI’s Operator is meant to be a personal intern that can do things independently, like help you buy groceries. It requires a $200 a month ChatGPT Pro subscription. AI agents hold a lot of promise, but they’re still experimental: a Washington Post reviewer says Operator decided on its own to order a dozen eggs for $31, paid with the reviewer’s credit card.

Google Gemini 2.0 Pro Experimental

Google Gemini’s much-awaited flagship model says it excels at coding and understanding general knowledge. It also has a super-long context window of 2 million tokens, helping users who need to quickly process massive chunks of text. The service requires (at minimum) a Google One AI Premium subscription of $19.99 a month.

AI models released in 2024

DeepSeek R1

This Chinese AI model took Silicon Valley by storm. DeepSeek’s R1 performs well on coding and math, while its open source nature means anyone can run it locally. Plus, it’s free. However, R1 integrates Chinese government censorship and faces rising bans for potentially sending user data back to China.

Gemini Deep Research

Deep Research summarizes Google’s search results in a simple and well-cited document. The service is helpful for students and anyone else who needs a quick research summary. However, its quality isn’t nearly as good as an actual peer-reviewed paper. Deep Research requires a $19.99 Google One AI Premium subscription.

Meta Llama 3.3 70B

This is the newest and most advanced version of Meta’s open source Llama AI models. Meta has touted this version as its cheapest and most efficient yet, especially for math, general knowledge, and instruction following. It is free and open source.

OpenAI Sora

Sora is a model that creates realistic videos based on text. While it can generate entire scenes rather than just clips, OpenAI admits that it often generates “unrealistic physics.” It’s currently only available on paid versions of ChatGPT, starting with Plus, which is $20 a month. 

Alibaba Qwen QwQ-32B-Preview

This model is one of the few to rival OpenAI’s o1 on certain industry benchmarks, excelling in math and coding. Ironically for a “reasoning model,” it has “room for improvement in common sense reasoning,” Alibaba says. It also incorporates Chinese government censorship, TechCrunch testing shows. It’s free and open source.

Anthropic’s Computer Use

Claude’s Computer Use is meant to take control of your computer to complete tasks like coding or booking a plane ticket, making it a predecessor of OpenAI’s Operator. Computer use, however, remains in beta. Pricing is via API: $0.80 per million tokens of input and $4 per million tokens of output.

x.AI’s Grok 2 

Elon Musk’s AI company, x.AI, has launched an enhanced version of its flagship Grok 2 chatbot it claims is “three times faster.” Free users are limited to 10 questions every two hours on Grok, while subscribers to X’s Premium and Premium+ plans enjoy higher usage limits. x.AI also launched an image generator, Aurora, that produces highly photorealistic images, including some graphic or violent content.

OpenAI o1

OpenAI’s o1 family is meant to produce better answers by “thinking” through responses through a hidden reasoning feature. The model excels at coding, math, and safety, OpenAI claims, but has issues deceiving humans, too. Using o1 requires subscribing to ChatGPT Plus, which is $20 a month.

Anthropic’s Claude Sonnet 3.5 

Claude Sonnet 3.5 is a model Anthropic claims as being best in class. It’s become known for its coding capabilities and is considered a tech insider’s chatbot of choice. The model can be accessed for free on Claude although heavy users will need a $20 monthly Pro subscription. While it can understand images, it can’t generate them.

OpenAI GPT 4o-mini

OpenAI has touted GPT 4o-mini as its most affordable and fastest model yet thanks to its small size. It’s meant to enable a broad range of tasks like powering customer service chatbots. The model is available on ChatGPT’s free tier. It’s better suited for high-volume simple tasks compared to more complex ones.

Cohere Command R+

Cohere’s Command R+ model excels at complex Retrieval-Augmented Generation (or RAG) applications for enterprises. That means it can find and cite specific pieces of information really well. (The inventor of RAG actually works at Cohere.) Still, RAG doesn’t fully solve AI’s hallucination problem.

source

Continue Reading

Tech

Not all cancer patients need chemo. Ataraxis AI raised $20M to fix that.

Artificial intelligence is a big trend in cancer care, and it’s mostly focused detecting cancer at the earliest possible stage. That makes a lot of sense, given that cancer is less deadly the earlier it’s detected.

But fewer are asking another fundamental question: if someone does have cancer, is an aggressive treatment like chemotherapy necessary? That’s the problem Ataraxis AI is trying to solve.

The New York-based startup is focused on using AI to accurately predict not only if a patient has cancer, but also what their cancer outcome looks like in 5 to 10 years. If there’s only a small chance of the cancer coming back, chemo can be avoided altogether – saving a lot of money, while avoiding the treatment’s notorious side effects.

Ataraxis AI now plans to launch their first commercial test, for breast cancer, to U.S. oncologists in the coming months, its co-founder Jan Witowski tells TechCrunch. To bolster the launch and expand into other types of cancer, the startup has raised a $20.4 million Series A, it told TechCrunch exclusively.

The round was led by AIX Ventures with participation from Thiel Bio, Founders Fund, Floating Point, Bertelsmann, and existing investors Giant Ventures and Obvious Ventures. Ataraxis emerged from stealth last year with a $4 million seed round.

Ataraxis was co-founded by Witowski and Krzysztof Geras, an assistant professor at NYU’s medical school who focuses on AI.

Ataraxis’ tech is powered by an AI model that extracts information from high-resolution images of cancer cells. The model is trained on hundreds of millions of real images from thousands of patients, Witowski said. A recent study showed Ataraxis’ tech was 30% more accurate than the current standard of care for breast cancer, per Ataraxis.

Long term, Ataraxis has big ambitions. It wants its tests to impact at least half of new cancer cases by 2030. It also views itself as a frontier AI company that builds its own models, touting Meta’s chief AI scientist Yann LeCun as an AI advisor.

“I think at Ataraxis we are trying to build what is essentially an AI frontier lab, but for healthcare applications,” Witowski said. “Because so many of those problems require a very novel technology.”

The AI boom has led to a rush of fundraises for cancer care startups. Valar Labs raised $22 million to help patients figure out their treatment plan in May 2024, for example. There’s also a bevvy of AI-powered drug discovery firms in the cancer space, like Manas AI which raised $24.6 million in January 2025 and was co-founded by Reid Hoffman, the LinkedIn co-founder.

source

Continue Reading