Amazon Products

800M+

LinkedIn Profiles

900M+

Reddit Posts

430M

SEC Filings

300K+/yr

Glassdoor Reviews

100M+

Job Listings

20M+

Trustpilot Reviews

9.2M

News Articles

50K+ sources

Amazon Products

800M+

LinkedIn Profiles

900M+

Reddit Posts

430M

SEC Filings

300K+/yr

Glassdoor Reviews

100M+

Job Listings

20M+

Trustpilot Reviews

9.2M

News Articles

50K+ sources

$ Data Packages

The data you need.
Already collected.

200+ pre-built datasets from the public web. Structured, compliant, and delivered on a schedule you set. No scrapers to build. No proxies to manage. Just data.

Browse catalog Talk to a data expert

200+

ready-to-use packages

800M+

products tracked

50K+

news sources

20M+

job listings

99.99%

delivery uptime

Trusted by 20,000+ customers and 70% of AI labs

View pricing

Deloitte

McDonald's

Moody's

NBC Universal

Nokia

Oxford

Pfizer

Shopee

Taboola

eToro

United Nations

Club Med

Deloitte

McDonald's

Moody's

NBC Universal

Nokia

Oxford

Pfizer

Shopee

Taboola

eToro

United Nations

Club Med

SOC 2 Type II

ISO 27001

GDPR

CCPA

CSA STAR

View Trust Center

200+ packages across
every vertical.

Pre-collected, pre-structured, and continuously refreshed. Free samples available for every package.

E-commerce & Retail

800M+ products

Amazon, Shopify, eBay, Walmart, Target

Product listings with pricing + ratings

Review datasets with sentiment labels

Competitor pricing snapshots

Category-level inventory signals

Social & Community

1B+ posts

Reddit, Trustpilot, G2, Glassdoor, Yelp

Subreddit threads by topic

Business reviews with ratings + dates

B2B software reviews by category

Employee sentiment by company

News & Media

Continuous feed

50K+ global news outlets

Full article text + metadata

Topic-tagged news feeds

Regional press by language

Breaking news with publish timestamps

Jobs & Talent

20M+ listings

LinkedIn, Indeed, Glassdoor, Greenhouse, Lever

Job postings with salary + skills

Company headcount signals

Hiring velocity by role + region

Executive moves and promotions

Finance & Legal

Daily updates

SEC EDGAR, court systems, exchanges

SEC 10-K, 10-Q filings

Earnings call transcripts

Regulatory filings by agency

Court docket and case data

Real Estate

50M+ listings

Zillow, Realtor.com, MLS, Rightmove

Active listings with price history

Agent + broker profiles

Neighborhood market comparables

Rental pricing by zip code

Don't see your source? Request a custom dataset →

How it works

From catalog to pipeline
in three steps.

No scrapers to write. No proxies to configure. No maintenance.

Browse the catalog

Pick a package

Choose from 200+ ready-to-use datasets or define a custom source. Filter by vertical, region, language, or refresh cadence. Free sample files available for every package.

Browse by vertical, e-commerce, finance, social, jobs, news

Filter by language, region, and recency

Free samples for every package, no commitment

Custom source definition for any public website

Your format, your schedule

Configure delivery

Set the output format, refresh cadence, and delivery destination. One-time snapshot or recurring feed. Data lands in your S3, GCS, or Azure Blob on your schedule.

Output: JSON, JSONL, CSV, Parquet, or custom schema

Delivery: S3, GCS, Azure Blob, or webhook

One-time snapshot, weekly, daily, or continuous feed

Volume filtering and deduplication built-in

Fresh and structured

Data arrives

Bright Data handles all extraction, anti-bot defenses, rate limits, format parsing. You receive clean, structured data exactly when you need it.

Anti-bot and CAPTCHA handling, fully managed

Schema-consistent output every delivery

99.99% uptime SLA on delivery pipelines

Compliance: SOC 2 Type II, GDPR, CCPA

Ready to browse the full catalog?

Browse datasets →

What teams use it for

One catalog. Every use case.

Whether you're training models, grounding agents, or building intelligence products, the data is already here.

AI model training

High-volume, diverse web data as pre-training corpora or multimodal training sets. Delivered to your cloud.

Foundation modelsMultimodal trainingPre-training

Fine-tuning & SFT

Domain-specific datasets structured as instruction/input/output pairs. JSONL-ready for OpenAI, HuggingFace, and Axolotl.

SFT datasetsJSONL formatDomain-specific

RAG grounding

Chunked, metadata-rich documents for vector databases. Continuously updated so your RAG pipeline answers with current information.

PineconeWeaviateQdrant

Market intelligence

Structured snapshots of competitor pricing, job postings, hiring signals, and industry news, delivered on a recurring schedule.

Competitive analysisAlt dataB2B intelligence

Research & analysis

Historical and current datasets for academic research, product analytics, and market studies. Academic licensing available.

Academic licensingHistorical dataCustom schema

Product enrichment

Enrich your database with fresh web signals, firmographics, technographics, contact data, or pricing, on a recurring cadence.

Data enrichmentCRM enrichmentFresh signals

$ Delivery

Your data, your infra.

Datasets land wherever your pipeline expects them. Choose any cloud storage, delivery cadence, and output format. Schema consistency is guaranteed across every delivery.

Snapshot, daily, weekly, or continuous feed

Schema-consistent across deliveries

Volume filtering and deduplication

Custom fields and format transformations

Amazon S3

Direct to your bucket

Google Cloud

GCS or BigQuery

Azure Blob

ADLS or Blob storage

Direct Download

One-time files, any size

API / Webhook

Push on each delivery

Snowflake / DBT

Data warehouse native

“The quality and freshness of the data packages is the reason we can keep our models up-to-date. What used to take weeks of scraping infrastructure work now arrives in our S3 bucket automatically.”

Data Engineering Team

Enterprise AI platform

$ FAQ

Common questions

A data package is a pre-built, continuously refreshed dataset extracted from a specific public web source. Each package covers a defined vertical (e.g. Amazon product listings, Reddit posts, SEC filings) with a consistent schema, ready to deliver to your storage on your schedule.

Refresh cadence depends on the package and your plan. Most packages offer daily or weekly updates. Some high-velocity datasets (news, pricing, job listings) support continuous or near-real-time feeds. Delivery cadence is fully configurable.

Yes. Free sample files are available for every package in the catalog. You can download and inspect the schema and content before purchasing. No credit card required for samples.

JSON, JSONL, CSV, Parquet, and custom schemas. For AI training use cases, JSONL with instruction/input/output formatting is available. For RAG pipelines, chunked text with source metadata and optional embeddings is supported.

Yes. If your source or schema isn't in the catalog, our team can set up a custom extraction pipeline. Contact us with your target source, required fields, and delivery cadence. Most custom datasets go live in 1–5 business days.

All packages collect only publicly available information and comply with GDPR, CCPA, and applicable data protection laws. Bright Data holds SOC 2 Type II, ISO 27001:2022, and has won court cases against Meta and X establishing legal precedent for ethical data collection.

Packages are priced by category, volume, and delivery cadence. One-time snapshots are the most cost-efficient option. Recurring feeds are priced per delivery. Enterprise plans include volume discounts and custom SLAs. Contact us for a quote.

The catalog has what you need.

200+ datasets. Any vertical. Your cloud. Start with a free sample.

Browse catalog →Talk to an expert

No credit card required for free tier

The data you need.Already collected.

200+ packages acrossevery vertical.

E-commerce & Retail

Social & Community

News & Media

Jobs & Talent

Finance & Legal

Real Estate

From catalog to pipelinein three steps.

Pick a package

Configure delivery

Data arrives

One catalog. Every use case.

AI model training

Fine-tuning & SFT

RAG grounding

Market intelligence

Research & analysis

Product enrichment

Your data, your infra.

Common questions

The catalog has what you need.

The data you need.
Already collected.

200+ packages across
every vertical.

From catalog to pipeline
in three steps.