$ Data Packages

The data you need.
Already collected.

200+ pre-built datasets from the public web. Structured, compliant, and delivered on a schedule you set. No scrapers to build. No proxies to manage. Just data.

200+
ready-to-use packages
800M+
products tracked
50K+
news sources
20M+
job listings
99.99%
delivery uptime

Trusted by 20,000+ customers and 70% of AI labs

View pricing

Deloitte
McDonald's
Moody's
NBC Universal
Nokia
Oxford
Pfizer
Shopee
Taboola
eToro
United Nations
Club Med
SOC 2 Type II
ISO 27001
GDPR
CCPA
CSA STAR
View Trust Center

200+ packages across
every vertical.

Pre-collected, pre-structured, and continuously refreshed. Free samples available for every package.

E-commerce & Retail

800M+ products

Amazon, Shopify, eBay, Walmart, Target

Product listings with pricing + ratings
Review datasets with sentiment labels
Competitor pricing snapshots
Category-level inventory signals

Social & Community

1B+ posts

Reddit, Trustpilot, G2, Glassdoor, Yelp

Subreddit threads by topic
Business reviews with ratings + dates
B2B software reviews by category
Employee sentiment by company

News & Media

Continuous feed

50K+ global news outlets

Full article text + metadata
Topic-tagged news feeds
Regional press by language
Breaking news with publish timestamps

Jobs & Talent

20M+ listings

LinkedIn, Indeed, Glassdoor, Greenhouse, Lever

Job postings with salary + skills
Company headcount signals
Hiring velocity by role + region
Executive moves and promotions

Finance & Legal

Daily updates

SEC EDGAR, court systems, exchanges

SEC 10-K, 10-Q filings
Earnings call transcripts
Regulatory filings by agency
Court docket and case data

Real Estate

50M+ listings

Zillow, Realtor.com, MLS, Rightmove

Active listings with price history
Agent + broker profiles
Neighborhood market comparables
Rental pricing by zip code

Don't see your source? Request a custom dataset →

How it works

From catalog to pipeline
in three steps.

No scrapers to write. No proxies to configure. No maintenance.

1
Browse the catalog

Pick a package

Choose from 200+ ready-to-use datasets or define a custom source. Filter by vertical, region, language, or refresh cadence. Free sample files available for every package.

Browse by vertical, e-commerce, finance, social, jobs, news
Filter by language, region, and recency
Free samples for every package, no commitment
Custom source definition for any public website
2
Your format, your schedule

Configure delivery

Set the output format, refresh cadence, and delivery destination. One-time snapshot or recurring feed. Data lands in your S3, GCS, or Azure Blob on your schedule.

Output: JSON, JSONL, CSV, Parquet, or custom schema
Delivery: S3, GCS, Azure Blob, or webhook
One-time snapshot, weekly, daily, or continuous feed
Volume filtering and deduplication built-in
3
Fresh and structured

Data arrives

Bright Data handles all extraction, anti-bot defenses, rate limits, format parsing. You receive clean, structured data exactly when you need it.

Anti-bot and CAPTCHA handling, fully managed
Schema-consistent output every delivery
99.99% uptime SLA on delivery pipelines
Compliance: SOC 2 Type II, GDPR, CCPA

Ready to browse the full catalog?

Browse datasets

What teams use it for

One catalog. Every use case.

Whether you're training models, grounding agents, or building intelligence products, the data is already here.

AI model training

High-volume, diverse web data as pre-training corpora or multimodal training sets. Delivered to your cloud.

Foundation modelsMultimodal trainingPre-training

Fine-tuning & SFT

Domain-specific datasets structured as instruction/input/output pairs. JSONL-ready for OpenAI, HuggingFace, and Axolotl.

SFT datasetsJSONL formatDomain-specific

RAG grounding

Chunked, metadata-rich documents for vector databases. Continuously updated so your RAG pipeline answers with current information.

PineconeWeaviateQdrant

Market intelligence

Structured snapshots of competitor pricing, job postings, hiring signals, and industry news, delivered on a recurring schedule.

Competitive analysisAlt dataB2B intelligence

Research & analysis

Historical and current datasets for academic research, product analytics, and market studies. Academic licensing available.

Academic licensingHistorical dataCustom schema

Product enrichment

Enrich your database with fresh web signals, firmographics, technographics, contact data, or pricing, on a recurring cadence.

Data enrichmentCRM enrichmentFresh signals
$ Delivery

Your data, your infra.

Datasets land wherever your pipeline expects them. Choose any cloud storage, delivery cadence, and output format. Schema consistency is guaranteed across every delivery.

Snapshot, daily, weekly, or continuous feed
Schema-consistent across deliveries
Volume filtering and deduplication
Custom fields and format transformations
Amazon S3
Direct to your bucket
Google Cloud
GCS or BigQuery
Azure Blob
ADLS or Blob storage
Direct Download
One-time files, any size
API / Webhook
Push on each delivery
Snowflake / DBT
Data warehouse native

The quality and freshness of the data packages is the reason we can keep our models up-to-date. What used to take weeks of scraping infrastructure work now arrives in our S3 bucket automatically.

Data Engineering Team
Enterprise AI platform
$ FAQ

Common questions

A data package is a pre-built, continuously refreshed dataset extracted from a specific public web source. Each package covers a defined vertical (e.g. Amazon product listings, Reddit posts, SEC filings) with a consistent schema, ready to deliver to your storage on your schedule.
Refresh cadence depends on the package and your plan. Most packages offer daily or weekly updates. Some high-velocity datasets (news, pricing, job listings) support continuous or near-real-time feeds. Delivery cadence is fully configurable.
Yes. Free sample files are available for every package in the catalog. You can download and inspect the schema and content before purchasing. No credit card required for samples.
JSON, JSONL, CSV, Parquet, and custom schemas. For AI training use cases, JSONL with instruction/input/output formatting is available. For RAG pipelines, chunked text with source metadata and optional embeddings is supported.
Yes. If your source or schema isn't in the catalog, our team can set up a custom extraction pipeline. Contact us with your target source, required fields, and delivery cadence. Most custom datasets go live in 1–5 business days.
All packages collect only publicly available information and comply with GDPR, CCPA, and applicable data protection laws. Bright Data holds SOC 2 Type II, ISO 27001:2022, and has won court cases against Meta and X establishing legal precedent for ethical data collection.
Packages are priced by category, volume, and delivery cadence. One-time snapshots are the most cost-efficient option. Recurring feeds are priced per delivery. Enterprise plans include volume discounts and custom SLAs. Contact us for a quote.

The catalog has what you need.

200+ datasets. Any vertical. Your cloud. Start with a free sample.

No credit card required for free tier