Bulk Enrichment

Learn how to enrich large datasets efficiently, including matching strategies, handling partial matches, and considerations for processing high-volume data.

Prerequisites

Before starting, ensure you have:

Completed the CSV Import guide
Understanding of Entities and Sheets
Your API key (see Authentication)

Bulk enrichment adds data to entities that already exist in your sheets or are being imported from CSV files. Unlike discovery searches, enrichment focuses on gathering additional information about known entities.

Enrichment vs Discovery

Aspect	Discovery	Enrichment
Entity source	Search criteria	CSV file or existing sheet
Primary goal	Find new entities	Add data to known entities
Entity matching	N/A	By name, domain, or identifier
Typical use	Lead generation	Data enhancement

Enrichment via ICP Configuration

Define enrichment fields in your ICP's entity target description:

from linkt import Linkt
 
client = Linkt()
 
icp = client.icp.create(
    name="Company Enrichment",
    description="Enrich imported companies with additional data",
    mode="discovery",
    entity_targets=[
        {
            "entity_type": "company",
            "root": True,
            "description": """## Enrichment Fields
- primary_product: Main product or service offered
- tech_stack: Key technologies and frameworks used
- funding_status: Latest funding round and amount
- employee_growth: Year-over-year headcount change
- key_customers: Notable customer names
- recent_news: Latest company announcements"""
        }
    ]
)

Enrichment-Only ICPs

For pure enrichment (no filtering), omit the Criteria section and only include Enrichment Fields. The AI will focus on researching and populating those fields.

Matching Strategies

When importing CSV data, Linkt matches rows to real-world entities using various strategies.

Match by Company Name

The primary column contains company names:

{
  "task_config": {
    "type": "ingest",
    "file_id": "507f1f77bcf86cd799439001",
    "primary_column": "company_name",
    "csv_entity_type": "company"
  }
}

CSV Example:

company_name,industry,location
Acme Corporation,Software,San Francisco
TechStartup Inc,SaaS,New York

Match by Domain

Include website domains for more accurate matching:

company_name,website,industry
Acme Corporation,acme.com,Software
TechStartup Inc,techstartup.io,SaaS

Linkt uses the domain to verify entity identity and improve match accuracy.

Match by LinkedIn URL

For person entities, LinkedIn URLs provide the most reliable matching:

full_name,linkedin_url,company
Sarah Chen,https://linkedin.com/in/sarachen,TechCorp
James Wilson,https://linkedin.com/in/jameswilson,DataFlow

Matching Best Practices

Strategy	Accuracy	Best For
Name + Domain	Highest	Company enrichment
LinkedIn URL	Highest	Person enrichment
Name only	Good	When domain unavailable
Name + Location	Good	Common company names

Handling Partial Matches

Not all CSV rows will match entities perfectly. Handle partial matches appropriately.

Match States

State	Description	Action
`completed`	Entity matched and enriched	Data saved to sheet
`discarded`	No match found	Skipped, logged in queue
`processing`	Currently being matched	In progress

Monitoring Match Status

Check the run queue for match results:

queue = client.run.get_queue(run_id=run_id)
print(f"Completed: {queue.completed}/{queue.total}")
print(f"Discarded: {queue.discarded}")

Response:

{
  "items": [
    {
      "row_data": {"company_name": "Acme Corp", "industry": "Software"},
      "state": "completed",
      "entity_id": "507f1f77bcf86cd799439015"
    },
    {
      "row_data": {"company_name": "Unknown Inc", "industry": "Tech"},
      "state": "discarded",
      "reason": "Entity not found"
    }
  ],
  "total": 150,
  "completed": 142,
  "discarded": 8
}

Reviewing Discarded Rows

Filter to see only discarded items:

discarded = client.run.get_queue(run_id=run_id, state="discarded")
for item in discarded.items:
    print(f"- {item.row_data['company_name']}: {item.reason}")

Improving Match Rates

Clean data before import — Fix typos, standardize names
Include domains — Add website column for companies
Use LinkedIn URLs — Most reliable for people
Remove duplicates — Dedupe CSV before upload
Verify entity existence — Some companies may have been acquired or renamed

Enrichment Data Handling

Understand how enrichment data is stored and merged.

Data Storage Pattern

Each enriched field follows the EntityAttribute structure:

{
  "field_name": {
    "value": "The enriched value",
    "references": ["https://source-url.com"],
    "created_at": "2025-01-06T10:00:00Z",
    "updated_at": "2025-01-06T10:00:00Z"
  }
}

Field Update Behavior

Scenario	Behavior
New field	Created with value
Existing field, new value	Updated with new value
Existing field, null result	Original value preserved
Re-enrichment	Updates timestamps

Preserving Original Data

Original CSV data is preserved alongside enriched data:

{
  "name": {
    "value": "Acme Corporation",
    "references": ["csv_import"]
  },
  "website": {
    "value": "https://acme.com",
    "references": ["csv_import"]
  },
  "primary_product": {
    "value": "Enterprise CRM platform",
    "references": ["https://acme.com/products"]
  }
}

Large Dataset Considerations

Handle high-volume enrichment efficiently.

Batch Size Guidelines

Dataset Size	Approach	Notes
1-100 rows	Single import	Direct processing
100-500 rows	Single import	Monitor queue progress
500-1000 rows	Consider splitting	Better error handling
1000+ rows	Multiple batches	Recommended

Splitting Large Files

import pandas as pd
 
def split_csv(file_path, batch_size=500):
    """Split large CSV into smaller batches."""
    df = pd.read_csv(file_path)
    num_batches = (len(df) + batch_size - 1) // batch_size
 
    for i in range(num_batches):
        start = i * batch_size
        end = start + batch_size
        batch_df = df[start:end]
        batch_df.to_csv(f"batch_{i+1}.csv", index=False)
        print(f"Created batch_{i+1}.csv with {len(batch_df)} rows")
 
split_csv("large_dataset.csv", batch_size=500)

Parallel Processing

Process multiple batches concurrently:

from linkt import Linkt
import time
from concurrent.futures import ThreadPoolExecutor
 
client = Linkt()
 
def process_batch(file_path, icp_id, sheet_id):
    """Process a single batch file."""
 
    # Upload file
    upload = client.files.upload(file_path=file_path)
 
    # Create and execute task
    task = client.task.create(
        name=f"Enrich {file_path}",
        flow_name="ingest",
        deployment_name="ingest/v1",
        sheet_id=sheet_id,
        task_config={
            "type": "ingest",
            "file_id": upload.file_id,
            "primary_column": "company_name",
            "csv_entity_type": "company"
        }
    )
 
    run = client.task.execute(task_id=task.id, icp_id=icp_id)
 
    # Wait for completion
    while True:
        status = client.run.retrieve(run_id=run.run_id)
 
        if status.status in ["COMPLETED", "FAILED", "CANCELED", "CRASHED"]:
            return {
                "file": file_path,
                "status": status.status,
                "run_id": run.run_id
            }
 
        time.sleep(10)  # Poll every 10 seconds
 
# Process batches in parallel (limit concurrency)
batch_files = ["batch_1.csv", "batch_2.csv", "batch_3.csv"]
 
with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [
        executor.submit(process_batch, f, icp_id, sheet_id)
        for f in batch_files
    ]
 
    for future in futures:
        result = future.result()
        print(f"{result['file']}: {result['status']}")

Rate Limiting

Be mindful of API rate limits:

import time
 
def process_with_rate_limit(items, delay_seconds=1):
    """Process items with rate limiting."""
    for i, item in enumerate(items):
        process_item(item)
 
        # Add delay between requests
        if i < len(items) - 1:
            time.sleep(delay_seconds)

Progress Monitoring

Track progress across large imports:

def monitor_enrichment_progress(client, run_ids):
    """Monitor multiple runs and report progress."""
    while True:
        all_complete = True
        total_completed = 0
        total_items = 0
 
        for run_id in run_ids:
            status = client.run.retrieve(run_id=run_id)
 
            if status.status not in ["COMPLETED", "FAILED", "CANCELED", "CRASHED"]:
                all_complete = False
 
            # Get queue progress
            queue = client.run.get_queue(run_id=run_id)
 
            total_completed += queue.completed or 0
            total_items += queue.total or 0
 
        progress = (total_completed / total_items * 100) if total_items > 0 else 0
        print(f"Progress: {total_completed}/{total_items} ({progress:.1f}%)")
 
        if all_complete:
            break
 
        time.sleep(10)

Complete Example

Full bulk enrichment workflow:

from linkt import Linkt
import time
import pandas as pd
 
client = Linkt()
 
def bulk_enrich(csv_path, enrichment_fields, batch_size=500):
    """
    Bulk enrich companies from a CSV file.
 
    Args:
        csv_path: Path to CSV file with company data
        enrichment_fields: List of fields to enrich
        batch_size: Rows per batch (default 500)
    """
 
    # Step 1: Create enrichment ICP
    print("Creating enrichment ICP...")
    fields_description = "\n".join([
        f"- {field['name']}: {field['description']}"
        for field in enrichment_fields
    ])
 
    icp = client.icp.create(
        name="Bulk Enrichment",
        description="Enrich imported company data",
        mode="discovery",
        entity_targets=[{
            "entity_type": "company",
            "root": True,
            "description": f"## Enrichment Fields\n{fields_description}"
        }]
    )
    icp_id = icp.id
 
    # Step 2: Create sheet
    print("Creating sheet...")
    sheet = client.sheet.create(
        name="Enriched Companies",
        icp_id=icp_id,
        entity_type="company"
    )
    sheet_id = sheet.id
 
    # Step 3: Split and process CSV
    df = pd.read_csv(csv_path)
    total_rows = len(df)
    num_batches = (total_rows + batch_size - 1) // batch_size
 
    print(f"Processing {total_rows} rows in {num_batches} batches...")
 
    run_ids = []
    for i in range(num_batches):
        start = i * batch_size
        end = min(start + batch_size, total_rows)
        batch_df = df[start:end]
 
        # Save batch to temp file
        batch_path = f"/tmp/batch_{i+1}.csv"
        batch_df.to_csv(batch_path, index=False)
 
        print(f"Processing batch {i+1}/{num_batches} ({len(batch_df)} rows)...")
 
        # Upload batch
        upload = client.files.upload(file_path=batch_path)
 
        # Create task
        task = client.task.create(
            name=f"Enrich Batch {i+1}",
            flow_name="ingest",
            deployment_name="ingest/v1",
            sheet_id=sheet_id,
            task_config={
                "type": "ingest",
                "file_id": upload.file_id,
                "primary_column": "company_name",
                "csv_entity_type": "company"
            }
        )
 
        # Execute
        run = client.task.execute(task_id=task.id, icp_id=icp_id)
        run_ids.append(run.run_id)
 
        # Small delay between batches
        time.sleep(2)
 
    # Step 4: Wait for all batches
    print("\nWaiting for completion...")
    while True:
        all_complete = True
        completed_count = 0
 
        for run_id in run_ids:
            status = client.run.retrieve(run_id=run_id)
 
            if status.status == "COMPLETED":
                completed_count += 1
            elif status.status in ["FAILED", "CANCELED", "CRASHED"]:
                completed_count += 1
                print(f"  Run {run_id} failed: {status.error}")
            else:
                all_complete = False
 
        print(f"  Batches complete: {completed_count}/{len(run_ids)}")
 
        if all_complete:
            break
 
        time.sleep(10)
 
    # Step 5: Get results
    print("\nRetrieving enriched entities...")
    entities = client.entity.list(
        icp_id=icp_id,
        entity_type="company",
        page_size=100
    )
 
    print(f"\nEnrichment complete!")
    print(f"  Total entities: {entities.total}")
    print(f"  ICP ID: {icp_id}")
    print(f"  Sheet ID: {sheet_id}")
 
    return {
        "icp_id": icp_id,
        "sheet_id": sheet_id,
        "total_entities": entities.total,
        "entities": entities.data
    }
 
# Run bulk enrichment
result = bulk_enrich(
    csv_path="companies.csv",
    enrichment_fields=[
        {"name": "primary_product", "description": "Main product or service offered"},
        {"name": "tech_stack", "description": "Key technologies used"},
        {"name": "funding_status", "description": "Latest funding round and amount"},
        {"name": "employee_count", "description": "Current number of employees"},
        {"name": "key_customers", "description": "Notable customer names"}
    ],
    batch_size=500
)
 
# Print sample results
print("\nSample enriched data:")
for entity in result["entities"][:3]:
    name = entity.data["name"]["value"]
    product = entity.data.get("primary_product", {}).get("value", "N/A")
    tech = entity.data.get("tech_stack", {}).get("value", "N/A")
    print(f"\n  {name}")
    print(f"    Product: {product}")
    print(f"    Tech: {tech}")

Best Practices

Data Preparation

Clean your CSV — Remove duplicates, fix obvious errors
Include identifiers — Add domain or LinkedIn URL when available
Standardize formats — Consistent company name formatting
Validate encoding — Ensure UTF-8 encoding

Enrichment Efficiency

Focus on high-value fields — Don't request unnecessary data
Batch appropriately — 500-row batches work well
Monitor progress — Track queue completion rates
Handle failures — Implement retry logic for transient errors

Quality Assurance

Spot check results — Review a sample of enriched entities
Track match rates — Monitor discarded rows
Validate data quality — Check enrichment accuracy
Export and backup — Save enriched data externally

Next Steps

Files Reference — File upload and management API
CSV Import — Basic import workflow
Entities Reference — Entity data structure
Custom Fields — Adding enrichment fields