Save Document to Table Block

Save documents to tables within Scout collections

The Save Document to Table block allows you to save data from your workflow into a table within a collection. This block is perfect for storing workflow results, logging actions, or building databases from automated processes.

Overview

This block saves a document (row) to a specified table in one of your collections. Each field in the document can pull values from previous blocks in your workflow using template variables, making it easy to build dynamic workflows that store structured data.

When to Use This Block

Use the Save Document to Table block when you want to:

  • ✅ Store workflow results in a structured format
  • ✅ Log workflow executions for audit trails
  • ✅ Build databases from automated processes
  • ✅ Save data extracted from previous blocks
  • ✅ Create records that can be viewed and managed in your Collections interface
  • ✅ Build searchable knowledge bases with semantic search capabilities

Configuration

Required Fields

  1. Collection - Select the collection where you want to save the document

    • You can create a new collection directly from this dropdown if needed
  2. Table - Select the table within the collection

    • The table must exist in the selected collection
    • You can create a new table directly from this dropdown if needed
  3. Values - Configure one or more fields to save:

    • Column: Select which column in the table to populate
    • Type: Choose the data type (string, number, boolean, or JSON)
    • Value: Enter the value using Jinja2 template syntax

Field Configuration

For each field you want to save, you need to specify:

  • Column: The column ID from your table schema
  • Type: One of four supported types:
    • string - Text values
    • number - Numeric values (integers or decimals)
    • boolean - True/false values
    • json - Structured JSON data
  • Value: A Jinja2 template that can reference data from previous blocks

Template Variables

You can use Jinja2 template syntax in the Value field to reference data from previous blocks in your workflow. The template has access to all the state from previous blocks.

Accessing Block Output

To reference data from a previous block, use the block’s ID followed by the output path:

1{{ block_id.output.field_name }}

For example, if you have a block with ID extract_data that outputs:

1{
2 "name": "John Doe",
3 "email": "john@example.com",
4 "score": 95
5}

You would reference these values like:

  • {{ extract_data.output.name }} → “John Doe”
  • {{ extract_data.output.email }} → “john@example.com
  • {{ extract_data.output.score }} → 95

Common Template Patterns

Simple text value:

1{{ extract_data.output.name }}

Concatenated strings:

1{{ extract_data.output.first_name }} {{ extract_data.output.last_name }}

Formatted text:

1User: {{ extract_data.output.name }} ({{ extract_data.output.email }})

Direct number:

1{{ calculate_score.output.total }}

Boolean from condition:

1{{ check_status.output.is_active }}

JSON object:

1{{ "{\"key\": \"value\", \"number\": 123}" }}

Or reference a JSON object from another block:

1{{ previous_block.output.metadata }}

Conditional values:

1{% if check_status.output.is_active %}Active{% else %}Inactive{% endif %}

Using date/time from previous blocks:

1{{ extract_data.output.created_at }}

Type Casting

The block automatically converts values to match the selected type. This helps ensure data consistency:

  • String: Converts any value to text
  • Number: Converts strings like “123” to the number 123
  • Boolean: Converts truthy/falsy values to true/false
  • JSON: Parses JSON strings into structured objects

Example: Type Conversions

If you set the type to number and the template evaluates to "42", it will be saved as the number 42, not the string "42".

Step-by-Step Example

Let’s walk through saving a customer record after extracting data from an email:

  1. Add the block to your workflow

  2. Select Collection: Choose “Customer Database”

  3. Select Table: Choose “Customers”

  4. Configure Fields:

    • Field 1:
      • Column: full_name
      • Type: string
      • Value: {{ extract_email.output.customer_name }}
    • Field 2:
      • Column: email_address
      • Type: string
      • Value: {{ extract_email.output.email }}
    • Field 3:
      • Column: signup_date
      • Type: string
      • Value: {{ extract_email.output.date }}
    • Field 4:
      • Column: is_premium
      • Type: boolean
      • Value: {{ check_tier.output.premium }}
  5. Run the workflow - The document will be saved to your table

Complete Example Workflow

Here’s a complete example workflow that processes a support ticket and saves it:

1. Trigger: Webhook receives support ticket
2. Extract Data: Parse ticket JSON
3. Enrich Data: Look up customer information
4. Save Document to Table:
- Collection: Support Tickets
- Table: Tickets
- Fields:
- ticket_id (string): {{ extract_data.output.id }}
- customer_name (string): {{ enrich_data.output.customer.name }}
- subject (string): {{ extract_data.output.subject }}
- priority (string): {{ extract_data.output.priority }}
- created_at (string): {{ extract_data.output.created_at }}
- metadata (json): {{ extract_data.output }}
Collection
stringRequired

The collection to save the document to. Ensure that the collection ID is correct to avoid saving data to the wrong collection.

Table
stringRequired

The table to save to. Ensure that the table ID is correct and exists within the desired collection.

Values
list

A list of field mappings to save. Each field maps a column in your table to a value. The default is an empty list. This field supports Jinja2 template syntax, allowing for dynamic content generation.

Each item in the Values list contains:

  • Column (string, required): The column ID in the target table where the value will be saved
  • Type (string, required): The data type of the value. Must be one of: “string”, “number”, “boolean”, or “json”
  • Value (string, required): The value to save, using Jinja2 template syntax. This field supports dynamic content generation by referencing data from previous blocks in your workflow.
See Workflow Logic & State > State Management for details on using dynamic variables in this block. For advanced Jinja template features including date/time operations, see Jinja Templates in Workflows.

When you save documents to a table, you’re creating a database that can be searched in different ways. Tables can be configured with vector indexes, enabling powerful semantic search capabilities alongside traditional keyword search.

What is a Vector Database?

A vector database stores data as embeddings - mathematical representations (vectors) that capture the semantic meaning of text. Each document’s content is converted into a high-dimensional vector (typically 768 dimensions) that represents its meaning in a way that computers can understand and compare.

There are two primary ways to search your saved documents:

Keyword Search (Traditional)

How it works:

  • Searches for exact word matches or phrases
  • Uses techniques like BM25 (a ranking algorithm) or full-text search
  • Looks for specific terms in the document text
  • Fast and precise for exact matches

Best for:

  • Finding documents containing specific terms
  • Searching for exact phrases or names
  • Cases where terminology is consistent
  • Structured queries with specific keywords

Example:

  • Query: "customer support ticket"
  • Finds: Documents that contain the exact words “customer”, “support”, and “ticket”

How it works:

  • Converts your search query into a vector (embedding)
  • Compares the query vector against all document vectors
  • Uses cosine similarity to find documents with similar meaning
  • Understands context, synonyms, and related concepts

Best for:

  • Finding documents with similar meaning, even without exact word matches
  • Natural language queries
  • Concept-based searches
  • Handling synonyms and variations in terminology

Example:

  • Query: "user having trouble accessing their account"
  • Finds: Documents about “login issues”, “authentication problems”, “account access errors” - even if they don’t contain the exact words from your query

Many tables support hybrid search, which combines both approaches:

  • Keyword search ensures you find exact matches and important terms
  • Semantic search finds conceptually similar content
  • Results are combined and ranked intelligently using Reciprocal Rank Fusion (RRF)

Benefits:

  • More comprehensive results
  • Balances precision (keyword) with recall (semantic)
  • Better handles queries with both specific terms and conceptual intent

Example:

  • Query: "customer complaint about billing"
  • Keyword search finds: Documents with “billing”, “complaint”, “customer”
  • Semantic search finds: Documents about “payment issues”, “invoice problems”, “account charges”
  • Hybrid combines both for the best results

Choosing between hybrid search and pure semantic search depends on your use case, query types, and data characteristics. Here’s a practical guide:

Use Hybrid Search When:

1. You Need Both Precision and Recall

  • ✅ Users search with both specific terms AND natural language
  • ✅ You want to catch exact matches while also finding conceptually similar content
  • ✅ Your data contains both technical terms and descriptive content
  • ✅ You need to balance finding exact keywords with understanding intent

Example Use Cases:

  • Knowledge bases where users might search for “API documentation” (keyword) or “how to integrate with our service” (semantic)
  • Support ticket systems where users search by ticket number (keyword) or describe their problem (semantic)
  • Product catalogs with both SKU numbers (keyword) and product descriptions (semantic)

2. Your Queries Mix Specific Terms with Concepts

  • Query: "React hooks useState tutorial"
  • Hybrid search finds: Documents with “React”, “hooks”, “useState” (keyword) AND documents about “state management in React” (semantic)

3. You Want Maximum Coverage

  • Hybrid search ensures you don’t miss results that might only appear in one search method
  • Better for general-purpose search where query types vary

4. Your Data Has Technical Terminology

  • Technical terms, product names, or codes need exact matching
  • But you also want to find related concepts and explanations

Use Pure Semantic Search When:

1. Queries are Primarily Natural Language

  • ✅ Users describe what they’re looking for in conversational language
  • ✅ Exact keyword matching is less important than understanding intent
  • ✅ You want to find conceptually related content even without exact word matches

Example Use Cases:

  • Customer support chatbots where users describe problems
  • Content discovery systems (“find articles about similar topics”)
  • Research and knowledge exploration
  • Recommendation systems

2. Your Data Has Synonym-Rich Content

  • Same concepts expressed in many different ways
  • Terminology varies across documents
  • You want to find all variations of an idea

Example:

  • Query: "feeling overwhelmed at work"
  • Semantic search finds: Documents about “stress management”, “burnout prevention”, “work-life balance” - even without exact word matches

3. You Prioritize Conceptual Understanding

  • Finding related ideas and concepts is more important than exact matches
  • Users explore topics rather than search for specific items

4. Your Queries are Ambiguous or Context-Dependent

  • Query meaning depends on context
  • Semantic search better understands context and intent

Hybrid Search Weighting (Alpha Parameter)

The alpha parameter controls how much weight semantic (vector) search has compared to keyword (BM25) search in hybrid search results. Understanding alpha is crucial for tuning your search to match your specific needs.

Alpha Range:

  • Alpha = 0.0: Pure keyword search (BM25 only, no vector search)
  • Alpha = 0.5: Balanced hybrid (default, equal weighting)
  • Alpha = 1.0: Pure semantic search (vector only, no keyword search)

How Alpha Works in Reciprocal Rank Fusion (RRF)

Hybrid search uses Reciprocal Rank Fusion (RRF) to combine results from both search methods. Alpha controls the vector search contribution in the fusion formula:

Hybrid Score = [1.0 / (k + BM25_rank)] + [alpha / (k + Vector_rank)]

Where:

  • k is a constant (typically 60) that prevents division by very small numbers
  • BM25_rank is the document’s rank from keyword search (1st = rank 1, 2nd = rank 2, etc.)
  • Vector_rank is the document’s rank from semantic search
  • alpha multiplies the vector search contribution

What This Means:

  • Lower alpha: Vector search has less influence on final rankings
  • Higher alpha: Vector search has more influence on final rankings
  • BM25 always contributes: Even with alpha = 0, BM25 is still part of the formula (but vector search is effectively ignored when alpha = 0)

Understanding Alpha Values

Alpha = 0.0 - Pure Keyword Search

  • Only keyword (BM25) results are considered
  • Vector search runs but doesn’t affect rankings
  • Best when exact term matching is critical
  • Use for: Technical documentation, code searches, product SKUs

Alpha = 0.1 - 0.3 - Strongly Favor Keyword

  • Keyword search dominates, but semantic search provides some boost
  • Documents that rank well in both methods get extra boost
  • Use for: API documentation, technical specs, structured data search

Alpha = 0.4 - 0.6 - Balanced (Recommended)

  • Both search methods contribute significantly
  • Good balance between precision (keyword) and recall (semantic)
  • Default value (0.5) is a good starting point
  • Use for: General knowledge bases, customer support, most use cases

Alpha = 0.7 - 0.9 - Favor Semantic Search

  • Semantic search has more influence on rankings
  • Still benefits from keyword precision for exact matches
  • Use for: Content discovery, research, natural language queries

Alpha = 1.0 - Pure Semantic Search

  • Only semantic (vector) results are considered
  • Keyword search runs but doesn’t affect rankings
  • Use for: Conceptual exploration, finding related ideas

Visual Example: How Alpha Affects Rankings

Let’s say you search for “React hooks tutorial” and these documents are found:

DocumentBM25 RankVector RankAlpha = 0.3Alpha = 0.5Alpha = 0.8
”React Hooks Tutorial”121st (keyword wins)1st (balanced)1st (still keyword wins)
“React State Management Guide”512nd (keyword priority)2nd (balanced)2nd (vector boost)
“JavaScript Functions Guide”1033rd3rd3rd
”Frontend Development”2044th4th4th

Key Observations:

  • With alpha = 0.3: Keyword ranking dominates, exact matches prioritized
  • With alpha = 0.5: Balanced - “State Management Guide” gets equal boost from both
  • With alpha = 0.8: Semantic search has more influence - conceptually related content ranks higher

Adjust Alpha Based On:

Lower Alpha (0.0 - 0.4) - Favor Keyword Search:

Use when:

  • Exact term matching is critical
  • Technical terminology is important
  • Users search with specific keywords
  • Product names, codes, or IDs need to be found

Examples:

  • API documentation (alpha = 0.3)
  • Code repositories (alpha = 0.2)
  • Product catalogs with SKUs (alpha = 0.3)
  • Technical specifications (alpha = 0.3-0.4)

Trade-offs:

  • ✅ Excellent precision for exact matches
  • ✅ Finds specific technical terms
  • ❌ May miss conceptually related content
  • ❌ Less effective for natural language queries

Medium Alpha (0.4 - 0.6) - Balanced:

Use when:

  • You want the best of both worlds
  • Queries mix specific terms and natural language
  • General-purpose search across varied content
  • Most knowledge bases and support systems

Examples:

  • General knowledge bases (alpha = 0.5)
  • Customer support systems (alpha = 0.5)
  • Internal wikis (alpha = 0.5)
  • Documentation with mixed content (alpha = 0.4-0.6)

Trade-offs:

  • ✅ Balanced precision and recall
  • ✅ Handles both keyword and semantic queries well
  • ✅ Good default for most use cases
  • ⚖️ May not be optimal for extreme use cases

Higher Alpha (0.6 - 1.0) - Favor Semantic Search:

Use when:

  • Understanding intent is more important than exact matches
  • Natural language queries are common
  • Finding conceptually similar content
  • Users explore topics rather than search for specific items

Examples:

  • Content discovery (alpha = 0.8)
  • Research and exploration (alpha = 0.7-0.9)
  • Conversational interfaces (alpha = 0.7)
  • Recommendation systems (alpha = 0.8-1.0)

Trade-offs:

  • ✅ Excellent for finding related concepts
  • ✅ Handles synonyms and variations well
  • ✅ Better for natural language
  • ❌ May miss exact keyword matches
  • ❌ Less precise for specific technical terms

How to Choose the Right Alpha Value

Step 1: Start with Default

  • Begin with alpha = 0.5 (balanced)
  • This works well for most use cases

Step 2: Analyze Your Queries

  • Are queries mostly keywords? → Lower alpha (0.3-0.4)
  • Are queries natural language? → Higher alpha (0.7-0.8)
  • Mixed queries? → Keep alpha = 0.5

Step 3: Test and Iterate

  • Try different alpha values with real queries
  • Compare result quality
  • Adjust based on user feedback

Step 4: Monitor Results

  • Track which results users find helpful
  • Identify patterns in successful searches
  • Fine-tune alpha based on data

Alpha vs. Minimum Similarity

Important: Alpha and minimum similarity threshold serve different purposes:

ParameterPurposeControls
AlphaWeighting between search methodsHow much keyword vs. semantic search influences rankings
Min SimilarityQuality filterWhich documents are included at all (regardless of method)

Example:

alpha = 0.3, min_similarity = 0.7
This means:
- Keyword search has 70% influence, semantic search has 30%
- But semantic results must still have similarity ≥ 0.7 to be included
- Even if semantic search contributes less to ranking, it still filters results

Common Alpha Patterns

Technical Documentation:

alpha = 0.3
Reason: Exact API names, function names, and technical terms need precise matching

Customer Support:

alpha = 0.5
Reason: Mix of specific error codes (keyword) and problem descriptions (semantic)

Content Discovery:

alpha = 0.8
Reason: Users explore topics, find related content, not searching for exact terms

Product Search:

alpha = 0.3
Reason: SKUs, product codes need exact matches, but descriptions help too

Research/Exploration:

alpha = 0.7
Reason: Finding related concepts and ideas, not specific items

Advanced: Alpha and Result Quality

When Alpha is Too Low:

  • Exact matches rank highly ✅
  • But conceptually relevant content might be buried
  • Users might miss helpful related information

When Alpha is Too High:

  • Conceptually related content ranks well ✅
  • But exact keyword matches might be ranked lower
  • Users searching for specific terms might be frustrated

The Sweet Spot:

  • Balance that matches your query patterns
  • Usually between 0.4-0.6 for most use cases
  • Adjust based on actual user behavior and feedback

Summary

  • Alpha controls the weight of semantic search in hybrid rankings
  • Range: 0.0 (keyword only) to 1.0 (semantic only), default 0.5 (balanced)
  • Lower alpha (0.0-0.4): Favor keyword search for exact matches
  • Medium alpha (0.4-0.6): Balanced, good for most use cases
  • Higher alpha (0.6-1.0): Favor semantic search for conceptual matching
  • Start with 0.5 and adjust based on your queries and results
  • Alpha is separate from minimum similarity threshold (they control different things)

Decision Matrix

ScenarioRecommendationWhy
Technical documentation with exact termsHybrid, Alpha 0.3-0.4Need keyword precision + some semantic understanding
Customer support/knowledge baseHybrid, Alpha 0.5-0.6Mix of specific terms and natural language
Content discovery/recommendationsSemantic only or Hybrid Alpha 0.7-0.8Concept-based, less about exact matches
Product search with SKUsHybrid, Alpha 0.3-0.4Exact product codes + descriptions
Research and explorationSemantic only or Hybrid Alpha 0.7-1.0Finding related concepts and ideas
General-purpose searchHybrid, Alpha 0.5Balanced approach for varied queries

Performance Considerations

Hybrid Search:

  • Slightly slower (runs two searches and combines results)
  • More comprehensive results
  • Better for varied query types

Semantic Search:

  • Faster (single search operation)
  • More focused on conceptual matching
  • Better for natural language queries

Testing Your Choice

Start with hybrid search (alpha = 0.5) and adjust based on:

  1. Query Analysis: Review common queries - are they keyword-heavy or natural language?
  2. Result Quality: Check if results are too focused on keywords (lower alpha) or missing exact matches (raise alpha)
  3. User Feedback: Monitor which results users find most relevant
  4. A/B Testing: Try different alpha values and compare user engagement

Tip: For most use cases, hybrid search with alpha = 0.5 is a good starting point. Adjust based on your specific needs.

How Vector Search Works in Practice

When you save a document to a table with vector indexing enabled:

  1. Document Processing: The document’s content is automatically converted into a vector embedding using a machine learning model (typically Google’s text-embedding-004)

  2. Storage: The vector is stored alongside your document data in the table

  3. Search Time: When someone searches:

    • The search query is converted into a query vector
    • The system compares this vector against all document vectors
    • Documents are ranked by similarity (cosine distance)
    • Results are returned sorted by relevance
  4. Similarity Threshold: You can set a minimum similarity threshold (0-1) to filter out irrelevant results

Understanding Minimum Similarity Scores and Vector Distance

When working with vector search, understanding similarity scores and vector distance is crucial for getting the right results. These concepts control how relevant your search results are.

What is Vector Distance?

Vector distance measures how far apart two vectors are in high-dimensional space. In semantic search, we use cosine distance to compare query vectors with document vectors.

How Cosine Distance Works:

  • Distance = 0.0: Vectors point in the exact same direction (identical meaning)
  • Distance = 1.0: Vectors point in opposite directions (completely different meaning)
  • Distance = 0.5: Vectors are orthogonal (somewhat related, but not very similar)

Visual Analogy: Think of vectors as arrows in space. Cosine distance measures the angle between arrows:

  • 0° angle (distance = 0): Arrows point the same direction → Very similar
  • 90° angle (distance = 0.5): Arrows are perpendicular → Somewhat related
  • 180° angle (distance = 1.0): Arrows point opposite directions → Completely different

What is Similarity Score?

Similarity score is the inverse of distance, making it more intuitive to work with:

Similarity = 1 - Distance

Similarity Score Range:

  • Similarity = 1.0: Perfect match (distance = 0)
  • Similarity = 0.5: Moderate similarity (distance = 0.5)
  • Similarity = 0.0: No similarity (distance = 1.0)

Why Use Similarity Instead of Distance?

  • Higher numbers = better matches (more intuitive)
  • Easier to understand thresholds (“I want results with at least 0.7 similarity”)
  • Standard practice in search systems

Minimum Similarity Threshold

The minimum similarity threshold (also called min_similarity) filters out results that aren’t similar enough to your query. Only documents with similarity scores above the threshold are returned.

How It Works:

  1. System calculates similarity for all documents
  2. Filters out documents below the threshold
  3. Returns only documents that meet or exceed the minimum similarity

Example:

Query: "customer support"
Document A: Similarity = 0.85 ✅ (above threshold of 0.7)
Document B: Similarity = 0.65 ❌ (below threshold of 0.7)
Document C: Similarity = 0.92 ✅ (above threshold of 0.7)
With min_similarity = 0.7, only Documents A and C are returned.

Choosing the Right Threshold

The threshold you choose depends on your use case and quality requirements:

Low Threshold (0.0 - 0.4):

  • Use when: You want maximum recall (find everything potentially relevant)
  • Trade-off: May include less relevant results
  • Best for: Exploratory searches, research, content discovery
  • Example: min_similarity = 0.3 for finding all related articles

Medium Threshold (0.4 - 0.7):

  • Use when: You want balanced precision and recall
  • Trade-off: Good balance between relevance and coverage
  • Best for: General-purpose search, knowledge bases
  • Example: min_similarity = 0.5 for customer support knowledge base

High Threshold (0.7 - 1.0):

  • Use when: You need high precision (only very relevant results)
  • Trade-off: May miss some relevant but less similar content
  • Best for: Specific answers, exact matches, critical applications
  • Example: min_similarity = 0.8 for finding exact technical documentation

Typical Threshold Values by Use Case

Use CaseRecommended ThresholdReasoning
Content discovery0.2 - 0.4Cast a wide net, find related ideas
General knowledge base0.5 - 0.6Balanced relevance and coverage
Technical documentation0.6 - 0.7Need precise, accurate results
Customer support0.5 - 0.7Balance between finding solutions and accuracy
Research/exploration0.3 - 0.5Find related concepts, not just exact matches
Critical answers0.7 - 0.9Only return highly confident matches

The Relationship Between Distance and Similarity

Remember: Distance and Similarity are inverses

DistanceSimilarityMeaningExample
0.01.0Perfect matchQuery: “Python tutorial”, Document: “Python tutorial”
0.10.9Very similarQuery: “Python tutorial”, Document: “Learn Python programming”
0.30.7Moderately similarQuery: “Python tutorial”, Document: “Programming guide”
0.50.5Somewhat relatedQuery: “Python tutorial”, Document: “Software development”
0.70.3Not very similarQuery: “Python tutorial”, Document: “Cooking recipes”
1.00.0Completely differentQuery: “Python tutorial”, Document: “Vacation photos”

Conversion Formula:

If you see distance = 0.3, then similarity = 1 - 0.3 = 0.7
If you need similarity ≥ 0.7, then distance must be ≤ 0.3

Why Thresholds Matter

Without a Threshold:

  • All documents returned, even completely unrelated ones
  • Low-quality results mixed with good ones
  • Harder to find what you’re looking for

With Too Low a Threshold:

  • Includes marginally relevant results
  • More noise in results
  • Harder to find the best matches

With Too High a Threshold:

  • Only very similar results returned
  • May miss relevant but differently worded content
  • Fewer results overall

With the Right Threshold:

  • Filters out noise while keeping relevant results
  • Better user experience
  • More focused, useful results

Practical Tips

1. Start with Defaults:

  • Most systems default to min_similarity = 0.5 or 0.7
  • Good starting point for most use cases

2. Adjust Based on Results:

  • Too many irrelevant results? → Raise the threshold
  • Missing relevant results? → Lower the threshold
  • Too few results? → Lower the threshold

3. Test with Real Queries:

  • Try different thresholds with actual user queries
  • Monitor which results users find helpful
  • Adjust based on feedback

4. Consider Your Content:

  • Technical content (specific terminology): Higher threshold (0.6-0.8)
  • General content (varied language): Lower threshold (0.4-0.6)
  • Synonym-rich content: Lower threshold to catch variations

5. Monitor Search Quality:

  • Track average similarity scores of returned results
  • If consistently low, your content might need improvement
  • If consistently high, threshold might be too restrictive

Example: Adjusting Thresholds

Scenario: Customer Support Knowledge Base

Initial Setup:

min_similarity = 0.7

Problem: Users complain about missing relevant articles

Investigation:

  • Review queries: “how do I reset my password”
  • Found article: “password reset instructions” (similarity = 0.65)
  • Article was filtered out because 0.65 < 0.7

Solution:

min_similarity = 0.6 (lowered to catch more relevant content)

Result: More relevant articles returned, users find what they need

Later Adjustment:

  • Too many marginally relevant results appearing
  • Raise to min_similarity = 0.65 for better balance

Advanced: Understanding Cosine Distance

Cosine distance measures the angle between vectors, not their magnitude:

Formula:

Cosine Distance = 1 - (A · B) / (||A|| × ||B||)

Where:

  • A · B = dot product of vectors A and B
  • ||A|| = magnitude (length) of vector A
  • ||B|| = magnitude (length) of vector B

Key Insight: Cosine distance focuses on direction (meaning) rather than magnitude (length). This makes it perfect for semantic search because:

  • Two documents with similar meaning point in similar directions
  • Document length doesn’t affect similarity (important for comparing long vs. short documents)
  • Focuses on semantic relationships, not word counts

Summary

  • Vector Distance: Measures how different two vectors are (0 = identical, 1 = completely different)
  • Similarity Score: Inverse of distance (1 = identical, 0 = completely different)
  • Minimum Similarity Threshold: Filters out results below a certain similarity
  • Relationship: Similarity = 1 - Distance
  • Best Practice: Start with 0.5-0.7, adjust based on your results and user feedback

Enable vector indexing on your table when you want to:

  • ✅ Search documents by meaning, not just keywords
  • ✅ Find relevant content even with different wording
  • ✅ Support natural language queries
  • ✅ Build AI-powered search experiences
  • ✅ Create knowledge bases that understand context

The content column is treated specially when your table has vector indexing enabled. Understanding this difference is crucial for building effective searchable databases.

What Makes the Content Column Different?

When a table has vector indexing, the content column receives special processing that other columns don’t:

  1. Automatic Text Chunking: The content column is automatically split into smaller chunks (typically 2,500 characters with 200 character overlap) using a RecursiveCharacter splitter. This allows:

    • Better handling of long documents
    • More precise search results (searching within relevant sections)
    • Improved embedding quality (smaller, focused chunks produce better embeddings)
  2. Vector Embedding Generation: Each chunk from the content column gets its own vector embedding, which enables semantic search

  3. Multiple Index Records: A single document with a long content field can create multiple index records (one per chunk), all linked back to the original document

  4. Vector Index Field: The content column is designated as the vector_index_field, meaning it’s the primary field used for vector similarity search

Other Columns: Standard Storage

Other columns in your table are stored differently:

  • No Chunking: Other columns are stored as-is, without splitting
  • Filterable Only: Other columns can be used for filtering and exact matching, but not for vector search
  • Full-Text Search: String columns get basic full-text search capabilities (keyword search), but not semantic/vector search
  • Metadata Storage: Other columns serve as metadata that can be filtered and displayed alongside search results

Practical Implications

Content Column:

  • ✅ Use for the main searchable text (descriptions, articles, summaries, etc.)
  • ✅ Automatically chunked for optimal search performance
  • ✅ Enables semantic search (understanding meaning)
  • ✅ Can be very long (will be chunked automatically)
  • ✅ Best for natural language content

Other Columns:

  • ✅ Use for structured data (titles, IDs, categories, dates, etc.)
  • ✅ Stored as-is without chunking
  • ✅ Used for filtering and exact matching
  • ✅ Can be used for keyword search (if string type)
  • ✅ Best for metadata, tags, and structured information

Example: Building a Knowledge Base

When saving a document to a knowledge base table:

1# Content column - the main searchable text
2- Column: content
3- Type: string
4- Value: {{ extract_article.output.full_text }}
5 # This will be chunked into ~2,500 character pieces
6 # Each chunk gets its own embedding for semantic search
7
8# Title column - structured metadata
9- Column: title
10- Type: string
11- Value: {{ extract_article.output.title }}
12 # Stored as-is, used for filtering and display
13
14# Category column - structured metadata
15- Column: category
16- Type: string
17- Value: {{ extract_article.output.category }}
18 # Stored as-is, great for filtering results
19
20# Published date - structured metadata
21- Column: published_date
22- Type: string
23- Value: {{ extract_article.output.date }}
24 # Stored as-is, used for sorting and filtering

How Chunking Works

When you save a document with a content column containing 10,000 characters:

  1. Original Document: One row in your table with all fields
  2. Index Records Created: Multiple chunks (e.g., 4 chunks of ~2,500 chars each)
  3. Each Chunk Gets: Its own embedding vector
  4. Search Behavior: When someone searches, the system:
    • Finds relevant chunks (not just whole documents)
    • Returns the parent document with the matching chunk highlighted
    • Maintains context through chunk overlap (200 characters)

Benefits:

  • More precise search results (finds relevant sections, not just documents)
  • Better handling of long documents
  • Improved search relevance (smaller chunks = more focused embeddings)

Best Practices

For the Content Column:

  • Save complete, meaningful text (not just keywords)
  • Include context and full descriptions
  • Write naturally (embeddings understand language, not just terms)
  • Longer content is fine (it will be chunked automatically)

For Other Columns:

  • Use structured, consistent values
  • Keep them concise (they’re not chunked)
  • Use for filtering, sorting, and display
  • Consider what users will want to filter by

When Your Table Doesn’t Have a Content Column

If your table doesn’t have vector indexing enabled or doesn’t have a content column:

  • All columns are treated equally (no special processing)
  • No automatic chunking occurs
  • Vector search is not available
  • Only keyword/full-text search is available for string columns

Performance Considerations

  • Vector search is faster for large datasets compared to traditional keyword search
  • Hybrid search provides the best balance of accuracy and coverage
  • Similarity thresholds help filter irrelevant results and improve performance
  • Vector indexes are optimized for fast similarity comparisons

Best Practices

1. Use Descriptive Field Names

Make sure your table columns have clear, descriptive names that match the data you’re saving.

2. Handle Missing Data

Use Jinja2 conditionals to handle cases where data might be missing:

1{% if extract_data.output.name %}{{ extract_data.output.name }}{% else %}Unknown{% endif %}

3. Validate Data Types

Ensure the data type you select matches the actual data. For example:

  • Use number for numeric calculations
  • Use boolean for true/false values
  • Use json for complex nested structures

4. Error Handling

The block will raise an error if:

  • The API request fails (non-200 status)
  • The table or column doesn’t exist
  • The data type doesn’t match the column schema

Make sure to test your workflow with sample data before deploying.

5. Naming Conventions

Use consistent naming for your block IDs to make templates easier to write:

  • extract_* for data extraction blocks
  • process_* for data processing blocks
  • save_* for save blocks

If your table has vector indexing enabled:

  • Save comprehensive content: Store full, meaningful text in the content column for better semantic search
  • Use descriptive text: Include context and details rather than just keywords
  • Avoid abbreviations: Spell out terms to improve search quality
  • Include synonyms: If possible, include alternative phrasings in your content
  • Structure matters: Well-structured, complete sentences produce better embeddings than fragments

Output

After the block executes successfully, it returns:

  • output: The response from the Collections API (typically includes the document ID and metadata)
  • details: Execution metadata including elapsed_time_ms

You can reference this output in subsequent blocks using:

1{{ save_document.output.document_id }}

Troubleshooting

”Failed to save document” Error

Possible causes:

  • The table or column doesn’t exist
  • The data type doesn’t match the column schema
  • The template variable references a block that hasn’t run yet
  • The template syntax is incorrect

Solutions:

  1. Verify the table and column exist in your collection
  2. Check that the data type matches the column’s expected type
  3. Ensure the referenced block runs before this block (check your workflow dependencies)
  4. Test your template syntax in a simple block first

Template Variable Not Found

If you see an error about a missing variable:

  1. Check that the block ID is correct (case-sensitive)
  2. Verify the block runs before this one in your workflow
  3. Check the output structure of the previous block to ensure the field path is correct

Type Conversion Errors

If type conversion fails:

  1. Check the actual value being generated by your template
  2. For JSON types, ensure the template produces valid JSON
  3. For numbers, ensure the template evaluates to a numeric value or numeric string

Column Not Found

If you see a column error:

  1. Refresh the column dropdown to ensure you have the latest schema
  2. Verify the column exists in the selected table
  3. Check that you’re using the correct column ID (not the display name)
  • Save Document to Collection (deprecated) - Older version that saves to collections without tables
  • Other collection blocks that read or update documents

Tips

💡 Tip 1: Use the block’s output to chain multiple saves or create relationships between documents.

💡 Tip 2: Save workflow metadata (like execution time, status) alongside your data for better debugging.

💡 Tip 3: Use JSON type for complex nested data structures that don’t fit well into individual columns.

💡 Tip 4: Create test workflows with sample data to validate your field mappings before using real data.

💡 Tip 5: Use descriptive block display names in your workflow to make it easier to write templates that reference them.

💡 Tip 6: If you’re building a searchable knowledge base, enable vector indexing on your table and save full, descriptive content in the content column for better semantic search results.

See Also