Save Document to Table Block | Scout

The Save Document to Table block allows you to save data from your workflow into a table within a collection. This block is perfect for storing workflow results, logging actions, or building databases from automated processes.

Overview

This block saves a document (row) to a specified table in one of your collections. Each field in the document can pull values from previous blocks in your workflow using template variables, making it easy to build dynamic workflows that store structured data.

When to Use This Block

Use the Save Document to Table block when you want to:

✅ Store workflow results in a structured format
✅ Log workflow executions for audit trails
✅ Build databases from automated processes
✅ Save data extracted from previous blocks
✅ Create records that can be viewed and managed in your Collections interface
✅ Build searchable knowledge bases with semantic search capabilities

Configuration

Required Fields

Collection - Select the collection where you want to save the document
- You can create a new collection directly from this dropdown if needed
Table - Select the table within the collection
- The table must exist in the selected collection
- You can create a new table directly from this dropdown if needed
Values - Configure one or more fields to save:
- Column: Select which column in the table to populate
- Type: Choose the data type (string, number, boolean, or JSON)
- Value: Enter the value using Jinja2 template syntax

Field Configuration

For each field you want to save, you need to specify:

Column: The column ID from your table schema
Type: One of four supported types:
- string - Text values
- number - Numeric values (integers or decimals)
- boolean - True/false values
- json - Structured JSON data
Value: A Jinja2 template that can reference data from previous blocks

Template Variables

You can use Jinja2 template syntax in the Value field to reference data from previous blocks in your workflow. The template has access to all the state from previous blocks.

Accessing Block Output

To reference data from a previous block, use the block’s ID followed by the output path:

1 {{ block_id.output.field_name }}

For example, if you have a block with ID extract_data that outputs:

1 {
2   "name": "John Doe",
3   "email": "john@example.com",
4   "score": 95
5 }

You would reference these values like:

{{ extract_data.output.name }} → “John Doe”
{{ extract_data.output.email }} → “john@example.com”
{{ extract_data.output.score }} → 95

Common Template Patterns

Simple text value:

1 {{ extract_data.output.name }}

Concatenated strings:

1 {{ extract_data.output.first_name }} {{ extract_data.output.last_name }}

Formatted text:

1 User: {{ extract_data.output.name }} ({{ extract_data.output.email }})

Direct number:

1 {{ calculate_score.output.total }}

Boolean from condition:

1 {{ check_status.output.is_active }}

JSON object:

1 {{ "{\"key\": \"value\", \"number\": 123}" }}

Or reference a JSON object from another block:

1 {{ previous_block.output.metadata }}

Conditional values:

1 {% if check_status.output.is_active %}Active{% else %}Inactive{% endif %}

Using date/time from previous blocks:

1 {{ extract_data.output.created_at }}

Type Casting

The block automatically converts values to match the selected type. This helps ensure data consistency:

String: Converts any value to text
Number: Converts strings like “123” to the number 123
Boolean: Converts truthy/falsy values to true/false
JSON: Parses JSON strings into structured objects

Example: Type Conversions

If you set the type to number and the template evaluates to "42", it will be saved as the number 42, not the string "42".

Step-by-Step Example

Let’s walk through saving a customer record after extracting data from an email:

Add the block to your workflow
Select Collection: Choose “Customer Database”
Select Table: Choose “Customers”
Configure Fields:
- Field 1:
  - Column: full_name
  - Type: string
  - Value: {{ extract_email.output.customer_name }}
- Field 2:
  - Column: email_address
  - Type: string
  - Value: {{ extract_email.output.email }}
- Field 3:
  - Column: signup_date
  - Type: string
  - Value: {{ extract_email.output.date }}
- Field 4:
  - Column: is_premium
  - Type: boolean
  - Value: {{ check_tier.output.premium }}
Run the workflow - The document will be saved to your table

Complete Example Workflow

Here’s a complete example workflow that processes a support ticket and saves it:

1. Trigger: Webhook receives support ticket
2. Extract Data: Parse ticket JSON
3. Enrich Data: Look up customer information
4. Save Document to Table:
   - Collection: Support Tickets
   - Table: Tickets
   - Fields:
     - ticket_id (string): {{ extract_data.output.id }}
     - customer_name (string): {{ enrich_data.output.customer.name }}
     - subject (string): {{ extract_data.output.subject }}
     - priority (string): {{ extract_data.output.priority }}
     - created_at (string): {{ extract_data.output.created_at }}
     - metadata (json): {{ extract_data.output }}

Collection

stringRequired

The collection to save the document to. Ensure that the collection ID is correct to avoid saving data to the wrong collection.

Table

stringRequired

The table to save to. Ensure that the table ID is correct and exists within the desired collection.

Values

list

A list of field mappings to save. Each field maps a column in your table to a value. The default is an empty list. This field supports Jinja2 template syntax, allowing for dynamic content generation.

Each item in the Values list contains:

Column (string, required): The column ID in the target table where the value will be saved
Type (string, required): The data type of the value. Must be one of: “string”, “number”, “boolean”, or “json”
Value (string, required): The value to save, using Jinja2 template syntax. This field supports dynamic content generation by referencing data from previous blocks in your workflow.

See Workflow Logic & State > State Management for details on using dynamic variables in this block. For advanced Jinja template features including date/time operations, see Jinja Templates in Workflows.

Understanding Vector Databases and Search

When you save documents to a table, you’re creating a database that can be searched in different ways. Tables can be configured with vector indexes, enabling powerful semantic search capabilities alongside traditional keyword search.

What is a Vector Database?

A vector database stores data as embeddings - mathematical representations (vectors) that capture the semantic meaning of text. Each document’s content is converted into a high-dimensional vector (typically 768 dimensions) that represents its meaning in a way that computers can understand and compare.

Semantic Search vs Keyword Search

There are two primary ways to search your saved documents:

Keyword Search (Traditional)

How it works:

Searches for exact word matches or phrases
Uses techniques like BM25 (a ranking algorithm) or full-text search
Looks for specific terms in the document text
Fast and precise for exact matches

Best for:

Finding documents containing specific terms
Searching for exact phrases or names
Cases where terminology is consistent
Structured queries with specific keywords

Example:

Query: "customer support ticket"
Finds: Documents that contain the exact words “customer”, “support”, and “ticket”

Semantic Search (Vector Search)

How it works:

Converts your search query into a vector (embedding)
Compares the query vector against all document vectors
Uses cosine similarity to find documents with similar meaning
Understands context, synonyms, and related concepts

Best for:

Finding documents with similar meaning, even without exact word matches
Natural language queries
Concept-based searches
Handling synonyms and variations in terminology

Example:

Query: "user having trouble accessing their account"
Finds: Documents about “login issues”, “authentication problems”, “account access errors” - even if they don’t contain the exact words from your query

Hybrid Search

Many tables support hybrid search, which combines both approaches:

Keyword search ensures you find exact matches and important terms
Semantic search finds conceptually similar content
Results are combined and ranked intelligently using Reciprocal Rank Fusion (RRF)

Benefits:

More comprehensive results
Balances precision (keyword) with recall (semantic)
Better handles queries with both specific terms and conceptual intent

Example:

Query: "customer complaint about billing"
Keyword search finds: Documents with “billing”, “complaint”, “customer”
Semantic search finds: Documents about “payment issues”, “invoice problems”, “account charges”
Hybrid combines both for the best results

When to Use Hybrid Search vs. Semantic Search

Choosing between hybrid search and pure semantic search depends on your use case, query types, and data characteristics. Here’s a practical guide:

Use Hybrid Search When:

1. You Need Both Precision and Recall

✅ Users search with both specific terms AND natural language
✅ You want to catch exact matches while also finding conceptually similar content
✅ Your data contains both technical terms and descriptive content
✅ You need to balance finding exact keywords with understanding intent

Example Use Cases:

Knowledge bases where users might search for “API documentation” (keyword) or “how to integrate with our service” (semantic)
Support ticket systems where users search by ticket number (keyword) or describe their problem (semantic)
Product catalogs with both SKU numbers (keyword) and product descriptions (semantic)

2. Your Queries Mix Specific Terms with Concepts

Query: "React hooks useState tutorial"
Hybrid search finds: Documents with “React”, “hooks”, “useState” (keyword) AND documents about “state management in React” (semantic)

3. You Want Maximum Coverage

Hybrid search ensures you don’t miss results that might only appear in one search method
Better for general-purpose search where query types vary

4. Your Data Has Technical Terminology

Technical terms, product names, or codes need exact matching
But you also want to find related concepts and explanations

Use Pure Semantic Search When:

1. Queries are Primarily Natural Language

✅ Users describe what they’re looking for in conversational language
✅ Exact keyword matching is less important than understanding intent
✅ You want to find conceptually related content even without exact word matches

Example Use Cases:

Customer support chatbots where users describe problems
Content discovery systems (“find articles about similar topics”)
Research and knowledge exploration
Recommendation systems

2. Your Data Has Synonym-Rich Content

Same concepts expressed in many different ways
Terminology varies across documents
You want to find all variations of an idea

Example:

Query: "feeling overwhelmed at work"
Semantic search finds: Documents about “stress management”, “burnout prevention”, “work-life balance” - even without exact word matches

3. You Prioritize Conceptual Understanding

Finding related ideas and concepts is more important than exact matches
Users explore topics rather than search for specific items

4. Your Queries are Ambiguous or Context-Dependent

Query meaning depends on context
Semantic search better understands context and intent

Hybrid Search Weighting (Alpha Parameter)

The alpha parameter controls how much weight semantic (vector) search has compared to keyword (BM25) search in hybrid search results. Understanding alpha is crucial for tuning your search to match your specific needs.

Alpha Range:

Alpha = 0.0: Pure keyword search (BM25 only, no vector search)
Alpha = 0.5: Balanced hybrid (default, equal weighting)
Alpha = 1.0: Pure semantic search (vector only, no keyword search)

How Alpha Works in Reciprocal Rank Fusion (RRF)

Hybrid search uses Reciprocal Rank Fusion (RRF) to combine results from both search methods. Alpha controls the vector search contribution in the fusion formula:

Hybrid Score = [1.0 / (k + BM25_rank)] + [alpha / (k + Vector_rank)]

Where:

k is a constant (typically 60) that prevents division by very small numbers
BM25_rank is the document’s rank from keyword search (1st = rank 1, 2nd = rank 2, etc.)
Vector_rank is the document’s rank from semantic search
alpha multiplies the vector search contribution

What This Means:

Lower alpha: Vector search has less influence on final rankings
Higher alpha: Vector search has more influence on final rankings
BM25 always contributes: Even with alpha = 0, BM25 is still part of the formula (but vector search is effectively ignored when alpha = 0)

Understanding Alpha Values

Alpha = 0.0 - Pure Keyword Search

Only keyword (BM25) results are considered
Vector search runs but doesn’t affect rankings
Best when exact term matching is critical
Use for: Technical documentation, code searches, product SKUs

Alpha = 0.1 - 0.3 - Strongly Favor Keyword

Keyword search dominates, but semantic search provides some boost
Documents that rank well in both methods get extra boost
Use for: API documentation, technical specs, structured data search

Alpha = 0.4 - 0.6 - Balanced (Recommended)

Both search methods contribute significantly
Good balance between precision (keyword) and recall (semantic)
Default value (0.5) is a good starting point
Use for: General knowledge bases, customer support, most use cases

Alpha = 0.7 - 0.9 - Favor Semantic Search

Semantic search has more influence on rankings
Still benefits from keyword precision for exact matches
Use for: Content discovery, research, natural language queries

Alpha = 1.0 - Pure Semantic Search

Only semantic (vector) results are considered
Keyword search runs but doesn’t affect rankings
Use for: Conceptual exploration, finding related ideas

Visual Example: How Alpha Affects Rankings

Let’s say you search for “React hooks tutorial” and these documents are found:

Document	BM25 Rank	Vector Rank	Alpha = 0.3	Alpha = 0.5	Alpha = 0.8
”React Hooks Tutorial”	1	2	1st (keyword wins)	1st (balanced)	1st (still keyword wins)
“React State Management Guide”	5	1	2nd (keyword priority)	2nd (balanced)	2nd (vector boost)
“JavaScript Functions Guide”	10	3	3rd	3rd	3rd
”Frontend Development”	20	4	4th	4th	4th

Key Observations:

With alpha = 0.3: Keyword ranking dominates, exact matches prioritized
With alpha = 0.5: Balanced - “State Management Guide” gets equal boost from both
With alpha = 0.8: Semantic search has more influence - conceptually related content ranks higher

Adjust Alpha Based On:

Lower Alpha (0.0 - 0.4) - Favor Keyword Search:

Use when:

Exact term matching is critical
Technical terminology is important
Users search with specific keywords
Product names, codes, or IDs need to be found

Examples:

API documentation (alpha = 0.3)
Code repositories (alpha = 0.2)
Product catalogs with SKUs (alpha = 0.3)
Technical specifications (alpha = 0.3-0.4)

Trade-offs:

✅ Excellent precision for exact matches
✅ Finds specific technical terms
❌ May miss conceptually related content
❌ Less effective for natural language queries

Medium Alpha (0.4 - 0.6) - Balanced:

Use when:

You want the best of both worlds
Queries mix specific terms and natural language
General-purpose search across varied content
Most knowledge bases and support systems

Examples:

General knowledge bases (alpha = 0.5)
Customer support systems (alpha = 0.5)
Internal wikis (alpha = 0.5)
Documentation with mixed content (alpha = 0.4-0.6)

Trade-offs:

✅ Balanced precision and recall
✅ Handles both keyword and semantic queries well
✅ Good default for most use cases
⚖️ May not be optimal for extreme use cases

Higher Alpha (0.6 - 1.0) - Favor Semantic Search:

Use when:

Understanding intent is more important than exact matches
Natural language queries are common
Finding conceptually similar content
Users explore topics rather than search for specific items

Examples:

Content discovery (alpha = 0.8)
Research and exploration (alpha = 0.7-0.9)
Conversational interfaces (alpha = 0.7)
Recommendation systems (alpha = 0.8-1.0)

Trade-offs:

✅ Excellent for finding related concepts
✅ Handles synonyms and variations well
✅ Better for natural language
❌ May miss exact keyword matches
❌ Less precise for specific technical terms

How to Choose the Right Alpha Value

Step 1: Start with Default

Begin with alpha = 0.5 (balanced)
This works well for most use cases

Step 2: Analyze Your Queries

Are queries mostly keywords? → Lower alpha (0.3-0.4)
Are queries natural language? → Higher alpha (0.7-0.8)
Mixed queries? → Keep alpha = 0.5

Step 3: Test and Iterate

Try different alpha values with real queries
Compare result quality
Adjust based on user feedback

Step 4: Monitor Results

Track which results users find helpful
Identify patterns in successful searches
Fine-tune alpha based on data

Alpha vs. Minimum Similarity

Important: Alpha and minimum similarity threshold serve different purposes:

Parameter	Purpose	Controls
Alpha	Weighting between search methods	How much keyword vs. semantic search influences rankings
Min Similarity	Quality filter	Which documents are included at all (regardless of method)

Example:

alpha = 0.3, min_similarity = 0.7
This means:
- Keyword search has 70% influence, semantic search has 30%
- But semantic results must still have similarity ≥ 0.7 to be included
- Even if semantic search contributes less to ranking, it still filters results

Common Alpha Patterns

Technical Documentation:

alpha = 0.3
Reason: Exact API names, function names, and technical terms need precise matching

Customer Support:

alpha = 0.5
Reason: Mix of specific error codes (keyword) and problem descriptions (semantic)

Content Discovery:

alpha = 0.8
Reason: Users explore topics, find related content, not searching for exact terms

Product Search:

alpha = 0.3
Reason: SKUs, product codes need exact matches, but descriptions help too

Research/Exploration:

alpha = 0.7
Reason: Finding related concepts and ideas, not specific items

Advanced: Alpha and Result Quality

When Alpha is Too Low:

Exact matches rank highly ✅
But conceptually relevant content might be buried
Users might miss helpful related information

When Alpha is Too High:

Conceptually related content ranks well ✅
But exact keyword matches might be ranked lower
Users searching for specific terms might be frustrated

The Sweet Spot:

Balance that matches your query patterns
Usually between 0.4-0.6 for most use cases
Adjust based on actual user behavior and feedback

Summary

Alpha controls the weight of semantic search in hybrid rankings
Range: 0.0 (keyword only) to 1.0 (semantic only), default 0.5 (balanced)
Lower alpha (0.0-0.4): Favor keyword search for exact matches
Medium alpha (0.4-0.6): Balanced, good for most use cases
Higher alpha (0.6-1.0): Favor semantic search for conceptual matching
Start with 0.5 and adjust based on your queries and results
Alpha is separate from minimum similarity threshold (they control different things)

Decision Matrix

Scenario	Recommendation	Why
Technical documentation with exact terms	Hybrid, Alpha 0.3-0.4	Need keyword precision + some semantic understanding
Customer support/knowledge base	Hybrid, Alpha 0.5-0.6	Mix of specific terms and natural language
Content discovery/recommendations	Semantic only or Hybrid Alpha 0.7-0.8	Concept-based, less about exact matches
Product search with SKUs	Hybrid, Alpha 0.3-0.4	Exact product codes + descriptions
Research and exploration	Semantic only or Hybrid Alpha 0.7-1.0	Finding related concepts and ideas
General-purpose search	Hybrid, Alpha 0.5	Balanced approach for varied queries

Performance Considerations

Hybrid Search:

Slightly slower (runs two searches and combines results)
More comprehensive results
Better for varied query types

Semantic Search:

Faster (single search operation)
More focused on conceptual matching
Better for natural language queries

Testing Your Choice

Start with hybrid search (alpha = 0.5) and adjust based on:

Query Analysis: Review common queries - are they keyword-heavy or natural language?
Result Quality: Check if results are too focused on keywords (lower alpha) or missing exact matches (raise alpha)
User Feedback: Monitor which results users find most relevant
A/B Testing: Try different alpha values and compare user engagement

Tip: For most use cases, hybrid search with alpha = 0.5 is a good starting point. Adjust based on your specific needs.

How Vector Search Works in Practice

When you save a document to a table with vector indexing enabled:

Document Processing: The document’s content is automatically converted into a vector embedding using a machine learning model (typically Google’s text-embedding-004)
Storage: The vector is stored alongside your document data in the table
Search Time: When someone searches:
- The search query is converted into a query vector
- The system compares this vector against all document vectors
- Documents are ranked by similarity (cosine distance)
- Results are returned sorted by relevance
Similarity Threshold: You can set a minimum similarity threshold (0-1) to filter out irrelevant results

Understanding Minimum Similarity Scores and Vector Distance

When working with vector search, understanding similarity scores and vector distance is crucial for getting the right results. These concepts control how relevant your search results are.

What is Vector Distance?

Vector distance measures how far apart two vectors are in high-dimensional space. In semantic search, we use cosine distance to compare query vectors with document vectors.

How Cosine Distance Works:

Distance = 0.0: Vectors point in the exact same direction (identical meaning)
Distance = 1.0: Vectors point in opposite directions (completely different meaning)
Distance = 0.5: Vectors are orthogonal (somewhat related, but not very similar)

Visual Analogy: Think of vectors as arrows in space. Cosine distance measures the angle between arrows:

0° angle (distance = 0): Arrows point the same direction → Very similar
90° angle (distance = 0.5): Arrows are perpendicular → Somewhat related
180° angle (distance = 1.0): Arrows point opposite directions → Completely different

What is Similarity Score?

Similarity score is the inverse of distance, making it more intuitive to work with:

Similarity = 1 - Distance

Similarity Score Range:

Similarity = 1.0: Perfect match (distance = 0)
Similarity = 0.5: Moderate similarity (distance = 0.5)
Similarity = 0.0: No similarity (distance = 1.0)

Why Use Similarity Instead of Distance?

Higher numbers = better matches (more intuitive)
Easier to understand thresholds (“I want results with at least 0.7 similarity”)
Standard practice in search systems

Minimum Similarity Threshold

The minimum similarity threshold (also called min_similarity) filters out results that aren’t similar enough to your query. Only documents with similarity scores above the threshold are returned.

How It Works:

System calculates similarity for all documents
Filters out documents below the threshold
Returns only documents that meet or exceed the minimum similarity

Example:

Query: "customer support"
Document A: Similarity = 0.85 ✅ (above threshold of 0.7)
Document B: Similarity = 0.65 ❌ (below threshold of 0.7)
Document C: Similarity = 0.92 ✅ (above threshold of 0.7)
With min_similarity = 0.7, only Documents A and C are returned.

Choosing the Right Threshold

The threshold you choose depends on your use case and quality requirements:

Low Threshold (0.0 - 0.4):

Use when: You want maximum recall (find everything potentially relevant)
Trade-off: May include less relevant results
Best for: Exploratory searches, research, content discovery
Example: min_similarity = 0.3 for finding all related articles

Medium Threshold (0.4 - 0.7):

Use when: You want balanced precision and recall
Trade-off: Good balance between relevance and coverage
Best for: General-purpose search, knowledge bases
Example: min_similarity = 0.5 for customer support knowledge base

High Threshold (0.7 - 1.0):

Use when: You need high precision (only very relevant results)
Trade-off: May miss some relevant but less similar content
Best for: Specific answers, exact matches, critical applications
Example: min_similarity = 0.8 for finding exact technical documentation

Typical Threshold Values by Use Case

Use Case	Recommended Threshold	Reasoning
Content discovery	0.2 - 0.4	Cast a wide net, find related ideas
General knowledge base	0.5 - 0.6	Balanced relevance and coverage
Technical documentation	0.6 - 0.7	Need precise, accurate results
Customer support	0.5 - 0.7	Balance between finding solutions and accuracy
Research/exploration	0.3 - 0.5	Find related concepts, not just exact matches
Critical answers	0.7 - 0.9	Only return highly confident matches

The Relationship Between Distance and Similarity

Remember: Distance and Similarity are inverses

Distance	Similarity	Meaning	Example
0.0	1.0	Perfect match	Query: “Python tutorial”, Document: “Python tutorial”
0.1	0.9	Very similar	Query: “Python tutorial”, Document: “Learn Python programming”
0.3	0.7	Moderately similar	Query: “Python tutorial”, Document: “Programming guide”
0.5	0.5	Somewhat related	Query: “Python tutorial”, Document: “Software development”
0.7	0.3	Not very similar	Query: “Python tutorial”, Document: “Cooking recipes”
1.0	0.0	Completely different	Query: “Python tutorial”, Document: “Vacation photos”

Conversion Formula:

If you see distance = 0.3, then similarity = 1 - 0.3 = 0.7
If you need similarity ≥ 0.7, then distance must be ≤ 0.3

Why Thresholds Matter

Without a Threshold:

All documents returned, even completely unrelated ones
Low-quality results mixed with good ones
Harder to find what you’re looking for

With Too Low a Threshold:

Includes marginally relevant results
More noise in results
Harder to find the best matches

With Too High a Threshold:

Only very similar results returned
May miss relevant but differently worded content
Fewer results overall

With the Right Threshold:

Filters out noise while keeping relevant results
Better user experience
More focused, useful results

Practical Tips

1. Start with Defaults:

Most systems default to min_similarity = 0.5 or 0.7
Good starting point for most use cases

2. Adjust Based on Results:

Too many irrelevant results? → Raise the threshold
Missing relevant results? → Lower the threshold
Too few results? → Lower the threshold

3. Test with Real Queries:

Try different thresholds with actual user queries
Monitor which results users find helpful
Adjust based on feedback

4. Consider Your Content:

Technical content (specific terminology): Higher threshold (0.6-0.8)
General content (varied language): Lower threshold (0.4-0.6)
Synonym-rich content: Lower threshold to catch variations

5. Monitor Search Quality:

Track average similarity scores of returned results
If consistently low, your content might need improvement
If consistently high, threshold might be too restrictive

Example: Adjusting Thresholds

Scenario: Customer Support Knowledge Base

Initial Setup:

min_similarity = 0.7

Problem: Users complain about missing relevant articles

Investigation:

Review queries: “how do I reset my password”
Found article: “password reset instructions” (similarity = 0.65)
Article was filtered out because 0.65 < 0.7

Solution:

min_similarity = 0.6  (lowered to catch more relevant content)

Result: More relevant articles returned, users find what they need

Later Adjustment:

Too many marginally relevant results appearing
Raise to min_similarity = 0.65 for better balance

Advanced: Understanding Cosine Distance

Cosine distance measures the angle between vectors, not their magnitude:

Formula:

Cosine Distance = 1 - (A · B) / (||A|| × ||B||)

Where:

A · B = dot product of vectors A and B
||A|| = magnitude (length) of vector A
||B|| = magnitude (length) of vector B

Key Insight: Cosine distance focuses on direction (meaning) rather than magnitude (length). This makes it perfect for semantic search because:

Two documents with similar meaning point in similar directions
Document length doesn’t affect similarity (important for comparing long vs. short documents)
Focuses on semantic relationships, not word counts

Summary

Vector Distance: Measures how different two vectors are (0 = identical, 1 = completely different)
Similarity Score: Inverse of distance (1 = identical, 0 = completely different)
Minimum Similarity Threshold: Filters out results below a certain similarity
Relationship: Similarity = 1 - Distance
Best Practice: Start with 0.5-0.7, adjust based on your results and user feedback

When to Enable Vector Search

Enable vector indexing on your table when you want to:

✅ Search documents by meaning, not just keywords
✅ Find relevant content even with different wording
✅ Support natural language queries
✅ Build AI-powered search experiences
✅ Create knowledge bases that understand context

The Content Column: Special Handling for Vector Search

The content column is treated specially when your table has vector indexing enabled. Understanding this difference is crucial for building effective searchable databases.

What Makes the Content Column Different?

When a table has vector indexing, the content column receives special processing that other columns don’t:

Automatic Text Chunking: The content column is automatically split into smaller chunks (typically 2,500 characters with 200 character overlap) using a RecursiveCharacter splitter. This allows:
- Better handling of long documents
- More precise search results (searching within relevant sections)
- Improved embedding quality (smaller, focused chunks produce better embeddings)
Vector Embedding Generation: Each chunk from the content column gets its own vector embedding, which enables semantic search
Multiple Index Records: A single document with a long content field can create multiple index records (one per chunk), all linked back to the original document
Vector Index Field: The content column is designated as the vector_index_field, meaning it’s the primary field used for vector similarity search

Other Columns: Standard Storage

Other columns in your table are stored differently:

No Chunking: Other columns are stored as-is, without splitting
Filterable Only: Other columns can be used for filtering and exact matching, but not for vector search
Full-Text Search: String columns get basic full-text search capabilities (keyword search), but not semantic/vector search
Metadata Storage: Other columns serve as metadata that can be filtered and displayed alongside search results

Practical Implications

Content Column:

✅ Use for the main searchable text (descriptions, articles, summaries, etc.)
✅ Automatically chunked for optimal search performance
✅ Enables semantic search (understanding meaning)
✅ Can be very long (will be chunked automatically)
✅ Best for natural language content

Other Columns:

✅ Use for structured data (titles, IDs, categories, dates, etc.)
✅ Stored as-is without chunking
✅ Used for filtering and exact matching
✅ Can be used for keyword search (if string type)
✅ Best for metadata, tags, and structured information

Example: Building a Knowledge Base

When saving a document to a knowledge base table:

1 # Content column - the main searchable text
2 - Column: content
3 - Type: string
4 - Value: {{ extract_article.output.full_text }}
5   # This will be chunked into ~2,500 character pieces
6   # Each chunk gets its own embedding for semantic search
7 
8 # Title column - structured metadata
9 - Column: title
10 - Type: string
11 - Value: {{ extract_article.output.title }}
12   # Stored as-is, used for filtering and display
13 
14 # Category column - structured metadata
15 - Column: category
16 - Type: string
17 - Value: {{ extract_article.output.category }}
18   # Stored as-is, great for filtering results
19 
20 # Published date - structured metadata
21 - Column: published_date
22 - Type: string
23 - Value: {{ extract_article.output.date }}
24   # Stored as-is, used for sorting and filtering

How Chunking Works

When you save a document with a content column containing 10,000 characters:

Original Document: One row in your table with all fields
Index Records Created: Multiple chunks (e.g., 4 chunks of ~2,500 chars each)
Each Chunk Gets: Its own embedding vector
Search Behavior: When someone searches, the system:
- Finds relevant chunks (not just whole documents)
- Returns the parent document with the matching chunk highlighted
- Maintains context through chunk overlap (200 characters)

Benefits:

More precise search results (finds relevant sections, not just documents)
Better handling of long documents
Improved search relevance (smaller chunks = more focused embeddings)

Best Practices

For the Content Column:

Save complete, meaningful text (not just keywords)
Include context and full descriptions
Write naturally (embeddings understand language, not just terms)
Longer content is fine (it will be chunked automatically)

For Other Columns:

Use structured, consistent values
Keep them concise (they’re not chunked)
Use for filtering, sorting, and display
Consider what users will want to filter by

When Your Table Doesn’t Have a Content Column

If your table doesn’t have vector indexing enabled or doesn’t have a content column:

All columns are treated equally (no special processing)
No automatic chunking occurs
Vector search is not available
Only keyword/full-text search is available for string columns

Performance Considerations

Vector search is faster for large datasets compared to traditional keyword search
Hybrid search provides the best balance of accuracy and coverage
Similarity thresholds help filter irrelevant results and improve performance
Vector indexes are optimized for fast similarity comparisons

Best Practices

1. Use Descriptive Field Names

Make sure your table columns have clear, descriptive names that match the data you’re saving.

2. Handle Missing Data

Use Jinja2 conditionals to handle cases where data might be missing:

1 {% if extract_data.output.name %}{{ extract_data.output.name }}{% else %}Unknown{% endif %}

3. Validate Data Types

Ensure the data type you select matches the actual data. For example:

Use number for numeric calculations
Use boolean for true/false values
Use json for complex nested structures

4. Error Handling

The block will raise an error if:

The API request fails (non-200 status)
The table or column doesn’t exist
The data type doesn’t match the column schema

Make sure to test your workflow with sample data before deploying.

5. Naming Conventions

Use consistent naming for your block IDs to make templates easier to write:

extract_* for data extraction blocks
process_* for data processing blocks
save_* for save blocks

6. Optimizing for Vector Search

If your table has vector indexing enabled:

Save comprehensive content: Store full, meaningful text in the content column for better semantic search
Use descriptive text: Include context and details rather than just keywords
Avoid abbreviations: Spell out terms to improve search quality
Include synonyms: If possible, include alternative phrasings in your content
Structure matters: Well-structured, complete sentences produce better embeddings than fragments

Output

After the block executes successfully, it returns:

output: The response from the Collections API (typically includes the document ID and metadata)
details: Execution metadata including elapsed_time_ms

You can reference this output in subsequent blocks using:

1 {{ save_document.output.document_id }}

Troubleshooting

”Failed to save document” Error

Possible causes:

The table or column doesn’t exist
The data type doesn’t match the column schema
The template variable references a block that hasn’t run yet
The template syntax is incorrect

Solutions:

Verify the table and column exist in your collection
Check that the data type matches the column’s expected type
Ensure the referenced block runs before this block (check your workflow dependencies)
Test your template syntax in a simple block first

Template Variable Not Found

If you see an error about a missing variable:

Check that the block ID is correct (case-sensitive)
Verify the block runs before this one in your workflow
Check the output structure of the previous block to ensure the field path is correct

Type Conversion Errors

If type conversion fails:

Check the actual value being generated by your template
For JSON types, ensure the template produces valid JSON
For numbers, ensure the template evaluates to a numeric value or numeric string

Column Not Found

If you see a column error:

Refresh the column dropdown to ensure you have the latest schema
Verify the column exists in the selected table
Check that you’re using the correct column ID (not the display name)

Save Document to Collection (deprecated) - Older version that saves to collections without tables
Other collection blocks that read or update documents

Tips

💡 Tip 1: Use the block’s output to chain multiple saves or create relationships between documents.

💡 Tip 2: Save workflow metadata (like execution time, status) alongside your data for better debugging.

💡 Tip 3: Use JSON type for complex nested data structures that don’t fit well into individual columns.

💡 Tip 4: Create test workflows with sample data to validate your field mappings before using real data.

💡 Tip 5: Use descriptive block display names in your workflow to make it easier to write templates that reference them.

💡 Tip 6: If you’re building a searchable knowledge base, enable vector indexing on your table and save full, descriptive content in the content column for better semantic search results.

Overview

When to Use This Block

Configuration

Required Fields

Field Configuration

Template Variables

Accessing Block Output

Common Template Patterns

Type Casting

Example: Type Conversions

Step-by-Step Example

Complete Example Workflow

Understanding Vector Databases and Search

What is a Vector Database?

Semantic Search vs Keyword Search

Keyword Search (Traditional)

Semantic Search (Vector Search)

Hybrid Search

When to Use Hybrid Search vs. Semantic Search

Use Hybrid Search When:

Use Pure Semantic Search When:

Hybrid Search Weighting (Alpha Parameter)

How Alpha Works in Reciprocal Rank Fusion (RRF)

Understanding Alpha Values

Visual Example: How Alpha Affects Rankings

Adjust Alpha Based On:

How to Choose the Right Alpha Value

Alpha vs. Minimum Similarity

Common Alpha Patterns

Advanced: Alpha and Result Quality

Summary

Decision Matrix

Performance Considerations

Testing Your Choice

How Vector Search Works in Practice

Understanding Minimum Similarity Scores and Vector Distance

What is Vector Distance?

What is Similarity Score?

Minimum Similarity Threshold

Choosing the Right Threshold

Typical Threshold Values by Use Case

The Relationship Between Distance and Similarity

Why Thresholds Matter

Practical Tips

Example: Adjusting Thresholds

Advanced: Understanding Cosine Distance

Summary

When to Enable Vector Search

The Content Column: Special Handling for Vector Search

What Makes the Content Column Different?

Other Columns: Standard Storage

Practical Implications

Example: Building a Knowledge Base

How Chunking Works

Best Practices

When Your Table Doesn’t Have a Content Column

Performance Considerations

Best Practices

1. Use Descriptive Field Names

2. Handle Missing Data

3. Validate Data Types

4. Error Handling

5. Naming Conventions

6. Optimizing for Vector Search

Output

Troubleshooting

”Failed to save document” Error

Template Variable Not Found

Type Conversion Errors

Column Not Found

Related Blocks

Tips

See Also