How to Upload a CSV to a Scout Collection

You can pull down and run this code from here: Scout Recipe: How to Upload a CSV to Scout Collection

This notebook demonstrates how to a CSV file into a Scout collection.

The example use case presented here involves adding a CSV file that contains a list of queries along with their expected responses. This functionality can be applied to various applications such as Relevance and Generation (RAG) Apps, model fine-tuning, semantic clustering, and more.

Each row of the CSV will be stored as a separate document. The text field will be indexed for semantic search purposes. The id serves as the unique identifier for each document; if a document with the same id already exists, the new entry will overwrite the existing document. The title field of the document determines the title displayed in the dashboard.

Document Format

When creating documents, the following fields are required:

  • id: A unique identifier for the document. It should be formatted as a string.
  • text: The main content of the document to be indexed. It should be formatted as a string.
  • title: The title of the document as it will appear in the dashboard. It should be formatted as a string.

Here is an example of a Pydantic model representing the document structure:

from pydantic import BaseModel

class Document(BaseModel):
    id: str
    text: str
    title: str
    # Any additional keys and their values will be saved as metadata.
    # The metadata can be of any valid datatype that is supported by JSON.

Please note that any additional keys and their corresponding values included in the document will be saved as metadata. The metadata can consist of any valid datatype that JSON supports.

COLLECTION_ID = ''
API_KEY = ''
import pandas as pd
import json
import logging

logger = logging.getLogger(__name__)
df = pd.read_csv("curations.csv")

# Add columns to the dataframe
df["id"] = df["query"]  # 'id' serves as the unique identifier for the document. It will upsert the document if it already exists.
df["title"] = df["query"]  # 'title' is displayed as the title on the Scout dashboard.
df["text"] = df["query"]  # 'text' is the content that is embedded and indexed for vector search.
df["category"] = df["category"]  # These columns don't have a specific function in relation to Scout and will be set as metadata on the document.
df["expected_response"] = df["expected_response"]

df
import requests

BASE_URL = "https://api.scoutos.com"
url = f"{BASE_URL}/v1/collections/{COLLECTION_ID}/documents"
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {API_KEY}"}
documents = df.to_dict(orient="records")
body = {"documents": documents}

response = requests.post(url, headers=headers, data=json.dumps(body))
response.raise_for_status()
res = response.json()

## Add the index_job to the dataframe for display purposes
df["index_job"] = res["jobs"]

df

Success!!

You should now see the documents in your collection on the Scout dashboard.