Skip to content

bytewax/bytewax-azure-ai-search

Repository files navigation

Actions Status PyPI Bytewax User Guide

Bytewax

bytewax-azure-ai-search

Custom sink for Azure AI Search vector database for real time indexing.

bytewax-azure-ai-search is commercially licensed with publicly available source code. Please see the full details in LICENSE.

Installation and import sample

To install you can run

pip install bytewax-azure-ai-search

Then import

from bytewax.azure_ai_search import AzureSearchSink

You can then add it to your dataflow

azure_sink = AzureSearchSink(
    azure_search_service=service_name,
    index_name="bytewax-index",
    search_api_version="2024-07-01",
    search_admin_key=api_key,
    schema={
        "id": {"type": "string", "default": None},
        "content": {"type": "string", "default": None},
        "meta": {"type": "string", "default": None},
        "vector": {"type": "collection", "item_type": "single", "default": []},
    },
)

flow = Dataflow("indexing-pipeline")
input_data = op.input("input", flow, FileSource("data/news_out.jsonl"))
deserialize_data = op.map("deserialize", input_data, safe_deserialize)
extract_html = op.map("extract_html", deserialize_data, process_event)
op.output("output", extract_html, azure_sink)

Note

This installation includes the following dependencies:

azure-search-documents==11.5.1
azure-common==1.1.28
azure-core==1.30.2
openai==1.44.1

These are used to write the vectors on the appropriate services based on an Azure schema provided. We will provide an example in this README for working versions of schema definition under these versions.

Setting up Azure AI services

This asumes you have set up an Azure AI Search service on the Azure portal. For more instructions, visit their documentation

Optional To generate embeddings, you can set up an Azure OpenAI service and deploy an embedding model such as text-ada-002-embedding

Once you have set up the resources, ensure to idenfity and store the following information from the Azure portal:

  • You Azure AI Search admin key
  • You Azure AI Search service name
  • You Azure AI Search service endpoint url

If you deployed an embedding model through Azure AI OpenAI service:

  • You Azure OpenAI endpoint url
  • You Azure OpenAI API key
  • Your Azure OpenAI service name
  • You Azure OpenAI embedding deployment name
  • Your Azure OpenAI embedding name (e.g. text-ada-002-embedding`)

Sample usage

You can find a complete example under the examples/ folder.

To execute the examples, you can generate a .env file with the following keywords:

# OpenAI
AZURE_OPENAI_ENDPOINT= <your-azure-openai-endpoint>
AZURE_OPENAI_API_KEY= <your-azure-openai-key>
AZURE_OPENAI_SERVICE=<your-azure-openai-named-service>
# Azure Document Search
AZURE_SEARCH_ADMIN_KEY=<your-azure-ai-search-admin-key>
AZURE_SEARCH_SERVICE=<your-azure-ai-search-named-service>
AZURE_SEARCH_SERVICE_ENDPOINT=<your-azure-ai-search-endpoint-url>

# Optional - if you prefer to generate embeddings with embedding models deployed on Azure
AZURE_EMBEDDING_DEPLOYMENT_NAME=<your-azure-openai-given-deployment-name>
AZURE_EMBEDDING_MODEL_NAME=<your-azure-openai-model-name>

# Optional - if you prefer to generate the embeddings with OpenAI
OPENAI_API_KEY=<your-openai-key>

Set up the connection and schema by running

python connection.py

You can verify the creation of the index was successful by visiting the portal.

If you click on the created index and press "Search" you can verify it was created - but empty at this point.

Generate the embeddings and store in Azure AI Search through the bytewax-azure-ai-search sink

python -m bytewax.run dataflow:flow

Verify the index was populated by pressing "Search" with an empty query.

Note

In the dataflow we initialized the custom sink as follows:

from bytewax.azure_ai_search import AzureSearchSink

azure_sink = AzureSearchSink(
    azure_search_service=service_name,
    index_name="bytewax-index",
    search_api_version="2024-07-01",
    search_admin_key=api_key,
    schema={
        "id": {"type": "string", "default": None},
        "content": {"type": "string", "default": None},
        "meta": {"type": "string", "default": None},
        "vector": {"type": "collection", "item_type": "single", "default": []},
    },
)

The schema and structure need to match how you configure the schema through the Azure AI Search Python API. For more information, visit their page

In this example:

from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
)

# Define schema
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=True,
        sortable=True,
        facetable=True,
        key=True,
    ),
    SearchableField(
        name="content",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        key=False,
    ),
    SearchableField(
        name="meta",
        type=SearchFieldDataType.String,
        searchable=True,
        filterable=False,
        sortable=False,
        facetable=False,
        key=False,
    ),
    SimpleField(
        name="vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Double),
        searchable=False,
        filterable=False,
        sortable=False,
        facetable=False,
        vector_search_dimensions=DIMENSIONS,
        vector_search_profile_name="myHnswProfile",
    ),
]

For developers - Setting up the project

Install just

We use just as a command runner for actions / recipes related to developing Bytewax. Please follow the installation instructions. There's probably a package for your OS already.

Install pyenv and Python 3.12

I suggest using pyenv to manage python versions. the installation instructions.

You can also use your OS's package manager to get access to different Python versions.

Ensure that you have Python 3.12 installed and available as a "global shim" so that it can be run anywhere. The following will make plain python run your OS-wide interpreter, but will make 3.12 available via python3.12.

$ pyenv global system 3.12

Install uv

We use uv as a virtual environment creator, package installer, and dependency pin-er. There are a few different ways to install it, but I recommend installing it through either brew on macOS or pipx.

Development

We have a just recipe that will:

  1. Set up a venv in venvs/dev/.

  2. Install all dependencies into it in a reproducible way.

Start by adding any dependencies that are needed into pyproject.toml or into requirements/dev.in if they are needed for development.

Next, generate the pinned set of dependencies with

> just venv-compile-all

Create and activate a virtual environment

Once you have compiled your dependencies, run the following:

> just get-started

Activate your development environment and run the development task:

> . venvs/dev/bin/activate
> just develop

License

bytewax-azure-ai-search is commercially licensed with publicly available source code. You are welcome to prototype using this module for free, but any use on business data requires a paid license. See https://modules.bytewax.io/ for a license. Please see the full details in LICENSE.