Custom sink for Azure AI Search vector database for real time indexing.
bytewax-azure-ai-search is commercially licensed with publicly available source code. Please see the full details in LICENSE.
To install you can run
pip install bytewax-azure-ai-search
Then import
from bytewax.azure_ai_search import AzureSearchSink
You can then add it to your dataflow
azure_sink = AzureSearchSink(
azure_search_service=service_name,
index_name="bytewax-index",
search_api_version="2024-07-01",
search_admin_key=api_key,
schema={
"id": {"type": "string", "default": None},
"content": {"type": "string", "default": None},
"meta": {"type": "string", "default": None},
"vector": {"type": "collection", "item_type": "single", "default": []},
},
)
flow = Dataflow("indexing-pipeline")
input_data = op.input("input", flow, FileSource("data/news_out.jsonl"))
deserialize_data = op.map("deserialize", input_data, safe_deserialize)
extract_html = op.map("extract_html", deserialize_data, process_event)
op.output("output", extract_html, azure_sink)
Note
This installation includes the following dependencies:
azure-search-documents==11.5.1
azure-common==1.1.28
azure-core==1.30.2
openai==1.44.1
These are used to write the vectors on the appropriate services based on an Azure schema provided. We will provide an example in this README for working versions of schema definition under these versions.
This asumes you have set up an Azure AI Search service on the Azure portal. For more instructions, visit their documentation
Optional
To generate embeddings, you can set up an Azure OpenAI service and deploy an embedding model such as text-ada-002-embedding
Once you have set up the resources, ensure to idenfity and store the following information from the Azure portal:
- You Azure AI Search admin key
- You Azure AI Search service name
- You Azure AI Search service endpoint url
If you deployed an embedding model through Azure AI OpenAI service:
- You Azure OpenAI endpoint url
- You Azure OpenAI API key
- Your Azure OpenAI service name
- You Azure OpenAI embedding deployment name
- Your Azure OpenAI embedding name (e.g. text-ada-002-embedding`)
You can find a complete example under the examples/
folder.
To execute the examples, you can generate a .env
file with the following keywords:
# OpenAI
AZURE_OPENAI_ENDPOINT= <your-azure-openai-endpoint>
AZURE_OPENAI_API_KEY= <your-azure-openai-key>
AZURE_OPENAI_SERVICE=<your-azure-openai-named-service>
# Azure Document Search
AZURE_SEARCH_ADMIN_KEY=<your-azure-ai-search-admin-key>
AZURE_SEARCH_SERVICE=<your-azure-ai-search-named-service>
AZURE_SEARCH_SERVICE_ENDPOINT=<your-azure-ai-search-endpoint-url>
# Optional - if you prefer to generate embeddings with embedding models deployed on Azure
AZURE_EMBEDDING_DEPLOYMENT_NAME=<your-azure-openai-given-deployment-name>
AZURE_EMBEDDING_MODEL_NAME=<your-azure-openai-model-name>
# Optional - if you prefer to generate the embeddings with OpenAI
OPENAI_API_KEY=<your-openai-key>
Set up the connection and schema by running
python connection.py
You can verify the creation of the index was successful by visiting the portal.
If you click on the created index and press "Search" you can verify it was created - but empty at this point.
Generate the embeddings and store in Azure AI Search through the bytewax-azure-ai-search sink
python -m bytewax.run dataflow:flow
Verify the index was populated by pressing "Search" with an empty query.
Note
In the dataflow we initialized the custom sink as follows:
from bytewax.azure_ai_search import AzureSearchSink
azure_sink = AzureSearchSink(
azure_search_service=service_name,
index_name="bytewax-index",
search_api_version="2024-07-01",
search_admin_key=api_key,
schema={
"id": {"type": "string", "default": None},
"content": {"type": "string", "default": None},
"meta": {"type": "string", "default": None},
"vector": {"type": "collection", "item_type": "single", "default": []},
},
)
The schema and structure need to match how you configure the schema through the Azure AI Search Python API. For more information, visit their page
In this example:
from azure.search.documents.indexes.models import (
SimpleField,
SearchFieldDataType,
)
# Define schema
fields = [
SimpleField(
name="id",
type=SearchFieldDataType.String,
searchable=True,
filterable=True,
sortable=True,
facetable=True,
key=True,
),
SearchableField(
name="content",
type=SearchFieldDataType.String,
searchable=True,
filterable=False,
sortable=False,
facetable=False,
key=False,
),
SearchableField(
name="meta",
type=SearchFieldDataType.String,
searchable=True,
filterable=False,
sortable=False,
facetable=False,
key=False,
),
SimpleField(
name="vector",
type=SearchFieldDataType.Collection(SearchFieldDataType.Double),
searchable=False,
filterable=False,
sortable=False,
facetable=False,
vector_search_dimensions=DIMENSIONS,
vector_search_profile_name="myHnswProfile",
),
]
We use just
as a command runner for
actions / recipes related to developing Bytewax. Please follow the
installation
instructions.
There's probably a package for your OS already.
I suggest using pyenv
to manage python versions.
the installation instructions.
You can also use your OS's package manager to get access to different Python versions.
Ensure that you have Python 3.12 installed and available as a "global
shim" so that it can be run anywhere. The following will make plain
python
run your OS-wide interpreter, but will make 3.12 available
via python3.12
.
$ pyenv global system 3.12
We use uv
as a virtual
environment creator, package installer, and dependency pin-er. There
are a few different ways to install
it,
but I recommend installing it through either
brew
on macOS or
pipx
.
We have a just
recipe that will:
-
Set up a venv in
venvs/dev/
. -
Install all dependencies into it in a reproducible way.
Start by adding any dependencies that are needed into pyproject.toml or into requirements/dev.in if they are needed for development.
Next, generate the pinned set of dependencies with
> just venv-compile-all
Once you have compiled your dependencies, run the following:
> just get-started
Activate your development environment and run the development task:
> . venvs/dev/bin/activate
> just develop
bytewax-azure-ai-search
is commercially licensed with publicly
available source code. You are welcome to prototype using this module
for free, but any use on business data requires a paid license. See
https://modules.bytewax.io/ for a license. Please see the full details
in LICENSE.