Create embeddings with cloud AI providers

The Cypher^® function genai.vector.encode and procedure genai.vector.encodeBatch allow you to generate embeddings for one or more pieces of text through external AI providers. You need an API token for one of the supported providers (OpenAI, Vertex AI, Azure OpenAI, Amazon Bedrock).

This page assumes you have already imported the recommendations dataset and set up your environment, and shows how to generate and store embeddings for Movie nodes basing on their title and plot.

Embeddings are always generated outside of Neo4j, but stored in the Neo4j database.

Setup environment

The encoding functions are part of the Neo4j GenAI plugin.

On Aura instances the plugin is enabled by default, so you don’t need to take any further actions if you are using Neo4j on Aura.
For self-managed instances, the plugin needs to be installed. You do so by moving the neo4j-genai.jar file from /products to /plugins in your Neo4j home directory, or by starting the Docker container with the extra parameter --env NEO4J_PLUGINS='["genai"]'.
For more information, see Configuration → Plugins.

Create embeddings for movies

The example below fetches all Movie nodes from the database, generates an embedding of the concatenation of movie title and plot, and adds that as an extra embedding property to each node.

import neo4j


URI = '<URI for Neo4j database>'
AUTH = ('<Username>', '<Password>')
DB_NAME = '<Database name>'  # examples: 'recommendations-50', 'neo4j'

openAI_token = '<OpenAI API token>'


def main():
    driver = neo4j.GraphDatabase.driver(URI, auth=AUTH)  (1)
    driver.verify_connectivity()

    batch_size = 100
    batch_n = 1
    movies_batch = []
    with driver.session(database=DB_NAME) as session:
        # Fetch `Movie` nodes
        result = session.run('MATCH (m:Movie) RETURN m.plot AS plot, m.title AS title')
        for record in result:
            title = record.get('title')
            plot = record.get('plot')

            if title is not None and plot is not None:
                movies_batch.append({
                    'title': title,
                    'plot': plot,
                    'to_encode': f'Title: {title}\nPlot: {plot}'  (2)
                })

            # Import a batch; flush buffer
            if len(movies_batch) == batch_size:  (3)
                import_batch(driver, movies_batch, batch_n)
                movies_batch = []
                batch_n += 1

    # Import complete, show counters
    records, _, _ = driver.execute_query('''
    MATCH (m:Movie WHERE m.embedding IS NOT NULL)
    RETURN count(*) AS countMoviesWithEmbeddings, size(m.embedding) AS embeddingSize
    ''', database_=DB_NAME)
    print(f"""
Embeddings generated and attached to nodes.
Movie nodes with embeddings: {records[0].get('countMoviesWithEmbeddings')}.
Embedding size: {records[0].get('embeddingSize')}.
    """)


def import_batch(driver, nodes, batch_n):
    # Generate and store embeddings for Movie nodes
    driver.execute_query('''
    CALL genai.vector.encodeBatch($to_encode_list, 'OpenAI', { token: $token }) YIELD index, vector  (4)
    MATCH (m:Movie {title: $movies[index].title, plot: $movies[index].plot})  (5)
    CALL db.create.setNodeVectorProperty(m, 'embedding', vector)  (6)
    ''', movies=nodes, to_encode_list=[movie['to_encode'] for movie in nodes], token=openAI_token,
    database_=DB_NAME)
    print(f'Processed batch {batch_n}')


if __name__ == '__main__':
    main()

'''
Movie nodes with embeddings: 9083.
Embedding size: 1536.
'''

1	The `driver` object is the interface to interact with your Neo4j instance. For more information, see Build applications with Neo4j and Python.
2	The strings that OpenAI should encode into embeddings.
3	A number of embeddings are collected before a whole batch is submitted to the database. This avoids holding the whole dataset into memory and potential timeouts (especially relevant for larger datasets).
4	The procedure `genai.vector.encodeBatch()` submits the batch for encoding to OpenAI. The default model for OpenAI is `text-embedding-ada-002`, which embeds text into vectors of size 1536 (i.e. lists of 1536 numbers). See GenAI providers for a list of supported providers and options.
5	The returned `index` from `genai.vector.encodeBatch` allows to relate embeddings to movies, so that it’s possible to retrieve each movie node and attach its embedding to it.
6	The procedure `db.create.setNodeVectorProperty` stores the embedding `vector` in the property named `embedding` for each movie node `m`. Adding embeddings with the procedure is more efficient than with the `SET` Cypher clause. To set vector properties on relationships, use `db.create.setRelationshipVectorProperty`.

Once embeddings are in the database, you can use them to compare how similar one movie is to another.