Create embeddings with cloud AI providers
The Cypher® function genai.vector.encode
and procedure genai.vector.encodeBatch
allow you to generate embeddings for one or more pieces of text through external AI providers.
You need an API token for one of the supported providers (OpenAI, Vertex AI, Azure OpenAI, Amazon Bedrock).
This page assumes you have already imported the recommendations dataset and set up your environment, and shows how to generate and store embeddings for Movie
nodes basing on their title and plot.
Embeddings are always generated outside of Neo4j, but stored in the Neo4j database. |
Setup environment
The encoding functions are part of the Neo4j GenAI plugin.
-
On Aura instances the plugin is enabled by default, so you don’t need to take any further actions if you are using Neo4j on Aura.
-
For self-managed instances, the plugin needs to be installed. You do so by moving the
neo4j-genai.jar
file from/products
to/plugins
in your Neo4j home directory, or by starting the Docker container with the extra parameter--env NEO4J_PLUGINS='["genai"]'
.
For more information, see Configuration → Plugins.
Create embeddings for movies
The example below fetches all Movie
nodes from the database, generates an embedding of the concatenation of movie title and plot, and adds that as an extra embedding
property to each node.
import neo4j
URI = '<URI for Neo4j database>'
AUTH = ('<Username>', '<Password>')
DB_NAME = '<Database name>' # examples: 'recommendations-50', 'neo4j'
openAI_token = '<OpenAI API token>'
def main():
driver = neo4j.GraphDatabase.driver(URI, auth=AUTH) (1)
driver.verify_connectivity()
batch_size = 100
batch_n = 1
movies_batch = []
with driver.session(database=DB_NAME) as session:
# Fetch `Movie` nodes
result = session.run('MATCH (m:Movie) RETURN m.plot AS plot, m.title AS title')
for record in result:
title = record.get('title')
plot = record.get('plot')
if title is not None and plot is not None:
movies_batch.append({
'title': title,
'plot': plot,
'to_encode': f'Title: {title}\nPlot: {plot}' (2)
})
# Import a batch; flush buffer
if len(movies_batch) == batch_size: (3)
import_batch(driver, movies_batch, batch_n)
movies_batch = []
batch_n += 1
# Import complete, show counters
records, _, _ = driver.execute_query('''
MATCH (m:Movie WHERE m.embedding IS NOT NULL)
RETURN count(*) AS countMoviesWithEmbeddings, size(m.embedding) AS embeddingSize
''', database_=DB_NAME)
print(f"""
Embeddings generated and attached to nodes.
Movie nodes with embeddings: {records[0].get('countMoviesWithEmbeddings')}.
Embedding size: {records[0].get('embeddingSize')}.
""")
def import_batch(driver, nodes, batch_n):
# Generate and store embeddings for Movie nodes
driver.execute_query('''
CALL genai.vector.encodeBatch($to_encode_list, 'OpenAI', { token: $token }) YIELD index, vector (4)
MATCH (m:Movie {title: $movies[index].title, plot: $movies[index].plot}) (5)
CALL db.create.setNodeVectorProperty(m, 'embedding', vector) (6)
''', movies=nodes, to_encode_list=[movie['to_encode'] for movie in nodes], token=openAI_token,
database_=DB_NAME)
print(f'Processed batch {batch_n}')
if __name__ == '__main__':
main()
'''
Movie nodes with embeddings: 9083.
Embedding size: 1536.
'''
1 | The driver object is the interface to interact with your Neo4j instance.
For more information, see Build applications with Neo4j and Python. |
2 | The strings that OpenAI should encode into embeddings. |
3 | A number of embeddings are collected before a whole batch is submitted to the database. This avoids holding the whole dataset into memory and potential timeouts (especially relevant for larger datasets). |
4 | The procedure genai.vector.encodeBatch() submits the batch for encoding to OpenAI.
The default model for OpenAI is text-embedding-ada-002 , which embeds text into vectors of size 1536 (i.e. lists of 1536 numbers).
See GenAI providers for a list of supported providers and options. |
5 | The returned index from genai.vector.encodeBatch allows to relate embeddings to movies, so that it’s possible to retrieve each movie node and attach its embedding to it. |
6 | The procedure db.create.setNodeVectorProperty stores the embedding vector in the property named embedding for each movie node m .
Adding embeddings with the procedure is more efficient than with the SET Cypher clause.
To set vector properties on relationships, use db.create.setRelationshipVectorProperty . |
Once embeddings are in the database, you can use them to compare how similar one movie is to another.