Model caches
This notebook covers how to cache results of individual LLM calls using different caches.
First, let's install some dependencies
%pip install -qU langchain-openai langchain-community
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass()
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
# To make the caching really obvious, lets use a slower and older model.
# Caching supports newer chat models as well.
llm = OpenAI(model="gpt-3.5-turbo-instruct", n=2, best_of=2)
API Reference:set_llm_cache | OpenAI
In Memory
Cache
from langchain_community.cache import InMemoryCache
set_llm_cache(InMemoryCache())
API Reference:InMemoryCache
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 7.57 ms, sys: 8.22 ms, total: 15.8 ms
Wall time: 649 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 551 µs, sys: 221 µs, total: 772 µs
Wall time: 1.23 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
SQLite
Cache
!rm .langchain.db
# We can do the same thing with a SQLite cache
from langchain_community.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
API Reference:SQLiteCache
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 12.6 ms, sys: 3.51 ms, total: 16.1 ms
Wall time: 486 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 52.6 ms, sys: 57.7 ms, total: 110 ms
Wall time: 113 ms
"\n\nWhy couldn't the bicycle stand up by itself? Because it was two-tired!"
Upstash Redis
Cache
Standard Cache
Use Upstash Redis to cache prompts and responses with a serverless HTTP API.
%pip install -qU upstash_redis
import langchain
from langchain_community.cache import UpstashRedisCache
from upstash_redis import Redis
URL = "<UPSTASH_REDIS_REST_URL>"
TOKEN = "<UPSTASH_REDIS_REST_TOKEN>"
langchain.llm_cache = UpstashRedisCache(redis_=Redis(url=URL, token=TOKEN))
API Reference:UpstashRedisCache
%%time
# The first time, it is not yet in cache, so it should take longer
llm.invoke("Tell me a joke")
CPU times: user 7.56 ms, sys: 2.98 ms, total: 10.5 ms
Wall time: 1.14 s
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
%%time
# The second time it is, so it goes faster
llm.invoke("Tell me a joke")
CPU times: user 2.78 ms, sys: 1.95 ms, total: 4.73 ms
Wall time: 82.9 ms
'\n\nWhy did the chicken cross the road?\n\nTo get to the other side!'
Semantic Cache
Use Upstash Vector to do a semantic similarity search and cache the most similar response in the database. The vectorization is automatically done by the selected embedding model while creating Upstash Vector database.
%pip install upstash-semantic-cache
from langchain.globals import set_llm_cache
from upstash_semantic_cache import SemanticCache
API Reference:set_llm_cache
UPSTASH_VECTOR_REST_URL = "<UPSTASH_VECTOR_REST_URL>"
UPSTASH_VECTOR_REST_TOKEN = "<UPSTASH_VECTOR_REST_TOKEN>"
cache = SemanticCache(
url=UPSTASH_VECTOR_REST_URL, token=UPSTASH_VECTOR_REST_TOKEN, min_proximity=0.7
)
set_llm_cache(cache)
%%time
llm.invoke("Which city is the most crowded city in the USA?")
CPU times: user 28.4 ms, sys: 3.93 ms, total: 32.3 ms
Wall time: 1.89 s
'\n\nNew York City is the most crowded city in the USA.'
%%time
llm.invoke("Which city has the highest population in the USA?")
CPU times: user 3.22 ms, sys: 940 μs, total: 4.16 ms
Wall time: 97.7 ms
'\n\nNew York City is the most crowded city in the USA.'
Redis
Cache
See the main Redis cache docs for detail.