AI / LangChain4j interview questions
LangChain4j is a Java library that brings the capabilities of large language models (LLMs) into the Java ecosystem in a structured, type-safe, and production-friendly way. Before LangChain4j, Java developers who wanted to integrate GPT, Gemini, Mistral, or any other LLM into their applications had to write HTTP clients, manage JSON serialization manually, build prompt templates from scratch, and figure out how to chain multiple AI calls together — all without any standardized pattern.
LangChain4j solves this by providing a unified abstraction layer over dozens of LLM providers (OpenAI, Azure OpenAI, Anthropic, Google Vertex AI, Ollama, Mistral, HuggingFace, and more), a clean interface for building conversational memory, RAG pipelines, and tool-calling agents, and — most distinctively — an AI Services pattern that lets you declare AI behavior as a plain Java interface, completely eliminating boilerplate prompt construction code.
It is the Java equivalent of Python's LangChain/LlamaIndex ecosystems, but built idiomatically for the JVM: strongly typed, annotation-driven, Spring Boot and Quarkus compatible, and deeply integrated with Java's dependency injection patterns. The library is actively maintained and has become the de facto standard for enterprise Java teams embedding LLM capabilities into existing Spring applications.
LangChain4j is organized into several Maven modules so you only pull in what you actually need. The main ones you will encounter in real projects are:
| Module | Artifact ID | Purpose |
|---|---|---|
| Core | langchain4j-core | Interfaces and abstractions (ChatLanguageModel, EmbeddingModel, ChatMemory, etc.) — no provider-specific code |
| Main | langchain4j | High-level features: AI Services, PromptTemplate, RAG pipeline components, chains, tools |
| Provider starters | langchain4j-open-ai, langchain4j-anthropic, etc. | One module per LLM provider — concrete implementations of core interfaces |
| Embedding stores | langchain4j-chroma, langchain4j-pgvector, langchain4j-pinecone, etc. | Vector database integrations for RAG |
| Document loaders | Built into main module | FileSystemDocumentLoader, UrlDocumentLoader, AmazonS3DocumentLoader, etc. |
| Spring Boot starter | langchain4j-spring-boot-starter | Auto-configuration, bean injection, properties binding for Spring applications |
| Quarkus extension | quarkus-langchain4j | CDI integration and native compilation support for Quarkus |
The design intentionally separates interfaces (core) from implementations (provider modules) so your application code can remain provider-agnostic. If you start with OpenAI and later want to switch to Anthropic or an on-premise Ollama instance, you swap the dependency and update a few configuration properties — the AI Services interface code stays unchanged.
AI Services is the flagship abstraction in LangChain4j. The idea is simple but powerful: you write a plain Java interface, annotate its methods with LangChain4j annotations that describe what each method should do with the LLM, and the library generates a working implementation at runtime using JDK dynamic proxies. You never write prompt-construction or HTTP-calling code — that is all handled by the generated proxy.
A minimal example:
import dev.langchain4j.service.AiServices;
import dev.langchain4j.service.SystemMessage;
import dev.langchain4j.service.UserMessage;
interface CodeReviewer {
@SystemMessage("You are a senior Java developer. Review code concisely.")
@UserMessage("Review this code snippet for bugs and style issues: {{code}}")
String review(String code);
}
// Wire it up
CodeReviewer reviewer = AiServices.builder(CodeReviewer.class)
.chatLanguageModel(model)
.build();
// Use it like any Java object
String feedback = reviewer.review("public void foo() { int x = 1/0; }");The interface method can return String for raw text, a custom POJO for structured output (LangChain4j adds JSON extraction instructions automatically), TokenStream for streaming, or AiMessage for full response metadata. You can also inject ChatMemory into the service for conversational state, add @Tool-annotated methods to the same class for function calling, and mix multiple retrieval augmentors for RAG — all declared at the builder level, none of it in your interface methods.
ChatMemory in LangChain4j is the component responsible for maintaining conversation history across multiple exchanges with an LLM. Without it, every call to the model is stateless — the model has no knowledge of what was said in previous turns. ChatMemory solves this by accumulating the message history and injecting it into each subsequent LLM request.
LangChain4j ships two built-in ChatMemory implementations:
- MessageWindowChatMemory — Keeps the last N messages (by message count). When the window is full, the oldest messages are dropped to make room for new ones. Simple and predictable, but a very long first user message might push out important context.
- TokenWindowChatMemory — Keeps messages up to a maximum token count. Requires a tokenizer (model-specific) to count tokens accurately. More precise than message count for managing context window limits of the underlying LLM.
// Message-window memory — keep last 10 messages
ChatMemory memory = MessageWindowChatMemory.withMaxMessages(10);
// Token-window memory — stay under 4096 tokens
ChatMemory memory = TokenWindowChatMemory.builder()
.maxTokens(4096, new OpenAiTokenizer(GPT_3_5_TURBO))
.build();
// Inject into AI Services for automatic history management
Assistant assistant = AiServices.builder(Assistant.class)
.chatLanguageModel(model)
.chatMemory(memory)
.build();For multi-user applications where each user needs isolated memory, LangChain4j provides ChatMemoryProvider — a factory that returns a memory instance per memory ID. The memory ID is typically the user session ID or user account ID, passed as an annotated parameter on the AI Services method.
RAG (Retrieval-Augmented Generation) is the technique of enriching an LLM prompt with relevant external content retrieved from a knowledge base before asking the model to generate a response. It solves the core limitation of LLMs — their knowledge is frozen at training time — by dynamically injecting up-to-date or domain-specific content at inference time.
In LangChain4j, a RAG pipeline has two distinct phases:
Ingestion phase (run once or periodically): Load documents → split into chunks → embed each chunk → store vectors in an EmbeddingStore.
Retrieval phase (at query time): Embed the user query → similarity-search the EmbeddingStore → inject top-K relevant chunks into the prompt → call the LLM.
// --- Ingestion ---
EmbeddingModel embeddingModel = new OpenAiEmbeddingModel.Builder()
.apiKey(apiKey).modelName("text-embedding-ada-002").build();
EmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
List<Document> docs = FileSystemDocumentLoader.loadDocuments("./docs");
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
.documentSplitter(DocumentSplitters.recursive(500, 50))
.embeddingModel(embeddingModel)
.embeddingStore(store)
.build();
ingestor.ingest(docs);
// --- Retrieval at query time via AI Services ---
interface Assistant {
String answer(String question);
}
Assistant assistant = AiServices.builder(Assistant.class)
.chatLanguageModel(chatModel)
.contentRetriever(EmbeddingStoreContentRetriever.from(store))
.build();
String answer = assistant.answer("What are our refund policies?");LangChain4j also supports advanced RAG patterns like query compression, re-ranking with a cross-encoder, and multiple content retrievers that are combined via a DefaultRetrievalAugmentor. These address quality issues in naive RAG implementations where retrieved chunks are too generic or poorly ranked.
Tools (also called function calling) give LLMs the ability to invoke real Java methods during a conversation. Instead of answering entirely from its training knowledge, the model can recognize when a specific capability is needed — fetching live data, running calculations, calling APIs — and request that the application execute a registered tool and return the result to the model for incorporation into its final answer.
In LangChain4j, tools are defined by annotating Java methods with @Tool on a plain Java object. Parameters can be annotated with @P (or @ToolParam) to provide descriptions that help the model understand when and how to use them.
class WeatherTools {
@Tool("Returns the current weather in a given city in Celsius")
String currentWeather(@P("City name, e.g. 'London'") String city) {
return weatherApiService.fetchCurrent(city); // real API call
}
@Tool("Returns the 5-day forecast for a city")
String forecast(@P("City name") String city,
@P("Number of days 1-5") int days) {
return weatherApiService.fetchForecast(city, days);
}
}
// Register with AI Services
TravelAssistant assistant = AiServices.builder(TravelAssistant.class)
.chatLanguageModel(model)
.tools(new WeatherTools())
.build();The flow is: user sends a message → LLM decides a tool should be called → LangChain4j intercepts the tool-use response → executes the Java method → appends the result to the conversation → re-calls the LLM with the result → LLM generates the final answer. All of this happens transparently within the assistant.chat() call. The model may call tools multiple times before producing a final answer, and LangChain4j handles those multi-step loops automatically.
LangChain4j provides a dedicated Spring Boot starter (langchain4j-spring-boot-starter) that wires everything up through standard Spring Boot auto-configuration. You add the starter plus the provider-specific starter for your chosen LLM, drop configuration into application.properties, and Spring automatically creates the ChatLanguageModel, EmbeddingModel, and related beans that you can inject anywhere in the application.
<!-- pom.xml -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-spring-boot-starter</artifactId>
<version>0.32.0</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-ai-spring-boot-starter</artifactId>
<version>0.32.0</version>
</dependency># application.properties
langchain4j.open-ai.chat-model.api-key=${OPENAI_API_KEY}
langchain4j.open-ai.chat-model.model-name=gpt-4o
langchain4j.open-ai.chat-model.temperature=0.7
langchain4j.open-ai.embedding-model.api-key=${OPENAI_API_KEY}For AI Services specifically, Spring Boot integration uses the @AiService annotation (or you declare a @Bean manually). LangChain4j detects annotated interfaces during component scan and creates Spring-managed proxy beans — meaning the AI service is injectable like any other Spring component:
@AiService
interface CustomerSupportAgent {
@SystemMessage("You are a helpful customer support agent.")
String chat(String userMessage);
}
@RestController
class SupportController {
private final CustomerSupportAgent agent;
SupportController(CustomerSupportAgent agent) { this.agent = agent; }
@PostMapping("/support")
String support(@RequestBody String message) {
return agent.chat(message);
}
}
An EmbeddingModel in LangChain4j converts text into dense numerical vectors (embeddings) that capture semantic meaning. Texts with similar meanings produce vectors that are geometrically close, enabling similarity search. EmbeddingModels are used during RAG ingestion (to vectorize document chunks) and at query time (to vectorize the user's question so it can be matched against stored chunks).
The core interface is minimal by design:
public interface EmbeddingModel {
Response<Embedding> embed(String text);
Response<List<Embedding>> embedAll(List<TextSegment> textSegments);
}Supported embedding model providers include:
| Provider | Example Model | Notes |
|---|---|---|
| OpenAI | text-embedding-3-small / ada-002 | Most commonly used; cloud API |
| Azure OpenAI | text-embedding-ada-002 | Enterprise Azure deployments |
| Google Vertex AI | textembedding-gecko | GCP-based workloads |
| Ollama | nomic-embed-text, mxbai-embed | Local/on-premise, no API costs |
| HuggingFace | sentence-transformers models | Open-source models via HF Inference API |
| In-process (Onnx) | all-MiniLM-L6-v2 | Embedded in the JVM — no external calls, fastest |
The in-process ONNX option (langchain4j-embeddings module) is particularly useful for offline environments or when minimizing API costs: the model runs entirely within the JVM with no network calls, at the cost of slightly lower embedding quality compared to frontier models.
An EmbeddingStore is the vector database layer in LangChain4j's RAG pipeline — it stores embedding vectors alongside their source text and metadata, and supports approximate nearest-neighbor (ANN) similarity search. LangChain4j implements a unified EmbeddingStore<TextSegment> interface across all backends, so swapping stores requires only a dependency and configuration change.
| Store | Type | Best For |
|---|---|---|
| InMemoryEmbeddingStore | In-memory (no persistence) | Development, unit tests, prototyping |
| PgVector | PostgreSQL extension | Teams already on Postgres; no separate vector DB infrastructure |
| Chroma | Open-source vector DB | Local dev/staging, self-hosted deployments |
| Pinecone | Managed cloud vector DB | Production scale, fully managed |
| Weaviate | Open-source / cloud | Multi-modal search, built-in vectorization |
| Qdrant | Open-source / cloud | High-performance filtered search |
| Milvus / Zilliz | Open-source / cloud | Very large-scale vector workloads |
| Elasticsearch | Managed / self-hosted | Teams already running ELK stack |
| Azure AI Search | Managed Azure service | Azure-native deployments |
| Redis Stack | In-memory + persistence | Low-latency, existing Redis infrastructure |
For choosing: start with InMemoryEmbeddingStore during development. For production, use PgVector if you already run PostgreSQL (zero additional infrastructure), or Pinecone/Qdrant if you need a dedicated managed vector database with advanced filtering and scaling controls. The interface is identical across all stores, so the choice is purely operational.
Document splitting (also called chunking) is the process of dividing a large document into smaller, overlapping segments before embedding and storing them in the vector database. It is a necessary step in RAG pipelines because LLMs have a fixed context window (e.g., 8K, 32K, or 128K tokens). You cannot embed an entire 200-page PDF as a single unit — you need to break it into pieces that fit comfortably in the context window while still carrying enough context to be meaningful.
LangChain4j provides several DocumentSplitter implementations:
- DocumentSplitters.recursive() — Recursively splits on paragraphs, then sentences, then words, aiming to preserve semantic boundaries. This is the recommended default for most text documents.
- DocumentSplitters.byParagraph() — Splits strictly at paragraph boundaries. DocumentSplitters.bySentence() — Uses sentence boundary detection (requires a sentence detector model).
- DocumentSplitters.byWord(maxTokens) — Splits by word count up to a token limit.
// Recursive splitter: 500 token chunks, 50 token overlap
DocumentSplitter splitter = DocumentSplitters.recursive(500, 50);
List<TextSegment> segments = splitter.split(document);The overlap parameter is critical: by repeating some tokens at the boundary of adjacent chunks, you ensure that sentences or ideas that span a chunk boundary are not lost in either chunk. Without overlap, a sentence split exactly at a boundary would appear truncated in both chunks, reducing retrieval quality. A 10-20% overlap of the chunk size is a common starting point.
@SystemMessage and @UserMessage are the two prompt-definition annotations at the core of LangChain4j's AI Services pattern. Together they define what gets sent to the LLM for each method invocation, replacing all manual prompt string assembly.
@SystemMessage defines the system prompt — the persona, context, constraints, and behavioral instructions that frame the entire conversation. It is sent as the role: system message in the API request. It can be a plain string literal, or point to a classpath resource file for longer prompts.
@UserMessage defines the user turn — what gets sent as the role: user message. Method parameters are injected into the template via {{paramName}} placeholders or can be injected automatically when there is only one String parameter. If @UserMessage is omitted, the first String parameter is used as the user message verbatim.
interface Translator {
@SystemMessage("You are a professional translator. Translate precisely without adding commentary.")
@UserMessage("Translate the following text to {{targetLanguage}}: {{text}}")
String translate(String text, @V("targetLanguage") String lang);
}
// Or loading from a classpath template file:
interface LegalReviewer {
@SystemMessage(fromResource = "prompts/legal-reviewer-system.txt")
@UserMessage("Review this contract clause: {{clause}}")
ReviewResult review(String clause);
}The @V annotation explicitly names a variable for injection when the parameter name differs or when there are multiple parameters. Without @V, LangChain4j uses the Java parameter name (requires compilation with -parameters flag, or the @Param annotation).
Streaming in LangChain4j allows the LLM's response to be delivered token-by-token as it is generated, rather than waiting for the entire response to be produced before returning anything to the caller. For user-facing chat interfaces, this dramatically improves perceived responsiveness — the user sees text appearing progressively instead of staring at a loading spinner for several seconds.
LangChain4j supports streaming through two mechanisms:
1. TokenStream (AI Services) — Declare the return type as TokenStream in your AI Services interface. The caller registers handlers for each token, completion, and errors:
interface StreamingAssistant {
TokenStream chat(String message);
}
StreamingAssistant assistant = AiServices.builder(StreamingAssistant.class)
.streamingChatLanguageModel(streamingModel) // note: streaming model
.build();
assistant.chat("Explain quantum entanglement")
.onNext(token -> System.out.print(token))
.onComplete(response -> System.out.println("\nDone. Tokens used: " + response.tokenUsage()))
.onError(Throwable::printStackTrace)
.start();2. Direct StreamingChatLanguageModel — Use the lower-level interface for custom streaming logic without AI Services.
For Spring Boot applications serving a web API, the streaming response is typically connected to an SSE (Server-Sent Events) endpoint or a WebSocket. Spring WebFlux's Flux<String> integrates naturally with LangChain4j's streaming by bridging the onNext callback to a reactive publisher.
Use streaming when: building conversational UIs, generating long-form content where early tokens are already useful, or when you need to display a typing indicator. Avoid streaming for batch jobs, automated pipelines, or API calls where the complete response is needed before any processing begins.
LangChain4j's advanced RAG API introduces a cleaner abstraction hierarchy above the basic EmbeddingStoreContentRetriever. The two key interfaces are ContentRetriever and RetrievalAugmentor.
ContentRetriever is the interface responsible for fetching relevant content given a query. Multiple implementations are available:
EmbeddingStoreContentRetriever— retrieves via vector similarity from an EmbeddingStoreWebSearchContentRetriever— fetches live web results (e.g., via Tavily, Google) for up-to-date informationSqlDatabaseContentRetriever— generates and executes SQL to retrieve structured data (text-to-SQL RAG)
RetrievalAugmentor is the higher-level orchestrator that sits between the user query and the LLM call. The default implementation, DefaultRetrievalAugmentor, exposes a full pipeline with configurable stages:
- Query transformer — Rewrites or decomposes the original query (e.g., query compression using conversation history, or HyDE — Hypothetical Document Embeddings)
- Query router — Routes queries to one or more ContentRetrievers based on the query type
- Content aggregator — Merges results from multiple retrievers
- Content injector — Formats retrieved content for injection into the prompt
RetrievalAugmentor augmentor = DefaultRetrievalAugmentor.builder()
.queryTransformer(new CompressingQueryTransformer(chatModel))
.contentRetriever(EmbeddingStoreContentRetriever.from(store))
.contentInjector(DefaultContentInjector.builder()
.promptTemplate(PromptTemplate.from("Context:\n{{contents}}\n\nQuestion: {{userMessage}}"))
.build())
.build();
Structured output means getting the LLM to return data that maps directly to a Java object — a POJO, record, enum, or collection — rather than free-form text that you parse yourself. LangChain4j makes this transparent: declare the return type of your AI Services method as the desired Java type, and the library handles everything else.
Internally, LangChain4j uses one of two strategies depending on the provider:
- JSON schema injection — For models that do not natively support constrained output, LangChain4j generates a JSON schema from the return type and appends it to the prompt as instructions (e.g., "respond only in this JSON format"). The response is then deserialized using Jackson.
- Native JSON mode / response format — For providers that support constrained JSON output (OpenAI's
response_format: { type: json_object }or Anthropic's tool-use-for-structured-output), LangChain4j activates the native mode for more reliable output.
record ProductReview(
String productName,
int ratingOutOf5,
List<String> pros,
List<String> cons
) {}
interface ReviewAnalyzer {
@UserMessage("Analyze this customer review and extract key information: {{review}}")
ProductReview analyze(String review);
}
// Returns a fully populated ProductReview object
ProductReview result = analyzer.analyze("Great laptop, very fast but battery life is poor");
System.out.println(result.ratingOutOf5()); // e.g., 4Enums work too: if you return an enum Sentiment { POSITIVE, NEUTRAL, NEGATIVE }, LangChain4j instructs the model to return exactly one of those values and maps the response to the correct enum constant. For complex nested objects and lists, Jackson handles the deserialization as long as the model produces valid JSON matching the schema.
PromptTemplate is the lower-level prompt construction API in LangChain4j, used when you are working directly with ChatLanguageModel or building custom chains without the AI Services abstraction. It lets you define a reusable template string with {{variable}} placeholders and fill them in programmatically at runtime.
PromptTemplate template = PromptTemplate.from(
"You are translating from English to {{language}}. Translate: {{text}}"
);
Prompt prompt = template.apply(Map.of(
"language", "French",
"text", "The quick brown fox jumps over the lazy dog"
));
// Generates a Prompt object containing the filled-in text
String result = chatModel.generate(prompt.toUserMessage())
.content().text();The key difference from @UserMessage is the level of abstraction and who drives the execution:
| Aspect | PromptTemplate | @UserMessage |
|---|---|---|
| Usage context | Direct ChatLanguageModel calls, custom chains | AI Services interface methods only |
| Variable injection | Manual Map.of(...) call | Automatic from method parameters |
| Code required | Template creation, apply(), generate() | Just annotation — no code |
| Best for | Dynamic, programmatically constructed prompts | Declarative, fixed-structure interactions |
Use PromptTemplate when you need to dynamically compose different prompt templates at runtime, when you are building low-level chains, or when the fixed annotation approach of AI Services is too rigid for a particular use case.
LangChain4j supports a wide range of LLM providers, both cloud-based and local, through its modular dependency design. Each provider is a separate Maven module that implements the core ChatLanguageModel and optionally EmbeddingModel, StreamingChatLanguageModel, and ImageModel interfaces.
| Provider | Artifact | Notable Models |
|---|---|---|
| OpenAI | langchain4j-open-ai | GPT-4o, GPT-4 Turbo, o1, o1-mini |
| Azure OpenAI | langchain4j-azure-open-ai | OpenAI models on Azure endpoints |
| Anthropic | langchain4j-anthropic | Claude 3.5, Claude 3 Opus/Sonnet/Haiku |
| Google Vertex AI | langchain4j-vertex-ai-gemini | Gemini 1.5 Pro, Gemini 1.5 Flash |
| Mistral AI | langchain4j-mistral-ai | Mistral Large, Codestral |
| Ollama | langchain4j-ollama | Llama 3, Mistral, Phi-3 (local) |
| HuggingFace | langchain4j-hugging-face | Open-source models via HF Inference |
| Amazon Bedrock | langchain4j-bedrock | Claude, Llama on Bedrock |
| Groq | langchain4j-open-ai (compatible) | OpenAI-compatible fast inference |
Switching providers is entirely a configuration concern — your AI Services interface and application logic do not change:
// OpenAI
ChatLanguageModel model = OpenAiChatModel.builder()
.apiKey("sk-...").modelName("gpt-4o").build();
// Switch to Anthropic — same interface, different builder
ChatLanguageModel model = AnthropicChatModel.builder()
.apiKey("sk-ant-...").modelName("claude-3-5-sonnet-20241022").build();
// Same AI Services usage for both
Assistant assistant = AiServices.builder(Assistant.class)
.chatLanguageModel(model).build();
In LangChain4j, an Agent is an AI Services instance that has been equipped with a set of Tools and operates in an autonomous reasoning loop. Instead of a single-shot prompt-and-respond interaction, an agent decides at each step whether to answer directly from its knowledge or to invoke one of the available tools to gather more information, then loops until it has enough to produce a final answer.
A simple AI Services call is a single round-trip: user message in → LLM response out. An agent uses a ReAct-style (Reasoning + Acting) loop:
- User message is sent to the LLM along with tool descriptions
- LLM reasons: "I need current stock prices" → requests a
getStockPrice("AAPL")tool call - LangChain4j executes the tool and appends the result to the conversation
- LLM reasons with the result: maybe needs another tool call, or produces the final answer
- Loop ends when the LLM generates a final text response (no more tool requests)
class FinanceTools {
@Tool("Gets the current stock price for a ticker symbol")
double getStockPrice(@P("Ticker symbol like AAPL") String ticker) {
return marketDataService.getPrice(ticker);
}
@Tool("Gets the P/E ratio for a company ticker")
double getPERatio(@P("Ticker symbol") String ticker) {
return fundamentalsService.getPERatio(ticker);
}
}
FinancialAnalyst analyst = AiServices.builder(FinancialAnalyst.class)
.chatLanguageModel(model)
.tools(new FinanceTools())
.chatMemory(MessageWindowChatMemory.withMaxMessages(20))
.build();
// The agent may call both tools before answering
String answer = analyst.analyze("Is Apple stock overvalued relative to its P/E ratio?");The critical difference: a simple AI Services call completes in one LLM round-trip with no tool access. An agent orchestrates multiple LLM calls and tool executions autonomously to answer questions that require real data.
Implementing per-user conversational memory in a Spring REST API requires three things: an AI Services interface with a memory-id parameter, a ChatMemoryProvider that returns isolated memory per ID, and a backing store to persist conversations across requests (or restarts).
// 1. AI Services interface with per-user memory
interface ChatAssistant {
String chat(@MemoryId String userId, @UserMessage String message);
}
// 2. In-memory store for development (switch to Redis/DB for production)
Map<String, ChatMemory> memoryMap = new ConcurrentHashMap<>();
ChatMemoryProvider memoryProvider = memoryId ->
memoryMap.computeIfAbsent(memoryId.toString(), id ->
MessageWindowChatMemory.withMaxMessages(20));
// 3. Build the AI Service with the provider
ChatAssistant assistant = AiServices.builder(ChatAssistant.class)
.chatLanguageModel(model)
.chatMemoryProvider(memoryProvider)
.build();
// 4. Spring REST controller
@RestController
@RequestMapping("/api/chat")
class ChatController {
private final ChatAssistant assistant;
ChatController(ChatAssistant assistant) {
this.assistant = assistant;
}
@PostMapping("/{userId}")
String chat(@PathVariable String userId, @RequestBody String message) {
return assistant.chat(userId, message); // each user gets isolated memory
}
}The @MemoryId annotation tells LangChain4j which parameter is the memory key. The ChatMemoryProvider lambda receives this key and returns the appropriate memory store for that user. For production, replace the ConcurrentHashMap with a Redis-backed or JDBC-backed memory store so conversations survive application restarts and work across multiple pods.
ImageModel is the LangChain4j interface for text-to-image generation — sending a text prompt and receiving a generated image in return. It follows the same provider-abstraction pattern as ChatLanguageModel: your code works against the interface, and the actual generation is delegated to whichever provider you configure.
public interface ImageModel {
Response<Image> generate(String prompt);
Response<List<Image>> generate(String prompt, int n);
Response<Image> edit(Image image, String prompt); // inpainting
Response<Image> edit(Image image, Image mask, String prompt);
}The Image response object contains either a URL to the generated image (hosted by the provider) or a Base64-encoded data URI, depending on the provider and configuration.
ImageModel model = OpenAiImageModel.builder()
.apiKey(apiKey)
.modelName(DALL_E_3)
.size("1024x1024")
.quality("standard")
.build();
Response<Image> response = model.generate(
"A serene Japanese zen garden at dawn, photorealistic");
String imageUrl = response.content().url().toString();
// or Base64: response.content().base64Data()Supported image generation providers in LangChain4j:
- OpenAI DALL-E 2 / DALL-E 3 — Available via
langchain4j-open-ai; DALL-E 3 supports higher quality and natural language understanding - Azure OpenAI DALL-E — Via
langchain4j-azure-open-aifor enterprise Azure deployments - Stability AI — Via
langchain4j-stability-aifor Stable Diffusion models
Note that ImageModel is a distinct interface from ChatLanguageModel. Multi-modal vision models that accept images as input (GPT-4V, Claude 3) are handled through ChatLanguageModel's message API using ImageContent, not through ImageModel.
LangChain4j itself does not provide a built-in retry framework — it intentionally delegates retry logic to the infrastructure layer. However, there are several natural integration points for error handling depending on your deployment context.
Rate limit handling (HTTP 429) — Most provider implementations in LangChain4j throw a dev.langchain4j.exception.RateLimitException when the LLM provider returns a 429. You handle this at the call site or through Spring's @Retryable mechanism:
// Using Spring Retry with @Retryable
@Service
class AiService {
private final ChatAssistant assistant;
@Retryable(
retryFor = RateLimitException.class,
maxAttempts = 3,
backoff = @Backoff(delay = 2000, multiplier = 2)
)
public String chat(String userId, String message) {
return assistant.chat(userId, message);
}
@Recover
public String fallback(RateLimitException ex, String userId, String message) {
return "Service is temporarily busy. Please try again in a moment.";
}
}Timeout handling — Configure timeouts directly on the ChatLanguageModel builder:
OpenAiChatModel model = OpenAiChatModel.builder()
.apiKey(apiKey)
.timeout(Duration.ofSeconds(30))
.maxRetries(2) // some providers support built-in retries in the client
.build();The OpenAI and some other provider clients support a maxRetries parameter that enables automatic retries with exponential backoff inside the HTTP client before the exception propagates to your code. For structured error handling across all exceptions, wrapping the AI Services call in a try-catch and mapping to application-specific error responses is standard practice. Resilience4j's circuit breaker is another option for preventing cascading failures when an LLM provider is degraded.
Testing AI Services without hitting real LLM endpoints is essential for fast, cost-free, deterministic unit tests. LangChain4j supports this through mock model implementations and the AiServices builder accepting any ChatLanguageModel — including test doubles you create yourself.
The most direct approach is to implement a simple mock that returns predetermined responses:
// Simple lambda mock
ChatLanguageModel mockModel = (messages, toolSpecifications) ->
new AiMessage("The capital of France is Paris.");
GeographyAssistant assistant = AiServices.builder(GeographyAssistant.class)
.chatLanguageModel(mockModel)
.build();
String answer = assistant.ask("What is the capital of France?");
assertThat(answer).isEqualTo("The capital of France is Paris.");For more complex scenarios, Mockito works naturally since ChatLanguageModel is an interface:
@ExtendWith(MockitoExtension.class)
class TranslatorTest {
@Mock
ChatLanguageModel mockModel;
@Test
void translatesText() {
AiMessage fakeResponse = new AiMessage("Bonjour le monde");
when(mockModel.generate(anyList())).thenReturn(new Response<>(fakeResponse));
Translator translator = AiServices.builder(Translator.class)
.chatLanguageModel(mockModel).build();
assertThat(translator.translate("Hello world", "French"))
.isEqualTo("Bonjour le monde");
}
}For integration tests that require a real LLM but want cost control, use Ollama with a small local model (e.g., tinyllama) via Testcontainers. This gives you real model behavior without OpenAI billing and can run in CI pipelines. The langchain4j-ollama module combined with the Testcontainers Ollama image enables fully automated integration test suites with no API keys required.
Document loaders are the entry point of any RAG ingestion pipeline — they read raw content from a source and return it as a list of Document objects, each containing the text content and source metadata. LangChain4j's loaders all implement the DocumentLoader interface and populate the Document.metadata() map with source-specific information like file path, URL, or S3 key.
Built-in document loader sources include:
| Loader | Source | Notes |
|---|---|---|
| FileSystemDocumentLoader | Local files and directories | Supports glob patterns; auto-detects parser by extension |
| UrlDocumentLoader | HTTP/HTTPS URLs | Fetches and parses web pages |
| ClassPathDocumentLoader | Classpath resources | Good for embedded documentation in JARs |
| AmazonS3DocumentLoader | AWS S3 buckets | Via langchain4j-document-loader-amazon-s3 |
| AzureBlobStorageDocumentLoader | Azure Blob Storage | Via langchain4j-document-loader-azure-storage-blob |
| GitHubDocumentLoader | GitHub repositories | Loads files from a repo branch |
Document parsers handle format-specific extraction: TextDocumentParser for plain text, ApachePdfBoxDocumentParser for PDFs, ApacheTikaDocumentParser for Word/Excel/PowerPoint and 100+ other formats. Parsers are composable with loaders:
// Load all PDFs from a directory
List<Document> docs = FileSystemDocumentLoader.loadDocuments(
"./knowledge-base",
PathMatcher.of("glob:**.pdf"),
new ApachePdfBoxDocumentParser()
);
The @Moderate annotation integrates content moderation directly into the AI Services pipeline. When placed on an AI Services method, LangChain4j automatically runs the user message through OpenAI's Moderation API before passing it to the language model. If the content is flagged as violating content policies, a ModerationException is thrown before the LLM is ever called — protecting you from sending inappropriate content upstream and from generating harmful responses.
interface SafeAssistant {
@Moderate // automatic moderation check on every call
@SystemMessage("You are a helpful customer service assistant.")
String chat(String userMessage);
}
// Build with a moderation model configured
SafeAssistant assistant = AiServices.builder(SafeAssistant.class)
.chatLanguageModel(chatModel)
.moderationModel(OpenAiModerationModel.builder()
.apiKey(apiKey)
.build())
.build();
// Usage
try {
String response = assistant.chat(userInput);
} catch (ModerationException e) {
// Input was flagged — respond with a rejection message
return "I cannot process that request.";
}The moderation check happens before the main LLM call, which means: no tokens wasted on the primary model, no risk of the LLM processing harmful prompts, and your system gets an automatic first line of defense. The moderation model (currently only OpenAI's text-moderation-latest is supported natively) returns categories and confidence scores for hate, harassment, self-harm, violence, and sexual content.
For applications where OpenAI's moderation is not suitable (on-premise deployments, or different moderation criteria), you can implement the ModerationModel interface with custom logic and plug it in identically.
Multi-modal LLMs like GPT-4o, Claude 3, and Gemini can process images alongside text. In LangChain4j, image input is handled through the UserMessage content builder, which accepts a list of Content objects — combining TextContent and ImageContent in a single user turn.
// Pass an image URL
UserMessage message = UserMessage.from(
TextContent.from("What defects do you see in this product image?"),
ImageContent.from("https://cdn.example.com/product-photo.jpg")
);
AiMessage response = chatModel.generate(List.of(message)).content();
// Or pass Base64-encoded image data (for local files)
byte[] imageBytes = Files.readAllBytes(Path.of("screenshot.png"));
String base64 = Base64.getEncoder().encodeToString(imageBytes);
UserMessage visionMessage = UserMessage.from(
TextContent.from("Describe any errors shown in this screenshot"),
ImageContent.from(base64, "image/png")
);Vision capabilities also work with AI Services. You can define a method that accepts a UserMessage directly, or use @UserMessage with image parameters:
interface ImageAnalyzer {
@UserMessage("Analyze this image and describe what you see.")
String analyze(UserMessage messageWithImage);
}Important considerations: not all models in the same provider family support vision (e.g., GPT-3.5 cannot process images; GPT-4o can). Check that your configured model name is a vision-capable variant. Image inputs consume significantly more tokens than text, which affects both cost and context window usage — high-resolution images can consume thousands of tokens depending on the model's tile-based processing strategy.
LangChain4j supports both synchronous and asynchronous execution models for LLM calls. The choice affects how your application thread behaves while waiting for the (potentially slow) LLM response.
Synchronous — The calling thread blocks until the complete response is received. This is the default and simplest mode, appropriate for batch jobs, background tasks, and thread-per-request servers where thread blocking is acceptable.
// Sync: thread blocks until response arrives (may take 5-30 seconds)
String answer = assistant.chat("What is quantum computing?");Asynchronous (CompletableFuture) — Declare the return type as CompletableFuture<String> (or any other response type) in your AI Services interface. LangChain4j submits the call on a separate thread and returns immediately with a future:
interface AsyncAssistant {
CompletableFuture<String> chat(String message);
CompletableFuture<ProductReview> analyze(String review); // works with POJOs too
}
// Non-blocking: returns immediately, response arrives later
CompletableFuture<String> future = assistant.chat("Explain blockchain");
future.thenAccept(answer -> System.out.println("Got answer: " + answer));
// ... continue doing other work ...Streaming (TokenStream) — Token-by-token delivery. Neither sync nor truly async — it is event-driven and provides progressive output rather than waiting for the full response or getting it all at once later. Best for UI responsiveness.
For Spring WebFlux applications, the recommended pattern is returning Flux<String> by bridging LangChain4j's TokenStream to a reactive publisher via Sinks.Many or a FluxSink. Pure CompletableFuture works for non-streaming Spring MVC async (DeferredResult) or WebFlux scenarios.
LangChain4j has a dedicated Quarkus extension (quarkus-langchain4j) maintained under the Quarkiverse umbrella. It provides CDI-based injection, Quarkus-native configuration, and — critically — native compilation support through GraalVM, enabling LangChain4j applications to be compiled to native executables with sub-second startup times.
The main differences from the Spring Boot integration:
| Aspect | Spring Boot Integration | Quarkus Integration |
|---|---|---|
| Dependency injection | Spring IoC / @Autowired | CDI / @Inject |
| Configuration | application.properties (spring.langchain4j.*) | application.properties (quarkus.langchain4j.*) |
| AI Service registration | @AiService (or @Bean) | @RegisterAiService |
| Native image support | Spring Native (experimental for AI libs) | First-class via GraalVM — officially supported |
| Dev mode | Spring DevTools hot reload | Quarkus Dev mode live reload + Dev UI panel for AI |
| Observability | Spring Actuator + Micrometer | Quarkus OpenTelemetry auto-instrumentation |
// Quarkus - register an AI service with @RegisterAiService
@RegisterAiService(tools = WeatherTools.class)
interface WeatherAssistant {
@SystemMessage("You are a weather assistant.")
String chat(String userMessage);
}
// Inject it as a CDI bean
@ApplicationScoped
class WeatherEndpoint {
@Inject WeatherAssistant assistant;
}Quarkus Dev mode provides a visual Dev UI panel specifically for LangChain4j where you can inspect registered AI services, test prompts interactively, and view conversation history — a significant developer experience advantage over the Spring Boot approach for teams working on Quarkus applications.
The ReAct (Reasoning + Acting) pattern in LangChain4j is implemented automatically by the AI Services framework whenever you register tools with a chat language model. There is no explicit ReAct class to instantiate — the pattern emerges from the interaction between the tool-equipped LLM and LangChain4j's tool execution loop.
The concrete mechanics inside LangChain4j's AI Services when tools are present:
- Tool schemas (name, description, parameter types) are serialized from your
@Toolannotated methods and included in every LLM request - If the LLM returns a tool call in its response, LangChain4j intercepts it, looks up the corresponding method, deserializes the arguments, and invokes the Java method via reflection
- The tool result is appended as a
ToolExecutionResultMessageto the conversation history - The LLM is called again with the updated history — it can reason about the result and either call another tool or produce a final text answer
- This loop continues until the LLM stops requesting tools (step 4's output is not a tool call)
Known limitations of the current implementation:
- No parallel tool execution — When the LLM requests multiple tools simultaneously (some models support this), LangChain4j executes them sequentially, not in parallel, which increases latency for multi-tool queries
- No configurable max iterations — There is no built-in loop guard. A misbehaving model or misconfigured tool could theoretically loop indefinitely. You must add your own application-level timeout
- Single agent only — LangChain4j does not natively orchestrate multi-agent workflows where agents delegate subtasks to other agents. Custom code is required for that pattern
- Tool schemas depend on model support — Tool calling requires a model that supports the function calling protocol. Older or smaller models may produce unreliable tool call JSON
The ModerationModel interface in LangChain4j defines the contract for content moderation checks. It takes a String input and returns a Response<Moderation> — where Moderation contains a boolean flagged() result and optionally category-level scores. LangChain4j's @Moderate AI Services annotation uses whichever ModerationModel you register on the builder.
The built-in implementation is OpenAiModerationModel, which calls OpenAI's text-moderation-latest API. But for custom moderation logic — rule-based keyword filtering, an internal ML model, or a different provider's moderation API — you implement the interface directly:
public class KeywordModerationModel implements ModerationModel {
private static final Set<String> BLOCKED = Set.of(
"badword1", "badword2", "competitor-brand"
);
@Override
public Response<Moderation> moderate(String text) {
boolean flagged = BLOCKED.stream()
.anyMatch(word -> text.toLowerCase().contains(word));
return Response.from(
flagged ? Moderation.flagged(text) : Moderation.notFlagged()
);
}
}
// Plug in as the moderation model
SafeAssistant assistant = AiServices.builder(SafeAssistant.class)
.chatLanguageModel(chatModel)
.moderationModel(new KeywordModerationModel())
.build();Custom implementations are particularly useful for on-premise deployments that cannot use external APIs, for organizations with specific terminology blocklists, or for domain-specific moderation where generic toxicity models produce too many false positives. The interface is small and straightforward — moderate(String) is the only method you must implement.
The Tokenizer interface in LangChain4j counts the number of tokens in a given string or list of messages using the specific tokenization algorithm of a target model. This is necessary because LLMs do not process raw characters or words — they operate on tokens, which are sub-word units that vary in count depending on the model's vocabulary. The same sentence can produce different token counts in GPT-4 vs Claude vs Llama.
Token counting matters for two concrete reasons in LangChain4j:
- TokenWindowChatMemory — Uses a Tokenizer to ensure the accumulated conversation history never exceeds the model's context window limit. Without accurate token counting, you either truncate valid context too early or exceed the limit and get API errors.
- Cost estimation — Before sending a request, counting tokens lets you estimate API cost (most providers charge per input/output token) and set guardrails on expensive queries.
// Count tokens for OpenAI GPT-4
Tokenizer tokenizer = new OpenAiTokenizer(GPT_4);
int tokensInPrompt = tokenizer.estimateTokenCountInMessage(
SystemMessage.from("You are a helpful assistant.")
);
// Use with TokenWindowChatMemory for precise context management
ChatMemory memory = TokenWindowChatMemory.builder()
.maxTokens(8192, new OpenAiTokenizer(GPT_4))
.build();LangChain4j ships tokenizers for OpenAI models (using the jtokkit library, which implements the BPE tokenization algorithm used by OpenAI), and approximate tokenizers for other models. For models without exact tokenizer support, the approximate tokenizer estimates based on average characters-per-token ratios — less precise but sufficient for rough context management.
LangChain4j's built-in MessageWindowChatMemory and TokenWindowChatMemory use in-memory storage — conversations vanish when the application restarts or when a new pod starts in a Kubernetes cluster. For production persistence you need a persistent ChatMemoryStore implementation.
LangChain4j defines the ChatMemoryStore interface with three methods:
public interface ChatMemoryStore {
List<ChatMessage> getMessages(Object memoryId);
void updateMessages(Object memoryId, List<ChatMessage> messages);
void deleteMessages(Object memoryId);
}You implement this against any persistence backend and plug it into the memory configuration:
// Redis-backed implementation example
@Component
class RedisChatMemoryStore implements ChatMemoryStore {
private final RedisTemplate<String, String> redis;
private final ObjectMapper mapper;
@Override
public List<ChatMessage> getMessages(Object memoryId) {
String json = redis.opsForValue().get("chat:" + memoryId);
if (json == null) return new ArrayList<>();
return mapper.readValue(json, new TypeReference<>(){});
}
@Override
public void updateMessages(Object memoryId, List<ChatMessage> messages) {
redis.opsForValue().set("chat:" + memoryId,
mapper.writeValueAsString(messages),
Duration.ofHours(24));
}
@Override
public void deleteMessages(Object memoryId) {
redis.delete("chat:" + memoryId);
}
}
// Wire into memory
ChatMemoryProvider provider = memoryId ->
MessageWindowChatMemory.builder()
.id(memoryId)
.maxMessages(20)
.chatMemoryStore(redisChatMemoryStore)
.build();
Prompt engineering in LangChain4j is about designing the @SystemMessage and @UserMessage content so the LLM reliably produces what you need. Several practices have proven effective in production LangChain4j applications:
1. Keep system messages focused and specific. A system message that tries to do too many things (act as a customer service agent AND a code reviewer AND limit to company topics) produces mediocre results for all of them. One interface, one clear role.
2. Use explicit output format instructions for structured responses. When returning POJOs, the auto-generated JSON schema is usually sufficient, but for edge cases add explicit instructions: "Always respond in valid JSON. Do not add explanation text outside the JSON."
3. Load long prompts from classpath resources, not annotations. Multi-paragraph system prompts inlined in annotations are hard to read, test, and update without a recompile:
// Hard to maintain
@SystemMessage("You are a... (200 words here)...")
// Better — load from file
@SystemMessage(fromResource = "prompts/customer-service-system.txt")4. Use few-shot examples for consistent formatting. Include 1-3 examples of ideal input → output pairs in the system message when the output format is non-trivial. This dramatically reduces malformed JSON or incorrect tone.
5. Version prompt files in source control separately from code. Treat src/main/resources/prompts/ as a versioned artifact. Prompt changes should go through review since they affect model behavior as much as code changes do.
6. Test with multiple inputs before deploying. LLM outputs are non-deterministic. Write parameterized tests covering edge cases: empty input, very long input, input in a non-English language, adversarial prompt injection attempts.
LangChain4j 0.31+ introduced native OpenTelemetry instrumentation for tracing LLM calls. When the langchain4j-open-telemetry module is on the classpath alongside an OTel SDK, LangChain4j automatically creates spans for each LLM call, embedding attributes from the OpenTelemetry Semantic Conventions for Generative AI Systems (draft spec).
Each span captures:
gen_ai.system— The LLM provider (e.g.,openai)gen_ai.request.model— The model name usedgen_ai.request.max_tokens— Max tokens configuredgen_ai.usage.input_tokens— Actual input tokens consumedgen_ai.usage.output_tokens— Actual output tokens generatedgen_ai.request.temperature— Temperature setting
For Spring Boot, adding the OTel Spring Boot starter alongside the LangChain4j OTel module is sufficient for automatic instrumentation:
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-telemetry</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>With these in place, every LLM call appears as a span in your Jaeger, Zipkin, Grafana Tempo, or any OTLP-compatible backend — showing latency distribution across providers and models, token usage trends, and which AI services are called in which order within a user request. This is critical for diagnosing slow AI paths in production without guessing.
InMemoryEmbeddingStore is LangChain4j's simplest EmbeddingStore implementation: it holds all embeddings in a Java List in heap memory, performs linear scan (brute-force cosine similarity) for similarity search, and has zero external dependencies. It ships in the core module with no additional Maven dependency.
// Zero setup — ready to use in any test or prototype
EmbeddingStore<TextSegment> store = new InMemoryEmbeddingStore<>();
// Serialize to JSON file for lightweight persistence
String json = store.serializeToJson();
Files.writeString(Path.of("embeddings.json"), json);
// Deserialize on next startup
EmbeddingStore<TextSegment> restored =
InMemoryEmbeddingStore.fromJson(Files.readString(Path.of("embeddings.json")));It does support basic JSON file persistence via serializeToJson() and fromJson(), so for truly small corpora it can survive restarts — but it is still a single-file, single-node solution.
You should migrate to a real vector database (PgVector, Qdrant, Pinecone, etc.) when any of these conditions are true:
- Scale — More than ~50,000 document chunks. Linear scan becomes visibly slow (~100ms+) at this scale versus ANN index millisecond queries
- Filtering — You need metadata-filtered similarity search (find documents by author AND semantic similarity). InMemoryEmbeddingStore has no filtering support
- Persistence — Multiple pods that need to share the same embeddings. A JSON file cannot serve multiple instances
- Updates — Frequent document additions or deletions. Rebuilding the in-memory store from scratch is expensive for large corpora
- Disaster recovery — If re-embedding your entire corpus on every restart takes more than seconds, the file-based approach is too fragile
As LangChain4j adoption has grown, several recurring mistakes in production deployments have emerged. Knowing these saves debugging time and prevents costly incidents.
1. Creating ChatLanguageModel or AI Services as request-scoped beans. These are expensive to initialize (TCP connections, key validation, token counting setup). They must be singletons — one instance per application lifecycle, not one per request.
2. Using InMemoryEmbeddingStore in production. Linear scan becomes unacceptably slow above ~50,000 chunks, there is no filtering support, and multiple pods cannot share it. Switch to PgVector or a managed vector DB before going live.
3. Not configuring timeouts. LLM API calls can stall for 60+ seconds. Without a .timeout(Duration.ofSeconds(30)) on the model builder, a hung upstream provider will exhaust your thread pool in a synchronous Spring MVC application.
4. Logging full prompts in production. System messages often contain proprietary business logic. User messages may contain PII. Log only token counts and model names by default; log full prompts only at DEBUG with PII scrubbing.
5. Ignoring ModerationException in safety-critical applications. If you enable @Moderate, surround every AI Services call with a try-catch for ModerationException and return a safe fallback. Uncaught exceptions surface as 500 errors.
6. Embedding the same corpus on every application startup. The ingestion pipeline (load → split → embed → store) should run once and persist results. Re-embedding on startup wastes API budget and delays readiness for large corpora.
7. Hardcoding model names as String literals. Use the constants provided by each provider module (e.g., OpenAiChatModelName.GPT_4_O) so model upgrades are refactor-friendly and typos are caught at compile time.
Beyond text and image inputs, some LLM providers support audio transcription and document (PDF) understanding as native model inputs. LangChain4j exposes these through additional Content types in the UserMessage builder, following the same pattern as ImageContent.
Audio input — For providers that support audio understanding (like OpenAI GPT-4o Audio or Google Gemini), AudioContent wraps a base64-encoded audio clip with a MIME type:
byte[] audioBytes = Files.readAllBytes(Path.of("customer-call.mp3"));
String base64Audio = Base64.getEncoder().encodeToString(audioBytes);
UserMessage message = UserMessage.from(
AudioContent.from(base64Audio, "audio/mp3"),
TextContent.from("Summarize the key complaints in this customer call recording.")
);PDF / document input — Some providers (Anthropic Claude, Gemini) accept raw PDF bytes as input, allowing the model to read and understand the document structure natively rather than extracting text first:
byte[] pdfBytes = Files.readAllBytes(Path.of("contract.pdf"));
String base64Pdf = Base64.getEncoder().encodeToString(pdfBytes);
UserMessage message = UserMessage.from(
TextContent.from("Identify all payment terms in this contract."),
PdfFileContent.from(base64Pdf) // provider-specific support required
);Important caveat: multi-modal support beyond text and images is provider-specific. Before using AudioContent or PdfFileContent, verify that your configured model and LangChain4j provider module version support it. Using these content types with a model that does not support them results in an API error from the provider. Always check the LangChain4j integration page for your provider for the current supported content types.
LangChain4j tools support complex parameter types beyond simple strings and primitives. When a tool method accepts a custom POJO, enum, or collection, LangChain4j automatically generates a JSON schema from the parameter type and includes it in the tool specification sent to the LLM. The model uses this schema to understand what JSON structure it should produce for the tool call arguments, and LangChain4j deserializes them via Jackson before invoking the method.
// Enum parameter
enum Priority { LOW, MEDIUM, HIGH, CRITICAL }
// Complex POJO parameter
record TaskFilter(
String assignee,
Priority minPriority,
@P("Filter to tasks due before this date (ISO-8601)") String dueBefore,
boolean includeCompleted
) {}
class ProjectTools {
@Tool("Search project tasks by multiple filter criteria")
List<Task> searchTasks(
@P("Filter criteria for the task search") TaskFilter filter
) {
return taskRepository.search(
filter.assignee(),
filter.minPriority(),
LocalDate.parse(filter.dueBefore()),
filter.includeCompleted()
);
}
@Tool("Update the priority of a specific task")
void updatePriority(
@P("Task ID to update") String taskId,
@P("New priority level") Priority newPriority
) {
taskRepository.updatePriority(taskId, newPriority);
}
}The LLM sees the fully expanded JSON schema for TaskFilter including field types and the @P descriptions. Good @P descriptions on nested fields are critical — without them the model may misinterpret the date format, the priority semantics, or which fields are required vs. optional. The return type of tool methods is also automatically serialized to JSON before being added to the conversation as a tool result.
HyDE (Hypothetical Document Embedder) is a query enhancement technique for RAG that improves retrieval quality by addressing a fundamental mismatch: the user's question is short and query-like, while the stored documents are long and answer-like. Embedding a question and a document paragraph in the same vector space often produces sub-optimal similarity scores because their styles differ.
The HyDE solution: before embedding the user's query, ask the LLM to generate a hypothetical document that would answer the question — essentially a plausible answer written in the style of the stored documents. Then embed this hypothetical document instead of the question. The resulting vector is much more similar to real matching documents.
// Custom HyDE QueryTransformer
class HydeQueryTransformer implements QueryTransformer {
private final ChatLanguageModel languageModel;
@Override
public Collection<Query> transform(Query originalQuery) {
String hypothetical = languageModel.generate(
"Write a short paragraph that would answer this question: "
+ originalQuery.text()
);
return List.of(Query.from(hypothetical));
}
}
// Wire into the RAG pipeline
RetrievalAugmentor augmentor = DefaultRetrievalAugmentor.builder()
.queryTransformer(new HydeQueryTransformer(chatModel))
.contentRetriever(EmbeddingStoreContentRetriever.from(store))
.build();
Assistant assistant = AiServices.builder(Assistant.class)
.chatLanguageModel(chatModel)
.retrievalAugmentor(augmentor)
.build();HyDE adds one additional LLM call per user query (to generate the hypothetical), which increases latency and cost. It is most effective for complex technical queries against document corpora where direct question embedding produces poor recall. Simpler query rewriting (compressing conversation context into a standalone question) is usually the better default trade-off.
When LangChain4j requests structured output (returning a POJO from an AI Services method), the LLM occasionally produces malformed JSON despite format instructions — especially with smaller models or complex schemas. Without explicit error handling, this surfaces as a OutputParsingException or JsonParseException from Jackson. Graceful handling is critical for production reliability.
There are three layers where you can handle parsing failures:
1. Return Optional to signal missing/failed results:
interface ReviewExtractor {
Optional<ProductReview> extractReview(String rawText);
}
// Returns Optional.empty() if parsing fails (safer than exception-based control flow)2. Catch OutputParsingException at the call site and fall back:
try {
ProductReview review = extractor.extractReview(text);
return review;
} catch (OutputParsingException e) {
log.warn("Failed to parse review structure: {}. Falling back to raw text.", e.getMessage());
return ProductReview.unparsed(text); // your fallback model
}3. Retry with an explicit correction prompt:
@Retryable(retryFor = OutputParsingException.class, maxAttempts = 2)
ProductReview extractWithRetry(String text) {
return extractor.extractReview(text);
}Reducing parsing failures proactively:
- Use providers with native JSON mode (OpenAI's
response_format: json_object) — configure viaOpenAiChatModelNameand setresponseFormaton the model builder - Add few-shot examples of correct JSON structure in the system message
- Use simpler schemas — fewer fields, no deeply nested objects, enums instead of free-text strings for constrained values
- Use a more capable model for extraction tasks where schema adherence is critical
Standard vector similarity RAG retrieves semantically similar text chunks, but it struggles with multi-hop reasoning — questions like "What are all the direct reports of the manager of the product that had the most returns in Q3?" require traversing multiple relationships, not just finding similar text. Graph-based RAG addresses this by integrating a knowledge graph (like Neo4j) as a content retriever alongside or instead of a vector store.
LangChain4j supports this through the ContentRetriever abstraction. You can implement a Neo4jContentRetriever (or similar) that translates the user's natural language query into a Cypher query using the LLM, executes it against Neo4j, and returns the structured results as text for context injection:
class Neo4jContentRetriever implements ContentRetriever {
private final Driver neo4jDriver;
private final ChatLanguageModel queryGeneratorModel;
@Override
public List<Content> retrieve(Query query) {
// Step 1: LLM generates Cypher from natural language
String cypher = queryGeneratorModel.generate(
"Convert this to a Cypher query: " + query.text()
);
// Step 2: Execute against Neo4j
try (Session session = neo4jDriver.session()) {
Result result = session.run(cypher);
String resultText = result.list().toString();
return List.of(Content.from(resultText));
}
}
}
// Use alongside vector retrieval
RetrievalAugmentor augmentor = DefaultRetrievalAugmentor.builder()
.contentRetriever(new Neo4jContentRetriever(driver, chatModel))
.build();The pattern is often called "GraphRAG" or "Text2Cypher RAG". For production, add query validation (reject Cypher that includes WRITE operations), result size limits, and retry logic for LLM-generated invalid Cypher. LangChain4j's modular ContentRetriever design makes this a clean extension point — no framework modification required.
RAG pipeline quality is notoriously hard to measure because "good retrieval" and "good answers" are context-dependent and partially subjective. LangChain4j does not provide a built-in RAG evaluation framework, but the ecosystem approach involves using LLMs themselves as evaluators (LLM-as-judge) combined with ground-truth question-answer test sets.
The standard evaluation dimensions for RAG systems are:
| Metric | What It Measures | How to Compute |
|---|---|---|
| Context Recall | Were the relevant documents retrieved? | Compare retrieved chunks vs. ground-truth relevant docs |
| Context Precision | What fraction of retrieved docs are actually relevant? | LLM-as-judge scores each retrieved chunk for relevance |
| Answer Faithfulness | Is the answer grounded in the retrieved context? | LLM judge checks if every claim in answer appears in context |
| Answer Relevance | Does the answer address the question? | LLM judge rates how directly the answer responds to the query |
A practical evaluation approach in LangChain4j:
record EvalCase(String question, String groundTruthAnswer, List<String> relevantDocIds) {}
interface RagEvaluator {
@SystemMessage("You are a factual accuracy judge. Rate 0-10.")
@UserMessage("Question: {{question}}\nGenerated Answer: {{answer}}\nContext: {{context}}")
int rateAnswerFaithfulness(String question, String answer, String context);
}
// Run evaluation on a test set
for (EvalCase testCase : testCases) {
String generatedAnswer = ragAssistant.answer(testCase.question());
List<Content> retrieved = contentRetriever.retrieve(Query.from(testCase.question()));
int score = evaluator.rateAnswerFaithfulness(testCase.question(), generatedAnswer, retrieved.toString());
// Aggregate scores across test cases
}For more comprehensive RAG evaluation, integrate LangChain4j with Python-based frameworks like RAGAS or DeepEval via their REST APIs, or use Azure AI Studio's evaluation workflows which support Java-generated answer datasets.
