A production document intelligence pipeline, built on AWS Bedrock & Azure AI.
Behavioral-health and regulatory work generates thousands of documents a day (clinical records, policies, forms), each a wall of unstructured text someone has to read. Here’s the pipeline we built to read them: extract, understand and act at scale, routing each document to the cheapest model that can handle it.
Beyond OCR and brittle rules
Traditional document processing means manual review, basic OCR that misses context, and rigid rule-based extraction that breaks on variation. This pipeline solves those by combining services, processing asynchronously, scaling automatically, and adapting to document type and complexity.
Better accuracy
Combine multiple AI services rather than betting on a single solution.
Non-blocking
Asynchronous processing so users never wait on a 50-page contract.
Scales automatically
From dozens to millions of documents on the same architecture.
Adaptive
Routes by document type and complexity to the right tool, and the right cost.
A three-stage assembly line
The pipeline mirrors how humans read documents (extract, understand, structure) but at machine scale. Each stage is optimized independently, and we can route around any service that has a bad day.
A storage layer that hides the cloud
An abstraction over Azure Blob, S3 or GCS, so application logic never depends on the provider. Each document gets a unique id; input and output use predictable {jobId}_input / {jobId}_output keys.
1public interface IDocumentStore2{3 Task<(string logicalKey, long fileSize)> UploadAsync(4 string key, Stream content, string contentType, CancellationToken ct);5 6 Task<long?> TryDownloadAsync(string key, Stream destination, CancellationToken ct);7 8 Uri GetResourceUri(string blobName);9}Getting the words out, with structure
Different documents need different extraction. A unified interface picks the right method, and returns text as individual lines so spatial relationships survive into the AI stage.
Simple text
Basic OCR is enough.
Complex forms
Layout understanding and field recognition.
Handwriting
Advanced recognition models.
Multi-language
Specialized language models.
1public interface ITextRecognition2{3 Task<IList<string>> DetectTextAsync(Uri blobPath, CancellationToken ct);4}5 6public class DocumentIntelligenceService : ITextRecognition7{8 public async Task<IList<string>> DetectTextAsync(Uri blobPath, CancellationToken ct)9 {10 var op = await _client.AnalyzeDocumentAsync(11 WaitUntil.Completed, "prebuilt-read", blobPath, ct);12 13 // Return individual lines, preserves structure for the LLM stage14 return op.Value.Pages15 .SelectMany(p => p.Lines)16 .Select(l => l.Content)17 .ToList();18 }19}LLM orchestration: a team of specialists
A model registry routes each task to the right model, tracks token usage and cost, and wraps calls in a retry policy.
1public class LLMOrchestrationClient2{3 private static readonly Dictionary<AIModel, ModelConfig> ModelRegistry = new()4 {5 { AIModel.Claude35Sonnet, new(Brand.Anthropic, "anthropic.claude-3-5-sonnet-20241022-v2:0",6 supportsCache: true, inputCost: 0.003M, outputCost: 0.015M) },7 { AIModel.NovaLite, new(Brand.Amazon, "us.amazon.nova-lite-v1:0",8 supportsCache: true, inputCost: 0.00006M, outputCost: 0.00024M) },9 };10 11 public async Task<LLMResponse> ProcessAsync(AIModel model, string prompt,12 string systemPrompt, CancellationToken ct)13 {14 return await _retryPolicy.ExecuteAsync(async (c) =>15 {16 var response = await _client.ConverseAsync(BuildRequest(model, prompt, systemPrompt), c);17 LogTokenUsage(model, response.Usage);18 return new LLMResponse(response.Output.Message.Content[0].Text, response);19 }, ct);20 }21}| Model | Best for | Input $/1K | Output $/1K |
|---|---|---|---|
| Nova Lite | Fast and cheap, for simple extraction | $0.00006 | $0.00024 |
| Nova Pro | Balanced, for general-purpose tasks | n/a | n/a |
| Claude 3.5 Sonnet | Complex reasoning & analysis | $0.003 | $0.015 |
Prompts as first-class citizens
Many document-AI projects treat prompts as an afterthought. We version, test and compose them: system prompts define the role, user prompts define the task.
1# system_prompts.yml2document_analysis: >-3 You are a document intelligence specialist trained to extract structured4 information from unstructured text. Identify key entities, relationships,5 and regulatory requirements.6 7policy_extraction: >-8 You are a policy analyst trained to interpret regulatory documents. Extract9 all verifiable requirements that can be assessed through documentation review.1public class PromptEngine2{3 public async Task<PolicyAnalysisResult> AnalyzePolicyDocument(4 string documentText, CancellationToken ct)5 {6 var systemPrompt = GetPrompt("system", "policy_extraction");7 var userPrompt = BuildPrompt("extract_policy_questions", new()8 {9 ["DOCUMENT_TEXT"] = documentText10 });11 12 return await _llmClient.ProcessAsync(13 AIModel.Claude35Sonnet, userPrompt, systemPrompt, ct);14 }15}Modularity
Role vs task separated into composable parts.
Testability
A/B test prompt versions without code changes.
Maintainability
Non-engineers improve prompts safely.
Version control
Prompts live in Git alongside the code.
Event-driven, asynchronous processing
A 50-page contract might take 30 seconds, and users can’t block on that. An event-driven handler scales with the queue and recovers from transient failures with retries.
1[MessageHandler(concurrencyLimit: 3)]2public class DocumentProcessingHandler : IMessageConsumer<DocumentUploadedEvent>3{4 public async Task HandleAsync(DocumentUploadedEvent message, CancellationToken ct)5 {6 var jobId = message.DocumentId;7 try8 {9 var documentUri = _documentStore.GetResourceUri($"{jobId}_input");10 var lines = await _textRecognition.DetectTextAsync(documentUri, ct);11 var documentText = string.Join("\n", lines);12 13 var analysis = await _llmProcessor.AnalyzePolicyDocument(documentText, ct);14 await StoreResults(jobId, analysis, ct);15 }16 catch (Exception ex)17 {18 _logger.LogError(ex, "Failed to process document {JobId}", jobId);19 }20 }21}Sync for small, async for large
1[HttpPost("/api/v1/documents/analyze")]2public async Task<IActionResult> AnalyzeDocument([FromForm] IFormFile document, CancellationToken ct)3{4 if (document?.Length <= 0) return BadRequest("No document provided");5 6 var preferAsync = Request.Headers.Prefer().Return == ReturnPreference.Minimal;7 if (!preferAsync)8 {9 var bytes = await ReadDocumentBytes(document, ct);10 return Ok(await _processor.AnalyzeDocumentAsync(bytes, ct)); // small: immediate11 }12 13 var jobId = await InitiateAsyncProcessing(document, ct); // large: async14 Response.Headers["Job-ID"] = jobId;15 return AcceptedAtAction("GetJobStatus", new { id = jobId });16}- Small documents (< 1MB): results returned immediately in the response.
- Large documents: a job id to poll, with real-time status updates.
- Consistent error format whether processing is sync or async.
Fast and cheap, by design
Performance and cost aren’t afterthoughts. They’re built into the architecture.
Intelligent model selection
Not every document needs the most powerful model. Routing by size and complexity cuts AI cost by up to 80% with no quality loss. A simple form doesn’t need Claude’s reasoning.
1public AIModel SelectOptimalModel(DocumentAnalysisRequest request)2{3 var size = request.Content.Length;4 var complexity = EstimateComplexity(request.DocumentType);5 6 return (size, complexity) switch7 {8 (< 10_000, ComplexityLevel.Low) => AIModel.NovaLite, // fast, cheap9 (< 50_000, ComplexityLevel.Medium) => AIModel.NovaPro, // balanced10 (_, ComplexityLevel.High) => AIModel.Claude35Sonnet,// most capable11 _ => AIModel.NovaPro12 };13}Multi-layer caching
Identical contracts and standard forms shouldn’t be reprocessed. A memory → distributed → process fallback eliminates redundant work.
1public async Task<AnalysisResult> ProcessDocumentAsync(string documentHash,2 Func<Task<AnalysisResult>> processor, CancellationToken ct)3{4 // L1: in-memory (fastest)5 if (_cache.TryGetValue(documentHash, out AnalysisResult cached)) return cached;6 7 // L2: distributed (shared across instances)8 var distributed = await _distributedCache.GetAsync(documentHash, ct);9 if (distributed != null)10 {11 var result = JsonSerializer.Deserialize<AnalysisResult>(distributed);12 _cache.Set(documentHash, result, TimeSpan.FromMinutes(30));13 return result;14 }15 16 // Miss: process once, then cache for next time17 var processed = await processor();18 await _distributedCache.SetAsync(documentHash,19 JsonSerializer.SerializeToUtf8Bytes(processed),20 new() { AbsoluteExpirationRelativeToNow = TimeSpan.FromHours(24) }, ct);21 return processed;22}Sensitive data, handled correctly
Financial, medical and legal documents make security fundamental, not optional.
1public async Task<ProcessingResult> ProcessSecureDocumentAsync(2 SecureDocumentRequest request, CancellationToken ct)3{4 await _auditLogger.LogDocumentAccessAsync(request.UserId, request.DocumentId);5 try6 {7 var decrypted = await _encryption.DecryptAsync(request.EncryptedContent);8 var masked = MaskSensitiveData(decrypted); // mask PII before AI9 var result = await ProcessDocumentContent(masked, ct);10 11 result.EncryptedOutput = await _encryption.EncryptAsync(result.Output);12 result.Output = null; // clear plaintext13 return result;14 }15 finally16 {17 GC.Collect(); // clear sensitive memory18 }19}Encrypt everything
At rest, in transit, and during processing.
Audit all access
Who accessed which document, and when.
Minimize exposure
Mask PII before sending to AI services.
Clean up
Clear sensitive plaintext from memory immediately.
You can’t optimize what you can’t measure
1public async Task<T> TrackProcessingAsync<T>(string operation, Func<Task<T>> processor)2{3 using var timer = _metrics.StartTimer($"document_processing.{operation}.duration");4 try5 {6 var result = await processor();7 _metrics.Increment($"document_processing.{operation}.success");8 return result;9 }10 catch (Exception ex)11 {12 _metrics.Increment($"document_processing.{operation}.error",13 new[] { ("error_type", ex.GetType().Name) });14 throw;15 }16}- Processing times per stage; success / failure rates by document type.
- Cost metrics: token usage and AI service spend.
- Queue depth and error patterns, to catch issues before users do.
Hard-won lessons from production
Start simple, scale smart
One document type working reliably beats every edge case half-done.
Prompt engineering is critical
A well-crafted prompt can lift accuracy ~40% while cutting cost.
Plan for failure
Retries, circuit breakers and graceful degradation from day one.
Monitor everything
Token usage, latency, error rates and cost, from the start.
…and how we solved them
Multi-cloud complexity
Different APIs, auth and formats per provider.
Abstraction layers with unified interfaces that preserve provider-specific optimizations.
LLM response reliability
Models occasionally return malformed JSON or unexpected structures.
Schema validation, multiple fallback parsers, and backup processing paths.
Cost management
LLM spend escalates fast with large documents and complex prompts.
Intelligent model selection, cost tracking and automatic alerts.
Where this goes next
Multi-modal
Images, tables and charts understood inline within documents.
Streaming responses
Results appear as they’re generated, not after the whole analysis.
Federated learning
Improve models from processed docs while preserving privacy.
Semantic chunking
Context-preserving splitting for better long-document accuracy.
Composable services, designed for failure
The key insight is treating each component (OCR, LLM processing, storage) as independent, composable services. That separation lets each part evolve while the system stays coherent. Start with a simple use case, get it working reliably, then expand. Build in monitoring and cost controls from day one. And design for failure: in production, failure isn’t a possibility, it’s a certainty.