Memo
Semantic search and vector storage library for Crystal.
Features
- CLI tool - Index, search, and manage from the command line
- Text chunking - Smart segmentation into optimal-sized pieces
- Embedding storage - Deduplication by content hash
- Similarity search - Cosine similarity with filtering
- Text storage - Optional persistent text with LIKE and FTS5 full-text search
- Projection filtering - Fast candidate pre-filtering via random projections
Installation
Add to your shard.yml:
dependencies:
memo:
github: trans/memo
Then run shards install.
CLI
Build the CLI:
shards build
Environment Variables
The CLI reads API keys from environment variables:
export MEMO_API_KEY=sk-... # Primary
export OPENAI_API_KEY=sk-... # Fallback
export VOYAGE_API_KEY=pa-... # Fallback
Global Options
-d, --db=PATH Database path (default: memo.db)
-s, --service=NAME Service name (default: openai)
-k, --api-key=KEY API key (overrides environment variables)
-j, --json Output as JSON (default: human-readable)
--no-vocab Disable vocabulary building during index
-h, --help Show help
-v, --version Show version
Commands
Index text:
memo index text "Your document text"
memo index text "Document" --source-type=article --source-id=1
Index files from directory:
memo index dir . # Index current directory
memo index dir /path/to/project # Index specific path
memo index dir . --dry-run # Preview without indexing
memo index dir . --full # Force re-index all files
Search:
memo search "semantic search"
memo search "query" --limit=5 --min-score=0.5
Delete:
memo delete source-id=1
Stats:
memo stats
Find similar words:
Vocabulary is built automatically during indexing. Just use like:
memo like "database"
# 0.70 data
# 0.70 databases
# 0.57 sqlite
Rebuild vocabulary (optional):
memo build-vocab # Full rebuild from all indexed texts
Service Management
List available services:
memo service list
memo service # 'list' is the default
Set default service:
memo service use voyage
Create custom service:
memo service create name=my-openai format=openai model=text-embedding-3-large dimensions=1024 max-tokens=8191
Delete service:
memo service delete my-openai
memo service delete my-openai force=true # if service has embeddings
JSON Input
Commands accept JSON via stdin with --stdin:
echo '{"query":"semantic search","limit":5}' | memo search --stdin
JSON Output
Use --json for machine-readable output:
memo --json search query="test" | jq '.[] | select(.score > 0.8)'
Quick Start (Library)
require "memo"
# Create service with database path
memo = Memo::Service.new(
db_path: "/var/data/memo.db",
format: "openai",
api_key: ENV["OPENAI_API_KEY"]
)
# Index a document
memo.index(
source_type: "article",
source_id: 42_i64,
text: "Your document text here..."
)
# Search
results = memo.search(query: "search query", limit: 10)
results.each do |r|
puts "#{r.source_type}:#{r.source_id} (score: #{r.score})"
end
# Clean up
memo.close
API
Memo::Service
The main API. Handles database lifecycle, chunking, and embeddings.
Initialization
memo = Memo::Service.new(
db_path: "/var/data/memo.db", # Path to database file
format: "openai", # API format ("openai", "voyage", "mock")
api_key: "sk-...", # API key for provider
model: nil, # Optional: override default model
dimensions: nil, # Optional: embedding dimensions (provider default)
store_text: true, # Optional: enable text storage (default true)
chunking_max_tokens: 2000 # Optional: max tokens per chunk
)
For smaller embeddings (faster search, less storage):
memo = Memo::Service.new(
db_path: "/var/data/memo.db",
format: "openai",
api_key: key,
model: "text-embedding-3-large",
dimensions: 1024 # Reduced from 3072 default
)
Indexing
# Index single document
memo.index(
source_type: "article",
source_id: 123_i64,
text: "Long text to index...",
pair_id: nil, # Optional: related source
parent_id: nil # Optional: hierarchical parent
)
# Index with Document struct
doc = Memo::Document.new(
source_type: "article",
source_id: 123_i64,
text: "Document text..."
)
memo.index(doc)
# Batch indexing (more efficient)
docs = [
Memo::Document.new(source_type: "article", source_id: 1_i64, text: "First..."),
Memo::Document.new(source_type: "article", source_id: 2_i64, text: "Second..."),
]
memo.index_batch(docs)
Search
results = memo.search(
query: "search query",
limit: 10,
min_score: 0.7,
source_type: nil, # Optional: filter by type
source_id: nil, # Optional: filter by ID
pair_id: nil, # Optional: filter by pair
parent_id: nil, # Optional: filter by parent
like: nil, # Optional: LIKE pattern(s) for text filtering
match: nil, # Optional: FTS5 full-text search query
sql_where: nil, # Optional: raw SQL WHERE clause
include_text: false # Optional: include text content in results
)
Text Filtering
When text storage is enabled, you can filter by text content:
# LIKE pattern (single)
results = memo.search(query: "cats", like: "%kitten%")
# LIKE patterns (AND logic)
results = memo.search(query: "pets", like: ["%cat%", "%dog%"])
# FTS5 full-text search
results = memo.search(query: "animals", match: "cats OR dogs")
results = memo.search(query: "animals", match: "quick brown*") # prefix
results = memo.search(query: "animals", match: '"exact phrase"')
# Include text in results
results = memo.search(query: "cats", include_text: true)
results.each { |r| puts r.text }
Queue Operations
All indexing goes through an embed queue with automatic retry support:
# Check queue status
stats = memo.queue_stats
puts "Pending: #{stats[:pending]}, Failed: #{stats[:failed]}"
# Process any pending/failed items in queue
memo.process_queue
# Process queue in background (non-blocking)
memo.process_queue_async
# Re-index all documents of a type (requires text storage)
memo.reindex("article")
# Re-index with custom text provider (no text storage needed)
memo.reindex("article") do |source_id|
Article.find(source_id).content # Your app provides text
end
# Clear completed items from queue
memo.clear_completed_queue
# Clear entire queue (pending, failed, completed)
memo.clear_queue
Vocabulary (Word-Level Similarity)
Build a vocabulary from indexed content for word-level semantic search:
# Build vocabulary from all indexed texts
memo.build_vocab # => 1523 (words stored)
# Find words similar to a query
results = memo.like("database")
results.each do |r|
puts "#{r.word}: #{r.score} (freq: #{r.frequency})"
end
# data: 0.70 (freq: 5)
# databases: 0.70 (freq: 2)
# sqlite: 0.57 (freq: 1)
# Get vocabulary size
memo.vocab_stats # => 1523
# Clear vocabulary
memo.clear_vocab
Other Operations
# Get statistics
stats = memo.stats
puts "Embeddings: #{stats.embeddings}, Chunks: #{stats.chunks}, Sources: #{stats.sources}"
# Delete by source
memo.delete(source_id: 123_i64)
memo.delete(source_id: 123_i64, source_type: "article") # More specific
# Mark chunks as read
memo.mark_as_read(chunk_ids: [1_i64, 2_i64])
# Close connection
memo.close
Search Results
struct Memo::Search::Result
getter chunk_id : Int64
getter source_type : String
getter source_id : Int64
getter score : Float64
getter pair_id : Int64?
getter parent_id : Int64?
getter text : String? # When include_text: true
end
Storage
Memo stores all data in a single SQLite file at the specified db_path:
- Services, embeddings, chunks, projections, texts, and queue
Text storage can be disabled with store_text: false if you prefer to manage text separately.
Providers
Currently supported:
openai- OpenAI text-embedding-3-small (default), text-embedding-3-largevoyage- Voyage AI voyage-3 (default), voyage-3-lite, voyage-code-3mock- Deterministic embeddings for testing
Architecture
See DESIGN.md for detailed architecture documentation.
License
MIT