module
Memo::Chunking
Overview
Text chunking for semantic search
Splits large text into semantically meaningful chunks based on configurable limits:
- Text < no_chunk_threshold tokens: Keep whole (no chunking)
- Text > no_chunk_threshold tokens: Split on paragraph breaks (\n\n)
- Paragraphs > max_tokens: Further split on sentences
- Sentences < min_tokens: Combine with next sentence
Range-based approach:
- All operations track exact character positions in the original text
- No string searching or reconstruction - positions are computed during splitting
- Returned offset/size are exact character ranges for SQLite SUBSTR compatibility
- chunk_text is the exact slice: text[offset, size]
Extended Modules
Defined in:
memo/chunking.crInstance Method Summary
-
#chunk_text(text : String, config : Config::Chunking) : Array(Tuple(String, Int32, Int32))
Chunk text into segments based on configuration
-
#estimate_tokens(text : String, tokens_per_byte : Float64 = 0.25) : Int32
Estimate token count using tokens_per_byte ratio
Instance Method Detail
def chunk_text(text : String, config : Config::Chunking) : Array(Tuple(String, Int32, Int32))
#
Chunk text into segments based on configuration
Returns array of tuples: {chunk_text, offset, size}
- chunk_text: Exact slice from original text (text[offset, size])
- offset: Character position in original text (0-indexed)
- size: Character length of chunk
SQLite usage: SUBSTR(content, offset + 1, size) returns chunk_text exactly
def estimate_tokens(text : String, tokens_per_byte : Float64 = 0.25) : Int32
#
Estimate token count using tokens_per_byte ratio