module Memo::Chunking

Overview

Text chunking for semantic search

Splits large text into semantically meaningful chunks based on configurable limits:

Text < no_chunk_threshold tokens: Keep whole (no chunking)
Text > no_chunk_threshold tokens: Split on paragraph breaks (\n\n)
Paragraphs > max_tokens: Further split on sentences
Sentences < min_tokens: Combine with next sentence

Range-based approach:

All operations track exact character positions in the original text
No string searching or reconstruction - positions are computed during splitting
Returned offset/size are exact character ranges for SQLite SUBSTR compatibility
chunk_text is the exact slice: text[offset, size]

Extended Modules

Memo::Chunking

Defined in:

memo/chunking.cr

Instance Method Summary

#chunk_text(text : String, config : Config::Chunking) : Array(Tuple(String, Int32, Int32))
Chunk text into segments based on configuration
#estimate_tokens(text : String, tokens_per_byte : Float64 = 0.25) : Int32
Estimate token count using tokens_per_byte ratio

Instance Method Detail

def chunk_text(text : String, config : Config::Chunking) : Array(Tuple(String, Int32, Int32)) #

Chunk text into segments based on configuration

Returns array of tuples: {chunk_text, offset, size}

chunk_text: Exact slice from original text (text[offset, size])
offset: Character position in original text (0-indexed)
size: Character length of chunk

SQLite usage: SUBSTR(content, offset + 1, size) returns chunk_text exactly

[View source]

def estimate_tokens(text : String, tokens_per_byte : Float64 = 0.25) : Int32 #

Estimate token count using tokens_per_byte ratio

[View source]