Most practical systems treat a sentence as:
- syntax (grammar structure)
- semantics (meaning)
- token sequence
- statistical pattern
- embedding/vector representation
Then they compare or recombine at multiple levels.
Example:
Sentence A:
“the expensive salmon”
Sentence B:
“I bought for my own dinner”
A practical system does NOT usually combine raw text blindly.
Instead it extracts structures like:
Grammar:
- DET + ADJ + NOUN
- SUBJECT + VERB + PURPOSE
Semantics:
- food item
- ownership
- purchase action
- personal consumption
Then it searches for compatible continuations.
- SIMPLEST METHOD Token probability
LLMs mainly work using:
P(next_token | previous_tokens)
Example:
Input:
“the expensive salmon”
Possible continuations:
- was delicious
- smelled awful
- I bought yesterday
- I bought for my own dinner
The model ranks possibilities by probability.
This already creates recombination behavior.
- EMBEDDING SIMILARITY Most practical modern method
Convert sentence → vector.
Example:
“The expensive salmon” → [0.12, -0.44, 0.88, …]
“I bought seafood for dinner” → [0.10, -0.39, 0.81, …]
Then compare using cosine similarity.
Common uses:
- semantic search
- RAG
- clustering
- deduplication
- recommendation
- sentence matching
Best practice:
- use transformer embeddings
- cosine similarity
- ANN/vector database
Common tools:
- sentence-transformers
- spaCy vectors
- FAISS
- ChromaDB
- Milvus
- STRUCTURE MATCHING Very important for generation
Systems often learn reusable templates.
Example:
Pattern:
“the [ADJ] [NOUN] I bought for my own [NOUN]”
Can generate:
- the expensive salmon I bought for my own dinner
- the fancy wine I bought for my own party
- the rare book I bought for my own collection
This is compositional generation.
Very important concept:
- reusable latent structures
- VALIDATION “How do we know it makes sense?”
Practical systems use multiple checks.
- Grammar validation
- POS tagging
- dependency parsing
- grammar models
- Semantic plausibility Example:
- World knowledge Example:
- salmon is edible
- cats eat fish
- Statistical likelihood If many similar patterns appeared during
training, probability increases.
- BEST PRACTICES Most practical systems
- Use embeddings for semantic similarity NOT raw string matching.
Bad:
Good:
- Separate:
- syntax
- semantics
- entities
- actions
This improves recombination quality.
- Use chunk-level recombination NOT random word mixing.
Good chunks:
- noun phrases
- verb phrases
- clauses
Example:
- “the expensive salmon”
- “I bought yesterday”
- “for my own dinner”
- Validate generated text before accepting.
Typical checks:
- grammar
- semantic coherence
- contradiction detection
- toxicity
- domain rules
- Store latent representations instead of memorizing sentences.
Modern systems prefer: sentence → embedding/vector
rather than: sentence → raw lookup table
- ADVANCED SYSTEMS
More advanced architectures combine:
- embeddings
- symbolic logic
- knowledge graphs
- memory retrieval
- syntactic trees
- world models
- probabilistic decoding
This allows:
- novelty
- consistency
- reasoning
- controllable generation
instead of simple copying.
- IMPORTANT REALIZATION
Human language itself is highly compositional.
Humans also recombine learned fragments:
- adjective patterns
- verb patterns
- social scripts
- semantic templates
So:
“new sentence”
usually means:
- new arrangement of known structures
NOT:
- completely alien creation.