Generalizing Model Intent

Transitioning from Word-based learning to Token-class learning using De-identification.

1 The Masking Process

By replacing specific identifiers with Functional Labels, we force the model to learn linguistic structure rather than memorizing data.

Original Text Generalized Version
"Ace accommodation, how can I help?" [ORG], how can I help?"
"I'd like to stay on the Gold Coast." I'd like to stay in [LOCATION]."
"Who am I speaking to?" Who am I speaking to?"
"Miss Mackinlay. Sylvia Mackinlay." [TITLE] [SURNAME]. [FIRSTNAME] [SURNAME]."

2 Programmatic Strategies

A. OOV Dictionary

Matches words against a standard corpus (e.g., Oxford).

  • ✅ Keep: Words found in dictionary.
  • ❌ Mask: Capitalized non-dictionary words.

B. POS Tagging

Uses NLP libraries to identify parts of speech.

if tag == "NNP":
  replace_with("[ENTITY]")

Why this works for Any Text

The model stops learning names and starts mastering Contextual Triggers.

Trigger Phrase
"My name is..."
Expectation
[NAME]
Trigger Phrase
"I live in..."
Expectation
[LOCATION]
Trigger Phrase
"Arriving on..."
Expectation
[DATE]

3 Implementation Example

Rental Agent: Good morning! [BUSINESS], how can I help you?

Client: I'd like to stay in [LOCATION].

Rental Agent: Certainly, who am I speaking to?

Client: [TITLE] [SURNAME].

Key Result: By focusing on the "scaffolding" (the English words), your model handles unknown names effortlessly because it recognizes the pattern, not the specific person.