ARTICLE

How to build smarter lexicon searches

Stock price information is reflected in a glass panel

March 16, 2026

Bloomberg Professional Services

This article was written by Eugene Semetsky and Polly Abreu, product managers at Bloomberg Vault.

Designing effective lexicons is a critical component of any communications surveillance program, yet even mature teams encounter avoidable pitfalls that diminish accuracy, create operational strain, and weaken explainability.

PRODUCT MENTIONS

Bloomberg Vault

As regulatory expectations rise and communication channels diversify, firms must balance breadth, precision, and efficiency in how they detect risk signals. This article outlines common challenges observed across the industry and provides practical guidance for building more resilient searches that strengthen detection while preserving operational effectiveness.

Explore the latest regulatory insights with our outlooks, webinars, research and analysis.

This second article in our four-part series on lexicon-based surveillance focuses on how to design smarter, more resilient lexicon searches that improve precision, reduce noise, and enhance explainability.

Part one explored how to build a robust lexicon policy framework. Part three examines the lexicon calibration process, and part four explores how AI-enabled methods can augment lexicon-based approaches to strengthen surveillance outcomes.

Common lexicon pitfalls

This section explores how compliance teams can design more effective, explainable lexicon searches by avoiding common pitfalls.

Starting with overly broad coverage

Rolling out lexicons with overly broad coverage can overwhelm reviewers with false positives, reducing effectiveness and creating operational strain. Regulators may also view any later narrowing of scope unfavorably unless there is clear, well-documented justification for why certain lexicon terms are no longer required; high alert volume, on its own, is not a defensible reason for reducing coverage. A more sustainable approach is to start small, iterate, and scale gradually, allowing each phase to be rolled out, assessed for impact, and refined before additional terms or phrases are introduced.

Remedy: Consider starting with targeted rules, validating performance and expanding coverage in stages.

Overly general searches

Overly broad searches can generate excessive alerts and dilute effectiveness. For instance, including a standalone rule for “quid pro quo” without contextual filters can lead to a high volume of false positives, as the phrase often appears in benign contexts unrelated to misconduct.

Remedy: Consider adding proximity and contextual filters to narrow results to relevant scenarios.

Example – “Book Scrubbing” and Wash Language

“Let’s scrub the books before end-of-month reporting.”
“We need to clean our book before compliance sees it.”

The basic keyword pattern would capture all instances of “clean” or “scrub,” resulting in high alert volumes and false positives.

Introducing contextual proximity can significantly reduce unrelated false positives, but it can also make the search too narrow, potentially missing relevant variations or slang expressions.

Overly narrow searches

Overly narrow searches risk missing variations in phrasing or intent. Focusing on a single expression or word may fail to capture the broader behavior you want to detect.

Remedy: Consider inclusive searches that incorporate proximity logic, synonyms, and common variants.

Let’s continue with our “Book Scrubbing” example:

This pattern captures intent even when slang, abbreviations, or spelling variations appear, while narrowing the focus to contexts that are truly relevant to potential misconduct.

Overlooking emojis and non-text signals

Digital communication has evolved beyond words and often includes emojis as shorthand for sentiment, intent, or coded meaning, particularly in informal channels like chat or mobile messaging. Yet many surveillance programs still do not include emojis in their lexicon searches.

This is a risk because in practice, emojis can convey tone, approval, disapproval, and can even act as substitutes for prohibited language. Emojis can signal tipping, collusion, or emotional states associated with high-risk behavior such as front-running or insider discussions.

Remedy: Consider incorporating high-risk emojis into lexicon policies where relevant.

Example pattern:

{🚀} <within 5 tokens of> {stock* OR share* OR name OR deal}

This pattern captures emoji references linked to key risk words, helping detect coded or mixed-format communication that plain text searches miss.

Global consistency versus local relevance

Firms have sometimes prioritized global lexicons to strengthen control and explainability. However, applying uniform global rules can dilute sensitivity in specific regions where slang, cultural nuances, or risk exposures differ.

Remedy: Maintain global standards for core risks, such as market abuse and conduct breaches, but allow regional customization to reflect local communication styles, regulatory expectations, and business contexts.

Multilingual and regional nuance

Firms have historically underestimated how much language and cultural nuance affect risk detection. Direct translation often misses intent; regional phrasing, abbreviations, and tone vary widely. Overlooking these differences can leave blind spots in surveillance coverage.

Remedy: Use a language-risk matrix to allocate effort proportionally across regions and involve regional SMEs, particularly in areas such as bribery and corruption, where local phrasing and context matter most.

Over-reliance on vendor baselines

Firms have often been inclined to rely too heavily on vendor-provided lexicons, assuming they are comprehensive or regulator-ready. In reality, these lists are designed for general use and rarely reflect a firm’s specific risks, business structure, or communication style.

Remedy: Treat vendor lexicons as a foundation rather than a finished product. Customize, test, and validate rules against your own historical data and risk scenarios before deployment.

Conclusion

Building smarter searches requires more than adding keywords. It involves carefully calibrating scope, context, proximity, and cultural nuance to ensure lexicons capture true risk while minimizing noise and operational burden. By starting small, refining iteratively, incorporating contextual logic, accounting for emojis and slang, and tailoring searches to regional and business-specific realities, firms can materially improve both coverage and explainability. Ultimately, effective lexicon design is not a one-off exercise but a continuous process of testing, validation, and evolution as communication behaviors and regulatory expectations change.

Revisit part one for a reminder of the core principles of building a strong lexicon policy framework. If you are ready for the next step, part three looks at the calibration process and how to tune lexicons, so they perform reliably in real-world surveillance environments.

Disclaimer

Disclaimer: The information included in these materials is for illustrative purposes only and does not constitute legal, financial, or professional advice. Readers should not rely on this content as a substitute for advice from qualified legal or compliance professionals. Always consult your own legal and compliance teams before making decisions or taking action based on the information contained herein. The BLOOMBERG TERMINAL service and Bloomberg data products (the “Services”) are owned and distributed by Bloomberg Finance L.P. (“BFLP”) except (i) in Argentina, Australia and certain jurisdictions in the Pacific islands, Bermuda, China, India, Japan, Korea and New Zealand, where Bloomberg L.P. and its subsidiaries (“BLP”) distribute these products, and (ii) in Singapore and the jurisdictions serviced by Bloomberg’s Singapore office, where a subsidiary of BFLP distributes these products. BLP or one of its subsidiaries provides BFLP and its subsidiaries with global marketing and operational support and service. Certain features, functions, products and services are available only to sophisticated investors and only where permitted. BFLP, BLP and their affiliates do not guarantee the accuracy of prices or other information in the Services. Nothing in the Services shall constitute or be construed as an offering of financial instruments by BFLP, BLP or their affiliates, or as investment advice or recommendations by BFLP, BLP or their affiliates of an investment strategy or whether or not to “buy”, “sell” or “hold” an investment. Information available via the Services should not be considered as information sufficient upon which to base an investment decision. The following are trademarks and service marks of BFLP, a Delaware limited partnership, or its subsidiaries: BLOOMBERG, BLOOMBERG ANYWHERE, BLOOMBERG MARKETS, BLOOMBERG NEWS, BLOOMBERG PROFESSIONAL, BLOOMBERG TERMINAL and BLOOMBERG.COM. Absence of any trademark or service mark from this list does not waive Bloomberg’s intellectual property rights in that name, mark or logo. All rights reserved. ©Bloomberg.