How AI Reads Your 10-K: A Visual Guide to Algorithmic Filing Analysis
AI investor relations systems are not reading your 10-K the way a buy-side analyst would. They are not absorbing the narrative, weighing the management team's credibility, or thinking about industry context. They are running statistical operations on a corpus of text — and the output of those operations feeds directly into institutional trading models.
Understanding how that process actually works is the first step toward optimizing for it.
The Pipeline: From EDGAR to Algo Signal in 60 Seconds
When you submit a 10-K to EDGAR, the clock starts. Here is what happens before any human reads it:
Step 1 — Text extraction. Automated systems strip the HTML/XBRL shell and isolate the prose: MD&A, Risk Factors, Business description, footnotes. Each section is processed independently because algo models weight them differently. MD&A and Risk Factors receive the highest signal weight.
Step 2 — Tokenization and normalization. The text is broken into individual words and phrases, stripped of punctuation, and normalized to lowercase roots. "Expecting" becomes "expect." "Significantly" becomes "significant." This step erases tone of voice and leaves only word choice.
Step 3 — Dictionary scoring. The cleaned word list is run against specialized financial lexicons. The dominant standard is the Loughran-McDonald Master Dictionary, developed by researchers Tim Loughran and Bill McDonald at Notre Dame. Unlike general-purpose sentiment tools (which would score "liability" as negative even in a neutral legal context), the LM dictionary was built specifically for financial disclosures. It contains six word categories that have been empirically validated against stock returns.
Step 4 — Signal generation. Scores from each dictionary category are combined with structural features — filing length, sentence complexity, forward-guidance density — to produce composite signals. These signals feed into quant screens, short-seller models, and institutional risk algorithms.
The entire pipeline runs in under 60 seconds. Your 10-K is scored before your IR team finishes the post-filing checklist.
The Six Loughran-McDonald Categories (And What They Actually Measure)
The LM dictionary's six categories are not arbitrary. Each was tested against actual market outcomes — specifically, whether the presence of these words predicts abnormal returns in the 3 to 12 months following a filing.
1. Negative (2,355 words) — This is the most important category. Words like "adverse," "volatility," "uncertainty," "breach," "decline." Academic research confirms that a 1 standard deviation increase in negative word density is associated with 3–5% lower abnormal returns in the 12 months post-filing. Every negative word is a drag on your score.
2. Positive (354 words) — "Achieve," "strength," "improvement," "confident." Notably, the LM positive list is much shorter than negative, reflecting the reality that financial disclosures are structurally cautious. Positive words in the right ratio to negative ones signal a healthy filing, but overuse triggers a different problem: the algorithm reads excessive positivity as a credibility discount.
3. Uncertainty (297 words) — "Approximately," "uncertain," "contingent," "possibility." High uncertainty scores are strongly correlated with analyst downgrades and increased short interest in the 60 days following a filing. Uncertainty language is the #1 predictor of follow-on analyst scrutiny.
4. Litigious (903 words) — Legal boilerplate triggers this category. "Litigation," "plaintiff," "defendant," "alleged." Some litigation language is unavoidable in Risk Factors. The question is whether your litigious word density is higher than sector peers — because that is what the model measures.
5. Strong Modal (68 words) — "Will," "must," "require," "should." High strong-modal density signals commitment and is generally positive. Companies that guide confidently with strong modals score better on algo certainty indices.
6. Weak Modal (27 words) — "Could," "may," "might," "would," "possible." The inverse of strong modal. High weak-modal density is the single most common avoidable signal problem we see in small-cap filings. Companies hedge obsessively because their lawyers told them to. Algorithms interpret it as management lacking conviction.
The Fog Index: Readability as a Signal
Beyond the LM dictionary, algo systems also score your filing's readability using the Gunning Fog Index — a formula measuring average sentence length and percentage of complex words. The formula: 0.4 × (average sentence length + percentage of words with 3+ syllables).
A Fog score under 12 is generally readable. Most academic papers score around 15–20. Many 10-Ks score in the 18–22 range — technically unreadable by the original scale.
Why does this matter for algorithms? Research by Feng Li at the University of Michigan found that companies with harder-to-read annual reports tend to have lower earnings persistence — and that obfuscation correlates with future earnings declines. Whether or not that's causal, institutional quant models have incorporated readability as a factor. A high Fog score is a modest but real negative signal.
Actionable benchmark: Target a Fog score under 14 for your MD&A. Break long sentences at the semi-colon. Remove words like "notwithstanding," "aforementioned," and "hereinafter" — they are legal fossils that cost you readability points without adding legal protection.
The Delta Problem: How Change Gets Detected
Most small-cap CEOs focus on the absolute content of their filing. Algorithmic models focus on the change from prior period.
Researchers at Cornell and MIT have demonstrated that cosine similarity between consecutive filings is a statistically significant predictor of post-filing stock performance. The "Lazy Prices" paper (Cohen, Malloy, Nguyen — now cited in institutional quant research) showed that companies whose 10-Ks change substantially relative to the prior year generate significantly lower returns. The implication: large textual changes signal instability, strategic pivots, or disclosure anxiety.
The specific sections being delta-analyzed:
- •Risk Factors — new risks added vs. prior year
- •MD&A — language drift on core business description
- •Liquidity section — changes in going-concern language or cash management discussion
- •Legal proceedings — additions and escalations
A company that adds 3 new risk factors quarter-over-quarter while removing none is sending a compounding negative delta signal — even if each individual risk factor is boilerplate. The volume of change matters.
What Your Filing Looks Like to an Algorithm: A Sample Breakdown
Here is how a typical small-cap 10-K scores against institutional NLP metrics, based on our analysis of 500+ filings:
| Metric | Industry Average | Top Quartile | Bottom Quartile | |---|---|---|---| | Negative word density | 3.2% | 2.1% | 4.8% | | Weak modal density | 1.8% | 0.9% | 2.9% | | Uncertainty score | 1.4% | 0.8% | 2.3% | | Fog Index (MD&A) | 17.2 | 13.8 | 21.4 | | Year-over-year cosine similarity | 0.82 | 0.91 | 0.71 |
Companies in the bottom quartile on these metrics are not necessarily in worse shape financially. They are communicating the same reality in a way that algo systems penalize. That penalty is real and measurable in spread behavior, short interest, and institutional screening results.
Five Filing Changes That Improve Your Algo Score
These are the highest-ROI edits based on what actually moves the composite signal:
1. Replace weak modals in forward guidance. Audit every sentence in your guidance section. "We may be able to achieve revenue growth" → "We expect revenue growth of X%." The commitment is the same; the signal is dramatically different.
2. Trim Risk Factors that have been resolved. Many companies add risk factors and never remove them. Accumulating risk factor count is a direct negative signal. If a risk has been mitigated, remove it — and document in your disclosure committee notes why you did.
3. Shorten your Liquidity section sentences. This is where Fog scores typically spike. Write it like a CFO talking to a smart analyst, not like a lawyer covering exposure.
4. Use parallel structure in MD&A. Algorithms parse sentence structure. Parallel construction ("Revenue increased 12%. Gross margin expanded 80 basis points. Operating expenses declined 3%.") scores better than discursive paragraphs that cover the same ground with more uncertainty language.
5. Anchor your Management Assessment. The "Management's Discussion and Analysis of Financial Condition" is the highest-weighted section in most NLP models. Specific numbers, specific timelines, and attribution to named business drivers outperform vague industry commentary.
Recommended Reading
For CEOs and CFOs who want to go deeper on how financial language affects stock performance:
[Financial Shenanigans: How to Detect Accounting Gimmicks and Fraud in Financial Reports](https://www.amazon.com/Financial-Shenanigans-Detecting-Accounting-Gimmicks/dp/0071792619?tag=axonir-20) by Howard Schilit — The inverse playbook. Understanding what red-flag language looks like from an analyst's perspective is the fastest way to ensure your disclosures don't accidentally pattern-match to it.
[The Little Book That Still Beats the Market](https://www.amazon.com/Little-Book-Still-Beats-Market/dp/0470624159?tag=axonir-20) by Joel Greenblatt — Not specifically about NLP, but essential for understanding the quantitative screens your stock is being measured against. If you understand what institutional quant models are optimizing for, you can align your disclosure strategy accordingly.
The One-Number Summary
Your 10-K, as read by an institutional NLP system, produces a composite score. That score influences whether quant screens include or exclude your stock, whether short-seller models flag you as a target, and whether momentum algorithms classify you as a buy or avoid signal.
Most small-cap companies have never seen their score.
Get your free algo score at [/free-score](/free-score) — it takes 60 seconds and benchmarks your filings against 500+ small-cap peers on every metric covered in this article. Understanding your current position is the prerequisite for improving it.
IR Intelligence Weekly
Data-driven insights on SEC filings, algo signals, and small-cap IR strategy. Read by 500+ CEOs and CFOs. No spam, unsubscribe anytime.
Join 500+ public company executives. Weekly delivery, no fluff.
See Your Company's Algo Score
Free analysis of your latest SEC filing. No account required.
Analyze My Ticker Free