SEC-EDGAR-GPT

A 124M parameter GPT-2 trained on SEC-EDGAR financial filings

GitHub HuggingFace Paper (PDF)

124M

Parameters

1.55B

Training Tokens

2.28

Val Loss

~8h

Training Time

Overview

SEC-EDGAR-GPT is a 124-million parameter GPT-2 language model trained on 1.55 billion tokens of SEC-EDGAR corporate filings using the nanoGPT framework. The model was trained on a single NVIDIA RTX 4070 GPU over approximately 8 hours, reaching a final validation loss of 2.28.

We evaluated the model's generation quality across multiple SEC filing sections including business descriptions, management discussion and analysis, risk factors, financial notes, and proxy statements. The model successfully learns SEC document structure, financial vocabulary, and boilerplate language patterns, but exhibits characteristic limitations in long-range coherence, numerical consistency, and table extension.

Training Setup

Training workstation equipped with NVIDIA RTX 4070 GPU (12GB VRAM)

Model Architecture

Parameter	Value
Number of layers	12
Attention heads	12
Embedding dimension	768
Context length	1,024 tokens
Total parameters	123.59M
Vocabulary	GPT-2 BPE (50,257 tokens)

Training Data

Metric	Value
Total training tokens	1.55B
Training shards	16
Validation tokens	100M
Source	SEC-EDGAR via CodeParrot dataset
Tokenizer	GPT-2 BPE

Hyperparameters

Parameter	Value
Effective batch size	32,768 tokens
Learning rate	6e-4
Min learning rate	6e-5
Warmup iterations	2,000
Max iterations	47,000
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Weight decay	0.1
Gradient clipping	1.0

Training Results

Evaluation prompts (Section 5) were generated with the assistance of Hermes Agent (Nous Research).

Loss Progression

Step	Train Loss	Val Loss
0	10.98	—
1,000	3.82	3.75
5,000	2.95	2.91
10,000	2.62	2.58
20,000	2.38	2.35
30,000	2.25	2.23
40,000	2.15	2.22
47,000	2.10	2.28

Cross-Domain Comparison

Dataset	Val Loss	Notes
OpenWebText (GPT-2 baseline)	2.85	General web text
GitHub Code	3.466	Noisy code data
SEC-EDGAR (this work)	2.28	Clean financial prose

SEC-EDGAR achieves significantly lower loss than general text and code models, due to the structured, repetitive nature of SEC filings.

Training Completion

Training completion: iteration 47,400 on the RTX 4070 workstation, accessed remotely from macOS.

Generation Quality Analysis

Five prompts representing different SEC filing sections. Each prompt: 500–2,000 characters of authentic SEC text. Generated 1,000 tokens at temperature 0.7, top-k=100. Evaluation prompts generated with Hermes Agent (Nous Research).

Prompt 1: Business Description (Item 1)

Show full prompt (input text)

ITEM 1. BUSINESS Overview We are a leading provider of cloud-based enterprise software solutions for the healthcare industry. Our platform enables hospitals, clinics, and other healthcare organizations to streamline their operations, improve patient outcomes, and reduce costs. We serve over 2,500 customers across 48 states, including 18 of the top 25 hospital systems in the United States. Our solutions address critical challenges in healthcare delivery, including electronic health records (EHR), practice management, revenue cycle management, and population health analytics. Our AI-powered clinical decision support tools help physicians make better-informed treatment decisions, while our interoperability platform enables seamless data exchange between disparate healthcare systems. For the fiscal year ended December 31, 2023, we generated revenue of approximately $487.2 million, representing a 28% increase from the prior year. Our subscription revenue, which accounted for 78% of total revenue, grew 35% year-over-year. We had approximately 8,200 full-time employees as of December 31, 2023.

Generated continuation (first 500 tokens):

Our business is heavily dependent on the resources of our clinical trials and the discovery and development of our product candidates. Our technology costs are based on the materials, manufacturing, test, and customer support costs. We are a global leader in the healthcare industry, and the care industry is undergoing a number of product development, including the following: - Our core product development, product development and commercialization of our product candidates, which we currently license agreements with third-party manufacturers. - Our commercialization of X-Avent therapies, including the following: - Our commercialization of X-Zentib, including the following: - Our commercialization of X-Zentib, including the following:

Structure Maintains SEC filing format
Loop "Commercialization of product candidates" repeats 47+ times
Hallucinate Invented drug names: X-Avent, X-Zentib, S-Zentib, Q-partnerib

Prompt 2: MD&A Revenue Analysis

Show full prompt (input text)

ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS Results of Operations Revenue Total revenue increased by $105.4 million, or 28%, to $487.2 million for the year ended December 31, 2023, from $381.8 million for the year ended December 31, 2022. The increase was primarily attributable to growth in subscription revenue, which increased by $98.2 million, or 35%, driven by new customer acquisitions and expanded usage from existing customers. Professional services revenue increased by $7.2 million, or 8%, reflecting increased implementation activity. Cost of Revenue and Gross Profit Cost of revenue increased by $31.2 million, or 22%, to $173.4 million for the year ended December 31, 2023, from $142.2 million for the year ended December 31, 2022. The increase was primarily due to higher hosting costs associated with our cloud infrastructure and increased headcount in our customer support organization. Gross profit margin improved to 64.4% for the year ended December 31, 2023, compared to 62.7% for the prior year, reflecting economies of scale in our cloud infrastructure.

Generated continuation (first 400 tokens):

Cost of revenue increased by $32.1 million, or 26%, for the year ended December 31, 2023, compared to $162.2 million for the prior year. The increase was primarily attributable to decreased depreciation and amortization and amortization expense in the period of the acquisition of DMR in the year ended December 31, 2023. Cost of revenue increased by $40.0 million, or 2%, to $107.1 million for the year ended December 31, 2023, from $162.8 million for the year ended December 31, 2022. The increase was primarily attributable to the increase in service revenue, partially offset by depreciation on certain C-1 products, which increased by $73.4 million, or 1%, from $96.0 million for the year ended December 31, 2023. Cost of revenue increased by $20.5 million, or 2%, to $1.8 million for the year ended December 31, 2023, primarily due to the increase in service revenue and the increased amortization of intangible assets.

Template Excellent first paragraph — references acquisition (DMR)
Loop 10 consecutive "Cost of revenue increased/decreased" paragraphs
Math Percentages don't match dollar amounts cited

Prompt 3: Risk Factors

Show full prompt (input text)

RISK FACTORS Risks Related to Our Business We have a history of net losses and may not achieve or maintain profitability in the future. We incurred net losses of $42.3 million, $67.8 million, and $89.1 million for the years ended December 31, 2023, 2022, and 2021, respectively. As of December 31, 2023, we had an accumulated deficit of $523.4 million. We expect to continue to incur significant expenses related to research and development, sales and marketing, and general administrative operations as we expand our business. We may not achieve or maintain profitability in the future. Our quarterly operating results may fluctuate significantly, which could cause our stock price to decline. Our quarterly revenue and operating results have varied in the past and may continue to vary significantly from quarter to quarter. Our revenue growth rate may decline in future periods as our revenue base increases. Factors that may cause our quarterly results to fluctuate include the timing of large enterprise contracts, seasonal purchasing patterns in the healthcare industry, and the impact of regulatory changes on customer buying decisions.

Generated continuation (first 500 tokens):

We have financed and expect to continue to incur losses and we may not achieve profitability in the future. We have incurred significant losses since inception and expect our net losses since inception and expect to continue to incur losses and we may never achieve profitability through the development and commercialization of our product candidates. Many of our product candidates are in development, and we have not yet generated any revenue from product sales. We continue to develop and market new products and technologies in the development and commercialization of our product candidates, and as a result, our revenues from product sales and marketing, are subject to product commercialization and regulatory requirements. Our product candidates may fail to develop, develop and commercialize our product candidates may fail to develop and commercialize our product candidates.

Structure Maintains risk factor format (heading + explanation)
Loop "Product candidates" appears 47 times in 25 lines
Self-ref Recursive: "product candidates may fail to develop, develop and commercialize"

Key Findings

Echo vs. Generate

The model excels at echoing input content (tables, numbers, formatting) but struggles to generate new content that maintains consistency. It learned surface-level patterns rather than underlying data relationships.

Loop Attractors

Certain phrases act as "probability sinks" when the model is uncertain:

"Commercialization of product candidates" — biotech 10-K risk factors
"Cost of revenue increased by $X million" — standard MD&A opener
"Raise additional capital" — going-concern disclosures

Numerical Coherence

Dollar amounts: Plausible scale ($1M–$500M) but internally inconsistent
Percentages: Often don't match the dollar changes cited
Dates: Consistent (always "December 31, 2023/2022")
Implication: The model learned the format of numbers, not their meaning

What It Can Do

Generate SEC boilerplate language and standard disclosures
Suggest section structures and formatting conventions
Draft placeholder text that visually resembles authentic filings
Provide a starting point for human writers to edit and refine

What It Cannot Do

Generate accurate financial data or calculations
Maintain consistency across long documents
Produce factually grounded content
Extend tables with new, coherent rows

Citation

@misc{sec-edgar-gpt-124m,
  author    = {Zhiwei Li},
  title     = {SEC-EDGAR GPT-2 124M},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/lzwjava/sec-edgar-gpt}
}