SEC-EDGAR-GPT

A 124M parameter GPT-2 trained on SEC-EDGAR financial filings

124M
Parameters
1.55B
Training Tokens
2.28
Val Loss
~8h
Training Time

Overview

SEC-EDGAR-GPT is a 124-million parameter GPT-2 language model trained on 1.55 billion tokens of SEC-EDGAR corporate filings using the nanoGPT framework. The model was trained on a single NVIDIA RTX 4070 GPU over approximately 8 hours, reaching a final validation loss of 2.28.

We evaluated the model's generation quality across multiple SEC filing sections including business descriptions, management discussion and analysis, risk factors, financial notes, and proxy statements. The model successfully learns SEC document structure, financial vocabulary, and boilerplate language patterns, but exhibits characteristic limitations in long-range coherence, numerical consistency, and table extension.

Training Setup

Training workstation with RTX 4070
Training workstation equipped with NVIDIA RTX 4070 GPU (12GB VRAM)

Model Architecture

ParameterValue
Number of layers12
Attention heads12
Embedding dimension768
Context length1,024 tokens
Total parameters123.59M
VocabularyGPT-2 BPE (50,257 tokens)

Training Data

MetricValue
Total training tokens1.55B
Training shards16
Validation tokens100M
SourceSEC-EDGAR via CodeParrot dataset
TokenizerGPT-2 BPE

Hyperparameters

ParameterValue
Effective batch size32,768 tokens
Learning rate6e-4
Min learning rate6e-5
Warmup iterations2,000
Max iterations47,000
OptimizerAdamW (β₁=0.9, β₂=0.95)
Weight decay0.1
Gradient clipping1.0

Training Results

Training loss curve
Evaluation prompts (Section 5) were generated with the assistance of Hermes Agent (Nous Research).

Loss Progression

StepTrain LossVal Loss
010.98
1,0003.823.75
5,0002.952.91
10,0002.622.58
20,0002.382.35
30,0002.252.23
40,0002.152.22
47,0002.102.28

Cross-Domain Comparison

DatasetVal LossNotes
OpenWebText (GPT-2 baseline)2.85General web text
GitHub Code3.466Noisy code data
SEC-EDGAR (this work)2.28Clean financial prose

SEC-EDGAR achieves significantly lower loss than general text and code models, due to the structured, repetitive nature of SEC filings.

Training Completion

Training completion screenshot
Training completion: iteration 47,400 on the RTX 4070 workstation, accessed remotely from macOS.

Generation Quality Analysis

Five prompts representing different SEC filing sections. Each prompt: 500–2,000 characters of authentic SEC text. Generated 1,000 tokens at temperature 0.7, top-k=100. Evaluation prompts generated with Hermes Agent (Nous Research).

Prompt 1: Business Description (Item 1)

Show full prompt (input text)
ITEM 1. BUSINESS Overview We are a leading provider of cloud-based enterprise software solutions for the healthcare industry. Our platform enables hospitals, clinics, and other healthcare organizations to streamline their operations, improve patient outcomes, and reduce costs. We serve over 2,500 customers across 48 states, including 18 of the top 25 hospital systems in the United States. Our solutions address critical challenges in healthcare delivery, including electronic health records (EHR), practice management, revenue cycle management, and population health analytics. Our AI-powered clinical decision support tools help physicians make better-informed treatment decisions, while our interoperability platform enables seamless data exchange between disparate healthcare systems. For the fiscal year ended December 31, 2023, we generated revenue of approximately $487.2 million, representing a 28% increase from the prior year. Our subscription revenue, which accounted for 78% of total revenue, grew 35% year-over-year. We had approximately 8,200 full-time employees as of December 31, 2023.
Generated continuation (first 500 tokens):
Our business is heavily dependent on the resources of our clinical trials and the discovery and development of our product candidates. Our technology costs are based on the materials, manufacturing, test, and customer support costs. We are a global leader in the healthcare industry, and the care industry is undergoing a number of product development, including the following: - Our core product development, product development and commercialization of our product candidates, which we currently license agreements with third-party manufacturers. - Our commercialization of X-Avent therapies, including the following: - Our commercialization of X-Zentib, including the following: - Our commercialization of X-Zentib, including the following:
Structure Maintains SEC filing format
Loop "Commercialization of product candidates" repeats 47+ times
Hallucinate Invented drug names: X-Avent, X-Zentib, S-Zentib, Q-partnerib

Prompt 2: MD&A Revenue Analysis

Show full prompt (input text)
ITEM 7. MANAGEMENT'S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS Results of Operations Revenue Total revenue increased by $105.4 million, or 28%, to $487.2 million for the year ended December 31, 2023, from $381.8 million for the year ended December 31, 2022. The increase was primarily attributable to growth in subscription revenue, which increased by $98.2 million, or 35%, driven by new customer acquisitions and expanded usage from existing customers. Professional services revenue increased by $7.2 million, or 8%, reflecting increased implementation activity. Cost of Revenue and Gross Profit Cost of revenue increased by $31.2 million, or 22%, to $173.4 million for the year ended December 31, 2023, from $142.2 million for the year ended December 31, 2022. The increase was primarily due to higher hosting costs associated with our cloud infrastructure and increased headcount in our customer support organization. Gross profit margin improved to 64.4% for the year ended December 31, 2023, compared to 62.7% for the prior year, reflecting economies of scale in our cloud infrastructure.
Generated continuation (first 400 tokens):
Cost of revenue increased by $32.1 million, or 26%, for the year ended December 31, 2023, compared to $162.2 million for the prior year. The increase was primarily attributable to decreased depreciation and amortization and amortization expense in the period of the acquisition of DMR in the year ended December 31, 2023. Cost of revenue increased by $40.0 million, or 2%, to $107.1 million for the year ended December 31, 2023, from $162.8 million for the year ended December 31, 2022. The increase was primarily attributable to the increase in service revenue, partially offset by depreciation on certain C-1 products, which increased by $73.4 million, or 1%, from $96.0 million for the year ended December 31, 2023. Cost of revenue increased by $20.5 million, or 2%, to $1.8 million for the year ended December 31, 2023, primarily due to the increase in service revenue and the increased amortization of intangible assets.
Template Excellent first paragraph — references acquisition (DMR)
Loop 10 consecutive "Cost of revenue increased/decreased" paragraphs
Math Percentages don't match dollar amounts cited

Prompt 3: Risk Factors

Show full prompt (input text)
RISK FACTORS Risks Related to Our Business We have a history of net losses and may not achieve or maintain profitability in the future. We incurred net losses of $42.3 million, $67.8 million, and $89.1 million for the years ended December 31, 2023, 2022, and 2021, respectively. As of December 31, 2023, we had an accumulated deficit of $523.4 million. We expect to continue to incur significant expenses related to research and development, sales and marketing, and general administrative operations as we expand our business. We may not achieve or maintain profitability in the future. Our quarterly operating results may fluctuate significantly, which could cause our stock price to decline. Our quarterly revenue and operating results have varied in the past and may continue to vary significantly from quarter to quarter. Our revenue growth rate may decline in future periods as our revenue base increases. Factors that may cause our quarterly results to fluctuate include the timing of large enterprise contracts, seasonal purchasing patterns in the healthcare industry, and the impact of regulatory changes on customer buying decisions.
Generated continuation (first 500 tokens):
We have financed and expect to continue to incur losses and we may not achieve profitability in the future. We have incurred significant losses since inception and expect our net losses since inception and expect to continue to incur losses and we may never achieve profitability through the development and commercialization of our product candidates. Many of our product candidates are in development, and we have not yet generated any revenue from product sales. We continue to develop and market new products and technologies in the development and commercialization of our product candidates, and as a result, our revenues from product sales and marketing, are subject to product commercialization and regulatory requirements. Our product candidates may fail to develop, develop and commercialize our product candidates may fail to develop and commercialize our product candidates.
Structure Maintains risk factor format (heading + explanation)
Loop "Product candidates" appears 47 times in 25 lines
Self-ref Recursive: "product candidates may fail to develop, develop and commercialize"

Key Findings

Echo vs. Generate

The model excels at echoing input content (tables, numbers, formatting) but struggles to generate new content that maintains consistency. It learned surface-level patterns rather than underlying data relationships.

Loop Attractors

Certain phrases act as "probability sinks" when the model is uncertain:

Numerical Coherence

What It Can Do

What It Cannot Do

Citation

@misc{sec-edgar-gpt-124m,
  author    = {Zhiwei Li},
  title     = {SEC-EDGAR GPT-2 124M},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/lzwjava/sec-edgar-gpt}
}