ByteDance Seed Open-Sources Code Model for the First Time!
Seed-Coder, an 8B-parameter model, surpasses Qwen3 and achieves multiple SOTA (State-of-the-Art) results.
It demonstrates that “with minimal human intervention, LLMs can autonomously manage code training data.”
By self-generating and filtering high-quality training data, the model’s code generation capabilities are significantly enhanced.

This can be viewed as an extension of DeepSeek-R1’s strategy for self-generating and filtering training data.

It includes three versions:
- Base
- Instruct
- Reasoning
Among them, the Instruct version excels in programming, achieving SOTA on two benchmark tests.

The Reasoning version surpassed QwQ-32B and DeepSeek-R1 in the IOI 2024 evaluation.

The model has a context length of 32K, was trained on 6T tokens, and is released under the permissive MIT open-source license. The complete code has been published on Hugging Face.
Managing Training Data with Models
Seed-Coder’s predecessor was doubao-coder, which adopted the Llama 3 architecture with 8.2B parameters, 6 layers, a hidden size of 4096, and utilized Grouped Query Attention (GQA).
The most critical work involved data processing. The Seed team proposed a “model-centric” data processing approach, using models to curate the data.
Specifically, the model scrapes raw code data from GitHub and web archives, which then undergo several processing steps to output the final pre-training data.

Seed-Coder’s filtered data is divided into four categories:
- File-level Code: Individual code files from GitHub, processed to retain high-quality code content.
- Repository-level Code: Code files based on repository structure, preserving project structure information so the model can learn relationships between codes.
- Commit Data: Snapshots of GitHub commits, including commit messages, repository metadata, related files, and code patches. This includes 74 million commits from 140,000 high-quality repositories.
- Code-related Web Data: Documents extracted from web archives that contain code blocks or are highly relevant to code.
Let’s look at the code processing first. In the preprocessing stage, the system implements deduplication at both repository and file levels: SHA256 hashing for exact deduplication and MinHash algorithms for approximate deduplication.
This two-layer strategy produces two variants of the code corpus—the file-level variant is used for short context window training, while the repository-level variant preserves project structure to support more coherent long-context learning.
Subsequently, the system uses syntax parsers like Tree-sitter to check remaining files and discard those containing syntax errors. This preprocessing stage reduces the original data volume by approximately 98%.
In the quality filtering phase, Seed-Coder uses a scoring model specially trained on over 220,000 code documents to filter out low-quality code files.
The scoring model is based on DeepSeek-V2-Chat, and its evaluation metrics cover four key aspects:
- Readability: Contains a reasonable number of comments, follows consistent naming conventions, and adheres to general formatting and structural standards;
- Modularity: Well-structured, avoiding overly complex or lengthy functions, achieving clear separation of logical functions through modularity;
- Clarity: Reduces redundancy (such as excessive function calls, large blocks of commented-out code, or debug print statements), ensuring the intent of each code block is clearly expressed;
- Reusability: Free of syntax and logic errors, avoids excessive hard-coded data, designed for easy integration with other projects, and features complete and meaningful functionality.
The scoring model is asked to provide an overall score from 0 to 10 along with a detailed explanation. The scores are then rescaled to the [0,1] range, and a pre-trained Llama 2 model with 1.3B parameters is fine-tuned for one epoch using a regression head as the quality scorer.
Based on this scoring method, the Seed team filtered out approximately the bottom 10% of files by score, resulting in a corpus supporting 89 programming languages and containing about 1 trillion unique tokens.

Next is the commit data. Seed-Coder collected 74 million commits from 140,000 high-quality GitHub repositories. The selection criteria for these repositories included: at least 100 stars, 10 forks, 100 commits, and 100 days of maintenance activity.
Each commit record contains rich metadata, such as commit messages, code patches, merge status, and pre-commit code snapshots.
To effectively utilize this data for pre-training, Seed-Coder formats each commit sample into a code change prediction task. Given a commit message and its related context, the model needs to predict the modified file paths and the corresponding code changes.
After deduplication and preprocessing, Seed-Coder obtained a corpus of approximately 100 billion tokens from commit data for pre-training.
For data obtained from the web, Seed-Coder also proposed a specialized extraction framework.
In the preprocessing stage, the framework efficiently preprocesses large-scale web archives and identifies two types of raw data:
- The first type consists of web pages in HTML with explicit code tags (e.g.,
<code>,<pre>), which can be extracted directly via standard rules; - The second type includes data without explicit code tags but potentially containing code or related knowledge. This type presents extraction challenges due to its volume and complexity.
Similar to GitHub data processing, the research team implemented exact and approximate deduplication techniques and developed heuristic rules to eliminate obvious low-quality documents during preprocessing (e.g., documents with fewer than 10 words).
In the quality filtering phase, the framework adopts two complementary strategies to ensure data quality: first identifying code relevance, then evaluating the intrinsic quality of the identified content.
In the code relevance identification step, the research team first extracted 10 million web page samples from Common Crawl data, marked pages with code features, and established an evaluation dataset.
70% of this dataset was used as a training set to train a fastText model for automatically identifying code-related content, while the remaining 30% served as a validation set to evaluate model performance.
In the quality assessment step, the system uses LLMs to score the identified code-related content using a 0-10 scale, evaluating the
Normativity, completeness, and value.
However, during the actual evaluation process, researchers discovered systematic biases in the scores assigned to different types of websites:
Document websites and technical blogs generally received higher scores due to their standardized formats and clear structures. In contrast, technical forums and Q&A platforms often contained valuable technical discussions and solutions but scored lower because of their informal formatting.
To address this scoring bias, the research team optimized the evaluation system by first categorizing websites based on content format and functionality, then establishing specific evaluation criteria and filtering thresholds for each category.
Through this optimized dual-filtering mechanism, the system ultimately constructed a web data corpus containing approximately 1.2 trillion tokens.

Based on the four data categories mentioned earlier, Seed-Coder’s pre-training was divided into two stages.
The first stage is standard pre-training, which uses file-level code and web data related to coding to build the model’s foundational capabilities.
The second stage is continued pre-training, utilizing all four data categories while additionally introducing high-quality datasets and long-context datasets to enhance performance, align the model, and stimulate its ability to understand long-context data.
In addition to the standard next-token prediction objective, Seed-Coder also employs Fill-in-the-Middle (FIM) and Suffix-Prefix-Middle (SPM) training methods to enhance context-aware completion and mid-content generation capabilities.
Building on the base model, the Seed team developed two special variants of Seed-Coder:
- Instruction Model (-Instruct): Designed to enhance the model’s ability to follow instructions. Its training consists of two stages: Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO).
- Reasoning Model (-Reasoning): Aimed at improving multi-step reasoning capabilities in complex programming tasks. It utilizes Long Chain-of-Thought (LongCoT) reinforcement learning training. This begins with warm-up training using solutions generated from programming competition problems and high-quality models, followed by reinforcement learning training implemented via the GRPO framework.
The establishment of these two variants further expands the practical utility of Seed-Coder.
ByteDance Seed Becomes Increasingly Open
In addition to open-sourcing Seed-Coder, recent actions by ByteDance Seed have focused heavily on lowering barriers and promoting openness.
For instance, in the realm of base models, they released video generation and reasoning models.
The video generation model, Seaweed, natively supports 1280×720 resolution, arbitrary aspect ratios, and variable durations with only 7 billion parameters, outperforming models with 14 billion parameters.
It emphasizes cost advantages, having been trained using 665,000 H100 GPU hours. It is deployable by small to medium-sized teams, requiring only a single GPU with 40GB of VRAM to generate videos at resolutions up to 1280×720.

The deep thinking model Seed-Thinking-v1.5 is more lightweight with fewer activated parameters, surpassing DeepSeek-R1 in reasoning tasks such as mathematics and coding.

The team also published technical reports detailing their secrets, explaining how they improved reasoning performance through data, RL algorithms, and RL infrastructure.
In the area of agents, they partnered with Tsinghua University to launch UI-TARS, a computer operation agent that outperforms GPT-4o and others, and is free for commercial use.
Built upon Qwen-VL, it can autonomously complete complex cross-task operations step-by-step and is compatible with various systems. It currently boasts over 5.8k stars on GitHub.

Additionally, they introduced Multi-SWE-bench: a multilingual benchmark for problem-solving. It spans seven programming languages and includes 1,632 high-quality instances.
…
Meanwhile, internal adjustments are ongoing within ByteDance Seed. Reports indicate that the three teams under LLM—Pre-train (pre-training), Post-train (post-training), and Horizon—now report directly to Wu Yonghui, head of Seed. Furthermore, three directions explored within ByteDance AI Lab—robotics & embodied intelligence, AI for Science, and AI safety/interpretability—have been merged into Seed.
Earlier this year, ByteDance officially established a research project codenamed “Seed Edge.” Its core objective is to conduct longer-term, more fundamental AGI frontier research than pre-training and large model iterations. Project members enjoy a relaxed research environment, independent computing resources, and longer evaluation cycles. The five proposed research directions are entirely focused on next-generation AI research, original innovation, or paradigm shifts.
Through ByteDance’s moves, the new trends in the AI community have become clearer.
Open source, openness, original innovation, AI for all…
In short, should we thank DeepSeek? (doge)
Project Address:
https://bytedance-seed-coder.github.io/