Surpassing Devin! A New Player Arrives on the SWEBench Leaderboard:
StarShip CodeGen Agent, developed by OpenCSG, a startup led by members of Tsinghua University’s Yao Class, has achieved a global ranking of second place with a score of 23.67%.
It also set a new State-of-the-Art (SOTA) record for non-GPT-4o base models.

As we know, the SWEBench evaluation closely mirrors real-world programming scenarios and is extremely difficult. It requires models not only to understand requirements and coordinate changes across multiple functions, classes, or even files but also to interact with execution environments, handle ultra-long contexts, and perform complex logical reasoning far beyond traditional code generation tasks.
In such rigorous real-world tests, the industry’s most advanced models, GPT-4 and Devin, could only solve 1.74% and 13.86% of the problems, respectively.
OpenCSG’s achievement marks a leading step by Chinese companies in driving language models toward greater practicality, intelligence, and autonomy.
How Difficult Is Large Model Programming, Really?
In March 2024, the debut of Devin, the first AI software engineer, ignited excitement across the tech industry. Despite accompanying controversies, Devin’s powerful innovation capabilities and immense potential brought new expectations to many AI enthusiasts and practitioners.
Devin can not only handle coding tasks with ease but also autonomously complete the entire software development lifecycle—from project planning to deployment—covering website building, autonomous bug hunting and fixing, training, and fine-tuning AI models, among other activities.

Why did Devin dare to challenge the programming capabilities of base models like GPT-4?
The core reason is that software engineering involves more than just writing code; it encompasses requirement understanding, code interpretation, planning, generation, debugging, and exception handling. Each of these stages impacts the usability and effectiveness of large model-based programming.
To address such real-world scenarios, Princeton University introduced SWEBench, a tool for quantitatively evaluating end-to-end code generation capabilities.
GPT-4 scored only 1.74% on SWEBench. Even with Retrieval-Augmented Generation (RAG) technology added, the score remained below 3%, indicating that relying solely on base models to directly solve real-world programming problems is currently unfeasible.
Devin’s technological innovation lies in constructing workflows based on Agents, elevating the success rate on SWEBench to new heights.
In March, Devin topped the leaderboard by independently solving 13.86% of the problems, effectively lifting “large model programming” from a nearly unusable state to one showing promise. Major Silicon Valley tech giants and AI startups rushed into the LLM for Software Engineering (LLM4SE) field, continuously rewriting this record.
By late April 2024, the best record was set by Amazon Q Developer Agent, launched by Amazon’s AI team, with a score of 20.33%.
Regrettably, while Chinese companies have shown vibrant diversity in base model rankings, they rarely participated in such high-difficulty challenges until OpenCSG recently broke this record.
From a Chinese Startup
With the latest SWEBench evaluation results updated, OpenCSG has risen to second place on the leaderboard. Its OpenCSG StarShip CodeGen Agent achieved a pass rate of 23.67% in the Lite evaluation, surpassing both Devin and Amazon’s performance.
Founded just one year ago, OpenCSG (Kaifang Chuanshen) is dedicated to building an ecosystem community for large models, bringing together upstream and downstream enterprises in the AI industry to provide solutions and tool platforms for vertical industry applications of large models.

The team possesses deep expertise in open source and large model integration:
CEO Chen Ran is a well-known entrepreneur in the open-source software field, having successfully built multiple commercial companies within the open-source sector.
CTO Wang Wei graduated from Tsinghua University’s Yao Class (Class of 2005) and has years of R&D experience in artificial intelligence.
The core research team also includes elite graduates from prestigious institutions such as Tsinghua University, Peking University, Wharton, and the Hong Kong University of Science and Technology.
So, how did this team achieve a new record?
While many enterprises are actively exploring base models, vertical domain models, and RAG technologies, OpenCSG chose a focused direction: dedicating itself to the innovative development of programming Agents and deep optimization of large model algorithms.
At the Agent level: Unlike LLM+RAG or general Agent frameworks, the OpenCSG StarShip CodeGen Agent is highly customized for software R&D. It implements various stages of development (requirement understanding, code retrieval, planning, coding, iterative verification) via LLM Agents and integrates software engineering methods such as AST syntax analysis and dependency retrieval for deep optimization. This meticulous refinement across all links ultimately enables higher-precision code generation.
At the algorithm level: Addressing typical issues like API conflicts caused by code version changes, OpenCSG proposed an adaptive teacher mode. The teacher model analyzes code version change logs to generate high-quality programming data, which is then used to improve the base model’s generation performance. According to evaluations, these innovations yield improvements significantly superior to current RAG modes, especially in popular projects with frequent API structure updates. Related findings have been submitted as papers to international conferences.
It is this dual approach of algorithm + engineering, characterized by relentless refinement, that allowed OpenCSG’s CodeGen Agent to stand out among other models.
“StarShip Is Like Various Home Appliances”
If the real-world evaluation of the CodeGen Agent was merely a trial run, then StarShip carries OpenCSG’s grand blueprint.
Regarding StarShip’s product positioning, OpenCSG CEO Chen Ran stated:
StarShip embodies our vision of large models reshaping software development. Users can build their own digital employee teams using the intelligent Agents built into StarShip. The CodeGen Agent serves as the platform’s embedded digital programmer; recently released agents also include a CodeReview Agent for code review and a CodeSearch Agent for code Q&A. Unlike traditional code assistance tools, we aim for these digital employees to work independently without human intervention. In the future, we will release more types of digital employees to comprehensively cover requirements, design, coding, testing, and operations.
CTO Wang Wei noted that this path is challenging but fascinating: “From first principles, the question is no longer whether large models can boost productivity, but when, where, and in what form they will do so. StarShip is our attempt to answer that.”

Beyond StarShip, the OpenCSG team has been highly productive, launching products such as CSGHub (an open-source model platform), Wukong (a pre-trained model), and CSGCoder (a fine-tuned code model). These products have precise positioning and received positive feedback in the industry.
The rapid launch and iteration of these products not only meet market demands but also serve a common goal: empowering every enterprise and individual with large models.
To empower every enterprise and individual, large models must become as ubiquitous as water and electricity. If large models are like electrical energy, CSGHub is the power grid, and StarShip represents the various home appliances that ultimately deliver value to households everywhere.
OpenCSG’s philosophy centers on openness and open source. As a company committed to an open-source core strategy, it has not only open-sourced its models and code but also its platform.
CTO Wang Wei summarized: “We are a young company. Benefiting from open source allowed us to achieve results in a short time, and we will fully give back to the open-source community, which is a fundamental principle of that ecosystem. Furthermore, I strongly agree with Sam Altman’s view that open source is merely a model; what matters more than the model itself is product value.”
“Benchmarks are just numbers. With the release of GPT-4o, SWEBench scores are expected to exceed 30% soon, and optimistically, they may break 50% next year. However, we focus more on the product value behind these numbers: As model capabilities and engineering techniques improve, digital employees will undergo a qualitative shift driven by quantitative growth—moving from ‘usable’ to ‘highly effective’—leading to comprehensive breakthroughs across industries,” Wang Wei explained. “This may be a significant change in the era of large models. From companies to individuals, we must all prepare for it.”
Comments
Sign in to join the discussion and leave a comment.
Sign in with Google