推出 GPT-5.4

2026/03/05 18:00阅读量 19

OpenAI 正式发布了专为专业工作设计的 GPT-5.4，该模型在推理、编码和代理工作流方面实现了显著突破。GPT-5.4 首次原生集成了最先进的计算机使用能力，支持跨应用执行复杂任务，并具备高达 100 万 token 的上下文窗口。此外，该模型在事实准确性、视觉感知及文档处理等关键指标上均超越了前代产品，成为目前最高效的推理模型之一。

Introducing GPT‑5.4 | OpenAI =============== * Research * Products * Business * Developers * Company * Foundation Log inTry ChatGPT(opens in a new window) * Research * Products * Business * Developers * Company * Foundation Try ChatGPT(opens in a new window)Login OpenAI Table of contents * Knowledge work * Computer use and vision * Coding * Tool use * Steerability * Safety * Availability and pricing * Evaluations March 5, 2026 Product Release Introducing GPT‑5.4 =================== Designed for professional work Loading… Share Today, we're releasing GPT‑5.4 in ChatGPT (as GPT‑5.4 Thinking), the API, and Codex. It's our most capable and efficient frontier model for professional work. We're also releasing GPT‑5.4 Pro in ChatGPT and the API, for people who want maximum performance on complex tasks. GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠ while improving how the model works across tools, software environments, and professional tasks involving spreadsheets, presentations, and documents. The result is a model that gets complex real work done accurately, effectively, and efficiently—delivering what you asked for with less back and forth. In ChatGPT, GPT‑5.4 Thinking can now provide an upfront plan of its thinking, so you canadjust course mid-responsewhile it's working**,** and arrive at a final output that's more closely aligned with what you need without additional turns. GPT‑5.4 Thinking also improves deep web research, particularly for highly specific queries, while better maintaining context for questions that require longer thinking. Together, these improvements mean higher-quality answers that arrive faster and stay relevant to the task at hand. In Codex and the API, GPT‑5.4 is the first general-purpose model we've released with native, state-of-the-art computer-use capabilities, enabling agents to operate computers and carry out complex workflows across applications. It supports up to 1M tokens of context, allowing agents to plan, execute, and verify tasks across long horizons. GPT‑5.4 also improves how models work across large ecosystems of tools and connectors with tool search, helping agents find and use the right tools more efficiently without sacrificing intelligence. Finally, GPT‑5.4 is ourmost token efficient reasoning modelyet, using significantly fewer tokens to solve problems when compared to GPT‑5.2—translating to reduced token usage and faster speeds. Together with advances in general reasoning, coding, and professional knowledge work, GPT‑5.4 enables more reliable agents, faster developer workflows, and higher-quality outputs across ChatGPT, the API, and Codex. GPT-5.4GPT-5.3-CodexGPT-5.2 GDPval (wins or ties)83.0%70.9%70.9% SWE-Bench Pro (Public)57.7%56.8%55.6% OSWorld-Verified 75.0%74.0%*47.3% Toolathlon 54.6%51.9%46.3% BrowseComp 82.7%77.3%65.8% *Previously reported as 64.7%. GPT‑5.3‑Codex achieves 74.0% with a newly introduced API parameter that preserves the original image resolution. Knowledge work -------------- Building on GPT‑5.2's general reasoning capabilities, GPT‑5.4 delivers even more consistent and polished results on real-world tasks that matter to professionals. On GDPval⁠, which tests agents' abilities to produce well-specified knowledge work across 44 occupations, GPT‑5.4 achieves a new state of the art, matching or exceeding industry professionals in 83.0% of comparisons, compared to 70.9% for GPT‑5.2. In GDPval, models attempt well-specified knowledge work spanning 44 occupations from the top 9 industries contributing to U.S. GDP. Tasks request real work products, such as sales presentations, accounting spreadsheets, urgent care schedules, manufacturing diagrams, or short videos. Reasoning effort was set to xhigh for GPT‑5.4 and heavy for GPT‑5.2 (a slightly lower level in ChatGPT). Mercor Walleye Capital Fundamental Research Labs Rogo Balyasny Asset Management > "GPT-5.4 is the best model we've ever tried. It's now top of the leaderboard on our APEX-Agents benchmark, which measures model performance for professional services work. It excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis, delivering top performance while running faster and at a lower cost than competitive frontier models." — Brendan Foody, CEO at Mercor We put a particular focus on improving GPT‑5.4's ability to create and edit spreadsheets, presentations, and documents. On an internal benchmark of spreadsheet modeling tasks that a junior investment banking analyst might do, GPT‑5.4 achieves a mean score of 87.3%, compared to **68.4%**for GPT‑5.2. On a set of presentation evaluation prompts, human raters preferred presentations from GPT‑5.4 68.0% of the time over those from GPT‑5.2 due to stronger aesthetics, greater visual variety, and more effective use of image generation. Spreadsheets Documents Presentations Documents were generated with reasoning effort set to xhigh You can try these capabilities in ChatGPT using GPT‑5.4 Thinking or Pro. If you're an Enterprise customer, we recommend using our newly released ChatGPT for Excel add-in⁠(opens in a new window), which was also launched today. We've also updated our spreadsheet⁠(opens in a new window) and presentation skills⁠(opens in a new window) available in Codex and the API. To make GPT‑5.4 better at real-world work, we continued our progress at driving down hallucinations and errors. GPT‑5.4 is our most factual model yet: on a set of de-identified prompts where users flagged factual errors, GPT‑5.4's individual claims are 33% less likely to be false and its full responses are 18% less likely to contain any errors, relative to GPT‑5.2. Harvey Thomson Reuters Notion HockeyStack Legora Clio > "GPT-5.4 sets a new bar for document-heavy legal work. On our BigLaw Bench eval, it scored 91%. Compared to other models, GPT‑5.4 is currently better at structuring complex transactional analysis, maintaining accuracy across lengthy contracts, and delivering the high level of detail legal practitioners require." — Niko Grupen, Head of Applied Research at Harvey Computer use and vision ----------------------- GPT‑5.4 is our first general-purpose model with native computer-use capabilitiesand marks a major step forward for developers and agents alike. It's the best model currently available for developers building agents that complete real tasks across websites and software systems. We've designed GPT‑5.4 to be performant across a wide range of computer-use workloads. It is excellent at writing code to operate computers via libraries like Playwright, as well as issuing mouse and keyboard commands in response to screenshots. Its behavior is steerable via developer messages, meaning that developers can adjust behavior to suit particular use cases. Developers can even configure the model's safety behavior to suit different levels of risk tolerance by specifying custom confirmation policies. The model's performance and flexibility are reflected across benchmarks that test computer use across different settings. On OSWorld-Verified, which measures a model's ability to navigate a desktop environment through screenshots and keyboard/mouse actions, GPT‑5.4 achieves a state-of-the-art 75.0% success rate, far exceeding GPT‑5.2's 47.3%, and surpassing human performance at **72.4%.**1 On WebArena-Verified, which tests browser use, GPT‑5.4 achieves a leading 67.3% success rate when using both DOM- and screenshot-driven interaction, compared to GPT‑5.2's 65.4%. On Online-Mind2Web, which also tests browser use, GPT‑5.4 achieves a 92.8% success rate using screenshot-based observations alone, improving over ChatGPT Atlas's Agent Mode, which achieves a success rate of 70.9%. A tool yield is when an assistant yields to await tool responses. If 3 tools are called in parallel, followed by 3 more tools called in parallel, the number of yields would be 2. Tool yields are a better proxy of latency than tool calls because they reflect the benefits of parallelization. Email & calendar Bulk data entry GPT‑5.4 interprets screenshots of a browser interface and interacts with UI elements through coordinate-based clicking to send emails and schedule a calendar event. Video is not sped up. GPT‑5.4's improved computer use is built on the model's improved general visual perception capabilities. On MMMU-Pro, a test of a model's visual understanding and reasoning, GPT‑5.4 achieves an 81.2% success rate without tool use, an improvement over GPT‑5.2's 79.5%. Improved visual perception also translates into better document parsing capabilities. On OmniDocBench, GPT‑5.4 without reasoning effort achieves an average error (measured by normalized edit distance between model prediction and ground truth) of 0.109, improved from GPT‑5.2's 0.140. MMMUPro was run with reasoning effort set to xhigh. OmniDocBench was run with reasoning effort set to none, to reflect low-cost, low-latency performance. We're also improving visual understanding for dense, high-resolution images where full fidelity matters. Starting with GPT‑5.4, we're introducing an original image input detail⁠(opens in a new window) level which supports full-fidelity perception up to 10.24M total pixels or 6000-pixel maximum dimension, whichever is lower; the high image input detail level now supports up to 2.56M total pixels or a 2048-pixel maximum dimension. In early testing with API users, we observed strong gains in localization ability, image understanding, and click accuracy when using original or high detail. Mainstay Momentic Pace > "In our evals measuring computer use performance across ~30K HOA and property tax portals, GPT‑5.4 achieved a 95% success rate on the first attempt and 100% within three attempts, compared to ~73–79% with prior CUA models. It also completed sessions ~3x faster while using ~70% fewer tokens, materially improving reliability and cost efficiency at scale." — Dod Fraser, CEO at Mainstay In the API, developers can access these capabilities using the updated computer tool. Please see our updated documentation⁠(opens in a new window) fo

阅读原文详情

准备好启动您的定制项目了吗？