當 AI Agent 需要一個瀏覽器 - agent-browser 實戰與方案比較

最近在用 Claude Code 做一些自動化的事情，需要讓 AI 去操作瀏覽器，像是打開頁面、填表單、截圖之類的。一開始接了 Playwright MCP，結果發現 context window 一下子就爆了，一個簡單的登入流程就吃掉快 10 萬 tokens，搞得 AI 後面的對話品質直接崩盤 (╯°□°)╯︵ ┻━┻

後來發現 Vercel Labs 出了一個叫 agent-browser 的東西，專門為 AI agent 設計的瀏覽器自動化 CLI。試用了一下覺得設計理念蠻有意思的，順便整理了一下目前市面上幾種方案的差異。

TL;DR

先講結論，趕時間的可以看完這段就走：

方案	適合場景	Token 效率	上手難度
agent-browser	AI agent 日常操作、長對話	極高（比 MCP 省 ~82%）	低，CLI 即用
Playwright MCP	短任務、沙箱環境	低（每步都灌 snapshot）	中，需設定 MCP server
Playwright CLI	需要進階功能的 coding agent	高（比 MCP 省 ~4x）	中，50+ 指令要學
Stagehand	用自然語言描述操作	中（多一層 LLM 推理）	低，但需要 TypeScript

如果你的 AI agent 需要頻繁操作瀏覽器，而且你在意 token 用量（應該在意，那是錢），agent-browser 目前是最省的選擇。

1. 先搞懂問題在哪：為什麼不能直接用 Playwright MCP？

在講 agent-browser 之前，先說一下為什麼傳統的 Playwright MCP 在 AI agent 場景會有問題。

Playwright MCP 的運作方式是：每次 AI 執行一個操作，server 就把整個頁面的 accessibility tree 塞回 context window。一個普通的 dashboard 頁面，光一次 snapshot 就大約 3,800 tokens。操作個 15 步，累積下來就是 60,000 到 80,000 tokens，光是頁面狀態就佔滿了。

具體感受一下這個數字：

步驟 1：打開登入頁        → 累積 ~3,800 tokens
步驟 5：填完表單送出      → 累積 ~25,000 tokens
步驟 10：在 dashboard 操作 → 累積 ~50,000 tokens
步驟 15：AI 開始胡言亂語   → 累積 ~80,000 tokens

這還只是頁面狀態的部分，加上 MCP 的 26 個 tool schema（每次互動多 ~4,200 tokens），你的 context window 根本撐不了幾步。

用比喻來說的話：Playwright MCP 就像每次問路，對方把整本地圖翻開攤在你桌上；agent-browser 只告訴你「前面路口左轉，然後第三棟就是了」。

2. agent-browser 的核心設計：Snapshot + Refs

agent-browser 最大的創新是 Snapshot + Refs 工作流。與其把整個 accessibility tree 丟進 context，它只回傳一個精簡的元素清單，每個元素帶一個 reference ID：

$ agent-browser snapshot

button "Sign In" [ref=e1]
textbox "Email" [ref=e2]
textbox "Password" [ref=e3]
link "Forgot password?" [ref=e4]
link "Create account" [ref=e5]

就這樣，5 行。同樣的頁面，Playwright MCP 會給你好幾百行的 accessibility tree。

然後你要操作的時候，直接用 ref 就好：

$ agent-browser fill @e2 "user@example.com"
$ agent-browser fill @e3 "my-password"
$ agent-browser click @e1

不需要寫 CSS selector，不需要 XPath，不需要任何脆弱的定位方式。ref 是根據 ARIA role 和 accessible name 產生的，比 CSS selector 穩定得多。

3. 實戰示範：拿 Hacker News 來跑一次

光講原理沒有感覺，直接拿 Hacker News 的登入頁來實測。以下是我實際跑過的結果，不是假資料。

安裝

# 全域安裝（推薦，Rust binary 效能最好）
npm install -g agent-browser
agent-browser install
# → 會自動下載 Chromium，大概 162 MB

步驟一：打開 Hacker News

$ agent-browser open https://news.ycombinator.com

✓ Hacker News
  https://news.ycombinator.com/

回傳很乾淨，就告訴你頁面標題跟網址，不多不少。

步驟二：看首頁有什麼

$ agent-browser snapshot
# → 879 行 / 45,754 bytes

首頁因為有 30 篇文章，每篇都有標題連結、upvote 按鈕、留言連結，所以 snapshot 比較大。但重點是：這些資料不會自動塞進 context，AI agent 可以選擇性地只讀需要的部分。

在 snapshot 裡面可以看到導航列的結構：

- link "new" [ref=e5]
- link "past" [ref=e6]
- link "comments" [ref=e7]
- link "ask" [ref=e8]
- link "show" [ref=e9]
- link "jobs" [ref=e10]
- link "submit" [ref=e11]
- link "login" [ref=e13]

步驟三：進入登入頁

$ agent-browser click @e13
✓ Done

步驟四：看登入頁的 snapshot

這是重點，看看登入頁的 snapshot 有多精簡：

$ agent-browser snapshot

- text: Login
- table:
    - row "username:":
        - cell "username:"
        - cell:
          - textbox [ref=e2]
    - row "password:":
        - cell "password:"
        - cell:
          - textbox [ref=e4]
- button "login" [ref=e5]
- link "Forgot your password?" [ref=e6]
- text: Create Account
- table:
    - row "username:":
        - cell "username:"
        - cell:
          - textbox [ref=e8]
    - row "password:":
        - cell "password:"
        - cell:
          - textbox [ref=e10]
- button "create account" [ref=e11]

25 行。整個登入頁就 25 行，AI agent 一看就知道 @e2 是帳號、@e4 是密碼、@e5 是登入按鈕。

步驟五：填表單

$ agent-browser fill @e2 "demo_user"
✓ Done

$ agent-browser fill @e4 "demo_password"
✓ Done

每個操作的回傳就一行 ✓ Done，不會額外塞任何東西到 context。

步驟六：截圖確認

$ agent-browser screenshot hn-login-filled.png
✓ Screenshot saved to hn-login-filled.png

截圖可以確認表單確實填好了（username 欄位顯示 demo_user，password 欄位顯示圓點）。

步驟七：收工

$ agent-browser close
✓ Browser closed

實測數據

整個流程跑下來，每一步回傳給 AI 的資料量：

操作	回傳大小
`open`	2 行（標題 + URL）
`snapshot`（首頁）	879 行 / ~46 KB
`click`	1 行（`✓ Done`）
`snapshot`（登入頁）	25 行 / ~0.7 KB
`fill` × 2	各 1 行（`✓ Done`）
`screenshot`	1 行（檔案路徑）
`close`	1 行

如果 AI agent 只需要操作登入頁（不需要讀首頁完整 snapshot），整個流程的 context 消耗大概就是 登入頁 snapshot 的 25 行 + 幾行操作確認，非常省。

4. 再測一個：GitHub Repo 頁面

登入頁太簡單了，來看一個內容比較豐富的頁面。直接開 agent-browser 自己的 GitHub repo：

$ agent-browser open https://github.com/vercel-labs/agent-browser
✓ GitHub - vercel-labs/agent-browser: Browser automation CLI for AI agents
  https://github.com/vercel-labs/agent-browser

snapshot 的大小：1,757 行 / ~104 KB。GitHub repo 頁面元素很多（導航列、tab、檔案列表、README 內容...），所以 snapshot 比較大。

但一樣的道理，AI agent 不需要一次讀完。可以針對需要的元素操作：

# 讀取 star 數
$ agent-browser get text @e19
Star 17.7k

# 截圖
$ agent-browser screenshot agent-browser-repo.png
✓ Screenshot saved

這邊有一個值得注意的點：snapshot 的大小跟頁面複雜度成正比。簡單的登入頁 25 行，複雜的 GitHub repo 頁面 1,757 行。但不管多大，agent-browser 都不會自動把它塞進 context — 是 AI agent 決定要不要讀、讀多少。

這跟 Playwright MCP 的根本差異就在這裡：MCP 是「每次操作都強制回傳整個頁面狀態」，agent-browser 是「你問我才說」。

5. 進階玩法：Semantic Locator

除了用 ref，agent-browser 還支援語義化定位，不需要先跑 snapshot 也能操作：

# 用 role 找元素
$ agent-browser find role button click --name "Submit"

# 用 label 找表單欄位
$ agent-browser find label "Email" fill "test@example.com"

# 用文字內容找元素
$ agent-browser find text "Sign In" click

這在你已經知道頁面長什麼樣的時候很方便，可以省掉一次 snapshot 的來回。

6. 進階玩法：狀態保存與視覺比對

保存登入狀態

每次都要重新登入很煩，可以把 session 存起來：

# 登入完之後存 state
$ agent-browser state save ./auth-state.json

# 下次直接載入，跳過登入
$ agent-browser state load ./auth-state.json
$ agent-browser open https://example-app.com/dashboard
# → 直接進 dashboard，不用再登入

視覺差異比對

改了 UI 之後想確認有沒有改壞：

# 先截一張「改之前」的圖
$ agent-browser screenshot before.png

# ... 部署新版本 ...

# 截「改之後」的圖，跟之前比對
$ agent-browser diff screenshot --baseline before.png

7. 方案比較：四種 AI 瀏覽器自動化的選擇

現在市面上主要有四種讓 AI agent 操作瀏覽器的方案，來一個一個看。

agent-browser（Vercel Labs）

架構：Rust CLI + Node.js Daemon + Playwright

優點：

Token 效率最高，Snapshot + Refs 設計把 context 壓到最小
零設定，裝完就能用
CLI 介面簡單直覺，AI agent 透過 Bash 就能呼叫
跨平台，支援 macOS / Linux / Windows

限制：

相對較新，生態系還在建立中
需要 shell access（沙箱環境不能用）

適合：日常 AI agent 任務、長對話、需要頻繁操作瀏覽器的場景

Playwright MCP（Microsoft）

架構：MCP Server + Playwright

優點：

微軟官方維護，穩定度高
支援完整的 Playwright API（26+ tools）
不需要 shell access，在沙箱環境也能用

限制：

Token 消耗驚人，每步都回傳完整 accessibility tree
超過 15 步後 context 容易崩潰
Tool schema overhead 每次 ~4,200 tokens

適合：短任務（5 步以內）、沙箱環境、需要完整 Playwright 功能的場景

Playwright CLI（Microsoft）

架構：CLI + 磁碟快照 + Playwright

優點：

比 MCP 省 4 倍 token（~27K vs ~114K）
50+ 指令，功能最完整
snapshot 存到磁碟，不佔 context
跟 agent-browser 類似的 ref 系統

限制：

指令數量多，學習曲線較陡
比 agent-browser 稍微冗長一些

適合：需要進階功能（network mock、tracing）的 coding agent

Stagehand（Browserbase）

架構：TypeScript SDK + Playwright + LLM

優點：

可以用自然語言描述操作（page.act("click the login button")）
v3 速度提升 44%，支援 Shadow DOM 和 iframe
有雲端瀏覽器方案（Browserbase），可以大規模執行

限制：

每次操作都要過一層 LLM 推理，多花一次 API call
只支援 TypeScript/JavaScript
需要 Browserbase 帳號才能用雲端功能

適合：不想管 selector 的場景、需要雲端大規模執行、已經在用 TypeScript 的團隊

8. Token 效率比較：數字說話

這大概是最實際的比較了。用一個「登入 → 進 dashboard → 操作 3 個功能」的場景來估算：

任務：登入並執行 5 個操作（約 15 步）

Playwright MCP  ████████████████████████████████  ~114,000 tokens
Playwright CLI  ████████                          ~27,000 tokens
agent-browser   █████                             ~20,000 tokens（估算）
Stagehand       ██████████████                    ~50,000 tokens（含 LLM 推理）

agent-browser 跟 Playwright CLI 的差距不大，兩者都是走「snapshot 到磁碟/精簡回傳」的路線。但 agent-browser 的指令更精簡，在 AI agent 場景下體感更好。

Playwright MCP 則是完全不同的量級，長一點的任務基本上 token 預算會爆。

9. 什麼時候選什麼

寫到這裡其實蠻明顯的了，但還是整理一下：

選 agent-browser 如果：

你在用 Claude Code、Cursor、Copilot 等有 shell access 的 AI 工具
你的任務需要超過 10 步的瀏覽器操作
你在意 token 成本（每 100K tokens 大概 $0.30，長期下來差很多）
你想要最簡單的設定

選 Playwright MCP 如果：

你的 AI 工具沒有 shell access（純 chat 介面）
你的任務很短，5 步以內搞定
你需要在沙箱環境裡跑

選 Playwright CLI 如果：

你需要進階功能（network interception、tracing、video recording）
你想要最完整的 Playwright 功能
你不介意學習 50+ 指令

選 Stagehand 如果：

你想用自然語言描述操作，不想管 ref 和 selector
你需要雲端大規模執行
你的專案本來就是 TypeScript

最後

agent-browser 的設計哲學其實蠻值得參考的：不是把所有資訊都丟給 AI，而是只給它需要的最小資訊。這跟我們平常寫程式的概念很像 — 最小權限原則，只是這裡變成了「最小 context 原則」。

Vercel Labs 目前還在積極開發中，GitHub 上 17.7k stars 也說明了社群的關注度。如果你正在建 AI agent 相關的東西，建議試試看，體感差距真的蠻大的ヽ(✿ﾟ▽ﾟ)ノ