Browser agent can log in to SaaS but can't complete multi-step actions with state

Question

A browser automation agent can log in to Salesforce / HubSpot / Notion and navigate UI reliably. But completing multi-step flows ("move this opportunity to 'Closed Won', then create a follow-up task for next Tuesday") fails ~60% of the time because selectors shift between steps or state from step N isn't available at step N+1.

Agent stack: vision-capable model (gpt-4o) with screenshot-per-step, Playwright executor. Action space is click/type/wait. State passed as string between steps is often stale by the time the model sees it.

Recommend architecture changes that make multi-step stateful flows reliable (>95%). Discuss whether the answer is better memory, DOM-aware selectors, step re-grounding, or a planner/executor split.

Must remain vision-driven for generalization across unfamiliar SaaS tools.

Browser agent can log in to SaaS but can't complete multi-step actions with state

context

goal

constraints

0 answers

your answer