Hybrid Self-evolving Structured Memory for GUI Agents

Sibo Zhu^*, Wenyi Wu^*, Kun Zhou^†, Stephen Wang, Biwei Huang

^*Equal contribution ^†Corresponding author

Framework Overview

Hybrid memory graph with discrete strategy nodes and continuous trajectory embeddings.

Abstract

The remarkable progress of vision-language models (VLMs) has enabled GUI agents to interact with computers in a human-like manner. Yet real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors. HyMEM introduces a graph-based hybrid memory that couples discrete high-level symbolic strategies with continuous trajectory embeddings, supporting multi-hop retrieval, self-evolution via structured updates, and on-the-fly working-memory refresh during inference.

Demo Video

Hugging Face deployment: put your demo video at assets/hymem_demo.mov.

Case Studies

Amazon — HyMEM Success

Success

HyMEM retrieves diverse strategy memory instead of repetitive actions.
The agent adapts from initial search to constraint-focused filtering.
Task is completed within step budget with stable long-horizon planning.

Amazon — Baseline Failure

Failure

Single-agent baseline over-relies on generic search and misses key filters.
It enters repetitive browsing loops without satisfying constraints.
Execution terminates due to max-step limit before finding a valid result.

Performance Comparison Leaderboard

Task success rates (%) across WebVoyager, Mind2Web, and MMInA.

Backbone	Model / Method	WebVoyager				Mind2Web				MMInA	Overall
Backbone	Model / Method	Amz	Cour	Recp	Map	Info	Svc	Ent	Trav	Wiki	Overall
Closed-Source
	GPT-4o	24.4	7.1	13.3	36.6	7.8	14.1	3.4	19.4	51.0	19.7
	Gemini-Pro-Vision	41.7	42.9	20.0	53.7	16.7	22.4	0.0	19.4	50.0	29.6
	Claude-4	63.4	28.6	33.3	70.0	25.6	40.0	6.9	9.7	52.0	36.6
Open-Source
	Qwen2.5-VL-32B	46.3	26.2	6.7	29.3	14.1	20.0	6.9	9.7	43.0	22.5
	CogAgent	12.2	9.5	26.7	9.8	24.4	8.2	13.8	16.1	21.0	15.7
	Websight	24.4	4.8	13.3	29.3	10.3	3.5	3.4	0.0	12.0	11.2
UI-TARS-1.5-7B	Baseline	31.7	16.7	20.0	31.7	6.4	4.7	6.9	0.0	36.0	17.1
UI-TARS-1.5-7B	+ Reasoning Bank	0.0	11.9	6.7	12.2	7.7	8.2	3.7	6.7	44.0	11.2
UI-TARS-1.5-7B	+ AWM	4.9	7.1	4.4	14.6	10.3	3.5	3.7	3.3	26.0	8.6
UI-TARS-1.5-7B	+ Discrete	19.5	16.7	8.9	17.1	11.5	8.2	7.4	3.4	46.0	15.4
UI-TARS-1.5-7B	+ Continuous	43.9	33.3	24.4	43.9	16.7	28.2	10.3	3.2	54.0	28.7
UI-TARS-1.5-7B	+ HyMEM	58.5	28.6	26.7	53.7	16.7	25.9	10.3	6.5	54.0	31.2
Qwen2.5-VL-7B	Baseline	14.6	2.4	15.9	16.7	9.0	11.8	0.0	4.4	38.0	12.5
Qwen2.5-VL-7B	+ Reasoning Bank	29.3	9.5	6.7	29.3	9.0	20.0	3.4	6.5	44.0	17.5
Qwen2.5-VL-7B	+ AWM	17.1	4.8	11.1	29.3	7.7	10.6	0.0	6.5	31.0	13.1
Qwen2.5-VL-7B	+ Discrete	22.0	21.4	8.9	31.7	9.0	12.9	10.3	3.2	34.0	17.0
Qwen2.5-VL-7B	+ Continuous	24.4	17.1	8.9	34.1	16.7	23.5	10.3	12.9	47.0	21.7
Qwen2.5-VL-7B	+ HyMEM	63.4	54.8	20.0	53.7	17.9	23.5	3.4	22.6	56.0	35.0
Qwen3-VL-8B	Baseline	36.6	16.7	13.3	43.9	9.0	20.0	10.3	9.7	38.0	21.9
Qwen3-VL-8B	+ Reasoning Bank	36.6	11.9	17.8	31.7	10.3	24.7	17.2	6.5	48.0	22.7
Qwen3-VL-8B	+ AWM	39.0	19.0	8.9	31.7	12.5	16.5	6.9	16.1	48.0	22.1
Qwen3-VL-8B	+ Discrete	31.7	16.7	15.6	31.7	11.5	16.5	13.8	16.1	38.0	21.3
Qwen3-VL-8B	+ Continuous	43.9	19.0	17.8	34.1	12.5	17.9	3.4	19.4	42.0	23.3
Qwen3-VL-8B	+ HyMEM	46.3	19.0	26.7	39.0	17.9	20.0	6.9	25.8	47.0	27.6

BibTeX

@misc{zhu2026hymem,
  title={Hybrid Self-evolving Structured Memory for GUI Agents},
  author={Sibo Zhu and Wenyi Wu and Kun Zhou and Stephen Wang and Biwei Huang},
  year={2026},
  note={Manuscript}
}

More Works from Our Lab

Towards General Continuous Memory for Vision-Language Models

Auto-Scaling Continuous Memory For GUI Agent