A brief introduction to LLM-Based GUI Agents

written by Wojciech Sieńko and Rostyslav Kovalenko

What are AI agents?

AI agents are intelligent systems designed to complete tasks autonomously with minimal human input. They can observe their environment and make decisions based on instructions and data. These agents can help us achieve specific goals, such as scheduling meetings or answering emails.

Why should you use AI agents?

Imagine having a personal assistant who can help you perform tasks using your voice, such as finding the best places to travel at the best prices. First, agents search and compare options, finding the most affordable and attractive destinations for you. When you choose a place to visit, the system can use a graphical user interface (GUI) to book the ticket and make reservations.

Systems like these can significantly benefit the elderly and those with disabilities who encounter challenges navigating a digital environment. Many platforms lack accessibility features, making tasks such as booking tickets or managing reservations difficult. These systems can create meaningful improvements for these groups.

What are GUI agents?

GUI agents are a specific type of AI agent that interacts with digital platforms, such as desktops or mobile phones, through their Graphical User Interface. They identify and observe interactable visual elements displayed on the device's screen and engage with them by clicking, typing, or tapping, mimicking the interaction of a human user.

Training and evaluation of GUI agents

GUI agent training and benchmarking datasets can be divided into two categories: static datasets and interactive environments, each under closed-world or open-world.

Static datasets vs Interactive environments

Static datasets typically store a predefined set of tasks with ground-truth solutions. Nowadays, these datasets usually contain screenshots along with text instructions and GUI observations. Although it is easier to produce these datasets, they are very limited by relying only on solutions provided in the dataset.

In contrast, interactive environments give agents the possibility to interact with GUI in real time and adapt their actions by receiving feedback.

Closed-world environments vs Open-world environments

A closed-world environment is isolated and created specifically for evaluation and training. It provides a great level of control over the process but lacks the variability of scenarios agents could go through.

Open-world environments address closed-world limitations by executing the whole process in real-world applications. However, deploying GUI agents in an open-world setting raises several risks, including security, unreproducible scenarios, and less control over the process.

The construction of GUI agents

As we learned earlier, the purpose of a GUI agent is to complete user-defined tasks by interacting with the graphical interface. While various approaches to building such agents exist, many share a common structure. This structure can be divided into the following four broad components: Perception, Planning and Decision-making, Action Execution, and Memory Access. Below, we will discuss each component and look at how different systems implement them.

Perception

The perception component serves as an entry point for a GUI agent. Its main task is to gather relevant information about the graphical interface so it can be further processed to fulfill the user's request. This component can take many different forms depending on the language model. Earlier GUI agents that relied on a single-modal LLM could only process textual input and, therefore, required external modules to convert GUI into text representations.

Recent studies utilize multimodal LLMs capable of processing visual inputs by themselves. Thus, GUI screenshots are now frequently treated as visual inputs, with research focusing on fine-tuning MLLMs to enhance their comprehension.

Planning and Decision-making

After receiving all the information, a GUI agent needs to effectively break down complex tasks into feasible small steps that can be executed on the GUI. It can be achieved by leveraging reasoning approaches such as Chain of Thought (CoT). CoT involves generating intermediate explanations before arriving at the final answer.

Furthermore, a graphical interface is an ever-changing environment, adapting with each action executed by an agent. To handle this, some GUI agents can adjust their plans dynamically, typically through a ReAct style. ReAct (Reasoning + Acting) is a method where reasoning and action execution occur in cycles, with each action influencing subsequent reasoning steps.

Action Execution

Once planning is complete, the agent decides on the specific operation to execute. These operations can range from simple actions, such as clicking, inputting text, or scrolling, to more complex cases.

Complex operations extend an agent's capabilities, enabling more flexible and powerful behavior. These may include code execution, allowing an agent to go beyond a predefined set of actions, or API integrations, providing an agent with capabilities to access external tools and information resources.

After an operation is completed, the process may repeat itself by starting again from Perception to address changes and execute the next action accordingly.

Memory Access

This component serves as a source of additional information to aid efficient task execution. Usually, it is divided into internal and external memory. Internal memory holds information about previous actions, GUI screenshots, and other data generated during execution, while external memory includes prior knowledge and rules about the UI and specific tasks.

The Classification of GUI Agents

GUI agents can be divided into two main categories based on their creation methods: Prompt-based and Training-based.

Prompt-based methods

Prompt-based methods do not require additional training as they rely on (M)LLM's preexisting skills to interpret and execute instructions. These methods are mostly focused on giving the language model detailed and clear instructions while incorporating previously mentioned techniques like CoT and ReAct.

Prompt-based methods also incorporate a memory system, enabling the agent to remember context, previous interactions, and task-specific knowledge. As a result, the agents can optimize their operations, learn from past experiences, and reduce mistakes during task execution.

Training-based methods

In contrast, training-based methods optimize models significantly by fine-tuning LLMs or multimodal models like LLaVa (Large Language and Vision Assistant). These agents can use multimodal datasets, which include screenshots, code, and APIs, to enhance reasoning, planning, and execution capabilities.

Pre-training

Pre-training involves training models on large datasets to create a solid base before fine-tuning them on smaller, domain-specific datasets. For GUI tasks, this helps the models better understand visual and textual elements and more effectively resolve their tasks.

Fine-tuning

Fine-tuning helps MLLMs and LLMs ground models into GUI elements and improve their ability to execute instructions more reliably. It reduces model hallucinations and helps to decrease model complexity and size.

Reinforcement Learning

Reinforcement learning optimizes models through a reward system that encourages correct actions and penalizes errors. This approach is commonly used in interactive environments, where the model interacts with GUI and gets feedback for its actions.

Further Research

This article has comprehensively introduced GUI agents, including their construction components, classification, and training methods. However, it was only a small portion of this extensive field, as each implementation has unique characteristics and nuances. We encourage you to delve deeper, conduct your own research, and continue learning about this fascinating topic.

References

Author Of article : Rostyslav Kovalenko Read full article