World Model

What is the World Model?

The World Model takes a user objective and the current state of a webpage and uses this information to generate:

the next instruction required to eventually achieve the user objective
the name of the Action Engine sub-engine component which should be used to execute the instruction

In our agentic workflow, the Agent will get the next instruction and engine to be used by calling the WorldModel.get_instruction method before passing these two strings to the Action Engine, which will handle executing the instruction.

Getting started with the World Model

Let's now create an instance of World Model that we can work with directly to understand better how it works.

from lavague.core import  WorldModel

world_model = WorldModel()

The World Model receives multiple inputs to make informed decisions:

objective: The user's objective
current_state: Internal and external state that includes some internal memory from agent outputs and user inputs
past: previous generated instructions
obs: The current HTML, screenshot, URL of the webpage + the current date/time

Before we can start using our World Model, we first need to generate these observations.

from lavague.drivers.selenium import SeleniumDriver
from lavague.core.memory import ShortTermMemory

selenium_driver = SeleniumDriver(headless=True, url="https://huggingface.co/")
st_memory = ShortTermMemory()

# Generate observations & set objective
objective = "Go on the quicktour of PEFT"
current_state, past = st_memory.get_state()
obs = selenium_driver.get_obs()

To understand what inputs the World Model receives, we can visualize the screenshots captured during the observation process.

# Visualize the screenshot
from lavague.core.logger import load_images_from_folder
from IPython.display import display 

images = load_images_from_folder(obs["screenshots_path"])
display(images[0])

The World Model receives multiple inputs to make informed decisions:

Objective
Internal and external state that comprises the current HTML, screenshot, some internal memory in agent_outputs and user provided data with user_inputs
Past with previous instructions

print("Objective: ", objective)
print("Current State: ", current_state)
print("Past: ", past)
print("Observations: ", obs)

world model inputs

We can now pass this information to the World Model to get the next instruction to be completed.

world_model_output = world_model.get_instruction(
    objective, current_state, past, obs
)

print(world_model_output)

world model outputs

The generated instruction provides insights into the World Model's thought process, including the recommended engine and instruction for the agent.

from lavague.core.utilities.format_utils import (
    extract_next_engine,
    extract_world_model_instruction,
)

next_engine_name = extract_next_engine(world_model_output)
instruction = extract_world_model_instruction(world_model_output)

print(next_engine_name)
print(instruction)

world model instruction

How does it work?

The World Model generates the next instruction by querying the multi-modal LLM (by default, we use OpenAI's gpt-4o) with our prompt template.

You can print out our default World Model template with the following code or view the full template here:

from lavague.core.world_model import WORLD_MODEL_PROMPT_TEMPLATE

print(WORLD_MODEL_PROMPT_TEMPLATE.template)

world model template

This shows us the overall guidelines. Let's now see now how examples are provided to provide world knowledge:

from lavague.core.world_model import WORLD_MODEL_GENERAL_EXAMPLES

print(WORLD_MODEL_GENERAL_EXAMPLES)

world model examples

If we wanted to add knowledge to our World Model prompt template, we can do so with the add_knowledge method, which allows us to add knowledge from a file:

extra_knowledge = """
Objective: Find the latest papers on Fine tuning
Previous instructions:
- SCAN
- Click on 'Previous'
Last engine: [NONE]
Current state:
external_observations:
  vision: '[SCREENSHOT]'
internal_state:
    agent_outputs: []
    user_inputs: []

Thoughts:
- The current screenshot shows the top of a page showing papers papers published on the 22nd May 2024 on Hugging Face.
- The objective is to find the latest papers on Fine tuning.
- As we need to find the latest papers, the best next step is to gather more information to see if this page contains the information we need.
- The best next step is to use the Navigation Controls to take a screenshot of the whole page to find the latest papers on Fine tuning.
Next engine: Navigation Controls
Instruction: SCAN
"""

# save extra knowledge to a file
file_path = "knowledge.txt"

with open(file_path, "w") as file:
    file.write(extra_knowledge)

# Import extra knowledge from file
world_model.add_knowledge(file_path=file_path)

Improving performance

By adding knowledge, we can improve the World Model's performance where it is failing to generate correct next instructions.

If you want to replace the default examples passed to the World Model entirely, you can do so by initializing your WorldModel and passing your examples as a string as an examples optional argument:

new_examples = "[YOUR_EXAMPLES_HERE]"
world_model = WorldModel(examples=new_examples)

Customizing your World Model

World Model from Context

You can customize the mm_llm multi-modal LLM used by the World Model by instantiating it from a Context object with the WorldModel.from_context method.

Details on how to do this are provided in our customization guide.

Optional arguments

Alternatively, you can pass your World Model class any of the four key customizable arguments:

mm_llm: A Multi-modal LLM. This can be any llama_index.multi_modal_llms object. By default, we use OpenAI's gpt-4o.
prompt_template: The prompt template used to query the multi-model LLM to generate the next instruction
examples: A string containing examples of how we expect the WorldModel to reason and produce a good next instruction
logger: An AgentLogger which records information about the World Model

Once initialized, the World Model can be used as seen previously.