Navigation Engine

The Navigation Engine is the engine dedicated to generating and executing Selenium code to perform an action on a web page.

Let's get to know the Navigation Engine better by removing the usual layers of abstraction and working directly with one directly as an agent would!

First of all, we'll need to create an instance of Navigation Engine.

from lavague.drivers.selenium import SeleniumDriver
from lavague.core.navigation import NavigationEngine


selenium_driver = SeleniumDriver(headless=True, url="https://huggingface.co/docs")
nav_engine = NavigationEngine(selenium_driver)

Optional arguments

Argument	Type	Description	Default Value
`driver`	`BaseDriver`	The web driver used to interact with the headless browser. (must be provided)	None
`llm`	`BaseLLM`	This argument can be used to pass an LLM object directly to the NavigationEngine to overwrite the default/context LLM. We support any `llama_index.llms` LLMs.	`None` (defaults to the LLM from the context)
`retriever`	`BaseHtmlRetriever`	The retriever to be used to retrieve the web elements to perform our action on.	`None` (defaults to the retriever based on the driver and embedding)
`prompt_template`	`PromptTemplate`	The prompt template used to query the LLM to generate an action.	`NAVIGATION_ENGINE_PROMPT_TEMPLATE.prompt_template`
`extractor`	`BaseExtractor`	Specifies how to extract the final code from the LLM's response.	`DynamicExtractor()`
`time_between_actions`	`float`	Time (in seconds) between each action executed by the engine.	`1.5` seconds
`n_attempts`	`int`	The number of attempts the LLM should take to generate a valid action. Retries can increase success rate since LLMs are non-deterministic and may succeed even if they fail on the first attempt.	`5` attempts
`logger`	`AgentLogger`	Logger to log the actions taken by the agent.	`None`
`display`	`bool`	Indicates if the agent is running in `display` mode. This mode can be used when in headless mode to display visual screenshot updates of the agent's progress.	`False`
`raise_on_error`	`bool`	Whether to raise an exception if an error occurs during execution.	`False`
`embedding`	`BaseEmbedding`	This argument can be used to pass an embedding model object directly to the NavigationEngine, which will be used instead of the default/context embedding model. We support any `llama_index.embeddings` models.	`None`

Retrieval

The first task handled by the Navigation Engine is to perform retrieval on the web page to collect the most relevant chunks, or nodes, of HTML code.

The Navigation Engine's embedding model is used at this stage (here we use the default embedding model, OpenAI's text-embedding-3-small).

instruction = "Click on the PEFT section."

nodes = nav_engine.get_nodes(instruction)

We can print out these nodes with the following code:

from IPython.display import display, HTML, Code

for node in nodes:
    display(HTML(node)) # Display node as visual element
    display(Code(node, language="html")) # Display code
    print("--------------")

Generating automation code

We can now provide these nodes as context for our LLM (here, we use the default LLM, gpt-4o) when we generate the appropriate code for our instruction.

context = "\n".join(nodes)

action = nav_engine.get_action_from_context(context, instruction)

display(Code(action, language="python"))

The LLM was queried with our default Navigation Engine prompt template which you can view in full here.

We see that the prompt is made up of three parts:

The driver capability or driver prompt template
The context string, or retrieved nodes
The query itself - this will be the original instruction received by the Navigation Engine from the World Model

We can see the default Selenium driver prompt template with the following code (or view the full code here):

print(nav_engine.prompt_template)

Navigation Engine

Initializing a Navigation Engine

Retrieval

Generating automation code

Navigation Engine LLM prompt