- Read structured information
- Repeat a task on multiple objects
- Write structured output
- Locate an element precisely
General
Act
Completes a task specified in natural language, up to a maximum number of steps. Returns: NoneApplication
Open
Open or switch to an application. Returns: NoneKeyboard control
Type
Type using the keyboard. Returns: NoneShortcut
Perform keyboard shortcut in the current application. Returns: NoneMouse Control
Click
Click on something, either specified by theat argument or the element argument.
Disambiguate by specifing the spatial relation between the target element and an anchor concept.
Available modes:
- default is text-based grounding.
- “textAndScreenshot”: grounding using both text and vision.
- “vision”: vision-only grounding.
When using modes with vision, all other arguments are ignored except
withCommand,clickTypeand waits.
Move
Moves the cursor to an object. Disambiguate by specifing the spatial relation between the target element and an anchor concept. All parameters besidesto have the same definition as those in the Click action.
Returns: None
Drag
Drag on the screen, starting from where the mouse is located Returns:Scroll
Scroll on the screen in a specified direction. Returns: NonePerception
ConceptsExist
Checks if all the concepts can be found on the current visible screen. Returns: If all concepts can be found, returns true, otherwise false.pageContent
Gets a JSON object containing the structural text content and base64 encoded image of the current screen. This object can be sent to a vision-language model for answering questions about the current screen. Returns: A JSON dictionary with the following fields:- text: A text description of the current web page;
- imageFilePath: temporary location in memory of the screenshot (accessible by
ask).
Text Generation
ask
Runs a large vision-language model on the given input prompt string and a JSON dictionary context. Often used after pageContent(). Returns: String response from a large vision language model.Wait
Wait
Put Agent into sleep state for a certain amount of time. Returns: NoneWaitForConcepts
Waits until all concepts can be found in the current frontmost window. If not all concepts can be found within 10 seconds, action returns failure Returns: NoneUser interaction
Respond
Respond to the user with a message and optionally ask for user confirmation to proceed. Returns: NoneSystem IO
CopyToClipboard
Copies a String to clipboard. Returns: NoneGetFromClipboard
Get the content of the current clipboard. Returns: Content of the currrent clipboardSaveScreenshot
Takes a screenshot of an element on the screen or the whole screen, and saves the screenshot as a PNG to a file. Returns: NoneScreenshotToClipboard
Take a screenshot of an element or the current page and save it to the system clipboard Returns: NoneReadFile
Read the contents of a file whose location is specified bypath.
Returns: Contents of the file as a String
WriteToFile
Writes the given text to a file. If the file already exists, then appends text to it, with an option to overwrite the existing content. Unless specified path, writes to /Library/Caches/com.simular.Simular-Pro/SimularActionResult/ Will throw an error if there is an existing non-folder file named SimularActionResult Returns: NoneGoogle Sheet control
GetGoogleSheetCellValue
Gets the value of a cell in a Google Sheet. Returns: Value of the cellSetGoogleSheetCellValue
Sets the value of a Google Sheet cell. Returns: NoneGetGoogleSheetColumns
Gets the column ids of each header in a given array of column headers in a Google Sheet. For example, if the sheet has column headers “website”, “description”, “date” in cells A1, B1, C1, respectively, thenGetGoogleSheetColumns(headers: ["website", "description", "date"]) returns [“A”, “B”, “C”]
Note: This function currently assumes that the table headers are on row 1.
Returns: Array of column id, each is a capital letter from A to Z
Advanced GUI functions
GetElements
Get elements that satisfy some conditions inside the current frontmost application or inside a root element (if given). For disambiguation, one can constrain the search to elements that satisfy certain spatial relations to anchor elements. This function supports multiple return types according toreturnType.
Returns: Depending on returnType: [UIElement], String, [String], [String: UIElement]
GetAttributeOfElement
Searches for an element that matches the input criteria and gets the element’s value for a specified attribute. Returns: String value of an attribute of an elementGetContent
Get text content from the current frontmost window or a region corresponding to the provided concept or element. Returns: IfinElement argument is given or the frontmost window was used (because neither inConcept nor
inElement was given), then returns a single String. Otherwise, returns a [String] array with one String per root element.
GetCells
Get all cells from a row or column element. Either row or column must be given. Returns: An array of cell elements contained in the given row or column. If input is a row, the output array is sorted by increasing x-coordinate (left to right). If input is a column, the output array is sorted by increasing y-coordinate (top to bottom).GetCellValue
Get the value of a given cell element. Returns: Value contained in the cell.GetCellLabel
Get the label of the given cell element in Excel. Returns: cell’s label String. Example: “A1”GetCellIndices
Given an array of table cell values, return a corresponding array of cell indices. For example, suppose the table has value1 in cell A10, thenGetCellIndices(cellValues: ["value1"]) returns [“A10”]
Returns: [String] array of cell indices
GetTableColumn
Given a header or a index String, return the column under it as [index: Element] dictionary If the table has a column with header “Website” in cell A1, and elements elem1 and elem2 under it, then this function returns [“A2”: elem1, “A3”: elem2]. Returns: Dictionary of [String: UIElement] pair for all information in the column under headerGetStructuredDescription
Gets XML-formatted description of the contents in each element. Returns: An array of String[s_1, ..., s_n], where each s_i is an XML-formatted description of the contents
rooted at u_i.

