Model HQ

Hybrid Inferencing using Model HQ (AI PC + API Server)

Seamlessly combine local and server-based inference modes for maximum flexibility

Goal

Seamlessly combine local and server-based inference modes—chat, agents, and semantic search—all from one interface.

This walkthrough is built for developers or technical practitioners looking to toggle between local AI PC inferencing and API server-based inference, including how to access remote vector libraries, run agents remotely, and build enterprise-wide RAG pipelines.

Video Tutorial Available

This walkthrough is also demonstrated step-by-step on our YouTube video:

"Unlock Hybrid AI: AI PC + API Server"

Requirements

Model HQ (on AI PC)

Run local inference and manage flows

Model HQ API Server

Run inference remotely; host vector DB

Vector DB

Store document embeddings (included in Model HQ)

Sample PDFs/Text Docs

Created from source material included with Model HQ

Step-by-Step Process

1

Connect Model HQ Desktop to the API Server

1.1 Launch the Model HQ App on your AI PC

  • 1Ensure your AI PC is network-accessible to the API server
  • 2Select the Configure button on the top right of the app
  • 3Go to App in the Model HQ Configuration Center
  • 4Toggle Connected Enterprise Servers to ON
  • 5Click > at the bottom of the screen
Turn ON Enterprise

Note

If you do not have an API connection pre-established, you will be directed to the 'Add New API Connection' screen, where you can enter the API Name, IP Address, Port and Secret Key information to establish a connection.

1.2 Confirm Server Discoverability

  • When connected, a Library button appears on the Main Menu bar
Library Button

This unlocks access to:

  • Chats and agents locally or through API
  • Vector search libraries hosted on the API server
  • Remote model options (e.g., larger LLMs)
  • Server-side agents
2

Run Chat Inference Locally or via Server

2.1 Start a New Chatbot Session

  • 1From the Main Menu, go to Bots
  • 2You'll see local bots like Fast Start Chatbot or Model HQ Biz Bot
  • 3You'll also see server-based bots like Model HQ API Server Biz Bot
  • 4Select the Model HQ API Server Biz Bot and click >
  • 5Choose a model (e.g., Phi-4-ov) running on the server
  • 6Begin chatting with the model
phi4-ov

2.2 What Happens Behind the Scenes

Local Mode

Query processed by local model (e.g., 7B)

Server Mode

Request sent over API to Model HQ Server

  • Can be hosted on cloud, datacenter, or office server
  • Example: 14B parameter model
  • Response is returned and shown in chat
3

Use Remote Knowledge with Local Inference (Hybrid RAG)

3.1 Start a Local Chat Session

  • 1From the Main Menu, select Chat
  • 2Choose Medium (7B) – Analysis and Typical RAG, then click >
Medium Bot

3.2 Connect to Remote Knowledge Base

  • 1Once chatbot is open, click Source
  • 2
    Select a server-hosted library (e.g., UN_Resolutions_0527)

    If no pre-created source exists, follow Step 5 to build one

Source

3.3 Enter a RAG-style Question

Example:

"What are the resolutions related to Ukraine?"

3.4 What Happens

  • Vector search is executed on the API server
  • Retrieved documents are returned to your AI PC
  • Local model performs inference over those chunks

You see:

  • Full source references
  • Answer generated on-device
  • No tokens leave your machine
Output
4

Run Agent Workflows on the API Server

4.1 Choose a Prebuilt or Custom Agent

  • 1From the Main Menu, go to Agents
  • 2Select Intake Processing, then click Run as API
Run as API

4.2 Provide Input

  • 1On the input screen, click Choose File
  • 2
    Select:C:\Users\[username]\llmware_data\sample_files\customer_transcripts\customer_transcript_1.txt
  • 3Click >, confirm input appears, then continue
Choose Files

4.3 What Happens

  • Agent process and input are sent over API
  • Agent runs fully on the API Server
  • Results are returned to the AI PC and displayed
Detailed output
5

Build a Shared Semantic Library with Vector DB

5.1 Create a New Library

  • 1Click Library from the Main Menu
  • 2Select Build New
  • 3Name your library (e.g., agreements)
New Library

5.2 Upload Source Files

  • 1Click +Add Files
  • 2
    Choose ~20 PDFs from:C:\Users\[username]\llmware_data\sample_files\UN-Resolutions-500
  • 3Select Done
  • 4Files are sent to the API server

5.3 Configure Embedding Settings

  • 1
    Go to:
    Library Actions → Library → [your library name]
  • 2Click Create Embedding
Create Embedding

5.4 Trigger Embedding

  • 1Select an embedding model (e.g., all-mini-lm-l6-v2-ov)
  • 2Click > to start embedding process
Trigger

Model HQ will:

  • Parse and chunk documents
  • Create embeddings
  • Store them in the server-hosted vector DB

Once complete, you'll see the library info and embeddings summary.

Final State

5.5 What You Now Have

  • A shareable, queryable knowledge base
  • Indexed and hosted on the API server
  • Accessible to any Model HQ user connected to the server
6

Use That Library in Any Chat or Agent Flow

6.1 Return to Chat

  • 1Open a local or server-run chat
  • 2In Sources, select your newly created vector library

6.2 Ask Questions

  • Type natural-language queries related to your document domain

6.3 See Results

  • Vector search occurs remotely
  • Context is retrieved
  • Inference runs locally or on the server (your choice)

Summary: Hybrid Modes You Can Mix and Match

Local-Only

Vector Search: Local files
Inference: AI PC
Trigger From: Desktop

Server-Only

Vector Search: Server KB
Inference: API Server
Trigger From: Desktop

Hybrid RAG

Vector Search: Server KB
Inference: AI PC
Trigger From: Desktop

Remote Agent

Vector Search: N/A or Server
Inference: API Server
Trigger From: Desktop

Full API

Vector Search: All remote
Inference: API Server
Trigger From: External app

Pro Tips

Privacy & Offline Access

Use local inference when privacy or offline access is important

Scale & Performance

Use server inference for larger models or batch processing

Team Collaboration

Build shared vector libraries for team-wide semantic search

Easy Configuration

All toggles and source configurations are available in a single UI—no CLI required

Need Help?

If you encounter any issues while setting up hybrid inferencing, feel free to contact our support team at support@aibloks.com