The Architect Forge: Distillation & Metadata Mastery

The Architect Forge represents the developer pipeline of ANDARTIS. It is a specialized training framework designed to distill capabilities from large Teacher models (like Mistral-7B) into lightweight, high-performance Student micro-models (100M-300M parameters) that can run on any standard Apple Silicon processor in milliseconds.

The Distillation Strategy

To achieve "Razor Blade" speed and efficiency without losing critical metadata extraction capabilities:

The Teacher (Local SLM/Mistral): Reads raw documentation libraries and generates detailed semantic tags, summaries, and entity schemas.
The Student (Global Extraction Head): A micro-transformer architecture trained directly on the Teacher's labels, learning to predict schemas and structures instantly.

Forge Workflow

[Raw Files] ──> [Artisan Label] ──> [Artisan Export] ──> [Artisan Train] ──> [Package & Ship]
                     │ (Mistral labels)    │ (Builds JSONL)      │ (MLX Distillation)
                     ▼                     ▼                     ▼
              [SQLite Store]        [training.jsonl]       [weights.npz]

1. Labeling (Synthetic Data Generation)

Parse files in a directory and run the local SLM to generate training labels. The Forge is structure-aware and automatically splits data into train and test validation sets based on folder hierarchies:

bash

php artisan andartis:forge label \
    --source=/path/to/raw_docs \
    --domain=legal \
    --task=metadata

--domain: Groups target training datasets (e.g., legal, clinical, software).
--task: Selects the target extraction script (metadata, summarization, classification).

2. Importing Existing Datasets

If you already possess structured files (such as CSV or JSONL format), you can import them directly into the training database, skipping the Teacher phase:

bash

php artisan andartis:forge import --source=my_data.jsonl --domain=legal

3. Exporting for Training

Combine all synthetic and imported labels into a single JSONL file structured for the Python MLX training pipeline:

bash

php artisan andartis:forge export --format=jsonl --domain=legal

4. Running Distillation

Train the student micro-model on the exported dataset. This generates the specialized weights:

bash

php artisan andartis:forge train --steps=1000

Model Distribution

To keep the application repository size small and ensure the .dmg package is lightweight, model weights are not checked into Git:

Compression: Run php artisan andartis:forge package to compress the output directory into a .zip archive.
Hosting: Upload the archive to a CDN or host it as a asset in GitHub Releases.
Synchronization: On initial application launch, the desktop app performs a handshake and downloads the latest global models.

Local Node Adaptation

When running on a user's machine, the Extraction Forge switches from global training to local refinement:

The pre-shipped global model manages instant ingestion.
During idle times, low-priority background workers review files and refine the local weights, saving small Delta-Weight adjustments. This adapts the model to the user's specific vocabulary and document formatting.

The Architect Forge: Distillation & Metadata Mastery ​

The Distillation Strategy ​

Forge Workflow ​

1. Labeling (Synthetic Data Generation) ​

2. Importing Existing Datasets ​

3. Exporting for Training ​

4. Running Distillation ​

Model Distribution ​

Local Node Adaptation ​

The Architect Forge: Distillation & Metadata Mastery

The Distillation Strategy

Forge Workflow

1. Labeling (Synthetic Data Generation)

2. Importing Existing Datasets

3. Exporting for Training

4. Running Distillation

Model Distribution

Local Node Adaptation