Training Data vs. Node Data (BYOD Philosophy)

To ensure privacy, scalability, and flexibility, ANDARTIS strictly separates model adaptation (how the local SLM learns to think and format) from context ingestion (what documents contain the active factual context).

This separation forms the core of our Bring Your Own Dataset (BYOD) philosophy.

Conceptual Overview

mermaid

graph TD
    subgraph Forge ["🧠 1. Model Forge (How to Think)"]
        A[Mock/Seed Training Data] --> B(Specialized LoRA Weights)
        C[Anonymized Domain Data] --> B
    end

    subgraph ActiveNode ["💾 2. Active Node (What is Fact)"]
        D[Confidential Local Files] --> E(Isolated Node SQLite DB)
    end

    B --> F{Local Inference Engine}
    E --> F
    F --> G[Desktop UI]

1. The Forge (Model Training)

Purpose: Teaches the local SLM (Mistral-7B) how to extract entities, speak specific jargon (medical, legal, financial), and generate consistent structural layout blueprints.
Format: Formatted as JSON Lines (.jsonl) or CSV files containing generalized prompts and their corresponding targets.
Attributes:
- Anonymized & Stylistic: It does not need (and should not have) active, highly confidential records. It uses representative mock/seed examples to teach style and pattern alignment.
- Avoids Memorization: Keeping the training data generic prevents the neural weights from memorizing specific patient names or transaction details, ensuring model files remain lightweight and generalizable.
- Infrequent Execution: Training is resource-heavy and is done once (or periodically when upgrading schemas/specialties).

2. The Active Node (Context Ingestion)

Purpose: Represents the actual day-to-day source files that the user queries or extracts information from.
Format: Unstructured files (Markdown, PDF, CSV, TXT) organized inside a local directory.
Attributes:
- Strict Fact Base: The node reads files locally and registers their hashes and contents in the node's isolated core.sqlite database.
- No Network Egress: Documents are never uploaded, sent to third-party endpoints, or integrated into the global model weights.
- Instant Ingestion: Files can be added or updated dynamically. We do not retrain the neural model when adding new documents. Instead, we query the already-specialized model using our compiled layout blueprints and the documents' text.

Developer vs. Production Environment

Developer Workflows (Mock Datasets)

When contributing to or developing ANDARTIS, you should use Mock/Seed Datasets to verify the ingestion, training, and self-healing pipelines.

A developer does not need clinical records or large private document vaults.
Developers use the included CLI to forge basic models using small, mock dataset files to ensure that the python MLX bridge and PHP orchestrator function correctly.

Sovereign Workflows (Sovereign BYOD)

When deployed inside a private practice, clinic, or independent legal office (e.g., a medical clinic):

Model Adaptation: The organization compiles an anonymized set of their unique medical charts to train a local clinical LoRA adapter using:
bash
```
php artisan andartis:forge import --source=private_clinical_samples.jsonl --domain=medical
php artisan andartis:forge train
```
Dynamic Ingestion: The doctors add their active patient charts inside their designated Node folders.
Outcome: The local cardiology model accurately parses patient history sheets instantly, storing the extracted schema in the node's local SQLite database, keeping all patient details fully secured within the local machine's boundary.

Training Data vs. Node Data (BYOD Philosophy) ​

Conceptual Overview ​

1. The Forge (Model Training) ​

2. The Active Node (Context Ingestion) ​

Developer vs. Production Environment ​

Developer Workflows (Mock Datasets) ​