Tech – How Juice Factory AI Works

Juice Factory AI is a European AI infrastructure platform for LLM inference, multimodal models, RAG, and batch processing. The platform runs in EU data centers with focus on data security, low latency, and full control over models and data.

Private AI for Business →

Architecture

• Control Plane: API gateway, authentication, quotas, scheduling
• Execution Plane: Containerized model runs on dedicated hardware
• Network: Low-latency connections between nodes and storage
• Storage: Object storage for model weights, cache for fast access
• Observability: Metrics, logs, tracing for full visibility

Hardware

Type	VRAM	Configuration
B200	80-192 GB	8×GPU, 2×CPU (128 cores), 2 TB RAM
NVIDIA RTX 6000-class	96 GB	4×GPU, 1×CPU (64 cores), 512 GB RAM
AMD MI300-class	192 GB	8×GPU, 2×CPU (128 cores), 2 TB RAM

Software Stack

Container Execution

Kubernetes for orchestration, Docker for isolation

Drivers

CUDA 12.x, ROCm 6.x for AMD

Inference Frameworks

vLLM, TensorRT-LLM, Text-Gen WebUI, TGI

Model Management

Automatic download, quantization (INT8, FP16), caching

Security & Compliance (EU/GDPR-first)

Security By Default

Data Location: All data and processing happens within the EU. No data leaves the EU.

Access Control: API keys, JWT tokens, role-based access, MFA support

Network Segmentation: Isolated networks per customer, no shared infrastructure

Log Policy: No data storage by default. Customer chooses retention policy.

Data Flows & Controls

Security By Default

Inference Data Flow Map

For each inference request, data follows a strictly defined flow:

Client

1. TLS-encrypted request

API Gateway

2. Authentication & validation

Inference Engine

3. RAM calculation

4. Return response

Memory Cleared

5. Auto-deletion

Logging

Metadata only: Customer ID, tokens, response time

The client sends a request via our API (TLS-encrypted).
The API layer authenticates the customer, validates the request, and forwards only necessary information to the inference engine.
The inference engine calculates the response in RAM without writing prompts or outputs to disk.
The response is returned to the client and all content is cleared from memory after the request is completed.
Only technical metadata (e.g., customer ID, model name, token count, response time) can be logged for operations and billing – never the actual content of prompts or responses in standard mode.

This data flow map is documented and version-controlled, making it possible to review each step during security and compliance audits.

Controls and Auditing

To ensure that no inference data is stored or used for training, we have implemented:

Code & Configuration Review

The inference code lacks write access to databases and storage for customer content. API gateway and logging platforms are configured not to log request or response bodies.

Separated Environments

Customer-specific namespaces and clear separation between test, staging, and production to avoid debug logging accidentally ending up in production.

Log Policy

Log formats contain only technical metadata. No fields for prompts or outputs in standard mode.

Retention and Auto-Deletion

All log data is subject to time-based retention where data is automatically deleted after X days according to customer or platform policy.

Audit Trail

Changes in log policy, configuration, and codebase are logged, enabling both internal and external audits (e.g., for ISO/SOC certifications).

Network & Performance

The platform is built for low latency and high throughput:

• Direct connections between nodes and storage (NVLink, InfiniBand)
• Token throughput: 100-500 tokens/s for 7B models, 50-200 for 70B
• Latency: <10ms for first token, <1ms per subsequent token

Multi-model & Isolation

Multiple LLMs can run simultaneously on the same infrastructure. Resource pooling allows models to share hardware when capacity exists, but each customer has isolated executions. The scheduler prioritizes low-latency requests over batch jobs.

Integrations & API

REST API and gRPC for programmatic access. Webhooks for event notifications. SSO via OIDC for easy integration with existing identity systems. SDKs for Python, JavaScript, and Go.

Pricing

Token-based pricing with clear cost control. You pay per generated token, with different prices for different model sizes. No lock-in, scale up and down as needed. Volume discounts for long-term commitments.

OpenAI Alternative →

Operations & Monitoring

Metrics: Prometheus for metrics, Grafana for visualization

Tracing: OpenTelemetry for distributed tracing

Autoscaling: Automatic scaling based on load

Alerts: Proactive alerts on anomalies, capacity forecasting

Use Case Examples

Production Customer Support Bot

An e-commerce company runs a 7B model for real-time responses in their chat. Average latency <50ms, 99.9% uptime.

Internal Search/RAG

A consultancy indexes internal documents and runs RAG queries against a 13B model. Secure, no data leaves the EU.

Batch Media Generation

A media agency generates thousands of product descriptions daily with a 70B model. Batch runs at night.

FAQ

How is my data protected?

All data stays in the EU. No data is logged or stored without your approval. Isolated networks per customer.

Which models can I run?

All open models (Llama, Mistral, etc.) and custom fine-tuned models. We help with deployment.

How fast do models respond?

First token <10ms, subsequent <1ms. Batch jobs scale as needed.

How do I integrate with you?

REST API, gRPC, webhooks. SDKs for Python, JS, Go. Full OpenAPI documentation.

What does it cost?

Token-based pricing. Contact us for exact pricing based on your needs.

Ready to test?

Get Started