Building a Real-Time Local LLM Runtime for Godot - Part 1

Like many developers, I've been fascinated by how AI could transform gaming experiences. When I first started playing with large language models (LLMs), I immediately saw the potential for creating more engaging NPCs. But there was a catch - getting LLMs to work smoothly in a game environment is anything but straightforward.

That's what led me to create LG (Little Guy), a local LLM runtime specifically designed for real-time games. In this first post, I want to share my journey building it, the challenges I've faced, and what I've learned along the way.

The Spark

The idea for LG came from a simple question: "What if NPCs could think and respond more like real characters?" I'd been working with Godot for a while and had experimented with various AI solutions, but nothing quite hit the mark. Traditional LLMs were either too slow, too resource-hungry, or required constant internet connectivity.

I needed something that could:

Run completely locally (no cloud dependencies)
Respond quickly enough for real-time gameplay
Play nice with Godot's architecture
Not eat up all available system resources

That's when I decided to build my own solution using Rust. Why Rust? Well, I needed the performance of a systems language, but also wanted strong safety guarantees. Plus, Rust's trait system turned out to be perfect for what I had in mind.

The Challenges

Making It Fast

The first wall I hit was performance. Traditional LLM inference can take several seconds or more - that's an eternity in game time! I needed responses in milliseconds, not seconds. Here's the interface I ended up designing:

// Core trait that different model implementations must satisfy
pub trait TextGeneration {
    fn run(&mut self, prompt: &str, sample_len: usize, use_prompt_cache: bool) -> Result<String>;
    fn clear_prompt_cache(&mut self);
}

This might look simple, but it's deliberately flexible. It lets me swap in different optimization strategies without changing the core API. The real magic happens in how different models implement this trait.

Memory Management

Games are already resource-intensive. Add an LLM to the mix, and you're asking for trouble. I needed a way to be smart about memory usage. One solution was to implement a caching system that avoid recomputation of responses:

// Cache structure that tracks both timing and usage patterns
pub struct CachedResponse {
    pub response: String,
    pub timestamp: SystemTime,
    pub usage_count: usize,
}

This way, frequently used responses stay readily available while less common ones can be cleared from memory. It's made a huge difference in both performance and memory usage. Though admittedly the model itself still comsumes a lot of memory, but approaches for reducing that require the use of smaller models, and possibly even quantization methods, which require some more research and development on my behalf.

Threading and Game Performance

This was the trickiest part. Game engines are sensitive about their main thread - block it for too long, and your game stutters. Not exactly ideal.

I solved this by creating a dedicated worker thread for inference. Here's what the request structure looks like:

#[derive(GodotClass)]
#[class(init, base=Object)]
pub struct InferenceRequest {
    base: Base<Object>,
    #[var]
    pub request_id: i64,
    pub prompt: String,
    pub use_prompt_cache: bool,
    pub status: String, // pending, completed, error
    pub result: String,
    pub error_message: String,
}

Using this from Godot turned out to be pretty straightforward. Here's a basic example:

# Initialize LG in your game
func _ready() -> void:
    var success = Lg.initialize_inference(
        "path/to/model",
        "path/to/tokenizer",
        MODEL_TYPE.LLAMA  # or MAMBA
    )
    if not success:
        push_error("Failed to initialize LG")

# Request inference for an NPC response
func do_inference(prompt: String) -> void:
    var inf_req: InferenceRequest = Lg.request_inference(system_prompt, prompt, true)
    inf_req.completed.connect(_on_inference_completed)

# Handle the inference result
func _on_inference_completed(response: String, request_id: int) -> void:
    responses[request_id] = response
    print("Response: %s\n%s" % [request_id, response])

I plan to add token streaming support soon as well, to allow for the creating of the perception of better performance, which can sometimes go just as far as actually good performance 🏎️

The Architecture

The core of LG is built around a few key components that work together to make local LLM inference viable for games:

Model Support

I wanted to support different model architectures without rewriting everything each time. Using the following simple enum combined with the TextGeneration trait I mentioned previously, I was able to create a simple unifying interface to various model backends.

pub enum TextGenerator {
    Mamba(mamba::TextGenerator),
    Llama(llama::TextGenerator),
    // Other supported models to come soon.
}

This has already proven valuable as I've experimented with different models. Each one has its strengths, and being able to swap them easily has been crucial for testing.

The Brain of the Operation

The main singleton that ties everything together looks like this:

#[derive(GodotClass)]
#[class(init, base=Object)]
pub struct LittleGuySingleton {
    base: Base<Object>,
    inference_pool: Option<InferencePool>,
    inference_args: Option<Args>,
    model: Option<Model>,
    config: Option<ModelConfig>,
    tokenizer: Option<Tokenizer>,
    device: Option<Device>,
    inference_initialized: bool,
    result_receiver: Option<mpsc::Receiver<InferenceMessage>>,
    result_sender: Option<mpsc::Sender<InferenceMessage>>,
    request_map: HashMap<RequestId, Gd<InferenceRequest>>,
}

It might look complicated, but each piece serves a specific purpose in managing the model, handling requests, and keeping everything running smoothly.

What's Coming Next

I've got big plans for LG in the coming months:

Adding support for 4-bit and 8-bit quantized models (aiming for 75% memory reduction!)
Expanding model support to include:
- Mamba (still experimental, but looking promising)
- RWKV models
- Custom fine-tuned models
- Quantization support

I'm focusing on performance and platform support:

Multi-NPC optimization (batch inference)
Better CPU fallback options for when GPU isn't available

I'm also exploring some interesting possibilities:

Cloud support for resource-constrained devices
Integration with traditional NPC systems like utility AI and behavior trees
A community model repository
Vector store and RAG support
Tools to quickly generate synthetic training data to fine-tune models for specific games

Coming Up in Part 2

In the next post, I'll get more hands-on:

Walking through a complete working example
Sharing real performance numbers
Showing how the caching system works in practice
Demonstrating LG in action with some video examples

I'm excited to show you LG running in a real game environment. Stay tuned!

This is Part 1 of my series on LG (Little Guy), my local LLM runtime for the Godot game engine. Check out the GitHub repository for the latest updates. Stay tuned for Part 2, where I'll show LG in action!