There has been a lot of hype regarding models like GPT4, Claude and others. They're great, I use them a lot, but there are scenarios where using them might not be the best option. For instance, if you're making a game and want to have NPCs that can talk to the player dynamically, you probably don't want to have to send the player's message to a server, wait for a response, then send that response back to the player, since that would introduce a lot of latency, require you to be beholden to OpenAI or whoever (which can be a big deal if their servers go down and no one can play your game). You've also got to think about the cost of using these models, which can be a shock if you aquire a lot of players and uneconomical if your game is cheap or played too much. It can also make you prone to minimising the amount of LLM-based content in your game, since you'll have to pay for every token, and this can add up fast.
I used llama.cpp with Mistral7b to generate the dialogue, and StyleTTS2 to generate voice lines, along with Unreal Engine 5 for the rendering. I tried to implement llama.cpp as a DLL in Unreal but ended up getting stuck, so what I ended up doing was building a horrible hacky solution using a Node script. I used mrfakename's StyleTTS2 demo Docker image for the voice, which I interacted with using the Gradio API interface package. Ideally I wouldn't have to use a Docker image, but I couldn't get the StyleTTS2 model to work on my computer natively.
I'm running this on my home PC on Windows 11 with the following specs:
The Node script is invoked using the FInteractiveProcess class in Unreal. It passes the previous conversation history to the script as a command line argument, and the script outputs each sentence of the NPC's dialogue sequentially (so as to improve performance - rather than waiting for the whole thing to be done, we go line-by-line and generate/voice the next one as the NPC is saying the current line) as a JSON object containing the text (for the subtitle) and the location of the audio file for that line to the command line, which is then parsed by Unreal into a struct and then played.
The performance is surprisingly pretty good. There are some minor stutters when it generates a new line, but it's not too bad. StyleTTS2 also uses 14gb of RAM and the Llama server uses 3gb, so you'll need a decent amount of RAM to run it - I'm sure that it StyleTTS be optimised down though. There isn't much of a hit to frames, as you can see in the video it easily gets a smooth 60fps.
The time to generate a new line is around 2-3 seconds, which is pretty fine if you ask me. Maybe there could be some kind of animation that plays while it's generating the line to make it feel less jarring, but I don't think it's too bad as it is. If you used Whisper with it, you could delay the speed at which the transcript of the player's voice appears on screen while the response is being generated, which people are more used to after using stuff like Siri.
While it's faster, the main drawback is that the Mistral model I'm using is less coherent than GPT3.5. It tends to go off on tangents and doesn't stick to the facts - for instance in one of my tests it told the player to go a few miles to the village it's supposed to currently be in. Also, as seen in the video, it knows the player's name of John, which isn't actually established within the conversation and it mentions Angers, which isn't going to be the settlement's name for hundreds of years.
It's also not particularly grounded in the reality of what is and isn't possible in the game world. For instance, as seen in the demo, the quest about training the villagers is actually impossible, since there are no game mechanics for anything. I think this could be fixed with a kind of RAG approach (see conclusion).
StyleTTS2 also doesn't quite sound natural, it's still got a bit of a robotic sound to it. It's also not very good at pronouncing words that it doesn't know, or mispronounces them based on context (the city of Angers is pronounced differently to the verb "to anger").
This is a rough proof of concept, and there are a lot of things that can be improved on if you actually wanted to use this in a game.
In terms of the demo, there are a bunch of other things that are tangentially related that would also provide improvements, but aren't really in the scope. For one, the NPC's mouth doesn't move when it's talking, which is obviously bad. This could be done with something like audio2face which I doubt would cause much of a performance hit. You could also allow the LLM to control the NPC's body language, by having it return JSON which includes an animation to play for that line.
In the long term I have an idea of creating a kind of database that's updated when the player begins a conversation. It would contain information about the player, the NPC (including their backstory and goals), the world, etc. and would be used to keep it grounded. You would generate a big document with all the player's quest history (that this NPC knows about), what's visible on the player, the weather, the NPC's interactions with other NPCs, their chat history with the player (including dates, times, etc. so it can infer that the player is talking about something that happened a while ago), etc. and then use a natural language query on that before passing it to the LLM.
Of course, it would still be best to have possible quests hand-crafted. I don't think we're at the point where we can let the LLM come up with entire quest series yet - you'd probably end up with something like Skyrim's radiant quests where it's just "go here and do this". They were definitely not a hit with players, and I think that's because they lacked that human touch.
The main thing to do would be to fine train the LLM to stay on track and refuse to do things it can't actually do. It could be passed a list of functions that it can can perform (give item, take item, start quest, etc.) and it would refuse to do anything else.
All of those fixes are what you would need to do to make it work in a full game, but the point of this post was to show that it's possible to do run a whole NPC locally, and that it's not too hard to do. I think the results are very positive, and I'm excited to see what other people with this.