Local A.I. private
large language models

Large Language Models (LLMs) are a breakthrough in search and human-computer interaction. However, they have problems with accuracy, relevance, privacy and cost.

We have deployed LLMs with appropriate filtering, privacy and guard-rails, and using Retrieval Augmented Generation to improve contextual relevance. We continue to track the state-of-the-art.

  • Privacy: there are significant concerns that many LLM tools "leak data" back into their training dataset, and risk sharing private data with 3rd parties.
  • We use "inference as a service", so this risk is substantially reduced (to the level of improbability), or use entirely local-inference, on your own physical hardware, to eliminate it.

  • Relevance: OpenAI doesn't know about your own documents, and even if it did, it wouldn't prefer them when finding results.
  • We use RAG to ensure that your data is available to the LLM and is prioritised in the results.

  • Cost: training a foundational model costs £ 10m+; but even licensing an existing model costs about £25/user/month, which soon mounts up.
  • By aggregating LLM queries, via the API, we can reduce the cost to only that of the actual processing consumed: in one case, cutting the bill to 0.04%.

  • Guard-rails: companies are rightly concerned about the data that their users might share with (potentially untrusted) 3rd parties.
  • We built a system integrating user-advice, detailed logging, and feedback; this trains colleagues to use A.I. safely and reliably, and the traceability-audit verifies that private data isn't misused.

  • Accuracy: LLMs have a number of failure modes, relating to hallucination, and oversimplification.
  • While fixing this is a billion-dollar research problem, we are highly aware of the current pitfalls and limitations, and can help you avoid expensive mistakes.

Inference: local or cloud?

Inference is the process of actually "running" a large language model. It is the complex calculation (literally trillions of computations per word) to figure out "what word comes next, based on the context (the previous words in this conversation and the training data)". This requires very expensive hardware. There are 3 ways to do it:

  • Proprietary: delegate inference to a 3rd party closed-source model. This is what OpenAI's ChatGPT is. It works well, but you have no control over the systems, or the data.
  • Cloud-based inference: send each specific conversation+context to be processed (e.g. using partly-open-source Llama3 on Replicate or RunPod). This gives you reasonable levels of control, and lets you only pay for the processing you actually use.
  • Local: run it on your own hardware. To obtain decent performance on the larger models, you need at least 32GB of graphics-RAM, so this used to be very expensive, requiring ∼£50k of server hardware. It's best for privacy, but only cost-effective if used intensively. As of 2025 the new Nvidia RTX 5090 high-end-consumer GPUs cost ∼£3k, and have 32GB onboard memory. So inference, with good performance, is now practical on your own hardware.

Machine Learning

The field of A.I. comprises far more than LLMs: it includes thousands of other Machine Learning algorithms for more specific process optimisation, and complex classification tasks.

This is the complex field of data-science, on which we have worked for a decade. Furthermore, when using M.L. tools, it's important to consider the more traditional statistical and data-analysis tools, which still have their place and can sometimes outperform it: M.L. isn't always better. Experience, combined with experiment will find the best algorithm.

Our recent successful collaboration with Q-Bot, under the auspices of Innovate UK is an example of this, where "A.I." often really means "A.I. or M.L. or data-science".


Why Neill Consulting?

We have worked on A.I. related tools for 10 years (long before LLMs because usable and widespread), from when it was still all about M.L. (machine-learning) and implementation of classification-algorithms in TensorFlow, and their use in data science. This gives us a deep understanding of A.I.: not just how to use it, but how it works, from Hebbian-Learning, to the Perceptron, to the emergent behaviour of today's multi-trillion-computation hardware — and we are continuing to follow the fast-changing state-of-the-art.

As a result, we know where you can, and where you can't use A.I., how to avoid the hype , and find the use cases that will deliver value and accuracy.

We've written extensively about LLMs, developed and deployed L11g, presented at CIONet, and worked with Innovate UK.

We can advise you on A.I. projects and help with their implementation.

Contact Neill Consulting