AI Assistants: A Balancing Act Between Data Security and Technological Progress

Simon Krannig

Data Scientist

ADACOR

Simon Krannig, Data Scientist at ADACOR, shares insights into the transformative potential of open-source AI Assistants. These tools, when effectively linked with your own business documents, could reshape the way your company operates and it can be achieved without jeopardizing your sensitive data.

Understanding AI Language Models

AI Language Models, colloquially known as Large Language Models (LLMs), are designed to learn from an incredibly vast volume of data. Leading AI services like OpenAI’s ChatGPT or Google Bard feed their models with data so extensively that it rivals the collective content available on the Internet. The computing power required to process this immense amount of information is astounding. The training process often deploys an array of hundreds or even thousands of Graphics Processing Units (GPUs), which are kept in operation for several weeks. This makes these models some of the most resource-intensive in the field of AI. But what does the learning process look like for these models?

When it comes to text-generating LLMs, the underlying principle is elegantly simple. Given an input text, these models are tasked with predicting the most likely subsequent sequence of words. For instance, if the input is “Paris is the capital of …”, the model would, based on its previous training, suggest that a likely continuation is “France” or, less likely, “fashion”. This feature isn’t unfamiliar to most of us, as it mirrors the autocomplete functionality on our smartphones.

With additional rounds of training on data designed to mimic conversations, the model becomes skilled at continuing any user query, as if it were engaged in a natural dialogue. Afterwards, substantial human feedback and fine-tuning ensure that the model provides responses that are not only accurate but also user-friendly.

This unique setup gives LLMs the capability to carry out a wide array of instructions – from offering coding advice and simplifying complex topics and many more.

Infusing your data into existing models

Despite these innovative capabilities, there is one notable drawback to these models: they can only respond based on the information present in the data used to train them. This limitation excludes two critical sources of information.

Firstly, it doesn’t account for any recent events, news, discoveries, or discussions that happened after the model’s last training session.
Secondly, any non-public data such as internal business documents will remain inaccessible to the model. As a result, the model will not be able to answer questions related to your specific business operations.

Are we then stuck with LLMs that can only scratch the surface of the information most relevant to us? Fortunately, the answer is “no”. The solution is deceptively simple: When we feed a question to the model, we also include the corresponding reference information. This could be anything from a news article to a private document. If it encompasses the relevant information, the LLM will take it into account and use it to answer our queries.

This process would require us to know about the relevant reference information beforehand, but it can be made more useful with an automated knowledge retrieval system. Such a system serves as a bridge between the user’s question and the model’s response. It functions like a search engine for our custom documents, enabling us to find PDFs, protocols, source code, and any other text-based documents. The system then ranks these documents based on how relevant they are to the user’s query, and augments the original prompt with the most relevant pieces.

Augmenting LLM prompts with an automated knowledge retrieval system

Knowledge retrieval systems can be categorized into numerous types, yet when it comes to enhancing the prompting of conversational LLMs, semantic search engines take center stage. They rely on “Embedding LLMs”, which are designed with the specific focus of numerically encoding the semantics of a given text. The semantic information from every piece of text in your knowledge-base is then stored in a vector database. This interplay between Embedding LLMs and semantic search engines effectively transforms the way we utilize and interact with conversational LLMs.

Open source versus AI-as-a-Service

While the idea of integrating such a custom search engine with an LLM is attractive, it does bring up a considerable caveat. Each time we use this setup, we are potentially transmitting a large amount of private data to the entity that owns and operates the Model or the AI Service. Keep in mind that every single prompt is now packed with raw data from your knowledge retrieval system. Therefore, any hesitations we might have had about using major AI Services like OpenAI’s ChatGPT or Google Bard due to privacy concerns should now be even more acute.

The solution to this predicament may initially seem infeasible. We could consider hosting our very own LLM, thus taking ownership and control of the model to ensure that data security measures are maintained at all times and that private data never leaves our systems. However, given that training such a model requires a tremendous amount of raw data, tons of human feedback, and incredible amounts of computing power, the feasibility of this approach may appear questionable at first.

The answer to this dilemma, however, resides in the rapidly expanding open-source community, which is devoted to the development and enhancement of ready-to-use LLMs. Open source brings with it the promise of democratized access, allowing individuals and organizations to harness the power of AI while maintaining control over their data.

Navigating the open-source ecosystem

Open-source models tend to be smaller in comparison to their proprietary counterparts. In context, GPT-3 already boasts a whopping 175 billion model parameters, whereas Google Bard and GPT-4 scale this up by a factor of 3 to 6 times. Most open-source models, in contrast, typically fall within the range of 3 billion to 40 billion parameters. This size discrepancy, however, does not dampen their effectiveness or potential. It does, however, tremendously reduce the required compute resources.

The open-source landscape is divided into two main camps. The first group focuses on the meticulous development of foundational LLMs such as EleutherAI’s Pythia, MosaicML’s MPT, and the Falcon model from the Technology Innovation Institute. These are typically text-continuation models that are yet to be fine-tuned on conversational data, but their training consumes the largest amount of computational resources in the entire training lifecycle of LLMs.