Experience report: AI, LLM and data

Hello, World! This is a software-tech / coding focused article about experiences with Large Language Models (LLMs). The "AI" tech is at the peak of the current hype-cycle. But it seems that the first issues are manifesting.

Table of contents

    DALL-E vision of a bookworm in the AI era

    Experiential notes on Large Language Model development

    I followed my interests with an RnD project. During this project, I took a look at the typical LLM software stack components, which are:

    1. a Vector Store or Vector Database

    2. a Large Language Model

    3. Embeddings or other Natural Language Processing (NLP) approaches

    Currently, it seems to be en vogue to use Software As A Service (SaaS) components. This creates a large and costly supply chain of dependencies, which are mostly venture-capital backed startups. My experience with suppliers like this is, that they will change rapidly, go through Mergers & Acquisitions or LBOs, or cease to exist. Besides that, their focus is growth, and not to contribute to the foundational understanding.

    Can you own your LLM setup, end to end?

    Data generation and preparation

    I decided to create a custom LLM from a database of web-clips from Information Security articles (txt, html, pdf, docx). The original file has been released, but the extraction of the content wasn’t straightforward.

    Release of my curated InfoSec web-clips (2008-2021)

    The purpose of the data generation was to ingest extracted text into a uni-modal LLM; an LLM which will only generate text. The source of the texts are the documents from the released database. I named the project “Bookworm” because it will consume many book documents (pdf, docx, etc.).

    Many PDFs were difficult to process in Python. There were

    • encoding issues (Unicode related)

    • reference issues (PDF standard related)

    The documents needed to be repaired. The details of this are documented in the released Jupyter Notebooks.

    Data sanitization was a key process to establish to enable this project.

    With LangChain it’s possible to pre-process (repaired) documents because it integrates with Python-based document parsers:

    • PDF, DOCX, Excel

    • HTML and TXT

    • Web sources: Confluence, Wikipedia etc.

    One of the use cases I see for this LLM is to provide a unified search engine, which can be used to cross-reference some of the content. Recently, I see, that results from Google get “poisoned” due to AI-driven over-optimization when it comes to Search Engine Optimization (SEO). In the first paragraph, these particular monetized articles “spam” keywords such as “Expert” etc.. Then nothing, and a couple of advertisements. Building custom knowledge cases will therefore be essential, especially for domains such as Information Security, which are affected by Fear Uncertainty and Doubt (FUD).

    To dive into the internals, the key components of the project had to be:

    • self-hosted, minimal overhead

    • Python-based, small code-base

    • documentation and learning focused

    Project Bookworm: Embedding and vectorization with different Vector Stores and DBs

    In my project, I utilized a BERT model, a Transformer Network.

    Developments in the transformer network architecture include systems such as bidirectional encoder representations from transform- ers (BERT) and generative pretrained transformer (GPT). Pretrained word embeddings learned by transformers can be loaded in PyTorch using the popular open-source library provided by Hugging Face (https://huggingface.co/transformers/).

    Source: Interpretable AI - Building explainable Machine Learning systems (2022), Ajay Thampi

    According to Hugging Face benchmarks from 2022, it’s not a top-notch model (any more). My selection was mostly for efficiency and for the sake of an initial implementation. During the project, I learned that I want to look at the benchmarks to select the right model. The approach chosen by my code is to process an entire sentence and to project it into a 1024 dimensional vector. The selection of sentences (or equivalently long pieces of symbols) happens via “chunking”. The text is split into portions, fed to the Transformer Model for embedding vectorization, and thereby projected into a numerical domain.

    The vectors get stored into a Vector Store based on FAISS. It’s a primitive implementation where the vectorized data is serialized into a persistent Python pickle object file. A Vector Database such as PGVecto or Milvus will be a more robust choice for larger data-sets. I wanted to use FAISS because for the NLP task in particular, the Euclidean (L2) distance is less preferable than Cosine similarity. Besides that, the resulting data-set on Kaggle is only roughly 3 GB.

    FAISS-based vector file store - vectorized data release on Kaggle

    The embedding vectorization of ~ 720,000 chunks took roughly 1 hour of computation time on Google Colab, using an Nvidia Tesla V100 GPU. I think the concurrency implementation might have created a bottleneck somewhere, and I will have to research whether a GPU utilization of 100% is enough of an indicator for optimal throughput.

    Google Colab allows you to accelerate your Machine Learning code with on-demand Nvidia GPUs

    Vector-search and using the Ollama self-hosted Mistral LLM

    Initially, I wanted to use OpenAI’s GPT4 model via the APIs they provide. But the cost-model for their embeddings appears to be too difficult to understand. Therefore, I researched more about self-hosting and independent LLMs. I found that the Ollama project allows for an easy installation of a CPU-compatible LLM (slower, but possible) named Mistral. In the benchmarks, Mistral seems to be on par with GPT4, which is why I don’t feel that I am missing out much. GPT4 has a problem with my work, which is Information Security / Governance Risk & Compliance. It tries to tell me, that the risks don’t matter so much etc. — due to its ethical constraints. This is a key issue besides the hallucination problems.

    Maximal Marginal Relevance (MMR) search and Prompt Engineering

    To get a good response from any LLM, it’s essential to feed enough input to it. The search results from the vector store / database have to be relevant and enough in volume. Qualitative and quantitative results. This isn’t overly surprising, given that an LLM is just a model.

    In my experiments, I came across MMR search, which isn’t much spoken about. It’s an alternative to a strict (vector) similarity search, and can yield better results with an LLM and a prompt. LangChain supports this search type as well, and it was useful for this data-set.

    Conclusion

    The implementation of this was educational, and I haven’t finished my integration tasks. There are numerous learnings and takeaways in the

    • published Jupyter Notebooks

    • the data-set on Kaggle and GitHub

    • this small Blog entry (Q1 2024)

    I haven’t looked into training actual Transformer Networks or building LLMs such as Mistral or Llama myself.

    For the integration tasks, I want to look into a proper frontend, probably with a modern on-premises Low Code approach rather than a custom React. Low Code has something in common with the overall issue in software technology today. It’s very difficult to own an implementation, end to end. I think the amount of 3rd party (SaaS / Cloud / uncontrolled) assets has to be kept to a minimum because very few AI companies have sustainable business models.

    Next
    Next

    Make Windows 11 taskbar smaller