Tech

Building My Own Chatbot

Chris Ozgo

Jan 27, 2024 • 7 min read

Last semester, I had the opportunity to take a brand new class at Georgia Tech called Conversational AI. It was taught by Professor Larry Heck, who was previously in charge of Microsoft and Samsung's voice assistants: Cortana and Bixby.

Our final project for the course was to create our own Large Language Model (LLM) fine-tuned to be a Georgia Tech advisor for the School of Electrical and Computer Engineering.

So basically, a chatbot, à la ChatGPT, that could advise students on information related to their degree requirements and graduation plans.

In teams of 2, we were given a month to complete this task. Despite many long days and tons of trial and error, my team created an LLM that outperformed baseline models on our test set of questions by 11.7%.

I had a blast with this project and wanted to chronicle the steps our team took to achieve our results.

Base Model

What is Llama 2: Meta's AI explained - Dexerto — Is Llama-2 the GOAT (pun intended) of open-source AI models? Credit: meetcody.ai

We chose to use Meta's open-source LLM, Llama-2-7B-chat-hf as our base model for this project, as was strongly recommended by Professor Heck.

We chose the 7B parameter model because it was the best performing open-source model that could run on a single Nvidia A100 GPU, which was all the computing power available to us for the project.

Raw Data Collection

Our next task was to obtain as much data related to the Georgia Tech ECE curriculum as possible.

We started with some good course offerings datasets, but also did the grunt work of scraping nearly every single GT ECE webpage for more data on degree requirements, rules, and more.

It was tedious, but when you consider how LLMs like ChatGPT scraped 300 billion words, this was just a drop in the bucket.

Using tools like Puppeteer and Python Excel/PDF scrapers helped expedite the process, but the entire web scraping and data collection process took around 2 weeks.

While I just gloss over it here, let it be known that it took some serious time and energy to be done properly.

From Raw Data to Consumable Data

This part was tricky.

As nice as it would have been to just feed the LLM a bunch of JSON and TXT files with raw data, that was likely to have no impact on successfully fine-tuning the model because it has no way of making sense of the data.

Instead, we had to present the data in a way that was digestible for the LLM, enabling it to actually return results specific to our use case.

We followed the suggestions of some online tutorials and transformed our raw data into question-answer pairs, a similar conversational format to what the LLM would be using to interact with students who had degree-related questions.

It took forever.

There is no straightforward way to turn raw JSON and TXT into question-answer pairs. We tried a bunch of different options with only limited success:

I thought creating a custom GPT with all our raw data was a stroke of pure genius...until it turned out to be incredibly cumbersome to get a lot of QA pairs quickly.

Feeding GPT-4 our raw data and asking it to generate question-answer pairs. This was mildly successful but had the limitation that it would only generate 10 Q/A pairs at a time before necessitating a new prompt. Even when prompting it to give us 20 Q/A pairs at a time, it would only give us 10. The other limitation was that GPT-4 had a limit on the amount of queries you could make every few hours.
Using LMQG, a Python package based on a couple of research papers published in 2023, which generated question-answer pairs based on input text. This worked really well, and much faster than GPT-4, but had the limitations that you could only input about one paragraph of text at a time, and it only worked well with TXT files.
Parsing through spreadsheets and generating our own questions. For super structured data, like an Excel file of course schedules, it was easy to parse through each line and generate a QA pair for each row of data.

Between these three solutions, we cobbled together around 3,200 question-answer pairs in just a few days.

It felt like an impressive task–no other team in our class who chose to go the QA route came even close to the size of our dataset–but we knew that the results would be modest at best.

According to Professor Heck, it would probably require hundreds of thousands of QA pairs to achieve the desired result.

Probably millions.

Ooof.

Perhaps with more time and ingenuity, we could have gotten there.

Fine Tuning

With a 7 billion parameter model and only 3,200 QA pairs to use, we had to find a method of fine-tuning that would make our dataset relevant to the pre-trained model and its existing swaths of data.

Oh, and if it could also not crash our single GPU that would be nice too.

Thankfully, as of 2023, there was a new fine-tuning method for this exact purpose: QLoRA.

QLoRA is a parameter-efficient fine-tuning (PEFT) method, which means that the number of trainable parameters in the model is drastically reduced.

So instead of 7B parameters being fine-tuned, only a small subset of these parameters are actually adjusted.

This allows for the model to be trained on a single GPU and allows for the training to occur efficiently enough that we could iterate our fine-tuning hyperparameters for better results.

It's the latest fine-tuning method in a line of PEFT techniques which includes LoRA. But QLoRA is more memory efficient than the rest and maintains similar results, both welcome features in our hardware-constrained environment.

So we went with it.

Fine-Tuning Environment

Cheat-sheet for Google Colab. In this tutorial, you will learn how to… | by Tanu N Prabhu | Towards Data Science

We conducted our fine-tuning within two different environments, Georgia Tech's PACE Cluster, which we had access to as GT students, and Google Colab, which (sometimes) gave us access to A100 GPUs for $11 a month.

Both environments were super finicky–I've come to believe it's a result of LLMs still being not quite mainstream for most developers. Thus the accompanying dev tools just aren't there yet.

In fact, some of the packages we used–transformers and peft specifically, were pushing out updates to address bugs we were experiencing the week our project was due!

It was a glimpse into the fact that all this stuff is happening in real time.

Modifying Hyper-parameters and Testing

QLoRA comes with a bunch of parameters that we tweaked to get the best result possible.

However, by far the most effective improvement to our model came from our prompting. We tested outputs asking the same questions to the same models but just feeding it the additional context, "For Georgia Tech," and all the models across the board improved tremendously.

The fine-tuning dataset did help the model, but the models were still relying heavily on pre-trained text. They just needed context to be oriented to the task at hand.

The QLoRA parameters we tweaked were the learning rate, alpha, and rank. Due to time constraints, we did not modify the dropout, number of epochs, or other parameters that could have improved our model.

We tested each of our models against the Llama-2-7b pre-trained model and scored the accuracy on a set of test questions with objectively correct answers relating to the Georgia Tech ECE curriculum.

Our scoring model was based on the model's context, accuracy, and concision in answering the question.

The Llama-2 base model scored 62.9% over our set of test questions, while our best-performing model scored 70.3%, an improvement of 11.3%!

Our best-performing model had a learning rate = 2e-5, alpha = 128, and rank = 64, and explanations for why these values worked for us are in our final paper. However, success with these values was likely dataset-specific–it contradicted the results of the LoRA paper.

A Glimpse Into What's Coming

Our solution was pretty good for a team of 2 given one month of work. But there's no chance our best model was ready for deployment in its current form.

That's the reality of this being a class project–there's just so much you can do with the resources and time constraints that companies doing this stuff don't have to contend with.

But, the cool part was getting my hands dirty with all the problems that are prevalent in building production-ready LLMs.

Data collection.

Hardware constraints.

Fine-tuning and modifying hyper-parameters.

Hell, even just setting up an environment that worked properly was a Herculean task.

It was a far cry from building a web app by popping open localhost and hot reloading my way through each issue.

That’s probably why I enjoyed the project so much. In school, you rarely get the chance to get your hands dirty with new, relevant technology. But in this course, we got to touch on all aspects of engineering the most cutting-edge technology out right now.

I'm very thankful to Professor Heck for giving us a taste of what real-world LLM building is like!

Feel free to check out all the source code, a couple of models, and the final paper!

CHRIS OZGO

Subscribe