Raspberry Pi 5 & offline LLM (Ollama.ai)

Well, it's been about a year since I was writing some python code (yes, I did type that & admit that I did that & to be fair it worked okay....as cardboard code), I was using Llama.cpp, langchain and I wrote RAG and COT code.  I was quantizing my own model on the laptop that I was using, therefore I was fully exploiting the 64CPUs, 16GB GPU & 128GB RAM and was then pushing the boundaries with that spec when using the streaming Q&A, ingesting my own documents & storing in a Chroma vector store and then questioning the content....I actually overloaded and crashed the Windows 10 OS on the laptop & it needed a fresh re-build afterwards... it never quite worked the same.

Anyway, my point being I was chuffed that I could run a Llama LLM offline on a laptop & it worked pretty reasonably, I was using code that I'd written & it was doing okay.  I attempted to explain it to other people & it turns out it was too complicated...

Then a very early version of privateGPT appeared, so I re-purposed that code, modified it & tweaked it & then took a direction to use the GPT4All model which seemed to work well..... then I wrote a python gradio ui that worked very well, however it was still a single-user experience & I wasn't going to take the python code further to handle the document upload stuff... it was good enough, simple enough for other people to understand and basically I then got bored, handed over the code to other people to work with...

I then decided to sit back on the fence in reference to LLMs... grabbed some popcorn & focused on some other things.  Turns out that was a good idea.  By attempting to keep up with the latest changes, that were happening on a daily basis, it was starting to distract me far too much & I decided that it was simpler to step back, keep an eye on things and then wait for some dust to settle.


Then I got a Raspberry Pi 5 8Gb device in Nov 2023...I installed UBuntu and did some other work with the device and now it's the end of Jan 2024 that work has been wrapped up, so I thought I'd take a look at how things have moved on and I wondered if I could take last years code, or a variant of it, and run it on the Raspberry Pi 5 device.....


Along has come ollama.  wow, that has made life so much simpler / easier!

Here's some screenshots, more for me as a reminder, that show ollama being setup / used on the RPi 5 as well as using a chat UI interface, the ability to use langchain and being able to use Javascript too....

I'm going to take this a bit further now, as these examples showed that the responses were FAST, you may think 2-5tokens a second is NOT fast, but trust me, when I was using the original laptop it was taking me 100's of seconds to get a response that now takes 1-2seconds. 






Really liked the fact that you can use this model to question the text from an image - this may have some interesting usages....hmmmm.... need to think more about this.

I'm so going to ramp up a Javascript app to use the API and Langchain...


Ah, this should have been first, basically the curl command that does the install, this is what happens behind the scenes, again pretty simple:

and for when I get back to work next week and set this up on the NVidia Jetson Orin device that I've had on my desk, wondering what to do with it, it has a GPU, so I'll have to setup CUDA as explained below:


Ah, talking of the original privateGPT, this is how I can use this as a skeleton to morph my previous code to use ollama and langchain, via Javascript this time - however, looking at the python code above, it looks super simple.

This might be about the right time to step back into the LLM world running on Edge devices...RPi 5 is a good start, then the Orin and then I reckon it'll be time to run on Android and iOS platforms - who needs all that "Cloud" bullsh*t? no, that is so 2014....


Here's the future... it looks a lot like 2001 :-D



UPDATE:
RAG performance still sucks on the RPi 5 and the Jetson, when I say sucks, I mean it takes 30-300seconds, which is still pretty good considering, but it's not as quick as the native LLM access.  I'm going to do some performance tests between these 2 devices, the laptop metioned at the start, along with a beefy server with dedicated GPUs (yes, I will also run it on Azure Cloud, but I still don't see the point of running anything in any "Cloud" environment), I guess I'm more of a Hybrid-Cloud kinda person.

Comments