Who needs all the GPUs for ML training?

 So....some great sales person at Nvidia took a look in the warehouse one day and said, "how come we have so many GPUs on the shelf? who ordered all these? was it the same person who ordered all the oil drums in Half Life 2?...how the hell am I going to sell all these?"

...and thus, the AI/ML training requirement using GPUs was born.

okay, so that could be a total fabrication of the truth (we'll never know) and I'm sure I'll get 1001 comments telling me the real history (pst: I really don't care and if you keep reading you'll figure out why)


I was at work the other day and the subject of training ML models came up and some very clever Data Scientists were telling me that to train/learn the things I wanted to do would require exclusively using Cloud servers as they have the GPU scalability that is required to do the training in a decent amount of time.  I wasn't convinced.

Also, we cannot use the "Cloud" (ie. someone elses servers), we're offline and OnPrem.  yeah, the 1990s are back in trend again.  I asked for a spec of what we would need.  I fell off my chair when the equivalent of a Cray supercomputer spec was placed before me.  Really? REALLY? nope.  I don't believe it.  I'm willing to accept that running the ML training models on OnPrem kit is doable and the sacrifice of doing so is "time", it's just going to take 7 days to execute, but we'll get the model and then it's just a case of deploying it and that is executed against.

This bugged me.  I also wanted to perform the ML training on the deployed hardware that we have.  Basically, a ruggedised server with 16CPUs and 128GB RAM and 3TB of HDD.  Not bad for something this size of a laptop that fits into a rucksack.  I want to run the ML training on that piece of kit.  I was laughed at by the Data Scientists.  It's a bit unfair for them as they report to me, so technically I'm the boss.  So I got bossy.  I said, "we're going to make this happen.  I don't know how, but we are."  One thing they've learnt, working with me over the past year or so, is....I usually figure it out and we all bask in the glory.

As I do... I keep many fingers in many different pies, I observe a lot of information and keep threads on what might be useful in the future and rarely disregard information that is not relevent today as it might be useful in the future.  (Yes, my brain often complains that the HDD is full and I need to delete files)

So, it was with great glee that I read this news article:

https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/simple-neural-network-upgrade-boosts-ai-performance


Whilst casually reading through this article, it started to justify quite a lot of the debates I've been having with IT people in general, ie. modern day developers relying on libraries and frameworks.  Whilst this is great for acceleration and re-use, it can start to bloat your code and execution, if you're only using, let's say 2 functions out of the 3000 in a library.  If the library is not written well, you end up loading all that code into memory, just for the 2 function calls you're making...kiss goodbye to a big chunk of RAM for no good reason, but hey, RAM and HDD is cheap nowdays (still no excuse for bloat).  Anyway, this links back to the GPU niggle that I've had for a while.  Why do we need a Cray supercomputer to do some "simple" training?  going though the above article, as it states, by "fixing" something at the core framework level, a huge performance increase was made and everyone benefits.  awesome.  No, I'm not one of those people who has to make everything from scratch, I appreciate the benefits of frameworks & libraries, but you have to balance out what they offer and what it is you really want to achieve and whether you are willing to compromise or not.  Time is usually the main reason.  You need to put something together fast (because that is what IT has now become, make it quick, fail fast, show me a PoC working, that's great, now put it into production.....errrr it's a PoC, it wasn't built for production purposes... you know the thread of that discussion).

Anyway, whilst reading the above article, my inner ego was getting very smug as I was going to use this as a reference in future debates to justify why I make certain decisions in the work place.

Then, as if the universe was listening to my thoughts, this article feel into my lap: 

https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/powerful-ai-can-now-be-trained-on-a-single-computer

Well, look at that!  "Necessity is the mother of invention" - totally agree!  A student didn't have access to the Cray supercomputer style kit to do the ML training, so he looked for alternative ways to achieve what was needed and approached the problem in a different way.

Okay, he ended up using a "beefy" laptop with 36 CPUs, 1 GPU (yes, 1..ONE...) and it performed exceedingly well!  There...I now have my reference to backup my desire to do the same thing on the kit I have to hand.  I will make it work and now I'm inspired even more to do so.

Here's a link to the GITHUB repo for the SAMPLE FACTORY: https://github.com/alex-petrenko/sample-factory

Basically, it's an async reinforcement learning tool, it's not going be useful for supervised and unsupervised learning, but it's a start.  A very good start.

In a nutshell, the simple answer (as with most answers, it all depends on the question being asked), but it does sound like a lot of the problems encountered is to do with people using python and read/writing/processing huge datasets and not caching them in memory. 


The Data Science "lazy" answer of "you need 100's of GPUs and millions to do that" is the same style answer as "I'll just add these 10 frameworks & libraries to my code, RAM is cheap, we can just up it from 64GB to 128GB with the click of a button"..... no no no no... These are the people who have never written code in C.  They have never worked on limited capacity devices, where 1MB (yes MB not GB!) of RAM was a luxury and 640KB of that 1MB was taken by the OS, therefore every line of your code had to be justified and every variable that took up memory had to be efficiently cleaned up in order to keep your code from crashing the device.  That mindset is becoming part of a dying breed and it's a shame, as that mindset is what got us all to where we are today.

I want to get to tomorrow though and to do that, sometimes you have to go back to yesterday and look at how things were done and apply the techniques in a new way to accelerate the future.

I'm looking forward to next week.....

Unlike the sales person from Nvidia....who, hopefully hasn't asked for 1000000 GPUs to be made so they can make a killing out of AI/ML training

 

Comments