Remote Working Data Annotation options

During the past, well, let's think, 10+ years, I've been involved in Machine Learning projects (okay, they call it "AI" now, but that's an argument for a different time) that require lots of data to be ingested / used to train the Machine Learning model.

For instance, if it is text, we need lots of snippets of text that can then be categorised, so that the model "knows" that the 10 words grouped together in that context have a "meaning" and that meaning is labelled - this helps with questioning later on. It also helped to extract out the entities and relationships between the wording to give more context. Here's a simple example using the spaCy tooling to give you an example:

https://www.labellerr.com/blog/image-annotation-services-and-data-labeling-for-ai-models/

If it is imagery, we need lots of images with the segmented parts identified, usually a bounding box(?) to identify the elements / objects inside the image, so, again, this can be used to label, categorise and label the elements within the image. Same for videos, which are just lots of images put together in a sequence - also adding textual descriptions of what is happening in specific scenes or sections of times within the videos, again, helps with the querying later on.

Same goes for voice - that was always the interesting one, having the audio displayed visually and then putting start and stop lines around the word or words and labelling what they were, again, labelling and adding that extra element that only a human can provide.

So what? I hear you say, well, all of the above projects that I worked on required some form of simplistic tooling to be created to help make those "annotation tools". They could be as simple as a Python Flask app, a NodeJS / JavaScript app that ran in a web browser or even an super simple Excel spreadsheet that exported to CSV format and could be read in from Python code.

What's the big deal? Well, this was considered to be low-value, but essential, work tasks and the companies and clients I worked with would frown upon allowing the highly expensive Data Scientists that they were paying for to do such mundane, but crucially important, tasks.

The point was, if this task was done wrong, it could massively impact the quality of the Machine Learning model and thereby affect the quality of the responses that the Machine Learning model would return.

Now, this was all BEFORE LLMs (Large Language Models) came onto the scene!?!?! Do you remember those dark days? Yes! Back when it was Python libraries, NLP (Natural Language Processing), lots of complex layers of algorithms, sometimes 10-12 layers of differing techniques, all very complex and it "could" take for a hugely complex Machine Learning model about 40 people and a few months of time to "train" it with the specific SME (Subject Matter Expert) data for the paying customer. That would be a large invoice.

Then OpenAI / chatGPT started to evolve in the quiet depths of the background, I recall reading back in 2022/2023 that they actually outsourced the RLHF - (Reinforcement Learning for Human Feedback) steps over to individual workers in East Africa, Kenya, I think for $1 per day. Now, apart from this being an absolutely appalling exploitation of human labour, this was just done and everyone moved on and forgot about this step, they just probably think that "magic" scraping of the internet happened and the chatGPT models just figured it all out.

Want to fact check me on that one? okay, I'll find a couple of links.

Here you go:

https://time.com/6247678/openai-chatgpt-kenya-workers/

https://slate.com/technology/2023/05/openai-chatgpt-training-kenya-traumatic.html

https://www.business-humanrights.org/en/latest-news/openai-and-sama-hired-underpaid-workers-in-kenia-to-filter-toxic-content-for-chatgpt/

Now those days have moved on, the models have been initially cleaned up and the focus is now on "how to train the models on useful things"

As the press reported articles like the above, it's not so easy for Big Tech to abuse workers anymore and that is where the "Digital Nomad", Gen Z, Gen Alpha, non-corporate, "I want to live a real life" generation has stepped in to fill the gap. I'm actually quite envious of them, had the option been there when I was in my late teens / twenties, I would most definitely be sitting around a swimming pool in Bali, with a laptop, "training" AI models and earning enough to get by!

And this is where I come full circle, whilst I used to be the Chief AI Architect - what a fancy sounding job title - for quite a few multi-million dollar projects, where we would define the need for these tasks to exist, I would still engage, get hands-on, be part of the teams, understand the details as well as the 50,000ft business goals and deliveries. I've always been curious, I've always been an Engineer, I have always wanted to know "how things work" and having a task to "Annotate data" was a curiosity for me that I've always been engaged with.

Yep, it can easily introduce bias from the humans doing the tasks, it can also be very opinion based, I recall we made one of the custom "annotation tools" force a triage, so that when one person annotated a document of text, someone else did the same and then if it found a difference it would flag it up and then a discussion would be forced to understand why one person did one thing and the other person did another. This could be for a genuine reason, it could be experience, it could be mis-understanding, it could be a multitude of reasons, the point being, one person did not have the power to bias the model.

So, back to today.

Data Annotation has now become an "outsourcing task", not for the cheapest labour force, I believe that was attempted multiple times and to be blunt, like cheap boots, you get what you pay for in the corporate world.

So, a compromise has been made - there are now companies setup who manage the demand for the "Data Annotations" on an adhoc project basis. Basically, it sounds like a company wants or needs a specific localised Machine Learning model to be "locally trained" or evaluated or tested and they reach out to these companies to get regular people to perform these tasks as and when they can.

I decided to watch a few YouTube videos, of course, where else are you going to find out all of this information nowadays? Well, facebook is a no-go, as the Gen Z, A's would say, "full of boomers", Instagram is just TikTok v2.0, TikTok is just, well, TikTok, it has some useful snippets, however, not having an account & not meeting the demographic I haven't got an account, then there is linkedin, which is just basically "Gen X" work-version of facebook and then there is good old fashioned "Googling" it. Which is amusing in itself, as I just found myself not saying, "use an online search engine to search the internet for website articles that can provide possible information about what you are looking for". That would be an AI Prompt nowadays and behind the scenes, it may or may not go off and do that search for you and summarise / collate the responses or just pull it from the training data.

From the 3-4 videos, as I admit, like all things, it got a bit repetitive, I concluded the following web-sites are offering themselves up as the brokers for such services.

As the videos stated, there are varying degrees of hoops to get through to "qualify" to be a remote worker, from simple through to quite complex 4-5hour assessments, so if you want to have a go at requesting an account and offering your services, you might spend 1/2 a day doing the assessment only to never hear back from the company, so it's a gamble. From what I can also tell, you don't need to be technical, don't need to know how to code (which I would expect you to), just be an analytical thinker who can "say what they see on the tin" and basically be the educator of the models so that when they are used they are providing better responses to the people using the tooling.

In no particular order, I'll list them here:

https://www.clickworker.com/clickworker/

https://www.remotasks.com/

https://www.alignerr.com/

Now, this one took me off at a bit of a tangent for a novel concept that I found in the middle of this approval process. An automated "AI" interview with a fake human being via a video.

Why was this novel? Well, a looooooooooooooooong time ago (2017/2018), I built out one of the first chatbots that had a 3D human (okay, was a scary animation, not realistic, but they were going that way, as you can see in the 2nd article link) that could emulate expressions, listen to what you were saying, pass it over to a chatbot engine for processing and provide responses, it was also able to hookup to the webcam to "assess you", looking for facial expressions etc... again, it was way ahead of it's time, perhaps too early!?!

https://tonyisageek.blogspot.com/2018/04/totally-100-selfish-plug-httpsdeveloper.html

https://tonyisageek.blogspot.com/2018/05/is-it-real-person-does-it-really-matter.html

But it was so cool to see that this is now the normal thing with ZARA above - but again, it can have flaws, if you don't get it exactly right, you'll get bounced, but it also means that those who know how to "play the system" will figure out how to say all the right things, even if they aren't qualified, pffhhh, human nature versus software.

https://outlier.ai/

https://outlier.ai/blog/the-flexibility-work-pay-i-always-wanted-why-experts-love-outlier

https://outlier.ai/blog/your-first-steps-on-outlier-what-to-expect-from-onboarding

and last, but not least, the big one that everyone wants to use, but most fail to get beyond the first assessment and then if they do, they never hear back or never get projects.

https://www.dataannotation.tech/

It is recommended to review the reddit posts too: https://www.reddit.com/r/dataannotation/

https://app.dataannotation.tech/worker_signup?

As was mentioned on one of the YouTube videos, on fiverr.com there are numerous people offering their services to data data annotation, that seem to be making a few quid on the side, I suspect thgouh on this website they are actually just using Python coding and off-the-shelf YOLO models etc.. to do segmentation, bounding boxes, etc...

https://www.fiverr.com/categories/data/data-labeling-annotation?source=vertical-buckets

You know what.

I was originally just posting this article as a way to express that an activity or set of tasks that I used to help co-ordinate on corporate run projects has now become a business model that is now part of normal life for people to "train the models", however, as I'm looking into it, I actually think I might have a spare 10-20 hours per week, in the evenings and maybe across the weekends, where I could just "put myself to work" and do some of this myself!?!?

Okay, it is not earning mega-bucks, however, it is keeping my hand-in on the best ways and techniques that the models are currently being trained on and how best to get the best out of the models - which has a potential to be REALLY useful over the next 1-2 years. So, it could be beneficial both sides of this equation.

Well, it is Sunday afternoon, I'm writing this article and I do have a spare 3-5hours in front of me, maybe I should get a big cup of tea and go through the assessment process for a couple, what's the worst that can happen?

Search This Blog

Tony is a Geek

Remote Working Data Annotation options - Digital Nomad

Comments

Post a Comment