Hash checking similarities of text

This isn't a new thing, it's probably not even used that much, although it should be.  I have a scenario where I'd like to get text, think of news articles from a variety of sources, maybe 4-5 or even 1-200!? and I would like to analyse the text and if certain trigger content is detected then notify me.  

Such as, a major security breach has happened and it is being reported on multiple cyber security websites and I should action on it pretty fast - now, I don't want to get 1000 alert notifications of the same or similar thing, I would just like to get 1.  Or at least less than 10!


I was pondering on how to do this and then duck.ai sprang to mind.  Why use duck.ai? well, it doesn't slurp what you are doing, it doesn't use any of the data for training, it's also free.  So, why not?!


I gave it this prompt:

If I have multiple sources reporting the same textual data, can you show how to create a hash-based change detection system with python that would identify that the textual articles, whilst not the exact same text have a high probability of being the same content in order to reduce duplication


I got a pretty good response. I did re-ask to tweak it to include the pandas element and I could go further and ask it to then use a python library to visualise the similarities, however, this is just a small part of a workflow, the real aim would be to "filter the noise" here and then next step I could ask is to then analyse the tokenised values with ntlk to classify the urgency or use a method to determine the quantity of a repeated article content to raise it up a priority order to imply it is of more importance.  I could even ask it to write the API calls out to specific websites to scrape the content or read files from a folder or something as input rather than the small fixed array in the code.

I installed the couple of python libraries that were needed and the code executed as defined.  This was a really good starter for 10, it just worked.  

Whilst I'm not the greatest fan of LLM output code, this is really useful to get started with.  What do I mean by that statement? well, there is no concept of exception handling or security, it is just the bare minimum, but it is good enough to prove the concept and get the basic code structure.

I could now go back and ask for more exception handling to be added etc..etc... but I fear I would then cycle around for about an hour tweaking and enhancing, when I could actually just write that bit myself.


Anyway, the following is the output I asked for, yes, I asked it to output so I could paste it into a blog post.  Now, you may think, why not just write something in n8n that could auto-post into a blog post? I'm not too much of a fan of totally removing a human in the equation, also I'm also not ready to hand over my security credentials to a tool that could easily harvest that info behind the scenes (as recently proven by clawdbot), so for now, I'll still do a small manual task.


As to whether this is the "BEST" way to do this, that is open to debate.  It is "ONE" way to achieve this, further in-depth testing with much longer article text will be required to see if this is indeed of any use or not.  As with all things AI-slop, it might just be that.  It might look and sound convincing, but it might be very limited in reality.


---


## Creating a Hash-Based Change Detection System in Python


In today’s information-rich environment, it's essential to identify and analyze similar textual content effectively. This guide will demonstrate how to implement a hash-based change detection system using Python, aimed at recognizing articles that, while not identical, contain similar content. We’ll utilize libraries such as `hashlib`, `nltk`, `scikit-learn`, and `pandas` to achieve this.


### Prerequisites


To get started, ensure you have the necessary libraries installed:


```bash

pip install nltk scikit-learn pandas

```


### Step 1: Preprocessing the Text


First, we’ll preprocess the text by converting it to lowercase, removing punctuation, and tokenizing it. This is crucial for standardizing the articles before further processing.


```python

import hashlib

import string

import nltk


# Download NLTK resources

nltk.download('punkt')


def preprocess_text(text):

    text = text.lower()

    text = text.translate(str.maketrans('', '', string.punctuation))

    tokens = nltk.word_tokenize(text)

    return ' '.join(tokens)

```


### Step 2: Creating a Hash of the Text


Using SHA-256 hashing, we can generate a unique identifier for each article, allowing us to track changes efficiently.


```python

def create_hash(text):

    return hashlib.sha256(text.encode('utf-8')).hexdigest()

```


### Step 3: Calculating Similarity


We calculate the similarity between articles using cosine similarity, which measures how alike the articles are based on their vector representations.


```python

from sklearn.metrics.pairwise import cosine_similarity

from sklearn.feature_extraction.text import CountVectorizer


def calculate_similarity(articles):

    vectorizer = CountVectorizer().fit_transform(articles)

    vectors = vectorizer.toarray()

    cosine_sim = cosine_similarity(vectors)

    return cosine_sim

```


### Step 4: Combining Everything


Now, let’s bring everything together into one function that processes multiple articles and generates a readable similarity matrix.


```python

import pandas as pd


def detect_changes(articles):

    processed_articles = [preprocess_text(article) for article in articles]

    hashes = [create_hash(article) for article in articles]

    similarity_matrix = calculate_similarity(processed_articles)

    

    # Create a Pandas DataFrame for better readability

    similarity_df = pd.DataFrame(similarity_matrix, columns=[f'Article {i+1}' for i in range(len(articles))],

                                  index=[f'Article {i+1}' for i in range(len(articles))])

    

    return hashes, similarity_df


# Example usage

articles = [

    "The quick brown fox jumps over the lazy dog.",

    "A fast brown fox leaps over a lazy canine.",

    "The quick brown fox leapt over the lazy dog.",

]


hashes, similarity_df = detect_changes(articles)


# Print hashes

print("Hashes:")

for i, h in enumerate(hashes):

    print(f"Article {i+1}: {h}")


# Print similarity DataFrame

print("\nSimilarity Matrix:")

print(similarity_df)

```


### Output Explanation


Running the code will produce a set of unique hashes for each article along with a similarity matrix. The matrix shows the cosine similarity between articles:


```

Hashes:

Article 1: 50f2a1e2fa9843f97f933f2865ed310b3b19156941d64c3a1e0e8734b3658fa5

Article 2: 46c803be091cf8abb55e2f6b8d6a7434fb7f9299e70f5a925488747cdbb473d1

Article 3: 50f2a1e2fa9843f97f933f2865ed310b3b19156941d64c3a1e0e8734b3658fa5


Similarity Matrix:

           Article 1  Article 2  Article 3

Article 1    1.000000    0.544941    1.000000

Article 2    0.544941    1.000000    0.568681

Article 3    1.000000    0.568681    1.000000

```


- **Hashes:** Each article's unique hash reflects its content.

- **Similarity Matrix:** Each entry represents the cosine similarity between two articles. Values closer to **1** indicate high similarity, while those closer to **0** show dissimilarity.


### Conclusion


This implementation provides a robust system for identifying similar articles, effectively minimizing duplication in your dataset. With this approach, you can enhance your content curation processes and maintain a more organized information flow.


Feel free to use this code as a foundation for building your own content analysis tools!

--- 


As you do, I thought about connecting this up to read from websites and I asked for that code, which it generated, however, there's all the issues with cookie acceptance, captchas, API keys, security etc..etc.. that would be too much of a challenge.

So I opted for just re-asking if I could read from .csv files, after all, that's what every Data Scientist seems to revel in (I blame pandas library for that :-) ).  Here's the output from that request:

can you remove the connection to external websites and instead read the news article content from several .csv files instead

GPT-4o mini

Certainly! We can modify the code to read article content from several .csv files instead of connecting to external websites. This will allow you to work with local data files. Here’s how you can implement this:

Required Libraries

Ensure you have the required libraries installed:

bash

pip install pandas nltk scikit-learn matplotlib seaborn

Implementation

Here’s how to adjust the previous implementation to read articles from .csv files:

python

import hashlib import string import nltk import pandas as pd from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import CountVectorizer import matplotlib.pyplot as plt import seaborn as sns # Download NLTK resources nltk.download('punkt') # Function to preprocess the text def preprocess_text(text): text = text.lower() text = text.translate(str.maketrans('', '', string.punctuation)) tokens = nltk.word_tokenize(text) return ' '.join(tokens) # Function to create a hash of the text def create_hash(text): return hashlib.sha256(text.encode('utf-8')).hexdigest() # Function to calculate similarity def calculate_similarity(articles): vectorizer = CountVectorizer().fit_transform(articles) vectors = vectorizer.toarray() cosine_sim = cosine_similarity(vectors) return cosine_sim # Function to read articles from CSV files def read_articles_from_csv(file_paths): articles = [] for file_path in file_paths: df = pd.read_csv(file_path) # Assuming the content is in a column named 'content' articles.extend(df['content'].tolist()) return articles # Function to detect changes and visualize similarity def detect_changes_and_visualize(file_paths): all_articles = read_articles_from_csv(file_paths) processed_articles = [preprocess_text(article) for article in all_articles] hashes = [create_hash(article) for article in all_articles] similarity_matrix = calculate_similarity(processed_articles) # Create a Pandas DataFrame for better readability similarity_df = pd.DataFrame(similarity_matrix, columns=[f'Article {i+1}' for i in range(len(all_articles))], index=[f'Article {i+1}' for i in range(len(all_articles))]) # Visualize the similarity matrix using seaborn's heatmap plt.figure(figsize=(10, 8)) sns.heatmap(similarity_df, annot=True, fmt=".2f", cmap="YlGnBu", xticklabels=similarity_df.columns, yticklabels=similarity_df.index) plt.title('Article Similarity Matrix') plt.show() return hashes, similarity_df # Example file paths (Replace these with the paths to your actual CSV files) file_paths = [ 'articles1.csv', # Path to first CSV file 'articles2.csv' # Path to second CSV file ] hashes, similarity_df = detect_changes_and_visualize(file_paths) # Print hashes print("Hashes:") for i, h in enumerate(hashes): print(f"Article {i+1}: {h}") # Print similarity DataFrame print("\nSimilarity Matrix:") print(similarity_df)

Explanation of Code Changes

  1. Reading Articles from CSV:
    The read_articles_from_csv function reads article content from multiple CSV files. It assumes that there is a column named 'content' containing the text of the articles. You may need to adjust this depending on your actual CSV structure.

  2. Main Processing Function:
    The detect_changes_and_visualize function now processes articles read from the specified CSV files.

Usage

  • Ensure that your CSV files are formatted correctly with the content of the articles in a column named 'content'.
  • Update the file_paths list at the end of the code to include the paths to your actual CSV files.

After running the modified code, you will get a visual representation of the similarity matrix, along with hashes for each article. This setup allows for quick comparisons of articles from different sources stored locally. If you have additional questions or need further modifications, feel free to ask!

 

Again, whilst I really do not "like" doing this and I resisted for quite some time, I do recognise that "this" is the way you are now expected to operate, ie. ask the GPT tool of choice how to do something and it will spit out the basics quick enough that you can cut & paste.

Yes, the trick is knowing what it has output, whether it is valid or not - if it errors, how to fix it, I suppose you could ask it to fix it itself, however, if that were the case why was it wrong in the first place - ah, we're back to the old "it doesn't actually understand, it is just spitting out patterns of text it has previously seen, vectorization, what words are close to the words in the training, it's not intelligence"

However, it is a great boost to get something executable for a Proof of Concept or a Minimal Viable Product in order to conduct Research & Development.

For that, this is a great way to get up & running.

The Teens in the world should be encouraged to use this tooling in this way, but also be educated to know where the limitations are and what is then expected from them, rather than taking this as an instant solver of everything that doesn't involve them thinking.  If we don't put some boundaries in place, then we're going to lose something very valuable along the way.

Comments

Popular posts from this blog

Google Opal [experiment] - not in the UK yet

Weaviate Verba RAG with Node-Red & Ollama Engine

Remote Working Data Annotation options - Digital Nomad