Hash checking similarities of text
This isn't a new thing, it's probably not even used that much, although it should be. I have a scenario where I'd like to get text, think of news articles from a variety of sources, maybe 4-5 or even 1-200!? and I would like to analyse the text and if certain trigger content is detected then notify me.
Such as, a major security breach has happened and it is being reported on multiple cyber security websites and I should action on it pretty fast - now, I don't want to get 1000 alert notifications of the same or similar thing, I would just like to get 1. Or at least less than 10!
I was pondering on how to do this and then duck.ai sprang to mind. Why use duck.ai? well, it doesn't slurp what you are doing, it doesn't use any of the data for training, it's also free. So, why not?!
I gave it this prompt:
If I have multiple sources reporting the same textual data, can you show how to create a hash-based change detection system with python that would identify that the textual articles, whilst not the exact same text have a high probability of being the same content in order to reduce duplication
I got a pretty good response. I did re-ask to tweak it to include the pandas element and I could go further and ask it to then use a python library to visualise the similarities, however, this is just a small part of a workflow, the real aim would be to "filter the noise" here and then next step I could ask is to then analyse the tokenised values with ntlk to classify the urgency or use a method to determine the quantity of a repeated article content to raise it up a priority order to imply it is of more importance. I could even ask it to write the API calls out to specific websites to scrape the content or read files from a folder or something as input rather than the small fixed array in the code.
I installed the couple of python libraries that were needed and the code executed as defined. This was a really good starter for 10, it just worked.
Whilst I'm not the greatest fan of LLM output code, this is really useful to get started with. What do I mean by that statement? well, there is no concept of exception handling or security, it is just the bare minimum, but it is good enough to prove the concept and get the basic code structure.
I could now go back and ask for more exception handling to be added etc..etc... but I fear I would then cycle around for about an hour tweaking and enhancing, when I could actually just write that bit myself.
Anyway, the following is the output I asked for, yes, I asked it to output so I could paste it into a blog post. Now, you may think, why not just write something in n8n that could auto-post into a blog post? I'm not too much of a fan of totally removing a human in the equation, also I'm also not ready to hand over my security credentials to a tool that could easily harvest that info behind the scenes (as recently proven by clawdbot), so for now, I'll still do a small manual task.
As to whether this is the "BEST" way to do this, that is open to debate. It is "ONE" way to achieve this, further in-depth testing with much longer article text will be required to see if this is indeed of any use or not. As with all things AI-slop, it might just be that. It might look and sound convincing, but it might be very limited in reality.
---
## Creating a Hash-Based Change Detection System in Python
In today’s information-rich environment, it's essential to identify and analyze similar textual content effectively. This guide will demonstrate how to implement a hash-based change detection system using Python, aimed at recognizing articles that, while not identical, contain similar content. We’ll utilize libraries such as `hashlib`, `nltk`, `scikit-learn`, and `pandas` to achieve this.
### Prerequisites
To get started, ensure you have the necessary libraries installed:
```bash
pip install nltk scikit-learn pandas
```
### Step 1: Preprocessing the Text
First, we’ll preprocess the text by converting it to lowercase, removing punctuation, and tokenizing it. This is crucial for standardizing the articles before further processing.
```python
import hashlib
import string
import nltk
# Download NLTK resources
nltk.download('punkt')
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = nltk.word_tokenize(text)
return ' '.join(tokens)
```
### Step 2: Creating a Hash of the Text
Using SHA-256 hashing, we can generate a unique identifier for each article, allowing us to track changes efficiently.
```python
def create_hash(text):
return hashlib.sha256(text.encode('utf-8')).hexdigest()
```
### Step 3: Calculating Similarity
We calculate the similarity between articles using cosine similarity, which measures how alike the articles are based on their vector representations.
```python
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
def calculate_similarity(articles):
vectorizer = CountVectorizer().fit_transform(articles)
vectors = vectorizer.toarray()
cosine_sim = cosine_similarity(vectors)
return cosine_sim
```
### Step 4: Combining Everything
Now, let’s bring everything together into one function that processes multiple articles and generates a readable similarity matrix.
```python
import pandas as pd
def detect_changes(articles):
processed_articles = [preprocess_text(article) for article in articles]
hashes = [create_hash(article) for article in articles]
similarity_matrix = calculate_similarity(processed_articles)
# Create a Pandas DataFrame for better readability
similarity_df = pd.DataFrame(similarity_matrix, columns=[f'Article {i+1}' for i in range(len(articles))],
index=[f'Article {i+1}' for i in range(len(articles))])
return hashes, similarity_df
# Example usage
articles = [
"The quick brown fox jumps over the lazy dog.",
"A fast brown fox leaps over a lazy canine.",
"The quick brown fox leapt over the lazy dog.",
]
hashes, similarity_df = detect_changes(articles)
# Print hashes
print("Hashes:")
for i, h in enumerate(hashes):
print(f"Article {i+1}: {h}")
# Print similarity DataFrame
print("\nSimilarity Matrix:")
print(similarity_df)
```
### Output Explanation
Running the code will produce a set of unique hashes for each article along with a similarity matrix. The matrix shows the cosine similarity between articles:
```
Hashes:
Article 1: 50f2a1e2fa9843f97f933f2865ed310b3b19156941d64c3a1e0e8734b3658fa5
Article 2: 46c803be091cf8abb55e2f6b8d6a7434fb7f9299e70f5a925488747cdbb473d1
Article 3: 50f2a1e2fa9843f97f933f2865ed310b3b19156941d64c3a1e0e8734b3658fa5
Similarity Matrix:
Article 1 Article 2 Article 3
Article 1 1.000000 0.544941 1.000000
Article 2 0.544941 1.000000 0.568681
Article 3 1.000000 0.568681 1.000000
```
- **Hashes:** Each article's unique hash reflects its content.
- **Similarity Matrix:** Each entry represents the cosine similarity between two articles. Values closer to **1** indicate high similarity, while those closer to **0** show dissimilarity.
### Conclusion
This implementation provides a robust system for identifying similar articles, effectively minimizing duplication in your dataset. With this approach, you can enhance your content curation processes and maintain a more organized information flow.
Feel free to use this code as a foundation for building your own content analysis tools!
---
Comments
Post a Comment