Hash checking similarities of text
This isn't a new thing, it's probably not even used that much, although it should be. I have a scenario where I'd like to get text, think of news articles from a variety of sources, maybe 4-5 or even 1-200!? and I would like to analyse the text and if certain trigger content is detected then notify me.
Such as, a major security breach has happened and it is being reported on multiple cyber security websites and I should action on it pretty fast - now, I don't want to get 1000 alert notifications of the same or similar thing, I would just like to get 1. Or at least less than 10!
I was pondering on how to do this and then duck.ai sprang to mind. Why use duck.ai? well, it doesn't slurp what you are doing, it doesn't use any of the data for training, it's also free. So, why not?!
I gave it this prompt:
If I have multiple sources reporting the same textual data, can you show how to create a hash-based change detection system with python that would identify that the textual articles, whilst not the exact same text have a high probability of being the same content in order to reduce duplication
I got a pretty good response. I did re-ask to tweak it to include the pandas element and I could go further and ask it to then use a python library to visualise the similarities, however, this is just a small part of a workflow, the real aim would be to "filter the noise" here and then next step I could ask is to then analyse the tokenised values with ntlk to classify the urgency or use a method to determine the quantity of a repeated article content to raise it up a priority order to imply it is of more importance. I could even ask it to write the API calls out to specific websites to scrape the content or read files from a folder or something as input rather than the small fixed array in the code.
I installed the couple of python libraries that were needed and the code executed as defined. This was a really good starter for 10, it just worked.
Whilst I'm not the greatest fan of LLM output code, this is really useful to get started with. What do I mean by that statement? well, there is no concept of exception handling or security, it is just the bare minimum, but it is good enough to prove the concept and get the basic code structure.
I could now go back and ask for more exception handling to be added etc..etc... but I fear I would then cycle around for about an hour tweaking and enhancing, when I could actually just write that bit myself.
Anyway, the following is the output I asked for, yes, I asked it to output so I could paste it into a blog post. Now, you may think, why not just write something in n8n that could auto-post into a blog post? I'm not too much of a fan of totally removing a human in the equation, also I'm also not ready to hand over my security credentials to a tool that could easily harvest that info behind the scenes (as recently proven by clawdbot), so for now, I'll still do a small manual task.
As to whether this is the "BEST" way to do this, that is open to debate. It is "ONE" way to achieve this, further in-depth testing with much longer article text will be required to see if this is indeed of any use or not. As with all things AI-slop, it might just be that. It might look and sound convincing, but it might be very limited in reality.
---
## Creating a Hash-Based Change Detection System in Python
In today’s information-rich environment, it's essential to identify and analyze similar textual content effectively. This guide will demonstrate how to implement a hash-based change detection system using Python, aimed at recognizing articles that, while not identical, contain similar content. We’ll utilize libraries such as `hashlib`, `nltk`, `scikit-learn`, and `pandas` to achieve this.
### Prerequisites
To get started, ensure you have the necessary libraries installed:
```bash
pip install nltk scikit-learn pandas
```
### Step 1: Preprocessing the Text
First, we’ll preprocess the text by converting it to lowercase, removing punctuation, and tokenizing it. This is crucial for standardizing the articles before further processing.
```python
import hashlib
import string
import nltk
# Download NLTK resources
nltk.download('punkt')
def preprocess_text(text):
text = text.lower()
text = text.translate(str.maketrans('', '', string.punctuation))
tokens = nltk.word_tokenize(text)
return ' '.join(tokens)
```
### Step 2: Creating a Hash of the Text
Using SHA-256 hashing, we can generate a unique identifier for each article, allowing us to track changes efficiently.
```python
def create_hash(text):
return hashlib.sha256(text.encode('utf-8')).hexdigest()
```
### Step 3: Calculating Similarity
We calculate the similarity between articles using cosine similarity, which measures how alike the articles are based on their vector representations.
```python
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
def calculate_similarity(articles):
vectorizer = CountVectorizer().fit_transform(articles)
vectors = vectorizer.toarray()
cosine_sim = cosine_similarity(vectors)
return cosine_sim
```
### Step 4: Combining Everything
Now, let’s bring everything together into one function that processes multiple articles and generates a readable similarity matrix.
```python
import pandas as pd
def detect_changes(articles):
processed_articles = [preprocess_text(article) for article in articles]
hashes = [create_hash(article) for article in articles]
similarity_matrix = calculate_similarity(processed_articles)
# Create a Pandas DataFrame for better readability
similarity_df = pd.DataFrame(similarity_matrix, columns=[f'Article {i+1}' for i in range(len(articles))],
index=[f'Article {i+1}' for i in range(len(articles))])
return hashes, similarity_df
# Example usage
articles = [
"The quick brown fox jumps over the lazy dog.",
"A fast brown fox leaps over a lazy canine.",
"The quick brown fox leapt over the lazy dog.",
]
hashes, similarity_df = detect_changes(articles)
# Print hashes
print("Hashes:")
for i, h in enumerate(hashes):
print(f"Article {i+1}: {h}")
# Print similarity DataFrame
print("\nSimilarity Matrix:")
print(similarity_df)
```
### Output Explanation
Running the code will produce a set of unique hashes for each article along with a similarity matrix. The matrix shows the cosine similarity between articles:
```
Hashes:
Article 1: 50f2a1e2fa9843f97f933f2865ed310b3b19156941d64c3a1e0e8734b3658fa5
Article 2: 46c803be091cf8abb55e2f6b8d6a7434fb7f9299e70f5a925488747cdbb473d1
Article 3: 50f2a1e2fa9843f97f933f2865ed310b3b19156941d64c3a1e0e8734b3658fa5
Similarity Matrix:
Article 1 Article 2 Article 3
Article 1 1.000000 0.544941 1.000000
Article 2 0.544941 1.000000 0.568681
Article 3 1.000000 0.568681 1.000000
```
- **Hashes:** Each article's unique hash reflects its content.
- **Similarity Matrix:** Each entry represents the cosine similarity between two articles. Values closer to **1** indicate high similarity, while those closer to **0** show dissimilarity.
### Conclusion
This implementation provides a robust system for identifying similar articles, effectively minimizing duplication in your dataset. With this approach, you can enhance your content curation processes and maintain a more organized information flow.
Feel free to use this code as a foundation for building your own content analysis tools!
---
As you do, I thought about connecting this up to read from websites and I asked for that code, which it generated, however, there's all the issues with cookie acceptance, captchas, API keys, security etc..etc.. that would be too much of a challenge.
So I opted for just re-asking if I could read from .csv files, after all, that's what every Data Scientist seems to revel in (I blame pandas library for that :-) ). Here's the output from that request:
can you remove the connection to external websites and instead read the news article content from several .csv files instead
Certainly! We can modify the code to read article content from several .csv files instead of connecting to external websites. This will allow you to work with local data files. Here’s how you can implement this:
Required Libraries
Ensure you have the required libraries installed:
bash
pip install pandas nltk scikit-learn matplotlib seaborn
Implementation
Here’s how to adjust the previous implementation to read articles from .csv files:
python
import hashlib import string import nltk import pandas as pd from sklearn.metrics.pairwise import cosine_similarity from sklearn.feature_extraction.text import CountVectorizer import matplotlib.pyplot as plt import seaborn as sns # Download NLTK resources nltk.download('punkt') # Function to preprocess the text def preprocess_text(text): text = text.lower() text = text.translate(str.maketrans('', '', string.punctuation)) tokens = nltk.word_tokenize(text) return ' '.join(tokens) # Function to create a hash of the text def create_hash(text): return hashlib.sha256(text.encode('utf-8')).hexdigest() # Function to calculate similarity def calculate_similarity(articles): vectorizer = CountVectorizer().fit_transform(articles) vectors = vectorizer.toarray() cosine_sim = cosine_similarity(vectors) return cosine_sim # Function to read articles from CSV files def read_articles_from_csv(file_paths): articles = [] for file_path in file_paths: df = pd.read_csv(file_path) # Assuming the content is in a column named 'content' articles.extend(df['content'].tolist()) return articles # Function to detect changes and visualize similarity def detect_changes_and_visualize(file_paths): all_articles = read_articles_from_csv(file_paths) processed_articles = [preprocess_text(article) for article in all_articles] hashes = [create_hash(article) for article in all_articles] similarity_matrix = calculate_similarity(processed_articles) # Create a Pandas DataFrame for better readability similarity_df = pd.DataFrame(similarity_matrix, columns=[f'Article {i+1}' for i in range(len(all_articles))], index=[f'Article {i+1}' for i in range(len(all_articles))]) # Visualize the similarity matrix using seaborn's heatmap plt.figure(figsize=(10, 8)) sns.heatmap(similarity_df, annot=True, fmt=".2f", cmap="YlGnBu", xticklabels=similarity_df.columns, yticklabels=similarity_df.index) plt.title('Article Similarity Matrix') plt.show() return hashes, similarity_df # Example file paths (Replace these with the paths to your actual CSV files) file_paths = [ 'articles1.csv', # Path to first CSV file 'articles2.csv' # Path to second CSV file ] hashes, similarity_df = detect_changes_and_visualize(file_paths) # Print hashes print("Hashes:") for i, h in enumerate(hashes): print(f"Article {i+1}: {h}") # Print similarity DataFrame print("\nSimilarity Matrix:") print(similarity_df)
Explanation of Code Changes
Reading Articles from CSV:
Theread_articles_from_csvfunction reads article content from multiple CSV files. It assumes that there is a column named'content'containing the text of the articles. You may need to adjust this depending on your actual CSV structure.Main Processing Function:
Thedetect_changes_and_visualizefunction now processes articles read from the specified CSV files.
Usage
- Ensure that your CSV files are formatted correctly with the content of the articles in a column named
'content'. - Update the
file_pathslist at the end of the code to include the paths to your actual CSV files.
After running the modified code, you will get a visual representation of the similarity matrix, along with hashes for each article. This setup allows for quick comparisons of articles from different sources stored locally. If you have additional questions or need further modifications, feel free to ask!
Again, whilst I really do not "like" doing this and I resisted for quite some time, I do recognise that "this" is the way you are now expected to operate, ie. ask the GPT tool of choice how to do something and it will spit out the basics quick enough that you can cut & paste.
Yes, the trick is knowing what it has output, whether it is valid or not - if it errors, how to fix it, I suppose you could ask it to fix it itself, however, if that were the case why was it wrong in the first place - ah, we're back to the old "it doesn't actually understand, it is just spitting out patterns of text it has previously seen, vectorization, what words are close to the words in the training, it's not intelligence"
However, it is a great boost to get something executable for a Proof of Concept or a Minimal Viable Product in order to conduct Research & Development.
For that, this is a great way to get up & running.
The Teens in the world should be encouraged to use this tooling in this way, but also be educated to know where the limitations are and what is then expected from them, rather than taking this as an instant solver of everything that doesn't involve them thinking. If we don't put some boundaries in place, then we're going to lose something very valuable along the way.
Comments
Post a Comment