First commit

This commit is contained in:
Jonathan Rosenbaum 2026-01-17 13:46:20 -05:00
commit 3860abf1f0
2 changed files with 345 additions and 0 deletions

152
README.md Normal file
View File

@ -0,0 +1,152 @@
# HyperKitty SearchandReplace Management Command
This Django management command performs **search and replace operations** on
HyperKitty email bodies and textbased attachments. It is designed for cases
where sensitive information must be scrubbed from archived mailinglist data.
The command supports simulation mode, selective processing, and deduplication of
both emails and attachments.
---
## Features
- Replace sensitive strings in:
- Email bodies
- Textbased attachments (`text/plain`, `text/html`, `application/xhtml+xml`)
- Simulation mode (`--simulate`) to preview changes without saving
- Deduplication of:
- Emails (by `message_id`)
- Attachments (by SHA1 content hash)
- Flexible processing:
- `--only-emails`
- `--only-attachments`
- Reads replacements from a simple text file using `shlex` parsing
---
## Installation
Place the command file in:
```
/usr/lib/python3.10/site-packages/hyperkitty/management/commands/
```
(or the equivalent path for your Python/Django installation)
The filename should match the command name, for example:
```
sanitize_hyperkitty.py
```
Django will automatically detect it as a management command.
---
## Usage
Run the command from your Django project directory:
```bash
./manage.py sanitize_hyperkitty \
--list mylist@example.com \
--replacements-file replacements.txt
```
### Common Options
| Option | Description |
|--------|-------------|
| `--list` | **Required.** Mailing list name (e.g. `team@lists.example.org`) |
| `--replacements-file` | **Required.** Path to a file containing replacement rules |
| `--simulate` | Show changes without saving them |
| `--only-emails` | Process only email bodies |
| `--only-attachments` | Process only attachments |
---
## Replacements File Format
The replacements file uses **shlex parsing**, allowing quoted strings.
Each line must contain **exactly two values**:
```
old_value new_value
```
### Examples
```
password "********"
"secret token" "[REDACTED]"
john@example.com jane@example.com
```
Lines beginning with `#` are ignored.
---
## How It Works
### 1. Load Replacements
The command reads the replacements file and builds a dictionary of `old → new`
pairs. Malformed lines are skipped with warnings.
### 2. Fetch and Deduplicate Emails
Emails are filtered by mailing list name and deduplicated by `message_id`.
### 3. Process Email Bodies
If enabled, each email body is scanned and replacements are applied.
### 4. Process Attachments
Attachments are:
- Deduplicated by SHA1 hash
- Checked for textbased MIME types
- Decoded using the attachments encoding
- Updated and saved if modified
### 5. Simulation Mode
If `--simulate` is used:
- Changes are printed to stdout
- No data is saved
### 6. Rebuild Search Index
After real modifications, rebuild the HyperKitty search index:
```bash
./manage.py rebuild_index
```
---
## Example
```bash
./manage.py sanitize_hyperkitty \
--list devteam@lists.example.org \
--replacements-file scrub.txt \
--simulate
```
This will scan all messages, show what would change, and leave the database untouched.
---
## Notes
- `--only-emails` and `--only-attachments` cannot be used together.
- Attachments without a MIME type attempt fallback detection based on filename.
- Nontext attachments are skipped automatically.
---
## License
This script is intended for administrative use within Django/HyperKitty
environments under GNU General Public License v3.0.

193
sanitize_mail.py Normal file
View File

@ -0,0 +1,193 @@
import sys
import shlex
from django.core.management.base import BaseCommand
from hyperkitty.models import Email, Attachment
# put command in /usr/lib/python3.10/site-packages/hyperkitty/management/commands
TEXT_MIMETYPES = {
"text/plain",
"text/html",
"application/xhtml+xml",
}
class Command(BaseCommand):
help = "Search and replace sensitive data in HyperKitty emails and attachments."
def add_arguments(self, parser):
parser.add_argument(
"--list",
required=True,
help="Mailing list name, e.g. bikeboard@lists.bikelover.org",
)
parser.add_argument(
"--simulate",
action="store_true",
help="Show what would be changed without saving.",
)
parser.add_argument(
"--replacements-file",
required=True,
help="Path to a text file containing replacements, one per line.",
)
parser.add_argument(
"--only-emails",
action="store_true",
help="Process only email bodies, not attachments.",
)
parser.add_argument(
"--only-attachments",
action="store_true",
help="Process only attachments, not email bodies.",
)
def load_replacements(self, filepath):
replacements = {}
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
continue
try:
parts = shlex.split(line)
except ValueError:
self.stdout.write(self.style.WARNING(f"Skipping malformed line: {line}"))
continue
if len(parts) != 2:
self.stdout.write(self.style.WARNING(f"Skipping invalid line: {line}"))
continue
old, new = parts
replacements[old] = new
return replacements
def handle(self, *args, **options):
mailing_list = options["list"]
simulate = options["simulate"]
replacements_file = options["replacements_file"]
only_emails = options["only_emails"]
only_attachments = options["only_attachments"]
# Validate flags
if only_emails and only_attachments:
self.stdout.write(self.style.ERROR("Cannot use --only-emails and --only-attachments together."))
return
replacements = self.load_replacements(replacements_file)
if not replacements:
self.stdout.write(self.style.ERROR("No valid replacements found."))
return
self.stdout.write(f"Loaded {len(replacements)} replacements.")
emails = Email.objects.filter(mailinglist__name=mailing_list)
self.stdout.write(f"Scanning {emails.count()} messages…")
# Deduplicate emails by message_id
unique_emails = {}
for msg in emails:
if msg.message_id not in unique_emails:
unique_emails[msg.message_id] = msg
emails = unique_emails.values()
for msg in emails:
changed = False
# --- Process email body ---
if not only_attachments:
if msg.content:
original = msg.content
updated = original
for old, new in replacements.items():
if old in updated and simulate:
self.stdout.write(f" Change in email body:")
self.stdout.write(f" - {old}")
self.stdout.write(f" + {new}")
updated = updated.replace(old, new)
if updated != original:
changed = True
self.stdout.write(f"[Email] {msg.subject}")
if not simulate:
msg.content = updated
msg.save()
# --- Process attachments ---
if not only_emails:
attachments = Attachment.objects.filter(email=msg)
import hashlib
# Deduplicate attachments by content hash
unique_attachments = {}
for att in attachments:
raw = att.content if isinstance(att.content, bytes) else att.content.encode("utf-8", errors="ignore")
digest = hashlib.sha1(raw).hexdigest()
if digest not in unique_attachments:
unique_attachments[digest] = att
attachments = unique_attachments.values()
seen_changes = set()
for att in attachments:
mime = getattr(att, "content_type", None)
filename = getattr(att, "name", None)
if not mime:
if filename and filename.lower().endswith((".htm", ".html", ".txt", ".xhtml")):
mime = "text/html"
else:
continue
if mime not in TEXT_MIMETYPES:
continue
try:
content = att.content.decode(att.encoding or "utf-8", errors="ignore")
except AttributeError:
content = att.content
original = content
updated = content
for old, new in replacements.items():
key = (filename, old, new)
if simulate and key not in seen_changes and old in updated:
seen_changes.add(key)
self.stdout.write(f" Change in attachment {filename}:")
self.stdout.write(f" - {old}")
self.stdout.write(f" + {new}")
updated = updated.replace(old, new)
if updated != original:
changed = True
self.stdout.write(f"[Attachment] {filename} in {msg.subject}")
if not simulate:
encoded = updated.encode(att.encoding or "utf-8")
att.content = encoded
att.size = len(encoded)
att.save()
if changed and simulate:
self.stdout.write(" (simulate mode: no changes saved)")
self.stdout.write("\nDone.")
if not simulate:
self.stdout.write("Run `./manage.py rebuild_index` to refresh search index.")