I have a lot of historical personal email, back as far as 1991, and from time to
time there's a need to find some old message. Although services like
GMail would like to you keep all your mail in their cloud and pride themselves
on searching, I'd rather keep my email archive offline and encrypted at
rest unless there's some need to do a search. Indeed I always use
Google takeout every month to remove all historic GMail messages. Until this
year I used a tool called Zoe
for allowing searchable email archives. You can import your emails, it
uses Apache Lucene as a back end, and gives you a nice web based interface to
find your mails. But Zoe has been unmaintained for over a decade and has
mostly vanished from the net. It was time to replace it.
Whenever I need some open source project my first place to look is if there
is an Apache Software Foundation community with a project along the same
lines. And the ASF is all about communities communicating over Email, so
not only is there an ASF project with a solution, but that project is used to
provide the web interface for all the archived ASF mailing lists too.
is the project and lists.apache.org is where you can see it
running. (Note that the Ponymail website refers to the old version of Pony
Mail before "Foal")
scripts to import emails into elasticsearch, so it's really straightforward
to get up and running following the project instructions.
So I can just import my several hundred thousand email messages I have in
random text mbox format files and be done? Well, nearly. It almost
worked but it needed a few tweaks:
Ponymail wasn't able to parse a fair number of email messages. Analysing the mails led to only three root causes of mails not being able to be imported:
Handling List-Id's. Ponymail likes to sort mails by the List-Id
which makes a lot of sense where you have the thousands of Apache lists.
But with personal email, and certainly when you subscribe to various
newsletters, or get bills, or spam that got into the archives then you end up
with lots of list id's that are only used once or twice or are not
useful. Working on open source projects there's lots of lists that
I'm on that I want the email to get archived, but it would be nice if it
was separated out in the Ponymail UI. So really I needed the ability to
have an 'allow list' of list id's that I want to have separate,
with everything else defaulting to a generic list id (being my email address
where all those mails came into). Patch
- Bad "Content-Type" headers. Even my bank gets this wrong
with the header Content-Type: text/html; charset="utf-8
charset=\"iso-8859-1\"". I just made the code ignore similar
bad headers and try the fallbacks. Patch
- Messages with no text or HTML body and no attachments. These are
fairly common for example a calendar entry might be sent as "Content-Type:
text/calendar". I just made it so that if there is no recognised body
it just uses whatever the last section it found was, regardless of content
- Google Chat messages from many years ago. These have no useful
anything, no body, no to: no message id, no return address. Rather than note
them as failures I use made the code ignore them completely. Since this is
just a warning, no upstream patch prepared.
HTML email. Where an email contains only HTML and no text version then
Ponymail will make and store a text conversion of the HTML, but sometimes,
especially those pesky bank emails, it's useful to be able to see the HTML
with all the embedded images. Displaying HTML email in HTML isn't
really a goal for the project, especially since you have to be really careful
might not want all those tracking images to suddenly start getting pinged.
But I'd really like a button that you could use on selected emails to
display them in HTML. Fortunately Ponymail stores a complete raw copy of
the email, any my proof-of-concept worked, so this can be easy to add in the future.
Managing a personal email archive can be a daunting task especially with the
volume of email correspondence. However, with Ponymail, it's possible to
take control of your email archive, keep it local and secure, and search through
it quickly and efficiently using the power of ElasticSearch.
Created: 28 Mar 2023
Tagged as: apache, fedora