mark :: blog

28 Mar 2023: How Ponymail helped me keep my email archive searchable but out of the cloud.

I have a lot of historical personal email, back as far as 1991, and from time to time there's a need to find some old message.  Although services like GMail would like to you keep all your mail in their cloud and pride themselves on searching, I'd rather keep my email archive offline and encrypted at rest unless there's some need to do a search.  Indeed I always use Google takeout every month to remove all historic GMail messages. Until this year I used a tool called Zoe for allowing searchable email archives.  You can import your emails, it uses Apache Lucene as a back end, and gives you a nice web based interface to find your mails.  But Zoe has been unmaintained for over a decade and has mostly vanished from the net. It was time to replace it.

Whenever I need some open source project my first place to look is if there is an Apache Software Foundation community with a project along the same lines.  And the ASF is all about communities communicating over Email, so not only is there an ASF project with a solution, but that project is used to provide the web interface for all the archived ASF mailing lists too.   "Ponymail Foal" is the project and lists.apache.org is where you can see it running.  (Note that the Ponymail website refers to the old version of Pony Mail before "Foal")

Internally the project is mostly Python, HTML and Javascript, using Python scripts to import emails into elasticsearch, so it's really straightforward to get up and running following the project instructions.

So I can just import my several hundred thousand email messages I have in random text mbox format files and be done?  Well, nearly.  It almost worked but it needed a few tweaks:

  • Ponymail wasn't able to parse a fair number of email messages.  Analysing the mails led to only three root causes of mails not being able to be imported:
    • Bad "Content-Type" headers.  Even my bank gets this wrong with the header Content-Type: text/html; charset="utf-8 charset=\"iso-8859-1\"".  I just made the code ignore similar bad headers and try the fallbacks.   Patch here
    • Messages with no text or HTML body and no attachments.  These are fairly common for example a calendar entry might be sent as "Content-Type: text/calendar".  I just made it so that if there is no recognised body it just uses whatever the last section it found was, regardless of content type.  Patch here
    • Google Chat messages from many years ago.  These have no useful anything, no body, no to: no message id, no return address. Rather than note them as failures I use made the code ignore them completely.  Since this is just a warning, no upstream patch prepared.
  • Handling List-Id's.  Ponymail likes to sort mails by the List-Id which makes a lot of sense where you have the thousands of Apache lists.  But with personal email, and certainly when you subscribe to various newsletters, or get bills, or spam that got into the archives then you end up with lots of list id's that are only used once or twice or are not useful.  Working on open source projects there's lots of lists that I'm on that I want the email to get archived, but it would be nice if it was separated out in the Ponymail UI.  So really I needed the ability to have an 'allow list' of list id's that I want to have separate, with everything else defaulting to a generic list id (being my email address where all those mails came into).  Patch here
  • HTML email.  Where an email contains only HTML and no text version then Ponymail will make and store a text conversion of the HTML, but sometimes, especially those pesky bank emails, it's useful to be able to see the HTML with all the embedded images.  Displaying HTML email in HTML isn't really a goal for the project, especially since you have to be really careful you don't end up parsing untrusted javascript for example.  And you might not want all those tracking images to suddenly start getting pinged.  But I'd really like a button that you could use on selected emails to display them in HTML.  Fortunately Ponymail stores a complete raw copy of the email, any my proof-of-concept worked, so this can be easy to add in the future.
  • Managing a personal email archive can be a daunting task especially with the volume of email correspondence. However, with Ponymail, it's possible to take control of your email archive, keep it local and secure, and search through it quickly and efficiently using the power of ElasticSearch.

    Created: 28 Mar 2023
    Tagged as: ,

    Hi! I'm Mark Cox. This blog gives my thoughts on security work, open source, home automation, and other topics.