detect metadata right from epub files
This issue has been raised with ticket:17 in the original tracker, and was transferred here slightly altered:
it would be nice if minicalope could fetch the title, author, lang, etc right from the epub file, instead of taking the dir/filename or relying on additional
*.datafiles that will be hard to maintain for the user in the long run.
This might be tricky to do in PHP, so an alternative idea could be to allow the user to use a 'backend' that will parse the epub file and return metadat in the format expected by minicalope.
The referenced "ticket" includes a patch, relying on a backend written in C and also attached.
I strongly advise against using such a feature in automatic runs, especially when unsupervised:
- epub description might contain "invalid HTML" (e.g. missing closing tags for lists), which then would break OPDS (while working fine in HTML)
- the same author might turn up in many different spellings
For the latter, an example: Bertha von Suttner. Most times (to my experience – running checks against the ~7,000 books in the German catalog on ebooks.qumran.org), she turns up as either "Bertha Suttner", or as "Suttner, Bertha". Makes two entries for the same author. But she got a title, so she might also turn up as "Bertha von Suttner", "Suttner, Bertha von" and even "von Suttner, Bertha" – making 5 different variants. Now, her title really would be "Freifrau von Suttner". And her full name is "Bertha Sophia Felicita Freifrau von Suttner" (and if you think that's already the most complicated name, check Ida Marie Louise Sophie Friederike Gustave Gräfin von Hahn😇). Unsupervised automated runs would leave all possible combinations – making "books by author X" quite … well, a broken concept.
So what I plan in a first run is:
- creating a class for reading epub metadata (done and in testing currently:
- creating a class extending this, taking care for creating the
.datafiles for a given book (next on my schedule;
- creating a simple script making use of the two, and including it within e.g. the
doc/directory (script already exists and is tested by me for the past couple of weeks; needs rework incl. splitting-out the
That way you can at least have all the metadata extracted semi-automatically (e.g.
epubmeta book.epub would create the
.data in the same place), and you can check (and fix/extend) the created files.
This is the next feature I have planned (of course, bug-fixes have higher priority, if bugs pop up 😉)
This feature has now been added for Metadata (by default, the
.data files). As lined out above, there might be a few issues – depending on who built the
.epub and how they've set up the metadata. I will line out possible fields and their culprits here:
author: see above
isbn: safe. This is either an ISBN, or not present at all.
publisher: to my experience, in many cases holds more than just the publisher. Usually also the publication place and year. Up to you if you wish that.
rating: not sure. Rarely found in epubs.
series: Might not be the one you wish to file it under
tag: probably not one of those you are using to file your books, but you might wish to try
title: should be pretty safe, but no guarantees
uri: also pretty safe (and rarely used)
5cd0535 completed this task, so I'll close the issue now. Some remarks on extracting book description you should be aware of:
- though TOC is always present in
.epubfiles, it's not always really useful (even if it fills the page)
- a book description may be available. If it is, it might contain HTML tags which might break the XML for the OPDS part (make sure to have
$skip_broken_xmlset to TRUE if you care for OPDS – otherwise OPDS users might be unable to access such a book)
- whether the
headis useful or not is your decision. Doesn't usually break anything, but you never know how the metadata are set up (believe me, there are strange things around).
You can always check ebooks manually using the
doc/epubmeta script, which extracts the full load of available values. Now enjoy!
Deleting a branch is permanent. It CANNOT be undone. Continue?