Wednesday, March 30, 2011

HTML to Kindle: An Aggravation Odyssey

Today I spent several hours making a nifty HTML file of some recipes that I'd scanned out of an old cookbook and converted to text, and then had a hell of a time trying to get the email-to-Kindle service to accept my file. Allegedly, HTML files are JUST FINE, but it seemed like every time I sent it, I got an error message in return. I don't know how much of my aggravation was due to the fact that I had omitted a critical HTML tag and therefore my file would not convert no matter how many settings I twiddled, but I suspect probably most of it. Yes, that's right; it was probably user error.

Still, I'm going to note down the steps I took to get my scanned recipes into a format that looks, if I say so myself, rather nice on my Kindle, and even has a table of contents. Mainly, this is in case *I* forget how to do it later, but hopefully it's useful to someone else, too. I used so many random web pages to compile these instructions that it's not even funny, but Kindle Formatting was one of the most helpful.

First, though, a note on why I didn't just send the PDF straight to my Kindle. Well, I did. BUT the PDF viewer did all kinds of annoying things like auto-rotating and generally making it hard to read. So I tried cutting and pasting the OCR text from my PDF into a Word doc, and then converting that back to PDF, and putting it on the Kindle. It was...better, but I wasn't able to browse to locations, and the spacing was all messed up. So here's what I did that DID work, from beginning to end, in ten not-so-easy steps:



1. I scanned the pages I wanted to turn into a document and saved them as PDF with OCR (optical character recognition) enabled. 

2. I copied and pasted the recipe text into a Word document. I then organized the document to be more or less the way I wanted it (removing extra spacing, fixing any errors that the OCR messed up on). Since the ultimate output would be a stripped-down HTML file, I didn't worry too much about font sizes or other font formatting--that has to be done manually later.

3. In OpenOffice (or Word), I did a Save As...HTML file. (Some of the web pages I consulted said that in Word for PC, you should select Web Page, Filtered.)

4. I opened the HTML file in a text editor (Dreamweaver also works) and then the tedious part began. I took out ALL NON-ESSENTIAL CODE, which means basically anything other than <html>, <head>, <title>, <body>, <h1> and other header tags, <p>, <br />, and list tags. Anything referring to font sizes, boldface, or other "fancy" formatting had to go. (Apparently, it can be rendered inconsistently, so I played it safe.) The Find and Replace function really helped here.

5. Then I did some fiddling with removing extra space in the document for my own ease of reading it (I think this can be automated in Dreamweaver) and added paragraph or line break tags where I felt they were necessary. I made all recipe titles h1 headings, any other titles of minor importance h3 headings, and everything else just enclosed in paragraph tags--except for one unordered list, which rendered fine.

6. Between each recipe--this is important--I added the Kindle tag which indicates a page break/section break: <mbp:pagebreak>  That way a new recipe would start on a new "page" rather than being continuous.

7. Next, as per the instructions on Kindle Formatting: "The Kindle has built-in bookmarks for the Table of Contents and the start of the book's content. Use the following anchor tags to mark those places in your book: <a name="TOC"/> and <a name="start"/>. Place the anchors right after the page break tag, before any headings or paragraphs."

8. Here's the part where knowing some basic HTML is helpful. Time to create the Table of Contents in a way that the Kindle can read and browse it. It's actually very easy: all you're doing is creating text anchors within the document, just like those you might create on a regular HTML page. Give the TOC its own "page" by separating it out with the page break code in #6. Put anchors in the text wherever you want the Kindle to be able to browse to (in this case, I put them before each recipe title). Then, in the TOC, link to each anchor using whatever name you want (in this case, I used the title of the recipe). For example:

<p>Table of Contents</p>
<p><a href="#one">OKRA SKILLET</a></p>
<p><a href="#two">HERBED GREEN BEANS</a></p>
<mbp:pagebreak>
<p><a name="one">OKRA SKILLET</a></p>
<p>(recipe for okra skillet)</p>
<mbp:pagebreak>
<p><a name="two">HERBED GREEN BEANS</a></p>
<p>(recipe for herbed green beans)</p>

Simple enough, if you've done it before in HTML. Thanks to Michael R. Hicks for the tips.

9. Here's where I'm not entirely sure what's required or not, because I was missing that stupid BODY tag and the bounce-back messages from Amazon were not giving me any info about the actual error. So then began some trial-and-error fiddling that may not have been necessary. I read that Unicode UTF-8 can be a problem for some versions of the Kindle. So I went into Dreamweaver and made sure the encoding for the HTML file was set to Western/Latin as opposed to Unicode (in Preferences, New Document). I opened a new file and copied and pasted all the HTML into it, and voila, I had my file in Western/Latin encoding.

10. I e-mailed my HTML file as an attachment to my free Kindle address using the subject line "Convert," and lo and behold, there it was, table of contents and all. Evidently you can also use some basic CSS with the Kindle, though I didn't try that this time.

3 comments:

tanita davis said...

All the other fun stuff aside, I cannot believe you didn't tell me you got a Kindle!!!

Are you loving it? Are you reading a lot of YA on it, or mostly just using it for ...cooking?

David T. Macknet said...

Welcome to the world of e "books": where you have to do all of the formatting (and removing of idiot characters) yourself.

We're stuck with readers which are more than a decade out of vogue (when's the last time you heard of Handspring?). Thus, characters like "smart" quotes show up as gibberish. So, the conversion process involves LOTS of manipulation of text.

Of course, I wrote a program to do it, but still: annoying.

aquafortis said...

I don't know how much better it really is on Kindle vs. Handspring...a bit, I'm sure, but there's still all the proprietary junk that makes it difficult to move documents from one platform to another. Hence the need for the stripped-down HTML file. There are programs you can buy, too, that convert files, but I'm not making full e-books, just stacks of recipes I can browse on my Kindle instead of having them clutter up the kitchen in stacks of paper...

Sorry I forgot to mention the Kindle! I got it in Feb. as a Chinese New Year/early birthday present from the in-laws. My MIL absolutely adores her Kindle, which she got as a birthday present last year. I really like it so far, but I've only read Chris Cope's book at this point (only available as an ebook) The only pther things I've done (besides messing with recipes) are downloading some of the free public domain books for future reading, and downloading a word game.

I have this stack of ARCs still from ALA that I'm working through, so I haven't gotten to reading much on the Kindle yet. But I'd love to try out that Netgalley thing sometime.