August 27, 2014

From Lyx/Latex to Word

This is sort of a placeholder post. Busy meeting a deadline, but this should help future Steve and anyone else when you need to turn your Lyx document into a Word document while keeping the format mostly sane. Broken, but sane.
  1. Export as Latex (plain).
  2. Run
    • latex <name of tex file, with or without extension>

  3. Run
    • bibtex <filename>

  4. Run
    • latex <filename>

  5. Run
    • latex <filename>

  6. Try to run
    • htlatex <filename> "html,0,charset=utf-8" "" -dhtml/
      • html: format to output
      • 0: normally chapters go into their own page, putting 0 here forces everything into a single page
      • charset=utf-8: let us be civilised
      • -dhtml/: puts the output files in a html sub-directory. Note that you can't have a space between -d and the html/

  7. If the above fails with something like 'illegal storage address', and you get a warning about text4ht.env not been found, then you need to find where it is in your TeX installation, and:
    • export TEX4HTENV and try again
    • Copy text4ht.env into your working directory
      • This approach also lets you affect locally some export parameters. More on this later...

  8. Open the html to verify correctness. You might object to the poor graphics quality. In this case copy text4ht.env into the working directory if you haven't done so, and then modify it so it uses a high density when converting images.
    • See this tex.stackexchange.com answer for more details
    • In my case, since dvipng was been used, I replaced all instances of
      • -D 96
    • with
      • -D 300

  9. It also helps if you
    • strip away html comments
      • These look like <!-- xxx -->
    • centre aligned image divs
    • remove <hr/> instances

  10. These changes will make the import into Libre/OpenOffice go easier

  11. Open the html file in Libre/OpenOffice

  12. File > Export > ODT

  13. Close html file

  14. Open exported ODT

  15. Edit > Links

  16. Select all links

  17. Break Links

  18. Verify that the ODT file is now much larger!

  19. File > Save As > Word 97 (doc)

Phew! To help future visitors, a simple python script to fix up the html as I have described is included at the end of this post. You will need lxml and cssselector installed. Cheers, Steve
#!/usr/bin/env python

from lxml.html import parse, HtmlComment
from lxml import etree

def main(*args):
  if len(args) == 0:
    return 1

  doc = parse(args[0]).getroot()

  body = doc.cssselect('body')[0]

  # replace <hr/> with <br/> to make doc conversion easier
  for hr in body.cssselect('hr'):
    p = hr.getparent()
    p.remove(hr)

    br = etree.Element('br')
    p.append(br)

  # remove comments because for some reason libreoffice opens up 
  # html comments as document comments, slowing things down
  for node in doc.getiterator():
    if isinstance(node, HtmlComment):
      node.getparent().remove(node)

  # centre align all figures
  for div in doc.cssselect('div.figure'):
    div.attrib['style']='text-align:center'

  print etree.tostring(doc, method='html', encoding='utf-8')
if __name__ == '__main__':
  import sys
  sys.exit(main(*sys.argv[1:]))