Info: pandoc used to convert exported XHTML to DOCX leads to very good results

t.dumm · May 31, 2018, 1:48pm

I do not know if this is of interest.

I tried the free software “pandoc” (https://pandoc.org/) to convert the XHTML file exported from Pressbooks. I did this, because the exported ODT file does not contain paragraph tags (Heading 1, Heading 2…) So the exported ODT is difficult to edit.

I first use an XSLT to get the heading levels in the exported XHTML ‘right’ and remove toc and the chapter numbers (both are later dynamically created in Word).
Then I use pandoc to convert the transformed XHTML to DOCX.

Pandoc is capable to pull in the referenced images. I can also reference a Word-Document that is then used for styling if I like. pandoc can also be used to create ODT file.

The result is astonishingly good. It has a quality that allows further editing and later import into Pressbooks (“roundtripping”). This is useful in case of major overhaul of a content where the author for some reason wants to make it outside Pressbooks.

I give the XSLT and the script that I used on the command line on a MAC to perform the action in case somebody wants to try.

Script:

#!/bin/sh
/usr/bin/java -jar /opt/saxon-he/saxon9he.jar -s:input.html -xsl:pandoc.xsl -o:output.html
pandoc output.html --data-dir . -f html -t docx -o output.docx

--data-dir . meaning that the optional reference word document with name “reference.docx” is in the same directory.

pandoc.xsl:

    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="http://www.w3.org/1999/xhtml" xmlns:h="http://www.w3.org/1999/xhtml"
    exclude-result-prefixes="h">

    <xsl:output method="xml" encoding="UTF-8" indent="no"
        doctype-public="-//W3C//DTD XHTML 1.0 Transitional//EN"
        doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>

    <xsl:template match="/">
        <xsl:apply-templates/>
    </xsl:template>


    <xsl:template match="h:div[@id = 'toc']"> </xsl:template>


    <xsl:template
        match="h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'chapter', ' '))]/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'chapter-title-wrap', ' '))]/h:h2">
        <h1>
            <xsl:apply-templates select="@* | node()"/>
        </h1>
    </xsl:template>

    



    <xsl:template
        match="h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'front-matter', ' '))]/h:div/h:h3"> </xsl:template>

    <xsl:template
        match="h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'back-matter', ' '))]/h:div/h:h3"> </xsl:template>

    <xsl:template
        match="h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'chapter', ' '))]/h:div/h:h3"> </xsl:template>





    <xsl:template
        match="h:div/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'ugc', ' '))]/h:h1">
        <h2>
            <xsl:apply-templates select="@* | node()"/>
        </h2>
    </xsl:template>

    <xsl:template
        match="h:div/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'ugc', ' '))]/h:h2">
        <h3>
            <xsl:apply-templates select="@* | node()"/>
        </h3>
    </xsl:template>


    <xsl:template
        match="h:div/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'ugc', ' '))]/h:h3">
        <h4>
            <xsl:apply-templates select="@* | node()"/>
        </h4>
    </xsl:template>

    <xsl:template
        match="h:div/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'ugc', ' '))]/h:h4">
        <h5>
            <xsl:apply-templates select="@* | node()"/>
        </h5>
    </xsl:template>

    <xsl:template
        match="h:div/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'ugc', ' '))]/h:h5">
        <h6>
            <xsl:apply-templates select="@* | node()"/>
        </h6>
    </xsl:template>

    <xsl:template
        match="h:div/h:div[contains(concat(' ', normalize-space(@class), ' '), concat(' ', 'ugc', ' '))]/h:h6">
        <div custom-style="Titel o. Nr.">
            <xsl:apply-templates select="@* | node()"/>
        </div>
    </xsl:template>






    <!-- Copy as is everything else -->

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

t.dumm · June 5, 2018, 6:57pm

Pandoc’s internal document model doesn’t allow colspans or rowspans.

As a consequence colspans and rowspans in html documents are completely broken after conversion to docx/odt. This makes pandoc useless for documents containing “complex” tables.