Loading...

July 8, 2008

Add DOCTYPE to output in Cocoon

We can add a <!DOCTYPE> element to our XML output in Cocoon very easy. The default XMLSerializer can be configured to include a <!DOCTYPE> element. We only then have to use this new configured serializer and the <!DOCTYPE> is added to the output.

Consider a scenario where we have to convert an XML file to another XML format. The original XML file contains several HTML entities (like &deg;) in CDATA sections. We want to transform those entities to the UNICODE equivalents in our output XML format. Here is the sample input XML:

<?xml version="1.0" encoding="UTF-8"?>
<input>
 <sample><![CDATA[The current temperature is 17 &deg;C.]]></sample>
</input>

We can transform this XML with a simple XSL transformation:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
        
    <xsl:template match="/">
        <output>
            <xsl:apply-templates/>
        </output>
    </xsl:template>
    
    <xsl:template match="sample">
        <xsl:copy>
            <xsl:value-of select="." disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Now we create a simple Cocoon pipeline to do this transformation:

<map:match pattern="article">
    <map:generate src="resources/input.xml"/>
    <map:transform src="xslt/article.xslt"/>
    <map:serialize type="xml"/>
</map:match>

This produces the following output XML:

<?xml version="1.0" encoding="UTF-8"?>
<output>
 <sample>The current temperature is 17 &deg;C.</sample>
</output>

Looks fine, but if we want this output XML to be parsed we get an error saying that &deg; is an unknown entity. And that is correct. An XML parser cannot resolve this entity. Therefore we need to add a <!DOCTYPE> section to the output XML. This <!DOCTYPE> must reference the entity &deg; and provide an alternative value. The good thing is we can use a XHTML DTD with a list of HTML entities. One of the entities is &deg;. In Cocoon we create a new serializer:

<map:serializer logger="sitemap.serializer.xml"
 mime-type="text/xml" name="xml-entity"
 src="org.apache.cocoon.serialization.XMLSerializer">
 <doctype-public>-//W3C//DTD XHTML 1.1//EN</doctype-public>
 <doctype-system>xhtml11-flat.dtd</doctype-system>
</map:serializer>

Notice we use the standard XMLSerializer, we only add two configuration elements: doctype-public and doctype-system. The values will be added to the output XML. We change our pipeline and use this new serializer:

<map:match pattern="article">
    <map:generate src="resources/input.xml"/>
    <map:transform src="xslt/article.xslt"/>
    <map:serialize type="xml-entity"/>
</map:match>

And now we get the following XML output if we run the pipeline again:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE output PUBLIC "-//W3C//DTD XHTML 1.1//EN" "xhtml11-flat.dtd">
<output>
 <sample>The current temperature is 17 °C.</sample>
</output>

And the entity can now be resolved and we see a nice little degree sign. This is just a simple example. We can extend this and use our own DTD for example.