Category Archives for Technology
Indexes, tables of contents, page numbers and other dynamically constructed elements in a Word document, are triggered by field codes. A table of contents is typically { TOC h z “My heading style” } for user defined headings, and you can see the code if you right click the table of contents and select “Toggle Field Codes” from the menu.
When you generate one of these structures, a rendered version of the field code is embedded in the document. This is what is viewed and printed, but it is not the master definition for the element, the field code is. Today I was generating a table of contents for a dynamically constructed WordProcessingML document, and found out how problematic this can be.
When you generate a table of contents, there is a lot of logic to perform. First you need a list of all the titles you’re going to use, then you need the page numbers that they’re on. But in order to get the correct page numbers, you need to repaginate the document with the completed table of contents included. So first you grab the titles, render the list with unspecified page numbers, which are also stored as field codes, then repaginate, then go back and render the page number field codes, subsequently embedding the non-master rendered page numbers with the field codes. Confusing? If you use the h TOC switch, you also need to generate hyperlinks from the titles in the TOC, to the titles in the document, so when you ctrl-click a TOC title, it jumps to that page. This involves generating infrastructure bookmarks throughout the document, and referencing them from the TOC. It doesn’t help that the bookmarks aren’t actually part of the Word schema, but the amlcore schema, and are thus in a different namespace. The page numbers, which are in fact rendered field codes themselves, also need to be hyperlinked in the same way.
In order to generate this rendered field codes inside rendered field codes mess, you need to have access to the Word rendering engine, so you can calculate page numbers, and if all you have available is XML and XSLT, that poses an interesting problem. How do you generate a Word table of contents from scratch, without invoking Word? (And without rewriting Word!)
At first you’d think that by leaving out the rendered version of the field code, Word might think to regenerate it. But due to what I’m guessing is a complex legacy issue, this isn’t the case. Leave out the rendered TOC, and you haven’t got a TOC. A simple dirty flag on the field code would probably solve the problem, triggering Word’s field code rendering, but Word currently doesn’t have anything like that for field codes. Inline styles sure, but not for field codes.
There are actually two kinds of field codes in Word. The first, unsurprisingly, is called a simple field code, which according to the schema “These fields are run-time calculated entities in Word (for example, page numbers)”, and look like this:
<w:fldSimple w:instr='TOC z "Item title,1"'/>
While they embed the field code quite nicely, they still don’t get rendered (calculated) at run-time unless you manually tell them to. Arguably not exactly run-time.
The other kind of field code is not surprisingly a complex field code, but surprisingly they don’t call it a complex field code, it is just a field code. Complex field codes use what are called fldChar markers to mark up sections of a document, however large, as field codes and their rendered views. Not much else to add here other than no, you can’t auto render those when the document opens either.
So how do you automatically generate a table of contents when you open a Word document you’ve generated outside of Word? Absolutely no idea. But if I’m right, it’s yet another example of Microsoft simply jumping on the XML bandwagon, and just exporting the underlying Word structures as XML, instead of carefully thinking about why developers might actually want to do this.
Following on from the last post about WordProcessingML list gotchas, another piece of obscure information about lists, is that you can’t technically restart them. Restarted lists are actually new w:list definitions, which may or may not point to an older w:listDef.
The problem here is that if you’re constructing a Word document from multiple parts or documents, then you need to keep track of how many lists there are in a document, and what their IDs are, because you’ll need them all at the end when you create the various w:list structures you’ll need to point to them all.
In Word every list is unique. If you restart a list with the UI, under the covers it actually creates another list which just points to the same w:listDef, so you can change the style of all the restarted lists in one place, but it causes havoc when you want to convert multiple HTML <ol>s to Word lists.
Microsoft Office has the ability to save documents in XML (WordProcesingML) format, so I’ve written an HTML to WordProcessingML converter for one of our projects at Synop. But while the schema is provided, there’s not much useful documentation, and there are some traps.
In WordProcessingML, lists are generated by applying a w:listPr element to a paragraph. The w:listPr points to what’s called an w:ilfo element, and it is the w:ilfo element which points to the structure which defines the style of the list, the w:listDef element. Think of it as a memory handle, as it works the same way.
So far so good, as it makes moving styles around fairly easy, by just changing pointer values. You can also restart the numbering of a list inside the w:ilfo, and the bullet characters from within the w:listPr, but for all intents the w:listDef is where the action happens.
Now the w:ilfo and w:listDef structures are all kept inside a master w:lists element at the document root, and while the ordering of each doesn’t matter, the grouping of like elements does. For example, you can have two w:listDefs then two w:ilfos which point to either of the w:listDefs, the ordering doesn’t matter, but you can’t have a w:listDef followed by a w:ilfo, followed by a w:listDef.
This of course flies in the face of having XML based handles in the first place, so my assumption was that is a bug and not by design. However upon checking the schema (XSD), there’s an xs:sequence which dictates that listDefs must all be included before any ilfos. So either it is by design, or whoever coded the XSD wasn’t thinking DOM and XPath access to the data. Not only that, but the schema doesn’t actually validate (in XMLSpy), so in the immortal words of the XML nazis, it’s not technically a schema.
Anyway, aside from having to know how to read an XSD, this isn’t documented anywhere, so a tip for budding WordProcessingML developers: always put w:listDefs together, following by w:ilfos.
I’ve been doing a few of these recently, days when the blogosphere changed forever. Today Amazon released OpenSearch, an API for customised search of content sites. I’ve been crapping on about this kind of thing for 18 months now, the ability for consumers to make aggregation and consumption decisions. Finally Amazon has taken the next step. The big guys have finally caught on, and ironically the little guys never really understood what they’d invented. Next we wait and see how fast reader clients jump all over it.