If you are responsible for producing technical documentation, I’m sure you’ve considered authoring content in a semantic markup such as XML. Moving in this direction enables you to separate content from presentation and support reuse by combining XML with frameworks such as the DITA OT. I’ve helped many authoring groups make this transition. However, it’s not always easy because XML adds complexity to the writing process and removes authors from many aspects of document layout and design.
On a recent project at Flatirons, we helped ease this transition by developing a framework based on Microsoft Word 2007/2010, a custom Office Add-in, and a MarkLogic NoSQL database. This solution enabled authors to write using a familiar tool, take advantage of Word’s strengths, and support reuse by searching into the NoSQL database to extract reusable components. The following diagram illustrates the architecture.
Figure 1: Logical Architecture
This approach became feasible when Microsoft standardized on the Office Open XML format in 2007 (e.g., .docx, .pptx, .xlsx). The file format is essentially a zipped, XML-based file format for representing word processing documents including spreadsheets, charts, and images. During this same period, Microsoft introduced content controls which enable you to add a degree of semantic markup to your Word document. Essentially, you can use content controls to indicate if a section of content is a reusable component such as a preface, chapter, list, table, or figure. If you import the contents of a .docx file into a NoSQL database, you can then search the XML and extract reusable components for use in another context.
To bridge the gap between Microsoft Word and the NoSQL database, we built a custom Microsoft Office Add-in which gave us access to Word’s API. In the Add-in, we added functionality that enables users to more easily add specific content controls to a document, navigate the document via the content control hierarchy, and manage content control metadata. It also enabled authors to search for reusable components stored in MarkLogic and insert them into the current document. The following figure illustrates a Section content control defining a reusable component and how the Add-in integrates with Microsoft Word.
Figure 2: Custom Office Add-in and Word Content Controls
In future blogs, I’ll provide more details around the implementation and highlight some of the technical challenges we faced using Microsoft Word and a NoSQL database to support content reuse.