1. Export your file, (in this example, "MyDoc.doc"), into "Filtered HTML" format via the File > Save As... menu option. *Note:* this option is only available on Microsoft Office 2000 and later for the Windows platform. The standard HTML format will *not* work as it contains too much non-standard markup to process.

2. If the file contains images, a directory will appear alongside the file. Rename this directory and upload to the server via WebDAV to the correct location (Documentation on this process still pending).

3. Normalize the HTML into standards-compliant XHTML using HTMLTidy with the following command:

tidy --clean -asxhtml -i -o mydoc.html.2 MyDoc.htm

4. Simplify the resulting HTML file using the cleanhtml.php script (see attachment):

php cleanhtml.php < mydoc.html.2 > mydoc.html.1

Note that this script only works on valid and clean XHTML as output by tidy in step #3.

5. Rewrite the image src attributes to comply with the location of the images defined by the result of step #2:

perl -p -e 's|src="MyDoc_files|src="/system/files/library/mydoc|' mydoc.html.1 > mydoc.html

6. Open mydoc.html in a text editor and cleanup the remaining HTML by hand. The following notes some common problems and suggested solutions:

- If the document begins with a title, remove the title and enter that text into the node title.
- Microsoft Word does not use h tags for headings, but rather p tags with CSS formatting. Thus, the process will have left sub-titles of sections rendered as <p>Title</p>. Convert these to <h2>Title</h2>, or h3, h4 etc for subsections if desired.
- Captions under images and tables should be rendered in italics using the <em></em> markup tag.
- Tables may require some additional editing by hand to achieve the desired appearance.

7. Copy and paste the resulting HTML code directly into the body of the node. Select "Raw HTML" input format.

Attachments