Saturday, July 31, 2010

Things are a bit quiet round here

If you are one of the many subscribers to my blog (where many is a value is slightly larger than zero) you may have noticed things have been a bit quiet round here. There is a reason for this. My proper website has suddenly become relatively popular for some reason, so I’ve made the decision to spend my spare time updating that more regularly, since it is actually producing income for me, which in the current climate is definitely a good thing. I’ll still post here, but it may not be so frequent.

Friday, July 16, 2010

Quick hack to convert HTML to Word documents

There are lots of products that will convert HTML files to Word format or RTF, but they mostly cost money. And there are so many of them, it’s hard to know which are good without testing them all. Word of course can open HTML files directly but one issue I found with this is if you then save the file as a DOC file any images are not embedded in the file (although they are in DOCX files).

But being the cheapskate I am, I thought there must be a way to implement this myself, so came up with this quick hack, which loads the HTML into IE, copies it to the clipboard, pastes it into Word and then saves it as a Word document. It does require that Word is installed on the user’s machine and it does splat any contents of the clipboard but other than that it seems reasonably robust. My only other thought is how horrible using the Word COM API is from .NET, bring on optional parameters!

    private bool loadComplete;
    public void ConvertToWord(string htmlFile, string filename)
      // open in IE
      using (WebBrowser browser = new WebBrowser())
        loadComplete = false;
        browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(
        while (!loadComplete)

        // copy to clipboard
        browser.Document.ExecCommand("SelectAll", true, null);
        browser.Document.ExecCommand("Copy", true, null);
        // open Word, paste and save...
        object dummy = Type.Missing;
        ApplicationClass wordApp = new ApplicationClass();
          Document newDoc = wordApp.Documents.Add(ref dummy, ref dummy, ref dummy, ref dummy);


          object fileName = filename;
          newDoc.SaveAs(ref fileName, ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, 
            ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, 
            ref dummy, ref dummy, ref dummy);
          ((_Document)newDoc).Close(ref dummy, ref dummy, ref dummy);
          wordApp.Quit(ref dummy, ref dummy, ref dummy);

    void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
      loadComplete = true;