Friday, July 16, 2010

Quick hack to convert HTML to Word documents

There are lots of products that will convert HTML files to Word format or RTF, but they mostly cost money. And there are so many of them, it’s hard to know which are good without testing them all. Word of course can open HTML files directly but one issue I found with this is if you then save the file as a DOC file any images are not embedded in the file (although they are in DOCX files).

But being the cheapskate I am, I thought there must be a way to implement this myself, so came up with this quick hack, which loads the HTML into IE, copies it to the clipboard, pastes it into Word and then saves it as a Word document. It does require that Word is installed on the user’s machine and it does splat any contents of the clipboard but other than that it seems reasonably robust. My only other thought is how horrible using the Word COM API is from .NET, bring on optional parameters!

    private bool loadComplete;
    public void ConvertToWord(string htmlFile, string filename)
    {
      // open in IE
      using (WebBrowser browser = new WebBrowser())
      {
        loadComplete = false;
        browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(
          browser_DocumentCompleted);
        browser.Navigate(htmlFile);
        while (!loadComplete)
        {
          Thread.Sleep(50);
          System.Windows.Forms.Application.DoEvents();
        }

        // copy to clipboard
        browser.Document.ExecCommand("SelectAll", true, null);
        browser.Document.ExecCommand("Copy", true, null);
        
        // open Word, paste and save...
        object dummy = Type.Missing;
        ApplicationClass wordApp = new ApplicationClass();
        try
        {
          Document newDoc = wordApp.Documents.Add(ref dummy, ref dummy, ref dummy, ref dummy);

          wordApp.Selection.Paste();

          object fileName = filename;
          newDoc.SaveAs(ref fileName, ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, 
            ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, ref dummy, 
            ref dummy, ref dummy, ref dummy);
          ((_Document)newDoc).Close(ref dummy, ref dummy, ref dummy);
        }
        finally
        {
          wordApp.Quit(ref dummy, ref dummy, ref dummy);
        }
      }
    }

    void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
    {
      loadComplete = true;
    }

No comments: