Thursday, March 19, 2009

On Further Review…

Never mind.  Microsoft has a CTP of the 2.0 version of the Open XML SDK that does this much better.  Check out http://www.microsoft.com/downloads/thankyou.aspx?familyId=c6e744e5-36e9-45f5-8d8c-331df206e0d0&displayLang=en

Wednesday, March 18, 2009

Reading a Office 2007 docx file using C# and SharpZipLib

I found myself needing to read a office 2007 docx file to get the xml of the document., so I went and got the ECMA spec for OfficeOpenXML. That is a heavy 1500 page heavy read :-).  So I decided to keep it handy and just write some code to see how bad it would be. So far it has not been as bad as I thought it would be so I thought I would post some code on it.  It has been a while since I have posted anything, so here goes.

The docx file is actually just a zip archive of a bunch of files.  The trick here is to read the zip and pull out the content.  I decided I wanted to load the data to a class hierarchy so I could do some Linq to objects work with it.

So I will start this with some simple code.  First the DocumentPart.cs class holds the document content and some information from it’s zip file.  The document content is stored as a linq to XML XDocument..

 

    1 using System;
    2 using System.Xml.Linq;
    3 
    4 namespace OfficeOpenXML.Package
    5 {
    6     /// <summary>
    7     /// A document part in the OfficeOpenXML Package
    8     /// </summary>
    9     public class DocumentPart
   10     {
   11         public string Name {get; set; }
   12         public string Comment { get; set; }
   13         public long CompressedSize { get; set; }
   14         public DateTime EntryDate { get; set; }
   15         public long Size { get; set; }
   16         public XDocument Content { get; set; }
   17     }
   18 }

It is pretty basic so far (love those automatic properties).


Now how do you read the zip file?  I use the SharpZipLib to read the zip file and then store each member of the zip archive into a dictionary of DocumentPart with the filename as the Dictionary key.

 

    1 using System;
    2 using System.Collections.Generic;
    3 using System.IO;
    4 using System.Text;
    5 using System.Xml.Linq;
    6 using ICSharpCode.SharpZipLib.Zip;
    7 
    8 namespace OfficeOpenXML.Package
    9 {
   10   public class Parts
   11   {
   12     public Dictionary<string, DocumentPart> DocumentParts { get; set; }
   13     public string FilePath { get; set; }
   14 
   15     public void OpenPackage(string filePath)
   16     {
   17         ZipEntry Entry;
   18         XDocument contents;
   19         StringBuilder XMLDocument;
   20         byte[] Buffer = new Byte[8192];
   21         int bytesRead;
   22 
   23         using (ZipInputStream Package =
   24           new ZipInputStream(new StreamReader(filePath).BaseStream))
   25         {
   26           while ((Entry = Package.GetNextEntry()) != null)
   27           {
   28             XMLDocument = new StringBuilder();
   29             while ((bytesRead = Package.Read(Buffer, 0, Buffer.Length)) != 0)
   30             {
   31               XMLDocument.Append(
   32                 ASCIIEncoding.ASCII.GetString(Buffer, 0, bytesRead));
   33             }
   34             contents = XDocument.Parse(XMLDocument.ToString());
   35             DocumentParts.Add(Entry.Name, new DocumentPart() 
   36               { Name=Entry.Name, Size=Entry.Size, 
   37                 Comment=Entry.Comment, 
   38                 CompressedSize=Entry.CompressedSize, 
   39                 EntryDate=Entry.DateTime, Content=contents });
   40           }
   41         }
   42     }
   43 
   44     /// <summary>
   45     /// Construct the class
   46     /// </summary>
   47     /// <param name="filePath">The office Open XML document</param>
   48     public Parts(string filePath)
   49     {
   50         FilePath = FilePath;
   51         DocumentParts = new Dictionary<string, DocumentPart>();
   52     }
   53   }
   54 }

I use the using statement (on line 23) to handle the opening the docx file and then process the zip archive using the GetNextEntry.  The inner while loop (at line 28) reads the content of the entry into a string.  Finally while the DocumentParts.Add() adds the document dictionary. 


A simple NUnit test (not exhaustive by any means) is:


 

    1 using NUnit.Framework;
    2 
    3 namespace OfficeOpenXML.UnitTests.Package
    4 {
    5     [TestFixture]
    6     public class PartTests
    7     {
    8         [Test]
    9         public void OpenPackageTest()
   10         {
   11             OfficeOpenXML.Package.Parts p = new OfficeOpenXML.Package.Parts();
   12             p.OpenPackage("../../TestData/AbilitiesandConditions.docx");
   13             Assert.IsTrue(p.DocumentParts.ContainsKey("[Content_Types].xml"), "No Content_Types?");
   14         }
   15     }
   16 }

In my next post I will show how to use the information from the [Content_Types].xml entry and the docsprops/app.xml and docprops/core.xml files to create an object that has information about the document being read (using some Linq to XML to populate the Class).  


Darrel

Wednesday, March 11, 2009

Red, Green, Re-factor

I find my self getting more and more into the test driven development paradigm.  I am working on some fairly heavy OO code with a great requirement specification (it is actually and ISO standard).


The application is growing pretty much via the unit tests.  Write the test, then make sure the code fits.  It is amazing how many times I am not sure how a piece of code is going to work so I just put together the unit test to fit the standard and then code the class.  I can then re-factor as needed.



I am using the speech API’s to drive the interface so the speech api simply outputs a string of text that I can use to execute the program commands.  I wanted the program to be able to be both driven by speech commands and through the standard GUI approach.  Since I am working on a compiler, interpreter I have simply added the speech grammar as another grammar in the compiler.  .I will probably do the same to parse the command line options.


Makes for some interesting code :-).

Tuesday, March 10, 2009

Not posting much lately

I have been working on several projects.  They are starting to come to completion and I hope to post some more on them at the beginning of April.  I have also been busy with contracts and life in general, but I will get back to some posts in a few weeks.