Wednesday, March 18, 2009

Reading a Office 2007 docx file using C# and SharpZipLib

I found myself needing to read a office 2007 docx file to get the xml of the document., so I went and got the ECMA spec for OfficeOpenXML. That is a heavy 1500 page heavy read :-).  So I decided to keep it handy and just write some code to see how bad it would be. So far it has not been as bad as I thought it would be so I thought I would post some code on it.  It has been a while since I have posted anything, so here goes.

The docx file is actually just a zip archive of a bunch of files.  The trick here is to read the zip and pull out the content.  I decided I wanted to load the data to a class hierarchy so I could do some Linq to objects work with it.

So I will start this with some simple code.  First the DocumentPart.cs class holds the document content and some information from it’s zip file.  The document content is stored as a linq to XML XDocument..

 

    1 using System;
    2 using System.Xml.Linq;
    3 
    4 namespace OfficeOpenXML.Package
    5 {
    6     /// <summary>
    7     /// A document part in the OfficeOpenXML Package
    8     /// </summary>
    9     public class DocumentPart
   10     {
   11         public string Name {get; set; }
   12         public string Comment { get; set; }
   13         public long CompressedSize { get; set; }
   14         public DateTime EntryDate { get; set; }
   15         public long Size { get; set; }
   16         public XDocument Content { get; set; }
   17     }
   18 }

It is pretty basic so far (love those automatic properties).


Now how do you read the zip file?  I use the SharpZipLib to read the zip file and then store each member of the zip archive into a dictionary of DocumentPart with the filename as the Dictionary key.

 

    1 using System;
    2 using System.Collections.Generic;
    3 using System.IO;
    4 using System.Text;
    5 using System.Xml.Linq;
    6 using ICSharpCode.SharpZipLib.Zip;
    7 
    8 namespace OfficeOpenXML.Package
    9 {
   10   public class Parts
   11   {
   12     public Dictionary<string, DocumentPart> DocumentParts { get; set; }
   13     public string FilePath { get; set; }
   14 
   15     public void OpenPackage(string filePath)
   16     {
   17         ZipEntry Entry;
   18         XDocument contents;
   19         StringBuilder XMLDocument;
   20         byte[] Buffer = new Byte[8192];
   21         int bytesRead;
   22 
   23         using (ZipInputStream Package =
   24           new ZipInputStream(new StreamReader(filePath).BaseStream))
   25         {
   26           while ((Entry = Package.GetNextEntry()) != null)
   27           {
   28             XMLDocument = new StringBuilder();
   29             while ((bytesRead = Package.Read(Buffer, 0, Buffer.Length)) != 0)
   30             {
   31               XMLDocument.Append(
   32                 ASCIIEncoding.ASCII.GetString(Buffer, 0, bytesRead));
   33             }
   34             contents = XDocument.Parse(XMLDocument.ToString());
   35             DocumentParts.Add(Entry.Name, new DocumentPart() 
   36               { Name=Entry.Name, Size=Entry.Size, 
   37                 Comment=Entry.Comment, 
   38                 CompressedSize=Entry.CompressedSize, 
   39                 EntryDate=Entry.DateTime, Content=contents });
   40           }
   41         }
   42     }
   43 
   44     /// <summary>
   45     /// Construct the class
   46     /// </summary>
   47     /// <param name="filePath">The office Open XML document</param>
   48     public Parts(string filePath)
   49     {
   50         FilePath = FilePath;
   51         DocumentParts = new Dictionary<string, DocumentPart>();
   52     }
   53   }
   54 }

I use the using statement (on line 23) to handle the opening the docx file and then process the zip archive using the GetNextEntry.  The inner while loop (at line 28) reads the content of the entry into a string.  Finally while the DocumentParts.Add() adds the document dictionary. 


A simple NUnit test (not exhaustive by any means) is:


 

    1 using NUnit.Framework;
    2 
    3 namespace OfficeOpenXML.UnitTests.Package
    4 {
    5     [TestFixture]
    6     public class PartTests
    7     {
    8         [Test]
    9         public void OpenPackageTest()
   10         {
   11             OfficeOpenXML.Package.Parts p = new OfficeOpenXML.Package.Parts();
   12             p.OpenPackage("../../TestData/AbilitiesandConditions.docx");
   13             Assert.IsTrue(p.DocumentParts.ContainsKey("[Content_Types].xml"), "No Content_Types?");
   14         }
   15     }
   16 }

In my next post I will show how to use the information from the [Content_Types].xml entry and the docsprops/app.xml and docprops/core.xml files to create an object that has information about the document being read (using some Linq to XML to populate the Class).  


Darrel

3 comments:

Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.
Anonymous said...
This comment has been removed by a blog administrator.