Donner's Daily Dose of Drama

Introduction to Microsoft’s Open Xml Format SDK 2.0 with a focus on Excel documents

Those looking for ways to render application data and content into Microsoft Office documents used to have a limited number of options. There once was something called Visual Basic for Applications (Microsoft ended support in 2007) or, more recently, Visual Studio Tools for Applications (see Wikipedia). OLE Automation has been around since the early days of Windows and can be used to access the Office application’s object model from the outside world (see Wikipedia). OLE is no longer mainstream technology, and there are conflicts with the managed code programming and deployment model of .Net.

All these options have in common that they are geared towards automating Office applications, rather than manipulating office documents. Consequently, they require that the Office applications are installed, which can be a prohibitive complexity for server-side document processing.

Alternatively, SQL Server Reporting Services has been an easy path to generating native Excel documents without Excel on the server, but Reporting Services is not exactly a lightweight product, and customization and integration  with other applications is a complex task.

This sets the stage for the arrival of the Open Xml SDK, made possible by the Office Open XML format specification in 2006 (see Wikipedia). This article will explain what the Open Xml SDK is and what it can be used for. I will also provide a simple code example for manipulating an Excel spreadsheet document.

The Microsoft Open XML Format SDK (visit the download page) is a very young product. Version 1.0 was released as recently as June 2008. Version 2.0 is available today as a Community Technology Preview release (CTP). The main difference is that 2.0 provides strongly typed classes, which means that your code is not dealing with generic Xml elements, but instead works with things like documents, sheets, and cells (in the case of a spreadsheet document).

In this article, I will not concern myself at all with standardization considerations or the politics of standards. Instead, I am solely interested in the practical applications of the SDK, and how it can shorten the path to achieving loose integration between Microsoft Office and custom .Net applications.

Deployment Models

After installation, the SDK makes a number of project templates available in Visual Studio. These templates can be used to create assemblies that extend Office (Add-In) or that can be attached to Office documents. The latter is an interesting concept. Excel executes managed code that can handle events in the document’s life cycle. A spreadsheet could be configured to execute managed code with pulls data from SQL Server and populates certain cells – all when the document is opened. This code would not be integrated with a stand-alone application, though, and sharing a security context and other state information would not be possible. 

Project types available through the SDK

Once the SDK libraries are installed, they can be referenced by and integrated with any application code. This is a model that works much better for an solutions that needs to generate or manipulate Office documents.

Application Integration

The SDK exposes objects and methods that allow direct access to elements of the document. There are types for Sheets, Rows,  Columns, and Cells. A typical workflow would be:

Unfortunately, but not surprisingly, the SDK does not provide methods that emulate the behavior of the Office application. For instance, when a formula is changed in an Excel spreadsheet that is being edited in Excel, the spreadsheet application will automatically search for and update references in other cells so that their validity is maintained.

The SDK does not do that. When an application needs to change a formula in the cell of an Excel document, it has explicitely to search the document for references and update them. This can quickly become a complex undertaking, as one might imagine, and it is well worth evaluating what type of manipulations should be allowed to the application in order to keep the amount of code managable.

The most uncritical approach for Excel spreadsheets is to update only values, not formulas. Formulas, if needed, can be in place in the  template, and need to be capable of handling variable amounts of data for the approach to work.

C# Code Example

Let’s update a cell in an existing spreadsheet document. I prefer to make my code reusable, so we will write a few wrapper methods for the SDK and place them in a static class that I call OpenXMLHelper.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Spreadsheet;
using System.Xml;
using System.IO;
namespace OpenXMLWindowsApp
{
    static class OpenXMLHelper
    {
        public static SpreadsheetDocument
                  OpenSpreadSheetForUpdate(string docName)
        {
            return SpreadsheetDocument.Open(docName, true);
        }

        public static void
           CloseSpreadSheet(SpreadsheetDocument spreadSheet)
        {
            spreadSheet.Close();
        }
    }
}

As you can see, we are starting out small. There are two methods that open and close a document.
Next, let’s add a method that we can call to update the value of a certain spreadsheet cell with a string. The method has to accept the following parameters: a Spreadsheet document object, a sheet name, a cell address (specified by a column name and a row index), a string value, and a flag indicating whether the cell node should be created if it does not already exist.

public static void UpdateCell(SpreadsheetDocument spreadSheet,
			      string sheetName,
			      string text,
			      uint rowIndex,
			      string columnName,
			      bool createCell)
{
    WorksheetPart worksheetPart =
          GetWorksheetPartByName(spreadSheet, sheetName);

    if (worksheetPart != null)
    {
	Cell cell = GetCell(worksheetPart.Worksheet, columnName,
                                                rowIndex, createCell);

	if (cell != null)
	{
	    cell.CellValue = new CellValue(text);
	    cell.DataType = new EnumValue(CellValues.Number);
	}

	// Save the worksheet.
	worksheetPart.Worksheet.Save();
    }
    else
    {
	throw new
              Exception("Worksheet not found in spreadsheet document");
    }
}

This method calls another method GetCell that we have to write. We assume here that the value that we passed in is numeric. This could be easily improved to handle different value types, of course. The spreadsheet document is comprised of parts, and worksheets are one type of parts. The method GetWorksheetPartByName is another method that is missing. It will return the WorksheetPart for a given worksheet name (the name is what you see on the little tab at the bottom of the Excel window). Other than that, there is nothing out of the ordinary here. The interesting code will be in the GetCell() method, so let’s look at that next.

public static Cell GetCell(Worksheet worksheet,
			   string columnName,
			   uint rowIndex,
			   bool createCell)
{
    Row row = GetRow(worksheet, rowIndex, createCell);

    Cell cell = null;

    if (row.Elements().Where(c => string.Compare(
       c.CellReference.Value, columnName + rowIndex, true)
              == 0).Count() > 0)
    {
	cell = row.Elements().Where(c => string.Compare(
                c.CellReference.Value, columnName + rowIndex, true)
                    == 0).First();
    }
    else if (createCell)
    {
	// Cells must be in sequential order according to CellReference.
	// Determine where to insert the new cell.
	Cell refCell = null;
	string cellReference = columnName + rowIndex;

	foreach (Cell ecell in row.Elements())
	{
	    if (string.Compare(
                   ecell.CellReference.Value, cellReference, true) > 0)
	    {
		refCell = ecell;
		break;
	    }
	}

	cell = new Cell() { CellReference = cellReference };
	row.InsertBefore(cell, refCell);
	worksheet.Save();
    }

    return cell;
}

This method retrieves the row, queries it for the cell what we are interested in (using a Lambda expression), and if the cell does not exist and the caller wants one to be created, inserts a new cell at the correct place in the row. Now let’s write the method that retrieves a Row object.

public static Row GetRow(Worksheet worksheet,
			 uint rowIndex,
			 bool createCell)
{
    Row row = null;

    if (worksheet.GetFirstChild().
          Elements().Where(
          r => r.RowIndex == rowIndex).Count() > 0)
    {
	row = worksheet.GetFirstChild().
	     Elements().Where(
	       r => r.RowIndex == rowIndex).First();
    }
    else if (createCell)
    {
	row = new Row() { RowIndex = rowIndex };
	worksheet.GetFirstChild().Append(row);
    }

    return row;
}

Since we allow a cell to be created if it does not exist, we have to first also create a new row if one does not exist. The method above handles that for us. The one thing left to do is the method that finds a worksheet by name. Here it is.

public static WorksheetPart GetWorksheetPartByName(
       SpreadsheetDocument document, string sheetName)
{
    IEnumerable sheets =
          document.WorkbookPart.Workbook.GetFirstChild().
          Elements().Where(s => s.Name == sheetName);

    if (sheets.Count() == 0)
    {
	return null;
    }

    string relationshipId = sheets.First().Id.Value;
    WorksheetPart worksheetPart =
        (WorksheetPart)document.WorkbookPart.
        GetPartById(relationshipId);

    return worksheetPart;
}

This method does not handle the situation if a worksheet does not exist. The following method will insert a new worksheet at the end of the worksheet list:

public static WorksheetPart
        InsertWorksheet(WorkbookPart workbookPart)
{
    // Add a new worksheet part to the workbook.
    WorksheetPart newWorksheetPart =
          workbookPart.AddNewPart();
    newWorksheetPart.Worksheet =
         new Worksheet(new SheetData());
    newWorksheetPart.Worksheet.Save();

    Sheets sheets =
         workbookPart.Workbook.GetFirstChild();
    string relationshipId =
         workbookPart.GetIdOfPart(newWorksheetPart);

    // Get a unique ID for the new sheet.
    uint sheetId = 1;
    if (sheets.Elements().Count() > 0)
    {
	sheetId = sheets.Elements().Select(
	           s => s.SheetId.Value).Max() + 1;
    }

    string sheetName = "Sheet" + sheetId;

    // Append the new worksheet
    // and associate it with the workbook.
    Sheet sheet = new Sheet()
     {
         Id = relationshipId,
         SheetId = sheetId,
         Name = sheetName
     };
    sheets.Append(sheet);
    workbookPart.Workbook.Save();

    return newWorksheetPart;
}

We now have a working utility class that let’s us update cells in a spreadsheet document, there is one more method that I would like to add. It has to do with a the fact that the SDK does not update references in formulas, as I already mentioned. It also does not recalculate formula results. So, if you update a cell and you have a formula that uses this cell to calculate something somewhere else in the spreadsheet, this value will not reflect the chance, when the spreadsheet is loaded into into Excel. Getting Excel to reevaluate formulas is not straight-forward, so we will work around that. By deleting all calculated cells in the spreadsheet, we can force Excel to reevaluate all of them when the document is opened. Here is a little method that does exactly that.

// remove all values of cells with formulas on a sheet
// so that Excel refreshes the them upon Open
public static void ClearAllValuesInSheet
      (SpreadsheetDocument spreadSheet, string sheetName)
{
    WorksheetPart worksheetPart =
        GetWorksheetPartByName(spreadSheet, sheetName);

    foreach (Row row in
       worksheetPart.Worksheet.
          GetFirstChild().Elements())
    {
	foreach (Cell cell in row.Elements())
	{
	    if (cell.CellFormula != null &&
	          cell.CellValue != null)
	    {
		cell.CellValue.Remove();
	    }
	}

    }

    worksheetPart.Worksheet.Save();
}

This concludes the Open XML Format SDK primer.

Get the complete source code for my OpenXMLHelper.cs class here.

Exit mobile version