Introduction to Microsoft’s Open Xml Format SDK 2.0 with a focus on Excel documents Christian Donner, March 24, 2009July 16, 2009 Those looking for ways to render application data and content into Microsoft Office documents used to have a limited number of options. There once was something called Visual Basic for Applications (Microsoft ended support in 2007) or, more recently, Visual Studio Tools for Applications (see Wikipedia). OLE Automation has been around since the early days of Windows and can be used to access the Office application’s object model from the outside world (see Wikipedia). OLE is no longer mainstream technology, and there are conflicts with the managed code programming and deployment model of .Net. All these options have in common that they are geared towards automating Office applications, rather than manipulating office documents. Consequently, they require that the Office applications are installed, which can be a prohibitive complexity for server-side document processing. Alternatively, SQL Server Reporting Services has been an easy path to generating native Excel documents without Excel on the server, but Reporting Services is not exactly a lightweight product, and customization and integration with other applications is a complex task. This sets the stage for the arrival of the Open Xml SDK, made possible by the Office Open XML format specification in 2006 (see Wikipedia). This article will explain what the Open Xml SDK is and what it can be used for. I will also provide a simple code example for manipulating an Excel spreadsheet document. The Microsoft Open XML Format SDK (visit the download page) is a very young product. Version 1.0 was released as recently as June 2008. Version 2.0 is available today as a Community Technology Preview release (CTP). The main difference is that 2.0 provides strongly typed classes, which means that your code is not dealing with generic Xml elements, but instead works with things like documents, sheets, and cells (in the case of a spreadsheet document). In this article, I will not concern myself at all with standardization considerations or the politics of standards. Instead, I am solely interested in the practical applications of the SDK, and how it can shorten the path to achieving loose integration between Microsoft Office and custom .Net applications. Deployment Models After installation, the SDK makes a number of project templates available in Visual Studio. These templates can be used to create assemblies that extend Office (Add-In) or that can be attached to Office documents. The latter is an interesting concept. Excel executes managed code that can handle events in the document’s life cycle. A spreadsheet could be configured to execute managed code with pulls data from SQL Server and populates certain cells – all when the document is opened. This code would not be integrated with a stand-alone application, though, and sharing a security context and other state information would not be possible. Project types available through the SDK Once the SDK libraries are installed, they can be referenced by and integrated with any application code. This is a model that works much better for an solutions that needs to generate or manipulate Office documents. Application Integration The SDK exposes objects and methods that allow direct access to elements of the document. There are types for Sheets, Rows, Columns, and Cells. A typical workflow would be: Open an XLSX document (the template) Update XLSX document Save a copy of, or print, the XLSX document (the output document) Unfortunately, but not surprisingly, the SDK does not provide methods that emulate the behavior of the Office application. For instance, when a formula is changed in an Excel spreadsheet that is being edited in Excel, the spreadsheet application will automatically search for and update references in other cells so that their validity is maintained. The SDK does not do that. When an application needs to change a formula in the cell of an Excel document, it has explicitely to search the document for references and update them. This can quickly become a complex undertaking, as one might imagine, and it is well worth evaluating what type of manipulations should be allowed to the application in order to keep the amount of code managable. The most uncritical approach for Excel spreadsheets is to update only values, not formulas. Formulas, if needed, can be in place in the template, and need to be capable of handling variable amounts of data for the approach to work. C# Code Example Let’s update a cell in an existing spreadsheet document. I prefer to make my code reusable, so we will write a few wrapper methods for the SDK and place them in a static class that I call OpenXMLHelper. using System; using System.Collections.Generic; using System.Linq; using System.Text; using DocumentFormat.OpenXml; using DocumentFormat.OpenXml.Packaging; using DocumentFormat.OpenXml.Spreadsheet; using System.Xml; using System.IO; namespace OpenXMLWindowsApp { static class OpenXMLHelper { public static SpreadsheetDocument OpenSpreadSheetForUpdate(string docName) { return SpreadsheetDocument.Open(docName, true); } public static void CloseSpreadSheet(SpreadsheetDocument spreadSheet) { spreadSheet.Close(); } } } As you can see, we are starting out small. There are two methods that open and close a document. Next, let’s add a method that we can call to update the value of a certain spreadsheet cell with a string. The method has to accept the following parameters: a Spreadsheet document object, a sheet name, a cell address (specified by a column name and a row index), a string value, and a flag indicating whether the cell node should be created if it does not already exist. public static void UpdateCell(SpreadsheetDocument spreadSheet, string sheetName, string text, uint rowIndex, string columnName, bool createCell) { WorksheetPart worksheetPart = GetWorksheetPartByName(spreadSheet, sheetName); if (worksheetPart != null) { Cell cell = GetCell(worksheetPart.Worksheet, columnName, rowIndex, createCell); if (cell != null) { cell.CellValue = new CellValue(text); cell.DataType = new EnumValue(CellValues.Number); } // Save the worksheet. worksheetPart.Worksheet.Save(); } else { throw new Exception("Worksheet not found in spreadsheet document"); } } This method calls another method GetCell that we have to write. We assume here that the value that we passed in is numeric. This could be easily improved to handle different value types, of course. The spreadsheet document is comprised of parts, and worksheets are one type of parts. The method GetWorksheetPartByName is another method that is missing. It will return the WorksheetPart for a given worksheet name (the name is what you see on the little tab at the bottom of the Excel window). Other than that, there is nothing out of the ordinary here. The interesting code will be in the GetCell() method, so let’s look at that next. public static Cell GetCell(Worksheet worksheet, string columnName, uint rowIndex, bool createCell) { Row row = GetRow(worksheet, rowIndex, createCell); Cell cell = null; if (row.Elements().Where(c => string.Compare( c.CellReference.Value, columnName + rowIndex, true) == 0).Count() > 0) { cell = row.Elements().Where(c => string.Compare( c.CellReference.Value, columnName + rowIndex, true) == 0).First(); } else if (createCell) { // Cells must be in sequential order according to CellReference. // Determine where to insert the new cell. Cell refCell = null; string cellReference = columnName + rowIndex; foreach (Cell ecell in row.Elements()) { if (string.Compare( ecell.CellReference.Value, cellReference, true) > 0) { refCell = ecell; break; } } cell = new Cell() { CellReference = cellReference }; row.InsertBefore(cell, refCell); worksheet.Save(); } return cell; } This method retrieves the row, queries it for the cell what we are interested in (using a Lambda expression), and if the cell does not exist and the caller wants one to be created, inserts a new cell at the correct place in the row. Now let’s write the method that retrieves a Row object. public static Row GetRow(Worksheet worksheet, uint rowIndex, bool createCell) { Row row = null; if (worksheet.GetFirstChild(). Elements().Where( r => r.RowIndex == rowIndex).Count() > 0) { row = worksheet.GetFirstChild(). Elements().Where( r => r.RowIndex == rowIndex).First(); } else if (createCell) { row = new Row() { RowIndex = rowIndex }; worksheet.GetFirstChild().Append(row); } return row; } Since we allow a cell to be created if it does not exist, we have to first also create a new row if one does not exist. The method above handles that for us. The one thing left to do is the method that finds a worksheet by name. Here it is. public static WorksheetPart GetWorksheetPartByName( SpreadsheetDocument document, string sheetName) { IEnumerable sheets = document.WorkbookPart.Workbook.GetFirstChild(). Elements().Where(s => s.Name == sheetName); if (sheets.Count() == 0) { return null; } string relationshipId = sheets.First().Id.Value; WorksheetPart worksheetPart = (WorksheetPart)document.WorkbookPart. GetPartById(relationshipId); return worksheetPart; } This method does not handle the situation if a worksheet does not exist. The following method will insert a new worksheet at the end of the worksheet list: public static WorksheetPart InsertWorksheet(WorkbookPart workbookPart) { // Add a new worksheet part to the workbook. WorksheetPart newWorksheetPart = workbookPart.AddNewPart(); newWorksheetPart.Worksheet = new Worksheet(new SheetData()); newWorksheetPart.Worksheet.Save(); Sheets sheets = workbookPart.Workbook.GetFirstChild(); string relationshipId = workbookPart.GetIdOfPart(newWorksheetPart); // Get a unique ID for the new sheet. uint sheetId = 1; if (sheets.Elements().Count() > 0) { sheetId = sheets.Elements().Select( s => s.SheetId.Value).Max() + 1; } string sheetName = "Sheet" + sheetId; // Append the new worksheet // and associate it with the workbook. Sheet sheet = new Sheet() { Id = relationshipId, SheetId = sheetId, Name = sheetName }; sheets.Append(sheet); workbookPart.Workbook.Save(); return newWorksheetPart; } We now have a working utility class that let’s us update cells in a spreadsheet document, there is one more method that I would like to add. It has to do with a the fact that the SDK does not update references in formulas, as I already mentioned. It also does not recalculate formula results. So, if you update a cell and you have a formula that uses this cell to calculate something somewhere else in the spreadsheet, this value will not reflect the chance, when the spreadsheet is loaded into into Excel. Getting Excel to reevaluate formulas is not straight-forward, so we will work around that. By deleting all calculated cells in the spreadsheet, we can force Excel to reevaluate all of them when the document is opened. Here is a little method that does exactly that. // remove all values of cells with formulas on a sheet // so that Excel refreshes the them upon Open public static void ClearAllValuesInSheet (SpreadsheetDocument spreadSheet, string sheetName) { WorksheetPart worksheetPart = GetWorksheetPartByName(spreadSheet, sheetName); foreach (Row row in worksheetPart.Worksheet. GetFirstChild().Elements()) { foreach (Cell cell in row.Elements()) { if (cell.CellFormula != null && cell.CellValue != null) { cell.CellValue.Remove(); } } } worksheetPart.Worksheet.Save(); } This concludes the Open XML Format SDK primer. Get the complete source code for my OpenXMLHelper.cs class here. Related Posts:Enphase Envoy Local AccessSUTAB Scam?TyreWiz not working after battery changeThe Great Cat Litter Poop OffAmazon threatens customer of 26 years Software Engineering Excel SDKOpen XMLOpen XML Format SDKXML Format
Hi Christian, Before the Microsoft Open XML SDK existed I’ve always found the cleanest (but not the simplest) solution to generating Office Documents from the server was with XSLT. At previous clients, we were able to develop transforms that created SpreadsheetML. Opened in Excel, the end result was multi-tabbed workbooks with formatting, formulas and bound charts. This is a nice solution that doesn’t require compiled code changes or redeployments when a report needs to get changed; simply update the XSLT. If you have experience with XSLT this can be a really solid solution. If you dont, you can always save the existing XLS file as XML and use that as a guide. Simple example: http://msdn.microsoft.com/en-us/library/bb226693(office.11).aspx
Hi Jona, I don’t see the strength of the Open XML SDK in generating spreadsheets from scratch. XSLT is probably the better choice for that. But the SDK is great for manipulating existing documents.
Do you have working code for the snippets you have above? I am getting errors when I try to compile your code into a library class.
I sent you an email with the source for a C# class. This is still prototype code. Let me know if you need examples for how to use it.
Hello, I am trying to create an excel file from scratch using openxml. Dim sheetData As SheetData = worksheetPart.Worksheet.GetFirstChild(Of SheetData)() blows up, because workSheetPart.WorkSheet is null. How can I add a WorkSheet to WorkSheetpart and add SheetData to Worksheet? Thank you, Burak Using myWorkBook As SpreadsheetDocument = SpreadsheetDocument.Create(FilePath, SpreadsheetDocumentType.Workbook) myWorkBook.AddWorkbookPart() myWorkBook.WorkbookPart.AddNewPart(Of WorksheetPart)(“TestSheet1”) Dim worksheetPart As WorksheetPart = myWorkBook.WorkbookPart.WorksheetParts.First Dim sheetData As SheetData = worksheetPart.Worksheet.GetFirstChild(Of SheetData)() Dim contentRow As New Row Dim firstCell As New Cell firstCell.DataType = CellValues.InlineString Dim cellStr As New InlineString Dim t As Text = New Text t.Text = “Hello world” cellStr.AppendChild(t) firstCell.AppendChild(cellStr) contentRow.AppendChild(firstCell) sheetData.AppendChild(contentRow) End Using
Burak, check out these links: http://msdn.microsoft.com/en-us/library/cc881781(office.14).aspx http://openxmldeveloper.org/articles/Creating_Spreadsheets_Server.aspx
Hello Christian, I used ExcelPackage but it has shortcomings and it’s not being added to more (which is shame) , so that’s why I started delving into OpenXml. I looked at the first site and modified my code as follows Dim spreadSheet As SpreadsheetDocument = SpreadsheetDocument.Create(FilePath, SpreadsheetDocumentType.Workbook) Using (spreadSheet) spreadSheet.AddWorkbookPart() Dim newWorksheetPart As WorksheetPart = spreadSheet.WorkbookPart.AddNewPart(Of WorksheetPart)() newWorksheetPart.Worksheet = New Worksheet(New SheetData()) newWorksheetPart.Worksheet.Save() Dim sheets As Sheets = spreadSheet.WorkbookPart.Workbook.GetFirstChild(Of Sheets)() it blows up this time with “object reference not set to an instance of an object” I guess it’s complaining about sheets. How do I create and add sheets? Thanks, Burak
I also noticed when we call spreadSheet.AddWorkbookPart() the workbook inside workbookpart shows up as being set to nothing and i think that’s what’s causing spreadSheet.WorkbookPart.Workbook.GetFirstChild(Of Sheets)() to blow up i looked inside workbookpart object with reflector but it’s constructor is not creating a workbook i am confused
I don’t have time to look into your code, but it appears that you are not creating a workbook. The hierarchy is document-workbook-workbookpart.
Christian, I’m trying to iterate through the cells of a simple spreadsheet to get some values. Your examples look very useful, but I am running into errors when I try to use your code snippets. A working example would be very helpful. Regards — Jim
Dear Christian, Your article is great. I have used it as a source of inspiration for my own code. The ClearAllValuesInSheet has been very helpful for me. I am now trying to do the same with arrays. Using the same code on the example below, I get rid of the “v” tag for C16. Once I have updated workbook.xml with a new array {9,6} under the definedname ITICTRD for instance, I get 9 in C16, which is great but I still get 0 in D16. If I remove the “v” tag from D16 and zip, D16 remains empty. -ITICTRD 0 Have you ever been through the array issue? If yes, it would be great if you could share your key findings. Thanks. Laurent
This is the example from sheet1.xml It was not taken into account into my previous post because of tags “” “-ITICTRD” “” “” “0” “”
Hi Christian, I am trying to use open xml format to recognize the repeating pattern in a set of powerpoint presentations so that I can later use the repeating parts to make a template like presentation and spare the users time instead of starting from scratch. Do you think open xml is a good idea for this purpose?
Roozy, I don’t know the Powerpoint API of the Open XML SDK, but if what you want to do can be done, I am sure it can be done with Open XML.
Hi Christian, UpdateCell doesn’t seem to work for me once I get passed ‘Z’ for the column name. For example ‘AA’ doesn’t seem to work correctly. Is this a known problem? Thanks,
Nevermind, I found the problem. This line of code is the culprit in GetCell(): if (string.Compare(ecell.CellReference.Value, cellReference, true) > 0) This will return true when comparing B1 to AA1.
Hi, i’m using the OpenXML SDK 2.0 and i have a little problem with the columns. However i come to the column AA or AB the first columns from B to Z are empty. Excel give me an error message that there is something unreadable in the sheet. But A , AA and AB isn’t empty. Thanks
Hi Christian, i have another question, I’m looking to a way to export some summations. I only have found the example in the SDK, but there is no way to export the formula. Is there another way to export it or is there another DataType for the formula? Thanks
Thanks. Florian and Ryan, I am having the same issue. How did you solve it? I found the line, but don’t know the work around.
Hi guys, I was able to figure it out. I just compared the length of each cell name and only do the second comparison if the lengths are equal. If one has a length of 3 (AA1), and the other has a length of 2 (B1) skip the if statement. if (string.Compare(ecell.CellReference.Value, cellReference, true) > 0) becomes: if ((ecell.CellReference.Value.Length == cellReference.Length) && (string.Compare(ecell.CellReference.Value, cellReference, true) > 0))