I've finally gotten back to working on the final series of place-data imports for Weather Now. One of the data sources comes as a 20,000-line Excel spreadsheet. Both because I wanted to learn how to read Excel files, and to make updating the Gazetteer transparent, I wrote the first draft of the import module using the DocumentFormat.OpenXml package from Microsoft.
The recommended way of reading a cell using that package looks like this:
private static string? CellText(
WorkbookPart workbook,
OpenXmlElement sheet,
string cellId)
{
var cell = sheet.Descendants<Cell>()
.FirstOrDefault(c => c.CellReference == cellId);
if (cell is null) return null;
if (cell.DataType is null || cell.DataType != CellValues.SharedString)
{
return cell.InnerText;
}
if (!int.TryParse(cell.InnerText, out var id))
{
return cell.InnerText;
}
var sharedString = workbook.SharedStringTablePart?
.SharedStringTable
.Elements<SharedStringItem>()
.ElementAt(id);
if (sharedString?.Text is not null)
{
return sharedString.Text.Text;
}
return sharedString?.InnerText is null
? sharedString?.InnerXml :
sharedString.InnerText;
}
When I ran a dry import (meaning it only read the file and parsed it without writing the new data to Weather Now), it...dragged. A lot. It went so slowly, in fact, that I started logging the rate that it read blocks of rows:
2022-05-29 18:43:14.2294|DEBUG|Still loading at 100 (rate: 1.4/s)
2022-05-29 18:44:26.9709|DEBUG|Still loading at 200 (rate: 1.4/s)
2022-05-29 18:45:31.3087|DEBUG|Still loading at 300 (rate: 1.4/s)
...
2022-05-29 22:26:27.7797|DEBUG|Still loading at 8300 (rate: 0.6/s)
2022-05-29 22:31:01.5823|DEBUG|Still loading at 8400 (rate: 0.6/s)
2022-05-29 22:35:40.3196|DEBUG|Still loading at 8500 (rate: 0.6/s)
Yes. First, it looked like it would take 4 hours to read 20,000 rows of data, but as you can see, it got even slower as it went on.
I finally broke out the profiler, and ran a short test that parsed 14 lines of data. The profiler showed a few hot spots:
- 355,000 calls to
OpenXmlElement<T>.MoveNext
- 740,000 calls to
OpenXmlCompositeElement.get_FirstChild
- 906,000 calls to
OpenXmlChildElements<GetEnumerator>.MoveNext
That's for 14 lines of data.
So I gave up and decided to export the data file to a tab-delimited text file. This code block, which opens up the Excel workbook:
using var document = SpreadsheetDocument.Open(fileName, false);
var workbook = document.WorkbookPart;
if (workbook is null)
throw new InvalidOperationException($"The file \"{fileName}\" was not a valid data file");
var sheet = workbook.Workbook.Descendants<Sheet>().FirstOrDefault(s => s.Name == "Airports");
if (sheet is null) throw new InvalidOperationException("Could not the data sheet");
var sheetPart = (WorksheetPart)workbook.GetPartById(sheet.Id!);
var sheetData = sheetPart.Worksheet.Elements<SheetData>().First();
var rows = sheetData.Elements<Row>().Count();
Now looks like this:
var lines = File.ReadAllLines(fileName);
And the code to read the data from an individual cell becomes:
return columns.Count >= index ? columns[index] : null;
Boom. Done. Took 30 minutes to refactor. My profiler now says the most frequent call for the 14-row test occurs just 192 times, and teh whole thing finishes in 307 ms.
So let's run it against the full file, now converted to tab-delimited text:
2022-05-30 09:19:33.6255|DEBUG|Still loading at 100 (rate: 211.3/s)
2022-05-30 09:19:33.8813|DEBUG|Still loading at 200 (rate: 274.2/s)
2022-05-30 09:19:34.1342|DEBUG|Still loading at 300 (rate: 305.4/s)
...
2022-05-30 09:20:14.9819|DEBUG|Still loading at 19600 (rate: 468.6/s)
2022-05-30 09:20:15.2609|DEBUG|Still loading at 19700 (rate: 467.8/s)
2022-05-30 09:20:15.5030|DEBUG|Still loading at 19800 (rate: 467.5/s)
Well, then. The first few hundred see a 200x improvement, and it actually gets faster, so the whole import takes 45 seconds instead of 6 hours.
So much time wasted yesterday. Just not worth it.