vault backup: 2025-12-10 11:37:35

This commit is contained in:
2023-05-15 17:16:05 +02:00
committed by Thomas Peetz
parent 91bf72fc87
commit 73f2162ddf
6049 changed files with 513094 additions and 227748 deletions
@@ -0,0 +1,35 @@
---
title: Conveniently Processing Large XML Files with Java
source: https://dzone.com/articles/conveniently-processing-large
tags:
- IT/Development/Java
- IT/Development/XML
---
Join the DZone community and get the full member experience.
When processing XML data it's usually most convenient to load the whole document using a **DOM** parser and fire some **XPath**-queries against the result. However, since we're building a multi-tenant eCommerce plattform we regularly have to handle large XML files, with file sizes above 1 GB. You certainly don't want to load such a beast into the heap of a production server, since it easily grows up to 3GB+ as DOM representation.
So what to do? Well, **SAX** to the rescue! Processing a large XML file using a SAX parser still requires constant (low) memory, since it only invokes callback for detected XML tokens. But, on the other hand, parsing complex XML really becomes a mess.
To resolve this problem we need to have a closer look at our XML input data. Most of the time, at least in our cases, you don't need the whole DOM at once. Say your importing product informations, it sufficient to look at one product at a time. Example:
 When processing Node 1, we don't need access to any attribute of Node 2 or three, respectively when processing Node 2, we don't need access to Node 1 or 3, and so on. So what we want is a partial DOM, in our example for every `<node>`.
What we've therefore built is a SAX parser, for which you can specify in which XML elements you are interested. Once such an element starts, we record the whole sub-tree. When this completes we notify a handler which then can run XPath expressions against this partial DOM. After that, the DOM is released and the SAX parser continues.
Here is a shortened example of how you could parse the XML above - one "`<node>`" at a time:
The full example, along with the implementation is open source (MIT-License) and available here:
https://github.com/andyHa/scireumOpen/tree/master/src/com/scireum/open/xml
https://github.com/andyHa/scireumOpen/blob/master/src/examples/ExampleXML.java
We successfully handle up to five parallel imports of 1GB+ XML files in our production system, without measurable heap growth. (Instead of using a FileInputStream, we use JAVAs ZIP capabilities and directly open and process ZIP versions of the XML file. This shrinks those monsters down to 20-50MB and makes uploads etc. much easier.)
Topics:
java, xml, bmecat
Opinions expressed by DZone contributors are their own.