Handling Large XML Files

Benjamin Markson · Post by **Benjamin Markson** » July 27th, 2016, 5:28 am

My task was to import an XML file and output an SQLite database based on the import.

I started with, what I guess is, a straightforward approach: using XMLHttpRequest's responseXML object. Traversing the XML object to populate a Binding Params Array before executing its associated SQL statement to populate my SQLite database. And this works fine until you import a really big XML file, files around 100mb in size.

This consumed a large amount of memory (circa 2 gig of memory), so much so that a computer without enough ram, and cpu, just destroys the Firefox session - frozen browser, not responding, unresponsive script, unresponsive computer - the word 'lag' just doesn't do it justice.

I discovered two things. Importing an XML file of 100mb produces an XML object of around 6 times that size. Creating the Binding Params Array for that amount of data can quite happily consume over a gig of memory on top. To put this in context, I was testing using a file with over 2 million rows and iterating over a 100,000 parent elements.

My approach was to stop using responseXML and instead use responseText (with an overrideMimeType of text/plain). Then use .match(regex) to produce an array of the XML parent elements - this is another memory hit but nowhere near as bad as the responseXML object. Take the array and slice through it in smaller array chunks. Then, for the smaller array, join it back into a single string and then parseFromString to an XML object. Finally, traverse the (much smaller) XML object to populate a Binding Params Array as before. Rinse and repeat for each slice of the larger array until you're done.

Code: Select all

/* create an array of the XML parent elements from responseText */
var xmlArray = this.responseText.match(/<element([\s\S]*?)<\/element>/gm);

/* slice, join, and parse array chunk to an XML object */
var xmlObject = parser.parseFromString('<root>' + xmlArray.slice(start, end).join('') + '</root>', "application/xml");

This worked until I discovered a new problem. On a computer with only 2 gig of ram, responseText would always get truncated to exactly 64 meg - I've googled this to death and can't find anything that sheds any light on this behaviour although it only seems to happen for responseText with an overrideMimeType of text/plain.

So, as if this isn't already convoluted enough, back to the drawing board.

I discover that blobs don't get truncated! So, I'm now taking XMLHttpRequest's response (with responseType = "blob") and then using FileReader's readAsText to change the blob into a text string. The readAsText response then goes forward as before to be match(ed), slice(d) and join(ed) into the smaller XML objects.

In this way, I finally got an ancient Pentium 4, with 2 gig of memory, to import and process a 90mb xml file in about 5 minutes but I still think this is a seriously ugly solution. On the other hand, it may be that my solution is, in fact, a really cunning solution. In which case I hope this is of help to someone.

As I don't have any great expertise in either XMLHttpRequest or XML - I have more experience using 'proper' databases - I've pretty much had to make the whole import process up with the help of google and a lot of trial and error. So, I'd be really interested to know if there is a more fit-for-purpose solution to handle really large XML files using Javascript. Most of the XML examples I find don't use Javascript at all. Which might be a clue.

Ben.

lithopsian · Post by **lithopsian** » July 27th, 2016, 12:14 pm

Maybe you're hitting the maximum javascript string length? I thought it was 256 meg on Firefox, but maybe it is 64 meg on 32 bit machines.

Benjamin Markson · Post by **Benjamin Markson** » July 27th, 2016, 2:00 pm

Nope, I've tested it on three machines... 2meg 32bit, 4meg 32bit, and 8meg 64bit, and only the 2meg 32bit machine exhibited this 'feature'. It was only when testing on the weakest machine that I discovered there even was a problem.

It makes no real sense as the much larger responseXML did not get truncated, although it did trash the machine before it could fully process the XML object (I'd returned a count of the parent elements before it fried, which was correct).

Ben.

lithopsian · Post by **lithopsian** » July 28th, 2016, 3:43 am

Have you found a limit on 4 meg 32 bit machines? There is a limit on total javascript heap which is based on total available memory and it won't be much bigger for a 4meg machine than a 2 meg machine.

Internally, with unicode conversions, you might be hitting those limits. Different methods are used with the blob and XML objects precisely because they are required to handle very large sizes. Either way, you are getting into the territory of javascript limits, so it might make sense to use the slow-but-safe blobs. Blobs have an explicit size limit of 800 MB in Firefox, but I'm pretty sure you wouldn't be able to read that into a text string anyway so in practice you'd be limited to perhaps 256 MB at most.

Or look for a totally different method. Standard database practice would be to nibble pieces of your XML file and insert them into the database not necessarily one by one but in small chunks. 64 MB SQL transactions can cause problems all of their own

Benjamin Markson · Post by **Benjamin Markson** » July 28th, 2016, 3:36 pm

lithopsian wrote:Or look for a totally different method. Standard database practice would be to nibble pieces of your XML file and insert them into the database not necessarily one by one but in small chunks.

Absolutely. I've been playing with nsIFileInputStream and nsIConverterInputStream which, superficially, is just the job. I can even choose my own buffer size. The only problem is that, rather unhelpfully, it is synchronous. I'm going to try experimenting with some kind of setTimeout or callback regime but in my limited experience this tends not to end well.

The asynchronous methods all seem to end up wanting to return all of the data. Catch 22.

Ben.

Benjamin Markson · Post by **Benjamin Markson** » July 29th, 2016, 8:36 am

Okay, I think I have it under control. I'm now using a Generator function. So, the nsIFilesInputStream read loop can yield while it processes each input buffer. I'm not sure whether this is strictly asynchronous, or not... pseudo-asynchronous? I'm sure the AMO editors will confuse me with some comment, or other, about it in due course.

Code: Select all

var grabDataIterator = null;
var sourceXml = '';

grabData = function*()
{
	var fileXml = Components.classes["@mozilla.org/file/local;1"].
				  createInstance(Components.interfaces.nsILocalFile);

	fileXml.initWithPath(sourceXml);

	var fstream = Components.classes["@mozilla.org/network/file-input-stream;1"].
				  createInstance(Components.interfaces.nsIFileInputStream);
	var cstream = Components.classes["@mozilla.org/intl/converter-input-stream;1"].
				  createInstance(Components.interfaces.nsIConverterInputStream);

	fstream.init(fileXml, -1, 0, 0);
	cstream.init(fstream, "ISO-8859-1", 0x200000, 0);

	var context = 'first';
	var trailingData = '';
	var buffer = {};
	var read = 0;

	do
	{
		read = cstream.readString(0xffffffff, buffer);

		if (read != 0)
		{
			var i = 0;
			var i = buffer.value.lastIndexOf('</tv>');
			if (i != -1) context = 'last';

			var arrProgrammes = (trailingData + buffer.value).match(/<programme([\s\S]*?)<\/programme>/gm)
			loadSchedules(arrProgrammes, context);

			if (i == -1)  context = 'more';
			if (context == 'more')
			{
				var i = buffer.value.lastIndexOf('</programme>');
				trailingData = buffer.value.substr(-1 * (buffer.value.length - i - 12));
			}

			yield; // asynchronously pauses the loop
		}
	} while (read != 0);

	cstream.close();
}

// To initiate the read sequence (context == 'first')
sourceXML = the file you want to read;
grabDataIterator = grabData();
grabDataIterator.next();

// To continue the read sequence (context == 'more')
grabDataIterator.next();

// To shut down the generator function (context == 'last')
grabDataIterator.return();

It's important to shut down and re-initiate the generator function before fetching further files, otherwise the generator function sort of gets stuck at the end of the input buffer.

In theory this can import any size of file without maxing out the memory - the memory hit is pretty much determined by the 0x200000 (2meg in this case) in the cstream.init statement. Of course, the bigger the file the longer it all takes but the UI doesn't freeze and can be updated during the import.

I don't suppose my code is particularly sexy but as I could find precious few examples using yield I hope it might be of some help to others.

Ben.

Handling Large XML Files

Handling Large XML Files

Re: Handling Large XML Files

Re: Handling Large XML Files

Re: Handling Large XML Files

Re: Handling Large XML Files

Re: Handling Large XML Files