Monday, January 21, 2008

Garbage collection: Performance test

Following my initial GC post, I received feedback regarding my comment "A well written and tuned garbage collector can be faster than manual allocation.". These comments can be summarised as "show us the proof".

I asked on the Boehm GC mailing list (if in doubt, ask for help). The conversation starts here.

They provided the following (my summary):
  • One benchmark is here, showing that speeds are comparable given sufficient memory (a gc will require more memory) .
  • Another is here from Hans Boehm's presentation. See pages 50 onwards. He comments that it is a toy benchmark, on linux.
  • Malloc implementations have improved
  • Code that favours manual allocation
    Simple create, do something, free
    Large objects
  • Code that favours gc
    Complicated lifetime management
    try ... finally, free
    multi threading

Well that kinda helps. But what about in delphi?

I have done some quick tests using my modified version of the delphi wrapper for the Boehm GC (Delphi GC for short). The modifications shouldn't make any major difference to the result.

Delphi benchmark 1:
This is a simple, trivial, benchmark. It creates 60,000,000 small objects and assigns a value.

The object is simply:

TTestObject = class
public
Lines: TStrings;
constructor Create();
destructor Destroy; override;
end;
and the test is simply
for f := 1 to TestCount do
begin
testObj:= TTestObject.Create;
{$ifdef USE_GC}
testObj.Lines.Add('aaa');
{$else}
try
testObj.Lines.Add('aaa');
finally
testObj.Free;
end;
{$endif}
The try ... finally free section is not required by the GC version as we don't have to worry about memory leaks.

The GC tests were repeated with a range of initial heap sizes and on different computers. The FastMem test was also tried without the try finally. The source code is available if anyone wants it.

The results are


Old laptop, 512mb Core 2, 2gig Single core 2 gig Quad core 3 gig
FMM (no try finally)
approx 31.5

FMM try finally 81.281 33.306 37.875 48.046
GC 0mb
73.181 59.047 46.25
GC 5mb
39.499 32.906 29.656
GC 10mb 60.891 30.857 29.422 27.984
GC 20mb 58.328 26.926 27.437 27.062



Given a large enough initial heap, the gc version ends up faster than the FastMM version.

This is not a serious benchmark, but it does indicate that a gc can be faster than manual allocation.

Delphi benchmark 2

For this, I added the gc to 2 of my existing unit tests. It was a 2 line conversion, I just added the gc and set the initial heap size.

Enable is a work injury management system. It is heavy on database access and single threaded.
Envisage is a document management system. Database access is done via the tiopf object persistence framework. It reads pdf files, checks for bar codes and creates new ones. It is multi-threaded. It uses a large amount of memory.

Here are the results:


Envisage, no threads Envisage, threaded Enable
FMM 70 114.4 16.4
GC 20 74 119.0
GC 40 71 117.5 14.1
GC 100 71 115.6




Conclusion
Based on these results, I would have to say that my comment "A well written and tuned garbage collector can be faster than manual allocation." is correct. Given sufficient heap space, the gc version is faster in some tests, and 1% slower in others.

Le me restate my conclusion as the initial one is not well worded in terms of what I intended to say. A better conclusion would be "It is possible for a garbage collected application to run at a speed similar to that of an application using manual deallocation". Or alternately, "adding a gc to an application doesn't automatically make it incredibly slow".

The gc performance could probably be improved further by surfacing the gc tuning options, improving the delphi wrapper and using a later version of the GC. The unit tests could also be sped up by removing the now redundant frees, destructors and try ... finally blocks

The Boehm GC used is an early version 6 (6.2 or so). Version 7 is available from cvs. V7.1 should be released soon.

There are downsides to using a gc, such as increased memory use. It is not appropriate for all applications, especially those with memory constraints. However speed does not appear to be one of those downsides.

Update

In response to a query, yes the garbage collector is running, and collecting the objects. After the initial run (which may increase the heap), the heap size remains static no matter how many times I repeat the test (any of the tests).

I also repeated the FastMM test removing the testObj.Free; line.

It "completed" in 35 seconds. By completed, I mean "used up 1.3gig of free mem, all my 4gig page file and then threw an "out of memory" exception.


Reference
GC Mailing list: Are there any benchmarks comparing the speed of gc v non gc
Garbage collector for Delphi
Boehm GC
Wikipedia article


10 comments:

Caleb Hattingh said...

Sean

If it is easy for you to do, could you also test

testObj := TTestObject.Create;
try
for f := 1 to TestCount do
begin
testObj.clear;
testObj.Lines.Add('aaa');
end;
finally
testObj.Free;
end;

and the equivalent GC version.

Also, if you have time, another version where
lines only get added to the stringlist inside the loop (no clear), and the whole stringlist is freed at the end

It would be awesome to compare the GC/non-GC effect of .Create and .Clear.

btw, excellent that you're measuring this.

Sean said...

Caleb,

I don't have much spare time at the moment, and I am not quite sure how you want the test to work.

Pierre le Riche has placed the source to his version in the
borland.public.attachments newsgroups so it would be easy to test yourself.

Caleb Hattingh said...

Okay

My suggestion was to see the effect of the memory allocation and deallocation for the strings being add, but it was not a good suggestion because strings are reference-counted, and I don't know how that muddies the waters (so don't bother).

As is clear from the nontech discussion, it seems that the speed benefit related to using the GC is directly related to the GC not calling .Destroy(). This is an interesting situation.

Destroy() could be re-written to do less work (in the first example provided and presumably in your second unit-test example) in which case FastMM will be significantly faster and use significantly less memory. But would this be "fair" or not? Certainly, in applications where Destroy() does something important, it seems the GC will entirely miss that code, which seems dangerous, but that situation seems to be not covered in these examples.

One important result here (found by Pierre), aside from the debate, is that TStringList.Destroy() appears to be relatively slow compared to just releasing the memory, and as such could be a useful optimization target for projects such as FastCode.

Anonymous said...

Your benchmarks are invalid, because the Boehm GC memory manager does not work properly with the VCL. The reason is seems fast is because it skips large chunks of critical code.

It is an interesting memory manager but unfortunately not very useful until the VCL gets rewritten.

Sean said...

Jan,

Are you referring to code contained within the .Destroy destructors, or something else?

I doubt that a TStringlist.Destroy is critical code other than to the string list (I can't check at the moment).

In my other 2 tests, I made no code changes other than adding and setting up the gc. All objects, vcl and otherwise are being freed.

If you are programming with the gc, you would still need to call free on vcl objects as they contain external resources and do funky things.

Calib,

Of the 3 benchmarks, only the first skips any destroys. The other 2, based on my unit tests, create and destroy (manually) all objects. The gc versions are still as fast as, or slightly slower, than the fmm ones.

Anonymous said...

Yes I was. TStrings and a lot of other VCL classes do a lot more besides releasing memory. As it stands you cannot test the garbage collection with the VCL.

One can still test its use as normal memory manager by freeing all objects which is what you did in your other two benchmarks. I did two tests too in a real life multi-threaded application with a 20MB GC heap size.

Test 1: Image processing stuff
D7 memory manager: 7005 msecs
FastMM4: 6650 msecs
Boehm GC: 8900 msecs

Test 2: String juglings stuff
D7 memory manager: 1060 msecs
FastMM4: 1154 msecs
Boehm GC: 5234 msecs

While repeating both tests the D7 memory manager and FastMM4 show a non shareable private working set memory of 9.3MB. The Boehm starts at 45MB and climbs with 10MB per cycle. So somewhere between 50 and 130MB it always crashes with a "Abnormal Program Termination". It looks like it is not releasing a significant part of the memory.

I sincerely think that the GC memory manager is an interesting experiment, but claiming it is as fast as FastMM4 is a bit silly. Due to the fact that garbage collection cannot be done with the VCL and the Boehm's MM buggyness.

Anonymous said...

I switched two numbers in my previous post. It should be:

Test 2: String juglings stuff
D7 memory manager: 1154 msecs
FastMM4: 1060 msecs
Boehm GC: 5234 msecs

Caleb Hattingh said...

Sean

Thanks for your reply. I expect you are wondering why everyone keeps focusing on your Stringlist example and not on your other "real-world" example?

I suspect it may be because your other example makes use of a lot of database access, as you stated. (It's not clear to me whether the app or the unit tests have DB access, but I assume both).

If this is the case, then, probably, it doesn't matter what memory management system is used because the cost of I/O far outweights the relative performance differences of the GC vs. FastMM.

In other words, using heavily I/O bound benchmarks to test memory managers will probably result in them all performing about equally. I'm not saying that this *is* the case with your second test, but I am assuming so based on the information you provided in your blog.

Interestingly, however, this is an argument *for* garbage collection, because in the heavily I/O bound domain, it can be argued that the speed hit (I maintain there is a speed hit till proven otherwise) when using a GC is negligible (in this domain) while you get all the perks, such as nicer syntax and low manual memory management cost, and so forth.

This is exactly the same argument the python community have been making for years: "Sure, python is slower than C, but it's not slower than I/O, so past that point, in that domain, speed doesn't matter."

Anonymous said...

may be you can add comparing with GC of .NET Framework (just compile with Delphi.NET)?

(or send me sources and I'll try it myself)

Sean said...

Caleb,

The later 2 benchmarks are with database access. Envisage is using the tiOPF framework, while Enable goes straight to the database.

Andrey,
There are sources in the borland.public.attachments newsgroup.