Allocation recording hurts concurrency?

Questions about YourKit Java Profiler
Post Reply
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Allocation recording hurts concurrency?

Post by plethora »

I've got a CPU-bound benchmark that will saturate 90% of the CPU (according to YourKit) when profiled with CPU sampling only. I'm running on an 8 core (2x quadcore Xeon) workstation. This is what the CPU graph of YourKit shows, and Process Explorer gives a comparable figure from the operating system level.

When I enable allocation recording, the CPU of the profiled process stays around 15%. Again this is reflected both in the YourKit CPU graph, as well as Process Explorer.
Enabling the sampling allocation recording does not drastically alter the picture.

Because of this, I get the impression that there's some contention within the allocation recording infrastructure of the YourKit agent. My code does not scale perfectly, usually I'm saturating 90%-95% of the CPU, but this drop is quite dramatic.
Anton Katilin
Posts: 6172
Joined: Wed Aug 11, 2004 8:37 am

Re: Allocation recording hurts concurrency?

Post by Anton Katilin »

Hello Taras

Allocation recording really takes a lock for a minimum period of time when it processes a recorded object. Overhead should depend on the number of parallel threads: more threads - bigger the overhead.

What was the "Record each N-th" setting?

Could you please try to change it (e.g. from 10 to 20 and then to 100) to see what the effect is.

What is the number of threads?

If it is possible e.g. via a pool setting could you please try to alter the number of threads.

How big is the performance snapshot file size (*.snapshot) ?

Best regards,
Anton
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

Hi Anton,
Allocation recording really takes a lock for a minimum period of time when it processes a recorded object. Overhead should depend on the number of parallel threads: more threads - bigger the overhead.
Is there some room for improvement here? E.g. using CAS structures, or JDK8-style high-thoughput counters?
What was the "Record each N-th" setting?
Each 1 (e.g. full allocation recording)
Could you please try to change it (e.g. from 10 to 20 and then to 100) to see what the effect is.
Will do, and comment here later.
What is the number of threads?
8 - equal to the number of cores.
If it is possible e.g. via a pool setting could you please try to alter the number of threads.
Sure, it's possible. Machine is 8-core (no HT support, it's a "old" Xeon workstation).
Which thread counts do you want me to try?
How big is the performance snapshot file size (*.snapshot) ?
I didn't actually take a snapshot at this point.
I was just observing the "Allocations" pane so far, and keeping an eye on the CPU graphs.

Kind regards,
Taras
Anton Katilin
Posts: 6172
Joined: Wed Aug 11, 2004 8:37 am

Re: Allocation recording hurts concurrency?

Post by Anton Katilin »

Is there some room for improvement here?
I think yes. All data collected in a centralized storage access to which is granted through a lock can be collected in each thread's local storage instead, and then merged at some point.

It is not difficult to simply store its own copy of allocation data in each thread. The question is how much memory it would require, how it would scale as the number of threads grows.

Version 11 already provides similar optimization for CPU tracing, and it shows 3x improvement. Allocation recording is different in some technical aspects, and exactly the same optimization is not applicable for it.

But we'll see what we can do.
E.g. using CAS structures, or JDK8-style high-thoughput counters?
I'm afraid I'm not familiar with them. Could you please share a link.
Sure, it's possible. Machine is 8-core (no HT support, it's a "old" Xeon workstation).
Which thread counts do you want me to try?
If possible, let's try 2, 4 and 16.
I didn't actually take a snapshot at this point.
I was just observing the "Allocations" pane so far, and keeping an eye on the CPU graphs.
I asked about the snapshot size because it allows to roughly estimate the possible collected data size growth. The most pessimistic estimation (in the real life it will never be that bad): if we store each thread's copy of data separately, the current size may grow in up to <the number of threads> times.
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

Hi Anton,

First some additional details:
* I'm specifying the following additional options: onexit=snapshot,disabletracing,disablestacktelemetry,disableexceptiontelemetry,noj2ee,builtinprobes=none,disableoomedumper
* Running JDK 1.6 update 26, HotSpot Server VM x64 on XP64
* ~7M objects/~300M bytes of allocations recorded during profiling run
* Non-profiled (normal execution): ~90% CPU usage, 3 seconds total runtime (usually the amount of work would be greater, to run at least a few minutes - this is just for comparison with the profiled run times below)
* Non sampled allocation recording: ~15% CPU usage, 310 seconds total runtime
* Sampled allocation recording: ~20% CPU usage, 79 seconds total runtime

So, sampled allocation recording is clearly a good improvement for this particular case, but the obvious drop is throughput and concurrency is less pleasant.
I'm afraid I'm not familiar with them. Could you please share a link.
They are part of jsr166e, see a blog post here:
http://dhruba.name/2011/08/28/jsr166e-u ... or-java-8/

Specifically, I was referring to classes like the one below:
http://gee.cs.oswego.edu/dl/jsr166/dist ... Adder.html

Of course, I have no idea what the structure of the data is that the agent is maintaining, and how easy it would be to use high-throughput lock-free constructs there.

Kind regards,
Taras
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

In the earlier reply, "CAS" obviously refers to (lock-free. but not necessarily wait-free) compare-and-swap operations. Either from native code, though the "Unsafe" class, using jsr166 libraries, or simply using the regular java.util.concurrent.* classes.

Ok, I'll try to do a bit of benchmarking at work tomorrow, and report results here.
I'll try thread counts of 2, 4, 8 and 16, and permute those with allocation sample ratios of 1, 10, 20 and 100.

Questions:
1) Do you want me to run regular allocation profiling, or "sampled" allocation profiling?
2) In addition, to obtain a memory snapshot at the end of the process, I assume I need to add the option "onexit=memory". Is that correct? I assume you're just interested in the file size in bytes, right?

Kind regards,
Taras
Anton Katilin
Posts: 6172
Joined: Wed Aug 11, 2004 8:37 am

Re: Allocation recording hurts concurrency?

Post by Anton Katilin »

Hello Taras
Ok, I'll try to do a bit of benchmarking at work tomorrow, and report results here.
I'll try thread counts of 2, 4, 8 and 16, and permute those with allocation sample ratios of 1, 10, 20 and 100.
Thank you.
Questions:
1) Do you want me to run regular allocation profiling, or "sampled" allocation profiling?
Both, if possible. The results should seriously differ, and it is interesting to know the exact numbers for each of the cases.
2) In addition, to obtain a memory snapshot at the end of the process, I assume I need to add the option "onexit=memory". Is that correct? I assume you're just interested in the file size in bytes, right?
Yes, that is correct.

Best regards,
Anton
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

Hi Anton,

Details on the measurements
  • Performed from IDEA, using the YK plugin for profiler configuration
  • "-server" option was enabled for the JVM running the benchmark
  • For the allocation profiling, "record each object bigger than" 1 MB was enabled
  • For the allocation profiling, the following additional options were set:

    Code: Select all

    onexit=memory,disabletracing,disablestacktelemetry,disableexceptiontelemetry,noj2ee,builtinprobes=none,disableoomedumper
  • CPU sampling was enabled for all allocation recording runs
  • For each pool size, I've also included CPU sampling only results for comparison (to see what the allocation recording overhead is)
Columns in the measurements
1) pool size, passed to Executors.newFixedThreadPool(...)
2) allocation recording ratio (1/10/20/100)
3) wether sampled or non-sampled allocation recording was used
4) CPU usage estimation: done visually by looking at the YourKit CPU graph, occasionally comparing with Process Explorer
5) total runtime of benchmark, in milliseconds
6) file size of the memory snapshot captured at end of run. There is some variation in the resulting snapshot sizes - for some configurations I performed multiple runs, and listed the various snapshot sizes separated by "/"
7) total objects/bytes GCed - obtained from some snapshots, just as a check

My conclusions
Sampled allocation recording, as well as choosing higher recording ratios (each N-th) helps the performance quite a lot.
In fact, using sampled allocation with a recording ratio of 1:100 is only a factor of 5-10 slower than CPU sampling.

Looking at the profiling runs using CPU sampling only, it's clear that the benchmark load scales to multiple cores reasonably well. Performance almost doubles going from 2->4 and 4->8 worker threads.
However, when using YourKit allocation recording, this scalability is totally lost. In fact, there's hardly any performance difference at all between using 2, 4, 8 or 16 worker threads.
I would assume that contention in the allocation recording of the agent is responsible for this.

pool size 16
pool size: 16, each 1th, non-sampled: ~15% cpu, 629842ms total, 412 MB (GCed objects: 13.037.031, GCed bytes: 516.649.392)
pool size: 16, each 1th, sampled: ~19% cpu, 158282ms total, 517/428 MB (GCed objects: 13.031.391, GCed bytes: 515.950.584)

pool size: 16, each 10th, non-sampled: ~20% cpu, 90717ms total, 329/392 MB (GCed objects: 1.303.249, GCed bytes: 70.017.400)
pool size: 16, each 10th, sampled: ~26% cpu, 42655ms total, 381/386 MB

pool size: 16, each 20th, non-sampled: ~23% cpu, 58373ms total, 298/371 MB (GCed objects: 651.391, GCed bytes: 45.220.464)
pool size: 16, each 20th, sampled: ~30% cpu, 33576ms total, 372/292 MB

pool size: 16, each 100th, non-sampled: ~30% cpu, 30390ms total, 286/285 MB (GCed objects: 130.251, GCed bytes: 25.610.280)
pool size: 16, each 100th, sampled: ~32% cpu, 24953ms total, 286/364 MB

cpu sampling only, pool size: 16, 10x jobs: ~90% cpu, 18499ms total
cpu sampling only, pool size: 16, 100x jobs: ~95% cpu, 139932ms total

pool size 8
pool size: 8, each 1th, non-sampled: ~15% cpu, 602613ms total, 418 MB
pool size: 8, each 1th, sampled: ~20% cpu, 151852ms total, 316 MB

pool size: 8, each 10th, non-sampled: ~20% cpu, 89605ms total, 308 MB
pool size: 8, each 10th, sampled: ~25% cpu, 41951ms total, 258 MB

pool size: 8, each 20th, non-sampled: ~22% cpu, 57904ms total, 229 MB
pool size: 8, each 20th, sampled: ~30% cpu, 32576ms total, 331 MB

pool size: 8, each 100th, non-sampled: ~30% cpu, 29811ms total, 223 MB
pool size: 8, each 100th, sampled: ~35% cpu, 24483ms total, 299 MB

cpu sampling only, pool size: 8, 10x jobs: ~90% cpu, 17109ms total
cpu sampling only, pool size: 8, 100x jobs: ~95% cpu, 157776ms total

pool size 4
pool size: 4, each 1th, non-sampled: ~15% cpu, 597989ms total, 355 MB
pool size: 4, each 1th, sampled: ~20% cpu, 153525ms total, 348 MB

pool size: 4, each 10th, non-sampled: ~20% cpu, 82277ms total, 260 MB
pool size: 4, each 10th, sampled: ~27% cpu, 38030ms total, 257 MB

pool size: 4, each 20th, non-sampled: ~23% cpu, 53061ms total, 254/295/207/272 MB
pool size: 4, each 20th, sampled: ~29% cpu, 29421ms total, 249/216/260/249 MB

pool size: 4, each 100th, non-sampled: ~30% cpu, 26202ms total, 254/203/251/83 MB
pool size: 4, each 100th, sampled: ~31% cpu, 21827ms total, 255/326/203/204 MB

cpu sampling only, pool size: 4, 10x jobs: ~50% cpu, 28936ms total
cpu sampling only, pool size: 4, 100x jobs: ~50% cpu, 257162ms total

pool size 2
pool size: 2, each 1th, non-sampled: ~15% cpu, 582006ms total, 346/309 MB
pool size: 2, each 1th, sampled: ~18% cpu, 141541ms total, 304/301 MB

pool size: 2, each 10th, non-sampled: ~17% cpu, 82247ms total, 290/226 MB
pool size: 2, each 10th, sampled: ~20% cpu, 40389ms total, 216/259 MB

pool size: 2, each 20th, non-sampled: ~17% cpu, 51482ms total, 327/262 MB
pool size: 2, each 20th, sampled: ~21% cpu, 29592ms total, 260/205 MB

pool size: 2, each 100th, non-sampled: ~20% cpu, 27374ms total, 314/209 MB
pool size: 2, each 100th, sampled: ~22% cpu, 22499ms total, 327/44/39/254 MB

cpu sampling only, pool size: 2, 10x jobs: ~25% cpu, 55170ms total
cpu sampling only, pool size: 2, 100x jobs: ~25% cpu, 526749ms total

Kind regards,
Taras
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

Hi Anton,

Are the measurements in line with your expectations? Anything more I could contribute from my side?

Kind regards,
Taras
Anton Katilin
Posts: 6172
Joined: Wed Aug 11, 2004 8:37 am

Re: Allocation recording hurts concurrency?

Post by Anton Katilin »

Hello Taras

Sorry for delayed response. We were too busy with the release of version 11.
We'll thoroughly analyze the results in next few days, and I'll write back ASAP.

Best regards,
Anton
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

No problem - just checking that you didn't miss my post (it took quite some time to perform these measurements).
I'll be here once you have some builds to try :)
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

So, now that YK 12 EAP is open, do you have any plans to improve this for the coming release?
It kind of limits the allocation profiling to non-realistic (low load, load concurrency) tests.

Cheers,
Taras
Anton Katilin
Posts: 6172
Joined: Wed Aug 11, 2004 8:37 am

Re: Allocation recording hurts concurrency?

Post by Anton Katilin »

Hello Taras

Yes, we'll try to address this issue in version 12.

Best regards,
Anton
Anton Katilin
Posts: 6172
Joined: Wed Aug 11, 2004 8:37 am

Re: Allocation recording hurts concurrency?

Post by Anton Katilin »

Hello Taras

We have started working on allocation recording scalability improvements.

We have an intermediate result. It is not the final one, because we have not applied all the planned changes, but still we think there should be an improvement for higher "Record each N" parameter values.

If you have a wish and an ability to test it, we can give you the agent for testing. Please let us know the target platform then.

I think it would be enough to test only the following combinations: (sampled - non-sampled) x (each 1 - 20 - 100) with the same thread count e.g. 16.

Best regards,
Anton
plethora
Posts: 314
Joined: Thu Jun 02, 2005 8:36 pm

Re: Allocation recording hurts concurrency?

Post by plethora »

Hi Anton,

Windows x64.

Kind regards,
Taras
Post Reply