DAZ Studio and multiprocessing
wowie
Posts: 2,029
I've been contemplating making a machine just for rendering and having the current one just for posing. One of the options on the table is a 4-way Opteron 6300 G34 socket (16 core each for a total of 64 cores). I know the built in 3delight renderer can spawn as many threads as there are existing cores (physical and logical). What I don't know is the scaling with more than 8 threads.
So, anybody here have experience running DAZ Studio with more than 8 threads? I'm particularly interested in performance scaling. Is there a diminishing return after a certain number of threads/cores?
Thanks.
Post edited by wowie on
Comments
The only thing I've seen on it, from looking around places like the 3DL forums, deal with licenses/cores....enough or unlocked and it pretty much looks like there's no limit. Think about it...if there wasn't a gain or it didn't scale well, then why would it be used on big productions that run their renders on many-cored/threaded renderfarms?
The only thing I'm not sure of is exactly how 'core unlocked' it is in DS...and I'm not sure if anyone who knows can actually tell...there were some things, in the discussions about 3DL that are in the forumarchive that were not divulged or vaguely answered....as if an NDA was standing in the way.
Thanks.
Yes, I could just use the standalone renderer but I'm more interested in using the built in renderer since I want to produce animation. With a standalone, I would need to export a RIB for each frame and maybe assign a specific number of cores for each RIB. But that would be quite complex to setup.
The reason I'm asking is whether or not 3delight, be it standalone or built in, suffer scaling problems with more than 8 cores. There's an article in fxphd - https://www.fxguide.com/featured/the-state-of-rendering-part-2/ - where they discuss multithreading performance of renderers.
I think, but am not sure, that the exact details of what the included 3DL is exactly capable of are protected under an NDA...at least that's what it seems like from the previous discussions, because the details always seemed kind of vague.
Interesting article...and it seems that many of the 'scaling' issues/losses stem from bottlenecks dealing with textures. One of the things with DS...you aren't going to be getting into the 10s/100s of GB range for textures, so the texture caching issues discussed in that article shouldn't be much of a problem.
I haven't seen any actual test/comparison data on more than 8 cores. But...at lower core counts (up to and including 8) yes, there are dramatic improvements.
It's not just textures. It's point clouds which I think is used pretty extensively (SSS, photon mapping).
So, no one is actually running DS on more than 8 threads machine?
How many DS users use pointclouds? The included pointcloud render script barely scratches the surface of what PC rendering is capable of.
And all the existing SSS solutions in DS use raytracing methods, not pointclouds. (pwsurface, HSS, US, US2 and AoA are all raytraced SSS).
I vaguely remember at least one person has mentioned a dual 6-core xeon setup. That had all 24 threads in use.
The only limit of 3Delight in Studio, that is public, is that it has no networking.
Everything else seems to be open to scripting.
That maybe true, but those shaders do need to store precomputation info.
http://www.3delight.com/en/uploads/docs/3delight/3delight_40.html#SEC170
And I would love to use a shader that's capable of two pass subsurface scattering and avoid subsequent precompute pass for each frame. Especially if you have lots of objects with SSS (each with its own material ID).
So would I, but two-pass rendering isn't all that easy to implement within DS. Unless you write your own and the needed scripts, too, it's going to be a long while before they are implemented in DS...
Now, nobody is saying that DS/3DL won't RUN on more that 8-cores...it's just HOW much of an improvement is there on larger numbers of cores. I'm saying that as long as there is adequate memory, unlike most major productions, the amount of data that a typical DS scene will cache won't be a likely source of bottlenecks and that the lower amount of data itself will be a performance limiter..
It would be spending more time trying to chop up the small amount of data to send to each than they would use processing that data. So somewhere in the 8 to 16 core/thread area you will hit a point of diminishing returns...due to data starvation, more than anything else. Basically, there will be small increases in performance but it will essentially be a gentle slope/plateau, not like the gains you see jumping from 2 to 4 to 8. (At least that's how I see it...of course I could be all wet, but reading that article it really seemed to me that the big problem is large amounts of cached data are going to really hurt because the way the caches are handled.)
And as far as I know, no, there haven't been any actual empirical tests done to prove/disprove/otherwise show it.
Using larger buckets will likely help there somewhat. Those Opterons do come with larger caches and memory shouldn't be a problem (though it uses a NUMA scheme). If there are diminishing returns between 8 to 16 threads, than I guess it would better off to spread the render load across several 8 thread machines. After all, with a 4-way Opteron means the renderer will have to spawn 64 threads.
Now, that's just looking at the render side of things...DS itself, while doing all the other stuff, including the non-render parts of animations probably will see a different 'curve' as far as threads/cores go...and my guess there is that it would be more of Windows limit or more precisely, how Windows will handle it.
And if you ever do use Luxrender, it can/does scale differently and can handle a fair amount of threads/cores...basically unlimited when networked.
Well, this second machine is purely for render purposes only, so I'm not that concern about DS itself. As for Luxrender, it has its merits but I don't think its particularly well suited for rendering animation at 24 or 48 fps. I'm aiming for less than one minute render time per frame (the lower the better) at 720p resolution minimal.
I don't think you're right at all here. It's not a gentle slope, it's logarithmic and at 24 cores I don't believe I've hit the top of the bell cure. Think about it a moment. The entire render is divided into buckets before rendering starts so as soon as a CPU becomes available there's always more buckets queued to be rendered. The point of diminishing returns occurs only when the O/S+Application spend more time on managing buckets than they do on crunching data.
There are two other scenarios where buckets per time cycle decrease in performance... Those would be
A) CPU/Memory bus is overwhelmed (too many buckets trying to move data in/out of RAM at once)
B) #ofBlocksLeftToRender is less than Cores Available.
Regarding Point A: Architecture here is everything. From bus speed, memory clock, cpu clock, to cache on the CPU everything matters.
Regarding Point B: Once there are fewer buckets left to render than there are cores assigned to rendering, those extra cores go idle and obviously performance as measured in the number of blocks completed per time period is reduced as only one core can work on a given block at a time.
Now, honestly, I can say that I never bothered to try and test 8 true cores vs 10cores vs 12 true cores, but it would be trivial to do and I'll just go do it to prove the point. :)
Give me a bit and I'll test 4 vs. 8 vs. 12 for each math.
I proved long ago that hyperthreaded cores are less efficient than true cores (back on the old forums), but they still bring between 40 and 60% improvement to the table so are worth using.
Initial results are promising... 12 vs. 8 so far (4 is still rendering), but the following scene rendered in 469.192 seconds on 12 cores and it rendered in 703.647 seconds on 8 cores. A difference of being 1.49969948 times faster on 12 than 4. Pretty much exactly what you'd suspect (given 12=1.5x8). These were true cores only. No hyperthreading.
By my guestimates, it'll be another 15 minutes or so for the 4 core render to complete. It should come out around 24-25 minutes.
Scene was rendered using Area Lighting, with 128 samples, reflections, transmaps, etc as a nice diverse set of things to work off of.
Sorry for the crappy render, just through something together to test with :)
Thanks Adam,
That's very promising indeed. Don't worry about the render quality. For completion's sake, could you test it with one thread (I know it will take a long time to render).
My guess of 16 cores being where it goes from log to linear is based solely on the amount of data that ends up being sent to the cores...it looks like 24 may be that limit instead of 16 or that it may scale better than I thought (32 cores...). But I still stand by the fact that at some point we aren't feeding it enough data to keep showing such dramatic improvements.
And OS plays a role...especially with its CPU scheduler.
Interestingly there's a more than 100% jump between 4 and 8. I think this has to do with the way buckets are assigned. It looks like in transmapped areas that buckets are permanently assigned cores. This makes sense when you think about it as you'd have to "redo" more math for lighting if you didn't... but what it does mean is that he 4-core render bogged down considerably in the transmapped hair, compared to the 8 or the 12. Both the 8 and 12 rendered 2 or more buckets outside the transmapped area where as the 4-core render did not.
As a result, at 4 cores, I was expecting it to take around 1400-1500 seconds, but instead it came in at 1872.23s (about 5 minutes longer than expected). That's almost a 4x jump (3.99) instead of the expected 3x jump (from 4 to 12).
I'm not sure I have the patience this morning for 1 core... guestimates would put that at least at 2 hours (but likely more).
My guess...4 or more, possibly as much as 8 (if it were a true logarithmic scaling it would definitely be in the 8 range).
Yeah, I think it's more complicated than that. There's some "pre-determination" that's occurring on bucket assignment that makes it a bit more voodoo-ish to figure out. What I know for certain is that I've never hit a limit on CPUs where I didn't notice an improvement. I was pondering playing with one of my dual xeon 10ways to see but I'm sure my boss would frown on that. :)
Well, I'm glad I was off by a couple of dozen cores...so right now, 24 and 32 are probably well within the 'still show gains' part of the graph.
1 core render did better than expected. 1h 59m 10s or 7150 seconds, just 50 seconds shy of 2 hours.
Keep in mind that none of the tests were perfectly sanitized, so there might be some slight variations. For instance I was running RDP, a web browser, skype, calc, notepad and all the routine windows services during all of the tests. System load on other tasks may contribute to some variation in time. When possible (everything but the 12 core test) I assigned these other tasks to otherwise idle CPUs.
I noticed, in the article Wowie linked to, that the PTEX routines we've just got our hands on are particularly badly threaded!
I believe this is more of an implementation problem rather than an inherent one. I'm hoping DAZ3D implement it though (properly), for both DS and Carrara.
Outstanding thread; thank you all for participating. Adam, thank you for testing...although yes, I agree that a single-core test would be very revealing and would provide a baseline value. But I understand the patience thing too. :D
That maybe true, but those shaders do need to store precomputation info.
http://www.3delight.com/en/uploads/docs/3delight/3delight_40.html#SEC170
And I would love to use a shader that's capable of two pass subsurface scattering and avoid subsequent precompute pass for each frame. Especially if you have lots of objects with SSS (each with its own material ID).
For what I remember from my tests the two pass SSS with the point cloud works. But indeed the problem is that the precomputation is lost because the temp files are deleted when the render is finished. You'd have to rewrite a shader and modify the script to keep the first pass
One thing nobody mentionned is multiprocessing vs multithreading. From 3delight manual it seems that 3DL will start as many thread as there are cores and that may be how it is done inside DS. But if you have a lot of cores and memory, may be you could get faster renders with multiprocessed renders
I'm assuming you mean rendering different frames at the same time to maximize utilization of all available cores. It will probably help a lot in situations where there are less buckets than cores. The idle cores can start processing the next frame. However, I don't think this is possible with the built in renderer. It's doable with the standalone, provided you've exported all the RIBs and have some sort of render queue management going on. That can be very complex to setup, so I do want to avoid that if possible.
I've made a chart based on Adam's data. Assuming the decrease in render time from going to 8 to 12 cores will be the same from 12 to 16 cores, it looks very promising indeed.
No I meant that there is a different behaviour between multiprocessing and multithreading
You can read some details in chapter 7.1 in 3delight manual
They also say that you could have some speed gain with some settings like choosing the right tiling strategy
Yes, I am certain that tiling strategy plays into it. I was using the default Horizontal bucket order. Unfortunately, I really don't know enough about how bucket order impacts the bucket/cpu assignment (other than the obvious) so I really don't know why you'd choose one bucket order over another.
I find that deciding the bucket order based on the primary direction of displacement...things like grass I use a vertical bucket order. It seems to help prevent clipping on the displaced grass. Other than that...????