skip to main content

PhysX: The Case Of The Leaky Cloth

TheCaseOfTheLeakyCloth

Summary


We found quite a large leak within Apex 3.4 (currently used in Unreal Engine 4, amongst others) – in a relatively short run of the title that we were working on, we saw over 150mb of leaked memory. Here, we present some changes that will fix this.

Memory Profiling Analysis


Initially, once we knew that we had a memory leak, we started profiling using all of the standard tools offered to us through UE4 – the STAT commands, Malloc Profiler and more … after several hours, we’d gotten nowhere. There was clearly quite a large leak happening, we found ways to make it happen faster, but simply, there wasn’t a good way of closing in on the problem. In theory, Malloc Profiler could’ve done it – but you simply couldn’t run it for long enough before it crashed… even if it didn’t, the tool for viewing and analysing the profile couldn’t cope with the data – and it, too, would hang.

In the end, we integrated our Malloc Profiler 2 changes (available here) and, within about 5 minutes we’d tracked the problem – and had a solution to it very soon after that.

The process for obtaining this information went something like this:-

  • run the project for a while to allow buffers to expand to their desired size, content to preload, etc;
  • started malloc profiling so that it starts tracking all new allocations, deallocations and reallocations;
  • play the project for a few minutes;
  • stop tracking allocations – but continue tracking deallocations;
  • play the project for several minutes more – moving out of the present location, returning to the frontend, etc.. essentially trying to get the project to deallocate as much as possible;
  • force a garbage collect using (using the console command “OBJ GC”)
  • stop profiling completely.

With this done, we’re left with all the allocations that were made during play that weren’t later deallocated. Here’s a dump that we got from the profiler that highlighted the leakage:-

So, yeah, I’ve highlighted the suspicious function here. Out of 170.86mb (416,801 allocations) of potentially leaked memory, 163.97mb (322,840 allocations) is coming from ClothingActorImpl::updateRenderProxy() within nVidia’s Apex library. That’s definitely the area that we need to start sniffing around.

Investigating The Leak


Digging in, ~148MB of potentially leaked memory was coming from ClothingRenderProxyImpl::ClothingRenderProxyImpl(), a constructor… so either the constructor was allocating something which wasn’t being freed or the object itself was being leaked. Looking at the parent, getRenderProxy(), didn’t show an obvious leak, but the caller of that, updateRenderProxy, contained something interesting:-

void ClothingActorImpl::updateRenderProxy()
{
   PX_PROFILE_ZONE("ClothingActorImpl::updateRenderProxy", GetInternalApexSDK()->getContextId());
   PX_ASSERT(mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy == NULL);

   // get a new render proxy from the pool
   RenderMeshAssetIntl* renderMeshAsset = mAsset->getGraphicalMesh(mCurrentGraphicalLodId);
   ClothingRenderProxyImpl* renderProxy = mClothingScene->getRenderProxy(renderMeshAsset, mActorDesc->fallbackSkinning, mClothingSimulation != NULL,
      mOverrideMaterials, mActorDesc->morphGraphicalMeshNewPositions.buf,
      &mGraphicalMeshes[mCurrentGraphicalLodId].morphTargetVertexOffsets[0]);

   mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy = renderProxy;
}

Something to note here: if we’re not using the debug version of Apex/PhysX, PX_ASSERT() won’t trigger. As it happens, Unreal Engine 4 is setup not to use debug versions of these by default – so in the case where renderProxy is null, which other code within Apex suggests is possible and valid, the assert is never triggered. Looking further along, you’ll see that renderProxy is overwritten without being released…

While looking into other uses of renderProxy, we found this:-

void ClothingActorImpl::markRenderProxyReady()
{
   PX_PROFILE_ZONE("ClothingActorImpl::markRenderProxyReady", GetInternalApexSDK()->getContextId());
   mRenderProxyMutex.lock();
   if (mRenderProxyReady != NULL)
   {
      // user didn't request the renderable after fetchResults,
      // let's release it, so it can be reused
      mRenderProxyReady->release();
   }

   ClothingRenderProxyImpl* renderProxy = mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy;
   if (renderProxy != NULL)
   {
      updateBoneBuffer(renderProxy);
      mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy = NULL;
      renderProxy->setBounds(mRenderBounds);
   }
 
   mRenderProxyReady = renderProxy;
   mRenderProxyMutex.unlock();
}

renderProxy is being moved to mRenderProxyReady and nulled… and line 16 is nulling the entry within mGraphicMeshes. This function would’ve correctly cleared up our leak, as you’ll see it’s correctly releasing mRenderProxyReady at the top of the function if it’s non-null… but, while the function is called often, it’s not being called in the cases where we’re seeing leakage. The reason why is a little convoluted – but I’ll explain…

Looking up the callstack, we find markRenderProxyReady() is called by syncActorData(), which is called by waitForFetchResults(), which in turn is called by getSimulationPositions(). The interesting thing here is that that function has an early exit when mClothingSimulation is null (and ultimately not calling markRenderProxyReady() in that case):-

const PxVec3* ClothingActorImpl::getSimulationPositions()
{
   if (mClothingSimulation == NULL)
      return NULL;
 
   waitForFetchResults();
 
   return mClothingSimulation->sdkWritebackPosition;
}

Note the comment on the function that calls getSimulationPositions(), USkeletalMeshComponent::ParallelEvaluateCloth():-

void USkeletalMeshComponent::ParallelEvaluateCloth(float DeltaTime, const FClothingActor& ClothingActor, const FClothSimulationContext& ClothSimulationContext)
{
... LOTS OF CODE ...
    
   {
      SCOPE_CYCLE_COUNTER(STAT_ClothSimTime)
      ApexClothingActor->simulate(DeltaTime);
      ApexClothingActor->getSimulationPositions();  //This is a hack that we use to internally call waitForFetchResults.
   }
}

Tellingly, if we set a breakpoint within updateRenderProxy() for the case where we’re going to leak (renderProxy being non-null), we see that mClothingSimulation is null – and hence why getSimulationPositions() is dropping out early. Going up one more level, we see that lodTick_LocksPhysX contains the following lines:-

bool actorCooked = isCookedDataReady();
... MORE CODE ...
if (actorCooked && /*[...]*/)
{
   if (mClothingSimulation == NULL)
   {
      createPhysX_LocksPhysX(simulationDelta);
   }
   ... MORE CODE ...
}
else
{
   ... MORE CODE ...
   removePhysX_LocksPhysX();
}

As the null check around createPhysX_LocksPhysX() hints, that function initializes mClothingSimulation and removePhysX_LocksPhysX() destroys it again. That means, if isCookedDataReady() returns false, mClothingSimulation is null, the render proxy is never consumed by markRenderProxyReady() and, on the next tick, updateRenderProxy() leaks it. Oops.

The Solution


We can fix the leak quite simply by replacing the PX_ASSERT() with a release().. something like this:-

void ClothingActorImpl::updateRenderProxy()
{
   PX_PROFILE_ZONE("ClothingActorImpl::updateRenderProxy", GetInternalApexSDK()->getContextId());

   // Release the old renderProxy if there was one
   if (mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy)
   {
      mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy->release();
   }
 
   // get a new render proxy from the pool
   RenderMeshAssetIntl* renderMeshAsset = mAsset->getGraphicalMesh(mCurrentGraphicalLodId);
   ClothingRenderProxyImpl* renderProxy = mClothingScene->getRenderProxy(renderMeshAsset, mActorDesc->fallbackSkinning, mClothingSimulation != NULL,
      mOverrideMaterials, mActorDesc->morphGraphicalMeshNewPositions.buf, 
      &mGraphicalMeshes[mCurrentGraphicalLodId].morphTargetVertexOffsets[0]);
 
   mGraphicalMeshes[mCurrentGraphicalLodId].renderProxy = renderProxy;
}

With this implemented, we’re able to run another profile in Malloc Profiler to verify whether or not we’re still seeing any leakage. We were very happy to see that we weren’t:-

 

So, yeah, a very successful bit of analysis – and further evidence of how useful our improved version of Malloc Profiler can be.

Credit(s): Gareth Martin (Coconut Lizard)
Status: Currently unfixed in Apex/PhysX 3.4

Facebook Messenger Twitter Pinterest Whatsapp Email
Go to Top