通过GPU Skinning 一:骨骼动画原理里面的介绍,我们理解了骨骼动画的数学原理,可以看出核心就在于蒙皮运算部分。Unity3D的骨骼动画组件SkinnedMeshRender是使用CPU进行蒙皮运算(主要是为了动画状态机的各种功能,如动作融合、IK等等),当同屏内有较多带骨骼的模型的时候,骨骼数和顶点数会使CPU的运算量显著的上升,移动端会出现比较明显的卡顿和发热等现象,这也是很多游戏对同屏人数做限制的原因之一。针对这种现象,Unity提供了通过GPU Skinning的方法来优化这部分的性能,思路比较简单,如果瓶颈在CPU,那边我们可以通过把部分运算移到GPU来缓解CPU的性能压力。

Unity内置的GPU Skinning

我们以Unity 4.6版本GPU Skinning的源码来做分析。

void SkinnedMeshRenderer::AwakeFromLoad(AwakeFromLoadMode awakeMode)
{
	if (!m_MemExportInfo)
		m_MemExportInfo = CreateGPUSkinningIfAvailable();
    ...
}

GPUSkinningInfo * GFX_GL_IMPL::CreateGPUSkinningInfo()
{
	if (gGraphicsCaps.gles30.useTFSkinning)
		return UNITY_NEW(TransformFeedbackSkinningInfo(), kMemGfxDevice);
	else
		return 0;
}

TransformFeedbackSkinningInfo() : GPUSkinningInfo(),
		m_SourceVBO(0), m_SourceVBOSize(0), m_BoneCount(0), m_VAO(0), m_TF(0), m_Shader(0)
	{}

class GPUSkinningInfo
{
protected: 
	//! Number of vertices in the skin
	UInt32	m_VertexCount;
	//! Channel map for the VBO
	UInt32	m_ChannelMap;
	//! Destination VBO stride
	int		m_Stride;

	//! Destination VBO
	VBO		*m_DestVBO;

	//! Bones per vertex, must be 1, 2 or 4
	UInt32 m_BonesPerVertex;
    ...
}

首先SkinnedMeshRenderer会在加载唤醒的时候创建GPU蒙皮信息,因为要使用到TransformFeedback接口,所以只有gles30才生效的。

void SkinnedMeshRenderer::Render (int subsetIndex, const ChannelAssigns& channels)
{
	...
    bool success = SkinMeshImmediate(requiredChannels);
    ...
}

bool SkinnedMeshRenderer::SkinMeshImmediate( UInt32 requiredChannels )
{
	GfxDevice& device = GetGfxDevice();
	// Double check there are no fences inserted during skinning
	UInt32 expectedFence = device.GetNextCPUFence();
	device.BeginSkinning(1);
	SkinMeshInfo skin;
	int flags = SF_AllowMemExport;
	bool success = PrepareSkin(requiredChannels, flags, skin);
	if (success)
	{
		SkinMesh(skin, true, expectedFence, flags);
#if UNITY_EDITOR
		UpdateClothDataForEditing(skin);
#endif
	}
	device.EndSkinning();
	// Insert fence after all skinning is complete
	UInt32 fence = device.InsertCPUFence();

	return success;
}

SkinnedMeshRenderer在渲染的时候开始做蒙皮处理,这里有个小细节,Unity使用了fence来做CPU、GPU之间的同步,CPU在确定蒙皮期间没有fence插入后调用Skinning,并在指令提交后加个fence,wait GPU执行完全部命令后再把CPU的命令清掉重新使用,保证CPU、GPU的并行。

bool SkinnedMeshRenderer::PrepareSkinGPU( UInt32 requiredChannels, int flags, SkinMeshInfo& skin, CalculateSkinMatricesTask* calcSkinMatricesTask )
{
	if (!PrepareSkinCommon( requiredChannels, flags, skin, calcSkinMatricesTask ))
		return false;
	...
}

bool SkinnedMeshRenderer::PrepareSkinCommon(UInt32 requiredChannels, int flags, SkinMeshInfo& skin, CalculateSkinMatricesTask* calcSkinMatricesTask)
{
	...

	if (hasSkin)
	{
		skin.bonesPerVertex = GetBonesPerVertexCount();
		skin.compactSkin = m_CachedMesh->GetSkinInfluence(skin.bonesPerVertex);

		Matrix4x4f rootPose;
		if (!(flags & SF_ClothPlaying))
			rootPose = GetActualRootBone().GetWorldToLocalMatrixNoScale ();
		else
			// clothed skins are simulated using world space rotation, so rotating the character will affect the cloth simulation.
			// translation is applied using forces in the cloth, which is smoother.
			rootPose.SetTranslate (-GetActualRootBone().GetPosition());

		...

		if (!canCalcSkinMatricesInMT)
		{
			// slow code path
			if (!CalculateSkinningMatrices(rootPose, skin.cachedPose, bindposeCount))
				return false;
		}
	}

	...
}

在PrepareSkinCommon函数中,我们可以看到Unity根据读取的动画数据进行骨骼变换。

bool SkinnedMeshRenderer::SkinMesh( SkinMeshInfo& skin, bool lastMemExportThisFrame, UInt32 cpuFence, int flags )
{
    ...
	device.SkinOnGPU(m_MemExportInfo, lastMemExportThisFrame);
	device.GetFrameStats().AddDrawCall (skin.vertexCount, skin.vertexCount);
    ...
}

// All actual functionality is performed in TransformFeedbackSkinningInfo, just forward the calls
void GFX_GL_IMPL::SkinOnGPU( GPUSkinningInfo * info, bool lastThisFrame )
{
	reinterpret_cast<TransformFeedbackSkinningInfo *>(info)->SkinMesh(lastThisFrame);
}

void TransformFeedbackSkinningInfo::SkinMesh(bool last)
{
	if (!m_TF)
	{
		GLES_CALL(glGenTransformFeedbacks, 1, &m_TF);
	}
    ...
}

SkinOnGPU函数会把存储了骨骼的矩阵数组传递给Geometry Shader来做GPU上的顶点混合,写到VerticesBuffer后,再提交一次渲染即可。如果模型是材质是多个pass,还可以节省顶点混合的计算。