Previously I explained how to collect information while the CPU is running high due to potential custom apps installed on a SharePoint Farm.
This post is about analysing the result and finding the root cause of the problem
Part 1 : Link
We will in this article :
- Analyze the result using DebugDiag 2 Analysis
- Analyze even more using WinDbg
- Find the root cause and corrective actions
Using DebugDiag 2 Analysis, analyze the results from the dump previously collected
|From start menu, run
|Click PerfAnalysis and click on the button add data files
|Select the files generated from this blog part 1, do not select the biggest file (full dump)
|Disconnect from internet, otherwise the analysis DebugDiag Analysis is really slow (it could take 2 / 3 days in some case to analysis…)|
|Click Start Analysis|
|Wait 10 min|
|Once the result are available, it opens in the browser
|Save this page, it will save an .mht file, for future analysis|
|We can see that the top function called in most of the thread are mscorlib.dll
That is some out of the box .net DLL called by the SharePoint code we have implemented.
This analysis is not really helpful as it doesn’t show which function called mscorlib Functions, lets dig into details, using WINDBG
Further analyse the results, using WinDbg – Installation
Since the result from DebugDiag 2 Analysis is not showing which function from mscorlib is taking the most CPU, we need to use an other tool to find out the root cause.
WinDbg is part of Windows SDK that you can download from here
Once installed, move the .exe to a folder “F:\Installs\WinDbg\resource”
Below is how to load information in WinDbg and find the root cause
|Load Mini Dump||Load a mini dump (from the part 1 of this blog)
|Build the sharepoint project
Drop the project .pdb file and .dll in the same previous folder F:\Installs\WinDbg\resource
|Find the thread list||!runaway|
|Select first thread||~165s|
|Show call stack||!clrstack
If call stack is not showing, call those again
|Win DBG shows us here that from this thread, this is the function being called.
If we select other thread (using ~NBRs function, and show the call stack, using the previous step), we can also see this same function
This is the root cause : our development, using C# code calls the taxonomy class from SharePoint.
Finding the root cause
|Digging into SharePoint code (SharePoint 2013 SP1), we can see that they are using a Dictionary to retrieve terms from term store.
Though a Dictionnary class is not supporting multi Thread calls (a website must support multi thread, as many concurrent users connects to the site)
From first dump, the thread was finding in the Dictionary, and from the second dump, the same thread was still finding in the same Dictionary.
This is unbelievable as the Dictionary has 3 items only. So, definitely these threads were entered an endless loop.
By review the code, we found this application modify/read the Dictionary object without any lock. This is the cause of the problem by a simple look at the FindEntry code via Reflector.
Below is the information from MSDN – Dictionary is not thread safe.
A Dictionary<(Of <(TKey, TValue>)>) can support multiple readers concurrently, as long as the collection is not modified.
Even so, enumerating through a collection is intrinsically not a thread-safe procedure. In the rare case where an enumeration contends with write accesses,
the collection must be locked during the entire enumeration. To allow the collection to be accessed by multiple threads for reading and writing, you must implement your own synchronization.
|Looking at the code from the Last SharePoint CU (January 2017 at this time) shows that they have fixed this issue regarding fetching the terms in the managed metadata service, by implementing a ConcurrentDictionary
Using this DebugDiag, we have collected information while the CPU is peaking at 100% and using DebugDiag Analysis and WinDBG we analyzed the function calling the out of the box code (MSCorLib).
Looking into Sharepoint code, showed the root cause problem, caused by using Dictionary without lock mechanism to support multi threading.
A Cumulative update fixes this issue using ConcurrentDictionary<TKey, TValue> instead of Dictionary, because it has functions for multi threading.
Have you also fix issues using those tool ? Please share your findings.