Use hprof & gprof2dot to visualize the runtime of your Java application
Over the past years of my working experiences, I have learned a lot of interesting techniques from my colleagues. Today I like to share one of them regarding how to visually do performance profiling in Java, utilizing Java’s standard hprof and an open-source application called gprof2dot.
Such visualization will give you a graphical representation of how the functions inside your code are called. Utilizing this graph, you can understand which part of your program takes the longest runtime and optimize it accordingly.
Installing
gprof2dot lies at https://github.com/jrfonseca/gprof2dot. You may follow the install instruction there to install it and its corresponding dependencies.
Just one side note regarding installing from pip for Python newbies: if you installed it in sudo mode or installed it on a mac, this will result in /usr/local/bin/gprof2dot and the program should be accessible in your current folder. If not, it should either be in your virtualenv executable folders or under ~/.local/bin/gprof2dot. Under both circumstances, you may need to set up the $PATH variable manually if you have not already done so.
Running
First, run your target Java program in this way to generate the hprof file:
java -agentlib:hprof=cpu=samples,interval=20,depth=20 -jar target.jar
This will generate a text file named java.hprof.txt
Then run the following:
gprof2dot -f hprof java.hprof.txt | dot -Tpng -o output.png
This will generate the output.png, which is the final visualization file.
Example
Following is a sample code I have which we can use to test the method in this post.
[SPOILER ALERT] Contains the solution for Leetcode 1466, do not read it if you want to attempt it yourself some time in the future!
The above code generate a runtime graph that looks like this:
(For full resolution, please visit: https://www.flickr.com/photos/191200776@N03/50651336101/in/dateposted-public/)
Here’s how I read this graph & how I can use this graph to improve my performance to the solution of Leetcode 1466:
- Two trees to the right, each accounted for 1.37% of the runtime, is related to a separate thread regarding my usage of Java’s lambda function, since they don’t take too much time, I will just ignore them.
- To the left, you got the main, which is the entry point to the main thread.
- The minReorder called by main is the core part of my solution which takes most of the time, 94.52% of the time. This makes sense since that’s the question asked by Leetcode with the other function being just used to create a sample data. Let’s break it down:
- 21.92% of the total runtime is being used for basic operation within minReorder that does not involves calling other functions yet. This includes maintaining the for loop, collecting the response from one function and prepare to pass it to the other,
- 46.58% of the total runtime is being used by my dfs function, which also makes sense since that’s the primary approach to Leetcode 1466.
- Going down the tree from my dfs functions, different streaming associated functions have taken different runtime to maintain the iteration I have inside my function, until you reach the core part of my iteration, which is the lambda$dfs function I have that I used to iterate over nextNodes. This takes 31.51% of the total runtime.
- Then I realize a significant portion of lambda$dfs’s child — mathematically would be 6.85% / 31.51% = 21.74% — of the time, it is calling buildHashableConnection to the left, a function I wrote in my code. This function is also called by minReorder all the way from the above. In total, this function takes 16.44% of the time. This is one place where I can identify a location to optimize my performance by looking at the graph — why do I have to spend 16.44% of my time just constructing a String out of Integer that there is no real point to use? Why can’t I just hash Pair<Integer, Integer>? Oh yeah, I forget, I wanted to test my code on Leetcode Platform, which it does not support outside libraries :( so happy hacking that is.
- Then I can see that lambda$dfs + its child is also spending a lot of time trying to add items to HashSet. In together with another call from minReorder, all HashSet.add operation takes 35.62% of the total runtime with 10.96% / 35.62% = roughly 1/3 being trying to hash the key and another 21/02% / 35.62% = 2/3 being trying to put the value actually into the set, and a very small portion of it being as part of the basic operations of the HashMap.put function. This is another place where I can identify a location to optimize my performance by looking at the graph — can some of the Java Set be turns to a List? Maybe some could, for instance, the one from adjacencyList, which the heck is even named after a List yet I used a Set. The Set add operation associated with the adjacencyList takes 13.70% of the total runtime. Here, the input format already guarantees no duplicates, so a set is not needed. The reason why I still did a Set is because I just don’t want to argue with that since it no longer affects Big(O), and usually with a good Big(O) my solution is already good enough to be accepted by Leetcode, so I was happy with a Set when I first tried to implement it.
Some of my personal advices regarding to how to use this tools
Make sure you have an efficient algorithm first
As you can see from the example above, had there been an issue in my algorithm — let’s say that instead of actually doing Depth-first Search in the dfs function I did some strange brute force algorithm that has some weird O(n³) runtime complexity, then this graph won’t be helpful as the best it could do is proportionally adding more runtime to the dfs and its children — which there ain’t that much to add since this is already the entry point to the bulk of our workload. So you will result in a similar graph, and you can conduct a similar analysis, except your analysis would be wrong, as the primary place where you could improve your runtime performance is to improve your algorithm, not trying to call function that you don’t need to call.
Test the performance of API by writing an executor
There are situations where you may want to test the performance of, let’s say:
curl -X GET your_server_host:your_server_port/endpoint
Doing that through my approach is possible but has some concerns:
- You will get a lot of garbage related to how to start a service which may not be easily distinguishable from the API call you would like to fire,
- Imagine that once this API finishes — and do note that some API could be Async by its nature— you have to quickly kill the jar to avoid some other garbage being dumped into your graph.
My recommended way to solve this problem is to write an executor — basically, a standalone class that has the main function, that at best loads your dependency injection libraries and then ignore all the server setup stuff but just call the core logic of your API afterward. Then, you can call the jar that includes both the API you want to test and the executor you just wrote in this way:
java -agentlib:hprof=cpu=samples,interval=20,depth=20 -cp target.jar package.path.to.your.Executor
This will generate the hprof file that includes minimal extra information and then you can analyze that from gprof2dot easily.
Apache Spark
Do note that the above approach does not work with Apache Spark directly, since Spark jobs are mostly running on the Spark worker, which is a separate JVM from your main JVM. That being said, this is not the end of the universe:
- Spark worker nodes are, by their nature, still JVMs. So you can add the hprof arguments to your Spark worker nodes, get the hprof file, and then generate a visualization on top of it.
- If you are utilizing some of the more unified distributed processing libraries to call Spark for you e.g. Apache Beam, these libraries may support running in standalone mode besides running in Spark mode, where running in the standalone mode usually means running in the same JVM as the one your main Application is in. Doing performance analysis in Standalone mode may already be good enough to reveal enough information regarding where can you improve your performance.
What if the final graph is only garbage of epollWait?
Well, that’s my daily life:
When this happens, one thing to consider is to lower gprof2dot’s default min sample threshold, all the way down to 0:
gprof2dot -n0 -e0 -f hprof java.hprof.txt | dot -Tpng -o output.png
This is because gprof2dot by default ignores “nodes and edges with little or no impact in the total computation time”. By lowering the threshold, you will get them back.
Do be noted that to make this effective, it is highly recommended that you have a lower interval and higher depth when doing java hprof so that in CPU sampling a sample is more likely to hit a class of your interests.
If we consider all the classes that appear in the original java.hprof.txt file, it will be easy for us to identify what are some of the other calls going around at that time. These calls, while not consuming much of the total computation time, might be responsible for creating the thread that is responsible for most of the total computation time, and will give you a good understanding of what type of operations is your program is trying to achieve in a while it was doing the waiting.
In my example, I was able to use the 0 threshold graph — unfortunately I can not show it in full since it includes some other information but I can explain verbally —to do my analysis. In this graph, there is a tiny link leaving io.netty.channel.nio.NioEventLoop.run in the middle of the original graph going down towards something inside org.elasticsearch. This shows that the epollWait comes from Elasticsearch API calls.
After identifying what causes epollWait, the next step is to understand why this happens. This usually happens in a scenario where your code is making an external API call. And this whatever API call you are making to the outside — in my case, a call to Elasticsearch — is spending too much time waiting for the response to get back instead of actually processing the data it returns. Here you may have a lot of options you can try to pursue in order to reduce the percentage of runtime being spend on useless waiting, which are not the topic of this publication so I will just stop here.