Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

'Unable to find job' - Node stays active #4280

JShelton ·

We're experiencing the 'Unable to find job' error (that we typically expect see when a node disconnects for some reason), however this time it is occurring for an active node that stays active after experiencing the error and continues to display the same error consistently.

Unable to find job '3d141bfa-d53b-4cbc-8d03-61cab6776f78' on node '<hostname>'.

In reviewing the logs on the connected node, I noticed "Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded" would be seen if a build were to attempt to run with this node and get the error - this despite the host being identical in specs and usage to several other Linux hosts not experiencing the issue. I added more available RAM to the wrapper.java.maxmemory property in wrapper.conf and restarted the agent process. Builds processed on the node just fine for several hours, but eventually succumbed to the same error as before. Again, as seen previously, node remains active and any builds calling specifically on this node or the resource calling it will fail.

Any ideas on what could be going on here? In order to keep the error reliably reproducible, I have left the node as active but removed it from all resources on the server. I set up a small test configuration that directly calls on the node (to run a simple "aws s3 ls" shell command) and it returns the 'Unable to find job' error 100% of the time (before showing this error, the build logs show the true out of memory error before being replaced with 'Unable to find job' when it finishes running). Is it possible to CATCH these types of errors in QuickBuild, either via plugin or otherwise? It does not seem to run anything at all after encountering this error.

  • replies 1
  • views 965
  • stars 0
robinshen ADMIN ·

In case of OutOfMemoryError, it is almost useless trying to recover. We need to find root cause of this error to avoid it. While keeping the problematic node isolated, please get a memory dump of the node when it experiences this error running your test builds:

/path/to/jdk/bin/jmap -dump:format=b,file=/path/to/dump

Then you may analyze the result to see what is the largest objects using some profile tool such as YourKit, or you may send to me for diagnostics (robin AT pmease DOT com)