We're experiencing the 'Unable to find job' error (that we typically expect see when a node disconnects for some reason), however this time it is occurring for an active node that stays active after experiencing the error and continues to display the same error consistently.
Unable to find job '3d141bfa-d53b-4cbc-8d03-61cab6776f78' on node '<hostname>'.
In reviewing the logs on the connected node, I noticed "Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded" would be seen if a build were to attempt to run with this node and get the error - this despite the host being identical in specs and usage to several other Linux hosts not experiencing the issue. I added more available RAM to the wrapper.java.maxmemory property in wrapper.conf and restarted the agent process. Builds processed on the node just fine for several hours, but eventually succumbed to the same error as before. Again, as seen previously, node remains active and any builds calling specifically on this node or the resource calling it will fail.
Any ideas on what could be going on here? In order to keep the error reliably reproducible, I have left the node as active but removed it from all resources on the server. I set up a small test configuration that directly calls on the node (to run a simple "aws s3 ls" shell command) and it returns the 'Unable to find job' error 100% of the time (before showing this error, the build logs show the true out of memory error before being replaced with 'Unable to find job' when it finishes running). Is it possible to CATCH these types of errors in QuickBuild, either via plugin or otherwise? It does not seem to run anything at all after encountering this error.