QuickBuild (10.0.30) server seems to be freezing up randomly approximately once per month (at least once, occasionally twice) and has been consistently since earlier this year. This only affects one of our QB server instances as we have several - including one which was initially set up from the same configurations and database as the freezing server. The error from the web browser when the "freeze" occurs:
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /.
Reason: Error reading from remote server
When the issue is being experienced and the above error is shown, QB server is unresponsive to script commands (restart or stop) and a "kill -9' command must be used on the running java process. After this point, the still-running wrapper script restarts QB on its own.
I've collected a number of logs over the past few months after this occurs to try and determine what is going on. It appears common to see many log messages for each build agent being removed "Active build agent 'agent hostname:port' timed out, removing..." followed by "Job entry no longer exists at task node 'agent hostname:port', will cancel running job..." (always a different build agent) This is always at a different time (it has occurred in the middle of the day once), and these messages are followed by repeating log entries about thread interruption (where it repeats the same message for each build agent that was connected):
jvm 1 | 2021-09-07 06:34:05,033 WARN com.mchange.v2.resourcepool.BasicResourcePool@22715a21 -- an attempt to checkout a resource was interrupted, and the pool is still live: some other thread must have either interrupted the Thread attempting checkout! jvm 1 | java.lang.InterruptedException jvm 1 | at java.lang.Object.wait(Native Method) jvm 1 | at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1414) jvm 1 | at com.mchange.v2.resourcepool.BasicResourcePool.prelimCheckoutResource(BasicResourcePool.java:606) jvm 1 | at com.mchange.v2.resourcepool.BasicResourcePool.checkoutResource(BasicResourcePool.java:526) jvm 1 | at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutAndMarkConnectionInUse(C3P0PooledConnectionPool.java:755) jvm 1 | at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutPooledConnection(C3P0PooledConnectionPool.java:682) jvm 1 | at com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource.getConnection(AbstractPoolBackedDataSource.java:140) jvm 1 | at org.hibernate.c3p0.internal.C3P0ConnectionProvider.getConnection(C3P0ConnectionProvider.java:90) jvm 1 | at org.hibernate.internal.AbstractSessionImpl$NonContextualJdbcConnectionAccess.obtainConnection(AbstractSessionImpl.java:380) jvm 1 | at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.obtainConnection(LogicalConnectionImpl.java:228) jvm 1 | at org.hibernate.engine.jdbc.internal.LogicalConnectionImpl.getConnection(LogicalConnectionImpl.java:171) jvm 1 | at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doBegin(JdbcTransaction.java:67) jvm 1 | at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.begin(AbstractTransactionImpl.java:162) jvm 1 | at org.hibernate.internal.SessionImpl.beginTransaction(SessionImpl.java:1471) jvm 1 | at com.pmease.quickbuild.entitymanager.impl.DefaultMeasurementDataManager.save(DefaultMeasurementDataManager.java:94) jvm 1 | at com.pmease.quickbuild.plugin.measurement.core.reporter.MeasurementServerReporter.save(MeasurementServerReporter.java:105) jvm 1 | at com.pmease.quickbuild.plugin.measurement.core.reporter.MeasurementServerReporter.access$100(MeasurementServerReporter.java:34) jvm 1 | at com.pmease.quickbuild.plugin.measurement.core.reporter.MeasurementServerReporter$NodeMetricsSender.execute(MeasurementServerReporter.java:93) jvm 1 | at com.pmease.quickbuild.grid.NodeJobExecuteJob.execute(NodeJobExecuteJob.java:25) jvm 1 | at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:131) jvm 1 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) jvm 1 | at java.util.concurrent.FutureTask.run(FutureTask.java:266) jvm 1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) jvm 1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) jvm 1 | at java.lang.Thread.run(Thread.java:748) jvm 1 | 2021-09-07 06:34:05,034 WARN SQL Error: 0, SQLState: null jvm 1 | 2021-09-07 06:34:05,034 ERROR An SQLException was provoked by the following failure: java.lang.InterruptedException jvm 1 | 2021-09-07 06:34:06,218 WARN Job entry no longer exists at task node 'agent hostname:port', will cancel running job... jvm 1 | 2021-09-07 06:34:06,218 WARN com.mchange.v2.resourcepool.BasicResourcePool@22715a21 -- an attempt to checkout a resource was interrupted, and the pool is still live: some other thread must have either interrupted the Thread attempting checkout!
Server: CentOS 7
Build agents: Both Linux and Windows
Database: MySQL (database is on another server)
This is a strange issue. Again, we have a secondary running server set up in the exact same as this one, with a majority identical configurations, and it has never experienced this issue. Any ideas on how we could narrow down the issue to be able to fix it?