Your browser was unable to load all of the resources. They may have been blocked by your firewall, proxy or browser configuration.
Press Ctrl+F5 or Ctrl+Shift+R to have your browser try again.

Unexpected end of file from server part 2 #1499

puneetg ·
Hi Robin

Thanks for helping with this issue, we have done fine tuning in many areas over last several weeks. Additionally, We saw backups every night was another cause of QB server slowdown (this has been now scheduled over the weekend). The build storage has been moved to netapps shelf and that help quite a bit.

Though the error has decreased substantially but we do see issues with few build nodes that was installed in secondary data center. There is a possibility that the terminated line between them is overloaded intermittently and causes them to loose connection with the QB server in primary data center.

Is there a possibility to fine tune timeouts of Hessian API's so that QB can handle network latency ? Also what is timeout at this time?

Thanks
-Puneet

01:18:03,763 [master>checkout.code.cws>checkout.tools.and.setup@bammatrix02.efi.internal:8811] ERROR - Step 'master>checkout.code.cws>checkout.tools.and.setup' is failed.
java.lang.RuntimeException: Error executing grid job 'master>checkout.code.cws>checkout.tools.and.setup>setup.tools'
at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:63)
at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:85)
at com.pmease.quickbuild.stepsupport.SequentialStep.triggerChildren(SequentialStep.java:46)
at com.pmease.quickbuild.stepsupport.CompositeStep.run(CompositeStep.java:95)
at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:442)
at com.pmease.quickbuild.stepsupport.StepJob.execute(StepJob.java:42)
at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:122)
at java.lang.Thread.run(Thread.java:637)
Caused by: com.caucho.hessian.client.HessianConnectionException: 500: java.net.SocketException: Unexpected end of file from server
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:195)
at $Proxy11.updateStepRuntime(Unknown Source)
at com.pmease.quickbuild.stepsupport.Step.updateRuntime(Step.java:688)
at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$601544e7.CGLIB$updateRuntime$74(<generated>)
at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$601544e7$$FastClassByCGLIB$$c42739d3.invoke(<generated>)
at net.sf.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:215)
at com.pmease.quickbuild.DefaultScriptEngine$Interpolator.intercept(DefaultScriptEngine.java:273)
at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$601544e7.updateRuntime(<generated>)
at com.pmease.quickbuild.stepsupport.StepJob.beforeExecute(StepJob.java:32)
at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:113)
... 1 more
Caused by: java.net.SocketException: Unexpected end of file from server
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1368)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1362)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1016)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:177)
... 10 more
Caused by: java.net.SocketException: Unexpected end of file from server
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:769)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:652)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1072)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:373)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:166)
... 10 more
puneetg

Posts: 159
Joined: Tue Nov 03, 2009 1:34 am
  • replies 16
  • views 8036
  • stars 0
robinshen ADMIN ·
Hi Puneet,

The socket connect/read timeout should not be relevant here, as otherwise a SocketTimeoutException will be thrown. I guess the server is still under heavy load. How many projects are configured and how frequent are they checking for changes? And how about increasing memory to server JVM?

Regards
Robin
puneetg ·
Hi Robin

There are about 60 configurations, out of that 30 are scheduled via cron every 30 mins. The cron shedule tries that only one configuration will check for changes at any time. Few points to consider:

1. Until now issue is seen with nodes in secondary data center, though I did see the issue with nodes in primary several months back.
2. I checked cpu/memory/network history of the QB server at that time and it looked normal with no spikes.
3. There is possibility many users are logged into server at that time, Is there a way to find real number? ( I have 15 sec refresh )
4. The issue is seen while build perform SVN repository checkouts.

How do I increase memory for JVM?

Is there a possibility we can add a user configured retry mechanism to build nodes so that they do not fail the build with first no response ( that in general is good for fault tolerance ) ?

Thanks
-Puneet
robinshen ADMIN ·
Hi Puneet,

QB should be fine with this number of configurations and checking frequency. What spec is your server hardware? We have customers running more than 4000 configurations, with hundreds of agents, and several hundreds of builds each day without any performance issue on a Windows server machine with 4 CPUs and 8G mem.

To isolate the problem, please try to disable auto-refresh (set it to 0) to see if it helps. To increase JVM memory, please edit entry "wrapper.java.maxmemory" in "conf/wrapper.conf". If the problem still happens, it will be more of a network issue instead of server issue.

When this error happens, does the node in secondary data center involves large amount of data transferring (such as publishing big artifacts, fetching files via input/output files, etc.)?

We thought of retrial before, but it may result in undesired behavior. For instance, some RPC calls change status, and the exception may occur after remote end finished status change. If we retry at this point, the change will be re-applied.

Regards
Robin
puneetg ·
Hi Robin

We were able to reduce the configurations by implementing userinput beans ( Thanks for that solution ).

The QB is 2 CPU, 8GB Windows 2008 Server. It is a Virtual Machine in vCenter. After last weeks of fine tuning, there not much going on with 2 CPU's. They are 20-30% loaded. Windows Nodes are 4CPU/4GB Windows 2008 Server VM. Mac Nodes are Apple XServes 8 Core, 8GB. We also are running Redhat linux VM's for test farming on labmanager, They are 1CPU, 1GB. For now, I have reduced number of nodes to 77 for testing.

There maybe other data being transferred between datacenters ( outside QB ), there are many sync's going on between them. I will test with no user refresh but engineers tend to not like it.

Retry will still good for fault tolerance even if it results in duplicate entries, thats better than build failures.

Thanks
-Puneet
robinshen ADMIN ·
Hi Puneet,

Duplicate entry is not the only retrial consequence, but even duplicate entry may result in build failure as database will report unique constraint violation error.

QB actually already retries at socket level, but only limited to connection failure. At this point, RPC calls have not been issued yet.

Regards
Robin
robinshen ADMIN ·
Also have you ever observed nodes in data center 1 experiencing this issue?
puneetg ·
Hi Robin

We have seen this issue in primary datacenter a month back but has never reproduced since. My network admin says bandwidth between data centers are never maxed out from either side, though there can be intermittent spikes. We donot use QB artifacts publishing but just leave them on a smb share in each data center.

Is it possible to skip update for the step to prevent DB collisions and special case build completion while retrying? I mean if steps and updates are incorrect but build succeeds, that is far better situation for us.

Thanks
-Puneet
robinshen ADMIN ·
Hi Puneet,

We'd rather not to keep inconsistent data in database, as this harms the data for the long run. Also it makes data mainteance and manipulation difficult. I would suggest to retry the whole build upon network failure. To do so, you may use below post-build script to trigger build again if find network failure:

groovy:
import com.pmease.quickbuild.*;

for (each in build.stepRuntimes.values) {
if (each.errorMessage.contains("Unexpected end of file from server")) {
logger.warn("Find network error. Retrying build...");
system.buildManager.delete(build);
def request = new BuildRequest();
request.setConfigurationId(configuration.getId());
request.setRespectBuildCondition(false);
request.getVariables().putAll(build.variableValues);
def requesterId;
if (build.requester != null)
requesterId = build.getRequester().getId();
else
requesterId = null;
Quickbuild.getServerService().requestBuild(requesterId, build.scheduled, request);
break;
}
}

Regards
Robin
puneetg ·
Thanks Robin, we will try this.

We are also investigating on our end, Since this happens only during certain times in day, it looks like a network issue. I will update you once there are any results.

Thanks
-Puneet
sriniefi ·
Hi Robin,

Retrying the build works fine, but we have 2 issues here.
1) deleted build waits in the queue for sometime, clicking on the deleted build throws error (This normally happens when the entity you are accessing is deleted.)
2) email notification is sent for the deleted build.

Can you please let us know how to disable notification for deleted build.

Thanks,
Srinivas
robinshen ADMIN ·
Hi Srinivas,

This will be fixed in next patch release. Please watch on issue:
http://track.pmease.com/browse/QB-974

Regards
Robin
puneetg ·
Thanks Robin, we are trying out the latest build.

Can we use the groovy code to build a module that saves running/waiting builds during a QB server restart? When QB is restarted, we can automatically delete cancelled builds and respool new ones with same properties.

In my case QB is always in use and there is no good time to restart, we end up canceling many builds and emailing users asking them to rerun the configurations.

Thanks
-Puneet
robinshen ADMIN ·
Hi Puneet,

Unfortunately this is not supported by now. We will take this into account in QB4.

Regards
Robin
puneetg ·
Hi Robin

Retrying the build in case of above error is working with build 52.

Thanks for considering restart server and respool builds feature for QB4.0. We have also developed many userinput beans and applying a patch needs restart as well. Its very important feature for a deploying bean fixes and QB patches.

Regards
-Puneet
puneetg ·
Hi Robin

while most of "Unexpected end of file" issues are fixed by firing the build again, we occasionally see following error, In this case master is not executed and post build is not called. Is there a way to resolve the error similar to what was done in post build?

thanks
-Puneet

02:50:05,937 INFO - Execute condition satisfied, selecting node to execute step 'master'...
02:50:09,394 [master@bawqbvm05:8811] INFO - Processing job...
02:50:09,394 [master@bawqbvm05:8811] INFO - Executing step 'master' on node 'bawqbvm05:8811'...
02:53:30,898 [master@bawqbvm05:8811] INFO - Job finished.
02:53:33,009 ERROR - Build is failed.
java.lang.RuntimeException: Error executing grid job 'master'
at com.pmease.quickbuild.grid.GridTaskFuture.get(GridTaskFuture.java:63)
at com.pmease.quickbuild.DefaultBuildEngine.run(DefaultBuildEngine.java:412)
at com.pmease.quickbuild.DefaultBuildEngine.process(DefaultBuildEngine.java:319)
at com.pmease.quickbuild.DefaultBuildEngine.access$1(DefaultBuildEngine.java:242)
at com.pmease.quickbuild.DefaultBuildEngine$2.run(DefaultBuildEngine.java:758)
at java.lang.Thread.run(Unknown Source)
Caused by: com.caucho.hessian.client.HessianConnectionException: 500: java.net.SocketException: Unexpected end of file from server
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:195)
at $Proxy11.updateStepRuntime(Unknown Source)
at com.pmease.quickbuild.stepsupport.Step.updateRuntime(Step.java:701)
at com.pmease.quickbuild.stepsupport.SequentialStep$$EnhancerByCGLIB$$9bf08a3a.CGLIB$updateRuntime$73(<generated>)
at com.pmease.quickbuild.stepsupport.SequentialStep$$EnhancerByCGLIB$$9bf08a3a$$FastClassByCGLIB$$6e184263.invoke(<generated>)
at net.sf.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:215)
at com.pmease.quickbuild.DefaultScriptEngine$Interpolator.intercept(DefaultScriptEngine.java:273)
at com.pmease.quickbuild.stepsupport.SequentialStep$$EnhancerByCGLIB$$9bf08a3a.updateRuntime(<generated>)
at com.pmease.quickbuild.stepsupport.StepJob.beforeExecute(StepJob.java:32)
at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:113)
... 1 more
Caused by: java.net.SocketException: Unexpected end of file from server
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection$6.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at sun.net.www.protocol.http.HttpURLConnection.getChainedException(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:177)
... 10 more
Caused by: java.net.SocketException: Unexpected end of file from server
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at com.caucho.hessian.client.HessianProxy.invoke(HessianProxy.java:166)
... 10 more
robinshen ADMIN ·
Hi Puneet,

Please modify the script to handle this case like below:

groovy:
import com.pmease.quickbuild.*;

boolean networkErrorFound = false;
if (build.errorMessage.contains("Unexpected end of file from server")) {
networkErrorFound = true;
} else {
for (each in build.stepRuntimes.values) {
if (each.errorMessage.contains("Unexpected end of file from server")) {
networkErrorFound = true;
break;
}
}
}

if (networkErrorFound) {
logger.warn("Find network error. Retrying build...");
system.buildManager.delete(build);
def request = new BuildRequest();
request.setConfigurationId(configuration.getId());
request.setRespectBuildCondition(false);
request.getVariables().putAll(build.variableValues);
def requesterId;
if (build.requester != null)
requesterId = build.getRequester().getId();
else
requesterId = null;
Quickbuild.getServerService().requestBuild(requesterId, build.scheduled, request);
}

Regards
Robin