Monday, February 05, 2007

The Mystery of "Too many open files"

Last week we saw an interesting problem while running the regression tests on Linux. Immediately after the tests were started, the VM started throwing error "FileNotFoundException : Too many open files". Ofcourse the files were there but the VM was trying to say that there are too many file descriptors open which were not closed. It was hitting the open file limit set by the OS. This was kind of absurd because we always close all the files that we open. And moreover we had never seen this problem before. So we started suspecting 1.6 VM on which we were running it.

Immediately after this, it got worse. The exception changed into "SocketException: Too many open files" and all the socket connection started getting rejected. So merely after serving 100 requests, the server was down to its knees.

A quick google search suggested to increase the open file descriptor limit on the machine.

"ulimit -aH" that gives the max limit for number of open files returned 1024.

I added every possible way to increase it. Here are few

1. In /etc/security/limits.conf
* soft nofile 1024
* hard nofile 65535
2. Increase ulimit by "ulimit -n 65535"
3. echo 65535 > /proc/sys/fs/file-max
4. In /etc/sysctl.conf
fs.file-max=65535

Increasing file descriptor limit did not help much either. Even after increasing this limit, we were still getting this error.And then Sanjeev (another brilliant CF engineer with an amazing sense of humour) cracked it !!
Just before we started getting these errors, there was another error which I had overlooked assuming it was test problem which infact was the clue.
The error was something like
coldfusion.jsp.CompilationFailedException: Errors reported by Java compiler: error: error reading /opt/coldfusionscorpio/lib/./././././././././././././././././././././././././././././
././././././././././././././././././././././././././././././././././././././././././.
/././././././././././././././././././././././././././././././././././././././././././
././././././././././././././././././././././././././././././././././././././././././.
/././././././././././././././././././././././././././././././././././././././././././
././././././././././././././././././././././././././././././././././././././././././.
/././././././././././././././././././././././././././././././././././././././././././
././././././././././././././././././././././././././././././././././././././././././.
/././././././././././././././././././././././././././././././././././././././././././
././././././././././././././././././././././././././././././././././././././././././.
/././././././././././././././././././././././././././././././././././././././././././
././././././././././././././././././././././././././././././././././....
at coldfusion.jsp.JavaCompiler.compileClass(JavaCompiler.java:138)

Sanjeev ran it through the debugger and he nailed the culprit. It was something which no one had even suspected. It was javac. :)
We had hit upon a Sun's bug in javac where if the classpath contains a jar which has a manifest and manifest contains classpath entry with relative paths of other jars as well as path to itself, javac goes in an infinite loop. (Sun's bug no 6400872, 6446657, 6456960, 6485027, 6206485)

I dont have the source for javac to pin point what exactly it did, but probably it kept opening all the entries in the classpath and because of the stack overflow it could not close those files - hence reaching the max limit of open file descriptors. This was a third party jar that had this manifest entry. Once we found the problem, the solution was simple - just remove the classpath entry from the manifest! Whew !!!

7 comments:

Unknown said...

This was a nice catch Sanjeev/Rupesh.

Hemant

Unknown said...

Hi Rupesh. Thanks for posting the blog entry. I'm having the same problem. First I suspected if there is any code which don't close the file. But, I can't find one.

But, how can I find which jar has the wrong classpath in the manifest? I use several 3rd party jar. Some of them are: spring, struts2, velocity, activation.jar, quartz, hibernate, and some common libraries from Apache. On your case, which jar caused the problem?

Wagner said...

Great!Great!Great Discovery.

But, I still did not understand a thing: when I startup the Jboss Aplication Service this kind of problemns don't arrive, but when I startup Geronimo Aplication Services the problems arrive.

I am wandering if all of folks that use Geronimo runs in this problem and if all of them visit the manifest of all jar to solve the problem?

Best regards!

Unknown said...

Hi

I have this problem, and I tryed to solve it by removing the classpath from the MANIFEST in the external jars (third party jars), but when I packed again the jar file, it does not work. Any sugest? Thanks a lot

Brew said...

I had a similar problem in my own code. Alas mine was not a Sun Bug, this bit of code helped me track down where my code was misbehaving. It prints out all the files open by the current java process

public static int pid() {
String id = ManagementFactory.getRuntimeMXBean().getName();
String[] ids = id.split("@");
return Integer.parseInt(ids[0]);
}


public static void listOpenFiles() throws IOException {
int pid = pid();
Runtime runtime = Runtime.getRuntime();
Process process = runtime.exec("lsof -p " + pid);
InputStream is = process.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
String line;

while ((line = br.readLine()) != null) {
log.info(line);
}
}

Anonymous said...

Thanks for this. You saved me a good number of hours, not to mention a couple hundred strands of hair.

Anonymous said...

With which jdk versions can this problem occur?