distributed processing too many files?

S

Stefano Zanella

Guest
Hi,

Has anybody experienced problems with LBS when the number of files in
the netlist directory is around 1000? I haven't exactly nailed down the
number of files that causes problems, but it is certainly less than 1024
(or 2^10 related numbers). The symptoms are the following: not a
single file is copied on the jobXXX directories and spectre fails (no
input file found). I am running in ocean and basically doing

foreach job {

do some stuff
run()

}

wait

It works beautifully for run directories that contain less that circa
1000 files and doesn't for more files. The problem seems to be
independent on the design (tried different designs) and on the size of
the run directory. I have all the log levels set to the maximum, but
they don't say anything.

Thanks in advance,
Stefano
 
Hi Stefano,

The issue is probably filesystem related, here is the solution for
Linux:

The value in file-max denotes the maximum number of file handles
that the Linux kernel will allocate. When you get a lot of error
messages about running out of file handles, you might want to raise
this limit. The default value is 4096. To change it, just write the
new number into the file:

# cat /proc/sys/fs/file-max
4096
# echo 8192 > /proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
8192

[...]

The value in inode-max denotes the maximum number of inode
handlers. This value should be 3 to 4 times larger than the value
in file-max, since stdin, stdout, and network sockets also need an
inode struct to handle them. If you regularly run out of inodes,
you should increase this value.

Regards
Raman


Stefano Zanella wrote:
Hi,

Has anybody experienced problems with LBS when the number of files in
the netlist directory is around 1000? I haven't exactly nailed down the
number of files that causes problems, but it is certainly less than 1024
(or 2^10 related numbers). The symptoms are the following: not a
single file is copied on the jobXXX directories and spectre fails (no
input file found). I am running in ocean and basically doing

foreach job {

do some stuff
run()

}

wait

It works beautifully for run directories that contain less that circa
1000 files and doesn't for more files. The problem seems to be
independent on the design (tried different designs) and on the size of
the run directory. I have all the log levels set to the maximum, but
they don't say anything.

Thanks in advance,
Stefano
 
Hi Raman,

Thanks a lot. Unfortunately that does not seem to be the case:

sh-2.05a$ cat /proc/sys/fs/file-max
104802

I checked it on the LBS server and on all client machines. I did not get
any error messages at all from LBS, which is the worrysome part. I am
wondering whether there is a hard-coded limit somewhere.

Regards,
Stefano


raman@webquarry.com wrote:
Hi Stefano,

The issue is probably filesystem related, here is the solution for
Linux:

The value in file-max denotes the maximum number of file handles
that the Linux kernel will allocate. When you get a lot of error
messages about running out of file handles, you might want to raise
this limit. The default value is 4096. To change it, just write the
new number into the file:

# cat /proc/sys/fs/file-max
4096
# echo 8192 > /proc/sys/fs/file-max
# cat /proc/sys/fs/file-max
8192

[...]

The value in inode-max denotes the maximum number of inode
handlers. This value should be 3 to 4 times larger than the value
in file-max, since stdin, stdout, and network sockets also need an
inode struct to handle them. If you regularly run out of inodes,
you should increase this value.

Regards
Raman


Stefano Zanella wrote:
Hi,

Has anybody experienced problems with LBS when the number of files in
the netlist directory is around 1000? I haven't exactly nailed down the
number of files that causes problems, but it is certainly less than 1024
(or 2^10 related numbers). The symptoms are the following: not a
single file is copied on the jobXXX directories and spectre fails (no
input file found). I am running in ocean and basically doing

foreach job {

do some stuff
run()

}

wait

It works beautifully for run directories that contain less that circa
1000 files and doesn't for more files. The problem seems to be
independent on the design (tried different designs) and on the size of
the run directory. I have all the log levels set to the maximum, but
they don't say anything.

Thanks in advance,
Stefano
 
Hi Stefano,

You might want to check if there is any quota set for the user
account, and also
the remote /tmp directories(if there is any issues). You could also
try running
the job as root to rule out any user specific limits(eventhough is not
advised).
The following .cdsenv setting might be helpful:
asimenv.distributed copyMode boolean nil
I do suspect that it could be a system issue, esp. when the logs
beccome useless.

Regards
Raman
 
Hi Raman,

Thanks a lot (again!). It is not a size issue. I can use a test case
that is 100 times as big (in terms of data size) with less files and
everything will be ok. I can't try the root option (my IT will never
allow me).

asimenv.distributed copyMode is already nil. I guess that next step is
cadence's support.

Regards,
Stefano


raman@webquarry.com wrote:
Hi Stefano,

You might want to check if there is any quota set for the user
account, and also
the remote /tmp directories(if there is any issues). You could also
try running
the job as root to rule out any user specific limits(eventhough is not
advised).
The following .cdsenv setting might be helpful:
asimenv.distributed copyMode boolean nil
I do suspect that it could be a system issue, esp. when the logs
beccome useless.

Regards
Raman
 
Hmmmm
Are all of these files a job i.e. 1000 jobs ?
Isn't this a limitation related to the default naming job-job999, tracked
by artMonitor,jobMonitor ?
Just a thought........
//BEE

"Stefano Zanella" <stefanoDOTzanella@pdfDOTcom> wrote in message
news:apadnb9itMfPzNPZnZ2dnUVZ_smdnZ2d@comcast.com...
Hi Raman,

Thanks a lot (again!). It is not a size issue. I can use a test case that
is 100 times as big (in terms of data size) with less files and everything
will be ok. I can't try the root option (my IT will never allow me).

asimenv.distributed copyMode is already nil. I guess that next step is
cadence's support.

Regards,
Stefano


raman@webquarry.com wrote:
Hi Stefano,

You might want to check if there is any quota set for the user
account, and also
the remote /tmp directories(if there is any issues). You could also
try running
the job as root to rule out any user specific limits(eventhough is not
advised).
The following .cdsenv setting might be helpful:
asimenv.distributed copyMode boolean nil
I do suspect that it could be a system issue, esp. when the logs
beccome useless.

Regards
Raman
 
nope, just few jobs.
Stefano

BEE wrote:
Hmmmm
Are all of these files a job i.e. 1000 jobs ?
Isn't this a limitation related to the default naming job-job999, tracked
by artMonitor,jobMonitor ?
Just a thought........
//BEE

"Stefano Zanella" <stefanoDOTzanella@pdfDOTcom> wrote in message
news:apadnb9itMfPzNPZnZ2dnUVZ_smdnZ2d@comcast.com...
Hi Raman,

Thanks a lot (again!). It is not a size issue. I can use a test case that
is 100 times as big (in terms of data size) with less files and everything
will be ok. I can't try the root option (my IT will never allow me).

asimenv.distributed copyMode is already nil. I guess that next step is
cadence's support.

Regards,
Stefano


raman@webquarry.com wrote:
Hi Stefano,

You might want to check if there is any quota set for the user
account, and also
the remote /tmp directories(if there is any issues). You could also
try running
the job as root to rule out any user specific limits(eventhough is not
advised).
The following .cdsenv setting might be helpful:
asimenv.distributed copyMode boolean nil
I do suspect that it could be a system issue, esp. when the logs
beccome useless.

Regards
Raman
 
Stefano

There are a zillion things with limits in an UNIX environment.
In your case a likely suspect is the number of open files,

/usr/sbin/lsof -p `pgrep icfb.exe` | wc -l

will tell you how many files icfb.exe has open.

You can check this against the limit: ulimit -n

To find out all the limits, type ulimit -a . Mind you, these are per
process limits. There are also per system limits as Raman points out.

Hope this helps.

Satya
 

Welcome to EDABoard.com

Sponsor

Back
Top