Results 1 to 2 of 2

Thread: Misbehaving RT applications

Hybrid View

  1. #1

    Question Misbehaving RT applications

    Hi,

    We have a HP Proliant DL980 g7 (4x 6core CPUs) server running SLES 11 SP2 + SLERT. The OS is configured to use CPU sets, the first 4 cores reserved for the OS, and the application running in the remaining 20 cores.

    If we have an application running with real-time priority, it is possible for the application to cause the server to stop responding. We can still 'ping' the server but it is not possible to interact with the server via the console and it is not possible to SSH into the machine.

    My impression was that separating the application on to different CPUs than the OS would prevent an application from hindering the OS.

    What am I missing?

    Thanks, Jason

  2. #2

    Re: Misbehaving RT applications

    It's possible for an application to effectively inhibit a system's ability
    to do something else even if it is not consuming all CPUs directly, at
    least with non-RT systems, and I suspect the same is true in RT-line. If
    the four cores dedicated to the OS are busy doing things that support the
    application, for example, then that may be the case.

    I've seen cases where a runaway process has taken all RAM and is now
    trying to get the virtual memory from swap as well on a system with too
    much swap, and even though the process is single-threaded (so it only gets
    one core out of sixteen) the system is effectively useless until the
    OUt-Of-Memory (OOM) killer takes over and nukes the lousy thing. Tying up
    the hard drive is probably one of the easiest, minimal-effort things I can
    do to lock up a system, particularly if I happen to be doing it while
    using up the majority of system memory. For this reason, things like
    ulimit exist to prevent using more resources than necessary, as do
    cgroups, and perhaps this is what your application is using.

    If you are SSH'd into the system, or if you go to the system console, can
    you interact that way, meaning it is just new connections that fail?
    Could you create a script to watch system resources (I/O, memory usage,
    CPU utilization) while the script is running so you can gather statistics
    even while disconnected or unable to interact with the system?

    --
    Good luck.

    If you find this post helpful and are logged into the web interface,
    show your appreciation and click on the star below...

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •