PDA

View Full Version : SLES 11 SP3 Memory not free, but no process ?!?



mpibgc
21-Nov-2018, 09:34
Hi,

we have the following problem with some HPC nodes. After some time (7-30 days) large parts of the memory
are occupied. There are no active or inactive processes using the memory. Any idea how do find the memory (ps,lsof , /proc/.. did not help).

System SLES 11SP3, 64GB RAM



BAD node:
node22:~ # cat /proc/meminfo
MemTotal: 66066804 kB
MemFree: 49673020 kB
Buffers: 12298540 kB
Cached: 1467596 kB
SwapCached: 0 kB
Active: 12875052 kB
Inactive: 948312 kB
Active(anon): 42288 kB
Inactive(anon): 88 kB
Active(file): 12832764 kB
Inactive(file): 948224 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 16779260 kB
SwapFree: 16779260 kB
Dirty: 52 kB
Writeback: 0 kB
AnonPages: 42236 kB
Mapped: 13780 kB
Shmem: 140 kB
Slab: 1638204 kB
SReclaimable: 1578768 kB
SUnreclaim: 59436 kB
KernelStack: 6400 kB
PageTables: 2696 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 49812660 kB
Committed_AS: 115352 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 383588 kB
VmallocChunk: 34359352372 kB
HardwareCorrupted: 0 kB
AnonHugePages: 14336 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 227328 kB
DirectMap2M: 11290624 kB
DirectMap1G: 55574528 kB


GOOD node:
node08:~ # cat /proc/meminfo
MemTotal: 66066804 kB
MemFree: 63056704 kB
Buffers: 814328 kB
Cached: 668740 kB
SwapCached: 6528 kB
Active: 1056692 kB
Inactive: 511812 kB
Active(anon): 32872 kB
Inactive(anon): 47336 kB
Active(file): 1023820 kB
Inactive(file): 464476 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 16779260 kB
SwapFree: 16761860 kB
Dirty: 88 kB
Writeback: 0 kB
AnonPages: 74692 kB
Mapped: 12376 kB
Shmem: 52 kB
Slab: 507524 kB
SReclaimable: 364540 kB
SUnreclaim: 142984 kB
KernelStack: 6416 kB
PageTables: 2980 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 49812660 kB
Committed_AS: 150608 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 383588 kB
VmallocChunk: 34359352372 kB
HardwareCorrupted: 0 kB
AnonHugePages: 32768 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 71680 kB
DirectMap2M: 4106240 kB
DirectMap1G: 62914560 kB

ab
21-Nov-2018, 13:52
On 11/21/2018 01:44 AM, mpibgc wrote:
>
> we have the following problem with some HPC nodes. After some time
> (7-30 days) large parts of the memory
> are occupied. There are no active or inactive processes using the
> memory. Any idea how do find the memory (ps,lsof , /proc/.. did not
> help).
>
> System SLES 11SP3, 64GB RAM

Just to be sure, other than the display from pseudo-files like
/proc/meminfo do you have any actual symptoms that make you classify this
as a "problem" rather than a great feature which improves your system
performance? Have you done any benchmarking, particularly of things using
disk I/O for files, that might indicate which box performs better?

> Code:
> --------------------
>
> BAD node:
> node22:~ # cat /proc/meminfo
> MemTotal: 66066804 kB
> MemFree: 49673020 kB
> Buffers: 12298540 kB
> Cached: 1467596 kB

The Free, Buffers, and Cached lines above equal almost all of the system
memory, so it looks to me like your memory is mostly available. This may
seem like a "bad thing" at first, but it is actually the way that Linux
(the kernel) helps increase performance of your system.

RAM that is not being used is essentially wasted, and RAM, used or not, is
very fast, so its waste is particularly unfortunate. Disks, on the other
hand, have traditionally been slower than RAM, particularly before SSDs
but even still today. As a result, ideally we would have the system do an
operation with RAM rather than with the disk. So far, hopefully this is
all clear.

Linux (the kernel) automatically keeps files it has used recently in RAM
in a filesystem cache. This is why you have things like 'Active(file)'
below (or Buffers above) showing 12 GiB of use. Anytime you then go to
the OS and ask for that file the OS will first check its cache and,
assuming it is current, present the file from there. Any writes that
happen go to disk, of course, and update the cache, but since many files
are read a billion times for every write, it makes sense to just load from
cache and leave the slow disk out of the picture. As a result, the
system's built up cache of commonly-used files helps the system
performance significantly.

Of course, the often-asked question (see Google) that follows is, "What
happens when I need to load a process that takes 50+ GiB of RAM and my
system only shows "free" RAM of 49 GiB?" The answer is that the system
isn't stupid, and recognizes that need for RAM trumps a desire to cache
things, and, because RAM is very fast, it simply clears out the cache to
make room for a real process's needs. You can easily test this with a few
utilities online that test RAM, or a simple script that does the same, or
real life programs. When you do you will see that the cache amounts drop
and the RAM is given to your process which needs it.

Usually this question comes up (in this forum as well as other places)
with the 'free' command output, so perhaps review that as its simpler
dataset makes things a little clearer than the verbose output from
/proc/meminfo, and if you Google for this question and the 'free' command
you'll see this same kind of response all over the Internet.

> SwapCached: 0 kB
> Active: 12875052 kB
> Inactive: 948312 kB
> Active(anon): 42288 kB
> Inactive(anon): 88 kB
> Active(file): 12832764 kB
> Inactive(file): 948224 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> SwapTotal: 16779260 kB
> SwapFree: 16779260 kB
> Dirty: 52 kB
> Writeback: 0 kB
> AnonPages: 42236 kB
> Mapped: 13780 kB
> Shmem: 140 kB
> Slab: 1638204 kB
> SReclaimable: 1578768 kB
> SUnreclaim: 59436 kB
> KernelStack: 6400 kB
> PageTables: 2696 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 49812660 kB
> Committed_AS: 115352 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 383588 kB
> VmallocChunk: 34359352372 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 14336 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 227328 kB
> DirectMap2M: 11290624 kB
> DirectMap1G: 55574528 kB
>
>
> GOOD node:
> node08:~ # cat /proc/meminfo
> MemTotal: 66066804 kB
> MemFree: 63056704 kB
> Buffers: 814328 kB
> Cached: 668740 kB

By "good" I presume you mean wasting a lot of RAM by doing nothing
efficient with it. Yes, that describes this machine better than the
other. :-)

> SwapCached: 6528 kB
> Active: 1056692 kB
> Inactive: 511812 kB
> Active(anon): 32872 kB
> Inactive(anon): 47336 kB
> Active(file): 1023820 kB
> Inactive(file): 464476 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> SwapTotal: 16779260 kB
> SwapFree: 16761860 kB
> Dirty: 88 kB
> Writeback: 0 kB
> AnonPages: 74692 kB
> Mapped: 12376 kB
> Shmem: 52 kB
> Slab: 507524 kB
> SReclaimable: 364540 kB
> SUnreclaim: 142984 kB
> KernelStack: 6416 kB
> PageTables: 2980 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 49812660 kB
> Committed_AS: 150608 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 383588 kB
> VmallocChunk: 34359352372 kB
> HardwareCorrupted: 0 kB
> AnonHugePages: 32768 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 71680 kB
> DirectMap2M: 4106240 kB
> DirectMap1G: 62914560 kB
>
> --------------------

--
Good luck.

If you find this post helpful and are logged into the web interface,
show your appreciation and click on the star below.

If you want to send me a private message, please let me know in the
forum as I do not use the web interface often.

jmozdzen
21-Nov-2018, 14:41
Hi Peer & all,

this seems to be a duplicate of "https://forums.suse.com/showthread.php?12909-Memory-not-free-but-no-process-!" and the post there reports that processes won't start, reporting insufficient memory.

Please also see my answer to that thread, which also references a command to "flush" buffers and caches, to test if this is really the cause for not being able to start the process.

Regards,
J

mpibgc
23-Nov-2018, 13:39
You are right,

doing a "sync; echo 3 > /proc/sys/vm/drop_caches" solved the "problem".

I'll have to execute it on all idle nodes, so that the memory values are shown correctly for
the queueing system.

Thanks, Peer