Out of memory process killer¶
Problem¶
Elos should trigger an event when a OOM (Out of memory) killer is invoked and a process is terminated as a result.
Goals¶
When an oom-killer has terminated a process an event corresponding to this is published so that it is simpler to analyse why this has happened.
Assumptions¶
The system has an OOM-killer configured and when system memory is too low for system to function effectively, the OOM-killer is triggered to clean up processes that are using up system memory. This is then logged in the system somewhere.
Solutions¶
Upon OOM-killer invocation a log is created and it is found to logged as kernel
log message and we already have a Kmsg Scanner
which publishes kernel log
messages as event, so having made this clear we have the following option to publish a
separate event for OOM
Extend existing Kmsg Scanner : The configuration for
Kmsg Scanner
does not contain a ‘MappingRules’ option, to map an event based on a givenMessageCode
filter. The configuration can be extended by adding aMappingRules
option as given below:"KmsgScanner": { "KmsgFile": "/dev/kmsg" "MappingRules": { "MessageCodes": { "5020": ".event.payload '.*Out of memory: Killed process*' REGEX" } } }
When this is done, the Kmsg Scanner
shall be expanded similar to Syslog Scanner
to
include an logline mapper
where the Kernel log
is parsed to check for OOM-killer
log message and when a match is found an event is published.
pros
Expansion to filter other messages like OOPs can be easily implemented by adding an appropriate filter and parsing the log line for the same.
Use of existing code. Implementation will quicker since we can copy log line mapper code from
Syslog Scanner
cons
Kernel logs are not in caononical form, needs to be analysed and parsed, and
log line mapper
fromSyslog Scanner
can not be used as is. Complex to implement.
Implement a new OOM-killer scanner :
a) Implement a new oom-killer scanner from scratch. Since the
Kmsg Scanner
already publishes kernel log messages as events, a new scanner is implemented that subscribes to the events published by theKmsg Scanner
and when an event payload has an oom-killer message, then the oom-killer invoked event is published.b) Implement a new scanner from scratch and directly link to kernel using netlink and scan incoming messages for oom-killer activity.
pros
a) Subscribing to existing `Kmsg Scanner` event :
- Event already exists in canonical form, no need parse log lines, only
subscription to publishing client and generating new event needs to be done.
Simple implementation.
- Expand subscription to other events like OOP's easily.
b) Connect to kernel using nelink socket :
- Directly access to kernel logs to check for invocation of oom-killer, which means
this is quicker.
cons
a) Takes time, since the subscribing client needs to wait for the `Kmsg Scanner`
to publish a event before creating a new event.
b) Netlink library needs to be analysed to check if netlink provides necessary
protocols for interacting with kernel logs directly. Upon analysing netlink
protocols to interact with kernel, it is found that netlink provides the
following protocols
NETLINK_KOBJECT_UEVENT : This protocol is used for recieving communication from
the kernel related to device and driver management.
NETLINK_CONNECTOR : This protocol is used for generic communication with the kernel
used for communicating, process, network and filesystem based events.
Programming a simple monitor to check for oom-killer using the above protocols did not provide
the expected result. On further reading, a user defined protocol can be defined to implement the
above test, but this is not tested yet. There is a lot of over head here in implementing a custom
protocol for this purpose.
Implement a Scanner from another source
Instead of the kernel log another source where an indication of oom-killer invocation is present can be used. The following interfaces can be checked for oom-killer occurence:
check
/proc/vmstat/
for oom_kill_count, if greater than 0 then publish an eventread
/sys/kernel/debug/oom\_kill
to get information about oom-killerread
/sys/kernel/debug/oom\_kill\_allocating\_task
to get information about oom-killer
A scanner can be implemented to scann the above interfaces to check for oom-killer activity.
pros
This is similar to the existing
coredump
client and can be implemented relatively easily.
cons
Depends heavily on kernel debug configuration and
proc
interfaces, which might always not be available. The kernel needs to be configured to includesys/kernel/debug
interface. This is not a client requirement and hence not implemented, therefore this method is not feasible.
Decision¶
Having taken into account all pros and cons of all the provided solutions given above it is decided that a separate
scanner that subscribes to Kmsg Scanner
events for oom-killer invocation will be implemented.
Open Points¶
When parsing a Kernel log event for oom killer invocation, it is possible to retrieve the process name and its pid,
but it is not possible to retrieve the path. This is because the proc
interface with the pid is not available after
the process is killed by the oom killer. To predetermine which process will be killed by oom killer is too much of an
overhead. The oom killer process path will therefore be set to empty “”.