Out of memory process killer

Problem

Elos should trigger an event when a OOM (Out of memory) killer is invoked and a process is terminated as a result.

Goals

When an oom-killer has terminated a process an event corresponding to this is published so that it is simpler to analyse why this has happened.

Assumptions

The system has an OOM-killer configured and when system memory is too low for system to function effectively, the OOM-killer is triggered to clean up processes that are using up system memory. This is then logged in the system somewhere.

Solutions

Upon OOM-killer invocation a log is created and it is found to logged as kernel log message and we already have a Kmsg Scanner which publishes kernel log messages as event, so having made this clear we have the following option to publish a separate event for OOM

  1. Extend existing Kmsg Scanner : The configuration for Kmsg Scanner does not contain a ‘MappingRules’ option, to map an event based on a given MessageCode filter. The configuration can be extended by adding a MappingRules option as given below:

         "KmsgScanner": {
             "KmsgFile": "/dev/kmsg"
             "MappingRules": {
                 "MessageCodes": {
                     "5020": ".event.payload '.*Out of memory: Killed process*' REGEX"
                 }
             }
         }
    

When this is done, the Kmsg Scanner shall be expanded similar to Syslog Scanner to include an logline mapper where the Kernel log is parsed to check for OOM-killer log message and when a match is found an event is published.

pros

  • Expansion to filter other messages like OOPs can be easily implemented by adding an appropriate filter and parsing the log line for the same.

  • Use of existing code. Implementation will quicker since we can copy log line mapper code from Syslog Scanner

cons

  • Kernel logs are not in caononical form, needs to be analysed and parsed, and log line mapper from Syslog Scanner can not be used as is. Complex to implement.

  1. Implement a new OOM-killer scanner :

    a) Implement a new oom-killer scanner from scratch. Since the Kmsg Scanner already publishes kernel log messages as events, a new scanner is implemented that subscribes to the events published by the Kmsg Scanner and when an event payload has an oom-killer message, then the oom-killer invoked event is published.

    b) Implement a new scanner from scratch and directly link to kernel using netlink and scan incoming messages for oom-killer activity.

pros

a) Subscribing to existing `Kmsg Scanner` event :

     - Event already exists in canonical form, no need parse log lines, only 
       subscription to publishing client and generating new event needs to be done.
       Simple implementation.
     - Expand subscription to other events like OOP's easily.

b) Connect to kernel using nelink socket :

     - Directly access to kernel logs to check for invocation of oom-killer, which means
       this is quicker. 

cons

a) Takes time, since the subscribing client needs to wait for the `Kmsg Scanner` 
   to publish a event before creating a new event.

b) Netlink library needs to be analysed to check if netlink provides necessary 
   protocols for interacting with kernel logs directly. Upon analysing netlink
   protocols to interact with kernel, it is found that netlink provides the 
   following protocols

   NETLINK_KOBJECT_UEVENT : This protocol is used for recieving communication from
                             the kernel related to device and driver management.

   NETLINK_CONNECTOR : This protocol is used for generic communication with the kernel
                        used for communicating, process, network and filesystem based events.

   Programming a simple monitor to check for oom-killer using the above protocols did not provide
   the expected result. On further reading, a user defined protocol can be defined to implement the
   above test, but this is not tested yet. There is a lot of over head here in implementing a custom
   protocol for this purpose.
  1. Implement a Scanner from another source

    Instead of the kernel log another source where an indication of oom-killer invocation is present can be used. The following interfaces can be checked for oom-killer occurence:

    • check /proc/vmstat/ for oom_kill_count, if greater than 0 then publish an event

    • read /sys/kernel/debug/oom\_kill to get information about oom-killer

    • read /sys/kernel/debug/oom\_kill\_allocating\_task to get information about oom-killer

    A scanner can be implemented to scann the above interfaces to check for oom-killer activity.

pros

  • This is similar to the existing coredump client and can be implemented relatively easily.

cons

  • Depends heavily on kernel debug configuration and proc interfaces, which might always not be available. The kernel needs to be configured to include sys/kernel/debug interface. This is not a client requirement and hence not implemented, therefore this method is not feasible.

Decision

Having taken into account all pros and cons of all the provided solutions given above it is decided that a separate scanner that subscribes to Kmsg Scanner events for oom-killer invocation will be implemented.

Open Points

When parsing a Kernel log event for oom killer invocation, it is possible to retrieve the process name and its pid, but it is not possible to retrieve the path. This is because the proc interface with the pid is not available after the process is killed by the oom killer. To predetermine which process will be killed by oom killer is too much of an overhead. The oom killer process path will therefore be set to empty “”.