# Out of memory process killer

## Problem 

Elos should trigger an event when a OOM (Out of memory) killer is invoked and a
process is terminated as a result.


## Goals
When an oom-killer has terminated a process an event corresponding to this is
published so that it is simpler to analyse why this has happened.


## Assumptions

The system has an OOM-killer configured and when system memory is too low for
system to function effectively, the OOM-killer is triggered to clean up
processes that are using up system memory. This is then logged in the system 
somewhere.


## Solutions

Upon OOM-killer invocation a log is created and it is found to logged as kernel
log message and we already have a `Kmsg Scanner` which publishes kernel log
messages as event, so having made this clear we have the following option to publish a 
separate event for OOM

1) **Extend existing Kmsg Scanner** : The configuration for `Kmsg Scanner` does not 
   contain a 'MappingRules' option, to map an event based on a given `MessageCode` filter.
   The configuration can be extended by adding a `MappingRules` option as given below:


   ```json
        "KmsgScanner": {
            "KmsgFile": "/dev/kmsg"
            "MappingRules": {
                "MessageCodes": {
                    "5020": ".event.payload '.*Out of memory: Killed process*' REGEX"
                }
            }
        }
   ```

  When this is done, the `Kmsg Scanner` shall be expanded similar to `Syslog Scanner` to 
  include an `logline mapper` where the `Kernel log` is parsed to check for `OOM-killer`
  log message and when a match is found an event is published.

  **pros**
  
   - Expansion to filter other messages like OOPs can be easily implemented by adding an
     appropriate filter and parsing the log line for the same.
   - Use of existing code. Implementation will quicker since we can copy log line mapper
     code from `Syslog Scanner`
  
  
  **cons**
  
   - Kernel logs are not in caononical form, needs to be analysed and parsed, and 
     `log line mapper` from  `Syslog Scanner` can not be used as is. Complex to implement.


2) **Implement a new OOM-killer scanner** : 

    a) Implement a new oom-killer scanner from scratch. Since the `Kmsg Scanner` already
       publishes kernel log messages as events, a new scanner is implemented that subscribes 
       to the events published by the `Kmsg Scanner` and when an event payload has an oom-killer
       message, then the oom-killer invoked event is published.

    b) Implement a new scanner from scratch and directly link to kernel using netlink and
       scan incoming messages for oom-killer activity.

  **pros**
  
    a) Subscribing to existing `Kmsg Scanner` event :

         - Event already exists in canonical form, no need parse log lines, only 
           subscription to publishing client and generating new event needs to be done.
           Simple implementation.
         - Expand subscription to other events like OOP's easily.

    b) Connect to kernel using nelink socket :

         - Directly access to kernel logs to check for invocation of oom-killer, which means
           this is quicker. 
  
  **cons**
  
    a) Takes time, since the subscribing client needs to wait for the `Kmsg Scanner` 
       to publish a event before creating a new event.
  
    b) Netlink library needs to be analysed to check if netlink provides necessary 
       protocols for interacting with kernel logs directly. Upon analysing netlink
       protocols to interact with kernel, it is found that netlink provides the 
       following protocols
  
       NETLINK_KOBJECT_UEVENT : This protocol is used for recieving communication from
                                 the kernel related to device and driver management.
  
       NETLINK_CONNECTOR : This protocol is used for generic communication with the kernel
                            used for communicating, process, network and filesystem based events.
  
       Programming a simple monitor to check for oom-killer using the above protocols did not provide
       the expected result. On further reading, a user defined protocol can be defined to implement the
       above test, but this is not tested yet. There is a lot of over head here in implementing a custom
       protocol for this purpose.


3) **Implement a Scanner from another source**
 
   Instead of the kernel log another source where an indication of oom-killer invocation is
   present can be used. The following interfaces can be checked for oom-killer occurence:
   
   * check `/proc/vmstat/` for oom\_kill\_count, if greater than 0 then publish an event
   * read `/sys/kernel/debug/oom\_kill` to get information about oom-killer
   * read `/sys/kernel/debug/oom\_kill\_allocating\_task`  to get information about oom-killer
   
   A scanner can be implemented to scann the above interfaces to check for oom-killer activity.

  **pros**

   - This is similar to the existing `coredump` client and can be implemented relatively easily.

  **cons**

   - Depends heavily on kernel debug configuration and `proc` interfaces, which might always not be
     available. The kernel needs to be  configured to include `sys/kernel/debug` interface. This is 
     not a client requirement and hence not implemented, therefore this method is not feasible.


## Decision

Having taken into account all pros and cons of all the provided solutions given above it is decided that a separate
scanner that subscribes to `Kmsg Scanner` events for oom-killer invocation will be implemented.


## Open Points

When parsing a Kernel log event for oom killer invocation, it is possible to retrieve the process name and its pid,
but it is not possible to retrieve the path. This is because the `proc` interface with the pid is not available after
the process is killed by the oom killer. To predetermine which process will be killed by oom killer is too much of an
overhead. The oom killer process path will therefore be set to empty "".