Architecture Design Record - Event Storage Backend¶
Problem¶
An event storage backend for elos has the task to persist and retrieve events utilizing techniques to fulfill a set of characteristics required to store and retrieve one or more classes of events (event storage classes).
Influencing factors¶
retention policy
assurance about write integrity
assurance about data integrity
write speed
read speed
flash usage per event persistence operation
space requirements per event (compression)
The task is now to define possible event storage backend implementation and define there characteristics. These could then be used to choose an implementation which fits best to store a given class of events.
Assumptions¶
Considered Alternatives¶
1) RDBMS - SQLite¶
SQLite is not a conventional RDBMS as it comes as a library and not as classical client-/server based approach. SQLite targets embedded systems and intended to provide an commonly used and known SQL-Interface to manage data.
TBD: evaluation, measurements, PoC
pros
optimized for embedded system by design
extension API to add custom driver to store data
cons
probably less effective on big complex data sets as common full featured RDBMSs
But: Is it intentional to manage data quantities and complex data structures that a classical RDBMS representative should be considered
2) NoSQL – MongoDB¶
MongoDB is a representative of the document orientated NoSQL-Databases. Each event can be considered as a document in in the NoSql context. NoSql databases are designed to search for through and for particular attributes of documents.
TBD: evaluation, measurements, PoC
pros
simplicity, straight forward take the event and store it without further processing
cons
needs a mongoDB server which comes with additional dependencies like python
3) Custom File Storage – Json File¶
To address the special requirements on storing events a sequential approach to store events serialized as newline Json separated strings is possible.
To reduce the writes techniques like preallocating a file on the backing filesystem and storage is used. To address the atomic write dependency specific flags for writing like O_SYNC and O_DSYNC can be used. More details on this approach can be obtained from the corresponding design decision.
TBD: evaluation, measurements, PoC
pros
simplicity, straight forward take the event and store it without further processing
an implementation from scratch can be highly customized to the specific needs and abilities of the used target system
cons
probably high development effort
danger of reinventing some other stream or file storage system over time, as more and more “lessons learned”
4) systemd like storage of logs¶
https://systemd.io/JOURNAL_FILE_FORMAT/ https://github.com/systemd/systemd https://www.freedesktop.org/software/systemd/man/sd-journal.html
systemds journald subsystem is a logging system not too different from syslog. It is, effectively, a block-based protocol, writing its logs to a socket.
Decision¶
Systemds journald will not be used. If the decision is reached to implement a completly new logging mechanism, the data storage format from journald is a good reference on how to write a logging format that is easily searchable.
Rationale¶
The API of journald does not support writing to a custom file/location, which means that we can not simply use the API for logging. It is possible to change the location of the logging directory by setting an enviroment variable and giving it to the journald server. However, Using the journald server requires to start all of systemd. From investigations it seems no functions related to the journald server are available for other programs via a shared library. Additionally, due to systemds design, it seems unlikely that a separate journald sever would run without the rest of systemd available on the machine.
Furthermore, it is unsure if using the journald protocol would satisfy our requirements for a logging protocol. According to the official documentation, it is to be assumed, that we need to write at least 3 block when creating an entry. One block for the header that needs updating, one block to update the entry array element which will contain the new entry, and one for the entry. When the current entry array is full, we might only need to write two blocks, since the entry array struct and the entry itself should fit into a single block. Additionally, sometimers a tag struct will be written for corruption protection, but this can fit into the same block as the entry as well. S the best case scenario is two block for a single entry write, and worst case is 4. While an new log entry is not necessarily written to the disc instantly, current code research indicates that every write does schedule a sync with the disc. This means that multiple log entries can pile up before the sync actually occures. This would reduce unnecessary the amount of times the file and list headers need to be updated. If the amount of log entries pilung up is sufficiently large, the overhead from those header writes would become relativly small.
The focus of the protocol is corruption optimisation, to ensure that as little data is corrupted and as much of the data is still useable after a corruption is detected. To achieve this, every read checks for data consistency while reading, as well as writing tags after an amount of entries. The first protection mechansims is highly dependent on the amount of actual reads that happen. The other focus of the protocol is to make it easily searchable, with having an search efficiency of O(n) in the worst case for n total entries, even when searching by multiple parameters.
The compatibility between the protocols data storage and our even storage is rather good. The format stores the date, as well as a “priority” data field, which we can sue to store our data and severity data. Additionally, the protocol does not have strong requirements for the name of its data fields, meaning we can store the rest of our event data fields in plain text, with an appropriate encoding of our field names. Combining that with the efficient search with field names as search parameters would make lookup pretty efficient.
Open Points¶
It is unclear if, should we be able to create a shared library for the journald server, how much of systemds other sources we would need to install as well to enable the server to run. It was not possible for our engineers to create a shared library from those. This does not necessarily mean that a more skilled engineer could not make that possible. It is unclear how long the time between a sync scheduling and the actual disc sync is and how many logs would accumulate in that time. It is unclear how good the corruption protection would work for elos, depending on how many lookups actually happen.
5) Apache Avro Storage of logs¶
https://avro.apache.org/docs/1.11.1/api/c/
Avro supports storing of binary data in an easy way.
Decision¶
Creating a code poc is necessary to determine how the api performs in regards to writing blocks. During the creation of the poc, further development was halted and avro was abandoned as a possible logging backend.
Rationale¶
It is certain that we can store an event fully in the data structures available from Avro.
During the development of the poc, an issue with the locally available avro dev library was found, in which the software contained a bug, making the library unable to open a file it previously created. This made it impossible to reopen a file that was previously written to, which makes closing the file during operation impossible and would require a new log file after each start of the application. Additionally, and more importantly, it is impossible to open old log files for reading.
Trying to build avro locally in order to patch it by ourself proved difficult as as well, due to the amount of dependencies. Some dependencies are not available in the necessary version locally, which would require building them as well.
When trying to build avro locally, while supplying the necessary dependencies, The build failed to varying reasons, even with the same setup.
Open Points¶
The amount of actual writes that happen when storing an event is unclear, but at least from the poc development, it seems reasonable to assume that it is possible to cache multiple events before actually writing them to file.
6) Time-Series Databases¶
As a representative for Time-Series Databases, InfluxDb was chosen.
https://www.influxdata.com/products/influxdb-overview/.
Decision¶
Creating a code poc is necessary to determine how the api performs in regards. Due to the unavailability of InfluxDBv2 for yocto, the poc was implemented against the API of InfluxDB in version 1.8. The code does work with version 2 as well, since version two is backwards compatible with the version 1 API.
As of development of this ADR, the version 3 of InfluxDB was already released, but storing was only possible in an amazon cloud, which is incompatible with the local storing we need for elos.
Further development has not been decided as of yet.
Rational¶
It is confirmed that we can store an elos event to an InfluxDb table and read it again.
The current Test Results have shown that InfluxDB performs better the more similar writes have been done, assumably since it needs to write its meta- data for the table only once, and subsequent writes are a lot smaller then for the other loggers.
Open Points¶
Version 2 of InfluxDb uses a different storage format. The assumption is, that it could perform better in writes then the previous Storage formats.
It is also unclear how the write performance changes should we decide to cache events and write multiple at once, which is easily possible with the InfluxDb API, in both versions.
Test Results¶
The following table displays the results of performance tests, executed on the S32G. We measured, primarily, the amount of bits written for a given set of elos events that needed to be logged. This test was executed for 4 different configuration, using 4 different json files:
basic.json: Control group, configures no logger.
influxdb.json: Configures the InfluxDB Backend as the logger.
json.json: Configures the Json Logging Backend as the logger.
sqlite.json: Configures the SQLite Database Logger as the logger.
The different columns of the table symbolize different stages of the test execution. The main bulk of writes should happen between “write messages_start” and “write_messages_stop”, with the possibility of a few messages trailing behind due to the system not syncing in time.
Each table block represents a different test run. As Visible, there are slight differences in the numbers between each test run for each configuration file. Some of this can be justified by the fact that we have small amounts of writes even without any actual logging, as can be seen in the values of the basic.json. And every difference besides that is in a very small percentage range that we can mostly ignore, or use to calculate an average for each importer.
name |
number events |
elosd_start |
write_message_start |
write_message_stop |
elosd_shutdown |
sync_ before_umount |
total |
---|---|---|---|---|---|---|---|
basic.json |
1 |
2 |
0 |
0 |
0 |
8 |
10 |
10 |
0 |
0 |
0 |
0 |
2 |
2 |
|
100 |
0 |
0 |
0 |
0 |
2 |
2 |
|
1000 |
0 |
0 |
0 |
0 |
2 |
2 |
|
influxdb.json |
1 |
0 |
764 |
38 |
0 |
50 |
852 |
10 |
2 |
842 |
38 |
0 |
52 |
934 |
|
100 |
2 |
1734 |
2 |
0 |
50 |
1788 |
|
1000 |
2 |
11296 |
2 |
0 |
12 |
11312 |
|
json.json |
1 |
4 |
24 |
2 |
0 |
10 |
40 |
10 |
2 |
168 |
2 |
8 |
10 |
190 |
|
100 |
4 |
1662 |
2 |
0 |
8 |
1676 |
|
1000 |
2 |
16548 |
2 |
0 |
10 |
16562 |
|
sqlite.json |
1 |
2 |
188 |
16 |
0 |
10 |
216 |
10 |
2 |
1884 |
16 |
0 |
10 |
1912 |
|
100 |
2 |
18840 |
16 |
0 |
10 |
18868 |
|
1000 |
2 |
188564 |
16 |
0 |
10 |
188592 |
|
————— |
——– |
———– |
———– |
———– |
———– |
———– |
———– |
basic.json |
1 |
0 |
0 |
0 |
10 |
6 |
16 |
10 |
0 |
0 |
0 |
10 |
6 |
16 |
|
100 |
0 |
0 |
0 |
0 |
2 |
2 |
|
1000 |
0 |
0 |
0 |
10 |
6 |
16 |
|
influxdb.json |
1 |
0 |
764 |
38 |
0 |
50 |
852 |
10 |
0 |
842 |
38 |
0 |
52 |
932 |
|
100 |
0 |
1730 |
2 |
0 |
50 |
1782 |
|
1000 |
0 |
11296 |
2 |
0 |
10 |
11308 |
|
json.json |
1 |
2 |
24 |
2 |
8 |
10 |
46 |
10 |
2 |
168 |
2 |
0 |
10 |
182 |
|
100 |
2 |
1664 |
2 |
0 |
8 |
1676 |
|
1000 |
2 |
16694 |
2 |
0 |
10 |
16708 |
|
sqlite.json |
1 |
2 |
190 |
16 |
0 |
10 |
218 |
10 |
2 |
1884 |
16 |
8 |
10 |
1920 |
|
100 |
2 |
18840 |
16 |
0 |
10 |
18868 |
|
1000 |
2 |
188592 |
18 |
0 |
10 |
188622 |
|
————— |
——– |
———– |
———– |
———– |
———– |
———– |
———– |
basic.json |
1 |
0 |
0 |
0 |
10 |
6 |
16 |
10 |
0 |
0 |
0 |
10 |
6 |
16 |
|
100 |
2 |
0 |
6 |
2 |
8 |
18 |
|
1000 |
0 |
8 |
2 |
8 |
8 |
26 |
|
influxdb.json |
1 |
0 |
764 |
38 |
0 |
50 |
852 |
10 |
0 |
886 |
2 |
0 |
54 |
942 |
|
100 |
0 |
1732 |
2 |
0 |
50 |
1784 |
|
1000 |
0 |
11300 |
2 |
0 |
12 |
11314 |
|
json.json |
1 |
2 |
24 |
2 |
0 |
10 |
38 |
10 |
2 |
168 |
2 |
8 |
10 |
190 |
|
100 |
2 |
1664 |
2 |
0 |
8 |
1676 |
|
1000 |
4 |
16686 |
2 |
0 |
10 |
16702 |
|
sqlite.json |
1 |
2 |
188 |
16 |
0 |
10 |
216 |
10 |
2 |
1884 |
16 |
0 |
10 |
1912 |
|
100 |
2 |
18844 |
16 |
0 |
10 |
18872 |
|
1000 |
2 |
188568 |
16 |
0 |
10 |
188596 |
|
————— |
——– |
———– |
———– |
———– |
———– |
———– |
———– |
basic.json |
1 |
0 |
0 |
0 |
0 |
2 |
2 |
10 |
0 |
0 |
0 |
10 |
6 |
16 |
|
100 |
0 |
0 |
0 |
0 |
16 |
16 |
|
1000 |
0 |
0 |
0 |
10 |
6 |
16 |
|
influxdb.json |
1 |
0 |
764 |
38 |
0 |
50 |
852 |
10 |
0 |
842 |
38 |
10 |
10 |
900 |
|
100 |
2 |
1732 |
2 |
0 |
50 |
1786 |
|
1000 |
2 |
11290 |
2 |
0 |
12 |
11306 |
|
json.json |
1 |
2 |
24 |
2 |
8 |
10 |
46 |
10 |
4 |
168 |
2 |
0 |
10 |
184 |
|
100 |
2 |
1664 |
2 |
0 |
8 |
1676 |
|
1000 |
2 |
16546 |
2 |
0 |
10 |
16560 |
|
sqlite.json |
1 |
2 |
188 |
16 |
0 |
10 |
216 |
10 |
2 |
1886 |
16 |
0 |
10 |
1914 |
|
100 |
2 |
18840 |
16 |
0 |
10 |
18868 |
|
1000 |
2 |
188564 |
16 |
0 |
10 |
188592 |