a lightweight distributed cron-like job system (making while learning Rust)
Go to file
era a4c69ee336 fixing some small warnings 2022-04-28 16:22:37 +01:00
proto method to disable job on server 2021-12-22 08:33:36 +00:00
src fixing some small warnings 2022-04-28 16:22:37 +01:00
.gitignore Added Storage, missing refactoring for Config file 2021-12-08 19:53:18 +00:00
Cargo.toml initial foundation of the DB trait 2022-01-12 17:39:39 +00:00
LICENSE Initial commit 2021-12-04 21:44:29 +00:00
README.md small changes on readme to reflect more current status 2021-12-31 21:49:44 +00:00
build.rs starting to follow https://github.com/hyperium/tonic/blob/master/examples/helloworld-tutorial.md 2021-12-06 17:11:53 +00:00

README.md

dcron

Status: WIP

The idea for this project is to enable users to schedule and run cron-like jobs without having to worry about where the script will run. For this to happen a client should be able to talk with any server to schedule a job or retrieve a job result/log. In order to keep the logs accessible by a web-interface, once the job is done the log will be uploaded to a object storage service (such as S3, ceph or openstack Swift). The client will also upload the script it wants to run in a object storage service and only give to the server the time it wants it to run (using cron-syntax) and the intepreter the server should use to run it.

If the job is set for every 5 minutes and one execution is taking more than that, the next execution will be delayed until the job is finished.

The service itself uses a DocumentDB such as https://couchdb.apache.org/ to store its data.

The service assumes the client is trustworth, so there won't be any complex check regarding the safety of scripts.

The service always look 5 minutes backward in order to schedule services. If the service crashes for a long period, it won't try to run all the jobs from that period.

API

For the public API, look at the proto/dcron.proto Public service definition. dcron-client is a client of that gRPC. The Internal service is used for the communication between Leader and followers.

Communication between client and server

API: CREATE JOB

The user defines the job name, if the job name is already taken an error is returned.

Request message
  • name: Job name
  • time: cron job syntax to define the frequency and time of execution
  • script type: python or ruby
  • script_location: object storage service url/id
  • update: boolean flag, True = if there is a job with this name, update it.
Response message
  • success message/errorcode

image

API: DISABLE JOB EXECUTIONS

Jobs cannot be deleted, they can be disable if needed.

Request message
  • name: Job name

Response message

  • success / error code

API: GET JOB EXECUTIONS

Request message
  • Job name
Response Message:
  • Array of execution object:
    • Execution date
    • Execution result
    • Logs path

Communication between leader and workers

The distributed system has an active (leader) node, which is responsible for polling the database every minute. When there is a new job to be executed, it sends to a node (for now based on round-robin). If the worker is busy, it can refuse to execute the job.

There is no need for the worker to communicate back the result of the job*, it just need to update the DocumentDB. The job can have a timeout, in which case, the leader will retry the job again in another node after the timeout. If the job does not have a timeout set, the leader will wait forever for the execution to finish (failing or succeeding) and may need manual intervention.

*not sure about it, need to think better on edge cases regarding timeouts.

API: EXECUTE JOB

Leader node to worker

Request message
  • job id
  • Timeout in minutes
  • script type
  • script location
Response message
  • Accepted status (declined/accepted)
API: SAVE JOB STATUS

Worker to leader node

Request message
  • Job id
  • Log path
  • exit status code

Implementation Details

Leader Role

The leader only needs to coordinate which machine is going to run which job. The machines can talk directly to the documentDB to save result of jobs and to insert new jobs.

The Execution document at the Database is append-only, meaning that if due to timeout two machines execute the same job, both executions will be kept at the database.

Timeouts can be set to zero, meaning the service won't retry the job until the executions finishes. You still should always write your scripts keeping in mind two jobs can run in paralell (in case of network split).

Libre Software

This will be a Free Software as defined by the Free Software Foundation: https://www.gnu.org/philosophy/free-sw.html

Thought about, not used

Consensus: https://raft.github.io/