๐Ÿ“ž On Call

Pre-Incident

We understand that errors are a normal part of software engineering. However, we must be mindful of the number of errors we can afford to make. Itโ€™s important that we never blame anyone personally but instead focus on the code and the system.

โ€‹During Incident

๐Ÿ“ž If you receive an on-call notification, please acknowledge your presence by writing a message in the #on-call channel. This will help us ensure that we have the right person on call and can quickly resolve any issues.

๐Ÿšจ If youโ€™re unsure how to handle a particular issue, escalate it to the relevant team member for further guidance.

๐Ÿค We appreciate your help and commitment to keeping our systems running smoothly, even during off-hours. Remember, with great power comes great responsibility.

โ€‹Post-Incident

Please open a ticket to track the incident and collaborate with the team to identify actionable items to prevent similar issues from happening in the future.

Playbook

Here are the steps to follow when something goes wrong:

Ask the team internally about the credentials.

Check Worker Server

  1. Login Into Worker Machine

  2. Navigate to docker compose file

  3. Make sure it's up & running

docker compose ps
  1. You can view the logs

docker compose logs

View API Server Resource Utilization from Cloud Provider

Scale API Server Replicas in K8s

  1. Log in to the DevOps machine using SSH.

  2. Edit the k8s/ap-prod.yml file.

  3. Apply the changes using the following command:

kubectl apply -f k8s/ap-prod

Check Flow Queue Usage

  1. Log in to the DevOps machine using SSH and port forward 1234.

ssh root@DEV_OPS_MACHINE -L 4000:localhost:1234
  1. Navigate to the queue directory.

  2. Start the node app.js application.

  3. Visit localhost:4000 and check the queues.

Last updated