📞 On Call

Pre-Incident

We understand that errors are a normal part of software engineering. However, we must be mindful of the number of errors we can afford to make. It’s important that we never blame anyone personally but instead focus on the code and the system.

During Incident

📞 If you receive an on-call notification, please acknowledge your presence by writing a message in the #on-call channel. This will help us ensure that we have the right person on call and can quickly resolve any issues.

🚨 If you’re unsure how to handle a particular issue, escalate it to the relevant team member for further guidance.

🤝 We appreciate your help and commitment to keeping our systems running smoothly, even during off-hours. Remember, with great power comes great responsibility.

Post-Incident

Please open a ticket to track the incident and collaborate with the team to identify actionable items to prevent similar issues from happening in the future.

Playbook

Here are the steps to follow when something goes wrong:

Ask the team internally about the credentials.

Check Worker Server

Login Into Worker Machine
Navigate to docker compose file
Make sure it's up & running

docker compose ps

You can view the logs

docker compose logs

View API Server Resource Utilization from Cloud Provider

Scale API Server Replicas in K8s

Log in to the DevOps machine using SSH.
Edit the k8s/ap-prod.yml file.
Apply the changes using the following command:

kubectl apply -f k8s/ap-prod

Check Flow Queue Usage

ssh root@DEV_OPS_MACHINE -L 4000:localhost:1234

Navigate to the queue directory.
Start the node app.js application.
Visit localhost:4000 and check the queues.

PreviousTeam Next How We Work

Last updated 2 years ago