📞 On Call
Pre-Incident
We understand that errors are a normal part of software engineering. However, we must be mindful of the number of errors we can afford to make. It’s important that we never blame anyone personally but instead focus on the code and the system.
During Incident
📞 If you receive an on-call notification, please acknowledge your presence by writing a message in the #on-call channel. This will help us ensure that we have the right person on call and can quickly resolve any issues.
🚨 If you’re unsure how to handle a particular issue, escalate it to the relevant team member for further guidance.
🤝 We appreciate your help and commitment to keeping our systems running smoothly, even during off-hours. Remember, with great power comes great responsibility.
Post-Incident
Please open a ticket to track the incident and collaborate with the team to identify actionable items to prevent similar issues from happening in the future.
Playbook
Here are the steps to follow when something goes wrong:
Check Worker Server
Login Into Worker Machine
Navigate to docker compose file
Make sure it's up & running
docker compose psYou can view the logs
docker compose logsView API Server Resource Utilization from Cloud Provider
Scale API Server Replicas in K8s
Log in to the DevOps machine using SSH.
Edit the
k8s/ap-prod.ymlfile.Apply the changes using the following command:
kubectl apply -f k8s/ap-prodCheck Flow Queue Usage
Log in to the DevOps machine using SSH and port forward 1234.
ssh root@DEV_OPS_MACHINE -L 4000:localhost:1234Navigate to the
queuedirectory.Start the
node app.jsapplication.Visit
localhost:4000and check the queues.
Last updated