📞 On Call
Pre-Incident
We understand that errors are a normal part of software engineering. However, we must be mindful of the number of errors we can afford to make. It’s important that we never blame anyone personally but instead focus on the code and the system.
During Incident
📞 If you receive an on-call notification, please acknowledge your presence by writing a message in the #on-call channel. This will help us ensure that we have the right person on call and can quickly resolve any issues.
🚨 If you’re unsure how to handle a particular issue, escalate it to the relevant team member for further guidance.
🤝 We appreciate your help and commitment to keeping our systems running smoothly, even during off-hours. Remember, with great power comes great responsibility.
Post-Incident
Please open a ticket to track the incident and collaborate with the team to identify actionable items to prevent similar issues from happening in the future.
Playbook
Here are the steps to follow when something goes wrong:
Check Worker Server
Login Into Worker Machine
Navigate to docker compose file
Make sure it's up & running
docker compose ps
You can view the logs
docker compose logs
View API Server Resource Utilization from Cloud Provider
Scale API Server Replicas in K8s
Log in to the DevOps machine using SSH.
Edit the
k8s/ap-prod.yml
file.Apply the changes using the following command:
kubectl apply -f k8s/ap-prod
Check Flow Queue Usage
Log in to the DevOps machine using SSH and port forward 1234.
ssh root@DEV_OPS_MACHINE -L 4000:localhost:1234
Navigate to the
queue
directory.Start the
node app.js
application.Visit
localhost:4000
and check the queues.
Last updated