๐ On Call
Pre-Incident
We understand that errors are a normal part of software engineering. However, we must be mindful of the number of errors we can afford to make. Itโs important that we never blame anyone personally but instead focus on the code and the system.
โDuring Incident
๐ If you receive an on-call notification, please acknowledge your presence by writing a message in the #on-call channel. This will help us ensure that we have the right person on call and can quickly resolve any issues.
๐จ If youโre unsure how to handle a particular issue, escalate it to the relevant team member for further guidance.
๐ค We appreciate your help and commitment to keeping our systems running smoothly, even during off-hours. Remember, with great power comes great responsibility.
โPost-Incident
Please open a ticket to track the incident and collaborate with the team to identify actionable items to prevent similar issues from happening in the future.
Playbook
Here are the steps to follow when something goes wrong:
Ask the team internally about the credentials.
Check Worker Server
Login Into Worker Machine
Navigate to docker compose file
Make sure it's up & running
You can view the logs
View API Server Resource Utilization from Cloud Provider
Scale API Server Replicas in K8s
Log in to the DevOps machine using SSH.
Edit the
k8s/ap-prod.yml
file.Apply the changes using the following command:
Check Flow Queue Usage
Log in to the DevOps machine using SSH and port forward 1234.
Navigate to the
queue
directory.Start the
node app.js
application.Visit
localhost:4000
and check the queues.
Last updated