Overview:
This role will be responsible for ensuring the availability, latency, performance, efficiency, and stability of our client’s critical infrastructure. You will also collaborate with development teams to implement and maintain reliable and scalable systems.
Key Responsibilities:
- Monitor and identify potential issues that could impact the availability of our systems.
- Implement and maintain automated alerting mechanisms to notify the appropriate parties of potential outages or performance degradation.
- Analyse performance metrics to identify and resolve latency bottlenecks in our infrastructure.
- Implement performance optimization techniques and tools to improve the overall responsiveness of our systems.
- Work with development teams to ensure that new features and code changes do not introduce performance regressions.
- Develop and maintain metrics dashboards to track key performance indicators (KPIs) for our critical systems.
- Identify performance trends and anomalies that may indicate potential issues or areas for improvement.
- Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
- Identify and implement cost-effective solutions to improve the efficiency of our IT operations.
Release Management:
- Design and implement automated deployment and rollback procedures to mitigate risks associated with software updates.
- Monitor the performance of new releases and address any issues that arise promptly.
- Lead the team that executes the release management.
- Design, implement, and maintain a comprehensive monitoring infrastructure to track the health and performance of our systems.
- Analyse monitoring data to identify potential issues and proactively troubleshoot problems before they impact users.
- Develop and implement alerts and notifications for critical events to ensure timely intervention.
- Build and lead the team that responds promptly to incidents and works collaboratively to resolve them in a timely manner.
- Analyse root causes of incidents to identify and implement preventive measures to minimize their recurrence.
- Document incident responses and communicate lessons learned to enhance our incident handling processes.
- Collaborate with your peers on the leadership team to define a multi-year technical roadmap. Stay up to date with industry developments and enterprise infrastructure, and anticipate significant risks.
Required Expereince:
- 10+ years of experience as a Site Reliability Engineer or equivalent in a similar role.
- Proven experience in monitoring, analysing, and optimizing the performance of large-scale distributed systems.
- Expertise in Linux systems administration, including managing servers, operating systems, and network configurations.
- Strong scripting and automation skills, preferably with experience in Bash, Python, or similar languages.
- Troubleshooting and problem-solving skills with a knack for identifying and resolving complex technical issues.
Desired Experience:
- Bachelors degree in Computer Science, Information Technology, or a related field.
- Familiarity with AWS.
- Experience with DevOps tools and practices, such as GitLab CI/CD, and Docker.