Site Reliability Engineer with Python
Our Client looking to bring on a site reliability engineer to help deploy, manage, troubleshoot, and enhance our complex cloud-based set of internal tools and externally managed services for a variety of users across our wide-ranging organization.
You will have at least 7 to 10 years hands-on expertise working as a Site Reliability Engineer.
You will work closely with IT, product, and engineering to extend and maintain this set of tools and services and to help debug and resolve problems.
In addition, the ideal candidate will proactively look for system weaknesses and find ways to resolve them before they can cause production issues via monitoring and data we aggregate through various tools within our organization's IT & DevOps toolkit.
- ●Keep our suite of internal apps and services up and running or getting it back up and running quickly if a failure were to occur
- ●Be the technical point person of operational responsibility for two core platforms (one mobile and one web application) i.e. engaging as appropriate upon escalations from the IT support group whether it be problem solving, addressing production issues, enhancing features etc. - collaborating with engineers and others as needed
- ●Work closely with internal partners and teams as well as external vendors to ensure that we ship software that meets our code quality, security and performance requirements
- ●Write, update, and use our documentation, including runbooks and/or playbooks
- ●Help automate existing or build new internal workflows including ongoing infrastructure needs, testing, failover mitigations, and more
- ●Debug complex problems across our entire web and mobile application stack and advise key stakeholders on solutions, as well as implement said solutions if appropriate.
- ●Further our internal CI/CD processes to improve release cadence and developer experience
- ●Participate in the daily / weekly software development process (standups, sprint planning, retros, issue tracking, etc.)
- ●Actively lead any critical issue post-mortem processes, including coordination of any meetings and further steps to take
- ●7+ years experience with software engineering, software development, and/or system operations
- ●Experience debugging complex problems and implementing timely cost-effective solutions
- ●Experience designing, building, and operating large-scale production systems
- ●Deep knowledge of Python is preferred, though other languages like Java, Go, Rust, or similar will also be heavily considered
- ●Experience using source control (Git, GitHub) and feature branching strategies
- ●Experience with a variety of open-source databases (MySQL, Postgres, Redis, etc.)
- ●Experience with DevOps engineering and working with container orchestration, such as with Docker or Kubernetes
- ●Experience with log monitoring and observability via platforms like Sumologic or Cloudwatch
- ●Experience automating infrastructure, testing, and deployments using tools like CircleCI Configuration management tooling and infrastructure as code knowledge is preferred but not required
- ●Experience working with AWS services, with knowledge of Azure / Google ecosystems helpful but not required
- ●Cross functional team collaboration experience, especially working with engineers and user experience / product designers, as well as external stakeholders
- ●Strong skills for weighing and managing scope, risk, quality and timelines
- ●Strong focus on quality, security, performance, and end user experience
This is an exciting position with an exciting organisation based in Central London and New York.
The position can be London or New York based.
The salary for this position will be circa £80K - £100K.
Do send your CV to us in Word format along with your salary and notice period.