Corellium is seeking an experienced Site Reliability Engineer to help us build, maintain, and troubleshoot our rapidly expanding infrastructure. In this role, you will focus on measuring and improving the availability, reliability, performance, and capacity of our cloud-based enterprise software services. Contributions will range from small cloud scaling operations and logging initiatives to large-scale multi-faceted rebuilds of production systems. You will also help define the infrastructure strategy, tooling, metrics, processes, and overall product scalability as we seek to grow our customer base and increase production efficiency.
You’ll be successful in this role if you have experience increasing AWS-based production reliability and performance, and providing thought-leadership to implement best practices and tools. The position requires an ability to work across departments while negotiating outcomes with other engineers. A holistic end-to-end approach to reliability will require general programming skills with strong computer science fundamentals. As a startup, we place a strong emphasis on individual contribution and diversity of thought, and a friendly collaborative voice is greatly appreciated.
Successful candidates will have experience with the following tools and languages:
- Shell Scripting Experience
- Terraform / Ansible
- Docker / Kubernetes
- Node.js (Homegrown Node.js-based CI/CD for automated iOS and Android testing)
- Owning cloud-related #alerts: tracking down the cause of the alert, finding relevant logs, troubleshooting the alert, and resolving the source of the alert. This may range from troubleshooting the cause of an errored virtual device to the cause of a server going offline.
- Owning, overseeing, and managing our AWS resources – including AWS accounts, permissions, settings, rogue unused resources, etc.
- Optimizing our strategy for auto-scaling.
- Debugging any cloud-related services and infrastructure bugs.
- Facilitating system maintenance and incident response.
- Analyzing logs for bug and anomaly detection, detecting new bugs or malicious use.
- Managing observability services and expanding the metrics we can observe. Using metrics for performance tuning. Metric anomaly detection.
- Analyzing technology currently in use, developing plans for improvement, recommending performance enhancements and cost-optimizations, identifying alternative solutions.
- Code review contribution.
- Managing by-with-and-through Service Level Objectives.
- Iterative development of a holistic reliability approach.
- Creating documentation, procedures, and reports.
- At least 4 years of experience in DevOps (infrastructure) including; mentorship, coaching of junior engineers, and leadership across an organization
- At least 3 years of experience as an SRE
- Engineering background or degree
- Experience working in a startup environment
- Experience in risk-based testing