Site Reliability Engineer - Core Services, Engineering - Site Reliability
Who we are
At Criteo, we connect 1.5 billion active shoppers with the things they need and love. Our technology takes an algorithmic approach to predict what user we show an ad to, when, and for what products. Our dataset is about 50 petabytes in Hadoop (more than 120 TB extra per day) and we take less than 10ms to respond to an ad request. This is truly big data and machine learning without the buzzwords. If scale and complexity excite you, join us.
Most of all, we are creators. From designing ground-breaking products to finding unique ways to solve technical challenges at an exceptional scale, our tech teams work with state of the art methodologies to shape the future of advertising.
The Site Reliability teams keep one of the largest computing platforms in the AdTech world functioning like clockwork. They are processing, storing and monitoring through large scale data compute & storage services (Hadoop, SQL & NoSQL), streaming (Kafka), platform as a service (Chef, Mesos), identity management (Kerberos) and analytics (Hive, Druid, Vertica).
Core-services team builds the current and next gen platform to consume resources on Criteo infrastructure. We are building the platform as-a-service for all stateless and stateful applications in Criteo (think webservers, database, distributed file systems). We use Apache Mesos to achieve that vision.
We spend time to understand our client needs and help them launch hundreds of instances of their apps across the world, isolated using containers and connected through our internal service mesh using Consul.
What you'll do
You will be in charge of building and operating our plateform as a service clusters that make the best decision in a very short time, half a million times per second across three continents and 15 datacenters, 24/7. You will have to imagine and implement mechanism to let users forget about infrastructure and focus on building their product.
Your day to day tasks will require the development of features to guarantee access to computing resources such as network bandwidth, memory, gpus... And also improvement of operations' automation to deploy significant changes with controlled disruption on thousands of containers.
You will work with engineering leadership to develop long-term roadmaps and architectures to scale our infrastructure and improve our SLA.
You will be involved in several projects such as :
- Isolate network bandwidth between containers to allow hyper latency sensitive application to be collocated on servers;
- Organize maintenance of large clusters to avoid unavailability of applications in a always-moving infrastructure.
Who you are
- Deep knowledge on Linux fundamentals
- Relevant experience working in a Software Engineering or DevOps Engineering role (at least 2 years);
- Knowledge of databases concepts and clusters such as Mesos or Hadoop is a plus
- Good knowledge of shell scripting and at least one object oriented language (Go, Python, Ruby, Java, .Net…)
- Strong interpersonal and communication skills