plannermili.blogg.se - Uberlayer comparison

UBERLAYER COMPARISON OFFLINE
UBERLAYER COMPARISON PLUS

Mesos is pretty much not a scheduler it's a resource manager, so orchestration happens at the agent and the resource allocation at the master of the Mesos, and the rest of the scheduler primitives happen on these frameworks on top of Mesos. It has job/task lifecycle management done through the application master, and the rest of the things happen at the resource manager, and execution happens at the node manager. Borg is Google's cluster manager, which runs all the application workloads through controller jobs job/task lifecycle, management, placement, preemption and allocation happens at the Borgmaster, and the Borglet does all the task execution of the country and orchestration. We looked primarily for solutions Borg, YARN, Mesos, and Kubernetes. We looked at the existing cluster management solution back in the day, and then we found there is no silver bullet right now. Because of the complimentary resource nature of the different workloads, those can get the better cluster utilizations. We don't need to buy extra capacity, which we do currently, because we need to over provision the services cluster as well as the batch cluster for spikes and all these DR reasons. If you preempt them, we can use it on higher priority jobs. We don't need to do the DR capacity based on principle. By that, we can preempt if one of the other profile spikes. What we are thinking is, we will co-locate them into the same cluster.

UBERLAYER COMPARISON OFFLINE

When you try to open an Uber app and you call for an Uber, and if you preempt a service there, the guy's not going to show up the batch jobs which have all these offline training, distributor training, machine learning jobs, spot jobs, all that analytics which can be preempted. The online services or the microservices are very much latency-sensitive jobs, which cannot be preempted, because if you preempt them, you are actually impacting the business. The other reason we can co-locate all these jobs is because the profiles of these jobs are very different.

The scale which runs is thousands and thousands of machines, and if we co-locate all these workloads together, we are envisioning that we will have a lot of resource efficiencies, which actually will be translated into millions of dollars. We wanted to do that to improve the cluster utilization. The vision for the Peloton is to combine all these workloads together onto the same big cluster. Then you have damon jobs, which actually go and run on all of these clusters, and there is no resource counting for those. We have Cassandra, Redis MySQL, and other stateful services, which run on bare-metal on their own clusters. Batch jobs on Hadoop, Spark, TensorFlow, all these jobs which runs on Hadoop YARN. We have batch jobs, stateless services run on Mesos and Aurora. Right now all the microservices, which we call stateless jobs, run onto their own clusters.

UBERLAYER COMPARISON PLUS

We have thousands of GPUs per cluster and 25 plus clusters like that.

We have more than 10 million batch job containers running per day. We have 100K plus instances running per cluster. We have thousands of builds per day, which happens in production, tens and thousands of instances deployed per day. Currently, we have thousands of microservices running in production. Let's look at what the cluster management story at Uber is today. We've been working on Uber's open-source unified resource scheduler we're calling it Peloton. Bansal: I am Mayank Bansal from the data infra team at Uber, and here is Apoorva who is from the compute team at Uber.