i build the cloud

@ibuildthecloud - darren0

What’s Missing From Rocket? The User Experience

Flying to Amsterdam for DockerCon, I was thinking about this week’s introduction of Rocket by CoreOS, and wanted to share a few thoughts. First, it’s important to remember that container technology has been around for a long time. Docker supports Linux kernels that predate the release of Docker itself. If containers have been around for awhile, why are Docker and containers suddenly all the rage? It has to do with user experience. The beauty of Docker is that it packages container technology in a way that clicks with users. By consistently focusing on the user experience and interface, Docker’s has built a following of passionate developers who love working with it.

Rocket and the accompanying App Container specification seem to have missed this fundamental point about Docker. If you can accuse Docker, Inc. of one thing it is that it consistently strives to control the experience of its users. Rocket and the App Container spec take a very different approach, as they are trying to define a low level technical specification with no vision for how it will be consumed by users. This approach seems misguided and bound to have no more impact than previous container projects like LXC or Solaris Zones (which technically are great projects, especially Solaris Zones).

Rocket itself is a light wrapper around systemd’s nspawn. This could change over time, but given that systemd is an essential part of CoreOS, a non-systemd implementation is probably unlikely to come from CoreOS. The most important thing about this week’s announcement is the App Container specification. Much of the concern lately in the community has been around the fact that Docker is increasing it’s scope. Not Docker, Inc, but Docker the tool. By creating a specification it has the ability to limit the scope to create one composable unit. That specification can then be used as a building block in other tools and products. Since Docker has no formal definition of the scope of the API it becomes less clear to those wishing to build products around Docker how the future of Docker will impact their product.

So what does the App Container specification mean to users and how will it impact how users interact with containers? Perhaps a good starting point is to imagine what would happen if Docker implemented the App Container specification. Users interface with Docker in three ways: the Docker CLI, the Remote API, and the Dockerfile. I don’t believe the App Container spec would have any direct impact on any of these touch points. Some of how Docker works would change, such as how Docker stores and looks up images and their meta data, but most of this would be hidden from to the user.

To sum this all up, Docker’s success is much attributed to its user interface. Rocket and its App Container specification attempt to standardize internal details of container technologies that mean very little to users. To build upon the momentum of Docker and its success has more to do with innovating on and expanding the user interface and much less to do with low level container technologies. Rocket is really not a competitor to Docker. It can potentially be used to implement Docker. If Rocket is extended to include a richer user experience then it will be a real alternative to Docker, but that would defeat the purpose of Rocket being a simple container runtime and packaging format.

Is Docker Fundamentally Flawed?

When Rocket was announced CoreOS stated

From a security and composability perspective, the Docker process model - where everything runs through a central daemon - is fundamentally flawed.

I wanted to give a bit more context that may help users understand why CoreOS might feel so strongly that Docker is fundamentally flawed. …And surprisingly this has a lot to do with systemd.

When systemd starts a service it creates a series of cgroups. As the service spawns child processes, the children by default stay in the same cgroup. This is the way systemd always knows which processes are associated with a service and additionally what resources they can consume. If a service dies systemd can cleanup all processes associated with the service regardless of who is the parent. This is really a clever design.

Now here’s the problem, Docker breaks all of this (not intentionally). When you do docker run, Docker makes a RPC call to the Docker daemon which spawns a child process that becomes the PID 1 of the container. If your service unit contains a docker run command systemd will monitor the Docker client process. As soon as the Docker client makes that RPC call systemd loses track of what is going on. This is due to the fact that the container is spawned in a different cgroup from the Docker client that is being monitored by systemd. In the end systemd can not manage Docker containers because of the daemon. So you can begin to see why CoreOS would say that it is fundamentally flawed.

What would CoreOS want to see Docker do differently? CoreOS would like Docker to remove the daemon such that containers are spawned as a child of the Docker client. If Docker was to launch containers as a child then systemd would be able to effectively manage Docker. Again, you can see why CoreOS says it’s fundamentally flawed.

So why doesn’t Docker change? Why doesn’t Docker just get rid of the daemon as it so clearly conflicts with systemd. Well first you have to consider what the daemon does. The daemon has several roles. It provides the remote API, maintains the state of containers, sets up resources (like staging images), and then does the execution and cleanup of containers. Rocket has no standalone daemon so how does it achieve all of this? Rocket largely handles setting up the resources for the container (such as staging the image) and then delegates everything to systemd-nspawn to do the rest.

There’s the magic, systemd. It’s not that Rocket doesn’t have a stand alone daemon it is just that that daemon happens to be systemd, PID 1, which runs as root. So is Docker fundamentally flawed? I can’t imagine how you could say that because Rocket follow a very similar paradigm, it just happens to be built into systemd. The fundamental issue is that both Docker and systemd want to be the daemon to manage containers.

This conflict ends up being the crux of the issue. While CoreOS is touted as being the best platform to run docker, you have to realize that CoreOS revolves around systemd, not Docker. Fleet, CoreOS’s container scheduler, in fact interfaces with systemd, not Docker. It is merely a systemd unit scheduler and those units may just happen to run Docker containers. As I mentioned before this model does not work well because systemd can not effectively manage Docker containers. If you want to understand some more nitty-gritty details about this refer to a project I created, systemd-docker, which was created as an attempt to find a happy compromise between systemd and Docker.

While CoreOS may list many reasons why Rocket is better, it also happens to be a nice opportunity to allow them to write a systemd friendly container system. In fact if you want to build a Rocket stage1 implementation that is not built on systemd, you will most likely find yourself back to a design very similar to what Docker is today.

How can this be resolved?

I like CoreOS very much. The sad truth is that running Docker containers on CoreOS is really not a nice experience. I’ve spent a huge amount of time dealing with this and attempting to facilitate some solution. We absolutely don’t need a rewrite of Docker to fix all of this but we do need some cooperation.

There are two solutions for this problem. First, it must be clear to say that Docker will not give up its daemon. It is just not feasible because Docker would have to give up its remote API and then also be tied to systemd. So the question becomes how can these two daemons (Docker and systemd) nicely cooperate.

One solution is that systemd exposes an API such that a caller can put a process into an existing cgroup or a child of a cgroup. One of the main issues is that the Docker daemon has no ability to launch a container that is related to an existing systemd service unit. I’ve brought this up with the systemd developers and they have added it to their todo list.

The second solution is that the Docker daemon communicates back to the Docker client to launch the container as a child of the client and not the daemon. I have had a couple discussions with Michael Crosby from Docker, Inc. on this topic that initiated from a long IRC discussion. In the end it is a workable solution that can solve some other issues like Docker restarts not restarting containers.

There needs to be a change in systemd or Docker to fix this. Why hasn’t anything been done? Simply put, nobody seems to be sufficiently motivated to do the work to fix the issue (including me). Is Docker fundamentally flawed? I don’t think so.

Announcing Rancher.io: Portable Infrastructure Services for Docker

Almost one year ago I started Stampede as an R&D project to look at the implications of Docker on cloud computing moving forward, and as such I’ve explored many ideas. After releasing Stampede, and getting so much great feedback, I’ve decided to concentrate my efforts. I’m renaming Stampede.io to Rancher.io to signify the new direction and focus the project is taking. Going forward, instead of the experimental personal project that Stampede was, Rancher will be a well-sponsored open source project focused on building a portable implementation of infrastructure services similar to EBS, VPC, ELB, and many other services.

Most Docker projects today look to build solutions that will sit on top of Docker and allow developers to schedule, monitor and manage applications. Rancher takes a different approach, it focuses on developing infrastructure services that will sit below Docker. Rancher will deliver a completely portable implementation of the infrastructure services you would expect to find in a cloud such as AWS, including EBS, VPC, ELB, Security Groups, Monitoring, RDS, and many more.

Docker has dramatically impacted cloud computing because it offers a portable package for an application. This portability means an application will run on any infrastructure whether it is your laptop, a physical server, or a cloud. Once you have a portable application you can do some amazing things to improve application performance, availability and costs using scheduling, service discovery, application templating, policy management, etc. Exciting projects including Kubernetes, Panamax, Helios, Clocker, Dies, etc, are building technology on top of Docker to deliver this type of value.

Rancher focuses on a very different problem. Imagine I have an application running on AWS today that uses functionality from EBS and VPC. If I Dockerize my application and run it on AWS, I will still be able to leverage EBS and VPC. However, if I move that application to Digital Ocean, or my own datacenter, those services just don’t exist. While Docker itself is portable, infrastructure services vary dramatically between clouds, and data centers, making real application portability almost impossible without architecting around those differences in your application. Rancher focuses on building portable implementations of these infrastructure services that can run on any cloud, or even multiple clouds at the same time. With Rancher you will be able to get infrastructure services as reliable as AWS provides, anywhere, including on your own hardware, another cloud provider, a dedicated server provider, or any combination of physical and virtual resources. With Rancher, hybrid cloud is no longer an over-hyped marketing term that relies on trying to make incompatible APIs work together, but instead a core capability as ubiquitous as Linux and Docker.

In the short term you can expect Rancher to focus on fundamental storage and networking services similar to EBS, VPC, ELB, and Route 53. Once those fundamental services are implemented they will serve as the foundation for other infrastructure services similar to CloudWatch, CloudMetrics, AutoScaling, RDS etc.

I’m building Rancher, because I want users to be able to access awesome portable infrastructure services everywhere they can run Docker. Docker is portable because Linux is everywhere, and Rancher takes the same approach; we build storage, networking, and other infrastructure services from simple Linux VMs and servers. Thank you again for all of the input on Stampede, and I hope you will join me in making Rancher an enormous success.

Stampede: How You Can Help

The response to the initial release of Stampede has been overwhelmingly positive and I thank everyone for their nice comments. I’ve had many questions regarding how people can help. The biggest help at this point would be feedback. Stampede has been my pet project for a couple months and it has served as a way for me to validate some ideas I’ve had. How and if I continue forward with this work all depends on whether other people think this would be a useful platform. What would be helpful is for people to think about what’s missing from this platform that would prevent them from using it as something to manage their production apps. Don’t think about new greenfield applications but instead your current production apps that have some warts and hacks. If you were to Docker-ize those apps, what would Stampede really need to manage those in production? For example, my current roadmap would be to implement the following features:

  • Fully managed volumes (Docker and KVM) with snapshot, restore, and backup to S3
  • Load balancing
  • Security groups

What other features would you need? Think about bare metal too. For apps running on AWS, if you were to Docker-ize it and run it on bare metal. What features would you lack because you no longer have EC2.

Stampede, in its current form, has big functionality gaps and is not production worthy. But this code base is not something I just hacked together. My background is in writing these orchestration systems and many of the ideas I wanted to validate were about how to write a better orchestration platform and not at all related to specific virtualization/container technology. Out of the ~50k lines of code I wrote more that 90% of that is just framework. There is actually very little code that has to do with Docker/KVM. The point I’m trying to make is that I feel this is a strong platform and I have only scratched the surface of what it can do.

Please tell your friends about Stampede, run it yourself, let me know the gaps and how you would use it. Once I get an idea of people’s interests then I can better decide how to take this platform forward. Contact Information

Announcing Stampede.io: A Hybrid IaaS/Docker Orchestation Platform Running on CoreOS

I’d like to announce Stampede.io. Stampede is a hybrid IaaS/Docker orchestration platform running on CoreOS. It’s extremely simple to get up and running and should take less than 10 minutes if you already have Fleet running or a fresh install through Vagrant. There’s also a short demo that gives a good overview of the current functionality. The main features at the moment are

  • Virtual Machines
    • Libvirt/KVM
    • EC2/OpenStack images work out of the box
    • EC2 style meta data
    • OpenStack config drive
    • Managed DNS/DHCP
    • User data
    • Floating IPs
    • Private networking
    • VNC Console
    • CoreOS, Ubuntu, Fedora, and Cirros templates preconfigured
  • Docker
    • Link containers across servers
    • Dynamically reassign links and ports
  • Networking
    • VMs and containers can share the same network space
    • By default, a private IPSec VPN is created that spans servers
    • All containers and VMs live on a virtual network that can span across cloud
    • Can also use any libvirt networking models for VMs
  • Interface
    • UI
    • REST API
      • Use web browser to explore and use API
    • Command line client
    • Python API bindings

While I feel Stampede is currently a pretty useful platform, I think the ideas behind the platform and what I’d like to accomplish are far more compelling. On the surface this may appear to be similar to most IaaS or Docker orchestrations tools you’ve seen, but I assure you under the hood it’s implemented quite differently. There’s many new ideas I’m playing around with, but there are two specific ideas I’d like to point out. First is Orchestration as a Service and the second is hybrid IaaS/Container orchestration. I talked about both these topics at more length in a recent blog I did.

Orchestration as a Service

The basic idea behind Orchestration as a Service (OaaS) is to level the playing field and make it more feasible for smaller providers to compete in the cloud space. AWS, GCE, and Azure are such juggernauts it’s hard to imagine another company would come around and enter the IaaS market, especially with margins perceived to be so low. To compete with the big three, you have to operate at their scale. On the other hand, you can get physical servers from just about anywhere and at a far cheaper price. The premise of Stampede and Orchestration as a Service (OaaS) is that with a good amount of orchestration I could construct a cloud on par with AWS (ELB, EBS, VPC, etc) if all I start with is a pool of x86_64 servers that have an empty Linux distro (CoreOS), L3 connectivity, and additional block devices for storage. By decoupling the physical infrastructure from the orchestration layer you enable the consumer to acquire hardware from whatever provider they choose. I strongly believe this model could transform the cloud space, and make it far healthier than what it is today. The only companies that can compete today are the ones that can afford to burn cash.

Hybrid IaaS/Container Orchestration

I very much believe that containerization is the next big thing. But, as I spoke about before, I feel the current approach of Container/Docker orchestration tools will further cement the future of the current IaaS players. In order to to make other clouds and especially bare metal more attractive, container orchestration tools need to take on more complex storage and networking orchestration. Basically the same storage and networking orchestration seen in IaaS. While many tout new application architectures will remove the need for reliable storage, or networking orchestration in general, the whole world will not rewrite their applications. AWS is a great example of this. When AWS launched it was ephemeral VMs only. Not until customers demanded it did they add EBS and VPC. While many may talk of architecting your application for the cloud, the real success of AWS was that they found technologies to helped move legacy, non-cloud architectures into the cloud. With containerization that same practical route will be key to the widespread adoption of containerization. While I very much like the newer architectures that are emerging, you can not expect everyone to rewrite their apps to be based on ephemeral storage and service discovery.

In the end I believe the same approaches and technologies used in IaaS orchestration are very applicable to Containers. IaaS by itself is not sufficient for containers, and thus I’ve built a hybrid IaaS/Container orchestration system.

I’m very proud to show off this work, but it is still raw. The intention is to demonstrate the feasibility of what could be accomplished. Hopefully you’ll find this work appealing. Have fun!

Darren Shepherd - https://www.linkedin.com/in/darrensshepherd

Containers as a Service (CaaS) Is the Cloud Operating System

Historically, the architecture of PaaS and IaaS has been that PaaS sits on top of IaaS. As containers have emerged as a first class technology, there seems to be a new intermediate layer forming. With the rise of Docker, I’ve observed two classes of platforms being built: PaaS “powered by Docker,” and Docker orchestration. A better PaaS powered by Docker is personally of very little interest to me as it’s just a better iteration of what has been done in the past. Docker orchestration has the potential to be more transformative. It’s Docker orchestration that has introduced this new layer I mentioned. Many have coined this as Containers as a Service (CaaS). As I’ve sat and pondered the real value and impact of this layer, it finally dawned on me that CaaS is in fact the “The Cloud Operating System” and Docker’s role is to define it’s interface.

The IaaS/PaaS dichotomy

With IaaS you get raw assets that are extremely flexible. With PaaS you get a very locked down experience that is optimized for a particular use case. These two solution are very much on opposite sides of the spectrum.

Netflix serves as probably the best example of someone fully exploiting the capabilities of the cloud. What they have done with AWS is truly amazing and shows what could be accomplished. They also serve as a cautionary tale. While they’ve fully exploited the cloud, when I last looked, Netflix has released over 35 open source projects all centered on running scalable applications in the cloud. While most people are not trying to run at the scale of Netflix, it does point out that there is still a significant amount of tooling needed to run in the cloud. IaaS gives you raw infrastructure on demand but you still need to manage that infrastructure in some fashion.

For those wanting to avoid the overhead of maintaining servers, operating systems, and runtimes, PaaS has historically been presented as the solution. PaaS is great in that you can just focus on your code and not worry about all the runtime maintenance. Quickly though, you run into restrictions on what your code can do. For example, running Java on Google AppEngine (GAE) you often need to first check and see if the third party libraries you are using are compatible with GAE as not all of the JRE is fully exposed. Other functionality, like listening on sockets, is just not possible.

Introduction of Containers as a Service (CaaS)

The introduction of CaaS solves a lot of practical issues with the IaaS/PaaS dichotomy and additionally further expands the capabilities of what one can do. CaaS really serves as the missing layer into between IaaS and PaaS. When you look at a server, you can break it into three logical layers: physical hardware, operating system, and applications. Respectively, you can look at that as IaaS, CaaS, and PaaS.

IaaS provides hardware assets whether it’s physical or a virtual representation of the physical counterpart. What you really get are RAM, CPUs, hard drives, and NICs. You, as the consumer of IaaS, are in control of everything else. While the cloud provider may supply a template with an operating system, the consumer is still ultimately responsible for it.

PaaS provides an application runtime. This ends up being language specific, so what PaaS is providing is a managed environment for Java, Ruby, Python, etc. PaaS is really concerned about providing what runs inside the process.

What if I just want a generic framework to run processes of any size or shape. That is the focus of CaaS. While IaaS provides the hardware and PaaS provides the runtime in the process, CaaS is the missing layer that glues these two things together.

The Cloud Operating System

CaaS plays the role of the operating system. Wikipedia states, “An operating system (OS) is software that manages computer hardware and software resources and provides common services for computer programs.” CaaS’s responsibility is to provide a platform in which you can run processes and manage the services and resources needed by those processes. What runs inside the process is completely arbitrary to CaaS.

CaaS really is the next logical evolution after IaaS. Now that I have hardware on demand, I no longer want to manage an operating system. Instead I’d like to focus on my application, without the constraints that PaaS imposes. There still is a value to PaaS, but its scope is diminished to providing a managed runtime. In fact, with CaaS it is significantly easier to build PaaS.

Many projects in the past have taken on the moniker of “The Cloud Operating System.” I’ve never particularly liked that as I didn’t feel there was any real definition of what that meant. Instead it was just a marketing term just as useless as the term “Cloud.” Finally, with CaaS, I think the “Cloud Operating System” is a correct fit. CaaS is your operating system.

The Role of Docker

How does Docker fit into all this? As Solomon Hykes, CTO of Docker, said, “The real value of Docker is not technology. It’s getting people to agree on something.” If CaaS provides the operating system, and the consumer provides the user space, what is the interface of that operating system? In the UNIX world we have POSIX. Love it or hate it, it is a standard that defines an interface to the operating system. Docker largely fills a similar role.

Docker provides the language in how we describe and talk about containers. It also defines what basic storage, networking, and services are available to the containers. Furthermore, it helps to ensure that applications written for Docker stay portable across varying types of infrastructure.

In summary, CaaS is the Cloud Operating System and Docker serves as its portable interface.

Evolution of Docker and Its Impact on AWS

I’ve been pondering a lot lately on how the Docker ecosystem will evolve and what will be its impact on the larger infrastructure market. My current day job has allowed me over the past 8 months or so to focus almost 100% of my time doing R&D efforts around Docker. As a result, I’ve had a lot of time to think about these things. In my mind I see three phases in which Docker could evolve. I don’t see it obviously trending the way I envision and the purpose of this post is to attempt to articulate the subtle differences of how I think it should change and their impact.

The phases I think Docker should evolve can be described as, first, Docker as an application package. Second, Docker as a unit of orchestration, and, finally, a container cloud. While all the hype in the media is that containers will revolutionize the cloud, based on where I see it’s headed today, it won’t, at least not to my standards. But it can, and that’s what I’d like to explain. We have an incredible opportunity to completely turn the infrastructure business on its head. So stick with me here, it gets good at the end.

Docker as an Application Package

The first phase is where we are today and it’s what made Docker so popular. The basic concept of using Docker as an application package is really to put Docker on a very similar level as rpm and deb. To run nginx on your server today, you log in, “apt-get install nginx”, and then run nginx. With Docker you can now log in and run “docker run nginx” or “docker run my-nodejs-app.” Docker just becomes a more portable, easier to use rpm/deb. That terse explanation undercuts the incredible benefits Docker brings to application packaging, deployment, and management but conceptually I think it’s a very useful analogy.

At this phase of evolution the impact on the larger infrastructure space is small. Basically, at this level of maturity you can expect Docker to become a very popular DevOps tool that has a similar impact as something like Puppet and Chef. Puppet and Chef are great tools and do enable you to run infrastructure better, but the impact of those tools can’t be compared to something like AWS, which was an absolute game changer.

The talks of containers changing the world, impacting AWS stronghold on the IaaS market, and other grandiose statements will not be obtained if we don’t evolve past this simple use case. The reason is quite basic. If Docker remain as a means of packaging, you are running the same infrastructure you are today, but just packaging the applications in a different manner. One argument that people make is that you’ll see a trend towards bare metal now because of Docker. AWS enabled a paradigm shift where the unit of deployment became the server. With Docker, you’re now able to move back up the stack and make the unit of deployment the container with even more agility than what AWS originally enabled. This means you aren’t creating and destroying servers as often and from this perspective buying bare metal should be more attractive.

If you are running Docker on AWS today you are most likely using some aspect of EBS, security groups, ELB, and maybe some VPC functionality. Therein lies the problem. With containers you still need persistent storage and snapshots, load balancing, firewalling, and other random things. When you move to bare metal, you need to again solve all these problems. There’s nothing magical about applications in Docker that would make them suddenly negate the need for something like EBS. If your application today uses EBS, then your application in Docker will still need EBS. In practice, what I think will happen is that people will just move to larger more static instances on AWS. To a certain degree this could help AWS because most likely the small t1, t2, m1 instances have the lowest margins.

Just because you can package an application in a better fashion doesn’t mean you’re going to change the face of the cloud.

Container Cloud

The next level of evolution that I think Docker should enter is that of orchestration. Before I address what I mean by orchestration, I’m going to talk first about the last phase, which is a container cloud. I want to talk about this first because this seems to be the logical next step for most people. Most higher level orchestration systems are really about building a container cloud.

A container cloud can be described easily by saying the level of abstraction becomes the container. In IaaS today, the level of abstraction is the virtual machine (VM). The VM runs on some physical hardware, but as the consumer of the cloud you have no visibility or access to the physical hardware. This is what makes the cloud a cloud. It just runs somewhere and you shouldn’t need to concern yourself with that. In a container cloud, similar to the VM, the container runs on something and you have no visibility/access to the underlying platform.

Most Docker orchestration tools I see today gravitate conceptually towards this model and maybe one day we’ll get there. Right now, I think there are some very practical issues in going towards this model. The first is security, but that is less of a concern to me. The second is a general question of usability and the appeal to the user.

Regarding security, as it stands today, Docker is not secure for multi-tenant environments. This is well known. If you wish to run multi-tenancy you have three basic options. First option is to run a restricted set of Docker functionality. This means not allowing root, controlling the image, and probably adding more SELinux rules. If you wish to run a PaaS or a specific use case, these restrictions may be just fine for you. This is essentially how Heroku can run container technology today in a multi-tenant environment. The second option is to use physical machines or virtual machines as your security boundary. When a customer goes to deploy a container, they buy a 1GB bucket, and that bucket ends up being a 1GB VM running somewhere. If you’re actually spinning up VMs, you’re then tying a group of containers to a specific host. I don’t think that’s really what the user wants. The third option is to just ignore it and assume somebody will figure it out. Option three is quite common.

Assuming security will probably be fixed or maybe the current security is sufficient for your needs, the much, much larger issue is the general usability of a container cloud. There’s an on going debate of whether you should run SSH in your container. The “right” answer is no, you should not. The practical answer is often “yes, I need to.” This basically sums up the issues of a container cloud. A container should really be a process. If I now have a running process somewhere in the cloud, how do I do runtime introspection of that process and provide other services like syslog, cron, monitoring, etc.

If you don’t give access to the host system, you need to either run all the tools in your container, or the container cloud must build a large suite of tools to provide all the runtime introspection you need. If you run all your tools in the container, suddenly your containers are almost as fat as a VM. This really shouldn’t be the goal of the community.

If you build a large collection of tools so that the users of your cloud can introspect their container, there’s a very good chance that you’ll irritate the same audience of users that you want to attract. It’s the same conundrum that plagues PaaS. PaaS targets developer so that they don’t need to worry about all the underlying details of running their app in production but often by doing this they restrict what the developer can do and that makes that PaaS less attractive. DevOps are the crowd today that loves Docker. There’s a very good chance that the DevOps people will not like the tools provided by the “container cloud” to do introspection. Instead they will want to just log into the host and run the tools they want. In the end, the success of the “container cloud” will probably require extending the audience past that of the DevOps crowd.

It will take quite a bit of time to overcome the issue that I’ve outlined. Assuming we do overcome them, what will be the impact on AWS and others? I venture to say almost nothing. You can imagine in this container cloud model you will now go to a “container cloud” provider or maybe you can run your own container cloud with some open source Docker orchestration tool.

The “big three” cloud providers today (AWS, GCE, Azure) will be the best suited to provide this “container cloud.” You still need to run large quantities of physical hardware, which they already have. Besides the basic needs of hardware, the other issue is the orchestration itself. As I mentioned earlier, if you’re using Docker today on AWS, there’s a very good chance you’re leveraging EBS, ELB, or VPC functionality. Containers still need storage and networking functionality and that needs to be orchestrated in some fashion. It’s slightly different with containers, but it still exists. Either you run your Docker orchestration tool on AWS or you build all the orchestration of storage and networking separately. If OpenStack has taught us anything, it is that these orchestration tools are hard to build. AWS, GCE, Azure have a huge advantage here. Orchestrating a VM or orchestrating a container carries with it all the same basic issues and those issues they know how to deal with. OpenStack has failed to put a dent in AWS and many can argue that it has in fact helped them (or more specifically helped GCE and Azure). Building an open source container container platform to replace AWS in the model I’ve just described, will largely follow the same suit. The trend I’m seeing is that these container cloud systems will in fact just be built on top of AWS.

At the end of the day, given what I’m currently observing, if we do ever get to the model of a full container cloud, that cloud will probably be running on AWS. Yet again, we fail to really impact AWS.

Orchestration as a Service

From my perspective, given the current trends of how I see the Docker community playing out, there will not be a mass exodus from AWS. But there could be and let me explain how.

We need to find the happy compromise between the application packaging model and the container cloud; one that pulls the power away from AWS. Let me present to you a very, very subtle change to the “container cloud” model I’ve presented thus far. So subtle you’ll probably say, “duh, that’s what everyone is already doing.” but I’ll show you why that’s not true. There are two key parts to this. First, the level of abstraction is not the container, but instead, the user still has access to the underlying server albeit physical or virtual. Second, you support a “bring your own server” model. I think the easiest way to describe this model is to walk through a hypothetical user experience.

A user goes to my-awesome-container-orchestrator.com and creates an account. They then go to AWS, SoftLayer, GCE, etc and gets a physical server or VM and registers that server with my-awesome-container-orchestrator.com. Once that server is registered they can now deploy containers on that server, manage the docker volumes, snapshot them, back them up, move volumes around, dynamically link containers across servers, dynamically map exposed port, manage service discovery, setup security groups on the server, place containers on a private L2 with custom subnetting, add load balancing, add metrics and monitoring, add autoscaling, etc. Since the user owns this server, they can still log in. If they wish to run lsof or strace or install additional monitoring tools they can.

With this model we are basically providing Orchestration as a Service (OaaS). Since all that containers require is Linux, you really don’t care about the underlying hardware. Running CoreoOS on AWS, GCE, a physical server in SoftLayer, some colo, or your basement is basically the same. With OaaS you basically create one gigantic cloud comprising all the servers in the world. This basic idea was tried in the past with VMs, but it fell apart because there is no portability between clouds today. Docker and Linux now make this possible.

The scenario I just describe may seem obvious and fall in line with much of the hype in the media of how Docker will change the world. There’s two subtle, yet very important differences in the model I’m envisioning. The first has to do with how we build these orchestration systems, the second is regarding how we treat storage and networking.

Decouple Orchestration from Infrastructure

First, orchestration systems need to be decoupled from the ownership of the underlying infrastructure. This is required for “bring your own server” model to work. The model that has always been done in the past is that hardware and orchestration come together. There is no such thing as “OpenStack as a Service” that is not bundled with a specific hardware offering. The nature of VMs and IaaS makes this almost impossible. Setting up a virtual machine cloud is very specific to the hardware environment. With Docker and containers we move up the stack from having a hardware dependency to a Linux dependency. This means Orchestration as a Service is actually feasible without requiring any specific hardware environment beyond your standard x86_64 server.

Most orchestration systems being built today have the fundamental assumption that the infrastructure owner is the also owner of the orchestration system. Whoever owns the VM or server running the containers is also in charge of managing the orchestration system. By own, I mean control. Some systems I’ve seen allow you to enter your AWS creds, but the VM that they deploy on AWS is a black box to you. This is problematic. The first problem is that you are locking out the user from the host. Going back to the issues I described about annoying DevOps people, I think the most practical solution is to give access to the host. Second issue is if you’re giving your AWS creds to some orchestration system, that means that that system is specific to AWS or whatever cloud provider. You need to make it such that the user can obtain the server and register it. This opens the possibility of getting infrastructure from anywhere.

It’s is actually quite difficult to built an orchestration/IaaS system that assumes the host could be malicious. There’s a lot of things to consider.

Storage and Networking

The second point is how we implement storage and networking. The great thing about Docker is all you need is Linux. Docker runs wherever Linux runs and that is how Docker achieves its incredible portability. This high degree of portability is what has sparked this notion that Docker will transform the cloud. The three basic building blocks of IaaS are compute, networking, and storage. Docker basically is a compute technology. As I keep mentioning, what will keep people tied to AWS is largely the storage and networking services that they provide. Docker achieves its high portability because it only needs Linux. For storage and networking, we can do that same basic approach.

Linux and the software that runs on Linux has all the raw technologies needed to provide the same functionality that EBS, ELB, and VPC provide. If you look at Linux hypervisors today, the hypervisor largely takes care of virtualizing the CPU and giving access to storage and networking. The actual implementation of the storage and networking is provided by Linux technologies (OVS, GRE, VXLAN, bridging, ip/eb/arptables, dnsmasq, qcow2, vhd, iscsi, nfs, etc). Even if you consider iSCSI or NFS, the actual implementation of the iSCSI or NFS server can be Linux. What is missing, to truly provide functionality on par with EBS, ELB, and VPC, is the complex orchestration to piece together all of the individual raw technologies.

Lets reimagine the cloud and start with the basic assumption that all I have available to me are empty Linux servers that have basic L3 connectivity and additional block devices for storage. If I add a high degree of orchestration I can take those basic building blocks and construct a full cloud with functionality on par with AWS.

I don’t know if I can emphasize this point enough. All you need is Linux and you can get Linux from anywhere. This completely removes the stronghold AWS has on infrastructure. It is drastically cheaper to run physical servers. The value that AWS provides today is the additional services of EBS, ELB, VPC, etc. With containers you still need those services. But if Linux can provide the raw technology needed for that functionality and the orchestration of that technology is provided as a service, you are then free to chose whatever infrastructure provider you want. There’s almost no reason to use AWS anymore. Use some local colo company. Then use some other colo in Singapore. You don’t need to go with one company that has a global footprint. Now any data center provider can compete with AWS. Throw in a global CDN like CloudFlare and you have an incredibly complete picture at a fraction of the cost of AWS. This approach is basically going to encourage small physical hosting companies to pop up that can operate at a lower margin than AWS can (unless they just lose money like Amazon seems to be fine with).

I’m sure many are thinking that there are use cases in which you really do need some non-Linux technology, like a hardware load balancer or an expensive SAN. First, I would say the vast majority of use cases can be solved with Linux technologies and the performance is acceptable. There are always cases where people need more. The model I’ve describe is better suited to handle this requirements. For example, imagine that for whatever reason you really want to run NetApp. In AWS, if you don’t like what they have, you’re out of luck. In the model I’ve described, since the user is supplying the infrastructure you are free to install NetApp in the same rack and use it how you choose.

I hope you understand that if we just slightly tweak our perspective we can have a huge impact. If we continue on the path I see today, I think we will just further enable the domination of the current cloud providers. The ideas I’ve presented are very complex to implement but this is what I do and this is what I spend most my time thinking about. I know it can be built, no doubt.