Cloud computing maturity: Childhood, Puberty and Adulthood


If two companies offer a form of cloud computing, do they offer exactly the same or can we find differences and how? As a self-respecting IT company it’s nowadays almost impossible to ignore the latest buzz-words as “Cloud”, “PaaS”, “IaaS” or “SaaS’. They are often presented and thought of as the Holy Grail for all our IT challenges. Most hosting companies will offer some form of Cloud computing (or use a different Buzz word for something similar) and will try to convince us all IT challenges belong to the past. We need a way to distinguish these different solutions and providers in order to understand which solution and/or provider aligns best with our needs.

Maturity Levels

We can distinguish 3 maturity levels of cloud computing which we for easy reference will call: childhood, puberty and adulthood. It is important to understand that, just like in normal life, a provider classifying for a certain maturity level can time evolve into a next phase. This evolution follows from natural learning process and the drive to continuously invest and improve the provided solution.

Childhood

Lots of hosting companies just changed the name of their hosting offering into one of the current buzz-words without actually changing the underlying offering, business/IT processes or business model

  • Traditional hosting, but now using a “cloud” buzzword
  • No automation of processes
  • No transparent “per-per-use” invoicing
  • No self-service portal
  • Great diversion in hosting platform, no homogeneity
  • Hardly any consolidation of systems, inefficient use of computing resources
  • There is a hard cut-off in terms of performance, no (auto) scaling possibilities
  • General Availability/uptime will be limited

Puberty

Some companies started to understand what cloud computing is really about and started to gradually transform operations of their business. The puberty phase is , like in normal life, a transition phase for a company to streamline the platform and all its operational processes. This cannot be done overnight, but is a gradual process.

  • Transparent pay-per-use invoicing
  • Passive Self-service portal, meaning that customer can retrieve (usage) reports from their environment, but not directly manage the environment from the portal
  • The hosting stack is becoming more homogenous but some legacy diversity is still there
  • Basic operations are handled through a management framework sitting on top of the hardware
  • Availability and performance can vary because the company is learning and introducing lots of changes all the time

Adulthood

This is where you’d want a cloud provider to be. It classifies fully matured hosting providers that have successfully transformed their business to run a smooth consolidated and automated cloud platform

  • All re-occurring tasks have been fully automated
  • There is a transparent pay-per use invoicing model
  • The complete hosting stack is homogenous
  • A cloud management framework is sitting on top of the bare metal hardware provider operational/management tools to the engineers and customers
  • A full-fledge self-service portal is available for customers that integrates with the platform management framework
  • Auto scaling and consolidation is available for efficient use of the computing resources
  • Availability/uptime will be very high because of less human interventions (less errors) and auto scaling/failover processes

The presented key characteristics are just a few examples of how to classify cloud providers. There are a lot more that can be considered, but this should give a general idea on how to look at this.

Costs

At first glance providers classifying for the “Adulthood” category can be assumed to provide the best offering, but also to be the most expensive one. This is not necessarily true as they should also have established best of class internal efficiency because of the high level of automation of all operational processes and self-support for customers. This should limit the cost-price of these providers, lowering the costs for customers. Providers in the middle of their “Puberty phase” are in the middle of a very large transformation and are investing lots of money into their company. This most likely increases costs and hence prices for customers.

Cloud computing: A business model


There are many articles on the Internet about cloud computing, covering all sorts of subjects from history to future trends, advantages, disadvantages, what it is, what is not, and the list could go on and on. Lately, there seem to be more and more people querying the nature of cloud computing services, is it really about technology as much as it is a business model? Of course, there are also articles describing the cloud as an over hyped marketing ploy, but lets just concentrate on the facts. Cloud computing comes with a value proposition and is not about competing technologies but a new, business oriented model. This is one way of looking at it, but without the technology, the business model would not be possible.

The business model says that computing services will be provided, on demand, how and when they are needed, quickly, efficiently, and very cost effective. This is a new way of ensuring a business has the IT support it needs, by fulfilling services, terms and costs. The discussion is no longer about how many servers we need, how many IT specialists, how many licenses we need and how long will it take to make it happen.

The technology is really what makes cloud computing possible. Cloud computing is, at its core, about IT capabilities. It is about providing the same resources, applications, infrastructure, platforms but now in a scalable way using the Internet to access them. This is a challenge to the IT industry, both software and hardware, and a driver for technological progress to respond to demands covering very diverse services but also a significant (and constantly increasing) number of different devices.

What this means is that the technology has to do what it has always done, provide the best services to support the business. The business can now stop trying to understand and worry about IT capabilities and concentrate on what they do best. Cloud computing simplifies IT services and makes them more cost effective for business but requires the IT industry as a whole but especially the professionals working within it to raise their game, not only technically but also become more business savvy.

Best Quotes on DevOps and Web Performance – Velocity Europe 2011


Velocity Conference Europe 2011We’ve seen a lot of great talks at Velocity Conference Europe 2011, we grabbed the following words of wisdom from the keynote presenters:

  • “DevOps is bullshit, we need *Ops” –  @postwait
  • “Rule 1: what you build will break” –  @postwait
  • “If it’s not monitored, it’s not in production” –  @postwait
  • DevOps needs balance: DevOps has been about putting more Dev into Ops. We need more Ops in Dev” - @postwait
  • “Exceptional performance means being faster than a user expects us to be” – @AloisReitbauer
  • “It’s easy to go fast, but it’s hard to go fast in the right direction” – @jonjenk
  • “OPS needs to reduce the cost of making mistakes” - @jonjenk
  • “Developers are not done until all tests pass” - @timmorrow
  • “Don’t manage complex systems, but reduce system complexity” - @schlomoschapiro
  • “Adaptive systems, are adaptive, they are not magic ” - @allspaw
  • “Complex systems is not the same as complicated systems” – @allspaw
  • “Fear is the anticipation of future failure, confidence is the anticipation of future success” - @allspaw
  • “One cannot take past successes as a guarantee against future failures” - @allspaw
  • “You should graph tweets about your business to find out if you have a problem” - Johannes Mainusch, XING AG
  • “The ‘paradox of operations’: if you don’t fail, you don’t improve. So we need to embrace failures and learn from them” – @cstrep
  • “Dev and Ops are good at guessing. Dev and Ops are bad at guessing correctly” - @bruntonspall
  • “A great tool for web ops needs to be automatic, accessible and actionable” –  @briandoll

Please do leave a comment if you have another one that should be added here.

Monitoring Is Easy, Why Are We So Bad At It?


Abstract of a Velocity Conference Europe 2011 Keynote by Theo Schlossnagle

The Purpose of Monitoring

What is monitoring:

  • Analytics
  • Trending
  • Fault detection / alerting
  • Capacity planning

It’s the collection and use of telemetry data

Monitoring is not:

  • Controls
  • Via a monitoring you observe, you do not influence

Monitoring and management/controls are two different things. You should never mix the two.

We suck at it because we think about:

  • Networks
  • Systems
  • Applications

What matters: Business! The purpose of monitoring is to make your company’s web business operate. It is very important to understand this purpose.

As a technology Operations group you have the technology to make it work and to do great things. Other department don’t have these technology or tools. You can not directly influence the business, but you can indirectly influence the business trough the things you operate/control.

Visualize!

We don’t really understand computers. We need to visualize the use of our computers to know what’s going on. It’s very hard to create a proper dashboard (like you would have in a car), as these sort of meters don’t always make sense. In general, graphs are much better than gauges. Gauges are often misleading if you don’t know the limits, but could be great for percentages, temperature, power per rack or bandwidth per uplink. It is often much better to just use graphs though as they tell us much more: They not only show us what’s going on now, but also what happened before. It makes sense to project historic trends (eg. Last day/week) in the same graphs so you can compare easily.

Another way of presenting useful information is using text. Using text messages as part of monitoring can be really useful. Geolocation is interesting to visualize for most people, but not really for Operations.

It’s all about real-time: You need to combine the data, you need to make the data real-time. You don’t want to wait for 5 minutes hitting the refresh button. You want to see what’s going on, right now.

A Career in Web Operations


Abstract of a Velocity Conference Europe 2011 Keynote by Theo Schlossnagle

Your Career

Your career is about being better, not making more money. To be truly excellent:  You really need to treat your job as a true craft

Basic steps for success in your career:

  • Step 1: Educate yourself. Make sure you know what you are talking about. Be the expert.
  • Step 2: Be disciplined
  • Step 3: Learn from & share with your peers
  • Step 4: Be patient: Experience takes time (and mistakes)

SaaS & DevOps & OpsDev

If you have one copy of the code that runs for all of your users, it means you are running Software As A Service (SaaS). Your website usually runs one copy for all users, so that makes it SaaS as well. What does it take make all of this work: DevOps

Theo argues that just DevOps = bullshit. Devops is incomplete, interpreted wrong, and is too isolated. There should be an Ops element for all departments in the company, so basically we need some from of *Ops. We need everything to be Ops orientated: Everyone needs to have an operational mentality at every step in delivering to the customer. The mindset should not be about how we run our servers or software, it should be about how we run our business.

 DevOps has always been about putting more Dev into Ops. However, DevOps needs balance. OpsDev:We need to put more Ops into Dev. Operations is like security, it’s a mindset, mentality, not something to track on in the end. When something is wrong in production, IT’S WRONG, whether or not you can re-produce it in other environments. Technology is everywhere and not just a vertical in companies anymore. The CIO/CTO role will in the future change into a COO role.

 Someone needs to make it run better – You!

Key success factor in IT outsourcing


Outsourcing in IT has now grown to a 313 billion euro industry.  This represents growth of 6.9 percent compared to 2010. Until 2015 the market is growing on average by 4.6 percent. Gartner predicts that according to Gartner research, involving 47 major IT outsourcers, 62 percent of respondents this year plan to further grow revenues (Source: http://www.computable.nl/artikel/ict_topics/outsourcing/4160345/1276946/itoutsourcing-groeit-naar-313-miljard-dollar.html). This steady growth is due to the fact that companies start to realize focusing on their core product is the strategy to stay ahead of competition. Why is it that IT outsourcing often fails so drastically or at least do not give the desired customer satisfaction?  A lot of these projects result in customers being unhappy combined with a long-term outsourcing vendor lock-in.

Why does it fail?

The real reason for IT outsourcing projects to fail is that companies often do not consider the IT platform as part of the core product. If this IT platform is used to deliver products or services to the customer, the IT platform  is just as business critical as the product itself. In a way, a combination of the IT platform and the products delivered on top of it become the core product of a company. The success of this combination determines end-user customer satisfaction. The risk is companies disconnecting the two and just throwing the IT part over the fence to an outsourcing party and tell them to “go make it work”. This is an approach that usually results in a project failure and a waste of money.

Key success factor

IT Outsourcing should never be considered in a traditional customer-vendor relationship but needs to be viewed as a long-term, strategically important, close partnership where customer and vendor become one team with one mission. This can only really be achieved by a dedicated team that has end-to-end ownership and responsibility over the environment. Traditional outsourcing parties are usually operating in a “one can do all”  mode where engineers on shift don’t have the knowledge or buy-in a dedicated team would have. The dedicated team will need to act as an extension of the companies IT team and work side by side on a daily basis. The team will need to be involved in all phase of the life cycle, from design, implementation, operational through decommissioning, to ensure ownership and buy-in. This also ensures knowledge about the platform is not flushed out to the outsourcer only but also stays available within the company. Companies need to anchor this knowledge to ensure they keep understanding their own product and how it is operated.

Summarizing the key factors listed above:

  • Understand the true value and business criticality of the IT platform for the success of the company
  • No traditional customer-vendor relationship: Work closely together with the outsourcing party, as one team with one mission
  • Dedicated team that has end-to-end ownership and responsibility in all phase of the platform life cycle, no “one can do all” mode.
  • Anchor the knowledge on how to operate the platform at both sides

Cloud model of the future


The Hybrid cloud model is expected to be the key IT model for the next 5 to 10 years. This is due to the fact that most businesses are not likely to fully move to a public cloud at once. The other important reason for this is that there are still lost of business critical traditional applications that are not built to securely run in the cloud. It will probably take another 10 years for all current applications to be “cloud ready” so they will take away all concerns when running them in the cloud.

We nowadays distinguish several different kind of cloud models:

  • private external cloud
    This comes down to renting a dedicated pool of  resources (hardware) from a another provider and run al applications from the environment. In this model, the resources are dedicated to you and you know exactly where the hardware is hosted.
  • private internal cloud
    This is actually not a real cloud, but more a bunch of hardware hosted on-premise by the company itself. IT departments like buzz words and therefore often start calling their old set of hardware resources a “private cloud”. Pay per use does also not apply is this case, as you buy all hardware upfront.
  • public cloud
    This is the real deal. A public cloud offers computing power to multiple customers from a shared pool of resources. The general location of where the public cloud usually runs of often known, but not the specifics.
  • hybrid cloud
    The hybrid cloud model combines on-premise (cloud) infrastructure with public cloud platforms. A combination of the two is often seen when companies start to move some services into the cloud but are not (yet) ready to run everything in the cloud.

When considering or implementing cloud computing, it is very important to classify all applications and services to see which are suitable to move into the cloud and which do need to stay on premise for now. Important aspects such as performance, security and export limitations should be looked at when doing this. Business will choose to run this kind of application on premise, but move suitable applications and services into the cloud.

Handling unpredictable risk


The need for risk-management is becoming increasingly more important. The ongoing  recovery of the market and increasing revenues make companies feel comfortable enough to take more risk in order to boost revenues even further.  Risk-management is becoming increasingly more challenging because modern businesses and customers expect the latest/greatest without any disturbance of production services. They basically want to replace the yet engines while the plain is flying. Without any risk. What makes risk-management even more complex is the fact that a lot an important part of risk is externally. External risk is always very hard to predict and remediate.

Unpredictable Risk

In any project there will always be risk factors that no-one will think of. Due to the complexity of risk analysis, there is always a small fraction that did not end up in the risk assessment because nobody ever thought of it. It’s easier to predict what will go right that beforehand predict all the small (or big) things that can possibly go wrong. This leaves all IT (change) projects with the notion of unpredictable risk. And the one thing everybody (especially in IT) hates is unpredictable hate.

The only way to properly handle unpredictable risk, is to have proper procedures and controls in place. These controls and procedures should be general but powerful enough to handle any unforeseen risk that might become reality. Any unpredictable risk that becomes reality should be classified. Based on this classification the appropriate controls and procedures should be applied. If the controls or procedures are not able to handle the risk which occurs at hand, they are not good enough. You need to make sure to learn from it and apply better controls and procedures next time.

When the cloud goes down: Software design for failure


Cloud computing & HA

A recent outage of Amazon’s Virginia based cloud, resulted in several companies being completely offline for several hours, days or worse. The world was shocked at it had always assumed the cloud to just be there, always. One can only imagine the damage done to businesses in terms of missed revenue, but also because of bad publicity. This problem did not hit all companies that were also serving their customer from the same Virginia based cloud. How did they manage to handle these problems?

The great thing about cloud computing is that you don’t have to worry about the availability of the underlying platform that is hosting your applications. Or at least you think you don’t have to care about it. Most people assume availability is something we used to worry about in a traditional non-cloud based set-up, but we nowadays expect the cloud provider to take care of it. This is a fundamental mistake, as we have seen with the recent outage of the world’s largest cloud provider Amazon. Another example is the recent bankruptcy of the Dutch Cloud provider iTricity.  Their strategy was to deliver IT as Electricity, but we all know even electricity coming from a wall socket is never guaranteed. That’s why we have emergency power supplies in data centers and hospitals. It’s funny that we don’t always take the same kind of measures when it comes to applications we run in the cloud. We probably just assume the cloud already takes care of this. Although there are numerous of technical solutions to implement HA this post focuses primarily on the fundamental idea how to deal with the world of cloud computing which has shown to be unpredictable when it comes to HA.

Cloud Software Design Principles

The companies not having major problems during Amazon’s recent outages did something differently than others. It’s not rocket-science: They just assumed the cloud would most likely be unavailable at some point in time and already implemented a way of dealing with this. Some of them made sure all data was continuously replicated to a different (cloud based) location and were able to deploy resources fast to get their applications up and running. This is great but needs manual intervention and still means downtime because it just takes time to start everything in a different locations. Others took cloud HA to the next level and made sure the application was aware of different cloud locations. Perfect HA software never trusts the platform it’s running on and has mechanisms built-in to automatically move to a different location when needed. Without downtime.

The following are key design principles for software engineering in the cloud

  • don’t trust the underlying platform
  • assume things will go wrong
  • design for failure, to automatically handle platform failures
  • make the application aware for the underlying platform, so it can manage its own platform resources if things go wrong
  • make sure you physically know where your cloud runs, even if it is cloud based
  • implement auto provisioning to dynamically scale capacity. Automatically failing over services into another locations may require more resources that usual. You don’ want a lack of capacity to take down this environment too

The ideas described here assume you have at least 2 completely independent locations that are geographically distributed. This means that no problems in one location can in any way disturb the other location. There can be no dependencies between them. You also want to check any cloud services you are running, such as dns or email. Are they depending on the availability of one supplier? Are they at the same locations you are in? You need to make sure these services are also in redundant locations without any dependencies between them.

New trend in Cloud software design

We have traditionally been focusing on building high available infrastructure in data centers. With cloud computing, we do not have control any more on the infrastructure level, but we should never assume it’s going to be available always. There is a new trend coming: it’s not the focus on infrastructure being HA, it’ the software that’s architected and built to be high available. The software can now handle infrastructure failures and even expects them to happen. This fundamental shift in thinking will require a more tight cooperation between development and operations (DevOps) in software companies.

Other Posts:

http://blog.rightscale.com/2011/05/02/aws-outage-follow-up-if-you-wanted-details-you-got-details/

Virtual SAN storage vs traditional SAN storage


With traditional SAN storage platforms, creating a storage design that meets performance characteristics now and in the future has always been a critical aspect for an infrastructure architecture. You need to know ahead how the applications will use the storage layer, but also prepare for growth scenario’s and align your design with those aspects.  In a lot of situations these are details that may not by completely clear (yet). Think of deploying a new kind of application, or when the size of the business starts growing rapidly. Making mistakes is painful because SANs are really expensive. In the dynamics of today, it’s almost impossible to predict how an infrastructure will be used in 3 months from now, let alone in 2 years from now. You may very well find that your LUN and Raid Group design are no longer supporting the requirements of the application or are used very inefficiently.

Virtual Storage

The traditional SAN’s such as provided by EMC require you to configure everything in advance. Making changes afterwards without down-time is often very difficult in running system. This problem calls for a more flexible storage platform, that allows changes later on that may better suit requirements at a specific point in time. The best solution would be to start-off with a default layout and have the system automatically determine how to configure its underlying raid groups and disks according to how the application uses the storage platform. This should be a dynamic process that can keep on changing the layout over time. One of the SAN’s available nowadays that supports a real virtual storage layer is EqualLogic. These boxes allow you to start off with a certain Raid Group which can very easily be changed later on without any downtime when the applications ask for it. Not knowing your storage space requirements in the beginning doesn’t matter: Just thin provision all your volumes. These volumes can also later on be changed to thick volumes if necessary. Most of these changes in SAN configuration can be managed by the EqualLogic box automatically which makes it easier to manage in large complex infrastructures.

Follow

Get every new post delivered to your Inbox.

Join 167 other followers