SEOmoz has published a report about the outage that is affecting their crawling service.
What happened? As described in the company’s blog, SEOmoz runs several services using Amazon Web Services (AWS). While most of the services run on “on-demand” or “reserved instances”, a few applications employ “spot instances”, a purchasing model that allows customers to rent Amazon’s unused resources at a discounted price which often exceeds 50%.
So far, so good. There is a caveat though, which is vital, as it will become clear later: the prices are set by means of periodic auctions (these occur every hour circa). Users bid for one or multiple virtual servers at a limit price (the maximum price the bidder is willing to pay per hour). Amazon gathers the bids and determines a clearing price (a.k.a. spot price) based on the bids and the available capacity. A bidder gets the required instances if his/her limit price is above the clearing price. In this case, the bidder pays the clearing price (not his/her limit price). The clearing price is updated as new bids arrive. If the clearing price goes above the bidder’s limit price, the bidder’s running spot instances are terminated without warning (more details can be found here and here).
As pointed out by Amazon and many others it is clear that, given the way spot pricing works, spot instances are intended to be used by application that can handle interruptions, e.g., business analytics, financial modeling testing, web crawling, data processing, etc.
A very important decision to be taken when employing spot instances is the bidding strategy to employ (see, for example, this short tutorial). As shown in Figures 1 to 4 (the y axis employs a logarithmic scale), spot prices are generally lower than the on-demand price(0.68 $/hour for m1.xlarge instances in the us-east region, see here), but there are occasional events which cause the prices to rocket up to 2 $/hour or more (this is the value used by SEOmoz on all its bids). In other words, the spot price can exceed the on-demand price, with a possible explanation being that when capacity is scarce, Amazon deliberately increases the price to a very high value in order to make room for required capacity (on-demand and possibly reserved instances).
It is perhaps worth noting that while the spikes might indeed be due to resources’ shortage, this study speculates that (1) Amazon might employ an algorithm that is not market driven and thus the transient increases in the spot prices might be artificially set by Amazon, and that (2) the algorithm used to determine the spot price for the next epoch changes over the time.
According to the information made available by SEOmoz, a few things were not planned properly:
The use of servers in different availability zones: this would have probably mitigated the problem, as prices between different availability zones are not correlated (see the figures below)
The use different instance types (e.g., spot and on-demand)
All servers were acquired at the same price. A better alternative would have been to use a different price distribution (e.g., some bids at 1 $, others at 2 $, and others at 3 $), in order for the servers to be claimed by Amazon more gradually in case of price rises.
While the points 1 and 2 were made clear in the blog entry, bidding excessively high prices is not the right approach (as the blog author claims). In the following I will explain why that bidding strategy would see only one winner, Amazon.
First of all, while it is true that often bidders pay less than their limit price, that strategy would cause the price to increase in the long term, thus limiting its effectiveness. Besides that, previous studies have demonstrated that the availability of post instances is a discontinuous function with respect to the price, i.e., small changes in the bid have large effects on the availability. In other words, bidding 2.1 $ might be much better (e.g., the probability of Amazon interrupting the VM much lower) than bidding 2 $, which in turns might be very similar to bidding 1.9 $.
Second, when bidders submit their bids without knowing the bid of the other people in the auction (like in the case of spot instances), the best strategy is to bid the true value. Why is this so? Auction’s theory tells us that whatever happens in the auction, there will be some bid zthat is the highest bid made by the other users.
If your own value is v, then whatever the value of z is, your profit will be v−z if your bid is greater than z (because you get the object and pay z for it) and it will be 0 if you bid less than z (because in this case you don’t get the object and don’t pay for it). What difference would it make if you tried to “overbid” by bidding b > v? If z < v, it wouldn’t matter at all, because you would win the object and pay z just as you would if you bid your true value v. But what if b>z>v? In this case, you would not win the object if you bid your true value v, but you will win the object with your bid of b. But if b>z>v,then your profit is v−z < 0. You will be paying more for the object than it is worth to you and will be worse off than if you had bid the truth and not won the object. This means that you can never make yourself better off and you may make yourself worse off by overbidding than by bidding your true values. A similar situation occurs in case of “underbid”.
As the blogs points out, predicting the optimal price to bid is far from trivial. However heuristic policies can be employed. Assuming that the prices are set by the market (see the observation earlier in this article), existing literature suggests that prices in similar markets follow a Normal distribution. Hence, it is legitimate to employ a Normal approximation to model the distribution of spot prices.
A recent paper analyzes the spot prices of different instance types over a three months period, and finds that while the prices are not exactly normally distributed (the distribution of the spot prices is more heavy-tailed), that approximation is adequately accurate. Having established such kind of relationship between the prices, the authors propose a prediction algorithm which, by means of a number of well known statistic techniques, is capable of predicting future prices with an accuracy higher than 99% (for small instances), see table below, while enabling considerable savings compared to the scenario where on-demand instances are employed.
To conclude, it is certainly possible to employ spot instances in a production environment. However users should remind that:
- The application should be fault tolerant, and that
- Smart price prediction algorithms are capable of considerably reducing the probability of out-of-bid events while at the same time maintaining the expenditure for infrastructure under control.