Saturday, August 9, 2014

S3 with strong consistency

We use S3 extensively here at Korrelate, but we frequently run into problems with it's eventual consistency model. We looked into working around it by using the US West region that has read-after-write consistency, but most of our infrastructure is on the US East region and moving it would be a lot of work. Netflix has a project called s3mper that provides a consistency checking layer for Hadoop using DynamoDB, but we really needed something for Ruby.

Since we also have a lot of infrastructure built around Redis, we decided to use it for our consistency layer because it's very fast and has built-in key expiration. The implementation is fairly simple: all writes to S3 also write a key to Redis with the etag of the new object. When a read method is called, the etag in Redis is checked against the etag from the client to see if they match. If they do, the read proceeds as normal. If they don't, an AWS::S3::Errors::PreconditionFailed is thrown. The client then decides how to handle the error, whether that is retrying or doing something else. If the Redis key is nil, it is assumed the data is consistent.

In practice, it's never more than a second or two to get consistent data after a write, but we set the Redis key timeout to 24 hours to give ourselves plenty of buffer without polluting the the DB with an endless number of keys.

This is still incomplete because it doesn't cover listing methods in ObjectCollection like with_prefix and each, but it's a start.

Thursday, August 7, 2014

The Premortem

We are all familiar with the postmortem. When things go wrong, we want to understand why they went wrong so hopefully we can avoid them in the future. This works well for classes of failures that are due to things like infrastructure or procedural deficiencies. Maybe you had a missing alert in your monitoring system or something wasn't communicated to the right person. The problem is that we can't plan for the unexpected and it's the unexpected that causes the most problems.

The way the premortem works is this: imagine it is one year from now and (insert your project here) has failed horribly. Please write a brief history explaining what went wrong.

What this attempts to do is to bypass our natural tendency to downplay potential issues. Whether you are in favor of the project or not, this exercise will engage your imagination to come up with failure scenarios you might not have otherwise considered.

Give it a shot and please post comments about what you thought of it.


Wednesday, August 6, 2014

You are bad at estimating

For the last 4 years or so at Korrelate we have been using Scrum. We had 2 week sprints, estimated stories and planned based on those estimates. In that time we have learned one very important lesson: we are very bad at estimating. In retrospect, this shouldn't have come as a surprise to anyone. There is a mountain of research across multiple fields that demonstrates just how bad expert estimates are. They are so bad that on average they are worse than randomly assigned numbers. If you think you are somehow different and you can do it better, you are wrong. You may be thinking that your estimates have been fairly accurate and your sprints largely successful. There are some reasonable explanations for this phenomena:

  1. You are doing a lot of very similar tasks. Given a history of nearly identical work it is possible to come up with fairly good estimates
  2. You have a large enough team that your bad estimates have roughly balanced each other out so far
  3. You are working a lot of extra hours
  4. You have yet to experience regression to the mean

Given that we know we are bad at estimating, what should we do? I propose a fairly simple change: weight all stories equally. That's right, don't estimate anything. This may sound crazy at first, but the evidence shows that equal weighting is on average better than expert estimates and as good or nearly as good as sophisticated algorithms. You would probably do better than your current estimates by basing them on the number of words in the story.

Now, obviously some stories will require more effort than others, we just don't have a good idea of which ones those are. I propose another change to help here: break every story up into the smallest pieces that make logical sense. Some stories will becomes epics that contain multiple smaller stories. If the story can't be sensibly broken up then just leave it. Do not make any attempt to equalize them either within the epic or against other stories, that's just a back door to estimating. I think this is the best attempt we can and should make to reduce the difference in effort between stories.

So, without estimates, how do you plan? This brings up the issue of sprints. My final proposal is that we abandon them as well. If you want to know how many stories the team is likely to complete over the next month, just take the average number of stories they have completed over the last few months. The time they took to finish is irrelevant and you can still follow trends in velocity over time and use them to provide better estimates about things like how adding another engineer will affect velocity and how long it will take to ramp them up. Resist the urge to add your "professional intuition" into the equation, you will only screw it up. Trust the data, not your gut.

I'd love to hear your personal experience with estimating or not estimating. Sprints vs continuous deployment or anything else related to improving the development process.