Tuesday, February 12, 2008

Production Deployment

I don't really think my work is that interesting to subject someone to a minute by minute account of my day, but I wrote up a huge sequence of events after production issues today, and figured I'd post on my blog at work (doubt anyone is even interested there? except perhaps to point out how they might have fixed the issue faster - with their super power of hindsight no doubt). Since it was posted there... might as well post it here too. So for any poor saps that would actually subject themselves to reading such drivel, here goes:

A rundown of 2/12 Production Issue Activity that I was involved in:

2/11
  • ~4pm - double checking properties and binary build (we really should do these during business hours to save after hours aggravation and cost). The whole Denver office was down (including VOIP phones) due to a network outage.
  • 5:30pm - Conclusion of review: Property files weren't updated yet, and we *really* need to fix the property file formatting so that it's possible to use the cool eclipse diffs between environments without spending hours scanning through ordering and formatting changes. Started looking into how to get the production ROC DB properties.
  • ~6pm - Todd P was working on the build in San Diego - ran into the properties issue and sent an email. Turns out he was building 1.100 instead of the planned 1.111. But that's because 1.100 was never tested in the QA2 env by front end QA. Ok, we need a staggered code freeze for future deployments
  • ~8:30pm - got properties from Jon Bigelow, checked in and
  • ~9:30pm - checked with Todd P and build was successful. Sweet - we should be good for the deployment to proceed smoothly now.

2/12

  • 5am - woken by Steve Fletcher to get on the testing bridge. The Verisign daily had run long and so testing was just starting - and some calls were failing (esnValidation)
  • ~6am - standard deploy troubleshooting step of restarting cluster completed
    having trouble logging into BO - appears the "Enterprise" passwords I've used for every previous usage no longer work. The AD active directory login isn't working either... annoying!
  • ~7am - intermittent failures begin to get worse - percentage of failures rise from ~25% to ~90%
  • 7:10am - Bigelow says his AD account works in BO - just normal login (ex: jbigelow) and windows password. I try it a few more times, and there's no way I could have misstyped my password so many times, and it finally lets me in!
  • 8:20am - After researching in BO, and browsing the code some, while meanwhile answering all kinds of questions from the bridge, finally came up with a theory of what was going on (see the Auth.createSalesCode functional spec page for the 3 issues identified)
  • 8:23am - called Zafar - left message - he called me back - still sleepy, he concurred that my theory of what was happening was correct
  • 8:40am - called and briefed Cassisa
  • 8:40 - 9:20am - discussion on the bridge on what to do.
  • 9:20am - decision is reached to have me work the fix ASAP, and also to do a roll-back in parallel
    branch auth-165cvs rtag -r csp-authentication-165 csp-authentication-165_branch_ROOT csp-adapters/authenticationcvs rtag -r csp-authentication-165_branch_ROOT csp-authentication-165_branch csp-adapters/authentication
  • 9:26am - cvs up -r csp-authentication-165_branch Seemed like forever to switch the branch - weird
  • 9:32am - make the code modification ( no local testing ) - initially thought to modify the GetSalesmanCode operation class, but instead modified SalesmanCodeCache.java - think it's simpler change and has less chance of missing something
  • 9:44am - checkin and tag
    validate diff: cvs diff -r csp-authentication-165 -r csp-authentication-165_branch_1
  • 10:06am - edit deploy_versions.props create binary build based on 1.100
  • 10:29am - first build failed - fixed tag and restartingcvs up -r 1.113 deploy_versions.props./binbuild-csp.sh dev3 deploy-to-repo HEAD
  • 10:56am - start deploy to dev server - sancapvmcsptr3./bindeploy-csp.sh dev3 80 1.113 v20080212_3
  • 10:58am - deploy complete
  • 11:02am - Jboss finished starting up
  • 11:18am - completed SOATest validation of createSession, searchByName, getAccount, and auth.getSalesCode. getSalescode returning 000 in ~100ms... searchByName and getAccountInfo ~3-8 seconds
  • ~12:30pm - QA2 had to be built 2x, with prod build in between - CSP build had been rolled back to 1.68 and everything was partially functional
  • ~2:30pm - Command Center's decision is made to wait until 8pm mountain to deploy this fix

No comments: