Today I received a call from Salesforce's VP of Development, setup by my customer success manager.  He asked to remain un-named for fear of being bombarded by reporters, so I will just refer to him as "He" in this article.  The conversation was very informative, developer to developer.  He didn't mind me posting the information to my blog here, so I'll do my best to reiterate.
 
First we talked about salesforce's move to their new data center, known as mirrorforce, which he admits salesforce did a bad job communicating.  The first set of significant outages occurred after moving to a new data center in Sunnyvale, CA in November which was required for their future growth.  This gives them more capacity, and was required in order to build a completely mirrored data center in Virginia.  Along with the move came some issues that took a week to fully iron out.  These have since been resolved and shouldn't happen again for at least another 5 years if they move to a new data center again.  Their redundant data center, which should be live in the next couple months, is meant solely for disaster recovery, and is designed to turn a 24 hour outage into about a 1 hour outage.  More importantly, it will not fix the 1-2 hour outages, or the stability and performance problems that have been occurring over the last couple months.  That's where work at the primary data center is being done to help solve these issues.

What of course interested me the most is the explanation of Monday's outage and what they plan on doing to eliminate this from happening in the future.  Apparently it was another database problem.  He was first to say that, "Fundamentally this is a salesforce problem, we are aggressively looking at architecture changes to avoid these issues."  This is where I get a little fuzzy, because I don't deal with large database systems myself.  The database architecture at salesforce is a single cluster of four Sun servers running Oracle.  The outage Monday was due to a bug in the clustering software, which if I understood properly, propagated the problem across all four servers.  The load on the Oracle system is extremely high, and seems to be unsurfacing some very remote bugs.  This leads me to what they plan on doing to resolve this issue.

To fix these stability issues, they are taking three actions.  First is to apply a new build from Oracle which fixes numerous bugs.  Their current build is about 9 months old and since then have been selectively applying patches.  Their next plan of attack is to split the database four ways.  This will reduce the load on each individual database, and if there is a problem with one of the four servers, the remaining servers should take up the slack four times faster.  This hopefully will eliminate these remote bugs from happening.  Finally, moving further ahead the plan is to put in place additional clusters, which will add even more redundancy.

I asked the question, "Why was the API down for 7 hours?  Did you not realize many customer rely as heavily on the API more than the actual interface itself?".  His response was that there is no misunderstanding about how critical the API is, 45% of all requests are from the API.  When the database problem appeared, the API was taken offline for 10 minutes until things stabilized.  Ten minutes later, the API was brought back on-line, and the entire system froze.  The API was taken back offline re-enabling the UI, with the idea being better the UI than nothing.  It wasn't until 3-4pm PST that the team put the API back on-line after spending all that time trying to isolate the problem.  I'm unclear whether they found the problem, or just gave up and put the API back on-line hoping it would work, which luckily it did. 

I should say that despite the problems with reliability, salesforce is very customer focused.  Atalasoft is just a small ISV, and even though we don't show up on their radar as far as revenue, we have been treated like an enterprise customer.  Having a single point of contact, the "Success Manager", and a VP to call me personally goes a long way in that respect. This conversation was very eye opening, and I was assured that salesforce is putting their best efforts and very smart people on this task of solving their reliability problems.  On the question of whether On Demand CRM software is the best thing for small software companies, I think that still remains open for debate.