Introduction 

 

From 2020-02-22 23:30 UTC to 2020-02-23 00:44 UTC we experienced a system wide issue.

 

Incident Details

 

  • 23.30 UTC – Our automated monitoring detected a higher than usual number of failed requests and alerted our engineers. This appeared to be impacting approximately 10% of requests. We began investigating the issue within 4 minutes. 
  • 23:36 UTC – We logged a critical support issues with Microsoft Azure. 
  • 00:04 UTC – We monitored an increase to approximately 30% of requests failing.  
  • 00.12 UTC – Microsoft Azure confirmed they were experiencing a platform issue and were investigating.
  • 00.44 UTC – Full service was restored.
  • 01.03 UTC – We posted the issue on our Service Status page. We switched our phones lines to play a pre-recorded message directing customers to our Service Status page. 
  • 01.32 UTC – We updated our Service Status page to reflect the earlier resolution.

 

 

Impacted Services 

 

  • VestiPOS Management Portal 
  • VestiPOS API 
  • VestiPOS iOS App

 

A significant impact was for our hospitality customers running in “Server” mode who were unable to load open table sales. Users were still able to process offline sales and use “Standalone Mode”. 

 

 

Communication

 

We understand the importance of communication when issues occur. There was a significant delay from the issue first being notified and being posted on our service status page. In addition to this, I am are aware our support team did not effectively communicate the issue when speaking to our customers. We fell short in this instance and are urgently reviewing our internal processes to learn what went wrong in this instance.



Future Mitigation & Internal Process Improvements

 

[1] We’ve invested in our platform to ensure a high level’s of uptime, achieving  99.99% platform availability over the past 5 years. Upon receiving the Root Cause Analysis report from Microsoft will review our system architecture to identify any potential improvements. We have a completed history of previous outages on our status page at https://vestipos.status.io/pages/history/542bd6e5723e21dc04000028

 

[2] We have a major development in progress which we are targeting for release this quarter. This development would have removed the impact our hospitality customers suffered during this outage. Please learn more about this upcoming update at https://support.vestipos.com/support/solutions/articles/43000559946-serverless-multi-terminal-support

 

[3] Currently our engineers have to manually update our service status page when an outage occurs. We are going automate this process. This will mean users will get notified via email during any platform issues. We will also automatically switch our phone systems to our pre-recorded message so users know we are investigating the issue. Automation should ensure this is within 15 minutes of an issue first being detected. 


[3] Our support team leader is reviewing our internal policies to ensure we improve our communication during an outage.