FWIW - Just noticed in the Audit Viewer, there's a bunch of entries for the Edge devices showing issues starting 11:36am and finally resolving sat 6:54pm on the 29 May 20223.
The properties logged show the Health Status going between "FAILED_NOHEARTBEAT" and "HEALTHY"
Next time there's an issue and there's no immediate information on the status page - I'll have to come in here and check for any entries.
Original Message:
Sent: 06-05-2023 21:01
From: Vaun McCarthy
Subject: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?
Thanks for creating the thread Jeff. I think it's important that we have more of these types of robust and transparent discussions.
As a partner and firing line target what concerned me about this outage was why nothing was seemingly triggered in among all the microservices monitoring etc to indicate there was an issue somewhere before it became a bigger issue. In addition to that it took what seemed like an unreasonable length of time before anything appeared on the "status" website and even then it was the usual "elevated errors" vague update. In fact the last two major outages it seems like it was only through an avalanche of partners and customers calling in and reporting that anything was actually triggered. The fault we had last year I know that a number of us were on calls at the same time to support and were all told "no we don't have any issues".
When we're in real time being pressured from clients, customers, agents etc and all we can say is "sorry yes there are elevated errors"..... :(
You're right that the marketing speak around microservices etc did fall a bit flat during this one but let's just hope that this was an exception to the norm :)
------------------------------
Vaun McCarthy
Original Message:
Sent: 06-05-2023 20:49
From: Jeff Hoogkamer
Subject: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?
Thanks everyone who responded so far.
Luckily on last Monday's incident, the actual voice routing within Genesys Cloud was still functional as we were able to use our pre-built emergency closure and forwarding modes to still get critical calls handled by different numbers. But this was initially setup as just a 'Transfer to Number' step in Architect and could only effectively route the call to a single number.
We also worked out that direct call routing to 'Groups' and Remote Number was still functional within Genesys Cloud - so we were able to effectively setup a sudo 'Transfer to Multiple Numbers' in group hunt style within Genesys Cloud using groups, dummy users and remote stations:
- created a Remote Station with the mobile number,
- created a 'Dummy User' with a Communicate license and any number/extension assigned (doesn't matter what it is, as it's not actually used)
- assign the Remote Number as the Dummy User's default station
- create a Group for the relevant queue/enquiry type
- activate 'Enable Calls' on the Group and assign any number/extension as the group phone (doesn't matter what it is, as it's not actually used)
- add the Dummy User to a Genesys Cloud Group
After setting this up for multiple users and queues - we could then update the Architect routing to send the urgent calls to the relevant Genesys Cloud Group for the enquiry type instead of Transfer to ACD.
Obviously this workaround was still dependent on the core call routing working in Genesys Cloud - and wouldn't cover a scenario where Genesys Cloud was entirely down or not functional even at this top level. This would need either the 'Warm BC' option in another GC region (or a voice solution from another provider) and ability to change the routing on demand at the carrier/provider level.
I would be curious to know the actual architecture of Genesys Cloud within AWS and how it *should* be able to recover much quicker than it did. Whilst I get the marketing speak of 'based on AWS micro services architecture' etc... and this distributed nature should be more resilient - it's still based on physical infrastructure somewhere
I'm still in the old mindset of on-premise high availability pair or hot-swap environment that can be switched to on in these events that are in separate geographic locations (but still within the same country) and whether there's any concept of this in Genesys Cloud / AWS? Does AWS in Sydney actually have more than 1 data centre, how is a site issue handled (power, data outage to a rack, a floor, a centre, etc)?
------------------------------
Jeff
Original Message:
Sent: 05-29-2023 18:46
From: Jeff Hoogkamer
Subject: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?
Hi All,
For those outside of the APAC (Sydney APSE2) region, yesterday there was a significant outage of Genesys Cloud with delays or inability to answer or receive calls from around 12:30pm AEST right up until 6pm AEST.
It appears a number of organisations within Australia/NZ were impacted
Symptoms included:
- calls ringing on physical phones but not appearing or significantly delayed in the desktop application
- delay in IVR/architect announcements and menu options (up to 10-30 seconds between prompts/actions)
- agents on auto-answer being delivered calls but then being put into not responding
- digital interactions (email and web messaging) were also impacted (not being able to be answered or timing out in delivery)
- calls in queue not routing to idle agents
- dashboard/performance information not updating for hours
- direct calls ringing/presenting in the desktop app and unable to be dismissed
- phantom call alerts/notifications for calls that didn't exist (or existed minutes/hours prior)
As this length of outage on Genesys Cloud (at least for me) is unprecedented - I'm curious what your organisation did/does to cope with outages in cloud based services - especially when the failover/recovery of the system is no longer under your control (i.e. initiating a switchover/failover in an on-premise based systems like PureConnect).
I'm keen to hear of any 'creative' workarounds or solutions that you use/used, where the business continuity/resilience plans you had in place suitable for this outage, and any lessons learnt on what to have as a viable backup option in the future?
#Routing(ACD/IVR)
#Telephony
------------------------------
Jeff
------------------------------