Genesys Cloud - Main

 View Only
Discussion Thread View
Expand all | Collapse all

How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

  • 1.  How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 05-29-2023 18:47
    Edited by Jeff Hoogkamer 06-12-2023 23:17

    Hi All,

    For those outside of the APAC (Sydney APSE2) region, yesterday there was a significant outage of Genesys Cloud with delays or inability to answer or receive calls from around 12:30pm AEST right up until 6pm AEST.

    It appears a number of organisations within Australia/NZ were impacted


    Symptoms included:

    • calls ringing on physical phones but not appearing or significantly delayed in the desktop application
    • delay in IVR/architect announcements and menu options (up to 10-30 seconds between prompts/actions)
    • agents on auto-answer being delivered calls but then being put into not responding
    • digital interactions (email and web messaging) were also impacted (not being able to be answered or timing out in delivery)
    • calls in queue not routing to idle agents
    • dashboard/performance information not updating for hours
    • direct calls ringing/presenting in the desktop app and unable to be dismissed
    • phantom call alerts/notifications for calls that didn't exist (or existed minutes/hours prior)
    • some Participant Data was not reliably recorded against the interaction/conversation and may be missing for interactions processed through Architect during the entire outage window (Edit: only noticed on June 13, added to list)


    As this length of outage on Genesys Cloud (at least for me) is unprecedented - I'm curious what your organisation did/does to cope with outages in cloud based services - especially when the failover/recovery of the system is no longer under your control (i.e. initiating a switchover/failover in an on-premise based systems like PureConnect).

    I'm keen to hear of any 'creative' workarounds or solutions that you use/used, where the business continuity/resilience plans you had in place suitable for this outage, and any lessons learnt on what to have as a viable backup option in the future? 


    #Routing(ACD/IVR)
    #Telephony

    ------------------------------
    Jeff
    ------------------------------



  • 2.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 05-30-2023 02:36

    Lots of crying and swearing :)

    To be honest, as rare as this was/is I think we're limited to some old school type approach.  If the org has toll free numbers going through a carrier platform first, potentially you put a DR/BCP plan into effect which may simply be a message followed by release.  If regular IVR behaviour was fine, trigger playing a message at the top of the IVR then release.  I wouldn't recommend using callback :)

    If digital is/was impacted, then I guess maybe having messaging switched off on the website or leverage a digital bot flow (assuming those work).  An alternate may be having a "twin" org in another region that acts as standby that you point as may services over to as possible.




    ------------------------------
    Vaun McCarthy
    ------------------------------



  • 3.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 05-31-2023 02:12

    Hi Jeff,

    As Vaun McCarthy mentioned on the reply having another org as a stand by in different region and implement the solution like "Build resiliency in your IVR with Genesys Cloud emergency groups and callbacks" as mentioned on Genesys blue prints section. 
    Here is the URL to check this blue print :
    https://developer.genesys.cloud/blueprints/dr-fallback-ivr/



    ------------------------------
    Ramu P
    Global Technology Solutions Inc.
    ------------------------------



  • 4.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 05-31-2023 06:09

    Some may have a bit of a concern though with the concept of paying for another org, to be used as BCP/DR when the platform provider itself (Genesys) is where the outage lies such as in this case.  It's different to back in on-premise/Engage type days when you'd have HA etc in place in case of WAN or Data Centre outage and most companies wouldn't blink and paying for that.  But when it's to partly cover the actual platform provider themselves having an outage....

    While this event was a bit of a, hopefully, one-off, I think Genesys as a whole should be looking at some type of inter/cross region fall-back for this type of scenario and make it standard rather than commercialised :)

    All of this said, anybody actually got any info on the warm BC bundles that seem to be mentioned behind some closed doors?  Keen for someone from Genesys to jump on here with more info.  Google and other searching produces pretty much no info.



    ------------------------------
    Vaun McCarthy
    ------------------------------



  • 5.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-01-2023 03:09

    Genesys makes available a BCP environment at additional cost however, you have to think if data residency is important to the organisation as the BCP environment is typically outside of Australia. As every customers needs differ, what the BCP looks for one customer may not be the same for another. 

    We have been looking at a few options however, each one is tailored for the customer.  



    ------------------------------
    Anish
    ------------------------------



  • 6.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-01-2023 03:15

    Yep that's my point exactly Anish :)  Looks good on paper but when you start looking at it closer, it's actually not going to be usable in a lot of companies - especially government departments where things like data sovereignty is critical as you pointed out.

    Has anybody got a link to the BCP options you mention?  I can only find mention of them in partner price listings but no actual info anywhere else.



    ------------------------------
    Vaun McCarthyVaun McCarthy
    ------------------------------



  • 7.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-01-2023 03:22

    Search for 'Warm BC' in the partner portal



    ------------------------------
    Anish
    ------------------------------



  • 8.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-01-2023 03:26

    Thanks Anish I'll get our guys to check.



    ------------------------------
    Vaun McCarthy
    ------------------------------



  • 9.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-05-2023 20:50

    Thanks everyone who responded so far.

    Luckily on last Monday's incident, the actual voice routing within Genesys Cloud was still functional as we were able to use our pre-built emergency closure and forwarding modes to still get critical calls handled by different numbers. But this was initially setup as just a 'Transfer to Number' step in Architect and could only effectively route the call to a single number.

    We also worked out that direct call routing to 'Groups' and Remote Number was still functional within Genesys Cloud - so we were able to effectively setup a sudo 'Transfer to Multiple Numbers' in group hunt style within Genesys Cloud using groups, dummy users and remote stations:

    - created a Remote Station with the mobile number,
    - created a 'Dummy User' with a Communicate license and any number/extension assigned (doesn't matter what it is, as it's not actually used)
    - assign the Remote Number as the Dummy User's default station
    - create a Group for the relevant queue/enquiry type
    - activate 'Enable Calls' on the Group and assign any number/extension as the group phone (doesn't matter what it is, as it's not actually used)
    - add the Dummy User to a Genesys Cloud Group

    After setting this up for multiple users and queues - we could then update the Architect routing to send the urgent calls to the relevant Genesys Cloud Group for the enquiry type instead of Transfer to ACD.

    Obviously this workaround was still dependent on the core call routing working in Genesys Cloud - and wouldn't cover a scenario where Genesys Cloud was entirely down or not functional even at this top level. This would need either the 'Warm BC' option in another GC region (or a voice solution from another provider) and ability to change the routing on demand at the carrier/provider level.

    I would be curious to know the actual architecture of Genesys Cloud within AWS and how it *should* be able to recover much quicker than it did. Whilst I get the marketing speak of 'based on AWS micro services architecture' etc... and this distributed nature should be more resilient - it's still based on physical infrastructure somewhere

    I'm still in the old mindset of on-premise high availability pair or hot-swap environment that can be switched to on in these events that are in separate geographic locations (but still within the same country) and whether there's any concept of this in Genesys Cloud / AWS? Does AWS in Sydney actually have more than 1 data centre, how is a site issue handled (power, data outage to a rack, a floor, a centre, etc)?



    ------------------------------
    Jeff
    ------------------------------



  • 10.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-05-2023 21:02

    Thanks for creating the thread Jeff.  I think it's important that we have more of these types of robust and transparent discussions.

    As a partner and firing line target what concerned me about this outage was why nothing was seemingly triggered in among all the microservices monitoring etc to indicate there was an issue somewhere before it became a bigger issue.  In addition to that it took what seemed like an unreasonable length of time before anything appeared on the "status" website and even then it was the usual "elevated errors" vague update.  In fact the last two major outages it seems like it was only through an avalanche of partners and customers calling in and reporting that anything was actually triggered.  The fault we had last year I know that a number of us were on calls at the same time to support and were all told "no we don't have any issues".

    When we're in real time being pressured from clients, customers, agents etc and all we can say is "sorry yes there are elevated errors"..... :(

    You're right that the marketing speak around microservices etc did fall a bit flat during this one but let's just hope that this was an exception to the norm :)



    ------------------------------
    Vaun McCarthy
    ------------------------------



  • 11.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-06-2023 03:22

    FWIW - Just noticed in the Audit Viewer, there's a bunch of entries for the Edge devices showing issues starting 11:36am and finally resolving sat 6:54pm on the 29 May 20223.



    The properties logged show the Health Status going between "FAILED_NOHEARTBEAT" and "HEALTHY"



    Next time there's an issue and there's no immediate information on the status page - I'll have to come in here and check for any entries.



    ------------------------------
    Jeff
    ------------------------------



  • 12.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-07-2023 19:22

    Does anyone know when Genesys will be implementing a Melbourne instance following AWS opening there in January?



    ------------------------------
    Tarquin Bell
    Precision Administration Services (Pty) Ltd
    ------------------------------



  • 13.  RE: How did your organisation cope with the APAC (Sydney) outage on 29 May 2023?

    Posted 06-12-2023 23:21

    Hi All,

    Just adding an additional symptom/impact that we've just discovered.

    During the outage window, some Participant Data was not reliably recorded against the interaction/conversation and may be missing for interactions processed through Architect.

    About a third to a half of interactions are missing values - particularly if the if Participant Data is written in one flow, and then updated or added in another flow.

    Curious to know if anyone else noticed the same behaviour?



    ------------------------------
    Jeff
    ------------------------------



Need Help finding something?

Check out the Genesys Knowledge Network - your all-in-one access point for Genesys resources