ivan.regalado | 2022-10-24 17:54:05 UTC | #1
Hi,
Currently deploying flows via v1.7.0 and ran into this error:
Stack trace from the terraform-provider-genesyscloud_v1.7.0 plugin:
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xd21a47]
goroutine 93921 [running]: github.com/mypurecloud/terraform-provider-genesyscloud/genesyscloud.updateFlow.func1() github.com/mypurecloud/terraform-provider-genesyscloud/genesyscloud/resourcegenesyscloudflow.go:192 +0x187 github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource.RetryContext.func1() github.com/hashicorp/terraform-plugin-sdk/v2@v2.23.0/helper/resource/wait.go:27 +0x56 github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource.(StateChangeConf).WaitForStateContext.func1() github.com/hashicorp/terraform-plugin-sdk/v2@v2.23.0/helper/resource/state.go:110 +0x207 created by github.com/hashicorp/terraform-plugin-sdk/v2/helper/resource.(StateChangeConf).WaitForStateContext github.com/hashicorp/terraform-plugin-sdk/v2@v2.23.0/helper/resource/state.go:83 +0x1d8
Error: The terraform-provider-genesyscloud_v1.7.0 plugin crashed!
This is always indicative of a bug within the plugin. It would be immensely helpful if you could report the crash with the plugin's maintainers so that it can be fixed. The output above should help diagnose the issue.
charlie.conneely | 2022-10-25 10:06:06 UTC | #2
Hi, Ivan. Thanks for reporting.
If you set sdk_debug=true in your genesyscloud provider block and re-run terraform apply, you could send us the CorrelationId which can be then be found in the file sdk_debug.log. Although the plugin seems to be crashing due to the mishandling of an error, having the correlation ID would provide us with more context around the error itself so we can recreate and solve the problem.
Thanks, Charlie
ivan.regalado | 2022-10-25 17:15:15 UTC | #3
@charlie.conneely Here is the correlation ID on a recent failed TF plan for one of our flows: [id=a45e1386-b0c4-4891-b931-9c14b522053c]
charlie.conneely | 2022-10-26 15:14:44 UTC | #4
@ivan.regalado it seems you may have sent the flow ID by mistake. The issue within the code is arrising on a GET of the endpoint /api/v2/flows/jobs/{jobId}. So that we can understand why this is, could you search the sdk_debug.log file for something along these lines:
Of course your website domain could look different and I've put "\<jobId\>" in place of a real ID. If you find an instance of this request, please send the job ID and any information about the success or failure of the operation such as the status code or correlation id if one is present.
John_Carnell | 2022-10-26 20:03:32 UTC | #5
Hi Ivan,
Two things:
- We have a fix for the stack trace issue. It had to do with how we handled an error return code from the API call-out to get the job status. I asked Charlie to get the correlation id so we could understand why the error condition was occurring. In talking with @Jeremy_Gillip it sounds like you might have been trying to deploy a large flow that takes longer than 15 minutes to deploy.
- I think I finally figured out what was going on with the flow deploying and kicking off on a terraform plan. While investigating another question around the flow resource I was finally able to reproduce this problem. I am in the process of writing a fix and need to make the tests pass before I can release it. It was a fairly subtle bug and it took me a while to track it down.
- My goal is to get these fixes through our testing and then get them deployed in short order.
Thanks, John Carnell Manager, Developer Engagement
Ihor | 2022-10-28 01:14:03 UTC | #6
Hi John,
Thanks for the update. Please let us know once the fix is ready.
-Ihor
ivan.regalado | 2022-10-28 18:34:45 UTC | #7
Hey @John_Carnell - any ETA on the fix for these two issues? We're currently blocked on delivering any new code for the stack trace issue, and flows deployed on a TF plan created a production outage this morning. Please let me know, thanks!
John_Carnell | 2022-10-28 18:51:59 UTC | #8
Hi Ivan,
I was waiting for final confirmation from my dev that everything was good before we deployed. I just got news v1.8.0 has been published. That should include the fix for the Terraform plan issue and the Nil pointer.
Just be aware that you will need to change how you are deploying your flows. We require an explicit hash of the file inside your HCL in order to determine your flow has changed. The original bug was the around hashing directly in the provider.
Here is an example of the change:
resource "genesyscloud_flow" "flow" {
filepath = "./SimpleFinancialIvr_v2-0.yaml"
file_content_hash = filesha256("./SimpleFinancialIvr_v2-0.yaml")
substitutions = {
my_queue_names = "Simple Financial IRA queue"
}
}
Just be aware that the stack trace issue will help show you what the error is and keep the provider from crashing, but you still might have an issue with your deploy (e.g. the flow takes longer than 15 minutes to deploy).
Thanks, John Carnell Manager, Developer Engagement
ivan.regalado | 2022-10-28 18:55:06 UTC | #9
Great news @JohnCarnell ! thanks for pushing this out quickly. We're already making use of the `filecontent_hash` parameter, but we'll make sure and add it to all our flows/environments going forward. We'll also work on the flow that was creating the stack trace issue and hopefully can now give us more details about what may be causing it to fail. thanks again!
John_Carnell | 2022-10-28 19:44:06 UTC | #10
Hi Ivan,
No worries, please let IHor know too :). This was a nasty bug that took us a while to track down. We had a updateFlow() call inside of the read() method on the flow resource. (Long story, but the original engineer who wrote it was trying to do hash the files and had done an update in the reading flow whenever the hash changed on a flow). The problem is that the defect only manifests itself when a YAML file has changed. So when we test we have a set of pre-defined YAML and we run them with every deployment.
What made this so ugly is this particular part of the code had no tracing so it was not until I was working on another part of the code base that I added a bunch of tracing and then was running it locally when I saw what you guys had been reporting.
Needless to say, I fixed it and added a test case around it to make sure we did not see weird behavior if the YAML file is modified and then re-deployed.
Thanks, John Carnell Manager, Developer Engagement
John_Carnell | 2022-10-31 14:29:50 UTC | #11
Hi @ivan.regalado ,
Just checking to see if the fixes to the flow resource helped.
Thanks, John Carnell Manager, Developer Engagement
ivan.regalado | 2022-10-31 16:18:43 UTC | #12
Hi @John_Carnell - yes the fix appears to have worked, @Ihor helped test it locally and a given TF plan did not update the org. Once we updated our repo to the 1.8.0 our pipeline did not deploy any flow changes on CI checks (TF plans) either. Thanks again John!
John_Carnell | 2022-10-31 16:38:05 UTC | #13
Hi guys,
Excellent. Sorry for the delay in resolving the issue, but it took me forever to finally track it down. Also, were you able to get beyond the nil pointer exception? Also, I talked with Jeremy this morning. If you guys run into any more weird issues please don't hesitate to post to the developer forum and let Jeremy know. Jeremy and I meet at least 2-3 times a week and he has a direct line to let me know if there are issues.
Thanks again for your patience John Carnell Manager, Developer Engagement
ivan.regalado | 2022-10-31 17:01:27 UTC | #14
That large flow is still unable to be published through TF, but is able to be published via archy locally. Error message now that you've fixed the stack trace issue isn't too helpful:
Error: Flow publish failed. JobID: 3a951096-ad97-4686-80f3-ba52c2f3f415, no tracing messages available.
John_Carnell | 2022-10-31 18:41:53 UTC | #15
Hi Ivan,
If you have a large flow that takes more than 15 minutes to deploy, our flow service that CX as Code uses will time out. If it publishes via Archy (archy does not use the flow service). archy will not have this limitation. I am going to chat with the Architect team who owns the flow service to validate my suspicions. If this is the case, I would highly recommend you look at refactoring that flow into smaller components as flows that big are incredibly difficult to maintain over time.
The error message about no tracing messages means we just received no tracing messages back from the service. I need to dig through the service API, but I suspect on a timeout we are only returning a timeout status code without all of the messages filled in.
Let me dig into this more.
Thanks, John Carnell Manager, Developer Engagement
ivan.regalado | 2022-10-31 19:19:06 UTC | #16
@Ihor fyi ^^ we'll need to chat with our flow dev team on how to possibly refactor this
John_Carnell | 2022-11-01 19:52:35 UTC | #17
Hi @ivan.regalado and @Ihor,
Really big flows are always a pain because they are a painful to understand and maintain. I did sit with the Architect team to review the issue.
They do not believe this is a timeout issue. They suspect one of the lambdas backing the flow service is maxing out the memory as we see the error occur only about three to four minutes into the flow deploy.
From a CX as Code perspective, we get a 500 status code back, but without traces of any kind. So, I have an internal ticket opened with the Architect team.
Here is what I ask that you guys to do:
- Open a support ticket with our Care team.
- Reference our conversation in this post here and request a ServOps ticket be opened an associated with DEVENGAGE-1476. This is the internal ticket that the engineering teams will be used to track the work.
- DM the Care case number to me. I will ping it to one of our Care case managers to make sure that this gets a ServOps.
- The engineering groups are working on this issue even without the Care ticket being opened. The Care ticket is for you to officially track the progress of the work and provide notes and feedback.
- NOTE: You can use the Archy CLI within your Terraform flows to deploy this flow instead of the genesyscloudflow resource. This will require you to have Archy installed in your CI/CD pipeline, but will enable you to move forward. You can look on look at the following code for an example of how to use the Archy CLI.
- I am going to be reaching out to George and Jeremy to see if we can get a copy of your flow. The Architect team is curious as to why it might be causing memory issues in the context of the flow service trying to deploy the flow.
Thanks, John Carnell Manager, Developer Engagement
system | 2022-12-02 19:51:40 UTC | #18
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.
This post was migrated from the old Developer Forum.
ref: 16866