Potential raciness in webhooks?


#1

In my integration, I respond to integration installation by cloning the repositories (then do some analysis and send a PR) and to PR reviews (what exactly depends on the review, but the first thing I’ll do is fetch the comments associated with the review). I’m seeing occasional errors in the log that I can best explain to myself as racing with the server-side update. During cloning of the repository, this manifests itself as a 401 and while retrieving the comments for a review, I sometimes get:

SERVER ERROR: Error found in GitHub reponse:
    Status Code: 422
    Message: Validation Failed
    Docs URL: https://developer.github.com/v3/pulls/reviews/#get-a-single-reviews-comments
    Errors: Any["Could not resolve to a node with the global id of 'MDE3OlB1bGxSZXF1ZXN0UmV2aWV3NTU4Njk0ODk='."]

Redelivering the same notification immediately afterwards usually seems fine, so I suspect my code is ok, but of course I can’t exclude the possibility that something is wonky on my end as well? Is there a recommended best practice (e.g. wait a few milliseconds before handling a request or just retry until it work or sth)? I’ll try to do a traffic between my app and GitHub as well, in case that would be helpful.


#2

Thanks for reaching out about this, @Keno. I’m wondering if this is related to GitHub Apps and in what way – are you able to reproduce the problem with regular webhooks as well (e.g. a webhook that you manually create on a repository and subscribe it to reviews) or only with webhooks associated with GitHub Apps?

Also, next time this happens – can you provide the full HTTP request and response for the webhook delivery in question (all headers and bodies included) and the API request you made (again, all headers and bodies included)? I’m guessing it might happen for a private repository, so please feel free to send that data via https://github.com/contact and reference this thread there.

I’m also guessing it’s possible that you’re hitting a race there. If that’s the case, then either of the two solutions you mentioned should help you out – either wait for a second or two, or implement retries (I’d probably go for retries with an exponential backoff, and limit the number of retries to 5). If that doesn’t help – let us know.

Thanks again!


#3

Thanks @izuzak, I ran a packet capture over the weekend, but didn’t see this problem again. I’ll keep it running though, so next time it happens, I’ll have the complete trace as it looked liked from my server.