How to get next results beyond 1,000?


#1

I’m new to GraphQL. Thanks in advance for any guidance you can provide.

I would like to get all issues from a repo based on a search term. For this particular search, there are over 5,000 results. I have attempted to use pagination with the end cursor, but I cannot get more than 1,000 results. I make requests for the first 100 after each endCursor, and this works great for the first 1,000. But once I get to 1,000, the last call’s hasNextPage is false and endCursor is an empty string. How do I advance from the first 1,000 to the next 1,000 (and then the next 1,000 and then the next and next…)?

The only way I can think of is some sort of hacky workaround using the dates.


#2

Hi @kschumy, I just did this myself in Python. It sounds like you may have tried this, but here’s a walkthrough of what my code did that will hopefully be helpful:

  1. Run the query with pageInfo, like you mentioned, and ask for hasNextPage and endCursor, as shown below:
    query { 
      repository(owner: "Microsoft" name: "vscode") {
        issues(first: 100) {
          edges {
            node {
              number
              title
            }
          }
          pageInfo {
            hasNextPage
            endCursor
          }
        }
      }
    }
    
  2. In your code, if hasNextPage is true, save the value of endCursor to a variable. Then, run the query again, adding after: "<endCursor>" to your query:
    query { 
      repository(owner: "Microsoft" name: "vscode") {
        issues(first: 100, after: "Y3Vyc29yOnYyOpHOBwPPkA==") {
    
  3. Repeat until hasNextPage is no longer true.

In my case, I found that GraphQL was forgiving enough that I could simply include the after: bit in my query and allow it to be empty on the first pass in a while-loop, like this:

pullRequests(first: 100, states: MERGED, after: ) {

At the end of each loop, as long as hasNextPage was true, I saved the value of endCursor and added it into my query for the next loop. Note: If you take this approach, after: is fine, but after: "" (with quotes) will give you an error. Instead, have the endCursor variable just include the quotation marks.

I hope that makes sense and that it helps. Reply to me if you have more questions, I’d be happy to help further to try to help with your code.

Edit: I forgot to mention one important thing. This did work for me to retrieve 1000+ results, which was why I even thought to reply. But if this is exactly what you did and it didn’t work for you, let me know! Maybe we’ll figure it out.


#3

Thank you so much @pbaity! This is hands down the clearest explanation I’ve read on this issue.

I ended up figuring out the 100 calls issue. Turns out that I left off the quotes around the endCursor value in my golang template, and this is why pagination didn’t work. I will never get back the 10 hours of my life it took me to figure this out.

But now I’m having trouble getting queries to paginate past 1,000, even if there are more than 1,000 issues found. After 1,000 results, the call returns false for hasNextPage and a call on the endCursor does not provide any issue results.

Below are print statements demonstrating this issue, specifically making calls for 100 results each time and storing those results. The [....] represent things I removed from the output for brevity. The ❗️ are comments I added.

$ go run main.go
Getting all 3735 results...
---------------------------------------
Request 1: query{[...]search(query:[...], first:100)[...]}
From request 1:
- total issue count: 100
- endCursor: Y3Vyc29yOjEwMA==
- Has next page? true
---------------------------------------
Request 2: query{[...]search(query:[...], first:100,after:"Y3Vyc29yOjEwMA==")[...]}
From request 2:
- total issue count: 200
- endCursor: Y3Vyc29yOjIwMA==
- Has next page? true
---------------------------------------
Request 3: query{[...]search(query:[...], first:100,after:"Y3Vyc29yOjIwMA==")[...]}
From request 3:
- total issue count: 300
- endCursor: Y3Vyc29yOjMwMA==
- Has next page? true
---------------------------------------
[skipped 4 - 8 but same idea as above]
---------------------------------------
Request 9: query{[...]search(query:[...], first:100,after:"Y3Vyc29yOjgwMA==")[...]}
From request 9:
- total issue count: 900
- endCursor: Y3Vyc29yOjkwMA==
- Has next page? true
---------------------------------------
Request 10: query{[...]search(query:[...], first:100,after:"Y3Vyc29yOjkwMA==")[...]}
From request 10:
- total issue count: 1000
- endCursor: Y3Vyc29yOjEwMDA=   ❗️very similar to the first request call 
- Has next page? false    ❗️expect this to be true 
---------------------------------------
Request 11: query{[...]search(query:[...], first:100,after:"Y3Vyc29yOjEwMDA=")[...]}
From request 11:
- total issue count: 1000   ❗️no additional issues from previous call 
- endCursor:    ❗️returned null/nil 
Has next page? false
---------------------------------------

This last call (request 11) ends up getting no issues, despite there being over 2,000 that I’m missing. You can check it out in the explorer. The full query I’m doing for request 11 is:

query{search(query:"node OR cluster in:title is:issue repo:kubernetes/kubernetes",type:ISSUE,first:100,after:"Y3Vyc29yOjEwMDA="){issueCount,pageInfo{endCursor,hasNextPage,startCursor,hasPreviousPage},nodes{...on Issue{author{login},authorAssociation,closed,closedAt,createdAt,id,lastEditedAt,locked,number,title,updatedAt,url}}}}

It will return:

{
  "data": {
    "search": {
      "issueCount": 3735,
      "pageInfo": {
        "endCursor": null,
        "hasNextPage": false,
        "startCursor": null,
        "hasPreviousPage": true
      },
      "nodes": []
    }
  }
}

Do you know why this is happening?


#5

@kschumy I think I found out why this is happening, and I hope I found a somewhat acceptable solution for you too. I wanted to explain both why and how, but if that’s too wordy for anyone reading, skip to Proposed Solution below. Here’s what I found:

First, I tried out your query in the explorer. I got the same result you did - no surprises there. So your query is just fine. (You can hopefully find some relief as a developer to know you weren’t missing any brackets!)

Next, I checked out the v4 API’s GraphQL resource limitations documentation. There didn’t seem to be anything that mentioned a limit of 1,000 search results, and your query certainly wasn’t hitting any of the other rate limits mentioned. Weird.

Then I went to the GitHub website and searched with your query. The total number of results checked out, but then I noticed that there were only 100 pages of results and ten results on each page…1,000 results? The top of the page says 3,740!

I googled around for this problem and found other people mentioning a problem similar to yours with the v3 REST API and with the GitHub Eclipse Java API. (Here, here, and here if you’re interested.) What several people mentioned was this note from the v3 documentation about the search API:

Just like searching on Google, you sometimes want to see a few pages of search results so that you can find the item that best meets your needs. To satisfy that need, the GitHub Search API provides up to 1,000 results for each search .

So there’s the answer to why - it’s just a hard limit of the GitHub search API (which I guess is why I didn’t encounter it, because I wasn’t using the search). But I didn’t want to be only a bearer of bad news, so I tried to come up how to get around this for you.

Proposed Solution: The best way to do it, it seems, is to do something a little hacky with the dates, as you mentioned in your original post. Since you’re using the search function, you could use GitHub’s nifty search syntax to get the next batch of issues after or on the date of the last issue from the first set of results. Check out this Query for dates section of the search syntax documentation.

Note however, there seems to be a quirk here. The 1000th result was created on June 30 2016, when I checked - so I searched with this query:

node OR cluster in:title is:issue repo:kubernetes/kubernetes created:<=2016-06-30

And here are the results of that search. You might notice that there are 1,401 results - which means there are 2,339 results after June 30 2016 (and up to the date of this writing). In either case, that’s still more than the 1,000 result limit.

I think that the trick would be to query in a loop not based on the cursor, but on the created date. You’d have to decide what amount of time you’d want to search in each iteration, but let’s say you went with month - you’d get all the issues from the current month, then all the issues for the previous month, and so on. You could stop the loop either 1) when you hit the date of the first issue, or 2) when you hit the created date of the repository. Either could be retrieved easily enough before starting the loop. The only thing you don’t want to do is stop the loop when you hit a month with no issues, since that can obviously just happen sometimes. (I know none of my repositories have any issues :disappointed_relieved:)

Whatever stretch of time you search, just be mindful of the API’s resource limitations.

All right, hopefully that novel will provide some help to you. I wanted to give the full answer since neither of us had any luck finding the full answer elsewhere, so I hope it will help others too. Let me know if it works for you!


#6

Thanks Paul for this post/reply. I have been trying to figure this out since 2 days, as to why my query would not paginate after 1000 records.

What I was looking for was a list of repos with ‘LastPush’ time and total commit count. This is to figure out which repos are active/inactive.

Have tried the below and works perfectly only till 1000 repos are fetched :frowning:

query {
  rateLimit {
    cost
    limit
    remaining
    resetAt
  }
  search(query: "archived:false", type: REPOSITORY, first: 5) {
    repositoryCount
    pageInfo {
      endCursor
      startCursor
      hasNextPage
    }
    edges {
      node {
        ... on Repository {
          nameWithOwner
          pushedAt
          defaultBranchRef {
            target {
              ... on Commit {
                history() {
                  totalCount
                }
              }
            }
          }
        }
      }
    }
  }
}

Will be searching and trying to see if I can re-write the above in some other way to get all records.

NOTE: Am running this against Github Enterprise - Am assuming everything is the same on it .


#7

@tahershaakir Glad my reply was helpful! My original use case was for GitHub Enterprise as well. Something you could try with your query is to limit by date, like I suggested above - especially if your instance of GHE doesn’t contain millions of repositories that get pushed to by the hundreds per second (like regular GitHub). Possibly something like this:

search(query: "archived:false pushed:2018-09-11..2018-10-11", type: REPOSITORY, first: 5) {

Which gets the repositories pushed to in the last month. As long as that returns less than 1000 results, you can keep looping through the repositories, one month at a time. Or if you get more than 1000 results, you could try using a smaller increment of time - maybe a week, or if you really needed to, a day.

Another option would be to loop through the organizations in GHE and run a query something like this:

query {
  organization(login: $orgName) {
    repositories(first: 100, after: $endCursor) {
      totalCount
      pageInfo {
        endCursor
        startCursor
        hasNextPage
      }
      edges {
        node {
          nameWithOwner
          pushedAt
          defaultBranchRef {
            target {
              ... on Commit {
                history() {
                  totalCount
                }
              }

I’m not sure how you’d get that list of organization names, but maybe you could figure that out, or do something similar with repository names instead. I hope this will be helpful too and will spark an idea for you!


Getting lastPush time and total commit count on all repos