Request: Improve API so it's faster to pull large data sets


#1

Hello GitHub,

Our dashboard application requires that we import a large set of issues and pull requests (and their timelines) from a repository. At the moment we get less than 15 issues per second from the API using the search “endpoint”. So it takes about 70-80 seconds to load 1000 issues from a repository. That’s a long time for a user to wait to see their metrics. We use search because it’s the only way to read issues so they are ordered by updatedAt, but reading issues via repository->issues was not any faster when we tried it.

Do you have plans to improve the speed of the API or otherwise make it easier to pull large data sets? Are there other ways of using the API that would be faster?

Best regards,
Tuomas


Odd rate limit abuse errors
#2

Hi! Could you share your query? It’s ok if the repo is private, I just want to see how the information is being requested.


#3

This is the query

query getIssuesSearch (
  $query: String!, 
  $eventBatchSize: Int!, 
  $issueBatchSize: Int!,
  $labelsBatchSize: Int!,
  $assigneesBatchSize: Int!,
  $issueCursor: String
) 
{
  search(query:$query, first:$issueBatchSize, after: $issueCursor, type:ISSUE) {
    pageInfo {
      endCursor
      hasNextPage
    }
    edges {
      node {
        ... on Issue {
          ...issueData
        }
        ... on PullRequest {
          ...pullRequestData
        }
      }
    }
  }
}

fragment issueData on Issue {
  __typename
  repository {
    id
    nameWithOwner
  }
  id
  databaseId
  number
  title
  url
  createdAt
  updatedAt
  milestone {
    id 
    title
    dueOn
  }
  assignees(first: $assigneesBatchSize) {
    edges {
      node {
        login
        name
        avatarUrl
      }
    }
  }
  labels(first: $labelsBatchSize) {
    edges {
      node {
        id
        name
        color
      }
    }
  }
  timeline(first: $eventBatchSize) {
    pageInfo {
      endCursor
      hasNextPage
    }
    edges {
      node {
        ...events
      }
    }
  }
}

fragment pullRequestData on PullRequest {
  __typename 
  repository {
    id
    nameWithOwner
  }
  id
  databaseId
  number
  title
  url
  createdAt
  updatedAt
  labels(first: $labelsBatchSize) {
    edges {
      node {
        id
        name
        color
      }
    }
  }
  assignees(first: $assigneesBatchSize) {
    edges {
      node {
        login
        name
        avatarUrl
      }
    }
  }
  timeline(first: $eventBatchSize) {
    pageInfo {
      endCursor
      hasNextPage
    }
    edges {
      node {
        ...events
      }
    }
  }
}

fragment events on Node {
	  __typename
  ... on ClosedEvent {
    createdAt
  }
  ... on ReopenedEvent {
    createdAt
  }
  ... on UnassignedEvent {
    createdAt
    unassignedActor: actor {
      login
    }
  }
  ... on AssignedEvent {
    createdAt
    assignedActor: actor {
      login
    }
  }
}

and here’s a sample set of parameters

{
  "query": "repo:sequelize/sequelize updated:>2017-03-01T20:24:49Z sort:updated",
  "eventBatchSize": 99,
  "issueBatchSize": 99,
  "labelsBatchSize": 20,
  "assigneesBatchSize":10,
  "issueCursor": null
}

#4

One additional note. We often get spurious 405 responses when pulling issues from repositories that have a lot of them (thousands). A current example would be the “kubernetes/kubernetes” repository. Setting the batch size smaller helps sometimes, but not always.


#5

At the moment the example query above returns an error after a couple of iterations using issueCursor.

Something went wrong while executing your query. This is most likely a GitHub bug. Please include “E4FB:15875:7194B33:8755C1E:592551ED” when reporting this issue.


#6

Phew! Yeah that took almost 5 seconds for me to bring in. Based on the profiling, I see a couple of easy wins for us to tackle. I’ll log an issue for it.


#7

We’re still consistently getting the “Something went wrong …” error trying to pull the third page using the query and parameters above.

Something went wrong while executing your query. This is most likely a GitHub bug. Please include “D3D7:4594:3613D85:3F70EE4:592C000B” when reporting this issue.

Is this related to performance or a separate issue?


#8

I believe it’s a separate, intermittent issue; I can’t seem to reproduce it reliably.


#9

I just tried to refresh the data from around 20 repos we have using the above query. Basically just doing a search and loading a few pages of results, up to 500 issues or pull requests or so.

Results

  • Lot’s of 405 errors, retrying a few times helped in some cases but in others not
  • Triggered a 403 “rate limit abuse” errors (rate limit wasn’t used up)
  • A few “Something went wrong…” errors

Overall I was able to pull data for around only half of the repos that had a large number (500+) of issues and pull requests.

We’re hoping to add a significant amount of new users in the near future but it looks like the v4 API is not ready yet. Can you help? Or give an estimate when these issues will be resolved?


Strange 405 responses depending on `first` argument value & fields requested
#10

Hello, I also run into this issues as well. Do you have any updates on this? Thanks!


#11

I suspect that the performance implications of this query are a result of it having to call out to our search infrastructure because you’re using the search field. It would be preferable to avoid the use of that and get the information directly from the issues connection off of the top level repository field, but we don’t support an updated argument. @ynie added a schema request in “Request: Improve API so it’s faster to pull large data sets” that I think would greatly improve the performance of your query.


#12

Hey @gjtorikian,

I’m having a similar intermittent issue. Please see if these steps help as I can reproduce the issue reliably this way.

  1. Start searching with the query below. It should work.
  2. Replace first: 21 with 22. It should fail.
  3. Keep 22. Remove repository block. It should work again.

It seemed intermittent at first, but I’m thinking it may be related to specific repositories? That’s because:

  • As long as I join other entities (issue authors, reactions etc) it keeps working fine as I page through.
  • Once I join repository, it fails “intermittently” (well not quite as I can reproduce reliably now).

BTW, the repository this scenario is failing at is ApplePride/PIDOR (currently 404 on github.com).

Hope this helps and looking forward to hearing your thoughts.

query {
  search(type: ISSUE, query: "reactions:>=100", first: 21, after: "Y3Vyc29yOjEwMA==") {
    pageInfo {
      endCursor
    }
    edges {
      node {
        ... on Issue {
          url
          repository {
            id
          }
        }
      }
    }
  }
}

Mar 13, 2018 UPDATE: Edited first indexes according to current cursor positions, as there are more search results with >= 100 reactions now, and the ApplePride/PIDOR repo is at position first: 22.


#13

It’s funny you mentioned this. We noticed this issue earlier today and are working on a fix for it. We believe it’s related to spammy repositories which are returning issues erroneously. I’ll update this thread when a fix is shipped!


#14

Glad to hear a fix is on the way @gjtorikian, cheers!


#15

Hey @gjtorikian, sorry to bother with this type of question.

May I ask of any progress or roughly an ETA on this, if any available et all?

By the way, I updated the first indexes in the steps to reproduce above.


#16

I forgot to reply here, but, your problem should be resolved now!