Skip to main content
GitHub is a software development platform for source control, collaboration, and code review. The Nekt GitHub connector uses the GitHub REST API to extract repository metadata, pull requests, and commits into your Catalog.

Configuring GitHub as a Source

In the Sources tab, click on the “Add source” button located on the top right of your screen. Then, select the GitHub option from the list of connectors. Click Next and you’ll be prompted to add your access.

1. Add account access

You’ll need a GitHub Personal Access Token (classic or fine-grained) with permission to read the repositories you want to sync. The following configurations are available:
  • Personal Access Token: Your GitHub Personal Access Token (classic or fine-grained) with repo scope. This field is required and stored securely.
  • Repositories: Optional list of repositories in owner/repo format (one per line). If provided, only these repositories are synced. If left empty, the connector syncs all repositories accessible by the token.
  • Start Date: Optional starting point used by incremental commit syncs. Only sync records created or updated after this date.
Once you’re done, click Next.

2. Select streams

Choose which data streams you want to sync:
  • repositories
  • pull_requests
  • commits
For faster extractions, select only the streams you need. Select the streams and click Next.

3. Configure data streams

Customize how you want your data to appear in your catalog. Select the desired layer where the data will be placed, a folder to organize it inside the layer, a name for each table, and the type of sync.
  • Layer: Choose the layer where extracted GitHub tables will be created.
  • Folder: Optionally group all GitHub tables inside a folder.
  • Table name: A default name is suggested, but you can customize it. You can also add a prefix to all tables.
  • Sync Type: Choose between INCREMENTAL and FULL_TABLE.
    • Incremental: Recommended for commits, using committed_at as the replication key.
    • Full table: Useful for one-off backfills or full refreshes.
Once you are done configuring, click Next.

4. Configure data source

Describe your data source for easy identification within your organization, not exceeding 140 characters. To define your Trigger, consider how often your repositories change:
  • Hourly / every few hours for active engineering analytics.
  • Daily for standard operational reporting.
  • Weekly for low-change repositories.
Optionally, you can define:
  • Delta Log Retention: How long Nekt keeps previous table states. See Resource control.
  • Additional Full Sync: Periodic full syncs in addition to incrementals.
When you are ready, click Next to finalize the setup.

5. Check your new source

You can view your new source on the Sources page. If needed, manually trigger the extraction by clicking on the arrow button. Once a run completes successfully, your data appears in the Catalog.
You need at least one successful source run to see the tables in your Catalog.

Streams and Fields

Below you’ll find the available GitHub streams and their core fields.
Repository metadata for all repositories accessible by the token (or only the configured list in repositories).Key fields:
FieldTypeDescription
idIntegerRepository numeric ID (primary key)
full_nameStringRepository name in owner/repo format
privateBooleanIndicates whether the repository is private
visibilityStringRepository visibility (public, private, etc.)
default_branchStringDefault branch
languageStringPrimary detected language
stargazers_countIntegerNumber of stars
forks_countIntegerNumber of forks
open_issues_countIntegerNumber of open issues
created_atDateTimeTimestamp when the repository was created
updated_atDateTimeTimestamp when the repository was last updated
pushed_atDateTimeTimestamp when the repository was last pushed to
Notes:
  • Primary key: id
  • Replication: full-table style (no replication key)
  • Child context: each repository emits owner and repo context used by pull_requests and commits
Pull requests for each repository. The connector fetches all pull request states (open, closed, and merged).Key fields:
FieldTypeDescription
idIntegerPull request ID (primary key)
numberIntegerPull request number inside the repository
titleStringPull request title
bodyStringPull request description/body
stateStringCurrent state (open, closed)
draftBooleanIndicates if it is a draft PR
lockedBooleanIndicates if the PR is locked
userObjectPull request author
headObjectSource branch metadata
baseObjectTarget branch metadata
merged_atDateTimeTimestamp when merged
closed_atDateTimeTimestamp when closed
created_atDateTimeTimestamp when created
updated_atDateTimeTimestamp when last updated
additionsIntegerLines added
deletionsIntegerLines deleted
changed_filesIntegerNumber of files changed
commentsIntegerNumber of comments
review_commentsIntegerNumber of review comments
commitsIntegerNumber of commits in the PR
_sdc_repositoryStringRepository context in owner/repo format
Notes:
  • Primary key: id
  • Replication: full-table style (no replication key)
  • Includes repository context fields (owner, repo, _sdc_repository) for easier joins
Commits for each repository. This stream supports incremental sync using commit timestamp.Key fields:
FieldTypeDescription
shaStringGit SHA hash value of the object (primary key).
node_idStringGraphQL node identifier of the record.
html_urlStringWeb URL for this resource in GitHub.
urlStringAPI URL for this resource.
commit.messageStringCommit message.
commit.authorObjectAuthor metadata embedded in the commit (name, email, date).
commit.committerObjectCommitter metadata embedded in the commit (name, email, date).
commit.treeObjectGit tree object referenced by this commit.
commit.verificationObjectVerification details (verified, reason, signature, payload).
authorObjectGitHub user object for the author (when available).
committerObjectGitHub user object for the committer (when available).
parentsArrayParent commit references.
stats.additionsIntegerLines added.
stats.deletionsIntegerLines deleted.
stats.totalIntegerTotal lines changed.
committed_atDateTimeReplication key (derived from commit.committer.date).
_sdc_repositoryStringRepository context in owner/repo format.
Notes:
  • Primary key: sha
  • Replication key: committed_at
  • Incremental sync sends since to GitHub API based on state bookmark (or start_date when state is not available)

Data Model

The connector follows a repository-centered model:

Use Cases for Data Analysis

This section includes practical SQL examples you can run in Explorer.

1. Pull Request throughput by repository

Measure how many pull requests are created, closed, and merged by repository.
SELECT
   _sdc_repository AS repository,
   COUNT(*) AS total_prs,
   SUM(CASE WHEN state = 'open' THEN 1 ELSE 0 END) AS open_prs,
   SUM(CASE WHEN closed_at IS NOT NULL THEN 1 ELSE 0 END) AS closed_prs,
   SUM(CASE WHEN merged_at IS NOT NULL THEN 1 ELSE 0 END) AS merged_prs
FROM
   nekt_raw.github_pull_requests
GROUP BY
   1
ORDER BY
   total_prs DESC;

2. Commit activity in the last 30 days

Track commit volume and active contributors by repository.
SELECT
   _sdc_repository AS repository,
   COUNT(*) AS commits_last_30d,
   COUNT(DISTINCT COALESCE(author.login, commit.author.email)) AS active_authors
FROM
   nekt_raw.github_commits
WHERE
   CAST(committed_at AS timestamp) >= current_timestamp - interval '30' day
GROUP BY
   1
ORDER BY
   commits_last_30d DESC;

Skills for agents

Download GitHub skills file

GitHub connector documentation as plain markdown, for use in AI agent contexts.