Configuring Apify as a Source
In the Sources tab, click on the “Add source” button located on the top right of your screen. Then, select the Apify option from the list of connectors. Click Next and you’ll be prompted to add your access.1. Add account access
You’ll need to provide your Apify API token for authentication. You can find your API token in the Apify Console under Settings > API & Integrations. The following configurations are available:- API Key: Your Apify API token used for authentication. This is required.
- Dataset IDs (optional): A list of dataset IDs to extract. You can find dataset IDs in the Apify Console under Storage > Datasets, or in the output tab of your Actor runs. Note that most Apify datasets (especially those created by Actor runs) are unnamed and won’t appear in generic listings, so you must provide their IDs explicitly.
-
Actor IDs (optional): A list of Actor IDs to extract run data from. The tap will incrementally sync succeeded runs and their dataset items. You can find Actor IDs in the Apify Console under Actors, or use the format
username~actor-name. This is ideal when you have scheduled Actors running periodically and want to automatically sync their results without manually tracking dataset IDs.
You must provide at least one of Dataset IDs or Actor IDs. You can also use both at the same time.
2. Select streams
Choose which data streams you want to sync. For faster extractions, select only the streams that are relevant to your analysis.Tip: The stream can be found more easily by typing its name.Select the streams and click Next.
3. Configure data streams
Customize how you want your data to appear in your catalog. Select the desired layer where the data will be placed, a folder to organize it inside the layer, a name for each table and the type of sync.- Layer: choose between the existing layers on your catalog. This is where you will find your new extracted tables as the extraction runs successfully.
- Folder: a folder can be created inside the selected layer to group all tables being created from this new data source.
- Table name: we suggest a name, but feel free to customize it. You have the option to add a prefix to all tables at once and make this process faster!
- Sync Type: you can choose between INCREMENTAL and FULL_TABLE.
- Incremental: every time the extraction happens, we’ll get only the new data.
- Full table: every time the extraction happens, we’ll get the current state of the data.
4. Configure data source
Describe your data source for easy identification within your organization, not exceeding 140 characters. To define your Trigger, consider how often you want data to be extracted from this source. Optionally, you can define some additional settings:- Configure Delta Log Retention and determine for how long we should store old states of this table as it gets updated. Read more about this resource here.
- Determine when to execute an Additional Full Sync.
5. Check your new source
You can view your new source on the Sources page. If needed, manually trigger the source extraction by clicking on the arrow button. Once executed, your data will appear in your Catalog.Streams and Fields
Below you’ll find all available data streams from Apify and their corresponding fields:Datasets
Datasets
Lists all datasets in your Apify account. This stream supports incremental sync based on the
modifiedAt field.| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier of the dataset |
| name | string | Name of the dataset |
| createdAt | date-time | Timestamp when the dataset was created |
| modifiedAt | date-time | Timestamp when the dataset was last modified |
| accessedAt | date-time | Timestamp when the dataset was last accessed |
| itemCount | integer | Total number of items in the dataset |
| cleanItemCount | integer | Number of clean (deduplicated) items |
| actId | string | ID of the Actor that created the dataset |
| actRunId | string | ID of the Actor run that created the dataset |
| stats | string | Dataset statistics as a JSON string |
| fields | string | Dataset field definitions as a JSON string |
Dataset Items
Dataset Items
Retrieves all items (records) stored in each dataset. Each item is serialized as a JSON string in the
data field, preserving the original structure from the Actor run.| Field | Type | Description |
|---|---|---|
| dataset_id | string | ID of the parent dataset |
| _row_index | integer | Sequential index of the item within the dataset |
| data | string | The dataset item serialized as a JSON string |
Dataset Statistics
Dataset Statistics
Retrieves detailed metadata and statistics for each dataset. This stream supports incremental sync based on the
modifiedAt field.| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier of the dataset |
| name | string | Name of the dataset |
| createdAt | date-time | Timestamp when the dataset was created |
| modifiedAt | date-time | Timestamp when the dataset was last modified |
| accessedAt | date-time | Timestamp when the dataset was last accessed |
| itemCount | integer | Total number of items in the dataset |
| cleanItemCount | integer | Number of clean (deduplicated) items |
| actId | string | ID of the Actor that created the dataset |
| actRunId | string | ID of the Actor run that created the dataset |
Actor Runs
Actor Runs
Lists all succeeded runs for each configured Actor, with incremental sync based on the
startedAt field. Each run includes metadata such as cost, schedule info, and a reference to its default dataset.| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier of the Actor run |
| actId | string | ID of the Actor |
| userId | string | ID of the user who triggered the run |
| actorTaskId | string | ID of the Actor task (if run via a task) |
| status | string | Run status (always SUCCEEDED for this stream) |
| startedAt | date-time | Timestamp when the run started |
| finishedAt | date-time | Timestamp when the run finished |
| buildId | string | ID of the Actor build used |
| buildNumber | string | Semantic version of the build |
| usageTotalUsd | number | Total cost of the run in USD |
| defaultKeyValueStoreId | string | ID of the run’s default key-value store |
| defaultDatasetId | string | ID of the run’s default dataset |
| defaultRequestQueueId | string | ID of the run’s default request queue |
| meta | object | Run metadata including origin, scheduleId, scheduledAt, clientIp, and userAgent |
Actor Run Dataset Items
Actor Run Dataset Items
Retrieves all items from each Actor run’s default dataset. This is a child stream of Actor Runs, meaning it automatically fetches dataset items for every new run synced. Each item is serialized as a JSON string in the
data field.| Field | Type | Description |
|---|---|---|
| dataset_id | string | ID of the dataset (from the Actor run) |
| actor_id | string | ID of the Actor that produced this data |
| run_id | string | ID of the Actor run |
| run_started_at | date-time | Timestamp when the parent Actor run started (used as replication key for incremental sync) |
| _row_index | integer | Sequential index of the item within the dataset |
| data | string | The dataset item serialized as a JSON string |