Configuring AWS S3 (CSV) as a Source
In the Sources tab, click on the “Add source” button located on the top right of your screen. Then, select the AWS S3 (CSV) option from the list of connectors. Click Next and you’ll be prompted to add your access.1. Add account access
You’ll need to provide the following credentials to connect to AWS S3:- Bucket Name: The name of your AWS S3 bucket where the CSV files are stored
-
Stream Name: A unique identifier for the stream (e.g.,
sales_data). Must start with a letter and contain only lowercase letters, numbers, and underscores. -
Delimiter: The character that separates fields in your CSV files (default:
,). Common options:- Comma (
,) for standard CSV files - Semicolon (
;) for some European formats - Tab (
\t) for TSV files
- Comma (
-
S3 Folder Path (optional): The path within your bucket where the files are located (e.g.,
raw_data/sales/). Leave empty if files are in the bucket’s root. -
File Search Pattern: A regex pattern to match the files you want to process (default:
.*). Examples:.*- matches all files.*\.csv$- matches only files ending in .csvsales_.*\.csv$- matches CSV files starting with “sales_”2024-.*\.csv$- matches CSV files from 2024
-
Unique Key Columns: The column(s) that uniquely identify each row in your CSV files (e.g.,
idororder_number). This helps with deduplication and incremental syncs.
2. Select streams
Choose which data streams you want to sync - you can select all streams or pick specific ones that matter most to you.Tip: The stream can be found more easily by typing its name.Select the streams and click Next.
3. Configure data streams
Customize how you want your data to appear in your catalog. Select a name for each table (which will contain the fetched data) and the type of sync.- Table name: we suggest a name, but feel free to customize it. You have the option to add a prefix and make this process faster!
-
Sync Type: you can choose between INCREMENTAL and FULL_TABLE.
- Incremental: every time the extraction happens, we’ll get only files that are added or modified since the last extraction.
- Full table: every time the extraction happens, we’ll process all matching files from the bucket.
4. Configure data source
Describe your data source for easy identification within your organization, not exceeding 140 characters. To define your Trigger, consider how often you want data to be extracted from this source. This decision usually depends on how frequently you need the new table data updated (every day, once a week, or only at specific times). Optionally, you can determine when to execute a full sync. This will complement the incremental data extractions, ensuring that your data is completely synchronized with your source every once in a while. Once you are ready, click Next to finalize the setup.5. Check your new source
You can view your new source on the Sources page. If needed, manually trigger the source extraction by clicking on the arrow button. Once executed, your data will appear in your Catalog.Additional Information
Best Practices
File Organization
- Keep related CSV files in dedicated folders
- Use consistent naming patterns for your files
- Consider using date-based prefixes for time-series data (e.g.,
YYYY-MM-DD-sales.csv)
CSV Structure
- Ensure consistent column names across files
- Use appropriate data types for each column
- Avoid special characters in column names
- Include a header row in all CSV files
Performance Optimization
- Use compression (e.g., .gz) for large files
- Set appropriate file search patterns to limit unnecessary file scanning
- Choose unique key columns that have good cardinality
Troubleshooting
File Not Found
- Verify the bucket name is correct
- Check the file search pattern
- Ensure the S3 folder path is correct
- Confirm file permissions
Parsing Errors
- Verify the correct delimiter is specified
- Check for special characters or quotes in the CSV
- Ensure consistent column counts across rows
- Look for hidden characters (BOM, etc.)
Authentication Issues
- Verify AWS credentials are correct
- Check bucket permissions
- Ensure the credentials have not expired
- Confirm the bucket region
Performance Issues
- Review file search patterns for efficiency
- Check file sizes and consider splitting large files
- Monitor AWS request quotas
- Consider using compression for large files