Common data transformations

When working with raw data from APIs, databases, and file uploads, you’ll often need to reshape, clean, or enrich it before it’s ready for analysis. This section provides step-by-step recipes for the most common transformations you’ll encounter. Each recipe includes the same transformation implemented in Athena SQL, BigQuery, and Python notebooks (using the Nekt SDK with PySpark), so you can pick the approach that fits your stack.

Schema and column operations

Rename columns

Standardize column names across sources — apply snake_case, remove prefixes, or align naming conventions.

Cast or convert data types

Convert strings to dates, numbers, or booleans and handle format differences across sources.

Handle NULL values

Replace NULLs with defaults using COALESCE and clean up missing data for downstream consumers.

Structured data handling

Unnest arrays

Explode array columns into individual rows — essential for tags, labels, and line items from APIs.

Parse JSON fields

Extract values from JSON string columns into proper, typed columns ready for analysis.

Flatten nested structures

Normalize deeply nested objects into flat, query-friendly rows.

Expand custom fields

Pivot a key-value custom fields table into columns on the main entity — common for CRMs like Pipedrive and HubSpot.

Date parsing and calculations

Extract date parts, calculate differences, and format dates consistently across SQL engines.

Data quality

Deduplicate rows

Remove exact or near-duplicate records using ROW_NUMBER, DISTINCT, and other dedup strategies.

Filter invalid rows

Remove or flag rows that fail validation rules before they reach your trusted layer.

Combining and reshaping

Join tables

Combine related tables with LEFT, INNER, and FULL joins using practical, real-world examples.

Pivot / unpivot data

Reshape rows into columns (and vice versa) to prepare data for reporting and analysis.

Historical KPIs Rename columns

⌘I

​Schema and column operations