Creating PySpark Transformations
Nekt makes it simple to create, test, and deploy transformations using Jupyter Notebooks. Whether you’re new to data engineering or an experienced user, our templates provide an easy starting point for building transformations.Step 1: Generate a Token
To allow Jupyter Notebooks to access your data:- Navigate to the Add Transformation page.
- Select the data tables you want to include in your transformation.
- Generate a token with appropriate access permissions.
- You can create multiple tokens, each with specific access levels for better security.
Step 2: Develop and Test Your Transformation
The idea of testing your transformation locally is to validate its logic before running it on Nekt. You can do this by running your transformation in a Jupyter notebook environment. Nekt provides several environment options for running Jupyter Notebooks. Choose the option that best fits your development preferences:- Google Colab Notebook: Run a Jupyter notebook in the cloud with minimal setup. Requires a Google account.
- GitHub Codespaces: Use a cloud-hosted Jupyter notebook powered by GitHub for easy setup. Requires a GitHub account.
- Local Dev Container: Set up a Jupyter notebook on your local machine using an isolated environment with all dependencies pre-installed.
- Local Jupyter Notebook: Manually configure a Jupyter notebook on your local machine. Best for users with advanced knowledge of Python environments.
- Copy your token and paste it into the template as instructed.
- Use the Nekt SDK to load the data you need for your transformation.
- The
layer_nameis the layer of the table you want to load. - The
table_nameis the name of the table you want to load.
- The
- Develop and test your transformation code within the notebook environment.
Step 3: Add Your Transformation to Nekt
Once your transformation is validated locally in the local notebook you can put it on production by creating a transformation at Nekt. This allows you to run it on a schedule, based on other pipelines’ events or on demand. It also allows you to save the transformed dataframes in your Lakehouse as a new table, for further processing and activation. Here’s the step by step guide for creating a transformation at Nekt:- Navigate back to the Add Transformation page in Nekt.
- Copy your transformation code into Nekt.
- Use the Nekt SDK
.save_tablemethod to save the transformed dataframes as new tables in your Lakehouse. You can call it as many times as you need in order to save multiple dataframes in the same transformation.- The
layer_nameis the layer of the table you want to save. - The
table_nameis the name of the table you want to save. - The
folder_nameis the name of the folder you want to save the table in. It is optional and if not provided, the table will be saved in the root of the layer.
- The
- If you need external dependecies, add them on the Define dependencies section.
- Save the transformation to make it available for execution.
Best Practices
-
Development:
- Use Notebook Templates to reduce setup time
- Start with simple transformations and scale complexity
- Leverage PySpark’s built-in functions when possible
-
Testing:
- Validate transformation logic in your Jupyter notebook before deployment
- Test with sample data of various sizes
- Verify edge cases and error handling
-
Performance:
- Monitor resource usage and execution times
- Optimize Spark configurations for your workload
-
Maintenance:
- Document complex logic and dependencies
- Keep dependencies up to date
- Use version control for your notebooks
Need Help?
If you encounter challenges during the transformation process, our team is ready to assist. You can:- Check our PySpark documentation for reference
- Use our example notebooks as templates
- Contact our support team for additional guidance