Skip to main content
PySpark transformations in Nekt combine the distributed computing power of Apache Spark with the flexibility of Python to help you build sophisticated data pipelines. Whether you’re implementing complex business logic, handling large-scale data processing, or creating custom transformations, our Jupyter notebook integration makes it easy to develop and deploy your code.

Creating PySpark Transformations

Nekt makes it simple to create, test, and deploy transformations using Jupyter Notebooks. Whether you’re new to data engineering or an experienced user, our templates provide an easy starting point for building transformations.

Step 1: Generate a Token

To allow Jupyter Notebooks to access your data:
  1. Navigate to the Add Transformation page.
  2. Select the data tables you want to include in your transformation.
  3. Generate a token with appropriate access permissions.
    • You can create multiple tokens, each with specific access levels for better security.
Generate a Token

Step 2: Develop and Test Your Transformation

The idea of testing your transformation locally is to validate its logic before running it on Nekt. You can do this by running your transformation in a Jupyter notebook environment. Nekt provides several environment options for running Jupyter Notebooks. Choose the option that best fits your development preferences:
  • Google Colab Notebook: Run a Jupyter notebook in the cloud with minimal setup. Requires a Google account.
  • GitHub Codespaces: Use a cloud-hosted Jupyter notebook powered by GitHub for easy setup. Requires a GitHub account.
  • Local Dev Container: Set up a Jupyter notebook on your local machine using an isolated environment with all dependencies pre-installed.
  • Local Jupyter Notebook: Manually configure a Jupyter notebook on your local machine. Best for users with advanced knowledge of Python environments.
How to Start:
  1. Copy your token and paste it into the template as instructed.
  2. Use the Nekt SDK to load the data you need for your transformation.
    • The layer_name is the layer of the table you want to load.
    • The table_name is the name of the table you want to load.
      import nekt
      
      df = nekt.load_table(
         layer_name="layer_name", 
         table_name="table_name"
      )
      
  3. Develop and test your transformation code within the notebook environment.
View Environment Options

Step 3: Add Your Transformation to Nekt

Once your transformation is validated locally in the local notebook you can put it on production by creating a transformation at Nekt. This allows you to run it on a schedule, based on other pipelines’ events or on demand. It also allows you to save the transformed dataframes in your Lakehouse as a new table, for further processing and activation. Here’s the step by step guide for creating a transformation at Nekt:
  1. Navigate back to the Add Transformation page in Nekt.
  2. Copy your transformation code into Nekt.
  3. Use the Nekt SDK .save_table method to save the transformed dataframes as new tables in your Lakehouse. You can call it as many times as you need in order to save multiple dataframes in the same transformation.
    • The layer_name is the layer of the table you want to save.
    • The table_name is the name of the table you want to save.
    • The folder_name is the name of the folder you want to save the table in. It is optional and if not provided, the table will be saved in the root of the layer.
      import nekt
      
      nekt.save_table(
         df=df,
         layer_name="layer_name", 
         table_name="table_name", 
         folder_name="folder_name"
      ) 
      
  4. If you need external dependecies, add them on the Define dependencies section.
  5. Save the transformation to make it available for execution.
Add Transformation

Best Practices

  1. Development:
    • Use Notebook Templates to reduce setup time
    • Start with simple transformations and scale complexity
    • Leverage PySpark’s built-in functions when possible
  2. Testing:
    • Validate transformation logic in your Jupyter notebook before deployment
    • Test with sample data of various sizes
    • Verify edge cases and error handling
  3. Performance:
    • Monitor resource usage and execution times
    • Optimize Spark configurations for your workload
  4. Maintenance:
    • Document complex logic and dependencies
    • Keep dependencies up to date
    • Use version control for your notebooks

Need Help?

If you encounter challenges during the transformation process, our team is ready to assist. You can:
  • Check our PySpark documentation for reference
  • Use our example notebooks as templates
  • Contact our support team for additional guidance