Skip to content

Filtered Full Load

This scenario is very similar to the full load, but it filters the data coming from the source, instead of doing a complete full load. As for other cases, the acon configuration should be executed with load_data using:

from lakehouse_engine.engine import load_data
acon = {...}
load_data(acon=acon)
Example of ACON configuration:
{
  "input_specs": [
    {
      "spec_id": "sales_source",
      "read_type": "batch",
      "data_format": "csv",
      "options": {
        "header": true,
        "delimiter": "|",
        "inferSchema": true
      },
      "location": "file:///app/tests/lakehouse/in/feature/full_load/with_filter/data"
    }
  ],
  "transform_specs": [
    {
      "spec_id": "filtered_sales",
      "input_id": "sales_source",
      "transformers": [
        {
          "function": "expression_filter",
          "args": {
            "exp": "date like '2016%'"
          }
        }
      ]
    }
  ],
  "output_specs": [
    {
      "spec_id": "sales_bronze",
      "input_id": "filtered_sales",
      "write_type": "overwrite",
      "data_format": "parquet",
      "location": "file:///app/tests/lakehouse/out/feature/full_load/with_filter/data"
    }
  ]

Relevant notes:
  • As seen in the ACON, the filtering capabilities are provided by a transformer called expression_filter, where you can provide a custom Spark SQL filter.