Full Load
This scenario reads CSV data from a path and writes in full to another path with delta lake files.
Relevant notes
- This ACON infers the schema automatically through the option
inferSchema
(we use it for local tests only). This is usually not a best practice using CSV files, and you should provide a schema through the InputSpec variablesschema_path
,read_schema_from_table
orschema
. - The
transform_specs
in this case are purely optional, and we basically use the repartition transformer to create one partition per combination of date and customer. This does not mean you have to use this in your algorithm. - A full load is also adequate for an init load (initial load).
As for other cases, the acon configuration should be executed with load_data
using:
from lakehouse_engine.engine import load_data
acon = {...}
load_data(acon=acon)
Example of ACON configuration:
{
"input_specs": [
{
"spec_id": "sales_source",
"read_type": "batch",
"data_format": "csv",
"options": {
"header": true,
"delimiter": "|",
"inferSchema": true
},
"location": "file:///app/tests/lakehouse/in/feature/full_load/full_overwrite/data"
}
],
"transform_specs": [
{
"spec_id": "repartitioned_sales",
"input_id": "sales_source",
"transformers": [
{
"function": "repartition",
"args": {
"num_partitions": 1,
"cols": ["date", "customer"]
}
}
]
}
],
"output_specs": [
{
"spec_id": "sales_bronze",
"input_id": "sales_source",
"write_type": "overwrite",
"data_format": "delta",
"partitions": [
"date",
"customer"
],
"location": "file:///app/tests/lakehouse/out/feature/full_load/full_overwrite/data"
}
]
}