Definitions
Definitions of standard values and structures for core components.
CollectEngineUsage
¶
Bases: Enum
Options for collecting engine usage stats.
- enabled, enables the collection and storage of Lakehouse Engine usage statistics for any environment.
- prod_only, enables the collection and storage of Lakehouse Engine usage statistics for production environment only.
- disabled, disables the collection and storage of Lakehouse Engine usage statistics, for all environments.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQDefaults
¶
Bases: Enum
Defaults used on the data quality process.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQExecutionPoint
¶
DQFunctionSpec
dataclass
¶
Bases: object
Defines a data quality function specification.
- function - name of the data quality function (expectation) to execute. It follows the great_expectations api https://greatexpectations.io/expectations/.
- args - args of the function (expectation). Follow the same api as above.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQSpec
dataclass
¶
Bases: object
Data quality overall specification.
- spec_id - id of the specification.
- input_id - id of the input specification.
- dq_type - type of DQ process to execute (e.g. validator).
- dq_functions - list of function specifications to execute.
- dq_db_table - name of table to derive the dq functions from.
- dq_table_table_filter - name of the table which rules are to be applied in the validations (Only used when deriving dq functions).
- dq_table_extra_filters - extra filters to be used when deriving dq functions. This is a sql expression to be applied to the dq_db_table.
- execution_point - execution point of the dq functions. [at_rest, in_motion]. This is set during the load_data or dq_validator functions.
- unexpected_rows_pk - the list of columns composing the primary key of the source data to identify the rows failing the DQ validations. Note: only one of tbl_to_derive_pk or unexpected_rows_pk arguments need to be provided. It is mandatory to provide one of these arguments when using tag_source_data as True. When tag_source_data is False, this is not mandatory, but still recommended.
- tbl_to_derive_pk - db.table to automatically derive the unexpected_rows_pk from. Note: only one of tbl_to_derive_pk or unexpected_rows_pk arguments need to be provided. It is mandatory to provide one of these arguments when using tag_source_data as True. hen tag_source_data is False, this is not mandatory, but still recommended.
- sort_processed_keys - when using the
prisma
dq_type
, a columnprocessed_keys
is automatically added to give observability over the PK values that were processed during a run. This parameter (sort_processed_keys
) controls whether the processed keys column value should be sorted or not. Default: False. - gx_result_format - great expectations result format. Default: "COMPLETE".
- tag_source_data - when set to true, this will ensure that the DQ process ends by tagging the source data with an additional column with information about the DQ results. This column makes it possible to identify if the DQ run was succeeded in general and, if not, it unlocks the insights to know what specific rows have made the DQ validations fail and why. Default: False. Note: it only works if result_sink_explode is True, gx_result_format is COMPLETE, fail_on_error is False (which is done automatically when you specify tag_source_data as True) and tbl_to_derive_pk or unexpected_rows_pk is configured.
- store_backend - which store_backend to use (e.g. s3 or file_system).
- local_fs_root_dir - path of the root directory. Note: only applicable for store_backend file_system.
- data_docs_local_fs - the path for data docs only for store_backend file_system.
- bucket - the bucket name to consider for the store_backend (store DQ artefacts). Note: only applicable for store_backend s3.
- data_docs_bucket - the bucket name for data docs only. When defined, it will supersede bucket parameter. Note: only applicable for store_backend s3.
- expectations_store_prefix - prefix where to store expectations' data. Note: only applicable for store_backend s3.
- validations_store_prefix - prefix where to store validations' data. Note: only applicable for store_backend s3.
- data_docs_prefix - prefix where to store data_docs' data.
- checkpoint_store_prefix - prefix where to store checkpoints' data. Note: only applicable for store_backend s3.
- data_asset_name - name of the data asset to consider when configuring the great expectations' data source.
- expectation_suite_name - name to consider for great expectations' suite.
- result_sink_db_table - db.table_name indicating the database and table in which to save the results of the DQ process.
- result_sink_location - file system location in which to save the results of the DQ process.
- data_product_name - name of the data product.
- result_sink_partitions - the list of partitions to consider.
- result_sink_format - format of the result table (e.g. delta, parquet, kafka...).
- result_sink_options - extra spark options for configuring the result sink. E.g: can be used to configure a Kafka sink if result_sink_format is kafka.
- result_sink_explode - flag to determine if the output table/location should have the columns exploded (as True) or not (as False). Default: True.
- result_sink_extra_columns - list of extra columns to be exploded (following
the pattern "
.*") or columns to be selected. It is only used when result_sink_explode is set to True. - source - name of data source, to be easier to identify in analysis. If not
specified, it is set as default
. This will be only used when result_sink_explode is set to True. - fail_on_error - whether to fail the algorithm if the validations of your data in the DQ process failed.
- cache_df - whether to cache the dataframe before running the DQ process or not.
- critical_functions - functions that should not fail. When this argument is defined, fail_on_error is nullified.
- max_percentage_failure - percentage of failure that should be allowed. This argument has priority over both fail_on_error and critical_functions.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 |
|
DQTableBaseParameters
¶
Bases: Enum
Base parameters for importing DQ rules from a table.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQType
¶
DQValidatorSpec
dataclass
¶
Bases: object
Data Quality Validator Specification.
- input_spec: input specification of the data to be checked/validated.
- dq_spec: data quality specification.
- restore_prev_version: specify if, having delta table/files as input, they should be restored to the previous version if the data quality process fails. Note: this is only considered if fail_on_error is kept as True.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
EngineConfig
dataclass
¶
Bases: object
Definitions that can come from the Engine Config file.
- dq_bucket: S3 prod bucket used to store data quality related artifacts.
- dq_dev_bucket: S3 dev bucket used to store data quality related artifacts.
- notif_disallowed_email_servers: email servers not allowed to be used for sending notifications.
- engine_usage_path: path where the engine prod usage stats are stored.
- engine_dev_usage_path: path where the engine dev usage stats are stored.
- collect_engine_usage: whether to enable the collection of lakehouse engine usage stats or not.
- dq_functions_column_list: list of columns to be added to the meta argument of GX when using PRISMA.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
EngineStats
¶
Bases: Enum
Definitions for collection of Lakehouse Engine Stats.
Note
whenever the value comes from a key inside a Spark Config that returns an array, it can be specified with a '#' so that it is adequately processed.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
FileManagerAPIKeys
¶
Bases: Enum
File Manager s3 api keys.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABCadence
¶
Bases: Enum
Representation of the supported cadences on GAB.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_cadences()
classmethod
¶
Get the cadences values as set.
Returns:
Type | Description |
---|---|
set[str]
|
set containing all possible cadence values as |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_ordered_cadences()
classmethod
¶
Get the cadences ordered by the value.
Returns:
Type | Description |
---|---|
dict
|
dict containing ordered cadences as |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
order_cadences(cadences_to_order)
classmethod
¶
Order a list of cadences by value.
Returns:
Type | Description |
---|---|
list[str]
|
ordered set containing the received cadences. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABCombinedConfiguration
¶
Bases: Enum
GAB combined configuration.
Based on the use case configuration return the values to override in the SQL file.
This enum aims to exhaustively map each combination of cadence
, reconciliation
,
week_start
and snap_flag
return the corresponding values join_select
,
project_start
and project_end
to replace this values in the stages SQL file.
Return corresponding configuration (join_select, project_start, project_end) for each combination (cadence x recon x week_start x snap_flag).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 |
|
GABDefaults
¶
Bases: Enum
Defaults used on the GAB process.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABKeys
¶
Constants used to update pre-configured gab dict key.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABReplaceableKeys
¶
Constants used to replace pre-configured gab dict values.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABSpec
dataclass
¶
Bases: object
Gab Specification.
- query_label_filter: query use-case label to execute.
- queue_filter: queue to execute the job.
- cadence_filter: selected cadences to build the asset.
- target_database: target database to write.
- curr_date: current date.
- start_date: period start date.
- end_date: period end date.
- rerun_flag: rerun flag.
- target_table: target table to write.
- source_database: source database.
- gab_base_path: base path to read the use cases.
- lookup_table: gab configuration table.
- calendar_table: gab calendar table.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 |
|
create_from_acon(acon)
classmethod
¶
Create GabSpec from acon.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
acon |
dict
|
gab ACON. |
required |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABStartOfWeek
¶
Bases: Enum
Representation of start of week values on GAB.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_start_of_week()
classmethod
¶
Get the start of week enum as a dict.
Returns:
Type | Description |
---|---|
dict
|
dict containing all enum entries as |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_values()
classmethod
¶
Get the start of week enum values as set.
Returns:
Type | Description |
---|---|
set[str]
|
set containing all possible values |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
InputFormat
¶
Bases: Enum
Formats of algorithm input.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
exists(input_format)
classmethod
¶
Checks if the input format exists in the enum values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_format |
str
|
format to check if exists. |
required |
Returns:
Type | Description |
---|---|
bool
|
If the input format exists in our enum. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
Type | Description |
---|---|
A list with all enum values. |
InputSpec
dataclass
¶
Bases: object
Specification of an algorithm input.
This is very aligned with the way the execution environment connects to the sources (e.g., spark sources).
- spec_id: spec_id of the input specification read_type: ReadType type of read operation.
- data_format: format of the input.
- sftp_files_format: format of the files (csv, fwf, json, xml...) in a sftp directory.
- df_name: dataframe name.
- db_table: table name in the form of
<db>.<table>
. - location: uri that identifies from where to read data in the specified format.
- enforce_schema_from_table: if we want to enforce the table schema or not, by
providing a table name in the form of
<db>.<table>
. - query: sql query to execute and return the dataframe. Use it if you do not want to read from a file system nor from a table, but rather from a sql query instead.
- schema: dict representation of a schema of the input (e.g., Spark struct type schema).
- schema_path: path to a file with a representation of a schema of the input (e.g., Spark struct type schema).
- disable_dbfs_retry: optional flag to disable file storage dbfs.
- with_filepath: if we want to include the path of the file that is being read. Only works with the file reader (batch and streaming modes are supported).
- options: dict with other relevant options according to the execution environment (e.g., spark) possible sources.
- calculate_upper_bound: when to calculate upper bound to extract from SAP BW or not.
- calc_upper_bound_schema: specific schema for the calculated upper_bound.
- generate_predicates: when to generate predicates to extract from SAP BW or not.
- predicates_add_null: if we want to include is null on partition by predicates.
- temp_view: optional name of a view to point to the input dataframe to be used to create or replace a temp view on top of the dataframe.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
MergeOptions
dataclass
¶
Bases: object
Options for a merge operation.
- merge_predicate: predicate to apply to the merge operation so that we can check if a new record corresponds to a record already included in the historical data.
- insert_only: indicates if the merge should only insert data (e.g., deduplicate scenarios).
- delete_predicate: predicate to apply to the delete operation.
- update_predicate: predicate to apply to the update operation.
- insert_predicate: predicate to apply to the insert operation.
- update_column_set: rules to apply to the update operation which allows to set the value for each column to be updated. (e.g. {"data": "new.data", "count": "current.count + 1"} )
- insert_column_set: rules to apply to the insert operation which allows to set the value for each column to be inserted. (e.g. {"date": "updates.date", "count": "1"} )
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
NotificationRuntimeParameters
¶
Bases: Enum
Parameters to be replaced in runtime.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
NotifierType
¶
OutputFormat
¶
Bases: Enum
Formats of algorithm output.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
exists(output_format)
classmethod
¶
Checks if the output format exists in the enum values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_format |
str
|
format to check if exists. |
required |
Returns:
Type | Description |
---|---|
bool
|
If the output format exists in our enum. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
Type | Description |
---|---|
A list with all enum values. |
OutputSpec
dataclass
¶
Bases: object
Specification of an algorithm output.
This is very aligned with the way the execution environment connects to the output systems (e.g., spark outputs).
- spec_id: id of the output specification.
- input_id: id of the corresponding input specification.
- write_type: type of write operation.
- data_format: format of the output. Defaults to DELTA.
- db_table: table name in the form of
<db>.<table>
. - location: uri that identifies from where to write data in the specified format.
- sharepoint_opts: options to apply on writing on sharepoint operations.
- partitions: list of partition input_col names.
- merge_opts: options to apply to the merge operation.
- streaming_micro_batch_transformers: transformers to invoke for each streaming micro batch, before writing (i.e., in Spark's foreachBatch structured streaming function). Note: the lakehouse engine manages this for you, so you don't have to manually specify streaming transformations here, so we don't advise you to manually specify transformations through this parameter. Supply them as regular transformers in the transform_specs sections of an ACON.
- streaming_once: if the streaming query is to be executed just once, or not, generating just one micro batch.
- streaming_processing_time: if streaming query is to be kept alive, this indicates the processing time of each micro batch.
- streaming_available_now: if set to True, set a trigger that processes all available data in multiple batches then terminates the query. When using streaming, this is the default trigger that the lakehouse-engine will use, unless you configure a different one.
- streaming_continuous: set a trigger that runs a continuous query with a given checkpoint interval.
- streaming_await_termination: whether to wait (True) for the termination of the streaming query (e.g. timeout or exception) or not (False). Default: True.
- streaming_await_termination_timeout: a timeout to set to the streaming_await_termination. Default: None.
- with_batch_id: whether to include the streaming batch id in the final data, or not. It only takes effect in streaming mode.
- options: dict with other relevant options according to the execution environment (e.g., spark) possible outputs. E.g.,: JDBC options, checkpoint location for streaming, etc.
- streaming_micro_batch_dq_processors: similar to streaming_micro_batch_transformers but for the DQ functions to be executed. Used internally by the lakehouse engine, so you don't have to supply DQ functions through this parameter. Use the dq_specs of the acon instead.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
ReadMode
¶
Bases: Enum
Different modes that control how we handle compliance to the provided schema.
These read modes map to Spark's read modes at the moment.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
ReadType
¶
Bases: Enum
Define the types of read operations.
- BATCH - read the data in batch mode (e.g., Spark batch).
- STREAMING - read the data in streaming mode (e.g., Spark streaming).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
ReconciliatorSpec
dataclass
¶
Bases: object
Reconciliator Specification.
- metrics: list of metrics in the form of: [{ metric: name of the column present in both truth and current datasets, aggregation: sum, avg, max, min, ..., type: percentage or absolute, yellow: value, red: value }].
- recon_type: reconciliation type (percentage or absolute). Percentage calculates the difference between truth and current results as a percentage (x-y/x), and absolute calculates the raw difference (x - y).
- truth_input_spec: input specification of the truth data.
- current_input_spec: input specification of the current results data
- truth_preprocess_query: additional query on top of the truth input data to preprocess the truth data before it gets fueled into the reconciliation process. Important note: you need to assume that the data out of the truth_input_spec is referencable by a table called 'truth'.
- truth_preprocess_query_args: optional dict having the functions/transformations to
apply on top of the truth_preprocess_query and respective arguments. Note: cache
is being applied on the Dataframe, by default. For turning the default behavior
off, pass
"truth_preprocess_query_args": []
. - current_preprocess_query: additional query on top of the current results input data to preprocess the current results data before it gets fueled into the reconciliation process. Important note: you need to assume that the data out of the current_results_input_spec is referencable by a table called 'current'.
- current_preprocess_query_args: optional dict having the
functions/transformations to apply on top of the current_preprocess_query
and respective arguments. Note: cache is being applied on the Dataframe,
by default. For turning the default behavior off, pass
"current_preprocess_query_args": []
. - ignore_empty_df: optional boolean, to ignore the recon process if source & target dataframes are empty, recon will exit success code (passed)
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
RestoreStatus
¶
RestoreType
¶
Bases: Enum
Archive types.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
exists(restore_type)
classmethod
¶
Checks if the restore type exists in the enum values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
restore_type |
str
|
restore type to check if exists. |
required |
Returns:
Type | Description |
---|---|
bool
|
If the restore type exists in our enum. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
Type | Description |
---|---|
A list with all enum values. |
SAPLogchain
¶
Bases: Enum
Defaults used on consuming data from SAP Logchain.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SQLDefinitions
¶
Bases: Enum
SQL definitions statements.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SQLParser
¶
Bases: Enum
Defaults to use for parsing.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SensorSpec
dataclass
¶
Bases: object
Sensor Specification.
- sensor_id: sensor id.
- assets: a list of assets that are considered as available to consume downstream after this sensor has status PROCESSED_NEW_DATA.
- control_db_table_name: db.table to store sensor metadata.
- input_spec: input specification of the source to be checked for new data.
- preprocess_query: SQL query to transform/filter the result from the upstream. Consider that we should refer to 'new_data' whenever we are referring to the input of the sensor. E.g.: "SELECT dummy_col FROM new_data WHERE ..."
- checkpoint_location: optional location to store checkpoints to resume from. These checkpoints use the same as Spark checkpoint strategy. For Spark readers that do not support checkpoints, use the preprocess_query parameter to form a SQL query to filter the result from the upstream accordingly.
- fail_on_empty_result: if the sensor should throw an error if there is no new data in the upstream. Default: True.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
create_from_acon(acon)
classmethod
¶
Create SensorSpec from acon.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
acon |
dict
|
sensor ACON. |
required |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SensorStatus
¶
SharepointOptions
dataclass
¶
Bases: object
options for a sharepoint write operation.
- client_id (str): azure client ID application.
- tenant_id (str): tenant ID associated with the SharePoint site.
- site_name (str): name of the SharePoint site where the document library resides.
- drive_name (str): name of the document library where the file will be uploaded.
- file_name (str): name of the file to be uploaded to local path and to SharePoint.
- secret (str): client secret for authentication.
- local_path (str): local path (or similar, e.g. mounted file systems like databricks volumes or dbfs) where files will be temporarily stored during the SharePoint write operation.
- api_version (str): version of the Graph SharePoint API to be used for operations.
- folder_relative_path (Optional[str]): relative folder path within the document library to upload the file.
- chunk_size (Optional[int]): Optional; size (in Bytes) of the file chunks for uploading to SharePoint. Default is 100 Mb.
- local_options (Optional[dict]): Optional; additional options for customizing write to csv action to local path. You can check the available options here: https://spark.apache.org/docs/3.5.3/sql-data-sources-csv.html
- conflict_behaviour (Optional[str]): Optional; behavior to adopt in case of a conflict (e.g., 'replace', 'fail').
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
TerminatorSpec
dataclass
¶
Bases: object
Terminator Specification.
I.e., the specification that defines a terminator operation to be executed. Examples are compute statistics, vacuum, optimize, etc.
- function: terminator function to execute.
- args: arguments of the terminator function.
- input_id: id of the corresponding output specification (Optional).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
TransformSpec
dataclass
¶
Bases: object
Transformation Specification.
I.e., the specification that defines the many transformations to be done to the data that was read.
- spec_id: id of the terminate specification
- input_id: id of the corresponding input specification.
- transformers: list of transformers to execute.
- force_streaming_foreach_batch_processing: sometimes, when using streaming, we want to force the transform to be executed in the foreachBatch function to ensure non-supported streaming operations can be properly executed.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
TransformerSpec
dataclass
¶
Bases: object
Transformer Specification, i.e., a single transformation amongst many.
- function: name of the function (or callable function) to be executed.
- args: (not applicable if using a callable function) dict with the arguments
to pass to the function
<k,v>
pairs with the name of the parameter of the function and the respective value.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
WriteType
¶
Bases: Enum
Types of write operations.