Definitions
Definitions of standard values and structures for core components.
CollectEngineUsage
¶
Bases: Enum
Options for collecting engine usage stats.
- enabled, enables the collection and storage of Lakehouse Engine usage statistics for any environment.
- prod_only, enables the collection and storage of Lakehouse Engine usage statistics for production environment only.
- disabled, disables the collection and storage of Lakehouse Engine usage statistics, for all environments.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQDefaults
¶
Bases: Enum
Defaults used on the data quality process.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQExecutionPoint
¶
DQFunctionSpec
dataclass
¶
Bases: object
Defines a data quality function specification.
- function - name of the data quality function (expectation) to execute. It follows the great_expectations api https://greatexpectations.io/expectations/.
- args - args of the function (expectation). Follow the same api as above.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQResultFormat
¶
DQSpec
dataclass
¶
Bases: object
Data quality overall specification.
- spec_id - id of the specification.
- input_id - id of the input specification.
- dq_type - type of DQ process to execute (e.g. validator).
- dq_functions - list of function specifications to execute.
- dq_db_table - name of table to derive the dq functions from.
- dq_table_table_filter - name of the table which rules are to be applied in the validations (Only used when deriving dq functions).
- dq_table_extra_filters - extra filters to be used when deriving dq functions. This is a sql expression to be applied to the dq_db_table.
- execution_point - execution point of the dq functions. [at_rest, in_motion]. This is set during the load_data or dq_validator functions.
- unexpected_rows_pk - the list of columns composing the primary key of the source data to identify the rows failing the DQ validations. Note: only one of tbl_to_derive_pk or unexpected_rows_pk arguments need to be provided. It is mandatory to provide one of these arguments when using tag_source_data as True. When tag_source_data is False, this is not mandatory, but still recommended.
- tbl_to_derive_pk - db.table to automatically derive the unexpected_rows_pk from. Note: only one of tbl_to_derive_pk or unexpected_rows_pk arguments need to be provided. It is mandatory to provide one of these arguments when using tag_source_data as True. hen tag_source_data is False, this is not mandatory, but still recommended.
- gx_result_format - great expectations result format. Default: "COMPLETE".
- tag_source_data - when set to true, this will ensure that the DQ process ends by tagging the source data with an additional column with information about the DQ results. This column makes it possible to identify if the DQ run was succeeded in general and, if not, it unlocks the insights to know what specific rows have made the DQ validations fail and why. Default: False. Note: it only works if result_sink_explode is True, gx_result_format is COMPLETE, fail_on_error is False (which is done automatically when you specify tag_source_data as True) and tbl_to_derive_pk or unexpected_rows_pk is configured.
- store_backend - which store_backend to use (e.g. s3 or file_system).
- local_fs_root_dir - path of the root directory. Note: only applicable for store_backend file_system.
- bucket - the bucket name to consider for the store_backend (store DQ artefacts). Note: only applicable for store_backend s3.
- expectations_store_prefix - prefix where to store expectations' data. Note: only applicable for store_backend s3.
- validations_store_prefix - prefix where to store validations' data. Note: only applicable for store_backend s3.
- checkpoint_store_prefix - prefix where to store checkpoints' data. Note: only applicable for store_backend s3.
- data_asset_name - name of the data asset to consider when configuring the great expectations' data source.
- expectation_suite_name - name to consider for great expectations' suite.
- result_sink_db_table - db.table_name indicating the database and table in which to save the results of the DQ process.
- result_sink_location - file system location in which to save the results of the DQ process.
- result_sink_chunk_size - number of records per chunk when writing the results of the DQ process. Default: 1000000 records.
- processed_keys_location - file system location where the keys processed by the DQ Process are saved. This is specifically used when the DQ Type is PRISMA. Note that this location is always constructed during the process, so any value defined in the configuration will be overwritten.
- data_product_name - name of the data product.
- result_sink_partitions - the list of partitions to consider.
- result_sink_format - format of the result table (e.g. delta, parquet, kafka...).
- result_sink_options - extra spark options for configuring the result sink. E.g: can be used to configure a Kafka sink if result_sink_format is kafka.
- result_sink_explode - flag to determine if the output table/location should have the columns exploded (as True) or not (as False). Default: True.
- result_sink_extra_columns - list of extra columns to be exploded (following
the pattern "
.*") or columns to be selected. It is only used when result_sink_explode is set to True. - source - name of data source, to be easier to identify in analysis. If not
specified, it is set as default
. This will be only used when result_sink_explode is set to True. - fail_on_error - whether to fail the algorithm if the validations of your data in the DQ process failed.
- cache_df - whether to cache the dataframe before running the DQ process or not.
- critical_functions - functions that should not fail. When this argument is defined, fail_on_error is nullified.
- max_percentage_failure - percentage of failure that should be allowed. This argument has priority over both fail_on_error and critical_functions.
- enable_row_condition - flag to determine if the row_conditions should be enabled or not. row_conditions allow you to filter the rows that are processed by the DQ functions. This is useful when you want to run the DQ functions only on a subset of the data. Default: False. Note: When using PRISMA, if you enable this flag, bear in mind that the number of processed keys will be numerically different from the evaluated keys. This happens because the row_conditions limit the number of rows that are processed by the DQ functions, but the we consider processed keys as all the keys that are passed to the dq_spec.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 | |
DQTableBaseParameters
¶
Bases: Enum
Base parameters for importing DQ rules from a table.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
DQType
¶
DQValidatorSpec
dataclass
¶
Bases: object
Data Quality Validator Specification.
- input_spec: input specification of the data to be checked/validated.
- dq_spec: data quality specification.
- restore_prev_version: specify if, having delta table/files as input, they should be restored to the previous version if the data quality process fails. Note: this is only considered if fail_on_error is kept as True.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
EngineConfig
dataclass
¶
Bases: object
Definitions that can come from the Engine Config file.
- dq_bucket: S3 prod bucket used to store data quality related artifacts.
- dq_dev_bucket: S3 dev bucket used to store data quality related artifacts.
- notif_disallowed_email_servers: email servers not allowed to be used for sending notifications.
- engine_usage_path: path where the engine prod usage stats are stored.
- engine_dev_usage_path: path where the engine dev usage stats are stored.
- collect_engine_usage: whether to enable the collection of lakehouse engine usage stats or not.
- dq_functions_column_list: list of columns to be added to the meta argument of GX when using PRISMA.
- raise_on_config_not_available: whether to raise an exception if a spark config is not available.
- prod_catalog: name of the prod catalog being used. This is useful to derive whether the environment is prod or dev, so the dev or prod buckets/paths can be used for storing engine usage stats and dq artifacts.
- environment: environment that the engine is being executed on. Takes precedence over prod_catalog when defining if the environment is prod or dev.
- sharepoint_authority: authority for the Sharepoint api.
- sharepoint_company_domain: company domain for the Sharepoint api.
- sharepoint_api_domain: api domain for the Sharepoint api.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
EngineStats
¶
Bases: object
Definitions for collection of Lakehouse Engine Stats.
Note
whenever the value comes from a key inside a Spark Config that returns an array, it can be specified with a '#' so that it is adequately processed.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
FileManagerAPIKeys
¶
Bases: Enum
File Manager s3 api keys.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABCadence
¶
Bases: Enum
Representation of the supported cadences on GAB.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_cadences()
classmethod
¶
Get the cadences values as set.
Returns:
| Type | Description |
|---|---|
set[str]
|
set containing all possible cadence values as |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_ordered_cadences()
classmethod
¶
Get the cadences ordered by the value.
Returns:
| Type | Description |
|---|---|
dict
|
dict containing ordered cadences as |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
order_cadences(cadences_to_order)
classmethod
¶
Order a list of cadences by value.
Returns:
| Type | Description |
|---|---|
list[str]
|
ordered set containing the received cadences. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABCombinedConfiguration
¶
Bases: Enum
GAB combined configuration.
Based on the use case configuration return the values to override in the SQL file.
This enum aims to exhaustively map each combination of cadence, reconciliation,
week_start and snap_flag return the corresponding values join_select,
project_start and project_end to replace this values in the stages SQL file.
Return corresponding configuration (join_select, project_start, project_end) for each combination (cadence x recon x week_start x snap_flag).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 | |
GABDefaults
¶
Bases: Enum
Defaults used on the GAB process.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABKeys
¶
Constants used to update pre-configured gab dict key.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABReplaceableKeys
¶
Constants used to replace pre-configured gab dict values.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABSpec
dataclass
¶
Bases: object
Gab Specification.
- query_label_filter: query use-case label to execute.
- queue_filter: queue to execute the job.
- cadence_filter: selected cadences to build the asset.
- target_database: target database to write.
- curr_date: current date.
- start_date: period start date.
- end_date: period end date.
- rerun_flag: rerun flag.
- target_table: target table to write.
- source_database: source database.
- gab_base_path: base path to read the use cases.
- lookup_table: gab configuration table.
- calendar_table: gab calendar table.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 | |
create_from_acon(acon)
classmethod
¶
Create GabSpec from acon.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
acon
|
dict
|
gab ACON. |
required |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
GABStartOfWeek
¶
Bases: Enum
Representation of start of week values on GAB.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_start_of_week()
classmethod
¶
Get the start of week enum as a dict.
Returns:
| Type | Description |
|---|---|
dict
|
dict containing all enum entries as |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
get_values()
classmethod
¶
Get the start of week enum values as set.
Returns:
| Type | Description |
|---|---|
set[str]
|
set containing all possible values |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
HeartbeatConfigSpec
dataclass
¶
Bases: object
Heartbeat Configurations and control table specifications.
This provides the way in which the Heartbeat can pass environment and specific quantum related config information to sensor acon.
- sensor_source: specifies the source system of sensor, for e.g. sap_b4, sap_bw, delta_table, kafka, lmu_delta_table, trigger_file etc. It is also a part of heartbeat control table, Therefore it is useful for filtering out data from Heartbeat control table based on template source system.
- data_format: format of the input source, e.g jdbc, delta, kafka, cloudfiles etc.
- heartbeat_sensor_db_table: heartbeat control table along with database from config.
- lakehouse_engine_sensor_db_table: Control table along with database(config).
- options: dict with other relevant options for reading data from specified input data_format. This can vary for each source system. For e.g. For sap systems, DRIVER, URL, USERNAME, PASSWORD are required which are all being read from config file of quantum.
- jdbc_db_table: schema and table name of JDBC sources.
- token: token to access Databricks Job API(read from config).
- domain: workspace domain url for quantum(read from config).
- base_checkpoint_location: checkpoint location for streaming sources(from config).
- kafka_configs: configs required for kafka. It is (read from config) as JSON.
config hierarchy is [sensor_kafka -->
--> main kafka options]. - kafka_secret_scope: secret scope for kafka (read from config).
- base_trigger_file_location: location where all the trigger files are being created (read from config).
- schema_dict: dict representation of schema of the trigger file (e.g. Spark struct type schema).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 | |
create_from_acon(acon)
classmethod
¶
Create HeartbeatConfigSpec from acon.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
acon
|
dict
|
Heartbeat ACON. |
required |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
HeartbeatSensorSource
¶
Bases: Enum
Formats of algorithm input.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
| Type | Description |
|---|---|
|
A list with all enum values. |
HeartbeatStatus
¶
InputFormat
¶
Bases: Enum
Formats of algorithm input.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
exists(input_format)
classmethod
¶
Checks if the input format exists in the enum values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_format
|
str
|
format to check if exists. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
If the input format exists in our enum. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
| Type | Description |
|---|---|
|
A list with all enum values. |
InputSpec
dataclass
¶
Bases: object
Specification of an algorithm input.
This is very aligned with the way the execution environment connects to the sources (e.g., spark sources).
- spec_id: spec_id of the input specification
- read_type: ReadType type of read operation.
- data_format: format of the input.
- sftp_files_format: format of the files (csv, fwf, json, xml...) in a sftp directory.
- df_name: dataframe name.
- db_table: table name in the form of
<db>.<table>. - location: uri that identifies from where to read data in the specified format.
- sharepoint_opts: Options to apply when reading from Sharepoint.
- enforce_schema_from_table: if we want to enforce the table schema or not,
by providing a table name in the form of
<db>.<table>. - query: sql query to execute and return the dataframe. Use it if you do not want to read from a file system nor from a table, but rather from a sql query.
- schema: dict representation of a schema of the input (e.g., Spark struct type schema).
- schema_path: path to a file with a representation of a schema of the input (e.g., Spark struct type schema).
- disable_dbfs_retry: optional flag to disable file storage dbfs.
- with_filepath: if we want to include the path of the file that is being read. Only works with the file reader (batch and streaming modes are supported).
- options: dict with other relevant options according to the execution environment (e.g., spark) possible sources.
- calculate_upper_bound: when to calculate upper bound to extract from SAP BW or not.
- calc_upper_bound_schema: specific schema for the calculated upper_bound.
- generate_predicates: when to generate predicates to extract from SAP BW or not.
- predicates_add_null: if we want to include is null on partition by predicates.
- temp_view: optional name of a view to point to the input dataframe to be used to create or replace a temp view on top of the dataframe.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
__post_init__()
¶
Normalize Sharepoint options if passed as a raw dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
self
|
Instance of the class where |
required |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
MergeOptions
dataclass
¶
Bases: object
Options for a merge operation.
- merge_predicate: predicate to apply to the merge operation so that we can check if a new record corresponds to a record already included in the historical data.
- insert_only: indicates if the merge should only insert data (e.g., deduplicate scenarios).
- delete_predicate: predicate to apply to the delete operation.
- update_predicate: predicate to apply to the update operation.
- insert_predicate: predicate to apply to the insert operation.
- update_column_set: rules to apply to the update operation which allows to set the value for each column to be updated. (e.g. {"data": "new.data", "count": "current.count + 1"} )
- insert_column_set: rules to apply to the insert operation which allows to set the value for each column to be inserted. (e.g. {"date": "updates.date", "count": "1"} )
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
NotificationRuntimeParameters
¶
Bases: Enum
Parameters to be replaced in runtime.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
NotifierType
¶
OutputFormat
¶
Bases: Enum
Formats of algorithm output.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
exists(output_format)
classmethod
¶
Checks if the output format exists in the enum values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_format
|
str
|
format to check if exists. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
If the output format exists in our enum. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
| Type | Description |
|---|---|
|
A list with all enum values. |
OutputSpec
dataclass
¶
Bases: object
Specification of an algorithm output.
This is very aligned with the way the execution environment connects to the output systems (e.g., spark outputs).
- spec_id: id of the output specification.
- input_id: id of the corresponding input specification.
- write_type: type of write operation.
- data_format: format of the output. Defaults to DELTA.
- db_table: table name in the form of
<db>.<table>. - location: uri that identifies from where to write data in the specified format.
- sharepoint_opts: options to apply on writing on Sharepoint operations.
- partitions: list of partition input_col names.
- merge_opts: options to apply to the merge operation.
- streaming_micro_batch_transformers: transformers to invoke for each streaming micro batch, before writing (i.e., in Spark's foreachBatch structured streaming function). Note: the lakehouse engine manages this for you, so you don't have to manually specify streaming transformations here, so we don't advise you to manually specify transformations through this parameter. Supply them as regular transformers in the transform_specs sections of an ACON.
- streaming_once: if the streaming query is to be executed just once, or not, generating just one micro batch.
- streaming_processing_time: if streaming query is to be kept alive, this indicates the processing time of each micro batch.
- streaming_available_now: if set to True, set a trigger that processes all available data in multiple batches then terminates the query. When using streaming, this is the default trigger that the lakehouse-engine will use, unless you configure a different one.
- streaming_continuous: set a trigger that runs a continuous query with a given checkpoint interval.
- streaming_await_termination: whether to wait (True) for the termination of the streaming query (e.g. timeout or exception) or not (False). Default: True.
- streaming_await_termination_timeout: a timeout to set to the streaming_await_termination. Default: None.
- with_batch_id: whether to include the streaming batch id in the final data, or not. It only takes effect in streaming mode.
- options: dict with other relevant options according to the execution environment (e.g., spark) possible outputs. E.g.,: JDBC options, checkpoint location for streaming, etc.
- streaming_micro_batch_dq_processors: similar to streaming_micro_batch_transformers but for the DQ functions to be executed. Used internally by the lakehouse engine, so you don't have to supply DQ functions through this parameter. Use the dq_specs of the acon instead.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
ReadMode
¶
Bases: Enum
Different modes that control how we handle compliance to the provided schema.
These read modes map to Spark's read modes at the moment.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
ReadType
¶
Bases: Enum
Define the types of read operations.
- BATCH - read the data in batch mode (e.g., Spark batch).
- STREAMING - read the data in streaming mode (e.g., Spark streaming).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
ReconciliatorSpec
dataclass
¶
Bases: object
Reconciliator Specification.
- metrics: list of metrics in the form of: [{ metric: name of the column present in both truth and current datasets, aggregation: sum, avg, max, min, ..., type: percentage or absolute, yellow: value, red: value }].
- recon_type: reconciliation type (percentage or absolute). Percentage calculates the difference between truth and current results as a percentage (x-y/x), and absolute calculates the raw difference (x - y).
- truth_input_spec: input specification of the truth data.
- current_input_spec: input specification of the current results data
- truth_preprocess_query: additional query on top of the truth input data to preprocess the truth data before it gets fueled into the reconciliation process. Important note: you need to assume that the data out of the truth_input_spec is referencable by a table called 'truth'.
- truth_preprocess_query_args: optional dict having the functions/transformations to
apply on top of the truth_preprocess_query and respective arguments. Note: cache
is being applied on the Dataframe, by default. For turning the default behavior
off, pass
"truth_preprocess_query_args": []. - current_preprocess_query: additional query on top of the current results input data to preprocess the current results data before it gets fueled into the reconciliation process. Important note: you need to assume that the data out of the current_results_input_spec is referencable by a table called 'current'.
- current_preprocess_query_args: optional dict having the
functions/transformations to apply on top of the current_preprocess_query
and respective arguments. Note: cache is being applied on the Dataframe,
by default. For turning the default behavior off, pass
"current_preprocess_query_args": []. - ignore_empty_df: optional boolean, to ignore the recon process if source & target dataframes are empty, recon will exit success code (passed)
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
RestoreStatus
¶
RestoreType
¶
Bases: Enum
Archive types.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
exists(restore_type)
classmethod
¶
Checks if the restore type exists in the enum values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
restore_type
|
str
|
restore type to check if exists. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
If the restore type exists in our enum. |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
values()
classmethod
¶
Generates a list containing all enum values.
Returns:
| Type | Description |
|---|---|
|
A list with all enum values. |
SAPLogchain
¶
Bases: Enum
Defaults used on consuming data from SAP Logchain.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SQLDefinitions
¶
Bases: Enum
SQL definitions statements.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SQLParser
¶
Bases: Enum
Defaults to use for parsing.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SensorSpec
dataclass
¶
Bases: object
Sensor Specification.
- sensor_id: sensor id.
- assets: a list of assets that are considered as available to consume downstream after this sensor has status PROCESSED_NEW_DATA.
- control_db_table_name: db.table to store sensor metadata.
- input_spec: input specification of the source to be checked for new data.
- preprocess_query: SQL query to transform/filter the result from the upstream. Consider that we should refer to 'new_data' whenever we are referring to the input of the sensor. E.g.: "SELECT dummy_col FROM new_data WHERE ..."
- checkpoint_location: optional location to store checkpoints to resume from. These checkpoints use the same as Spark checkpoint strategy. For Spark readers that do not support checkpoints, use the preprocess_query parameter to form a SQL query to filter the result from the upstream accordingly.
- fail_on_empty_result: if the sensor should throw an error if there is no new data in the upstream. Default: True.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
create_from_acon(acon)
classmethod
¶
Create SensorSpec from acon.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
acon
|
dict
|
sensor ACON. |
required |
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SensorStatus
¶
SharepointFile
dataclass
¶
Represents a file from Sharepoint with metadata and optional content.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
SharepointOptions
dataclass
¶
Bases: object
Options for Sharepoint I/O (used by both reader and writer).
This dataclass is shared by the Sharepoint reader and writer. Some fields
are required/used only in read mode, others only in write mode.
Use validate_for_reader() / validate_for_writer() to enforce the
correct subsets.
Common (reader & writer): - client_id (str): Azure AD application (client) ID. - tenant_id (str): Azure AD tenant (directory) ID. - site_name (str): Sharepoint site name. - drive_name (str): Document library/drive name. - secret (str): Client secret. - local_path (str): Local/volume path for staging (read/write temp). - api_version (str): Microsoft Graph API version (default: "v1.0"). - conflict_behaviour (Optional[str]): e.g. 'replace', 'fail'. - allowed_extensions (Optional[Collection[str]]): Defaults to SHAREPOINT_SUPPORTED_EXTENSIONS {".csv", ".xlsx"}.
Reader-specific
- folder_relative_path (Optional[str]): Folder (or full file path) to read from.
- file_name (Optional[str]): Name of a single file inside the folder
to read. If
folder_relative_pathalready points to a file,file_namemust be None. - file_type (Optional[str]): "csv" or "xlsx" when reading a folder.
- file_pattern (Optional[str]): Glob (e.g. '*.csv') when reading a folder.
- local_options (Optional[dict]): Spark CSV read options (e.g. header, sep).
- chunk_size (Optional[int]): Download chunk size (bytes).
Writer-specific
- file_name (Optional[str]): Target file name to upload.
- local_options (Optional[dict]): Spark CSV write options.
- chunk_size (Optional[int]): Upload chunk size (bytes).
Archiving (reader): - archive_enabled (bool): Whether to move files after a successful/failed read. Default: True. - archive_success_subfolder (Optional[str]): Success folder (default "done"). Set None to keep in place. - archive_error_subfolder (Optional[str]): Error folder (default "error"). Set None to keep in place.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 | |
__post_init__()
¶
Normalize and validate Sharepoint options (types, extensions, etc).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
validate_for_reader()
¶
Validate Sharepoint options required for reading.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
validate_for_writer()
¶
Validate Sharepoint options required for writing.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
TerminatorSpec
dataclass
¶
Bases: object
Terminator Specification.
I.e., the specification that defines a terminator operation to be executed. Examples are compute statistics, vacuum, optimize, etc.
- function: terminator function to execute.
- args: arguments of the terminator function.
- input_id: id of the corresponding output specification (Optional).
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
TransformSpec
dataclass
¶
Bases: object
Transformation Specification.
I.e., the specification that defines the many transformations to be done to the data that was read.
- spec_id: id of the terminate specification
- input_id: id of the corresponding input specification.
- transformers: list of transformers to execute.
- force_streaming_foreach_batch_processing: sometimes, when using streaming, we want to force the transform to be executed in the foreachBatch function to ensure non-supported streaming operations can be properly executed.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
TransformerSpec
dataclass
¶
Bases: object
Transformer Specification, i.e., a single transformation amongst many.
- function: name of the function (or callable function) to be executed.
- args: (not applicable if using a callable function) dict with the arguments
to pass to the function
<k,v>pairs with the name of the parameter of the function and the respective value.
Source code in mkdocs/lakehouse_engine/packages/core/definitions.py
WriteType
¶
Bases: Enum
Types of write operations.