Dq validator
Module to define Data Validator class.
DQValidator
¶
Bases: Algorithm
Validate data using an algorithm configuration (ACON represented as dict).
This algorithm focuses on isolate Data Quality Validations from loading, applying a set of data quality functions to a specific input dataset, without the need to define any output specification. You can use any input specification compatible with the lakehouse engine (dataframe, table, files, etc).
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
|
|
__init__(acon)
¶
Construct DQValidator algorithm instances.
A data quality validator needs the following specifications to work properly:
- input specification (mandatory): specify how and what data to read.
- data quality specification (mandatory): specify how to execute the data quality process.
- restore_prev_version (optional): specify if, having delta table/files as input, they should be restored to the previous version if the data quality process fails. Note: this is only considered if fail_on_error is kept as True.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
acon |
dict
|
algorithm configuration. |
required |
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
execute()
¶
Define the algorithm execution behaviour.
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
process_dq(data)
¶
Process the data quality tasks for the data that was read.
It supports a single input dataframe.
It is possible to use data quality validators/expectations that will validate your data and fail the process in case the expectations are not met. The DQ process also generates and keeps updating a site containing the results of the expectations that were done on your data. The location of the site is configurable and can either be on file system or S3. If you define it to be stored on S3, you can even configure your S3 bucket to serve the site so that people can easily check the quality of your data. Moreover, it is also possible to store the result of the DQ process into a defined result sink.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
input dataframe on which to run the DQ process. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Validated dataframe. |
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
read()
¶
Read data from an input location into a distributed dataframe.
Returns:
Type | Description |
---|---|
DataFrame
|
Dataframe with data that was read. |