Dq validator
Module to define Data Validator class.
DQValidator
¶
Bases: Algorithm
Validate data using an algorithm configuration (ACON represented as dict).
This algorithm focuses on isolate Data Quality Validations from loading, applying a set of data quality functions to a specific input dataset, without the need to define any output specification. You can use any input specification compatible with the lakehouse engine (dataframe, table, files, etc).
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
|
__init__(acon)
¶
Construct DQValidator algorithm instances.
A data quality validator needs the following specifications to work properly:
- input specification (mandatory): specify how and what data to read.
- data quality specification (mandatory): specify how to execute the data quality process.
- restore_prev_version (optional): specify if, having delta table/files as input, they should be restored to the previous version if the data quality process fails. Note: this is only considered if fail_on_error is kept as True.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
acon |
dict
|
algorithm configuration. |
required |
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
execute()
¶
Define the algorithm execution behaviour.
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
process_dq(data)
¶
Process the data quality tasks for the data that was read.
It supports a single input dataframe.
It is possible to use data quality validators/expectations that will validate your data and fail the process in case the expectations are not met. The DQ process also generates and keeps updating a site containing the results of the expectations that were done on your data. The location of the site is configurable and can either be on file system or S3. If you define it to be stored on S3, you can even configure your S3 bucket to serve the site so that people can easily check the quality of your data. Moreover, it is also possible to store the result of the DQ process into a defined result sink.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
DataFrame
|
input dataframe on which to run the DQ process. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
Validated dataframe. |
Source code in mkdocs/lakehouse_engine/packages/algorithms/dq_validator.py
read()
¶
Read data from an input location into a distributed dataframe.
Returns:
Type | Description |
---|---|
DataFrame
|
Dataframe with data that was read. |