Validations Failing
The scenarios presented on this page are similar, but their goal is to show what happens when a DQ expectation fails the validations. The logs generated by the execution of the code will contain information regarding which expectation(s) have failed and why.
1. Fail on Error
In this scenario is specified below two parameters:
"fail_on_error": False
- this parameter is what controls what happens if a DQ expectation fails. In case this is set totrue
(default), your job will fail/be aborted and an exception will be raised. In case this is set tofalse, a log message will be printed about the error (as shown in this scenario) and the result status will also be available in result sink (if configured) and in the [data docs great expectation site](../data_quality.html#3-data-docs-website). On this scenario it is set to
false` to avoid failing the execution of the notebook.- the
max_value
of the functionexpect_table_column_count_to_be_between
is defined with specific value so that this expectation fails the validations.
from lakehouse_engine.engine import load_data
acon = {
"input_specs": [
{
"spec_id": "dummy_deliveries_source",
"read_type": "batch",
"data_format": "csv",
"options": {
"header": True,
"delimiter": "|",
"inferSchema": True,
},
"location": "s3://my_data_product_bucket/dummy_deliveries/",
}
],
"dq_specs": [
{
"spec_id": "dq_validator",
"input_id": "dummy_deliveries_source",
"dq_type": "validator",
"bucket": "my_data_product_bucket",
"data_docs_bucket": "my_dq_data_docs_bucket",
"data_docs_prefix": "dq/my_data_product/data_docs/site/",
"result_sink_db_table": "my_database.dq_result_sink",
"result_sink_location": "my_dq_path/dq_result_sink/",
"tbl_to_derive_pk": "my_database.dummy_deliveries",
"source": "deliveries_fail",
"fail_on_error": False,
"dq_functions": [
{"function": "expect_column_to_exist", "args": {"column": "salesorder"}},
{"function": "expect_table_row_count_to_be_between", "args": {"min_value": 15, "max_value": 20}},
{"function": "expect_table_column_count_to_be_between", "args": {"max_value": 5}},
{"function": "expect_column_values_to_be_null", "args": {"column": "article"}},
{"function": "expect_column_values_to_be_unique", "args": {"column": "status"}},
{
"function": "expect_column_min_to_be_between",
"args": {"column": "delivery_item", "min_value": 1, "max_value": 15},
},
{
"function": "expect_column_max_to_be_between",
"args": {"column": "delivery_item", "min_value": 15, "max_value": 30},
},
],
}
],
"output_specs": [
{
"spec_id": "dummy_deliveries_bronze",
"input_id": "dq_validator",
"write_type": "overwrite",
"data_format": "delta",
"location": "s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/",
}
],
}
load_data(acon=acon)
If you run bellow command, you would be able to see the success
column has the value false
for the last execution.
display(spark.table(RENDER_UTILS.render_content("my_database.dq_result_sink")))
2. Critical Functions
In this scenario, alternative parameters to fail_on_error
are used:
critical_functions
- this parameter defaults toNone
if not defined. It controls what DQ functions are considered a priority and as such, it stops the validation and throws an execution error whenever a function defined as critical doesn't pass the test. If any other function that is not defined in this parameter fails, an error message is printed in the logs. This parameter has priority overfail_on_error
. In this specific example, after defining theexpect_table_column_count_to_be_between
as critical, it is made sure that the execution is stopped whenever the conditions for the function are not met.
Additionally, it can also be defined additional parameters like:
max_percentage_failure
- this parameter defaults toNone
if not defined. It controls what percentage of the total functions can fail without stopping the execution of the validation. If the threshold is surpassed the execution stops and a failure error is thrown. This parameter has priority overfail_on_error
andcritical_functions
.
You can also pair critical_functions
with max_percentage_failure
by defining something like
a 0.6 max percentage of failure and also defining some critical function.
In this case even if the threshold is respected, the list defined on critical_functions
still is checked.
from lakehouse_engine.engine import load_data
acon = {
"input_specs": [
{
"spec_id": "dummy_deliveries_source",
"read_type": "batch",
"data_format": "csv",
"options": {
"header": True,
"delimiter": "|",
"inferSchema": True,
},
"location": "s3://my_data_product_bucket/dummy_deliveries/",
}
],
"dq_specs": [
{
"spec_id": "dq_validator",
"input_id": "dummy_deliveries_source",
"dq_type": "validator",
"bucket": "my_data_product_bucket",
"data_docs_bucket": "my_dq_data_docs_bucket",
"data_docs_prefix": "dq/my_data_product/data_docs/site/",
"result_sink_db_table": "my_database.dq_result_sink",
"result_sink_location": "my_dq_path/dq_result_sink/",
"source": "deliveries_critical",
"tbl_to_derive_pk": "my_database.dummy_deliveries",
"dq_functions": [
{"function": "expect_column_to_exist", "args": {"column": "salesorder"}},
{"function": "expect_table_row_count_to_be_between", "args": {"min_value": 15, "max_value": 25}},
],
"critical_functions": [
{"function": "expect_table_column_count_to_be_between", "args": {"max_value": 5}},
],
}
],
"output_specs": [
{
"spec_id": "dummy_deliveries_bronze",
"input_id": "dq_validator",
"write_type": "overwrite",
"data_format": "delta",
"location": "s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/",
}
],
}
load_data(acon=acon)