Validations Failing

The scenarios presented on this page are similar, but their goal is to show what happens when a DQ expectation fails the validations. The logs generated by the execution of the code will contain information regarding which expectation(s) have failed and why.

1. Fail on Error

In this scenario is specified below two parameters:

  • "fail_on_error": False - this parameter is what controls what happens if a DQ expectation fails. In case this is set to true (default), your job will fail/be aborted and an exception will be raised. In case this is set to false, a log message will be printed about the error (as shown in this scenario) and the result status will also be available in result sink (if configured) and in the [data docs great expectation site](../data_quality.html#3-data-docs-website). On this scenario it is set tofalse` to avoid failing the execution of the notebook.
  • the max_value of the function expect_table_column_count_to_be_between is defined with specific value so that this expectation fails the validations.
from lakehouse_engine.engine import load_data

acon = {
    "input_specs": [
        {
            "spec_id": "dummy_deliveries_source",
            "read_type": "batch",
            "data_format": "csv",
            "options": {
                "header": True,
                "delimiter": "|",
                "inferSchema": True,
            },
            "location": "s3://my_data_product_bucket/dummy_deliveries/",
        }
    ],
    "dq_specs": [
        {
            "spec_id": "dq_validator",
            "input_id": "dummy_deliveries_source",
            "dq_type": "validator",
            "bucket": "my_data_product_bucket",
            "data_docs_bucket": "my_dq_data_docs_bucket",
            "data_docs_prefix": "dq/my_data_product/data_docs/site/",
            "result_sink_db_table": "my_database.dq_result_sink",
            "result_sink_location": "my_dq_path/dq_result_sink/",
            "tbl_to_derive_pk": "my_database.dummy_deliveries",
            "source": "deliveries_fail",
            "fail_on_error": False,
            "dq_functions": [
                {"function": "expect_column_to_exist", "args": {"column": "salesorder"}},
                {"function": "expect_table_row_count_to_be_between", "args": {"min_value": 15, "max_value": 20}},
                {"function": "expect_table_column_count_to_be_between", "args": {"max_value": 5}},
                {"function": "expect_column_values_to_be_null", "args": {"column": "article"}},
                {"function": "expect_column_values_to_be_unique", "args": {"column": "status"}},
                {
                    "function": "expect_column_min_to_be_between",
                    "args": {"column": "delivery_item", "min_value": 1, "max_value": 15},
                },
                {
                    "function": "expect_column_max_to_be_between",
                    "args": {"column": "delivery_item", "min_value": 15, "max_value": 30},
                },
            ],
        }
    ],
    "output_specs": [
        {
            "spec_id": "dummy_deliveries_bronze",
            "input_id": "dq_validator",
            "write_type": "overwrite",
            "data_format": "delta",
            "location": "s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/",
        }
    ],
}

load_data(acon=acon)

If you run bellow command, you would be able to see the success column has the value false for the last execution. display(spark.table(RENDER_UTILS.render_content("my_database.dq_result_sink")))

2. Critical Functions

In this scenario, alternative parameters to fail_on_error are used:

  • critical_functions - this parameter defaults to None if not defined. It controls what DQ functions are considered a priority and as such, it stops the validation and throws an execution error whenever a function defined as critical doesn't pass the test. If any other function that is not defined in this parameter fails, an error message is printed in the logs. This parameter has priority over fail_on_error. In this specific example, after defining the expect_table_column_count_to_be_between as critical, it is made sure that the execution is stopped whenever the conditions for the function are not met.

Additionally, it can also be defined additional parameters like:

  • max_percentage_failure - this parameter defaults to None if not defined. It controls what percentage of the total functions can fail without stopping the execution of the validation. If the threshold is surpassed the execution stops and a failure error is thrown. This parameter has priority over fail_on_error and critical_functions.

You can also pair critical_functions with max_percentage_failure by defining something like a 0.6 max percentage of failure and also defining some critical function. In this case even if the threshold is respected, the list defined on critical_functions still is checked.

from lakehouse_engine.engine import load_data

acon = {
    "input_specs": [
        {
            "spec_id": "dummy_deliveries_source",
            "read_type": "batch",
            "data_format": "csv",
            "options": {
                "header": True,
                "delimiter": "|",
                "inferSchema": True,
            },
            "location": "s3://my_data_product_bucket/dummy_deliveries/",
        }
    ],
    "dq_specs": [
        {
            "spec_id": "dq_validator",
            "input_id": "dummy_deliveries_source",
            "dq_type": "validator",
            "bucket": "my_data_product_bucket",
            "data_docs_bucket": "my_dq_data_docs_bucket",
            "data_docs_prefix": "dq/my_data_product/data_docs/site/",
            "result_sink_db_table": "my_database.dq_result_sink",
            "result_sink_location": "my_dq_path/dq_result_sink/",
            "source": "deliveries_critical",
            "tbl_to_derive_pk": "my_database.dummy_deliveries",
            "dq_functions": [
                {"function": "expect_column_to_exist", "args": {"column": "salesorder"}},
                {"function": "expect_table_row_count_to_be_between", "args": {"min_value": 15, "max_value": 25}},
            ],
            "critical_functions": [
                {"function": "expect_table_column_count_to_be_between", "args": {"max_value": 5}},
            ],
        }
    ],
    "output_specs": [
        {
            "spec_id": "dummy_deliveries_bronze",
            "input_id": "dq_validator",
            "write_type": "overwrite",
            "data_format": "delta",
            "location": "s3://my_data_product_bucket/bronze/dummy_deliveries_dq_template/",
        }
    ],
}

load_data(acon=acon)