Schema utils
Utilities to facilitate dataframe schema management.
SchemaUtils
¶
Bases: object
Schema utils that help retrieve and manage schemas of dataframes.
Source code in mkdocs/lakehouse_engine/packages/utils/schema_utils.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
|
from_dict(struct_type)
staticmethod
¶
Get a spark schema from a dict.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
struct_type |
dict
|
dict containing a spark schema structure. Check here. |
required |
Returns:
Type | Description |
---|---|
StructType
|
Spark schema struct type. |
Source code in mkdocs/lakehouse_engine/packages/utils/schema_utils.py
from_file(file_path, disable_dbfs_retry=False)
staticmethod
¶
Get a spark schema from a file (spark StructType json file) in a file system.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
path of the file in a file system. Check here. |
required |
disable_dbfs_retry |
bool
|
optional flag to disable file storage dbfs. |
False
|
Returns:
Type | Description |
---|---|
StructType
|
Spark schema struct type. |
Source code in mkdocs/lakehouse_engine/packages/utils/schema_utils.py
from_file_to_dict(file_path, disable_dbfs_retry=False)
staticmethod
¶
Get a dict with the spark schema from a file in a file system.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path |
str
|
path of the file in a file system. Check here. |
required |
disable_dbfs_retry |
bool
|
optional flag to disable file storage dbfs. |
False
|
Returns:
Type | Description |
---|---|
Any
|
Spark schema in a dict. |
Source code in mkdocs/lakehouse_engine/packages/utils/schema_utils.py
from_input_spec(input_spec)
classmethod
¶
Get a spark schema from an input specification.
This covers scenarios where the schema is provided as part of the input specification of the algorithm. Schema can come from the table specified in the input specification (enforce_schema_from_table) or by the dict with the spark schema provided there also.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_spec |
InputSpec
|
input specification. |
required |
Returns:
Type | Description |
---|---|
Optional[StructType]
|
spark schema struct type. |
Source code in mkdocs/lakehouse_engine/packages/utils/schema_utils.py
from_table_schema(table)
staticmethod
¶
Get a spark schema from a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table |
str
|
table name from which to inherit the schema. |
required |
Returns:
Type | Description |
---|---|
StructType
|
Spark schema struct type. |
Source code in mkdocs/lakehouse_engine/packages/utils/schema_utils.py
schema_flattener(schema, prefix=None, level=1, max_level=None, shorten_names=False, alias=True, num_chars=7, ignore_cols=None)
staticmethod
¶
Recursive method to flatten the schema of the dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
schema |
StructType
|
schema to be flattened. |
required |
prefix |
str
|
prefix of the struct to get the value for. Only relevant for being used in the internal recursive logic. |
None
|
level |
int
|
level of the depth in the schema being flattened. Only relevant for being used in the internal recursive logic. |
1
|
max_level |
int
|
level until which you want to flatten the schema. Default: None. |
None
|
shorten_names |
bool
|
whether to shorten the names of the prefixes of the fields being flattened or not. Default: False. |
False
|
alias |
bool
|
whether to define alias for the columns being flattened or not. Default: True. |
True
|
num_chars |
int
|
number of characters to consider when shortening the names of the fields. Default: 7. |
7
|
ignore_cols |
List
|
columns which you don't want to flatten. Default: None. |
None
|
Returns:
Type | Description |
---|---|
List
|
A function to be called in .transform() spark function. |