Skip to content

Spark utils

Utilities to facilitate spark dataframe management.

SparkUtils

Bases: object

Spark utils that help retrieve and manage dataframes.

Source code in mkdocs/lakehouse_engine/packages/utils/spark_utils.py
class SparkUtils(object):
    """Spark utils that help retrieve and manage dataframes."""

    @staticmethod
    def create_temp_view(
        df: DataFrame,
        view_name: str,
        return_prefix: bool = False,
        is_serverless: bool = None,
    ) -> None | str:
        """Create a temporary view from a dataframe.

        If the execution environment is serverless, it creates a temporary view,
        otherwise it creates a global temporary view.
        Serverless environments don't support global temporary views, so we need to
        create a temporary view in that case, but it still gets accessible from other
        queries in the same session.
        In non-serverless environments, we create a global temporary view to make
        sure it is accessible from other sessions as well.

        Args:
            df: dataframe to create the view from.
            view_name: name of the view to create.
            return_prefix: whether to return the prefix to use in queries
            for this view or not.
            is_serverless: if workload is running in serverless

        Returns:
            None or the prefix to use in queries for this view, depending on the
            value of return_prefix.
        """
        if is_serverless is None:
            is_serverless = ExecEnv.IS_SERVERLESS

        if is_serverless:
            df.createOrReplaceTempView(view_name)
            prefix = ""
        else:
            df.createOrReplaceGlobalTempView(view_name)
            prefix = "global_temp."
        if return_prefix:
            return prefix
        return None

create_temp_view(df, view_name, return_prefix=False, is_serverless=None) staticmethod

Create a temporary view from a dataframe.

If the execution environment is serverless, it creates a temporary view, otherwise it creates a global temporary view. Serverless environments don't support global temporary views, so we need to create a temporary view in that case, but it still gets accessible from other queries in the same session. In non-serverless environments, we create a global temporary view to make sure it is accessible from other sessions as well.

Parameters:

Name Type Description Default
df DataFrame

dataframe to create the view from.

required
view_name str

name of the view to create.

required
return_prefix bool

whether to return the prefix to use in queries

False
is_serverless bool

if workload is running in serverless

None

Returns:

Type Description
None | str

None or the prefix to use in queries for this view, depending on the

None | str

value of return_prefix.

Source code in mkdocs/lakehouse_engine/packages/utils/spark_utils.py
@staticmethod
def create_temp_view(
    df: DataFrame,
    view_name: str,
    return_prefix: bool = False,
    is_serverless: bool = None,
) -> None | str:
    """Create a temporary view from a dataframe.

    If the execution environment is serverless, it creates a temporary view,
    otherwise it creates a global temporary view.
    Serverless environments don't support global temporary views, so we need to
    create a temporary view in that case, but it still gets accessible from other
    queries in the same session.
    In non-serverless environments, we create a global temporary view to make
    sure it is accessible from other sessions as well.

    Args:
        df: dataframe to create the view from.
        view_name: name of the view to create.
        return_prefix: whether to return the prefix to use in queries
        for this view or not.
        is_serverless: if workload is running in serverless

    Returns:
        None or the prefix to use in queries for this view, depending on the
        value of return_prefix.
    """
    if is_serverless is None:
        is_serverless = ExecEnv.IS_SERVERLESS

    if is_serverless:
        df.createOrReplaceTempView(view_name)
        prefix = ""
    else:
        df.createOrReplaceGlobalTempView(view_name)
        prefix = "global_temp."
    if return_prefix:
        return prefix
    return None