Skip to content

Recipes

delete_extra_columns

delete_extra_columns(df)

This context manager exists to provide a commonly needed functionality. Unlike pandas which lets you compute a temporary quantity in a separate Series or a numpy array, pyspark requires you to create a new column even for temporary quantities.

This context manager makes sure that any new columns that you create within the context manager will get deleted (in-place) afterwards even if your code encounters an Exception. Any columns that you start with will not be deleted (unless of course you deliberately delete them yourself).

Note that this context manager will not prevent you from overwriting any column (new or otherwise).

Parameters:

Name Type Description Default
df FlickerDataFrame

The FlickerDataFrame object from which extra columns will be deleted.

required

Yields:

Name Type Description
names_to_keep list

A list of column names to keep after deleting extra columns.

Raises:

Type Description
TypeError

If df is not an instance of FlickerDataFrame.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> from flicker import delete_extra_columns
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, ['a', 'b'])
>>> df
FlickerDataFrame[a: double, b: double]
>>> with delete_extra_columns(df) as names_to_keep:
...     print(names_to_keep)
...     df['c'] = 1
...     print(df.names)
['a', 'b']
['a', 'b', 'c']
>>> print(df.names)
['a', 'b']  # 'c' column is deleted automatically

find_empty_columns

find_empty_columns(df, verbose=True)

A very opinionated function that returns the names of 'empty' columns in a FlickerDataFrame.

A column is considered empty if all of its values are None or have length 0. Note that a column with all NaNs is not considered empty.

Parameters:

Name Type Description Default
df FlickerDataFrame

The DataFrame object to check for empty columns

required
verbose bool

Flag indicating whether to print progress information while checking the columns. Default is True.

True

Returns:

Type Description
list[str]

A list of names of empty columns found in the DataFrame

Raises:

Type Description
TypeError

If the provided df parameter is not of type FlickerDataFrame

Examples:

>>> import numpy as np
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='rowseq')
>>> df['col3'] = None
>>> df['col4'] = np.nan
>>> empty_cols = find_empty_columns(df)
>>> print(empty_cols)
['col3']