Recipes

delete_extra_columns ¶

delete_extra_columns(df)

This context manager exists to provide a commonly needed functionality. Unlike pandas which lets you compute a temporary quantity in a separate Series or a numpy array, pyspark requires you to create a new column even for temporary quantities.

This context manager makes sure that any new columns that you create within the context manager will get deleted (in-place) afterwards even if your code encounters an Exception. Any columns that you start with will not be deleted (unless of course you deliberately delete them yourself).

Note that this context manager will not prevent you from overwriting any column (new or otherwise).

Parameters:

Name	Type	Description	Default
`df`	`FlickerDataFrame`	The FlickerDataFrame object from which extra columns will be deleted.	required

Yields:

Name	Type	Description
`names_to_keep`	`list`	A list of column names to keep after deleting extra columns.

Raises:

Type	Description
`TypeError`	If `df` is not an instance of FlickerDataFrame.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> from flicker import delete_extra_columns
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, ['a', 'b'])
>>> df
FlickerDataFrame[a: double, b: double]
>>> with delete_extra_columns(df) as names_to_keep:
...     print(names_to_keep)
...     df['c'] = 1
...     print(df.names)
['a', 'b']
['a', 'b', 'c']
>>> print(df.names)
['a', 'b']  # 'c' column is deleted automatically

find_empty_columns ¶

find_empty_columns(df, verbose=True)

A very opinionated function that returns the names of 'empty' columns in a FlickerDataFrame.

A column is considered empty if all of its values are None or have length 0. Note that a column with all NaNs is not considered empty.

Parameters:

Name	Type	Description	Default
`df`	`FlickerDataFrame`	The DataFrame object to check for empty columns	required
`verbose`	`bool`	Flag indicating whether to print progress information while checking the columns. Default is True.	`True`

Returns:

Type	Description
`list[str]`	A list of names of empty columns found in the DataFrame

Raises:

Type	Description
`TypeError`	If the provided `df` parameter is not of type `FlickerDataFrame`

Examples:

>>> import numpy as np
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='rowseq')
>>> df['col3'] = None
>>> df['col4'] = np.nan
>>> empty_cols = find_empty_columns(df)
>>> print(empty_cols)
['col3']