Recipes
delete_extra_columns ¶
delete_extra_columns(df)
This context manager exists to provide a commonly needed functionality. Unlike pandas which lets you compute a temporary quantity in a separate Series or a numpy array, pyspark requires you to create a new column even for temporary quantities.
This context manager makes sure that any new columns that you create within the context manager will get deleted (in-place) afterwards even if your code encounters an Exception. Any columns that you start with will not be deleted (unless of course you deliberately delete them yourself).
Note that this context manager will not prevent you from overwriting any column (new or otherwise).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
FlickerDataFrame
|
The FlickerDataFrame object from which extra columns will be deleted. |
required |
Yields:
Name | Type | Description |
---|---|---|
names_to_keep |
list
|
A list of column names to keep after deleting extra columns. |
Raises:
Type | Description |
---|---|
TypeError
|
If |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> from flicker import delete_extra_columns
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, ['a', 'b'])
>>> df
FlickerDataFrame[a: double, b: double]
>>> with delete_extra_columns(df) as names_to_keep:
... print(names_to_keep)
... df['c'] = 1
... print(df.names)
['a', 'b']
['a', 'b', 'c']
>>> print(df.names)
['a', 'b'] # 'c' column is deleted automatically
find_empty_columns ¶
find_empty_columns(df, verbose=True)
A very opinionated function that returns the names of 'empty' columns in a FlickerDataFrame
.
A column is considered empty if all of its values are None or have length 0. Note that a column with all NaNs is not considered empty.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
FlickerDataFrame
|
The DataFrame object to check for empty columns |
required |
verbose
|
bool
|
Flag indicating whether to print progress information while checking the columns. Default is True. |
True
|
Returns:
Type | Description |
---|---|
list[str]
|
A list of names of empty columns found in the DataFrame |
Raises:
Type | Description |
---|---|
TypeError
|
If the provided |
Examples:
>>> import numpy as np
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='rowseq')
>>> df['col3'] = None
>>> df['col4'] = np.nan
>>> empty_cols = find_empty_columns(df)
>>> print(empty_cols)
['col3']