FlickerDataFrame

FlickerDataFrame is a wrapper over pyspark.sql.DataFrame. FlickerDataFrame provides a modern, clean, intuitive, pythonic, polars-like API over a pyspark backend.

Construct a FlickerDataFrame from a pyspark.sql.DataFrame. Construction will fail if the pyspark.sql.DataFrame contains duplicate column names.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	The input `pyspark.sql.DataFrame` to initialize a `FlickerDataFrame` object	required

Raises:

Type	Description
`TypeError`	If the df parameter is not an instance of `pyspark.sql.DataFrame`
`ValueError`	If the df parameter contains duplicate column names

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 1), ('pandas', 3), ('polars', 2)]
>>> spark_df = spark.createDataFrame(rows, schema=['package', 'rank'])
>>> df = FlickerDataFrame(spark_df)
>>> df()
  package rank
0   spark    1
1  pandas    3
2  polars    2

dtypes `property` ¶

dtypes

Returns the column names and corresponding data types as an OrderedDict. The order of key-value pairs in the output is the same order as that of (left-to-right) columns in the dataframe.

Returns:

Type	Description
`OrderedDict`	Keys are column names and values are dtypes

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.dtypes
OrderedDict([('col1', 'bigint'), ('col2', 'bigint')])

names `property` ¶

names

Returns a list of column names in the FlickerDataFrame

Returns:

Type	Description
`list[str]`	list of column names in order of occurrence

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.names
['col1', 'col2']

ncols `property` ¶

ncols

Returns the number of columns. This method always returns immediately no matter the number of rows in the dataframe.

Returns:

Type	Description
`int`	number of columns

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.ncols
2

nrows `property` ¶

nrows

Returns the number of rows. This method may take a long time to count all the rows in the dataframe. Once the number of rows is computed, it is automatically cached until the dataframe is mutated. Cached number of rows is returned immediately without having to re-count all the rows.

Returns:

Type	Description
`int`	number of rows

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 1000, 2, names=['col1', 'col2'], fill='zero')
>>> df.nrows
1000

schema `property` ¶

schema

Returns the dataframe schema as an object of type pyspark.sql.types.StructType.

shape `property` ¶

shape

Returns the shape of the FlickerDataFrame as (nrows, ncols)

Returns:

Type	Description
`tuple[int, int]`	shape as (nrows, ncols)

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.shape
(3, 2)

call ¶

__call__(n=5, use_pandas_dtypes=False)

Return a selection of pyspark.sql.DataFrame as a pandas.DataFrame.

Parameters:

Name	Type	Description	Default
`n`	`int \| None`	Number of rows to return. If not specified, defaults to 5. If df.nrows < n, only df.nrows are returned. If n=None, all rows are returned.	`5`
`use_pandas_dtypes`	`bool`	If False (recommended and default), the resulting pandas.DataFrame will have all column dtypes as object. This option preserves NaNs and None(s) as-is. If True, the resulting pandas.DataFrame will have parsed dtypes. This option may be a little faster, but it allows pandas to convert None(s) in numeric columns to NaNs.	`False`

Returns:

Type	Description
`DataFrame`	pandas DataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 1), ('pandas', 3), ('polars', 2)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['package', 'rank'])
>>> df() # call the FlickerDataFrame to quickly see a snippet
  package rank
0   spark    1
1  pandas    3
2  polars    2

getitem ¶

__getitem__(item)

Index into the dataframe in various ways

Parameters:

Name	Type	Description	Default
`item`	`tuple \| slice \| str \| list \| Column \| FlickerColumn`	The index value to retrieve from the FlickerDataFrame object	required

Returns:

Type Description

FlickerColumn | FlickerDataFrame

If the index value is a string, returns a FlickerColumn object containing the column specified by the string. If the index value is a Column object, returns a new FlickerDataFrame object with only the specified column. If the index value is a FlickerColumn object, returns a new FlickerDataFrame object with only the column of the FlickerColumn object. If the index value is a slice object, returns a new FlickerDataFrame object with the columns specified by the slice. If the index value is a tuple of two slices, returns a new FlickerDataFrame object with the columns specified by the second slice, limited by the stop value of the first slice. If the index value is an iterable, returns a new FlickerDataFrame object with the columns specified by the elements of the iterable.

Raises:

Type	Description
`KeyError`	If the index value is not a supported index type.

_ipython_key_completions_ ¶

_ipython_key_completions_()

Provide list of auto-completions for getitem (not attributes) that is completed by df["c"+tab. Note that attribute completion is separate that happens automatically even when dir() is not explicitly defined.

See https://ipython.readthedocs.io/en/stable/config/integrating.html

This function enables auto-completion in both jupyter notebook and ipython terminal.

concat ¶

concat(other, ignore_names=False)

Return a new FlickerDataFrame with rows from this and other dataframe concatenated together. This is a non-mutating method that calls pyspark.sql.DataFrame.union after some checks. Resulting concatenated DataFrame will always contain the same column names in the same order as that in the current DataFrame.

Parameters:

Name	Type	Description	Default
`other`	`FlickerDataFrame \| DataFrame`	The DataFrame to concatenate with the current DataFrame	required
`ignore_names`	`(bool, optional(default=False))`	If `True`, the column names of the `other` dataframe are ignored when concatenating. Concatenation happens by column order and resulting dataframe will have column names in the same order as the current dataframe. If `False`, this method checks that current and `other` dataframe have the same column names (even if not in the same order). If this check fails, a `KeyError` is raised.	`False`

Returns:

Type	Description
`FlickerDataFrame`	The concatenated DataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df_zero = FlickerDataFrame.from_shape(spark, 2, 2, names=['a', 'b'], fill='zero')
>>> df_one = FlickerDataFrame.from_shape(spark, 2, 2, names=['a', 'b'], fill='one')
>>> df_rand = FlickerDataFrame.from_shape(spark, 2, 2, names=['b', 'c'], fill='rand')
>>> df_zero.concat(df_one)
FlickerDataFrame[a: bigint, b: bigint]
>>> df_zero.concat(df_one, ignore_names=False)()
   a  b
0  0  0
1  0  0
2  1  1
3  1  1
>>> df_zero.concat(df_one, ignore_names=True)()  # ignore_names has no effect
   a  b
0  0  0
1  0  0
2  1  1
3  1  1
>>> df_zero.concat(df_rand, ignore_names=True)()
          a         b
0       0.0       0.0
1       0.0       0.0
2   0.85428  0.148739
3  0.031665   0.14922
>>> df_zero.concat(df_rand, ignore_names=False)  # KeyError

describe ¶

describe()

Returns a pandas.DataFrame with statistical summary of the FlickerDataFrame. This method supports numeric (int, bigint, float, double), string, timestamp, boolean columns. Unsupported columns are ignored without an error. This method returns a different and better dtyped output than pyspark.sql.DataFrame.describe.

The output contains count, mean, stddev, min, and max values.

Returns:

Type	Description
`DataFrame`	A pandas DataFrame with statistical summary of the FlickerDataFrame

Examples:

>>> from datetime import datetime, timedelta
>>> spark = SparkSession.builder.getOrCreate()
>>> t = datetime(2023, 1, 1)
>>> dt = timedelta(days=1)
>>> rows = [('Bob', 23, 100.0, t - dt, False), ('Alice', 22, 110.0, t, True), ('Tom', 21, 120.0, t + dt, False)]
>>> names = ['name', 'age', 'weight', 'time', 'is_jedi']
>>> df = FlickerDataFrame.from_rows(spark, rows, names)
>>> df()
    name age weight                 time is_jedi
0    Bob  23  100.0  2022-12-31 00:00:00   False
1  Alice  22  110.0  2023-01-01 00:00:00    True
2    Tom  21  120.0  2023-01-02 00:00:00   False
>>> df.describe()
         name   age weight                 time   is_jedi
count       3     3      3                    3         3
max       Tom    23  120.0  2023-01-02 00:00:00      True
mean      NaN  22.0  110.0  2023-01-01 00:00:00  0.333333
min     Alice    21  100.0  2022-12-31 00:00:00     False
stddev    NaN   1.0   10.0       1 day, 0:00:00   0.57735
>>> df.describe()['time']['stddev']  # output contains appropriately typed values instead of strings
datetime.timedelta(days=1)

drop ¶

drop(names)

Delete columns by name. This is the non-mutating form of the __del__ method.

Parameters:

Name	Type	Description	Default
`names`	`list[str]`	A list of column names to delete from the FlickerDataFrame.	required

Returns:

Type	Description
`FlickerDataFrame`	A new instance of the FlickerDataFrame class with the specified columns deleted

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 4, names=['col1', 'col2', 'col3', 'col4'], fill='zero')
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
>>> df.drop(['col2', 'col4'])
FlickerDataFrame[col1: bigint, col3: bigint]
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]

from_columns `classmethod` ¶

from_columns(spark, columns, names=None, nan_to_none=True)

Create a FlickerDataFrame from columns

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`		required
`columns`	`Iterable[Iterable]`	The columns to create the DataFrame from. Each column should be an iterable. For example: `[('col1', 'a'), (1, 2), ('col3', 'b')]`	required
`names`	`list[str] \| None`	The column names of the DataFrame. If None, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'.	`None`
`nan_to_none`	`bool`	Flag indicating whether to convert all NaN values to None. Default and recommended value is True.	`True`

Returns:

Type	Description
`FlickerDataFrame`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> columns = [[1, 2, 3], ['a', 'b', 'c']]
>>> names = ['col1', 'col2']
>>> df = FlickerDataFrame.from_columns(spark, columns, names)
>>> df()
  col1 col2
0    1    a
1    2    b
2    3    c

Raises:

Type	Description
`ValueError`	If the columns contain different number of rows

from_dict `classmethod` ¶

from_dict(spark, data, nan_to_none=True)

Create a FlickerDataFrame object from a dictionary, in which, dict keys represent column names and dict values represent column values.

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`		required
`data`	`dict`	The dictionary containing column names as keys and column values as values. For example, `{'col1': [1, 2, 3], 'col2': [4, 5, 6]}`	required
`nan_to_none`	`bool`	Flag indicating whether to convert all NaN values to None. Default and recommended value is True.	`True`

Returns:

Type	Description
`FlickerDataFrame`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
>>> df = FlickerDataFrame.from_dict(spark, data)
>>> df()
  col1 col2
0    1    4
1    2    5
2    3    6

from_pandas `classmethod` ¶

from_pandas(spark, df, nan_to_none=True)

Create a FlickerDataFrame from a pandas.DataFrame

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`		required
`df`	`DataFrame`	The pandas DataFrame to convert to a FlickerDataFrame.	required
`nan_to_none`	`bool`	Flag indicating whether to convert all NaN values to None. Default and recommended value is True.	`True`

Returns:

Type	Description
`FlickerDataFrame`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> pandas_df = pd.DataFrame({'col1': [1, np.nan, 3], 'col2': [4, 5, np.nan]})
>>> pandas_df
   col1  col2
0   1.0   4.0
1   NaN   5.0
2   3.0   NaN
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=True)
>>> df()
   col1  col2
0   1.0   4.0
1  None   5.0
2   3.0  None

>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=False)
>>> df()
  col1 col2
0  1.0  4.0
1  NaN  5.0
2  3.0  NaN

from_records `classmethod` ¶

from_records(spark, records, nan_to_none=True)

Create a FlickerDataFrame from a list of dictionaries (similar to JSON lines format)

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`		required
`records`	`Iterable[dict]`	An iterable of dictionaries. Each dictionary represents a row (aka record).	required
`nan_to_none`	`bool`	Flag indicating whether to convert all NaN values to None. Default and recommended value is True.	`True`

Returns:

Type	Description
`FlickerDataFrame`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> records = [{'col1': 1, 'col2': 1}, {'col1': 2, 'col2': 2}, {'col1': 3, 'col2': 3}]
>>> df = FlickerDataFrame.from_records(spark, records)
>>> df()
  col1 col2
0    1    1
1    2    2
2    3    3

from_rows `classmethod` ¶

from_rows(spark, rows, names=None, nan_to_none=True)

Create a FlickerDataFrame from rows.

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`		required
`rows`	`Iterable[Iterable]`	The rows of data to be converted into a DataFrame. For example, `[('row1', 1), ('row2', 2)]`.	required
`names`	`list[str] \| None`	The column names of the DataFrame. If None, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'.	`None`
`nan_to_none`	`bool`	Flag indicating whether to convert all NaN values to None. Default and recommended value is True.	`True`

Returns:

Type	Description
`FlickerDataFrame`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [['a', 1, 2.0], ['b', 2, 4.0]]
>>> names = ['col1', 'col2', 'col3']
>>> df = FlickerDataFrame.from_rows(spark, rows, names)
>>> df()
  col1 col2 col3
0    a    1  2.0
1    b    2  4.0

Raises:

Type	Description
`ValueError`	If the rows contain different number of columns

from_schema `classmethod` ¶

from_schema(spark, schema=None, data=())

Creates a FlickerDataFrame object from a schema and optionally some data.

This method can be very useful to create an empty dataframe with a given schema. The best way to obtain a schema from another dataframe is df.schema. The input schema can also be empty in which case

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`	The Spark session used for creating the DataFrame.	required
`schema`	`StructType or str or None`	The schema for the DataFrame. Can be specified as a Spark `StructType` object, a string representation of the schema, or None. If None, an empty `StructType` schema is used by default.	`None`
`data`	`RDD, Iterable, DataFrame, or np.ndarray`	The data to populate the DataFrame. Can be an RDD, an iterable collection, an existing Spark DataFrame, or a NumPy array.	`()`

Returns:

Type	Description
`FlickerDataFrame`	A new instance of FlickerDataFrame created with the provided schema and data.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> FlickerDataFrame.from_schema(spark)  # Create a dataframe with no rows and no columns
FlickerDataFrame[]
>>> FlickerDataFrame.from_schema(spark, schema='a string, b int')  # Create a dataframe with zero rows
FlickerDataFrame[a: string, b: int]
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['a', 'b'])
FlickerDataFrame[a: bigint, b: bigint]
>>> FlickerDataFrame.from_schema(spark, schema=df.schema)  # Create a dataframe with the same schema as df
FlickerDataFrame[a: bigint, b: bigint]

from_shape `classmethod` ¶

from_shape(spark, nrows, ncols, names=None, fill='zero')

Create a FlickerDataFrame from a given shape and fill. This method is useful for creating test data and experimentation.

Parameters:

Name	Type	Description	Default
`spark`	`SparkSession`	The Spark session used for creating the DataFrame.	required
`nrows`	`int`	The number of rows in the DataFrame.	required
`ncols`	`int`	The number of columns in the DataFrame.	required
`names`	`list[str] \| None`	The names of the columns in the DataFrame. If not provided, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'.	`None`
`fill`		The value used for filling the DataFrame. Default is 'zero'. Accepted values are: 'zero', 'one', 'rand', 'randn', 'rowseq', 'colseq'	`'zero'`

Returns:

Type	Description
`FlickerDataFrame`	A new instance of the FlickerDataFrame class created from the given shape and parameters.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='rowseq')
>>> df()
  col1 col2
0    0    1
1    2    3
2    4    5

groupby ¶

groupby(names)

Groups the rows of the DataFrame based on the specified column names, so we can run aggregation on them. Returns a FlickerGroupedData object. This method is a pass-through to pyspark.sql.DataFrame but returns a FlickerGroupedData object instead of a pyspark.sql.GroupedData object.

Parameters:

Name	Type	Description	Default
`names`	`list[str]`	The column names based on which the DataFrame rows should be grouped	required

Returns:

Type	Description
`FlickerGroupedData`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 10), ('pandas', 10), ('spark', 100)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['name', 'number'])
>>> df.groupby(['name'])
FlickerGroupedData[grouping expressions: [name], value: [name: string, number: bigint], type: GroupBy]
>>> df.groupby(['name']).count()
FlickerDataFrame[name: string, count: bigint]
>>> df.groupby(['name']).count()()
     name count
0   spark     2
1  pandas     1

head ¶

head(n=5)

Return top n rows as a FlickerDataFrame. This method differs from FlickerDataFrame.__call__(), which returns a pandas.DataFrame.

Parameters:

Name	Type	Description	Default
`n`	`int \| None`	Number of rows to return. If not specified, defaults to 5. If `df.nrows < n`, only df.nrows are returned. If `n=None`, all rows are returned.	`5`

Returns:

Type	Description
`FlickerDataFrame`	A new instance of FlickerDataFrame containing top (at most) `n` rows

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 10, 2, names=['col1', 'col2'], fill='zero')
>>> df.head(3)
FlickerDataFrame[col1: bigint, col2: bigint]

join ¶

join(right, on, how='inner', lprefix='', lsuffix='_l', rprefix='', rsuffix='_r')

Join the current FlickerDataFrame with another dataframe. This non-mutating method returns the joined dataframe as a FlickerDataFrame.

This method preserves duplicate column names (that are joined on) by renaming them in the join result. Note that FlickerDataFrame.join is different from FlickerDataFrame.merge in both function signature and the merged/joined result.

Parameters:

Name	Type	Description	Default
`right`	`FlickerDataFrame \| DataFrame`	The right DataFrame to join with the left DataFrame.	required
`on`	`dict[str, str]`	Dictionary specifying which column names to join on. Keys represent column names from the left dataframe and values represent column names from the right dataframe.	required
`how`	`str`	The type of join to perform - 'inner': Returns only the matching rows from both DataFrames - 'left': Returns all the rows from the left DataFrame and the matching rows from the right DataFrame - 'right': Returns all the rows from the right DataFrame and the matching rows from the left DataFrame - 'outer': Returns all the rows from both DataFrames, including unmatched rows, with `null` values for non-matching columns	`'inner'`
`lprefix`	`str`	Prefix to add to column names from the left dataframe that are duplicated in the join result	`''`
`lsuffix`	`str`	Suffix to add to column names from the left dataframe that are duplicated in the join result	`'_l'`
`rprefix`	`str`	Prefix to add to column names from the right dataframe that are duplicated in the join result	`''`
`rsuffix`	`str`	Suffix to add to column names from the right dataframe that are duplicated in the join result	`'_r'`

Returns:

Type	Description
`FlickerDataFrame`

Raises:

Type	Description
`TypeError`	If the `on` parameter is not a dictionary
`ValueError`	If the `on` parameter is an empty dictionary
`TypeError`	If the keys or values of the `on` parameter are not of str type
`KeyError`	If the left or right DataFrame contains duplicate column names after renaming
`NotImplementedError`	To prevent against unexpected changes in the underlying `pyspark.sql.DataFrame.join`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['x', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['x', 'number'])
>>> inner_join = left.join(right, on={'x': 'x'}, how='inner')
>>> inner_join()  # 'x' columns from both left and right dataframes is preserved
  x_l number_l x_r number_r
0   a        1   a        4

>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['x1', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['x2', 'number'])
>>> inner_join = left.join(right, on={'x1': 'x2'}, how='inner')
>>> inner_join()  # renaming happens only when needed
  x1 number_l x2 number_r
0  a        1  a        4

merge ¶

merge(right, on, how='inner', lprefix='', lsuffix='_l', rprefix='', rsuffix='_r')

Merge the current FlickerDataFrame with another dataframe. This non-mutating method returns the merged dataframe as a FlickerDataFrame.

Note that FlickerDataFrame.merge is different from FlickerDataFrame.join in both function signature and the merged/joined result.

Parameters:

Name	Type	Description	Default
`right`	`FlickerDataFrame \| DataFrame`	The right dataframe to merge with	required
`on`	`Iterable[str]`	Column names to 'join' on. The column names must exist in both left and right dataframes. The column names provided in `on` are not duplicated and are not renamed using prefixes/suffixes.	required
`how`	`str`	Type of join to perform. Possible values are `{'inner', 'outer', 'left', 'right'}`.	`'inner'`
`lprefix`	`str`	Prefix to add to column names from the left dataframe that are duplicated in the merge result	`''`
`lsuffix`	`str`	Suffix to add to column names from the left dataframe that are duplicated in the merge result	`'_l'`
`rprefix`	`str`	Prefix to add to column names from the right dataframe that are duplicated in the merge result	`''`
`rsuffix`	`str`	Suffix to add to column names from the right dataframe that are duplicated in the merge result	`'_r'`

Returns:

Type	Description
`FlickerDataFrame`

Raises:

Type	Description
`TypeError`	If `on` is not an `Iterable[str]` or if it is a `dict`
`ValueError`	If `on` is an empty `Iterable[str]`
`TypeError`	If any element in `on` is not a `str`
`KeyError`	If renaming results in duplicate column names in the left dataframe
`KeyError`	If renaming results in duplicate column names in the right dataframe

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['name', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['name', 'number'])
>>> inner_merge = left.merge(right, on=['name'], how='inner')
>>> inner_merge()
  name number_l number_r
0    a        1        4
>>> left_merge = left.merge(right, on=['name'], how='left')
>>> left_merge()
  name number_l number_r
0    a        1        4
1    b        2     None
2    c        3     None

rename ¶

rename(from_to_mapper)

Renames columns in the FlickerDataFrame based on the provided mapping of the form {'old_col_name1': 'new_col_name1', 'old_col_name2': 'new_col_name2', ...}. This is a non-mutating method.

Parameters:

Name	Type	Description	Default
`from_to_mapper`	`dict[str, str]`	A dictionary containing the mapping of current column names to new column names	required

Returns:

Type	Description
`FlickerDataFrame`	A new instance of `FlickerDataFrame` with renamed columns

Raises:

Type	Description
`TypeError`	If the provided `from_to_mapper` is not a dictionary
`KeyError`	If any of the keys in `from_to_mapper` do not match existing column names in the `FlickerDataFrame`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 4, names=['col1', 'col2', 'col3', 'col4'], fill='zero')
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
>>> df.rename({'col1': 'col_a', 'col3': 'col_c'})
FlickerDataFrame[col_a: bigint, col2: bigint, col_c: bigint, col4: bigint]
>>> df  # df is not mutated
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]

show ¶

show(n=5, truncate=True, vertical=False)

Prints the first n rows to the console as a (possibly) giant string. This is a pass-through method to pyspark.sql.DataFrame.show().

Parameters:

Name	Type	Description	Default
`n`	`int \| None`	Number of rows to show. Defaults to 5.	`5`
`truncate`	`bool \| int`	If True, strings longer than 20 chars are truncated. If `truncate > 1`, strings longer than `truncate` are truncated to `length=truncate` and made right-aligned.	`True`
`vertical`	`bool`	If True, print output rows vertically (one line per column value).	`False`

Returns:

Type	Description
`None`

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   0|   0|
|   0|   0|
|   0|   0|
+----+----+

sort ¶

sort(names, ascending=True)

Returns a new :class:DataFrame sorted by the specified column name(s). This non-mutating method is a pass-through to pyspark.sql.DataFrame.sort but with some checks and a slightly different function signature.

Parameters:

Name	Type	Description	Default
`names`	`list[str]`	The list of column names to sort the DataFrame by	required
`ascending`	`bool`	Whether to sort the DataFrame in ascending order or not	`True`

Returns:

Type	Description
`FlickerDataFrame`

Raises:

Type	Description
`KeyError`	If `names` contains a non-existant column name

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [(10, 1), (1, 2), (100, 3)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['x', 'y'])
>>> df()
     x  y
0   10  1
1    1  2
2  100  3
>>> df.sort(['x'])
FlickerDataFrame[x: bigint, y: bigint]
>>> df.sort(['x'])()
     x  y
0    1  2
1   10  1
2  100  3
>>> df  # df is not mutated
FlickerDataFrame[x: bigint, y: bigint]

take ¶

take(n=5, convert_to_dict=True)

Return top n rows as a list.

Parameters:

Name	Type	Description	Default
`n`	`int \| None`	Number of rows to return. If not specified, defaults to 5. If `df.nrows < n`, only df.nrows are returned. If `n=None`, all rows are returned.	`5`
`convert_to_dict`	`bool`	If False, output is a list of `pyspark.sql.Row` objects. If True, output is a list of `dict` objects.	`True`

Returns:

Type	Description
`list[dict \| Row]`	A list of at most n items. Each item is either a pyspark.sql.Row or a dict object.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['col1', 'col2'])
>>> df.take(2, convert_to_dict=True)
[{'col1': 1, 'col2': 'a'}, {'col1': 2, 'col2': 'b'}]
>>> df.take(2, convert_to_dict=False)
[Row(col1=1, col2='a'), Row(col1=2, col2='b')]

to_dict ¶

to_dict(n=5)

Converts the FlickerDataFrame into a dictionary representation, in which, dict keys represent column names and dict values represent column values.

Parameters:

Name	Type	Description	Default
`n`	`int \| None`	Number of rows to return. If not specified, defaults to 5. If `df.nrows < n`, only df.nrows are returned. If `n=None`, all rows are returned.	`5`

Returns:

Type	Description
`dict`	A dictionary representation of the `FlickerDataFrame` where keys are column names and values are lists containing up to `n` values from each column.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='colseq')
>>> df()
  col1 col2
0    0    3
1    1    4
2    2    5
>>> df.to_dict(n=2)
{'col1': [0, 1], 'col2': [3, 4]}

to_pandas ¶

to_pandas()

Converts a FlickerDataFrame to a pandas.DataFrame. Calling this method on a big FlickerDataFrame may result in out-of-memory errors.

This method is simply a pass through to pyspark.sql.DataFrame.to_pandas(). Consider using FlickerDataFrame.__call___() instead of FlickerDataFrame.to_pandas() because pyspark.sql.DataFrame.to_pandas() can cause unwanted None to NaN conversions. See example below.

Returns:

Type	Description
`DataFrame`	pandas DataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> pandas_df = pd.DataFrame({'col1': [1.0, np.nan, None], 'col2': [4.0, 5.0, np.nan]}, dtype=object)
>>> pandas_df
   col1 col2
0   1.0  4.0
1   NaN  5.0
2  None  NaN
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=False)
>>> df()
   col1 col2
0   1.0  4.0
1   NaN  5.0
2  None  NaN
>>> df.to_pandas()  # causes unwanted None to NaN conversion in df.to_pandas().iloc[2, 0]
   col1  col2
0   1.0   4.0
1   NaN   5.0
2   NaN   NaN

unique ¶

unique()

Returns a new FlickerDataFrame with unique rows. This non-mutating method is just a pass-through to pyspark.sql.DataFrame.distinct.

Returns:

Type	Description
`FlickerDataFrame`	A new FlickerDataFrame with unique rows

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df()
  col1 col2
0    0    0
1    0    0
2    0    0
>>> df.unique()
FlickerDataFrame[col1: bigint, col2: bigint]
>>> df.unique()()
  col1 col2
0    0    0
>>> df.shape  # df is not mutated
(3, 2)

FlickerDataFrame

dtypes property ¶

names property ¶

ncols property ¶

nrows property ¶

schema property ¶

shape property ¶

__call__ ¶

__getitem__ ¶

_ipython_key_completions_ ¶

concat ¶

describe ¶

drop ¶

from_columns classmethod ¶

from_dict classmethod ¶

from_pandas classmethod ¶

from_records classmethod ¶

from_rows classmethod ¶

from_schema classmethod ¶

from_shape classmethod ¶

groupby ¶

head ¶

join ¶

merge ¶

rename ¶

show ¶

sort ¶

take ¶

to_dict ¶

to_pandas ¶

unique ¶

dtypes `property` ¶

names `property` ¶

ncols `property` ¶

nrows `property` ¶

schema `property` ¶

shape `property` ¶

call ¶

getitem ¶

from_columns `classmethod` ¶

from_dict `classmethod` ¶

from_pandas `classmethod` ¶

from_records `classmethod` ¶

from_rows `classmethod` ¶

from_schema `classmethod` ¶

from_shape `classmethod` ¶