Skip to content

FlickerDataFrame

FlickerDataFrame is a wrapper over pyspark.sql.DataFrame. FlickerDataFrame provides a modern, clean, intuitive, pythonic, polars-like API over a pyspark backend.

Construct a FlickerDataFrame from a pyspark.sql.DataFrame. Construction will fail if the pyspark.sql.DataFrame contains duplicate column names.

Parameters:

Name Type Description Default
df DataFrame

The input pyspark.sql.DataFrame to initialize a FlickerDataFrame object

required

Raises:

Type Description
TypeError

If the df parameter is not an instance of pyspark.sql.DataFrame

ValueError

If the df parameter contains duplicate column names

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 1), ('pandas', 3), ('polars', 2)]
>>> spark_df = spark.createDataFrame(rows, schema=['package', 'rank'])
>>> df = FlickerDataFrame(spark_df)
>>> df()
  package rank
0   spark    1
1  pandas    3
2  polars    2

dtypes property

dtypes

Returns the column names and corresponding data types as an OrderedDict. The order of key-value pairs in the output is the same order as that of (left-to-right) columns in the dataframe.

Returns:

Type Description
OrderedDict

Keys are column names and values are dtypes

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.dtypes
OrderedDict([('col1', 'bigint'), ('col2', 'bigint')])

names property

names

Returns a list of column names in the FlickerDataFrame

Returns:

Type Description
list[str]

list of column names in order of occurrence

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.names
['col1', 'col2']

ncols property

ncols

Returns the number of columns. This method always returns immediately no matter the number of rows in the dataframe.

Returns:

Type Description
int

number of columns

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.ncols
2

nrows property

nrows

Returns the number of rows. This method may take a long time to count all the rows in the dataframe. Once the number of rows is computed, it is automatically cached until the dataframe is mutated. Cached number of rows is returned immediately without having to re-count all the rows.

Returns:

Type Description
int

number of rows

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 1000, 2, names=['col1', 'col2'], fill='zero')
>>> df.nrows
1000

schema property

schema

Returns the dataframe schema as an object of type pyspark.sql.types.StructType.

shape property

shape

Returns the shape of the FlickerDataFrame as (nrows, ncols)

Returns:

Type Description
tuple[int, int]

shape as (nrows, ncols)

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.shape
(3, 2)

__call__

__call__(n=5, use_pandas_dtypes=False)

Return a selection of pyspark.sql.DataFrame as a pandas.DataFrame.

Parameters:

Name Type Description Default
n int | None

Number of rows to return. If not specified, defaults to 5. If df.nrows < n, only df.nrows are returned. If n=None, all rows are returned.

5
use_pandas_dtypes bool

If False (recommended and default), the resulting pandas.DataFrame will have all column dtypes as object. This option preserves NaNs and None(s) as-is. If True, the resulting pandas.DataFrame will have parsed dtypes. This option may be a little faster, but it allows pandas to convert None(s) in numeric columns to NaNs.

False

Returns:

Type Description
DataFrame

pandas DataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 1), ('pandas', 3), ('polars', 2)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['package', 'rank'])
>>> df() # call the FlickerDataFrame to quickly see a snippet
  package rank
0   spark    1
1  pandas    3
2  polars    2

__getitem__

__getitem__(item)

Index into the dataframe in various ways

Parameters:

Name Type Description Default
item tuple | slice | str | list | Column | FlickerColumn

The index value to retrieve from the FlickerDataFrame object

required

Returns:

Type Description
FlickerColumn | FlickerDataFrame

If the index value is a string, returns a FlickerColumn object containing the column specified by the string. If the index value is a Column object, returns a new FlickerDataFrame object with only the specified column. If the index value is a FlickerColumn object, returns a new FlickerDataFrame object with only the column of the FlickerColumn object. If the index value is a slice object, returns a new FlickerDataFrame object with the columns specified by the slice. If the index value is a tuple of two slices, returns a new FlickerDataFrame object with the columns specified by the second slice, limited by the stop value of the first slice. If the index value is an iterable, returns a new FlickerDataFrame object with the columns specified by the elements of the iterable.

Raises:

Type Description
KeyError

If the index value is not a supported index type.

_ipython_key_completions_

_ipython_key_completions_()

Provide list of auto-completions for getitem (not attributes) that is completed by df["c"+tab. Note that attribute completion is separate that happens automatically even when dir() is not explicitly defined.

See https://ipython.readthedocs.io/en/stable/config/integrating.html

This function enables auto-completion in both jupyter notebook and ipython terminal.

concat

concat(other, ignore_names=False)

Return a new FlickerDataFrame with rows from this and other dataframe concatenated together. This is a non-mutating method that calls pyspark.sql.DataFrame.union after some checks. Resulting concatenated DataFrame will always contain the same column names in the same order as that in the current DataFrame.

Parameters:

Name Type Description Default
other FlickerDataFrame | DataFrame

The DataFrame to concatenate with the current DataFrame

required
ignore_names (bool, optional(default=False))

If True, the column names of the other dataframe are ignored when concatenating. Concatenation happens by column order and resulting dataframe will have column names in the same order as the current dataframe. If False, this method checks that current and other dataframe have the same column names (even if not in the same order). If this check fails, a KeyError is raised.

False

Returns:

Type Description
FlickerDataFrame

The concatenated DataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df_zero = FlickerDataFrame.from_shape(spark, 2, 2, names=['a', 'b'], fill='zero')
>>> df_one = FlickerDataFrame.from_shape(spark, 2, 2, names=['a', 'b'], fill='one')
>>> df_rand = FlickerDataFrame.from_shape(spark, 2, 2, names=['b', 'c'], fill='rand')
>>> df_zero.concat(df_one)
FlickerDataFrame[a: bigint, b: bigint]
>>> df_zero.concat(df_one, ignore_names=False)()
   a  b
0  0  0
1  0  0
2  1  1
3  1  1
>>> df_zero.concat(df_one, ignore_names=True)()  # ignore_names has no effect
   a  b
0  0  0
1  0  0
2  1  1
3  1  1
>>> df_zero.concat(df_rand, ignore_names=True)()
          a         b
0       0.0       0.0
1       0.0       0.0
2   0.85428  0.148739
3  0.031665   0.14922
>>> df_zero.concat(df_rand, ignore_names=False)  # KeyError

describe

describe()

Returns a pandas.DataFrame with statistical summary of the FlickerDataFrame. This method supports numeric (int, bigint, float, double), string, timestamp, boolean columns. Unsupported columns are ignored without an error. This method returns a different and better dtyped output than pyspark.sql.DataFrame.describe.

The output contains count, mean, stddev, min, and max values.

Returns:

Type Description
DataFrame

A pandas DataFrame with statistical summary of the FlickerDataFrame

Examples:

>>> from datetime import datetime, timedelta
>>> spark = SparkSession.builder.getOrCreate()
>>> t = datetime(2023, 1, 1)
>>> dt = timedelta(days=1)
>>> rows = [('Bob', 23, 100.0, t - dt, False), ('Alice', 22, 110.0, t, True), ('Tom', 21, 120.0, t + dt, False)]
>>> names = ['name', 'age', 'weight', 'time', 'is_jedi']
>>> df = FlickerDataFrame.from_rows(spark, rows, names)
>>> df()
    name age weight                 time is_jedi
0    Bob  23  100.0  2022-12-31 00:00:00   False
1  Alice  22  110.0  2023-01-01 00:00:00    True
2    Tom  21  120.0  2023-01-02 00:00:00   False
>>> df.describe()
         name   age weight                 time   is_jedi
count       3     3      3                    3         3
max       Tom    23  120.0  2023-01-02 00:00:00      True
mean      NaN  22.0  110.0  2023-01-01 00:00:00  0.333333
min     Alice    21  100.0  2022-12-31 00:00:00     False
stddev    NaN   1.0   10.0       1 day, 0:00:00   0.57735
>>> df.describe()['time']['stddev']  # output contains appropriately typed values instead of strings
datetime.timedelta(days=1)

drop

drop(names)

Delete columns by name. This is the non-mutating form of the __del__ method.

Parameters:

Name Type Description Default
names list[str]

A list of column names to delete from the FlickerDataFrame.

required

Returns:

Type Description
FlickerDataFrame

A new instance of the FlickerDataFrame class with the specified columns deleted

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 4, names=['col1', 'col2', 'col3', 'col4'], fill='zero')
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
>>> df.drop(['col2', 'col4'])
FlickerDataFrame[col1: bigint, col3: bigint]
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]

from_columns classmethod

from_columns(spark, columns, names=None, nan_to_none=True)

Create a FlickerDataFrame from columns

Parameters:

Name Type Description Default
spark SparkSession
required
columns Iterable[Iterable]

The columns to create the DataFrame from. Each column should be an iterable. For example: [('col1', 'a'), (1, 2), ('col3', 'b')]

required
names list[str] | None

The column names of the DataFrame. If None, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'.

None
nan_to_none bool

Flag indicating whether to convert all NaN values to None. Default and recommended value is True.

True

Returns:

Type Description
FlickerDataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> columns = [[1, 2, 3], ['a', 'b', 'c']]
>>> names = ['col1', 'col2']
>>> df = FlickerDataFrame.from_columns(spark, columns, names)
>>> df()
  col1 col2
0    1    a
1    2    b
2    3    c

Raises:

Type Description
ValueError

If the columns contain different number of rows

from_dict classmethod

from_dict(spark, data, nan_to_none=True)

Create a FlickerDataFrame object from a dictionary, in which, dict keys represent column names and dict values represent column values.

Parameters:

Name Type Description Default
spark SparkSession
required
data dict

The dictionary containing column names as keys and column values as values. For example, {'col1': [1, 2, 3], 'col2': [4, 5, 6]}

required
nan_to_none bool

Flag indicating whether to convert all NaN values to None. Default and recommended value is True.

True

Returns:

Type Description
FlickerDataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
>>> df = FlickerDataFrame.from_dict(spark, data)
>>> df()
  col1 col2
0    1    4
1    2    5
2    3    6

from_pandas classmethod

from_pandas(spark, df, nan_to_none=True)

Create a FlickerDataFrame from a pandas.DataFrame

Parameters:

Name Type Description Default
spark SparkSession
required
df DataFrame

The pandas DataFrame to convert to a FlickerDataFrame.

required
nan_to_none bool

Flag indicating whether to convert all NaN values to None. Default and recommended value is True.

True

Returns:

Type Description
FlickerDataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> pandas_df = pd.DataFrame({'col1': [1, np.nan, 3], 'col2': [4, 5, np.nan]})
>>> pandas_df
   col1  col2
0   1.0   4.0
1   NaN   5.0
2   3.0   NaN
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=True)
>>> df()
   col1  col2
0   1.0   4.0
1  None   5.0
2   3.0  None
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=False)
>>> df()
  col1 col2
0  1.0  4.0
1  NaN  5.0
2  3.0  NaN

from_records classmethod

from_records(spark, records, nan_to_none=True)

Create a FlickerDataFrame from a list of dictionaries (similar to JSON lines format)

Parameters:

Name Type Description Default
spark SparkSession
required
records Iterable[dict]

An iterable of dictionaries. Each dictionary represents a row (aka record).

required
nan_to_none bool

Flag indicating whether to convert all NaN values to None. Default and recommended value is True.

True

Returns:

Type Description
FlickerDataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> records = [{'col1': 1, 'col2': 1}, {'col1': 2, 'col2': 2}, {'col1': 3, 'col2': 3}]
>>> df = FlickerDataFrame.from_records(spark, records)
>>> df()
  col1 col2
0    1    1
1    2    2
2    3    3

from_rows classmethod

from_rows(spark, rows, names=None, nan_to_none=True)

Create a FlickerDataFrame from rows.

Parameters:

Name Type Description Default
spark SparkSession
required
rows Iterable[Iterable]

The rows of data to be converted into a DataFrame. For example, [('row1', 1), ('row2', 2)].

required
names list[str] | None

The column names of the DataFrame. If None, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'.

None
nan_to_none bool

Flag indicating whether to convert all NaN values to None. Default and recommended value is True.

True

Returns:

Type Description
FlickerDataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [['a', 1, 2.0], ['b', 2, 4.0]]
>>> names = ['col1', 'col2', 'col3']
>>> df = FlickerDataFrame.from_rows(spark, rows, names)
>>> df()
  col1 col2 col3
0    a    1  2.0
1    b    2  4.0

Raises:

Type Description
ValueError

If the rows contain different number of columns

from_schema classmethod

from_schema(spark, schema=None, data=())

Creates a FlickerDataFrame object from a schema and optionally some data.

This method can be very useful to create an empty dataframe with a given schema. The best way to obtain a schema from another dataframe is df.schema. The input schema can also be empty in which case

Parameters:

Name Type Description Default
spark SparkSession

The Spark session used for creating the DataFrame.

required
schema StructType or str or None

The schema for the DataFrame. Can be specified as a Spark StructType object, a string representation of the schema, or None. If None, an empty StructType schema is used by default.

None
data RDD, Iterable, DataFrame, or np.ndarray

The data to populate the DataFrame. Can be an RDD, an iterable collection, an existing Spark DataFrame, or a NumPy array.

()

Returns:

Type Description
FlickerDataFrame

A new instance of FlickerDataFrame created with the provided schema and data.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> FlickerDataFrame.from_schema(spark)  # Create a dataframe with no rows and no columns
FlickerDataFrame[]
>>> FlickerDataFrame.from_schema(spark, schema='a string, b int')  # Create a dataframe with zero rows
FlickerDataFrame[a: string, b: int]
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['a', 'b'])
FlickerDataFrame[a: bigint, b: bigint]
>>> FlickerDataFrame.from_schema(spark, schema=df.schema)  # Create a dataframe with the same schema as df
FlickerDataFrame[a: bigint, b: bigint]

from_shape classmethod

from_shape(spark, nrows, ncols, names=None, fill='zero')

Create a FlickerDataFrame from a given shape and fill. This method is useful for creating test data and experimentation.

Parameters:

Name Type Description Default
spark SparkSession

The Spark session used for creating the DataFrame.

required
nrows int

The number of rows in the DataFrame.

required
ncols int

The number of columns in the DataFrame.

required
names list[str] | None

The names of the columns in the DataFrame. If not provided, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'.

None
fill

The value used for filling the DataFrame. Default is 'zero'. Accepted values are: 'zero', 'one', 'rand', 'randn', 'rowseq', 'colseq'

'zero'

Returns:

Type Description
FlickerDataFrame

A new instance of the FlickerDataFrame class created from the given shape and parameters.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='rowseq')
>>> df()
  col1 col2
0    0    1
1    2    3
2    4    5

groupby

groupby(names)

Groups the rows of the DataFrame based on the specified column names, so we can run aggregation on them. Returns a FlickerGroupedData object. This method is a pass-through to pyspark.sql.DataFrame but returns a FlickerGroupedData object instead of a pyspark.sql.GroupedData object.

Parameters:

Name Type Description Default
names list[str]

The column names based on which the DataFrame rows should be grouped

required

Returns:

Type Description
FlickerGroupedData

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 10), ('pandas', 10), ('spark', 100)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['name', 'number'])
>>> df.groupby(['name'])
FlickerGroupedData[grouping expressions: [name], value: [name: string, number: bigint], type: GroupBy]
>>> df.groupby(['name']).count()
FlickerDataFrame[name: string, count: bigint]
>>> df.groupby(['name']).count()()
     name count
0   spark     2
1  pandas     1

head

head(n=5)

Return top n rows as a FlickerDataFrame. This method differs from FlickerDataFrame.__call__(), which returns a pandas.DataFrame.

Parameters:

Name Type Description Default
n int | None

Number of rows to return. If not specified, defaults to 5. If df.nrows < n, only df.nrows are returned. If n=None, all rows are returned.

5

Returns:

Type Description
FlickerDataFrame

A new instance of FlickerDataFrame containing top (at most) n rows

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 10, 2, names=['col1', 'col2'], fill='zero')
>>> df.head(3)
FlickerDataFrame[col1: bigint, col2: bigint]

join

join(right, on, how='inner', lprefix='', lsuffix='_l', rprefix='', rsuffix='_r')

Join the current FlickerDataFrame with another dataframe. This non-mutating method returns the joined dataframe as a FlickerDataFrame.

This method preserves duplicate column names (that are joined on) by renaming them in the join result. Note that FlickerDataFrame.join is different from FlickerDataFrame.merge in both function signature and the merged/joined result.

Parameters:

Name Type Description Default
right FlickerDataFrame | DataFrame

The right DataFrame to join with the left DataFrame.

required
on dict[str, str]

Dictionary specifying which column names to join on. Keys represent column names from the left dataframe and values represent column names from the right dataframe.

required
how str

The type of join to perform - 'inner': Returns only the matching rows from both DataFrames - 'left': Returns all the rows from the left DataFrame and the matching rows from the right DataFrame - 'right': Returns all the rows from the right DataFrame and the matching rows from the left DataFrame - 'outer': Returns all the rows from both DataFrames, including unmatched rows, with null values for non-matching columns

'inner'
lprefix str

Prefix to add to column names from the left dataframe that are duplicated in the join result

''
lsuffix str

Suffix to add to column names from the left dataframe that are duplicated in the join result

'_l'
rprefix str

Prefix to add to column names from the right dataframe that are duplicated in the join result

''
rsuffix str

Suffix to add to column names from the right dataframe that are duplicated in the join result

'_r'

Returns:

Type Description
FlickerDataFrame

Raises:

Type Description
TypeError

If the on parameter is not a dictionary

ValueError

If the on parameter is an empty dictionary

TypeError

If the keys or values of the on parameter are not of str type

KeyError

If the left or right DataFrame contains duplicate column names after renaming

NotImplementedError

To prevent against unexpected changes in the underlying pyspark.sql.DataFrame.join

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['x', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['x', 'number'])
>>> inner_join = left.join(right, on={'x': 'x'}, how='inner')
>>> inner_join()  # 'x' columns from both left and right dataframes is preserved
  x_l number_l x_r number_r
0   a        1   a        4
>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['x1', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['x2', 'number'])
>>> inner_join = left.join(right, on={'x1': 'x2'}, how='inner')
>>> inner_join()  # renaming happens only when needed
  x1 number_l x2 number_r
0  a        1  a        4

merge

merge(right, on, how='inner', lprefix='', lsuffix='_l', rprefix='', rsuffix='_r')

Merge the current FlickerDataFrame with another dataframe. This non-mutating method returns the merged dataframe as a FlickerDataFrame.

Note that FlickerDataFrame.merge is different from FlickerDataFrame.join in both function signature and the merged/joined result.

Parameters:

Name Type Description Default
right FlickerDataFrame | DataFrame

The right dataframe to merge with

required
on Iterable[str]

Column names to 'join' on. The column names must exist in both left and right dataframes. The column names provided in on are not duplicated and are not renamed using prefixes/suffixes.

required
how str

Type of join to perform. Possible values are {'inner', 'outer', 'left', 'right'}.

'inner'
lprefix str

Prefix to add to column names from the left dataframe that are duplicated in the merge result

''
lsuffix str

Suffix to add to column names from the left dataframe that are duplicated in the merge result

'_l'
rprefix str

Prefix to add to column names from the right dataframe that are duplicated in the merge result

''
rsuffix str

Suffix to add to column names from the right dataframe that are duplicated in the merge result

'_r'

Returns:

Type Description
FlickerDataFrame

Raises:

Type Description
TypeError

If on is not an Iterable[str] or if it is a dict

ValueError

If on is an empty Iterable[str]

TypeError

If any element in on is not a str

KeyError

If renaming results in duplicate column names in the left dataframe

KeyError

If renaming results in duplicate column names in the right dataframe

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['name', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['name', 'number'])
>>> inner_merge = left.merge(right, on=['name'], how='inner')
>>> inner_merge()
  name number_l number_r
0    a        1        4
>>> left_merge = left.merge(right, on=['name'], how='left')
>>> left_merge()
  name number_l number_r
0    a        1        4
1    b        2     None
2    c        3     None

rename

rename(from_to_mapper)

Renames columns in the FlickerDataFrame based on the provided mapping of the form {'old_col_name1': 'new_col_name1', 'old_col_name2': 'new_col_name2', ...}. This is a non-mutating method.

Parameters:

Name Type Description Default
from_to_mapper dict[str, str]

A dictionary containing the mapping of current column names to new column names

required

Returns:

Type Description
FlickerDataFrame

A new instance of FlickerDataFrame with renamed columns

Raises:

Type Description
TypeError

If the provided from_to_mapper is not a dictionary

KeyError

If any of the keys in from_to_mapper do not match existing column names in the FlickerDataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 4, names=['col1', 'col2', 'col3', 'col4'], fill='zero')
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
>>> df.rename({'col1': 'col_a', 'col3': 'col_c'})
FlickerDataFrame[col_a: bigint, col2: bigint, col_c: bigint, col4: bigint]
>>> df  # df is not mutated
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]

show

show(n=5, truncate=True, vertical=False)

Prints the first n rows to the console as a (possibly) giant string. This is a pass-through method to pyspark.sql.DataFrame.show().

Parameters:

Name Type Description Default
n int | None

Number of rows to show. Defaults to 5.

5
truncate bool | int

If True, strings longer than 20 chars are truncated. If truncate > 1, strings longer than truncate are truncated to length=truncate and made right-aligned.

True
vertical bool

If True, print output rows vertically (one line per column value).

False

Returns:

Type Description
None

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.show()
+----+----+
|col1|col2|
+----+----+
|   0|   0|
|   0|   0|
|   0|   0|
+----+----+

sort

sort(names, ascending=True)

Returns a new :class:DataFrame sorted by the specified column name(s). This non-mutating method is a pass-through to pyspark.sql.DataFrame.sort but with some checks and a slightly different function signature.

Parameters:

Name Type Description Default
names list[str]

The list of column names to sort the DataFrame by

required
ascending bool

Whether to sort the DataFrame in ascending order or not

True

Returns:

Type Description
FlickerDataFrame

Raises:

Type Description
KeyError

If names contains a non-existant column name

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [(10, 1), (1, 2), (100, 3)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['x', 'y'])
>>> df()
     x  y
0   10  1
1    1  2
2  100  3
>>> df.sort(['x'])
FlickerDataFrame[x: bigint, y: bigint]
>>> df.sort(['x'])()
     x  y
0    1  2
1   10  1
2  100  3
>>> df  # df is not mutated
FlickerDataFrame[x: bigint, y: bigint]

take

take(n=5, convert_to_dict=True)

Return top n rows as a list.

Parameters:

Name Type Description Default
n int | None

Number of rows to return. If not specified, defaults to 5. If df.nrows < n, only df.nrows are returned. If n=None, all rows are returned.

5
convert_to_dict bool

If False, output is a list of pyspark.sql.Row objects. If True, output is a list of dict objects.

True

Returns:

Type Description
list[dict | Row]

A list of at most n items. Each item is either a pyspark.sql.Row or a dict object.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['col1', 'col2'])
>>> df.take(2, convert_to_dict=True)
[{'col1': 1, 'col2': 'a'}, {'col1': 2, 'col2': 'b'}]
>>> df.take(2, convert_to_dict=False)
[Row(col1=1, col2='a'), Row(col1=2, col2='b')]

to_dict

to_dict(n=5)

Converts the FlickerDataFrame into a dictionary representation, in which, dict keys represent column names and dict values represent column values.

Parameters:

Name Type Description Default
n int | None

Number of rows to return. If not specified, defaults to 5. If df.nrows < n, only df.nrows are returned. If n=None, all rows are returned.

5

Returns:

Type Description
dict

A dictionary representation of the FlickerDataFrame where keys are column names and values are lists containing up to n values from each column.

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='colseq')
>>> df()
  col1 col2
0    0    3
1    1    4
2    2    5
>>> df.to_dict(n=2)
{'col1': [0, 1], 'col2': [3, 4]}

to_pandas

to_pandas()

Converts a FlickerDataFrame to a pandas.DataFrame. Calling this method on a big FlickerDataFrame may result in out-of-memory errors.

This method is simply a pass through to pyspark.sql.DataFrame.to_pandas(). Consider using FlickerDataFrame.__call___() instead of FlickerDataFrame.to_pandas() because pyspark.sql.DataFrame.to_pandas() can cause unwanted None to NaN conversions. See example below.

Returns:

Type Description
DataFrame

pandas DataFrame

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> pandas_df = pd.DataFrame({'col1': [1.0, np.nan, None], 'col2': [4.0, 5.0, np.nan]}, dtype=object)
>>> pandas_df
   col1 col2
0   1.0  4.0
1   NaN  5.0
2  None  NaN
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=False)
>>> df()
   col1 col2
0   1.0  4.0
1   NaN  5.0
2  None  NaN
>>> df.to_pandas()  # causes unwanted None to NaN conversion in df.to_pandas().iloc[2, 0]
   col1  col2
0   1.0   4.0
1   NaN   5.0
2   NaN   NaN

unique

unique()

Returns a new FlickerDataFrame with unique rows. This non-mutating method is just a pass-through to pyspark.sql.DataFrame.distinct.

Returns:

Type Description
FlickerDataFrame

A new FlickerDataFrame with unique rows

Examples:

>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df()
  col1 col2
0    0    0
1    0    0
2    0    0
>>> df.unique()
FlickerDataFrame[col1: bigint, col2: bigint]
>>> df.unique()()
  col1 col2
0    0    0
>>> df.shape  # df is not mutated
(3, 2)