FlickerDataFrame
FlickerDataFrame
is a wrapper over pyspark.sql.DataFrame
. FlickerDataFrame
provides a modern,
clean, intuitive, pythonic, polars-like API over a pyspark
backend.
Construct a FlickerDataFrame
from a pyspark.sql.DataFrame
. Construction will fail if the
pyspark.sql.DataFrame
contains duplicate column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If the df parameter is not an instance of |
ValueError
|
If the df parameter contains duplicate column names |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 1), ('pandas', 3), ('polars', 2)]
>>> spark_df = spark.createDataFrame(rows, schema=['package', 'rank'])
>>> df = FlickerDataFrame(spark_df)
>>> df()
package rank
0 spark 1
1 pandas 3
2 polars 2
dtypes
property
¶
dtypes
Returns the column names and corresponding data types as an OrderedDict. The order of key-value pairs in the output is the same order as that of (left-to-right) columns in the dataframe.
Returns:
Type | Description |
---|---|
OrderedDict
|
Keys are column names and values are dtypes |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.dtypes
OrderedDict([('col1', 'bigint'), ('col2', 'bigint')])
names
property
¶
names
Returns a list of column names in the FlickerDataFrame
Returns:
Type | Description |
---|---|
list[str]
|
list of column names in order of occurrence |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.names
['col1', 'col2']
ncols
property
¶
ncols
Returns the number of columns. This method always returns immediately no matter the number of rows in the dataframe.
Returns:
Type | Description |
---|---|
int
|
number of columns |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.ncols
2
nrows
property
¶
nrows
Returns the number of rows. This method may take a long time to count all the rows in the dataframe. Once the number of rows is computed, it is automatically cached until the dataframe is mutated. Cached number of rows is returned immediately without having to re-count all the rows.
Returns:
Type | Description |
---|---|
int
|
number of rows |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 1000, 2, names=['col1', 'col2'], fill='zero')
>>> df.nrows
1000
schema
property
¶
schema
Returns the dataframe schema as an object of type pyspark.sql.types.StructType
.
shape
property
¶
shape
Returns the shape of the FlickerDataFrame as (nrows, ncols)
Returns:
Type | Description |
---|---|
tuple[int, int]
|
shape as (nrows, ncols) |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.shape
(3, 2)
__call__ ¶
__call__(n=5, use_pandas_dtypes=False)
Return a selection of pyspark.sql.DataFrame
as a pandas.DataFrame
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int | None
|
Number of rows to return. If not specified, defaults to 5. If df.nrows < n, only df.nrows are returned. If n=None, all rows are returned. |
5
|
use_pandas_dtypes
|
bool
|
If False (recommended and default), the resulting pandas.DataFrame will have all column dtypes as object. This option preserves NaNs and None(s) as-is. If True, the resulting pandas.DataFrame will have parsed dtypes. This option may be a little faster, but it allows pandas to convert None(s) in numeric columns to NaNs. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
pandas DataFrame |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 1), ('pandas', 3), ('polars', 2)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['package', 'rank'])
>>> df() # call the FlickerDataFrame to quickly see a snippet
package rank
0 spark 1
1 pandas 3
2 polars 2
__getitem__ ¶
__getitem__(item)
Index into the dataframe in various ways
Parameters:
Name | Type | Description | Default |
---|---|---|---|
item
|
tuple | slice | str | list | Column | FlickerColumn
|
The index value to retrieve from the FlickerDataFrame object |
required |
Returns:
Type | Description |
---|---|
FlickerColumn | FlickerDataFrame
|
If the index value is a string, returns a FlickerColumn object containing the column specified by the string. If the index value is a Column object, returns a new FlickerDataFrame object with only the specified column. If the index value is a FlickerColumn object, returns a new FlickerDataFrame object with only the column of the FlickerColumn object. If the index value is a slice object, returns a new FlickerDataFrame object with the columns specified by the slice. If the index value is a tuple of two slices, returns a new FlickerDataFrame object with the columns specified by the second slice, limited by the stop value of the first slice. If the index value is an iterable, returns a new FlickerDataFrame object with the columns specified by the elements of the iterable. |
Raises:
Type | Description |
---|---|
KeyError
|
If the index value is not a supported index type. |
_ipython_key_completions_ ¶
_ipython_key_completions_()
Provide list of auto-completions for getitem (not attributes) that is completed by df["c"+tab. Note that attribute completion is separate that happens automatically even when dir() is not explicitly defined.
See https://ipython.readthedocs.io/en/stable/config/integrating.html
This function enables auto-completion in both jupyter notebook and ipython terminal.
concat ¶
concat(other, ignore_names=False)
Return a new FlickerDataFrame with rows from this and other dataframe concatenated together. This is a
non-mutating method that calls pyspark.sql.DataFrame.union
after some checks.
Resulting concatenated DataFrame will always contain the same column names in the same order as that in the
current DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
other
|
FlickerDataFrame | DataFrame
|
The DataFrame to concatenate with the current DataFrame |
required |
ignore_names
|
(bool, optional(default=False))
|
If |
False
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
The concatenated DataFrame |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df_zero = FlickerDataFrame.from_shape(spark, 2, 2, names=['a', 'b'], fill='zero')
>>> df_one = FlickerDataFrame.from_shape(spark, 2, 2, names=['a', 'b'], fill='one')
>>> df_rand = FlickerDataFrame.from_shape(spark, 2, 2, names=['b', 'c'], fill='rand')
>>> df_zero.concat(df_one)
FlickerDataFrame[a: bigint, b: bigint]
>>> df_zero.concat(df_one, ignore_names=False)()
a b
0 0 0
1 0 0
2 1 1
3 1 1
>>> df_zero.concat(df_one, ignore_names=True)() # ignore_names has no effect
a b
0 0 0
1 0 0
2 1 1
3 1 1
>>> df_zero.concat(df_rand, ignore_names=True)()
a b
0 0.0 0.0
1 0.0 0.0
2 0.85428 0.148739
3 0.031665 0.14922
>>> df_zero.concat(df_rand, ignore_names=False) # KeyError
describe ¶
describe()
Returns a pandas.DataFrame
with statistical summary of the FlickerDataFrame. This method supports
numeric (int, bigint, float, double), string, timestamp, boolean columns. Unsupported columns are ignored
without an error. This method returns a different and better dtyped output than
pyspark.sql.DataFrame.describe
.
The output contains count, mean, stddev, min, and max values.
Returns:
Type | Description |
---|---|
DataFrame
|
A pandas DataFrame with statistical summary of the FlickerDataFrame |
Examples:
>>> from datetime import datetime, timedelta
>>> spark = SparkSession.builder.getOrCreate()
>>> t = datetime(2023, 1, 1)
>>> dt = timedelta(days=1)
>>> rows = [('Bob', 23, 100.0, t - dt, False), ('Alice', 22, 110.0, t, True), ('Tom', 21, 120.0, t + dt, False)]
>>> names = ['name', 'age', 'weight', 'time', 'is_jedi']
>>> df = FlickerDataFrame.from_rows(spark, rows, names)
>>> df()
name age weight time is_jedi
0 Bob 23 100.0 2022-12-31 00:00:00 False
1 Alice 22 110.0 2023-01-01 00:00:00 True
2 Tom 21 120.0 2023-01-02 00:00:00 False
>>> df.describe()
name age weight time is_jedi
count 3 3 3 3 3
max Tom 23 120.0 2023-01-02 00:00:00 True
mean NaN 22.0 110.0 2023-01-01 00:00:00 0.333333
min Alice 21 100.0 2022-12-31 00:00:00 False
stddev NaN 1.0 10.0 1 day, 0:00:00 0.57735
>>> df.describe()['time']['stddev'] # output contains appropriately typed values instead of strings
datetime.timedelta(days=1)
drop ¶
drop(names)
Delete columns by name. This is the non-mutating form of the __del__
method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
list[str]
|
A list of column names to delete from the FlickerDataFrame. |
required |
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
A new instance of the FlickerDataFrame class with the specified columns deleted |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 4, names=['col1', 'col2', 'col3', 'col4'], fill='zero')
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
>>> df.drop(['col2', 'col4'])
FlickerDataFrame[col1: bigint, col3: bigint]
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
from_columns
classmethod
¶
from_columns(spark, columns, names=None, nan_to_none=True)
Create a FlickerDataFrame
from columns
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
|
required |
columns
|
Iterable[Iterable]
|
The columns to create the DataFrame from. Each column should be an iterable. For example:
|
required |
names
|
list[str] | None
|
The column names of the DataFrame. If None, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'. |
None
|
nan_to_none
|
bool
|
Flag indicating whether to convert all NaN values to None. Default and recommended value is True. |
True
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> columns = [[1, 2, 3], ['a', 'b', 'c']]
>>> names = ['col1', 'col2']
>>> df = FlickerDataFrame.from_columns(spark, columns, names)
>>> df()
col1 col2
0 1 a
1 2 b
2 3 c
Raises:
Type | Description |
---|---|
ValueError
|
If the columns contain different number of rows |
from_dict
classmethod
¶
from_dict(spark, data, nan_to_none=True)
Create a FlickerDataFrame
object from a dictionary, in which, dict keys represent column names and
dict values represent column values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
|
required |
data
|
dict
|
The dictionary containing column names as keys and column values as values. For example,
|
required |
nan_to_none
|
bool
|
Flag indicating whether to convert all NaN values to None. Default and recommended value is True. |
True
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
>>> df = FlickerDataFrame.from_dict(spark, data)
>>> df()
col1 col2
0 1 4
1 2 5
2 3 6
from_pandas
classmethod
¶
from_pandas(spark, df, nan_to_none=True)
Create a FlickerDataFrame
from a pandas.DataFrame
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
|
required |
df
|
DataFrame
|
The pandas DataFrame to convert to a FlickerDataFrame. |
required |
nan_to_none
|
bool
|
Flag indicating whether to convert all NaN values to None. Default and recommended value is True. |
True
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> pandas_df = pd.DataFrame({'col1': [1, np.nan, 3], 'col2': [4, 5, np.nan]})
>>> pandas_df
col1 col2
0 1.0 4.0
1 NaN 5.0
2 3.0 NaN
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=True)
>>> df()
col1 col2
0 1.0 4.0
1 None 5.0
2 3.0 None
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=False)
>>> df()
col1 col2
0 1.0 4.0
1 NaN 5.0
2 3.0 NaN
from_records
classmethod
¶
from_records(spark, records, nan_to_none=True)
Create a FlickerDataFrame
from a list of dictionaries (similar to JSON lines format)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
|
required |
records
|
Iterable[dict]
|
An iterable of dictionaries. Each dictionary represents a row (aka record). |
required |
nan_to_none
|
bool
|
Flag indicating whether to convert all NaN values to None. Default and recommended value is True. |
True
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> records = [{'col1': 1, 'col2': 1}, {'col1': 2, 'col2': 2}, {'col1': 3, 'col2': 3}]
>>> df = FlickerDataFrame.from_records(spark, records)
>>> df()
col1 col2
0 1 1
1 2 2
2 3 3
from_rows
classmethod
¶
from_rows(spark, rows, names=None, nan_to_none=True)
Create a FlickerDataFrame from rows.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
|
required |
rows
|
Iterable[Iterable]
|
The rows of data to be converted into a DataFrame. For example, |
required |
names
|
list[str] | None
|
The column names of the DataFrame. If None, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'. |
None
|
nan_to_none
|
bool
|
Flag indicating whether to convert all NaN values to None. Default and recommended value is True. |
True
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [['a', 1, 2.0], ['b', 2, 4.0]]
>>> names = ['col1', 'col2', 'col3']
>>> df = FlickerDataFrame.from_rows(spark, rows, names)
>>> df()
col1 col2 col3
0 a 1 2.0
1 b 2 4.0
Raises:
Type | Description |
---|---|
ValueError
|
If the rows contain different number of columns |
from_schema
classmethod
¶
from_schema(spark, schema=None, data=())
Creates a FlickerDataFrame object from a schema and optionally some data.
This method can be very useful to create an empty dataframe with a given schema.
The best way to obtain a schema from another dataframe is df.schema
.
The input schema can also be empty in which case
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
The Spark session used for creating the DataFrame. |
required |
schema
|
StructType or str or None
|
The schema for the DataFrame.
Can be specified as a Spark |
None
|
data
|
RDD, Iterable, DataFrame, or np.ndarray
|
The data to populate the DataFrame. Can be an RDD, an iterable collection, an existing Spark DataFrame, or a NumPy array. |
()
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
A new instance of FlickerDataFrame created with the provided schema and data. |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> FlickerDataFrame.from_schema(spark) # Create a dataframe with no rows and no columns
FlickerDataFrame[]
>>> FlickerDataFrame.from_schema(spark, schema='a string, b int') # Create a dataframe with zero rows
FlickerDataFrame[a: string, b: int]
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['a', 'b'])
FlickerDataFrame[a: bigint, b: bigint]
>>> FlickerDataFrame.from_schema(spark, schema=df.schema) # Create a dataframe with the same schema as df
FlickerDataFrame[a: bigint, b: bigint]
from_shape
classmethod
¶
from_shape(spark, nrows, ncols, names=None, fill='zero')
Create a FlickerDataFrame from a given shape and fill. This method is useful for creating test data and experimentation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
The Spark session used for creating the DataFrame. |
required |
nrows
|
int
|
The number of rows in the DataFrame. |
required |
ncols
|
int
|
The number of columns in the DataFrame. |
required |
names
|
list[str] | None
|
The names of the columns in the DataFrame. If not provided, column names will be generated as '0', '1', '2', ..., f'{ncols -1}'. |
None
|
fill
|
The value used for filling the DataFrame. Default is 'zero'. Accepted values are: 'zero', 'one', 'rand', 'randn', 'rowseq', 'colseq' |
'zero'
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
A new instance of the FlickerDataFrame class created from the given shape and parameters. |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='rowseq')
>>> df()
col1 col2
0 0 1
1 2 3
2 4 5
groupby ¶
groupby(names)
Groups the rows of the DataFrame based on the specified column names, so we can run aggregation on them.
Returns a FlickerGroupedData
object. This method is a pass-through to pyspark.sql.DataFrame
but
returns a FlickerGroupedData
object instead of a pyspark.sql.GroupedData
object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
list[str]
|
The column names based on which the DataFrame rows should be grouped |
required |
Returns:
Type | Description |
---|---|
FlickerGroupedData
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [('spark', 10), ('pandas', 10), ('spark', 100)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['name', 'number'])
>>> df.groupby(['name'])
FlickerGroupedData[grouping expressions: [name], value: [name: string, number: bigint], type: GroupBy]
>>> df.groupby(['name']).count()
FlickerDataFrame[name: string, count: bigint]
>>> df.groupby(['name']).count()()
name count
0 spark 2
1 pandas 1
head ¶
head(n=5)
Return top n
rows as a FlickerDataFrame
. This method differs from FlickerDataFrame.__call__()
,
which returns a pandas.DataFrame
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int | None
|
Number of rows to return. If not specified, defaults to 5.
If |
5
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
A new instance of FlickerDataFrame containing top (at most) |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 10, 2, names=['col1', 'col2'], fill='zero')
>>> df.head(3)
FlickerDataFrame[col1: bigint, col2: bigint]
join ¶
join(right, on, how='inner', lprefix='', lsuffix='_l', rprefix='', rsuffix='_r')
Join the current FlickerDataFrame with another dataframe. This non-mutating method returns the joined dataframe as a FlickerDataFrame.
This method preserves duplicate column names (that are joined on) by renaming them in the join result.
Note that FlickerDataFrame.join
is different from FlickerDataFrame.merge
in both function signature
and the merged/joined result.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
right
|
FlickerDataFrame | DataFrame
|
The right DataFrame to join with the left DataFrame. |
required |
on
|
dict[str, str]
|
Dictionary specifying which column names to join on. Keys represent column names from the left dataframe and values represent column names from the right dataframe. |
required |
how
|
str
|
The type of join to perform
- 'inner': Returns only the matching rows from both DataFrames
- 'left': Returns all the rows from the left DataFrame and the matching rows from the right DataFrame
- 'right': Returns all the rows from the right DataFrame and the matching rows from the left DataFrame
- 'outer': Returns all the rows from both DataFrames, including unmatched rows, with |
'inner'
|
lprefix
|
str
|
Prefix to add to column names from the left dataframe that are duplicated in the join result |
''
|
lsuffix
|
str
|
Suffix to add to column names from the left dataframe that are duplicated in the join result |
'_l'
|
rprefix
|
str
|
Prefix to add to column names from the right dataframe that are duplicated in the join result |
''
|
rsuffix
|
str
|
Suffix to add to column names from the right dataframe that are duplicated in the join result |
'_r'
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Raises:
Type | Description |
---|---|
TypeError
|
If the |
ValueError
|
If the |
TypeError
|
If the keys or values of the |
KeyError
|
If the left or right DataFrame contains duplicate column names after renaming |
NotImplementedError
|
To prevent against unexpected changes in the underlying |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['x', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['x', 'number'])
>>> inner_join = left.join(right, on={'x': 'x'}, how='inner')
>>> inner_join() # 'x' columns from both left and right dataframes is preserved
x_l number_l x_r number_r
0 a 1 a 4
>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['x1', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['x2', 'number'])
>>> inner_join = left.join(right, on={'x1': 'x2'}, how='inner')
>>> inner_join() # renaming happens only when needed
x1 number_l x2 number_r
0 a 1 a 4
merge ¶
merge(right, on, how='inner', lprefix='', lsuffix='_l', rprefix='', rsuffix='_r')
Merge the current FlickerDataFrame with another dataframe. This non-mutating method returns the merged dataframe as a FlickerDataFrame.
Note that FlickerDataFrame.merge
is different from FlickerDataFrame.join
in both function signature
and the merged/joined result.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
right
|
FlickerDataFrame | DataFrame
|
The right dataframe to merge with |
required |
on
|
Iterable[str]
|
Column names to 'join' on. The column names must exist in both left and right dataframes.
The column names provided in |
required |
how
|
str
|
Type of join to perform. Possible values are |
'inner'
|
lprefix
|
str
|
Prefix to add to column names from the left dataframe that are duplicated in the merge result |
''
|
lsuffix
|
str
|
Suffix to add to column names from the left dataframe that are duplicated in the merge result |
'_l'
|
rprefix
|
str
|
Prefix to add to column names from the right dataframe that are duplicated in the merge result |
''
|
rsuffix
|
str
|
Suffix to add to column names from the right dataframe that are duplicated in the merge result |
'_r'
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Raises:
Type | Description |
---|---|
TypeError
|
If |
ValueError
|
If |
TypeError
|
If any element in |
KeyError
|
If renaming results in duplicate column names in the left dataframe |
KeyError
|
If renaming results in duplicate column names in the right dataframe |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> left = FlickerDataFrame.from_rows(spark, [('a', 1), ('b', 2), ('c', 3), ], ['name', 'number'])
>>> right = FlickerDataFrame.from_rows(spark, [('a', 4), ('d', 5), ('e', 6), ], ['name', 'number'])
>>> inner_merge = left.merge(right, on=['name'], how='inner')
>>> inner_merge()
name number_l number_r
0 a 1 4
>>> left_merge = left.merge(right, on=['name'], how='left')
>>> left_merge()
name number_l number_r
0 a 1 4
1 b 2 None
2 c 3 None
rename ¶
rename(from_to_mapper)
Renames columns in the FlickerDataFrame based on the provided mapping of the form
{'old_col_name1': 'new_col_name1', 'old_col_name2': 'new_col_name2', ...}
.
This is a non-mutating method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
from_to_mapper
|
dict[str, str]
|
A dictionary containing the mapping of current column names to new column names |
required |
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
A new instance of |
Raises:
Type | Description |
---|---|
TypeError
|
If the provided |
KeyError
|
If any of the keys in |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 4, names=['col1', 'col2', 'col3', 'col4'], fill='zero')
>>> df
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
>>> df.rename({'col1': 'col_a', 'col3': 'col_c'})
FlickerDataFrame[col_a: bigint, col2: bigint, col_c: bigint, col4: bigint]
>>> df # df is not mutated
FlickerDataFrame[col1: bigint, col2: bigint, col3: bigint, col4: bigint]
show ¶
show(n=5, truncate=True, vertical=False)
Prints the first n
rows to the console as a (possibly) giant string. This is a pass-through method
to pyspark.sql.DataFrame.show()
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int | None
|
Number of rows to show. Defaults to 5. |
5
|
truncate
|
bool | int
|
If True, strings longer than 20 chars are truncated.
If |
True
|
vertical
|
bool
|
If True, print output rows vertically (one line per column value). |
False
|
Returns:
Type | Description |
---|---|
None
|
|
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 0| 0|
| 0| 0|
| 0| 0|
+----+----+
sort ¶
sort(names, ascending=True)
Returns a new :class:DataFrame
sorted by the specified column name(s). This non-mutating method is
a pass-through to pyspark.sql.DataFrame.sort
but with some checks and a slightly different function
signature.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
list[str]
|
The list of column names to sort the DataFrame by |
required |
ascending
|
bool
|
Whether to sort the DataFrame in ascending order or not |
True
|
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
|
Raises:
Type | Description |
---|---|
KeyError
|
If |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [(10, 1), (1, 2), (100, 3)]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['x', 'y'])
>>> df()
x y
0 10 1
1 1 2
2 100 3
>>> df.sort(['x'])
FlickerDataFrame[x: bigint, y: bigint]
>>> df.sort(['x'])()
x y
0 1 2
1 10 1
2 100 3
>>> df # df is not mutated
FlickerDataFrame[x: bigint, y: bigint]
take ¶
take(n=5, convert_to_dict=True)
Return top n
rows as a list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int | None
|
Number of rows to return. If not specified, defaults to 5.
If |
5
|
convert_to_dict
|
bool
|
If False, output is a list of |
True
|
Returns:
Type | Description |
---|---|
list[dict | Row]
|
A list of at most n items. Each item is either a pyspark.sql.Row or a dict object. |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> rows = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
>>> df = FlickerDataFrame.from_rows(spark, rows, names=['col1', 'col2'])
>>> df.take(2, convert_to_dict=True)
[{'col1': 1, 'col2': 'a'}, {'col1': 2, 'col2': 'b'}]
>>> df.take(2, convert_to_dict=False)
[Row(col1=1, col2='a'), Row(col1=2, col2='b')]
to_dict ¶
to_dict(n=5)
Converts the FlickerDataFrame
into a dictionary representation, in which, dict keys represent
column names and dict values represent column values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n
|
int | None
|
Number of rows to return. If not specified, defaults to 5.
If |
5
|
Returns:
Type | Description |
---|---|
dict
|
A dictionary representation of the |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='colseq')
>>> df()
col1 col2
0 0 3
1 1 4
2 2 5
>>> df.to_dict(n=2)
{'col1': [0, 1], 'col2': [3, 4]}
to_pandas ¶
to_pandas()
Converts a FlickerDataFrame
to a pandas.DataFrame
. Calling this method on a big
FlickerDataFrame
may result in out-of-memory errors.
This method is simply a pass through to pyspark.sql.DataFrame.to_pandas()
. Consider using
FlickerDataFrame.__call___()
instead of FlickerDataFrame.to_pandas()
because
pyspark.sql.DataFrame.to_pandas()
can cause unwanted None to NaN conversions. See example below.
Returns:
Type | Description |
---|---|
DataFrame
|
pandas DataFrame |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> pandas_df = pd.DataFrame({'col1': [1.0, np.nan, None], 'col2': [4.0, 5.0, np.nan]}, dtype=object)
>>> pandas_df
col1 col2
0 1.0 4.0
1 NaN 5.0
2 None NaN
>>> df = FlickerDataFrame.from_pandas(spark, pandas_df, nan_to_none=False)
>>> df()
col1 col2
0 1.0 4.0
1 NaN 5.0
2 None NaN
>>> df.to_pandas() # causes unwanted None to NaN conversion in df.to_pandas().iloc[2, 0]
col1 col2
0 1.0 4.0
1 NaN 5.0
2 NaN NaN
unique ¶
unique()
Returns a new FlickerDataFrame with unique rows. This non-mutating method is just a pass-through to
pyspark.sql.DataFrame.distinct
.
Returns:
Type | Description |
---|---|
FlickerDataFrame
|
A new FlickerDataFrame with unique rows |
Examples:
>>> spark = SparkSession.builder.getOrCreate()
>>> df = FlickerDataFrame.from_shape(spark, 3, 2, names=['col1', 'col2'], fill='zero')
>>> df()
col1 col2
0 0 0
1 0 0
2 0 0
>>> df.unique()
FlickerDataFrame[col1: bigint, col2: bigint]
>>> df.unique()()
col1 col2
0 0 0
>>> df.shape # df is not mutated
(3, 2)