Skip to content

Quick Example

This example describes how to use flicker. This also provides a good comparison between flicker and pyspark dataframe APIs.

The example assumes that you have already installed flicker. Please see Getting Started for installation instructions. This example uses flicker 0.0.16 and pyspark 2.4.5.

You can follow along without having to install or setup a spark environment by using the flicker-playground docker image, as described in Getting Started.

The entire code is available as example.py or example.ipynb.

Create the DataFrame

Let's create a Spark session and a PySpark dataframe to begin. This is the same as with PySpark. In some cases, you may already have a spark: SparkSession object defined for you (such as when running the pyspark executable or on AWS EMR).

1
2
3
4
5
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PySparkShell').getOrCreate()
pyspark_df = spark.createDataFrame(
    [(1, 'Turing', 41), (2, 'Laplace', 77), (3, 'Kolmogorov', 84)],
    'id INT, name STRING, age INT')

To use the benefits of flicker, let's create FlickerDataFrame from the PySpark dataframe. This is easy – just call the default constructor.

6
7
from flicker import FlickerDataFrame
df = FlickerDataFrame(pyspark_df)
If you're following along this example in your own interactive python terminal, you'll notice that the above step is pretty fast. A FlickerDataFrame simply wraps the PySpark dataframe within itself1 (you can access it at df._df). Flicker does not copy any data from pyspark_df to df in the above code snippet. This means that no matter how big your PySpark dataframe is, creating a Flicker dataframe is always quick.

In the rest of this example, we show code for both Flicker and PySpark side-by-side.

Printing a distributed dataframe may require pulling data in from worker nodes, which in turn, may need to perform any un-executed operations before they can send the data. This makes printing PySpark or Flicker dataframes slow operations, unlike pandas. This is why neither PySpark nor Flicker shows you the contents when you print df or pyspark_df.

1
2
df
# FlickerDataFrame[id: int, name: string, age: int]
1
2
pyspark_df
# DataFrame[id: int, name: string, age: int]

In order to see the contents (actually, just the first few rows) of the dataframe, we can invoke the .show() method. Since we have to print dataframes very often (such as when performing interactive analysis), Flicker lets you "print" the first few rows by just calling the dataframe.

1
2
3
4
5
df()
#    id        name  age
# 0   1      Turing   41
# 1   2     Laplace   77
# 2   3  Kolmogorov   84
1
2
3
4
5
6
7
8
pyspark_df.show()
# +---+----------+---+
# | id|      name|age|
# +---+----------+---+
# |  1|    Turing| 41|
# |  2|   Laplace| 77|
# |  3|Kolmogorov| 84|
# +---+----------+---+

If you're running the commands in a terminal, you will see the output like the one shown above. For this small example, the Flicker version of printed content looks unimpressive against the PySpark version but Flicker-printed content looks much better for bigger dataframes. If you're running the same commands in a Jupyter notebook, the Flicker-printed content you appear as a pretty, mildly-interactive HTML dataframe but the PySpark-printed content would just be text.

Under the hood, Flicker and PySpark have very different behaviors. PySpark's pyspark_df.show() uses side-effects – it truly prints the formatted string to stdout and then returns None. Flicker's df(), on the other hand, returns a small pandas dataframe which then gets printed appropriately depending on the interactive tool (such as Jupyter or IPython)2. This also means that if you wanted to inspect the printed dataframe, you could simply do this:

1
2
3
pandas_df_sample = df()
pandas_df_sample['name'].values
# array(['Turing', 'Laplace', 'Kolmogorov'], dtype=object)
1
2
3
pandas_df_sample = pyspark_df.limit(5).toPandas()
pandas_df_sample['name'].values
# array(['Turing', 'Laplace', 'Kolmogorov'], dtype=object)

Obviously, PySpark lets you do this too but with more verbosity.

Inspect shape and columns

Flicker provides a pandas-like API. The same result may be obtained using PySpark with a little bit more verbosity.

1
2
df.shape
# (3, 3)
1
2
(pyspark_df.count(), len(pyspark_df.columns))
# (3, 3)

Note that Flicker still uses PySpark's .count() method under the hood to get the number of rows. This means that both Flicker and PySpark snippets above may be slow the first time we run them. However, FlickerDataFrame "stores" the row count which means that invoking df.shape the second time should be instantaneous, as long as df is not modified since the first invocation.

Getting the column names is also easy. Flicker differentiates between a column (pyspark.sql.Column object) and a column name (a str object). This is why we named the property df.names instead of df.columns. Similar to PySpark dataframe, we can get the data types for all the columns.

1
2
3
4
df.names
# ['id', 'name', 'age']
df.dtypes
# [('id', 'int'), ('name', 'string'), ('age', 'int')]
1
2
3
4
pyspark_df.columns
# ['id', 'name', 'age']
pyspark_df.dtypes
# [('id', 'int'), ('name', 'string'), ('age', 'int')]

Extracting a column

Unlike a pandas.Series object, the pyspark.sql.Column object does not materialize the column for us. Since Flicker is just an API over PySpark, Flicker does not materialize the column either.

1
2
df['name']  # not a FickerDataFrame object
# Column<b'name'>
1
2
pyspark_df['name']  # not a pyspark.sql.DataFrame objecy
# Column<b'name'>

PySpark does not provide a proper equivalent to the pandas' Series object. If we wanted to perform an operation on a column, we may still be able to it albeit in a round-about way. For example, we can count the number of distinct names like this:

1
2
df[['name']].distinct().nrows
# 3
1
2
pyspark_df[['name']].distinct().count()
# 3

Extracting multiple columns

Luckily, this is the same in Flicker, PySpark, and pandas. As previously mentioned, the contents don't get printed unless we specifically ask for it.

1
2
df[['name', 'age']]
# FlickerDataFrame[name: string, age: int]
1
2
pyspark_df[['name', 'age']]
# DataFrame[name: string, age: int]

Creating a new column

This is where Flicker shines – you can use pandas-like assignment API. Observant readers may notice that the following makes FlickerDataFrame objects mutable, unlike a pyspark.sql.DataFrame object which is immutable. This is by design.

1
2
3
4
5
6
df['is_age_more_than_fifty'] = df['age'] > 50
df()  # Must print to see the output
#    id        name  age  is_age_more_than_fifty
# 0   1      Turing   41                   False
# 1   2     Laplace   77                    True
# 2   3  Kolmogorov   84                    True
1
2
3
4
5
6
7
8
9
pyspark_df = pyspark_df.withColumn('is_age_more_than_fifty', pyspark_df['age'] > 50)
pyspark_df.show()  # Must print to see the output
# +---+----------+---+----------------------+
# | id|      name|age|is_age_more_than_fifty|
# +---+----------+---+----------------------+
# |  1|    Turing| 41|                 false|
# |  2|   Laplace| 77|                  true|
# |  3|Kolmogorov| 84|                  true|
# +---+----------+---+----------------------+

The above combination of perform an operation and then print is a common pattern when performing interactive analysis. This is because simply executing df['is_age_more_than_fifty'] = df['age'] > 50 does not actually perform the computation. It's only when you print (or count or take any other action), that the previously specified computation is actually performed. By printing immediately after specifying an operation helps catch errors early.

Filtering

This is also the same in Flicker, PySpark, and pandas.

1
2
3
4
5
6
7
8
# Use boolean column to filter
df[df['is_age_more_than_fifty']]
# FlickerDataFrame[id: int, name: string, age: int, is_age_more_than_fifty: boolean]

# Filter and print in one-line
df[df['age'] < 50]()
#    id    name  age  is_age_more_than_fifty
# 0   1  Turing   41                   False
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Use boolean column to filter
pyspark_df[pyspark_df['is_age_more_than_fifty']]
# DataFrame[id: int, name: string, age: int, is_age_more_than_fifty: boolean]

# Filter and print in one-line
pyspark_df[pyspark_df['age'] < 50].show()
# +---+------+---+----------------------+
# | id|  name|age|is_age_more_than_fifty|
# +---+------+---+----------------------+
# |  1|Turing| 41|                 false|
# +---+------+---+----------------------+

Common operations

Flicker comes loaded with methods that perform common operations. A prime example is generating value counts, typically done in pandas via .value_counts() method. Flicker also provides this method with only minor (but sensible) modifications to the method arguments.

1
2
3
4
5
6
7
8
df.value_counts('name')
# FlickerDataFrame[name: string, count: bigint]

df.value_counts('name')()
#          name  count
# 0      Turing      1
# 1     Laplace      1
# 2  Kolmogorov      1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pyspark_df.groupby('name').count()
# DataFrame[name: string, count: bigint]

pyspark_df.groupby('name').count().show()
# +----------+-----+
# |      name|count|
# +----------+-----+
# |    Turing|    1|
# |Kolmogorov|    1|
# |   Laplace|    1|
# +----------+-----+

Even though the PySpark snippet above looks simple enough, it requires the programmer to know that they have to use .groupby() method to generate value counts (much like in SQL). This additional cognitive load on the programmer is a hallmark of PySpark dataframe API. But, Flicker can do more than that.

1
2
3
4
5
df.value_counts('is_age_more_than_fifty', normalize=True,
                sort=True, ascending=True)()
#    is_age_more_than_fifty     count
# 0                   False  0.333333
# 1                    True  0.666667
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
nrows = pyspark_df.count()
count_df = (pyspark_df.groupBy('is_age_more_than_fifty')
            .count()
            .orderBy('count', ascending=True))
count_df.withColumn('count', count_df['count'] / nrows).show()
# +----------------------+------------------+
# |is_age_more_than_fifty|             count|
# +----------------------+------------------+
# |                 false|0.3333333333333333|
# |                  true|0.6666666666666666|
# +----------------------+------------------+

PySpark requires defining more variables to normalize the counts. We need to generate value counts for a lot of dataframes in order to simply inspect the data. The obvious solution is to wrap the PySpark code snippet into a function and then re-use it. That's exactly what Flicker does!

Generating value counts is only one such example. See other methods such as any, all, min, rows_with_max for other common examples. Even more useful are methods rename and join which do a lot more than the corresponding PySpark methods.

Chain everything together

You can chain everything together into complex operations. Flicker can often do a sequence of operations in one line without having to define any temporary variables.

1
2
df[df['age'] < 50].rows_with_max('age')[['name']]()['name'][0]
# 'Turing'
1
2
3
4
filtered_df = pyspark_df[pyspark_df['age'] < 50]
age_max = filtered_df.agg({'age': 'max'}).collect()[0][0]
filtered_df[filtered_df['age'].isin([age_max])][['name']].toPandas()['name'][0]
# 'Turing'

It may appear that the Flicker expression above is too complicated to be meaningful. However, while performing interactive analysis, the above expression naturally arises as your mind sequentially searches for increasingly specific information. This is better experienced than can be described.

Get the PySpark dataframe

If you have to use PySpark dataframe for some operations, you can easily get the underlying PySpark dataframe stored in the ._df attribute. This may be useful when there is no Flicker method available to perform an operation that can easily be performed with PySpark3. You can also mix and match – perform some computation with Flicker and the rest with PySpark.

1
2
pyspark_df = df._df
processed_pyspark_df = df[df['age'] < 50].rows_with_max('age')._df

There is more

This example only describes the basic Flicker dataframe API. We note some advantages in this section:

  • FlickerDataFrame does not allow duplicate column names and does not create duplicate column names (which PySpark dataframe does and then fails awkwardly).
  • FlickerDataFrame.rename method lets you rename multiple columns at once
  • FlickerDataFrame.join lets you specify the join condition using a dict and lets you add a suffix/prefix in one line of code.
  • FlickerDataFrame comes with many factory constructors such as from_rows, from_columns, and even a from_shape that lets you create FlickerDataFrame quickly.
  • flicker.udf contains some commonly needed UDF functions such as type_udf and len_udf
  • flicker.recipes contains some more useful tools that are needed for real-world data analysis

  1. Composition is the fancy term for it. 

  2. Conversion to a pandas dataframe can sometimes convert np.nan into None

  3. If possible, please contribute by filing a GitHub issue and/or sending a PR. 


Last update: 2020-08-13