Lambda in Databricks - The Basics

A lambda in programming is a small piece of code that can (among other things) be passed as an argument to a function. It has many uses, but one of the most common is that it lets a calling function customize the behavior of the called function, without the latter needing to know the details.

Morten Mjelde

Publisert:

26. apr. 2023

This article will present a simple example of how lambdas (sometimes called anonymous functions) can be used in Databricks. The cases we will look at demonstrate how we can write transformation code specific to a data set and inject this code into a standard transformation function. We can then use the standard transformation on this particular data set, without having to rewrite existing code. Using this method, we can better organize our notebook and reuse code.

The main goals of this article are:

Demonstrate a simple use of lambda in Databricks.
Show how we can use this to improve code quality.
Point out why this method is the best available in certain cases.

In this article we will use PySpark, but the general principle will work for similar languages, like Scala. Or outside of Databricks as well, for that matter. Note that for brevity we will omit import commands and code comments.

Note that this article is not intended to be an in-depth guide to lambdas in programming, but rather an introduction to their use in data processing.

Case 1: Simple data

Consider the data shown below, which we will call DataSet1. And let’s say that our task is to write transformations that add columns with each persons initials and full name.

location	firstName	lastName
Oslo	Harald	Haraldson
Stockholm	Ingrid	Hansen
Oslo	Mark	Hunter

This can be done with the following two functions:

def addInitials(df: DataFrame) -> DataFrame:
   return df.withColumn("intials", concat(col("firstName").substr(0,1), col("lastName").substr(0,1)))

def addFullName(df: DataFrame) -> DataFrame:
   return df.withColumn("fullName", concat(col("firstName"), lit(" "), col("lastName")))

And to make things even easier we define a third function, nameTransformations, that calls the first two.

def nameTransformations(data: DataFrame) -> DataFrame:
   data = addInitials(data)
   data = addFullName(data)
   return data

We can use this to transform our data:

result = nameTransformations(data=DataSet1)

And we get the following:

location	firstName	lastName	initials	fullName
Oslo	Harald	Haraldson	HH	Harald Haraldson
Stockholm	Ingrid	Hansen	IH	Ingrid Hansen
Oslo	Mark	Hunter	MH	Mark Hunter

So far so good, but anyone familiar with data engineering or data science will know that data from different sources can be messy and often organized differently. Consider DataSet2:

location	firstName	nobleName
Kobenhagen	Roger	of house Henriksen
Kobenhagen	Hans	of house Dale

This new data contains the same type of information (more or less) as DataSet1, but the schema is different: instead of a lastName column we have a nobleName column. We want to reuse the existing code as much as possible when processing this new data, and this is where the lambda comes in.

We first modify nameTransformations to include a callable parameter, which is essentially a reference to a piece of code. This code can be invoked with a Dataframe input.

def nameTransformations(data: DataFrame, correctNameColumns: callable) -> DataFrame:
   data = correctNameColumns(data)
   data = addInitials(data)
   data = addFullName(data)
   return data

Next, we supply this argument for DataSet2, which is a simple line of code to create the missing lastName column based on nobleName.

result = nameTransformations(
   data=DataSet2,
   correctNameColumns=lambda df: df.withColumn("lastName", element_at(split(col("nobleName"), " "), -1))
)

This gives the correct result.

location	firstName	nobleName	lastName	initials	fullName
Kobenhagen	Roger	of house Henriksen	Henriksen	HH	Roger Henriksen
Kobenhagen	Hans	of house Dale	Dale	HD	Hans Dale

Note that the change to nameTransformations will break backwards compatibility for DataSet1. This is easily solved by adding a default value to the customTransform parameter.

def nameTransformations(data: DataFrame, correctNames: callable = lambda df: df) -> DataFrame:
   ….

This is the shortest lambda possible: it simply returns the input, unchanged. The code for DataSet1 will now work as before, and when no correctNames is provided, no additional transformations are done.

We now have a complete example of how to use lambda to inject additional steps into a data transformation function. This lets us add code specific to the data at hand, without having to modify the nameTransformations function for each new case. Looking at this from a higher perspective, the function nameTransformations is only concerned with creating the new columns, and to do this it needs a firstName and a lastName column. If these are not already in the data, then the lambda lets us tell the function how to add one or both: it doesn’t know, nor need to know, the details.

Imagine an additional data set with neither firstName nor lastName, but instead it contains the names in a single name column formatted like this.

location	name
Helsinki	Miller, George

In this case we can still reuse the above nameTransformations function without any modifications by simply providing a lambda that generates the needed columns.

The complete code for this case can be found in this notebook: Databricks Lambda Case 1.

Case 2: Nested data

The observant reader might have noticed that in the first example the lambda function was not strictly necessary. After all, we could create the lastName column for DataSet2 prior to calling the nameTransformations function and achieve the same result. We will next see an example where this is not practical. Note that to make this example clearer, we will not reuse any code from Case1, except addInitials and addName.

Consider a variant of DataSet1, named DataSet3, but where the people living at a location are now in an array column.

location

population

Oslo

firstName	lastName
Harald	Haraldsen
Mark	Hunter

Stockholm

firstName	lastName
Ingrid	Hansen

To produce the same result as before, we need to add an explode-step to our transformation.

def explodePopulation(df: DataFrame) -> DataFrame:
   return df.withColumn("exploded_poplulation", explode("population")).select("location", " exploded_poplulation.*")

def nameTransformationsNested(data: DataFrame) -> DataFrame:
   data = explodePopulation(data)
   data = addInitials(data)
   data = addFullName(data)
   return data

The new nameTransformationsNested function includes one additional step compared to nameTransformations. We can now run this using DataSet3 and get the same output as in the first case.

result = nameTransformationsNested(data=DataSet3)

location	firstName	lastName	initials	fullName
Oslo	Harald	Haraldson	HH	Harald Haraldson
Stockholm	Ingrid	Hansen	IH	Ingrid Hansen
Oslo	Mark	Hunter	MH	Mark Hunter

Let us now consider DataSet4.

location

population

Copenhagen

firstName	nobleName
Roger	of house Henriksen
Hans	of house Dale

As with DataSet3 the names are now in an array column, making it hard to add the required lastName column. But if the explode-step has already been executed then the exact same lambda as before can be used for this purpose. We modify the nameTransformationNested function and provide the lambda.

def nameTransformationsNested(data: DataFrame, correctNameColumns: callable = lambda df: df) -> DataFrame:
   data = explodePopulation(data)
   data = correctNameColumns(data)
   data = addInitials(data)
   data = addFullName(data)
   return data

result = nameTransformationsNested(
   data=DataSet4,
   correctNameColumns=lambda df: df.withColumn("lastName", element_at(split(col("nobleName"), " "), -1))
)

Notice that the correctNames function must be executed between explodePopulation and addInitials, which means that unlike in the first example, we cannot easily create the lastName column prior to calling nameTransformationsNested. The finished function can now be used to transform both DataSet3 and DataSet4.

location	firstName	nobleName	lastName	initials	fullName
Copenhagen	Roger	of house Henriksen	Henriksen	HH	Roger Henriksen
Copenhagen	Hans	of house Dale	Dale	HD	Hans Dale

The complete code for this case can be found in this notebook Databricks Lambda Case 2.

Conclusion

We have seen how to use a lambda in Databricks to inject a custom transformation into a function, thus allowing us to reuse code for multiple data sets with similar, but not identical, schemas. Situations like these are very common in data analysis and data engineering. The described method allows us write code that is reusable, readable and maintainable, all of which are important requirements to production quality software.