How to Read and Write Data using Azure Databricks

How to Read and Write Data using Azure Databricks

Azure Databricks is a Microsoft Azure cloud service that provides a scalable platform for data analytics using Apache Spark.  

It is a fully managed, cloud-based data analytics platform that allows developers to create enterprise-grade data apps more quickly, which accelerates AI and innovation.

In this tutorial, you will cover how to get started with the platform in Azure. We’ll also see how to perform data interactions including reading, writing, and analyzing datasets.

By the end of this tutorial, you will be able to use Azure Databricks to read multiple file types, both with and without a schema.

What is Databricks?

Databricks was built as a joint effort by Microsoft and the team that started Apache Spark. It provides data science, engineering, and analytical teams with a single platform for big data processing and machine learning.

Not only does Databricks give you a fast, optimized Apache-spark environment, but it also supports large-scale data processing for batch and streaming workloads.

As you will see in this tutorial, Azure Databricks makes it easy to execute massive Spark workloads. It does this by fusing the scalability and security of Microsoft’s Azure platform with the power of Databricks, an end-to-end, managed Apache Spark platform built for the cloud.

In all, Databricks is one of the most prominent platforms you can use to deal with big data and perform collaborative tasks in the Data Science field.

How to Create Your Databricks Workspace

To unlock insights from all your data (and build Artificial Intelligence solutions) with Azure Databricks, you must set up an Azure Databricks workspace in your Azure subscription.

After you’ve created a valid and active Microsoft Azure account, you will then navigate to the Azure portal.  

Click the Create a resource button.

On the search prompt in the Create a resource page, you will search for Azure Databricks. After this, you will select the Azure Databricks option.

You will then click the Azure Databricks button. Next, you will click the Create button.

At this point, you will have to enter the project details before clicking the Review + create button.  

The Subscription option will differ for you. It will depend on the Azure subscription you have available.

For the Resource group option, you can select a pre-created one if you are familiar with Azure and have one created. Else, you will create a new one using the Create new button.

A resource group simply implies grouping together similar services on your Azure account, so it makes it easier to manage.

The Workspace name option must be a globally unique name. Mine is named salim-freeCodeCamp-databricks

The Region option should be filled in with the location closest to where you are. A region is a set of physical data centers that serve as servers. Since, I am in based in Nigeria, I selected South Africa North.

Ideally, for the Pricing Tier option, you would select the Standard option that includes Apache Spark with Azure AD. However, the other two options are also valid. With the Premium tier, you will have role-based access controls and other enterprise-level features and the Trial tier is a 14-day free trial of a premium-level workspace.

At this point, you can then click the Review + create button.

The validation process usually takes three to five minutes, give or take. Once the validation and deployment processes are complete for the workspace, you can launch the workspace using the Launch Workspace button that appears.

By launching, you will take advantage of the Azure integration with Databricks. Also, by clicking the button, you will automatically be signed in using the Azure Directory Single Sign On.

In a new tab, the Microsoft Azure Databricks home page will come up.

You will then create a cluster using the Create a cluster (top of the three) green button on the left of the page.

When you click that button, a list of your available clusters will come up. If, like myself, you have not created any, you’ll see yours empty as well.

I will then use the deep blue Create Cluster button to create one.  

You will then click the Single node option (changing from the Multi node default option) and maintain other settings as default. Then click the Create Cluster button at the bottom of the page. This will take a few minutes.

Also, although the name of the cluster is given as Salim Oyinlola's freeCodeCamp Cluster, the initial pre-assigned name for the cluster will be full_name's Cluster where full_name is the name associated with your Azure account. You can edit this name. Also, you can explore the Multi node option if your dataset is large.

You can specify the configuration settings as follows (although we’re leaving them as default in this tutorial):

With the cluster created, you will then have to import some ready to use notebooks. To do this, using the left taskbar, you will navigate through Workspace > Users > your_account . Then right-click to see the dropdown menu. You will then select the Import option on the dropdown menu.

Once you click on the Import button, you will then select the URL option and paste the following URL:

After importing, your workspace should look like this.

With that done, you have successfully created your Azure Databricks workspace.

When you open the file named Reading Data - CSV, you will see the notebook shown below.

On the top left corner, you will change the dropdown which initially shows Detached to your cluster’s name. Mine is named Salim Oyinlola's freeCodeCamp Cluster.

With your cluster attached, you will then run all the cells one after the other.

At its core, the notebook simply reads the data in csv format. Then it adds an option that tells the reader that the data contains a header and to use that header to determine our column names. You can also add an option that tells the reader to infer each column’s data types (also known as a schema).

It is important to note that data can be read in different formats such as JSON (with or without schemas), parquet, and table and views. To achieve this, you can simply run the respective notebooks for each format.

How to Write Data into a Parquet File

Just as there are many ways to read data, there are many ways to write data. But in this notebook, we’ll get a quick peek of how to write data back out to Parquet files.

Apache Parquet is a column storage file format used by Hadoop systems, such as Spark and Hive. The file format is cross platform, language independent, stores data in a column layout and has a binary representation.

Parquet files, which effectively store large datasets, have the extension .parquet.

Like what you did when reading data, you will also run the cells one after the other.

Integral to writing into the parquet file is creating a DataFrame. You will be creating one by running this cell.

The .mode"overwrite" method shown below implies that by writing DataFrame to parquet files, you are replacing existing files.

At its core, the notebook reads a .tsv file (the same used to read for the .csv file) and writes it back out as a Parquet file.

How to Delete the Azure Databricks Instance (Optional)

Finally, the Azure resources that you created in this tutorial can incur ongoing costs. To avoid such costs, it is important to delete the resource or resource group that contains all those resources. You can do that by using the Azure portal.

In this tutorial, you have learned the basics about reading and writing data in Azure Databricks.

You now know what Azure Databricks is, how to set it up, how to read CSV and parquet files, and how to read parquet files to the Databricks file system (DBFS) with compression options.

Thank you for reading 🙂

This content was originally published here.