Apache Superset: An Open-Source Visualization Tool
The articles document the views of the DataOps apprentices views and opinions on Apache Superset and may not exhaustively cover all features of the tool. The intent behind this article is to provide an introductory overview of the tool, and compare it against common alternative visualization tools.
1. What is Superset ?
Fig 1.1 Apache Superset Logo Retrieved from Apache Superset Github.
Apache Superset can be classified as a business intelligence tool. It can serve for data visualisation. Apache is fast and lightweight allowing users for easy usage. It is loaded with options that can enable users of all skill sets to visualise charts and data. There are also options for SQL if the user is used to using SQL queries. Apache is also open source and supports a wide variety of databases. Overall, Apache Supertset is relatively user friendly but requires some time to learn the intuition of the usage.
More information can be found in Apache Superset’s Documentation.
2. Motivation for superset
Fig 2.1 Apache Superset Features Retrieved from Apache Superset’s Homepage .
Apache Superset provides dashboards which can provide a clear picture to the audiences for business use cases. The dashboards have configurable options that allow the users to adjust the flow in which the charts are presented. The charts themselves have customisable options that allow the user to choose the appropriate configurations. An example of a good business use case is the use of the filter. During business presentations, on the go questions may be raised such as the comparison of certain items of the data. The filter options allows the user to update all the charts in the dashboard directly in an efficient manner. Apache Superset is also easy to integrate with existing modern data infrastructure and supports a wide range of databases used in businesses.
Although this also can be done in a python jupyter notebook, Apache Superset provides the edge in presentation for business use cases in an efficient way that the standard EDA may lack.
3. Installation
Apache Superset has two ways of installation:
- Installing Superset Locally Using Docker Compose
- Installing Superset from Scratch
3.1 Installing Superset Using Docker Compose
Image Retrieved from the following link.
Linux users can install docker through the following link found in the docker’s installation documentation for Linux. The docker compose is already included in this documentation and need not require further installation as stated in Apache Superset’s documentation.
Image Retrieved from the following link.
Macbook Users can refer to the docker’s installation documentation for Mac. This installation is also used in Apache Superset’s Documentation.
Image Retrieved from the following link.
Windows Users can refer to the docker’s installation for Windows. It is important to choose WSL2 as the backend for the configuration during the installation.
After installing the Docker, head to the working folder and clone the Superset github repository with the following code.
git clone https://github.com/apache/superset.git
After cloning, navigate to the superset folder with the first line of code. Run the docker compose with the next two lines of code and Superset is ready.
cd superset docker-compose -f docker-compose-non-dev.yml pull docker-compose -f docker-compose-non-dev.yml up
Access the Superset through the local host.
https://localhost:8088/
There are other configurations such as Superset configurations, additional network settings, event logging and so on. These configurations can be found in the documentation, navigate to the configuration of interest in the tabs on the left.
3.2. Installing Superset from Scratch
This method involves installing Superset in a python virtual environment. For more information, please refer to this guide.
4. Data loading
Installing Superset via Docker Compose installs a local Postgres SQL server, whilst installing it from scratch installs a default SQLite data base at “~/. superset/superset.db”. To scale the underlying database, Superset also supports a plethora of database connections. The various connection methods will be discussed below.
4.1. Connect to a database (DB)
Connecting Superset to a database requires an existing (1) functional SQLAlchemy dialect and (2) Python driver. A Python database driver is to be installed for each database the user wants to connect to. Depending on the installation method, there are different ways to install the database driver.
4.1.1. DB connection for docker compose installations
When setting up Superset locally via docker-compose, the necessary drivers and packages need to be stated in a “superset/docker/requirements-local.txt” file to be installed automatically. Please refer to this guide for more information.
4.1.2. DB connection for installations from scratch
Using this reference, pip install the Python DB driver packages necessary to connect to the database. Thereafter, these steps can be followed:
- Click on the “+” icon on the top right corner of the Superset UI, then select “Connect database”
- Select the database type. If the database is not on the list but an interface exists, the user may file an issue on the Superset Github repository.
- Either fill in the connection details using the input form or choose the option to use SQL Alchemy URI connection string instead.
4.2 Upload a .csv file
1. Navigating to the Database Connection Settings
There is a feature that allows a csv file to be uploaded to the database. In order to activate this feature head to the settings sign at the top right hand corner and select database connection. This will navigate the user to their database tab as shown.
2. Navigating to the Database settings
Under the actions column, Click the edit button to navigate to the database configurations and select the “ADVANCED” tab.
3. Selecting the upload option
Check the “Allow files upload to database” as shown. The upload csv to database option will be available thereafter.
5. Data wrangling
An individual running Superset with its default local database may choose to do the bulk of data wrangling before data is loaded into the database. This can give the flexibility to process the data using the Python Pandas package for example.
However, Superset also provides an excellent SQL IDE for data wrangling. Some basic examples will be covered below.
5.1. Calculated columns
Example of casting a string column to float data type using a combination of SQL and the Superset UI.
- Select the “Datasets” tab on the top of the Superset UI.
- Put the cursor over the data table and select the edit icon.
- In the edit menu that pops up, go to the “Calculated Columns” tab, then click on “+ Add Item”.
- As an example, a string column “remaining_years” with values in the format of “xx years xx months”, where “xx” refers to zero-padded integers, extract the first 2 integers in the string representing the number of years, then cast it into a float data type calculated column via SQL as follows:
- Scroll down and set the data type to “Numeric”.
5.2. Temporal data type casting
Apart from casting data types by creating a calculated column, temporal data type casting specifically can be done through the Superset UI.
- Select the “Datasets” tab on the top of the Superset UI.
- Put the cursor over the data table and select the edit icon.
- In the edit menu that pops up, go to the “Columns” tab, then click on the temporal column of the user choice. In this example, it is the string column “month” that has values “YYYY-MM”.
- Check the “Is temporal” checkbox, and click the “Default datetime” radio button.
- Scroll down and fill in the Python datetime string pattern that corresponds to the format of the “month” column values to facilitate the parsing of the string column into datetime. Do remember to include separators if applicable.
- Do note that any charts using temporal columns should have “Time Grain” set to “Original Value”.
5.3. Creating views / virtual tables
Views can also be created using tables stored within the underlying database.
- Click on the “SQL” tab on the top of the Superset UI and select “SQL Lab”.
- Type in the SQL query to return the desired output view, then click on the dropdown arrow beside the save button.
- Click on “Save dataset” and input the desired view name. The view will now be accessible within Superset as a dataset.
6. Dashboard creation
The Dashboard is quite intuitive on its own. To access the dashboard head to the dashboard tab, click on the dashboard and finally click on edit dashboard. If creating a new dashboard click the + Dashboard button and the user will be directed to an empty Dashboard
The dashboard is best learnt through usage so most of the functions will not be covered.
There are a few special functionalities that perhaps one may find useful. The three dots highlighted will give access to options like edit CSS which gives the option to input CSS code. There is also the edit properties button where color schemes for the dashboard, owners and certifications can be defined.
Another more uncommon tool that could be used is the Text label under the layout elements tab. An image can be displayed in the text box by using the following simple code below.
<img src="Input Image Link Here">
Overall, the Dashboard is straightforward and just takes a bit of exploration for the user to gain the intuition to use it.
7. Chart creation
Superset supports a variety of charts for different visualization needs. These charts have been conveniently categorized according to their use cases:
Chart Category |
Use Case |
Correlation |
Plot relationships between features |
Distribution |
Plot volume / frequency of features |
Evolution |
Time series plots |
Flow |
Visualize connections between features |
KPI |
Visualize key metrics |
Map |
Geospatial plots |
Part of whole |
Provides visual breakdown of a population |
Ranking |
Visualizations that impose an order or score to data |
To access the Chart creation menu, click on the “Charts” tab on the top of the Superset UI, then click on the “+ Chart” button. A few examples will be discussed below.
7.1. Line charts
For this example, we will be creating a time-series line chart on Singapore’s public housing (HDB) resale prices across time, by the different apartment flat types.
The steps to create this chart are as follows:
- Select the temporal column of choice.
- Select the granularity of time series: Select “Original value” as other granularities were observed to throw errors.
- Select the feature metric to measure on the vertical axis.
- Select any categorical feature dimensions to use to slice the line series.
- Click on “Update Chart”.
7.2. Annotation layers
Annotation layers are layers that are put on top of a chart to provide supplementary information on the visualizations. For this example, annotations will be added for events that may have an effect on the resale prices from the example in 7.1.
- On the top right corner of the Superset UI, click on “Settings”, and select “Annotation Layers”.
- Provide the desired annotation layer name in the window that pops up and press “Add”. for this example.
- Within the annotation layer created, labels can be added associated with that label. For example, we can add a separate label for each lockdown that occurred in Singapore. Click on the “+ Annotation” button on the top right corner of the annotation layer page.
- Provide an annotation name. This will be part of the label that shows on the chart. The other part of the label is derived from the name that was specified on the chart creation menu which will be shown in a later step.
- Provide a date timeframe that corresponds to the lockdown event.
- Click on save and go back to the chart. Within the “Annotations and Layers” segment, click on “Add annotation layer” and fill in the inputs as follows:
- Annotation layer type: Either choose “Event” or “Interval”, which will reference the date timeframe that was configured in the previous step. “Formula” allows the user to input a function of the horizontal axis feature, whilst “Time Series” adds a layer based on another chart.
- Annotation source: Select “Superset annotation” to indicate an annotation layer to be used.
- Annotation layer: Select “Covid Lockdown” to reference the annotation layer that was created in the earlier step.
- Select “OK” and then “Update Chart”. The annotation should now appear on the chart.
7.3. Geospatial charts
Superset relies on API calls to a 3rd party map service, Mapbox, to generate maps for geospatial plots. Consequently, the user is required to set up a free account with Mapbox, generate an API token, and copy paste it into the “superset/docker/pythonpath_dev/superset_config.py” file as a new line as shown below:
Additionally, it is essential that the underlying dataset has geospatial features such as longitude and latitude in order to plot them on a map grid.
The following steps illustrate how to generate a “deck.gl screen grid” chart, where each square polygon is an aggregate of Singapore HDB resale price, with darker shades representing higher prices.
- Select the temporal column of choice.
- Select the geospatial columns with longitude and latitude data.
- Select the feature metric to represent as a color gradient.
- Click on “Update Chart”.
8. Superset vs. Tableau vs. Python
Superset is a useful visualization tool, however it should not be viewed in isolation from other solutions. A list of relevant features were used as a basis for comparison between Superset, Tableau, and Python in the table below:
Features |
Python |
Superset |
Tableau |
Cost |
Free |
Free |
Limited free license; subscription-based |
Chart interactivity |
Limited interactivity using plotly |
Supports a variety of interactive filters |
Supports a variety of interactive filters and tooltips |
Ease of use |
Potentially involved – everything is in code |
User-friendly UI available; SQL knowledge required |
User-friendly UI available; SQL / Tableau functions required for advanced use cases |
Variety of charts |
Supports basic and niche charts |
Does not compare to Python by supports most charts |
Does not compare to Python by supports most charts |
Security |
Not applicable |
Role-based or dashboard / chart level access management |
Role-based access management to control site / project / dashboard access. |
Compute speed |
Dependent on package optimization |
Caching utility available |
Has Hyper DB which facilitates quick dashboard rendering |
Hosting |
Local / server |
Local / server / cloud |
Desktop / server / cloud |
Support |
Not applicable |
Limited support for immediate production issues |
Technical support available for issues with server and desktop software |
Community |
Large, active and helpful |
Active and helpful, but not large |
Large, active and helpful |
9. Conclusion
With the variety of visualization tools available, a prudent adopter should carefully consider their idiosyncrasies to determine not the best, but the right tool for an organization and its users. Specifically, Superset fills a niche gap between the open-sourced but code-based Python, and the licensed but UI-based Tableau. Whilst esoteric with inflexibilities inherent to abstracted UI-based solutions, Superset is cost-free, easy to install, and is intuitive to use – there is little reason not to give it a try.