Pandas and Bokeh: create interactive graphics

Analyzing data also requires graphing the data or the results from the analysis performed. Many libraries in Python provide useful tools for visualization, but the plots produced are static. The Pandas Bokeh library is a great alternative for creating interactive plots and including them in web projects. Let's find out how to use it and the results we can achieve through some examples.

Share

Share on facebook
Share on linkedin
Share on twitter
Share on email
Reading time: 8 minutes

Data visualization is one of the fundamental aspects of data analysis. Exploring the nature of the data and its distribution allows the data analyst to understand how to analyze it. In addition, visualizing analysis results allows for immediate communication of the result of complex analyses. 

Choosing a library to display data and/or results is sometimes complicated. In fact, there are several libraries that are easy to use but limit interaction with the data itself. Others, however, allow interaction with graphs but the learning curve is steep. However, there are open-source libraries that partially solve this problem.

We have seen in the article PandasGUI: Graphical user interface for analyzing data with Pandas how we can use a tool to interact with data through a graphical interface. However, if we want to include graph generation within our code, we need to use other libraries.

In this article, we will compare the pandas and pandas_boken library. We will analyze the syntax and the results obtained by comparing some available types of graphs. In particular, we will limit ourselves to six basic graphs, namely line graphs, bar graphs, stacked bar graphs, histograms, scatter and pie graphs. The goal is to make the graphs interactive so that we can take full advantage of the information they present to us.

Dataset

There are many public datasets with which you can test these tools. For example, on Kaggle you can download free datasets covering a variety of areas, from financial data to weather data. There are also several repositories that provide so-called open data, i.e. free data published online.  

In this tutorial we will use the open data about the COVID-19 cases in Italy available here. We will not do an analysis of that data, but we will only use it to show the functionality of the libraries. In particular, we will use the national trend data and focus only on the hospitalizations data. You can download the dataset here.

Before installing the library we recommend that you create your own development environment. To do this you can simply use pipenv. This way you will install only the libraries you need for your project in a dedicated workspace and not at the operating system level. So after creating your workspace with pipenv shell, you can proceed with the installation of Pandas and Pandas_bokeh. The commands are as follows.

# from PyPi
pip install pandas-bokeh

# with Conda
conda install -c patrikhlobil pandas-bokeh 

At this point we can import the necessary libraries and dataset.

# Importing required modules
import pandas as pd
import pandas_bokeh
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Reading in the data
data = pd.read_csv('covid.csv',parse_dates=["data"],index_col='data')
data.head() 

With the last command we can see the DataFrame read from the csv file. As you can see there are 23 columns (attributes) for each date. To simplify we select only the data related to hospitalizations, i.e. the columns hospitalized_with_symptoms (i.e., ricoverati_con_sintomi) and intensive_care(i.e., terapia_intensiva). In addition, to have fewer points to display we can resample the time series data using the pandas resample() command. Below we report the commands and the output obtained.

Out[1]:
stato  ricoverati_con_sintomi  terapia_intensiva  ...  totale_positivi_test_antigenico_rapido  tamponi_test_molecolare  tamponi_test_antigenico_rapido
data                                                                  ...                                                                                                 
2020-02-24 18:00:00   ITA                     101                 26  ...                                     NaN                      NaN                             NaN
2020-02-25 18:00:00   ITA                     114                 35  ...                                     NaN                      NaN                             NaN
2020-02-26 18:00:00   ITA                     128                 36  ...                                     NaN                      NaN                             NaN
2020-02-27 18:00:00   ITA                     248                 56  ...                                     NaN                      NaN                             NaN
2020-02-28 18:00:00   ITA                     345                 64  ...                                     NaN                      NaN                             NaN

[5 rows x 23 columns]


data = data[['ricoverati_con_sintomi', 'terapia_intensiva']]

In [2]: data.head()
Out[2]: 
                     ricoverati_con_sintomi  terapia_intensiva
data                                                          
2020-02-24 18:00:00                     101                 26
2020-02-25 18:00:00                     114                 35
2020-02-26 18:00:00                     128                 36
2020-02-27 18:00:00                     248                 56
2020-02-28 18:00:00                     345                 64

In [3]: data_resample = data.resample(rule='M').mean()

In [4]: data_resample
Out[4]: 
            ricoverati_con_sintomi  terapia_intensiva
data                                                 
2020-02-29              222.833333          53.666667
2020-03-31            12762.612903        1985.645161
2020-04-30            25565.800000        2975.466667
2020-05-31            11501.645161         882.516129
2020-06-30             3353.800000         207.433333
2020-07-31              808.741935          57.870968
2020-08-31              890.451613          58.129032
2020-09-30             2204.833333         194.566667
2020-10-31             7843.870968         785.290323
2020-11-30            29880.800000        3231.666667
2020-12-31            27113.612903        3005.516129
2021-01-31            22392.354839        2483.451613
2021-02-28            18799.535714        2128.571429
2021-03-31            25040.580645        3127.032258
2021-04-30            24889.600000        3306.833333
2021-05-31            12422.516129        1783.645161
2021-06-30             3408.100000         531.966667
2021-07-31             1248.000000         179.400000 

Now that we have our dataframes ready, it’s time to visualize them via different graphs.

Plot syntax

The two libraries we will use to generate the plots are the one provided by pandas and the pandas_brokeh library. We see below the syntax provided by both.

Pandas

To generate a plot using pandas, you use the .plot() method of the dataframe. This method is a simple wrapper around matplotlib’s plt.plot(). You can also specify some additional parameters such as those below.

Some of the important Parameters
--------------------------------

x : label or position, default None
    Only used if data is a DataFrame.
y : label, position or list of label, positions, default None
title: title to be used for the plot
X and y label: Name to use for the label on the x-axis and y-axis.
figsize : specifies the size of the figure object.    
kind : str
    The kind of plot to produce:

    - 'line' : line plot (default)
    - 'bar' : vertical bar plot
    - 'barh' : horizontal bar plot
    - 'hist' : histogram
    - 'box' : boxplot
    - 'kde' : Kernel Density Estimation plot
    - 'density' : same as 'kde'
    - 'area' : area plot
    - 'pie' : pie plot
    - 'scatter' : scatter plot
    - 'hexbin' : hexbin plot. 

For a complete list of parameters and their usage, see the official documentation.

Pandas bokeh

The Pandas bokeh library provides a bokeh plotting backend for Pandas, GeoPandas and Pyspark DataFrames by adding the plot_bokeh() method. It requires you to define at the beginning one plotting method among two possible ones: Jupyter notebook or HTML file. The syntax is as follows.

# for embedding plots in Jupyter Notebooks.
pandas_bokeh.output_notebook()
# for exporting plots as HTML.
pandas_bokeh.output_file(filename) 

Comparison of different plots

In this section we show both the commands and the results for different chart types. In particular we will analyze the types:

  • Line charts
  • Scatter
  • Histograms
  • Bar Charts
  • Stacked bar charts
  • Pie charts

Line charts

#pandas
data.plot(title='Covid', xlabel='Values').figure.show()

# pandas_bokeh
data_new.plot_bokeh(kind='line') 

Pandas

Bokeh

Scatter plot

#pandas
data.plot(kind='scatter', 
    x='ricoverati_con_sintomi', 
    y='terapia_intensiva', 
    title='Scatter Covid').figure.show()

# pandas_bokeh
data_new.plot_bokeh.scatter(x='ricoverati_con_sintomi', 
    y='terapia_intensiva', 
    title='Scatter Covid') 

Pandas

Bokeh

Histograms

#pandas
data.plot(kind='hist', bins=30).figure.show()

# pandas_bokeh
data.plot_bokeh(kind='hist', bins=30) 

Pandas

Bokeh

Bar Charts

#pandas
data_resample.plot(kind='bar').figure.show()

# pandas_bokeh
data_resample.plot_bokeh(kind='bar') 

Pandas

Bokeh

Stacked bar charts

#pandas
data_resample.plot(kind='barh', stacked=True).figure.show()

# pandas_bokeh
data_resample.plot_bokeh(kind='barh', stacked=True) 

Pandas

Bokeh

Pie charts

#pandas
data_resample['terapia_intensiva'].plot.pie(legend=False, autopct='%.1f').figure.show()

# pandas_bokeh
data_resample.plot_bokeh.pie(y='terapia_intensiva') 

Pandas

Bokeh

For this chart type, Bokeh also provides another type of output that incorporates multiple pie charts. The syntax and result are shown below.

data_resample.plot_bokeh.pie() 

Conclusions

Pandas provides an excellent library for generating graphs of the data contained in DataFrames. However, these graphs lack interactivity and capabilities such as zooming and panning. The Pandas Bokeh library overcomes this limitation by also providing HTML output that can be easily included in websites. Obviously, the use of one library over another depends on the application context and preferences of the developer or data analyst.

Recommended Readings

More To Explore

Google Cloud platform

BigQuery: WITH clause

Extracting data and analyzing it is a process that requires knowledge of data sources and the ability to write complex queries. BigQuery, Google’s database, makes it easy to access terabytes of data. Query writing, however, requires method. Let’s discover the WITH clause to increase the readability of our queries.

Python language

Jupyter Notebook: user’s guide

The development of data analytics pipelines by Data Scientists requires several skills. Having an easy, intuitive, and interactive development environment is critical. Jupyter Notebook is an open source web application that allows you to create and share interactive textual documents, containing objects such as equations, graphs and executable source code in different languages. Let’s discover its main features.

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!