Visualizations are at the heart of data science. There is no clearer way to describe the statistics of a dataset or the results of a model than through a well organized graph. Luckily, python has many libraries that facilitate creating a well constructed visualization. In this post, I will be giving a brief overview of three libraries, Matplotlib, Seaborn and Plotly, and give some examples of the differences between them.
Matplotlib is the standard graphing library in python, and is typically the first graphing library a data scientist will learn when using python. It is functionally integrated with pandas and numpy for easy and efficient plotting. Furthermore, Matplotlib gives the user full control over fonts, graph styling and axes properties, though this control comes at the potential cost of lengthy blocks of code. Matplotlib is especially good for performing exploratory analysis because of the integration with pandas, allowing for quick transformations from dataframe to graph. Matplotlib is particularly good for creating basic plots like scatter plots, bargraphs and lineplots, but looks a little rough when creating more complex plots like polar scatterplots.
Seaborn is a library built on top of the pyplot module in Matplotlib. It provides a high level interface to create a more intuitive feel. This entails using a simpler syntax and more intuitive parameter settings. Additionally, Seaborn includes a more aesthetically pleasing collection of colors, themes and styles. This produces a smoother and more professional looking plot than those created from the pyplot module. This library is especially useful when creating more complex plots where more refined graphics
Now that we have gone over what these libraries are used for, let’s look at an examples of how we can build plots in each. For this example, we will be using the titanic dataset and we want to create a barplot showing the average fare price for each passenger class when accounting for whether or not an individual survived (1 for survived, 0 for did not survive). Below is the Matplotlib code for creating this plot and the resulting image.
# Import pyplot module
import matplotlib.pyplot as plt# Set default size for all pyplot plots
plt.rcParams["figure.figsize"] = (12,8)# Group the dataframe by passenger class and survival,
# then calculate average fare price,
# then unstack the grouped data, and plot a barplot
There is a good deal to unpack here. In order to get the information that we wanted, we needed to group the dataframe by passenger class and by survival. This gives us 6 groups, those who survived and those who didn’t for each passenger class. We choose our aggregate in the grouping clause to be the mean fare price. Next we unstack our grouped dataframe in order to plot 3 paired groupings, rather than 6 individual ones. Finally we choose to plot as a barplot. We can see how clunky Matplotlib can be here. One other thing to note here is that plotting function is called off of the pandas dataframe. We call the .plot() attribute of the dataframe object and do not need to feed the dataframe into a seperate function. Now let’s compare this to creating the same plot in Seaborn.
# import Seaborn
import seaborn as sns# set global parameters like font and label sizes
sns.set_context('talk')# set style parameters such as presence of grid and background color
sns.set_style('darkgrid')# Plot data
sns.barplot(data=df, x= 'pclass', y = 'fare', hue='survived')
There are a few things to note here. First, we do not need to set the size of this chart. Because Seaborn is built on Matplotlib, we are using the default size that we have already set. Similarly, now that we have set the context and style of Seaborn, every Matplotlib chart will use the same values. Next, we can see how much simpler the syntax is for Seaborn compared to pyplot. We don’t need to group our data here, we just pass the values in and Seaborn calculates the results automatically. Finally, we also see that Seaborn will help make our chart a little nicer by filling in our axes labels. We can always supply a label by passing plt.xlabel() or plt.ylabel(), but this is not strictly necessary when using Seaborn. Now let’s look at how we can create the same plot using Plotly.
# Import Plotly
import plotly.express as px# Set up graph by grouping the dataframe by pclass and survival.
# We need to keep our column names, so we set as_index to False.
fig = px.bar(df.groupby(['pclass', 'survived'],
# Setting 'barmode' to group creates paired plots
# instead of stacked plots.
barmode = 'group')
When the code for this plot is run, hovering a mouse over any of the bars will tell you the passenger class, the survival group and the average ticket price for that group. This plot requires the data to be grouped like the Matplotlib plot, but it is a little more complex since we need access to our column names. To account for this, we group with the .agg() function so that we can set ‘set_index’ to false inside of the group by function. The other major difference with this code is that the default nature of this graph is a stacked bar plot instead of a paired bar plot. To account for this, we simply set ‘barmode’ to ‘group’. This is a very basic example and doesn’t show off the full usefulness of Plotly. Mostly this example is to show the differences in creating a simple chart. It’s important to note however that Plotly lets you create some really advanced visualizations and makes busy visualizations very easy to read. I will likely write a post soon showing off some of the higher end visualizations that you can create using Plotly.
For some examples of graphics you can create using each of these three libraries, make sure to check out the documentation: