The Next Level of Data Visualization in Python

Python's Data Visualization at its Highest Level

How to create stunning, interactive charts with only one line of Python

One of the numerous detrimental cognitive biases to which people are susceptible is the sunk-cost fallacy. It alludes to our propensity to keep investing time and money into a futile endeavor simply because we have already invested—sunk—so much time into it. The sunk-cost fallacy can be used to justify holding onto unsatisfactory jobs for longer than necessary, working tirelessly on a project even though it's obvious that it won't succeed, and yes, even sticking with the clunky, out-of-date plotting library matplotlib when more effective, interactive, and attractive options are available.

Plotly Brief Summary

Plotly.js, which in turn is built on d3.js, is the foundation of the open-source library known as the plotly Python package. We'll be utilizing a wrapper for Pandas dataframes on Plotly called Cufflinks.

Our complete stack is therefore cufflinks > plotly > plotly.js > d3.js, giving us the productivity of Python writing along with the amazing interactive graphics capabilities of d3.

(Plotly is a graphics company with a number of open-source tools and solutions. We can create as many charts as we want in offline mode and as many as 25 charts in online form to share with the rest of the world using the free Python library.)

Distributions of a single variable: histograms and boxplots

The histogram is a go-to plot (albeit it has some difficulties) for visualizing a distribution. Single variable, or univariate, graphs are a common approach to begin an investigation. Here, let's create an interactive histogram of the number of claps for articles using my Medium article statistics (you can see how to collect your own statistics here or use mine here; df is a typical Pandas dataframe):

df['claps'].

Claps Distribution in iplot(kind='hist', xTitle='claps', yTitle='count')

For those accustomed to matplotlib, all we need to do is substitute iplot for plot to create a far more attractive and interactive chart. We can zoom into certain areas of the plot by clicking on the data to view more detail, and, as we'll see later, we can choose which categories to highlight.

Plotting overlapping histograms is equally straightforward:

'time started' and 'time published']]

iplot(kind='hist', histnorm='percent', barmode='overlay', xTitle='Time of Day', yTitle='(%) of Articles', title='Time Started and Time Published') displays the start and end times of each article.

We can create a barplot with a bit of Pandas manipulation:

Plot df2 = df[['view','reads','published date']] and resample to monthly frequency.

Set the index to 'published date

'M' is resampled.

mean()\sdf2.

Monthly Average Views and Reads, iplot(kind='bar', xTitle='Date', yTitle='Average')

As we have shown, the strength of pandas, plotly, and cufflinks can be combined. We use a pivot and then plot to create a boxplot of the readers per story per publication:

'publication' column, 'fans' value.

fans, yTitle 'fans,' title 'Fans Distribution by Publication,' kind='box'

We can explore and subset the data anyway we like thanks to interactivity. A box plot has a lot of information, and if we can't see the figures, we'll be missing most of it.

Scatterplots

The heart of the majority of analysis is the scatterplot. It enables us to observe a variable's change over time or the connection between two (or more) variables.

Time-Series

Real-world data contains a significant amount of time-related information. Luckily, time-series representations were considered when plotly + cufflinks was created. Analyzing my TDS articles as a dataframe will allow us to see how the trends have changed.

Create a dataframe containing articles from the Towards Data Science publication. . Set the index to 'published date'

# Plot reads time as a time series [['claps,' 'fans,' and 'title']] . iplot(y='claps', mode='lines+markers', secondary y = 'fans', secondary y title='Fans', xTitle='Date', yTitle='Claps', text='title', title='Fans and Claps through Time') displays the number of claps over time.

In this line, we're doing a lot of different things at once:

Automatically generating a time-series x-axis with a good format

adding a second y-axis because the ranges of our variables diverge

Including the article titles as hover information

We may also quickly add text annotations to add extra information:

Total Word Count per Month, tds monthly totals.iplot(mode='lines+markers+text', text=text, y='word count', opacity=0.8, xTitle='Date', yTitle='Word Count')

In order to color a two-variable scatter plot with a third categorical variable, we employ:

Title: 'Reading Percent vs. Read Ratio by Publication' df.iplot(x='read time', y='read ratio', # Specify the category categories='publication', xTitle='Read Time', yTitle='Reading Percent')

Let's use a log axis, which is provided as a plotly layout (see the Plotly documentation for the layout specifications), and size the bubbles by a numerical variable to make things a little more complex:

text=text, mode=markers, size='read ratio', tds.iplot(x='word count', y='reads',

# Log Word Count, Reads vs. Log Word Count Sized by Read Ratio, and xaxis layout=dict (xaxis=dict(type='log', title='Word Count'), yaxis=dict(title='Reads'))

As before, we can mix plotly + cufflinks + pandas to create helpful plots.

columns='publication,' values='views,' index='published date'

cumsum().

I plotted a table using the following parameters: mode='markers+lines', size=8, symbol=[1, 2, 3, 4, 5], layout=dict (xaxis=dict(title='Date'), yaxis=dict(type='log', title='Total Views'), title='Total Views over Time by Publication')

Complex Plots

We'll now discuss a couple plots that you generally won't use very often, but which can still be pretty effective. Even these fantastic plots will be kept to a single line by using the plotly figure factory.

Matrix Scatter

A scatter matrix, often known as a splom, is a fantastic tool for investigating correlations between numerous variables:

Import plotly.figure factory as ff figure = ff.create scatterplotmatrix (diag='histogram', index='publication', df[['claps', 'publication', 'views','read ratio','word count'])

Heatmap of Correlation

We compute the correlations between the numerical variables, and then create an annotated heatmap to show the correlations:

Corrs = df.corr() Figure = ff.create annotated heatmap (z = corrs.values, x = corrs.columns, y = corrs.index), annotation text = corrs.round(2).values, showscale = True)

Conclusions

The worst aspect of the sunk cost fallacy is that you don't understand how much time you've lost until after you've given up. Fortunately, since I made the error of using matplotlib for an extended period of time, you don't have to!

There are a few qualities we want when plotting libraries, including:

charts with one line for quick investigation
Interactive components for data subsetting and investigation
Option to delve into specifics as necessary
Simple customization for the presentation's final stage
Plotly is now the best choice for carrying out each of these tasks in Python. We can easily create visuals with Plotly, and its interactive features give us greater understanding of our data. Let's not forget that plotting ought to be among the most entertaining aspects of data science!

blog

The Next Level of Data Visualization in Python

HARIDHA P

Leave Comment

Comments

Liked By