Published: Aug. 8, 2019 by lukemakin |  estimated reading time: 14 minutes
In this one will discover the basics of Seaborn library, which enables statistical data visualization in a beautiful form. With the use of build-in database to seaborn ("tips") we will explore basic that will provide neat charts / plots. Hopefully you'll find here the most useful code lines that will help you get started. Let's begin!

Open Jupyter notebook and setup the neccassary imports like below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The last line is written so we can read the plots within the jupyter notebook. Now let's load the dataset:


df = sns.load_dataset('tips')
df.head()

and with the head method let's check if everything has been loaded correctly:



Also let's don't forget about running 2 common methods - info() and describe() that provide way many informations about the data we're dealing with (in just 2 words!). Let's check them out starting from info:

df.info()
'''
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill 244 non-null float64
tip 244 non-null float64
sex 244 non-null category
smoker 244 non-null category
day 244 non-null category
time 244 non-null category
size 244 non-null int64
dtypes: category(4), float64(2), int64(1)
memory usage: 7.2 KB
'''

and now describe:


df.describe()

Output:



Let's now globally set the label size of our axes and white background for all seaborn plots:

plt.rcParams["axes.labelsize"] = 20
sns.set_style("white")

We are ready now to create our first plot, which will present distributions of the tip by sex. Notice, that this chart will be the most difficult one from all the other included in this article.


g = sns.FacetGrid(df, hue='sex')
g = g.map(sns.distplot, "tip", hist_kws=dict(edgecolor="k", linewidth=1))

g.fig.set_size_inches(18,5)
g.fig.suptitle('Distribution plot of tips by sex')
g.add_legend(fontsize=17)

plt.setp(g._legend.get_title(), fontsize=20)

plt.savefig("dist1.png")
plt.show()

Line 1,2 is related to plot setup. In line 3 we've speciefied the size of the figure (18 is width, 5 height). In line 4 we created a title and in line 5 added a legend + it's fon't size. In line 6 we added the title of the legend also specifying the font size). Line 7,8 is about saving and displaying the plot. Important! plt.show() comes last, if you run it before savefig() you will receive a blank image.



In a simple way we created pretty looking distribution plot with distiguished data by sex. Let's now check the correlation and visualize it with a heatmap:

df.corr()

'''
Output:
total_bill tip size
total_bill 1.000000 0.675734 0.598315
tip 0.675734 1.000000 0.489299
size 0.598315 0.489299 1.000000

'''

To create a heatmap we need to pass df.corr() directly or in as a variable. Adding annot enables us to see the correlation indicators on the chart. Cmap is the quivalent of color palette (so we can display heatmap in different shades & colors). If your plot get's cutt-of, you can use autoscale() method which should fix this problem.


sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.autoscale()
plt.show()

The result:



Remember if you want to save plot, you must call: plt.savefig("yourname.png"). Moving forward - we've noticed that the highest correletion is between total_bill and tip (seems logical). Let's look closer into this relation by doing some additional visualizations begining with jointplot:

sns.jointplot(x='total_bill',y='tip',data=df, kind="reg")
plt.show()

The result:



To distinguish categories we can create a similar plot - linear regression plot:

sns.lmplot(data=df, x='total_bill', y='tip', hue='sex', markers=['x','v'], palette='seismic')
plt.savefig("lm1.png")
plt.show()

The result:



The next plot we will do is extremly easy to do, but gives a set of histograms and scatterplots - let's do a Pairplot by smoker:



Finally let's make something more 'classical' - we will create couple of barplots starting with average total_bill by day:

sns.barplot(x="day", y="total_bill", data=df, palette='seismic')
plt.savefig("bar2.png", bbox_inches = "tight")
plt.show()

Because the x-label was a liitle bit cut-off, I used bbox_inches="tight" while saving the plot to make a perfectly fit layout.



The next chart 

ax = sns.barplot(x="day", y="total_bill", data=df, hue='sex', palette='plasma')
ax.legend(fontsize=14, bbox_to_anchor=(1,1))
ax.set_title('Bar plot of total_bill by sex', fontdict={'fontsize': 22, 'fontweight': 'medium'})
plt.xticks(rotation=90)
plt.savefig("bar3.png", bbox_inches = "tight")
plt.show()

This one wil be a little bit more extended because we moved the legend outside the plot (line 2). Next we've set the title in line 3, and in line 4 we changed the rotation of x axes metrics. Result:



The next barplot we are going to create is to show sum of incomes by day. Again we are using a barplot, but this time we've specified an estimator that is equal to sum (so we've overwritten the default value set to mean).

from numpy import sum
sns.barplot(x="day", y="total_bill", data=df, estimator=sum, palette='seismic')
plt.savefig("bar4.png", bbox_inches = "tight")
plt.show()

The result:



We see that on friday the restaurant had the lowest incomes. Let's just be sure and run some pandas code to confirm the outcome:

df.groupby('day').sum()['total_bill']

'''
Output:
day
Thur 1096.33
Fri 325.88
Sat 1778.40
Sun 1627.16
Name: total_bill, dtype: float64
'''

So basically we group our dataset by day to sum total_bill values. The result coincides with the graph above. Let's now move forward and count the values to see how many customers paid in the restaurant at a particular day (first in pandas) and then visualize it in seaborn's countplot


df['day'].value_counts()

'''
Sat 87
Sun 76
Thur 62
Fri 19
Name: day, dtype: int64
'''

Chart (countplot) representing number of values below:


sns.countplot(x='day', data=df, palette='seismic')
plt.savefig("count1.png", bbox_inches = "tight")
plt.show()





Another way for representing data is by creating a boxplot. Again it's very simple and based on all the above principles:



Another chart w're going to create is similar to Pairplot, but we can specify what elements should be including. As you remember - pairplot consists of scatter plots and histograms.

g = sns.PairGrid(df, hue='smoker')
g.map_diag(sns.distplot)
g.map_lower(plt.scatter)
g.map_upper(sns.kdeplot)
plt.savefig("pgrid1.png", bbox_inches = "tight")
plt.show()

So we've specified that the middle part should contain distribution plots, on the left (lower) we've placed scatter plots and on the right (upper) - kde plots. Result:



The last visulaization will be regarding catplots, that are a little bit diffrently constructed. Inside the catplot we specify all the most parameters our chart should consists of, including the type of the chart (so I could pass in, box, scatter and so on...)

sns.catplot(data=df, x="smoker", y="total_bill", hue="sex", kind="violin", height=5, aspect=2, palette="magma", legend=False)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)
plt.savefig("cat-violin1.png", bbox_inches = "tight")
plt.show()

In this case we indicated as kind - violin, which is another chart to show the distribution of data.





While playing around with this, your layout may become messy (i.e. labels overlap), you can try out tight_layout to organize it better:

plt.tight_layout()



That is it for this one. As you can see, few lines of code can provide wonderful images for i.e. your powerpoint presentation etc.

 
Extras
To view additional content login or create a free account
Categories:
Share your thoughts

Guest
1 year, 10 months ago
Quite an interesting topic and good job with showing it in a simple way!
daniel29
1 year, 3 months ago
Cool!
Guest
4 months, 2 weeks ago
nice
Signup to the newsletter
To get the latest updates from pyplane
© copyright 2019 pyplane.com