Published: Aug. 8, 2019 by lukemakin |  estimated reading time: 14 minutes  In this one will discover the basics of Seaborn library, which enables statistical data visualization in a beautiful form. With the use of build-in database to seaborn ("tips") we will explore basic that will provide neat charts / plots. Hopefully you'll find here the most useful code lines that will help you get started. Let's begin!

Open Jupyter notebook and setup the neccassary imports like below:
``import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline``

The last line is written so we can read the plots within the jupyter notebook. Now let's load the dataset:

``df = sns.load_dataset('tips')df.head()``

and with the head method let's check if everything has been loaded correctly: Also let's don't forget about running 2 common methods - info() and describe() that provide way many informations about the data we're dealing with (in just 2 words!). Let's check them out starting from info:
``df.info()'''Output:<class 'pandas.core.frame.DataFrame'>RangeIndex: 244 entries, 0 to 243Data columns (total 7 columns):total_bill    244 non-null float64tip           244 non-null float64sex           244 non-null categorysmoker        244 non-null categoryday           244 non-null categorytime          244 non-null categorysize          244 non-null int64dtypes: category(4), float64(2), int64(1)memory usage: 7.2 KB'''``

and now describe:

``df.describe()``

Output: Let's now globally set the label size of our axes and white background for all seaborn plots:
``plt.rcParams["axes.labelsize"] = 20sns.set_style("white")``

We are ready now to create our first plot, which will present distributions of the tip by sex. Notice, that this chart will be the most difficult one from all the other included in this article.

``g = sns.FacetGrid(df, hue='sex')g = g.map(sns.distplot, "tip", hist_kws=dict(edgecolor="k", linewidth=1))g.fig.set_size_inches(18,5)g.fig.suptitle('Distribution plot of tips by sex')g.add_legend(fontsize=17)plt.setp(g._legend.get_title(), fontsize=20)plt.savefig("dist1.png")plt.show()``

Line 1,2 is related to plot setup. In line 3 we've speciefied the size of the figure (18 is width, 5 height). In line 4 we created a title and in line 5 added a legend + it's fon't size. In line 6 we added the title of the legend also specifying the font size). Line 7,8 is about saving and displaying the plot. Important! plt.show() comes last, if you run it before savefig() you will receive a blank image. In a simple way we created pretty looking distribution plot with distiguished data by sex. Let's now check the correlation and visualize it with a heatmap:
``df.corr()'''Output:	              total_bill	tip      	sizetotal_bill	      1.000000	0.675734	0.598315tip	              0.675734	1.000000	0.489299size	            0.598315	0.489299	1.000000'''``

To create a heatmap we need to pass df.corr() directly or in as a variable. Adding annot enables us to see the correlation indicators on the chart. Cmap is the quivalent of color palette (so we can display heatmap in different shades & colors). If your plot get's cutt-of, you can use autoscale() method which should fix this problem.

``sns.heatmap(df.corr(), annot=True, cmap='coolwarm')plt.autoscale()plt.show()``

The result: Remember if you want to save plot, you must call: plt.savefig("yourname.png"). Moving forward - we've noticed that the highest correletion is between total_bill and tip (seems logical). Let's look closer into this relation by doing some additional visualizations begining with jointplot:
``sns.jointplot(x='total_bill',y='tip',data=df, kind="reg")plt.show()``

The result: To distinguish categories we can create a similar plot - linear regression plot:
``sns.lmplot(data=df, x='total_bill', y='tip', hue='sex', markers=['x','v'], palette='seismic')plt.savefig("lm1.png")plt.show()``

The result: The next plot we will do is extremly easy to do, but gives a set of histograms and scatterplots - let's do a Pairplot by smoker: Finally let's make something more 'classical' - we will create couple of barplots starting with average total_bill by day:
``sns.barplot(x="day", y="total_bill", data=df, palette='seismic')plt.savefig("bar2.png",  bbox_inches = "tight")plt.show()``

Because the x-label was a liitle bit cut-off, I used bbox_inches="tight" while saving the plot to make a perfectly fit layout. The next chart
``ax = sns.barplot(x="day", y="total_bill", data=df, hue='sex', palette='plasma')ax.legend(fontsize=14, bbox_to_anchor=(1,1))ax.set_title('Bar plot of total_bill by sex', fontdict={'fontsize': 22, 'fontweight': 'medium'})plt.xticks(rotation=90)plt.savefig("bar3.png",  bbox_inches = "tight")plt.show()``

This one wil be a little bit more extended because we moved the legend outside the plot (line 2). Next we've set the title in line 3, and in line 4 we changed the rotation of x axes metrics. Result: The next barplot we are going to create is to show sum of incomes by day. Again we are using a barplot, but this time we've specified an estimator that is equal to sum (so we've overwritten the default value set to mean).
``from numpy import sumsns.barplot(x="day", y="total_bill", data=df, estimator=sum, palette='seismic')plt.savefig("bar4.png",  bbox_inches = "tight")plt.show()``

The result: We see that on friday the restaurant had the lowest incomes. Let's just be sure and run some pandas code to confirm the outcome:
``df.groupby('day').sum()['total_bill']'''Output:dayThur    1096.33Fri      325.88Sat     1778.40Sun     1627.16Name: total_bill, dtype: float64'''``

So basically we group our dataset by day to sum total_bill values. The result coincides with the graph above. Let's now move forward and count the values to see how many customers paid in the restaurant at a particular day (first in pandas) and then visualize it in seaborn's countplot

``df['day'].value_counts()'''Sat     87Sun     76Thur    62Fri     19Name: day, dtype: int64'''``

Chart (countplot) representing number of values below:

``sns.countplot(x='day', data=df, palette='seismic')plt.savefig("count1.png",  bbox_inches = "tight")plt.show()`` Another way for representing data is by creating a boxplot. Again it's very simple and based on all the above principles: Another chart w're going to create is similar to Pairplot, but we can specify what elements should be including. As you remember - pairplot consists of scatter plots and histograms.
``g = sns.PairGrid(df, hue='smoker')g.map_diag(sns.distplot)g.map_lower(plt.scatter)g.map_upper(sns.kdeplot)plt.savefig("pgrid1.png",  bbox_inches = "tight")plt.show()``

So we've specified that the middle part should contain distribution plots, on the left (lower) we've placed scatter plots and on the right (upper) - kde plots. Result: The last visulaization will be regarding catplots, that are a little bit diffrently constructed. Inside the catplot we specify all the most parameters our chart should consists of, including the type of the chart (so I could pass in, box, scatter and so on...)
``sns.catplot(data=df, x="smoker", y="total_bill", hue="sex", kind="violin", height=5, aspect=2, palette="magma", legend=False)plt.legend(bbox_to_anchor=(1.05, 1), loc=2)plt.savefig("cat-violin1.png",  bbox_inches = "tight")plt.show()``

In this case we indicated as kind - violin, which is another chart to show the distribution of data. While playing around with this, your layout may become messy (i.e. labels overlap), you can try out tight_layout to organize it better:
``plt.tight_layout()``

That is it for this one. As you can see, few lines of code can provide wonderful images for i.e. your powerpoint presentation etc.

Extras
Categories: Guest
1 year, 10 months ago
Quite an interesting topic and good job with showing it in a simple way! daniel29
1 year, 3 months ago
Cool! Guest
4 months, 2 weeks ago
nice 