Data Visualisation

Monday 27 February 2017

Visualizing data with tableau_2

Visualizing data with tableau_1

Thursday 2 February 2017

31.01.2017 -Python / Seaborne

In today's post, I shall take you through a very important and indeed visualising package of the python programming language called the "Seaborne". Let's jump right to it.

Let's start by importing the package.

import seaborn as sns

Then, just so that we get our images in between the code-lines, execute the below by ivoking the matplotlib library

%matplotlib inline

HeatMap

# Load the example flights dataset and conver to long-form
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")

# Draw a heatmap with the numeric values in each cell
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5)

Kdeplot

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="dark")
rs = np.random.RandomState(50)

# Set up the matplotlib figure
f, axes = plt.subplots(3, 3, figsize=(9, 9), sharex=True, sharey=True)

# Rotate the starting point around the cubehelix hue circle
for ax, s in zip(axes.flat, np.linspace(0, 3, 10)):

# Create a cubehelix colormap to use with kdeplot
cmap = sns.cubehelix_palette(start=s, light=1, as_cmap=True)

# Generate and plot a random bivariate dataset
x, y = rs.randn(2, 50)
sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=ax)
ax.set(xlim=(-3, 3), ylim=(-3, 3))

f.tight_layout()

tsplot

sns.set(style="darkgrid", palette="Set2")

# Create a noisy periodic dataset
sines = []
rs = np.random.RandomState(8)
for _ in range(15):
x = np.linspace(0, 30 / 2, 30)
y = np.sin(x) + rs.normal(0, 1.5) + rs.normal(0, .3, 30)
sines.append(y)

# Plot the average over replicates with bootstrap resamples
sns.tsplot(sines, err_style="boot_traces", n_boot=500)

Swarmplot

import pandas as pd
import seaborn as sns
sns.set(style="whitegrid", palette="muted")

# Load the example iris dataset
iris = sns.load_dataset("iris")

# "Melt" the dataset to "long-form" or "tidy" representation
iris = pd.melt(iris, "species", var_name="measurement")

# Draw a categorical scatterplot to show each observation
sns.swarmplot(x="measurement", y="value", hue="species", data=iris)

Pairgrid

import seaborn as sns
sns.set(style="whitegrid")

# Load the dataset
crashes = sns.load_dataset("car_crashes")

# Make the PairGrid
g = sns.PairGrid(crashes.sort_values("total", ascending=False),
x_vars=crashes.columns[:-3], y_vars=["abbrev"],
size=10, aspect=.25)

# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
palette="Reds_r", edgecolor="gray")

# Use the same x axis limits on all columns and add better labels
g.set(xlim=(0, 25), xlabel="Crashes", ylabel="")

# Use semantically meaningful titles for the columns
titles = ["Total crashes", "Speeding crashes", "Alcohol crashes",
"Not distracted crashes", "No previous crashes"]

for ax, title in zip(g.axes.flat, titles):

# Set a different title for each axes
ax.set(title=title)

# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)

sns.despine(left=True, bottom=True)

Based on this, a relatively small sample size, which is it that you like better, [R]'s ggplot2 or Seaborne? Help me find your answers as comments below.

Tuesday 24 January 2017

24.01.2017 - Geo Mapping

5 State Data in Google GeoChart

4 Plotting Cities on a Map along with Data

Showing Addresses on a Google Map

15.01.2017 -Google Charts

This one is called the 'Gnatt Chart'.

Thursday 12 January 2017

11.01.2017.R

For this visualization techniques session, let's use the 'Big Mart' dataset. You can download it here.

As always, let's start by calling the library ggplot2

library(ggplot2)

# Scatter plot

A Scatter Plot is used to see the relationship between two continuous variables.

library(ggplot2)
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05)) +
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()

ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")

#facet_wrap works superb & wraps Item_Type in rectangular layout.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.4,0.1))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)

#Histogram

A Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.

#Bar & Stack Bar Chart

Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.

ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()

Another variation under this kind of visualization is the

Vertical Bar Chart:

ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Type, Item_Weight)) +
geom_bar(stat = "identity", fill = "darkblue") +
scale_x_discrete("Outlet Type")+
scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(title = "Bar Chart")

Stacked Bar Chart:

ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")

#Box plot

Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.

ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")

#Area Chart

Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.

ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Outlet_Sales)) +
geom_area(stat = "bin", bins = 30, fill = "steelblue") +
scale_x_continuous(breaks = seq(0,11000,1000))+
labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")

#Heat Map

Heat Map uses intensity (density) of colors to display relationship between two or three or many variables in a two dimensional image. It allows you to explore two dimensions as the axis and the third dimension by intensity of color.

ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")

#Correlogram

Correlogram is used to test the level of co-relation among the variable available in the data set. The cells of the matrix can be shaded or colored to show the co-relation value.

Darker the color, higher the co-relation between variables. Positive co-relations are displayed in blue and negative correlations in red color. Color intensity is proportional to the co-relation value.

install.packages("corrgram")
library(corrgram)

corrgram(Big_Mart_Dataset_Sheet1, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")

That's all for this post. More on visualizations on later posts.

Add to Anti-Banner