Data Visualisation
Monday 27 February 2017
Thursday 2 February 2017
31.01.2017 -Python / Seaborne
In today's post, I shall take you through a very important and indeed visualising package of the python programming language called the "Seaborne". Let's jump right to it.
Let's start by importing the package.
import seaborn as sns
Then, just so that we get our images in between the code-lines, execute the below by ivoking the matplotlib library
%matplotlib inline
HeatMap
# Load the example flights dataset and conver to long-form
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")
# Draw a heatmap with the numeric values in each cell
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5)
Kdeplot
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="dark")
rs = np.random.RandomState(50)
# Set up the matplotlib figure
f, axes = plt.subplots(3, 3, figsize=(9, 9), sharex=True, sharey=True)
# Rotate the starting point around the cubehelix hue circle
for ax, s in zip(axes.flat, np.linspace(0, 3, 10)):
# Create a cubehelix colormap to use with kdeplot
cmap = sns.cubehelix_palette(start=s, light=1, as_cmap=True)
# Generate and plot a random bivariate dataset
x, y = rs.randn(2, 50)
sns.kdeplot(x, y, cmap=cmap, shade=True, cut=5, ax=ax)
ax.set(xlim=(-3, 3), ylim=(-3, 3))
f.tight_layout()
tsplot
sns.set(style="darkgrid", palette="Set2")
# Create a noisy periodic dataset
sines = []
rs = np.random.RandomState(8)
for _ in range(15):
x = np.linspace(0, 30 / 2, 30)
y = np.sin(x) + rs.normal(0, 1.5) + rs.normal(0, .3, 30)
sines.append(y)
# Plot the average over replicates with bootstrap resamples
sns.tsplot(sines, err_style="boot_traces", n_boot=500)
Swarmplot
import pandas as pd
import seaborn as sns
sns.set(style="whitegrid", palette="muted")
# Load the example iris dataset
iris = sns.load_dataset("iris")
# "Melt" the dataset to "long-form" or "tidy" representation
iris = pd.melt(iris, "species", var_name="measurement")
# Draw a categorical scatterplot to show each observation
sns.swarmplot(x="measurement", y="value", hue="species", data=iris)
Pairgrid
import seaborn as sns
sns.set(style="whitegrid")
# Load the dataset
crashes = sns.load_dataset("car_crashes")
# Make the PairGrid
g = sns.PairGrid(crashes.sort_values("total", ascending=False),
x_vars=crashes.columns[:-3], y_vars=["abbrev"],
size=10, aspect=.25)
# Draw a dot plot using the stripplot function
g.map(sns.stripplot, size=10, orient="h",
palette="Reds_r", edgecolor="gray")
# Use the same x axis limits on all columns and add better labels
g.set(xlim=(0, 25), xlabel="Crashes", ylabel="")
# Use semantically meaningful titles for the columns
titles = ["Total crashes", "Speeding crashes", "Alcohol crashes",
"Not distracted crashes", "No previous crashes"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title=title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
sns.despine(left=True, bottom=True)
Based on this, a relatively small sample size, which is it that you like better, [R]'s ggplot2 or Seaborne? Help me find your answers as comments below.
Tuesday 24 January 2017
24.01.2017 - Geo Mapping
Thursday 12 January 2017
11.01.2017.R
For this visualization techniques session, let's use the 'Big Mart' dataset. You can download it here.
As always, let's start by calling the library ggplot2
library(ggplot2)
A Scatter Plot is used to see the relationship between two continuous variables.
library(ggplot2)
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05)) +
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")
#facet_wrap works superb & wraps Item_Type in rectangular layout.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.4,0.1))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)
A Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.
Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()
Another variation under this kind of visualization is the
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Type, Item_Weight)) +
Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")
Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Outlet_Sales)) +
geom_area(stat = "bin", bins = 30, fill = "steelblue") +
scale_x_continuous(breaks = seq(0,11000,1000))+
labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")
As always, let's start by calling the library ggplot2
library(ggplot2)
# Scatter plot
A Scatter Plot is used to see the relationship between two continuous variables.
library(ggplot2)
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05)) +
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")
#facet_wrap works superb & wraps Item_Type in rectangular layout.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.4,0.1))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)
#Histogram
A Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.
#Bar & Stack Bar Chart
Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()
Another variation under this kind of visualization is the
Vertical Bar Chart:
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Type, Item_Weight)) +
geom_bar(stat = "identity", fill = "darkblue") +
scale_x_discrete("Outlet Type")+
scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(title = "Bar Chart")
Stacked Bar Chart:
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")
#Box plot
Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")
#Area Chart
Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Outlet_Sales)) +
geom_area(stat = "bin", bins = 30, fill = "steelblue") +
scale_x_continuous(breaks = seq(0,11000,1000))+
labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")
#Heat Map
Heat Map uses intensity (density) of colors to display relationship between two or three or many variables in a two dimensional image. It allows you to explore two dimensions as the axis and the third dimension by intensity of color.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")
#Correlogram
Correlogram is used to test the level of co-relation among the variable
available in the data set. The cells of the matrix can be shaded or
colored to show the co-relation value.
Darker the color, higher the co-relation
between variables. Positive co-relations are displayed in blue and
negative correlations in red color. Color intensity is proportional to
the co-relation value.
install.packages("corrgram")
library(corrgram)
corrgram(Big_Mart_Dataset_Sheet1, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
library(corrgram)
corrgram(Big_Mart_Dataset_Sheet1, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
That's all for this post. More on visualizations on later posts.
Add to Anti-Banner
Subscribe to:
Posts (Atom)