For this visualization techniques session, let's use the 'Big Mart' dataset. You can download it here.
As always, let's start by calling the library ggplot2
library(ggplot2)
A Scatter Plot is used to see the relationship between two continuous variables.
library(ggplot2)
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05)) +
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")
#facet_wrap works superb & wraps Item_Type in rectangular layout.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.4,0.1))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)

A Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.
Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()
Another variation under this kind of visualization is the
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Type, Item_Weight)) +

Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")

Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Outlet_Sales)) +
geom_area(stat = "bin", bins = 30, fill = "steelblue") +
scale_x_continuous(breaks = seq(0,11000,1000))+
labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")

As always, let's start by calling the library ggplot2
library(ggplot2)
# Scatter plot
A Scatter Plot is used to see the relationship between two continuous variables.
library(ggplot2)
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05)) +
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) +
geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot")
#facet_wrap works superb & wraps Item_Type in rectangular layout.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.4,0.1))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap( ~ Item_Type)
#Histogram
A Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.
#Bar & Stack Bar Chart
Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()
Another variation under this kind of visualization is the
Vertical Bar Chart:
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Type, Item_Weight)) +
geom_bar(stat = "identity", fill = "darkblue") +
scale_x_discrete("Outlet Type")+
scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(title = "Bar Chart")
Stacked Bar Chart:
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")
#Box plot
Box Plots are used to plot a combination of categorical and continuous variables. This plot is useful for visualizing the spread of the data and detect outliers. It shows five statistically significant numbers- the minimum, the 25th percentile, the median, the 75th percentile and the maximum.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")
#Area Chart
Area chart is used to show continuity across a variable or data set. It is very much same as line chart and is commonly used for time series plots. Alternatively, it is also used to plot continuous variables and analyze the underlying trends.
ggplot(Big_Mart_Dataset_Sheet1, aes(Item_Outlet_Sales)) +
geom_area(stat = "bin", bins = 30, fill = "steelblue") +
scale_x_continuous(breaks = seq(0,11000,1000))+
labs(title = "Area Chart", x = "Item Outlet Sales", y = "Count")
#Heat Map
Heat Map uses intensity (density) of colors to display relationship between two or three or many variables in a two dimensional image. It allows you to explore two dimensions as the axis and the third dimension by intensity of color.
ggplot(Big_Mart_Dataset_Sheet1, aes(Outlet_Identifier, Item_Type))+
geom_raster(aes(fill = Item_MRP))+
labs(title ="Heat Map", x = "Outlet Identifier", y = "Item Type")+
scale_fill_continuous(name = "Item MRP")
#Correlogram
Correlogram is used to test the level of co-relation among the variable
available in the data set. The cells of the matrix can be shaded or
colored to show the co-relation value.
Darker the color, higher the co-relation
between variables. Positive co-relations are displayed in blue and
negative correlations in red color. Color intensity is proportional to
the co-relation value.
install.packages("corrgram")
library(corrgram)
corrgram(Big_Mart_Dataset_Sheet1, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
library(corrgram)
corrgram(Big_Mart_Dataset_Sheet1, order=NULL, panel=panel.shade, text.panel=panel.txt,
main="Correlogram")
seen
ReplyDelete