Boxplot¶
Parameters:
df: pd.DataFrame,
x: object --> string column name of the violinplot values in the DF for the X
y: object --> string column name of the violinplot values in the DF for the Y
title='' --> string title
xlabel='' --> string x label
ylabel='' --> string y label
hue=None --> column you want to colour by
order=None --> order of your values
box_colors=None, --> a list of colours to plot the boxes by --> only works in older version of matplotlib
showfliers=False,
add_dots=False,
add_stats=True,
stat_method='Mann-Whitney', # options: t-test_ind, t-test_welch, t-test_paired, Mann-Whitney, Mann-Whitney-gt, Mann-Whitney-ls, Levene, Wilcoxon, Kruskal from: https://www.statsmodels.org/stable/api.html
box_pairs=None --> a list of box pairs i.e. comparisons for the statistics
figsize=(3, 3),
title_font_size=12,
label_font_size=8,
title_font_weight=700,
config={})
Config options = any of the parameters with the same name but with in a dictionary format instead, and also includes default parameters for the visualisation such as the font family and font.
Example config:
config={'palette': ['red', 'yellow', 'pink'],
'figsize':(4, 5), # Size of figure (x, y)
'title_font_size': 16, # Size of the title (pt)
'label_font_size': 12, # Size of the labels (pt)
'title_font_weight': 700, # 700 = bold, 600 = normal, 400 = thin
'font_family': 'sans-serif', # 'serif', 'sans-serif', or 'monospace'
'font': ['Tahoma'] # Default: Arial # http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/
}
Loading data¶
[1]:
import pandas as pd
from sciviso import Barchart, Boxplot, Heatmap, Histogram, Scatterplot, Violinplot, Volcanoplot, Line
import matplotlib.pyplot as plt
df = pd.read_csv('iris.csv')
df
[1]:
sepal_length | sepal_width | petal_length | petal_width | label | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
150 rows × 5 columns
Basic boxplot¶
[2]:
boxplot = Boxplot(df, x='label', y='sepal_width')
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
Iris-setosa v.s. Iris-versicolor: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=8.950e-13 U_stat=2.306e+03
Iris-versicolor v.s. Iris-virginica: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.372e-02 U_stat=8.410e+02
Iris-setosa v.s. Iris-virginica: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=3.543e-08 U_stat=2.074e+03

Formatting data for boxplot¶
Data needs to be formamted for the boxplot, for example, if we have a gene list and want to do a boxplot of just a few of them or some groups of genes (e.g. a group of genes we’re interested in comparing between two conditions).
For this we’ll use a different example dataset.
[3]:
df = pd.read_csv('volcano.csv')
df
[3]:
external_gene_name | logfc | padj | |
---|---|---|---|
0 | MT-TF | -2.6 | 0.02128 |
1 | MT-RNR1 | -6.1 | 0.83880 |
2 | MT-TV | -8.6 | 0.25140 |
3 | MT-RNR2 | -0.9 | 0.29380 |
4 | MT-TL1 | 1.1 | 0.58210 |
... | ... | ... | ... |
73620 | ARHGEF5 | 6.5 | 0.55980 |
73621 | NOBOX | 1.5 | 0.01870 |
73622 | AC004864.1 | -8.5 | 0.05760 |
73623 | MTRF1LP2 | -4.8 | 0.17570 |
73624 | GSDMC | 3.5 | 0.78250 |
73625 rows × 3 columns
[4]:
# Now we'll do an example where we look at the logFC between two conditions
# of a group of genes and test whether they are significantly different.
# Pretend we have two conditions
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1
boxplot = Boxplot(df, x='external_gene_name', y='logfc')
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
formatted_df = boxplot.format_data_for_boxplot(
df,
conditions=["cond_1", "cond_2"],
filter_column="external_gene_name",
filter_values=hox_genes
)
print(formatted_df)
# Reinitialise boxplot with the new data
boxplot = Boxplot(formatted_df, "Conditions", "Values",
box_colors=["plum", "gold"],
add_dots=True)
boxplot.plot()
Samples Values Conditions
0 cond_1 0.0 cond_1
1 cond_2 -2.0 cond_2
2 cond_1 -1.8 cond_1
3 cond_2 -3.8 cond_2
4 cond_1 -6.9 cond_1
.. ... ... ...
125 cond_2 -2.4 cond_2
126 cond_1 7.0 cond_1
127 cond_2 5.0 cond_2
128 cond_1 -7.7 cond_1
129 cond_2 -9.7 cond_2
[130 rows x 3 columns]
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1 v.s. cond_2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.169e-02 U_stat=2.606e+03
[4]:
<AxesSubplot:>

Advanced sytle options¶
Here are some examples where things like the bin, color and fig size have been changed.
[5]:
# boxplot = Boxplot(df: pd.DataFrame, x: object, y: object, title='', xlabel='', ylabel='', box_colors=None,
# hue=None, order=None, hue_order=None, showfliers=False, add_dots=False, add_stats=True,
# stat_method='Mann-Whitney', box_pairs=None, figsize=(3, 3), title_font_size=12, label_font_size=8, title_font_weight=700)# Config options = any of the parameters with the same name but with in a dictionary format instead
# Let's continue with the previous example with the formatted data
boxplot = Boxplot(df=formatted_df, x='Conditions', y='Values', title='Hox genes', xlabel='', ylabel='Log FC',
box_colors=None, # An ordered list of colours to match the conditions
hue=None, # A column in your dataset that you want to colour by
order=None, # Order of the box's
hue_order=None, # order of the colours
showfliers=False, # Show fliers (on the box's)
add_dots=False, # Add dots for each data point
add_stats=True, # Add statistics between box's pairwise tests
stat_method='Mann-Whitney', # Type of stat
box_pairs=None, # Pre-specified comparisons (if you don't want to do all pairs)
figsize=(3, 3),
title_font_size=12,
label_font_size=8,
title_font_weight=700, # Config options = any of the parameters with the same name but with in a dictionary format instead
# You could also pass these as individual parameters, but it's easier to set as a dictionary
# also, then you can re-use it for other charts!
config={'figsize':(4, 5), # Size of figure (x, y)
'title_font_size': 16, # Size of the title (pt)
'label_font_size': 12, # Size of the labels (pt)
'title_font_weight': 700, # 700 = bold, 600 = normal, 400 = thin
'font_family': 'sans-serif', # 'serif', 'sans-serif', or 'monospace'
'font': ['Tahoma'] # Default: Arial # http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/
})
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1 v.s. cond_2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.169e-02 U_stat=2.606e+03

Show multiple comparisons¶
In this one we have an example where we have two conditions for two groups of genes.
[6]:
# Pretend we have two conditions
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1
boxplot = Boxplot(df, x='external_gene_name', y='logfc')
mt_genes = [g for g in df['external_gene_name'].values if 'MT' in g]
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
hox_df = boxplot.format_data_for_boxplot(
df,
conditions=["cond_1", "cond_2"],
filter_column="external_gene_name",
filter_values=hox_genes)
# Add another column to the hox df that's the label
hox_df['Gene Group'] = 'Hox'
# Create a df for the MT genes
mt_df = boxplot.format_data_for_boxplot(
df,
conditions=["cond_1", "cond_2"],
filter_column="external_gene_name",
filter_values=mt_genes)
# Add another column to the hox df that's the label
mt_df['Gene Group'] = 'MT'
gene_df = pd.concat([hox_df, mt_df])
# Now we set hue
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
box_colors=None, # An ordered list of colours to match the conditions
hue='Gene Group', # A column in your dataset that you want to colour by
order=None, # Order of the box's
hue_order=None, # order of the colours
showfliers=False, # Show fliers (on the box's)
add_dots=False, # Add dots for each data point
add_stats=True, # Add statistics between box's pairwise tests
stat_method='t-test_ind', # Type of stat
box_pairs=None, # Pre-specified comparisons (if you don't want to do all pairs)
figsize=(3, 3),
title_font_size=12,
label_font_size=8,
title_font_weight=700) # Config options = any of the parameters with the same name but with in a dictionary format instead)
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1 v.s. cond_2: t-test independent samples with Bonferroni correction, P_val=1.843e-13 stat=7.417e+00

Another example¶
[7]:
# Pretend we have two conditions and two groups of genes we want to identify the significance between
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1
boxplot = Boxplot(df, x='external_gene_name', y='logfc')
mt_genes = [g for g in df['external_gene_name'].values if 'MT' in g]
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
hox_df = boxplot.format_data_for_boxplot(
df,
conditions=["cond_1", "cond_2"],
filter_column="external_gene_name",
filter_values=hox_genes)
# Add another column to the hox df that's the label
hox_df['Gene Group'] = 'Hox'
hox_df['Conditions'] = hox_df['Conditions'].values + hox_df['Gene Group'].values
# Create a df for the MT genes
mt_df = boxplot.format_data_for_boxplot(
df,
conditions=["cond_1", "cond_2"],
filter_column="external_gene_name",
filter_values=mt_genes)
# Add another column to the hox df that's the label
mt_df['Gene Group'] = 'MT'
mt_df['Conditions'] = mt_df['Conditions'].values + mt_df['Gene Group'].values
gene_df = pd.concat([hox_df, mt_df])
#
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
hue='Gene Group') # A column in your dataset that you want to colour by) # Config options = any of the parameters with the same name but with in a dictionary format instead)
boxplot.plot()
plt.show()
# Let's limit our tests to only the comparisons between things we're interested in
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
hue='Gene Group',
box_pairs=[('cond_1Hox', 'cond_2Hox'),
('cond_1MT', 'cond_2MT'),
])
boxplot.plot()
plt.show()
# Alternatively instead of explicity setting the hue through seaborn we can do it manually (but the legend won't come up)
# I personally prefer this and just manually set the label, it's a to do to add it in.
# Let's limit our tests to only the comparisons between things we're interested in
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
box_colors=['lightblue', 'darkblue', 'lightgrey', 'darkgrey'],
box_pairs=[('cond_1Hox', 'cond_2Hox'),
('cond_1MT', 'cond_2MT'),
])
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1Hox v.s. cond_1MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=2.699e+04
cond_1MT v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=5.865e-02 U_stat=3.216e+04
cond_2Hox v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=2.699e+04
cond_1Hox v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.301e-01 U_stat=2.606e+03
cond_1MT v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=7.349e-11 U_stat=4.106e+05
cond_1Hox v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=6.067e-02 U_stat=3.214e+04

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1Hox v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.338e-02 U_stat=2.606e+03
cond_1MT v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.450e-11 U_stat=4.106e+05

p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1Hox v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.338e-02 U_stat=2.606e+03
cond_1MT v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.450e-11 U_stat=4.106e+05

Saving¶
Saving is the same for all plots and v simple, just make sure you specify what ending you want it to have.
[10]:
# Now we'll do an example where we look at the logFC between two conditions
# of a group of genes and test whether they are significantly different.
df = pd.read_csv('volcano.csv')
df
# Pretend we have two conditions
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1
boxplot = Boxplot(df, x='external_gene_name', y='logfc')
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
formatted_df = boxplot.format_data_for_boxplot(
df,
conditions=["cond_1", "cond_2"],
filter_column="external_gene_name",
filter_values=hox_genes
)
print(formatted_df)
# Reinitialise boxplot with the new data
boxplot = Boxplot(formatted_df, "Conditions", "Values",
ylabel='logFC',
title='Gene expression changes',
box_colors=["orchid", "gold"],
add_dots=True,
config={'palette': ['orchid', 'paleturquoise', 'gold'],
'figsize':(3, 3), # Size of figure (x, y)
's': 20,
'title_font_size': 16, # Size of the title (pt)
'label_font_size': 12, # Size of the labels (pt)
'title_font_weight': 700, # 700 = bold, 600 = normal, 400 = thin
'font_family': 'sans-serif', # 'serif', 'sans-serif', or 'monospace'
'font': ['Tahoma'] # Default: Arial # http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/
})
boxplot.plot()
plt.savefig('boxplot.svg', bbox_inches='tight') # .png, .pdf, .jpg
plt.savefig('boxplot.png', dpi=300) # .png, .pdf, .jpg
plt.savefig('chart.pdf') # .png, .pdf, .jpg
Samples Values Conditions
0 cond_1 0.0 cond_1
1 cond_2 -2.0 cond_2
2 cond_1 -1.8 cond_1
3 cond_2 -3.8 cond_2
4 cond_1 -6.9 cond_1
.. ... ... ...
125 cond_2 -2.4 cond_2
126 cond_1 7.0 cond_1
127 cond_2 5.0 cond_2
128 cond_1 -7.7 cond_1
129 cond_2 -9.7 cond_2
[130 rows x 3 columns]
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04
cond_1 v.s. cond_2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.169e-02 U_stat=2.606e+03
