Boxplot

Parameters:

df: pd.DataFrame,
x: object --> string column name of the violinplot values in the DF for the X
y: object --> string column name of the violinplot values in the DF for the Y
title='' --> string title
xlabel='' --> string x label
ylabel='' --> string y label
hue=None --> column you want to colour by
order=None --> order of your values
box_colors=None, --> a list of colours to plot the boxes by --> only works in older version of matplotlib
showfliers=False,
add_dots=False,
add_stats=True,
stat_method='Mann-Whitney',  # options: t-test_ind, t-test_welch, t-test_paired, Mann-Whitney, Mann-Whitney-gt, Mann-Whitney-ls, Levene, Wilcoxon, Kruskal from: https://www.statsmodels.org/stable/api.html
box_pairs=None --> a list of box pairs i.e. comparisons for the statistics
figsize=(3, 3),
title_font_size=12,
label_font_size=8,
title_font_weight=700,
config={})

Config options = any of the parameters with the same name but with in a dictionary format instead, and also includes default parameters for the visualisation such as the font family and font.

Example config:

config={'palette': ['red', 'yellow', 'pink'],
       'figsize':(4, 5),  # Size of figure (x, y)
        'title_font_size': 16, # Size of the title (pt)
        'label_font_size': 12, # Size of the labels (pt)
        'title_font_weight': 700, # 700 = bold, 600 = normal, 400 = thin
        'font_family': 'sans-serif', # 'serif', 'sans-serif', or 'monospace'
        'font': ['Tahoma'] # Default: Arial  # http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/
}

Loading data

[1]:
import pandas as pd
from sciviso import Barchart, Boxplot, Heatmap, Histogram, Scatterplot, Violinplot, Volcanoplot, Line
import matplotlib.pyplot as plt

df = pd.read_csv('iris.csv')
df
[1]:
sepal_length sepal_width petal_length petal_width label
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns

Basic boxplot

[2]:
boxplot = Boxplot(df, x='label', y='sepal_width')
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

Iris-setosa v.s. Iris-versicolor: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=8.950e-13 U_stat=2.306e+03
Iris-versicolor v.s. Iris-virginica: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.372e-02 U_stat=8.410e+02
Iris-setosa v.s. Iris-virginica: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=3.543e-08 U_stat=2.074e+03
../_images/examples_Boxplot_4_1.png

Formatting data for boxplot

Data needs to be formamted for the boxplot, for example, if we have a gene list and want to do a boxplot of just a few of them or some groups of genes (e.g. a group of genes we’re interested in comparing between two conditions).

For this we’ll use a different example dataset.

[3]:
df = pd.read_csv('volcano.csv')
df
[3]:
external_gene_name logfc padj
0 MT-TF -2.6 0.02128
1 MT-RNR1 -6.1 0.83880
2 MT-TV -8.6 0.25140
3 MT-RNR2 -0.9 0.29380
4 MT-TL1 1.1 0.58210
... ... ... ...
73620 ARHGEF5 6.5 0.55980
73621 NOBOX 1.5 0.01870
73622 AC004864.1 -8.5 0.05760
73623 MTRF1LP2 -4.8 0.17570
73624 GSDMC 3.5 0.78250

73625 rows × 3 columns

[4]:
# Now we'll do an example where we look at the logFC between two conditions
# of a group of genes and test whether they are significantly different.

# Pretend we have two conditions
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1

boxplot = Boxplot(df, x='external_gene_name', y='logfc')
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
formatted_df = boxplot.format_data_for_boxplot(
                   df,
                   conditions=["cond_1", "cond_2"],
                   filter_column="external_gene_name",
                   filter_values=hox_genes
                )
print(formatted_df)

# Reinitialise boxplot with the new data
boxplot = Boxplot(formatted_df, "Conditions", "Values",
                  box_colors=["plum", "gold"],
                  add_dots=True)
boxplot.plot()
    Samples  Values Conditions
0    cond_1     0.0     cond_1
1    cond_2    -2.0     cond_2
2    cond_1    -1.8     cond_1
3    cond_2    -3.8     cond_2
4    cond_1    -6.9     cond_1
..      ...     ...        ...
125  cond_2    -2.4     cond_2
126  cond_1     7.0     cond_1
127  cond_2     5.0     cond_2
128  cond_1    -7.7     cond_1
129  cond_2    -9.7     cond_2

[130 rows x 3 columns]
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1 v.s. cond_2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.169e-02 U_stat=2.606e+03
[4]:
<AxesSubplot:>
../_images/examples_Boxplot_7_2.png

Advanced sytle options

Here are some examples where things like the bin, color and fig size have been changed.

[5]:
# boxplot = Boxplot(df: pd.DataFrame, x: object, y: object, title='', xlabel='', ylabel='', box_colors=None,
#                 hue=None, order=None, hue_order=None, showfliers=False, add_dots=False, add_stats=True,
#                 stat_method='Mann-Whitney', box_pairs=None, figsize=(3, 3), title_font_size=12, label_font_size=8, title_font_weight=700)# Config options = any of the parameters with the same name but with in a dictionary format instead

# Let's continue with the previous example with the formatted data

boxplot = Boxplot(df=formatted_df, x='Conditions', y='Values', title='Hox genes', xlabel='', ylabel='Log FC',
                  box_colors=None, # An ordered list of colours to match the conditions
                  hue=None, # A column in your dataset that you want to colour by
                  order=None, # Order of the box's
                  hue_order=None, # order of the colours
                  showfliers=False, # Show fliers (on the box's)
                  add_dots=False,  # Add dots for each data point
                  add_stats=True, # Add statistics between box's pairwise tests
                  stat_method='Mann-Whitney', # Type of stat
                  box_pairs=None, # Pre-specified comparisons (if you don't want to do all pairs)
                  figsize=(3, 3),
                  title_font_size=12,
                  label_font_size=8,
                  title_font_weight=700, # Config options = any of the parameters with the same name but with in a dictionary format instead
                  # You could also pass these as individual parameters, but it's easier to set as a dictionary
                  # also, then you can re-use it for other charts!
                  config={'figsize':(4, 5),  # Size of figure (x, y)
                       'title_font_size': 16, # Size of the title (pt)
                       'label_font_size': 12, # Size of the labels (pt)
                       'title_font_weight': 700, # 700 = bold, 600 = normal, 400 = thin
                       'font_family': 'sans-serif', # 'serif', 'sans-serif', or 'monospace'
                       'font': ['Tahoma'] # Default: Arial  # http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/
                  })
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1 v.s. cond_2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.169e-02 U_stat=2.606e+03
../_images/examples_Boxplot_9_1.png

Show multiple comparisons

In this one we have an example where we have two conditions for two groups of genes.

[6]:
# Pretend we have two conditions
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1

boxplot = Boxplot(df, x='external_gene_name', y='logfc')
mt_genes = [g for g in df['external_gene_name'].values if 'MT' in g]
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
hox_df = boxplot.format_data_for_boxplot(
                   df,
                   conditions=["cond_1", "cond_2"],
                   filter_column="external_gene_name",
                   filter_values=hox_genes)

# Add another column to the hox df that's the label
hox_df['Gene Group'] = 'Hox'

# Create a df for the MT genes
mt_df = boxplot.format_data_for_boxplot(
                   df,
                   conditions=["cond_1", "cond_2"],
                   filter_column="external_gene_name",
                   filter_values=mt_genes)

# Add another column to the hox df that's the label
mt_df['Gene Group'] = 'MT'

gene_df = pd.concat([hox_df, mt_df])

# Now we set hue
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
                  box_colors=None, # An ordered list of colours to match the conditions
                  hue='Gene Group', # A column in your dataset that you want to colour by
                  order=None, # Order of the box's
                  hue_order=None, # order of the colours
                  showfliers=False, # Show fliers (on the box's)
                  add_dots=False,  # Add dots for each data point
                  add_stats=True, # Add statistics between box's pairwise tests
                  stat_method='t-test_ind', # Type of stat
                  box_pairs=None, # Pre-specified comparisons (if you don't want to do all pairs)
                  figsize=(3, 3),
                  title_font_size=12,
                  label_font_size=8,
                  title_font_weight=700) # Config options = any of the parameters with the same name but with in a dictionary format instead)
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1 v.s. cond_2: t-test independent samples with Bonferroni correction, P_val=1.843e-13 stat=7.417e+00
../_images/examples_Boxplot_11_1.png

Another example

[7]:
# Pretend we have two conditions and two groups of genes we want to identify the significance between
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1

boxplot = Boxplot(df, x='external_gene_name', y='logfc')
mt_genes = [g for g in df['external_gene_name'].values if 'MT' in g]
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
hox_df = boxplot.format_data_for_boxplot(
                   df,
                   conditions=["cond_1", "cond_2"],
                   filter_column="external_gene_name",
                   filter_values=hox_genes)

# Add another column to the hox df that's the label
hox_df['Gene Group'] = 'Hox'
hox_df['Conditions'] = hox_df['Conditions'].values + hox_df['Gene Group'].values

# Create a df for the MT genes
mt_df = boxplot.format_data_for_boxplot(
                   df,
                   conditions=["cond_1", "cond_2"],
                   filter_column="external_gene_name",
                   filter_values=mt_genes)

# Add another column to the hox df that's the label
mt_df['Gene Group'] = 'MT'
mt_df['Conditions'] = mt_df['Conditions'].values + mt_df['Gene Group'].values

gene_df = pd.concat([hox_df, mt_df])

#
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
                  hue='Gene Group') # A column in your dataset that you want to colour by) # Config options = any of the parameters with the same name but with in a dictionary format instead)
boxplot.plot()
plt.show()

# Let's limit our tests to only the comparisons between things we're interested in
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
                  hue='Gene Group',
                  box_pairs=[('cond_1Hox', 'cond_2Hox'),
                             ('cond_1MT', 'cond_2MT'),
                            ])
boxplot.plot()
plt.show()

# Alternatively instead of explicity setting the hue through seaborn we can do it manually (but the legend won't come up)
# I personally prefer this and just manually set the label, it's a to do to add it in.
# Let's limit our tests to only the comparisons between things we're interested in
boxplot = Boxplot(df=gene_df, x='Conditions', y='Values', title='Hox & MT genes', xlabel='', ylabel='Log FC',
                  box_colors=['lightblue', 'darkblue', 'lightgrey', 'darkgrey'],
                  box_pairs=[('cond_1Hox', 'cond_2Hox'),
                             ('cond_1MT', 'cond_2MT'),
                            ])
boxplot.plot()
plt.show()
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1Hox v.s. cond_1MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=2.699e+04
cond_1MT v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=5.865e-02 U_stat=3.216e+04
cond_2Hox v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.000e+00 U_stat=2.699e+04
cond_1Hox v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=1.301e-01 U_stat=2.606e+03
cond_1MT v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=7.349e-11 U_stat=4.106e+05
cond_1Hox v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=6.067e-02 U_stat=3.214e+04
../_images/examples_Boxplot_13_1.png
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1Hox v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.338e-02 U_stat=2.606e+03
cond_1MT v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.450e-11 U_stat=4.106e+05
../_images/examples_Boxplot_13_3.png
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1Hox v.s. cond_2Hox: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=4.338e-02 U_stat=2.606e+03
cond_1MT v.s. cond_2MT: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.450e-11 U_stat=4.106e+05
../_images/examples_Boxplot_13_5.png

Saving

Saving is the same for all plots and v simple, just make sure you specify what ending you want it to have.

[10]:
# Now we'll do an example where we look at the logFC between two conditions
# of a group of genes and test whether they are significantly different.
df = pd.read_csv('volcano.csv')
df
# Pretend we have two conditions
df['cond_1'] = df['logfc'] + 1
df['cond_2'] = df['logfc'] - 1

boxplot = Boxplot(df, x='external_gene_name', y='logfc')
hox_genes = [g for g in df['external_gene_name'].values if 'HOX' in g]
# conditions: list, filter_column=None, filter_values=None
formatted_df = boxplot.format_data_for_boxplot(
                   df,
                   conditions=["cond_1", "cond_2"],
                   filter_column="external_gene_name",
                   filter_values=hox_genes
                )
print(formatted_df)

# Reinitialise boxplot with the new data
boxplot = Boxplot(formatted_df, "Conditions", "Values",
                  ylabel='logFC',
                  title='Gene expression changes',
                  box_colors=["orchid", "gold"],
                  add_dots=True,
                  config={'palette': ['orchid', 'paleturquoise', 'gold'],
                           'figsize':(3, 3),  # Size of figure (x, y)
                           's': 20,
                           'title_font_size': 16, # Size of the title (pt)
                           'label_font_size': 12, # Size of the labels (pt)
                           'title_font_weight': 700, # 700 = bold, 600 = normal, 400 = thin
                           'font_family': 'sans-serif', # 'serif', 'sans-serif', or 'monospace'
                           'font': ['Tahoma'] # Default: Arial  # http://jonathansoma.com/lede/data-studio/matplotlib/list-all-fonts-available-in-matplotlib-plus-samples/
                           })
boxplot.plot()
plt.savefig('boxplot.svg', bbox_inches='tight') # .png, .pdf, .jpg
plt.savefig('boxplot.png', dpi=300) # .png, .pdf, .jpg
plt.savefig('chart.pdf') # .png, .pdf, .jpg
    Samples  Values Conditions
0    cond_1     0.0     cond_1
1    cond_2    -2.0     cond_2
2    cond_1    -1.8     cond_1
3    cond_2    -3.8     cond_2
4    cond_1    -6.9     cond_1
..      ...     ...        ...
125  cond_2    -2.4     cond_2
126  cond_1     7.0     cond_1
127  cond_2     5.0     cond_2
128  cond_1    -7.7     cond_1
129  cond_2    -9.7     cond_2

[130 rows x 3 columns]
p-value annotation legend:
ns: 5.00e-02 < p <= 1.00e+00
*: 1.00e-02 < p <= 5.00e-02
**: 1.00e-03 < p <= 1.00e-02
***: 1.00e-04 < p <= 1.00e-03
****: p <= 1.00e-04

cond_1 v.s. cond_2: Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction, P_val=2.169e-02 U_stat=2.606e+03
../_images/examples_Boxplot_15_1.png