hicluster contact-distance

hicluster contact-distance#

This step calculate the contacts number in different genomic distances and the sparsity of contact matrices at certain resolution for all chromosomes.

Command Docs#

usage: hicluster contact-distance [-h] --contact_table CONTACT_TABLE
                                  --chrom_size_path CHROM_SIZE_PATH
                                  --output_prefix OUTPUT_PREFIX
                                  [--resolution RESOLUTION] [--chr1 CHROM1]
                                  [--pos1 POS1] [--chr2 CHROM2] [--pos2 POS2]
                                  [--cpu CPU]

optional arguments:
  -h, --help            show this help message and exit
  --resolution RESOLUTION
                        Resolution of contact length (default: 10000)
  --chr1 CHROM1         0 based index of chr1 column. (default: 1)
  --pos1 POS1           0 based index of pos1 column. (default: 2)
  --chr2 CHROM2         0 based index of chr2 column. (default: 5)
  --pos2 POS2           0 based index of pos2 column. (default: 6)
  --cpu CPU             number of cpus to parallel. (default: 20)

required arguments:
  --contact_table CONTACT_TABLE
                        Contain all the cell contact file after blacklist
                        region removwl; information in two tab-separated
                        columns: 1. cell_uid, 2. file_path. No header
                        (default: None)
  --chrom_size_path CHROM_SIZE_PATH
                        Path to UCSC chrom size fileContain all the
                        chromosomeinformation in two tab-separated columns:
                        1.chromosome name, 2. chromosome length. No header
                        (default: None)
  --output_prefix OUTPUT_PREFIX
                        Output hdf file prefix including the directory
                        (default: None)

Command Example#

hicluster contact-distance \
--contact_table contact_table_rmbkl.tsv \
--chrom_size_path /data/aging/ref/m3C/mm10.main.nochrM.nochrY.chrom.sizes \
--output_prefix contact_distance \
--resolution 10000 \
--chr1 1 \
--pos1 2 \
--chr2 5 \
--pos2 6 \
--cpu 20

Command Break Down#

--cell_table contact_table_rmbkl.tsv

Specify the file paths of the contact files after removing blacklist regions in this line(e.g. /home/qzeng_salk_edu/project/aging/230711_m3C/rmbkl/AMB_220712_18mo_12D_13B_2_P4-1-I15-K1.contact.rmbkl.tsv.gz). Here is an example of what the contact_table_rmbkl.tsv looks like

cell_1 absolute_hic_rmbkl_contact_path_1
cell_2 absolute_hic_rmbkl_contact_path_2
cell_3 absolute_hic_rmbkl_contact_path_3

The first column indicates the cell name (e.g. AMB_220712_18mo_12D_13B_2_P4-1-I15-K1) whereas the second column indicates the HiC contact file path after removing blacklist of the cell. Make sure the two parts are separated by a tab; also make sure the file has no header.

The output file of this command are contact_distance_decay.hdf5 and contact_distance_chromsparsity.hdf5, which can be read using pd.read_hdf coomand. The decay file records the number of contacts in different genomic distances, while the chromsparsity file shows the total number of contacts on each chromosome.