SoDoPE FAQ

Detailed guide to protein solubility prediction

Core Concepts

What does SoDoPE predict?

SoDoPE (Soluble Domain for Protein Expression) is a tool designed to predict the solubility of protein targets in Escherichia coli. It helps identify region-wise solubility including protein domains and 'Better Regions'—sub-segments of your protein sequence that are predicted to have a higher probability of being expressed solubly compared to the full-length protein.

How can I explore properties of my protein?

Please use the two way slider below the sequence topology plot to select your region of interest. Domains, if found, are also clickable and will automatically move the slider to the domain region. All the protein properties are also automatically updated based on your selection. This enables granular analysis and comparison of different potential 'Better Regions.'

What are the data used?

The training data for SWI is curated by DNASU and can be downloaded here. The data consists of sequences with pET21 and pET15 vector with solubility as 1 (soluble) or 0 (insoluble). The solubility tags for pET15 is MGHHHHHHSH at the N-terminal and for pET21 it is LEHHHHHH at the C-terminal. The testing data is from eSOL and can be downloaded here. The sequences have flanking regions with the amino acids MRGSHHHHHHTDPALRA and GLCGR at the N and C terminal respectively (Niwa et al. 2009). The solubility here are the percentages of supernatent to the total uncentrifuged fraction.

How do I cite the tools?

Please visit our general FAQ page using the button below for the citation details.

General FAQ

Technical Parameters

What is the Solubility-Weighted Index (SWI)?

The Solubility-Weighted Index (SWI) is a new metric which we derived from the crystallographic B-factors on soluble (N=8,238) and insoluble (N=3,978) sequences from PSI:Biology experiments using an E. coli T7 lac promoter system. The values are: 0.84, 0.77, 0.86, 0.91, 0.52, 0.79, 0.99, 0.8, 0.89, 0.68, 0.66, 0.93, 0.63, 0.58, 0.82, 0.74, 0.81, 0.64, 0.61, 0.74 for A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V respectively. The mean of SWI across the protein sequence is converted to probability using a logistic regression fitted to PSI:Biology data: $P(Solubility) = \frac{1}{1+e^{-(AX + B)}}$ where $A = 81.0581$ , $B = -62.7775$ and $X$ is the mean of SWI across the protein.

AA Weight vs. Solubility Contribution (SWI)

Where are other protein features such as Isoelectric point, Instability index?

In our testing of the gold standard E. coli protein solubility database, eSOL (N=3,198), many of these features are poorly correlated with protein solubility. We do however display Hydrophobicity, Flexibility and Solubility profile plot using sliding window size of 9 amino-acids as these may be useful in detremining the protein features along the sequence such as central core and surface residues as well as coiled and structural regions. We also display average of Hydrophobicity (GRAVY) and flexibility in addition to probability of solubility of the entire sequence.

eSOL (N=3,198)

What is mRNA expression score?

If the input was a nucleotide sequence, we also check the mRNA accessibility using our mRNA optimization tool TIsigner (Translation Initiation coding region designer). The score ranges from 0 to 100 and higher scores generally mean better expression. Additionally, we also display GC content using sliding window of 19bp. To ensure your gene or selected region within the sequence is easy to synthesize and express, we also offer optimizing it using TIsigner, assuming your host is E. coli and a T7 lac promoter system. Please see the TIsigner faq and our paper DOI:10.1371/journal.pcbi.1009461 for more details.

TIsigner FAQ

Troubleshooting & Privacy

Why did I get a 'Job Not Found' error?

To protect user privacy and manage server load, job results are automatically deleted after 7 days. If you need to keep your results longer, please download the optimized sequences in FASTA or CSV format or take a screenshot of the analysis.

What does a 'NaN' value in the results mean?

This usually occurs if the input sequence is too short for the selected window or if the sequence contains non-standard characters. Please ensure your input is a valid nucleotide or protein sequence.

Why are there no 'Better Regions' suggested for my sequence?

This can happen for several reasons:

The sequence might be too short to identify distinct soluble domains.
The entire protein might be predicted to have uniformly low solubility, meaning no significantly 'better' sub-segments could be found.

I still got some error?

Please contact us with the detailed job id if possible or screenshot of the error/results page.

I found a bug 🐛 !

Please contact us or open an issue in our GitHub repo.

Can I use SoDoPE for commercial purposes?

Please see our license page for more details.

License