Hi, Howard, very interesting work. I have a few questions. First, what is the innovative part of this project? Second, what is the most challenging part of this project? Have you overcome it? If not, any plan? Third, what is the progress now? how much work has been completed? Thank you.
Thanks for the questions. The following responses are for the three respective question:
First, the project adopt an deep learning process to study the peptide sequence and its immune response to specific chemicals (IL4 and IFNg as the target chemicals for this project). The given results can therefore assist medical specialists in vaccine design. Although the project scope only focuses on COVID-19 and the designated immune responses, with respective replacement of the testing samples and the training data, the model can eventually deal with other diseases and even other immune responses. In the other word, with little adjustment, the model can complete the task in a general aspect.
The most challenging part is data cleaning: where I prepare training and testing data from a general spreadsheet downloaded from the public database IEDB. To prepare the dataset, I need to split the data into four category based on two criterion shown in “Immediate Results” of the poster. The process requires reordering the peptide sequences based on alphabetical order, split the data based on its organism (if the sequence belongs to COVID-19), and categorize the sequence based on vast majority rule among different assay test results. Currently, training and testing data has been prepared already, as stated in “Immediate Results” and the demo code of Data Cleaning as provided below.
Apart from data cleaning, I have pre-trained a BERT-based model for the classification task. In the demo code provide below, there is the general workflow of the model including encoding, training, testing, and final implementation for one chemical (IL-4). Formal training will be conducted after my return to campus in early February. Furthermore, another example code for scanning method has been completed. This method is to scan through a whole virus sequence as indicated in the poster. This part of example code may be provided upon request. Thank you.
Hi, Howard, very interesting work. I have a few questions. First, what is the innovative part of this project? Second, what is the most challenging part of this project? Have you overcome it? If not, any plan? Third, what is the progress now? how much work has been completed? Thank you.
Dear Professor SHAO,
Thanks for the questions. The following responses are for the three respective question:
First, the project adopt an deep learning process to study the peptide sequence and its immune response to specific chemicals (IL4 and IFNg as the target chemicals for this project). The given results can therefore assist medical specialists in vaccine design. Although the project scope only focuses on COVID-19 and the designated immune responses, with respective replacement of the testing samples and the training data, the model can eventually deal with other diseases and even other immune responses. In the other word, with little adjustment, the model can complete the task in a general aspect.
The most challenging part is data cleaning: where I prepare training and testing data from a general spreadsheet downloaded from the public database IEDB. To prepare the dataset, I need to split the data into four category based on two criterion shown in “Immediate Results” of the poster. The process requires reordering the peptide sequences based on alphabetical order, split the data based on its organism (if the sequence belongs to COVID-19), and categorize the sequence based on vast majority rule among different assay test results. Currently, training and testing data has been prepared already, as stated in “Immediate Results” and the demo code of Data Cleaning as provided below.
Apart from data cleaning, I have pre-trained a BERT-based model for the classification task. In the demo code provide below, there is the general workflow of the model including encoding, training, testing, and final implementation for one chemical (IL-4). Formal training will be conducted after my return to campus in early February. Furthermore, another example code for scanning method has been completed. This method is to scan through a whole virus sequence as indicated in the poster. This part of example code may be provided upon request. Thank you.
Sincerely,
Howard
Demo code of Data Cleaning (https://drive.google.com/file/d/1tq-3r5feu6DL_4Bw3Jz1r-TRKSxdWzhJ/view?usp=sharing) and BERT model (https://drive.google.com/file/d/1TnPHahy2dwXIC5TqFaSx2ZSzzbc7vN90/view?usp=sharing)