Skip to main content

My COVID-19 jupyter notebook

Over the last few weeks I have been working on the understanding of the COVID-19 genome. As a Computer Scientist, I was able to work out a way to compare genome sequences and this post is the result of my comparison.

I needs way much more understanding that is way beyond my knowledge, and I am sharing this unfinished work for anyone interested in the topic that can give a hand on understanding genomic similarities, field which is not my expertise.

First of all, let me explain what was the idea: originally coming from Kaggle call of arms, we want to understand how the virus evolved into human contagion, and the best way to start is from analysing the genomic sequences.

I started taking the genomes published in GenBank. I decided to make the following strategy:

1. Download the FASTA files found on GenBank of:
        - COVID-19 from Wuhan
        - SARS found in bats
        - COVID-19 reported in Spain
        - COVID-19 reported in Italy
        - COVID-19 reported in USA
        - COVID-19 reported in Brasil
        - HIV virus(1991)
        - Ebola virus

2. Perform a Levenshtein similarity analysis between Wuhan COVID vs. all.

3. Perform a sequence-to-sequence analysis with Wuhan COVID, HIV, and Ebola.

5. Ideas?? It would be HUGELY GRATEFUL if a MICROBIOLOGIST or a GENOMICS expert takes this ideas further!

https://calvodatascientist.blogspot.com/2020/07/covid-19-genome-analysis.html


Comments

Popular posts from this blog

Covid-19 Genome analysis

Covid-19 Genome analysis In [2]: import pandas import numpy as np import sklearn.cluster import distance In [3]: genomes_df = pandas . read_csv ( "/Users/johncalvo/Downloads/covid_sequences.csv" ) genomes_df . head () Out[3]: Virus version Nucleotides sequence 0 >MT198652 |Severe acute respiratory syndrome c... AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC... 1 >MT198653 |Severe acute respiratory syndrome c... AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC... 2 >MT192758 |Severe acute respiratory syndrome c... CCGCAATCCTGCTAACAATGCTGCAATCGTGCTACAACTTCCTCAA... 3 >MT186679 |Severe acute respiratory syndrome c... CGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGC... 4 ...

Good coding practices for Data Scientists

In this rampage era of “Big Data”, we found ourselves in the position to look for keep on top of the “new wave” of information technology. Data Science is often seen as “The” career evolution for a diverse range of professionals, including computer scientists, mathematicians, business analysts, and many more…. Many many more. The core of Data Science, according to The Data Science Journal, resides in the interjection of Statistics, Business Analysis, Open Data, Data Engineering and Computer Science, and yet many Data Science applications, even official examples in the web on how to apply it (e.g. Python machine learning libraries) are found to be quite useful, albeit with a lack of properly used data modelling, and good computer programming practices . I am not blaming practitioners on not to do so, but I rather want to add my bit on building a better body of knowledge for everyone who wants to join to the Data revolution, from a Computer Scientist background perspective....