Skip to main content

Covid-19 Genome analysis

Covid-19 Genome analysis
In [2]:
import pandas
import numpy as np
import sklearn.cluster
import distance
In [3]:
genomes_df = pandas.read_csv("/Users/johncalvo/Downloads/covid_sequences.csv")
genomes_df.head()
Out[3]:
Virus version Nucleotides sequence
0 >MT198652 |Severe acute respiratory syndrome c... AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...
1 >MT198653 |Severe acute respiratory syndrome c... AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...
2 >MT192758 |Severe acute respiratory syndrome c... CCGCAATCCTGCTAACAATGCTGCAATCGTGCTACAACTTCCTCAA...
3 >MT186679 |Severe acute respiratory syndrome c... CGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGC...
4 >LC529905 |Severe acute respiratory syndrome c... ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
In [4]:
words = genomes_df["Nucleotides sequence"]
print(words)
0     AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...
1     AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...
2     CCGCAATCCTGCTAACAATGCTGCAATCGTGCTACAACTTCCTCAA...
3     CGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGC...
4     ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
5     ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
6     ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
7     TGATAGAGCCATGCCTAACATGCTTAGAATTATGGCCTCACTTGTT...
8     ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
9     CTTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
10    ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
11    TTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGAT...
12    ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
13    CAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTT...
14    ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
15    ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGA...
16    GGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTA...
17    ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCT...
Name: Nucleotides sequence, dtype: object
In [5]:
words = np.asarray(words) #So that indexing with a list will work
words_match_index = np.array([[str(i) + '_' + str(j) for i in range(1,len(words))]
                              for j in range(1,len(words))])
In [6]:
lev_15_16 = distance.levenshtein(words[15],words[16], normalized=True)
lev_1_15 = distance.levenshtein(words[1],words[15], normalized=True)
lev_15_17 = distance.levenshtein(words[15],words[17], normalized=True)
In [7]:
print("Similarity between SARS-CoV-Whu-1 genome and HIV")
print(1 - lev_15_16)
Similarity between SARS-CoV-Whu-1 genome and HIV
0.3011403538106544
In [8]:
print("Similarity between the last reported genome and SARS-CoV-Whu-1")
print(1 - lev_1_15)
Similarity between the last reported genome and SARS-CoV-Whu-1
0.9826773233454837
In [9]:
print("Similarity between SARS-CoV-Whu-1 genome and SARS-CoV original")
print(1 - lev_15_17)
Similarity between SARS-CoV-Whu-1 genome and SARS-CoV original
0.7996187673477577
In [22]:
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2, normalized=True) for w1 in words] for w2 in words])
In [ ]:
 

Comments

Popular posts from this blog

Good coding practices for Data Scientists

In this rampage era of “Big Data”, we found ourselves in the position to look for keep on top of the “new wave” of information technology. Data Science is often seen as “The” career evolution for a diverse range of professionals, including computer scientists, mathematicians, business analysts, and many more…. Many many more. The core of Data Science, according to The Data Science Journal, resides in the interjection of Statistics, Business Analysis, Open Data, Data Engineering and Computer Science, and yet many Data Science applications, even official examples in the web on how to apply it (e.g. Python machine learning libraries) are found to be quite useful, albeit with a lack of properly used data modelling, and good computer programming practices . I am not blaming practitioners on not to do so, but I rather want to add my bit on building a better body of knowledge for everyone who wants to join to the Data revolution, from a Computer Scientist background perspective....

My COVID-19 jupyter notebook

Over the last few weeks I have been working on the understanding of the COVID-19 genome. As a Computer Scientist, I was able to work out a way to compare genome sequences and this post is the result of my comparison. I needs way much more understanding that is way beyond my knowledge, and I am sharing this unfinished work for anyone interested in the topic that can give a hand on understanding genomic similarities, field which is not my expertise. First of all, let me explain what was the idea: originally coming from Kaggle call of arms, we want to understand how the virus evolved into human contagion, and the best way to start is from analysing the genomic sequences. I started taking the genomes published in GenBank. I decided to make the following strategy: 1. Download the FASTA files found on GenBank of:         - COVID-19 from Wuhan         - SARS found in bats         - COVID-19 reported in Spain     ...