Skip to main content

Good coding practices for Data Scientists


In this rampage era of “Big Data”, we found ourselves in the position to look for keep on top of the “new wave” of information technology. Data Science is often seen as “The” career evolution for a diverse range of professionals, including computer scientists, mathematicians, business analysts, and many more…. Many many more.



The core of Data Science, according to The Data Science Journal, resides in the interjection of Statistics, Business Analysis, Open Data, Data Engineering and Computer Science, and yet many Data Science applications, even official examples in the web on how to apply it (e.g. Python machine learning libraries) are found to be quite useful, albeit with a lack of properly used data modelling, and good computer programming practices.

I am not blaming practitioners on not to do so, but I rather want to add my bit on building a better body of knowledge for everyone who wants to join to the Data revolution, from a Computer Scientist background perspective.

So, here is my bit on what I have found are recurrent errors on coding for Data Science that every data scientist should avoid:



Some good coding practices for Data Scientists

  • Comment your code

One of the main problems of coding, is basically, to understand someone else’s code. It is really, hugely painful to read through lines of uncommented code. I have tried to replicate some of the most relevant papers in machine learning for text analysis, and basically spent weeks on just running examples, because they are simply not properly commented.
When coding, think outside of the box, and if you want to publish your code, please be gentle with your pairs as your code will be possibly used (and read) by a large community.

  • Find a good name for your variables, please!

One thing that might be even more painful than understanding uncommented code, is to understand randomly named variables! Dear Statistician/mathematician (in general) and computer scientist (some too!), Xx1, xx2, yy1 and yy2 are simply not good naming conventions! I kindly suggest meaningful variables. The reason behind it is, as you might guess, that your code will be read by many other Data Scientists, from different backgrounds (not only maths and stats).

So next time you publish your code, make sure you follow good conventions. Looking at Wikipedia you can find a good start https://en.wikipedia.org/wiki/Naming_convention_(programming). Furthermore, if you are working for a big company, you will be likely to be reviewed and your code will be tested. So it will be of a huge help for your pair reviewer.

  • Avoid “Spaghetti code”

In Tensorflow, scikit-learn and similar libraries, I have found examples (and library implementations too) using the “spaghetti paradigm”, and that, from the computer science perspective, is not an efficient way of building reliable, maintainable software. By spaghetti code, we refer to code made in just one single file, sometimes even without using, or defining methods. Methods are an efficient way of dividing your code, and it is a much better way to visualize and to define your solutions.

Python allows quick implementations as it is a multi-paradigm programming language, which is great for building quick demos and prototypes. However, this does not mean that you (especially if you are a programmer) no longer need to break up your code into separate methods/files.
Remember, Divide and conquer! Divide your code into separate files. Make classes when appropriate and divide the responsibilities of each class, each method. As a rule of thumb, I try to avoid more-than-10-lines methods, and more-than-100-lines classes.

  • Keep consistency


One other recurrent issue I’ve found is the lack of consistency throughout the code. Scikit-learn for example, names its inner objects in capital letters (such as OneClassSVM), but names datasets in lower-case (load_boston). This issue is also part of good naming convention practices, but it goes further when defining methods. As an advice, follow your favourite’s programming language naming convention principles.

  • Please, avoid hard-coded values!

I have seen an overwhelming amount of projects, and especially during my Ph.D. that heavily relies on deliberately custom-made values. From the programming perspective, it is a very bad coding practice to hard-code variable values (like theta = 0.129371). Use a configuration file instead, so you don’t need to re-compile the whole program to change a value, or ask the user via inputs. From the research perspective is somehow right to use constants, as long as it is a good argument for that, making a real reason why this value is used as such.

  • Learn OO programming, or Aspect-oriented programming


Try to make courses (coursera has very good options for free) around the topic of OO programming. If you are unsure, there will be always a computer scientist able to help you out on this matter if you work for a big company. If you are not, I strongly suggest to follow a youtube series on how to make OO programming in your code. If you dislike or prefer other programming paradigm, feel free to use it with proper standards and conventions. The key point is to make easier the readability, reliability and scalability of your source code J


There are many other coding mistakes that I will leave for another post (when it comes to data modelling). If you see more, or you see I commit a mistake, or you have any other comment please comment below and share! We are trying to improve our knowledge in Data Science as a whole.

Cheers!

References



Comments

Popular posts from this blog

Covid-19 Genome analysis

Covid-19 Genome analysis In [2]: import pandas import numpy as np import sklearn.cluster import distance In [3]: genomes_df = pandas . read_csv ( "/Users/johncalvo/Downloads/covid_sequences.csv" ) genomes_df . head () Out[3]: Virus version Nucleotides sequence 0 >MT198652 |Severe acute respiratory syndrome c... AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC... 1 >MT198653 |Severe acute respiratory syndrome c... AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC... 2 >MT192758 |Severe acute respiratory syndrome c... CCGCAATCCTGCTAACAATGCTGCAATCGTGCTACAACTTCCTCAA... 3 >MT186679 |Severe acute respiratory syndrome c... CGCGATCAAAACAACGTCGGCCCCAAGGTTTACCCAATAATACTGC... 4 ...

My COVID-19 jupyter notebook

Over the last few weeks I have been working on the understanding of the COVID-19 genome. As a Computer Scientist, I was able to work out a way to compare genome sequences and this post is the result of my comparison. I needs way much more understanding that is way beyond my knowledge, and I am sharing this unfinished work for anyone interested in the topic that can give a hand on understanding genomic similarities, field which is not my expertise. First of all, let me explain what was the idea: originally coming from Kaggle call of arms, we want to understand how the virus evolved into human contagion, and the best way to start is from analysing the genomic sequences. I started taking the genomes published in GenBank. I decided to make the following strategy: 1. Download the FASTA files found on GenBank of:         - COVID-19 from Wuhan         - SARS found in bats         - COVID-19 reported in Spain     ...