Design Sprint Journey

UniLodge helps students to find affordable accommodation across Australia and New Zealand, but recent feedback has found that students often struggle to find the perfect roommate due to conflicts…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




The Case for Switching From Conda to Virtual Environments in Python

Recently our Applied Data Science and Machine Learning team at Sainsbury’s has undergone a massive transformation in our ways of working. One of the transformations we made was moving away from the Anaconda Python Distribution towards managing our Python environments ourselves with the use of Virtual Environments. In order to discuss why we made these changes, we will first introduce Python Virtual Environments and explain what they do.

Why we came to adopt Virtual Environments

When doing any Data Science reproducibility is key. We train our algorithms to predict, classify and automate lots of low-level decisions. To ensure that our algorithms are behaving sensibly we need to be able to understand and explain why our algorithms produced any specific result.

We need to minimise the possible ways in which reproducibility can be compromised. One way reproducibility can be compromised is if a package is updated where a difference in some function can affect the output of an algorithm.

Version 0.21.0 of scikit-learn for instance may have seen changes from the previous version 0.20.0 such that with the same data and parameters a different model is obtained from the previous version. Previously we were using Anaconda to do this.

One of Anaconda’s most important features is the ability to replicate environments in which your models run. Anaconda also comes with a Python interpreter and most of the packages that you would ever need to train a machine learning model and do some Data Science straight out of the box.

Recently we have reorganised our team to include Python Engineers who are integral to the productionisation of our algorithms. These Engineers are an integral part of the Applied Data Science & Machine Learning team, and after joining they often raised the point that Anaconda adds an extra layer of obscurity in our algorithms, and that they preferred using Virtual Environments. Commands in Anaconda are distinct from Virtual Environment commands. To export the list of packages that you are using in your environment you would do

conda list -e > requirements.txt

in anaconda, but to do this in a Virtual Environment you would do

pip freeze > requirements.txt

We wanted to align our ways of working so we changed our ways of working so that we were also using Python Virtual Environments.

Advantages of Virtual Environments over Anaconda

Since we have made the jump from Anaconda to Virtual Environments we have as a team increased out understanding of Python. Our algorithms are more versatile and much easier to productionise, and we do not feel that we have lost anything in moving away from Anaconda.

If you’re doing any Data Science you can just download Anaconda and the chances are that any packages that you need to use are already installed and ready to be imported. It also includes commonly used utilities like iPython and Jupyter Notebooks. These features make Anaconda a “batteries included” Python install.

However, with all of these batteries comes a lot of bloat. As of the time of writing this article the smallest Anaconda distribution that can be downloaded on the Anaconda website is 530MB. It includes over 1500 packages for Data Science, many of which most of us will likely never use. Installing Anaconda can take well over 10 minutes.

Most Data Science projects use a core set of packages such as Pandas, Scikit-Learn and Matplotlib as well as Jupyter Notebooks. All of which can be easily pip installed into a Virtual Environment instead using the command line, as can most packages whose necessity is highly project specific.

Anaconda loyalists amongst you might be up in arms arguing that Anaconda also manages dependencies of packages for you, or wondering how you’re going to use Jupyter Notebooks without Anaconda. Sure, Anaconda does manage package dependencies well. However, due to the large number of packages that are installed in the distribution, often Anaconda does not have the latest version of a package as it will break a dependency of a package that often you are unlikely to be using.

If you need the most recent update to a Python package you will have to wait until Anaconda is updated to work with the newest update of the package. Or you could pip install it as you would with a Virtual Environment, but doing this leads to more convoluted package management.

We can also use Jupyter Notebooks with our Virtual Environments in the same way that we would with Anaconda. Installing it is as easy as entering

pip install jupyter

into the command line with your Virtual Environment activated. To use a ipython kernel that is associated with a Virtual Environment you can then do `pip install ipykernel` followed by:

In summary, the switch from Anaconda to Virtual Environments has been a positive one — our algorithms are easier to productionise, aligning to engineering ways of working, and we do not feel that we have lost anything in moving away from Anaconda.

Add a comment

Related posts:

Best Tools to Perform CPU and GPU Stress Test on Your PC

If you are done building your computer with overclocked components and now want to perform a stress test, but the only problem you are facing is which PC Stress Test Software to Choose? You are at…

Remembering Gary Carter

It some respects it seems like he passed just yesterday. The last time I saw Gary alive was a month before his death. Jeff Wilpon and I flew down to Palm Beach Gardens, FL to visit him at his home. I…

Is a welfare recipient just a good capitalist?

Most well of Americas blame people in poverty for a cultural shift in philosophy, but in reality it is corporations who are directly to blame and for who frustration should first be directed.