Summary and Setup

Introduction to Data Analysis in Python

This is a 4-session introductory Python workshop hosted by the Center for Computational Biomedicine at Harvard Medical School.

Learning Objectives:

  • Apply fundamentals of Python programming including variables, expressions, loops, and conditional statements.
  • Create modular Python programs which handle file I/O, string manipulation, and object manipulation.
  • Manage Conda environments and lookup documentation to utilize built-in and 3rd party Python packages.
  • Manipulate data and objects using Python and Pandas.
  • Visualize data in Python using Matplotlib and Seaborn.

Participants will also be introduced to popular packages for machine learning and bioinformatics analyses in Python, and pointed towards resources to continue their learning.

Week 1 Schedule

Session 1: April 18 from 2:00pm - 4:00pm

Self-Guided Sessions 1

Session 2: April 20 from 2:00pm - 4:00pm

Week 2 Schedule

Session 3: April 25 from 2:00pm - 4:00pm

Self-Guided Sessions 3

Session 4: April 27 from 2:00pm - 4:00pm

Prerequisite

  • Please follow the steps found here to install the needed software and data for this workshop.

Materials for this workshop have been based on materials from the following:

Learn to Discover Basic Python

Programming with Python

Plotting and Programming with Python

Introduction to Conda for (Data) Scientists

Introduction to data analysis with R and Bioconductor

This lesson is made from the The Carpentries Workbench template.

Python 3


It is sometimes claimed that biologist should use Python 2, because most biology related libraries in Python are written for that version. This is wrong. The reason why Python 2 is still out there is that following the release of Python 3.0 in December 2008, the CPython interpreter sustained several problems, and was not backward compatible. This meant that, any code written in Python 2, could not be run using Python 3 without modifications. By now, Python 2 is obsolete. Do not use it.

Why Anaconda?


For the purpose of this course, we recommend the Anaconda distribution of Python released by the Python Software Foundation, and maintained by the Anaconda Cloud.

Anaconda automatically installs many packages needed for scientific purposes (over 250 automatically installed). It is easy to install, and it takes care of dependencies between packages. This is particularly important because some of Python’s scientific libraries require Fortran– and C–based libraries, which may be challenging to install for beginners.

Installation


To install the Anaconda distribution of Python, please visit the installation instructions as outlined in the Anaconda documentations, and follow the instructions for your operating system. Ensure that you use the Python 3.x graphical installer for Windows and MacOSX (there is no graphical installer for Linux). Once downloaded, you can proceed to install the distribution as you would any other application on your computer.

Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that allows users to launch applications and manage conda packages, environments and channels without using command-line commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the packages and update them. It is available for Windows, macOS and Linux.

The following applications are available by default in Navigator:

  • JupyterLab

  • Jupyter Notebook

  • QtConsole

  • Spyder

  • Glue

  • Orange

  • RStudio

  • Visual Studio Code

We recommend using JupyterLab for writing and practicing your codes.

Starting with JupyterLab

We will be going over how to use JupyterLab during our first session, but if you want to get started early here is an official video by Jupyter going over some uses and basic shortcuts in Jupyter Lab.