Summary and Setup
Introduction to Data Analysis in Python
This is a 4-session introductory Python workshop hosted by the Center for Computational Biomedicine at Harvard Medical School.
Learning Objectives:
- Apply fundamentals of Python programming including variables, expressions, loops, and conditional statements.
- Create modular Python programs which handle file I/O, string manipulation, and object manipulation.
- Manage Conda environments and lookup documentation to utilize built-in and 3rd party Python packages.
- Manipulate data and objects using Python and Pandas.
- Visualize data in Python using Matplotlib and Seaborn.
Participants will also be introduced to popular packages for machine learning and bioinformatics analyses in Python, and pointed towards resources to continue their learning.
Week 1 Schedule
Session 1: April 18 from 2:00pm - 4:00pm
- Introduction to Python and Jupyter
- Variables
- Break
- Types
- Built-in functions
- Wrap-up
Session 2: April 20 from 2:00pm - 4:00pm
- Lists
- Loops
- Break
- Libraries
- Loading Data
- Wrap-up
Week 2 Schedule
Session 4: April 27 from 2:00pm - 4:00pm
- Plotting with Seaborn and Matplotlib
- Break (10 min)
- Using scikit-learn for machine learning
- Using mygene to get gene annotations
- Wrap-up and next steps
Prerequisite
- Please follow the steps found here to install the needed software and data for this workshop.
Materials for this workshop have been based on materials from the following:
Learn to Discover Basic Python
Plotting and Programming with Python
Introduction to Conda for (Data) Scientists
Introduction to data analysis with R and Bioconductor
This lesson is made from the The Carpentries Workbench template.
Python 3
It is sometimes claimed that biologist should use Python 2, because most biology related libraries in Python are written for that version. This is wrong. The reason why Python 2 is still out there is that following the release of Python 3.0 in December 2008, the CPython interpreter sustained several problems, and was not backward compatible. This meant that, any code written in Python 2, could not be run using Python 3 without modifications. By now, Python 2 is obsolete. Do not use it.
Why Anaconda?
For the purpose of this course, we recommend the Anaconda distribution of Python released by the Python Software Foundation, and maintained by the Anaconda Cloud.
Anaconda automatically installs many packages needed for scientific purposes (over 250 automatically installed). It is easy to install, and it takes care of dependencies between packages. This is particularly important because some of Python’s scientific libraries require Fortran– and C–based libraries, which may be challenging to install for beginners.
Installation
To install the Anaconda distribution of Python, please visit the installation instructions as outlined in the Anaconda documentations, and follow the instructions for your operating system. Ensure that you use the Python 3.x graphical installer for Windows and MacOSX (there is no graphical installer for Linux). Once downloaded, you can proceed to install the distribution as you would any other application on your computer.
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that allows users to launch applications and manage conda packages, environments and channels without using command-line commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the packages and update them. It is available for Windows, macOS and Linux.
The following applications are available by default in Navigator:
JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glue
Orange
RStudio
Visual Studio Code
We recommend using JupyterLab for writing and practicing your codes.