Summary and Schedule

Introduction to Data Analysis in Python

This is a 4-session introductory Python workshop hosted by the Center for Computational Biomedicine at Harvard Medical School.

Learning Objectives:

Apply fundamentals of Python programming including variables, expressions, loops, and conditional statements.
Create modular Python programs which handle file I/O, string manipulation, and object manipulation.
Manage Conda environments and lookup documentation to utilize built-in and 3rd party Python packages.
Manipulate data and objects using Python and Pandas.
Visualize data in Python using Matplotlib and Seaborn.

Participants will also be introduced to popular packages for machine learning and bioinformatics analyses in Python, and pointed towards resources to continue their learning.

Week 1 Schedule

Session 1: April 18 from 2:00pm - 4:00pm

Self-Guided Sessions 1

Session 2: April 20 from 2:00pm - 4:00pm

Lists
Loops
Break
Libraries
Loading Data
Wrap-up

Self-Guided Sessions 2

Week 2 Schedule

Session 3: April 25 from 2:00pm - 4:00pm

Self-Guided Sessions 3

Session 4: April 27 from 2:00pm - 4:00pm

Day Schedule

Current status: Days 2 and 3 lessons still need a final pass for polish and small fixes.

We need to add instructions for using conda environments in jupyter
The plotting lesson (not pushed yet) has matplotlib but needs seaborn and some more biological examples like heatmaps.
The application example lessons on days 3 and 4 have not been started, though I have existing notebooks I’m planning to base them on.

Day 1

Introduction (20 min)
Variables (25 min)
Break (10 min)
Types (25 min)
Built-in functions (30 min)
Wrap-up (10 min)

At-home Lessons 1

Strings
Objects

Day 2

Lists (45 min)
Loops (25 min)
Break (10 min)
Libraries (20 min)
Loading Data (20 min)

At-home Lessons 2

Conda
Dictionaries

Day 3

Conditionals (35 min)
Dataframes (40 min)
Break (10 min)
Writing Functions (25 min)
Using scipy for statistics (10 min)

At-home Lessons 3

Data Wranging Practice
Preparing data for plotting

Day 4

Plotting with Seaborn and Matplotlib (60 min) half-done
Break (10 min)
Using scikit-learn for machine learning (20 min) half-done
Using BioMart get gene annotations (15 min) not started
Wrap-up and next steps (15 min)

Prerequisite

Please follow the steps found here to install the needed software and data for this workshop.

Materials for this workshop have been based on materials from the following:

Learn to Discover Basic Python

Programming with Python

Plotting and Programming with Python

Introduction to Conda for (Data) Scientists

Introduction to data analysis with R and Bioconductor

This lesson is made from the The Carpentries Workbench template.

Setup Instructions

Download files required for the lesson

00h 00m

1. Welcome to Python

How can I run Python programs?

00h 20m

2. Variables in Python

How do I run Python code?
What are variables?
How do I set variables in Python?

00h 45m

3. Basic Types

What kinds of data are there in Python?
How are different data types treated differently?
How can I identify a variable’s type?
How can I convert between types?

01h 10m

4. Built-in Functions and Help

How can I use built-in functions?
How can I find out what they do?
What kind of errors can occur in programs?

01h 40m

5. String Manipulation

How can I manipulate text?
How can I create neat and dynamic text output?

02h 05m

6. Using Objects

What is an object?

02h 25m

7. Lists

How can I store many values together?
How do I access items stored in a list?
How are variables assigned to lists different than variables assigned to values?

03h 10m

8. For Loops

How can I make a program do many things?
How can I do something for each thing in a list?

03h 35m

9. Libraries

How can I use software that other people have written?
How can I find out what that software does?

03h 55m

10. Reading tabular data

How can I read tabular data?

04h 15m

11. Managing Python Environments

How do I manage different sets of packages?
How do I install new packages?

04h 55m

12. Dictionaries

How is a dictionary defined in Python?
What are the ways to interact with a dictionary?
Can a dictionary be nested?

05h 20m

13. Conditionals

How can programs do different things for different data?

05h 55m

14. Pandas DataFrames

How can I perform statistical analysis of tabular data?

06h 35m

15. Writing Functions

How can I create my own functions?
How can I use functions to write better programs?

07h 00m

16. Perform Statistical Tests with Scipy

How do I use distributions?
How do I perform statistical tests?

07h 10m

17. Reshaping Data

How can change the shape of my data?
What is the difference between long and wide data?

07h 35m

18. Combining Data

How can I combine dataframes?
How do I handle missing or incomplete data mappings?

08h 00m

19. Visualizing data with matplotlib and seaborn

How can I plot my data?
How can I save my plot for publishing?

09h 00m

20. Perform machine learning with Scikit-learn

What is Scikit-learn
How is Scikit-learn used for a typical machine learning workflow?

09h 20m

21. ID mapping using mygene

How can I get gene annotations in Python?

09h 30m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Python 3

It is sometimes claimed that biologist should use Python 2, because most biology related libraries in Python are written for that version. This is wrong. The reason why Python 2 is still out there is that following the release of Python 3.0 in December 2008, the CPython interpreter sustained several problems, and was not backward compatible. This meant that, any code written in Python 2, could not be run using Python 3 without modifications. By now, Python 2 is obsolete. Do not use it.

Why Anaconda?

For the purpose of this course, we recommend the Anaconda distribution of Python released by the Python Software Foundation, and maintained by the Anaconda Cloud.

Anaconda automatically installs many packages needed for scientific purposes (over 250 automatically installed). It is easy to install, and it takes care of dependencies between packages. This is particularly important because some of Python’s scientific libraries require Fortran– and C–based libraries, which may be challenging to install for beginners.

Installation

To install the Anaconda distribution of Python, please visit the installation instructions as outlined in the Anaconda documentations, and follow the instructions for your operating system. Ensure that you use the Python 3.x graphical installer for Windows and MacOSX (there is no graphical installer for Linux). Once downloaded, you can proceed to install the distribution as you would any other application on your computer.

Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that allows users to launch applications and manage conda packages, environments and channels without using command-line commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the packages and update them. It is available for Windows, macOS and Linux.

The following applications are available by default in Navigator:

JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glue
Orange
RStudio
Visual Studio Code

We recommend using JupyterLab for writing and practicing your codes.

Starting with JupyterLab

We will be going over how to use JupyterLab during our first session, but if you want to get started early here is an official video by Jupyter going over some uses and basic shortcuts in Jupyter Lab.