Summary and Schedule
Introduction to Data Analysis in Python
This is a 4-session introductory Python workshop hosted by the Center for Computational Biomedicine at Harvard Medical School.
Learning Objectives:
- Apply fundamentals of Python programming including variables, expressions, loops, and conditional statements.
- Create modular Python programs which handle file I/O, string manipulation, and object manipulation.
- Manage Conda environments and lookup documentation to utilize built-in and 3rd party Python packages.
- Manipulate data and objects using Python and Pandas.
- Visualize data in Python using Matplotlib and Seaborn.
Participants will also be introduced to popular packages for machine learning and bioinformatics analyses in Python, and pointed towards resources to continue their learning.
Week 1 Schedule
Session 1: April 18 from 2:00pm - 4:00pm
- Introduction to Python and Jupyter
- Variables
- Break
- Types
- Built-in functions
- Wrap-up
Session 2: April 20 from 2:00pm - 4:00pm
- Lists
- Loops
- Break
- Libraries
- Loading Data
- Wrap-up
Week 2 Schedule
Session 4: April 27 from 2:00pm - 4:00pm
- Plotting with Seaborn and Matplotlib
- Break (10 min)
- Using scikit-learn for machine learning
- Using mygene to get gene annotations
- Wrap-up and next steps
Current status: Days 2 and 3 lessons still need a final pass for polish and small fixes.
- We need to add instructions for using conda environments in jupyter
- The plotting lesson (not pushed yet) has matplotlib but needs seaborn and some more biological examples like heatmaps.
- The application example lessons on days 3 and 4 have not been started, though I have existing notebooks I’m planning to base them on.
Day 1
- Introduction (20 min)
- Variables (25 min)
- Break (10 min)
- Types (25 min)
- Built-in functions (30 min)
- Wrap-up (10 min)
Prerequisite
- Please follow the steps found here to install the needed software and data for this workshop.
Materials for this workshop have been based on materials from the following:
Learn to Discover Basic Python
Plotting and Programming with Python
Introduction to Conda for (Data) Scientists
Introduction to data analysis with R and Bioconductor
This lesson is made from the The Carpentries Workbench template.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Welcome to Python | How can I run Python programs? |
Duration: 00h 20m | 2. Variables in Python |
How do I run Python code? What are variables? How do I set variables in Python? |
Duration: 00h 45m | 3. Basic Types |
What kinds of data are there in Python? How are different data types treated differently? How can I identify a variable’s type? How can I convert between types? |
Duration: 01h 10m | 4. Built-in Functions and Help |
How can I use built-in functions? How can I find out what they do? What kind of errors can occur in programs? |
Duration: 01h 40m | 5. String Manipulation |
How can I manipulate text? How can I create neat and dynamic text output? |
Duration: 02h 05m | 6. Using Objects | What is an object? |
Duration: 02h 25m | 7. Lists |
How can I store many values together? How do I access items stored in a list? How are variables assigned to lists different than variables assigned to values? |
Duration: 03h 10m | 8. For Loops |
How can I make a program do many things? How can I do something for each thing in a list? |
Duration: 03h 35m | 9. Libraries |
How can I use software that other people have written? How can I find out what that software does? |
Duration: 03h 55m | 10. Reading tabular data | How can I read tabular data? |
Duration: 04h 15m | 11. Managing Python Environments |
How do I manage different sets of packages? How do I install new packages? |
Duration: 04h 55m | 12. Dictionaries |
How is a dictionary defined in Python? What are the ways to interact with a dictionary? Can a dictionary be nested? |
Duration: 05h 20m | 13. Conditionals | How can programs do different things for different data? |
Duration: 05h 55m | 14. Pandas DataFrames | How can I perform statistical analysis of tabular data? |
Duration: 06h 35m | 15. Writing Functions |
How can I create my own functions? How can I use functions to write better programs? |
Duration: 07h 00m | 16. Perform Statistical Tests with Scipy |
How do I use distributions? How do I perform statistical tests? |
Duration: 07h 10m | 17. Reshaping Data |
How can change the shape of my data? What is the difference between long and wide data? |
Duration: 07h 35m | 18. Combining Data |
How can I combine dataframes? How do I handle missing or incomplete data mappings? |
Duration: 08h 00m | 19. Visualizing data with matplotlib and seaborn |
How can I plot my data? How can I save my plot for publishing? |
Duration: 09h 00m | 20. Perform machine learning with Scikit-learn |
What is Scikit-learn How is Scikit-learn used for a typical machine learning workflow? |
Duration: 09h 20m | 21. ID mapping using mygene | How can I get gene annotations in Python? |
Duration: 09h 30m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Python 3
It is sometimes claimed that biologist should use Python 2, because most biology related libraries in Python are written for that version. This is wrong. The reason why Python 2 is still out there is that following the release of Python 3.0 in December 2008, the CPython interpreter sustained several problems, and was not backward compatible. This meant that, any code written in Python 2, could not be run using Python 3 without modifications. By now, Python 2 is obsolete. Do not use it.
Why Anaconda?
For the purpose of this course, we recommend the Anaconda distribution of Python released by the Python Software Foundation, and maintained by the Anaconda Cloud.
Anaconda automatically installs many packages needed for scientific purposes (over 250 automatically installed). It is easy to install, and it takes care of dependencies between packages. This is particularly important because some of Python’s scientific libraries require Fortran– and C–based libraries, which may be challenging to install for beginners.
Installation
To install the Anaconda distribution of Python, please visit the installation instructions as outlined in the Anaconda documentations, and follow the instructions for your operating system. Ensure that you use the Python 3.x graphical installer for Windows and MacOSX (there is no graphical installer for Linux). Once downloaded, you can proceed to install the distribution as you would any other application on your computer.
Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that allows users to launch applications and manage conda packages, environments and channels without using command-line commands. Navigator can search for packages on Anaconda Cloud or in a local Anaconda Repository, install them in an environment, run the packages and update them. It is available for Windows, macOS and Linux.
The following applications are available by default in Navigator:
JupyterLab
Jupyter Notebook
QtConsole
Spyder
Glue
Orange
RStudio
Visual Studio Code
We recommend using JupyterLab for writing and practicing your codes.