Content from Welcome to Python


Last updated on 2023-04-18 | Edit this page

Overview

Questions

  • How can I run Python programs?

Objectives

  • Launch the JupyterLab server.
  • Create a new Python script.
  • Create a Jupyter notebook.
  • Shutdown the JupyterLab server.
  • Understand the difference between a Python script and a Jupyter notebook.
  • Create Markdown cells in a notebook.
  • Create and run Python cells in a notebook.

Key Points

  • Python scripts are plain text files.
  • Use the Jupyter Notebook for editing and running Python.
  • The Notebook has Command and Edit modes.
  • Use the keyboard and mouse to select and edit cells.
  • The Notebook will turn Markdown into pretty-printed documentation.
  • Markdown does most of what HTML does.

Many software developers will often use an integrated development environment (IDE) or a text editor to create and edit their Python programs which can be executed through the IDE or command line directly. While this is a common approach, we are going to use the [Jupyter Notebook][jupyter] via JupyterLab for the remainder of this workshop.

This has several advantages: * You can easily type, edit, and copy and paste blocks of code. * Tab complete allows you to easily access the names of things you are using and learn more about them. * It allows you to annotate your code with links, different sized text, bullets, etc. to make it more accessible to you and your collaborators. * It allows you to display figures next to the code that produces them to tell a complete story of the analysis.

Each notebook contains one or more cells that contain code, text, or images.

Scripts vs. Notebooks vs. Programs

A notebook, as described above, is a combination of text, figures, results, and code in a single document. In contrast a script is a file containing only code (and comments). A python script has the extension .py, while a python notebook has the extension .ipynb. Python notebooks are built ontop of Python and are not considered a part of the Python language, but an extension to it. Cells in a notebook can be run interactively while a script is run all at once, typically by calling it at the command line.

A program is a set of machine instructions which are executed together. We can think about both Python scripts and notebooks as being programs. It should be noted, however, that in computer science the terms programming and scripting are defined more formally, and no Python code would be considered a program under these definitions.

Getting Started with JupyterLab


JupyterLab is an application with a web-based user interface from [Project Jupyter][jupyter] that enables one to work with documents and activities such as Jupyter notebooks, text editors, terminals, and even custom components in a flexible, integrated, and extensible manner. JupyterLab requires a reasonably up-to-date browser (ideally a current version of Chrome, Safari, or Firefox); Internet Explorer versions 9 and below are not supported.

JupyterLab is included as part of the Anaconda Python distribution. If you have not already installed the Anaconda Python distribution, see the setup instructions for installation instructions.

Even though JupyterLab is a web-based application, JupyterLab runs locally on your machine and does not require an internet connection. * The JupyterLab server sends messages to your web browser. * The JupyterLab server does the work and the web browser renders the result. * You will type code into the browser and see the result when the web page talks to the JupyterLab server.

JupyterLab? What about Jupyter notebooks?

JupyterLab is the next stage in the evolution of the Jupyter Notebook. If you have prior experience working with Jupyter notebooks, then you will have a good idea of what to expect from JupyterLab.

Experienced users of Jupyter notebooks interested in a more detailed discussion of the similarities and differences between the JupyterLab and Jupyter notebook user interfaces can find more information in the JupyterLab user interface documentation.

Starting JupyterLab


You can start the JupyterLab server through the command line or through an application called Anaconda Navigator. Anaconda Navigator is included as part of the Anaconda Python distribution.

macOS - Command Line

To start the JupyterLab server you will need to access the command line through the Terminal. There are two ways to open Terminal on Mac.

  1. In your Applications folder, open Utilities and double-click on Terminal
  2. Press Command + spacebar to launch Spotlight. Type Terminal and then double-click the search result or hit Enter

After you have launched Terminal, type the command to launch the JupyterLab server.

BASH

$ jupyter lab

Windows Users - Command Line

To start the JupyterLab server you will need to access the Anaconda Prompt.

Press Windows Logo Key and search for Anaconda Prompt, click the result or press enter.

After you have launched the Anaconda Prompt, type the command:

BASH

$ jupyter lab

Anaconda Navigator

To start a JupyterLab server from Anaconda Navigator you must first start Anaconda Navigator (click for detailed instructions on macOS, Windows, and Linux). You can search for Anaconda Navigator via Spotlight on macOS (Command + spacebar), the Windows search function (Windows Logo Key) or opening a terminal shell and executing the anaconda-navigator executable from the command line.

After you have launched Anaconda Navigator, click the Launch button under JupyterLab. You may need to scroll down to find it.

Here is a screenshot of an Anaconda Navigator page similar to the one that should open on either macOS or Windows.

Anaconda Navigator landing page
Anaconda Navigator landing page

And here is a screenshot of a JupyterLab landing page that should be similar to the one that opens in your default web browser after starting the JupyterLab server on either macOS or Windows.

Anaconda Navigator landing page
Anaconda Navigator landing page

The JupyterLab Interface


JupyterLab has many features found in traditional integrated development environments (IDEs) but is focused on providing flexible building blocks for interactive, exploratory computing.

The JupyterLab Interface consists of the Menu Bar, a collapsable Left Side Bar, and the Main Work Area which contains tabs of documents and activities.

The Menu Bar at the top of JupyterLab has the top-level menus that expose various actions available in JupyterLab along with their keyboard shortcuts (where applicable). The following menus are included by default.

  • File: Actions related to files and directories such as New, Open, Close, Save, etc. The File menu also includes the Shut Down action used to shutdown the JupyterLab server.
  • Edit: Actions related to editing documents and other activities such as Undo, Cut, Copy, Paste, etc.
  • View: Actions that alter the appearance of JupyterLab.
  • Run: Actions for running code in different activities such as notebooks and code consoles (discussed below).
  • Kernel: Actions for managing kernels. Kernels in Jupyter will be explained in more detail below.
  • Tabs: A list of the open documents and activities in the main work area.
  • Settings: Common JupyterLab settings can be configured using this menu. There is also an Advanced Settings Editor option in the dropdown menu that provides more fine-grained control of JupyterLab settings and configuration options.
  • Help: A list of JupyterLab and kernel help links.

Kernels

The JupyterLab docs define kernels as “separate processes started by the server that run your code in different programming languages and environments.” When we open a Jupyter Notebook, that starts a kernel - a process - that is going to run the code. In this lesson, we’ll be using the Jupyter ipython kernel which lets us run Python 3 code interactively.

Using other Jupyter kernels for other programming languages would let us write and execute code in other programming languages in the same JupyterLab interface, like R, Java, Julia, Ruby, JavaScript, Fortran, etc.

A screenshot of the default Menu Bar is provided below.

JupyterLab Menu Bar
JupyterLab Menu Bar

The left sidebar contains a number of commonly used tabs, such as a file browser (showing the contents of the directory where the JupyterLab server was launched), a list of running kernels and terminals, the command palette, and a list of open tabs in the main work area. A screenshot of the default Left Side Bar is provided below.

JupyterLab Left Side Bar
JupyterLab Left Side Bar

The left sidebar can be collapsed or expanded by selecting “Show Left Sidebar” in the View menu or by clicking on the active sidebar tab.

Main Work Area

The main work area in JupyterLab enables you to arrange documents (notebooks, text files, etc.) and other activities (terminals, code consoles, etc.) into panels of tabs that can be resized or subdivided. A screenshot of the default Main Work Area is provided below.

JupyterLab Main Work Area
JupyterLab Main Work Area

Drag a tab to the center of a tab panel to move the tab to the panel. Subdivide a tab panel by dragging a tab to the left, right, top, or bottom of the panel. The work area has a single current activity. The tab for the current activity is marked with a colored top border (blue by default).

Creating a Python script


  • To start writing a new Python program click the Text File icon under the Other header in the Launcher tab of the Main Work Area.
    • You can also create a new plain text file by selecting the New -> Text File from the File menu in the Menu Bar.
  • To convert this plain text file to a Python program, select the Save File As action from the File menu in the Menu Bar and give your new text file a name that ends with the .py extension.
    • The .py extension lets everyone (including the operating system) know that this text file is a Python program.
    • This is convention, not a requirement.

Creating a Jupyter Notebook


To open a new notebook click the Python 3 icon under the Notebook header in the Launcher tab in the main work area. You can also create a new notebook by selecting New -> Notebook from the File menu in the Menu Bar.

Additional notes on Jupyter notebooks.

  • Notebook files have the extension .ipynb to distinguish them from plain-text Python programs.
  • Notebooks can be exported as Python scripts that can be run from the command line.

Below is a screenshot of a Jupyter notebook running inside JupyterLab. If you are interested in more details, then see the official notebook documentation.

Example Jupyter Notebook
Example Jupyter Notebook

How It’s Stored

  • The notebook file is stored in a format called JSON.
  • Just like a webpage, what’s saved looks different from what you see in your browser.
  • But this format allows Jupyter to mix source code, text, and images, all in one file.

Arranging Documents into Panels of Tabs

In the JupyterLab Main Work Area you can arrange documents into panels of tabs. Here is an example from the official documentation.

Multi-panel JupyterLab
Multi-panel JupyterLab

First, create a text file, Python console, and terminal window and arrange them into three panels in the main work area. Next, create a notebook, terminal window, and text file and arrange them into three panels in the main work area. Finally, create your own combination of panels and tabs. What combination of panels and tabs do you think will be most useful for your workflow?

After creating the necessary tabs, you can drag one of the tabs to the center of a panel to move the tab to the panel; next you can subdivide a tab panel by dragging a tab to the left, right, top, or bottom of the panel.

Code vs. Text

Jupyter mixes code and text in different types of blocks, called cells. We often use the term “code” to mean “the source code of software written in a language such as Python”. A “code cell” in a Notebook is a cell that contains software; a “text cell” is one that contains ordinary prose written for human beings.

The Notebook has Command and Edit modes.


  • If you press Esc and Return alternately, the outer border of your code cell will change from gray to blue.
  • These are the Command (gray) and Edit (blue) modes of your notebook.
  • Command mode allows you to edit notebook-level features, and Edit mode changes the content of cells.
  • When in Command mode (esc/gray),
    • The b key will make a new cell below the currently selected cell.
    • The a key will make one above.
    • The x key will delete the current cell.
    • The z key will undo your last cell operation (which could be a deletion, creation, etc).
  • All actions can be done using the menus, but there are lots of keyboard shortcuts to speed things up.

Command Vs. Edit

In the Jupyter notebook page are you currently in Command or Edit mode?
Switch between the modes. Use the shortcuts to generate a new cell. Use the shortcuts to delete a cell. Use the shortcuts to undo the last cell operation you performed.

Command mode has a grey border and Edit mode has a blue border. Use Esc and Return to switch between modes. You need to be in Command mode (Press Esc if your cell is blue). Type b or a. You need to be in Command mode (Press Esc if your cell is blue). Type x. You need to be in Command mode (Press Esc if your cell is blue). Type z.

Use the keyboard and mouse to select and edit cells.

  • Pressing the Return key turns the border blue and engages Edit mode, which allows you to type within the cell.
  • Because we want to be able to write many lines of code in a single cell, pressing the Return key when in Edit mode (blue) moves the cursor to the next line in the cell just like in a text editor.
  • We need some other way to tell the Notebook we want to run what’s in the cell.
  • Pressing Shift+Return together will execute the contents of the cell.
  • Notice that the Return and Shift keys on the right of the keyboard are right next to each other.

The Notebook will turn Markdown into pretty-printed documentation.

  • Notebooks can also render Markdown.
    • A simple plain-text format for writing lists, links, and other things that might go into a web page.
    • Equivalently, a subset of HTML that looks like what you’d send in an old-fashioned email.
  • Turn the current cell into a Markdown cell by entering the Command mode (Esc/gray) and press the M key.
  • In [ ]: will disappear to show it is no longer a code cell and you will be able to write in Markdown.
  • Turn the current cell into a Code cell by entering the Command mode (Esc/gray) and press the y key.

Markdown does most of what HTML does.

*   Use asterisks
*   to create
*   bullet lists.
  • Use asterisks
  • to create
  • bullet lists.
1.  Use numbers
1.  to create
1.  numbered lists.
  1. Use numbers
  2. to create
  3. numbered lists.
*  You can use indents
	*  To create sublists
	*  of the same type
*  Or sublists
	1. Of different
	1. types
  • You can use indents
    • To create sublists
    • of the same type
  • Or sublists
    1. Of different
    2. types
# A Level-1 Heading

A Level-1 Heading

## A Level-2 Heading (etc.)

A Level-2 Heading (etc.)

Line breaks
don't matter.

But blank lines
create new paragraphs.

Line breaks don’t matter.

But blank lines create new paragraphs.

[Create links](http://software-carpentry.org) with `[...](...)`.
Or use [named links][data_carpentry].

[data_carpentry]: http://datacarpentry.org

Create links with [...](...). Or use named links.

Creating Lists in Markdown

Create a nested list in a Markdown cell in a notebook that looks like this:

  1. Get funding.
  2. Do work.
    • Design experiment.
    • Collect data.
    • Analyze.
  3. Write up.
  4. Publish.

This challenge integrates both the numbered list and bullet list. Note that the bullet list is indented 2 spaces so that it is inline with the items of the numbered list.

1.  Get funding.
1.  Do work.
    *   Design experiment.
    *   Collect data.
    *   Analyze.
1.  Write up.
1.  Publish.

More Math

What is displayed when a Python cell in a notebook that contains several calculations is executed? For example, what happens when this cell is executed?

PYTHON

7 * 3
2 + 1

Python returns the output of the last calculation.

OUTPUT

3

Change an Existing Cell from Code to Markdown

What happens if you write some Python in a code cell and then you switch it to a Markdown cell? For example, put the following in a code cell:

PYTHON

x = 6 * 7 + 12
print(x)

And then run it with Shift+Return to be sure that it works as a code cell. Now go back to the cell and use Esc then m to switch the cell to Markdown and “run” it with Shift+Return. What happened and how might this be useful?

The Python code gets treated like Markdown text. The lines appear as if they are part of one contiguous paragraph. This could be useful to temporarily turn on and off cells in notebooks that get used for multiple purposes.

PYTHON

x = 6 * 7 + 12 print(x)

Equations

Standard Markdown (such as we’re using for these notes) won’t render equations, but the Notebook will. Create a new Markdown cell and enter the following:

$\sum_{i=1}^{N} 2^{-i} \approx 1$

(It’s probably easier to copy and paste.) What does it display? What do you think the underscore, _, circumflex, ^, and dollar sign, $, do?

The notebook shows the equation as it would be rendered from LaTeX equation syntax. The dollar sign, $, is used to tell Markdown that the text in between is a LaTeX equation. If you’re not familiar with LaTeX, underscore, _, is used for subscripts and circumflex, ^, is used for superscripts. A pair of curly braces, { and }, is used to group text together so that the statement i=1 becomes the subscript and N becomes the superscript. Similarly, -i is in curly braces to make the whole statement the superscript for 2. \sum and \approx are LaTeX commands for “sum over” and “approximate” symbols.

Closing JupyterLab


  • From the Menu Bar select the “File” menu and then choose “Shut Down” at the bottom of the dropdown menu. You will be prompted to confirm that you wish to shutdown the JupyterLab server (don’t forget to save your work!). Click “Shut Down” to shutdown the JupyterLab server.
  • To restart the JupyterLab server you will need to re-run the following command from a shell.
$ jupyter lab

Closing JupyterLab

Practice closing and restarting the JupyterLab server.

Content from Variables in Python


Last updated on 2023-04-18 | Edit this page

Overview

Questions

  • How do I run Python code?
  • What are variables?
  • How do I set variables in Python?

Objectives

  • Use the Python console to perform calculations.
  • Assign values to variables in Python.
  • Reuse variables in Python.

Key Points

  • Python is an interpreted programming language, and can be used interactively.
  • Values are assigned to variables in Python using =.
  • You can use print to output variable values.
  • Use meaningful variable names.

Variables store values.


  • Variables are names for values.
  • In Python the = symbol assigns the value on the right to the name on the left.
  • The variable is created when a value is assigned to it.
  • Here, Python assigns an age to a variable age and a name in quotes to a variable first_name.

PYTHON

age = 42
first_name = 'Ahmed'
  • Variable names
    • can only contain letters, digits, and underscore _ (typically used to separate words in long variable names)
    • cannot start with a digit
    • are case sensitive (age, Age and AGE are three different variables)
  • Variable names that start with underscores like __alistairs_real_age have a special meaning so we won’t do that until we understand the convention.

Use print to display values.


Callout

A string is the type which stores text in Python. Strings can be thought of as sequences or strings of individual characters (and using the words string this way actually goes back to a printing term in the pre-computer era). We will be learning more about types and strings in other lessons.

  • Python has a built-in function called print that prints things as text.
  • Call the function (i.e., tell Python to run it) by using its name.
  • Provide values to the function (i.e., the things to print) in parentheses.
  • To add a string to the printout, wrap the string in single or double quotes.
  • The values passed to the function are called arguments

Single vs. double quotes

In Python, you can use single quotes or double quotes to denote a string, but you need to use the same one for both the beginning and the end.

PYTHON

a = "mouse" # is a string
a = 'mouse' # is a string
a = "mouse'nt" # we can use a single quote inside double quotes without ending the string
a = 'mouse" #This doesn't work

In Python,

PYTHON

print(first_name, 'is', age, 'years old')

OUTPUT

Ahmed is 42 years old
  • print automatically puts a single space between items to separate them.

In when using Jupyter notebooks, we can also simply write a variable name and its value will be displayed:

PYTHON

age

OUTPUT

42

However, this will not work in other programming environments or when running scripts. We will be displaying variables using both methods throughout this workshop.

Variables must be created before they are used.


  • If a variable doesn’t exist yet, or if the name has been mis-spelled, Python reports an error. (Unlike some languages, which “guess” a default value.)

PYTHON

print(last_name)

ERROR

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-c1fbb4e96102> in <module>()
----> 1 print(last_name)

NameError: name 'last_name' is not defined
  • The last line of an error message is usually the most informative.
  • We will look at error messages in detail later.

Variables Persist Between Cells

Be aware that it is the order of execution of cells that is important in a Jupyter notebook, not the order in which they appear. Python will remember all the code that was run previously, including any variables you have defined, irrespective of the order in the notebook. Therefore if you define variables lower down the notebook and then (re)run cells further up, those defined further down will still be present. As an example, create two cells with the following content, in this order:

PYTHON

print(myval)

PYTHON

myval = 1

If you execute this in order, the first cell will give an error. However, if you run the first cell after the second cell it will print out 1. To prevent confusion, it can be helpful to use the Kernel -> Restart & Run All option which clears the interpreter and runs everything from a clean slate going top to bottom.

Variables can be used in calculations.


  • We can use variables in calculations just as if they were values.
    • Remember, we assigned the value 42 to age a few lines ago.

PYTHON

age = age + 3
print('Age in three years:', age)

OUTPUT

Age in three years: 45

Python is case-sensitive.


  • Python thinks that upper- and lower-case letters are different, so Name and name are different variables.
  • There are conventions for using upper-case letters at the start of variable names so we will use lower-case letters for now.

Use meaningful variable names.


  • Python doesn’t care what you call variables as long as they obey the rules (alphanumeric characters and the underscore).

PYTHON

flabadab = 42
ewr_422_yY = 'Ahmed'
print(ewr_422_yY, 'is', flabadab, 'years old')
  • Use meaningful variable names to help other people understand what the program does.
  • The most important “other person” is your future self.

Swapping Values

Fill the table showing the values of the variables in this program after each statement is executed.

PYTHON

# Command  # Value of x   # Value of y   # Value of swap #
x = 1.0    #              #              #               #
y = 3.0    #              #              #               #
swap = x   #              #              #               #
x = y      #              #              #               #
y = swap   #              #              #               #

Try using this python visualization tool to visualize what happens in the code.

OUTPUT

# Command  # Value of x   # Value of y   # Value of swap #
x = 1.0    # 1.0          # not defined  # not defined   #
y = 3.0    # 1.0          # 3.0          # not defined   #
swap = x   # 1.0          # 3.0          # 1.0           #
x = y      # 3.0          # 3.0          # 1.0           #
y = swap   # 3.0          # 1.0          # 1.0           #

These three lines exchange the values in x and y using the swap variable for temporary storage. This is a fairly common programming idiom.

Predicting Values

What is the final value of position in the program below? (Try to predict the value without running the program, then check your prediction.)

PYTHON

initial = 'left'
position = initial
initial = 'right'

PYTHON

print(position)

OUTPUT

left

The initial variable is assigned the value 'left'. In the second line, the position variable also receives the string value 'left'. In third line, the initial variable is given the value 'right', but the position variable retains its string value of 'left'.

Choosing a Name

Which is a better variable name, m, min, or minutes? Why?

Hint: think about which code you would rather inherit from someone who is leaving the lab:

  1. ts = m * 60 + s
  2. tot_sec = min * 60 + sec
  3. total_seconds = minutes * 60 + seconds

minutes is better because min might mean something like “minimum” (and actually is an existing built-in function in Python that we will cover later).

Content from Basic Types


Last updated on 2023-04-18 | Edit this page

Overview

Questions

  • What kinds of data are there in Python?
  • How are different data types treated differently?
  • How can I identify a variable’s type?
  • How can I convert between types?

Objectives

  • Explain key differences between integers and floating point numbers.
  • Explain key differences between numbers and character strings.
  • Use built-in functions to convert between integers, floating point numbers, and strings.
  • Subset a string using slicing.

Key Points

  • Every value has a type.
  • Use the built-in function type to find the type of a value.
  • Types control what operations can be done on values.
  • Strings can be added and multiplied.
  • Strings have a length (but numbers don’t).

Every value has a type.


  • Every value in a program has a specific type.
  • Integer (int): represents positive or negative whole numbers like 3 or -512.
  • Floating point number (float): represents real numbers like 3.14159 or -2.5.
  • Character string (usually called “string”, str): text.
    • Written in either single quotes or double quotes (as long as they match).
    • The quote marks aren’t printed when the string is displayed.

Use the built-in function type to find the type of a value.


  • Use the built-in function type to find out what type a value has.
  • Works on variables as well.
    • But remember: the value has the type — the variable is just a label.

PYTHON

print(type(52))

OUTPUT

<class 'int'>

PYTHON

fitness = 'average'
print(type(fitness))

OUTPUT

<class 'str'>

Types control what operations (or methods) can be performed on a given value.


  • A value’s type determines what the program can do to it.

PYTHON

print(5 - 3)

OUTPUT

2

PYTHON

print('hello' - 'h')

OUTPUT

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-67f5626a1e07> in <module>()
----> 1 print('hello' - 'h')

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Simple and compound types.


We can broadly split different types into two categories in Python: simple types and compound types.

Simple types consist of a single value. In Python, these simple types include:

  • int
  • float
  • bool
  • NoneType

Compound types contain multiple values. Compound types include:

  • str
  • list
  • dictionary
  • tuple
  • set

In this lesson we will be learning about simple types and strings (str), which are made up of multiple characters. We will go into more detail on other compound types in future lessons.

You can use the “+” and “*” operators on strings.


  • “Adding” character strings concatenates them.

PYTHON

full_name = 'Ahmed' + ' ' + 'Walsh'
print(full_name)

OUTPUT

Ahmed Walsh

The empty string

We can initialize a string to contain no letter, empty = "". This is called the empty string, and is often used when we want to build up a string character by character

  • Multiplying a character string by an integer N creates a new string that consists of that character string repeated N times.
    • Since multiplication is repeated addition.

PYTHON

separator = '=' * 10
print(separator)

OUTPUT

==========

Strings have a length (but numbers don’t).


  • The built-in function len counts the number of characters in a string.

PYTHON

print(len(full_name))

OUTPUT

11
  • But numbers don’t have a length (not even zero).

PYTHON

print(len(52))

OUTPUT

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-f769e8e8097d> in <module>()
----> 1 print(len(52))

TypeError: object of type 'int' has no len()

You must convert numbers to strings or vice versa when operating on them.


  • Cannot add numbers and strings.

PYTHON

print(1 + '2')

ERROR

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-fe4f54a023c6> in <module>()
----> 1 print(1 + '2')

TypeError: unsupported operand type(s) for +: 'int' and 'str'
  • The result here would be 3 if both values were considered ints, and '12' if both values were considered strings. Due to this ambiguity addition is not allowed between strings and integers.
  • Some types can be converted to other types by using the type name as a function.

PYTHON

print(1 + int('2'))
print(str(1) + '2')

OUTPUT

3
12

You can mix integers and floats freely in operations.


  • Integers and floating-point numbers can be mixed in arithmetic.
    • Python 3 automatically converts integers to floats as needed.

PYTHON

print('half is', 1 / 2.0)
print('three squared is', 3.0 ** 2)

OUTPUT

half is 0.5
three squared is 9.0

Variables only change value when something is assigned to them.


  • If we make one cell in a spreadsheet depend on another, and update the latter, the former updates automatically.
  • This does not happen in programming languages.

PYTHON

variable_one = 1
variable_two = 5 * variable_one
variable_one = 2
print('first is', variable_one, 'and second is', variable_two)

OUTPUT

first is 2 and second is 5
  • The computer reads the value of variable_one when doing the multiplication, creates a new value, and assigns it to variable_two.
  • Afterwards, the value of variable_two is set to the new value and not dependent on variable_one so its value does not automatically change when variable_one changes.

Fractions

What type of value is 3.4? How can you find out?

It is a floating-point number (often abbreviated “float”). It is possible to find out by using the built-in function type().

PYTHON

print(type(3.4))

OUTPUT

<class 'float'>

Automatic Type Conversion

What type of value is 3.25 + 4?

It is a float: integers are automatically converted to floats as necessary.

PYTHON

result = 3.25 + 4
print(result, 'is', type(result))

OUTPUT

7.25 is <class 'float'>

Choose a Type

What type of value (integer, floating point number, or character string) would you use to represent each of the following? Try to come up with more than one good answer for each problem. For example, in # 1, when would counting days with a floating point variable make more sense than using an integer?

  1. Number of days since the start of the year.
  2. Time elapsed from the start of the year until now in days.
  3. Serial number of a piece of lab equipment.
  4. A lab specimen’s age
  5. Current population of a city.
  6. Average population of a city over time.

The answers to the questions are: 1. Integer, since the number of days would lie between 1 and 365. 2. Floating point, since fractional days are required 3. Character string if serial number contains letters and numbers, otherwise integer if the serial number consists only of numerals 4. This will vary! How do you define a specimen’s age? whole days since collection (integer)? date and time (string)? 5. Choose floating point to represent population as large aggregates (eg millions), or integer to represent population in units of individuals. 6. Floating point number, since an average is likely to have a fractional part.

Division Types

In Python 3, the // operator performs integer (whole-number) floor division, the / operator performs floating-point division, and the % (or modulo) operator calculates and returns the remainder from integer division:

PYTHON

print('5 // 3:', 5 // 3)
print('5 / 3:', 5 / 3)
print('5 % 3:', 5 % 3)

OUTPUT

5 // 3: 1
5 / 3: 1.6666666666666667
5 % 3: 2

Imagine that you are buying cages for mice. num_mice is the number of mice you need cages for, and num_per_cage is the maximum number of mice which can live in a single cage. Write an expression that calculates the exact number of cages you need to purchase.

PYTHON

num_mice = 56
num_per_cage = 3

We want the number of cages to house all of our mice, which is the rounded up result of num_mice / num_per_cage. This is equivalent to performing a floor division with // and adding 1. Before the division we need to subtract 1 from the number of mice to deal with the case where num_mice is evenly divisible by num_per_cage.

PYTHON

num_cages = ((num_mice - 1) // num_per_cage) + 1

print(num_mice, 'mice,', num_per_cage, 'per cage:', num_cages, 'cages')

OUTPUT

56 mice, 3 per cage: 19 cages

Strings to Numbers


Where reasonable, float() will convert a string to a floating point number, and int() will convert a floating point number to an integer:

PYTHON

print("string to float:", float("3.6"))
print("float to int:", int(3.6))

OUTPUT

string to float: 3.4
float to int: 3

Note that converting a float to an int does not round the result, but instead truncates it by removing everything past the decimal point.

If the conversion doesn’t make sense, however, an error message will occur.

PYTHON

print("string to float:", float("Hello world!"))

ERROR

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-df3b790bf0a2> in <module>
----> 1 print("string to float:", float("Hello world!"))
 ValueError: could not convert string to float: 'Hello world!'

Challenge

What do you expect the following program to do?

What does it actually do?

Why do you think it does that?

PYTHON

print("fractional string to int:", int("3.6"))

What do you expect this program to do? It would not be so unreasonable to expect the Python 3 int command to convert the string “3.4” to 3.4 and an additional type conversion to 3. After all, Python 3 performs a lot of other magic - isn’t that part of its charm?

PYTHON

int("3.6")

ERROR

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-ec6729dfccdc> in <module>
----> 1 int("3.4")
ValueError: invalid literal for int() with base 10: '3.4'

However, Python 3 throws an error. Why? To be consistent, possibly. If you ask Python to perform two consecutive typecasts, you must convert it explicitly in code.

PYTHON

int(float("3.6"))

OUTPUT

3

Challenge

Arithmetic with Different Types

Which of the following will return the floating point number 2.0? Note: there may be more than one right answer.

PYTHON

first = 1.0
second = "1"
third = "1.1"
  1. first + float(second)
  2. float(second) + float(third)
  3. first + int(third)
  4. first + int(float(third))
  5. int(first) + int(float(third))
  6. 2.0 * second

Answer: 1 and 4

Use an index to get a single character from a string.


Python uses 0-based indexing.
Python uses 0-based indexing.
  • The characters (individual letters, numbers, and so on) in a string are ordered. For example, the string 'AB' is not the same as 'BA'. Because of this ordering, we can treat the string as a list of characters.
  • Each position in the string (first, second, etc.) is given a number. This number is called an index or sometimes a subscript.
  • Indices are numbered from 0.
  • Use the position’s index in square brackets to get the character at that position.

PYTHON

atom_name = 'sodium'
print(atom_name[0])

OUTPUT

s

Use a slice to get a substring.


  • A part of a string is called a substring. A substring can be as short as a single character.
  • An item in a list is called an element. Whenever we treat a string as if it were a list, the string’s elements are its individual characters.
  • Slicing gets a part of a string (or, more generally, a part of any list-like thing).
  • Slicing uses the notation [start:stop], where start is the integer index of the first element we want and stop is the integer index of the element just after the last element we want.
  • The difference between stop and start is the slice’s length.
  • Taking a slice does not change the contents of the original string. Instead, taking a slice returns a copy of part of the original string.

PYTHON

atom_name = 'sodium'
print(atom_name[0:3])

OUTPUT

sod

Use the built-in function len to find the length of a string.


PYTHON

print(len('sodium'))

OUTPUT

6
  • Nested functions are evaluated from the inside out, like in mathematics.

Indexing

If you assign a = 123, what happens if you try to get the second digit of a via a[1]?

Numbers are not strings or sequences and Python will raise an error if you try to perform an index operation on a number. In the next lesson on types and type conversion we will learn more about types and how to convert between different types. If you want the Nth digit of a number you can convert it into a string using the str built-in function and then perform an index operation on that string.

PYTHON

a = 123
print(a[1])

OUTPUT

TypeError: 'int' object is not subscriptable

PYTHON

a = str(123)
print(a[1])

OUTPUT

2

Slicing practice

What does the following program print?

PYTHON

atom_name = 'carbon'
print('atom_name[1:3] is:', atom_name[1:3])

OUTPUT

atom_name[1:3] is: ar

Slicing concepts

Given the following string:

PYTHON

species_name = "Acacia buxifolia"

What would these expressions return?

  1. species_name[2:8]
  2. species_name[11:] (without a value after the colon)
  3. species_name[:4] (without a value before the colon)
  4. species_name[:] (just a colon)
  5. species_name[11:-3]
  6. species_name[-5:-3]
  7. What happens when you choose a stop value which is out of range? (i.e., try species_name[0:20] or species_name[:103])
  1. species_name[2:8] returns the substring 'acia b'
  2. species_name[11:] returns the substring 'folia', from position 11 until the end
  3. species_name[:4] returns the substring 'Acac', from the start up to but not including position 4
  4. species_name[:] returns the entire string 'Acacia buxifolia'
  5. species_name[11:-3] returns the substring 'fo', from the 11th position to the third last position
  6. species_name[-5:-3] also returns the substring 'fo', from the fifth last position to the third last
  7. If a part of the slice is out of range, the operation does not fail. species_name[0:20] gives the same result as species_name[0:], and species_name[:103] gives the same result as species_name[:]

Content from Built-in Functions and Help


Last updated on 2023-04-14 | Edit this page

Overview

Questions

  • How can I use built-in functions?
  • How can I find out what they do?
  • What kind of errors can occur in programs?

Objectives

  • Explain the purpose of functions.
  • Correctly call built-in Python functions.
  • Correctly nest calls to built-in functions.
  • Use help to display documentation for built-in functions.
  • Correctly describe situations in which SyntaxError and NameError occur.

Key Points

  • Use comments to add documentation to programs.
  • A function may take zero or more arguments.
  • Commonly-used built-in functions include max, min, and round.
  • Functions may only work for certain (combinations of) arguments.
  • Functions may have default values for some arguments.
  • Use the built-in function help to get help for a function.
  • Python reports a syntax error when it can’t understand the source of a program.
  • Python reports a runtime error when something goes wrong while a program is executing.

Use comments to add documentation to programs.


PYTHON

# This sentence isn't executed by Python.
adjustment = 0.5   # Neither is this - anything after '#' is ignored.

Comments are important for the readability of your code. They are where you can explain what your code is doing, decisions you made, and things to watch out for when using the code.

Well-commented code will be easier to use by others, and by you if you are returning to code you wrote months or years ago.

A function may take zero or more arguments.


  • We have seen some functions already — now let’s take a closer look.
  • An argument is a value passed into a function.
  • len takes exactly one.
  • int, str, and float create a new value from an existing one.
  • print takes zero or more.
  • print with no arguments prints a blank line.
    • Must always use parentheses, even if they’re empty, so that Python knows a function is being called.

PYTHON

print('before')
print()
print('after')

OUTPUT

before

after

Every function returns something.


  • Every function call produces some result.
  • If the function doesn’t have a useful result to return, it usually returns the special value None. None is a Python object that stands in anytime there is no value.

PYTHON

result = print('example')
print('result of print is', result)

Note that even though we set the result of print equal to a variable, printing still occurs.

OUTPUT

example
result of print is None

Commonly-used built-in functions include max, min, and round.


  • Use max to find the largest value of one or more values.
  • Use min to find the smallest.
  • Both work on character strings as well as numbers.
    • “Larger” and “smaller” use (0-9, A-Z, a-z) to compare letters.

PYTHON

print(max(1, 2, 3))
print(min('a', 'A', '0'))

OUTPUT

3
0

Functions may only work for certain (combinations of) arguments.


  • max and min must be given at least one argument.
    • “Largest of the empty set” is a meaningless question.
  • And they must be given things that can meaningfully be compared.

PYTHON

print(max(1, 'a'))

ERROR

TypeError                                 Traceback (most recent call last)
<ipython-input-52-3f049acf3762> in <module>
----> 1 print(max(1, 'a'))

TypeError: '>' not supported between instances of 'str' and 'int'

Functions may have default values for some arguments.


  • round will round off a floating-point number.
  • By default, rounds to zero decimal places.

PYTHON

print(round(3.712))

OUTPUT

4
  • We can specify the number of decimal places we want.

PYTHON

print(round(3.712, 1))

OUTPUT

3.7

Functions attached to objects are called methods


  • Functions take another form that will be common in the pandas episodes.
  • Methods have parentheses like functions, but come after the variable.
  • Some methods are used for internal Python operations, and are marked with double underlines.

PYTHON

my_string = 'Hello world!'  # creation of a string object 

print(len(my_string))       # the len function takes a string as an argument and returns the length of the string

print(my_string.swapcase()) # calling the swapcase method on the my_string object

print(my_string.__len__())  # calling the internal __len__ method on the my_string object, used by len(my_string)

OUTPUT

12
hELLO WORLD!
12
  • You might even see them chained together. They operate left to right.

PYTHON

print(my_string.isupper())          # Not all the letters are uppercase
print(my_string.upper())            # This capitalizes all the letters

print(my_string.upper().isupper())  # Now all the letters are uppercase

OUTPUT

False
HELLO WORLD
True

Use the built-in function help to get help for a function.


  • Every built-in function has online documentation.

PYTHON

help(round)

OUTPUT

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.

    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.

The Jupyter Notebook has two ways to get help.


  • Option 1: Place the cursor near where the function is invoked in a cell (i.e., the function name or its parameters),
    • Hold down Shift, and press Tab.
    • Do this several times to expand the information returned.
  • Option 2: Type the function name in a cell with a question mark after it. Then run the cell.

Python reports a syntax error when it can’t understand the source of a program.


  • Won’t even try to run the program if it can’t be parsed.

PYTHON

# Forgot to close the quote marks around the string.
name = 'Feng

ERROR

  File "<ipython-input-56-f42768451d55>", line 2
    name = 'Feng
                ^
SyntaxError: EOL while scanning string literal

PYTHON

# An extra '=' in the assignment.
age = = 52

ERROR

  File "<ipython-input-57-ccc3df3cf902>", line 2
    age = = 52
          ^
SyntaxError: invalid syntax
  • Look more closely at the error message:

PYTHON

print("hello world"

ERROR

  File "<ipython-input-6-d1cc229bf815>", line 1
    print ("hello world"
                        ^
SyntaxError: unexpected EOF while parsing
  • The message indicates a problem on first line of the input (“line 1”).
    • In this case the “ipython-input” section of the file name tells us that we are working with input into IPython, the Python interpreter used by the Jupyter Notebook.
  • The -6- part of the filename indicates that the error occurred in cell 6 of our Notebook.
  • Next is the problematic line of code, indicating the problem with a ^ pointer.

Python reports a runtime error when something goes wrong while a program is executing.


PYTHON

age = 53
remaining = 100 - aege # mis-spelled 'age'

ERROR

NameError                                 Traceback (most recent call last)
<ipython-input-59-1214fb6c55fc> in <module>
      1 age = 53
----> 2 remaining = 100 - aege # mis-spelled 'age'

NameError: name 'aege' is not defined
  • Fix syntax errors by reading the source and runtime errors by tracing execution.

What Happens When

  1. Explain in simple terms the order of operations in the following program: when does the addition happen, when does the subtraction happen, when is each function called, etc.
  2. What is the final value of radiance?

PYTHON

radiance = 1.0
radiance = max(2.1, 2.0 + min(radiance, 1.1 * radiance - 0.5))
  1. Order of operations:
    1. 1.1 * radiance = 1.1
    2. 1.1 - 0.5 = 0.6
    3. min(radiance, 0.6) = 0.6
    4. 2.0 + 0.6 = 2.6
    5. max(2.1, 2.6) = 2.6
  2. At the end, radiance = 2.6

Spot the Difference

  1. Predict what each of the print statements in the program below will print.
  2. Does max(len(rich), poor) run or produce an error message? If it runs, does its result make any sense?

PYTHON

easy_string = "abc"
print(max(easy_string))
rich = "gold"
poor = "tin"
print(max(rich, poor))
print(max(len(rich), len(poor)))

PYTHON

print(max(easy_string))

OUTPUT

c

PYTHON

print(max(rich, poor))

max for strings ranks them alphabetically.

OUTPUT

tin

PYTHON

print(max(len(rich), len(poor)))

OUTPUT

4

max(len(rich), poor) throws a TypeError. This turns into max(4, 'tin') and as we discussed earlier a string and integer cannot meaningfully be compared.

ERROR

TypeError                                 Traceback (most recent call last)
<ipython-input-65-bc82ad05177a> in <module>
----> 1 max(len(rich), poor)

TypeError: '>' not supported between instances of 'str' and 'int'

Why Not?

Why is it that max and min do not return None when they are called with no arguments?

max and min return TypeErrors in this case because the correct number of parameters was not supplied. If it just returned None, the error would be much harder to trace as it would likely be stored into a variable and used later in the program, only to likely throw a runtime error.

Last Character of a String

If Python starts counting from zero, and len returns the number of characters in a string, what index expression will get the last character in the string name? (Note: we will see a simpler way to do this in a later episode.)

PYTHON

name[len(name) - 1]

Explore the Python docs!

The official Python documentation is arguably the most complete source of information about the language. It is available in different languages and contains a lot of useful resources. The Built-in Functions page contains a catalogue of all of these functions, including the ones that we’ve covered in this lesson. Some of these are more advanced and unnecessary at the moment, but others are very simple and useful.

Content from String Manipulation


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can I manipulate text?
  • How can I create neat and dynamic text output?

Objectives

  • Extract substrings of interest.
  • Format dynamic strings using f-strings.
  • Explore Python’s built-in string functions

Key Points

  • Strings can be indexed and sliced.
  • Strings cannot be directly altered.
  • You can build complex strings based on other variables using f-strings and format.
  • Python has a variety of useful built-in string functions.

Strings can be indexed and sliced.


As we saw in class during the types lesson, strings in Python can be indexed and sliced

PYTHON

my_string = "What a lovely day."

# Indexing
my_string[2]

OUTPUT

'a'

PYTHON

# Slicing
my_string[1:3]

OUTPUT

'ha'

PYTHON

my_string[:3] # Same as my_string[0:3]

OUTPUT

'Wha'

PYTHON

my_string[1:] # Same as my_string[1:len(my_string)]

OUTPUT

'hat a lovely day.'

Strings are immutable.


We will talk about this concept in more detail when we explore lists. However, for now it is important to note that strings, like simple types, cannot have their values be altered in Python. Instead, a new value is created.

For simple types, this behavior isn’t that noticable:

PYTHON

x = 10
# While this line appears to be changing the value 10 to 11, in reality a new integer with the value 11 is created and assigned to x. 
x = x + 1
x

OUTPUT

11

However, for strings we can easily cause errors if we attempt to change them directly:

PYTHON

# This attempts to change the last character to 'g' 
my_string[-1] = "g"

ERROR

---------------------------------------------------------------------
TypeError                           Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 my_string[-1] = "g"

TypeError: 'str' object does not support item assignment

Thus, we need to learn ways in Python to manipulate and build strings. In this instance, we can build a new string using indexing:

PYTHON

my_new_string = my_string[:-2] + "g"
my_new_string

OUTPUT

'What a lovely dag'

Or use the built-in function str.replace() if we wanted to replace a larger portion of the string.

PYTHON

my_new_string = my_string.replace("day", 'dog')
print(my_new_string)

OUTPUT

What a lovely dog.

Build complex strings based on other variables using format.


What if we want to use values inside of a string? For instance, say we want to print a sentence denoting how many and the percent of samples we dropped and kept as a part of a quality control analysis.

Suppose we have variables denoting how many samples and dropped samples we have:

PYTHON

good_samples = 964
bad_samples = 117
percent_dropped = bad_samples/(good_samples + bad_samples)

One option would be to simply put everything in print:

PYTHON

print("Dropped", percent_dropped, "percent samples")

OUTPUT

Dropped 0.10823311748381129 percent samples

Or we could convert and use addition:

PYTHON

out_str = "Dropped " + str(percent_dropped) + "% samples"
print(out_str)

OUTPUT

Dropped 0.10823311748381129% samples

However, both of these options don’t give us as much control over how the percent is displayed. Python uses systems called string formatting and f-strings to give us greater control.

We can use Python’s built-in format function to create better-looking text:

PYTHON

print('Dropped {0:.2%} of samples, with {1:n} samples remaining.'.format(percent_dropped, good_samples))

OUTPUT

Dropped 10.82% of samples, with 964 samples remaining.

Calls to format have a number of components:

  • The use of brackets {} to create replacement fields.
  • The index inside each bracket (0 and 1) to denote the index of the variable to use.
  • Format instructions. .2% indicates that we want to format the number as a percent with 2 decimal places. n indicates that we want to format as a number.
  • format we call format on the string, and as arguments give it the variables we want to use. The order of variables here are the indices referenced in replacement fields.

For instance, we can switch or repeat indices:

PYTHON

print('Dropped {1:.2%} of samples, with {0:n} samples remaining.'.format(percent_dropped, good_samples))
print('Dropped {0:.2%} of samples, with {0:n} samples remaining.'.format(percent_dropped, good_samples))

OUTPUT

Dropped 96400.00% of samples, with 964 samples remaining.
Dropped 10.82% of samples, with 0.108233 samples remaining.

Python has a shorthand for using format called f-strings. These strings let us directly use variables and create expressions inside of strings. We denote an f-string by putting f in front of the string definition:

PYTHON

print(f'Dropped {percent_dropped:.2%} of samples, with {good_samples:n} samples remaining.')
print(f'Dropped {100*(bad_samples/(good_samples + bad_samples)):.2f}% of samples, with {good_samples:n} samples remaining.')

OUTPUT

Dropped 10.82% of samples, with 964 samples remaining.
Dropped 10.82% of samples, with 964 samples remaining.

Here, {100*(bad_samples/(good_samples + bad_samples)):.2f} is treated as an expression, and then printed as a float with 2 digits.

Full documenation on all of the string formatting mini-language can be found here.

Python has many useful built-in functions for string manipulation.


Python has many built-in methods for manipulating strings; simple and powerful text manipulation is considered one of Python’s strengths. We will go over some of the more common and useful functions here, but be aware that there are many more you can find in the official documentation.

Dealing with whitespace

str.strip() strips the whitespace from the beginning and ending of a string. This is especially useful when reading in files which might have hidden spaces or tabs at the end of lines.

PYTHON

"  a   \t\t".strip()

Note that \t denotes a tab character.

OUTPUT

'a'

str.split() strips a string up into a list of strings by some character. By default it uses whitespace, but we can give any set character to split by. We will learn how to use lists in the next session.

PYTHON

"My favorite sentence I hope nothing bad happens to it".split()

OUTPUT

['My',
 'favorite',
 'sentence',
 'I',
 'hope',
 'nothing',
 'bad',
 'happens',
 'to',
 'it']

PYTHON

"2023-:04-:12".split('-:')

OUTPUT

['2023', '04', '12']

Pattern matching

str.find() find the first occurrence of the specified string inside the search string, and str.rfind() finds the last.

PYTHON

"(collected)ENSG00000010404(inprogress)ENSG000000108888".find('ENSG')

OUTPUT

11

PYTHON

"(collected)ENSG00000010404(inprogress)ENSG000000108888".rfind('ENSG')

OUTPUT

38

str.startswith() and str.endswith() perform a simlar function but return a bool based on whether or not the string starts or ends with a particular string.

PYTHON

"(collected)ENSG00000010404".startswith('ENSG')

OUTPUT

False

PYTHON

"ENSG00000010404".startswith('ENSG')

OUTPUT

True

Case

str.upper(), str.lower(), str.capitalize(), and str.title() all change the capitalization of strings.

PYTHON

my_string = "liters (coffee) per minute"
print(my_string.lower())
print(my_string.upper())
print(my_string.capitalize())
print(my_string.title())

OUTPUT

liters (coffee) per minute
LITERS (COFFEE) PER MINUTE
Liters (coffee) per minute
Liters (Coffee) Per Minute

Challenge

A common problem when analyzing data is that multiple features of the data will be stored as a column name or single string.

For instance, consider the following column headers:

WT_0Min_1	WT_0Min_2	X.RAB7A_0Min_1	X.RAB7A_0Min_2	WT_5Min_1	WT_5Min_2	X.RAB7A_5Min_1	X.RAB7A_5Min_2	X.NPC1_5Min_1	X.NPC1_5Min_2

There are two variables of interest, the time, 0, 5, or 60 minutes post-infection, and the genotype, WT, NPC1 knockout and RAB7A knockout. We also have replicate information at the end of each column header. For now, let’s just try extracting the timepoint, genotype, and replicate from a single column header.

Given the string:

PYTHON

sample_info = "X.RAB7A_0Min_1"

Try to print the string:

Sample is 0min RABA7A knockout, replicate 1. 

Using f-strings, slicing and indexing, and built-in string functions. You can try to use lists as a challenge, but its fine to instead get each piece of information separately from sample_info.

PYTHON

s_info = sample_info.split('_')
genotype = s_info[0]
time = s_info[1]
rep = s_info[2]

#Cleaning the parts up
genotype = genotype.split('.')[-1]
time = time.lower()

print(f"Sample is {genotype} {time} knockout, replicate {rep}")

OUTPUT

Sample is RABA7A 0min knockout, replicate 1

Content from Using Objects


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • What is an object?

Objectives

  • Define objects.
  • Use an object’s methods.
  • Call a constructor for an object.

Key Points

  • Objects are entities with both data and methods
  • Methods are unique to objects, and so methods with the same name may work differently on different objects.
  • You can create an object using a constructor.
  • Objects need to be explicitly copied.

Objects are entities with both data and methods


In addition to basic types, Python also has objects we refer to the type of an object as its class. In other programming languages there is a more definite distinction between base types and objects, but in Python these terms essentially interchangable. However, thinking about more complex data structures through an object-oriented lense will allow us to better understand how to write effective code.

We can think of an object as an entity which has two aspects: data and methods. The data (sometimes called properties) is what we would typically think of as the value of that object; what it is storing. Methods define what we can do with an object, essentially they are functions specific to an object (you will often hear function and method used interchangebly).

Next session we will explore the list object in Python. A list consists of its data, or what is stored in the list.

PYTHON

# Here, we are storing 3, 67, 1, and 33 as the data inside the list object
sizes = [3,67,1,33]

# Reverse is a list method which reverses that list
sizes.reverse()

print(sizes)

OUTPUT

[33, 1, 67, 3]

Methods are unique to objects, and so methods with the same name may work differently on different objects.


In Python, many objects share method names. However, those methods may do different things.

We have already seen this implicitly looking at how operations like addition interact with different basic types. For instance, performing multiplication on a list:

PYTHON

sizes * 3

OUTPUT

[33, 1, 67, 3, 33, 1, 67, 3, 33, 1, 67, 3]

May not have the behavior you expect. Whenever our code isn’t doing what we want, one of the first things to check is that the type of our variables is what we expect.

You can create an object using a constructor.


All objects have constructors, which are special function that create that object. Constructors are often called implicitly when we, for instance, define a list using the [] notation or create a Pandas dataframe object using read_csv. However, objects also have explicit constructors which can be called directly.

PYTHON

# The basic list constructor
new_sizes = list()

new_sizes.append(33)
new_sizes.append(1)
new_sizes.append(67)
new_sizes.append(3)

print(new_sizes)

OUTPUT

[33, 1, 67, 3]

Objects need to be explicitly copied.


We will continue to circle back to this point, but one imporant note about complex objects is that they need to be explicitly copied. This is due to them being mutable, which we will discuss more next session.

We can have multiple variables refer to the same object. This can result in us unknowningly changing an object we didn’t expect to.

For instance, two variables can refer to the same list:

PYTHON

# Here, we lose the original sizes list and now both variables are pointing to the same list
sizes = new_sizes
print(sizes)
print(new_sizes)

# Thus, if we change sizes:
sizes.append(100)

# We see that new_sizes has also changed
print(sizes)
print(new_sizes)

OUTPUT

[33, 1, 67, 3]
[33, 1, 67, 3]
[33, 1, 67, 3, 100]
[33, 1, 67, 3, 100]

In order to make a variable refer to a different copy of a list, we need to explicitly copy it. One way to do this is to use a copy constructor. Most objects in Python have a copy constructor which accepts another object of the same type and creates a copy of it.

PYTHON

# Calling the copy constructor
sizes = list(new_sizes)
print(sizes)
print(new_sizes)

# Now if we change sizes:
sizes.append(100)

# We see that new_sizes does NOT change
print(sizes)
print(new_sizes)

OUTPUT

[33, 1, 67, 3, 100]
[33, 1, 67, 3, 100]
[33, 1, 67, 3, 100, 100]
[33, 1, 67, 3, 100]

However, even copying an object can sometimes not be enough. Some objects are able to store other objects. If this is the case, the internal object might not change.

PYTHON

# Multiplying lists gets weird
list_of_lists = [sizes] * 4
print(list_of_lists)

OUTPUT

[[33, 1, 67, 3, 100, 100], [33, 1, 67, 3, 100, 100], [33, 1, 67, 3, 100, 100], [33, 1, 67, 3, 100, 100]]

PYTHON

# While the external list here is different, the internal lists are still the same
new_lol = list(list_of_lists)
print(new_lol)

OUTPUT

[[33, 1, 67, 3, 100, 100], [33, 1, 67, 3, 100, 100], [33, 1, 67, 3, 100, 100], [33, 1, 67, 3, 100, 100]]

PYTHON

new_lol[0].reverse()
print(list_of_lists)

OUTPUT

[[100, 3, 67, 1, 33], [100, 3, 67, 1, 33], [100, 3, 67, 1, 33], [100, 3, 67, 1, 33]]

Overall, we need to be careful in Python when copying objects. This is one of the reasons why most libaries’ methods return new objects as opposed to altering existing objects, so as to avoid any confusion over different object versions. Additionally, we can perform a deep copy of an object, which is an object copy which copys and stored objects.

Challenge

Pandas is one of the most popular python libraries for manipulating data which we will be diving into next week. Take a look at the official documentation for Pandas and try to find how to create a deep copy of a dataframe. Is what you found a function or a method? You can find the documentation here

Dataframes in Pandas have a copy method found here. This method by default performs a deep copy as the deep argument’s default value is True.

Content from Lists


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • How can I store many values together?
  • How do I access items stored in a list?
  • How are variables assigned to lists different than variables assigned to values?

Objectives

  • Explain what a list is.
  • Create and index lists of simple values.
  • Change the values of individual elements
  • Append values to an existing list
  • Reorder and slice list elements
  • Create and manipulate nested lists

Key Points

  • [value1, value2, value3, ...] creates a list.
  • Lists can contain any Python object, including lists (i.e., list of lists).
  • Lists are indexed and sliced with square brackets (e.g., list[0] and list[2:9]), in the same way as strings and arrays.
  • Lists are mutable (i.e., their values can be changed in place).
  • Strings are immutable (i.e., the characters in them cannot be changed).

Python lists


We create a list by putting values inside square brackets and separating the values with commas:

PYTHON

odds = [1, 3, 5, 7]
print('odds are:', odds)

OUTPUT

odds are: [1, 3, 5, 7]

The empty list

Similar to an empty string, we can create an empty list with a len of 0 and no elements. We create an empty list as empty_list = [] or empty_list = list(), and it is useful when we want to build up a list element by element.

We can access elements of a list using indices – numbered positions of elements in the list. These positions are numbered starting at 0, so the first element has an index of 0.

PYTHON

print('first element:', odds[0])
print('last element:', odds[3])
print('"-1" element:', odds[-1])

OUTPUT

first element: 1
last element: 7
"-1" element: 7

Yes, we can use negative numbers as indices in Python. When we do so, the index -1 gives us the last element in the list, -2 the second to last, and so on. Because of this, odds[3] and odds[-1] point to the same element here.

There is one important difference between lists and strings: we can change the values in a list, but we cannot change individual characters in a string. For example:

PYTHON

names = ['Curie', 'Darwing', 'Turing']  # typo in Darwin's name
print('names is originally:', names)
names[1] = 'Darwin'  # correct the name
print('final value of names:', names)

OUTPUT

names is originally: ['Curie', 'Darwing', 'Turing']
final value of names: ['Curie', 'Darwin', 'Turing']

works, but:

PYTHON

name = 'Darwin'
name[0] = 'd'

ERROR

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-220df48aeb2e> in <module>()
     1 name = 'Darwin'
----> 2 name[0] = 'd'

TypeError: 'str' object does not support item assignment

does not. This is because a list is mutable while a string is immutable.

Mutable and immutable

Data which can be modified in place is called mutable, while data which cannot be modified is called immutable. Strings and numbers are immutable. This does not mean that variables with string or number values are constants, but when we want to change the value of a string or number variable, we can only replace the old value with a completely new value.

Lists and arrays, on the other hand, are mutable: we can modify them after they have been created. We can change individual elements, append new elements, or reorder the whole list. For some operations, like sorting, we can choose whether to use a function that modifies the data in-place or a function that returns a modified copy and leaves the original unchanged.

Be careful when modifying data in-place. If two variables refer to the same list, and you modify the list value, it will change for both variables!

PYTHON

mild_salsa = ['peppers', 'onions', 'cilantro', 'tomatoes']
hot_salsa = mild_salsa        # <-- mild_salsa and hot_salsa point to the *same* list data in memory
hot_salsa[0] = 'hot peppers'
print('Ingredients in mild salsa:', mild_salsa)
print('Ingredients in hot salsa:', hot_salsa)

OUTPUT

Ingredients in mild salsa: ['hot peppers', 'onions', 'cilantro', 'tomatoes']
Ingredients in hot salsa: ['hot peppers', 'onions', 'cilantro', 'tomatoes']

Let’s go to this link to visualize this code.

If you want variables with mutable values to be independent, you must make a copy of the value when you assign it.

PYTHON

mild_salsa = ['peppers', 'onions', 'cilantro', 'tomatoes']
hot_salsa = list(mild_salsa)        # <-- makes a *copy* of the list
hot_salsa[0] = 'hot peppers'
print('Ingredients in mild salsa:', mild_salsa)
print('Ingredients in hot salsa:', hot_salsa)

OUTPUT

Ingredients in mild salsa: ['peppers', 'onions', 'cilantro', 'tomatoes']
Ingredients in hot salsa: ['hot peppers', 'onions', 'cilantro', 'tomatoes']

Because of pitfalls like this, code which modifies data in place can be more difficult to understand. However, it is often far more efficient to modify a large data structure in place than to create a modified copy for every small change. You should consider both of these aspects when writing your code.

Nested Lists

Since a list can contain any Python variables, it can even contain other lists.

For example, you could represent the products on the shelves of a small grocery shop as a nested list called veg:

veg is represented as a shelf full of produce. There are three rows of vegetables on the shelf, and each row contains three baskets of vegetables. We can label each basket according to the type of vegetable it contains, so the top row contains (from left to right) lettuce, lettuce, and peppers.
veg is represented as a shelf full of produce. There are three rows of vegetables on the shelf, and each row contains three baskets of vegetables. We can label each basket according to the type of vegetable it contains, so the top row contains (from left to right) lettuce, lettuce, and peppers.

To store the contents of the shelf in a nested list, you write it this way:

PYTHON

veg = [['lettuce', 'lettuce', 'peppers', 'zucchini'],
    ['lettuce', 'lettuce', 'peppers', 'zucchini'],
    ['lettuce', 'cilantro', 'peppers', 'zucchini']]

Here are some visual examples of how indexing a list of lists veg works. First, you can reference each row on the shelf as a separate list. For example, veg[2] represents the bottom row, which is a list of the baskets in that row.

veg is now shown as a list of three rows, with veg[0] representing the top row of three baskets, veg[1] representing the second row, and veg[2] representing the bottom row.
veg is now shown as a list of three rows, with veg[0] representing the top row of three baskets, veg[1] representing the second row, and veg[2] representing the bottom row.

Index operations using the image would work like this:

PYTHON

print(veg[2])

OUTPUT

['lettuce', 'cilantro', 'peppers', 'zucchini']

PYTHON

print(veg[0])

OUTPUT

['lettuce', 'lettuce', 'peppers', 'zucchini']

To reference a specific basket on a specific shelf, you use two indexes. The first index represents the row (from top to bottom) and the second index represents the specific basket (from left to right). veg is now shown as a two-dimensional grid, with each basket labeled according to its index in the nested list. The first index is the row number and the second index is the basket number, so veg[1][3] represents the basket on the far right side of the second row (basket 4 on row 2): zucchini

PYTHON

print(veg[0][0])

OUTPUT

'lettuce'

PYTHON

print(veg[1][2])

OUTPUT

'peppers'

Heterogeneous Lists

Lists in Python can contain elements of different types. Example:

PYTHON

sample_ages = [10, 12.5, 'Unknown']

Manipulating Lists


There are many ways to change the contents of lists besides assigning new values to individual elements:

PYTHON

odds.append(11)
print('odds after adding a value:', odds)

OUTPUT

odds after adding a value: [1, 3, 5, 7, 11]

PYTHON

removed_element = odds.pop(0)
print('odds after removing the first element:', odds)
print('removed_element:', removed_element)

OUTPUT

odds after removing the first element: [3, 5, 7, 11]
removed_element: 1

PYTHON

odds.reverse()
print('odds after reversing:', odds)

OUTPUT

odds after reversing: [11, 7, 5, 3]

While modifying in place, it is useful to remember that Python treats lists in a slightly counter-intuitive way.

As we saw earlier, when we modified the mild_salsa list item in-place, if we make a list, (attempt to) copy it and then modify this list, we can cause all sorts of trouble. This also applies to modifying the list using the above functions:

PYTHON

odds = [3, 5, 7]
primes = odds
primes.append(2)
print('primes:', primes)
print('odds:', odds)

OUTPUT

primes: [3, 5, 7, 2]
odds: [3, 5, 7, 2]

This is because Python stores a list in memory, and then can use multiple names to refer to the same list. If all we want to do is copy a (simple) list, we can again use the list function, so we do not modify a list we did not mean to:

PYTHON

odds = [3, 5, 7]
primes = list(odds)
primes.append(2)
print('primes:', primes)
print('odds:', odds)

OUTPUT

primes: [3, 5, 7, 2]
odds: [3, 5, 7]

Subsets of lists can be accessed by slicing, similar to how we accessed ranges of positions in a strings.

PYTHON

binomial_name = 'Drosophila melanogaster'
genus = binomial_name[0:10]
print('genus:', genus)

species = binomial_name[11:23]
print('species:', species)

chromosomes = ['X', 'Y', '2', '3', '4']
autosomes = chromosomes[2:5]
print('autosomes:', autosomes)

last = chromosomes[-1]
print('last:', last)

OUTPUT

genus: Drosophila
species: melanogaster
autosomes: ['2', '3', '4']
last: 4

Slicing From the End

Use slicing to access only the last four characters of a string or entries of a list.

PYTHON

string_for_slicing = 'Observation date: 02-Feb-2013'
list_for_slicing = [['fluorine', 'F'],
                   ['chlorine', 'Cl'],
                   ['bromine', 'Br'],
                   ['iodine', 'I'],
                   ['astatine', 'At']]

Desired output:

OUTPUT

'2013'
[['chlorine', 'Cl'], ['bromine', 'Br'], ['iodine', 'I'], ['astatine', 'At']]

Would your solution work regardless of whether you knew beforehand the length of the string or list (e.g. if you wanted to apply the solution to a set of lists of different lengths)? If not, try to change your approach to make it more robust.

Hint: Remember that indices can be negative as well as positive

Use negative indices to count elements from the end of a container (such as list or string):

PYTHON

string_for_slicing[-4:]
list_for_slicing[-4:]

Non-Continuous Slices

So far we’ve seen how to use slicing to take single blocks of successive entries from a sequence. But what if we want to take a subset of entries that aren’t next to each other in the sequence?

You can achieve this by providing a third argument to the range within the brackets, called the step size. The example below shows how you can take every third entry in a list:

PYTHON

primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]
subset = primes[0:12:3]
print('subset', subset)

OUTPUT

subset [2, 7, 17, 29]

Notice that the slice taken begins with the first entry in the range, followed by entries taken at equally-spaced intervals (the steps) thereafter. If you wanted to begin the subset with the third entry, you would need to specify that as the starting point of the sliced range:

PYTHON

primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]
subset = primes[2:12:3]
print('subset', subset)

OUTPUT

subset [5, 13, 23, 37]

Use the step size argument to create a new string that contains only every other character in the string “In an octopus’s garden in the shade”. Start with creating a variable to hold the string:

PYTHON

beatles = "In an octopus's garden in the shade"

What slice of beatles will produce the following output (i.e., the first character, third character, and every other character through the end of the string)?

OUTPUT

I notpssgre ntesae

To obtain every other character you need to provide a slice with the step size of 2:

PYTHON

beatles[0:35:2]

You can also leave out the beginning and end of the slice to take the whole string and provide only the step argument to go every second element:

PYTHON

beatles[::2]

If you want to take a slice from the beginning of a sequence, you can omit the first index in the range:

PYTHON

date = 'Monday 4 January 2016'
day = date[0:6]
print('Using 0 to begin range:', day)
day = date[:6]
print('Or omit the beginning index to slice from 0:', day)

OUTPUT

Using 0 to begin range: Monday
Or omit the beginning index to slice from 0: Monday

And similarly, you can omit the ending index in the range to take a slice to the very end of the sequence:

PYTHON

months = ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']
sond = months[8:12]
print('With known last position:', sond)
sond = months[8:len(months)]
print('Using len() to get last entry:', sond)
sond = months[8:]
print('Or omit the final index to go to the end of the list:', sond)

OUTPUT

With known last position: ['sep', 'oct', 'nov', 'dec']
Using len() to get last entry: ['sep', 'oct', 'nov', 'dec']
Or omit the final index to go to the end of the list: ['sep', 'oct', 'nov', 'dec']

Going past len

Python does not consider it an error to go past the end of a list when slicing. Python will juse slice until the end of the list:

PYTHON

months[:30]

OUTPUT

['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec']

However, trying to get a single item from a list at an index greater than its length will result in an error:

PYTHON

months[30]

ERROR

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 months[30]

IndexError: list index out of range

Overloading

+ usually means addition, but when used on strings or lists, it means “concatenate”. Given that, what do you think the multiplication operator * does on lists? In particular, what will be the output of the following code?

PYTHON

counts = [2, 4, 6, 8, 10]
repeats = counts * 2
print(repeats)
  1. [2, 4, 6, 8, 10, 2, 4, 6, 8, 10]
  2. [4, 8, 12, 16, 20]
  3. [[2, 4, 6, 8, 10],[2, 4, 6, 8, 10]]
  4. [2, 4, 6, 8, 10, 4, 8, 12, 16, 20]

The technical term for this is operator overloading: a single operator, like + or *, can do different things depending on what it’s applied to.

The multiplication operator * used on a list replicates elements of the list and concatenates them together:

OUTPUT

[2, 4, 6, 8, 10, 2, 4, 6, 8, 10]

{: .output}

It’s equivalent to:

PYTHON

counts + counts

Content from For Loops


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • How can I make a program do many things?
  • How can I do something for each thing in a list?

Objectives

  • Explain what for loops are normally used for.
  • Trace the execution of a simple (unnested) loop and correctly state the values of variables in each iteration.
  • Write for loops that use the Accumulator pattern to aggregate values.

Key Points

  • A for loop executes commands once for each value in a collection.
  • A for loop is made up of a collection, a loop variable, and a body.
  • The first line of the for loop must end with a colon, and the body must be indented.
  • Indentation is always meaningful in Python.
  • Loop variables can be called anything (but it is strongly advised to have a meaningful name to the looping variable).
  • The body of a loop can contain many statements.
  • Use range to iterate over a sequence of numbers.

A for loop executes commands once for each value in a collection.


  • Doing calculations on the values in a list one by one is as painful as working with pressure_001, pressure_002, etc.
  • A for loop tells Python to execute some statements once for each value in a list, a character string, or some other collection.
  • “for each thing in this group, do these operations”

PYTHON

for number in [2, 3, 5]:
    print(number)
  • This for loop is equivalent to:

PYTHON

print(2)
print(3)
print(5)
  • And the for loop’s output is:

OUTPUT

2
3
5

A for loop is made up of a collection, a loop variable, and a body.


PYTHON

for number in [2, 3, 5]:
    print(number)
  • The collection, [2, 3, 5], is what the loop is being run on.
  • The body, print(number), specifies what to do for each value in the collection.
  • The loop variable, number, is what changes for each iteration of the loop.
    • The “current thing”.

The first line of the for loop must end with a colon, and the body must be indented.


  • The colon at the end of the first line signals the start of a block of statements.
  • Python uses indentation rather than {} or begin/end to show nesting.
    • Any consistent indentation is legal, but almost everyone uses four spaces.

PYTHON

for number in [2, 3, 5]:
print(number)

OUTPUT

IndentationError: expected an indented block
  • Indentation is always meaningful in Python.

PYTHON

firstName = "Jon"
  lastName = "Smith"

ERROR

  File "<ipython-input-7-f65f2962bf9c>", line 2
    lastName = "Smith"
    ^
IndentationError: unexpected indent
  • This error can be fixed by removing the extra spaces at the beginning of the second line.

Loop variables can be called anything.


  • As with all variables, loop variables are:
    • Created on demand.
    • Meaningless: their names can be anything at all.

PYTHON

for kitten in [2, 3, 5]:
    print(kitten)

The body of a loop can contain many statements.


  • But no loop should be more than a few lines long.
  • Hard for human beings to keep larger chunks of code in mind.

PYTHON

primes = [2, 3, 5]
for p in primes:
    squared = p ** 2
    cubed = p ** 3
    print(p, squared, cubed)

OUTPUT

2 4 8
3 9 27
5 25 125

Use range to iterate over a sequence of numbers.


  • The built-in function range produces a sequence of numbers.
    • Not a list: the numbers are produced on demand to make looping over large ranges more efficient.
  • range(N) is the numbers 0..N-1
    • Exactly the legal indices of a list or character string of length N

PYTHON

print('a range is not a list: range(0, 3)')
for number in range(0, 3):
    print(number)

OUTPUT

a range is not a list: range(0, 3)
0
1
2

The Accumulator pattern turns many values into one.


  • A common pattern in programs is to:
    1. Initialize an accumulator variable to zero, the empty string, or the empty list.
    2. Update the variable with values from a collection.

PYTHON

# Sum the first 5 integers.
my_sum = 0 # Line 1
for number in range(5): # Line 2
   my_sum = my_sum + (number + 1) # Line 3
print(my_sum) # Line 4

OUTPUT

15
  • Read total = total + (number + 1) as:
    • Add 1 to the current value of the loop variable number.
    • Add that to the current value of the accumulator variable total.
    • Assign that to total, replacing the current value.
  • We have to add number + 1 because range produces 0..9, not 1..10.
    • You could also have used number and range(11).

We can trace the program output by looking at which line of code is being executed and what each variable’s value is at each line:

Line No Variables
1 my_sum = 0
2 my_sum = 0 number = 0
3 my_sum = 1 number = 0
2 my_sum = 1 number = 1
3 my_sum = 3 number = 1
2 my_sum = 3 number = 2
3 my_sum = 6 number = 2
2 my_sum = 6 number = 3
3 my_sum = 10 number = 3
2 my_sum = 10 number = 4
3 my_sum = 15 number = 4
4 my_sum = 15 number = 4

Let’s double check our work by visualizing the code.

Classifying Errors

Is an indentation error a syntax error or a runtime error?

An indentation error (IndentationError) is a syntax error. Programs with syntax errors cannot be started. A program with a runtime error will start but an error will be thrown under certain conditions.

Tracing Execution

Trace through the following code and create a table showing the lines that are executed when this program runs, and the values of the variables after each line is executed.

PYTHON

total = 0 # Line 1
for char in "tin": # Line 2
    total = total + 1 # Line 3
Line no Variables
1 total = 0
2 total = 0 char = ‘t’
3 total = 1 char = ‘t’
2 total = 1 char = ‘i’
3 total = 2 char = ‘i’
2 total = 2 char = ‘n’
3 total = 3 char = ‘n’

Reversing a String

Fill in the blanks in the program below so that it prints “nit” (the reverse of the original character string “tin”).

PYTHON

original = "tin"
result = ____
for char in original:
    result = ____
print(result)

PYTHON

original = "tin"
result = ""
for char in original:
    result = char + result
print(result)

Practice Accumulating

Fill in the blanks in each of the programs below to produce the indicated result.

PYTHON

# A
# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
    ____ = ____ + len(word)
print(total)

# B
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
words = ["red", "green", "blue"]
result = ____
for ____ in ____:
    ____
print(result)

# C
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
lengths = ____
for word in ["red", "green", "blue"]:
    lengths.____(____)
print(lengths)

PYTHON

# A
# Total length of the strings in the list: ["red", "green", "blue"] => 12
total = 0
for word in ["red", "green", "blue"]:
    total = total + len(word)
print(total)

# B
# Concatenate all words: ["red", "green", "blue"] => "redgreenblue"
words = ["red", "green", "blue"]
result = ""
for word in words:
    result = result + word
print(result)

# C
# List of word lengths: ["red", "green", "blue"] => [3, 5, 4]
lengths = []
for word in ["red", "green", "blue"]:
    lengths.append(len(word))
print(lengths)

String accumulation

Starting from the list ["red", "green", "blue"], create the acronym "RGB" using a for loop.

You may need to use a string method to properly format the acronym.

PYTHON

acronym = ""
for word in ["red", "green", "blue"]:
    acronym = acronym + word[0].upper()
print(acronym)

Cumulative Sum

Reorder and properly indent the lines of code below so that they print a list with the cumulative sum of data. The result should be [1, 3, 5, 10].

PYTHON

cumulative.append(total)
for number in data:
cumulative = []
total = total + number
total = 0
print(cumulative)
data = [1,2,2,5]

PYTHON

total = 0
data = [1,2,2,5]
cumulative = []
for number in data:
    total = total + number
    cumulative.append(total)
print(cumulative)

Identifying Variable Name Errors

  1. Read the code below and try to identify what the errors are without running it.
  2. Run the code and read the error message. What type of NameError do you think this is? Is it a string with no quotes, a misspelled variable, or a variable that should have been defined but was not?
  3. Fix the error.
  4. Repeat steps 2 and 3, until you have fixed all the errors.

This is a first taste of if statements. We will be going into more detail on if statements in future lessons. No errors in this code have to do with how the if statement is being used.

PYTHON

for number in range(10):
    # use a if the number is a multiple of 3, otherwise use b
    if (Number % 3) == 0:
        message = message + a
    else:
        message = message + "b"
print(message)
  • Python variable names are case sensitive: number and Number refer to different variables.
  • The variable message needs to be initialized as an empty string.
  • We want to add the string "a" to message, not the undefined variable a.

PYTHON

message = ""
for number in range(10):
    # use a if the number is a multiple of 3, otherwise use b
    if (number % 3) == 0:
        message = message + "a"
    else:
        message = message + "b"
print(message)

Identifying Item Errors

  1. Read the code below and try to identify what the errors are without running it.
  2. Run the code, and read the error message. What type of error is it?
  3. Fix the error.

PYTHON

seasons = ['Spring', 'Summer', 'Fall', 'Winter']
print('My favorite season is ', seasons[4])

This list has 4 elements and the index to access the last element in the list is 3.

PYTHON

seasons = ['Spring', 'Summer', 'Fall', 'Winter']
print('My favorite season is ', seasons[3])

Content from Libraries


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • How can I use software that other people have written?
  • How can I find out what that software does?

Objectives

  • Explain what software libraries are and why programmers create and use them.
  • Write programs that import and use modules from Python’s standard library.
  • Find and read documentation for the standard library interactively (in the interpreter) and online.

Key Points

  • Most of the power of a programming language is in its libraries.
  • A program must import a library module in order to use it.
  • Use help to learn about the contents of a library module.
  • Import specific items from a library to shorten programs.
  • Create an alias for a library when importing it to shorten programs.

Most of the power of a programming language is in its libraries.


  • A library is a collection of files (called modules) that contains functions for use by other programs.
    • May also contain data values (e.g., numerical constants) and other things.
    • Library’s contents are supposed to be related, but there’s no way to enforce that.
  • The Python standard library is an extensive suite of modules that comes with Python itself.
  • Many additional libraries are available from PyPI (the Python Package Index).

Libraries, packages, and modules

A module is a typically defined set of code located within a single Python file which is intended to be imported into scripts or other modules. A package is a set of related modules, often contained in a single directory. A library is a more general term referring to a collection of modules and packages. For instance, the Python Standard Library contains functionality from compressing files to parallel programming.

However, these definitions are not particularly formal or strict. Module, package, and library are often used interchangeably, especially since many libraries only consist of a single module.

A program must import a library module before using it.


  • Use import to load a library module into a program’s memory.
  • Then refer to things from the module as module_name.thing_name.
    • Python uses . to mean “part of”.
  • Using math, one of the modules in the standard library:

PYTHON

import math

print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))

OUTPUT

pi is 3.141592653589793
cos(pi) is -1.0
  • Have to refer to each item with the module’s name.
    • math.cos(pi) won’t work: the reference to pi doesn’t somehow “inherit” the function’s reference to math.

Use help to learn about the contents of a library module.


  • Works just like help for a function.

PYTHON

help(math)

OUTPUT

Help on module math:

NAME
    math

MODULE REFERENCE
    http://docs.python.org/3/library/math

    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module is always available.  It provides access to the
    mathematical functions defined by the C standard.

FUNCTIONS
    acos(x, /)
        Return the arc cosine (measured in radians) of x.

Import specific items from a library module to shorten programs.


  • Use from ... import ... to load only specific items from a library module.
  • Then refer to them directly without library name as prefix.

PYTHON

from math import cos, pi

print('cos(pi) is', cos(pi))

OUTPUT

cos(pi) is -1.0

Create an alias for a library module when importing it to shorten programs.


  • Use import ... as ... to give a library a short alias while importing it.
  • Then refer to items in the library using that shortened name.

PYTHON

import math as m

print('cos(pi) is', m.cos(m.pi))

OUTPUT

cos(pi) is -1.0
  • Allows less typing for long and/or frequently used packages.
    • E.g., the matplotlib.pyplot plotting package is often aliased as plt.
  • But can make programs harder to understand, since readers must learn your program’s aliases.

Exploring the Math Module

  1. What function from the math module can you use to calculate a square root without using sqrt?
  2. Since the library contains this function, why does sqrt exist?
  1. Using help(math) we see that we’ve got pow(x,y) in addition to sqrt(x), so we could use pow(x, 0.5) to find a square root.

  2. The sqrt(x) function is arguably more readable than pow(x, 0.5) when implementing equations. Readability is a cornerstone of good programming, so it makes sense to provide a special function for this specific common case.

    Also, the design of Python’s math library has its origin in the C standard, which includes both sqrt(x) and pow(x,y), so a little bit of the history of programming is showing in Python’s function names.

Locating the Right Module

You want to select a random character from a string:

PYTHON

bases = 'ACTTGCTTGAC'
  1. Which standard library module could help you?
  2. Which function would you select from that module? Are there alternatives?
  3. Try to write a program that uses the function.

While you can use help from within Jupyter Lab, often when working with Python searching on the internet can yield a faster result.

The random module seems like it could help.

The string has 11 characters, each having a positional index from 0 to 10. You could use the random.randrange or random.randint functions to get a random integer between 0 and 10, and then select the bases character at that index:

PYTHON

from random import randrange

random_index = randrange(len(bases))
print(bases[random_index])

or more compactly:

PYTHON

from random import randrange

print(bases[randrange(len(bases))])

Perhaps you found the random.sample function? It allows for slightly less typing but might be a bit harder to understand just by reading:

PYTHON

from random import sample

print(sample(bases, 1)[0])

The simplest and shortest solution is the random.choice function that does exactly what we want:

PYTHON

from random import choice

print(choice(bases))

Jigsaw Puzzle (Parson’s Problem) Programming Example

Rearrange the following statements so that a random DNA base is printed and its index in the string. Not all statements may be needed. Feel free to use/add intermediate variables.

PYTHON

bases="ACTTGCTTGAC"
import math
import random
___ = random.randrange(n_bases)
___ = len(bases)
print("random base ", bases[___], "base index", ___)

PYTHON

import math 
import random
bases = "ACTTGCTTGAC" 
n_bases = len(bases)
idx = random.randrange(n_bases)
print("random base", bases[idx], "base index", idx)

When Is Help Available?

When a colleague of yours types help(math), Python reports an error:

OUTPUT

NameError: name 'math' is not defined

What has your colleague forgotten to do?

Importing the math module (import math)

Importing With Aliases

  1. Fill in the blanks so that the program below prints 90.0.
  2. Rewrite the program so that it uses import without as.
  3. Which form do you find easier to read?

PYTHON

import math as m
angle = ____.degrees(____.pi / 2)
print(____)

PYTHON

import math as m
angle = m.degrees(m.pi / 2)
print(angle)

can be written as

PYTHON

import math
angle = math.degrees(math.pi / 2)
print(angle)

Since you just wrote the code and are familiar with it, you might actually find the first version easier to read. But when trying to read a huge piece of code written by someone else, or when getting back to your own huge piece of code after several months, non-abbreviated names are often easier, except where there are clear abbreviation conventions.

There Are Many Ways To Import Libraries!

Match the following print statements with the appropriate library calls.

Print commands:

  1. print("sin(pi/2) =", sin(pi/2))
  2. print("sin(pi/2) =", m.sin(m.pi/2))
  3. print("sin(pi/2) =", math.sin(math.pi/2))

Library calls:

  1. from math import sin, pi
  2. import math
  3. import math as m
  4. from math import *
  1. Library calls 1 and 4. In order to directly refer to sin and pi without the library name as prefix, you need to use the from ... import ... statement. Whereas library call 1 specifically imports the two functions sin and pi, library call 4 imports all functions in the math module.
  2. Library call 3. Here sin and pi are referred to with a shortened library name m instead of math. Library call 3 does exactly that using the import ... as ... syntax - it creates an alias for math in the form of the shortened name m.
  3. Library call 2. Here sin and pi are referred to with the regular library name math, so the regular import ... call suffices.

Note: although library call 4 works, importing all names from a module using a wildcard import is not recommended as it makes it unclear which names from the module are used in the code. In general it is best to make your imports as specific as possible and to only import what your code uses. In library call 1, the import statement explicitly tells us that the sin function is imported from the math module, but library call 4 does not convey this information.

Importing Specific Items

  1. Fill in the blanks so that the program below prints 90.0.
  2. Do you find this version easier to read than preceding ones?
  3. Why wouldn’t programmers always use this form of import?

PYTHON

____ math import ____, ____
angle = degrees(pi / 2)
print(angle)

PYTHON

from math import degrees, pi
angle = degrees(pi / 2)
print(angle)

Most likely you find this version easier to read since it’s less dense. The main reason not to use this form of import is to avoid name clashes. For instance, you wouldn’t import degrees this way if you also wanted to use the name degrees for a variable or function of your own. Or if you were to also import a function named degrees from another library.

Reading Error Messages

  1. Read the code below and try to identify what the errors are without running it.
  2. Run the code, and read the error message. What type of error is it?

PYTHON

from math import log
log(0)

OUTPUT

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-d72e1d780bab> in <module>
      1 from math import log
----> 2 log(0)

ValueError: math domain error
  1. The logarithm of x is only defined for x > 0, so 0 is outside the domain of the function.
  2. You get an error of type ValueError, indicating that the function received an inappropriate argument value. The additional message “math domain error” makes it clearer what the problem is.

Content from Reading tabular data


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • How can I read tabular data?

Objectives

  • Import the Pandas library.
  • Use Pandas to load a simple CSV data set.
  • Get some basic information about a Pandas DataFrame.

Key Points

  • Use the Pandas library to get basic statistics out of tabular data.
  • Use index_col to specify that a column’s values should be used as row headings.
  • Use DataFrame.info to find out more about a dataframe.
  • The DataFrame.columns variable stores information about the dataframe’s columns.
  • Use DataFrame.T to transpose a dataframe.
  • Use DataFrame.describe to get summary statistics about data.

Use the Pandas library to do statistics on tabular data.


  • Pandas is a widely-used Python library for statistics, particularly on tabular data.
  • Borrows many features from R’s dataframes.
    • A 2-dimensional table whose columns have names and potentially have different data types.
  • Load it with import pandas as pd. The alias pd is commonly used for Pandas.
  • Read a Comma Separated Values (CSV) data file with pd.read_csv.
    • Argument is the name of the file to be read.
    • Assign result to a variable to store the data that was read.

Data description


We are going to use part of the data published by Blackmore et al. (2017), The effect of upper-respiratory infection on transcriptomic changes in the CNS. The goal of the study was to determine the effect of an upper-respiratory infection on changes in RNA transcription occurring in the cerebellum and spinal cord post infection. Gender matched eight week old C57BL/6 mice were inoculated with saline or with Influenza A by intranasal route and transcriptomic changes in the cerebellum and spinal cord tissues were evaluated by RNA-seq at days 0 (non-infected), 4 and 8.

The dataset is stored as a comma-separated values (CSV) file. Each row holds information for a single RNA expression measurement, and the first eleven columns represent:

Column Description
gene The name of the gene that was measured
sample The name of the sample the gene expression was measured in
expression The value of the gene expression
organism The organism/species - here all data stem from mice
age The age of the mouse (all mice were 8 weeks here)
sex The sex of the mouse
infection The infection state of the mouse, i.e. infected with Influenza A or not infected.
strain The Influenza A strain; C57BL/6 in all cases.
time The duration of the infection (in days).
tissue The tissue that was used for the gene expression experiment, i.e. cerebellum or spinal cord.
mouse The mouse unique identifier.

We will be seeing different parts of this total dataset in different lessons, such as just the expression data or sample data. Later, we will learn how to convert between these different data layouts. For this lesson, we will only be using a subset of the total dataset.

PYTHON

import pandas as pd

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/rnaseq_reduced.csv"
rnaseq_df = pd.read_csv(url)
print(rnaseq_df)

OUTPUT

         gene      sample  expression  time
0         Asl  GSM2545336        1170     8
1        Apod  GSM2545336       36194     8
2     Cyp2d22  GSM2545336        4060     8
3        Klk6  GSM2545336         287     8
4       Fcrls  GSM2545336          85     8
...       ...         ...         ...   ...
7365    Mgst3  GSM2545340        1563     4
7366   Lrrc52  GSM2545340           2     4
7367     Rxrg  GSM2545340          26     4
7368    Lmx1a  GSM2545340          81     4
7369     Pbx1  GSM2545340        3805     4

[7370 rows x 4 columns]
  • The columns in a dataframe are the observed variables, and the rows are the observations.
  • Pandas uses backslash \ to show wrapped lines when output is too wide to fit the screen.

Use index_col to specify that a column’s values should be used as row headings.


  • Row headings are numbers (0 and 1 in this case).
  • Really want to index by gene.
  • Pass the name of the column to read_csv as its index_col parameter to do this.

PYTHON

rnaseq_df = pd.read_csv(url, index_col='gene')
print(rnaseq_df)

OUTPUT

             sample  expression  time
gene
Asl      GSM2545336        1170     8
Apod     GSM2545336       36194     8
Cyp2d22  GSM2545336        4060     8
Klk6     GSM2545336         287     8
Fcrls    GSM2545336          85     8
...             ...         ...   ...
Mgst3    GSM2545340        1563     4
Lrrc52   GSM2545340           2     4
Rxrg     GSM2545340          26     4
Lmx1a    GSM2545340          81     4
Pbx1     GSM2545340        3805     4

[7370 rows x 3 columns]

Use the DataFrame.info() method to find out more about a dataframe.


PYTHON

rnaseq_df.info()

OUTPUT

<class 'pandas.core.frame.DataFrame'>
Index: 7370 entries, Asl to Pbx1
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   sample      7370 non-null   object
 1   expression  7370 non-null   int64
 2   time        7370 non-null   int64
dtypes: int64(2), object(1)
memory usage: 230.3+ KB
  • This is a DataFrame
  • It has 7370 rows, with the first rowname being Asl and the last bveing Pbx1
  • It has 3 columns, two of which are 64-bit integer values.
    • We will talk later about null values, which are used to represent missing observations.
  • Uses 230.3 KB of memory.

The DataFrame.columns variable stores information about the dataframe’s columns.


  • Note that this is data, not a method. (It doesn’t have parentheses.)
    • Like math.pi.
    • So do not use () to try to call it.
  • Called a member variable, or just member.

PYTHON

print(rnaseq_df.columns)

OUTPUT

Index(['sample', 'expression', 'time'], dtype='object')

Use DataFrame.T to transpose a dataframe.


  • Sometimes want to treat columns as rows and vice versa.
  • Transpose (written .T) doesn’t copy the data, just changes the program’s view of it.
  • Like columns, it is a member variable.

PYTHON

print(rnaseq_df.T)

OUTPUT

gene               Asl        Apod     Cyp2d22        Klk6       Fcrls  \
sample      GSM2545336  GSM2545336  GSM2545336  GSM2545336  GSM2545336
expression        1170       36194        4060         287          85
time                 8           8           8           8           8

gene            Slc2a4        Exd2        Gjc2        Plp1        Gnb4  ...  \
sample      GSM2545336  GSM2545336  GSM2545336  GSM2545336  GSM2545336  ...
expression         782        1619         288       43217        1071  ...
time                 8           8           8           8           8  ...

gene            Dusp27        Mael     Gm16418     Gm16701     Aldh9a1  \
sample      GSM2545340  GSM2545340  GSM2545340  GSM2545340  GSM2545340
expression          20           4          15         149        1462
time                 4           4           4           4           4

gene             Mgst3      Lrrc52        Rxrg       Lmx1a        Pbx1
sample      GSM2545340  GSM2545340  GSM2545340  GSM2545340  GSM2545340
expression        1563           2          26          81        3805
time                 4           4           4           4           4

[3 rows x 7370 columns]

Use DataFrame.describe() to get summary statistics about data.


DataFrame.describe() gets the summary statistics of only the columns that have numerical data. All other columns are ignored, unless you use the argument include='all'.

PYTHON

print(rnaseq_df.describe())

OUTPUT

          expression         time
count    7370.000000  7370.000000
mean     1774.729308     3.200000
std      4402.155716     2.993529
min         0.000000     0.000000
25%        65.000000     0.000000
50%       515.500000     4.000000
75%      1821.750000     4.000000
max    101241.000000     8.000000

Inspecting Data

Use help(rnaseq_df.head) and help(rnaseq_df.tail) to find out what DataFrame.head and DataFrame.tail do.

  1. What method call will display the first three rows of this data?
  2. Using head and/or tail, how would you display the last three columns of this data? (Hint: you may need to change your view of the data.)
  1. We can check out the first five rows of rnaseq_df by executing rnaseq_df.head() (allowing us to view the head of the DataFrame). We can specify the number of rows we wish to see by specifying the parameter n in our call to rnaseq_df.head(). To view the first three rows, execute:

    PYTHON

    print(rnaseq_df.head(n=3))

    OUTPUT

             sample 	expression 	time
     gene 			
     Asl 	GSM2545336 	1170 	8
     Apod 	GSM2545336 	36194 	8
     Cyp2d22 GSM2545336 	4060 	8
  2. To check out the last three rows of rnaseq_df, we would use the command, rnaseq_df.tail(n=3), analogous to head() used above. However, here we want to look at the last three columns so we need to change our view and then use tail(). To do so, we create a new DataFrame in which rows and columns are switched:

    PYTHON

    rnaseq_df_flipped = rnaseq_df.T

    We can then view the last three columns of rnaseq_df by viewing the last three rows of rnaseq_df_flipped:

    PYTHON

    print(rnaseq_df_flipped.tail(n=3))

    OUTPUT

     gene 	Asl 	Apod 	Cyp2d22 	Klk6 	Fcrls 	Slc2a4 	Exd2 	Gjc2 	Plp1 	Gnb4 	... 	Dusp27 	Mael 	Gm16418 	Gm16701 	Aldh9a1 	Mgst3 	Lrrc52 	Rxrg 	Lmx1a 	Pbx1
     sample 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	GSM2545336 	... 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340 	GSM2545340
     expression 	1170 	36194 	4060 	287 	85 	782 	1619 	288 	43217 	1071 	... 	20 	4 	15 	149 	1462 	1563 	2 	26 	81 	3805
     time 	8 	8 	8 	8 	8 	8 	8 	8 	8 	8 	... 	4 	4 	4 	4 	4 	4 	4 	4 	4 	4

    This shows the data that we want, but we may prefer to display three columns instead of three rows, so we can flip it back:

    PYTHON

    print(rnaseq_df_flipped.tail(n=3).T)

    OUTPUT

        sample 	expression 	time
    gene 			
    Asl 	GSM2545336 	1170 	8
    Apod 	GSM2545336 	36194 	8
    Cyp2d22 	GSM2545336 	4060 	8
    Klk6 	GSM2545336 	287 	8
    Fcrls 	GSM2545336 	85 	8
    ... 	... 	... 	...
    Mgst3 	GSM2545340 	1563 	4
    Lrrc52 	GSM2545340 	2 	4
    Rxrg 	GSM2545340 	26 	4
    Lmx1a 	GSM2545340 	81 	4
    Pbx1 	GSM2545340 	3805 	4

    Note: we could have done the above in a single line of code by ‘chaining’ the commands:

    PYTHON

    rnaseq_df.T.tail(n=3).T

Writing Data

As well as the read_csv function for reading data from a file, Pandas provides a to_csv function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called processed.csv. You can use help to get information on how to use to_csv.

In order to write the DataFrame rnaseq_df to a file called processed.csv, execute the following command:

PYTHON

rnaseq_df.to_csv('processed.csv')

For help on to_csv, you could execute, for example:

PYTHON

help(rnaseq_df.to_csv)

Note that help(to_csv) throws an error! This is a subtlety and is due to the fact that to_csv is NOT a function in and of itself and the actual call is rnaseq_df.to_csv.

Content from Managing Python Environments


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • How do I manage different sets of packages?
  • How do I install new packages?

Objectives

  • Understand how Conda environments can improve your research workflow.
  • Create a new environment.
  • Activate (deactivate) a particular environment.
  • Install packages into existing environments using Conda (+pip).
  • Specify the installation location of an environment.
  • List all of the existing environments on your machine.
  • List all of the installed packages within a particular environment.
  • Delete an entire environment.

Key Points

  • A Conda environment is a directory that contains a specific collection of Conda packages that you have installed.
  • You create (remove) a new environment using the conda create (conda remove) commands.
  • You activate (deactivate) an environment using the conda activate (conda deactivate) commands.
  • You install packages into environments using conda install; you install packages into an active environment using pip install.
  • You should install each environment as a sub-directory inside its corresponding project directory
  • Use the conda env list command to list existing environments and their respective locations.
  • Use the conda list command to list all of the packages installed in an environment.

What is a Conda environment


A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. For example, you may be working on a research project that requires NumPy 1.18 and its dependencies, while another environment associated with an finished project has NumPy 1.12 (perhaps because version 1.12 was the most current version of NumPy at the time the project finished). If you change one environment, your other environments are not affected. You can easily activate or deactivate environments, which is how you switch between them.

Avoid installing packages into your base Conda environment

Conda has a default environment called base that include a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base software environment. Additional packages needed for a new project should always be installed into a newly created Conda environment.

Working from the command line

So far, we have been working inside Jupyter Lab. However, we now want to manipulate the Python kernels and will be working from outside Jupyter.

For installing packages and manipulating conda environments, we will be working from the terminal or Anaconda prompt.

macOS - Command Line

To start the JupyterLab server you will need to access the command line through the Terminal. There are two ways to open Terminal on Mac.

  1. In your Applications folder, open Utilities and double-click on Terminal
  2. Press Command + spacebar to launch Spotlight. Type Terminal and then double-click the search result or hit Enter

After you have launched Terminal, run conda init.

BASH

$ conda init

You should now see (base) at the start of your terminal line.

BASH

(base) $ 

Windows Users - Command Line

To start the JupyterLab server you will need to access the Anaconda Prompt.

Press Windows Logo Key and search for Anaconda Prompt, click the result or press enter.

After you have launched Terminal, run conda init.

BASH

$ conda init

You should now see (base) at the start of your terminal line.

BASH

(base)

Creating environments


To create a new environment for Python development using conda you can use the conda create command.

BASH

$ conda create --name python3-env python

For a list of all commands, take a look at Conda general commands.

It is a good idea to give your environment a meaningful name in order to help yourself remember the purpose of the environment. While naming things can be difficult, $PROJECT_NAME-env is a good convention to follow. Sometimes also the specific version of a package why you had to create a new environment is a good name

The command above will create a new Conda environment called “python3” and install the most recent version of Python. If you wish, you can specify a particular version of packages for conda to install when creating the environment.

BASH

$ conda create --name python36-env python=3.6

Always specify a version number for each package you wish to install

In order to make your results more reproducible and to make it easier for research colleagues to recreate your Conda environments on their machines it is a “best practice” to always explicitly specify the version number for each package that you install into an environment. If you are not sure exactly which version of a package you want to use, then you can use search to see what versions are available using the conda search command.

BASH

$ conda search $PACKAGE_NAME

So, for example, if you wanted to see which versions of Scikit-learn, a popular Python library for machine learning, were available, you would run the following.

BASH

$ conda search scikit-learn

As always you can run conda search --help to learn about available options.

You can create a Conda environment and install multiple packages by listing the packages that you wish to install.

BASH

$ conda create --name basic-scipy-env ipython=7.13 matplotlib=3.1 numpy=1.18 scipy=1.4

When conda installs a package into an environment it also installs any required dependencies. For example, even though Python is not listed as a packaged to install into the basic-scipy-env environment above, conda will still install Python into the environment because it is a required dependency of at least one of the listed packages.

Creating a new environment

Create a new environment called “machine-learning-env” with Python and the most current versions of IPython, Matplotlib, Pandas, Numba and Scikit-Learn.

In order to create a new environment you use the conda create command as follows.

BASH

$ conda create --name machine-learning-env \
 ipython \
 matplotlib \
 pandas \
 python \
 scikit-learn \
 numba

Since no version numbers are provided for any of the Python packages, Conda will download the most current, mutually compatible versions of the requested packages. However, since it is best practice to always provide explicit version numbers, you should prefer the following solution.

BASH

$ conda create --name machine-learning-env \
 ipython=7.19 \
 matplotlib=3.3 \
 pandas=1.2 \
 python=3.8 \
 scikit-learn=0.23 \
 numba=0.51

However, please be aware that the version numbers for each packages may not be the latest available and would need to be adjusted.

Activating an existing environment


Activating environments is essential to making the software in environments work well (or sometimes at all!). Activation of an environment does two things.

  1. Adds entries to PATH for the environment.
  2. Runs any activation scripts that the environment may contain.

Step 2 is particularly important as activation scripts are how packages can set arbitrary environment variables that may be necessary for their operation. You can activate the basic-scipy-env environment by name using the activate command.

BASH

$ conda activate basic-scipy-env

You can see that an environment has been activated because the shell prompt will now include the name of the active environment.

BASH

(basic-scipy-env) $

Deactivate the current environment


To deactivate the currently active environment use the Conda deactivate command as follows.

BASH

(basic-scipy-env) $ conda deactivate

You can see that an environment has been deactivated because the shell prompt will no longer include the name of the previously active environment.

BASH

$

Returning to the base environment

To return to the base Conda environment, it’s better to call conda activate with no environment specified, rather than to use deactivate. If you run conda deactivate from your base environment, you may lose the ability to run conda commands at all. Don’t worry if you encounter this undesirable state! Just start a new shell.

Activate an existing environment by name

Activate the machine-learning-env environment created in the previous challenge by name.

In order to activate an existing environment by name you use the conda activate command as follows.

BASH

$ conda activate machine-learning-env

Deactivate the active environment

Deactivate the machine-learning-env environment that you activated in the previous challenge.

In order to deactivate the active environment you use the conda deactivate command.

BASH

(active-environment-name) $ conda deactivate

Installing a package into an existing environment


You can install a package into an existing environment using the conda install command. This command accepts a list of package specifications (i.e., numpy=1.18) and installs a set of packages consistent with those specifications and compatible with the underlying environment. If full compatibility cannot be assured, an error is reported and the environment is not changed.

By default the conda install command will install packages into the current, active environment. The following would activate the basic-scipy-env we created above and install Numba, an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code, into the active environment.

BASH

$ conda activate basic-scipy-env
$ conda install numba

As was the case when listing packages to install when using the conda create command, if version numbers are not explicitly provided, Conda will attempt to install the newest versions of any requested packages. To accomplish this, Conda may need to update some packages that are already installed or install additional packages. It is always a good idea to explicitly provide version numbers when installing packages with the conda install command. For example, the following would install a particular version of Scikit-Learn, into the current, active environment.

BASH

$ conda install scikit-learn=0.22

Using your environment in Jupyter

We need to perform a few extra steps to make our conda environments available in Jupyter Lab.

First, we’re going to break convention and install the nb_conda_kernels into our base environment.

BASH

$ conda activate
$ conda install nb_conda_kernels

Now for any environment we want to be available in Jupyter Lab we simply install the ipykernel package.

BASH

$ conda activate basic-scipy-env
$ conda install ipykernel

We then want to go back to our base environment and start Jupyter Lab:

BASH

$ conda activate
$ jupyter lab

Now, when we launch Jupyter Lab our environments should be available as kernels. You can change your kernel in the top-right of the screen:

Changing the kernel
Changing the kernel

And you should see your conda environments available:

Selecting conda kernels in Jupyer Lab
Selecting conda kernels in Jupyer Lab

Freezing installed packages

To prevent existing packages from being updating when using the conda install command, you can use the --freeze-installed option. This may force Conda to install older versions of the requested packages in order to maintain compatibility with previously installed packages. Using the --freeze-installed option does not prevent additional dependency packages from being installed.

Where do Conda environments live?


Environments created with conda, by default, live in the envs/ folder of your miniconda3 (or anaconda3) directory the absolute path to which will look something the following: /Users/$USERNAME/miniconda3/envs or C:\Users\$USERNAME\Anaconda3.

Running ls (linux) / dir (Windows) on your anaconda envs/ directory will list out the directories containing the existing Conda environments.

How do I specify a location for a Conda environment?


You can control where a Conda environment lives by providing a path to a target directory when creating the environment. For example to following command will create a new environment in a sub-directory of the current working directory called env.

BASH

$ conda create --prefix ./env ipython=7.13 matplotlib=3.1 pandas=1.0 python=3.6

You activate an environment created with a prefix using the same command used to activate environments created by name.

BASH

$ conda activate ./env

It is often a good idea to specify a path to a sub-directory of your project directory when creating an environment. Why?

  1. Makes it easy to tell if your project utilizes an isolated environment by including the environment as a sub-directory.
  2. Makes your project more self-contained as everything including the required software is contained in a single project directory.

An additional benefit of creating your project’s environment inside a sub-directory is that you can then use the same name for all your environments; if you keep all of your environments in your ~/miniconda3/env/ folder, you’ll have to give each of them a different name.

Listing existing environments


Now that you have created a number of Conda environments on your local machine you have probably forgotten the names of all of the environments and exactly where they live. Fortunately, there is a conda command to list all of your existing environments together with their locations.

BASH

$ conda env list

Listing the contents of an environment


In addition to forgetting names and locations of Conda environments, at some point you will probably forget exactly what has been installed in a particular Conda environment. Again, there is a conda command for listing the contents on an environment. To list the contents of the basic-scipy-env that you created above, run the following command.

BASH

$ conda list --name basic-scipy-env

If you created your Conda environment using the --prefix option to install packages into a particular directory, then you will need to use that prefix in order for conda to locate the environment on your machine.

BASH

$ conda list --prefix /path/to/conda-env

Listing the contents of a particular environment.

List the packages installed in the machine-learning-env environment that you created in a previous challenge.

You can list the packages and their versions installed in machine-learning-env using the conda list command as follows.

BASH

$ conda list --name machine-learning-env

To list the packages and their versions installed in the active environment leave off the --name or --prefix option.

BASH

$ conda list

Deleting entire environments


Occasionally, you will want to delete an entire environment. Perhaps you were experimenting with conda commands and you created an environment you have no intention of using; perhaps you no longer need an existing environment and just want to get rid of cruft on your machine. Whatever the reason, the command to delete an environment is the following.

BASH

$ conda remove --name my-first-conda-env --all

If you wish to delete and environment that you created with a --prefix option, then you will need to provide the prefix again when removing the environment.

BASH

$ conda remove --prefix /path/to/conda-env/ --all

Delete an entire environment

Delete the entire “basic-scipy-env” environment.

In order to delete an entire environment you use the conda remove command as follows.

BASH

$ conda remove --name basic-scipy-env --all --yes

This command will remove all packages from the named environment before removing the environment itself. The use of the --yes flag short-circuits the confirmation prompt (and should be used with caution).

Content from Dictionaries


Last updated on 2023-04-20 | Edit this page

Overview

Questions

  • How is a dictionary defined in Python?
  • What are the ways to interact with a dictionary?
  • Can a dictionary be nested?

Objectives

  • Understanding the structure of a dictionary.
  • Accessing data from a dictionary.
  • Practising nested dictionaries to deal with complex data.

Key Points

  • Dictionaries associate a set of values with a number of keys.
  • keys are used to access the values of a dictionary.
  • Dictionaries are mutable.
  • Nested dictionaries are constructed to organise data in a hierarchical fashion.
  • Some of the useful methods to work with dictionaries are: .items(), .get()

Dictionary


One of the most useful built-in tools in Python, dictionaries associate a set of values with a number of keys.

Think of an old fashion, paperback dictionary where we have a range of words with their definitions. The words are the keys, and the definitions are the values that are associated with the keys. A Python dictionary works in the same way.

Consider the following scenario:

Suppose we have a number of protein kinases, and we would like to associate them with their descriptions for future reference.

This is an example of association in arrays. We may visualise this problem as displayed below:

Illustrative diagram of associative arrays, showing the sets of keys and their association with some of the values.
Illustrative diagram of associative arrays, showing the sets of keys and their association with some of the values.

One way to associate the proteins with their definitions would be to use nested arrays. However, it would make it difficult to retrieve the values at a later time. This is because to retrieve the values, we would need to know the index at which a given protein is stored.

Instead of using normal arrays, in such circumstances, we use associative arrays. The most popular method to create construct an associative array in Python is to create dictionaries or dict.

Remember

To implement a dict in Python, we place our entries in curly bracket, separated using a comma. We separate keys and values using a colon — e.g. {‘key’: ‘value’}. The combination of dictionary key and its associating value is known as a dictionary item.

Note

When constructing a long dict with several items that span over several lines, it is not necessary to write one item per line or use indentations for each item or line. All we must is to write the as {‘key’: ‘value’} in curly brackets and separate each pair with a comma. However, it is good practice to write one item per line and use indentations as it makes it considerably easier to read the code and understand the hierarchy.

We can therefore implement the diagram displayed above in Python as follows:

PYTHON


protein_kinases = {
  'PKA': 'Involved in regulation of glycogen, sugar, and lipid metabolism.',
  'PKC': 'Regulates signal transduction pathways such as the Wnt pathway.',
  'CK1': 'Controls the function of other proteins through phosphorylation.'
  }
	
print(protein_kinases)

OUTPUT

{'PKA': 'Involved in regulation of glycogen, sugar, and lipid metabolism.', 'PKC': 'Regulates signal transduction pathways such as the Wnt pathway.', 'CK1': 'Controls the function of other proteins through phosphorylation.'}

PYTHON

print(type(protein_kinases))

OUTPUT

<class 'dict'>

Constructing dictionaries

Use Universal Protein Resource (UniProt) to find the following proteins for humans: - Axin-1 - Rhodopsin

Construct a dictionary for these proteins and the number amino acids for each of them. The keys should represent the name of the protein. Display the result.

PYTHON

proteins = {
  'Axin-1': 862,
  'Rhodopsin': 348
  }
		
print(proteins)

OUTPUT

{'Axin-1': 862, 'Rhodopsin': 348}

Now that we have created a dictionary; we can test whether or not a specific key exists our dictionary:

PYTHON

'CK1' in protein_kinases

OUTPUT

True

PYTHON

'GSK3' in protein_kinases

OUTPUT

False

Using in

Using the proteins dictionary you created in the above challenge, test to see whether or not a protein called ERK exists as a key in your dictionary? Display the result as a Boolean value.

PYTHON

print('ERK' in proteins)

OUTPUT

False

Interacting with a dictionary

We have already learnt that in programming, the more explicit our code, the better it is. Interacting with dictionaries in Python is very easy, coherent, and explicit. This makes them a powerful tool that we can exploit for different purposes.

In lists and tuples, we use indexing and slicing to retrieve values. In dictionaries, however, we use keys to do that. Because we can define the keys of a dictionary ourselves, we no longer have to rely exclusively on numeric indices.

As a result, we can retrieve the values of a dictionary using their respective keys as follows:

PYTHON

print(protein_kinases['CK1'])

OUTPUT

Controls the function of other proteins through phosphorylation.

However, if we attempt to retrieve the value for a key that does not exist in our dict, a KeyError will be raised:

PYTHON

'GSK3' in protein_kinases

print(protein_kinases['GSK3'])

ERROR

Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'GSK3'

Detailed traceback:
  File "<string>", line 1, in <module>

Dictionary lookup

Implement a dict to represent the following set of information:

Cystic Fibrosis:

Full Name Gene Type
Cystic fibrosis transmembrane conductance regulator CFTR Membrane Protein

Using the dictionary you implemented, retrieve and display the gene associated with cystic fibrosis.

PYTHON

cystic_fibrosis = {
  'full name': 'Cystic fibrosis transmembrane conductance regulator',
  'gene': 'CFTR',
  'type': 'Membrane Protein'
  }
		
print(cystic_fibrosis['gene'])

OUTPUT

CFTR

Remember

Whilst the values in a dict can be of virtually any type supported in Python, the keys may only be defined using immutable types such as string, int, or tuple. Additionally, the keys in a dictionary must be unique.

If we attempt to construct a dict using a mutable value as key, a TypeError will be raised.

For instance, list is a mutable type and therefore cannot be used as a key:

PYTHON

test_dict = {
  ['a', 'b']: 'some value'
  }

ERROR

Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: unhashable type: 'list'

Detailed traceback:
  File "<string>", line 1, in <module>

But we can use any immutable type as a key:

PYTHON

test_dict = {
  'ab': 'some value'
  }
  
print(test_dict)

OUTPUT

{'ab': 'some value'}

PYTHON

test_dict = {
  ('a', 'b'): 'some value'
  }
  
print(test_dict)  

OUTPUT

{('a', 'b'): 'some value'}

If we define a key more than once, the Python interpreter constructs the entry in dict using the last instance.

In the following example, we repeat the key ‘pathway’ twice; and as expected, the interpreter only uses the last instance, which in this case represents the value ‘Canonical’:

PYTHON

signal = {
  'name': 'Wnt', 
  'pathway': 'Non-Canonical',  # first instance
  'pathway': 'Canonical'  # second instance
  }

print(signal)	
{'name': 'Wnt', 'pathway': 'Canonical'}

Dictionaries are mutable

Dictionaries are mutable. This means that we can alter their contents. We can make any alterations to a dictionary as long as we use immutable values for the keys.

Suppose we have a dictionary stored in a variable called protein, holding some information about a specific protein:

PYTHON

protein = {
  'full name': 'Cystic fibrosis transmembrane conductance regulator', 
  'alias': 'CFTR',
  'gene': 'CFTR',
  'type': 'Membrane Protein',
  'common mutations': ['Delta-F508', 'G542X', 'G551D', 'N1303K']
  }

We can add new items to our dictionary or alter the existing ones:

PYTHON

# Adding a new item:
protein['chromosome'] = 7
	
print(protein)

print(protein['chromosome'])

OUTPUT

{'full name': 'Cystic fibrosis transmembrane conductance regulator', 'alias': 'CFTR', 'gene': 'CFTR', 'type': 'Membrane Protein', 'common mutations': ['Delta-F508', 'G542X', 'G551D', 'N1303K'], 'chromosome': 7}
7

We can also alter an existing value in a dictionary using its key. To do so, we simply access the value using its key, and treat it as a normal variable; i.e. the same way we do with members of a list:

PYTHON

print(protein['common mutations'])

OUTPUT

['Delta-F508', 'G542X', 'G551D', 'N1303K']

PYTHON

protein['common mutations'].append('W1282X')
print(protein)

OUTPUT

{'full name': 'Cystic fibrosis transmembrane conductance regulator', 'alias': 'CFTR', 'gene': 'CFTR', 'type': 'Membrane Protein', 'common mutations': ['Delta-F508', 'G542X', 'G551D', 'N1303K', 'W1282X'], 'chromosome': 7}

Altering values

Implement the following dictionary:

signal = {'name': 'Wnt', 'pathway': 'Non-Canonical'}}
	

with respect to signal:

  • Correct the value of pathway to “Canonical”;
  • Add a new item to the dictionary to represent the receptors for the canonical pathway as “Frizzled” and “LRP”.

Display the altered dictionary as the final result.

PYTHON

signal = {'name': 'Wnt', 'pathway': 'Non-Canonical'}
	
signal['pathway'] = 'Canonical'
signal['receptors'] = ('Frizzled', 'LRP')
	
print(signal)

OUTPUT

{'name': 'Wnt', 'pathway': 'Canonical', 'receptors': ('Frizzled', 'LRP')}

Advanced Topic

Displaying an entire dictionary using the print() function can look a little messy because it is not properly structured. There is, however, an external library called pprint (Pretty-Print) that behaves in very similar way to the default print() function, but structures dictionaries and other arrays in a more presentable way before displaying them. We do not discuss ``Pretty-Print’’ in this course, but it is a part of Python’s default library and is therefore installed with Python automatically. To learn more it, have a read through the official documentations for the library and review the examples.

Because the keys are immutable, they cannot be altered. However, we can get around this limitation by introducing a new key and assigning the values of the old key to the new one. Once we do that, we can go ahead and remove the old item. The easiest way to remove an item from a dictionary is to use the syntax del:

PYTHON

# Creating a new key and assigning to it the 
# values of the old key:
protein['human chromosome'] = protein['chromosome']

print(protein)

OUTPUT

{'full name': 'Cystic fibrosis transmembrane conductance regulator', 'alias': 'CFTR', 'gene': 'CFTR', 'type': 'Membrane Protein', 'common mutations': ['Delta-F508', 'G542X', 'G551D', 'N1303K', 'W1282X'], 'chromosome': 7, 'human chromosome': 7}

PYTHON

# Now we remove the old item from the dictionary:
del protein['chromosome']

print(protein)

OUTPUT

{'full name': 'Cystic fibrosis transmembrane conductance regulator', 'alias': 'CFTR', 'gene': 'CFTR', 'type': 'Membrane Protein', 'common mutations': ['Delta-F508', 'G542X', 'G551D', 'N1303K', 'W1282X'], 'human chromosome': 7}

We can simplify the above operation using the .pop() method, which removes the specified key from a dictionary and returns any values associated with it:

PYTHON

protein['common mutations in caucasians'] = protein.pop('common mutations')

print(protein)

OUTPUT

{'full name': 'Cystic fibrosis transmembrane conductance regulator', 'alias': 'CFTR', 'gene': 'CFTR', 'type': 'Membrane Protein', 'human chromosome': 7, 'common mutations in caucasians': ['Delta-F508', 'G542X', 'G551D', 'N1303K', 'W1282X']}

Reassigning values

Implement a dictionary as:

PYTHON

signal = {'name': 'Beta-Galactosidase', 'pdb': '4V40'}

with respect to signal:

  • Change the key name from ‘pdb’ to ‘pdb id’ using the .pop() method.

  • Write a code to find out whether the dictionary:

    • contains the new key (i.e. ‘pdb id’).
    • confirm that it no longer contains the old key (i.e. ‘pdb’)

If both conditions are met, display:

Contains the new key, but not the old one.

Otherwise:

Failed to alter the dictionary.

PYTHON

signal = {
		    'name': 'Beta-Galactosidase', 
		    'pdb': '4V40'
	}
	
signal['pdb id'] = signal.pop('pdb')
	
if 'pdb id' in signal and 'pdb' not in signal:
    print('Contains the new key, but not the old one.')
else:
    print('Failed to alter the dictionary.')
    

OUTPUT

Contains the new key, but not the old one.

Useful methods for dictionary

Now we use some snippets to demonstrate some of the useful methods associated with dict in Python.

Given a dictionary as:

PYTHON

lac_repressor = {
	    'pdb id': '1LBI',
	    'deposit data': '1996-02-17',
	    'organism': 'Escherichia coli',
	    'method': 'x-ray',
	    'resolution': 2.7,
}

We can create an array of all items in the dictionary using the .items() method:

PYTHON

print(lac_repressor.items())

OUTPUT

dict_items([('pdb id', '1LBI'), ('deposit data', '1996-02-17'), ('organism', 'Escherichia coli'), ('method', 'x-ray'), ('resolution', 2.7)])

The .items() method also returns an array of tuple members. Each tuple itself consists of 2 members, and is structured as (‘key’: ‘value’). On that account, we can use its output in the context of a for–loop as follows:

PYTHON

for key, value in lac_repressor.items():
    print(key, value, sep=': ')

OUTPUT

pdb id: 1LBI
deposit data: 1996-02-17
organism: Escherichia coli
method: x-ray
resolution: 2.7

We learned earlier that if we ask for a key that is not in the dict, a KeyError will be raised. If we anticipate this, we can handle it using the .get() method. The method takes in the key and searches the dictionary to find it. If found, the associating value is returned. Otherwise, the method returns None by default. We can also pass a second value to .get() to replace None in cases that the requested key does not exist:

PYTHON

print(lac_repressor['gene'])

OUTPUT

Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: 'gene'

Detailed traceback:
  File "<string>", line 1, in <module>

PYTHON

print(lac_repressor.get('gene'))

OUTPUT

None

PYTHON

print(lac_repressor.get('gene', 'Not found...'))

OUTPUT

Not found...

Getting multiple values

Implement the lac_repressor dictionary and try to extract the values associated with the following keys:

  • organism
  • authors
  • subunits
  • method

If a key does not exist in the dictionary, display No entry instead.

Display the results in the following format:

organism: XXX
authors: XXX  

PYTHON

lac_repressor = {
    'pdb id': '1LBI',
    'deposit data': '1996-02-17',
    'organism': 'Escherichia coli',
    'method': 'x-ray',
    'resolution': 2.7,
}
	
requested_keys = ['organism', 'authors', 'subunits', 'method']
	
for key in requested_keys:
    lac_repressor.get(key, 'No entry')

OUTPUT

'Escherichia coli'
'No entry'
'No entry'
'x-ray'

for-loops and dictionaries

Dictionaries and for-loops create a powerful combination. We can leverage the accessibility of dictionary values through specific keys that we define ourselves in a loop to extract data iteratively and repeatedly.

One of the most useful tools that we can create using nothing more than a for-loop and a dictionary, in only a few lines of code, is a sequence converter.

Here, we are essentially iterating through a sequence of DNA nucleotides (sequence), extracting one character per loop cycle from our string (nucleotide). We then use that character as a key to retrieve its corresponding value from our a dictionary (dna2rna). Once we get the value, we add it to the variable that we initialised using an empty string outside the scope of our for-loop (rna_sequence). At the end of the process, the variable rna_sequence will contain a converted version of our sequence.

PYTHON

sequence = 'CCCATCTTAAGACTTCACAAGACTTGTGAAATCAGACCACTGCTCAATGCGGAACGCCCG'
	
dna2rna = {"A": "U", "T": "A", "C": "G", "G": "C"}
	
rna_sequence = str()  # Creating an empty string.
	
for nucleotide in sequence:
    rna_sequence += dna2rna[nucleotide]
	
print('DNA:', sequence)
print('RNA:', rna_sequence)

OUTPUT

DNA: CCCATCTTAAGACTTCACAAGACTTGTGAAATCAGACCACTGCTCAATGCGGAACGCCCG
RNA: GGGUAGAAUUCUGAAGUGUUCUGAACACUUUAGUCUGGUGACGAGUUACGCCUUGCGGGC

Using dictionaries as maps

We know that in reverse transcription, RNA nucleotides are converted to their complementary DNA as shown:

Type Direction Nucleotides
RNA 5’…’ U A G C
cDNA 5’…’ A T C G

with that in mind:

  1. Use the table to construct a dictionary for reverse transcription, and another dictionary for the conversion of cDNA to DNA.

  2. Using the appropriate dictionary, convert the following mRNA (exon) sequence for human G protein-coupled receptor to its cDNA.

PYTHON

human_gpcr = (
    'AUGGAUGUGACUUCCCAAGCCCGGGGCGUGGGCCUGGAGAUGUACCCAGGCACCGCGCAGCCUGCGGCCCCCAACACCACCUC'
    'CCCCGAGCUCAACCUGUCCCACCCGCUCCUGGGCACCGCCCUGGCCAAUGGGACAGGUGAGCUCUCGGAGCACCAGCAGUACG'
    'UGAUCGGCCUGUUCCUCUCGUGCCUCUACACCAUCUUCCUCUUCCCCAUCGGCUUUGUGGGCAACAUCCUGAUCCUGGUGGUG'
    'AACAUCAGCUUCCGCGAGAAGAUGACCAUCCCCGACCUGUACUUCAUCAACCUGGCGGUGGCGGACCUCAUCCUGGUGGCCGA'
    'CUCCCUCAUUGAGGUGUUCAACCUGCACGAGCGGUACUACGACAUCGCCGUCCUGUGCACCUUCAUGUCGCUCUUCCUGCAGG'
    'UCAACAUGUACAGCAGCGUCUUCUUCCUCACCUGGAUGAGCUUCGACCGCUACAUCGCCCUGGCCAGGGCCAUGCGCUGCAGC'
    'CUGUUCCGCACCAAGCACCACGCCCGGCUGAGCUGUGGCCUCAUCUGGAUGGCAUCCGUGUCAGCCACGCUGGUGCCCUUCAC'
    'CGCCGUGCACCUGCAGCACACCGACGAGGCCUGCUUCUGUUUCGCGGAUGUCCGGGAGGUGCAGUGGCUCGAGGUCACGCUGG'
    'GCUUCAUCGUGCCCUUCGCCAUCAUCGGCCUGUGCUACUCCCUCAUUGUCCGGGUGCUGGUCAGGGCGCACCGGCACCGUGGG'
    'CUGCGGCCCCGGCGGCAGAAGGCGCUCCGCAUGAUCCUCGCGGUGGUGCUGGUCUUCUUCGUCUGCUGGCUGCCGGAGAACGU'
    'CUUCAUCAGCGUGCACCUCCUGCAGCGGACGCAGCCUGGGGCCGCUCCCUGCAAGCAGUCUUUCCGCCAUGCCCACCCCCUCA'
    'CGGGCCACAUUGUCAACCUCACCGCCUUCUCCAACAGCUGCCUAAACCCCCUCAUCUACAGCUUUCUCGGGGAGACCUUCAGG'
    'GACAAGCUGAGGCUGUACAUUGAGCAGAAAACAAAUUUGCCGGCCCUGAACCGCUUCUGUCACGCUGCCCUGAAGGCCGUCAU'
    'UCCAGACAGCACCGAGCAGUCGGAUGUGAGGUUCAGCAGUGCCGUG'
)

Q1:

PYTHON

mrna2cdna = {
    'U': 'A',
    'A': 'T',
    'G': 'C',
    'C': 'G'
}
		
cdna2dna = {
    'A': 'T',
    'T': 'A',
    'C': 'G',
    'G': 'C'
}

Q2:

PYTHON

cdna = str()
for nucleotide in human_gpcr:
    cdna += mrna2cdna[nucleotide]
		
print(cdna)

OUTPUT

TACCTACACTGAAGGGTTCGGGCCCCGCACCCGGACCTCTACATGGGTCCGTGGCGCGTCGGACGCCGGGGGTTGTGGTGGAGGGGGCTCGAGTTGGACAGGGTGGGCGAGGACCCGTGGCGGGACCGGTTACCCTGTCCACTCGAGAGCCTCGTGGTCGTCATGCACTAGCCGGACAAGGAGAGCACGGAGATGTGGTAGAAGGAGAAGGGGTAGCCGAAACACCCGTTGTAGGACTAGGACCACCACTTGTAGTCGAAGGCGCTCTTCTACTGGTAGGGGCTGGACATGAAGTAGTTGGACCGCCACCGCCTGGAGTAGGACCACCGGCTGAGGGAGTAACTCCACAAGTTGGACGTGCTCGCCATGATGCTGTAGCGGCAGGACACGTGGAAGTACAGCGAGAAGGACGTCCAGTTGTACATGTCGTCGCAGAAGAAGGAGTGGACCTACTCGAAGCTGGCGATGTAGCGGGACCGGTCCCGGTACGCGACGTCGGACAAGGCGTGGTTCGTGGTGCGGGCCGACTCGACACCGGAGTAGACCTACCGTAGGCACAGTCGGTGCGACCACGGGAAGTGGCGGCACGTGGACGTCGTGTGGCTGCTCCGGACGAAGACAAAGCGCCTACAGGCCCTCCACGTCACCGAGCTCCAGTGCGACCCGAAGTAGCACGGGAAGCGGTAGTAGCCGGACACGATGAGGGAGTAACAGGCCCACGACCAGTCCCGCGTGGCCGTGGCACCCGACGCCGGGGCCGCCGTCTTCCGCGAGGCGTACTAGGAGCGCCACCACGACCAGAAGAAGCAGACGACCGACGGCCTCTTGCAGAAGTAGTCGCACGTGGAGGACGTCGCCTGCGTCGGACCCCGGCGAGGGACGTTCGTCAGAAAGGCGGTACGGGTGGGGGAGTGCCCGGTGTAACAGTTGGAGTGGCGGAAGAGGTTGTCGACGGATTTGGGGGAGTAGATGTCGAAAGAGCCCCTCTGGAAGTCCCTGTTCGACTCCGACATGTAACTCGTCTTTTGTTTAAACGGCCGGGACTTGGCGAAGACAGTGCGACGGGACTTCCGGCAGTAAGGTCTGTCGTGGCTCGTCAGCCTACACTCCAAGTCGTCACGGCAC

Content from Conditionals


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can programs do different things for different data?

Objectives

  • Correctly write programs that use if and else statements and simple Boolean expressions (without logical operators).
  • Trace the execution of unnested conditionals and conditionals inside loops.

Key Points

  • Use if statements to control whether or not a block of code is executed.
  • Conditionals are often used inside loops.
  • Use else to execute a block of code when an if condition is not true.
  • Use elif to specify additional tests.
  • Conditions are tested once, in order.

Use if statements to control whether or not a block of code is executed.


  • An if statement (more properly called a conditional statement) controls whether some block of code is executed or not.
  • Structure is similar to a for statement:
    • First line opens with if and ends with a colon
    • Body containing one or more statements is indented (usually by 4 spaces)

PYTHON

mass = 3.54
if mass > 3.0:
    print(mass, 'is large')

mass = 2.07
if mass > 3.0:
    print (mass, 'is large')

OUTPUT

3.54 is large

Conditionals are often used inside loops.


  • Not much point using a conditional when we know the value (as above).
  • But useful when we have a collection to process.

PYTHON

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')

OUTPUT

3.54 is large
9.22 is large

Use else to execute a block of code when an if condition is not true.


  • else can be used following an if.
  • Allows us to specify an alternative to execute when the if branch isn’t taken.

PYTHON

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

OUTPUT

3.54 is large
2.07 is small
9.22 is large
1.86 is small
1.71 is small

Use elif to specify additional tests.


  • May want to provide several alternative choices, each with its own test.
  • Use elif (short for “else if”) and a condition to specify these.
  • Always associated with an if.
  • Must come before the else (which is the “catch all”).

PYTHON

masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 9.0:
        print(m, 'is HUGE')
    elif m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

OUTPUT

3.54 is large
2.07 is small
9.22 is HUGE
1.86 is small
1.71 is small

Conditions are tested once, in order.


  • Python steps through the branches of the conditional in order, testing each in turn.
  • So ordering matters.

PYTHON

grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')

OUTPUT

grade is C
  • Does not automatically go back and re-evaluate if values change.

PYTHON

velocity = 10.0
if velocity > 20.0:
    print('moving too fast')
else:
    print('adjusting velocity')
    velocity = 50.0

OUTPUT

adjusting velocity
  • Often use conditionals in a loop to “evolve” the values of variables.

PYTHON

velocity = 10.0
for i in range(5): # execute the loop 5 times
    print(i, ':', velocity)
    if velocity > 20.0:
        print('moving too fast')
        velocity = velocity - 5.0
    else:
        print('moving too slow')
        velocity = velocity + 10.0
print('final velocity:', velocity)

OUTPUT

0 : 10.0
moving too slow
1 : 20.0
moving too slow
2 : 30.0
moving too fast
3 : 25.0
moving too fast
4 : 20.0
moving too slow
final velocity: 30.0

Compound Relations Using and, or, and Parentheses

Often, you want some combination of things to be true. You can combine relations within a conditional using and and or. Continuing the example above, suppose you have

PYTHON

mass     = [ 3.54,  2.07,  9.22,  1.86,  1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]

for i in range(5):
    if mass[i] > 5 and velocity[i] > 20:
        print("Fast heavy object.  Duck!")
    elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
        print("Normal traffic")
    elif mass[i] <= 2 and velocity[i] <= 20:
        print("Slow light object.  Ignore it")
    else:
        print("Whoa!  Something is up with the data.  Check it")

Just like with arithmetic, you can and should use parentheses whenever there is possible ambiguity. A good general rule is to always use parentheses when mixing and and or in the same condition. That is, instead of:

PYTHON

if mass[i] <= 2 or mass[i] >= 5 and velocity[i] > 20:

write one of these:

PYTHON

if (mass[i] <= 2 or mass[i] >= 5) and velocity[i] > 20:
if mass[i] <= 2 or (mass[i] >= 5 and velocity[i] > 20):

so it is perfectly clear to a reader (and to Python) what you really mean.

Tracing Execution

What does this program print?

PYTHON

pressure = 71.9
if pressure > 50.0:
    pressure = 25.0
elif pressure <= 50.0:
    pressure = 0.0
print(pressure)

OUTPUT

25.0

Trimming Values

Fill in the blanks so that this program creates a new list containing zeroes where the original list’s values were negative and ones where the original list’s values were positive.

PYTHON

original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = ____
for value in original:
    if ____:
        result.append(0)
    else:
        ____
print(result)

OUTPUT

[0, 1, 1, 1, 0, 1]

PYTHON

original = [-1.5, 0.2, 0.4, 0.0, -1.3, 0.4]
result = []
for value in original:
    if value < 0.0:
        result.append(0)
    else:
        result.append(1)
print(result)

Initializing

Modify this program so that it finds the largest and smallest values in the list no matter what the range of values originally is.

PYTHON

values = [...some test data...]
smallest, largest = None, None
for v in values:
    if ____:
        smallest, largest = v, v
    ____:
        smallest = min(____, v)
        largest = max(____, v)
print(smallest, largest)

What are the advantages and disadvantages of using this method to find the range of the data?

PYTHON

values = [-2,1,65,78,-54,-24,100]
smallest, largest = None, None
for v in values:
    if smallest is None and largest is None:
        smallest, largest = v, v
    else:
        smallest = min(smallest, v)
        largest = max(largest, v)
print(smallest, largest)

If you wrote == None instead of is None, that works too, but Python programmers always write is None because of the special way None works in the language.

It can be argued that an advantage of using this method would be to make the code more readable. However, a disadvantage is that this code is not efficient because within each iteration of the for loop statement, there are two more loops that run over two numbers each (the min and max functions). It would be more efficient to iterate over each number just once:

PYTHON

values = [-2,1,65,78,-54,-24,100]
smallest, largest = None, None
for v in values:
    if smallest is None or v < smallest:
        smallest = v
    if largest is None or v > largest:
        largest = v
print(smallest, largest)

Now we have one loop, but four comparison tests. There are two ways we could improve it further: either use fewer comparisons in each iteration, or use two loops that each contain only one comparison test. The simplest solution is often the best:

PYTHON

values = [-2,1,65,78,-54,-24,100]
smallest = min(values)
largest = max(values)
print(smallest, largest)

Content from Pandas DataFrames


Last updated on 2023-04-24 | Edit this page

Overview

Questions

  • How can I perform statistical analysis of tabular data?

Objectives

  • Select individual values from a Pandas dataframe.
  • Select entire rows or entire columns from a dataframe.
  • Select a subset of both rows and columns from a dataframe in a single operation.
  • Select a subset of a dataframe by a single Boolean criterion.

Key Points

  • Use DataFrame.iloc[..., ...] to select values by integer location.
  • Use : on its own to mean all columns or all rows.
  • Select multiple columns or rows using DataFrame.loc and a named slice.
  • Result of slicing can be used in further operations.
  • Use comparisons to select data based on value.
  • Select values or NaN using a Boolean mask.

Note about Pandas DataFrames/Series


A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

The Pandas cheatsheet is a great quick reference for popular dataframe operations.

Data description


We will be using the same data as the previous lesson from Blackmore et al. (2017), The effect of upper-respiratory infection on transcriptomic changes in the CNS.

However, this data file consists only of the expression data in a gene x sample matrix.

Selecting values


To access a value at the position [i,j] of a DataFrame, we have two options, depending on what is the meaning of i in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

Use DataFrame.iloc[..., ...] to select values by their (entry) position


  • Can specify location by numerical index analogously to 2D version of character selection in strings.

PYTHON

import pandas as pd

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/expression_matrix.csv"
rnaseq_df = pd.read_csv(url, index_col=0)
print(rnaseq_df.iloc[0, 0])

OUTPUT

1230

Note how here we used the index of the gene column as opposed to its name. It is often safer to use names instead of indices so that your code is robust to changes in the data layout, but sometimes indices are necessary.

Use DataFrame.loc[..., ...] to select values by their (entry) label.


  • Can specify location by row and/or column name.

PYTHON

print(rnaseq_df.loc["Cyp2d22", "GSM2545351"])

OUTPUT

3678

Use : on its own to mean all columns or all rows.


  • Just like Python’s usual slicing notation.

PYTHON

print(rnaseq_df.loc["Cyp2d22", :])

OUTPUT

GSM2545336    4060
GSM2545337    1616
GSM2545338    1603
GSM2545339    1901
GSM2545340    2171
GSM2545341    3349
GSM2545342    3122
GSM2545343    2008
GSM2545344    2254
GSM2545345    2277
GSM2545346    2985
GSM2545347    3452
GSM2545348    1883
GSM2545349    2014
GSM2545350    2417
GSM2545351    3678
GSM2545352    2920
GSM2545353    2216
GSM2545354    1821
GSM2545362    2842
GSM2545363    2011
GSM2545380    4019
Name: Cyp2d22, dtype: int64
  • We would get the same result printing rnaseq_df.loc["Albania"] (without a second index).

PYTHON

print(rnaseq_df.loc[:, "GSM2545351"])

OUTPUT

gene
AI504432    1136
AW046200      67
AW551984     584
Aamp        4813
Abca12         4
            ...
Zkscan3     1661
Zranb1      8223
Zranb3       208
Zscan22      433
Zw10        1436
Name: GSM2545351, Length: 1474, dtype: int64
  • We would get the same result printing rnaseq_df["GSM2545351"]
  • We would also get the same result printing rnaseq_df.GSM2545351 (not recommended, because easily confused with . notation for methods)

Select multiple columns or rows using DataFrame.loc and a named slice.


PYTHON

print(rnaseq_df.loc['Nadk':'Nid1', 'GSM2545337':'GSM2545341'])

OUTPUT

         GSM2545337  GSM2545338  GSM2545339  GSM2545340  GSM2545341
gene
Nadk           2285        2237        2286        2398        2201
Nap1l5        18287       17176       21950       18655       19160
Nbeal1         2230        2030        2053        1970        1910
Nbl1           2033        1859        1772        2439        3720
Ncf2             76          60          83          73          61
Nck2            683         706         690         644         648
Ncmap            37          40          46          51         138
Ncoa2          5053        4374        4406        4814        5311
Ndufa10        4218        3921        4268        3980        3631
Ndufs1         7042        6672        7413        7090        6943
Neto1           214         183         383         217         164
Nfkbia          740         897         724         873        1000
Nfya           1003         962         944         919         587
Ngb              56          46         135          63          50
Ngef            498         407         587         410         193
Ngp              50          18         131          47          27
Nid1           1521        1395         862         795         673

In the above code, we discover that slicing using loc is inclusive at both ends, which differs from slicing using iloc, where slicing indicates everything up to but not including the final index.

Result of slicing can be used in further operations.


  • Usually don’t just print a slice.
  • All the statistical operators that work on entire dataframes work the same way on slices.
  • E.g., calculate max of a slice.

PYTHON

print(rnaseq_df.loc['Nadk':'Nid1', 'GSM2545337':'GSM2545341'].max())

OUTPUT

GSM2545337    18287
GSM2545338    17176
GSM2545339    21950
GSM2545340    18655
GSM2545341    19160
dtype: int64

PYTHON

print(rnaseq_df.loc['Nadk':'Nid1', 'GSM2545337':'GSM2545341'].min())

OUTPUT

GSM2545337    37
GSM2545338    18
GSM2545339    46
GSM2545340    47
GSM2545341    27
dtype: int64

Use comparisons to select data based on value.


  • Comparison is applied element by element.
  • Returns a similarly-shaped dataframe of True and False.

PYTHON

# Use a subset of data to keep output readable.
subset = rnaseq_df.iloc[:5, :5]
print('Subset of data:\n', subset)

# Which values were greater than 1000 ?
print('\nWhere are values large?\n', subset > 1000)

OUTPUT

Subset of data:
           GSM2545336  GSM2545337  GSM2545338  GSM2545339  GSM2545340
gene
AI504432        1230        1085         969        1284         966
AW046200          83         144         129         121         141
AW551984         670         265         202         682         246
Aamp            5621        4049        3797        4375        4095
Abca12             5           8           1           5           3

Where are values large?
           GSM2545336  GSM2545337  GSM2545338  GSM2545339  GSM2545340
gene
AI504432        True        True       False        True       False
AW046200       False       False       False       False       False
AW551984       False       False       False       False       False
Aamp            True        True        True        True        True
Abca12         False       False       False       False       False

Select values or NaN using a Boolean mask.


  • A frame full of Booleans is sometimes called a mask because of how it can be used.

PYTHON

mask = subset > 1000
print(subset[mask])

OUTPUT

          GSM2545336  GSM2545337  GSM2545338  GSM2545339  GSM2545340
gene
AI504432      1230.0      1085.0         NaN      1284.0         NaN
AW046200         NaN         NaN         NaN         NaN         NaN
AW551984         NaN         NaN         NaN         NaN         NaN
Aamp          5621.0      4049.0      3797.0      4375.0      4095.0
Abca12           NaN         NaN         NaN         NaN         NaN
  • Get the value where the mask is true, and NaN (Not a Number) where it is false.
  • NaNs are ignored by operations like max, min, average, etc. This is different than in other programming languages such as R but makes using masks in this way very useful.
  • The behavior towards NaN values can be changed with the skipna argument for most pandas methods and functions.

PYTHON

print(subset[subset > 1000].describe())

OUTPUT

        GSM2545336   GSM2545337  GSM2545338   GSM2545339  GSM2545340
count     2.000000     2.000000         1.0     2.000000         1.0
mean   3425.500000  2567.000000      3797.0  2829.500000      4095.0
std    3104.905876  2095.864499         NaN  2185.667061         NaN
min    1230.000000  1085.000000      3797.0  1284.000000      4095.0
25%    2327.750000  1826.000000      3797.0  2056.750000      4095.0
50%    3425.500000  2567.000000      3797.0  2829.500000      4095.0
75%    4523.250000  3308.000000      3797.0  3602.250000      4095.0
max    5621.000000  4049.000000      3797.0  4375.000000      4095.0

Challenge

Why are there still NaNs in this output, when in Pandas methods like min and max igonre Nan by default?

We still see NaN because GSM2545337 and GSM2545340 only have a single value over 1000 in this subset. The standard deviation (std) is undefined for a single value.

Using multiple dataframes


Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

For instance, let’s say we want to take a look at genes which are highly expressed over time.

To start, let’s take a look at the top genes by mean expression. First we calculate mean expression per gene:

PYTHON

mean_exp = rnaseq_df.mean()
print(mean_exp.head())

OUTPUT

GSM2545336    2062.191995
GSM2545337    1765.508820
GSM2545338    1667.990502
GSM2545339    1696.120760
GSM2545340    1681.834464
dtype: float64

Notice that that this is printing differently than when we printed Dataframes. Methods which return single values such as mean by default return a Series instead of a Dataframe.

PYTHON

print(type(mean_exp))

OUTPUT

<class 'pandas.core.series.Series'>

We can think of a series as a single-column dataframe. Series, since they only contain a single column, are indexed like lists: series[start:end].

Now, we can use the sort_values method and look at the top 10 genes. Note that sort_values when used on a Dataframe with multiple columns needs a by argument to specify what column it should be sorted by.

PYTHON

mean_exp = mean_exp.sort_values()
mean_exp[:10]

OUTPUT

GSM2545342    1594.116689
GSM2545341    1637.532564
GSM2545338    1667.990502
GSM2545340    1681.834464
GSM2545346    1692.774084
GSM2545339    1696.120760
GSM2545345    1700.161465
GSM2545344    1712.440299
GSM2545337    1765.508820
GSM2545347    1805.396201
dtype: float64

Now let’s say we want to do something a bit more complicated. As opposed to getting the top 10 genes across all samples, we want to instead get genes which had the highest absolute expression change from 0 to 4 days.

To do that, we need to load in the sample metadata as a separate dataframe:

PYTHON

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/metadata.csv"
metadata = pd.read_csv(url, index_col=0)
print(metadata)

OUTPUT

                organism  age     sex    infection   strain  time      tissue  \
sample
GSM2545336  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545337  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545338  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545339  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545340  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545341  Mus musculus    8    Male   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545342  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545343  Mus musculus    8    Male  NonInfected  C57BL/6     0  Cerebellum
GSM2545344  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545345  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545346  Mus musculus    8    Male   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545347  Mus musculus    8    Male   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545348  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545349  Mus musculus    8    Male  NonInfected  C57BL/6     0  Cerebellum
GSM2545350  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545351  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545352  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545353  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545354  Mus musculus    8    Male  NonInfected  C57BL/6     0  Cerebellum
GSM2545362  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545363  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545380  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum

            mouse
sample
GSM2545336     14
GSM2545337      9
GSM2545338     10
GSM2545339     15
GSM2545340     18
GSM2545341      6
GSM2545342      5
GSM2545343     11
GSM2545344     22
GSM2545345     13
GSM2545346     23
GSM2545347     24
GSM2545348      8
GSM2545349      7
GSM2545350      1
GSM2545351     16
GSM2545352     21
GSM2545353      4
GSM2545354      2
GSM2545362     20
GSM2545363     12
GSM2545380     19

Time is a property of the samples, not of the genes. While we could add another row to our dataframe indicating time, it makes more sense to flip our dataframe and add it as a column. We will learn about other ways to combine dataframes in a future lesson, but for now since metadata and rnadeq_df have the same index values we can use the following syntax:

PYTHON

flipped_df = rnaseq_df.T
flipped_df['time'] = metadata['time']
print(flipped_df.head())

OUTPUT

gene        AI504432  AW046200  AW551984  Aamp  Abca12  Abcc8  Abhd14a  Abi2  \
GSM2545336      1230        83       670  5621       5   2210      490  5627
GSM2545337      1085       144       265  4049       8   1966      495  4383
GSM2545338       969       129       202  3797       1   2181      474  4107
GSM2545339      1284       121       682  4375       5   2362      468  4062
GSM2545340       966       141       246  4095       3   2475      489  4289

gene        Abi3bp  Abl2  ...  Zfp92  Zfp941  Zfyve28  Zgrf1  Zkscan3  Zranb1  \
GSM2545336     807  2392  ...     91     910     3747    644     1732    8837
GSM2545337    1144  2133  ...     46     654     2568    335     1840    5306
GSM2545338    1028  1891  ...     19     560     2635    347     1800    5106
GSM2545339     935  1645  ...     50     782     2623    405     1751    5306
GSM2545340     879  1926  ...     48     696     3140    549     2056    5896

gene        Zranb3  Zscan22  Zw10  time
GSM2545336     207      483  1479     8
GSM2545337     179      535  1394     0
GSM2545338     199      533  1279     0
GSM2545339     208      462  1376     4
GSM2545340     184      439  1568     4 

Group By: split-apply-combine


We can now group our data by timepoint using groupby. groupby will combine our data by one or more columns, and then any summarizing functions we call such as mean or max will be performed on each group.

PYTHON

grouped_time = flipped_df.groupby('time')
print(grouped_time.mean())

OUTPUT

gene     AI504432    AW046200    AW551984         Aamp    Abca12        Abcc8  \
time
0     1033.857143  155.285714  238.000000  4602.571429  5.285714  2576.428571
4     1104.500000  152.375000  302.250000  4870.000000  4.250000  2608.625000
8     1014.000000   81.000000  342.285714  4762.571429  4.142857  2291.571429

gene     Abhd14a         Abi2       Abi3bp         Abl2  ...      Zfp810  \
time                                                     ...
0     591.428571  4880.571429  1174.571429  2170.142857  ...  537.000000
4     546.750000  4902.875000  1060.625000  2077.875000  ...  629.125000
8     432.428571  4945.285714   762.285714  2131.285714  ...  940.428571

gene      Zfp92      Zfp941      Zfyve28       Zgrf1      Zkscan3   Zranb1  \
time
0     36.857143  673.857143  3162.428571  398.571429  2105.571429  6014.00
4     43.000000  770.875000  3284.250000  450.125000  1981.500000  6464.25
8     49.000000  765.571429  3649.571429  635.571429  1664.428571  8187.00

gene      Zranb3     Zscan22         Zw10
time
0     191.142857  607.428571  1546.285714
4     196.750000  534.500000  1555.125000
8     191.428571  416.428571  1476.428571

[3 rows x 1474 columns]

Another way to easily make a Dataframe after using groupby is to use Pandas’ aggregation function called agg. Let’s also flip the data back.

PYTHON

time_means = grouped_time.agg('mean').T
print(time_means.head())

OUTPUT

time                0         4            8
gene
AI504432  1033.857143  1104.500  1014.000000
AW046200   155.285714   152.375    81.000000
AW551984   238.000000   302.250   342.285714
Aamp      4602.571429  4870.000  4762.571429
Abca12       5.285714     4.250     4.142857

Let’s also make the column names someting more legible using rename.

PYTHON

time_means = time_means.rename(columns={0: "day_0", 4: "day_4", 8: "day_8"})
print(time_means.head())

OUTPUT

time            day_0     day_4        day_8
gene
AI504432  1033.857143  1104.500  1014.000000
AW046200   155.285714   152.375    81.000000
AW551984   238.000000   302.250   342.285714
Aamp      4602.571429  4870.000  4762.571429
Abca12       5.285714     4.250     4.142857

Now we can calculate the difference between 0 and 4 days:

PYTHON

time_means['diff_4_0'] = time_means['day_4'] - time_means['day_0']
print(time_means.head())

OUTPUT

time            day_0     day_4        day_8    diff_4_0
gene
AI504432  1033.857143  1104.500  1014.000000   70.642857
AW046200   155.285714   152.375    81.000000   -2.910714
AW551984   238.000000   302.250   342.285714   64.250000
Aamp      4602.571429  4870.000  4762.571429  267.428571
Abca12       5.285714     4.250     4.142857   -1.035714

And get the top genes. This time we’ll use the nlargest method instead of sorting:

PYTHON

print(time_means.nlargest(10, 'diff_4_0'))

OUTPUT

time           day_0      day_4         day_8     diff_4_0
gene
Glul    48123.285714  55357.500  73947.857143  7234.214286
Sparc   35429.571429  38832.000  56105.714286  3402.428571
Atp1b1  57350.714286  60364.250  59229.000000  3013.535714
Apod    11575.428571  14506.500  31458.142857  2931.071429
Ttyh1   21453.571429  24252.500  30457.428571  2798.928571
Mt1      7601.428571  10112.375  14397.285714  2510.946429
Etnppl   4516.714286   6702.875   8208.142857  2186.160714
Kif5a   36079.571429  37911.750  33410.714286  1832.178571
Pink1   15454.571429  17252.375  23305.285714  1797.803571
Itih3    2467.714286   3976.500   5534.571429  1508.785714

Creating new columns

We looked at the absolute expression change when finding top genes, but typically we actually want to look at the log (base 2) fold chage.

The log fold change from x to y can be calculated as either:

\(log(y) - log(x)\)

or as

\(log(\frac{y}{x})\)

Try calculating the log fold change from 0 to 4 and 0 to 8 days, and store the result as two new columns in time_means

We will need the log2 function from the numpy package to do this:

PYTHON

import numpy as np
np.log2(time_means['day_0'])

PYTHON

import numpy as np
time_means['logfc_0_4'] = np.log2(time_means['day_4']) - np.log2(time_means['day_0'])
time_means['logfc_0_8'] = np.log2(time_means['day_8']) - np.log2(time_means['day_0'])
print(time_means.head())

OUTPUT

time            day_0     day_4        day_8    diff_4_0  logfc_0_4  logfc_0_8
gene
AI504432  1033.857143  1104.500  1014.000000   70.642857   0.095357  -0.027979
AW046200   155.285714   152.375    81.000000   -2.910714  -0.027299  -0.938931
AW551984   238.000000   302.250   342.285714   64.250000   0.344781   0.524240
Aamp      4602.571429  4870.000  4762.571429  267.428571   0.081482   0.049301
Abca12       5.285714     4.250     4.142857   -1.035714  -0.314636  -0.351472

We can also use the assign dataframe method to create new columns.

PYTHON

time_means = time_means.assign(fc_0_8 = 2**time_means["logfc_0_8"])
print(time_means.iloc[:,5:])

OUTPUT

time      logfc_0_8    fc_0_8
gene
AI504432  -0.027979  0.980793
AW046200  -0.938931  0.521619
AW551984   0.524240  1.438175
Aamp       0.049301  1.034763
Abca12    -0.351472  0.783784
...             ...       ...
Zkscan3   -0.339185  0.790488
Zranb1     0.445010  1.361324
Zranb3     0.002155  1.001495
Zscan22   -0.544646  0.685560
Zw10      -0.066695  0.954823

[1474 rows x 2 columns]

Extent of Slicing

  1. Do the two statements below produce the same output?
  2. Based on this, what rule governs what is included (or not) in numerical slices and named slices in Pandas?

Here’s metadata as a reminder:

                organism  age     sex    infection   strain  time      tissue  \
sample
GSM2545336  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545337  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545338  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545339  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545340  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum

            mouse
sample
GSM2545336     14
GSM2545337      9
GSM2545338     10
GSM2545339     15
GSM2545340     18  

PYTHON

print(metadata.iloc[0:2, 0:2])
print(metadata.loc['GSM2545336':'GSM2545338', 'organism':'sex'])

No, they do not produce the same output! The output of the first statement is:

OUTPUT

                organism  age
sample
GSM2545336  Mus musculus    8
GSM2545337  Mus musculus    8

The second statement gives:

OUTPUT

                organism  age     sex
sample
GSM2545336  Mus musculus    8  Female
GSM2545337  Mus musculus    8  Female
GSM2545338  Mus musculus    8  Female

Clearly, the second statement produces an additional column and an additional row compared to the first statement.
What conclusion can we draw? We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided, while a named slice, ‘GSM2545336’:‘GSM2545338’, includes the final element.

Reconstructing Data

Explain what each line in the following short program does: what is in first, second, etc.?

PYTHON

first = metadata.copy(deep=True)
second = first[first['sex'] == 'Female']
third = second.drop('GSM2545338')
fourth = third.drop('sex', axis = 1)
fourth.to_csv('result.csv')

Let’s go through this piece of code line by line.

PYTHON

first = metadata.copy(deep=True)

This line makes a copy of the metadata dataframe. While we probably don’t need to make sure deep=True here, as non of our columns are more complex or compound objects, it never hurts to be safe.

PYTHON

second = first[first['sex'] == 'Female']

This line makes a selection: only those rows of first for which the ‘sex’ column matches ‘Female’ are extracted. Notice how the Boolean expression inside the brackets, first['sex'] == 'Female', is used to select only those rows where the expression is true. Try printing this expression! Can you print also its individual True/False elements? (hint: first assign the expression to a variable)

PYTHON

third = second.drop('GSM2545338')

As the syntax suggests, this line drops the row from second where the label is ‘GSM2545338’. The resulting dataframe third has one row less than the original dataframe second.

PYTHON

fourth = third.drop('sex', axis = 1)

Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis parameter (axis is by default set to 0 for rows, and we change it to 1 for columns).

PYTHON

fourth.to_csv('result.csv')

The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the to_csv() function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.

Selecting Indices

Explain in simple terms what idxmin and idxmax do in the short program below. When would you use these methods?

PYTHON

print("idxmin:")
print(rnaseq_df.idxmin())
print("idxmax:")
print(rnaseq_df.idxmax())

OUTPUT

idxmin:
GSM2545336       Cfhr2
GSM2545337       Cfhr2
GSM2545338        Aox2
GSM2545339       Ascl5
GSM2545340       Ascl5
GSM2545341    BC055402
GSM2545342       Cryga
GSM2545343       Ascl5
GSM2545344        Aox2
GSM2545345    BC055402
GSM2545346        Cpa6
GSM2545347       Cfhr2
GSM2545348       Acmsd
GSM2545349     Fam124b
GSM2545350       Cryga
GSM2545351       Glrp1
GSM2545352       Cfhr2
GSM2545353       Ascl5
GSM2545354       Ascl5
GSM2545362       Cryga
GSM2545363    Adamts19
GSM2545380       Ascl5
dtype: object
idxmax:
GSM2545336    Glul
GSM2545337    Plp1
GSM2545338    Plp1
GSM2545339    Plp1
GSM2545340    Plp1
GSM2545341    Glul
GSM2545342    Glul
GSM2545343    Plp1
GSM2545344    Plp1
GSM2545345    Plp1
GSM2545346    Glul
GSM2545347    Glul
GSM2545348    Plp1
GSM2545349    Plp1
GSM2545350    Plp1
GSM2545351    Glul
GSM2545352    Plp1
GSM2545353    Plp1
GSM2545354    Plp1
GSM2545362    Plp1
GSM2545363    Plp1
GSM2545380    Glul
dtype: object

For each column in data, idxmin will return the index value corresponding to each column’s minimum; idxmax will do accordingly the same for each column’s maximum value.

You can use these functions whenever you want to get the row index of the minimum/maximum value and not the actual minimum/maximum value.

Many Ways of Access

There are at least two ways of accessing a value or slice of a DataFrame: by name or index. However, there are many others. For example, a single column or row can be accessed either as a DataFrame or a Series object.

Suggest different ways of doing the following operations on a DataFrame:

  1. Access a single column
  2. Access a single row
  3. Access an individual DataFrame element
  4. Access several columns
  5. Access several rows
  6. Access a subset of specific rows and columns
  7. Access a subset of row and column ranges

1. Access a single column:

PYTHON

# by name
rnaseq_df["col_name"]   # as a Series
rnaseq_df[["col_name"]] # as a DataFrame

# by name using .loc
rnaseq_df.T.loc["col_name"]  # as a Series
rnaseq_df.T.loc[["col_name"]].T  # as a DataFrame

# Dot notation (Series)
rnaseq_df.col_name

# by index (iloc)
rnaseq_df.iloc[:, col_index]   # as a Series
rnaseq_df.iloc[:, [col_index]] # as a DataFrame

# using a mask
rnaseq_df.T[rnaseq_df.T.index == "col_name"].T

2. Access a single row:

PYTHON

# by name using .loc
rnaseq_df.loc["row_name"] # as a Series
rnaseq_df.loc[["row_name"]] # as a DataFrame

# by name
rnaseq_df.T["row_name"] # as a Series
rnaseq_df.T[["row_name"]].T # as a DataFrame

# by index
rnaseq_df.iloc[row_index]   # as a Series
rnaseq_df.iloc[[row_index]]   # as a DataFrame

# using mask
rnaseq_df[rnaseq_df.index == "row_name"]

3. Access an individual DataFrame element:

PYTHON

# by column/row names
rnaseq_df["column_name"]["row_name"]         # as a Series

rnaseq_df[["col_name"]].loc["row_name"]  # as a Series
rnaseq_df[["col_name"]].loc[["row_name"]]  # as a DataFrame

rnaseq_df.loc["row_name"]["col_name"]  # as a value
rnaseq_df.loc[["row_name"]]["col_name"]  # as a Series
rnaseq_df.loc[["row_name"]][["col_name"]]  # as a DataFrame

rnaseq_df.loc["row_name", "col_name"]  # as a value
rnaseq_df.loc[["row_name"], "col_name"]  # as a Series. Preserves index. Column name is moved to `.name`.
rnaseq_df.loc["row_name", ["col_name"]]  # as a Series. Index is moved to `.name.` Sets index to column name.
rnaseq_df.loc[["row_name"], ["col_name"]]  # as a DataFrame (preserves original index and column name)

# by column/row names: Dot notation
rnaseq_df.col_name.row_name

# by column/row indices
rnaseq_df.iloc[row_index, col_index] # as a value
rnaseq_df.iloc[[row_index], col_index] # as a Series. Preserves index. Column name is moved to `.name`
rnaseq_df.iloc[row_index, [col_index]] # as a Series. Index is moved to `.name.` Sets index to column name.
rnaseq_df.iloc[[row_index], [col_index]] # as a DataFrame (preserves original index and column name)

# column name + row index
rnaseq_df["col_name"][row_index]
rnaseq_df.col_name[row_index]
rnaseq_df["col_name"].iloc[row_index]

# column index + row name
rnaseq_df.iloc[:, [col_index]].loc["row_name"]  # as a Series
rnaseq_df.iloc[:, [col_index]].loc[["row_name"]]  # as a DataFrame

# using masks
rnaseq_df[rnaseq_df.index == "row_name"].T[rnaseq_df.T.index == "col_name"].T

4. Access several columns:

PYTHON

# by name
rnaseq_df[["col1", "col2", "col3"]]
rnaseq_df.loc[:, ["col1", "col2", "col3"]]

# by index
rnaseq_df.iloc[:, [col1_index, col2_index, col3_index]]

5. Access several rows

PYTHON

# by name
rnaseq_df.loc[["row1", "row2", "row3"]]

# by index
rnaseq_df.iloc[[row1_index, row2_index, row3_index]]

6. Access a subset of specific rows and columns

PYTHON

# by names
rnaseq_df.loc[["row1", "row2", "row3"], ["col1", "col2", "col3"]]

# by indices
rnaseq_df.iloc[[row1_index, row2_index, row3_index], [col1_index, col2_index, col3_index]]

# column names + row indices
rnaseq_df[["col1", "col2", "col3"]].iloc[[row1_index, row2_index, row3_index]]

# column indices + row names
rnaseq_df.iloc[:, [col1_index, col2_index, col3_index]].loc[["row1", "row2", "row3"]]

7. Access a subset of row and column ranges

PYTHON

# by name
rnaseq_df.loc["row1":"row2", "col1":"col2"]

# by index
rnaseq_df.iloc[row1_index:row2_index, col1_index:col2_index]

# column names + row indices
rnaseq_df.loc[:, "col1_name":"col2_name"].iloc[row1_index:row2_index]

# column indices + row names
rnaseq_df.iloc[:, col1_index:col2_index].loc["row1":"row2"]

Content from Writing Functions


Last updated on 2023-04-24 | Edit this page

Overview

Questions

  • How can I create my own functions?
  • How can I use functions to write better programs?

Objectives

  • Explain and identify the difference between function definition and function call.
  • Write a function that takes a small, fixed number of arguments and produces a single result.

Key Points

  • Break programs down into functions to make them easier to understand.
  • Define a function using def with a name, parameters, and a block of code.
  • Defining a function does not run it.
  • Arguments in a function call are matched to its defined parameters.
  • Functions may return a result to their caller using return.

Break programs down into functions to make them easier to understand.


  • Human beings can only keep a few items in working memory at a time.
  • Understand larger/more complicated ideas by understanding and combining pieces.
    • Components in a machine.
    • Lemmas when proving theorems.
  • Functions serve the same purpose in programs.
    • Encapsulate complexity so that we can treat it as a single “thing”.
  • Also enables re-use.
    • Write one time, use many times.

Define a function using def with a name, parameters, and a block of code.


  • Begin the definition of a new function with def.
  • Followed by the name of the function.
    • Must obey the same rules as variable names.
  • Then parameters in parentheses.
    • Empty parentheses if the function doesn’t take any inputs.
    • We will discuss this in detail in a moment.
  • Then a colon.
  • Then an indented block of code.

PYTHON

def print_greeting():
    print('Hello!')
    print('The weather is nice today.')
    print('Right?')

Defining a function does not run it.


  • Defining a function does not run it.
    • Like assigning a value to a variable.
  • Must call the function to execute the code it contains.

PYTHON

print_greeting()

OUTPUT

Hello!

Arguments in a function call are matched to its defined parameters.


  • Functions are most useful when they can operate on different data.
  • Specify parameters when defining a function.
    • These become variables when the function is executed.
    • Are assigned the arguments in the call (i.e., the values passed to the function).
    • If you don’t name the arguments when using them in the call, the arguments will be matched to parameters in the order the parameters are defined in the function.

PYTHON

def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    print(joined)

print_date(1871, 3, 19)

OUTPUT

1871/3/19

Or, we can name the arguments when we call the function, which allows us to specify them in any order and adds clarity to the call site; otherwise as one is reading the code they might forget if the second argument is the month or the day for example.

PYTHON

print_date(month=3, day=19, year=1871)

OUTPUT

1871/3/19
  • This tweet shared a nice analogy for functions, Twitter: () contains the ingredients for the function while the body contains the recipe.

Functions may return a result to their caller using return.


  • Use return ... to give a value back to the caller.
  • May occur anywhere in the function.
  • But functions are easier to understand if return occurs:
    • At the start to handle special cases.
    • At the very end, with a final result.

PYTHON

def average(values):
    if len(values) == 0:
        return None
    return sum(values) / len(values)

PYTHON

a = average([1, 3, 4])
print('average of actual values:', a)

OUTPUT

average of actual values: 2.6666666666666665

PYTHON

print('average of empty list:', average([]))

OUTPUT

average of empty list: None

PYTHON

result = print_date(1871, 3, 19)
print('result of call is:', result)

OUTPUT

1871/3/19
result of call is: None

Identifying Syntax Errors

  1. Read the code below and try to identify what the errors are without running it.
  2. Run the code and read the error message. Is it a SyntaxError or an IndentationError?
  3. Fix the error.
  4. Repeat steps 2 and 3 until you have fixed all the errors.

PYTHON

def another_function
  print("Syntax errors are annoying.")
   print("But at least python tells us about them!")
  print("So they are usually not too hard to fix.")

PYTHON

def another_function():
  print("Syntax errors are annoying.")
  print("But at least Python tells us about them!")
  print("So they are usually not too hard to fix.")

Definition and Use

What does the following program print?

PYTHON

def report(pressure):
    print('pressure is', pressure)

print('calling', report, 22.5)

OUTPUT

calling <function report at 0x7fd128ff1bf8> 22.5

A function call always needs parenthesis, otherwise you get memory address of the function object. So, if we wanted to call the function named report, and give it the value 22.5 to report on, we could have our function call as follows

PYTHON

print("calling")
report(22.5)

OUTPUT

calling
pressure is 22.5

Order of Operations

  1. What’s wrong in this example?

    PYTHON

    result = print_time(11, 37, 59)
    
    def print_time(hour, minute, second):
       time_string = str(hour) + ':' + str(minute) + ':' + str(second)
       print(time_string)
  2. After fixing the problem above, explain why running this example code:

    PYTHON

    result = print_time(11, 37, 59)
    print('result of call is:', result)

    gives this output:

    OUTPUT

    11:37:59
    result of call is: None
  3. Why is the result of the call None?

  1. The problem with the example is that the function print_time() is defined after the call to the function is made. Python doesn’t know how to resolve the name print_time since it hasn’t been defined yet and will raise a NameError e.g., NameError: name 'print_time' is not defined

  2. The first line of output 11:37:59 is printed by the first line of code, result = print_time(11, 37, 59) that binds the value returned by invoking print_time to the variable result. The second line is from the second print call to print the contents of the result variable.

  3. print_time() does not explicitly return a value, so it automatically returns None.

Encapsulation

Fill in the blanks to create a function that takes a single filename as an argument, loads the data in the file named by the argument, and returns the minimum value in that data.

PYTHON

import pandas as pd

def min_in_data(____):
    data = ____
    return ____

PYTHON

import pandas as pd

def min_in_data(filename):
    data = pd.read_csv(filename)
    return data.min()

Find the First

Fill in the blanks to create a function that takes a list of numbers as an argument and returns the first negative value in the list. What does your function do if the list is empty? What if the list has no negative numbers?

PYTHON

def first_negative(values):
    for v in ____:
        if ____:
            return ____

PYTHON

def first_negative(values):
    for v in values:
        if v < 0:
            return v

If an empty list or a list with all positive values is passed to this function, it returns None:

PYTHON

my_list = []
print(first_negative(my_list))

OUTPUT

None

Calling by Name

Earlier we saw this function:

PYTHON

def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    print(joined)

We saw that we can call the function using named arguments, like this:

PYTHON

print_date(day=1, month=2, year=2003)
  1. What does print_date(day=1, month=2, year=2003) print?
  2. When have you seen a function call like this before?
  3. When and why is it useful to call functions this way?
  1. 2003/2/1
  2. We saw examples of using named arguments when working with the pandas library. For example, when reading in a dataset using data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country'), the last argument index_col is a named argument.
  3. Using named arguments can make code more readable since one can see from the function call what name the different arguments have inside the function. It can also reduce the chances of passing arguments in the wrong order, since by using named arguments the order doesn’t matter.

Encapsulation of an If/Print Block

The code below will run on a label-printer for chicken eggs. A digital scale will report a chicken egg mass (in grams) to the computer and then the computer will print a label.

PYTHON

import random
for i in range(10):

    # simulating the mass of a chicken egg
    # the (random) mass will be 70 +/- 20 grams
    mass = 70 + 20.0 * (2.0 * random.random() - 1.0)

    print(mass)

    # egg sizing machinery prints a label
    if mass >= 85:
        print("jumbo")
    elif mass >= 70:
        print("large")
    elif mass < 70 and mass >= 55:
        print("medium")
    else:
        print("small")

The if-block that classifies the eggs might be useful in other situations, so to avoid repeating it, we could fold it into a function, get_egg_label(). Revising the program to use the function would give us this:

PYTHON

# revised version
import random
for i in range(10):

    # simulating the mass of a chicken egg
    # the (random) mass will be 70 +/- 20 grams
    mass = 70 + 20.0 * (2.0 * random.random() - 1.0)

    print(mass, get_egg_label(mass))
  1. Create a function definition for get_egg_label() that will work with the revised program above. Note that the get_egg_label() function’s return value will be important. Sample output from the above program would be 71.23 large.
  2. A dirty egg might have a mass of more than 90 grams, and a spoiled or broken egg will probably have a mass that’s less than 50 grams. Modify your get_egg_label() function to account for these error conditions. Sample output could be 25 too light, probably spoiled.

PYTHON

def get_egg_label(mass):
    # egg sizing machinery prints a label
    egg_label = "Unlabelled"
    if mass >= 90:
        egg_label = "warning: egg might be dirty"
    elif mass >= 85:
        egg_label = "jumbo"
    elif mass >= 70:
        egg_label = "large"
    elif mass < 70 and mass >= 55:
        egg_label = "medium"
    elif mass < 50:
        egg_label = "too light, probably spoiled"
    else:
        egg_label = "small"
    return egg_label

Simulating a dynamical system

In mathematics, a dynamical system is a system in which a function describes the time dependence of a point in a geometrical space. A canonical example of a dynamical system is the logistic map, a growth model that computes a new population density (between 0 and 1) based on the current density. In the model, time takes discrete values 0, 1, 2, …

  1. Define a function called logistic_map that takes two inputs: x, representing the current population (at time t), and a parameter r = 1. This function should return a value representing the state of the system (population) at time t + 1, using the mapping function:

    f(t+1) = r * f(t) * [1 - f(t)]

  2. Using a for or while loop, iterate the logistic_map function defined in part 1, starting from an initial population of 0.5, for a period of time t_final = 10. Store the intermediate results in a list so that after the loop terminates you have accumulated a sequence of values representing the state of the logistic map at times t = [0,1,...,t_final] (11 values in total). Print this list to see the evolution of the population.

  3. Encapsulate the logic of your loop into a function called iterate that takes the initial population as its first input, the parameter t_final as its second input and the parameter r as its third input. The function should return the list of values representing the state of the logistic map at times t = [0,1,...,t_final]. Run this function for periods t_final = 100 and 1000 and print some of the values. Is the population trending toward a steady state?

  1. PYTHON

       def logistic_map(x, r):
           return r * x * (1 - x)
  2. PYTHON

       initial_population = 0.5
       t_final = 10
       r = 1.0
       population = [initial_population]
       for t in range(t_final):
           population.append( logistic_map(population[t], r) )
  3. PYTHON

    def iterate(initial_population, t_final, r):
        population = [initial_population]
        for t in range(t_final):
            population.append( logistic_map(population[t], r) )
        return population
    
    for period in (10, 100, 1000):
        population = iterate(0.5, period, 1)
        print(population[-1])

    OUTPUT

    0.06945089389714401
    0.009395779870614648
    0.0009913908614406382
    The population seems to be approaching zero.

Using Functions With Conditionals in Pandas

Functions will often contain conditionals. Here is a short example that will indicate which quartile the argument is in based on hand-coded values for the quartile cut points.

PYTHON

def calculate_life_quartile(exp):
    if exp < 58.41:
        # This observation is in the first quartile
        return 1
    elif exp >= 58.41 and exp < 67.05:
        # This observation is in the second quartile
       return 2
    elif exp >= 67.05 and exp < 71.70:
        # This observation is in the third quartile
       return 3
    elif exp >= 71.70:
        # This observation is in the fourth quartile
       return 4
    else:
        # This observation has bad dataS
       return None

calculate_life_quartile(62.5)

OUTPUT

2

That function would typically be used within a for loop, but Pandas has a different, more efficient way of doing the same thing, and that is by applying a function to a dataframe or a portion of a dataframe. Here is an example, using the definition above.

PYTHON


{'lifeExp_1992': {'Argentina': 71.868,   'Bolivia': 59.957,   'Brazil': 67.057,   'Canada': 77.95,   'Chile': 74.126,   'Colombia': 68.421,   'Costa Rica': 75.713,   'Cuba': 74.414,   'Dominican Republic': 68.457,   'Ecuador': 69.613,   'El Salvador': 66.798,   'Guatemala': 63.373,   'Haiti': 55.089,   'Honduras': 66.399,   'Jamaica': 71.766,   'Mexico': 71.455,   'Nicaragua': 65.843,   'Panama': 72.462,   'Paraguay': 68.225,   'Peru': 66.458,   'Puerto Rico': 73.911,   'Trinidad and Tobago': 69.862,   'United States': 76.09,   'Uruguay': 72.752,   'Venezuela': 71.15},  'lifeExp_2007': {'Argentina': 75.32,   'Bolivia': 65.554,   'Brazil': 72.39,   'Canada': 80.653,   'Chile': 78.553,   'Colombia': 72.889,   'Costa Rica': 78.782,   'Cuba': 78.273,   'Dominican Republic': 72.235,   'Ecuador': 74.994,   'El Salvador': 71.878,   'Guatemala': 70.259,   'Haiti': 60.916,   'Honduras': 70.198,   'Jamaica': 72.567,   'Mexico': 76.195,   'Nicaragua': 72.899,   'Panama': 75.537,   'Paraguay': 71.752,   'Peru': 71.421,   'Puerto Rico': 78.746,   'Trinidad and Tobago': 69.819,   'United States': 78.242,   'Uruguay': 76.384,   'Venezuela': 73.747}}

data['life_qrtl_1992'] = data['lifeExp_1992'].apply(calculate_life_quartile)
data['life_qrtl_2007'] = data['lifeExp_2007'].apply(calculate_life_quartile)
print(data.iloc[:,2:])

There is a lot in that second line, so let’s take it piece by piece. On the right side of the = we start with data['lifeExp'], which is the column in the dataframe called data labeled lifExp. We use the apply() to do what it says, apply the calculate_life_quartile to the value of this column for every row in the dataframe.

Content from Perform Statistical Tests with Scipy


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How do I use distributions?
  • How do I perform statistical tests?

Objectives

  • Explore the scipy library.
  • Define and sample from a distribution.
  • Perform a t-test on prepared data.

Key Points

  • Scipy is a package with a variety of scientific computing functionality.
  • Scipy.stats contains functionality for distributions and statistical tests.

Sci-py is a package with a variety of scientific computing functionality.


SciPy contains a variety of scientific computing modules. It has modules for:

  • Integration (scipy.integrate)
  • Optimization (scipy.optimize)
  • Interpolation (scipy.interpolate)
  • Fourier Transforms (scipy.fft)
  • Signal Processing (scipy.signal)
  • Linear Algebra (scipy.linalg)
  • Compressed Sparse Graph Routines (scipy.sparse.csgraph)
  • Spatial data structures and algorithms (scipy.spatial)
  • Statistics (scipy.stats)
  • Multidimensional image processing (scipy.ndimage)

We are only going to look at some of the functionality in scipy.stats.

Let’s load the module and take a look. Note that the output is not shown here as it is very long.

PYTHON

import scipy.stats as stats

help(stats)

This is a lot, but we can see that scipy.stats has a variety of different distributions and functions available for performing statistics.

Sampling from a distribution


One common reason use a distribution is if we want to generate some random with some kind of structure. Python has a built-in random module which can perform some basic tasks like getting uniform random numbers or shuffling a list, but we need more power to generate more complicated random data.

Distributions in scipy share a set of methods. Some of the important ones are:

  • rvs which generates random numbers from the distribution.
  • pdf which gets the probability density function.
  • cdf which gets the cumulative distribution function.
  • stats which reports various properties of the distribution.

Scipy’s convention is not to create distribution objects, but to pass parameters as needed into each of these functions. For instance, to generate 100 normally distributed variables with a mean of 8 and a standard deviation of 3, we call stats.norm.rvs with those parameters as loc and scale, respectively.

PYTHON

stats.norm.rvs(size=100, loc=8, scale=3)

OUTPUT

array([10.81294892,  9.35960484,  4.13064284,  5.01349956, 11.91542804,
        4.40831262,  7.873177  ,  2.80427116,  6.93137287,  8.6260419 ,
       10.48824661,  4.03472414, 10.01449037,  7.37493941,  6.8729883 ,
        8.8247789 ,  6.9956787 ,  8.2084562 ,  9.4272925 ,  8.14680254,
        1.61100441,  7.14171227,  6.83756279, 13.15778935,  5.87752233,
       11.53910465,  9.7899608 , 10.99998659,  5.67069185,  4.43542582,
        8.05798467,  4.56883381, 11.2219477 ,  9.49666323,  6.09194101,
       10.0844057 , 10.3107259 ,  5.50683223,  9.97121225, 10.71650187,
        7.11465757,  1.81891326,  5.11893454,  5.7602409 ,  7.21754014,
        8.79988949, 10.37762164, 14.33689265,  6.33571171,  8.06869862,
        8.54040514,  7.70807529, 11.35719793,  9.60738274,  6.02998292,
        5.68116531,  2.35490176, 10.74582778,  9.8661685 , 13.39578467,
       10.04354226,  7.28494967, 10.16128058, -0.59049207,  7.2093563 ,
        6.81705905,  5.95187581,  7.51137727, 12.00011092, 10.99417942,
        7.84189409,  1.51154885,  5.85094646,  9.24591807, 10.0216898 ,
        9.79350275,  7.26730344,  4.94176518,  9.06454997,  2.99129021,
       10.8972046 , 12.51642136,  7.31422469,  4.54086114,  4.36204651,
        8.33272365,  9.53609612,  7.21441855,  8.58643188,  7.67419071,
       10.36948034,  4.405381  ,  8.16845496,  2.9752478 ,  5.93608394,
        4.91781677, 11.60177026,  7.97727669,  8.43215961,  6.97469055])

Numpy arrays

Note that the output isn’t a list or pandas object, but is something called an array. This is an object in the NumPy package. NumPy objects are used by most scientific packages in Python such as pandas, scikit-learn, and scipy. It provides a variety of objects and funcionality for performing fast numeric operations.

Numpy arrays can be indexed and manipulated like Python lists in most instances, but provide extra functionality. We will not be going into more detail about numpy during this workshop.

Performing a t-test with scipy


First, let’s load up a small toy dataset looking at how mouse weight responded to a particular treatment. This data contains 4 columns, the mouse label/number, whether or not it was treated, its sex and its weight.

PYTHON

import pandas as pd

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/mouse_weight_data.csv"
mouse_df = pd.read_csv(url, index_col="Mouse")
print(mouse_df)

OUTPUT

       Treated Sex  Weight
Mouse
1         True   M      22
2         True   F      25
3         True   M      27
4         True   F      23
5         True   M      23
6         True   F      24
7         True   M      26
8         True   F      24
9         True   M      27
10        True   F      27
11        True   M      25
12        True   F      27
13        True   M      26
14        True   F      24
15        True   M      26
16       False   F      25
17       False   M      29
18       False   F      28
19       False   M      27
20       False   F      28
21       False   M      25
22       False   F      23
23       False   M      24
24       False   F      30
25       False   M      24
26       False   F      30
27       False   M      30
28       False   F      28
29       False   M      27
30       False   F      24

We can start by seeing if the mean weight is different between the groups:

PYTHON

treated_mean = mouse_df[mouse_df["Treated"]]["Weight"].mean()
untreated_mean = mouse_df[~mouse_df["Treated"]]["Weight"].mean()
print(f"Treated mean weight: {treated_mean:0.0f}g\nUntreated mean weight: {untreated_mean:0.0f}g")

OUTPUT

Treated mean weight: 25g
Untreated mean weight: 27g

We can see that there is a slight different, but it’s unclear if it is real or just a source of randomnes in the data.

The newline character

\n is a special character called the newline character. It tells Python to go to a new line of text. \n is univeral to almost all programming languages.

~ in Pandas

Though before we saw that in Python ! is how we indicate logical NOT, for pandas boolean series we have to use the ~ character to invert every boolean in the series.

Let’s perform a t-test to check for a significant difference. A one-sample t-test is performed using ttest_1samp. As we want to compare two groups, we will need to use perform a two-sample t-test using ttest_ind.

PYTHON

treated_weights = mouse_df[mouse_df["Treated"]]["Weight"]
untreated_weights = mouse_df[~mouse_df["Treated"]]["Weight"]

stats.ttest_ind(treated_weights, untreated_weights)

OUTPUT

Ttest_indResult(statistic=-2.2617859482443694, pvalue=0.03166586638057747)

Our p-value is about 0.03.

There are a number of arguments we could use to change the test, such as accounting for unequal variance and single-sides tests which can be found here

Challenge

Peform a t-test looking at whether there is a difference by sex as opposed to treatment.

PYTHON

m_weights = mouse_df[mouse_df["Sex"] == "M"]["Weight"]
f_weights = mouse_df[mouse_df["Sex"] == "F"]["Weight"]

stats.ttest_ind(m_weights, f_weights)

OUTPUT

Ttest_indResult(statistic=-0.16005488537123244, pvalue=0.8739869209422747)

Content from Reshaping Data


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can change the shape of my data?
  • What is the difference between long and wide data?

Objectives

  • Extract substrings of interest.
  • Format dynamic strings using f-strings.
  • Explore Python’s built-in string functions

Key Points

  • Strings can be indexed and sliced.
  • Strings cannot be directly altered.
  • You can build complex strings based on other variables using f-strings and format.
  • Python has a variety of useful built-in string functions.

Wide data and long data


We’ve so far seen our gene expression dataset in two distinct formats which we used for different purposes.

Our original dataset contains a single count value per row, and has all metadata included:

PYTHON

import pandas as pd

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/rnaseq_reduced.csv"
rnaseq_df = pd.read_csv(url)
print(rnaseq_df)

OUTPUT

         gene      sample  expression  time
0         Asl  GSM2545336        1170     8
1        Apod  GSM2545336       36194     8
2     Cyp2d22  GSM2545336        4060     8
3        Klk6  GSM2545336         287     8
4       Fcrls  GSM2545336          85     8
...       ...         ...         ...   ...
7365    Mgst3  GSM2545340        1563     4
7366   Lrrc52  GSM2545340           2     4
7367     Rxrg  GSM2545340          26     4
7368    Lmx1a  GSM2545340          81     4
7369     Pbx1  GSM2545340        3805     4

[7370 rows x 4 columns]

Data in this format is referred to as long data.

We also looked at a version of the same data which only contained expression data, where each sample was a column:

PYTHON

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/expression_matrix.csv"
expression_matrix = pd.read_csv(url, index_col=0)
print(expression_matrix.iloc[:10, :5])

OUTPUT

          GSM2545336  GSM2545337  GSM2545338  GSM2545339  GSM2545340
gene
AI504432        1230        1085         969        1284         966
AW046200          83         144         129         121         141
AW551984         670         265         202         682         246
Aamp            5621        4049        3797        4375        4095
Abca12             5           8           1           5           3
Abcc8           2210        1966        2181        2362        2475
Abhd14a          490         495         474         468         489
Abi2            5627        4383        4107        4062        4289
Abi3bp           807        1144        1028         935         879
Abl2            2392        2133        1891        1645        1926

Data in this format is referred to as wide data.

Each format has its pros and cons, and we often need to switch between the two. Wide data is more human readable, and allows for easy comparision across different samples, such as finding the genes with the highest average expression. However, wide data requires any sample metadata to exist as a separate dataframe. Thus it is much more difficult to examine sample metadata, such as calculating the mean expression at a particular timepoint.

In contrast, long data is considered less human readable. We have to first aggregate our data by gene if we want to do something like calculate average expression. However, we can easily group the data by whatever sample metadata we wish. We will also see in the next session that long data is the preferred format for plotting.

Wide and long data also handle missing data differently. In long data, and values which don’t exist for a particular sample simply aren’t rows in the dataset. Thus, our data does not have null values, but it is harder to tell where data is missing. In wide data, there is a matrix position for every combination of sample and gene. Thus, any missing data is shown as a null value. This can be more difficult to deal with, but makes missing data clear.

Convert from wide to long with melt.


Melt goes from wide to long data
Melt goes from wide to long data

We can convert from wide data to long data using melt. melt takes in the arguments var_name and value_name, which are strings that name the two created columns. We also typically want to set the ignore_index argument to False, otherwise pandas will drop the dataframe index.

We could also allow pandas to drop the index, if there is some other column we want the new rows to be named by, and pass that in instead as the id_vars argument.

PYTHON

long_data = expression_matrix.melt(var_name="sample", value_name="expression", ignore_index=False)
print(long_data)

OUTPUT

              sample  expression
gene
AI504432  GSM2545336        1230
AW046200  GSM2545336          83
AW551984  GSM2545336         670
Aamp      GSM2545336        5621
Abca12    GSM2545336           5
...              ...         ...
Zkscan3   GSM2545380        1900
Zranb1    GSM2545380        9942
Zranb3    GSM2545380         202
Zscan22   GSM2545380         527
Zw10      GSM2545380        1664

[32428 rows x 2 columns]

Convert from long to wide with pivot.


Pivot_table goes from long to wide
Pivot_table goes from long to wide

To go the other way, we want to use the pivot_table method.

This method takes in a columns the column to get new column names from, values, what to populate the matrix with, and index, what the row names of the wide data should be.

PYTHON

wide_data = rnaseq_df.pivot_table(columns = "sample", values = "expression", index = "gene")
print(wide_data)

OUTPUT

sample    GSM2545336  GSM2545337  GSM2545338  GSM2545339  GSM2545340
gene
AI504432        1230        1085         969        1284         966
AW046200          83         144         129         121         141
AW551984         670         265         202         682         246
Aamp            5621        4049        3797        4375        4095
Abca12             5           8           1           5           3
...              ...         ...         ...         ...         ...
Zkscan3         1732        1840        1800        1751        2056
Zranb1          8837        5306        5106        5306        5896
Zranb3           207         179         199         208         184
Zscan22          483         535         533         462         439
Zw10            1479        1394        1279        1376        1568

[1474 rows x 5 columns]

Note that any columns in the dataframe which are not used as values are dropped.

Challenge

Create a dataframe of the rnaseq dataset where each row is a gene and each column is a timepoint instead of a sample. The values in each column should be the mean count across all samples at that timepoint.

Take a look at the aggfunc argument in pivot_table. What is the default?

By default, pivot_table aggregates values by mean.

PYTHON

rnaseq_df.pivot_table(columns = "time", values = "expression", index="gene")

Content from Combining Data


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can I combine dataframes?
  • How do I handle missing or incomplete data mappings?

Objectives

  • Determine which kind of combination is desired between two dataframes.
  • Combine two dataframes row-wise or column-wise.
  • Identify the different types of joins.

Key Points

  • Concatenate dataframes to add additional rows.
  • Merge/join data frames to add additional columns.
  • Change the on argument to choose what is matched between dataframes when joining.
  • The different types of joins control how missing data is handled for the left and right dataframes.

There are a variety of ways we might want to combine data when performing a data analysis. We can generally group these into concatenating (sometimes called appending) and merging (sometimes called joining).

We will continue to use the rnaseq dataset:

PYTHON

import pandas as pd

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/rnaseq.csv"
rnaseq_df = pd.read_csv(url, index_col=0)

Concatenate dataframes to add additional rows.


When we want to combine two dataframes by adding one as additional rows, we concatenate them together. This if often the case if our observations are spread out over multiple files. To simulate this, let’s make two miniature versions of `rnaseq_df`` with the first and last 10 rows of the data:

PYTHON

rnaseq_mini = rnaseq_df.loc[:,["sample", "expression"]].head(10)
rnaseq_mini_tail = rnaseq_df.loc[:,["sample", "expression"]].tail(10)
print(rnaseq_mini)
print(rnaseq_mini_tail)

OUTPUT

             sample  expression
gene
Asl      GSM2545336        1170
Apod     GSM2545336       36194
Cyp2d22  GSM2545336        4060
Klk6     GSM2545336         287
Fcrls    GSM2545336          85
Slc2a4   GSM2545336         782
Exd2     GSM2545336        1619
Gjc2     GSM2545336         288
Plp1     GSM2545336       43217
Gnb4     GSM2545336        1071
             sample  expression
gene
Dusp27   GSM2545380          15
Mael     GSM2545380           4
Gm16418  GSM2545380          16
Gm16701  GSM2545380         181
Aldh9a1  GSM2545380        1770
Mgst3    GSM2545380        2151
Lrrc52   GSM2545380           5
Rxrg     GSM2545380          49
Lmx1a    GSM2545380          72
Pbx1     GSM2545380        4795

We can then concatenate the dataframes using the pandas function concat.

PYTHON

combined_df = pd.concat([rnaseq_mini, rnaseq_mini_tail])
print(combined_df)

OUTPUT

             sample  expression
gene
Asl      GSM2545336        1170
Apod     GSM2545336       36194
Cyp2d22  GSM2545336        4060
Klk6     GSM2545336         287
Fcrls    GSM2545336          85
Slc2a4   GSM2545336         782
Exd2     GSM2545336        1619
Gjc2     GSM2545336         288
Plp1     GSM2545336       43217
Gnb4     GSM2545336        1071
Dusp27   GSM2545380          15
Mael     GSM2545380           4
Gm16418  GSM2545380          16
Gm16701  GSM2545380         181
Aldh9a1  GSM2545380        1770
Mgst3    GSM2545380        2151
Lrrc52   GSM2545380           5
Rxrg     GSM2545380          49
Lmx1a    GSM2545380          72
Pbx1     GSM2545380        4795

We now have 20 rows in our combined dataset, and the same number of columns. Note that concat is a function of the pd module, as opposed to a dataframe method. It takes in a list of dataframes, and outputs a combined dataframe.

If one dataframe has columns which don’t exist in the other, these values are filled in with NaN.

PYTHON

rnaseq_mini_time = rnaseq_df.loc[:,["sample", "expression","time"]].iloc[10:20,:]
print(rnaseq_mini_time)

OUTPUT

            sample  expression  time
gene
Tnc     GSM2545336         219     8
Trf     GSM2545336        9719     8
Tubb2b  GSM2545336        2245     8
Fads1   GSM2545336        6498     8
Lxn     GSM2545336        1744     8
Prr18   GSM2545336        1284     8
Cmtm5   GSM2545336        1381     8
Enpp1   GSM2545336         388     8
Clic4   GSM2545336        5795     8
Tm6sf2  GSM2545336          32     8

PYTHON

mini_dfs = [rnaseq_mini, rnaseq_mini_time, rnaseq_mini_tail]
combined_df = pd.concat(mini_dfs)
print(combined_df)

OUTPUT

             sample  expression  time
gene
Asl      GSM2545336        1170   NaN
Apod     GSM2545336       36194   NaN
Cyp2d22  GSM2545336        4060   NaN
Klk6     GSM2545336         287   NaN
Fcrls    GSM2545336          85   NaN
Slc2a4   GSM2545336         782   NaN
Exd2     GSM2545336        1619   NaN
Gjc2     GSM2545336         288   NaN
Plp1     GSM2545336       43217   NaN
Gnb4     GSM2545336        1071   NaN
Tnc      GSM2545336         219   8.0
Trf      GSM2545336        9719   8.0
Tubb2b   GSM2545336        2245   8.0
Fads1    GSM2545336        6498   8.0
Lxn      GSM2545336        1744   8.0
Prr18    GSM2545336        1284   8.0
Cmtm5    GSM2545336        1381   8.0
Enpp1    GSM2545336         388   8.0
Clic4    GSM2545336        5795   8.0
Tm6sf2   GSM2545336          32   8.0
Dusp27   GSM2545380          15   NaN
Mael     GSM2545380           4   NaN
Gm16418  GSM2545380          16   NaN
Gm16701  GSM2545380         181   NaN
Aldh9a1  GSM2545380        1770   NaN
Mgst3    GSM2545380        2151   NaN
Lrrc52   GSM2545380           5   NaN
Rxrg     GSM2545380          49   NaN
Lmx1a    GSM2545380          72   NaN
Pbx1     GSM2545380        4795   NaN

Merge or join data frames to add additional columns.


As opposed to concatenating data, we instead might want to add additional columns to a dataframe. We’ve already seen how to add a new column based on a calculation, but often we have some other data table we want to combine.

If we know that the rows are in the same order, we can use the same syntax we use to assign new columns. However, this is often not the case.

PYTHON

# There is an ongoing idelogical debate among developers whether variables like this should be named:
# url1, url2, and url3
# url0, url1, and url2 or
# url, url2, and url3
url1 = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/annot1.csv"
url2 = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/annot2.csv"
url3 = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/annot3.csv"

# Here .sample is being used to shuffle the dataframe rows into a random order
annot1 = pd.read_csv(url1, index_col=0).sample(frac=1)
annot2 = pd.read_csv(url2, index_col=0).sample(frac=1)
annot3 = pd.read_csv(url3, index_col=0).sample(frac=1)
print(annot1)

OUTPUT

                                          gene_description
gene
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Klk6     kallikrein related-peptidase 6 [Source:MGI Sym...
Gnb4     guanine nucleotide binding protein (G protein)...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...

PYTHON

print(rnaseq_mini.join(annot1))

OUTPUT

             sample  expression  \
gene
Asl      GSM2545336        1170
Apod     GSM2545336       36194
Cyp2d22  GSM2545336        4060
Klk6     GSM2545336         287
Fcrls    GSM2545336          85
Slc2a4   GSM2545336         782
Exd2     GSM2545336        1619
Gjc2     GSM2545336         288
Plp1     GSM2545336       43217
Gnb4     GSM2545336        1071

                                          gene_description
gene
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
Klk6     kallikrein related-peptidase 6 [Source:MGI Sym...
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Gnb4     guanine nucleotide binding protein (G protein)...  

We have combined the two dataframes to add the gene_description column. By default, join looks at the index column of the left and right dataframe, combining rows when it finds matches. The data used to determine which rows should be combined between the dataframes is referred to as what is being joined on, or the keys. Here, we would say we are joining on the index columns, or the index columns are the keys. The row order and column order depends on which dataframe is on the left.

PYTHON

print(annot1.join(rnaseq_mini))

OUTPUT

                                          gene_description      sample  \
gene
Fcrls    Fc receptor-like S, scavenger receptor [Source...  GSM2545336
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...  GSM2545336
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...  GSM2545336
Klk6     kallikrein related-peptidase 6 [Source:MGI Sym...  GSM2545336
Gnb4     guanine nucleotide binding protein (G protein)...  GSM2545336
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...  GSM2545336
Slc2a4   solute carrier family 2 (facilitated glucose t...  GSM2545336
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...  GSM2545336
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...  GSM2545336
Exd2     exonuclease 3'-5' domain containing 2 [Source:...  GSM2545336

         expression
gene
Fcrls            85
Plp1          43217
Apod          36194
Klk6            287
Gnb4           1071
Cyp2d22        4060
Slc2a4          782
Asl            1170
Gjc2            288
Exd2           1619

And the index columns do not need to have the same name.

PYTHON

print(annot2)

OUTPUT

                                                          description
external_gene_name
Slc2a4              solute carrier family 2 (facilitated glucose t...
Gnb4                guanine nucleotide binding protein (G protein)...
Fcrls               Fc receptor-like S, scavenger receptor [Source...
Gjc2                gap junction protein, gamma 2 [Source:MGI Symb...
Klk6                kallikrein related-peptidase 6 [Source:MGI Sym...
Plp1                proteolipid protein (myelin) 1 [Source:MGI Sym...
Cyp2d22             cytochrome P450, family 2, subfamily d, polype...
Exd2                exonuclease 3'-5' domain containing 2 [Source:...
Apod                apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Asl                 argininosuccinate lyase [Source:MGI Symbol;Acc...

PYTHON

print(rnaseq_mini.join(annot2))

OUTPUT

             sample  expression  \
gene
Asl      GSM2545336        1170
Apod     GSM2545336       36194
Cyp2d22  GSM2545336        4060
Klk6     GSM2545336         287
Fcrls    GSM2545336          85
Slc2a4   GSM2545336         782
Exd2     GSM2545336        1619
Gjc2     GSM2545336         288
Plp1     GSM2545336       43217
Gnb4     GSM2545336        1071

                                               description
gene
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
Klk6     kallikrein related-peptidase 6 [Source:MGI Sym...
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Gnb4     guanine nucleotide binding protein (G protein)...  

join by default uses the indices of the dataframes it combines. To change this, we can use the on parameter. For instance, let’s say we want to combine our sample metadata back with rnaseq_mini.

PYTHON

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/metadata.csv"
metadata = pd.read_csv(url, index_col=0)
print(metadata)

OUTPUT

                organism  age     sex    infection   strain  time      tissue  \
sample
GSM2545336  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545337  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545338  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545339  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545340  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545341  Mus musculus    8    Male   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545342  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545343  Mus musculus    8    Male  NonInfected  C57BL/6     0  Cerebellum
GSM2545344  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545345  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545346  Mus musculus    8    Male   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545347  Mus musculus    8    Male   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545348  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545349  Mus musculus    8    Male  NonInfected  C57BL/6     0  Cerebellum
GSM2545350  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545351  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum
GSM2545352  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545353  Mus musculus    8  Female  NonInfected  C57BL/6     0  Cerebellum
GSM2545354  Mus musculus    8    Male  NonInfected  C57BL/6     0  Cerebellum
GSM2545362  Mus musculus    8  Female   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545363  Mus musculus    8    Male   InfluenzaA  C57BL/6     4  Cerebellum
GSM2545380  Mus musculus    8  Female   InfluenzaA  C57BL/6     8  Cerebellum

            mouse
sample
GSM2545336     14
GSM2545337      9
GSM2545338     10
GSM2545339     15
GSM2545340     18
GSM2545341      6
GSM2545342      5
GSM2545343     11
GSM2545344     22
GSM2545345     13
GSM2545346     23
GSM2545347     24
GSM2545348      8
GSM2545349      7
GSM2545350      1
GSM2545351     16
GSM2545352     21
GSM2545353      4
GSM2545354      2
GSM2545362     20
GSM2545363     12
GSM2545380     19  

PYTHON

print(rnaseq_mini.join(metadata, on="sample"))

OUTPUT

             sample  expression      organism  age     sex   infection  \
gene
Asl      GSM2545336        1170  Mus musculus    8  Female  InfluenzaA
Apod     GSM2545336       36194  Mus musculus    8  Female  InfluenzaA
Cyp2d22  GSM2545336        4060  Mus musculus    8  Female  InfluenzaA
Klk6     GSM2545336         287  Mus musculus    8  Female  InfluenzaA
Fcrls    GSM2545336          85  Mus musculus    8  Female  InfluenzaA
Slc2a4   GSM2545336         782  Mus musculus    8  Female  InfluenzaA
Exd2     GSM2545336        1619  Mus musculus    8  Female  InfluenzaA
Gjc2     GSM2545336         288  Mus musculus    8  Female  InfluenzaA
Plp1     GSM2545336       43217  Mus musculus    8  Female  InfluenzaA
Gnb4     GSM2545336        1071  Mus musculus    8  Female  InfluenzaA

          strain  time      tissue  mouse
gene
Asl      C57BL/6     8  Cerebellum     14
Apod     C57BL/6     8  Cerebellum     14
Cyp2d22  C57BL/6     8  Cerebellum     14
Klk6     C57BL/6     8  Cerebellum     14
Fcrls    C57BL/6     8  Cerebellum     14
Slc2a4   C57BL/6     8  Cerebellum     14
Exd2     C57BL/6     8  Cerebellum     14
Gjc2     C57BL/6     8  Cerebellum     14
Plp1     C57BL/6     8  Cerebellum     14
Gnb4     C57BL/6     8  Cerebellum     14  

Note that if we want to join on columns between dataframes which have different names we can simply rename the columns we wish to join on. If there is a reason we don’t want to do this, we can instead use the more powerful but harder to use pandas merge function.

Missing and duplicate data

In the above join, there were a few other things going on. The first is that multiple rows of rnaseq_mini contain the same sample. Pandas by default will repeat the rows wherever needed when joining.

The second is that the metadata dataframe contains samples which are not present in rnaseq_mini. By default, they were dropped from the dataframe. However, there might be cases where we want to keep our data. Let’s explore how joining handles missing data with another example:

PYTHON

print(annot3)

OUTPUT

                                          gene_description
gene
mt-Rnr1  mitochondrially encoded 12S rRNA [Source:MGI S...
mt-Rnr2  mitochondrially encoded 16S rRNA [Source:MGI S...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Gnb4     guanine nucleotide binding protein (G protein)...
mt-Tl1   mitochondrially encoded tRNA leucine 1 [Source...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
mt-Tv    mitochondrially encoded tRNA valine [Source:MG...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
mt-Tf    mitochondrially encoded tRNA phenylalanine [So...
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...

Missing data in joins

What data is missing between annot3 and rnaseq_mini?

Try joining them. How is this missing data handled?

There are both genes in annot3 but not in rnaseq_mini, and genes in rnaseq_mini not in annot3.

When we join, we keep rows from rnaseq_mini with missing data and drop rows from annot3 with missing data.

PYTHON

print(rnaseq_mini.join(annot3))

OUTPUT

             sample  expression  \
gene
Asl      GSM2545336        1170
Apod     GSM2545336       36194
Cyp2d22  GSM2545336        4060
Klk6     GSM2545336         287
Fcrls    GSM2545336          85
Slc2a4   GSM2545336         782
Exd2     GSM2545336        1619
Gjc2     GSM2545336         288
Plp1     GSM2545336       43217
Gnb4     GSM2545336        1071

                                          gene_description
gene
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
Klk6                                                   NaN
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Gnb4     guanine nucleotide binding protein (G protein)...  

Types of joins

The reason we see the above behavior is because by default pandas performs a left join.

We can change the type of join performed by changing the how argument of the join method.

A summary of the types of joins and what they keep and drop.
A summary of the types of joins and what they keep and drop.

A right join saves all genes in the right dataframe, but drops and genes unique to rnaseq_mini.

PYTHON

print(rnaseq_mini.join(annot3, how = 'right'))

OUTPUT

             sample  expression  \
gene
Plp1     GSM2545336     43217.0
Exd2     GSM2545336      1619.0
Gjc2     GSM2545336       288.0
mt-Rnr2         NaN         NaN
Gnb4     GSM2545336      1071.0
mt-Rnr1         NaN         NaN
Apod     GSM2545336     36194.0
Fcrls    GSM2545336        85.0
mt-Tv           NaN         NaN
mt-Tf           NaN         NaN
Slc2a4   GSM2545336       782.0
Cyp2d22  GSM2545336      4060.0
mt-Tl1          NaN         NaN
Asl      GSM2545336      1170.0

                                          gene_description
gene
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
mt-Rnr2  mitochondrially encoded 16S rRNA [Source:MGI S...
Gnb4     guanine nucleotide binding protein (G protein)...
mt-Rnr1  mitochondrially encoded 12S rRNA [Source:MGI S...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Fcrls    Fc receptor-like S, scavenger receptor [Source...
mt-Tv    mitochondrially encoded tRNA valine [Source:MG...
mt-Tf    mitochondrially encoded tRNA phenylalanine [So...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
mt-Tl1   mitochondrially encoded tRNA leucine 1 [Source...
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...  

An inner join only keeps genes shared by the dataframes, and drops all genes which are only in one dataframe.

PYTHON

print(rnaseq_mini.join(annot3, how = 'inner'))

OUTPUT

             sample  expression  \
gene
Asl      GSM2545336        1170
Apod     GSM2545336       36194
Cyp2d22  GSM2545336        4060
Fcrls    GSM2545336          85
Slc2a4   GSM2545336         782
Exd2     GSM2545336        1619
Gjc2     GSM2545336         288
Plp1     GSM2545336       43217
Gnb4     GSM2545336        1071

                                          gene_description
gene
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Slc2a4   solute carrier family 2 (facilitated glucose t...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Gnb4     guanine nucleotide binding protein (G protein)...  

Finally, an outer join keeps all genes across both dataframes.

PYTHON

print(rnaseq_mini.join(annot3, how = 'outer'))

OUTPUT

             sample  expression  \
gene
Apod     GSM2545336     36194.0
Asl      GSM2545336      1170.0
Cyp2d22  GSM2545336      4060.0
Exd2     GSM2545336      1619.0
Fcrls    GSM2545336        85.0
Gjc2     GSM2545336       288.0
Gnb4     GSM2545336      1071.0
Klk6     GSM2545336       287.0
Plp1     GSM2545336     43217.0
Slc2a4   GSM2545336       782.0
mt-Rnr1         NaN         NaN
mt-Rnr2         NaN         NaN
mt-Tf           NaN         NaN
mt-Tl1          NaN         NaN
mt-Tv           NaN         NaN

                                          gene_description
gene
Apod     apolipoprotein D [Source:MGI Symbol;Acc:MGI:88...
Asl      argininosuccinate lyase [Source:MGI Symbol;Acc...
Cyp2d22  cytochrome P450, family 2, subfamily d, polype...
Exd2     exonuclease 3'-5' domain containing 2 [Source:...
Fcrls    Fc receptor-like S, scavenger receptor [Source...
Gjc2     gap junction protein, gamma 2 [Source:MGI Symb...
Gnb4     guanine nucleotide binding protein (G protein)...
Klk6                                                   NaN
Plp1     proteolipid protein (myelin) 1 [Source:MGI Sym...
Slc2a4   solute carrier family 2 (facilitated glucose t...
mt-Rnr1  mitochondrially encoded 12S rRNA [Source:MGI S...
mt-Rnr2  mitochondrially encoded 16S rRNA [Source:MGI S...
mt-Tf    mitochondrially encoded tRNA phenylalanine [So...
mt-Tl1   mitochondrially encoded tRNA leucine 1 [Source...
mt-Tv    mitochondrially encoded tRNA valine [Source:MG...   

Duplicate column names

One common challenge we encounter when combining datasets from different sources is that they have identical column names.

Imagine that you’ve collected a second set of observations from samples which are stored in a identically structured file. We can simulate this by generating some random numbers in a copy of rnaseq_mini.

PYTHON

new_mini = rnaseq_mini.copy()
new_mini["expression"] = pd.Series(range(50,50000)).sample(int(10), replace=True).array
print(new_mini)

Note: as these are pseudo-random numbers your exact values will be different

OUTPUT

             sample  expression
gene
Asl      GSM2545336       48562
Apod     GSM2545336         583
Cyp2d22  GSM2545336       39884
Klk6     GSM2545336        6161
Fcrls    GSM2545336       10318
Slc2a4   GSM2545336       15991
Exd2     GSM2545336       44471
Gjc2     GSM2545336       40629
Plp1     GSM2545336       23146
Gnb4     GSM2545336       22506

Try joining rnaseq_mini and new_mini. What happens? Take a look at the lsuffix and rsuffix arguments for join. How can you use these to improve your joined dataframe?

When we try to join these dataframes we get an error.

PYTHON

rnaseq_mini.join(new_mini)

ERROR

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[17], line 1
----> 1 rnaseq_mini.join(new_mini)

File ~\anaconda3\envs\ml-env\lib\site-packages\pandas\core\frame.py:9976, in DataFrame.join(self, other, on, how, lsuffix, rsuffix, sort, validate)
   9813 def join(
   9814     self,
   9815     other: DataFrame | Series | list[DataFrame | Series],
   (...)
   9821     validate: str | None = None,
   9822 ) -> DataFrame:
   9823     """
   9824     Join columns of another DataFrame.
   9825
   (...)
   9974     5  K1  A5   B1
   9975     """
-> 9976     return self._join_compat(
   9977         other,
   9978         on=on,
   9979         how=how,
   9980         lsuffix=lsuffix,
   9981         rsuffix=rsuffix,
   9982         sort=sort,
   9983         validate=validate,
   9984     )

File ~\anaconda3\envs\ml-env\lib\site-packages\pandas\core\frame.py:10015, in DataFrame._join_compat(self, other, on, how, lsuffix, rsuffix, sort, validate)
  10005     if how == "cross":
  10006         return merge(
  10007             self,
  10008             other,
   (...)
  10013             validate=validate,
  10014         )
> 10015     return merge(
  10016         self,
  10017         other,
  10018         left_on=on,
  10019         how=how,
  10020         left_index=on is None,
  10021         right_index=True,
  10022         suffixes=(lsuffix, rsuffix),
  10023         sort=sort,
  10024         validate=validate,
  10025     )
  10026 else:
  10027     if on is not None:

File ~\anaconda3\envs\ml-env\lib\site-packages\pandas\core\reshape\merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     93 @Substitution("\nleft : DataFrame or named Series")
     94 @Appender(_merge_doc, indents=0)
     95 def merge(
   (...)
    108     validate: str | None = None,
    109 ) -> DataFrame:
    110     op = _MergeOperation(
    111         left,
    112         right,
   (...)
    122         validate=validate,
    123     )
--> 124     return op.get_result(copy=copy)

File ~\anaconda3\envs\ml-env\lib\site-packages\pandas\core\reshape\merge.py:775, in _MergeOperation.get_result(self, copy)
    771     self.left, self.right = self._indicator_pre_merge(self.left, self.right)
    773 join_index, left_indexer, right_indexer = self._get_join_info()
--> 775 result = self._reindex_and_concat(
    776     join_index, left_indexer, right_indexer, copy=copy
    777 )
    778 result = result.__finalize__(self, method=self._merge_type)
    780 if self.indicator:

File ~\anaconda3\envs\ml-env\lib\site-packages\pandas\core\reshape\merge.py:729, in _MergeOperation._reindex_and_concat(self, join_index, left_indexer, right_indexer, copy)
    726 left = self.left[:]
    727 right = self.right[:]
--> 729 llabels, rlabels = _items_overlap_with_suffix(
    730     self.left._info_axis, self.right._info_axis, self.suffixes
    731 )
    733 if left_indexer is not None:
    734     # Pinning the index here (and in the right code just below) is not
    735     #  necessary, but makes the `.take` more performant if we have e.g.
    736     #  a MultiIndex for left.index.
    737     lmgr = left._mgr.reindex_indexer(
    738         join_index,
    739         left_indexer,
   (...)
    744         use_na_proxy=True,
    745     )

File ~\anaconda3\envs\ml-env\lib\site-packages\pandas\core\reshape\merge.py:2458, in _items_overlap_with_suffix(left, right, suffixes)
   2455 lsuffix, rsuffix = suffixes
   2457 if not lsuffix and not rsuffix:
-> 2458     raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
   2460 def renamer(x, suffix):
   2461     """
   2462     Rename the left and right indices.
   2463
   (...)
   2474     x : renamed column name
   2475     """

ValueError: columns overlap but no suffix specified: Index(['sample', 'expression'], dtype='object')

We need to give the datasets suffixes so that there is no column name collision.

PYTHON

print(rnaseq_mini.join(new_mini, lsuffix="_exp1", rsuffix="_exp2"))

OUTPUT

        sample_exp1  expression_exp1 sample_exp2  expression_exp2
gene
Asl      GSM2545336             1170  GSM2545336            29016
Apod     GSM2545336            36194  GSM2545336            46560
Cyp2d22  GSM2545336             4060  GSM2545336             1823
Klk6     GSM2545336              287  GSM2545336            27428
Fcrls    GSM2545336               85  GSM2545336            45369
Slc2a4   GSM2545336              782  GSM2545336            31129
Exd2     GSM2545336             1619  GSM2545336            21478
Gjc2     GSM2545336              288  GSM2545336            34747
Plp1     GSM2545336            43217  GSM2545336            46074
Gnb4     GSM2545336             1071  GSM2545336            16370

While this works, we now have duplicate sample columns.

To avoid this, we could either drop the sample column in one of the dataframes before joining, or use merge to join on multiple columns and reset the index afterward.

PYTHON

#Option 1: Drop the column
print(rnaseq_mini.join(new_mini.drop('sample',axis=1), lsuffix="_exp1", rsuffix="_exp2"))

OUTPUT

             sample  expression_exp1  expression_exp2
gene
Asl      GSM2545336             1170            29016
Apod     GSM2545336            36194            46560
Cyp2d22  GSM2545336             4060             1823
Klk6     GSM2545336              287            27428
Fcrls    GSM2545336               85            45369
Slc2a4   GSM2545336              782            31129
Exd2     GSM2545336             1619            21478
Gjc2     GSM2545336              288            34747
Plp1     GSM2545336            43217            46074
Gnb4     GSM2545336             1071            16370

PYTHON

#Option 2: Merging on gene and sample
print(pd.merge(rnaseq_mini, new_mini, on=["gene","sample"]))

OUTPUT

             sample  expression_x  expression_y
gene
Asl      GSM2545336          1170         29016
Apod     GSM2545336         36194         46560
Cyp2d22  GSM2545336          4060          1823
Klk6     GSM2545336           287         27428
Fcrls    GSM2545336            85         45369
Slc2a4   GSM2545336           782         31129
Exd2     GSM2545336          1619         21478
Gjc2     GSM2545336           288         34747
Plp1     GSM2545336         43217         46074
Gnb4     GSM2545336          1071         16370

Note that pd.merge does not throw an error when dealing with duplicate column names, but instead automatically uses the suffixes _x and _y. We could change these defaults with the suffixes argument.

Content from Visualizing data with matplotlib and seaborn


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can I plot my data?
  • How can I save my plot for publishing?

Objectives

  • Visualize list-type data using matplotlib.
  • Visualize dataframe data with matplotlib and seaborn.
  • Customize plot aesthetics.
  • Create publication-ready plots for a particular context.

Key Points

  • matplotlib is the most widely used scientific plotting library in Python.
  • Plot data directly from a Pandas dataframe.
  • Select and transform data, then plot it.
  • Many styles of plot are available: see the Python Graph Gallery for more options.
  • Seaborn extends matplotlib and provides useful defaults and integration with dataframes.

matplotlib is the most widely used scientific plotting library in Python.


  • Commonly use a sub-library called matplotlib.pyplot.
  • The Jupyter Notebook will render plots inline by default.

PYTHON

import matplotlib.pyplot as plt
  • Simple plots are then (fairly) simple to create.

PYTHON

time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')
Simple Position-Time Plot
Simple Position-Time Plot

Display All Open Figures

In our Jupyter Notebook example, running the cell should generate the figure directly below the code. The figure is also included in the Notebook document for future viewing. However, other Python environments like an interactive Python session started from a terminal or a Python script executed via the command line require an additional command to display the figure.

Instruct matplotlib to show a figure:

PYTHON

plt.show()

This command can also be used within a Notebook - for instance, to display multiple figures if several are created by a single cell.

Plot data directly from a Pandas dataframe.


PYTHON

import pandas as pd

url = "https://raw.githubusercontent.com/ccb-hms/workbench-python-workshop/main/episodes/data/rnaseq.csv"
rnaseq_df = pd.read_csv(url, index_col=0)

rnaseq_df.loc[:,'expression'].plot(kind = 'hist')
Expression histogram
Expression histogram

Select and transform data, then plot it.


  • By default, DataFrame.plot plots with the rows as the X axis.
  • We can transform the data to plot multiple samples.

PYTHON

expression_matrix = rnaseq_df.pivot_table(columns = "time", values = "expression", index="gene")
expression_matrix.plot(kind="box", logy=True)
Boxplot by timepoint
Boxplot by timepoint

Many plot options are available.


  • For example, we can use the subplots argument to separate our plot automatically by column.

PYTHON

expression_matrix.plot(kind="hist", subplots=True)
Histogram by timpoint as subplots.
Histogram by timpoint as subplots.

Data can also be plotted by calling the matplotlib plot function directly.


  • The command is plt.plot(x, y)
  • The color and format of markers can also be specified as an additional optional argument e.g., b- is a blue line, g-- is a green dashed line.

PYTHON

expression_asl = expression_matrix.loc['Asl']
plt.plot(expression_asl, 'g--')
Asl expression over time
Asl expression over time

You can plot many sets of data together.


PYTHON

plt.plot(expression_matrix.loc['Asl'], 'g--', label = 'Asl')
plt.plot(expression_matrix.loc['Apod'], 'r-', label = 'Apod')
plt.plot(expression_matrix.loc['Cyp2d22'], 'b-', label = 'Cyp2d22')
plt.plot(expression_matrix.loc['Klk6'], 'b--', label = 'Klk6')

# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Time (days)')
plt.ylabel('Mean Expression')
Multiple lines on the same plot with a legend
Multiple lines on the same plot with a legend

Adding a Legend

Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data.

This can be done in matplotlib in two stages:

  • Provide a label for each dataset in the figure:

PYTHON

plt.plot(expression_matrix.loc['Asl'], 'g--', label = 'Asl')
plt.plot(expression_matrix.loc['Apod'], 'r-', label = 'Apod')
  • Instruct matplotlib to create the legend.

PYTHON

plt.legend()

By default matplotlib will attempt to place the legend in a suitable position. If you would rather specify a position this can be done with the loc= argument, e.g to place the legend in the upper left corner of the plot, specify loc='upper left'

Seaborn provides more integrated plotting with Pandas and useful defaults


Seaborn is a plotting library build ontop of matplotlib.pyplot. It provides higher-level functions for creating high quality data visualizations, and easily integrates with pandas dataframes.

Let’s see what a boxplot looks like using seaborn.

PYTHON

import seaborn as sns
import numpy as np

#Use seaborn default styling
sns.set_theme()

#Add a log expression column with a pseudocount of 1
rnaseq_df = rnaseq_df.assign(log_exp = np.log2(rnaseq_df["expression"]+1))

sns.boxplot(data = rnaseq_df, x = "time", y = "log_exp")
A simple boxplot using seaborn
A simple boxplot using seaborn

Instead of providing data to x and y directly, with seaborn we instead give it a dataframe with the data argument, then tell seaborn which columns of the dataframe we want to be used for each axis.

Notice that we don’t have to give reshaped data to seaborn, it aggregates our data for us. However, we still have to do more complicated transformations ourselves.

Scenario: log foldchange scatterplot


Imagine we want to compare how different genes behave at different timepoints, and we are expecially interested if genes on any chromosome or of a specific type stnad out. Let’s make a scatterplot of the log-foldchange of our genes step-by-step.

Prepare the data

First, we need to aggregate our data. We could join expression_matrix with the columns we want from rnaseq_df, but instead let’s re-pivot our data and bring gene_biotype and chromosome_name along.

PYTHON

time_matrix = rnaseq_df.pivot_table(columns = "time", values = "log_exp", index=["gene", "chromosome_name", "gene_biotype"])
time_matrix = time_matrix.reset_index(level=["chromosome_name", "gene_biotype"])
print(time_matrix)

OUTPUT

time     chromosome_name    gene_biotype          0          4          8
gene
AI504432               3          lncRNA  10.010544  10.085962   9.974814
AW046200               8          lncRNA   7.275920   7.226188   6.346325
AW551984               9  protein_coding   7.860330   7.264752   8.194396
Aamp                   1  protein_coding  12.159714  12.235069  12.198784
Abca12                 1  protein_coding   2.509810   2.336441   2.329376
...                  ...             ...        ...        ...        ...
Zkscan3               13  protein_coding  11.031923  10.946440  10.696602
Zranb1                 7  protein_coding  12.546820  12.646836  12.984985
Zranb3                 1  protein_coding   7.582472   7.611006   7.578124
Zscan22                7  protein_coding   9.239555   9.049392   8.689101
Zw10                   9  protein_coding  10.586751  10.598198  10.524456

[1474 rows x 5 columns]

Dataframes practice

Starting with time_matrix, add new columns for the log foldchange from 0 to 4 hours and one with the log foldchange from 0 to 8 hours.

We already have the mean log expression each gene at each timepoint. We need to calculate the foldchange.

PYTHON

# We have to use loc here since our column name is a number
time_matrix["logfc_0_4"] = time_matrix.loc[:,4] - time_matrix.loc[:,0]
time_matrix["logfc_0_8"] = time_matrix.loc[:,8] - time_matrix.loc[:,0]
print(time_matrix)

OUTPUT

time     chromosome_name    gene_biotype          0          4          8  \
gene
AI504432               3          lncRNA  10.010544  10.085962   9.974814
AW046200               8          lncRNA   7.275920   7.226188   6.346325
AW551984               9  protein_coding   7.860330   7.264752   8.194396
Aamp                   1  protein_coding  12.159714  12.235069  12.198784
Abca12                 1  protein_coding   2.509810   2.336441   2.329376
...                  ...             ...        ...        ...        ...
Zkscan3               13  protein_coding  11.031923  10.946440  10.696602
Zranb1                 7  protein_coding  12.546820  12.646836  12.984985
Zranb3                 1  protein_coding   7.582472   7.611006   7.578124
Zscan22                7  protein_coding   9.239555   9.049392   8.689101
Zw10                   9  protein_coding  10.586751  10.598198  10.524456

time      logfc_0_4  logfc_0_8
gene
AI504432   0.075419  -0.035730
AW046200  -0.049732  -0.929596
AW551984  -0.595578   0.334066
Aamp       0.075354   0.039070
Abca12    -0.173369  -0.180433
...             ...        ...
Zkscan3   -0.085483  -0.335321
Zranb1     0.100015   0.438165
Zranb3     0.028535  -0.004348
Zscan22   -0.190164  -0.550454
Zw10       0.011447  -0.062295

[1474 rows x 7 columns]

Note that we could have first calculate foldchange and then log transform it, but performing subtraction is more precise than division in most programming languages, so log transforming first is preferred.

Making an effective scatterplot

Now we can make a scatterplot of the foldchanges.

PYTHON

sns.scatterplot(data = time_matrix, x = "logfc_0_4", y = "logfc_0_8")
Basic foldchange scatterplot.
Basic foldchange scatterplot.

Let’s now improve this plot step-by-step. First we can give the axes more human-readable names than the dataframe’s column names.

PYTHON

sns.scatterplot(data = time_matrix, x = "logfc_0_4", y = "logfc_0_8")
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")
Axis labels changed
Axis labels changed

It is difficult to see see the density of genes due to there being so many points. Lets make the points a little transparent to help with this by changing their alpha level.

PYTHON

sns.scatterplot(data = time_matrix, x = "logfc_0_4", y = "logfc_0_8", alpha = 0.4)
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")
Adjusted alpha level
Adjusted alpha level

Or we could make the points smaller by adjusting the s argument.

PYTHON

sns.scatterplot(data = time_matrix, x = "logfc_0_4", y = "logfc_0_8", alpha = 0.6, s=4)
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")
Adjusted point size
Adjusted point size

Now let’s incorporate gene_biotype. To do this, we can set the hue of the scatterplot to the column name. We’re also going to make some more room for the plot by changing matplotlib’s rcParams, which are the global plotting settings.

Global plot settings with rcParams

rcParams is a specialized dictionary used by matplotlib to store all global plot settings. You can change rcParams to set things like the

PYTHON

plt.rcParams['figure.figsize'] = [12, 8]
sns.scatterplot(data = time_matrix, x = "logfc_0_4", y = "logfc_0_8", alpha = 0.6, hue="gene_biotype")
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")
Added gene_biotype as hue
Added gene_biotype as hue

It looks like we overwhelmingly have protein coding genes in the dataset. Instead of plotting every biotype, let’s just plot whether or not the genes are protein coding.

Challenge

Create a new boolean column, is_protein_coding which specifies whether or not each gene is protein coding.

Make a scatterplot where the style of the points varies with is_protein_coding.

PYTHON

time_matrix["is_protein_coding"] = time_matrix["gene_biotype"]=="protein_coding"
print(time_matrix)

OUTPUT

time     chromosome_name    gene_biotype          0          4          8  \
gene
AI504432               3          lncRNA  10.010544  10.085962   9.974814
AW046200               8          lncRNA   7.275920   7.226188   6.346325
AW551984               9  protein_coding   7.860330   7.264752   8.194396
Aamp                   1  protein_coding  12.159714  12.235069  12.198784
Abca12                 1  protein_coding   2.509810   2.336441   2.329376
...                  ...             ...        ...        ...        ...
Zkscan3               13  protein_coding  11.031923  10.946440  10.696602
Zranb1                 7  protein_coding  12.546820  12.646836  12.984985
Zranb3                 1  protein_coding   7.582472   7.611006   7.578124
Zscan22                7  protein_coding   9.239555   9.049392   8.689101
Zw10                   9  protein_coding  10.586751  10.598198  10.524456

time      logfc_0_4  logfc_0_8  is_protein_coding
gene
AI504432   0.075419  -0.035730              False
AW046200  -0.049732  -0.929596              False
AW551984  -0.595578   0.334066               True
Aamp       0.075354   0.039070               True
Abca12    -0.173369  -0.180433               True
...             ...        ...                ...
Zkscan3   -0.085483  -0.335321               True
Zranb1     0.100015   0.438165               True
Zranb3     0.028535  -0.004348               True
Zscan22   -0.190164  -0.550454               True
Zw10       0.011447  -0.062295               True

[1474 rows x 8 columns]

PYTHON

sns.scatterplot(data = time_matrix, x = "logfc_0_4", y = "logfc_0_8", alpha = 0.6, style="is_protein_coding")
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")
Style set to is_protein_coding
Style set to is_protein_coding

Now that hue is freed up, we can use it to encode for chromosome.

PYTHON

#This hit my personal stylistic limit, so I'm spread the call to multiple lines
sns.scatterplot(data = time_matrix, 
                x = "logfc_0_4", 
                y = "logfc_0_8", 
                alpha = 0.6, 
                style = "is_protein_coding",
                hue = "chromosome_name")
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")
Adding chromosome
Adding chromosome

There doesn’t seem to be any interesting pattern. Let’s go back to just plotting whether or not the gene is protein coding, and do a little more to clean up the plot.

PYTHON

# This scales all text in the plot by 150%
sns.set(font_scale = 1.5)

sns.scatterplot(data = time_matrix, 
                x = "logfc_0_4", 
                y = "logfc_0_8", 
                alpha = 0.6, 
                hue = "is_protein_coding", 
                #Sets the color pallete to map values to
                palette = "mako")
plt.xlabel("0min to 4min (log FC)")
plt.ylabel("0min to 8min (log FC)")

# Set a legend title and give it 2 columns
plt.legend(ncol = 2, title="Is Protein Coding")
Cleaning up the plot
Cleaning up the plot

Heatmaps


Seaborn also has great built-in support for heatmaps.

We can make heatmaps using heatmap or clustermap, depending on if we want our data to be clustered. Let’s take a look at the correlation matrix for our samples.

PYTHON

sample_matrix = rnaseq_df.pivot_table(columns = "sample", values = "log_exp", index="gene")
print(sample_matrix.corr())

OUTPUT

sample      GSM2545336  GSM2545337  GSM2545338  GSM2545339  GSM2545340  \
sample
GSM2545336    1.000000    0.972832    0.970700    0.979381    0.957032
GSM2545337    0.972832    1.000000    0.992230    0.983430    0.972068
GSM2545338    0.970700    0.992230    1.000000    0.981653    0.970098
GSM2545339    0.979381    0.983430    0.981653    1.000000    0.961275
GSM2545340    0.957032    0.972068    0.970098    0.961275    1.000000
GSM2545341    0.957018    0.951554    0.950648    0.940430    0.979411
GSM2545342    0.987712    0.980026    0.978246    0.974903    0.968918
GSM2545343    0.948861    0.971826    0.973301    0.959663    0.988114
GSM2545344    0.973664    0.987626    0.985568    0.980757    0.970084
GSM2545345    0.943748    0.963804    0.963715    0.946226    0.984994
GSM2545346    0.967839    0.961595    0.957958    0.955665    0.984706
GSM2545347    0.958351    0.957363    0.955393    0.944399    0.982981
GSM2545348    0.973558    0.992018    0.991673    0.982984    0.971515
GSM2545349    0.948938    0.970211    0.971276    0.955778    0.988263
GSM2545350    0.952803    0.973764    0.971737    0.960796    0.991229
GSM2545351    0.991921    0.977846    0.973363    0.979936    0.962843
GSM2545352    0.980437    0.989274    0.986909    0.983319    0.974796
GSM2545353    0.974628    0.989940    0.989857    0.982648    0.971325
GSM2545354    0.950621    0.972800    0.972051    0.959123    0.987231
GSM2545362    0.986494    0.981655    0.978767    0.988969    0.962088
GSM2545363    0.936577    0.962115    0.962892    0.940839    0.980774
GSM2545380    0.988869    0.977962    0.975382    0.973683    0.965520

sample      GSM2545341  GSM2545342  GSM2545343  GSM2545344  GSM2545345  ...  \
sample                                                                  ...
GSM2545336    0.957018    0.987712    0.948861    0.973664    0.943748  ...
GSM2545337    0.951554    0.980026    0.971826    0.987626    0.963804  ...
GSM2545338    0.950648    0.978246    0.973301    0.985568    0.963715  ...
GSM2545339    0.940430    0.974903    0.959663    0.980757    0.946226  ...
GSM2545340    0.979411    0.968918    0.988114    0.970084    0.984994  ...
GSM2545341    1.000000    0.967231    0.967720    0.958880    0.978091  ...
GSM2545342    0.967231    1.000000    0.959923    0.980462    0.957994  ...
GSM2545343    0.967720    0.959923    1.000000    0.965428    0.983462  ...
GSM2545344    0.958880    0.980462    0.965428    1.000000    0.967708  ...
GSM2545345    0.978091    0.957994    0.983462    0.967708    1.000000  ...
GSM2545346    0.983828    0.973720    0.977478    0.962581    0.976912  ...
GSM2545347    0.986949    0.968972    0.974794    0.962113    0.981518  ...
GSM2545348    0.950961    0.980532    0.974156    0.986710    0.965095  ...
GSM2545349    0.971058    0.960674    0.992085    0.966662    0.986116  ...
GSM2545350    0.973947    0.964523    0.990495    0.970990    0.986146  ...
GSM2545351    0.961638    0.989280    0.953650    0.979976    0.950451  ...
GSM2545352    0.960888    0.985541    0.971954    0.990066    0.968691  ...
GSM2545353    0.952424    0.982904    0.971896    0.989196    0.965436  ...
GSM2545354    0.973524    0.961077    0.988615    0.972600    0.987209  ...
GSM2545362    0.951186    0.982616    0.955145    0.983885    0.946956  ...
GSM2545363    0.970944    0.952557    0.980218    0.965252    0.987553  ...
GSM2545380    0.963521    0.990511    0.957152    0.978371    0.956813  ...

sample      GSM2545348  GSM2545349  GSM2545350  GSM2545351  GSM2545352  \
sample
GSM2545336    0.973558    0.948938    0.952803    0.991921    0.980437
GSM2545337    0.992018    0.970211    0.973764    0.977846    0.989274
GSM2545338    0.991673    0.971276    0.971737    0.973363    0.986909
GSM2545339    0.982984    0.955778    0.960796    0.979936    0.983319
GSM2545340    0.971515    0.988263    0.991229    0.962843    0.974796
GSM2545341    0.950961    0.971058    0.973947    0.961638    0.960888
GSM2545342    0.980532    0.960674    0.964523    0.989280    0.985541
GSM2545343    0.974156    0.992085    0.990495    0.953650    0.971954
GSM2545344    0.986710    0.966662    0.970990    0.979976    0.990066
GSM2545345    0.965095    0.986116    0.986146    0.950451    0.968691
GSM2545346    0.960875    0.978673    0.981337    0.971079    0.970621
GSM2545347    0.958524    0.978974    0.980361    0.964883    0.967565
GSM2545348    1.000000    0.971489    0.973997    0.977225    0.990905
GSM2545349    0.971489    1.000000    0.990858    0.954456    0.971922
GSM2545350    0.973997    0.990858    1.000000    0.959045    0.975085
GSM2545351    0.977225    0.954456    0.959045    1.000000    0.983727
GSM2545352    0.990905    0.971922    0.975085    0.983727    1.000000
GSM2545353    0.992297    0.969924    0.972310    0.979291    0.991556
GSM2545354    0.972151    0.990242    0.989228    0.956395    0.972436
GSM2545362    0.980224    0.955712    0.959541    0.988591    0.984683
GSM2545363    0.963226    0.983872    0.982521    0.943499    0.966163
GSM2545380    0.979398    0.956662    0.961779    0.989683    0.985750

sample      GSM2545353  GSM2545354  GSM2545362  GSM2545363  GSM2545380
sample
GSM2545336    0.974628    0.950621    0.986494    0.936577    0.988869
GSM2545337    0.989940    0.972800    0.981655    0.962115    0.977962
GSM2545338    0.989857    0.972051    0.978767    0.962892    0.975382
GSM2545339    0.982648    0.959123    0.988969    0.940839    0.973683
GSM2545340    0.971325    0.987231    0.962088    0.980774    0.965520
GSM2545341    0.952424    0.973524    0.951186    0.970944    0.963521
GSM2545342    0.982904    0.961077    0.982616    0.952557    0.990511
GSM2545343    0.971896    0.988615    0.955145    0.980218    0.957152
GSM2545344    0.989196    0.972600    0.983885    0.965252    0.978371
GSM2545345    0.965436    0.987209    0.946956    0.987553    0.956813
GSM2545346    0.963956    0.975386    0.961966    0.967480    0.973104
GSM2545347    0.961118    0.977208    0.952988    0.972364    0.967319
GSM2545348    0.992297    0.972151    0.980224    0.963226    0.979398
GSM2545349    0.969924    0.990242    0.955712    0.983872    0.956662
GSM2545350    0.972310    0.989228    0.959541    0.982521    0.961779
GSM2545351    0.979291    0.956395    0.988591    0.943499    0.989683
GSM2545352    0.991556    0.972436    0.984683    0.966163    0.985750
GSM2545353    1.000000    0.971512    0.980850    0.962497    0.981468
GSM2545354    0.971512    1.000000    0.958851    0.983429    0.957011
GSM2545362    0.980850    0.958851    1.000000    0.942461    0.980664
GSM2545363    0.962497    0.983429    0.942461    1.000000    0.951069
GSM2545380    0.981468    0.957011    0.980664    0.951069    1.000000

[22 rows x 22 columns]

We can see the structure of these correlations in a clustered heatmap.

PYTHON

sns.clustermap(sample_matrix.corr())
Default clustered heatmap
Default clustered heatmap

We can change the color mapping.

PYTHON

sns.clustermap(sample_matrix.corr(), cmap = "magma")
Using magma colormap
Using magma colormap

We can add sample information to the plot by changing the index. First we need to get the data.

PYTHON

# We select all of the columns related to the samples we might want to look at, group by sample, then aggregate by the most common value, or mode
sample_data = rnaseq_df[['sample','organism', 'age', 'sex', 'infection',
       'strain', 'time', 'tissue', 'mouse']].groupby('sample').agg(pd.Series.mode)
print(sample_data)

OUTPUT

	organism 	age 	sex 	infection 	strain 	time 	tissue 	mouse
sample 								
GSM2545336 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	14
GSM2545337 	Mus musculus 	8 	Female 	NonInfected 	C57BL/6 	0 	Cerebellum 	9
GSM2545338 	Mus musculus 	8 	Female 	NonInfected 	C57BL/6 	0 	Cerebellum 	10
GSM2545339 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	15
GSM2545340 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	18
GSM2545341 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	6
GSM2545342 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	5
GSM2545343 	Mus musculus 	8 	Male 	NonInfected 	C57BL/6 	0 	Cerebellum 	11
GSM2545344 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	22
GSM2545345 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	13
GSM2545346 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	23
GSM2545347 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	24
GSM2545348 	Mus musculus 	8 	Female 	NonInfected 	C57BL/6 	0 	Cerebellum 	8
GSM2545349 	Mus musculus 	8 	Male 	NonInfected 	C57BL/6 	0 	Cerebellum 	7
GSM2545350 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	1
GSM2545351 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	16
GSM2545352 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	21
GSM2545353 	Mus musculus 	8 	Female 	NonInfected 	C57BL/6 	0 	Cerebellum 	4
GSM2545354 	Mus musculus 	8 	Male 	NonInfected 	C57BL/6 	0 	Cerebellum 	2
GSM2545362 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	20
GSM2545363 	Mus musculus 	8 	Male 	InfluenzaA 	C57BL/6 	4 	Cerebellum 	12
GSM2545380 	Mus musculus 	8 	Female 	InfluenzaA 	C57BL/6 	8 	Cerebellum 	19

Then we can join, set the index, and plot to see the sample groupings.

PYTHON

corr_df = sample_matrix.corr()
corr_df = corr_df.join(sample_data)
corr_df = corr_df.set_index("infection")
sns.clustermap(corr_df.iloc[:,:-8],
               cmap = "magma")
Heatmap with infection information
Heatmap with infection information

Creating a boxplot

Visualize log mean gene expression by sample and some other variable of choice in rnaseq_df. Take some time to try to get this plot close to publication ready for either a poster, presentation, or paper.

You may need want to rotate the x labels of the plot, which can be done with plt.xticks(rotation = degrees) where degrees is the number of degrees you wish to rotate by.

Check out some of the details of sns.set_context() here. Seaborn has useful size default for different figure types.

Your plot will of course vary, but here are some examples of what we can do:

PYTHON

sns.set_context("talk")
# Things like this will depend on your screen resolution and size
#plt.rcParams['figure.figsize'] = [8, 6]
#plt.rcParams['figure.dpi'] = 300
#sns.set(font_scale=1.25)
sns.set_palette(sns.color_palette("rocket_r", n_colors=3))
sns.boxplot(rnaseq_df, 
            x = "sample", 
            y = "log_exp",
            hue = "time", 
            dodge=False)
plt.xticks(rotation=45, ha='right');
plt.ylabel("Log Expression")
plt.xlabel("")
# See this thread for more discussion of legend positions: https://stackoverflow.com/a/43439132
plt.legend(title='Days Post Infection', loc=(1.04, 0.5), labels=['0 days','4 days','8 days'])

PYTHON

sns.set_palette(["#A51C30","#8996A0"])
sns.violinplot(rnaseq_df, 
            x = "sample", 
            y = "log_exp",
            hue = "infection", 
            dodge=False)
plt.xticks(rotation=45, ha='right');
plt.legend(loc = 'lower right')

PYTHON

sns.set_context("talk")
pal = sns.color_palette("blend:#A51C30,#FFDB6D", 24)
g = sns.catplot(rnaseq_df,
            kind = "box",
            x = "sample", 
            y = "log_exp",
            col = "sex",
            dodge=False, 
            hue = "sample",
            palette = pal,
            sharex = False)
for axes in g.axes.flat:
    _ = axes.set_xticklabels(axes.get_xticklabels(), rotation=45, ha='right')
g.set_axis_labels("", "Log Expression")
plt.subplots_adjust(wspace=0.1)

Saving your plot to a file

If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with

PYTHON

plt.savefig('my_figure.png')

will save the current figure to the file my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).

Note that functions in plt refer to a global figure variable and after a figure has been displayed to the screen (e.g. with plt.show) matplotlib will make this variable refer to a new empty figure. Therefore, make sure you call plt.savefig before the plot is displayed to the screen, otherwise you may find a file with an empty plot.

When using dataframes, data is often generated and plotted to screen in one line. In addition to using plt.savefig, we can save a reference to the current figure in a local variable (with plt.gcf) and call the savefig class method from that variable to save the figure to file.

PYTHON

data.plot(kind='bar')
fig = plt.gcf() # get current figure
fig.savefig('my_figure.png')

This supports most common image formats such as png, svg, pdf, etc.

Making your plots accessible

Whenever you are generating plots to go into a paper or a presentation, there are a few things you can do to make sure that everyone can understand your plots. * Always make sure your text is large enough to read. Use the fontsize parameter in xlabel, ylabel, title, and legend, and tick_params with labelsize to increase the text size of the numbers on your axes. * Similarly, you should make your graph elements easy to see. Use s to increase the size of your scatterplot markers and linewidth to increase the sizes of your plot lines. * Using color (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the linestyle parameter lets you use different types of lines. For scatterplots, marker lets you change the shape of your points. If you’re unsure about your colors, you can use Coblis or Color Oracle to simulate what your plots would look like to those with colorblindness.

Content from Perform machine learning with Scikit-learn


Last updated on 2023-04-22 | Edit this page

Overview

Questions

  • What is Scikit-learn
  • How is Scikit-learn used for a typical machine learning workflow?

Objectives

  • Explore the sklearn library.
  • Walk through a typical machine learning workflow in Scikit-learn.
  • Examine machine learning evaluation plots.

Key Points

  • Scikit-learn is a popular package for machine learning in Python.
  • Scikit-learn has a variety of useful functionality for creating predictive models.
  • A machine learning workflow involves preprocessing, model selection, training, and evaluation.

Scikit-learn (also called sklearn) is a popular machine learning package in Python. It has a wide variety of functionality which aid in creating, training, and evaluating machine learning models.

Some of its functionality includes:

  • Classification: predicting a category such as active/inactive or what animal a picture is of.
  • Regression: predicting a quantity such as temperature, survival time, or age.
  • Clustering: determining groupings such as trying to create groups based on the structure of the data, such as disease subtypes or celltypes.
  • Dimentionality reduction: projecting data from a high-dimensional to a low-dimensional space for visualization and exploration. PCA, UMAP, and TSNE are all dimensionality reduction methods.
  • Model selection: choosing which model to use. This module includes methods for splitting data for training, evaluation metrics, hyperparameter optimization, and more.
  • Preprocessing: cleaning and transforming data for machine learning. This includes imputing missing data, normalization, denoising, and extracting features (for instance, creating a “season” feature from dates, or creating a “brightness” feature from an image).

Let’s import the modules and other packages we need.

PYTHON

import pandas as pd
from sklearn import model_selection, ensemble, metrics
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
sns.set_theme(style='ticks')

Data description


The data we will be using for this lesson comes from Lee et al, 2018. In order to achieve replicative immortality, human cancer cells need to activate a telomere maintenance mechanism (TMM). Cancer cells typically use either the enzyme telomerase, or the Alternative Lengthening of Telomeres (ALT) pathway. The machine learning task we will be exporing is to create a model which can use a cancer cell’s telomere repeat composition to predict whether or not a tumor cell is using the ALT pathway.

More formally, we are predicting a binary variable for each sample, whether the TMM is using the ALT pathway (positive) or not (negative). In machine learning we refer to this as the target variable or class of the data. We have set of columns in the dataset which indicate the six-mer composition of the sample’s telomeres by percent.

PYTHON

# Load data
data = pd.read_csv("https://raw.githubusercontent.com/HMS-Data-Club/python-environments/main/data/telomere_ALT/telomere.csv", sep='\t')
num_row = data.shape[0]     # number of samples in the dataset
num_col = data.shape[1]     # number of features in the dataset (plus the label column)
print(data)

OUTPUT

     TTAGGG  ATAGGG  CTAGGG  GTAGGG  TAAGGG  TCAGGG  TGAGGG  TTCGGG  TTGGGG  \
0    94.846   0.019   0.430   0.422   0.216   0.544   1.762   0.535   0.338
1    94.951   0.011   0.241   0.491   0.223   0.317   1.351   0.818   0.702
2    94.889   0.043   0.439   0.478   0.355   0.316   1.151   0.625   0.313
3    94.202   0.017   0.252   0.509   0.396   0.548   1.877   0.856   0.440
4    96.368   0.011   0.078   0.131   0.015   0.306   1.525   1.165   0.126
..      ...     ...     ...     ...     ...     ...     ...     ...     ...
156  98.608   0.001   0.100   0.219   0.000   0.196   0.421   0.101   0.229
157  98.176   0.000   0.145   0.304   0.088   0.158   0.446   0.181   0.292
158  99.393   0.000   0.017   0.155   0.000   0.030   0.190   0.065   0.073
159  98.742   0.003   0.077   0.169   0.002   0.091   0.304   0.211   0.206
160  98.738   0.009   0.138   0.113   0.000   0.371   0.186   0.080   0.163

     TTTGGG  ...  TTACGG  TTATGG  TTAGAG  TTAGCG  TTAGTG  TTAGGA  TTAGGC  \
0     0.068  ...   0.028   0.118   0.153   0.000   0.049   0.060   0.033
1     0.090  ...   0.024   0.125   0.080   0.024   0.035   0.155   0.030
2     0.079  ...   0.041   0.253   0.195   0.032   0.043   0.161   0.047
3     0.097  ...   0.053   0.110   0.125   0.000   0.043   0.069   0.029
4     0.000  ...   0.014   0.099   0.022   0.000   0.019   0.026   0.009
..      ...  ...     ...     ...     ...     ...     ...     ...     ...
156   0.000  ...   0.019   0.020   0.016   0.000   0.004   0.025   0.011
157   0.023  ...   0.012   0.013   0.043   0.000   0.005   0.009   0.015
158   0.000  ...   0.008   0.014   0.009   0.000   0.003   0.012   0.013
159   0.000  ...   0.011   0.053   0.047   0.000   0.004   0.019   0.014
160   0.000  ...   0.015   0.049   0.023   0.000   0.021   0.037   0.021

     TTAGGT  rel_TL  TMM
0     0.089   -0.89    -
1     0.093   -0.39    -
2     0.185   -1.66    -
3     0.110   -1.73    -
4     0.014    0.21    -
..      ...     ...  ...
156   0.001    2.00    +
157   0.013    0.98    +
158   0.003    1.29    +
159   0.007    1.24    +
160   0.009    1.71    +

[161 rows x 21 columns]

Machine learning workflows involves preprocessing, model selection, training, and evaluation.


Before we can use a model to predict ALT pathway activity, however, we have to do three steps: 1. Preprocessing: Gather data and get it ready for use in the machine learning model. 2. Learning/Training: Choose a machine learning model and train it on the data. 3. Evaluation: Measure how well the model performed. Can we trust the predictions of the trained model?

Supervised Machine Learning Workflow
Supervised Machine Learning Workflow

In this case, our data is already preprocessed. In order to perform training, we need to split our data into a training set and a testing set.

Splitting a training set.

Scikit-learn has useful built-in functions for splitting data inside of its model_selection module. Here, we use train_test_split. The convention in Scikit-learn (as well as more generally in machine learning) is to call our feactures X and our target variable y.

PYTHON

X = data.iloc[:, 0: num_col-1]    # feature columns
y = data['TMM']                   # label column
X_train, X_test, y_train, y_test = \
    model_selection.train_test_split(X, y, 
                                     test_size=0.2,      # reserve 20 percent data for testing
                                     stratify=y,         # stratified sampling
                                     random_state=0) # random seed

Setting a test set aside from the training and validation sets from the beginning, and only using it once for a final evaluation, is very important to be able to properly evaluate how well a machine learning algorithm learned. If this data leakage occurs it contaminates the evaluation, making the evaluation not accurately reflect how well the model actually performs. Letting the machine learning method learn from the test set can be seen as giving a student the answers to an exam; once a student sees any exam answers, their exam score will no longer reflect their true understanding of the material.

In other words, improper data splitting and data leakage means that we will not know if our model works or not.

Now that we have our data split up, we can select and train our model, in machine learning tasks like this also called our classifier. We want to further split our training data into a traiing set and validation set so that we can freely explore different models before making our final choice.

Holdout validation strategy
Holdout validation strategy

However, in this case we are simply going to go with a random forest classifier. Random forests are a fast classifier which tends to perform well, and is often reccomended as a first classifier to try.

Definitions

Training set - The training set is a part of the original dataset that trains or fits the model. This is the data that the model uses to learn patterns and set the model parameters.

Validation set - Part of the training set is used to validate that the fitted model works on new data. This is not the final evaluation of the model. This step is used to change hyperparameters and then train the model again.

Test set - The test set checks how well we expect the model to work on new data in the future. The test set is used in the final phase of the workflow, and it evaluates the final model. It can only be used one time, and the model cannot be adjusted after using it.

Parameters - These are the aspects of a machine learning model that are learned from the training data. The parameters define the prediction rules of the trained model.

Hyperparameters - These are the user-specified settings of a machine learning model. Each machine learning method has different hyperparameters, and they control various trade-offs which change how the model learns. Hyperparameters control parts of a machine learning method such as how much emphasis the method should place on being perfectly correct versus becoming overly complex, how fast the method should learn, the type of mathematical model the method should use for learning, and more.

Model Creation

PYTHON

rf = ensemble.RandomForestClassifier(
    n_estimators = 10,           # 10 random trees in the forest
    criterion = 'entropy',       # use entropy as the measure of uncertainty
    max_depth = 3,               # maximum depth of each tree is 3
    min_samples_split = 5,       # generate a split only when there are at least 5 samples at current node
    class_weight = 'balanced',   # class weight is inversely proportional to class frequencies
    random_state = 0)

Each model in Scikit-learn has a wide variety of possible hyperparameters to choose from. These can be thought of as the settings, or knobs and dials, of the model.

Once we’ve created the model, in Scikit-learn we train it using the fit method on the training data. We can then look at its score on the testing data using the score method.

Training

PYTHON

# Refit the classifier using the full training set
rf = rf.fit(X_train, y_train)

# Compute classifier's accuracy on the test set
test_accuracy = rf.score(X_test, y_test)
print(f'Test accuracy: {test_accuracy: 0.3f}')

OUTPUT

Test accuracy: 0.848

Evaluation

This means that we predicted the TMM correctly in about 85% of our test samples. As oppsed to the score, we can also directly get each of the model’s predictions. We can either use predict to get a single prediction or predict_proba to get a probablistic confidence score.

PYTHON

# Predict labels for the test set
y = rf.predict(X_test)             # Solid prediction
y_prob = rf.predict_proba(X_test)  # Soft prediction
print(y)
print(y_prob)

OUTPUT

['+' '+' '+' '-' '-' '+' '+' '-' '-' '-' '+' '-' '+' '-' '-' '+' '-' '-'
 '-' '-' '+' '-' '-' '+' '-' '-' '-' '-' '-' '-' '+' '+' '+']
[[0.81355661 0.18644339]
 [0.85375    0.14625   ]
 [0.72272654 0.27727346]
 [0.03321429 0.96678571]
 [0.04132653 0.95867347]
 [0.98434442 0.01565558]
 [0.99230769 0.00769231]
 [0.09321429 0.90678571]
 [0.0225     0.9775    ]
 [0.08788497 0.91211503]
 [0.99238014 0.00761986]
 [0.         1.        ]
 [0.87857143 0.12142857]
 [0.0225     0.9775    ]
 [0.1        0.9       ]
 [0.52973416 0.47026584]
 [0.17481069 0.82518931]
 [0.29859926 0.70140074]
 [0.43977573 0.56022427]
 [0.         1.        ]
 [0.80117647 0.19882353]
 [0.03061224 0.96938776]
 [0.22071729 0.77928271]
 [0.72899028 0.27100972]
 [0.0975     0.9025    ]
 [0.         1.        ]
 [0.03061224 0.96938776]
 [0.         1.        ]
 [0.45651511 0.54348489]
 [0.1        0.9       ]
 [0.81355661 0.18644339]
 [0.85375    0.14625   ]
 [0.88516484 0.11483516]]

Instead of simple metrics such as accuracy which can be misleading, we can also look at and plot more complex metrics such as the receiver operating characteristic (ROC) curve or the precision recall (PR) curve.

PYTHON

# Compute the ROC and PR curves for the test set
fpr, tpr, _ = metrics.roc_curve(y_test, y_prob[:, 0], pos_label='+')
precision_plus, recall_plus, _ = metrics.precision_recall_curve(y_test, y_prob[:, 0], pos_label='+')
precision_minus, recall_minus, _ = metrics.precision_recall_curve(y_test, y_prob[:, 1], pos_label='-')

# Compute the AUROC and AUPRCs for the test set
auroc = metrics.auc(fpr, tpr)
auprc_plus = metrics.auc(recall_plus, precision_plus)
auprc_minus = metrics.auc(recall_minus, precision_minus)

# Plot the ROC curve for the test set
plt.plot(fpr, tpr, label='test (area = %.3f)' % auroc)
plt.plot([0,1], [0,1], linestyle='--', color=(0.6, 0.6, 0.6))
plt.plot([0,0,1], [0,1,1], linestyle='--', color=(0.6, 0.6, 0.6))

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend(loc='lower right')
plt.title('ROC curve')
plt.show()

# Plot the PR curve for the test set
plt.plot(recall_plus, precision_plus, label='positive class (area = %.3f)' % auprc_plus)
plt.plot(recall_minus, precision_minus, label='negative class (area = %.3f)' % auprc_minus)
plt.plot([0,1,1], [1,1,0], linestyle='--', color=(0.6, 0.6, 0.6))
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('recall')
plt.ylabel('precision')
plt.legend(loc='lower left')
plt.title('Precision-recall curve')
plt.show()
ROC curve
ROC curve
PR curve
PR curve

Finally, we can also look at the confusion matrix, another useful summary of model performance.

PYTHON

# Compute the confusion matrix for the test set
cm = metrics.confusion_matrix(y_test, y)


# Plot the confusion matrix for the test set
plt.matshow(cm, interpolation='nearest', cmap=plt.cm.Blues, alpha=0.5)
plt.xticks(np.arange(2), ('+', '-'))
plt.yticks(np.arange(2), ('+', '-'))

for i in range(2):
    for j in range(2):
        plt.text(j, i, cm[i, j], ha='center', va='center')

plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.show()
Confusion Matrix
Confusion Matrix

Resources for learning more


The Google machine learning glossary and ML4Bio guides define common machine learning terms.

The scikit-learn tutorials provide a Python-based introduction to machine learning. There is also a third-party scikit-learn tutorial and a Carpentries lesson.

The book Python Machine Learning has machine learning example code.

The Elements of AI course presents general introductory materials to machine learning and related topics.

Galaxy ML provides access to classification and regression workflows through the Galaxy interface.

You can find additional resources at these links for beginners and intermediate users.

Content from ID mapping using mygene


Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can I get gene annotations in Python?

Objectives

  • Explore the mygene library.
  • Access and merge gene annotations using mygene.

Key Points

  • mygene is Python module which allows access to a gene annotation database.
  • You can query mygene with multiple identifiers using querymany.

Installing mygene from bioconda


ID mapping is a very common, often not fun, task for every bioinformatician. Different datasets and tools often use different sets of biological identifiers. Our goal is to take some list of gene symbols other gene identifiers and convert them to some other set of identifiers to facilitate some analysis.

mygene is a convenient Python module to access MyGene.info gene query web services.

Mygene is a part of a family of biothings API services which also has modules for chemical, diseases, genetic variances, and taxonomy. All of these modules use a simlar Python interface.

Mygene can be installed using conda, but we it exists on a separate set of biomedical packages called bioconda. We access this using the -c flag which tells conda the channel we wish to search.

BASH

$ conda install -c bioconda mygene

Once installed, we can import mygene and create a MyGeneInfo object, which is used for accessing and downloading genetic annotations.

PYTHON

import mygene

mg = mygene.MyGeneInfo()

Mapping gene symbols

Suppose xli is a list of gene symbols we want to convert to entrez gene ids:

PYTHON

xli = ['DDX26B',
'CCDC83',
'MAST3',
'FLOT1',
'RPL11',
'ZDHHC20',
'LUC7L3',
'SNORD49A',
'CTSH',
'ACOT8']

We can then use the querymany method to get entrez gene ids for this list, telling it your input is “symbol” with the scopes argument, and you want “entrezgene” (Entrez gene ids) back with the fields argument.

PYTHON

out = mg.querymany(xli, scopes='symbol', fields='entrezgene', species='human')
print(out)

OUTPUT

querying 1-10...done.
Finished.
1 input query terms found no hit:
	['DDX26B']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
[{'query': 'DDX26B', 'notfound': True}, {'query': 'CCDC83', '_id': '220047', '_score': 18.026102, 'entrezgene': '220047'}, {'query': 'MAST3', '_id': '23031', '_score': 18.120222, 'entrezgene': '23031'}, {'query': 'FLOT1', '_id': '10211', '_score': 18.43404, 'entrezgene': '10211'}, {'query': 'RPL11', '_id': '6135', '_score': 16.711576, 'entrezgene': '6135'}, {'query': 'ZDHHC20', '_id': '253832', '_score': 18.165747, 'entrezgene': '253832'}, {'query': 'LUC7L3', '_id': '51747', '_score': 17.696375, 'entrezgene': '51747'}, {'query': 'SNORD49A', '_id': '26800', '_score': 22.741982, 'entrezgene': '26800'}, {'query': 'CTSH', '_id': '1512', '_score': 17.737492, 'entrezgene': '1512'}, {'query': 'ACOT8', '_id': '10005', '_score': 17.711779, 'entrezgene': '10005'}]

The mapping result is returned as a list of dictionaries. Each dictionary contains the fields we asked to return, in this case, “entrezgene” field. Each dictionary also returns the matching query term, “query”, and an internal id, “**_id“, which is the same as”entrezgene**” most of time (will be an ensembl gene id if a gene is available from Ensembl only). Mygene could not find a mapping for one of the genes.

We could also get Ensembl gene ids. This time, we will use the as.dataframe argument to get the result back as a pandas dataframe.

PYTHON

ensembl_res = mg.querymany(xli, scopes='symbol', fields='ensembl.gene', species='human', as_dataframe=True)
print(ensembl_res)

OUTPUT

querying 1-10...done.
Finished.
1 input query terms found no hit:
	['DDX26B']
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
         notfound     _id     _score     ensembl.gene  \
query
DDX26B       True     NaN        NaN              NaN
CCDC83        NaN  220047  18.026253  ENSG00000150676
MAST3         NaN   23031  18.120222  ENSG00000099308
FLOT1         NaN   10211  18.416580              NaN
RPL11         NaN    6135  16.711576  ENSG00000142676
ZDHHC20       NaN  253832  18.165747  ENSG00000180776
LUC7L3        NaN   51747  17.701532  ENSG00000108848
SNORD49A      NaN   26800  22.728123  ENSG00000277370
CTSH          NaN    1512  17.725145  ENSG00000103811
ACOT8         NaN   10005  17.717607  ENSG00000101473

                                                    ensembl
query
DDX26B                                                  NaN
CCDC83                                                  NaN
MAST3                                                   NaN
FLOT1     [{'gene': 'ENSG00000206480'}, {'gene': 'ENSG00...
RPL11                                                   NaN
ZDHHC20                                                 NaN
LUC7L3                                                  NaN
SNORD49A                                                NaN
CTSH                                                    NaN
ACOT8                                                   NaN 

Mixed symbols and multiple fields

PYTHON

xli = ['ENSG00000150676',
 'MAST3',
 'FLOT1',
 'RPL11',
 '1007_s_at',
 'AK125780']

The above id list contains multiple types of identifiers. Let’s try taking this list and trying to get back both Entrez gene ids and uniprot ids. Parameters scopes, fields, species are all flexible enough to support multiple values, either a list or a comma-separated string.

PYTHON

query_res = mg.querymany(xli, scopes='symbol,reporter,accession,ensembl.gene', fields='entrezgene,uniprot', species='human', as_dataframe=True)
print(query_res)

OUTPUT

querying 1-6...done.
Finished.
1 input query terms found dup hits:
	[('1007_s_at', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
                       _id     _score entrezgene uniprot.Swiss-Prot  \
query
ENSG00000150676     220047  24.495731     220047             Q8IWF9
MAST3                23031  18.115461      23031             O60307
FLOT1                10211  18.416580      10211             O75955
RPL11                 6135  16.715294       6135             P62913
1007_s_at        100616237  18.266214  100616237                NaN
1007_s_at              780  18.266214        780             Q08345
AK125780         118142757  21.342941  118142757             P43080

                                                    uniprot.TrEMBL
query
ENSG00000150676                                             H0YDV3
MAST3            [V9GYV0, A0A8V8TLL8, A0A8I5KST9, A0A8V8TMW0, A...
FLOT1            [A2AB11, Q5ST80, A2AB13, A2AB10, A2ABJ5, A2AB1...
RPL11                                 [Q5VVC8, Q5VVD0, A0A2R8Y447]
1007_s_at                                                      NaN
1007_s_at        [A0A024RCJ0, A0A024RCL1, A0A0A0MSX3, A0A024RCQ...
AK125780                                      [A0A7I2V6E2, B2R9P6]  

Finally, we can take a look at all of the information mygene has to offer by setting fields to “all”.

PYTHON

query_res = mg.querymany(xli, scopes='symbol,reporter,accession,ensembl.gene', fields='all', species='human', as_dataframe=True)
print(list(query_res.columns))

OUTPUT

querying 1-6...done.
Finished.
1 input query terms found dup hits:
	[('1007_s_at', 2)]
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
['AllianceGenome', 'HGNC', '_id', '_score', 'alias', 'entrezgene', 'exons', 'exons_hg19', 'generif', 'ipi', 'map_location', 'name', 'other_names', 'pharmgkb', 'symbol', 'taxid', 'type_of_gene', 'unigene', 'accession.genomic', 'accession.protein', 'accession.rna', 'accession.translation', 'ensembl.gene', 'ensembl.protein', 'ensembl.transcript', 'ensembl.translation', 'ensembl.type_of_gene', 'exac._license', 'exac.all.exp_lof', 'exac.all.exp_mis', 'exac.all.exp_syn', 'exac.all.lof_z', 'exac.all.mis_z', 'exac.all.mu_lof', 'exac.all.mu_mis', 'exac.all.mu_syn', 'exac.all.n_lof', 'exac.all.n_mis', 'exac.all.n_syn', 'exac.all.p_li', 'exac.all.p_null', 'exac.all.p_rec', 'exac.all.syn_z', 'exac.bp', 'exac.cds_end', 'exac.cds_start', 'exac.n_exons', 'exac.nonpsych.exp_lof', 'exac.nonpsych.exp_mis', 'exac.nonpsych.exp_syn', 'exac.nonpsych.lof_z', 'exac.nonpsych.mis_z', 'exac.nonpsych.mu_lof', 'exac.nonpsych.mu_mis', 'exac.nonpsych.mu_syn', 'exac.nonpsych.n_lof', 'exac.nonpsych.n_mis', 'exac.nonpsych.n_syn', 'exac.nonpsych.p_li', 'exac.nonpsych.p_null', 'exac.nonpsych.p_rec', 'exac.nonpsych.syn_z', 'exac.nontcga.exp_lof', 'exac.nontcga.exp_mis', 'exac.nontcga.exp_syn', 'exac.nontcga.lof_z', 'exac.nontcga.mis_z', 'exac.nontcga.mu_lof', 'exac.nontcga.mu_mis', 'exac.nontcga.mu_syn', 'exac.nontcga.n_lof', 'exac.nontcga.n_mis', 'exac.nontcga.n_syn', 'exac.nontcga.p_li', 'exac.nontcga.p_null', 'exac.nontcga.p_rec', 'exac.nontcga.syn_z', 'exac.transcript', 'genomic_pos.chr', 'genomic_pos.end', 'genomic_pos.ensemblgene', 'genomic_pos.start', 'genomic_pos.strand', 'genomic_pos_hg19.chr', 'genomic_pos_hg19.end', 'genomic_pos_hg19.start', 'genomic_pos_hg19.strand', 'go.MF.category', 'go.MF.evidence', 'go.MF.id', 'go.MF.pubmed', 'go.MF.qualifier', 'go.MF.term', 'homologene.genes', 'homologene.id', 'interpro.desc', 'interpro.id', 'interpro.short_desc', 'pantherdb.HGNC', 'pantherdb._license', 'pantherdb.ortholog', 'pantherdb.uniprot_kb', 'pharos.target_id', 'reagent.GNF_Qia_hs-genome_v1_siRNA', 'reagent.GNF_hs-ORFeome1_1_reads.id', 'reagent.GNF_hs-ORFeome1_1_reads.relationship', 'reagent.GNF_mm+hs-MGC.id', 'reagent.GNF_mm+hs-MGC.relationship', 'reagent.GNF_mm+hs_RetroCDNA.id', 'reagent.GNF_mm+hs_RetroCDNA.relationship', 'reagent.NOVART_hs-genome_siRNA', 'refseq.genomic', 'refseq.protein', 'refseq.rna', 'refseq.translation', 'reporter.GNF1H', 'reporter.HG-U133_Plus_2', 'reporter.HTA-2_0', 'reporter.HuEx-1_0', 'reporter.HuGene-1_1', 'reporter.HuGene-2_1', 'umls.cui', 'uniprot.Swiss-Prot', 'uniprot.TrEMBL', 'MIM', 'ec', 'interpro', 'pdb', 'pfam', 'prosite', 'summary', 'go.BP', 'go.CC.evidence', 'go.CC.gocategory', 'go.CC.id', 'go.CC.qualifier', 'go.CC.term', 'go.MF', 'reagent.GNF_hs-Origene.id', 'reagent.GNF_hs-Origene.relationship', 'reagent.GNF_hs-druggable_lenti-shRNA', 'reagent.GNF_hs-druggable_plasmid-shRNA', 'reagent.GNF_hs-druggable_siRNA', 'reagent.GNF_hs-pkinase_IDT-siRNA', 'reagent.Invitrogen_IVTHSSIPKv2', 'reagent.NIBRI_hs-Secretome_pDEST.id', 'reagent.NIBRI_hs-Secretome_pDEST.relationship', 'reporter.HG-U95Av2', 'ensembl', 'genomic_pos', 'genomic_pos_hg19', 'go.CC', 'pathway.kegg.id', 'pathway.kegg.name', 'pathway.netpath.id', 'pathway.netpath.name', 'pathway.reactome', 'pathway.wikipathways', 'reagent.GNF_hs-Origene', 'wikipedia.url_stub', 'pir', 'pathway.kegg', 'pathway.pid', 'pathway.wikipathways.id', 'pathway.wikipathways.name', 'miRBase', 'retired', 'reagent.GNF_mm+hs-MGC'

You can find a neater version of all available fields here.

Challenge

Load in the rnaseq dataset as a pandas dataframe. Add columns to the dataframe with entrez gene ids and at least one other field from mygene.

Other bioinformatics resources


Biopython is a library with a wide variety of bioinformatics applications. You can find its tutorial/cookbook here


This lesson was adapted from the official mygene python tutorial