String Manipulation

Last updated on 2023-04-27 | Edit this page

Overview

Questions

  • How can I manipulate text?
  • How can I create neat and dynamic text output?

Objectives

  • Extract substrings of interest.
  • Format dynamic strings using f-strings.
  • Explore Python’s built-in string functions

Key Points

  • Strings can be indexed and sliced.
  • Strings cannot be directly altered.
  • You can build complex strings based on other variables using f-strings and format.
  • Python has a variety of useful built-in string functions.

Strings can be indexed and sliced.


As we saw in class during the types lesson, strings in Python can be indexed and sliced

PYTHON

my_string = "What a lovely day."

# Indexing
my_string[2]

OUTPUT

'a'

PYTHON

# Slicing
my_string[1:3]

OUTPUT

'ha'

PYTHON

my_string[:3] # Same as my_string[0:3]

OUTPUT

'Wha'

PYTHON

my_string[1:] # Same as my_string[1:len(my_string)]

OUTPUT

'hat a lovely day.'

Strings are immutable.


We will talk about this concept in more detail when we explore lists. However, for now it is important to note that strings, like simple types, cannot have their values be altered in Python. Instead, a new value is created.

For simple types, this behavior isn’t that noticable:

PYTHON

x = 10
# While this line appears to be changing the value 10 to 11, in reality a new integer with the value 11 is created and assigned to x. 
x = x + 1
x

OUTPUT

11

However, for strings we can easily cause errors if we attempt to change them directly:

PYTHON

# This attempts to change the last character to 'g' 
my_string[-1] = "g"

ERROR

---------------------------------------------------------------------
TypeError                           Traceback (most recent call last)
Input In [6], in <cell line: 1>()
----> 1 my_string[-1] = "g"

TypeError: 'str' object does not support item assignment

Thus, we need to learn ways in Python to manipulate and build strings. In this instance, we can build a new string using indexing:

PYTHON

my_new_string = my_string[:-2] + "g"
my_new_string

OUTPUT

'What a lovely dag'

Or use the built-in function str.replace() if we wanted to replace a larger portion of the string.

PYTHON

my_new_string = my_string.replace("day", 'dog')
print(my_new_string)

OUTPUT

What a lovely dog.

Build complex strings based on other variables using format.


What if we want to use values inside of a string? For instance, say we want to print a sentence denoting how many and the percent of samples we dropped and kept as a part of a quality control analysis.

Suppose we have variables denoting how many samples and dropped samples we have:

PYTHON

good_samples = 964
bad_samples = 117
percent_dropped = bad_samples/(good_samples + bad_samples)

One option would be to simply put everything in print:

PYTHON

print("Dropped", percent_dropped, "percent samples")

OUTPUT

Dropped 0.10823311748381129 percent samples

Or we could convert and use addition:

PYTHON

out_str = "Dropped " + str(percent_dropped) + "% samples"
print(out_str)

OUTPUT

Dropped 0.10823311748381129% samples

However, both of these options don’t give us as much control over how the percent is displayed. Python uses systems called string formatting and f-strings to give us greater control.

We can use Python’s built-in format function to create better-looking text:

PYTHON

print('Dropped {0:.2%} of samples, with {1:n} samples remaining.'.format(percent_dropped, good_samples))

OUTPUT

Dropped 10.82% of samples, with 964 samples remaining.

Calls to format have a number of components:

  • The use of brackets {} to create replacement fields.
  • The index inside each bracket (0 and 1) to denote the index of the variable to use.
  • Format instructions. .2% indicates that we want to format the number as a percent with 2 decimal places. n indicates that we want to format as a number.
  • format we call format on the string, and as arguments give it the variables we want to use. The order of variables here are the indices referenced in replacement fields.

For instance, we can switch or repeat indices:

PYTHON

print('Dropped {1:.2%} of samples, with {0:n} samples remaining.'.format(percent_dropped, good_samples))
print('Dropped {0:.2%} of samples, with {0:n} samples remaining.'.format(percent_dropped, good_samples))

OUTPUT

Dropped 96400.00% of samples, with 964 samples remaining.
Dropped 10.82% of samples, with 0.108233 samples remaining.

Python has a shorthand for using format called f-strings. These strings let us directly use variables and create expressions inside of strings. We denote an f-string by putting f in front of the string definition:

PYTHON

print(f'Dropped {percent_dropped:.2%} of samples, with {good_samples:n} samples remaining.')
print(f'Dropped {100*(bad_samples/(good_samples + bad_samples)):.2f}% of samples, with {good_samples:n} samples remaining.')

OUTPUT

Dropped 10.82% of samples, with 964 samples remaining.
Dropped 10.82% of samples, with 964 samples remaining.

Here, {100*(bad_samples/(good_samples + bad_samples)):.2f} is treated as an expression, and then printed as a float with 2 digits.

Full documenation on all of the string formatting mini-language can be found here.

Python has many useful built-in functions for string manipulation.


Python has many built-in methods for manipulating strings; simple and powerful text manipulation is considered one of Python’s strengths. We will go over some of the more common and useful functions here, but be aware that there are many more you can find in the official documentation.

Dealing with whitespace

str.strip() strips the whitespace from the beginning and ending of a string. This is especially useful when reading in files which might have hidden spaces or tabs at the end of lines.

PYTHON

"  a   \t\t".strip()

Note that \t denotes a tab character.

OUTPUT

'a'

str.split() strips a string up into a list of strings by some character. By default it uses whitespace, but we can give any set character to split by. We will learn how to use lists in the next session.

PYTHON

"My favorite sentence I hope nothing bad happens to it".split()

OUTPUT

['My',
 'favorite',
 'sentence',
 'I',
 'hope',
 'nothing',
 'bad',
 'happens',
 'to',
 'it']

PYTHON

"2023-:04-:12".split('-:')

OUTPUT

['2023', '04', '12']

Pattern matching

str.find() find the first occurrence of the specified string inside the search string, and str.rfind() finds the last.

PYTHON

"(collected)ENSG00000010404(inprogress)ENSG000000108888".find('ENSG')

OUTPUT

11

PYTHON

"(collected)ENSG00000010404(inprogress)ENSG000000108888".rfind('ENSG')

OUTPUT

38

str.startswith() and str.endswith() perform a simlar function but return a bool based on whether or not the string starts or ends with a particular string.

PYTHON

"(collected)ENSG00000010404".startswith('ENSG')

OUTPUT

False

PYTHON

"ENSG00000010404".startswith('ENSG')

OUTPUT

True

Case

str.upper(), str.lower(), str.capitalize(), and str.title() all change the capitalization of strings.

PYTHON

my_string = "liters (coffee) per minute"
print(my_string.lower())
print(my_string.upper())
print(my_string.capitalize())
print(my_string.title())

OUTPUT

liters (coffee) per minute
LITERS (COFFEE) PER MINUTE
Liters (coffee) per minute
Liters (Coffee) Per Minute

Challenge

A common problem when analyzing data is that multiple features of the data will be stored as a column name or single string.

For instance, consider the following column headers:

WT_0Min_1	WT_0Min_2	X.RAB7A_0Min_1	X.RAB7A_0Min_2	WT_5Min_1	WT_5Min_2	X.RAB7A_5Min_1	X.RAB7A_5Min_2	X.NPC1_5Min_1	X.NPC1_5Min_2

There are two variables of interest, the time, 0, 5, or 60 minutes post-infection, and the genotype, WT, NPC1 knockout and RAB7A knockout. We also have replicate information at the end of each column header. For now, let’s just try extracting the timepoint, genotype, and replicate from a single column header.

Given the string:

PYTHON

sample_info = "X.RAB7A_0Min_1"

Try to print the string:

Sample is 0min RABA7A knockout, replicate 1. 

Using f-strings, slicing and indexing, and built-in string functions. You can try to use lists as a challenge, but its fine to instead get each piece of information separately from sample_info.

PYTHON

s_info = sample_info.split('_')
genotype = s_info[0]
time = s_info[1]
rep = s_info[2]

#Cleaning the parts up
genotype = genotype.split('.')[-1]
time = time.lower()

print(f"Sample is {genotype} {time} knockout, replicate {rep}")

OUTPUT

Sample is RABA7A 0min knockout, replicate 1