## Demo2: Useful R functions for working with strings

Barry Grant < http://thegrantlab.org >

It is important to note that all strings in R are stored as vectors. For example both of the below are vectors of strings.

```
v1 <- c("A", "B", "C", "D", "E")
v2 <- "ABCDE"
v1
```

```
## [1] "A" "B" "C" "D" "E"
```

```
v2
```

```
## [1] "ABCDE"
```

The first vector **v1** has 5 elements (i.e values A to E) and is thus of length 5. The second vector **v2** has only one element and is of length 1:

```
length(v1)
```

```
## [1] 5
```

```
length(v2)
```

```
## [1] 1
```

To find out how many characters (i.e. letters) are in each element of a vector we can use the **nchar()** function:

```
nchar(v1)
```

```
## [1] 1 1 1 1 1
```

```
nchar(v2)
```

```
## [1] 5
```

Note that the example

nchar(v1)here shows that thenchar()function is vectorized - that is, it will operate on every element of our input vector without having to explicitly loop over these elements. This is a really cool feature of R!

## The paste() function

A particularly useful function is **paste()**, which constructs strings by “pasting” together the parts. The **paste()** function takes any number of arguments and concatenates them together using the separating string specified by the sep argument (which is a space by default). Like many of R’s functions, paste() is vectorized:

```
paste("chr", c(1:22, "X", "Y"), sep="")
```

```
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8"
## [9] "chr9" "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16"
## [17] "chr17" "chr18" "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
```

What do you think would happen with the following example?

```
paste(v1, 1)
```

```
## [1] "A 1" "B 1" "C 1" "D 1" "E 1"
```

How about these examples?

```
paste(v1, c(1:5))
```

```
## [1] "A 1" "B 2" "C 3" "D 4" "E 5"
```

```
paste(v1, c(1:3))
```

```
## [1] "A 1" "B 2" "C 3" "D 1" "E 2"
```

In the last example above, **paste(v1, c(1:3))** we see **recycling** in operation - in this case the shorter input vector got “recycled” until it matched the length of the longer input vector to produce our output. This is a feature, not a bug!

## Pattern matching and replacement with grep() and sub()

The **grep()** function, like its UNIX namesake, is useful for finding matches within and between strings. Here are some simple examples:

```
x <- c("AGCTAG", "ATA", "GATCTGAG", "")
length(x)
```

```
## [1] 4
```

```
nchar(x)
```

```
## [1] 6 3 8 0
```

```
grep("AT",x)
```

```
## [1] 2 3
```

```
grep("AT[CG]",x)
```

```
## [1] 3
```

```
grep("AT[CG]",x, value=TRUE)
```

```
## [1] "GATCTGAG"
```

Unlike **grep()**, the related **regexpr(pattern, x)** function returns where in each element of x it matched pattern. If an element doesn’t match the pattern, regexpr() returns –1. For example:

```
regexpr("AT[CG]",x)
```

```
## [1] -1 -1 2 -1
## attr(,"match.length")
## [1] -1 -1 3 -1
## attr(,"useBytes")
## [1] TRUE
```

Note from the output here the pattern **AT[CG]** is not found in first, second and fourth element of the vector **x** but is found in the third element starting at the second character continuing for 3 characters.

The **sub()** function is useful for replacement of pattern matched segments of strings. For example:

```
sub("AT[CG]", "--barry--", x)
```

```
## [1] "AGCTAG" "ATA" "G--barry--TGAG" ""
```

If you want to match and replace more than the first occurrence of your pattern you will want to use the related **gsub()** function.

## Extracting values from a string

The final function essential to string processing in R is **strsplit(x, split)**, which splits string **x** by **split**. Like R’s other string processing functions, **strsplit()** supports optional *perl* and *fixed* arguments.

For example, if we had a string like *gene=LEAFY;locus=2159208;gene_model=AT5G61850.1* and we wished to extract each part, we’d need to split by “;”:

```
y <- "gene=LEAFY;locus=2159208;gene_model=AT5G61850.1"
strsplit(y, ";")
```

```
## [[1]]
## [1] "gene=LEAFY" "locus=2159208"
## [3] "gene_model=AT5G61850.1"
```

Also, like all of R’s other string functions, **strsplit()** is vectorized, so it can process entire character vectors at once. Because the number of split chunks can vary, **strsplit()** always returns results in a list.