Working with Text:
As a data analyst you will inevitably encounter situations where you need to process STRINGS. It might be to conduct a simple content analysis (counting word occurrences), but most likely you will use these tools at the data cleaning and database preparation steps.
Text analysis is a large and complex field with packages that specialize in natural language processing tools, but you will be surprised at how sophisticated you can get with a handful of core R string functions.
When you work with text in computer programs, it is called ‘string processing’ because the computer does not know anything about words or concepts, so it treats text as strings of characters.
Some basic vocabulary common to tasks that treat text as data:
We can atomize a large body of text by breaking it into sentences, words, letters, etc.
<- c("This is a string.", "These", "words","are", "also", "strings." )
x
x
## [1] "This is a string." "These" "words"
## [4] "are" "also" "strings."
# putting text together
paste( "This is a string.", "These", "words","are", "also", "strings.", sep=" ")
## [1] "This is a string. These words are also strings."
# breaking it apart
unlist( strsplit( x, " " ) )
## [1] "This" "is" "a" "string." "These" "words" "are"
## [8] "also" "strings."
unlist( strsplit( x, "" ) )
## [1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "t" "r" "i" "n" "g" "." "T" "h"
## [20] "e" "s" "e" "w" "o" "r" "d" "s" "a" "r" "e" "a" "l" "s" "o" "s" "t" "r" "i"
## [39] "n" "g" "s" "."
There are a handful of functions that you will use to work with strings. These functions find specific words or characters in your data, find parts of words, and replace them with other words or characters. There are also some functions to break text apart, put text together, or format it.
Function | Use |
---|---|
grep() |
Find a word or phrase (returns the proper string). |
grepl() |
Find a word or phrase (returns a logical vector). |
regexpr() |
Find a part of a word or phrase - very flexible. |
agrep() |
Find an approximate match. |
sub() |
Replace the first occurence of a word or phrase. |
gsub() |
Replace ALL occurences of a word or phrase. |
———– | ——————————————— |
paste() |
Combine multiple strings into a single string. |
strsplit() |
Split one string into multiple strings. |
substr() |
Extract part of a string. |
Let’s look at some examples of these functions in action.
We often need to combine several pieces of text into one string, called concatenation. R’s function for this is paste().
paste( "My","name","is","mud.")
## [1] "My name is mud."
<- "mud."
a paste("My","name","is", a ) # it can handle objects as arguments
## [1] "My name is mud."
<- c("Larry","Moe","Curly")
b paste("My","name","is", b ) # it is vectorized
## [1] "My name is Larry" "My name is Moe" "My name is Curly"
# Need to create some vector names?
paste("x",1:3,sep="")
## [1] "x1" "x2" "x3"
toupper( "AbCdEfG" )
## [1] "ABCDEFG"
tolower( "AbCdEfG" )
## [1] "abcdefg"
Need to sort a column of text by the length of words? You count characters with the function nchar():
nchar( c("micky","snuffleupagus") )
## [1] 5 13
This is a little more complicated since text is often processed as a single character string.
nchar( "a b c")
## [1] 5
nchar( "This is all one piece of text." )
## [1] 30
We can split text using the string split function strsplit(). We just need to tell it the delimiters, which is just a space in this case.
strsplit( "This is all one piece of text.", split=" " )
## [[1]]
## [1] "This" "is" "all" "one" "piece" "of" "text."
length( strsplit( "This is all one piece of text.", split=" " )[[1]] )
## [1] 7
If we want to split everything we give it an empty split set:
strsplit("abc", "") # returns 3 element vector "a","b","c"
## [[1]]
## [1] "a" "b" "c"
nchar( "This is all one piece of text." )
## [1] 30
length( strsplit( "This is all one piece of text.", split="" )[[1]] )
## [1] 30
Recall that the census downloads contain a field called GEO.id which consists of several fips codes pasted together. If we inspect this ID we can see that the county fips (the one we often use for merges) is includes as the last five digits. How can we use this variable to exta the county fips codes?
The function substr() takes character vectors as their argument and returns the substring specified by the start and end positions.
substr( "Micky", start=2, stop=4 )
## [1] "ick"
<- c("0500000US01001","0500000US01003","0500000US01005")
GEO.id
substr( GEO.id, start=10, stop=15 ) # returns county fips codes only
## [1] "01001" "01003" "01005"
# replacement using substr
substr( GEO.id, 2, 4) <- "22222"
GEO.id
## [1] "0222000US01001" "0222000US01003" "0222000US01005"
If we want to search text for a keyword we use grep().
In case you are curious about what ‘grep’ means, it is a term inherited from Unix operating systems.
GREP (g/re/p): Globally search for a Regular Expression and Print
<- c("micky","minnie","goofy","pluto")
my.text
grep( pattern="goofy", my.text ) # correctly returns the third line
## [1] 3
grep( pattern="Goofy", my.text ) # whoops! case matters
## integer(0)
grep( pattern="Goofy", my.text, ignore.case=T ) # there we go
## [1] 3
# returns each line that contains the match text
grep( "new", c("california","new york","new jersey","tennessee") )
## [1] 2 3
# perhaps we want to see all of the lines that match
grep( "new", c("california","new york","new jersey","tennessee"), value=T )
## [1] "new york" "new jersey"
Find and replace the first case in a string with sub() or all cases with gsub():
sub( pattern="New", replacement="Old", "We are traveling from New York to New Jersey" )
## [1] "We are traveling from Old York to New Jersey"
sub( pattern="new", replacement="old", c("california","new york","new jersey","tennessee") )
## [1] "california" "old york" "old jersey" "tennessee"
sub( pattern="rave", replacement="party", "We are traveling from New York to New Jersey" )
## [1] "We are tpartyling from New York to New Jersey"
gsub( pattern="New", replacement="Old", "We are traveling from New York to New Jersey" )
## [1] "We are traveling from Old York to Old Jersey"
gsub( pattern="new", replacement="old", c("california","new york","new jersey","tennessee") )
## [1] "california" "old york" "old jersey" "tennessee"
# must us escapes for special characters
sub("?",".","Hello there?") # that's not right
## [1] ".Hello there?"
sub("\\?",".","Hello there?") # there we go
## [1] "Hello there."
sub("\\s",".","Hello There") # this works for spaces
## [1] "Hello.There"
We often need to search large bodies of text for patterns.
Regular expressions are a stylized syntax that are used to query bodies of text to return very specific results. It uses symbols that help match groups of characters, as well as expressions to query locations within strings (a pattern at the beginning of a word or end of a sentence).
Note that this section borrows heavily from Gloria Li and Jenny Bryan ( original link, but now defunct ).
Some more clear examples provided by Dean Attali at:
https://github.com/daattali/UBC-STAT545/blob/master/reference/regex/regularExpressions.md
Recall that logical operators are symbols that allow us to translate nuanced questions into computer code. For example, how many left-handed batters have been inducted into the Baseball Hall of Fame?
Similarly, regular expression operators allow us to create complex search terms.
Instead of saying, search for the word “cat” in the text, we might want to say, search for word “cat”, only at the beginning of sentences, and do not return instances like “catch” that contain “cat”.
In order to specify these searches, we need a more flexible language. Regular expressions gives us this.
Each of these symbols functions as an operator in the regular expressions framework:
$ * + . ? [ ] ^ { } | ( )
Here are the uses of some of these:
Operator | Use |
---|---|
. | matches any single character (wild card for single character) |
^ | start of a string |
$ | end of a string |
? | match any time a preceding character appears 0 or 1 times |
* | match any time a preceding character appears 0 or more times |
+ | match any time a preceding character appears 1 or more times |
| | OR statement - match either statement given |
[ ] | OR statement - match any of the characters given |
[^ ] | match any characters EXCEPT those given in the list |
\ | escape character - turns an operator into plain text |
Note that the position of RegEx operators matters and they are not all the same. For example, the position of anchors change with the anchor type:
"^word" # start of string anchor acts on W, precedes letter
"word$" # end of string anchor acts on D, follows letter
The quantifiers ?, * and + always follow the letter they act upon.
"loo?se" # matches lose or loose
<- c("^ab", "ab", "abc", "abd", "abe", "ab 12", "ab$")
strings
# match anything that starts with ab followed by any character
grep("ab.", strings, value = TRUE)
## [1] "abc" "abd" "abe" "ab 12" "ab$"
# search for abc OR abd
grep("abc|abd", strings, value = TRUE)
## [1] "abc" "abd"
# match abc OR abd OR abe
grep("ab[c-e]", strings, value = TRUE)
## [1] "abc" "abd" "abe"
# match anything that is NOT abc
grep("ab[^c]", strings, value = TRUE)
## [1] "abd" "abe" "ab 12" "ab$"
# match any string where ab occurs at the beginning
grep("^ab", strings, value = TRUE)
## [1] "ab" "abc" "abd" "abe" "ab 12" "ab$"
# match any string where ab occurs at the end
grep("ab$", strings, value = TRUE)
## [1] "^ab" "ab"
# search for matches that contain the character ^
grep("^", strings, value = TRUE)
## [1] "^ab" "ab" "abc" "abd" "abe" "ab 12" "ab$"
# try again
grep("\\^", strings, value = TRUE)
## [1] "^ab"
# same here
grep("$", strings, value = TRUE)
## [1] "^ab" "ab" "abc" "abd" "abe" "ab 12" "ab$"
grep("\\$", strings, value = TRUE)
## [1] "ab$"
If we want to search for one of these special operators in our text, we need to tell R that we are looking for the operator, and not trying to use a regular expression statement. We accomplish this with an escape sequence.
Create an escape sequence by placeing the double backslash “\” in front of a special operator. For example, to search for a quote, a newline, or a tab in the text use these:
<- "Here is a long string
string of text that contains
some breaks."
string
## [1] "Here is a long string\n of text that contains \n some breaks."
# find the positions of the breaks
nchar( string )
## [1] 79
gregexpr( "\\n", string )
## [[1]]
## [1] 22 56
## attr(,"match.length")
## [1] 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
# find all of the blank spaces
gregexpr( "\\b ", string )
## [[1]]
## [1] 5 8 10 15 36 41 46 55 72
## attr(,"match.length")
## [1] 1 1 1 1 1 1 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
# all of the o's at the beginning of words
gregexpr( "\\bo", string )
## [[1]]
## [1] 34
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
# all of the o's in the middle of words
gregexpr( "\\Bo", string )
## [[1]]
## [1] 12 48 69
## attr(,"match.length")
## [1] 1 1 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
The regexpr() and gregexpr() functions are odd because they return a character position instead of an element from the character vector. These start and stop positions are used to extract pieces of text from the whole body of text.
# extracting text using start and stop values
regexpr( "c.*g", "abcdefghi" )
## [1] 3
## attr(,"match.length")
## [1] 5
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
<- regexpr( "c.*g", "abcdefghi" )
start.pos
<- start.pos + attr( regexpr( "c.*g", "abcdefghi" ), "match.length" )
stop.pos
substr( "abcdefghi", start=start.pos, stop=stop.pos )
## [1] "cdefgh"
The quantifiers allow us to specify the number of times a character is repeated.
Operator | Use |
---|---|
* | matches at least 0 times. |
. | matches only one time |
+ | matches at least 1 times. |
? | matches at most 1 times. |
{n} | matches exactly n times. |
{n,} | matches at least n times. |
{n,m} | matches between n and m times. |
<- c("ht","hot","hoot","hooot")
strings
# match at least zero times
grep("h*t", strings, value = TRUE)
## [1] "ht" "hot" "hoot" "hooot"
# match ONLY one time
grep("h.t", strings, value = TRUE)
## [1] "hot"
# match at least one times
grep("ho+t", strings, value = TRUE)
## [1] "hot" "hoot" "hooot"
# match zero or one times
grep("ho?t", strings, value = TRUE)
## [1] "ht" "hot"
# match exactly n times
grep("ho{2}t", strings, value = TRUE)
## [1] "hoot"
# match at least n times
grep("ho{2,}t", strings, value = TRUE)
## [1] "hoot" "hooot"
# match between n and m times
grep("ho{1,2}t", strings, value = TRUE)
## [1] "hot" "hoot"
The position specified whether the characters occur at the beginning, middle, or end or a word or phrase.
Note that “a dog” is a STRING that contains two WORDS for the definitions below.
Operator | Use |
---|---|
^ | matches the start of the STRING. |
$ | matches the end of the STRING. |
\\b | matches the empty string at either edge of a WORD. |
\\B | matches the string provided it is NOT at an edge of a word. |
<- c("abcd", "cdab", "cabd", "c abd")
strings
# anywhere in the text
grep("ab", strings, value = TRUE)
## [1] "abcd" "cdab" "cabd" "c abd"
# at the beginning of a STRING
grep("^ab", strings, value = TRUE)
## [1] "abcd"
# at the end of a STRING
grep("ab$", strings, value = TRUE)
## [1] "cdab"
# at the beginning of a WORD
grep("\\bab", strings, value = TRUE)
## [1] "abcd" "c abd"
## [1] "abcd" "c abd"
# in the middle of a WORD
grep("\\Bab", strings, value = TRUE)
## [1] "cdab" "cabd"
# Searching for special characters using escape
regexpr( "\\*", "abcd*efghi" )
## [1] 5
## attr(,"match.length")
## [1] 1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
<- c("micky","minnie","goofy","pluto")
my.text
grep( pattern="g*fy", my.text )
## [1] 3
grep( pattern="g*y", my.text )
## [1] 1 3
grep( pattern="pluo?to", my.text )
## [1] 4
grep( pattern="pluo?t", my.text )
## [1] 4
grep( pattern="plo?to", my.text )
## integer(0)
grep( pattern="mi*", my.text )
## [1] 1 2
# FormA OR FormB OR FormC
<- c( "FormA", "FormC", "FormE" )
my.text
grep( pattern="Form[ABC]", my.text )
## [1] 1 2
grep( pattern="h[oi]t" , c("hot","hat","hit","hop") )
## [1] 1 3
# replace land with LAND in all country names
gsub( "land", "LAND", c("finland", "iceland", "michael landon") )
## [1] "finLAND" "iceLAND" "michael LANDon"
# need to anchor the word to the end
gsub( "land$", "LAND", c("finland", "iceland", "michael landon") )
## [1] "finLAND" "iceLAND" "michael landon"
R has a special class of text elements for dates. This class translates letters and numbers into calendar dates, and it knows how to translate these elements easily between days and years.
You would use this function in order to re-cast characters from a database into calendar dates, or to calculate time between events.
date()
## [1] "Thu Sep 09 13:25:26 2021"
Perhaps we are running simulations and need to print output to a file in a way that we can generate random names for the files but still keep track of the order. We can create filenames using dates:
paste( date(), ".pdf", sep="" )
## [1] "Thu Sep 09 13:25:26 2021.pdf"
That’s a complicated title. Perhaps we want a simple representation of the full date. We can format a date object using some simple commands. For a full list see strptime().
Sys.time()
## [1] "2021-09-09 13:25:26 MST"
format(Sys.time(), "%a %b %d %Y" )
## [1] "Thu Sep 09 2021"
Suppose you want to calculate the time between two datas in your data set:
<- c("2011/06/13","2011/07/25","2011/05/24")
start.date
<- c("2012/01/01","2012/01/01","2012/03/19")
end.date
start.date
## [1] "2011/06/13" "2011/07/25" "2011/05/24"
class( start.date )
## [1] "character"
# you will get an error here:
# end.date - start.date
You will notice that our dates were read in as characters, so we first need to translate them to the date class in order to make any meaningful comparisons between them. So we cast them as dates.
as.Date( end.date )
## [1] "2012-01-01" "2012-01-01" "2012-03-19"
as.Date( end.date ) - as.Date( start.date )
## Time differences in days
## [1] 202 160 300
It worked! Let’s be a little more careful, though, about how we are conducting the translation to make sure we are not introducing any errors. We can explicitly specify the format of the dates to ensure they are interpretted correctly:
as.Date( start.date, format="%Y/%m/%d")
## [1] "2011-06-13" "2011-07-25" "2011-05-24"
That works correctly. What if we mix up days and months, though (European dates and American dates often have different ordering of days and months).
as.Date( start.date, format="%Y/%d/%m")
## [1] NA NA NA
as.Date( "2004/30/06", format="%Y/%d/%m" ) # ok
## [1] "2004-06-30"
as.Date( "2004/31/06", format="%Y/%d/%m" ) # June only has 30 days
## [1] NA
At least R is smart enough to know there are no months higher than 12 and only 30 days in June, and no recycling here!
We can use the sequence function to generate lists of dates as long as the arguments are dates.
<- as.Date("2010/01/01")
a <- as.Date("2010/02/01")
b <- as.Date("2011/01/15")
c
seq( from=a, to=b, by=1 ) # sequence of days
## [1] "2010-01-01" "2010-01-02" "2010-01-03" "2010-01-04" "2010-01-05"
## [6] "2010-01-06" "2010-01-07" "2010-01-08" "2010-01-09" "2010-01-10"
## [11] "2010-01-11" "2010-01-12" "2010-01-13" "2010-01-14" "2010-01-15"
## [16] "2010-01-16" "2010-01-17" "2010-01-18" "2010-01-19" "2010-01-20"
## [21] "2010-01-21" "2010-01-22" "2010-01-23" "2010-01-24" "2010-01-25"
## [26] "2010-01-26" "2010-01-27" "2010-01-28" "2010-01-29" "2010-01-30"
## [31] "2010-01-31" "2010-02-01"
seq( from=a, to=b, by=7 ) # sequence of weeks
## [1] "2010-01-01" "2010-01-08" "2010-01-15" "2010-01-22" "2010-01-29"
seq( from=a, to=b, by="week" ) # same output
## [1] "2010-01-01" "2010-01-08" "2010-01-15" "2010-01-22" "2010-01-29"
seq( from=a, to=c, by="month" ) # end date does not land on the 15th
## [1] "2010-01-01" "2010-02-01" "2010-03-01" "2010-04-01" "2010-05-01"
## [6] "2010-06-01" "2010-07-01" "2010-08-01" "2010-09-01" "2010-10-01"
## [11] "2010-11-01" "2010-12-01" "2011-01-01"
ASCII stands for the American Standard Code for Information Interchange, a standard table of letters, numbers and punctuation based upon the American alphabet. ASCII defines 128 characters, 95 print characters (letters, numbers, etc.) and 33 control characters (end of line, tab, etc.). The American alphabet is limited to text without accent marks or special characters. ASCII was originally the standard character encoding of the World Wide Web but it was changed to UTF-8, a more flexible global standard.
Data analysis can be adversely affected if foreign characters find their way into datasets. If it’s causing you trouble, it’s useful to know some tricks to find and remove non-ASCII text. The iconv() function is one option:
# not run because returns an error
# x <- c("Ekstr\xf8m", "J\xf6reskog", "bi\xdfchen Z\xfcrcher") # e.g. from ?iconv
# Encoding(x) <- "latin1" # (just to make sure)
#
# x
#
# iconv(x, "latin1", "ASCII", sub="" )
#
# iconv(x, "latin1", "ASCII", sub="_" )