String functions

Title stata.com

String functions

Contents Functions References Also see

Contents

abbrev(s,n) name s, abbreviated to a length of n

char(n) the character corresponding to ASCII or extended ASCII code n; ""

if n is not in the domain

collatorlocale(loc,type) the most closely related locale supported by ICU from loc if type

is 1; the actual locale where the collation data comes from if

type is 2

collatorversion(loc) the version string of a collator based on locale loc

indexnot(s

) the position in ASCII string s

of the ﬁrst character of s

not found

in ASCII string s

, or 0 if all characters of s

are found in s

plural(n,s) the plural of s if n 6= ±1

plural(n,s

) the plural of s

, as modiﬁed by or replaced with s

, if n 6= ±1

real(s) s converted to numeric or missing

regexcapture(n) subexpression n from a previous regexm() or regexmatch()

match

regexcapturenamed(grp) subexpression corresponding to matching group named grp in reg-

ular expression from a previous regexm() or regexmatch()

match

regexm(s,re) a match of a regular expression, which evaluates to 1 if regular

expression re is satisﬁed by the ASCII string s; otherwise, 0

regexmatch(s,re



,noc



,std



,nlalt

  

)

a match of a regular expression, which evaluates to 1 if regular

expression re is satisﬁed by the ASCII string s; otherwise, 0

regexr(s

,re,s

) replaces the ﬁrst substring within ASCII string s

that matches re

with ASCII string s

and returns the resulting string

regexreplace(s

,re,s



,noc



,fmt



,std



,nlalt

   

)

replaces the ﬁrst substring within ASCII string s

that matches re

with ASCII string s

and returns the resulting string

regexreplaceall(s

,re,s



,noc



,fmt



,std



,nlalt

   

)

replaces all substrings within ASCII string s

that match re with

ASCII string s

and returns the resulting string

regexs(n) subexpression n from a previous regexm() or regexmatch()

match, where 0 ≤ n < 10

soundex(s) the soundex code for a string, s

soundex nara(s) the U.S. Census soundex code for a string, s

strcat(s

) there is no strcat() function; instead the addition operator is used

to concatenate strings

strdup(s

,n) there is no strdup() function; instead the multiplication operator

is used to create multiple copies of strings

string(n) a synonym for strofreal(n)

2 String functions

string(n,s) a synonym for strofreal(n,s)

stritrim(s) s with multiple, consecutive internal blanks (ASCII space character

char(32)) collapsed to one blank

strlen(s) the number of characters in ASCII s or length in bytes

strlower(s) lowercase ASCII characters in string s

strltrim(s) s without leading blanks (ASCII space character char(32))

strmatch(s

) 1 if s

matches the pattern s

; otherwise, 0

strofreal(n) n converted to a string

strofreal(n,s) n converted to a string using the speciﬁed display format

strpos(s

) the position in s

at which s

is ﬁrst found, 0 if s

does not occur,

and 1 if s

is empty

strproper(s) a string with the ﬁrst ASCII letter and any other letters immediately

following characters that are not letters capitalized; all other

ASCII letters converted to lowercase

strreverse(s) the reverse of ASCII string s

strrpos(s

) the position in s

at which s

is last found, 0 if s

does not occur,

and 1 if s

is empty

strrtrim(s) s without trailing blanks (ASCII space character char(32))

strtoname(s





) s translated into a Stata 13 compatible name

strtrim(s) s without leading and trailing blanks (ASCII space character

char(32)); equivalent to strltrim(strrtrim(s))

strupper(s) uppercase ASCII characters in string s

subinstr(s

,n) s

, where the ﬁrst n occurrences in s

of s

have been replaced

with s

subinword(s

,n) s

, where the ﬁrst n occurrences in s

of s

as a word have been

replaced with s

substr(s,n

) the substring of s, starting at n

, for a length of n

tobytes(s





) escaped decimal or hex digit strings of up to 200 bytes of s

uchar(n) the Unicode character corresponding to Unicode code point n or

an empty string if n is beyond the Unicode code-point range

udstrlen(s) the number of display columns needed to display the Unicode string

s in the Stata Results window

udsubstr(s,n

) the Unicode substring of s, starting at character n

, for n

display

columns

uisdigit(s) 1 if the ﬁrst Unicode character in s is a Unicode decimal digit;

otherwise, 0

uisletter(s) 1 if the ﬁrst Unicode character in s is a Unicode letter; otherwise,

ustrcompare(s



,loc



) compares two Unicode strings

ustrcompareex(s

,loc,st,case,cslv,norm,num,alt,f r)

compares two Unicode strings

ustrfix(s



,rep



) replaces each invalid UTF-8 sequence with a Unicode character

ustrfrom(s,enc,mode) converts the string s in encoding enc to a UTF-8 encoded Unicode

string

ustrinvalidcnt(s) the number of invalid UTF-8 sequences in s

ustrleft(s,n) the ﬁrst n Unicode characters of the Unicode string s

String functions 3

ustrlen(s) the number of characters in the Unicode string s

ustrlower(s



,loc



) lowercase all characters of Unicode string s under the given locale

loc

ustrltrim(s) removes the leading Unicode whitespace characters and blanks from

the Unicode string s

ustrnormalize(s,norm) normalizes Unicode string s to one of the ﬁve normalization forms

speciﬁed by norm

ustrpos(s





) the position in s

at which s

is ﬁrst found; otherwise, 0

ustrregexm(s,re



,noc



) performs a match of a regular expression and evaluates to 1 if regular

expression re is satisﬁed by the Unicode string s; otherwise, 0

ustrregexra(s

,re,s



,noc



)replaces all substrings within the Unicode string s

that match re

with s

and returns the resulting string

ustrregexrf(s

,re,s



,noc



)replaces the ﬁrst substring within the Unicode string s

that matches

re with s

and returns the resulting string

ustrregexs(n) subexpression n from a previous ustrregexm() match

ustrreverse(s) the reverse of Unicode string s

ustrright(s,n) the last n Unicode characters of the Unicode string s

ustrrpos(s





) the position in s

at which s

is last found; otherwise, 0

ustrrtrim(s) remove trailing Unicode whitespace characters and blanks from the

Unicode string s

ustrsortkey(s



,loc



) generates a null-terminated byte array that can be used by the sort

command to produce the same order as ustrcompare()

ustrsortkeyex(s,loc,st,case,cslv,norm,num,alt,f r)

generates a null-terminated byte array that can be used by the sort

command to produce the same order as ustrcompare()

ustrtitle(s



,loc



) a string with the ﬁrst characters of Unicode words titlecased and

other characters lowercased

ustrto(s,enc,mode) converts the Unicode string s in UTF-8 encoding to a string in

encoding enc

ustrtohex(s





) escaped hex digit string of s up to 200 Unicode characters

ustrtoname(s





) string s translated into a Stata name

ustrtrim(s) removes leading and trailing Unicode whitespace characters and

blanks from the Unicode string s

ustrunescape(s) the Unicode string corresponding to the escaped sequences of s

ustrupper(s



,loc



) uppercase all characters in string s under the given locale loc

ustrword(s,n



,loc



) the nth Unicode word in the Unicode string s

ustrwordcount(s



,loc



) the number of nonempty Unicode words in the Unicode string s

usubinstr(s

,n) replaces the ﬁrst n occurrences of the Unicode string s

with the

Unicode string s

in s

usubstr(s,n

) the Unicode substring of s, starting at n

, for a length of n

word(s,n) the nth word in s; missing ("") if n is missing

4 String functions

wordbreaklocale(loc,type) the most closely related locale supported by ICU from loc if type

is 1, the actual locale where the word-boundary analysis data

come from if type is 2; or an empty string is returned for any

other type

wordcount(s) the number of words in s

Functions

In the display below, s indicates a string subexpression (a string literal, a string variable, or another

string expression) and n indicates a numeric subexpression (a number, a numeric variable, or another

numeric expression).

If your strings contain Unicode characters or you are writing programs that will be used by others

who might use Unicode strings, read [U] 12.4.2 Handling Unicode strings.

abbrev(s,n)

Description: name s, abbreviated to a length of n

Length is measured in the number of display columns, not in the number of

characters. For most users, the number of display columns equals the number of

characters. For a detailed discussion of display columns, see [U] 12.4.2.2 Displaying

Unicode characters.

If any of the characters of s are a period, “.”, and n < 8, then the value of n

defaults to a value of 8. Otherwise, if n < 5, then n defaults to a value of 5.

If n is missing, abbrev() will return the entire string s. abbrev() is typically

used with variable names and variable names with factor-variable or time-series

operators (the period case).

abbrev("displacement",8) is displa

Domain s: strings

Domain n: integers 5 to 32

Range: strings

char(n)

Description: the character corresponding to ASCII or extended ASCII code n; "" if n is not in

the domain

Note: ASCII codes are from 0 to 127; extended ASCII codes are from 128 to

255. Prior to Stata 14, the display of extended ASCII characters was encoding

dependent. For example, char(128) on Microsoft Windows using Windows-1252

encoding displayed the Euro symbol, but on Linux using ISO-Latin-1 encoding,

char(128) displayed an invalid character symbol. Beginning with Stata 14, Stata’s

display encoding is UTF-8 on all platforms. The char(128) function is an invalid

UTF-8 sequence and thus will display a question mark. There are two Unicode

functions corresponding to char(): uchar() and ustrunescape(). You can

use uchar(8364) or ustrunescape("\u20AC") to display a Euro sign on all

platforms.

Domain n: integers 0 to 255

Range: ASCII characters

String functions 5

uchar(n)

Description: the Unicode character corresponding to Unicode code point n or an empty string

if n is beyond the Unicode code-point range

Note that uchar() takes the decimal value of the Unicode code point. us-

trunescape() takes an escaped hex digit string of the Unicode code point. For

example, both uchar(8364) and ustrunescape("\u20ac") produce the Euro

sign.

Domain n: integers ≥ 0

Range: Unicode characters

collatorlocale(loc,type)

Description: the most closely related locale supported by ICU from loc if type is 1; the actual

locale where the collation data comes from if type is 2

For any other type, loc is returned in a canonicalized form.

collatorlocale("en us texas", 0) = en US TEXAS

collatorlocale("en us texas", 1) = en US

collatorlocale("en us texas", 2) = root

Domain loc: strings of locale name

Domain type: integers

Range: strings

collatorversion(loc)

Description: the version string of a collator based on locale loc

The Unicode standard is constantly adding more characters and the sort key format

may change as well. This can cause ustrsortkey() and ustrsortkeyex()

to produce incompatible sort keys between different versions of International

Components for Unicode. The version string can be used for versioning the sort

keys to indicate when saved sort keys must be regenerated.

Range: strings

indexnot(s

)

Description: the position in ASCII string s

of the ﬁrst character of s

not found in ASCII string

, or 0 if all characters of s

are found in s

indexnot() is intended for use only with plain ASCII strings. For Unicode

characters beyond the plain ASCII range, the position and character are given in

bytes, not characters.

Domain s

: ASCII strings (to be searched)

Domain s

: ASCII strings (to search for)

Range: integers ≥ 0

6 String functions

plural(n,s)

Description: the plural of s if n 6= ±1

The plural is formed by adding “s” to s.

plural(1, "horse") = "horse"

plural(2, "horse") = "horses"

Domain n: real numbers

Domain s: strings

Range: strings

plural(n,s

)

Description: the plural of s

, as modiﬁed by or replaced with s

, if n 6= ±1

If s

begins with the character “+”, the plural is formed by adding the remainder

of s

to s

. If s

begins with the character “-”, the plural is formed by subtracting

the remainder of s

from s

. If s

begins with neither “+” nor “-”, then the plural

is formed by returning s

plural(2, "glass", "+es") = "glasses"

plural(1, "mouse", "mice") = "mouse"

plural(2, "mouse", "mice") = "mice"

plural(2, "abcdefg", "-efg") = "abcd"

Domain n: real numbers

Domain s

: strings

Domain s

: strings

Range: strings

real(s)

Description: s converted to numeric or missing

Also see strofreal().

real("5.2")+1 = 6.2

real("hello") = .

Domain s: strings

Range: −8e+307 to 8e+307 or missing

regexcapture(n)

Description: subexpression n from a previous regexm() or regexmatch() match

regexcapture(0) returns the entire string that satisﬁed the regular expression.

Domain n: integers

Range: ASCII strings or missing

regexcapturenamed(grp)

Description: subexpression corresponding to matching group named grp in regular expression

from a previous regexm() or regexmatch() match

Domain grp: ASCII strings

Range: ASCII strings or missing

String functions 7

regexm(s,re)

Description: a match of a regular expression, which evaluates to 1 if regular expression re is

satisﬁed by the ASCII string s; otherwise, 0

Regular expression syntax is based on Henry Spencer’s NFA algorithm, and this is

nearly identical to the POSIX.2 standard. s and re may not contain binary 0 (\0).

regexm() is intended for use only with plain ASCII characters. For Unicode

characters beyond the plain ASCII range, the match is based on bytes. For a

character-based match, see ustrregexm().

For more advanced regular expression matching, see regexmatch().

Domain s: ASCII strings

Domain re: regular expressions

Range: 0, 1, or missing

regexmatch(s,re



,noc



,std



,nlalt

  

)

Description: a match of a regular expression, which evaluates to 1 if regular expression re is

satisﬁed by the ASCII string s; otherwise, 0

regexmatch() is intended for use only with plain ASCII characters. For Unicode

characters beyond the plain ASCII range, the match is based on bytes. For a

character-based match, see ustrregexm().

If noc is speciﬁed and is not 0, a case-insensitive match is performed; otherwise,

a case-sensitive match is performed.

std speciﬁes the regular expression standard: 1 for POSIX Extended Regular, 2 for

POSIX Basic Regular, 3 for Emacs, 4 for AWK, 5 for grep, 6 for egrep, or any

other number for Perl, the default.

If nlalt is speciﬁed and is 0, the newline character, char(10), is not treated like

alternation operator |; otherwise, newline has the same effect as |.

s and re may not contain binary 0 (\0).

Domain s: ASCII strings

Domain re: regular expression

Domain noc: integers

Domain std: integers

Domain nlalt: integers

Range: 0, 1, or missing

8 String functions

regexr(s

,re,s

)

Description: replaces the ﬁrst substring within ASCII string s

that matches re with ASCII string

and returns the resulting string

If s

contains no substring that matches re, the unaltered s

is returned. s

and

the result of regexr() may be at most 1,100,000 characters long. s

, re, and s

may not contain binary 0 (\0).

regexr() is intended for use only with plain ASCII characters. For Unicode

characters beyond the plain ASCII range, the match is based on bytes, and the result

is restricted to 1,100,000 bytes. For a character-based match, see ustrregexrf()

or ustrregexra().

For more advanced regular expression replacement, see regexreplace() and

regexreplaceall().

Domain s

: ASCII strings

Domain re: regular expressions

Domain s

: ASCII strings

Range: ASCII strings

regexreplace(s

,re,s



,noc



,fmt



,std



,nlalt

   

)

Description: replaces the ﬁrst substring within ASCII string s

that matches re with ASCII string

and returns the resulting string

If noc is speciﬁed and is not 0, a case-insensitive match is performed; otherwise,

a case-sensitive match is performed.

fmt speciﬁes the format string syntax supported in s

: 1 for literal, where s

treated as a string literal (no special character substitution), 2 for sed, or any other

number for Perl, the default.

std speciﬁes the regular expression standard: 1 for POSIX Extended Regular, 2 for

POSIX Basic Regular, 3 for Emacs, 4 for AWK, 5 for grep, 6 for egrep, or any

other number for Perl, the default.

If nlalt is speciﬁed and is 0, the newline character, char(10), is not treated like

alternation operator |; otherwise, newline has the same effect as |.

If s

contains no substring that matches re, the unaltered s

is returned. s

, s

and re may not contain binary 0 (\0).

Domain s

: ASCII strings

Domain re: regular expression

Domain s

: ASCII strings

Domain noc: integers

Domain f mt: integers

Domain std: integers

Domain nlalt: integers

Range: ASCII strings

String functions 9

regexreplaceall(s

,re,s



,noc



,fmt



,std



,nlalt

   

)

Description: replaces all substrings within ASCII string s

that match re with ASCII string s

and returns the resulting string

If noc is speciﬁed and is not 0, a case-insensitive match is performed; otherwise,

a case-sensitive match is performed.

fmt speciﬁes the format string syntax supported in s

: 1 for literal, where s

treated as a string literal (no special character substitution), 2 for sed, or any other

number for Perl, the default.

std speciﬁes the regular expression standard: 1 for POSIX Extended Regular, 2 for

POSIX Basic Regular, 3 for Emacs, 4 for AWK, 5 for grep, 6 for egrep, or any

other number for Perl, the default.

If nlalt is speciﬁed and is 0, the newline character, char(10), is not treated like

alternation operator |; otherwise, newline has the same effect as |.

If s

contains no substring that matches re, the unaltered s

is returned. s

, s

and re may not contain binary 0 (\0).

Domain s

: ASCII strings

Domain re: regular expression

Domain s

: ASCII strings

Domain noc: integers

Domain f mt: integers

Domain std: integers

Domain nlalt: integers

Range: ASCII strings

regexs(n)

Description: subexpression n from a previous regexm() or regexmatch() match, where

0 ≤ n < 10

Subexpression 0 is reserved for the entire string that satisﬁed the regular expression.

The returned subexpression may be at most 1,100,000 characters (bytes) long.

For more options to return matching substrings, see regexcapture() and regex-

capturenamed().

Domain n: 0 to 9

Range: ASCII strings

ustrregexm(s,re



,noc



)

Description: performs a match of a regular expression and evaluates to 1 if regular expression

re is satisﬁed by the Unicode string s; otherwise, 0

If noc is speciﬁed and not 0, a case-insensitive match is performed. The function

may return a negative integer if an error occurs.

ustrregexm("12345", "([0-9]){5}") = 1

ustrregexm("de TR

ES pr`es", "r`es") = 1

ustrregexm("de TR

ES pr`es", "R`es") = 0

ustrregexm("de TR

ES pr`es", "R`es", 1) = 1

Domain s: Unicode strings

Domain re: Unicode regular expressions

Domain noc: integers

Range: integers

10 String functions

ustrregexrf(s

,re,s



, noc



)

Description: replaces the ﬁrst substring within the Unicode string s

that matches re with s

and returns the resulting string

If noc is speciﬁed and not 0, a case-insensitive match is performed. The function

may return an empty string if an error occurs.

ustrregexrf("tr`es pr`es", "r`es", "X") = "tX pr`es"

ustrregexrf("TR

ES pr`es", "R`es", "X") = "TR

ES pr`es"

ustrregexrf("TR

ES pr`es", "R`es", "X", 1) = "TX pr`es"

Domain s

: Unicode strings

Domain re: Unicode regular expressions

Domain s

: Unicode strings

Domain noc: integers

Range: Unicode strings

ustrregexra(s

,re,s



, noc



)

Description: replaces all substrings within the Unicode string s

that match re with s

and

returns the resulting string

If noc is speciﬁed and not 0, a case-insensitive match is performed. The function

may return an empty string if an error occurs.

ustrregexra("tr`es pr`es", "r`es", "X") = "tX pX"

ustrregexra("TR

ES pr`es", "R`es", "X") = "TR

ES pr`es"

ustrregexra("TR

ES pr`es", "R`es", "X", 1) = "TX pX"

Domain s

: Unicode strings

Domain re: Unicode regular expressions

Domain s

: Unicode strings

Domain noc: integers

Range: Unicode strings

ustrregexs(n)

Description: subexpression n from a previous ustrregexm() match

Subexpression 0 is reserved for the entire string that satisﬁed the regular expression.

The function may return an empty string if n is larger than the maximum count

of subexpressions from the previous match or if an error occurs.

Domain n: integers ≥ 0

Range: strings

String functions 11

soundex(s)

Description: the soundex code for a string, s

The soundex code consists of a letter followed by three numbers: the letter is the

ﬁrst ASCII letter of the name and the numbers encode the remaining consonants.

Similar sounding consonants are encoded by the same number. Unicode characters

beyond the plain ASCII range are ignored.

soundex("Ashcraft") = "A226"

soundex("Robert") = "R163"

soundex("Rupert") = "R163"

Domain s: strings

Range: strings

soundex nara(s)

Description: the U.S. Census soundex code for a string, s

The soundex code consists of a letter followed by three numbers: the letter is the

ﬁrst ASCII letter of the name and the numbers encode the remaining consonants.

Similar sounding consonants are encoded by the same number. Unicode characters

beyond the plain ASCII range are ignored.

soundex nara("Ashcraft") = "A261"

Domain s: strings

Range: strings

strcat(s

)

Description: there is no strcat() function; instead the addition operator is used to concatenate

strings

"hello " + "world" = "hello world"

"a" + "b" = "ab"

"Caf´e " + "de Flore" = "Caf´e de Flore"

Domain s

: strings

Domain s

: strings

Range: strings

strdup(s

,n)

Description: there is no strdup() function; instead the multiplication operator is used to create

multiple copies of strings

"hello" * 3 = "hellohellohello"

3 * "hello" = "hellohellohello"

0 * "hello" = ""

"hello" * 1 = "hello"

Domain s

: strings

Domain n: nonnegative integers 0, 1, 2, . . .

Range: strings

12 String functions

string(n)

Description: a synonym for strofreal(n)

string(n,s)

Description: a synonym for strofreal(n,s)

stritrim(s)

Description: s with multiple, consecutive internal blanks (ASCII space character char(32))

collapsed to one blank

stritrim("hello there") = "hello there"

Domain s: strings

Range: strings with no multiple, consecutive internal blanks

strlen(s)

Description: the number of characters in ASCII s or length in bytes

strlen() is intended for use only with plain ASCII characters and for use by

programmers who want to obtain the byte-length of a string. Note that any Unicode

character beyond ASCII range (code point greater than 127) takes more than 1 byte

in the UTF-8 encoding; for example, ´e takes 2 bytes.

For the number of characters in a Unicode string, see ustrlen().

strlen("ab") = 2

strlen("´e") = 2

Domain s: strings

Range: integers ≥ 0

ustrlen(s)

Description: the number of characters in the Unicode string s

An invalid UTF-8 sequence is counted as one Unicode character. An invalid UTF-8

sequence may contain one byte or multiple bytes. Note that any Unicode character

beyond the plain ASCII range (code point greater than 127) takes more than 1 byte

in the UTF-8 encoding; for example, ´e takes 2 bytes.

ustrlen("m´ediane") = 7

strlen("m´ediane") = 8

Domain s: Unicode strings

Range: integers ≥ 0

String functions 13

udstrlen(s)

Description: the number of display columns needed to display the Unicode string s in the Stata

Results window

A Unicode character in the CJK (Chinese, Japanese, and Korean) encoding usually

requires two display columns; a Latin character usually requires one column. Any

invalid UTF-8 sequence requires one column.

Domain s: Unicode strings

Range: integers ≥ 0

strlower(s)

Description: lowercase ASCII characters in string s

Unicode characters beyond the plain ASCII range are ignored.

strlower("THIS") = "this"

strlower("CAF

E") = "caf

Domain s: strings

Range: strings with lowercased characters

ustrlower(s



,loc



)

Description: lowercase all characters of Unicode string s under the given locale loc

If loc is not speciﬁed, the default locale is used. The same s but different loc

may produce different results; for example, the lowercase letter of “I” is “i” in

English but a dotless “i” in Turkish. The same Unicode character can be mapped

to different Unicode characters based on its surrounding characters; for example,

Greek capital letter sigma Σ has two lowercases: ς, if it is the ﬁnal character of a

word, or σ. The result can be longer or shorter than the input Unicode string in

bytes.

ustrlower("M

EDIANE","fr") = "m´ediane"

ustrlower("ISTANBUL","tr") = "ıstanbul"

Domain s: Unicode strings

Domain loc: locale name

Range: Unicode strings

strltrim(s)

Description: s without leading blanks (ASCII space character char(32))

strltrim(" this") = "this"

Domain s: strings

Range: strings without leading blanks

14 String functions

ustrltrim(x)

Description: removes the leading Unicode whitespace characters and blanks from the Unicode

string s

Note that, in addition to char(32), ASCII characters char(9), char(10),

char(11), char(12), and char(13) are whitespace characters in Unicode stan-

dard.

ustrltrim(" this") = "this"

ustrltrim(char(9)+"this") = "this"

ustrltrim(ustrunescape("\u1680")+" this") = "this"

Domain s: Unicode strings

Range: Unicode strings

strmatch(s

)

Description: 1 if s

matches the pattern s

; otherwise, 0

strmatch("17.4","1??4") returns 1. In s

, "?" means that one character goes

here, and "*" means that zero or more bytes go here. Note that a Unicode

character may contain multiple bytes; thus, using "*" with Unicode characters

can infrequently result in matches that do not occur at a character boundary.

Also see regexm(), regexr(), and regexs().

strmatch("caf´e", "caf?") = 1

Domain s

: strings

Domain s

: strings

Range: integers 0 or 1

strofreal(n)

Description: n converted to a string

Also see real().

strofreal(4)+"F" = "4F"

strofreal(1234567) = "1234567"

strofreal(12345678) = "1.23e+07"

strofreal(.) = "."

Domain n: −8e+307 to 8e+307 or missing

Range: strings

String functions 15

strofreal(n,s)

Description: n converted to a string using the speciﬁed display format

Also see real().

strofreal(4,"%9.2f") = "4.00"

strofreal(123456789,"%11.0g") = "123456789"

strofreal(123456789,"%13.0gc") = "123,456,789"

strofreal(0,"%td") = "01jan1960"

strofreal(225,"%tq") = "2016q2"

strofreal(225,"not a format") = ""

Domain n: −8e+307 to 8e+307 or missing

Domain s: strings containing % fmt numeric display format

Range: strings

strpos(s

)

Description: the position in s

at which s

is ﬁrst found, 0 if s

does not occur, and 1 if s

is empty

strpos() is intended for use only with plain ASCII characters and for use by

programmers who want to obtain the byte-position of s

. Note that any Unicode

character beyond ASCII range (code point greater than 127) takes more than 1 byte

in the UTF-8 encoding; for example, ´e takes 2 bytes.

To ﬁnd the character position of s

in a Unicode string, see ustrpos().

strpos("this","is") = 3

strpos("this","it") = 0

strpos("this","") = 1

Domain s

: strings (to be searched)

Domain s

: strings (to search for)

Range: integers ≥ 0

ustrpos(s





)

Description: the position in s

at which s

is ﬁrst found; otherwise, 0

If n is speciﬁed and is greater than 0, the search starts at the nth Unicode character

of s

. An invalid UTF-8 sequence in either s

or s

is replaced with a Unicode

replacement character \ufffd before the search is performed.

ustrpos("m´ediane", "´edi") = 2

ustrpos("m´ediane", "´edi", 3) = 0

ustrpos("m´ediane", "´eci") = 0

Domain s

: Unicode strings (to be searched)

Domain s

: Unicode strings (to search for)

Domain n: integers

Range: integers

16 String functions

strproper(s)

Description: a string with the ﬁrst ASCII letter and any other letters immediately following

characters that are not letters capitalized; all other ASCII letters converted to

lowercase

strproper() implements a form of titlecasing and is intended for use only with

plain ASCII strings. Unicode characters beyond ASCII are treated as characters that

are not letters. To titlecase strings with Unicode characters beyond the plain ASCII

range or to implement language-sensitive rules for titlecasing, see ustrtitle().

strproper("mR. joHn a. sMitH") = "Mr. John A. Smith"

strproper("jack o’reilly") = "Jack O’Reilly"

strproper("2-cent’s worth") = "2-Cent’S Worth"

strproper("vous ^etes") = "Vous ^eTes"

Domain s: strings

Range: strings

ustrtitle(s



,loc



)

Description: a string with the ﬁrst characters of Unicode words titlecased and other characters

lowercased

If loc is not speciﬁed, the default locale is used. Note that a Unicode word is

different from a Stata word produced by function word(). The Stata word is a

space-separated token. A Unicode word is a language unit based on either a set of

word-boundary rules or dictionaries for some languages (Chinese, Japanese, and

Thai). The titlecase is also locale dependent and context sensitive; for example,

lowercase “ij” is considered a digraph in Dutch. Its titlecase is “IJ”.

ustrtitle("vous ^etes", "fr") = "Vous

Etes"

ustrtitle("mR. joHn a. sMitH") = "Mr. John A. Smith"

ustrtitle("ijmuiden", "en") = "Ijmuiden"

ustrtitle("ijmuiden", "nl") = "IJmuiden"

Domain s: Unicode strings

Domain loc: Unicode strings

Range: Unicode strings

strreverse(s)

Description: the reverse of ASCII string s

strreverse() is intended for use only with plain ASCII characters. For Unicode

characters beyond ASCII range (code point greater than 127), the encoded bytes

are reversed.

To reverse the characters of Unicode string, see ustrreverse().

strreverse("hello") = "olleh"

Domain s: ASCII strings

Range: ASCII reversed strings

String functions 17

ustrreverse(s)

Description: the reverse of Unicode string s

The function does not take Unicode character equivalence into consideration.

Hence, a Unicode character in a decomposed form will not be reversed as one

unit. An invalid UTF-8 sequence is replaced with a Unicode replacement character

\ufffd.

ustrreverse("m´ediane") = "enaid´em"

Domain s: Unicode strings

Range: reversed Unicode strings

strrpos(s

)

Description: the position in s

at which s

is last found, 0 if s

does not occur, and 1 if s

empty

strrpos() is intended for use only with plain ASCII characters and for use

by programmers who want to obtain the last byte-position of s

. Note that any

Unicode character beyond ASCII range (code point greater than 127) takes more

than 1 byte in the UTF-8 encoding; for example, ´e takes 2 bytes.

To ﬁnd the last character position of s

in a Unicode string, see ustrrpos().

strrpos("this","is") = 3

strrpos("this is","is") = 6

strrpos("this is","it") = 0

strrpos("this is","") = 1

Domain s

: strings (to be searched)

Domain s

: strings (to search for)

Range: integers ≥ 0

ustrrpos(s





)

Description: the position in s

at which s

is last found; otherwise, 0

If n is speciﬁed and is greater than 0, only the part between the ﬁrst Unicode

character and the nth Unicode character of s

is searched. An invalid UTF-8

sequence in either s

or s

is replaced with a Unicode replacement character

\ufffd before the search is performed.

ustrrpos("enchant´e", "n") = 6

ustrrpos("enchant´e", "n", 5) = 2

ustrrpos("enchant´e", "n", 6) = 6

ustrrpos("enchant´e", "ne") = 0

Domain s

: Unicode strings (to be searched)

Domain s

: Unicode strings (to search for)

Domain n: integers

Range: integers

strrtrim(s)

Description: s without trailing blanks (ASCII space character char(32))

strrtrim("this ") = "this"

Domain s: strings

Range: strings without trailing blanks

18 String functions

ustrrtrim(s)

Description: remove trailing Unicode whitespace characters and blanks from the Unicode string

Note that, in addition to char(32), ASCII characters char(9), char(10),

char(11), char(12), and char(13) are considered whitespace characters in

the Unicode standard.

ustrrtrim("this ") = "this"

ustrltrim("this"+char(10)) = "this"

ustrrtrim("this "+ustrunescape("\u2000")) = "this"

Domain s: Unicode strings

Range: Unicode strings

strtoname(s





)

Description: s translated into a Stata 13 compatible name

strtoname() results in a name that is truncated to 32 bytes. Each character in s

that is not allowed in a Stata name is converted to an underscore character, . If the

ﬁrst character in s is a numeric character and p is not 0, then the result is preﬁxed

with an underscore. Stata 14 names may be 32 characters; see [U] 11.3 Naming

conventions.

strtoname("name") = "name"

strtoname("a name") = "a name"

strtoname("5",1) = " 5"

strtoname("5:30",1) = " 5 30"

strtoname("5",0) = "5"

strtoname("5:30",0) = "5 30"

Domain s: strings

Domain p: integers 0 or 1

Range: strings

ustrtoname(s





)

Description: string s translated into a Stata name

ustrtoname() results in a name that is truncated to 32 characters. Each character

in s that is not allowed in a Stata name is converted to an underscore character,

. If the ﬁrst character in s is a numeric character and p is not 0, then the result

is preﬁxed with an underscore.

ustrtoname("name",1) = "name"

ustrtoname("the m´ediane") = "the m´ediane"

ustrtoname("0m´ediane") = " 0m´ediane"

ustrtoname("0m´ediane", 1) = " 0m´ediane"

ustrtoname("0m´ediane", 0) = "0m´ediane"

Domain s: Unicode strings

Domain p: integers 0 or 1

Range: Unicode strings

String functions 19

strtrim(s)

Description: s without leading and trailing blanks (ASCII space character char(32)); equivalent

to strltrim(strrtrim(s))

strtrim(" this ") = "this"

Domain s: strings

Range: strings without leading or trailing blanks

ustrtrim(s)

Description: removes leading and trailing Unicode whitespace characters and blanks from the

Unicode string s

Note that, in addition to char(32), ASCII characters char(9), char(10),

char(11), char(12), and char(13) are considered whitespace characters in

the Unicode standard.

ustrtrim(" this ") = "this"

ustrtrim(char(11)+" this ")+char(13) = "this"

ustrtrim(" this "+ustrunescape("\u2000")) = "this"

Domain s: Unicode strings

Range: Unicode strings

strupper(s)

Description: uppercase ASCII characters in string s

Unicode characters beyond the plain ASCII range are ignored.

strupper("this") = "THIS"

strupper("caf´e") = "CAF´e"

Domain s: strings

Range: strings with uppercased characters

ustrupper(s



,loc



)

Description: uppercase all characters in string s under the given locale loc

If loc is not speciﬁed, the default locale is used. The same s but a different loc

may produce different results; for example, the uppercase letter of “i” is “I” in

English, but “I” with a dot in Turkish. The result can be longer or shorter than

the input string in bytes; for example, the uppercase form of the German letter ß

(code point \u00df) is two capital letters “SS”.

ustrupper("m´ediane","fr") = "M

EDIANE"

ustrupper("Rußland", "de") = "RUSSLAND"

ustrupper("istanbul", "tr") = "

ISTANBUL"

Domain s: Unicode strings

Domain loc: locale name

Range: Unicode strings

20 String functions

subinstr(s

,n)

Description: s

, where the ﬁrst n occurrences in s

of s

have been replaced with s

subinstr() is intended for use only with plain ASCII characters and for use by

programmers who want to perform byte-based substitution. Note that any Unicode

character beyond ASCII range (code point greater than 127) takes more than 1 byte

in the UTF-8 encoding; for example, ´e takes 2 bytes.

To perform character-based replacement in Unicode strings, see usubinstr().

If n is missing, all occurrences are replaced.

Also see regexm(), regexr(), and regexs().

subinstr("this is the day","is","X",1) = "thX is the day"

subinstr("this is the hour","is","X",2) = "thX X the hour"

subinstr("this is this","is","X",.) = "thX X thX"

Domain s

: strings (to be substituted into)

Domain s

: strings (to be substituted from)

Domain s

: strings (to be substituted with)

Domain n: integers ≥ 0 or missing

Range: strings

usubinstr(s

,n)

Description: replaces the ﬁrst n occurrences of the Unicode string s

with the Unicode string

in s

If n is missing, all occurrences are replaced. An invalid UTF-8 sequence in s

, s

or s

is replaced with a Unicode replacement character \ufffd before replacement

is performed.

usubinstr("de tr`es pr`es","`es","es",1) = "de tres pr`es"

usubinstr("de tr`es pr‘es","`es","X",2) = "de trX prX"

Domain s

: Unicode strings (to be substituted into)

Domain s

: Unicode strings (to be substituted from)

Domain s

: Unicode strings (to be substituted with)

Domain n: integers ≥ 0 or missing

Range: Unicode strings

String functions 21

subinword(s

,n)

Description: s

, where the ﬁrst n occurrences in s

of s

as a word have been replaced with

A word is deﬁned as a space-separated token. A token at the beginning or end of

is considered space-separated. This is different from a Unicode word, which

is a language unit based on either a set of word-boundary rules or dictionaries for

several languages (Chinese, Japanese, and Thai). If n is missing, all occurrences

are replaced.

Also see regexm(), regexr(), and regexs().

subinword("this is the day","is","X",1) = "this X the day"

subinword("this is the hour","is","X",.) = "this X the hour"

subinword("this is this","th","X",.) = "this is this"

Domain s

: strings (to be substituted for)

Domain s

: strings (to be substituted from)

Domain s

: strings (to be substituted with)

Domain n: integers ≥ 0 or missing

Range: strings

substr(s,n

)

Description: the substring of s, starting at n

, for a length of n

substr() is intended for use only with plain ASCII characters and for use by

programmers who want to extract a subset of bytes from a string. For those with

plain ASCII text, n

is the starting character, and n

is the length of the string

in characters. For programmers, substr() is technically a byte-based function.

For plain ASCII characters, the two are equivalent but you can operate on byte

values beyond that range. Note that any Unicode character beyond ASCII range

(code point greater than 127) takes more than 1 byte in the UTF-8 encoding; for

example, ´e takes 2 bytes.

To obtain substrings of Unicode strings, see usubstr().

If n

< 0, n

is interpreted as the distance from the end of the string; if n

= .

(missing), the remaining portion of the string is returned.

substr("abcdef",2,3) = "bcd"

substr("abcdef",-3,2) = "de"

substr("abcdef",2,.) = "bcdef"

substr("abcdef",-3,.) = "def"

substr("abcdef",2,0) = ""

substr("abcdef",15,2) = ""

Domain s: strings

Domain n

: integers ≥ 1 and ≤ −1

Domain n

: integers ≥ 1

Range: strings

22 String functions

usubstr(s,n

)

Description: the Unicode substring of s, starting at n

, for a length of n

If n

< 0, n

is interpreted as the distance from the last character of the s; if

= . (missing), the remaining portion of the Unicode string is returned.

usubstr("m´ediane",2,3) = "´edi"

usubstr("m´ediane",-3,2) = "an"

usubstr("m´ediane",2,.) = "´ediane"

Domain s: Unicode strings

Domain n

: integers ≥ 1 and ≤ −1

Domain n

: integers ≥ 1

Range: Unicode strings

udsubstr(s,n

)

Description: the Unicode substring of s, starting at character n

, for n

display columns

If n

= . (missing), the remaining portion of the Unicode string is returned. If

display columns from n

is in the middle of a Unicode character, the substring

stops at the previous Unicode character.

udsubstr("m´ediane",2,3) = "´edi"

Domain s: Unicode strings

Domain n

: integers ≥ 1

Domain n

: integers ≥ 1

Range: Unicode strings

tobytes(s





)

Description: escaped decimal or hex digit strings of up to 200 bytes of s

The escaped decimal digit string is in the form of \dDDD. The escaped hex digit

string is in the form of \xhh. If n is not speciﬁed or is 0, the decimal form is

produced. Otherwise, the hex form is produced.

tobytes("abc") = "\d097\d098\d099"

tobytes("abc", 1) = "\x61\x62\x63"

tobytes("caf´e") = "\d099\d097\d102\d195\d169"

Domain s: Unicode strings

Domain n: integers

Range: strings

uisdigit(s)

Description: 1 if the ﬁrst Unicode character in s is a Unicode decimal digit; otherwise, 0

A Unicode decimal digit is a Unicode character with the character property Nd

according to the Unicode standard. The function returns -1 if the string starts with

an invalid UTF-8 sequence.

Domain s: Unicode strings

Range: integers

String functions 23

uisletter(s)

Description: 1 if the ﬁrst Unicode character in s is a Unicode letter; otherwise, 0

A Unicode letter is a Unicode character with the character property L according to

the Unicode standard. The function returns -1 if the string starts with an invalid

UTF-8 sequence.

Domain s: Unicode strings

Range: integers

ustrcompare(s



,loc



)

Description: compares two Unicode strings

The function returns -1, 1, or 0 if s

is less than, greater than, or equal to s

. The

function may return a negative number other than −1 if an error happens. The

comparison is locale dependent. For example, z <

o in Swedish but

o < z in German.

If loc is not speciﬁed, the default locale is used. The comparison is diacritic and case

sensitive. If you need different behavior, for example, case-insensitive comparison,

you should use the extended comparison function ustrcompareex(). Unicode

string comparison compares Unicode strings in a language-sensitive manner. On

the other hand, the sort command compares strings in code-point (binary) order.

For example, uppercase “Z” (code-point value 90) comes before lowercase “a”

(code-point value 97) in code-point order but comes after “a” in any English

dictionary.

ustrcompare("z", "¨o", "sv") = -1

ustrcompare("z", "¨o", "de") = 1

Domain s

: Unicode strings

Domain s

: Unicode strings

Domain loc: Unicode strings

Range: integers

ustrcompareex(s

,loc,st,case,cslv,norm,num,alt,f r)

Description: compares two Unicode strings

The function returns -1, 1, or 0 if s

is less than, greater than, or equal to s

The function may return a negative number other than -1 if an error occurs. The

comparison is locale dependent. For example, z <

o in Swedish but

o < z in

German. If loc is not speciﬁed, the default locale is used.

st controls the strength of the comparison. Possible values are 1 (primary), 2

(secondary), 3 (tertiary), 4 (quaternary), or 5 (identical). -1 means to use the

default value for the locale. Any other numbers are treated as tertiary. The primary

difference represents base letter differences; for example, letter “a” and letter “b”

have primary differences. The secondary difference represents diacritical differences

on the same base letter; for example, letters “a” and “

a” have secondary differences.

The tertiary difference represents case differences of the same base letter; for

example, letters “a” and “A” have tertiary differences. Quaternary strength is

useful to distinguish between Katakana and Hiragana for the JIS 4061 collation

standard. Identical strength is essentially the code-point order of the string, hence,

is rarely useful.

ustrcompareex("caf´e","cafe","fr", 1, -1, -1, -1, -1, -1, -1) = 0

ustrcompareex("caf´e","cafe","fr", 2, -1, -1, -1, -1, -1, -1) = 1

ustrcompareex("Caf´e","caf´e","fr", 3, -1, -1, -1, -1, -1, -1) = 1

24 String functions

case controls the uppercase and lowercase letter order. Possible values are 0 (use

order speciﬁed in tertiary strength), 1 (uppercase ﬁrst), or 2 (lowercase ﬁrst). -1

means to use the default value for the locale. Any other values are treated as 0.

ustrcompareex("Caf´e","caf´e","fr", -1, 1, -1, -1, -1, -1, -1) = -1

ustrcompareex("Caf´e","caf´e","fr", -1, 2, -1, -1, -1, -1, -1) = 1

cslv controls whether an extra case level between the secondary level and the

tertiary level is generated. Possible values are 0 (off) or 1 (on). -1 means to use

the default value for the locale. Any other values are treated as 0. Combining this

setting to be “on” and the strength setting to be primary can achieve the effect

of ignoring the diacritical differences but preserving the case differences. If the

setting is “on”, the result is also affected by the case setting.

ustrcompareex("caf´e","Cafe","fr", 1, -1, 1, -1, -1, -1, -1) = -1

ustrcompareex("caf´e","Cafe","fr", 1, 1, 1, -1, -1, -1, -1) = 1

norm controls whether the normalization check and normalizations are performed.

Possible values are 0 (off) or 1 (on). -1 means to use the default value for the locale.

Any other values are treated as 0. Most languages do not require normalization

for comparison. Normalization is needed in languages that use multiple combining

characters such as Arabic, ancient Greek, or Hebrew.

num controls how contiguous digit substrings are sorted. Possible values are 0

(off) or 1 (on). -1 means to use the default value for the locale. Any other values

are treated as 0. If the setting is “on”, substrings consisting of digits are sorted

based on the numeric value. For example, “100” is after value “20” instead of

before it. Note that the digit substring is limited to 254 digits, and plus/minus

signs, decimals, or exponents are not supported.

ustrcompareex("100", "20","en", -1, -1, -1, -1, 0, -1, -1) = -1

ustrcompareex("100", "20","en", -1, -1, -1, -1, 1, -1, -1) = 1

alt controls how spaces and punctuation characters are handled. Possible values

are 0 (use primary strength) or 1 (alternative handling). Any other values are

treated as 0. If the setting is 1 (alternative handling), “onsite”, “on-site”, and “on

site” are considered equals.

ustrcompareex("onsite", "on-site","en",

-1, -1, -1, -1, -1, 1, -1) = 0

ustrcompareex("onsite", "on site","en",

-1, -1, -1, -1, -1, 1, -1) = 0

ustrcompareex("onsite", "on-site","en",

-1, -1, -1, -1, -1, 0, -1) = 1

fr controls the direction of the secondary strength. Possible values are 0 (off)

or 1 (on). -1 means to use the default value for the locale. All other values are

treated as “off”. If the setting is “on”, the diacritical letters are sorted backward.

Note that the setting is “on” by default only for Canadian French (locale fr CA).

ustrcompareex("cot´e", "c^ote","fr CA",-1,-1,-1,-1,-1,-1,0) = -1

ustrcompareex("cot´e", "c^ote","fr CA",-1,-1,-1,-1,-1,-1,1) = 1

ustrcompareex("cot´e", "c^ote","fr CA",-1,-1,-1,-1,-1,-1,-1) = 1

ustrcompareex("cot´e", "c^ote","fr",-1,-1,-1,-1,-1,-1,-1) = 1

String functions 25

Domain s

: Unicode strings

Domain s

: Unicode strings

Domain loc: Unicode strings

Domain st: integers

Domain case: integers

Domain cslv: integers

Domain norm: integers

Domain num: integers

Domain alt: integers

Domain f r: integers

Range: integers

ustrfix(s



,rep



)

Description: replaces each invalid UTF-8 sequence with a Unicode character

In the one-argument case, the Unicode replacement character \ufffd is used. In

the two-argument case, the ﬁrst Unicode character of rep is used. If rep starts

with an invalid UTF-8 sequence, then Unicode replacement character \ufffd is

used. Note that an invalid UTF-8 sequence can contain one byte or multiple bytes.

ustrfix(char(200)) = ustrunescape("\ufffd")

ustrfix("ab"+char(200)+"cd´e", "") = "abcd´e"

ustrfix("ab"+char(229)+char(174)+"cd´e", "´e") = "ab´ecd´e"

Domain s: Unicode strings

Domain rep: Unicode character

Range: Unicode strings

ustrfrom(s,enc,mode)

Description: converts the string s in encoding enc to a UTF-8 encoded Unicode string

mode controls how invalid byte sequences in s are handled. The possible values

are 1, which substitutes an invalid byte sequence with a Unicode replacement

character \ufffd; 2, which skips any invalid byte sequences; 3, which stops at

the ﬁrst invalid byte sequence and returns an empty string; or 4, which replaces

any byte in an invalid sequence with an escaped hex digit sequence %Xhh. Any

other values are treated as 1. A good use of value 4 is to check what invalid

bytes a Unicode string ust contains by examining the result of ustrfrom(ust,

"utf-8", 4).

Also see ustrto().

ustrfrom("caf"+char(233), "latin1", 1) = "caf´e"

ustrfrom("caf"+char(233), "utf-8", 1) =

"caf"+ustrunescape("\ufffd")

ustrfrom("caf"+char(233), "utf-8", 2) = "caf"

ustrfrom("caf"+char(233), "utf-8", 3) = ""

ustrfrom("caf"+char(233), "utf-8", 4) = "caf%XE9"

Domain s: strings in encoding enc

Domain enc: Unicode strings

Domain mode: integers

Range: Unicode strings

26 String functions

ustrinvalidcnt(s)

Description: the number of invalid UTF-8 sequences in s

An invalid UTF-8 sequence may contain one byte or multiple bytes.

ustrinvalidcnt("m´ediane") = 0

ustrinvalidcnt("m´ediane"+char(229)) = 1

ustrinvalidcnt("m´ediane"+char(229)+char(174)) = 1

ustrinvalidcnt("m´ediane"+char(174)+char(158)) = 2

Domain s: Unicode strings

Range: integers

ustrleft(s,n)

Description: the ﬁrst n Unicode characters of the Unicode string s

An invalid UTF-8 sequence is replaced with a Unicode replacement character

\ufffd.

Domain s: Unicode strings

Domain n: integers

Range: Unicode strings

ustrnormalize(s,norm)

Description: normalizes Unicode string s to one of the ﬁve normalization forms speciﬁed by

norm

The normalization forms are nfc, nfd, nfkc, nfkd, or nfkcc. The function

returns an empty string for any other value of norm. Unicode normalization

removes the Unicode string differences caused by Unicode character equivalence.

nfc speciﬁes Normalization Form C, which normalizes decomposed Unicode

code points to a composited form. nfd speciﬁes Normalization Form D, which

normalizes composited Unicode code points to a decomposed form. nfc and nfd

produce canonical equivalent form. nfkc and nfkd are similar to nfc and nfd but

produce compatibility equivalent forms. nfkcc speciﬁes nfkc with casefolding.

This normalization and casefolding implement the Unicode Character Database.

In the Unicode standard, both “i” (\u0069 followed by a diaeresis \u0308)

and the composite character \u00ef represent “i” with 2 dots as in “na

ıve”.

Hence, the code-point sequence \u0069\u0308 and the code point \u00ef are

considered Unicode equivalent. According to the Unicode standard, they should

be treated as the same single character in Unicode string operations, such as

in display, comparison, and selection. However, Stata does not support multiple

code-point characters; each code point is considered a separate Unicode character.

Hence, \u0069\u0308 is displayed as two characters in the Results window.

ustrnormalize() can be used with "nfc" to normalize \u0069\u0308 to the

canonical equivalent composited code point \u00ef.

ustrnormalize(ustrunescape("\u0069\u0308"), "nfc") = "¨ı"

String functions 27

The decomposed form nfd can be used to removed diacritical marks from base

letters. First, normalize the Unicode string to canonical decomposed form, and

then call ustrto() with mode skip to skip all non-ASCII characters.

Also see ustrfrom().

ustrto(ustrnormalize("caf´e", "nfd"), "ascii", 2) = "cafe"

Domain s: Unicode strings

Domain norm: Unicode strings

Range: Unicode strings

ustrright(s,n)

Description: the last n Unicode characters of the Unicode string s

An invalid UTF-8 sequence is replaced with a Unicode replacement character

\ufffd.

Domain s: Unicode strings

Domain n: integers

Range: Unicode strings

ustrsortkey(s



,loc



)

Description: generates a null-terminated byte array that can be used by the sort command to

produce the same order as ustrcompare()

The function may return an empty array if an error occurs. The result is locale

dependent. If loc is not speciﬁed, the default locale is used. The result is also

diacritic and case sensitive. If you need different behavior, for example, case-

insensitive results, you should use the extended function ustrsortkeyex().

See [U] 12.4.2.5 Sorting strings containing Unicode characters for details and

examples.

Domain s: Unicode strings

Domain loc: Unicode strings

Range: null-terminated byte array

28 String functions

ustrsortkeyex(s,loc,case,cslv,norm,num,alt,f r)

Description: generates a null-terminated byte array that can be used by the sort command to

produce the same order as ustrcompare()

The function may return an empty array if an error occurs. The result is locale

dependent. If loc is not speciﬁed, the default locale is used. See [U] 12.4.2.5 Sorting

strings containing Unicode characters for details and examples.

st controls the strength of the comparison. Possible values are 1 (primary), 2

(secondary), 3 (tertiary), 4 (quaternary), or 5 (identical). -1 means to use the

default value for the locale. Any other numbers are treated as tertiary. The primary

difference represents base letter differences; for example, letter “a” and letter “b”

have primary differences. The secondary difference represents diacritical differences

on the same base letter; for example, letters “a” and “

a” have secondary differences.

The tertiary difference represents case differences of the same base letters; for

example, letters “a” and “A” have tertiary differences. Quaternary strength is useful

to distinguish between Katakana and Hiragana for the JIS 4061 collation standard.

Identical strength is essentially the code-point order of the string and, hence, is

rarely useful.

case controls the uppercase and lowercase letter order. Possible values are 0 (use

order speciﬁed in tertiary strength), 1 (uppercase ﬁrst), or 2 (lowercase ﬁrst). -1

means to use the default value for the locale. Any other values are treated as 0.

cslv controls if an extra case level between the secondary level and the tertiary

level is generated. Possible values are 0 (off) or 1 (on). -1 means to use the

default value for the locale. Any other values are treated as 0. Combining this

setting to be “on” and the strength setting to be primary can achieve the effect

of ignoring the diacritical differences but preserving the case differences. If the

setting is “on”, the result is also affected by the case setting.

norm controls whether the normalization check and normalizations are performed.

Possible values are 0 (off) or 1 (on). -1 means to use the default value for the locale.

Any other values are treated as 0. Most languages do not require normalization

for comparison. Normalization is needed in languages that use multiple combining

characters such as Arabic, ancient Greek, or Hebrew.

num controls how contiguous digit substrings are sorted. Possible values are 0

(off) or 1 (on). -1 means to use the default value for the locale. Any other values

are treated as 0. If the setting is “on”, substrings consisting of digits are sorted

based on the numeric value. For example, “100” is after “20” instead of before

it. Note that the digit substring is limited to 254 digits, and plus/minus signs,

decimals, or exponents are not supported.

String functions 29

alt controls how spaces and punctuation characters are handled. Possible values

are 0 (use primary strength) or 1 (alternative handling). Any other values are

treated as 0. If the setting is 1 (alternative handling), “onsite”, “on-site”, and “on

site” are considered equals.

fr controls the direction of the secondary strength. Possible values are 0 (off)

or 1 (on). -1 means to use the default value for the locale. All other values are

treated as “off”. If the setting is “on”, the diacritical letters are sorted backward.

Note that the setting is “on” by default only for Canadian French (locale fr CA).

Domain s: Unicode strings

Domain loc: Unicode strings

Domain st: integers

Domain case: integers

Domain cslv: integers

Domain norm: integers

Domain num: integers

Domain alt: integers

Domain f r: integers

Range: null-terminated byte array

ustrto(s,enc,mode)

Description: converts the Unicode string s in UTF-8 encoding to a string in encoding enc

See [D] unicode encoding for details on available encodings. Any invalid se-

quence in s is replaced with a Unicode replacement character \ufffd. mode

controls how unsupported Unicode characters in the encoding enc are handled.

The possible values are 1, which substitutes any unsupported characters with the

enc’s substitution strings (the substitution character for both ascii and latin1

is char(26)); 2, which skips any unsupported characters; 3, which stops at the

ﬁrst unsupported character and returns an empty string; or 4, which replaces any

unsupported character with an escaped hex digit sequence \uhhhh or \Uhhhhhhhh.

The hex digit sequence contains either 4 or 8 hex digits, depending if the Unicode

character’s code-point value is less than or greater than \uffff. Any other values

are treated as 1.

ustrto("caf´e", "ascii", 1) = "caf"+char(26)

ustrto("caf´e", "ascii", 2) = "caf"

ustrto("caf´e", "ascii", 3) = ""

ustrto("caf´e", "ascii", 4) = "caf\u00E9"

ustrto() can be used to removed diacritical marks from base letters. First,

normalize the Unicode string to NFD form using ustrnormalize(), and then call

ustrto() with value 2 to skip all non-ASCII characters.

Also see ustrfrom().

ustrto(ustrnormalize("caf´e", "nfd"), "ascii", 2) = "cafe"

Domain s: Unicode strings

Domain enc: Unicode strings

Domain mode: integers

Range: strings in encoding enc

30 String functions

ustrtohex(s





)

Description: escaped hex digit string of s up to 200 Unicode characters

The escaped hex digit string is in the form of \uhhhh for code points less than

\uffff or \Uhhhhhhhh for code points greater than \uffff. The function starts at

the nth Unicode character of s if n is speciﬁed and larger than 0. Any invalid UTF-8

sequence is replaced with a Unicode replacement character \ufffd. Note that the

null terminator char(0) is a valid Unicode character. Function ustrunescape()

can be applied on the result to get back the original Unicode string s if s does

not contain any invalid UTF-8 sequences.

Also see ustrunescape().

ustrtohex("i"+char(200)+char(0)+"s") =

"\u0069\ufffd\u0000\u0073"

Domain s: Unicode strings

Domain n: integers ≥ 1

Range: strings

ustrunescape(s)

Description: the Unicode string corresponding to the escaped sequences of s

The following escape sequences are recognized: 4 hex digit form \uhhhh; 8 hex

digit form \Uhhhhhhhh; 1–2 hex digit form \xhh; and 1–3 octal digit form \ooo,

where h is [0-9A-Fa-f] and o is [0-7]. The standard ANSI C escapes \a, \b,

\t, \n, \v, \f, \r, \e, \", \’, \?, \\ are recognized as well. The function

returns an empty string if an escape sequence is badly formed. Note that the 8

hex digit form \Uhhhhhhhh begins with a capital letter “U”.

Also see ustrtohex().

Domain s: strings of escaped hex values

Range: Unicode strings

word(s,n)

Description: the nth word in s; missing ("") if n is missing

Positive numbers count words from the beginning of s, and negative numbers

count words from the end of s. (1 is the ﬁrst word in s, and -1 is the last word

in s.) A word is a set of characters that start and terminate with spaces. This is

different from a Unicode word, which is a language unit based on either a set of

word-boundary rules or dictionaries for several languages (Chinese, Japanese, and

Thai).

Domain s: strings

Domain n: integers

Range: strings

String functions 31

ustrword(s,n



,loc



)

Description: the nth Unicode word in the Unicode string s

Positive n counts Unicode words from the beginning of s, and negative n counts

Unicode words from the end of s. For examples, n equal to 1 returns the ﬁrst

word in s, and n equal to −1 returns the last word in s. If loc is not speciﬁed, the

default locale is used. A Unicode word is different from a Stata word produced by

the word() function. A Stata word is a space-separated token. A Unicode word

is a language unit based on either a set of word-boundary rules or dictionaries for

some languages (Chinese, Japanese, and Thai). The function returns missing ("")

if n is greater than cnt or less than −cnt, where cnt is the number of words s

contains. cnt can be obtained from ustrwordcount(). The function also returns

missing ("") if an error occurs.

ustrword("Parlez-vous fran¸cais", 1, "fr") = "Parlez"

ustrword("Parlez-vous fran¸cais", 2, "fr") = "-"

ustrword("Parlez-vous fran¸cais",-1, "fr") = "fran¸cais"

ustrword("Parlez-vous fran¸cais",-2, "fr") = "vous"

Domain s: Unicode strings

Domain loc: Unicode strings

Domain n: integers

Range: Unicode strings

wordbreaklocale(loc,type)

Description: the most closely related locale supported by ICU from loc if type is 1, the actual

locale where the word-boundary analysis data come from if type is 2; or an empty

string is returned for any other type

wordbreaklocale("en us texas", 1) = en US

wordbreaklocale("en us texas", 2) = root

Domain loc: strings of locale name

Domain type: integers

Range: strings

wordcount(s)

Description: the number of words in s

A word is a set of characters that starts and terminates with spaces, starts with

the beginning of the string, or terminates with the end of the string. This is

different from a Unicode word, which is a language unit based on either a set of

word-boundary rules or dictionaries for several languages (Chinese, Japanese, and

Thai).

Domain s: strings

Range: nonnegative integers 0, 1, 2, . . .

32 String functions

ustrwordcount(s



,loc



)

Description: the number of nonempty Unicode words in the Unicode string s

An empty Unicode word is a Unicode word consisting of only Unicode whitespace

characters. If loc is not speciﬁed, the default locale is used. A Unicode word is

different from a Stata word produced by the word() function. A Stata word is a

space-separated token. A Unicode word is a language unit based on either a set of

word-boundary rules or dictionaries for some languages (Chinese, Japanese, and

Thai). The function may return a negative number if an error occurs.

ustrwordcount("Parlez-vous fran¸cais", "fr") = 4

Domain s: Unicode strings

Domain loc: Unicode strings

Range: integers

References

Cox, N. J. 2004. Stata tip 6: Inserting awkward characters in the plot. Stata Journal 4: 95–96.

. 2011. Stata tip 98: Counting substrings within strings. Stata Journal 11: 318–320.

. 2022. Stata tip 148: Searching for words within strings. Stata Journal 22: 998–1003.

Jeanty, P. W. 2013. Dealing with identiﬁer variables in data management and analysis. Stata Journal 13: 699–718.

Koplenig, A. 2018. Stata tip 129: Efﬁciently processing textual data with Stata’s new Unicode features. Stata Journal

18: 287–289.

Schwarz, C. 2019. lsemantica: A command for text similarity based on latent semantic analysis. Stata Journal 19:

129–142.

Also see

[FN] Functions by category

[D] egen — Extensions to generate

[D] generate — Create or change contents of variable

[M-4] String — String manipulation functions

[U] 12.4.2 Handling Unicode strings

[U] 13.2.2 String operators

[U] 13.3 Functions

Stata, Stata Press, and Mata are registered trademarks of StataCorp LLC. Stata and

Stata Press are registered trademarks with the World Intellectual Property Organization

of the United Nations. StataNow and NetCourseNow are trademarks of StataCorp

LLC. Other brand and product names are registered trademarks or trademarks of their

respective companies. Copyright

 1985–2023 StataCorp LLC, College Station, TX,

For suggested citations, see the FAQ on citing Stata documentation.