Char Sets
To use the bindings from this module:
(import :std/text/char-set)
def-codepoint
(def-codepoint (name x y ...) body ...)
This macro defines two functions, codepoint-name
and and char-name
(where the name is interpolated for the provided symbol).
The first one takes a fixnum x
as first argument
and optionally more arguments y ...
, and evaluates the body ...
.
Typically, it is a predicate returning a boolean, but could be anything.
The second one takes any value x
as first argument
and optionally more arguments y ...
;
if x
is a character, the previous function is called with its codepoint
and the rest of the arguments, otherwise #f
is returned.
Examples:
> (def-codepoint (chess-piece? c) (<= 9812 c 9823))
> (codepoint-chess-piece? 9817)
#t
> (codepoint-chess-piece? 9999)
#f
> (char-chess-piece? #\♞)
#t
> (char-chess-piece? #\A)
#f
codepoint-ascii? char-ascii?
(codepoint-ascii? x) => bool
(char-ascii? x) => bool
Returns true if the designated character is valid ASCII, with codepoint between 0 and 127 included.
codepoint-ascii-uppercase? char-ascii-uppercase?
(codepoint-ascii-uppercase? x) => bool
(char-ascii-uppercase? x) => bool
Returns true if the designated character is
a valid ASCII uppercase letter from #\a
to #\z
,
with codepoint between 65 and 90 included.
codepoint-ascii-lowercase? char-ascii-lowercase?
(codepoint-ascii-lowercase? x) => bool
(char-ascii-lowercase? x) => bool
Returns true if the designated character is
a valid ASCII lowercase letter from #\a
to #\z
,
with codepoint between 97 and 122 included.
codepoint-ascii-alphabetic? char-ascii-alphabetic?
(codepoint-ascii-alphabetic? x) => bool
(char-ascii-alphabetic? x) => bool
Returns true if the designated character is a valid ASCII letter,
either uppercase from #\A
to #\Z
or lowercase from #\a
to #\z
.
codepoint-ascii-numeric? char-ascii-numeric?
(codepoint-ascii-numeric? x) => bool
(char-ascii-numeric? x) => bool
Returns true if the designated character is a valid ASCII digit
between #\0
and #\9
included,
with codepoint from 48 to 57 included.
codepoint-ascii-alphanumeric? char-ascii-alphanumeric?
(codepoint-ascii-alphanumeric? x) => bool
(char-ascii-alphanumeric? x) => bool
Returns true if the designated character is valid ASCII letter or digit.
codepoint-ascii-alphanumeric-or-underscore? char-ascii-alphanumeric-or-underscore?
(codepoint-ascii-alphanumeric-or-underscore? x) => bool
(char-ascii-alphanumeric-or-underscore? x) => bool
Returns true if the designated character is a valid ASCII letter or digit
or the underscore character #\_
(codepoint 95).
codepoint-ascii-printable? char-ascii-printable?
(codepoint-ascii-printable? x) => bool
(char-ascii-printable? x) => bool
Returns true if the designated character is a valid
ASCII graphic character,
codepoint from 32 to 126 included.
Note that codepoint 32 is actually the #\space
that prints to a blank space,
but that other whitespace characters are not included.
Codepoint 127 is actually #\delete
which isn’t printable.
codepoint-strict-whitespace? char-strict-whitespace?
(codepoint-strict-whitespace? x) => bool
(char-strict-whitespace? x) => bool
These functions are the first of several predicates that recognize whitespace.
There is no consensus as to what is a
whitespace character
for either ASCII or Unicode,
and these follow the strictest definition, as specified by HTML and JSON:
whitespace characters are codepoints
32 (#\space
), 9 (#\tab
), 10 (#\newline
), 11 (#\return
).
The latest Scheme standard R7RS also specifies that this is
the set of whitespace accepted by all Scheme implementations,
though implementations may allow additional whitespace “such as page-break”.
codepoint-ascii-whitespace? char-ascii-whitespace?
(codepoint-ascii-whitespace? x) => bool
(char-ascii-whitespace? x) => bool
These predicates recognize ASCII whitespace characters
as defined by C, C++ and Python.
In addition to the four strict whitespace characters, they also accept
codepoints 12 (#\vtab
, vertical tab, C '\v'
) and
13 (#\page
, page break, form feed, C '\f'
).
codepoint-scheme-whitespace? char-scheme-whitespace?
(codepoint-scheme-whitespace? x) => bool
(char-scheme-whitespace? x) => bool
These predicates recognize the same whitespace characters
as the underlying Scheme implementation.
For Gambit and thus Gerbil (so far), it is the union of the ASCII whitespace above
plus Unicode Space Separators
(codepoints #x20
#xA0
#x1680
#x2000
-#x200a
#x202f
#x205f
#x3000
)
plus Unicode Line Separators (codepoints #x0A
#x0D
#x85
#x2028
#x2029
).
Note that JavaScript accepts the ASCII whitespace, the Unicode Space Separators,
#xFEFF
(ZWNBSP), but doesn't consider the line separators whitespace;
rather it considers #x0A
#x0D
#x2028
#x2029
as line terminators
but not #x85
(Next Line).
Meanwhile Rust recognizes the ASCII whitespace plus #x85
#x200E
#x200F
#x2028
#x2029
.
Whichever language or grammar you parse, be sure to look at its latest specification to identify its specific definition of “whitespace”.
codepoint-ascii-printable-or-whitespace? char-ascii-printable-or-whitespace?
(codepoint-ascii-printable-or-whitespace? x) => bool
(char-ascii-printable-or-whitespace? x) => bool
These predicates recognize ASCII characters that are either printable or whitespace (the C definition, which also equals the intersection of the underlying Scheme definition and ASCII).
codepoint-ascii-digit char-ascii-digit
(codepoint-ascii-digit x [base 10]) => number-or-false
(char-ascii-digit x [base 10]) => number-or-false
Given a character x
and a base
from 2 to 36 (defaults to 10),
if that character represents a digit in that base
(with letters being the digits from 10 to 35),
return the numerical value of the digit.
Otherwise return #f
.
digit-char
(digit-char n [base 10] [upper-case? #f]) => char-or-false
Given a number n
and a base
from 2 to 36 (defaults to 10),
if the number is an exact-integer
between 0 (included) and base
(excluded),
then return an ASCII character that represents that digit in the given base
.
If the digit value is 10 to 35, then use a lowercase letter if upper-case?
is false,
an uppercase letter if upper-case?
is true.
If the argument n
is not a valid digit for that base
, return #f
.
char-eol?
(char-eol? x) => bool
Is x
, a result from calling read-char
or peek-char
from a Port
or Reader
, a line terminator?
This is the case if x
is one of the characters #\newline
or #\return
,
or the special object #!eof
.