JavaScript 检测中文字符的 3 种方法 | Unicode 正则表达式实战

发布日期：2023-09-08 修改时间：2023-09-08 阅读所需：3 分钟

javascript

regular expression

unicode

string

最近在重新复习 TypeScript 的时候打算拿之前基于 Golang 写过的一个 wordcounter 项目用 TypeScript 进行重写，以提高熟练度。

这个项目的核心功能就是对一篇内容中的中文字符进行统计，所以需要一个方法来检测中文字符。由于 Go 提供了良好的 Unicode 支持，可以直接使用 unicode 标准库里的 unicode.Han 字符集来检测中文字符，所以它也就是构成了这个项目的核心算法：

func (c *TextCounter) Count(input interface{}) error {
  str := ""
  switch v := input.(type) {
  case string:
    str = v
  case []byte:
    str = string(v)
  }
  if str == "" {
    return errors.New("no input provided")
  }
  scanner := bufio.NewScanner(strings.NewReader(str))
  for scanner.Scan() {
    c.S.Lines++
    line := scanner.Text()
    for _, r := range line {
      c.S.TotalChars++
      if unicode.In(r, unicode.Han) {
        c.S.ChineseChars++
      } else {
        c.S.NonChineseChars++
      }
    }
  }
  if err := scanner.Err(); err != nil {
    return err
  }
  return nil
}

简单来说就是只要某个字符在中文字符集中，那么计数加一。

但在 JavaScript 中这似乎没有提供像 Go 这样标准库可以使用，所以通常如果要匹配中文字符集需要利用 Unicode 的正则表达式来进行匹配，比如：

/[\u4e00-\u9fa5]/.test("中文"); // true

\u4e00 和 \u9fa5 分别是常见中文字符集的开始和结束字符，这个正则表达式通常也适用于其他语言，如 Python。

不过 Unicode 本身提供了对中文汉字的检测方式，即通过指定 Script 属性来实现：

/\p{Script=Han}/u.test("中文"); // true

它不仅能匹配中文汉字，也能匹配其他 CJK 字符。

所以最终我选定了这种方式来在 TypeScript 版本中实现中文字符的检测：

function count(input: string | Uint8Array): Error | null {
    let str = "";
    if (typeof input === "string") {
      str = input;
    } else if (input instanceof Uint8Array) {
      str = new TextDecoder().decode(input);
    }
    if (str === "") {
      return new Error("No input provided");
    }

    const lines = str.split("\n");
    for (const line of lines) {
      this.s.lines++;
      for (const char of line) {
        this.s.totalChars++;
        if (/\p{Script=Han}/u.test(char)) {
          this.s.chineseChars++;
        } else {
          this.s.nonChineseChars++;
        }
      }
    }

    return null;
  }
}

参考：