Benchmark: UTF-8 byte length Arabic 4-mthods

Script Preparation code:

var arabic_str = "kksldfnjeoliwkfmirewogbregiojrmfikrefnهنصتةبهنمخصةبهخصثةبثصهنخقلةثقملىثقمةلكمسقنلمنكقةلمنيبةىلميبنكلكمطسيوبصثقنلحخثقتلثقنلوسيكملنسيكملklrmglkedrmg;ler,g;lkerdmglkermg;ler,g;lerkmglkerglk;ermg;lermg;hjfbwefseoifnuiwefصثعىبصثبىهصثىةبخهمصقاىلثخقتلةحخثقلةثقلةثقلةثقلةثخقلةحخثقلةخقثلةثحخقلةثينقىلنمقثيىلنمقىلمنيثقسىلقىلمقىلنمكقىلمنقىلمنيثقىلمنقىلمنقثىيةلمنخىثيقمنلةيمنلتىنثتقيىلخهمثقيةلخمثقيىلثقفيىلنثقيةلمنخىفقثيلىىىتنعصثىبمخنثةصبحةثصقخمبىثقخمنهلىةثحكقةلكحثقةلمنثقىلمنثقىلكمثةقلكمثقةلمنىثقلمنىثقلمنىثقلنمكثىقلنمكثقىلمنثيقىلمطنثقىلمنطثقىةسلبثمكسقيلةثطقكةلونطحكصثقنلخحثقةلحكخمقفةاقفهىاقفاىقفةلحخثقهخابهقعثالخهثقتلخهثقالخهاثقهعلخباثقحهخبةصحخثجؤوخحجصسثؤوحخصثرةىخهمثقيلىرخهمثقلىخهثقمللاخهثمقلتحخثقتلبحخصثتبحخصثستىبخهمثقىلخهمنثقىلحخثقىلحثقىلخهمنثقصىلثقةلحخثقىلمخهنثيقىلمصثويبجحصضثنيحخصثتبهختثصقخهلبةثقهنخلمةثقيرنمهىيؤنتءراىهخثقةرثقمرةثقيمخهنلىخهثقنلىخهمثقهتل"
var english_str = "UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an «octet» in the Unicode Standard). Code points wi"

function lengthInUtf8Bytes(str) {
  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
  var m = encodeURIComponent(str).match(/%[89ABab]/g);
  return str.length + (m ? m.length : 0);
}


function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
  }
  return s;
}

function byteLengthTextEncoder(str) {
  return (new TextEncoder().encode(str)).length;
}

function byteLengthBlob(str) {
  return new Blob([str]).size;
}

​x
 
var arabic_str = "kksldfnjeoliwkfmirewogbregiojrmfikrefnهنصتةبهنمخصةبهخصثةبثصهنخقلةثقملىثقمةلكمسقنلمنكقةلمنيبةىلميبنكلكمطسيوبصثقنلحخثقتلثقنلوسيكملنسيكملklrmglkedrmg;ler,g;lkerdmglkermg;ler,g;lerkmglkerglk;ermg;lermg;hjfbwefseoifnuiwefصثعىبصثبىهصثىةبخهمصقاىلثخقتلةحخثقلةثقلةثقلةثقلةثخقلةحخثقلةخقثلةثحخقلةثينقىلنمقثيىلنمقىلمنيثقسىلقىلمقىلنمكقىلمنقىلمنيثقىلمنقىلمنقثىيةلمنخىثيقمنلةيمنلتىنثتقيىلخهمثقيةلخمثقيىلثقفيىلنثقيةلمنخىفقثيلىىىتنعصثىبمخنثةصبحةثصقخمبىثقخمنهلىةثحكقةلكحثقةلمنثقىلمنثقىلكمثةقلكمثقةلمنىثقلمنىثقلمنىثقلنمكثىقلنمكثقىلمنثيقىلمطنثقىلمنطثقىةسلبثمكسقيلةثطقكةلونطحكصثقنلخحثقةلحكخمقفةاقفهىاقفاىقفةلحخثقهخابهقعثالخهثقتلخهثقالخهاثقهعلخباثقحهخبةصحخثجؤوخحجصسثؤوحخصثرةىخهمثقيلىرخهمثقلىخهثقمللاخهثمقلتحخثقتلبحخصثتبحخصثستىبخهمثقىلخهمنثقىلحخثقىلحثقىلخهمنثقصىلثقةلحخثقىلمخهنثيقىلمصثويبجحصضثنيحخصثتبهختثصقخهلبةثقهنخلمةثقيرنمهىيؤنتءراىهخثقةرثقمرةثقيمخهنلىخهثقنلىخهمثقهتل"var english_str = "UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space (1,114,112 code points minus 2,048 surrogate code points) using one to four 8-bit bytes (a group of 8 bits is known as an «octet» in the Unicode Standard). Code points wi"​function lengthInUtf8Bytes(str) {  // Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.  var m = encodeURIComponent(str).match(/%[89ABab]/g);  return str.length + (m ? m.length : 0);}​​function byteLength(str) {  // returns the byte length of an utf8 string  var s = str.length;  for (var i=str.length-1; i>=0; i--) {    var code = str.charCodeAt(i);    if (code > 0x7f && code <= 0x7ff) s++;    else if (code > 0x7ff && code <= 0xffff) s+=2;  }  return s;}​function byteLengthTextEncoder(str) {  return (new TextEncoder().encode(str)).length;}​function byteLengthBlob(str) {  return new Blob([str]).size;}​​​

Tests:

with a loop Arabic
byteLength(arabic_str);
byteLength(arabic_str);
with a regex Arabic
lengthInUtf8Bytes(arabic_str);
lengthInUtf8Bytes(arabic_str);
with a TextEncoder ِِArabic
byteLengthTextEncoder(arabic_str)
byteLengthTextEncoder(arabic_str)
with a Blob Arabic
byteLengthBlob(arabic_str);
byteLengthBlob(arabic_str);

Rendered benchmark preparation results:

Suite status: <idle, ready to run>

Previous results

Experimental features:

Memory measurements supported only in Chrome.
For precise memory measurements Chrome must be launched with --enable-precise-memory-info flag.
More information: Monitoring JavaScript Memory

Test case name	Result
with a loop Arabic
with a regex Arabic
with a TextEncoder ِِArabic
with a Blob Arabic

Fastest: N/A

Slowest: N/A

Latest run results:

Run details: (Test run date: one month ago)

User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36

Browser/OS: Chrome 133 on Mac OS X 10.15.7

View result in a separate tab

Test name	Executions per second
with a loop Arabic	1397361.6 Ops/sec
with a regex Arabic	91756.8 Ops/sec
with a TextEncoder ِِArabic	567086.8 Ops/sec
with a Blob Arabic	19844.5 Ops/sec

Autogenerated LLM Summary (model llama3.2:3b, generated 5 months ago):

Let's break down the provided JSON and explain what's being tested, compared options, pros and cons of those approaches, library usage, special JS features or syntax (if any), and alternatives.

Benchmark Definition

The benchmark is testing the byte length calculation of the Arabic string arabic_str using different methods:

byteLength(arabic_str);
lengthInUtf8Bytes(arabic_str);
byteLengthTextEncoder(arabic_str)
byteLengthBlob(arabic_str)

Options Compared

The four options being compared are:

Direct byte length calculation using byteLength function: This method uses a simple loop to iterate through the string and calculate the byte length.
Using regular expression with lengthInUtf8Bytes function: This method uses a regex pattern to match non-initial characters in the UTF-8 encoded string, which are likely to be multi-byte sequences.
Using TextEncoder API (byteLengthTextEncoder function): This method uses the TextEncoder API to encode the string into an array of bytes and then returns the length of that array.
Using Blob object (byteLengthBlob function): This method creates a new Blob object from the string and returns the size of that blob.

Pros and Cons

Direct byte length calculation: Simple, efficient, but may not accurately handle all UTF-8 edge cases.
Regular expression with lengthInUtf8Bytes: More accurate for handling UTF-8 sequences, but slower due to regex overhead.
TextEncoder API: Fast and accurate, but requires the TextEncoder API to be supported by the browser.
Blob object: Simple and fast, but may not accurately handle all encoding scenarios.

Library Usage

Only the TextEncoder API is being used in this benchmark, which suggests that modern browsers support it.

Special JS Features or Syntax (None)

No special JavaScript features or syntax are being used in this benchmark.

Alternatives

If you wanted to add more alternatives to the benchmark, some options could be:

Using String.prototype.charCodeAt() method: This method would calculate the byte length by iterating through each character's Unicode code point.
Using a library like ICU for internationalization and localization: This would provide a more comprehensive way of handling different encoding scenarios.

However, these alternatives might not be as efficient or accurate as the options already being compared in the benchmark.

LLMs can make mistakes. Check important info.

Let's break down the provided JSON and explain what's being tested, compared options, pros and cons of those approaches, library usage, special JS features or syntax (if any), and alternatives.

**Benchmark Definition**

The benchmark is testing the byte length calculation of the Arabic string `arabic_str` using different methods:

1. `byteLength(arabic_str);`
2. `lengthInUtf8Bytes(arabic_str);`
3. `byteLengthTextEncoder(arabic_str)`
4. `byteLengthBlob(arabic_str)`

**Options Compared**

The four options being compared are:

1. **Direct byte length calculation using `byteLength` function**: This method uses a simple loop to iterate through the string and calculate the byte length.
2. **Using regular expression with `lengthInUtf8Bytes` function**: This method uses a regex pattern to match non-initial characters in the UTF-8 encoded string, which are likely to be multi-byte sequences.
3. **Using TextEncoder API (`byteLengthTextEncoder` function)**: This method uses the TextEncoder API to encode the string into an array of bytes and then returns the length of that array.
4. **Using Blob object (`byteLengthBlob` function)**: This method creates a new Blob object from the string and returns the size of that blob.

**Pros and Cons**

1. **Direct byte length calculation**: Simple, efficient, but may not accurately handle all UTF-8 edge cases.
2. **Regular expression with `lengthInUtf8Bytes`**: More accurate for handling UTF-8 sequences, but slower due to regex overhead.
3. **TextEncoder API**: Fast and accurate, but requires the TextEncoder API to be supported by the browser.
4. **Blob object**: Simple and fast, but may not accurately handle all encoding scenarios.

**Library Usage**

Only the `TextEncoder` API is being used in this benchmark, which suggests that modern browsers support it.

**Special JS Features or Syntax (None)**

No special JavaScript features or syntax are being used in this benchmark.

**Alternatives**

If you wanted to add more alternatives to the benchmark, some options could be:

1. **Using String.prototype.charCodeAt() method**: This method would calculate the byte length by iterating through each character's Unicode code point.
2. **Using a library like ICU for internationalization and localization**: This would provide a more comprehensive way of handling different encoding scenarios.

However, these alternatives might not be as efficient or accurate as the options already being compared in the benchmark.

Related benchmarks:

UTF-8 byte length Arabic 4-mthods (version: 2)

Comparing performance of: with a loop Arabic vs with a regex Arabic vs with a TextEncoder ِِArabic vs with a Blob Arabic

Created: 4 years ago by: Registered User

Jump to the latest result

with a loop Arabic

with a regex Arabic

with a TextEncoder ِِArabic

with a Blob Arabic

Suite status: <idle, ready to run>

Experimental features:

Fastest: N/A

Slowest: N/A

Autogenerated LLM Summary (model llama3.2:3b, generated 5 months ago):