Search for unicode character (ranges) - Page 2

Karellen · 08-21-2023, 07:46 PM

You could also use the Characters list in the Reports module. Double clicking on the character list will jump you to the next instance.

kovidgoyal · 08-21-2023, 11:04 PM

This is a python based regex engine, you need \U followed by 8 digits so insert three leading zeros. For example for pouting cat face (😾 U+1f63e)

\U0001f63e

Karellen · 08-21-2023, 11:17 PM

Quote:

Originally Posted by kovidgoyal

This is a python based regex engine, you need \U followed by 8 digits so insert three leading zeros. For example for pouting cat face (😾 U+1f63e)

\U0001f63e

Thanks.

Why does the Search find the character, then not recognize it when trying to replace it?

kovidgoyal · 08-21-2023, 11:23 PM

That's a limitation of Qt with non-BMP unicode characters find will select only the first utf-16 codepoint making up the character. I could possibly workaround it, but its a lot of effort for a niche use case. Just use replace all for this case.

Karellen · 08-21-2023, 11:37 PM

Quote:

Originally Posted by kovidgoyal

That's a limitation of Qt with non-BMP unicode characters find will select only the first utf-16 codepoint making up the character. I could possibly workaround it, but its a lot of effort for a niche use case. Just use replace all for this case.

Ok, thanks.

kovidgoyal · 08-21-2023, 11:45 PM

Actually the workaround turns out to be quite easy

https://github.com/kovidgoyal/calibr...f210a1129f00ab

Comfy.n · 08-22-2023, 12:46 AM

Quote:

Originally Posted by Karellen

Thanks.

Why does the Search find the character, then not recognize it when trying to replace it?

Today I had a similar issue in Notepad++ while trying to remove the first character of every line in a text file with regex.

Using ^(.), it would mark the first character occurrences in the search panel, but would not actually replace them ('\1').

Then I found two alternative ways to do it: one is by using ^.?(.*) instead, and the other is surprisingly simple - Alt-selecting vertically the text "column" and deleting it.

Karellen · 08-22-2023, 02:38 AM

Quote:

Originally Posted by kovidgoyal

Actually the workaround turns out to be quite easy

https://github.com/kovidgoyal/calibr...f210a1129f00ab

Oh, nice!!.
Yep, looks like a simple fix... but probably took a bit to figure out that's what was needed.

Quote:

Originally Posted by Comfy.n

and the other is surprisingly simple - Alt-selecting vertically the text "column" and deleting it.

Oh, geez. I didn't even know you could do that.
Thanks for the tip!!

Azraelo · 08-22-2023, 08:22 PM

Thank you for the feedback. With the syntax "\U0001D4B7" it finds some characters.

If i have the text "𝐧𝒐𝗏𝑬𝔩𝓤𝗌𝒷.𝓬𝑶𝐦" and search for this, it shows me 4 occurences (also the wrong ones).
But when replacing, then it properly only replaces just the one character it should match.

I assume this also corresponds to your bugfix.

In some cases it doesn't seem to work consistently, but I'm looking forward for this bugfix to get a better visual information about the matches. Then i'll continue testing.

08-21-2023, 07:46 PM	#16
Karellen Wizard Posts: 1,146 Karma: 4911876 Join Date: Sep 2021 Location: Australia Device: Kobo Libra 2	You could also use the Characters list in the Reports module. Double clicking on the character list will jump you to the next instance. Attached Thumbnails

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
pdf to epub regex unicode character match not working	marcio_oliveira	Conversion	2	09-11-2021 03:16 PM
Aura Supported Unicode ranges	kuvera	Kobo Reader	3	06-12-2015 04:44 PM
Can't match Unicode character	atordo	Recipes	2	06-15-2012 03:20 PM
Problem with Unicode Character 'Word Joiner' (U+2060)	psztk	Conversion	0	10-14-2011 01:18 PM
Glyph Substitution of Unicode character	vdevan	OpenInkpot	2	07-18-2009 05:54 PM

08-21-2023, 11:04 PM	#17
kovidgoyal creator of calibre Posts: 43,954 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	This is a python based regex engine, you need \U followed by 8 digits so insert three leading zeros. For example for pouting cat face (😾 U+1f63e) \U0001f63e

08-21-2023, 11:23 PM	#19
kovidgoyal creator of calibre Posts: 43,954 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	That's a limitation of Qt with non-BMP unicode characters find will select only the first utf-16 codepoint making up the character. I could possibly workaround it, but its a lot of effort for a niche use case. Just use replace all for this case.

08-21-2023, 11:45 PM	#21
kovidgoyal creator of calibre Posts: 43,954 Karma: 22669822 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Actually the workaround turns out to be quite easy https://github.com/kovidgoyal/calibr...f210a1129f00ab

08-22-2023, 08:22 PM	#24
Azraelo Junior Member Posts: 5 Karma: 10 Join Date: Jun 2023 Device: Kobo Clara HD	Thank you for the feedback. With the syntax "\U0001D4B7" it finds some characters. If i have the text "𝐧𝒐𝗏𝑬𝔩𝓤𝗌𝒷.𝓬𝑶𝐦" and search for this, it shows me 4 occurences (also the wrong ones). But when replacing, then it properly only replaces just the one character it should match. I assume this also corresponds to your bugfix. In some cases it doesn't seem to work consistently, but I'm looking forward for this bugfix to get a better visual information about the matches. Then i'll continue testing.

Advert

Advert