|
(?<!(;|^))(?!"(;|$))" Demo[^]
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
This is an example of why I always insist on tabs as a separator (not commas nor semi-colons.)
Guus2005 wrote: Remove all double quotes not directly preceded or directly followed by a semicolon
You are trying to solve this incorrectly.
Guus2005 wrote: 1;200;345;"Apotheker "Blue tongue"";"Apeldoorn";12;"ABCD12"
You do not want to "remove" the double quotes because they are part of the value. The following is the correct value from the above.
Apotheker "Blue tongue"
The pattern for the CSV is as follows
1. Semi-colon separates values.
2. Some values are quoted (double quotes.)
For processing for the second case the following applies for the value (not the line but just a value from the line.)
1. The double quotes MUST be at both the end and start of the value. It is ignored if both are not true.
2. The double quotes in that case are removed. Double quotes internal are not impacted.
Additionally you need to deal with the potential that there is a semi-colon in the middle of a value.
If there is a semi-colon in a value then I doubt you should be using a regex to parse lines. Certainly if I was doing it I would not use a regex. Rather I would build a parser/tokenizer since the rules would be easier to see (and debug). Additionally it would probably be faster also.
The tokenizer makes the case with the semi-colon much easier to deal with. The tokenizer rule would be in general
1. Find a semi-colon (start at semi-colon.)
2. If the next character is a double quote, flag a rule that it must look for quote then semi-colon as next break.
3. If the next character is not a double quote, flag a rule that it must look for a semi-colon as next break.
|
|
|
|
|
Do you think this regular expression that I made for US addresses is good enough for 99.9% of addresses out there?
^(?<housenumber>\d{1,5}) (?:(?<predirectional>N|E|S|W|NE|SE|SW|NW) ){0,1}(?<streetname>(?:[A-Z][A-Za-z]{0,40}|(?:[1-9]\d{0,2}(?:st|rd|nd|th)))(?: [A-Z][A-Za-z]{0,40}){0,5}) (?<streettype>Alley|Aly|Annex|Anx|Arcade|Arc|Avenue|Ave|Bayou|Byu|Beach|Bch|Bend|Bnd|Bluff|Blf|Bluffs|Blfs|Bottom|Btm|Boulevard|Blvd|Branch|Br|Bridge|Brg|Brook|Brk|Brooks|Brks|Burg|Bg|Burgs|Bgs|Bypass|Byp|Camp|Cp|Canyon|Cyn|Cape|Cpe|Causeway|Cswy|Center|Ctr|Centers|Ctrs|Circle|Cir|Circles|Cirs|Cliff|Clf|Cliffs|Clfs|Club|Clb|Common|Cmn|Commons|Cmns|Concourse|Conc|Corner|Cor|Corners|Cors|Course|Crse|Court|Ct|Courts|Cts|Cove|Cv|Coves|Cvs|Creek|Crk|Crescent|Cres|Crest|Crst|Crossing|Xing|Crossroad|Xrd|Crossroads|Xrds|Curve|Curv|Dale|Dl|Dam|Dm|Divide|Dv|Drive|Dr|Drives|Drs|Esate|Est|Estates|Ests|Expressway|Expy|Extension|Ext|Extentions|Exts|Fall|Falls|Fl|Ferry|Fry|Field|Fld|Fields|Flds|Flat|Flt|Flats|Flts|Ford|Frd|Fords|Frds|Forest|Frst|Forge|Frg|Forges|Frgs|Fork|Frk|Forks|Frks|Fort|Ft|Freeway|Fwy|Garden|Gdn|Gardens|Gdns|Gateway|Gtwy|Glen|Gln|Glens|Glns|Green|Grn|Greens|Grns|Grove|Grv|Groves|Grvs|Harbor|Hbr|Harbors|Hbrs|Haven|Hvn|Heights|Hts|Highway|Hwy|Hill|Hl|Hills|Hls|Hollow|Holw|Inlet|Inlt|Island|Is|Islands|Iss|Isle|Junction|Jct|Junctions|Jcts|Key|Ky|Keys|Kys|Knoll|Knl|Knolls|Knls|Lake|Lk|Lakes|Lks|Land|Landing|Lndg|Lane|Ln|Light|Lgt|Lights|Lgts|Loaf|Lf|Lock|Lck|Locks|Lcks|Lodge|Ldg|Loop|Mall|Manor|Mnr|Manors|Mnrs|Meadow|Mdw|Meadows|Mdws|Mews|Mill|Ml|Mills|Mls|Mission|Mls|Mission|Msn|Motorway|Mtwy|Mount|Mt|Mountain|Mtn|Mountains|Mtns|Neck|Nck|Orchard|Orch|Oval|Overpass|Opas|Park|Parks|Parkway|Pkwy|Parkways|Pass|Passage|Psge|Path|Pike|Pine|Pines|Pnes|Place|Pl|Plain|Pln|Plains|Plns|Plaza|Plz|Point|Pt|Points|Pts|Port|Prt|Ports|Prts|Prairie|Pr|Radial|Radl|Ramp|Ranch|Rnch|Rapid|Rpd|Rapids|Rpds|Rest|Rst|Ridge|Rdg|Ridges|Rdgs|River|Riv|Road|Rd|Roads|Rds|Route|Rte|Row|Rue|Run|Shoal|Shl|Shoals|Shls|Shore|Shr|Shores|Shrs|Skyway|Skwy|Spring|Spg|Springs|Spgs|Spur|Spurs|Square|Sq|Squares|Sqs|Station|Sta|Stravenue|Stra|Stream|Strm|Street|St|Streets|Sts|Summit|Smt|Terrace|Ter|Throughway|Trwy|Trace|Trce|Track|Trak|Trafficway|Trfy|Trail|Trl|Trailer|Trlr|Tunnel|Tunl|Turnpike|Tpke|Underpass|Upas|Union|Un|Unions|Uns|Valley|Vly|Valleys|Vlys|Viaduct|Via|View|Vw|Views|Vws|Village Vill|Vlg|Villages|Vlgs|Ville|Vl|Vista|Vis|Walk|Walks|Wall|Way|Ways|Well|Wl|Wells|Wls)(?: (?<streetnumber>[1-9]\d{0,4}[A-Z]{0,2})){0,1}(?: (?<postdirectional>N|E|S|W|NE|SE|SW|NW)){0,1}$
|
|
|
|
|
Not hardly. I live in an area with many Spanish-style addresses -- Calle This, Avenida That, Caminito The Other. I assume that would be true throughout the southwest. I imagine French-style addresses abound in some parts of the north and Louisiana.
It cannot be done, you'd have a better time using Regular Expressions to parse HTML and only summon Cthulhu. Parsing Html The Cthulhu Way[^]
Edit: Oh, man! I just remembered Palmdale, CA -- look at their street naming convention!
modified 21-Feb-23 21:14pm.
|
|
|
|
|
Well I tried... lol. What is up with their street numbering convention in CA...
Also, I might be forgetting some characters like á and ñ...
|
|
|
|
|
"Urban Planning" (ptui) is simply ridiculous. You wind up with the Esperanto version of a city and no one wants it or asked for it.
|
|
|
|
|
PIEBALDconsult wrote: Oh, man! I just remembered Palmdale, CA -- look at their street naming convention!
Nuts! 
|
|
|
|
|
jpcodex153 wrote: US addresses is good enough
What is even the point? Why do you think you need to validate a postal address at all? What business need are you serving by attempting to validate?
Lets say your app results in shipping a product to a postal address so you want to validate that. Then there is a service (at least one) that allows you to at least validate that the US Postal Service recognizes it. So that is what you should actually use.
If you do not actually need to deliver something then don't validate it at all.
|
|
|
|
|
Maybe he wants a filter to prevent non-US-style foreign addresses from getting through?
We see that a lot from US web shops: They insist on
- a 'state' level between the city and the country (not accepting a blank, full stop, dash, ... It must be an alphabetic word)
... Norway is not split into 'states'. The counties are never used in mail addresses (and they were reorganized a couple of years ago).
- a five digit zip code.
... Norway uses four digit zip codes. (Usually you can get away with adding a leading zero, which really looks silly.)
- the zip code placed after the state name.
... In Norway, the zip code is written before the City name.
If some non-US-style address is presented, it can be rejected: This is un-American! We do not want to be bothered by un-American stuff!
(Yes, I see the subject line explicitly referring to 'us addresses'. Explicitly saying: We do not care for, and probably will never in the future care for, anything outside the US. We are actively working to keep up the stereotypical image.)
|
|
|
|
|
trønderen wrote: a 'state' level between the city and the country
If that shows up then you probably need to educate the requirements writer.
|
|
|
|
|
Rant about regex taken from the Code Project newsletter links. Figured I might as well put a response here.
Regular Expressions make me feel like a powerful wizard – and that’s not a good thing – Terence Eden’s Blog[^]
"The other day I had to fix a multi-line Regular Expression (RegEx). After a few hours of peering at it with a variety of tools,"
But the author doesn't suggest that there was in fact a different way to approach the problem.
"There's no space for comments"
Well no that isn't true.
For starters Perl language actually allows embedded comments although I would not use them.
However I can certainly explain the regex, and often do, outside of the regex itself.
Of course some few complain that code, all code, should not have comments at all but that is a personal problem.
"Here are some positive use-cases for RegEx":
Obviously just sarcasm but missing the point that there is a range of problems where regexes should be used. The solution to not using them would be to write the code that the regex represents in the first place. Which would be much more verbose and more likely the source of errors.
There are actually however cases of misuse that are not named where someone thinks they can avoid other types of solutions by incorrectly applying regexes. That is often the case for XML/HTML parsing.
"We should be writing intelligible code for each other"
Agreed. But there are always trade offs. I can't write code that a junior developer is going to easily understand and still produce code at the rate at which I do. Doesn't mean that I should write code that a mid level or even senior level developer is going to have to spend days figuring out. And I do try to make it easier for them to understand (often by adding comments to explain odd constraints and/or external factors.)
|
|
|
|
|
Hello all,
I'm a regex newbie and I'm stuck with regex in PCRE/PHP.
First expression, I'd like to match letters, numbers or underscore but
- no space at start or end
- no more than one consecutive space
This works fine : ^[\w]+([_\s]{1}[a-zA-Z0-9]+)*$
Second expression
- no "private" in lower or upercase ie avoid "private", "PRIVATE", "Private" ou "pRivATe"
This seem to work fine : ^((?i)(?!private).(?-i))*$
I tried to combine both regex but cannot get it work !
By the way, I'm using extendsclass.com online tester.
Thanks for help.
JLO
|
|
|
|
|
Finally after many trials, I think I found an answer :
^(?=((?i)(?!private).(?-i))*$)([\w]+([_\s]{1}[a-zA-Z0-9]+))*$
Bye
|
|
|
|
|
I didn't attempt to parse that but there seems to be several constructs in there that would make me nervous.
You have two '$' and only one '^'
It appears you have two optional clauses. Optional clauses without hard anchors general are always a problem because they likely make the regex engine do a lot of work.
Since you already presumably have a working solution what makes you think you need to combine them into one expression? Or another way of saying that is that regexes use an iterative process to find the best solution and more complex expressions unless carefully crafted can cause unexpected problems (slowness.)
|
|
|
|
|
i need to find the strings like below
mTimerManager
mAutolockManager
but not like
mv this is comment timerManager
anand sunku
|
|
|
|
|
And?
What have you tried, and where are you stuck?
"I need..." is not a question.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
You did not define which regex you are working with.
Your definition for the 'word' is anything except a space. Which is not really what 'word' generally means. But that is what I used.
In perl.
m[^ ]+
The above will match the following (because that is the definition of 'word')
m&extra---stuff.
It is also limited with the following since, again with the definition of word, it is not clear what might be expected
mFirst,mSecond
|
|
|
|
|
I am trying to write some regex to pull out fields from a set of web pages. The information contained in them can vary for example they can have all or some of the fields (I think I have identified all the possibilities). and I think I can deal with this by including all the potential options and have data returned if the field is present as long as I can figure out how to make them absolute references. The other challenge is that sometimes these fields contain bullet lists which can have 1 or more bullet items which I don't know how to handle. Example is below and i am trying to identify the details associated with "Type of surveyor", "Works for", "Business type", "Surveying services", "Partners and directors", "Accreditations", "Registered valuer". If anyone can help that would be greatly appreciated
<div class="office inner grid">
<!-- Office title -->
<h1 class="office__title grid__col grid__col--md-16 grid__push--md-8">Patterson Surveying</h1>
<!-- Office information -->
<div class="office__content grid__col grid__col--md-16">
<p class="office__about">Patterson Surveying is an independent surveying firm run by Paul Patterson</p>
<section class="office__info">
<div class="office-info__row">
<h3 class="office-info__heading">Type of surveyor</h3>
<div class="office-info__content">
<ul class="bullet-list">
<li class="bullet-list__item">Chartered Valuation Surveyor</li>
</ul>
</div>
</div>
<div class="office-info__row">
<h3 class="office-info__heading">Works for</h3>
<div class="office-info__content">
<ul class="bullet-list">
<li class="bullet-list__item">Residential customers</li>
<li class="bullet-list__item">Commercial contracts</li>
</div>
</div>
<div class="office-info__row">
<h3 class="office-info__heading">Business type</h3>
<div class="office-info__content">
Private Practice
</div>
</div>
<div class="office-info__row">
<h3 class="office-info__heading">Surveying services</h3>
<div class="office-info__content">
<ul class="bullet-list bullet-list--2col">
<li class="bullet-list__item">Building surveying</li>
<li class="bullet-list__item">RICS Home Survey – Level 2</li>
</ul>
</div>
</div>
<div class="office-info__row">
<h3 class="office-info__heading">Partners and Directors </h3>
<div class="office-info__content">
<ul class="bullet-list bullet-list--2col">
<li class="bullet-list__item">Mr P M Patterson MRICS</li>
</ul>
</div>
</div>
<div class="office-info__row">
<h3 class="office-info__heading">Accreditations</h3>
<div class="office-info__content">
<h3>Registered Valuer</h3>
<ul class="bullet-list">
<li class="bullet-list__item">Mr P M Patterson MRICS</li>
</ul>
</div>
</div>
<div class="section section--shaded">
<a name="Contact"></a>
<h3 class="office__title">Contact Patterson Surveying</h3>
|
|
|
|
|
Basically, don't use a Regex: HTML is notorious for being difficult to process effectively if you treat it as text - it pretty much needs a browser engine to mostly render the page before it can be parsed effectively as it contains so many different ways to do anything.
Instead, I'd suggest you use an HTML parser (I use HTMLAgilityPack[^] in C#, but your language may need a different one) and scrape the sites that way - it's a lot easier to work with, and a whole load easier to change when the site admin alters the format, which happens a lot as features are added, removed, modified, or bugs are fixed.
Doing it with a regex means it might work for a week, and then fail - and then the whole regex has to be re-written, re-tested, fixed, and released.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
|
|
|
|
|
Thanks Original Griff,
That is beyond my technical know-how at this point but I am looking to learn. I am using this within Octoparse which from what I have learnt to date can only use regex to make the fields absolute / more accurate. So I think I am stuck with trying to make it work using regex. Unless anyone knows differently or can help with the regex please?
|
|
|
|
|
This is for my VMware vCenter servers where I am trying to clean out extra log files which are no longer required. The type of files for this example are:
Quote:
/storage/log/vmware/eam/web/localhost_access_log..2020-12-06.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-09-13.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-10-31.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-12-13.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-10-03.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-09-08.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-07-21.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-08-03.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-11-30.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-11-08.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-11-27.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-12-14.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-09-28.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-10-01.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-11-29.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-10-19.txt
/storage/log/vmware/eam/web/localhost_access_log..2020-12-05.txt
The expression which works for me in my Linux file system is this one:
find /storage/log/vmware/ -mtime +10 -type f -name "localhost_access_log..2020-[0-9][0-9]-[0-9][0-9].txt"
It uses the Linux "find" command to find the files, the files have to be "mtime=10" 10 days or older. I would like to shorten the regex to simplify it, and using RegExr: Learn, Build, & Test RegEx[^] as my tester, I found that the following regex works:
localhost_access_log\.\.2020-[0-9]{2}-[0-9]{2}\.txt
However when I try it on my Linux filesystem, it fails to produce results. I get nothing returned.
|
+-- JDMils
|
+-- VB6
+-- VB Dot Net
|
|
|
|
|
|
Go back to the version that does work.
|
|
|
|
|
Unless told otherwise, find uses file globs, I think. But you can change the regex engine using -regextype. e.g. -regextype posix-extended. find will tell you what regex engines it knows about if you say -regex help. Possibly one of the engines knows how to parse your regex expression to your liking.
Keep Calm and Carry On
|
|
|
|
|
Thanks K5054, that was the clincher. To find the files I need, I had to perform the following:
* State the Regex Engine as 'posix-extended'
* Put the expression '.*' at the start of the filename as the files are treated as fully qualified filenames (file path & filename).
Thus, I can now use the following:
find /storage/log/vmware/ -type f -regextype posix-extended -regex '.*vpxd-svcs-access-.2022-[0-9]{2}-[0-9]{2}.log.gz'
And....
find /storage/log/vmware/ -type f -regextype posix-extended -regex '.*sps.log.[0-9]{2}.gz'
|
+-- JDMils
|
+-- VB6
+-- VB Dot Net
|
|
|
|
|
|
Regular Expression to find parts of a <script/img src=''> or <link href=''> attribute value
Been using my go-to regex101.com editor to work this out, but I always have problems with URLs and filesystem paths. I generally have the 'https' URL/resource in order.
I am trying to read and parse the link 'href' and img/script 'src' attribute values from the elements extracted in the markup.
The groupings/captures I want are
- "path provider" (PowerShell terminology), basically the drive
- The path leading to the file part. I prefer groupings between the path separator "\" or "/", both must be accounted for but will accept a long string
Thus, suppose D:\a\b\c\file.ext
This part can be grouped as '\a\b\c' but if it can multiple groups '\a', '\b', '\c', even better.
One more more path separators required - The file basename without path separator
- The file extension with the leading '.' which is the last '.' of the path
My working pattern/RE is: ^([a-zA-Z] ?(([/\]?[^/\]+)*)[/\]([^.]+).(\S+)$
The pattern might be more specific regular expressions separated by the alternative separator (|) instead of trying to match the strings with a single expression.
I specifically include the '^' and '$' start and end assertions for the markup attribute value.
Test string #1: ${SPREST_JS_FolderPath}/SPListREST.js
- No path provider/drive, so no Group 1 - OK
- Group 2: ${SPREST_JS_FolderPath} # Item (ii)
- Group 3: ${SPREST_JS_FolderPath} # repeat of Group 2 -- not wanted
- Group 4: SPListREST # file basename Item (iii)
- Group 5: js # file type/extension Item (iv)
Test string #2 D:\dev\SharePoint\SPTools\src\pagestyle.css
- Group 1: D: # Item (i)
- Group 2: \dev\SharePoint\SPTools\src # Item (ii) exactly as required if groupings by '\pathseg' not possible
- Group 3: \src # the last path segment--unwanted
- Group 4: pagestyle # file basename Item (iii)
- Group 5: css # file type/extension Item (iv)
Test string #3 ./js/SPREST/SPRestEmail.js
- No path provider/drive, so no group 1
- Group 2: ./js/SPREST # Item (ii) exactly as required if groupings by '\pathseg' not possible
- Group 3: /SPREST # the last path segment--unwanted
- Group 4: SPRestEmail # file basename Item (iii)
- Group 5: js # file type/extension Item (iv)
[composed in Markdown, so presentation affected by your settings/stylings]
|
|
|
|