I am parsing HTML strings to get values in PHP and write them in database. Here is an example string:
<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b> +371 12345678, +371 23456789<br>
<b>E-mail: </b>[email protected]<br>
The string can be formatted in random manners. It can contain additional keys that I am not parsing out and it can contain duplicate keys. It can also contain only some of the keys that I am interested in or be completely empty. HTML can also be broken (example tag: <br
). I have decided that I will follow the rules that entries are separated by \n
and are in the form key: value
+ some HTML.
First, I use this code to make the string parseable:
$parse = strip_tags($string);
$parse = str_replace(':', '=', $parse);
$parse = str_replace("\n", '&', $parse);
$parse = str_replace("\r", '', $parse);
$parse = str_replace("\t", '', $parse);
My string looks something like this now:
Adress= 22 Examplary road, Nowhere&Phone= +123 12345678, +123 23456789&E-mail= [email protected]
Then I use parse_str()
to get the values and then I take out the values if the needed keys are found:
parse_str($parse, $values);
$address = null;
if (isset($values['Adress']))
$address = trim($values['Adress']);
$phone = null;
if (isset($values['Phone']))
$phone = trim($values['Phone']);
The problem is that I end up with $phone = '371 12345678, 371 23456789'
- I lose the +
signs. How to conserve those?
Also, if you have any hints how to improve this procedure, I would be glad to know that. Some entries have Website: example.com
, others have Web Site example.com
... I am pretty sure that it will not be possible to automatically parse all of the information but I am looking for the best possible solution.
Solution
Using tips provided by WEBjuju I am now using this:
preg_match_all('/([^:]*):\s?(.*)\n/Usi', $string, $matches, PREG_SET_ORDER);
$values = [];
foreach ($matches as $match)
{
$key = strip_tags($match[1]);
$key = trim($key);
$key = mb_strtolower($key);
$key = str_replace("\s", '', $key);
$key = str_replace('-', '', $key);
$value = strip_tags($match[2]);
$value = trim($value);
$descriptionValues[$key] = $value;
}
This allows me to go from this input:
<b>Venue:</b> The Hall<br
<b>Adress:</b> 22 Examplary road, Nowhere <br>
<b>Phone:</b> +371 12345678<br>
<b>E-mail: </b>[email protected]<br>
<b>Website:</b> <a href="http://example.com/" target="_blank">example.com</a><br>
To a nice PHP array with homogenized and hopefully recognizable keys:
[
'venue' => 'The Hall',
'adress' => '22 Examplary road, Nowhere',
'phone' => '+371 12345678',
'email' => '[email protected]',
'website' => 'example.com',
];
It still doesn't account for the cases of missing colons, but I don't think I can solve that...
Realizing that you have preformed HTML that conforms to a simple standard structure I can tell you that regular expression matching will be the best way to grab this data. Here is an example to get you on your way - I'm sure it doesn't solve everything, but it solves what your issue is on this post, where you are troubled with "finding key/var matches".
// now go get those matches!
preg_match_all('/<b>([^:]*):\s?<\/b>(.*)<br>/Usi', $string, $matches, PREG_SET_ORDER);
die('<pre>'.print_r($matches,true));
That will output, for instance, something like this:
Array
(
[0] => Array
(
[0] => <b>Adress:</b> 22 Examplary road, Nowhere <br>
[1] => Adress
[2] => 22 Examplary road, Nowhere
)
[1] => Array
(
[0] => <b>Phone:</b> +371 12345678, +371 23456789<br>
[1] => Phone
[2] => +371 12345678, +371 23456789
)
[2] => Array
(
[0] => <b>E-mail: </b>[email protected]<br>
[1] => E-mail
[2] => [email protected]
)
And from there, I'd have to guess that you can putt that in for par.