Ben Wendt's blog

11. June 2013

Longest Common Substring in PHP

Longest common substring is a function that can be useful once in a while. Here’s a PHP implementation. Be forewarned, this runs in O(mn) time.

function longest_common_substring($string1, $string2) {
    $L = array();
    $length = 0;
    $pos = 0;
    $array1 =str_split($string1);
    $array2 =str_split($string2);
    foreach ($array1 as $i => $c1) { 
        $L[$i] = array();
        foreach ($array2 as $j => $c2) { 
            $L[$i][$j] = 0;
            if ($c1 == $c2) {
                if ($i == 0 || $j == 0) {
                    // initialize that this character position exists.
                    $L[$i][$j] = 1;
                } else {
                    // increment previous or reset.
                    if (isset($L[$i-1][$j-1])) {
                        $L[$i][$j] = $L[$i-1][$j-1] + 1;
                    } else {
                        $L[$i][$j] = 0;
                    }
                }
                if ($L[$i][$j] > $length) {
                    $length = $L[$i][$j];
                }
                if ((isset($L[$i][$j]))&&($L[$i][$j] == $length)) {
                    $pos = $i;
                }
            }
        }
    }
    if ($length > 0) {
        return substr($string1, $pos - $length + 1, $length);
    } else {
        return '';
    }
}

Usage:

$string1 = 'sadjjasdf this is the string  sdlkjhaskl';
$string2 = 'eriuhysdfnbasi this is the stringbhdjubsdi';

echo longest_common_substring($string1, $string2);

10. June 2013

Finding the main content element on a page in javascript

Short of going to something more complex like measuring information or doing some natural language processing, you can estimate which element on a page contains the content by determining which element has the highest ratio of contained content to contained markup. Here’s a javascript snippet that does just that:

// not perfect obviously. Not terrible neither.

var id, tag;
var all = document.querySelectorAll('body *'), max = 0, el, i, L;

// list some commons ids that denote the outermost element on a page.
var badIds = {
    "wrapper" : 1,
    "container" : 1,
    "wrapper-content" : 1
};

// we don't want to include content from certain tags.
var badTags = {
    "SCRIPT" : 1,
    "STYLE" : 1,
    "HEADER" : 1
}

// the goal rate of markup per content
var contentPercent = 0.45;

var contentRatio = function(el) {
    var i, L, totalScript = 0, scripts = el.getElementsByTagName("script");
    for (i =0, L= scripts.length; i < L; i++) {
        totalScript += scripts[i].length;
    }
    totalScript = 0;
    return (el.textContent.length - totalScript) / el.innerHTML.length;
};

for (i = 0, L =all.length; i < L; i++) {
    id = all[i].getAttribute('id');
    tag = all[i].tagName;
    if (all[i].textContent && all[i].textContent.length > max && (contentRatio(all[i]) > contentPercent) && !badIds[id] && !badTags[tag]) {
        max = all[i].textContent.length;
        el = all[i];
    }
}

// show the results.
console.log(el)
console.log(el.textContent.length / el.innerHTML.length)

30. May 2013

Simple Markov Chain in PHP

Here’s a simple Markov chain implementation in PHP, loosely adapted from this excellent write up about implementing Markov chains in javascript:

class Link {
    private $nexts = array();
    public function addNextWord($word) {
        if (!is_string($word)) {
            throw new Exception('addNextWord method in Link class is run with an string parameter');
        }
        if (!isset($this->nexts[$word])) {
            $this->nexts[$word] = 0;
        }
        $this->nexts[$word]++;
    }
    public function getNextWord() {
        $total = 0;
        foreach($this->nexts as $word => $count) {
            $total += $count;
        }
        $randomIndex = rand(1, $total);
        $total = 0;
        foreach($this->nexts as $word => $count) {
            $total += $count;
            if ($total >= $randomIndex) {
                return $word;
            }
        }
    }
}

class Chain {
    private $words = array();
    function __construct($words) {
        if (!is_array($words)) {
            throw new Exception('Chain class is instantiated with an array');
        }

        for($i = 0; $i < count($words); $i++) {
            $word = (string) $words[$i];
            if (!isset($this->words[$word])) {
                $this->words[$word] = new Link();
            }
            if (isset($words[$i + 1])) {
                $this->words[$word]->addNextWord($words[$i + 1]);
            }
        }
    }
    public function getChainOfLength($word, $i) {
        if (!is_string($word)) {
            throw new Exception('getChainOfLength method in Chain class is run with an string parameter');
        }
        if (!is_integer($i)) {
            throw new Exception('getChainOfLength method should be called with an integer');
        }
        if (!isset($this->words[$word])) {
            return '';
        } else {
            $chain = array($word);
            for ($j = 0; $j < $i; $j++) {
                $word = $this->words[$word]->getNextWord();
                $chain[] = $word;
            }
            return implode(' ', $chain);
        }
    }
}

And here is an example of usage:

function get_all_words_in_file($file) {
    return preg_split('/s+/ ', file_get_contents($file));
}

$file = 'testtext2.txt';

$words = get_all_words_in_file($file);
$chain = new Chain($words);
$newSentence = $chain->getChainOfLength('The', 200);
echo wordwrap($newSentence, 80, "n");

Conceptually, a Markov chain captures the idea of likelihood of traversing from state to state. You can populate this data for a block of text by passing through a block of text and counting the number of occurrences of words that follow a given word. You can then use this data to generate new blocks of text.