Code Snippets
Code snippets around interesting algorithms
Breaking a Paragraph into Lines
Category: Dynamic Programming
Given a sequence of words \( w_{1} \dots w_{n} \) we need to insert them into a paragraph where we penalize the empty space at the end of each line. Let the width of each line be \(W\) and the space required to fit words \(w_{i} \dots w_{j}\) be \(y(i,j)\) then the total penalty we incur is \( \sum_{i=0}^{m} P(W, y(i,j)) \) where \(P\) is a generic increasing loss function and \(m\) is the number of lines we output. The lines of the paragraph are defined by a sequence of breakpoints \( \{b_{1}, \dots b_{m}\} \), where, \( b_{i} \) means that we start a new line starting with word \( i \). Finally, the goal is to output the optimal sequence of breakpoints \( b_{1} \dots b_{m} \) so as to minimize the cost function \( P \). In this problem we make the assumption that the first line does not incur any penalty.
We approach this problem using Dynamic Programming. Consider the optimal solution \( \mathcal{O} \) to to this problem, then given we are considering the \( i^{th} \) word we have to decide on which line to insert it. We can create a new line and insert words \( w_{j}, w_{j+1} \dots w_{i} \) for some index \( j \) such that \(y(j,i) \leq W \), in order for the sequence of words to fit in the same line. Let \( q(i)\) be a function that outputs the least \(j\) such that we can fit \(w_{j} \dots w_{i}\) on the same line. Then we have to find the optimal breakpoint \(b_{j}\) and then behave optimally for the sequence \(w_{1} \dots w_{j-1}\). These observations suggest that we should look into subproblems that include the first \(i\) words of the sequence.
Hence, let \( OPT(i) \) be the optimal cost of breaking the sequence of the first \( i \) words into a paragraph. Then if \( i = 0 \) then the cost of breaking no words is also \( 0 \). In general for \( i > 0 \) we get: $$ OPT(i) = \min\limits_{q(i) \leq k \leq i} \{ OPT(k-1) + P(W, y(k,j) \} $$ Our goal is to compute \( OPT(n) \). To do this, based on the above identity we form a \( 1D \) array, say \( M \), to store the optimal value. We proceed to fill out \( M \) in the following order. First we fill in \(M[0]=0\). In general we notice that in order to fill entry \(i\), of \(OPT\), we need to know the value at entries \(q(i),q(i)+1, \dots i-1\). Therefore we can fill \(M\) by increasing \(i\) and compute the cases where \(y(0,n) \leq W\) directly from the above recurrence. In order to calculate a particular entry for the array \(M\), takes \(O(n)\) time since we take a minimum of (at most) \(n\) numbers . Since there is a total of $n$ entries, we get that the running time of this algorithm is \(O(n^{2}) \).
My code for the problem
import numpy as np
def space(index_i, index_j, maxWidth, par):
''' returns the space (characters) needed for words
indexed from i to j to be stored including
spaces betweem different words '''
c = 0
for i in range(index_i, index_j+1):
c += len(par[i])
c += index_j - index_i #gaps between words
return c
def init(maxWidth, par):
'''
Let s(i,j) be the space left if we put words
starting at i and ending at j in the same sentence
Let q(i) be the min indexed word that we could break up
the sentence if we consider a line ending at i
'''
s = np.zeros((len(par), len(par)), dtype = int)
q = np.zeros(len(par), dtype = int)
#initialize s
for i in range(len(par)):
for j in range(len(par)):
if j >= i:
s[i,j] = space(i, j, maxWidth, par)
#initialize q
for j in range(len(par)):
temp = np.Inf
for i in range(j+1):
if temp > i and s[i,j] <= maxWidth:
temp = i
q[j] = temp
return s, q
def cost_third_function(margin):
''' Cost function to penalize "slack" '''
return margin**3
def cost_square_function(margin):
''' Cost function to penalize "slack" '''
return margin**2
def break_par(maxWidth, sentence, penalize_First = False, cost_function = cost_square_function):
'''
outputs a set of breakpoints as well as
the corresponsing optimal score. In this implementation
the first line has no penalty whereas all the other lines have
penalties. The convention is that if the breakpoint is b_i,
then this means that a new line starts at the beginning of word i.
By default we do not penalize the first line and use the square function
to penalize the margin. You can define your costum cost_function and pass it
as a parameter to this function.
'''
par = [i for i in sentence.split()]
#guard against invalid input
for i in par:
if len(i) > maxWidth:
print("Lenght of each word must be less than W. Aborting.")
return
s, q = init(maxWidth, par)
#this takes care of opt[0] = 0
opt = np.zeros(len(par))
b = np.zeros(len(par), dtype = int) #hold breakpoints
for i in range(1, len(par)):
temp_min = np.Inf
for k in range(q[i], i+1):
if penalize_First == False: # if do not penalize the first line
if k == 0 and temp_min > opt[k-1]:
temp_min = opt[k-1] #no penatly in the first line
b[i] = k
elif temp_min > opt[k-1] + cost_function(maxWidth - s[k,i]) :
temp_min = opt[k-1] + cost_function(maxWidth - s[k,i])
b[i] = k
else: # if we want to penalize the first line
if temp_min > opt[k-1] + cost_function(maxWidth - s[k,i]) :
temp_min = opt[k-1] + cost_function(maxWidth - s[k,i])
b[i] = k
opt[i] = temp_min
#subtle point in the code
#we basically do not want all the
#breakpoints in b since we "overwrite" all
#the breakpoints from q[i] to i every time we
#consider the ith word
last_k = b[-1]
good_breaks = []
while last_k >= 1:
good_breaks.append(last_k)
last_k = b[last_k-1]
good_breaks.append(0)
good_breaks.reverse()
return good_breaks, opt[-1]
def print_par(maxWidth, sentence, breakpoints, cost = None):
''' Pretty printing of paragraph given breakpoints '''
par = [i for i in sentence.split()]
print "+++++++++++++++++++++++++++++++++++++"
print "Number of words in paragraph = ", len(par)
print "Width = ", maxWidth
print "breakpoints = ", breakpoints
if cost is not None:
print "Total Cost = ", cost
print "+++++++++++++++++++++++++++++++++++++"
#create upper side of the box
print u'\u2554' + u'\u2550'* (maxWidth) + u'\u2557'
for i in range(len(breakpoints)):
if i == len(breakpoints) - 1: #case of last index dump everything
s = ' '.join(par[breakpoints[i]:])
print u'\u2551'+ s + u'\u25AA' * (maxWidth-len(s)) + u'\u2551'
else:
s = ' '.join(par[ breakpoints[i] : breakpoints[i+1]])
print u'\u2551'+ s + u'\u25AA' * (maxWidth-len(s)) + u'\u2551'
#create lower side of the box
print u'\u255A'+ u'\u2550' * (maxWidth) + u'\u255D'
print '\n'
def main():
W = 8
s2 = 'aa bbbbb c'
#breakpoints2, cost2 = break_par(W, s2, cost_function = cost_third_function)
#print_par(W, s2, breakpoints2, cost2)
s3 = 'aa bbbbb c d'
#breakpoints3, cost3 = break_par(W, s3, cost_function = cost_third_function)
#print_par(W, s3, breakpoints3, cost3)
W = 12
s1 = ' "Do only what only you can do" '
#breakpoints1,cost1 = break_par(W, s1, cost_function = cost_third_function, penalize_First = True)
#print_par(W, s1, breakpoints1, cost1)
W = 40
s0 = '''And Heisenberg said: I remember discussions with
Bohr which went through many hours till
very late at night and ended almost in despair;
and when at the end of the discussion I went alone
for a walk in the neighbouring park I repeated to
myself again and again the question:
Can nature possibly be so absurd as it seemed to
us in these atomic experiments?'''
#breakpoints1,cost1 = break_par(W, s0, penalize_First = True)
#print_par(W, s0, breakpoints1, cost1)
Example Usage with the Cubic loss function
And Heisenberg said: I remember discussions with Bohr which went through many hours till very late at night and ended almost in despair; and when at the end of the discussion I went alone for a walk in the neighbouring park I repeated to myself again and again the question: Can nature possibly be so absurd as it seemed to us in these atomic experiments?