Notes on a kind of string equivalence

The problem I had been batteling with started when a friend was asking for me to write a program that could match the headers of differnt columns in some CSV file. They would be really similar but not so similar that casting both files headers to uppercase would resolve it. An example might be “CurveConfig1”, “Curve Config 1”, “curve_config1” and “curveconfig1”. My first ideas was that I could convert these down to the last format in that list. You would have to convert both lists of headers to the finnaly format as you can't go back the other way. Once they are all in this 'normalised' format they can be compared directly.

The following is from my notes at the time:

Question 

Does encoding down to the lower information state make the comparison less reliable and should there be a hierarchical, not quite probability level, value stating the confidence of the comparison?

If we did then: 

 “CurveConfig1” -> “CurveConfig1” would be 

 “Curve Config 1” -> “CurveConfig1” would be 

 “CurveConfig1” -> “curveconfig1”

I recently came back to this after not having looked at the problem in over a year and thought it would be a good experiment to see how quickly I could code up something this simple. I was surprised that it didn't take too long. It's the perfect experiment in which to use TDD and I'm glad I did because I think that was core to how I sketched it out so quickly.

The following is what I came up with in python:

The functional code being:


I then thought since this is a nice concise function that it would be interesting to see how long it would take me to sketch this up in a second language. I do find it frustrating when simple functions like this are spoken about online without something embedded to play with. It didn't take too long to convert but the conversion really points out the scale of built-in functions to python.

Javascript version:

Comments

Popular posts from this blog

An exploration in number systems

Structural engineering with cardboard