Deep Dive into String Manipulation
- Austin De Witt
- Jun 25, 2018
- 3 min read
Updated: Jun 27, 2018
An old client reached out who knew about the data work I've been getting into. They were hand joining over 100k lines of data in excel and were hoping to automate things a bit.
I wanted to give it a shot and quickly found myself tangled in lots of very dirty, string-based data. Really, really, dirty data.

Nearly 500 lines of SQL code later, I was able to automate 25% of the work which will saving 60 hours of time and labor costs annually. I'm hoping to bump it up to 50% before the next quarterly reporting is due but very happy with the results!
PROJECT OVERVIEW
The project uses proprietary data which cannot be shared. I was given the okay on sharing the SQL work. The data below is modified to be demonstrative.
The problem with the data was two-fold
1) No primary or foreign keys to join on meaning the data had to joined using string data
2) Non-uniform labeling/incorrect/missing labels within strings
Below is an example of the data from the two tables that needed joining.
Table 1 has the name of every power line they did work around and the starting and ending poles (Work Start and Work Stop)

Table 2 has a list of every span* from substation** to substation for each power line
* Span: Stretch of line indicated by the poles on either end
**Power lines typically begin and end at substations transferring power across the state

DATA ISSUES
As you can see from a the above data, there's A LOT of information packed into both the Span column and also the Work Start and Work End columns. The first step is to pull out the individual pieces of information locked inside this string data.
A few things make this difficult, lets just look at the Projects table:
1) Between 2 and 4 unique pieces of information per pole (Work Start/Work Stop)
2) Sometimes a prefix ( : ; ! * ) or suffix (A, B, C, x, * ) is important, sometimes not
3) Start and Stop locations sometimes listed in reverse
4) Start or Stop location entered incorrectly and doesn't exist
5) Substations identified by characters (SUB or TERM)
The Lines table has nearly all the same issues but there's two poles per span! Double the fun...
THE GOAL
Realistically, automating the entire process is not in the picture as the client is a contractor and unable to change how the data is formatted from their larger client. But it doesn't need to be! Automating just the cleanest portions of the work between 25-50% of the data would be a huge improvement reducing both costs and time that is better used elsewhere.
So here's how I went about doing it!
INTO THE CODE
The full code can be found here https://github.com/dewitta/reporting_automation
1) Create two tables in SQL and read in .CSV files

2) Initial cleaning of all string data: There were end line and tab characters present

3) Pull out specific portions within pole string identifiers based on multiple conditions

4) Conditional string cleaning, lots of if and regexp.

This might seem inefficient, but writing the code this way made finding and fixing any issues very simple. Each line does a specific cleaning task and can easily be added to or edited.
5) Join the data and limit to where spans are between starting and ending points
6) Export to .csv

TAKEAWAYS
This data was a major pain, and I loved it.
This was the messiest string data I had worked with and I learned a ton! Here are some highlights:
1. KISS - Keep It Simple, Stupid! But seriously, my first attempt I tried to solve everything at once. It was a waste because I couldn't trouble shoot it easily. I gave up and started over using what I learned and wrote each step separately, updating columns one step at a time.
2. Name variables consistently. This project required very similar operations to be repeated with minor tweaks. By naming consistently and only changing a letter or two between tables I was quickly able to double and then quadruple the code I wrote saving multiple hours.
3. Document heavily. This was a long SQL script, nearly 500 lines. I found it very helpful to have BIG EXAGGERATED headers for different sections that listed out exactly what stage I was at and the process happening below.
THE RESULT
This is work that has to be done quarterly. Automating just 25% of the work saves 60 hours of time and pay annually. That's 1.5 weeks an employees time that can be focused on more important tasks. Bumping that number up to 50% would mean 3 weeks of time saved each year.
Comments