After writing my last post, I was so ready to dive into my own digital history project. Enough exercises! I was ready to show the mess that is the Shawville Equity OCR transcription who’s boss. I was ready to have all my data sparkly and clean. I was ready to succeed.
I must have forgotten whose class I’m in.
We started this week with a class-wide fail: our DHBox, the computer we’re all interfacing with to access a Unix command line, went down on us. At the time of my writing, we’re approaching the 48-hour deadline for this module and it isn’t yet back up. Luckily, the exercises for this week were doable on our own computers, even my Windows laptop. Unluckily for me, I felt the need to poke and prod at my data more than Notepad++ allowed. I would up installing Cygwin, basically a Unix command line for Windows, and some of the packages we’ve been using on the DHBox (nano, wget and curl among them) through this Lifehacker tutorial, some trial, and even more error.
Like last week, it felt good to find a solution (my classmate Jeff Blackadar also found one by downloading Ubuntu through the Windows store, proving that there’s always more than one creative solution to a computing problem). Unlike last week’s ten magic lines of Python, though, I found my solution opened up even more problems. No, that’s the wrong way of looking at it – I should say my solution opened up even more possibilities for failure. Those look a lot like problems when you’re starting out.
For example, once I finished the exercises for this module, it was time to apply the lessons I learned to my .txt copies of The Shawville Equity. For my project, I want to discover more about local community events advertised and covered by the Equity. I started by looking for mentions of days of the week, commonly used in newspapers to refer to recent or upcoming events. I used regex patterns to find all instances of “Monday,” “Tuesday,” “Thursday,” etc., thinking that if I could isolate the words around weekday words, I could determine what kinds of events were being talked about. As I tried to do that, though, I ran into a lot of problems with the OCR and with cleaning data in general. Of note:
- Cleaning data takes a long time. I wrote earlier in the week about my fascination with XML. When I started looking through the Equity, my first instinct was to XML my way to data management, as M. H. Beals put it. I started tagging the days of the week in the January 9, 1890 issue of the Equity. Even with regex finding the strings for me, making sure that the string I found was a day of the week (and not an instance of “today” or “holiday,” which my regex pattern also turned up) meant some manual checking.
- Cleaning data takes a lot of thought. I could have streamlined my regex and maybe even used a replacement function to tag days of the week automatically across multiple files. I realized as I was tagging, though, that just tagging the dates wasn’t going to help much. Sure, it would make finding instances of weekdays easier in future searches (as I could just search for everything between <date> and </date> tags), but without an overall structure to my XML document, which can get very in-depth, those scattered tags wouldn’t help much. For example, if I don’t separate out each article with <div> tags, then XML analysis programs have no way of linking a date to the article in which it appears. I played around with adding <div> and <p> and <dateline> and <headline> tags to the January 9, 1890 issue, but I quickly realized that unless I wanted to spend the rest of the summer XMLing newspapers, I had to come up with a smarter way of cleaning my data. This piecemeal method wasn’t going to work.
- When data is messy, sometimes it’s really messy. I lauded messy data in my last post and in my annotations on Paige Morgan’s “This talk doesn’t have a name.” I considered the digital text as artifact and considered the interesting scholarly implications of messy OCR. I still agree with what I wrote, but I’ve realized that messiness in data is sometimes just a hindrance to research (which is why we clean it, obviously). This is especially true when the OCR is as bad as the Equity‘s. The OCR read across multiple columns on a lot of issues, meaning that lines from different articles are all mixed together. This problem is especially bad in the “Local News” section of the paper – the exact section I’m interested in. Following one of Dr. Graham’s exercises, I tried isolating the lines containing days of the week in the Equity (a process I can do very quickly thanks to command line prompts like sed and grep), but a line referring to “next Monday” is just as likely to contain text from the article next to the article talking about whatever’s going on next Monday. Some of my classmates and I are looking into solutions to this on our class Slack group, but for now it looks like we have the choice of taking a lot of time to sort out individual articles (which, if it doesn’t defeat the purpose of distant reading, it sure makes it a hell of a lot more time-consuming) or taking the bad OCR as it is.
- Cleaning techniques don’t like ambiguity. Regex is complicated. OpenRefine hides complicated data-cleaning algorithms behind a nice interface. I’m still trying to work out how best to use them. I haven’t even had complete success finding and isolating days of the week: sometimes the OCR has mangled the words beyond recognition, other times, as I mentioned, I pick up words I don’t mean to. How on earth am I going to isolate a nebulous category like “social events”? Sure, I could make a list of every single word that might refer to a community get-together. But, as Ted Underwood points out, how am I to know that these are the words that the Equity would have used to describe social events? How many words would I be missing? How much would my bias influence the data? At the moment, it doesn’t seem like a viable strategy. I may try topic modelling next week and see if those methodologies serve my purpose better.
- It’s hard to make data speak to a historical argument. Dr. Graham posted links to a number of articles this week about the need to connect data and data-based methodologies to historical argumentation. I agree wholeheartedly – I took this class because I wanted to enhance, not replace, my work as a student of history. Actually doing it, though, is hard, especially at the early stage of data cleanup. Of course, it’s important to have a plan during the cleanup stage, but it’s overwhelming – in part for all the reasons I listed above. I’m primarily a textual historian, so playing around with text – rearranging it, deleting parts, and adding code – sometimes feels anathema. What if my intervention skews my results? Is the Equity still the Equity if I cut it up into CSVs? Dr. Graham and Matthew D. Lincoln remind us that cleaning data is a method of interpretation, just as valid as the interpretation we do on standardized texts. It certainly feels different, though, and it’s something I’ve yet to get used to.
Read more about the process in this week’s fail log.
All in all, this week has seen little product and a lot of problems. I worried that I’d have nothing to submit and nothing to write at the end of it – not just because DHBox was down, but because I couldn’t get anything to work. Yet looking at my wall of text above, I wouldn’t call this week unproductive after all. In fact, I think I learned a lot. I still don’t know where to go from here, but I know a little better where I stand in relation to the Equity and its damn OCR.
If it seems like I’m buying Dr. Graham’s “productive failure” model hook, line, and sinker, it’s because I am. I wouldn’t have learned much this week if I hadn’t just tried some things out. I enjoyed finding out what worked and what didn’t, and I look forward to finding out where my classmates are at with their exploration of the Equity. Data is messy, the command line is confusing, and I don’t know what I’m going to do next. That’s okay, though. I don’t need to succeed. I just need to do something.