How did I spend my summer, you ask? Working on a project which required a lot of research… and a lot of development - generally what you would expect from an R&D Lab.
At the beginning of my internship with the New York Times R&D Lab, I set to work on figuring out how to computationally categorize images from the New York Times Archive as articles, photographs, and advertisements. This project was presented to me with the words “potentially impossible” attached. I had no experience with any sort of image analysis, but I did know how to code. So I did what I do best, drafted up a bunch of scripts and pressed go, hoping for some conclusive output (there may have been some research and carefully crafted code involved, too). Thirty scripts, decades of sample data, and one broken machine later, some patterns started emerging. Eventually, I was able to develop an analysis algorithm that had a 95% successful classification rate.
This process basically boiled down to extracting zones from newspaper pages in the archive, running scripts that did all sorts of data analysis, and looking for patterns. OpenCV, Tesseract-OCR, and I became very good friends. Presently, I have an algorithm running on ~3.8 million newspaper pages, classifying approximately 40 million images. Eventually people will be able to request just the photos from the archive for a given time period, just the ads, or just the articles! Why would this be useful? Well, besides learning how those 20th century readers avoided “the dangers of denture breath,” it is completely up to you. Although this is for internal use for now, it will free up new information for many future studies.
I was the only person on my team dedicating time to this project but I had an awesome group of makers around me who were always eager to help. This kind of atmosphere was all that I’d hoped for in an internship. This project is also a wonderful example of writing code that contributes to something larger than yourself - larger than most things you’ll probably touch in a classroom, which is why it’s incredibly important for all programmers to work on projects outside of class. The best thing about programming is that you are perpetually learning. Running into inevitable roadblocks can be deterring but they really just enhance your ability to google the answer problem solve. So, get out and code!
Conclusion: Did I expect to be liberating and categorizing millions of images from the confines of their .tiff prisons this summer? Not at all. But it was awesome.
You can follow my future projects and tech ramblings on twitter: @rehilee
Hey all, took a pretty long blogging hiatus but have done some cool stuff in the past year. Let me start catching you up!
First off, check out my jewelry store on Etsy, PterofractylArt: http://www.etsy.com/shop/PterofractylArt
We specialize in chemical structures, ie:
If you like what you see, check us out on facebook: https://www.facebook.com/pterofractylart. The store is always changing, so check back frequently. If you want something you don’t see, just let us know!