Clean the Data! Story time

(edited)

In another life, I became a "data guy". In 2009 I was in my second year with the Fiscal and Economic Research Center (FERC) at the University of Wisconsin at Whitewater, working with a professor doing studies on education using open data. All 421 school districts in Wisconsin were legally obligated to report certain "open" data, things like standardized test results, free lunch program data, and other anonymized statistics about their inputs and "results".

My first job was to go to the website for each of these 421 school districts and download the data. Once we did that, we could run analysis on it....right?

image.png

Data is not enough, it must be "structured"

So it turned out that while 421 school districts were reporting all the mandatory data - not a single pair of them were reporting it in the same "way", with the same column structure, in the same order, with the same format. It was all different. Every single one, some were csv files, some were xlsx files, some were txt files - etc, etc.

So my job was to "clean the data" - which basically means to get it all in the same format. Once that is done, regression analysis is easy, but before that - its impossible. I spent months learning to work with SPSS and cleaning data, before I could even put my econometrics regression analysis skills into practice. It was a very "real world application" for me, tying everything I have ever learned in the classroom together and... throwing it out the window immediately for some practical obstacle.

image.png

Unstructured Data vs Data that has not yet been Stuctured

Officially, unstructured data is data that cannot be structured, not data that you can get structured with a undergraduate assistant. But practically, its the same thing - you cannot run the analysis on unstructured data - so you have to get it structured. There are a number of ways to do this, for example, you could retype it all out into a new csv or excel file, but this time ensuring that every one has the same format. But this way has its drawbacks.

For example, assuming we were doing this analysis on the 2010 data, we could - with some effort of an undergraduate tryhard - manually fix all the data. But what do we do "next year" with all the data from 2011? Are we doomed to repeat this manual process every year for the rest of the history of the Fiscal and Economic Research Center budget?

Instead, I was taught how to make data cleaning scripts with a program called SPSS.


image.png

Now SPSS does a lot of other things too, but we used it mostly for cleaning data, as the professor had access to other, more powerful, analytics programs once the data was clean.

I spent many, many hours of my life earning beer and rent money by cleaning data. And one day, after nearly 18 months of work on several years of school district data, the professor (who I still remember fondly for all the things he taught me) said, "We could write a paper together". And you know what I did?

I dropped out of college and ran away to South America without money or a plan.


image.png

So long, no thanks for the cubicle.

And I haven't thought about it in a while, though what I learned about data and analysis there has always stuck with me, always been "a part" of my tool box or skill set. It wasn't until recently that I really remembered all this story - when me and @thecrazygm bumped up against the widest and least organized data set I have ever seen. We are interested in working with this data - its open source data - and we will do.

Right after I clean it.

Freedom and Friendship

0.14645472 BEE
7 comments

Good Luck and Godspeed to us both, it is no small feat. 😂

0.00538271 BEE

Data cleaning is certainly a useful skill to have under your belt. I really like the idea of data cleaning applications or scripts, especially considering how often I've cleaned data manually...lol! May you make your chaotic data immaculate! 😁 🙏 💚 ✨ 🤙

0.00286109 BEE

not a single pair of them were reporting it in the same "way", with the same column structure, in the same order, with the same format. It was all different. Every single one

Welcome to my world. This has happened so many times!

0.00281098 BEE

Have you read the paper "Tidydata" ?

Everyone should read it. It isn't just about data, but about observation, and how each data point should be an observation.

0.00276013 BEE

I will read it!

0.00000000 BEE

You might want to consider using RStudio with the Tidyverse Package. I found it is a very strong tool for data cleaning, and is completely free. I haven't used SPSS though so I can't compare. Hope you got your work done soon and with the least pain possible.
Good luck

0.00275454 BEE

I used R for a project a few years later - I love it!! Appreciate you sharing that!

0.00000000 BEE

Thank you, I had to learn R cuz of work requirements and when I discovered

  1. RStudio
  2. Tidyverse
    It was a game changer for working with data.
0.00275935 BEE

hahahahaha, I love it:

I dropped out of college and ran away to South America without money or a plan.

You make my day, my friend. ❤️

0.00265641 BEE

Previously, there was no such software, so it was quite difficult to remove data from every single thing and every single place. Now, many different software have come up. We will search now and find some very good software, because of which we can now do this work easily.

0.00000000 BEE