In another life, I became a "data guy". In 2009 I was in my second year with the Fiscal and Economic Research Center (FERC) at the University of Wisconsin at Whitewater, working with a professor doing studies on education using open data. All 421 school districts in Wisconsin were legally obligated to report certain "open" data, things like standardized test results, free lunch program data, and other anonymized statistics about their inputs and "results".
My first job was to go to the website for each of these 421 school districts and download the data. Once we did that, we could run analysis on it....right?
So it turned out that while 421 school districts were reporting all the mandatory data - not a single pair of them were reporting it in the same "way", with the same column structure, in the same order, with the same format. It was all different. Every single one, some were csv files, some were xlsx files, some were txt files - etc, etc.
So my job was to "clean the data" - which basically means to get it all in the same format. Once that is done, regression analysis is easy, but before that - its impossible. I spent months learning to work with SPSS and cleaning data, before I could even put my econometrics regression analysis skills into practice. It was a very "real world application" for me, tying everything I have ever learned in the classroom together and... throwing it out the window immediately for some practical obstacle.
Officially, unstructured data is data that cannot be structured, not data that you can get structured with a undergraduate assistant. But practically, its the same thing - you cannot run the analysis on unstructured data - so you have to get it structured. There are a number of ways to do this, for example, you could retype it all out into a new csv or excel file, but this time ensuring that every one has the same format. But this way has its drawbacks.
For example, assuming we were doing this analysis on the 2010 data, we could - with some effort of an undergraduate tryhard - manually fix all the data. But what do we do "next year" with all the data from 2011? Are we doomed to repeat this manual process every year for the rest of the history of the Fiscal and Economic Research Center budget?
Instead, I was taught how to make data cleaning scripts with a program called SPSS.
Now SPSS does a lot of other things too, but we used it mostly for cleaning data, as the professor had access to other, more powerful, analytics programs once the data was clean.
I spent many, many hours of my life earning beer and rent money by cleaning data. And one day, after nearly 18 months of work on several years of school district data, the professor (who I still remember fondly for all the things he taught me) said, "We could write a paper together". And you know what I did?
I dropped out of college and ran away to South America without money or a plan.
And I haven't thought about it in a while, though what I learned about data and analysis there has always stuck with me, always been "a part" of my tool box or skill set. It wasn't until recently that I really remembered all this story - when me and @thecrazygm bumped up against the widest and least organized data set I have ever seen. We are interested in working with this data - its open source data - and we will do.
Right after I clean it.
Good Luck and Godspeed to us both, it is no small feat. 😂
Data cleaning is certainly a useful skill to have under your belt. I really like the idea of data cleaning applications or scripts, especially considering how often I've cleaned data manually...lol! May you make your chaotic data immaculate! 😁 🙏 💚 ✨ 🤙
Welcome to my world. This has happened so many times!
Have you read the paper "Tidydata" ?
Everyone should read it. It isn't just about data, but about observation, and how each data point should be an observation.
I will read it!
You might want to consider using RStudio with the Tidyverse Package. I found it is a very strong tool for data cleaning, and is completely free. I haven't used SPSS though so I can't compare. Hope you got your work done soon and with the least pain possible.
Good luck
I used R for a project a few years later - I love it!! Appreciate you sharing that!
Thank you, I had to learn R cuz of work requirements and when I discovered
It was a game changer for working with data.
hahahahaha, I love it:
You make my day, my friend. ❤️
Previously, there was no such software, so it was quite difficult to remove data from every single thing and every single place. Now, many different software have come up. We will search now and find some very good software, because of which we can now do this work easily.