my face

bryce

githublinkedin
HomeProjectsWork

GSOD Data Processor

GitHubGSOD Data Processor

Why?

If you read the word salad above for Weathered, you know the motivation for the project. But the reason I wanted to make this data processor was to solve a different problem and learn some new things.

The NOAA keeps the GSOD dataset on a public file share. Each station gets its own CSV file, named after each station's Station ID, and each row of that file is a different day of the year. There are over 13,000 stations at the time of writing, and each of those station's CSV files are placed in a folder named after the year the data was recorded in.

There are also GZipped Tar archives of each year of data, updated whenever updates come, I suppose. I need to get those and parse the CSVs for all that sweet, sweet data.

How?

Every hour, my server has a cron job that runs this program. This program first parses the HTML data on the file share, and checks to see if the current year's GZipped Tar archive's update date/time is newer than the check from the last hour. If not, we're done!

If there is an updated file available, this program downloads that file, decompresses it, then parses through each CSV file in the archive. The data from, at latest, 8 days ago from each file gets put into memory. I then query the database for records that do not have yesterday's data and update what is available.

There are definitely performance improvements that can and should be made, but I just wanted to get this done fairly quickly. I want to optimize it as much as I can, then rewrite it in another language for fun.

With What?

.Net 8.0, SharpZipLib, HtmlAgilityPack, TinyCsvParser, MongoDB