Open Food Facts Data Analysis

As a self-taught nutrition enthusiast, I have been contributing to the crowdsourced India database of Open Food Facts - "the Wikipedia of food", by adding and editing products. The decentralized open source project has data on more than 3 million products (as of April 2024) from countries around the world.

Open Food Facts (OFF) data is released as Open Data: it can be reused freely by anyone, under the Open Database License (ODBL).

I've been cleaning up & compiling the data by food categories to extract insights.

There are multiple ways to get the data:

CSV or Excel format using the advanced search feature
MongoDB daily export - more than 30GB uncompressed!
JSONL daily export
CSV daily export
API

As I was interested in select product features and macronutrient details, I decided to use the OFF API. This is how the process for compiling data for each food category currently goes:

1. I construct the following URL with the fields I require to fetch the nutrition data as JSON -

https://in.openfoodfacts.org/cgi/search.pl?search_terms=Breakfast Cereals&fields=code,brands,quantity,product_name,energy-kcal_100g,carbohydrates_100g,added-sugars_100g,fat_100g,calcium_100g,cholesterol_100g,saturated-fat_100g,proteins_100g,salt_100g,sugars_100g,fiber_100g,nutriscore_grade,nova_group&json=1&sort_by=last_modified_t&page_size=500

2. I then use the Data.Page Online JSON to CSV conversion tool to convert the JSON and provide me with a downloadable CSV file. Alternatively, you can use this Python script to convert the JSON output to a CSV file. AI assistants can also help in the direct conversion if the size of the output is small.

3. I edit the CSV file using Excel or Google Sheets to clean the data - add minor missing details, review anomalies and fix inconsistencies (sometimes sodium values may be mistakenly specified in grams instead of mg), simultaneously edit the data on the OFF website and rearrange the columns so that there is some order. As the data on OFF is crowdsourced, checking for outliers (by filtering & looking for extreme values in the ranges) helps in weeding out anomalies.

4. I upload the CSV data to Github following a folder structure matching the Food Safety and Standards Authority of India (FSSAI) categorization for packaged food products. The CSV file is then available in a neat tabular format through a publicly accessible URL that looks like this - https://github.com/mvark/indiafoodstats/blob/main/1.%20Dairy%20products%20and%20analogues/Milk.csv

5. To better appreciate, visualize and analyze the product data, I use the Flat Viewer tool by just prefixing "flat" to the CSV file on Github, thus - https://flatgithub.com/mvark/indiafoodstats/blob/main/1.%20Dairy%20products%20and%20analogues/Milk.csv

As I browse, sort and filter I jot down my notes about the patterns and number ranges that emerge from viewing the histograms generated by the tool.

Histograms generated by the Flat Viewer tool

Also see -

Search This Blog

Tech Tips, Tricks & Trivia

Open Food Facts Data Analysis

Comments

Post a Comment

Popular posts from this blog

Things Near Me – Find & Learn About Landmarks Nearby

What is the difference between Browser Mode & Document Mode in IE

Tech-No-Logic