Open Food Facts Data Analysis
As a self-taught nutrition enthusiast, I have been contributing to the crowdsourced India database of Open Food Facts - "the Wikipedia of food", by adding and editing products. The decentralized open source project has data on more than 3 million products (as of April 2024) from countries around the world.
Open Food Facts (OFF) data is released as Open Data: it can be reused freely by anyone, under the Open Database License (ODBL).
I've been cleaning up & compiling the data by food categories to extract insights.
There are multiple ways to get the data:
- CSV or Excel format using the advanced search feature
- MongoDB daily export - more than 30GB uncompressed!
- JSONL daily export
- CSV daily export
- API
As I was interested in select product features and macronutrient details, I decided to use the OFF API. This is how the process for compiling data for each food category currently goes:
1. I construct the following URL with the fields I require to fetch the nutrition data as JSON -
https://in.openfoodfacts.org/cgi/search.pl?search_terms=Breakfast Cereals&fields=code,brands,quantity,product_name,energy-kcal_100g,carbohydrates_100g,added-sugars_100g,fat_100g,calcium_100g,cholesterol_100g,saturated-fat_100g,proteins_100g,salt_100g,sugars_100g,fiber_100g,nutriscore_grade,nova_group&json=1&sort_by=last_modified_t&page_size=500
2. I then use the Data.Page Online JSON to CSV conversion tool to convert the JSON and provide me with a downloadable CSV file. Alternatively, you can use this Python script to convert the JSON output to a CSV file. AI assistants can also help in the direct conversion if the size of the output is small.
3. I edit the CSV file using Excel or Google Sheets to clean the data - add minor missing details, review anomalies and fix inconsistencies (sometimes sodium values may be mistakenly specified in grams instead of mg), simultaneously edit the data on the OFF website and rearrange the columns so that there is some order. As the data on OFF is crowdsourced, checking for outliers (by filtering & looking for extreme values in the ranges) helps in weeding out anomalies.
4. I upload the CSV data to Github following a folder structure matching the Food Safety and Standards Authority of India (FSSAI) categorization for packaged food products. The CSV file is then available in a neat tabular format through a publicly accessible URL that looks like this - https://github.com/mvark/indiafoodstats/blob/main/1.%20Dairy%20products%20and%20analogues/Milk.csv
5. To better appreciate, visualize and analyze the product data, I use the Flat Viewer tool by just prefixing "flat" to the CSV file on Github, thus - https://flatgithub.com/mvark/indiafoodstats/blob/main/1.%20Dairy%20products%20and%20analogues/Milk.csv
As I browse, sort and filter I jot down my notes about the patterns and number ranges that emerge from viewing the histograms generated by the tool.
Histograms generated by the Flat Viewer tool |
Comments
Post a Comment