Study of MYSQL Data Loading

Created at:  | By: cfcodefans | At: Source

MYSQL

In the preparation of another blog talking about spatial feature of MYSQL, I started with data collection, then I wondered how capable MYSQL could be on ETL(Extraction/Transform/Loading)  [+].

Firstly I found some geo data from AshKyd  [+]

I picked the geo data with low resolution, please refer to here  [+] The data looks like this

{"type":"FeatureCollection",
    "features":[{"type":"Feature",
        "properties":{"scalerank":1,
            "featurecla":"Admin-0 country", "labelrank":4,
            "sovereignt":"Central African Republic", "sov_a3":"CAF",
            "type":"Sovereign country", "admin":"Central African Republic",
            "name":"Central African Rep.","name_long":"Central African Republic",
            "brk_group":null,"abbrev":"C.A.R.",
            "postal":"CF","formal_en":"Central African Republic",
            "note_brk":null,"name_sort":"Central African Republic",
            "pop_est":4511488, "gdp_md_est":3198,"pop_year":-99,
            "lastcensus":2003,"gdp_year":-99,
            "economy":"7. Least developed region","income_grp":"5. Low income",
...lots of properties
            "region_un":"Africa","subregion":"Middle Africa",
            "region_wb":"Sub-Saharan Africa","name_len":20,
            "tiny":-99,"homepart":1,"filename":"CAF.geojson"},
            "geometry":{"type":"Polygon",
                "coordinates":[[[15.279460483469109,7.421924546737969],[16.10623172370677,7.497087917506505],[16.290561557691888,7.754307359239306],[16.456184523187346,7.734773667832968],[16.705988396886255,7.508327541529979],[17.964929640380888,7.890914008002866],[18.38955488452322,8.281303615751824],[18.911021762780507,8.630894680206353],[18.81200971850927,8.982914536978598],[19.09400800952602,9.07484691002584],[20.05968549976427,9.012706000194854],[21.000868361096167,9.475985215691509],[21.723821648859452,10.567055568885976],[22.231129184668788,10.97188873946051],[22.864165480244225,11.142395127807546],[22.97754357269261,10.71446259199854],[23.554304233502194,10.089255275915308],[23.557249790142826,9.681218166538684],[23.394779087017184,9.265067857292223],[23.459012892355986,8.954285793488893],[23.805813429466752,8.666318874542426],[24.567369012152085,8.229187933785468],[25.11493248871679,7.825104071479174],[25.124130893664727,7.500085150579436],[25.79664798351118,6.979315904158071],[26.213418409945117,6.546603298362072],[26.46590945812323,5.94671743410187],[27.21340905122517,5.550953477394557],[27.37422610851749,5.233944403500061],[27.04406538260471,5.127852688004836],[26.402760857862543,5.150874538590871],[25.650455356557472,5.256087754737123],[25.278798455514302,5.170408229997192],[25.12883344900328,4.927244777847789],[24.805028924262416,4.89724660890235],[24.410531040146253,5.108784084489129],[23.29721398285014,4.609693101414223],[22.841479526468106,4.710126247573484],[22.70412356943629,4.633050848810157],[22.405123732195538,4.029160061047321],[21.659122755630023,4.22434194581372],[20.927591180106276,4.322785549329737],[20.290679152108936,4.691677761245288],[19.467783644293146,5.03152781821278],[18.93231245288476,4.709506130385975],[18.54298221199778,4.201785183118318],[18.45306521980993,3.504385891123349],[17.809900343505262,3.56019643799857],[17.133042433346304,3.728196519379452],[16.537058139724135,3.198254706226279],[16.012852410555354,2.267639675298085],[15.907380812247652,2.557389431158612],[15.86273237474748,3.013537298998983],[15.405395948964383,3.33530060466434],[15.036219516671252,3.851367295747124],[14.950953403389661,4.210389309094921],[14.478372430080467,4.732605495620447],[14.558935988023505,5.03059764243153],[14.459407179429348,5.4517605656103],[14.536560092841112,6.22695872642069],[14.776545444404576,6.408498033062045],[15.279460483469109,7.421924546737969]]]}}
    ]}

The data file is downloaded into some path "/xxx/yyy/xxx/geo.json" Hereafter let us only use MYSQL to process this data. firstly we would like to create a table to accommodate the json data in json file (sql is here  [+])

CREATE TABLE `general_data` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(256) DEFAULT NULL,
  `json_data` json DEFAULT NULL,
  --....
  PRIMARY KEY (`id`),
  UNIQUE KEY `id_UNIQUE` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

ETL starts with loading data, with mysql loading feature, detailed here  [+] Very versatile features, let us just load the file directly into general_data table.

-- location of data file, on windows delimiter must be double backward slash
-- like "D://source//data//geo.json"
load data infile "/xxx/yyy/xxx/geo.json" 
into table general_data (json_data) 
set name='raw-country-geo-data'; -- here we could set other column while data of the file is loaded into "json_data" column

Then it fails if the mysql server lacks configuration of secure_file_priv  [+] This option controls whether mysql server program can read/write files in certain folders. I believe it is okay to set it on if we just use mysql as data tool on my laptop other than data warehouse of business of million dollars....(please DO NOT use this on production without discretion)

now the data looks like this in general_data table

idnamejson_data
1raw-country-geo-data{"type": "FeatureCollection", "features": [{"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[15.279460483469109, 7.421924546737969], [16.10623172370677, 7.497087917506505], [16.290561557691888, 7.754307359239306], [16.456184523187346, 7.734773667832968], [16.705988396886255, 7.508327541529979], [17.964929640380888, 7.890914008002866], [18.38955488452322, 8.281303615751824], [18.911021762780507, 8.630894680206353], [18.81200971850927, 8.982914536978598], [19.09400800952602, 9.07484691002584], [20.05968549976427, 9.012706000194854], [21.000868361096167, 9.475985215691509], [21.723821648859452, 10.567055568885976], [22.231129184668788, 10.971...... a lot of data

The data still needs to be broken down into finer pieces. Here the Extraction of ETL goes with mysql json functions  [+].

select 
json_data->>"$.features[0].type" as _type, 
json_data->>"$.features[0].properties" as properties, 
json_data->>"$.features[0].geometry" as geometry 
from raw_country_geo;
_typepropertiesgeometry
Feature'{"name": "Central African Rep.", "tiny": -99, "type": "Sovereign country", "admin": "Central African Republic", "gu_a3": "CAF", "level": 2, "su_a3": "CAF", "un_a3": "140", "wb_a2": "CF", "wb_a3": "CAF", "abbrev": "C.A.R.", "brk_a3": "CAF", "iso_a2": "CF", "iso_a3": "CAF", "iso_n3": "140", "postal": "CF", "sov_a3": "CAF", "su_dif": 0, "woe_id": -99, "adm0_a3": "CAF", "economy": "7. Least developed region", "fips_10": null, "geounit": "Central African Republic", "pop_est": 4511488, "subunit": "Central African Republic", "adm0_dif": 0, "brk_diff": 0, "brk_name": "Central African Rep.", "filename": "CAF.geojson", "gdp_year": -99, "geou_dif": 0, "homepart": 1, "long_len": 24, "name_alt": null, "name_len": 20, "note_brk": null, "pop_year": -99, ....''{"type": "Polygon", "coordinates": [[[15.279460483469109, 7.421924546737969], [16.10623172370677, 7.497087917506505], [16.290561557691888, 7.754307359239306], [16.456184523187346, 7.734773667832968], [16.705988396886255, 7.508327541529979], [17.964929640380888, 7.890914008002866], [18.38955488452322, 8.281303615751824], [18.911021762780507, 8.630894680206353], [18.81200971850927, 8.982914536978598], [19.09400800952602, 9.07484691002584], [20.05968549976427, 9.012706000194854], [21.000868361096167, 9.475985215691509], [21.723821648859452, 10.567055568885976], [22.231129184668788, 10.97188873946051], [22.864165480244225, 11.142395127807546], [22.97754357269261, 10.71446259199854], [23.554304233502194, 10.089255275915308], [23.557249790142823 ....

This is just data for one country, we need finer data .....specially the json_data is just one row that contains geo data of 175 countries, we need to break it into 175 rows with json_table function  [+]

select geo_jsons.* 
from general_data, json_table(general_data.json_data, 
-- the json_data column from general_data table, placed before the function, the function yields the result as table.
	"$.features[*]" COLUMNS( -- "$.features[*]" extracts the feature json array from general_data.json_data.
        --....
        continent varchar(32) path "$.properties.continent", -- for each country's continent, extracted as continent column
        economy varchar(256) path "$.properties.economy",
        featurecla varchar(256) path "$.properties.featurecla",
        filename varchar(256) path "$.properties.filename",
        fips_10 varchar(32) path "$.properties.fips_10",
        formal_en varchar(256) path "$.properties.formal_en",
        formal_fr varchar(256) path "$.properties.formal_fr",
        gdp_md_est int(11) path "$.properties.gdp_md_est",
        gdp_year int(11) path "$.properties.gdp_year",
        geou_dif int(11) path "$.properties.geou_dif"
        --...
        )
) as geo_jsons;

The result shows like

abbrevadmincontinenteconomygdp_md_estgdp_yearborderothers
C.A.R."Central African Republic"Africa"7. Least developed region"3198-99"[[[15.2794604834691097.421924546737969]
Bwa.BotswanaAfrica"6. Developing region"27060-99"[[[25.64916344575016-18.53602589281899]
Ang.AngolaAfrica"7. Least developed region"110300-99"[[[[16.326528354567046-5.877470391466218]
I.C."Ivory Coast"Africa"6. Developing region"33850-99"[[[-2.8561250472023974.994475816259509]
Cam.CameroonAfrica"6. Developing region"42750-99"[[[13.0758223812467522.267097072759015]
D.R.C."Democratic Republic of the Congo"Africa"7. Least developed region"20640-99"[[[30.8338598975938083.509165961110341]
"Rep. Congo""Republic of Congo"Africa"6. Developing region"15350-99"[[[12.995517205465177-4.781103203961884]
BeninBeninAfrica"7. Least developed region"12830-99"[[[2.6917016943562546.258817246928629]
B.F."Burkina Faso"Africa"7. Least developed region"17820-99"[[[-2.8274963037127079.642460842319778]
Bur.BurundiAfrica"7. Least developed region"3102-99"[[[29.339997592900342-4.499983412294092]

So far the geo data is roughly processed using only mysql inherent functions.