DM-70 - Problems of large spatial databases

Large spatial databases often labeled as geospatial big data exceed the capacity of commonly used computing systems as a result of data volume, variety, velocity, and veracity. Sources include satellites, aircraft and drone platforms, vehicles, geosocial networking services, mobile devices, and cameras. The problems in processing these data to extract useful information include query, analysis, and visualization and often data mining techniques and machine learning algorithms, such as deep convolutional neural networks are used with geospatial big data. The obvious problem is handling the large data volumes, particularly for input and output operations, requiring parallel read and write of the data as well as high speed computers, disk services, and network transfer speeds. Additional problems of large spatial databases include the variety and heterogeneity of data requiring advanced algorithms to handle different data types and characteristics, and integration with other data. The velocity at which the data are acquired is a problem especially with todays advanced sensors and the Internet of Things with millions of devices creating data on short temporal scales of micro seconds to minutes. Finally, the veracity, or truthfulness of large spatial databases is difficult to establish and verify, particularly for all data elements in the database.

Topic Description: 
  1. Definitions
  2. Introduction
  3. Classification
  4. Big Data Challenges
  5. Geospatial big data management and processing
  6. Summary and Conclusions

 

1. Definitions

Geospatial Big Data: datasets with locational identifiers that exceed the capacity of current computing systems to manage, process, or analyze the data with reasonable effort

 

2. Introduction

Large spatial databases most often are referred to currently (2018) as geospatial big data. By definition, geospatial big data are datasets with locational identifiers that exceed the capacity of current computing systems to manage, process, or analyze the data with reasonable effort. These excesses result from volume, variety, velocity, veracity, and other Vs of the spatial data (Firican, 2017; Panimalar et al., 2017). Traditional forms of large spatial databases are vector-formatted points and linework and raster images, including satellite images, aerial photographs. Additional forms of large data bases appearing now with locational components include a plethora of new sensors including lidar and other electronic sensors, video systems, human sensors with cell phones and other data collections, resulting in Volunteered Geographic Information (VGI) and a wealth of social media data including Twitter feeds and photographic archives. These big data sources provide locational information as part of the data and all exceed current capacities for processing and defining the context and nature of geospatial big data.

Geospatial big data provide a major source for innovation, competition and productivity, but simultaneously are problematic for data handling, processing, storage, retrieval, and use. The challenges facing management and application of geospatial big data has necessitated developing new software tools and techniques as well as parallel computing hardware architectures to meet the data handling requirements (Wright & Wang, 2011).

 

3. Classification

Geospatial big data may be classified into groups with similar data characteristics. A basic classification, after Yao and Li (2018), using types of data for organization follows.

3.1 Remote sensing

Traditionally, remote sensing data from the initiation of satellite systems, such as Landsat 1, to the modern day high spatial and temporal resolutions to new sensor designs all form a class of geospatial big data. Current systems acquire multispectral and hyperspectral images, are multi-resolution and multi-temporal data from multi-sensor systems resulting in big data volumes and other challenges (Chi et al., 2016). Collected and archived in traditional raster formats, these mostly continuous tone images usually exceed current computer processing capabilities requiring new solutions with parallel computing and enhanced distributed cyberinfrastructure (Ma et al., 2015).

3.2 Surveying and mapping data

Geographically referenced data, often referenced to the Global Navigation Satellite Systems (GNSS), that includes industry geographics, thematic maps, digital products, such as Digital Line Graphs, Digital Elevation Models, Digital Raster Graphics and Digital Orthophotographs, data for most national mapping organizations, such as The National Map of the U.S. Geological Survey, land use, and other basic surveying and mapping data. In recent years, high resolution lidar data have become the dominant form of mapping data and the data volumes (terabytes to petabytes) result in geospatial big data (Sugarbaker et al., 2014). Mobile mapping with moving sensors and sensor fields have also impacted the big data problem in the area of mapping.

3.3 Location based data for location services

Location-based data typically include geographic and human social information data with spatial and temporal locations. These data are primarily acquired with GNSS inputs generated with smart phones, field collected data, human and traffic trajectories. These geospatial big data have become a critical resource in socially-based service industries, vehicle routing, and other activities that sense the activities of human groups (Liu et al., 2015).

3.4 Social media platforms

These data are Internet and Web-based and include geospatial location. These locations may be specified on Web page data, or in social media platforms, such as Twitter, Facebook, Google Plus, and other social platforms. Location specification may be as specific as a set of global positioning coordinates or as vague as a simple disambiguated place name. Geospatial big data from social media carry a host of problems for data handling, location dereferencing, failure to meet basic statistical assumptions of independent, unbiased samples, and others (Goodchild, 2013; Kitchin, 2013; Tsou, 2015)

3.5 Internet of Things

Sensors that monitor and collect data including environmental and atmospheric measurements, water, intelligent devices in the household, in field collection for science and management, wearable devices, and a host of sensors that contribute real-time data in microseconds to minutes to big data servers. Most of these data acquired by new sensors are data streams of arbitrary high density; include many different dimensional measurements, such as optical, acoustical, and mechanical; and have different positional accuracies and precision. Compared to traditional data on the Internet, IoT data are generally of much greater variety and much greater frequency leading to a true geospatial big data problem (Alelaiwi, 2017).

 

4. Big Data Challenges

Big Data Challenges include architectures for processing the large data volumes as well as inherent problems in the data such as heterogeneity, vagueness of geographic feature definitions and boundaries, and uncertainty. Among the identified processing problems are quality assessment, data modeling and structuring, functional programming for geospatial big data streams, geospatial big data analytics, data mining and knowledge discovery, and geospatial big data visualization and visual analytics (Li et al., 2015).

4.1 Inherent geospatial big data problems

All of the types of geospatial big data listed above suffer from problems common to all geospatial data which include vagueness and indeterminate boundaries for geographic features (Burrough & Frank, 1995). For example, using lidar data for terrain feature extraction immediately requires definition and conceptualization of the terrain features themselves. That is where are the boundaries of the natural geographic feature that can be used to extract the entity from the data? Couple this indeterminacy with the big data problem and the natural vagueness of the features become more problematic with the large data volumes and heterogenous data. Another inherent problem is uncertainty of geospatial big data and the lack of methods and specifications for measuring and quantifying such uncertainties. Many of these datasets lack normal scientific standards of replicability and rigorous sampling (Goodchild, 2013).

4.2 Quality Assessment

Geospatial Big Data are often continuous measurements, such as pixel values in satellite images or lidar responses, and as abstractions of real time variables, uncertain. The sheer volume of data magnifies the uncertainty and requires quality assessment to assure appropriate abstraction, processing, feature extraction, analytical and visualization processing. Standards have been developed and rigorous procedures for monitoring quality exist for many traditional structured geospatial datasets. For example, the international standard, ISO 8402 and ISO 19157, which specifically addresses the quality of geospatial data. A similar standard exists for lidar data collections, the U.S. Geological Survey Lidar Base Specification 1.3 (Heidemann, 2018). These standards and other similar standards frameworks define quantitative measures of data quality, including spatial, temporal and thematic accuracy, spatial, temporal, and thematic resolution, consistency, and completeness (Li et al., 2015). Thus, data quality assessment is defined for the collection processes and creation of metadata reflecting the quality of those processes. Similar standards do not exist for the majority of Big Geospatial Data. For example, assessment of the quality for social media data is extremely difficult since many of these data violate basic statistical assumptions and are often from biased and limited samples, e.g., those who own cell phones and those who participate in the data collection process for a particular type of big data. Quality of locations in geospatial big data is also problematic and often the locational component, as in Twitter feeds, is determined from context rather that highly refined surveying methods.

4.3 Geospatial Big Data Streams

Twitter, Facebook, and other social media platforms provide continuous streams of Geospatial Big Data. These streams require continual processing and analysis. The geospatial components are often hidden, ambiguous, and only determined by the context of the message or the scenes of a photograph. Thus, the spatial component is inherently uncertain and must be resolved in any analytics or visualizations related to these types of data. Similar big data streams are collected from sensor fields and objects on the Internet of Things. With these high volume data streams are more problematic than the spatial location since most of most of these devices have fixed locations for the sensors and that can be used in analytics and visualization.

 

5. Geospatial big data management and processing

Geospatial big data management and processing is problematic because of the volume and characteristics of the data and the velocity with which some of the data are collected. Among the solutions for geospatial big data processing are data organization methods such as indexing on spatial location, feature identifiers, or thematic or temporal attribution.

5.1 Data modeling and structuring

Data modeling and structuring is a common approach to handling and processing geospatial big data, particularly in traditional vector and raster formats. For example, vector data consists of points, lines, and polygons, and sometimes volumetric figures. Standard arc-node based models have been developed with high degrees of indexing that allows direct access to the geometric, topologic, thematic, and temporal attributes and relationships of these geospatial features. Raster data can be supported with a variety of indexing methods such as encoding methods, linear, and quadtrees, k-d trees, R-trees, binary trees, and many others (Samet, 2006). The use of ontologies and semantics to structure and index geospatial big data now offer potential to support modeling, analytics, and visualization (Zhang et al., 2017).

5.2 Geospatial big data analytics

Geospatial big data analytics often involve distributed and high performance computing systems with algorithms adapted for parallel computation, processing, and visualization. The approach often involved data mining and knowledge discovery. Methods of approach include parametric statistics, which require assumptions of a probability distribution function and often randomness and independence of samples; non-parametric statistics which simply assume local smoothness; and functional analysis including wavelets and spatial data generalization, spatial data clustering, and mining spatial association rules. Machine learning and the use of deep convolutional neural networks for geospatial big data mining and knowledge discovery are often implemented in high performance computing environments. These methods can be applied to the large variety of geospatial big data including the vast archives of satellite and other Earth observation imagery, sensor field data, social media, and the data feeds from the Internet of Things (Vatsavai et al., 2012; Vignesh, 2014).

 

5. Summary and Conclusions

Geospatial Big Data exceed the capacity of commonly used computing systems as a result of data volume, variety, velocity, and veracity. Many data types including remotely-sensed images from satellite, aircraft, and hand held platforms; mapping data types including lidar, elevation, hydrography, land cover classifications, and others; location-based data for location services highly dependent on GNSS coordinates; social media feeds such as Twitter and images from Facebook; and sensor streams from sensor fields and the IoT are examples of geospatial big data with extensive volumes and heterogeneity. Challenges and problems of geospatial big data include computer processing architectures, requirements for parallel and distributed computational systems, high performance computing, basic uncertainty of the spatial and non-spatial components of the data, quality assessment, analytical processing and statistical assumptions and violations, and analysis and visualization of large datasets with clusters of significant context. The future of geospatial big data is further growth in the volume and variety of data and the harnessing of computational methods to turn geospatial big data into science results and applications for society.

 

Any use of trade, firm, or product names in this publication is for descriptive purposes only and does not imply endorsement by the U.S. Government.

References: 

Adelaiwi, A. (2017). A Collaborative Resource Management Tool for Big IoT Data Processing in Cloud. Cluster-Computing: The Journal of Networks, Software, Tools and Applications, 20(2), 1791-1799, DOI: 10.1007/s10586-017-0839-y

Burrough, P. A., & A.U. Frank, eds. (1996). Geographic Objects with Indeterminate Boundaries. Taylor and Francis, London, p. 71-85.

Chi, M., Plaza, A., Benediktsson, J. A., Sun, Z., Shen, I., & Zhu, Y. (2016). Big Data for Remote Sensing: Challenges and Opportunities. Proceedings of the Institute for Electrical and Electronics Engineers, 104(11), p. 2207-2219, DOI: 10.1109/Jproc.2016.2598228

Firican, G., (2017). The 10 Vs of Big Data. Retrieved September 11, 2018 from https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

Goodchild, M. F. (2013). The Quality of Big (Geo)data. Dialogues in Human Geography, 3(3), 280-284, DOI: 10.1177/2043820613513392

Heidemann, H. K. (2018), Lidar base specification (ver. 1.3, February 2018): U.S. Geological Survey Techniques and Methods, book 11, chap. B4, 101. DOI: 10.3133/tm11b4

Kitchin, R. (2013). Big Data and Human Geography Opportunities, Challenges and Risks. Dialogues in Human Geography, 3(3), 262-267, DOI: 10.1177/2043820613513388

Lee, J.-G., & Kang, M. (2015). Geospatial Big Data: Challenges and Opportunities. Big Data Research 2(2), 74-81.

Li, S., Dragicevic, S., Anton, F., Sester, M., Winter, S., Coltekin, A., Pettit, C., Jiang, B., Haworth, J., Stein, A., & Cheng, T. (2015). Geospatial Big Data Handling Theory and Methods: A Review and Research Challenges, pp. 2-19.

Liu, Y., Liu, X., Gao, S., Gong, L., Kang, C., Zhi, Y., & Shi, L. (2015). Social Sensing: A New Approach to Understanding our Socioeconomic Environments. Annals of Association of American Georgraphers, 105(3), p. 512-530, DOI: 10.1080/00045608.2015.1018773

Ma,Y., Wu, H., Wang, L., Huang, B., Ranjan, R., Zomaya, A., & Jie, W. (2015). Remote Sensing Big Data Computing: Challenges and Opportunities, Future Generation Computer Systems, 51, 47-60, DOI: 10.1016/j.future.2014.10.029

Panimalar, A., Varnekha, S., & Veneshia, K. (2017). The 17 Vs of Big Data. International Research Journal of Engineering and Technology (IRJET), 4(9). Retrieved September 11, 2018 from https://www.irjet.net/archives/V4/i9/IRJET-V4I957.pdf

Samet, H. (2006). Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers, 1024.

Sugarbaker, L. J., Constance, E. W., Heidemann, H. K., Jason, A. L., Lukas, V., Saghy, D. L., & Stoker, J. M. (2014). The 3D Elevation Program initiative—A call for action. U.S. Geological Survey Circular 1399, 35. DOI: 10.3133/cir1399

Tsou, M.-H. (2015). Research Challenges and Opportunities in Mapping Social Media and Big Data. Cartography and Geographic Information Science, 42(1), 70-74. DOI: 10.1080/15230406.2015.1059251

Vatsavai, R. R., Ganguly, A., Chandola, V., Stefanidis, A., Klasky, S., & Shekhar, S. (2012). Spatiotemporal Data Mining in the Era of Big Geospatial Data: Algorithms and Applications. BigSpatial ’12 Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data. Redondo Beach, CA, 1-10. DOI: 10.1145/2447481.2447482

Vignesh, M. (2014). Spatial Data Mining: Progress and Challenges. International Journal of Computer Science and Information Technology Research, 2(3), 1-16.

Yao, X., & Li, G. (2018). Big Spatial Vector Data Management: A Review. Big Earth Data, 2(1), 108-129, DOI: 10.1080/20964471.2018.1432115

Wright, D. & Wang, S. (2011). The Emergence of Spatial Cyberinfrastructure. Proceedings of the National Academy of Sciences, 108(14), 5488-5491. Retrieved September 11, 2018 from http://www.pnas.org/content/108/14/5488

Zhang, C., Zhao, T., & Li, W. (2017). Big Geospatial Data and Geospatial Semantic Web: Current State and Future Opportunities. In Y. Wu, F. Hu, G. Min, and A. Zomaya (Eds.), Big Data and Computational Intelligence in Networking. Taylor & Francis LLC, CRC Press. pp 43-64.

Learning Objectives: 
  • Describe the basic types of geospatial big data
  • Describe emerging geocomputation techniques for geospatial big data
  • Explain how to recognize contaminated data in large datasets
  • What are the primary methods for structuring and modeling geospatial big data?
  • What are the statistical limitations of large spatial databases?
  • Describe difficulties in dealing with large spatial databases, especially those arising from spatial heterogeneity
  • Describe some of the problems of large spatial datasets from social media.
Instructional Assessment Questions: 
  1. Provide a basic definition of geospatial big data.
  2. Identify the major problems of large spatial databases, aka, geospatial big data.
  3. Define several methods of indexing large raster datasets. What are advantages of each?
  4. Standards work for some large spatial databases and not others. Why?
  5. Define some methods of analyzing and mining geospatial big data.
  6. Why does the Internet of Things create a geospatial big data problem?