Book review: Analyzing big data for real-world value

By Shalin Hai-Jew, Kansas State University

A Closer Look at Big Data Analytics
R. Anandan, Editor
Nova Science Publishers
2021

366 pp.

If now is the time of the Fourth Industrial Revolution (4IR), with a melding of so much technology from the biological sciences and AI and robotics, what are people supposed to do with so much big and diverse data? Now that big data is generally past the initial part of the hype cycle, what are actual practical ways to use available big data? What are ways to find patterns in data? What are some practical problems that may be solved with big data?

R. Anandan’s A Closer Look at Big Data Analytics (2021) is comprised of research works exploring the prior questions. This collection seems to be written for engineering and computer science students to provide a sense of possibilities in this current datafied moment (instead of for practitioners or experts in the space).

Anticipating the food needs of peoples using online food delivery service

H.M. Moyeenudin and R. Anandan’s “Artificial Intelligence for Knowing the Anticipation of Client from Online Food Delivery Using Big Data” (Ch. 1) begins with a basic premise that those who order food online and those who dine in are somewhat different in terms of their preferences. They suggest that the analysis of data from an online food delivery application may have inferential analysis and forecast value and ultimately inform the international quick service restaurants (IQSR) and other food providers of how to stock their refrigerators and freezers and schedule employees for their respective businesses, among other decisions.

The work seems to be based out of India, with ordered food delivered by motorized scooter (Moyeenudin & Anandan, 2021, p. 5). The co-researchers write: “Customers are begun place (sic) their orders through online food ordering applications, with greatest comfort and directness, guessing a related chance which is available choices and benefits in receiving their food” (Moyeenudin & Anandan, 2021, p. 6). Much of the writing requires focus and diligence to fully or even partially understand. Writing about technology has its challenges, and it can be much more difficult with second or third languages, perhaps. Still one is left with the sense that the opening chapter would read better had there been some more in-depth editorial oversight.

This work suggests that data may be drawn from various sources, such as ratings and comments from social media. They posit that there may be value in trying to understand the experiences of both “fulfilled and disappointed clients.” The researchers point to the importance of the restaurant sector to people’s livelihoods in a developing country. This work then goes on to list various programming languages like Python, Pandas, Bokeh, MapReduce, Xplenty, Apache Hadoop, Apache Spark, and Jupyter Notebook for analyzing data albeit without specifying what data per se…or much in the way of the local food context. There is a short section on various machine learning approaches and various algorithms, from linear regression to Naïve Bayes Algorithm and K means clustering k-nearest neighbor algorithm and random forest, as well as artificial neural networks.

The use case of food ordering online seems like it is a pretext to discuss the technologies (albeit superficially). The details are not quite in-depth enough to get learners started on the tools and techniques. The work perhaps underplays the complexity of thinking in such sparse ways and engaging data with Python. The work also under-emphasizes the amount of effort it takes to acquire the necessary fluency with data analytics and data visualization. And the writing is not quite superficially breezy to enable readers to skim through the work.

This chapter would benefit from examples of how data could be captured, cleaned, and analyzed, for this use case. Some real-life data patterns from a dataset would be helpful. For example, are there demographic associations between the individual ordering food and what is ordered? What about various events (cultural, weather, pay day) and particular food orders? What are some differences in data patterns between dine-in, take-away, and online-ordered foods? Where are the competitive advantages to be had, and why? Are there other ways to arrive at the same data place through different means?

Surveying convolutional neural networks (CNNs) to explore Internet of Things (IoT) data

The Industrial Internet of Things (IIoT) is somewhat similar to the general IoT except that the sensor arrays and other data sources reflect the work done inside a factory and other similar endeavors. The captured data shed light on the machines’ and operations’ throughput. Convolutional Neural Networks (CNNs) extract visual features from the imagery (still images, motion images or video broken out into frames), and neurons in these networks separate out the images into layers for analysis. CNNs may be applied to audio and text.

S. Karthik and K. Priyadarsini’s “An Overview on IoT Data Analytics with a Survey on CNN Accelerator Architecture” (Ch. 2) focuses on industrial data related to “gesture control, M2M (machine to machine communications), analytical maintenance, smart energy, smart monitoring, and connected medical systems” (p. 37). This work opens with the suggestion that applying neural networks to IoT devices could “bring about a generation of programs” (p. 38) that enable people to extract knowledge from data patterns and inform automated and human-based decision-making. One major challenge involves the IIoT devices themselves:

Specialized hardware architectures for processing CNNs need to be designed to deploy CNNs in edge/embedded devices; only then can they offer sufficient processing power in a strict power envelope. Most of the CNN networks are exceedingly computationally demanding and require billions of computations to evaluate single input instances. (Karthik & Priyadarsini, 2021, p. 43)

What is in order is a hardware accelerator for enabling convolutional neural networks on devices that are used in IIoT. This hardware needs to enable fast computational processing speed, low memory cost, be universal (to “support diverse CNN architectures out of the box”), and be scalable (enabling different applications, even those that require more computing) (Karthik & Priyadarsini, 2021, p. 44). The researchers describe a fused-layer CNN that may enable particular processing efficiencies (p. 45). Other efforts in this space include intra-kernel parallelization, reconfigurable neuromorphic computing accelerator (pp. 49-50), one approach called DianNao (“electronic brain” or “computer” in pinyin) (pp. 50 – 51), a Tensor Processing Unit (p. 51) architecture, PRIME architecture, and other approaches. This work provides an introductory summary of each, with some light illustration, such as in block diagram format. With the light knowledge, readers can search further to learn more in the computation literature.

Overhead in aerospace industry

P. Kalaichelvi, V. AKila, T.P. Rani, S., Sowmiya, and C. Divya’s “Big Data in Multi-Decision Making System of the Aerospace Industry” (Ch. 3) define big data as “streams of data that are huge by volume and rapid generation” (p. 69). They mention yottabyte or 280 bytes or 1,000 zettabytes as a measure. How can the aerospace industry handle large datastreams of information from so many different technologies and for so many purposes? This work provides a light summary of various real-time decision-making systems and automation related to human flight for flight safety, efficiencies, fuel savings, and environmental concerns. They suggest that such knowledge may inform the design of next-generation aeronautical systems (p. 70). The co-researchers conceptualize the IoT in airline industries as being comprised of concerns around the running of airports, airline companies, and aircraft and equipment manufacturers (p. 79), all interrelated as a complex system. They point to common computing tools applied to big data: Apache Hadoop, Xplenty, and Cloudera Distribution for Hadoop. They see a need for not only understanding the available data but strategizing how to process these using AI and other logics, for increasing improvements in processes. [If complexity of systems in aerospace is elusive, the 2018 and 2019 aircraft accidents attributed in part to the MCAS system on the 737 MAX 8 shows something of the risks of automation with over-reliance on single sensors, insufficient design overrides, and insufficient pilot training, among others.]

This work provides an overview of some general aspects of flight and aerospace, to possibly spark interest among researchers and learners. They suggest that artificial intelligence (AI) and neural network-based machine learning may be applied to big data implementations of aerospace data to achieve the following objectives: compete more effectively in the “aerospace business markets”; “improve a secure, safe, reliable and passenger friendly air journey”; improve aeronautical technology worldwide; “track and solve the environmental problems during air travel”; and “develop and implement the technological innovations in airline industries” (Kalaichelvi, AKila, Rani, Sowmiya, & Divya, 2021, pp. 104 - 105). This work is written at a general level, and this would benefit from a more specific “use case” or two.

Employing personal healthcare databases for learning

T. Nalini and Sudhakar Murugesan’s “New Trends and Applications of Big Data Analytics for Medical Science and Healthcare” (Ch. 4) explore a hodgepodge of computer tools that may enhance the practice of medicine, the making of drugs, and other aspects. Big data may enhance the following: “fast identification of high-risk patients, better decision making, closer monitoring, more storage, powerful analytics, (and) massive computation” (p. 116).

They write as if all hospital systems were generic. They write as if data collection were frictionless in the world, even with federal laws about health data privacy and even in litigious environments. They assert big claims: “With the help of big data, we can solve all the problems indicated above for all the world hospitals” assert the coauthors (p. 116). They then launch into a textbook summary of various tools for cloud computing, telemedicine data, machine learning, robotic surgery, computational medical image diagnostics, and IoT data within a healthcare setting. They do not actually make a direct case for what can be looked at to benefit what outcomes. The over-simplifications may serve as a bridge to the actual work, but this work does not really offer in-world insights. [For example, a data anomaly is not necessarily fraud. There are certainly other factors to examine.]

Motif identification via ANN analyses of bioinformatics data

New drugs require plenty of discovery and development work, and they have to pass stringent guidelines to make it to market. Part of the work involves studying various molecular structures of compounds. Currently, there are over 200,000 protein structures identified in different protein data banks.

D. Shine Babu and Latha Parthiban’s “Deep Neural Networks in Bioinformatics for Motif Identification” (Ch. 5) suggests the use of deep convolutional networks to identify DNA motif binding. The researchers explain: “Protein sequences and structures contain patterns of similar amino acid composition with distinct functions…These patterns are nothing but motifs with a mixture of secondary and tertiary structure elements to form the final protein with complex functionality” (p. 145). They describe the deep neural network approach in three phases: motif discovery and the defining of “a weighted motif adjacency matrix”; generation of random walk sequences “of original and motif(s) graph” and then a third phase of outputting the “node embeddings” based on the skip-gram network model (p. 151). This experiment’s results are depicted based on performance on a confusion matrix,

Their approaches show a prediction accuracy range of 82.5% (Deep Neural Network) to 83.4% (Convolutional Neural Network, to 84.7% (Convolutional Neural Network using ensemble word embeddings (Babu & Parthiban, 2021, p. 158). Their work lays the groundwork for others’ work (and their own future work), which may achieve higher levels of accuracy, in a complex context. [Side note: A common research trajectory in machine learning predictivity is early initial progress with a lot of noise in terms of prediction and then advancement to higher and higher levels of accuracy, up to and including near 100% accuracy or certitude in other data prediction contexts. The 80% range is respectable, but how valuable that is will depend on the particular context related to the data.]

Teaching “computational thinking in school” with Scratch and AI

Computational thinking, which involves the uses of computers to solve practical and other problems, is considered an essential skill throughout the learning spectrum. There are educational outreaches from pre-school age onwards. Over time, such knowledge is thought to confer competitive advantage. “Utilizing Scratch to Create Computational Thinking at School with Artificial Intelligence,” by Matta Krishna Kumari, T.P. Latchoumi, G. Kalusuraman, M. Chithambarathanu, and Latha Parthiban (Ch. 6) describes the use of a popular high-level visual programming language that is used to teach youth about methods to engage big data. The co-authors write loftily: Knowledge is “the spirit of the universe” (p. 177). Their strategy is to make big data attainable and understandable, not to hype it as something elusive.

Staying atop the speed of big data updates can be a challenge (p. 171). Security comes to the fore, with the need to control access to sensitive information, maintain logs to enable “information governance” (p. 173), “disinfect and censor sensitive information on the fly depending on the assignability of the information and the demand of the AI model” (p. 173), and other challenges.

The lesson they describe focuses on two main algorithms: k-means and the artificial neural network (both of which seem open to clear visual descriptions, which enables visual thinking). They write: “Students will have one hour to complete the code (20 minutes for K-means and 40 minutes for the two neural frame pieces) and run the long-term applications to see if they work properly” (p. 186). Their studies benefit from the presence of supervising teachers. The coauthors offer engaging suggestions about how to enrich the uses of Scratch in various learning contexts, including experiments and efforts integrating other technologies like Excel (pp. 187 – 188).

Tracking bird migration patterns with sensors

Monitoring bird flight may provide useful information to humanity, about the state of the environment, bird health, and other information. It may inform farmers about what to expect in terms of insect infestations (or their absence) or other factors. It may inform governments about tourism-related data. So assert Battula. Bhavya, T.R. Rajesh, T.P. Latchoumi, Narra. Harika, and Latha Parthiban’s “Tracking System for Birds Migration Using Sensors” (Ch. 7). [The names are represented in the book with the periods in the prior locations, so they were left that way.] The coauthors describe a system by which bird migratory patterns may be ascertained based on bird flight directions, speeds, altitudes, and other time-based data may be tracked, along with ambient temperature and other data. They describe an “Internet of Birds” approach, which includes sensors, ground stations, satellites, and other expensive technologies. They describe their proposed innovation:

The migration of birds tracking was proposed to improve the sensors and medium RF (radio frequency) communication between the tracker and cloud storage. Android phones function as the transmission unit. Our system eliminates the requirement for fixed base stations that are re-established by mobile transceiver units. The proposed system consists of four functional requirements. The first module follows a device attached to the bird (that) comprises a power supply, a sensor, a microcontroller, and a memory. Secondly, a module is a two-way transmitter that helps us transfer data between Android devices and the tracker. The third module is an Android application used to upload data from the tracking device and upload (to) the cloud storage. The fourth module is a web application of Google Maps used to see the models and migration routes of the bird(s) needed. (Bhavya, Rajesh, Latchoumi, Harika, & Parthiban, 2021, p. 208)

Bird migration patterns are influenced by “seasonality and ecological productivity” (Bhavya, Rajesh, Latchoumi, Harika, & Parthiban, 2021, p. 216). The bird flight and migration data may provide insights about changes in the “geographic range boundaries of birds and other taxa” due to global warming and birds moving to higher ranges (p. 218).

Analyzing big data with available tools

R. Anandan, Syed Rizwan, and Usha Kumari’s “Big Data Analytics Tools” (Ch. 8) highlights the need to use proper technologies and techniques to analyze big data, given that such data is hosted in various locales: data centers and the cloud in most cases. The different high-performance frameworks and resources are measured on four common characteristics: processing capacity, memory, storage, and network (p. 227). They then offer summaries of Hadoop Distributed File System (a system to process big data across distributed digital spaces), MapReduce and Yarn (a processor of big data), Zookeeper (a distributed configuration service for Hadoop), HBase (Hadoop database), HIVE (data warehousing), Apache Pig (a platform to analyze big data tables), Mahout libraries, and other tools, each with their own unique aspects. After the brief overviews, the researchers offer side-by-side comparisons between the various tools, to capture both strengths and weaknesses…and to provide insightful insider knowledge. They note that big data may be structured, semi-structured, or unstructured. This work reads like a helpful light primer for getting started. (For many, getting over needless fear or intimidation is important to learning anything new.)

Datamining techniques

Data mining is about finding patterns in data, collected in various sets. With the ever-expanding data collections, some may randomly select and sample a percentage of the available data. For those who want to engage the entire set, they often have to go to big data analytics.

J. Priya and R. Anandans “Data Mining Techniques and its Applications” (Ch. 9) explains: “Data mining techniques are useful to uncover hidden styles, unknown relationships among data, market behaviors in association with consumer’s decisions, and various useful information which can help the enterprise to take decisions” (p. 251). This work offers a walk-through of data pre-processing (aka cleaning), data transformation, and then the actual data mining based on machine learning algorithms like decision trees and neural networks (to enable associations, classifications, clustering, regression, and outlier detection). This work offers more in-depth explanations, including offering code (in various font types). The writing glides over various approaches, which may actually take some teaching to enable. [For example, the Bayes Theorem is mentioned in part and covered in a page or two. In a graduate game theory course I took years ago, Bayes Theorem was the course…along with decision trees of various kinds.] This work also summarizes K-Means, PAM, CLARA, and other clustering algorithms. They cover the regression method. They also cover various methods of detecting outlier datapoints and outlier groups from a dataset. For people who may have never heard of some of these approaches, this may be a helpful start. [One quibble: Some of the formulas and such are stretched with strange aspect ratio. I am a stickler for consistent and accessible font.]

Designing an intelligent decision support system

If the healthcare system is fairly fraught, how decision support systems are deployed may ease some of the tensions or potentially make things worse (if applied unthinkingly). Maithili Devi Reddy and Latha Parthiban’s “Design of Computationally Intelligent Decision Support System using Data Analytics” (Ch. 10) suggests that AI-informed systems may observe symptoms of various diseases and provide insight in clinical settings to predict whether a patient has a particular disease or not. The idea is that the system can stand in for an actual test in some cases, and treatments may start for the suspected health issue (without a lab validation). This system can help with speed (knowing quickly what the system predicts) and cost (obviating the need for a lab test). The question is how such a system would work in a live setting and what the risks might be depending on the particular health issue. After all, how things work in theory and / or in a research setting may not transfer effectively into the real. In terms of the technology, the coauthors write:

In this work, we take the features from the patient. Perform Data Normalization on the obtained data. Once the data is normalized, then we perform a feature selection algorithm on the data. We check the data with at least three to four feature selection algorithm(s) and select the best one, which shows higher accuracy. We use an artificial neural network to classify our data. Artificial Neural network is trained by drawing the relative advantages of gradient descent-based backpropagation algorithm. (p. 279)

They used “three benchmark clinical datasets, namely, Pima Indian Diabetes, Wisconsin Breast Cancer, and Cleveland Heart Disease” (p. 278) to train and ultimately test the machine learning algorithms. They write: “An improper value may ruin a patient’s life, so we try to develop accurate values” (p. 280). The ranges of accuracy performance ranged from the 70 – 80% range with different conditions on the Pima Indian diabetes dataset, the 70 – 90% range with different conditions on the breast cancer dataset. The coauthors write: “Patients can use this system to know about their illness instead of visiting hospitals every time for the examinations, thereby saving time and cost” (p. 297). They suggest that with different and more data that higher levels of accuracy may be achieved.

Harnessing computational intelligent agents for medical diagnosis

Maithili Devi Reddy and Latha Parthiban’s “Data Analytics using Computationally Intelligent Agents for Medical Diagnosis” (Ch. 11) begins with the idea that intelligent agents may be harnessed to process information captured on social media and other locales, engaging in text mining, and perhaps provide early warning of serious disease. As one example, they point to lung cancer, which may be treatable in the first two stages but less so in the latter two. Intelligent agent modeling may result in flexible and adaptable programs:

Agent based searching considers the present distribution of the searching authority and can constantly readjust the different searches in response to the dynamic environment…Such systems enable the representation of every single coordination object as a single autonomous agent with its own goals. (Reddy & Parthiban, “Data analytics…,” 2021, p. 305)

They describe their proposed multi-agent system through various flowcharts depicting processes and algorithms and algorithmic flows and decisions, along with human-readable text.

Confidentiality and big data

D. Raghunath Kumar Babu, R. Balakrishna, and R. Anandan’s “Stability and Confidentiality Mechanism in Big Data” (Ch. 12) takes on an important challenge of how to keep big data safe and confidential throughout the data lifecycle, when not using a third-party cloud solution. The big data lifecycle includes data generation, data processing, and data storage (p. 333). There is a need to guarantee data against “illicit access, disclosure, disorder, augmentation, update, monitoring,” and others (p. 333). The researchers offer a range of strategies to protect data using strategies and technological means. They make the point that the granularity of data may affect how easy it is to protect (coarse easier than fine). They cite various technologies that enable encryption for “datasets of fragmented personal data” (p. 344). One approach called “factual security” involves a system “that changes touchy information into information that is at the same time valuable and non-delicate” (p. 347). They explain the importance of setting proper service rules and enforcing them. They explain the need to keep logs to deter people from misusing data. Big data systems have to work at the speed that people need them to function to be usable, but they have to have sufficient defenses if attacked, so the data is not compromised or inappropriately accessed or changed. There are a variety of strategies (something like “security in depth”) that sound fairly similar to protecting digital data on other systems but at smaller scales. They write about the current state of the art:

At present, a ton of stages and instruments for large information are rising. Nonetheless, the stages and devices that are being created to deal with these gigantic informational collections are regularly not intended to join satisfactory security or protection measures. Most condition (sic) of craftsmanship (of) large information stages will in general depend on conventional executions in application layer and to limit admittance information. (Babu, Balakrishna, & Anandan, 2021, p. 349)

This work offers a review of the literature for how such systems may be cobbled; however, it is not clear if they had hands-on experiences in this space.

Conclusion

R. Anandan, in A Closer Look at Big Data Analytics, offers some perspectives on how big data may be analyzed and perhaps kept safe. It does not read like a manual but maybe more of a tourist guidebook into the topic. Some parts are written unintelligibly, language-wise, even after multiple readings, which may suggest something about the editing but also the original writing. The work mostly argues for the power of big data analytics to inform on behaviors in the world.

About the Author

Shalin Hai-Jew works as an instructional designer / researcher at Kansas State University. She is working on multiple book projects. Her email is shalin@ksu.edu.

Comment on this page

Local Discussion

Popout

Discussion of "Book review: Analyzing big data for real-world value"

Add your voice to this discussion.

Checking your signed in status ...

Previous page on path

Cover, page 19 of 23

Next page on path

Your name
Comment title
Content <a><i><u><b>
CAPTCHA

C2C Digital Magazine (Fall 2021 / Winter 2022)