An Electronic Commerce Big Data Analytics Architecture and Platform

Munshi, Amr; Alhindi, Ahmad; Qadah, Thamir M.; Alqurashi, Amjad

doi:10.3390/app131910962

Open AccessArticle

An Electronic Commerce Big Data Analytics Architecture and Platform

¹

College of Computer and Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia

²

Salla Research and Innovation, Makkah 24225, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10962; https://doi.org/10.3390/app131910962

Submission received: 9 August 2023 / Revised: 24 September 2023 / Accepted: 25 September 2023 / Published: 4 October 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The COVID-19 pandemic significantly increased e-commerce growth, adding more than 218 billion US dollars to the United States e-commerce sales. With this significant growth, various operational challenges have appeared, including logistic difficulties and customer satisfaction. Businesses that strive to take advantage of increased e-commerce growth must understand data and rely on e-commerce analytics. The large scale of e-commerce data requires sophisticated information technology techniques and cyber-infrastructure to leverage and analyze. This study presents a big e-commerce data platform to address several challenges in e-commerce. The presented platform’s design is based on a distributed system architecture that supports e-commerce analytics applications using historical and real-time data and features a continuous feedback loop to observe the decision-making and evaluation processes to achieve the desired objectives. The platform was validated using two analytical applications. The first application was to identify the periods in which customers prefer to place orders, while the second was used to verify the big e-commerce data platform. The resulting insights and findings promote informed e-commerce decisions. Furthermore, viewing and acting on insight results and findings promote informed decisions that potentially benefit the e-commerce industry. The proposed platform can perform numerous e-commerce applications that potentially benefit the e-commerce industry.

Keywords:

big data; data lake; data warehouse; e-commerce

1. Introduction

In the world of empowered consumers, electronic commerce (e-commerce) is continuously growing because of the benefits it provides to consumers and businesses. During the COVID-19 pandemic, e-commerce grew significantly grew, adding more than 218 billion US dollars to United States e-commerce sales [1]. By the end of 2022, the global e-commerce market is anticipated to reach 5.5 trillion US dollars, and by 2025 it is projected to rise by 24.5% to reach 7.4 trillion US dollars [2]. Accompanying this growth is several operational challenges that are primarily related to providing appropriate service levels that meet consumer expectations. Businesses that strive to take advantage of increased e-commerce growth and be prepared for the future are required to rely on e-commerce analytics. For example, the consideration of consumer behavior data to perform predictive analysis stands out as a promising option for overcoming many operational challenges, including logistic difficulties. In addition, this study analyzes data to reveal bestsellers and non-selling products to determine the reasons for unsought products and to take advantage of the insights to optimize the product portfolio. Furthermore, e-commerce data have the potential to be extremely valuable and translated into actionable information for business decision-making. However, there are numerous challenges in effectively handling e-commerce data. Recently, data analytics has received more attention in e-commerce. In references [3,4], systematic reviews and agendas for e-commerce data challenges and future research are discussed. In order to collect and analyze e-commerce data in a way that is appropriate for the issues that they present, it is necessary to understand the features of e-commerce data. The characteristics of e-commerce data can be summarized as follows: (1) produced at exponentially large volumes, (2) produced in various data types (structured, semi-structured and unstructured forms), (3) the high velocity of the data production rate, and (4) the produced data must be analyzed in an efficient way to make decisions and gain business momentum. Considering those big data characteristics, e-commerce data incorporate all of the “4V’s” of big data, i.e., volume, variety, velocity, and value. However, managing big data requires expensive infrastructure and expertise. Previous studies [4,5,6,7,8] have highlighted the role of big data analytics in refining e-commerce from various perspectives as well as types, challenges, and opportunities in theory and practice that were explored to clarify big data’s commercial significance in e-commerce. Reference [9] presents a comprehensive study to identify the issues that big data analytics in e-commerce encounter, such as data security and accuracy, and to supply tools and solutions to address these issues. Furthermore, reference [9] also highlights the rationale for investigating big data analytics applications in e-commerce. References [10,11] presented a case study of the U.S. and China to examine the potential of big data analytics in e-commerce. Insights into the field of exploratory data analysis in e-commerce by applying machine-learning-based models are presented in [12,13,14,15,16] and for customer segmentation in [17]. Accordingly, big data analytics in e-commerce is considered to be of revolutionary significance to businesses and is currently an area of significant research interest. However, it remains underdeveloped as a concept that impedes theoretical and practical advances in the field [18,19]. This study presents a big e-commerce data platform to address the many challenges in big e-commerce data. In particular, the main research question addressed in this paper is how to leverage open-source big data frameworks and tools to build a comprehensive platform based on the lambda architecture that addresses the needs of novice users in the e-commerce domain in Saudi Arabia. Thus, to the best of the authors’ knowledge, this proposal is the first work that aims to address such challenges for the growing Saudi e-commerce sector. The main contributions of this paper can be summarized as follows:

1.: We present a comprehensive platform and framework that covers the life-cycle of big e-commerce data starting from data acquisition from production environments to performing analytics to gain valuable insights and making useful predictions;
2.: The framework features a feedback loop to support the decision-making process in enterprises;
3.: The framework adopts the state-of-the-art open source tools to provide an easy and cost-effective development environment for practicing engineers and non data scientists to develop similar tools for their demanding e-commerce applications, and for addressing e-commerce big data challenges.
4.: The platform is built using cloud-based technologies for data lake repositories and integration of data warehouses.
5.: The platform supports performing various e-commerce data analytical tasks.
6.: The proposed framework allows novice users without data science skills to apply business intelligence in their e-commerce applications.

We organize the remainder of this paper as follows. Section 2 gives an overview of our big e-commerce data platform. Section 3 focuses on the cloud-based data lake architecture and the integration of a data warehouse for storing e-commerce data to maximize their availability for data analytics applications. Furthermore, also in Section 3, the implementation of the big e-commerce data platform on a cloud platform is presented. In Section 4, the platform’s feasibility is tested by applying two analytical studies, and finally, the conclusions of this study are presented in Section 5.

2. Big E-Commerce Data Platform

In e-commerce transactions, there are two mainstreams: informational and physical transactions. The efficient integration of both is the core of constructing a platform that is capable of handling data in a timely and dynamic manner for the operation of e-commerce. Generally, from a big data perspective, the life cycle of e-commerce data is comprised of five stages, namely, data generation, data storage, data processing, data querying and data analytics. Accordingly, the big e-commerce data platform is constructed in a dedicated formulation with respect to e-commerce needs and decomposed into five subsequent layers. The first stage is data generation, which is performed by various sources in the e-commerce environment. The second stage is data storage, which is required for later processing. The third stage is data processing, which is typically performed using large-scale data processing techniques. The fourth stage is data query processing, which supports the final fifth stage of data analytics. After the data analytics stage, informed decisions can be made, and the effects of those decisions can be monitored through a feedback loop, which allows corrective measures to be taken to achieve desired objectives. Featuring a feedback loop gives businesses a better understanding of their customers’ needs and makes informed decisions to enhance the overall customer experience. This allows businesses to engage with their customers and continually work actively to address problems, increase customer satisfaction, and improve products and services. Also, feedback loops are critical for identifying customer satisfaction issues, improving retention, driving innovation, and fostering loyalty, contributing significantly to long-term business success and positive customer relationships. As more data are acquired, this ensures that customer behavior and churn prediction models, over time, do not degrade due to being retrained on biased data that does not reflect the true data distribution. Hence, the performance of various prediction models improves over time. In Figure 1, we present a high-level overview of our proposed big e-commerce data platform. We discuss the five stages of the platform in more detail in the following subsections.

2.1. Data Production and Generation

E-commerce data are continuously produced at high rates from various sources. This includes any data that may influence the industry, such as data concerning products, pricing, sales, stock, customers, manufacturing and supplier details, reviews, advertisements, marketing strategies, and competitor activities. In addition, social media, sentiment, images and videos, weather, Internet-of-Things (IoT) devices, and data could also be factors that influence the e-commerce industry. These data produced can take different forms, such as structured, unstructured, and semi-structured. The explosion of produced data is motivated by the fact that e-commerce firms take into account a variety of data sources and effectively explore analytical correlations across the data in their big data analytics, which can experience higher productivity than their competitors. Currently, with increased investment in social media campaigns, it is essential to consider their effects on e-commerce, including product sales and the effects of weather conditions on the e-commerce industry. However, it has not gained much attention. Furthermore, emerging technologies are revolutionizing the e-commerce industry. For example, drone deployment will propel e-commerce to new heights because of reduced delivery time.

2.2. Cloud Data Lake Storage

Regardless of the source and data type, the e-commerce data produced must be stored cost-effectively in its raw form. Usually, in business analysis, data warehouses are used for storing data. However, the constant influx of e-commerce-related data is increasing in businesses, which requires agile and flexible solutions to store, analyze, and report data. In order to store large amounts of data in its raw form (structured, semi-structured, and unstructured) with no data schema defined, a centralized repository, namely, data lakes, is used. On this platform, a cloud data lake approach was considered. Data lakes are more flexible than traditional data warehouse infrastructures because they do not require excessive pre-processing, cleaning, transformation, or other preparation methods. The data may be stored and delivered for analysis in raw and original state forms. Furthermore, for organizations currently relying on data warehouses, the cloud data lake presented in this platform features the integration of data warehouses. This allows the data stored in data lakes to be loaded into data warehouses or consumed directly by analytical and business intelligence software and tools. Furthermore, businesses that wish to establish a data warehouse in the future will be able to obtain the required data from the data lake. Figure 1 illustrates a cloud data lake featuring integration with a data warehouse. Based on the e-commerce data and industry requirements, the e-commerce data lakes, in general, can be divided into mainly six zones based on the data sensitivity and their processing status:

1.: The Raw Data Zone: this zone is where raw data are preserved after the ingestion by the platform without any processing.
2.: The Refined Data Zone: this zone stores the output of preprocessing and cleaning of data from the Raw Data Zone.
3.: The Data Warehouse: this zone stores the structured version of the cleaned data from the Refined Data Zone.
4.: The Public Data Zone: this zone stores the cleaned data from the Refined Data Zone and is made available to public access.
5.: The Work Zone is where multiple “data puddles” (pooling together several data puddles creates a data lake) are created for different users or projects. This zone can be utilized by platform users (e.g., data scientists in the e-commerce sector) as a staging area to prepare data for analytical tasks. The results of the analytical tasks can be saved into the Refined Data Zone to perform further applications that require these data.
6.: The Sensitive Data Zone is where sensitive/private data are stored. This zone must have a form of access control to allow access to only authorized users. Having access control ensures that sensitive data do not leak elsewhere in the data lake.

This data lake architecture supports diverse e-commerce analytical tasks, from data exploration to sophisticated visualizations and summary dashboards. Furthermore, the presented data lake architecture is flexible because additional zones and data puddles can be added or modified at any time according to business changes.

2.3. Data Processing

Our platform organizes the stored data into “data lake zones”, as described above. In the proposed platform, the processing is accomplished on a cluster of machines that follows a leader–worker paradigm. In this paradigm, applications interact with the leader node to deploy a data processing job. The job is further decomposed into tasks that can be executed concurrently by the worker nodes. The proposed platform is designed to process diverse types of data. These data can be in their original format (e.g., log records and social network feeds). Alternatively, the data can be stored as traditional relational data structured as rows and columns with fixed schemas. Moreover, e-commerce requires analytical tasks to be performed in batches and real-time (stream). For this, the platform uses a distributed system architecture namely, the lambda architecture, which is a data deployment model for data processing that consists of a batch-based data pipeline and a fast streaming-based data pipeline for handling real-time data [20,21], to process real-time streams of e-commerce data as well as batches of data on a cluster of commodity machines in a distributed and parallel fashion.

2.4. Data Query Processing

Data query processing is considered the cornerstone of e-commerce analytical applications. From the preceding subsection, the lambda architecture facilitates two tasks for big data applications. These two tasks are query answering and analytics. In addition, at the query processing stage, a data warehouse is included in answering queries. Additionally, e-commerce data can include structured data (customer information), semi-structured data (transactions), and unstructured data (social media data). The variety of such data makes it challenging to query e-commerce data, where the required data are occasionally the result of information retrieval or data analytics. Considering the needs of the e-commerce industry, it is essential to query the components that enable batch and stream analytical applications. To achieve this, the presented big e-commerce data platform uses SparkSQL [22] because it allows better performance when the working set fits in memory. Consequently, a range of e-commerce query answering and analytical tasks can be carried out using big e-commerce data platforms.

2.5. Data Analytics Tasks

The core objective of performing data analytics tasks is to gain insights and support the decision-making process that effectively endorses e-commerce optimization operations. This includes:

Find interesting e-commerce data correlations within the data lake.
Predicting events before they occur using big data.
Analyze data to understand customers and optimize products and processes to meet customer expectations.
Study the effects of social media campaigns on e-commerce, including product sales.
Provision of personalized service and customized products.
Real-time data analytics provides personalized services with unique content and makes relevant promotional offerings.
Dynamic pricing systems with monitoring competing prices and alerts.
Identify credit card fraud, product returns, and identity theft.
Performing fraud detection in real-time by analyzing data from multiple sources, such as transactional data, customer purchase history, social media feeds, and location data.
Analyzing customer churn and making decisions to avoid it.

Furthermore, the presented big e-commerce data platform includes a feedback cycle to perceive the outcomes of decisions made about e-commerce firms. This is potentially beneficial in the monitoring of sales of items after promotions or the monitoring of a cluster of customers after personalized promotions are offered. Generally, applications that include tasks such as data mining, knowledge discovery, and visual analytics can be performed at this stage to optimize the e-commerce firm, and the outcomes are monitored for corrective action.

3. Big E-Commerce Data Platform Implementation

In this section, we give an overview of the implementation of the proposed big e-commerce data platform in accordance with the general requirements of an e-commerce firm. The platform’s structure can serve as a test bed for developing and validating e-commerce-related applications. The following subsections cover the introduction of data lake cloud storage featuring a data warehouse considering the stages presented in Section 2. We use Google Cloud Platform for the infrastructure of our implementation.

3.1. Data Lake and Data Production

Various data sources and the data flow were specified at this stage, including data such as present and past products, pricing, stock, customer information, customer transactions, manufacturing and supplier details, reviews, advertisement, marketing strategy, competitor activities, events, social media data, sentiment, images and videos, weather, and Internet-of-Things (IoT) device data. These continuously produced data sources include structured, semi-structured, and unstructured data that grow rapidly. Furthermore, the acquired data may be produced from multiple sources, including real-time, near-real-time, and batch-oriented systems. The data are collected and loaded into a storage environment, which in this case is the data lake, and the data flow from production to storage must be specified. This includes identifying the data source, flow medium, and destination. In this implementation, the acquisition component Flume [23] was adopted, which is capable of collecting e-commerce data from single or multiple sources, making it a reliable component for collecting data. In addition, Flume can pull data from the prespecified resources and transfer the data to ensure that it is stored at the desired destination in the data lake, ensuring the delivery of data even in cases where disconnections occur. The Flume topology is shown in Figure 2. In a Flume agent; data flow through four consecutive levels before it reaches its destination (data lake).

In other words, these levels constitute logic for ingesting data streams into the data lake. First, data are forwarded to the IP address of a prespecified flume agent(s), which is called the flume agent source (the first level in Figure 2). The Source may actively poll for data or wait for the data to be forwarded, allowing the data to pass through an interceptor interface (the second level in Figure 2), which modifies and drops data streams’ records based on any prespecified criteria. Data that meets the users’ criteria can pass the interceptor flows through a channel (the third level in Figure 2), which is the medium between the Source and the Sink (the fourth and final level in Figure 2). At the final level, the Sink drains the data to their destination, the data lake. In this implementation, a cloud platform stores e-commerce-related data in its prespecified destination (zone) in the data lake. In addition to functionality requirements, a cloud platform must adhere to the Cloud Security Alliance standards. Compliance with the standards allows for delivering secure forms of analytics. In addition, the Secure Shell (SSH) protocol to establish a secure encrypted connection between cluster nodes can be established. However, maintaining a secure connection is an ongoing process that is beyond the scope of this study. Data lakes with cloud-based deployments can benefit from the high-availability feature of cloud-based services, which allows data acquisition from various sources over commodity network connections. Once the data are received, depending on the destination the data are sent from, the data can be stored in their original format in the Raw Data Zone. With the exception of sensitive data, all data can be retained in the Raw Data Zone on-demand, whilst sensitive data are relocated from the Raw Data Zone into the Sensitive Data Zone. Sensitive data may include customer or financial information. The data in this study were organized in the data lake with respect to the architecture discussed in Section 2.

3.2. Processing of Big E-Commerce Data

To specify adequate processing frameworks for e-commerce data analytics, the requirement of applications related to e-commerce data needs to be determined. In this implementation, the underlying requirements of e-commerce can be linked to the characteristics of big data, i.e., volume, velocity, and variety. Table 1 presents e-commerce processing requirements and their respective values.

From the requirements listed in Table 1, it can be concluded that the processing stage for e-commerce data requires framework components that are capable of processing large amounts of data in batches and in real time, featuring scalability, reliability, and availability. Based on the requirements, the concept of lambda architecture processing [20] is adopted in this implementation. The basis of lambda architecture is to conduct arbitrarily distributed data workloads with batch and real-time capabilities, featuring balanced data latency, scalability, and reliability. The Hadoop Distributed File System (HDFS) and Spark [22] frameworks were utilized to fulfill the lambda architecture capabilities at the processing stage of the big e-commerce data platform. HDFS is a robust distributed file system designed for distributed data-processing clusters of commodity servers. This permits the processing of large amounts of e-commerce data in parallel after splitting the data up into manageable chunks. The Spark framework was utilized to achieve low-latency e-commerce processing and features in-memory cluster computation capabilities that increase processing speeds. For the real-time processing requirements of e-commerce, incorporating both frameworks can fulfill such requirements. To accomplish this, a layer that incorporates both frameworks at the processing stage was added. The proposed layer merges the outcomes of Hadoop and Spark computations to deliver real-time computational results. The lambda architecture is adopted at the Batch/Stream processing stage shown in Figure 1 to perform the processing of big e-commerce data.

3.3. Data Query Processing

For big e-commerce data, query engines that offer interfaces while hiding the complexity of data storage configuration are required. This requires distributed computing, which involves implementing a cluster of computing machines that work together to perform queries. Based on the requirements of big e-commerce data, query engines must be capable of performing queries in a distributed manner. In this setting, the cluster of machines includes a resource manager for administering the work between the cluster nodes and a group of worker nodes that perform the computations. The data-querying stage includes components that permit the extraction, loading, and aggregation of raw data stored in the cloud data lake to a form suitable for analytics. To fulfill the requirements of big e-commerce data, in which data vary in form, structure, and complexity and need to be processed in batch and real-time, two querying components are considered. Each querying component varies in terms of how it executes queries in parallel on a cluster. SparkSQL [22] is a layer built on top of Spark in the lambda architecture and leverages robust distributed datasets that can perform parallel operations on cluster nodes. An advantage of using Spark lies in its ability to execute batch and real-time queries.

3.4. Analytics

The data analytics stage is the stage at which the value of the big e-commerce data platform is determined. The primary objectives of this stage are to gather pertinent data and insights, support decision-making, and effectively advance the e-commerce industry. In this implementation, a feedback loop is considered to evaluate the outcomes of decision-making on e-commerce. At this stage, it is necessary to perform statistical, data mining, knowledge discovery, and visual analytics applications. In our platform, we consider novice users, which are users without data science skills but have deep knowledge of the e-commerce domain. To make this stage in our platform accessible to novice users in the e-commerce sector and allow them to perform analytically demanding applications on top of the big e-commerce data platform, a remote connection between analytical tools and the cluster of machines is established through a master node(s). The connections were established via Open Database Connectivity (ODBC) drivers, which are open standard Application Programming Interfaces (APIs) for accessing data sources [24], which permit access to the data-querying components Spark, and further, the data lake can be reached. The following analytical tools were used in this implementation of e-commerce applications:

-: MATLAB [25] used tall arrays to work with data backed by a distributed data store.
-: Python and PySpark [26,27] for analytics libraries that include data analysis, natural language processing, and image processing packages.
-: Tableau [28] for visual analytics and dashboards. It should be noted that these tools can be used in other e-commerce applications, whilst the platform can be connected to other analytical tools for a variety of applications.

4. Practical Application of the E-Commerce Platform

In this section, the construction of a big e-commerce data platform is discussed. We implement three applications to validate the feasibility of our proposed platform. The dataset concerns a Saudi Arabian e-commerce platform that hosts more than 10,000 stores. The data were obtained from the cloud data lake, and data pre-processing was performed to store the transactional data in the data warehouse. In the first application, it is desirable to identify the periods in which customers are most likely to purchase items. The second application is a visualization to gain insight into how customers are distributed on the map and determine where orders are placed. The third application is a time-series forecasting application on e-commerce sales data. It predicts seasonal trends in revenue by building models based on sales transactions to make observations for online merchants to make better business decisions.

4.1. Big E-Commerce Data Platform Construction

Considering that it satisfies the big data requirements for e-commerce, the Infrastructure-as-a-Service (IaaS) cloud computing platform was considered. Hardware services, including virtual machines and storage, offer dependability, scalability, cost-effectiveness, and service provision. For this implementation, the platform was hosted on Google Cloud computing. Six machines were used to construct a big e-commerce-data platform cluster. There was one master node and five worker nodes in the cluster. Each worker node has four vCPUs and 15 GB of RAM, running a 64-bit Linux operating system, whilst the master node has eight vCPUs. A secure shell (SSH) is used to create a secure encrypted connection between cluster nodes. E-commerce data resources are proactively transferred to the master node’s IP address once a secure connection has been made. It should be noted that providing a secure operating environment is a continuous process; however, this is beyond the scope of this study. All five worker node machines ran CentOS Linux and were deployed on the Hadoop platform. As for the remaining master machine running the Windows OS for performing the analytical tasks. The IP address and host names of the nodes were identified at each node in the respective host configuration. Figure 3 presents the cluster setup, and the HDFS distributes file blocks among the cluster nodes. The work involved in any analysis task is distributed to worker nodes. Each involved node runs the assigned task on its block from the file, and the results are collated and digested into a single result after each involved block has been analyzed.

To construct the big e-commerce data platform, the following components were set up on the cluster nodes:

Flume: data acquisition.
Hadoop platform: comprises the storing component HDFS and processing components MapReduce and resource negotiator Yarn [29].
The SparkCore [22]: includes components for scheduling, fault recovery, memory management, and interaction with the storage system.
SparkSQL: facilitates reading, writing, and managing data stored in the data lake repository and querying the data [22,26].

The data acquisition component, Flume, is set up on the master node to actively poll data and is responsible for sinking the data into a specified zone in the data lake repository with respect to the data resource. As the e-commerce data need to be organized to avoid a “Data Swamp”, which is a term used to describe data that have lost processes and standards, and the data became difficult to find, manipulate, and inevitably analyze, multiple Flume agents are configured. Once the data were available for the data lake, the analytical tools were connected to a big e-commerce data platform via a network connection. The analytical tools perform analytical tasks by sending queries through Spark, and the querying components return the outcome to the analytical tool. Thus, the analytical tools work at the top of the big e-commerce data platform. In this implementation, the Cloudera Manager Hadoop distribution was adopted because it provides an easy development environment for practicing users to develop tools for e-commerce-related applications. The Cloudera Manager automates the installation of the required configurations, debugging, Structured Query Language (SQL) database, Cloudera Distributed Hadoop (CDH) agents, and other necessary components for this platform.

Users can use SparkSQL’s built-in libraries (e.g., SQL and Dataframes) or third-party integrations (e.g., Zeppelin [30]). These tools can be leveraged to perform data pre-processing tasks such as missing-values augmentation, PCA, detection, and data cleaning tasks.

4.2. Identifying Order Time Periods

In this application, the periods in which customers are most likely to place orders are of interest and can assist organizations in the planning stages to accommodate higher traffic periods. Furthermore, this allows for identifying periods when customers are more likely to purchase items; accordingly, informed decisions related to promotions and sales are carried out in ways that may increase revenue and satisfy customers. This application was performed on one e-store to identify periods with higher-order demands. Figure 4 presents a heatmap illustrating the demand for weekdays for each month of the year. It can be observed that for this e-store, there were higher demands from December to May. Specifically, the first days of April are typically the busiest periods of the year. In addition, businesses may take action to encourage customers to place orders at other times after conducting proper research.

A heatmap of more than 10,000 e-stores is shown in Figure 5. It can be observed that April is the busiest month. These insights could potentially assist in inventory planning and in making informed decisions on specific periods that benefit the organization. It is worth noting that a similar analysis can be conducted to identify the hours of the day when customers are more likely to place orders.

4.3. Analysis of Customers’ Spatial Distribution

The second application verifies the big e-commerce data platform and analyzes customers’ spatial distribution through transactions. Finding distinct patterns in the spatial distribution of consumers is made easier by visualizing the distribution of the locations where orders are placed. Furthermore, comparing the number of transactions, the distributions, and their respective properties, such as the distance from warehouses, could reveal noteworthy differences and similarities between e-stores. Figure 6 shows the number of orders placed for a specific e-store during a month. From this visualization, decisions that are necessary for effective urban planning and determining an optimized location for warehouses can be made. Delivery options and priorities can also be set once demand locations are identified.

Further analysis can be conducted to investigate why certain cities may have no or fewer transactions, and corrective measures, such as location-specific commercial advertisements, can be implemented accordingly. Furthermore, the consequences of the actions can be witnessed through the feedback loop of the big e-commerce data platform.

4.4. Forecasting Store Sales

The third application is a time series forecast that makes predictions based on historical sale transactions with time stamps. It involves building models to make trend and seasonal observations that help merchants make better store planning, e.g., inventory planning, where the decision maker can determine the optimal quantity and timing for his/her inventory. This application was carried out on an actual data set from three e-stores. The data covers daily sales recorded for approximately four years, from January 2018 to October 2022, with more than 74,000 online sales records that contain information such as order date and sales amount. In the application, Prophet [31] is used, which is an open-source framework developed by Facebook’s data science team and can be used for forecasting time series sales data.

In forecasting, patterns of the data are important. Based on Figure 7, the data shows a seasonal monthly pattern per year in the three online stores. For example, e-store 3 (see Figure 7c) has a changing trend that appeared in August 2021. Note that Figure 7 shows the sales forecast for the three e-stores with plotted the actual values of our time series (black dots) and the forecasted values (blue line). In addition, the model can forecast sales based on the patterns in the actual data, although there are outliers in the data. See, for example, e-store1 in Figure 7a.

5. Conclusions

This paper addresses the research question of how to leverage big data frameworks and tools to build a comprehensive platform that addresses the needs of novice users in the e-commerce domain in Saudi Arabia. In particular, this paper presented a big e-commerce data platform based on lambda architecture. The requirements of applications related to e-commerce data are discussed to determine the appropriate components for building an e-commerce platform. The platform was hosted on a cloud service considering the lambda architecture principles and design for performing batch and real-time processing of e-commerce applications. By utilizing batch and real-time processing frameworks, the proposed platform can handle enormous amounts of e-commerce data, collecting and storing it in a cloud data lake. This allows the storage of various e-commerce-related data, including customer information, transactions, social feeds, weather, and image and video data. Furthermore, for industries currently relying on data warehouses, the proposed cloud data lake enables the integration of such data warehouses, allowing the data to be loaded into data warehouses or consumed directly by analytics and business intelligence software and tools. Furthermore, businesses that wish to establish a data warehouse can obtain the required data from the data lake. Three analytical applications are implemented to validate the effectiveness of the big e-commerce data platform. The first application was to identify the periods in which customers prefer to place orders, whilst the second application was to verify the big e-commerce data platform is presented. In this application, a visualization to analyze the distribution of the locations where orders were placed is presented. The third application was to forecast time series data. We consider the case of forecast sales of stores and how it can be a better tool for the planning and management for e-commerce merchants. Such applications could provide insights that could potentially assist with numerous business aspects. Furthermore, viewing and acting on the insight results and findings promotes informed decisions that potentially benefit the e-commerce industry. Moreover, our current implementation of the proposed platform uses cloud-based storage services (e.g., Google Cloud Storage), which allows integration with other software systems via RESTful APIs and language-specific SDKs. The impact of the presented big e-commerce data platform extends the presented applications, and as additional e-commerce-related data become available, the platform becomes more proficient in performing further analytical applications. The limitations of the presented work can be summarized as: (1) chances of a “data swamp” increase as more data become available; however, proper organization of the data lake and including multiple Flume agents is helpful. (2) The data lake may be vulnerable to security threats, data breaches, and cyberattacks, unless it is safeguarded via modern technologies, as securing the platform is an ongoing process that is beyond the scope of this work. Future research could investigate the best security practices to implement and secure the platform as a whole. Also, analytical applications while including numerous data resources to further understand customer behavior patterns, enhance experience, and streamline various processes can be studied.

Author Contributions

Conceptualization, A.M., A.A. (Ahmad Alhindi) and T.M.Q.; writing—original draft, A.M., A.A. (Ahmad Alhindi) and T.M.Q.; writing—review and editing, A.M. and T.M.Q.; project administration, A.A. (Amjad Alqurashi); funding acquisition, A.A. (Ahmad Alhindi). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Salla Research & Innovation Centre, Makkah, grant number [SRI222].

Data Availability Statement

Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Duffin, A. The State of the Digital Commerce; Gartner for Marketers; Gartner: Stamford, CT, USA, 2021. [Google Scholar]
von Abrams, K. Global Ecommerce Forecast 2021: Digital Leads the Way, Building on 2020’s Growth; eMarketer: New York, NY, USA, 2021. [Google Scholar]
Akter, S.; Fosso Wamba, S. Big data analytics in e-commerce: A systematic review and agenda for future research. Electron. Mark. 2016, 26, 173–194. [Google Scholar] [CrossRef]
Arya, K.; Kumar, T.; Jain, M.K. Big data analytics of global e-commerce organisations: A study, survey and analysis. Int. J. Sci. Eng. Res. 2016, 7, 82–84. [Google Scholar]
Pavithra, B.; Niranjanmurthy, M.; Shaker, J.K.; Mani, M.S. The study of big data analytics in e-commerce. Int. J. Adv. Res. Comput. Commun. Eng. 2016, 5, 126–131. [Google Scholar]
Vinodhini, M.; Appukuttan, M. A survey on big data analytics in e-commerce. IJEST ISSN 2016, 1, 61–64. [Google Scholar]
Hadwan, M. Social commerce in saudi arabia: Literature review. Int. J. Eng. Res. Technol. 2019, 12, 3018–3026. [Google Scholar]
Mikalef, P.; Pappas, I.; Krogstie, J.; Giannakos, M. Big data analytics capabilities: A systematic literature review and research agenda. Inf. Syst. Bus. Manag. 2018, 16, 547–578. [Google Scholar] [CrossRef]
Alrumiah, S.; Hadwan, M. Implementing big data analytics in e-commerce: Vendor and customer view. IEEE Access 2021, 9, 37281–37286. [Google Scholar] [CrossRef]
Zhuang, W. The influence of big data analytics on e-commerce: Case study of the U.S. and China. Wirel. Commun. Mob. Comput. 2021, 2021, e2888673. [Google Scholar] [CrossRef]
Weiqing, Z.; Wang, M.C.; Nakamoto, I.; Jiang, M. Big Data Analytics in E-commerce for the U.S. and China Through Literature Reviewing. J. Syst. Sci. Inf. 2021, 9, 16–44. [Google Scholar] [CrossRef]
Panwar, M.; Wadhwa, A.; Pippal, S. An overview: Exploratory data analytics on e-commerce dataset. In Proceedings of the 2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 17–18 December 2021; pp. 91–93. [Google Scholar]
Mohamed, M.; El-Henawy, I.; Salah, A. Price prediction of seasonal items using machine learning and statistical methods. Comput. Mater. Contin. 2022, 70, 3473–3489. [Google Scholar] [CrossRef]
Vapiwala, F.; Pandita, D. Analyzing the Application of Artificial Intelligence for E-Commerce Customer Engagement. In Proceedings of the 2022 International Conference on Data Analytics for Business and Industry (ICDABI), Sakhir, Bahrain, 25–26 October 2022; pp. 423–427. [Google Scholar] [CrossRef]
Yu, J.; Wen, Y.; Yang, L.; Zhao, Z.; Guo, Y.; Guo, X. Monitoring on Triboelectric Nanogenerator and Deep Learning Method. Nano Energy 2022, 92, 106698. [Google Scholar] [CrossRef]
Salamai, A.; Ageeli, A.; El-kenawy, E.-S. Forecasting e-commerce adoption based on bidirectional recurrent neural networks. Comput. Mater. Contin. 2022, 70, 5091–5106. [Google Scholar] [CrossRef]
Luo, Y.; Xu, S.; Xie, C. E-commerce big data classification and mining algorithm based on artificial intelligence. In Proceedings of the 2022 IEEE 2nd International Conference on Electronic Technology, Communication and Information (ICETCI), Virtual Event, 27–29 May 2022; pp. 1153–1155. [Google Scholar] [CrossRef]
Yu, R.; Wu, C.; Yan, B.; Yu, B.; Zhou, X.; Yu, Y.; Chen, N. Analysis of the impact of big data on e-commerce in cloud computing environment. Complexity 2021, 2021, e5613599. [Google Scholar] [CrossRef]
Desai, P.; Ganatra, K. Artificial Intelligence In Strengthening The Operations Of Ecommerce Based Business. In Proceedings of the 2022 Interdisciplinary Research in Technology and Management (IRTM), Kolkata, India, 26–28 February 2022; pp. 1–7. [Google Scholar] [CrossRef]
Marz, N.; Warren, J. Big Data: Principles and Best Practices of Scalable Real-Time Data Systems, Manning; Simon and Schuster: New York, NY, USA, 2015. [Google Scholar]
Munshi, A.; Mohamed, Y. Data lake lambda architecture for smart grids big data analytics. IEEE Access 2018, 6, 40463–40471. [Google Scholar] [CrossRef]
Armbrust, M.; Reynold, S.X.; Cheng, L.; Yin, H.; Davies., L.; Bradley, J.K.; Meng, X. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, VIC, Australia, 31 May–4 June 2015; pp. 1383–1394. [Google Scholar] [CrossRef]
VMware. Apache Flume and Apache Sqoop Data Ingestion to Apache Hadoop Clusters on VMware vSphere; White Paper Edition; VMware: Palo Alto, CA, USA, 2013. [Google Scholar]
Cloudera ODBC Driver for Impala. Cloudera, Palo Alto, CA, USA. 2016. Available online: http://www.cloudera.com/documentation/other/connectors/impala-odbc/latest/Cloudera-ODBC-Driver-for-Impala-Install-Guide.pdf (accessed on 1 September 2023.).
MATLAB. Tall Arrays. Available online: https://www.mathworks.com/help/matlab/import-export/tall-arrays.html (accessed on 1 September 2023).
PySpark. Available online: https://spark.apache.org/docs/latest/api/python/ (accessed on 1 September 2023).
Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.B.; Amde, M.; Owen, S.; et al. MLlib: Machine learning in apache spark. J. Mach. Learn. Res. 2016, 17, 1235–1241. Available online: http://jmlr.org/papers/v17/15-237.html (accessed on 1 September 2023).
Tableau Software. Tableau. Available online: https://www.tableau.com (accessed on 1 September 2023).
Vavilapalli, V.K.; Murthy, A.C.; Douglas, C.; Agarwal, S.; Konar, M.; Evans, R.; Graves, T.; Lowe, J.; Shah, H.; Seth, S. Apache hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SoCC, Santa Clara, CA, USA, 1–3 October 2013. [Google Scholar] [CrossRef]
Cheng, Y.; Liu, F.; Jing, S.; Xu, W.; Chau, D. Building Big Data Processing and Visualization Pipeline through Apache Zeppelin. In Proceedings of the Practice and Experience on Advanced Research Computing, New Orleans, LO, USA, 9–13 July 2017; pp. 1–7. [Google Scholar] [CrossRef]
Taylor, S.; Letham, B. Forecasting at scale. PeerJ 2017, 9, 2167–9843. Available online: https://peerj.com/preprints/3190 (accessed on 1 September 2023). [CrossRef]

Figure 1. This figure shows the high-level architecture of the proposed big e-commerce data platform.

Figure 2. Flume data stream topology.

Figure 3. Cluster setup and HDFS distributes file blocks among cluster nodes.

Figure 4. Heatmap of transactions for a single e-store.

Figure 5. Heatmap of transactions for around 10,000 e-stores.

Figure 6. A visualization showing the distribution of customers for an e-store. Each dot represents a city. Each city is associated with a number of cursomers’ orders.

Figure 7. Prediction results using FB-prophet. Black dots represent the actual data points, while the blue line represents the FB-prophet predictions.

Table 1. E-commerce processing requirements depicted with their respective parameter values.

E-Commerce Requirement	Respective Values
The size of the data to be processed is large and it is necessary to divide the vast amount of data that needs to be processed into more manageable portions	Volume
The data require processing in parallel across multiple machines	Volume
The data require processing simultaneously across multiple program modules	Volume
The data need to be simultaneously, completely processed due to high volumes	Volume
The processing of data needs to be reliable and processed from any point of failure, as restarting the process from the beginning is extremely large and time consuming	Volume
The data are required to be processed at real-time streaming speeds	Velocity
The data requiring processing come from multiple resources and sometimes simultaneously	Velocity
The data may critically need multi-pass processing and scalability	Velocity
The data are of different forms	Variety
The data are of different structures	Variety
The data may be complex and need to use multiple algorithms for prompt processing	Variety

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Munshi, A.; Alhindi, A.; Qadah, T.M.; Alqurashi, A. An Electronic Commerce Big Data Analytics Architecture and Platform. Appl. Sci. 2023, 13, 10962. https://doi.org/10.3390/app131910962

AMA Style

Munshi A, Alhindi A, Qadah TM, Alqurashi A. An Electronic Commerce Big Data Analytics Architecture and Platform. Applied Sciences. 2023; 13(19):10962. https://doi.org/10.3390/app131910962

Chicago/Turabian Style

Munshi, Amr, Ahmad Alhindi, Thamir M. Qadah, and Amjad Alqurashi. 2023. "An Electronic Commerce Big Data Analytics Architecture and Platform" Applied Sciences 13, no. 19: 10962. https://doi.org/10.3390/app131910962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Electronic Commerce Big Data Analytics Architecture and Platform

Abstract

1. Introduction

2. Big E-Commerce Data Platform

2.1. Data Production and Generation

2.2. Cloud Data Lake Storage

2.3. Data Processing

2.4. Data Query Processing

2.5. Data Analytics Tasks

3. Big E-Commerce Data Platform Implementation

3.1. Data Lake and Data Production

3.2. Processing of Big E-Commerce Data

3.3. Data Query Processing

3.4. Analytics

4. Practical Application of the E-Commerce Platform

4.1. Big E-Commerce Data Platform Construction

4.2. Identifying Order Time Periods

4.3. Analysis of Customers’ Spatial Distribution

4.4. Forecasting Store Sales

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI