Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
PROCEEDINGS OF
1
2
M. A. Upal (editor)
Program Chair & Proceedings Editor: M. Afzal Upal, PhD
Chair of Computing & Information Science Department
Mercyhurst University
501 E 38th St
Erie, PA, 16546
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Table of Contents
Music Genre Classification Using Machine Learning Techniques
Andrew Innes
4
Predicting Hole by Hole Golf Scores on the PGA Tour
Ron Richardson
11
Logistic regression Versus Convolutional Neural Networks for classification
Jerrin Varghese
15
Machine Learning for the Detection of Mobile Malware on Android Devices
Christina Eusanio
19
Building a Gun Detection Model Using Deep Learning
Shraddha Dubey
24
Flight delay/cancellation prediction using machine learning
Milos Veres
30
How do Socioeconomic Factors Effect the Amount of Waste Produced
Heidi Beezub
34
Using Stock Market Data to Evaluate Genetic Algorithm Performance
Bill Fisher
42
Stock Market Price Model Using Sentiment and Market Analysis
Justin Minsk
50
Connecting People: Psychology and Machine Learning
Praveen Neelappa
54
Medical Brain Drain: The Relationship between Regulation and Emigration
Kimberly Staudt
58
Predicting Future Poaching Sites in African Reserves
Stephanie Le Grange
63
Mass Shootings in the United States: An Analysis and Prediction
Dayana Moncada
67
Are ISIS Sympathizers More Like Republicans or “Water For All” Charity Members?
M. Afzal Upal
74
3
4
M. A. Upal (editor)
Music Genre Classification Using Machine Learning Techniques
Andrew Innes
Mercyhurst University
501 E 38th St,
Erie, PA 16546
ainnes54@lakers.mercyhurst.edu
ABSTRACT
Music genre classification has been a widely studied topic for
many years and has grown drastically with the rise of
machine learning. Various techniques have been used to
classify music, but which techniques perform the best? In this
study, multiple machine learning techniques will be used to
compare which algorithm performs the most accurate and
most efficient.
Keywords
Music Genre Classification, Neural Network, CNN, CRNN,
Sequential, Decision Tree, Linear Regression, Random Forest,
SVM.
1. INTRODUCTION
Machine learning promises to transform the music industry for the
better and we will see consistent change for years to come. With
the emergence of massive streaming services machine learning has
taken over the science behind the music. Machine learning has
already helped artist almost fully create songs and has also helped
get new artists off the ground. Some other uses have been in
predicting the next big hit which focuses on the combination of big
data analytics and machine learning techniques. In this paper, we
will focus on the difficult task of music genre classification.
Music genre classification is an ongoing research problem that has
increased in importance since the emergence of music streaming
applications, personalized radios, and at-home playlists. There are
currently over a thousand micro-genres of popular music which
makes genre classification a very challenging task. Micro-genres
or sub-genres can make most music hard to categorize because they
can carry multiple elements of different genres. In this paper, we
address the music genre classification problem using machine
learning techniques.
Music genre classification can be broken down into two key uses.
One being what the streaming services like Spotify and Pandora use
it for. They use music genre classification to recommend a similar
song when listening to music. This means it will take the song
that’s currently playing, analyze it into a specific genre or subgenre, and then pick another song from that genre to play next.
Another use for music genre classification would be for sorting out
a large music collection. This can be helpful for the normal
everyday music listener or today’s disc-jockeys (DJs) that have a
music library with a massive amount of music. The problem is,
many of these songs are not tagged with proper labels or are not
tagged at all. Automatic music genre classification aims to solve
this problem by quickly sifting through a user’s music library and
correctly placing genre tags on each song. In the end, this would
make it easier for a user to sort through their music and generate
playlists with similar music. This paper will focus on music genre
classification for the user.
Music professionals such as DJs could benefit significantly from a
genre classification tool. There are already many applications that
can automatically sort through music using key features to create
playlists. However, these key features must already be labeled.
These labels can be anything from the genre of the music, the mood,
or just the song title. As stated before, many songs aren’t labeled
or are not labeled correctly.
Music can be broken down into multiple genres and multiple subgenres. This makes the task of classifying music generally hard.
Another reason music genre classification is difficult is because of
the genre definition itself [1]. Many genres and especially subgenres have unclear definitions of what the music is. For example,
Electronic music has many sub-genres with very similar elements
of each where one specific sound could be the reason why it’s in
that sub-genre. Another genre example would be, New Weird
America. New Weird America is considered an indie folk/rock
variant descended from the psychedelic folk and rock of the 60s
and 70s. It contains elements from multiple genres that includes,
metal, free jazz, electronic music, world music, Latin, noise, and
opera [2]. This definition brings in the human aspect of genre
classification.
When you add in the human element to classify music, it can get
subjective. Humans typically have their own opinions when it
comes to music. For instance, one person could say this song is
amazing and is considered this genre while another could consider
it just noise. Also, humans want things done relatively quickly.
Numerous previous studies have investigated music genre
classification and have used a variety of different machine learning
algorithms. As with any machine learning algorithm, some work
better than others. In a study done using the GTZAN dataset,
college students could achieve a 70% classification accuracy after
listening to 3 seconds of music [3]. When listening to longer clips
of music the accuracy was not any better. In one previous work,
they used a convolutional deep belief network achieved a 70%
classification accuracy on a 5-genre classification task [3]. Based
on this report there is still more room for improvement.
In another report, a combined algorithm for music genre
classification based on specific parameters and on a set of SVMs
was used to classify up to 100% correctly. However, it was trained
on 80% of the database which only had 72 songs [1]. When using
only 10% of the dataset, recognition rates varied from 51% to 92%.
Overall, the classification method worked but was only used on a
small dataset. This shows that there is still room for improvement
when using a larger dataset.
Another report used a combination of convolutional neural
networks (CNN) and recurrent neural networks (RNN) to form a
CRNN for music tagging [4]. Within this report, they used the
million-song dataset. However, they use the CRNN to predict the
top-50 tags. We will be looking at genre specifically. Their
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
evaluation allows them to predict genre, mood, instruments used,
and the era of music.
To further understand how we aim to classify music into
genres, you must understand what features make up a song. Some
genres contain specific segments such as a solo guitarist in rock
music, a drop in electronic music, or a chorus in pop music. All
these can be used to help classify which genre of music a song
belongs to. More features can include mood, beats per minute
(BPM), pitch, tone, instruments used, lyrics, and release date.
There are many more features that can be used to classify songs.
We will provide detailed information about the features used in this
report later.
2. RELEVANT WORK
Many different techniques have been applied to the problem of
music genre classification. Some of these are machine learning
techniques that attempt to classify genre correctly. The average
person can classify music at a 70% accuracy rate when listening to
30 second audio samples. Many previous attempts to use machine
learning to classify music correctly have seen an accuracy rate
approaching humans. This could be because the classification of
music into genres is a highly subjective issue. Many genres of
music contain very similar musical elements that make it difficult
for both humans and machines to accurately classify music.
However, music genre classification in data science is used to speed
up the process of classifying genre while maintaining a high
accuracy rate. In this section we will look at relevant works that
have successfully classified music using neural network
approaches, use similar datasets, and focus on feature importance.
The first relevant work we will discuss contains a few different
methods that shows a few different approaches taken in classifying
genre. These methods include support vector machines, neural
networks, decision trees, k-nearest neighbor, and composer
classification [5]. They tested each model showing the average F1
score of each genre for the model. For their support vector
machine’s model, they used both linear and polynomial kernels.
The linear kernel showed an F-score of 0.847 and the polynomial
kernel had an F-score of a 0.838 [5]. Their neural network
approached used a model with two hidden layers which achieved
an average F-score of 0.9 [5]. As discussed in their results, their
neural network approach showed the best classification results
across 4 genres with 100 songs each. Their next approach was their
k-nearest neighbor approach which saw a 0.842 F-score. Their
final model was a decision trees model which showed an F-score
of 0.793. This paper also showed that there is much room for
improvement as few genres were attempted in a small dataset.
Many previous studies focus on the use of convolutional neural
networks (CNNs) to assist in music genre classification. One
specific paper written at South China University of Technology
focused on feature extraction and the use of CNNs to predict music
genre at a 72.4% accuracy [6]. Their algorithm was based on
spectrograms and CNNs. A spectrogram is a visual representation
of the frequencies in sound. The spectrogram contains more details
of music components such as pitch and tempo which can assist in
classifying music into a specific genre. They used a feature
detector as a filter to divide the spectrogram into four feature maps,
which were used to see trends of the spectrogram in both time and
frequency. Using CNNs will assist in music genre classification
based on the convolution of the spectrogram [6]. Their final step
was connecting the features to a multi-layer perceptron classifier
(MLP).
The GTZAN dataset has been used by many other reports written
on this topic [5]. It contains 10 genres each are represented by 100
30-second audio clips. Other studies have used only a portion of
the GTZAN dataset to test the accuracy of many machine learning
approaches. Although this paper will not use the GTZAN dataset,
it is a good baseline to classify genre using audio clips and can help
focus on specific features to classify music. Evaluating audio clips
will be of importance in classifying music into specific genres.
This study concludes with a discussion of future work. They make
note that they manually selected the feature detector and are
interested in how to automatically learn the feature detector [5].
They also suggest trying to add more layers to the CNNs. These
two changes could help create higher-level features which could
help with improving the accuracy of the study.
Another study uses the same dataset and focuses on similar
techniques but better explains the use of CNNs. CNNs are used to
learn filters that extract features in the time and frequency domain
[3]. If the filters mimic spectro-temporal receptive fields (STRF) in
the human auditory system, useful features can then be extracted
for music genre classification [3]. In order to successfully fit the
CNN to the spectrogram of the music signal, the spectrogram must
be split into 3-second segments. This allows the CNN to make
predictions for each segment and then combine the predictions
together. This is used because human classification accuracy
plateaus at 3 seconds and good results were obtained using 3second segments to train convolutional deep belief network [6]. In
short, this study used the “Divide and Conquer” technique to
successfully implement the CNN and reach human level accuracy.
Another study uses a combination of CNNs and recurrent neural
networks (RNNs) to form a convolutional recurrent neural network
(CRNN) for music genre classification [4]. A CRNN is a modified
CNN with the last convolutional layers replaced with an RNN.
Both CNNs and RNNs play a role as a feature extractor and a
temporal summarizer. The RNN allows the networks to take the
global structure into account while local features are extracted with
the remaining convolutional layers. This allows the networks to
focus on all features such as mood and instruments. Mood would
be considered a global feature while instrument would be
considered a local feature. In their report, they tested their CRNN
against 3 other CNNs. When testing for speed, one of the CNNs
performed faster than the CRNN in all testing parameters [4].
However, the CRNN outperformed that CNN with the same
number of parameters. In this paper, we will be testing for both
speed and accuracy in the hopes of finding the best overall
classifier.
A previous study also used the million-song dataset (MSD). The
million-song dataset is considered to be the largest dataset in the
field of music [8]. It is a freely-available collection of audio
features and metadata for a million contemporary popular music
tracks [8]. The MSD is a unique dataset that worked around the
issue of music licensing by using songs that were legally available
to The Echo Nest. The Echo Nest is one of the world’s largest
music data companies that focuses on music intelligence to power
smarter music applications. The MSD was created to further
research into music tagging and genre classification in a legal and
larger way. This dataset is pushing the boundaries for music and
data as many previous datasets have not come close to the size or
amount of tags as the MSD. As stated before, MSD contains audio
features and metadata for 1 million songs which includes, 280 GB
of data, 44,745 unique artists, 7,643 unique terms (Echo Nest Tags),
2,321 unique musicbrainz tags, 43,943 artists with at least one term,
5
6
M. A. Upal (editor)
2,201,916 asymmetric similarity relationships, and 515,576 dated
tracks starting from 1922. [8] The MSD can be used alongside The
Echo Nests API to give extra identifiers and updated tags that have
been changed since the release of the dataset. MDS also contains
audio file clips that contain acoustic features such as pitches,
timbre, and loudness as well as peak loudness.
When looking for important features that could help classify music
into genre, we must look at music information retrieval. One
relevant work was written on multiple-instance learning for music
information retrieval [9]. In this paper they use two types of
features to describe musical audio. One of those features is spectral
features that capture temporal aspects of music in relation to
instruments and production quality [9]. The second features are
types of temporal features that summarize the beat, tempo, and
rhythmic complexity in four different frequency bands. The beat
would be considered the overall structure of the song. In today’s
world of music, most songs follow a standard 4 by 4 beat structure.
The tempo is how fast a song is played out. Most songs vary in
tempo however, much of today’s music is in tempos that are related
to specific genres. For instance, dance music tends to be around 128
beats per minute (BPM), and most pop music is around 100 BPM.
Rhythmic complexity describes how complicated the song is to
follow. Songs are more complex when they don’t follow a
structure. This could be from a speed up or slow-down of the tempo
or a guitar solo that throws off the main portion of the song.
Temporal features are calculated on the magnitude of the Mel
spectrogram. The Mel spectrogram is also known as Melfrequency cepstrum (MFC) which is a representation of the shortterm power spectrum of a sound, based on a linear cosine transform
of a log power spectrum on a nonlinear Mel scale of frequency. The
Mel bands are then combined into four large bands at low, low-mid,
high-mid, and high frequencies given the total magnitude of each
[9]. The spectral features consist of the mean and unwrapped
covariance of clip’s Mel-frequency cepstral coefficients (MFCC).
The MFCCs are calculated from the Mel spectrogram used in the
temporal features above. The MFCC is a non-linear spectrum of a
spectrum. Mandel and Ellis did not attempt to solve our problem
or use the same dataset, it focused heavily on features that we will
need to analyze in order to classify genre correctly.
To summarize, previous studies focus on audio clips and the
spectrogram. Most recent work has focused on classifying genre
with the use of audio features. Audio features are the sounds and
structure that makes up a song. Through neural network
techniques, previous studies were able to classify music using audio
clips into genre categories around a 70% accuracy rate. In this
report, we will attempt to achieve a higher classification accuracy
by incorporating other aspects of music data. These include music
tags that are in the million-song dataset which includes artist, title,
release date, etc. This paper focused on using as many features as
possible in order to accurately classify music. In the next section
we will discuss our proposed solution and the exact steps we will
be taking to solve the music genre classification problem.
3. PROPOSED SOLUTION
In order to solve the problem of music genre classification we must
use a music library or dataset with genre labels. This study will use
a combination of the large and medium subset of the FMA dataset.
The FMA large dataset contains approximately 105,000 tracks.
The FMA medium dataset is made up of 25,000 tracks that are 30
seconds long and contains 16 genres. The dataset also includes a
metadata folder that contains CSV files with feature information
and music tags such as genre. During research, many studies
focused on audio extraction for their neural network to better
understand what a song is made of. For this study, these features
are already extracted using the LibROSA package which will be
discussed later in this report.
Since the dataset is made up of audio files with some metadata, we
can implement a comparison study on different machine learning
algorithms and build two datasets. The first dataset includes a
database of features extracted from each song using the LibROSA
package in Python. However, this dataset does not include the
spectrogram. The second dataset includes only the spectrogram
(See Figure 1) for each song and uses the genre as the label. Using
two datasets allows us to test different algorithms to see which
performs better.
For the first dataset, we focus on using multiple classifiers to
determine which performs the fastest and the most accurate. These
Figure 1: Sample Spectrogram showing frequency and time.
classifiers include a Decision tree, a SVC, Linear Regression, and
a Random Forest model. In the FMA medium dataset, the data is
already split between train, test, and validation. This information
is accessed through the Tracks CSV which contains metadata for
each track. Although there are 25,000 songs in this dataset only
13,522 are available for training, and 1,705 are available for testing.
This gives us approximately an 87% split for training and a 13%
split for testing. This dataset will include 252 features that were
extracted using the LibROSA package.
For the second dataset, we focus on an image recognition task. This
task will be used on the spectrograms of each song to classify its
genre. For this, we use deep learning with a combination of a
Convolutional Neural Network (CNN) with a Recurrent Neural
Network (RNN) as a CRNN. Although this has been tested in
previous reports, we are doing a comparison study between
different machine learning techniques. To gather more data for our
training sets, we can combine the large and medium subsets. Only
24,986 songs are available for use in our models due to an issue
with the Tracks CSV. 2,906 songs are available for the validation
set and 3,974 are available for the test set. This gives us a spilt of
73%, 12%, and 15%. In the next section we will provide the details
about each algorithm and explain the appropriate steps taken to
reach our results.
.
4. EVALUTION
The first step in the evaluation process is data collection. For this
research paper, music was collected from the FMA dataset. As
mentioned before, this dataset includes subsets. For this paper, we
are using the FMA medium dataset which includes 25,000 song
with 16 genres. These genres include Blues, Classical, Country,
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Easy Listening, Electronic, Experimental, Folk, Hip-Hop,
Instrumental, International, Jazz, Old-Time/Historic, Pop, Rock,
Soul-R&B, and Spoken. Unfortunately, the FMA medium dataset
does not have an even split among genres but it is best to use a well
put together dataset with correct genre labels when testing various
algorithms. The FMA dataset includes two CSV files to easily
connect each track with its metadata and its extracted features from
the LibROSA package. Metadata information is found within the
Tracks CSV and features information is found within the Features
CSV.
The next step involved the LibROSA package in Python that allows
us to convert our music into ready to use data. The LibROSA
package is a python package for music and audio analysis that
provides the building blocks for music information retrieval. The
LibROSA package was used to extract features for the Features
CSV. For this paper, we use 6 features that are commonly used in
music information retrieval. The first feature used is the Root Mean
Square Energy (RMSE) value for each spectrogram. In music, the
energy of the signal is also known as the magnitude of the signal or
how loud the song is. The next feature is a chromagram generated
from the songs wave form. A chromagram is used to predict the
songs pitch class. This is based on the 12 semitones of a pianos
keyboard. This allows us to find the songs pitch key in a numerical
value. The third feature is the spectral centroid. The spectral
centroid indicates where the “center of mass” of the spectrum is
located. It’s connected with the “brightness” of sound. The next
feature is the spectral bandwidth which is the difference between
upper and lower frequencies on a spectrum. This can determine
whether the song has more bass or has more high-end sounds. The
fifth feature used is the zero-crossing rate. The zero-crossing rate
indicates the number of times a signal crosses the horizontal axis.
The zero-crossing rate is useful to determine at which points drums
are present in a song. The final feature is a set of 20 different
features that make up the Mel-Frequency Cepstral Coefficients
(MFCC). MFCC’s are coefficients that collectively make up the
Mel-Frequency Cepstrum (MFC). MFC is a representation of the
short-term power scale of sound. In MFCC, the bands are equally
spaced on the Mel-scale, which approximates more closely to the
human ear. Using 20 different MFCC’s allows the song to be
broken up into 20 bins throughout the song. This produces more
accurate MFCC’s. One thing to note is that the first MFCC
normally contains silence which can produce an inaccurate mean
value for the first bin.
The next step in the evaluation process is to split the data into train
and test. The FMA dataset already includes a train, test, and
validation split that can be found in the Tracks CSV. This split is
approximately an 87% train and 13% test or validation data. For
more testing, we could use another dataset with the same features
extracted and run our algorithms on that. For the purpose of this
paper, we will be using the test and validation data that is given.
As mentioned previously in this paper, we are not just testing for
accuracy but speed of each algorithm on the test set. For this
approach, we train and test our data on four different algorithms
and output the time it takes to train and test the data as well as the
F1 score. These algorithms include a Decision Tree Classifier,
SVC, Logistic Regression, and a Random Forest Classifier. The F1
score is a measure of accuracy.
The first algorithm used is a Decision Tree Classifier. Decision
Tree models can be used for classification or regression. It makes
a trees structure and breaks down the data into smaller subsets that
eventually leads to prediction. Decision Tree models tend to be one
of the most used algorithms in machine learning due to its high
classification accuracy.
The second algorithm we train our data on is an SVC which comes
from support vector machines. The goal of the SVC model is to fit
the data you provide returning a best-fit model. This helps
categorize the data and return a more accurate prediction. Many
music genre classification tasks use an SVC model.
The third algorithm is Logistic Regression. Logistic Regression is
typically used for a binary classification problem. For this case, the
model still becomes binary by passing each category as whether it
is true or false.
The fourth algorithm we are testing is a Random Forest Classifier
in the TensorFlow package. A Random Forest Classifier is a
supervised learning approach that takes a Decision Tree Model and
adds more trees to the model. The higher number of trees leads to
a higher accurate model which when classifying data, accuracy
means everything.
4.1 Neural Network Approach (CRNN)
The next classifier uses a neural network approach in order to
classify genres. For this approach, we must extract the spectrogram
for each track. This will turn the problem into an image recognition
task. In order to complete this task, the spectrogram must be turned
into an array of numbers. This approach will be explained further
in the paragraphs that follow.
The first step of this approach is to extract the spectrogram from
each song. For this, we use the LibROSA package that allows you
to plot a spectrogram that shows the frequency in Hz and the time
of the track. It also allows us to see the energy of the track in dB
with a color spectrum. Just like the previous approach mentioned,
we must use the Tracks CSV to link the tracks with their metadata.
This allows use to build a data frame with the spectrograms that
includes the tracks genre tag. During the extraction, we were able
to extract 24,986 tracks for our train dataset, 2,906 for validation,
and 3,974 for our test data. This gives us an 73% train, 12%
validation, and 15% test split.
After extracting the spectrograms, the next step is to convert each
spectrogram into a NumPy array. In order to due this task on a
normal laptop you must batch your data. For the purpose of this
paper, we split our training data into 11 batches with approximately
1,600 songs in each batch. This cut our original train data size
down from 130,000 to 24,986 due to an issue with the Tracks CSV
however, this is still the best approach to creating the array for our
data. After each batch is extracted, we then save the arrays to npz
files which are used specifically for arrays. We then do the same
with our test and validation data. The next step is to then convert
our files using the db_to_power function in the LibROSA package.
Then we finally scale the data using the log function. This allows
us to determine the loudness of the sound data in decibels as it
relates to human-perceived pitch. After the songs are converted,
we then concatenate the data and save it to a final npz file.
To train this method, we use a combination of CNN and an RNN
to develop a CRNN. CNNs are often used in image recognition
tasks and RNNs are typically used for sequential data, in this case
time. This model was inspired by and modified from a recently
developed model by Priya Dwivedi [10]. This approach takes 1D
convolution layers that perform the convolution operation across
the time dimension. RELU is then applied after the convolution
operation. RELU is an activation function that changes all positive
values to a linear identity and all negatives to 0. RELU is also the
7
8
M. A. Upal (editor)
most commonly used activation function used for CNNs. Next,
Batch Normalization is applied which normalizes the inputs to
layers within the network. Finally, we apply 1D Max Pooling that
is used to reduce spatial dimension of the image and prevents us
from overfitting our data. This is performed 5 times with 64 filters
per layer.
The output of the convolutional layer is then fed into a LSTM.
LSTM is an RNN that at its base can compute anything a
conventional computer can compute. In this case, we are using it
to compute the short- and long-term structure of the song. The
LSTM is then put into a Dense Layer which is just a regular layer
of neurons in a neural network. To simplify this, each neuron
receives an input from all the neurons in the previous layer making
them densely connected. The final output is another Dense Layer
with SoftMax activation. The SoftMax function is used to give our
prediction between 0 and 1, thus meaning an accuracy percentage.
To reduce overfitting, we used dropout and L2 Regularization
between each layer. Dropout is a technique used where randomly
chosen neurons are ignored or dropped out. Regularization is a
technique used to discourage the complexity of the model by
penalizing the loss function. L2 Regularization is the sum of square
of all features weights and forces the weights to be small but does
not make them zero.
We use an Adam Optimizer to train our data with a learning rate of
0.001. Using an Adam Optimizer is said to retrieve good results
fast as it updates the weights with each gradient. For the loss
function we use categorical cross entropy which measures the
probability error in discrete classification tasks in which classes are
mutually exclusive. The model is trained for a total of 70 epochs.
5. RESULTS
As stated before, each model was tested for speed and accuracy.
When training and testing the data using the extracted features, the
Decision Tree Classifier performed the fastest. The model trained
in 1.4467 seconds with a 59% accuracy of the training data and
made predictions on the test set in 0 seconds with a 60% accuracy.
However, there is at least a 10% loss in accuracy with this model
when compared to the top performing model. The SVC model gave
us the best accuracy with an 82% on the training data and a 73% on
the test data. The SVC model trained in 49.60 seconds and made
predictions on the test set in 4.6 seconds. To further review the
results, we used the classification_report function in sklearn. This
allows use to see the classification into each genre. After reviewing
the classification report, it became known that we have imbalances
in our train data that lead to misclassification of the test set.
The CRNN model trained each epoch in approximately 250
seconds. It took approximately 5 hours to train all the data with the
best model producing a 72% accuracy with the train data and a 60%
on the validation data. When using the test set, we received and
accuracy score of 65% in 15 seconds. When viewing our
classification report, Old-Time / Historic received an F1 score of
99% making it the most accurate class. A total of 6 classes were
completely misclassified with a 0% F1 score. These genres are
Blues, Country, Easy Listening, Instrumental, Pop, and RnB. This
is most likely due to unbalanced sample sizes in the training set. In
the next section we will further discuss our findings.
6. DISCUSSION
After receiving the results of each algorithm, the CRNN did not
perform as well as expected. We think this may be due to
unbalanced sample sizes in each class when training the data.
However, the CRNN model did produce results for our test data
fast. It was able to process approximately 4,000 songs in 16
seconds and classify them into genre.
To no surprise, the SVC model performed the best. When doing
research, many reports mentioned the use of SVC for music genre
classification. The SVC approach almost always showed the best
accuracy scores next to a neural network approach. The big
difference between these approaches is the way the data is
formulated. The SVC approach, you must extract features from
each song. The neural network approach is an image recognition
task that requires the user to extract the spectrogram from each song
and formulate arrays.
One downfall of this project was the dataset not having an equal
amount of songs in each genre for the training dataset. Although it
isn’t required for any classification task, having an equal number
of classes allows the classifiers to train each class equally. This
would allow us to build a more accurate model. One solution to fix
this issue is resampling of the training set. This will allow us to
better fit our model with equal sample sizes. This method was
tested on the Decision Tree and SVC Model using the built-in
class_weights function. This function allows us to balance the
weights of each class to better fit the model. For instance, if Class
A has more samples tan Class B, Class B will be weighted higher
than Class A. Using this on the Decision Tree model produced
undesirable results with only the highest weighted class being
classified. The SVC model produced similar results with the
class_weights function as discussed in the results section.
Another issue for less accuracy could be that some songs were not
labeled properly to begin with or are ambiguous across multiple
genres. Music Genre Classification is a highly opinionated topic
making it a difficult classification task. Music can also be easily
classified into more than one genre or have similarities to another.
For example, an instrumental track can have a guitar riff that can
misclassify the song as rock. Another example could be a pop song
with electronic roots. All these issues must be taken into
consideration when developing a strong dataset for Music Genre
Classification.
7. CONCLUSION
In this project, we attempted to classify music into its appropriate
genre category. The key findings show us that the best classifier to
use on this dataset is an SVC that gave us a 72% accuracy score.
Our CRNN model did not perform as well as expected with only a
65% accuracy score. These key findings show that there is still
much room for improvement when it comes to music genre
classification. They may also show us that the dataset plays a key
role in how accurate the predictions of your model may be.
8. FUTURE WORK
In order to extend this work, we could try batching the spectrogram
into 3 second windows to more accurately predict each songs
classification. Another suggestion would be to build a better
dataset with expertly tagged songs with more genres. The topic of
music genre relies heavily on opinion and when extracting songs
from a non-expertly tagged database, inaccuracies should be
expected. After a satisfactory accuracy score is met, our next
suggestion would be to build an app that would allow a user to take
songs in their music library and have it place genre tags on each
song. With thousands of genres and subgenres of music this app
would have to be constantly updated in order to keep up with the
pace of today’s music.
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
9. ACKNOWLEDGEMENTS
A Special Thank you goes to Dr. Upal who has helped guide me
through this project throughout my final year of graduate school.
Another Thank you goes to all who have supported me throughout
my studies at Mercyhurst University.
10. REFERENCES
[1] Antonio Jose Homsi Goulart, Rodrigo Capobianco Guido,
Carlos Dias Maciel. 2012. Exploring different approaches for
music genre classification Egyptian Informatics Journal 13, 2
(July
2012),
59-63.
DOI:
https://www.sciencedirect.com/science/article/pii/S11108665
12000151#b0015
[2] Eliot Van Buskirk. 2015. 50 Genres with the Strangest Names
on Spotify. (Sep. 2015). Retrieved October 18, 2018 from
https://insights.spotify.com/us/2015/09/30/50-strangestgenre-names/
[3] Mingwen Dong. 2018. Convolutional Neural Network
Achieves Human-level Accuracy in Music Genre
Classification.
arXiv: 1802.09697v1. Retrieved from
https://arxiv.org/pdf/1802.09697.pdf
[4] Keunwoo Choi, Gy¨orgy Fazekas, Mark Sandler. 2016.
CONVOLUTIONAL
RECURRENT
NEURAL
NETWORKS FOR MUSIC CLASSIFICATION. arXiv:
1609.04243v3.
Retrieved
from
https://arxiv.org/pdf/1609.04243.pdf
[5] Matthew Creme, Charles Burlin, Raphael Lenain. 2016.
Music Genre Classification. (December 2016). Retrieved
November
8,
2018
from
http://cs229.stanford.edu/proj2016/report/BurlinCremeLenai
n-MusicGenreClassification-report.pdf
[6] Qiuqiang Kong, Xiaohui Feng, Yanxiong Li. 2014. Music
Genre Classification Using Convolutional Neural Network.
(2014).
Retrieved
November
8,
2018
from
http://www.terasoft.com.tw/conf/ismir2014/LBD%5CLBD1
7.pdf
[7] Alexandros Tsaptsinos. 2017. LYRICS-BASED MUSIC
GENRE CLASSIFICATION USING A HIERARCHICAL
ATTENTION NETWORK. (2017). Retrieved November 8,
2018
from
https://ccrma.stanford.edu/groups/meri/assets/pdf/tsaptsinos2
017preprint.pdf
[8] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman,
Paul Lamere. 2011. THE MILLION SONG DATASET.
(2011).
Retrieved
November
8,
2018
from
https://www.ee.columbia.edu/~dpwe/pubs/BertEWL11msd.pdf
[9] Michael I. Mandel, Daniel P.W. Ellis. 2008. MULTIPLEINSTANCE LEARNING FOR MUSIC INFORMATION
RETRIEVAL. (2008). Retrieved November 8, 2018 from
http://www.ee.columbia.edu/~dpwe/pubs/MandelE08MImusic.pdf
[10] Priya Dwivedi. 2018. Using CNNs and RNNs for Music
Genre Recognition. (Dec. 2018). Retrieved February 6, 2019
from https://towardsdatascience.com/using-cnns-and-rnnsfor-music-genre-recognition-2435fb2ed6af
About the authors:
Andrew J. Innes is a 2nd year Data Science Graduate Student at
Mercyhurst University. Andrew has an undergraduate degree in
Business Competitive Intelligence. He also has a strong passion
for music and enjoys sharing his passion with others.
9
10
M. A. Upal (editor)
Predicting Hole by Hole Golf Scores on the PGA Tour
Ron Richardson
Department of Computing and Information Science
Mercyhurst University
Erie, PA, USA
ron.richardson@gmail.com
ABSTRACT
This paper tests different machine learning techniques to predict
the score made by a golfer on the PGA Tour, comparing features
of the golfer’s skills with additional features of the course and
hole, to examine the impact the course has on performance. For
this study, data from the PGA Tour was used between 2016 and
2018. Using a few handpicked features of golfer performance
alongside features of the course being played, the paper
concludes that the course does not have a significant impact on
the outcome of the hole, and the skills of the golfer alone
determine the score on the hole. Using an optimized Random
Forest classifier, an accuracy score of 62.7% was achieved.
Keywords
Golf, PGA Tour, ShotLink, machine learning, random forest,
classification.
1. INTRODUCTION
Using machine learning techniques to predict the outcomes of
sporting events is nothing new. There has been plenty of research
on the major sports (baseball, basketball, football), but the amount
of research on golf has been limited. What makes golf so hard to
predict is the amount of randomness that can occur during a round
of golf. A golfer might hole out a long shot where the probability
of doing so is very high. A hole in one by a professional golfer is
estimated at 3,000 to 1, whereas for an average golfer the
probability jumps to 12,000 to 1 [1]. A golfer making a hole in
one earns them roughly 2.1 strokes gained for the hole, where the
mean strokes gained is usually between 0.5 and 1.
Determining the best feature set for predicting golf scores has
been evaluated several times, starting with Davidson and Templin
in 1986, where only three features explained 86% of a golfer’s
scoring variance (greens in regulation (GIR), total putts, and
driving proficiency) [2]. In 1992, Shmanske used three different,
but similar, features for prediction [3]. This type of research
continued, with some variations, until Sen created a single metric
to use for score prediction [4].
Previous research focuses on total round score, whereas this paper
breaks down scoring prediction down to individual holes.
Breaking down scoring predictions to the hole level can open a
whole new level of betting options and prop bets. This can
increase the volume of betting and fan engagement. Additionally,
this level of breakdown can help professional golfers identify
which types of holes can cause issues allowing them to work on
different strategies or practice different skills to improve their
chances of success.
1.1 Available Data
With the introduction of ShotLink by the PGA Tour in 2004, the
amount of available data has increased significantly [5]. ShotLink
is a real-time system that collects every shot hit by a golfer during
tournaments on the PGA Tour. The data is collected by a team of
volunteers at the tournament using a laser to pinpoint the starting
and ending location of each shot, while logging the type of
condition the golfer hit from (e.g. fairway, rough, sand, etc.).
Over the past 14 years, the amount of data that has come from this
system is staggering. From 2010 to 2018 alone, over 10 million
shots have been recorded and over 300 data features calculated.
1.2 Strokes Gained
Coming out of this trove of data was a new statistic called Strokes
Gained. This statistic was first created by Mark Broadie in 2008
for putting and expanded in 2012 to include other aspects of the
game [6]. Strokes gained provides a benchmark for comparing
other golfers in certain skills and has been valuable in giving
viewers and fans a better insight into how a golfer is performing.
Strokes gained is a measure of the effectiveness of a golf shot to
the golfer’s score and represents the decrease in the average
number of strokes to finish the hole, from the beginning of the
shot to the end of the shot, minus one to account for the stroke
taken [6]. If J(distance, condition) is a function that represents the
average number of strokes it takes a PGA Tour golfer to complete
the hole, where distance is the distance the ball is from the hole,
and condition is the current location of the ball for the shot (e.g.
fairway, rough, green), then strokes gained is defined as the
difference of the current shot and the next shot, minus one to
account for the stroke actually taken.
𝑔 = 𝐽(𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 , 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 ) − 𝐽(𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
−1
, 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛
)
Broadie gives the following example: Suppose the average
number of shots to complete a hole from 40 yards away in the
fairway is 2.6. If the golfer hits the shot to one foot away from the
hole (where the average number of shots to complete the hole is
1.0), then the strokes gained is 0.6 [6].
𝑔 = 2.6 − 1.0 − 1 = 0.6
In general, a positive strokes gained value represents that the shot
is better than a PGA Tour golfer’s average shot from that distance
and condition.
Strokes gained helps quantify some of the legacy stats that exist
on the PGA Tour. For example, the total putts in a round statistic
was commonly used but can be deceiving. If golfer A takes only
29 putts during a round, while golfer B takes 31 putts, does that
mean that golfer A is a better putter than golfer B? Golfer A might
have missed more greens in regulation (measured as being on the
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
green in 2 shots or less than par for the hole). By missing the green
in regulation, golfer A has a shorter shot (not a putt) that has a
decent likelihood of getting much closer to the hole resulting in
an easier putt and a greater probability of needing only one putt
to complete the hole. However, golfer B might be on the green in
regulation, but from a much longer distance, and require 2 or more
putts to complete the hole.
In his book “Every Shot Counts” [7], Broadie calculated the
probabilities of one-putting from different distances, compared to
the probability of three-putting throughout the 2003 and 2012
seasons. From 2-feet, the probability of making it in one shot is
99%, for an average of 1.01 shots to complete the hole. From 60feet, the one-putt probability is reduced to 2%, for an average of
2.21 shots to complete the hole. So, if a golfer only takes 2 shots
from 60-feet, they gained 0.21 strokes. Another golfer might only
have 2-feet left to putt, and by only taking one shot to complete
the hole, they gained only 0.01 strokes.
that not hitting the fairway with a tee shot costs a golfer on
average 0.32 strokes [8].
The par on a hole is a rating that is designated by the course
architect and represents what a golfer with a handicap rating of 0
(e.g. a “scratch” golfer) should score on the hole. Golfers on the
PGA Tour don’t play with a handicap but are generally
considered about 4-8 strokes better than scratch. Par is generally
related to the length of the hole but does not describe the difficulty
of the hole. In some cases, the scoring average on a par 5 hole can
actually be lower than the scoring average on a par 4 hole. The
distribution of scores in 2018 by par values is shown in Figure 1.
Strokes gained has evolved into the following categories,
covering nearly all aspects of the game: off the tee (OTT),
approach the green (ATG), around the green (ARG), putting, tee
to green (T2G), and total. For this paper, the average strokes
gained for OTT, ATG, ARG, and putting were used as predictive
features for a golfer’s skill.
2. DATA COLLECTION
The data for this research was obtained from the PGA Tour
ShotLink System. Academic access was granted by the PGA Tour
in August 2017, but the program closed in January 2019. When
the 2018 golf season ended in September, the following data was
downloaded for the 2016 through 2018 seasons: course data,
round data, hole data, and stroke data.
This data is provided in a collection of delimited text files for each
year. Within the files, there are hundreds of measured and
calculated statistics, so feature reduction is important. Two
different feature sets were created, with the first set having a total
of 21 features, combing both golfer features and course/hole
features. The second feature set removed all course/hole features,
resulting in only 12 features.
The 2016 season was used to train the algorithms, with the 2017
season used to validate the results. Finally, the 2018 season was
used as a final test. This resulted in a training set of 271,462 holes,
a validation set of 290,122 holes, and a test set of 281,376 holes.
2.1 Course/Hole Features
For most every course that is played on the PGA Tour, categorical
data for each hole is provided for the type of grass used on the tee
boxes, fairways, rough, and greens. Additionally, the height of
those grasses is provided as some courses have much taller grass
in the rough which can make the course tougher (e.g. GC of
Houston had a rough height of 1.25 inches in 2018, while
Aronimink GC had a rough height of 4.75 inches). Wind speed
and direction is measured both in the morning and afternoon, as
well as how firm the fairways and greens are.
As of 2010, additional data included is the width of the fairways
at certain distances. These widths are measured at distances
between 275 and 350 yards, which contains most drives hit by
professional golfers. Broadie recently measured the cost of
missing the fairway during the 2019 PGA Tour season and found
Figure 1: Score distribution by hole par value
2.2 Golfer Features
With nearly 300 different statistics for a golfer’s performance
included in the data files, picking the most predictive ones is key.
Since strokes gained is widely used as a measure of a golfer’s
performance, it makes sense to include these in the feature set.
Additionally, a golfer’s average driving distance, driving
accuracy, percentage of greens hit in regulation, and putts per hole
were calculated for each golfer.
Before training the algorithms with these features, a calculation
of the golfer’s features at the time of the hole being predicted
needed to be calculated. This was performed by taking the
previous 25 rounds of golf played by a golfer (450 holes) prior to
the tournament being played.
3. MODEL SELECTION
Since golf scores are always whole numbers, and generally vary
between 1 and 8, a classifier can be used rather than regression.
To test the best model to use, a sample of 12 golfers from the 2016
season were selected and run through a couple of classifiers,
including Random Forest and Gaussian Naïve Bayes. From this
initial test, a Random Forest classifier performed the best, before
any hyperparameter optimization.
To optimize the hyperparameters, both a random search and grid
search were performed, with the grid search providing the best
optimization of the Random Forest classifier. After the grid
search, the full feature set with course/hole features included was
reduced from 21 features down to 10. The grid search was also
run on the feature set that excluded course/hole features and no
feature reduction was need, keeping the total feature count at 12.
11
12
M. A. Upal (editor)
4. RESULTS
4.1 Model Accuracy
The default Random Forest classifier with both golfer and
course/hole features included resulted in an accuracy score of
59.4% against the validation data set. When the course/hole
features were removed, the accuracy of the default Random
Forest classifier remained steady at 59.3%.
The results of the grid search determined that using entropy as the
criterion for measuring the quality of a split was best. A max
depth of 3 was used, along with only 10 features, and a minimum
sample split of two. This resulted in a bump in accuracy to 62.7%
with the full feature set, and the exact same for the reduced feature
set for just golfer skills.
4.1.1 Feature Importance
In the full feature set, the top two features in terms of importance
were the actual yardage of the hole, and the par value for the hole.
This is not surprising since the length of the hole is highly
correlated to the par value (e.g. a 200 yard hole is always a par 3,
and a 550 yard hole is always a par 5). Since PGA Tour golfers
are considered better than a 0 handicap golfer (for which par is
determined), their scores will usually hover very closely around
par (within one stroke). The top 10 most important features for
the full feature set are shown below in Table 1.
Table 1: Top 10 Features for Golfers and Courses/Holes by
Importance
Actual Yardage
Par
Actual 275 Distance
Actual 350 Distance
Actual 325 Distance
Avg SG Approach
Avg SG Around the Green
Avg Driving Distance
Driving Accuracy
Avg SG Off the Tee
Importance
0.345919
0.089438
0.062597
0.041917
0.038473
0.035605
0.035171
0.035159
0.034993
0.034901
Once the course/hole features were removed, the same test of
feature importance was run, and again, the actual yardage of the
hole and par value were the top two predictive values. However,
the importance of approach shots, shots off the tee, and average
driving distance were also high on the list. The top 10 most
important features for the reduced feature set are shown below in
Table 2.
Table 2: Top 10 Features for Golfers by Importance
Actual Yardage
Par
Avg SG Approach
Avg SG Off the Tee
Avg Driving Distance
Importance
0.584660
0.115456
0.031246
0.031179
0.031016
Importance
0.030791
0.030652
0.030519
0.030041
0.029770
Avg SG Around the Green
Avg SG Putting
Scrambling Success
Driving Accuracy
Putts per Hole (GIR)
4.1.2 One Round Test
A test was performed on one round played in 2018 to show the
hole by hole predictions. The round chosen was by Phil
Mickelson in 2018 during the first round of the WGC Bridgestone
Invitational played at Firestone Country Club in Akron, OH. Par
for the course is 70, and Mickelson shot a 66 that day. Tables 3
and 4 show the prediction results for each hole with all 4 models
used.
Table 3: Prediction results for one round vs. Actual results
(Front 9)
Model
(Features)
Actual Score
Default
(Full)
Optimized
(Full)
Default
(Reduced)
Optimized
(Reduced)
1
2
3
4
5
6
7
8
9
%
4
4
3
6
4
4
4
4
2
3
5
4
2
3
4
4
4
3
44%
4
5
4
4
3
4
3
4
4
56%
5
5
4
5
3
5
3
5
4
33%
4
5
4
4
3
4
3
4
4
56%
Table 4: Prediction results for one round vs. Actual results
(Back 9)
Model
10
Actual Score 3
Default
4
(Full)
Optimized
4
(Full)
Default
4
(Reduced)
Optimized
4
(Reduced)
11
4
4
12
3
3
13
4
4
14
4
4
15
3
3
16
5
5
17
4
4
18 %
4
4 89%
4
3
4
4
3
5
4
4 89%
5
3
5
4
3
4
4
5 44%
4
3
4
4
3
5
4
4 89%
Both the default and optimized Random Forest models, using the
full feature set, predicted a total round score of 70. The default
model with the reduced feature set predicted a total round score
of 76, while the optimized model predicted a total round score of
70. However, since we are looking at hole by hole predictions, the
best models were the optimized models, regardless of feature set,
having an accuracy of 72.2% for this 18-hole round.
Most of the holes were predicted correctly, however, hole #2 was
never predicted correctly by any model. On this hole (a par 5
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
measuring 529 yards), Mickelson scored an eagle (3), which is 2
shots better than par. An eagle only occurred 2.2% of the time
during the 2018 season, so predicting this value is difficult. No
other hole was incorrect by more than one shot.
5. DISCUSSION
This paper discussed using and optimizing a Random Forest
classifier to predict a PGA Tour golfer’s score on a specific hole,
while also investigating if a course or hole setup makes much of
an impact in the predictions. A classifier was chosen because golf
scores on a hole usually fall within the range of 2 and 6. The best
model in this research never classified a score outside the range
of 3 and 5.
While the default Random Forest models show that some
course/hole features are important, removing them provided no
significant decrease in the accuracy of the predictions. Matt and
Will Courchene of the website datagolf.ca have also found a very
small correlation between a golfer’s past performance at a course
versus future performance, indicating that using course history
and setup blindly won’t always tell the whole story [9].
There is an old adage in golf that says, “drive for show, putt for
dough”, meaning that the long tee shots look neat and are fun to
hit, but putting is where you will make your money and lower
your scores. From this research, putting statistics were barely an
influence in the prediction, but approach shots to the green and
driving distance and accuracy were more important. Broadie has
confirmed this several times, most recently after one tournament
where Rory McIlroy won and was in the top 10 of the field in
driving and approach shots [10].
The PGA Tour has the best golfers in the world and a lot of that
can come down to course management and strategy. With a
narrower hole, a golfer might choose to play to a strategic portion
of the hole to optimize their chances of scoring close to or below
par. Most amateur golfers might hit their standard tee shot and
end up in a more difficult situation, which might lead to a higher
number. The consistency of shots for a professional golfer versus
an amateur golfer is the biggest difference between them.
6. FUTURE WORK
Admittedly, there is a lot of future work that can be performed.
The number of features available in the ShotLink system is
staggering and include granular statistics such as proximity to the
hole from certain distances, putting accuracy from certain
distances, and detailed notes on the type of condition each shot
was taken from. Using these additional features, rather than handpicked ones, could increase the accuracy of the hole predictions.
Principal Component Analysis could be utilized to quickly figure
out which features can be grouped together.
Additional algorithms can also be explored. The Poisson
regression algorithm is a type of generalized linear model that is
used when the target value is a count, which a golf score is.
Ordered logistic regression was also recommended and has been
used by several people in the daily fantasy sports betting arena.
Dan Rosenheck of The Economist used the Burr Type XII
distribution for his hole by hole prediction system called EAGLE,
first presented at the 2017 MIT Sloan Sports Analytics
Conference in Boston and updated in 2019 [11].
For this research, certain course features were not included, such
as grass type and firmness, along with environmental features
such as wind speed and direction. The wind can have a big impact
on a golf score, most notably in Europe where tournaments have
been played in 40 mph gusts. This research did not investigate
scoring differences between each round (except for the difference
in hole yardages, which only varies by a few yards), but there
might be some additional information that can be extracted for
each round that has an impact on scoring predictions.
7. ACKNOWLEDGMENTS
The author would like to thank the PGA Tour for their access to
the ShotLink system and data. Without this information readily
available, additional work to piece together the data would be
required. The author has yet to find another reliable source for the
level of course and hole detail anywhere else. The author would
also like to thank Dr. Afzal Upal and Dr. Stephen Ousley for their
guidance throughout this process.
8. REFERENCES
[1] Auclair, T.J.. June 29, 2018. Odds of a hole in one, albatross,
condor and golf’s other unlikely shots. Retrieved December
12, 2018 from https://www.pga.com/news/golf-buzz/oddshole-in-one-albatross-condor
[2] Davidson, J.D. and Templin, T.J.. 1986. Determinates of
Success Among Professional Golfers. Research Quarterly
for Exercise and Sport, 57, 1 (1986), 60-67.
[3] Shmanske, S.. 1992. Human Capital Formation in
Professional Sports: Evidence from the PGA Tour. Atlantic
Economic Journal, 20, 3, 66-80.
[4] Sen, K.C. 2012. Mapping statistics to success on the pga
tour: Insights from the use of a single metric. Sport, Business
and Management: An International Journal, 2, 1 (2012), 3950.
[5] ShotLink Background. Retrieved December 12, 2018 from
http://www.shotlink.com/about/background
[6] Broadie, Mark. 2012. Assessing Golfer Performance on the
PGA TOUR. Interfaces, 42, 2 (April 1 2012), 105-228. DOI:
https://doi.org/10.1287/inte.1120.0626
[7] Broadie, Mark. 2014. Every Shot Counts. Gotham, New
York, NY.
[8] @MarkBroadie. 2019. Mark Broadie on Twitter: A standard
way to measure the cost of a missed fairway. Retrieved
March
30,
2019
from
https://twitter.com/MarkBroadie/status/1108804384673222
659
[9] @DataGolf. 2019. data golf on Twitter: (Course History
thread!).
Retrieved
March
30,
2019
from
https://twitter.com/DataGolf/status/1086138883916496896
[10] @MarkBroadie. 2019. Mark Broadie on Twitter: Strokes
gained results. Retrieved March 30, 2019 from
https://twitter.com/MarkBroadie/status/1108013944738930
689
[11] Rosenheck, Dan. 2017. The EAGLE has landed: Real-time
win probabilities in men’s major golf tournaments. In MIT
Sloan Sports Analytics Conference (March 3-4, 2017).
Boston,
MA.
Retrieved
from
http://www.sloansportsconference.com/content/eaglelanded-real-time-win-probabilities-mens-major-golftournaments/
13
M. A. Upal (editor)
About the author:
Ron Richardson is a graduate student at Mercyhurst University
studying Data Science. Previously, he graduated from Penn State
University, majoring in both Computer Science and Mathematics.
14
He has worked for Fortune 500 companies as a software engineer
and IT Manager and has built enterprise-level software for
businesses of varying sizes. In his spare time, he can be found on
the golf course, claiming it benefits his research projects.
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Logistic regression Versus Convolutional neural network for classification
Jerrin Joe Varghese
Department of Data Science
Mercyhurst University
jvargh81@lakers.mercyhurst.edu
ABSTRACT
Machine learning algorithms are becoming popular and are
widely used for giving machines the ability to learn for them self
without human intervention. Hence, these algorithms are used for
object detection, image classification, stock prediction etc. Some
machine learning algorithms are complex and requires more
memory and processing power. This paper proposes the use of
logistic regression to overcome the problem of memory and
processing power, if the data can be turned into a binary
classification problem. In order to test this hypothesis, the paper
goes through the problem of driver distraction and uses both
convolutional neural network (CNN) [8] and logistic regression
[10] to analyze the performance of both models on different
machines with different memory and processing power.
1. INTRODUCTION
Machine learning is a type of artificial intelligence technique that
learns to identify new pattern in data. This technique is now
widely used in several industries for various tasks. There are
different types of machine learning algorithms such as supervised
learning, unsupervised learning, and reinforcement learning. So,
it’s important to know which type of machine learning algorithm
is best suited for a machine learning problem. Supervised learning
is the search for algorithms that reason from externally supplied
instances to produce general hypotheses, which then make
predictions about future instances. In other words, the goal of
supervised learning is to build a concise model of the distribution
of class labels in terms of predictive features [1]. We provide the
machine with data that is already labeled and it’s not learning own
its own, we call this type of technique as supervised learning.
Supervised learning can be further divided into two:
classification, where the output is in the form of categories and
regression, where the output is in the form of real values [3].
Examples of supervised learning is Linear Regression, Logistic
Regression, CART, Naive Bayes, KNN etc. Unsupervised on the
other hand, studies how systems can learn to represent input
patterns in a way that reflects the statistical structure of the overall
collection of input patterns [2]. In this type of machine learning,
the machine simply receives as data, but obtains neither
supervised target outputs nor any rewards from its environment
[4]. The machine learns patterns that it feels are present in the
data. Unsupervised learning can be further subdivided into three
sub categories such as clustering, association and dimensionality
reduction. Examples of unsupervised learning are K-means, PCA
etc. Finally, reinforcement [6] learning is a type of machine
learning that helps the machine learn by helping it decide the next
action by rewarding it. This methodology is typically used in
robotics where the machine learns by trial and error technique.
One of the most recent use of reinforcement learning is its use in
Deep mind’s StarCraft project [5]. Where rewards are given for
learning based on the score obtained from the StarCraft II engine
against the built-in computer opponent. This paper focuses on
supervised learning and whether we can use logistic regression to
solve machine learning problems that could be converted to
binary classification. The paper is looking into the binary
classification problem because it relies on the premise: if any
problem can be converted to a binary classifier and gives us a
result that is close to deep learning or convolution neural
networks with Logistic regression, then we could save time and
solve the machine learning problems faster and utilize computers
with low memory and computer processing power. Hence, driver
distraction [7] is the problem used in the paper as it can be
converted to a binary classification, where the classes are divided
into distracted and undistracted drivers. This machine learning
problem is solved using convolutional neural network and logistic
regression [10]; therefore, developed three models, one with
binary classification using logistic regression, the second multiple
models of logistic regression and multi-class using convolutional
neural network [8] and multi-class logistic regression [9]. All the
models are available on Kaggle for reference [11,12,13].
2. DATA
The dataset is collected from Kaggle’s State Farm dataset [14].
Figure 1 shows two examples of the dataset. The dataset consists
of 10 classes such as safe driving, texting–right, talking on the
phone–right, texting–left, talking on the phone–left, operating the
radio, drinking, reaching behind, hair and makeup, and talking to
passenger [14]. The dataset consists of 22424 labeled data for
training and 79726 data for validation. The shape of each image
is of 480 x 640.
(a)
(b)
Figure 1. Example of a driver distraction dataset: (a) safe driving
(b) dangerous driving- texting.
3. RELEVANT WORK
Convolutional neural network is traditionally used for image
processing as they extract features by convolving the images and
extracting useful information. Logistic regression is another
machine learning algorithm which is widely used for binary
classification. The paper goes through both algorithms to
understand whether a binary classification problem or the one that
can be converted to a binary classification problem needs CNN as
it requires more memory and processing power.
3.1 Convolution Neural Network (CNN)
15
16
M. A. Upal (editor)
Convolution Neural Network [8], consist of an input layer, hidden
layer, and an output layer. Some of these layers in the network
are: Convolution [15], Activation [16], Pooling [17], Dropout
[18], Dense, and SoftMax [19].
The Convolution layer consists a set of filters, where each filter
can be considered as a small square that extends through the full
depth of the input volume. During each pass, the filter convolves
across the width and height of the input, which results in a 2-d
activation map that gives the responses of that filter at every
spatial position. To avoid over-fitting, pooling layers are used to
apply non-linear down sampling on the activation maps. It means
that, this layer is aggressive at discarding information, but can be
useful if used appropriately. Dropout layers also help to reduce
over-fitting by randomly ignoring certain activation functions,
while dense layers are fully connected layers and often come at
the end of the Neural Network. The output of the layers of the
neural network are processed using an activation function, which
is a node that is added to the hidden layers and output layers.
You’ll often find that the RELU activation [16] function is used
in hidden layers, while the final layer typically consists of a
SoftMax activation function. The idea is that by stacking layers
of linear and non-linear functions, we can detect a large range of
patterns and accurately predict a label for a given image. SoftMax
is often found in the final layer which acts as basically a
normalizer and produces a discrete probability distribution vector.
Because of these benefits, CNN is most widely used in image
classification or problems related to images.
3.1.1 Pooling
The pooling layer reduces the spatial dimensions of the input and
the computational complexity of our model. Pooling also helps in
controlling the overfitting problem, as it operates on every slice
independently. There are different functions such as Max pooling,
average pooling or L2-norm pooling. Max pooling is the most
used type of pooling that takes the most important part from each
slice of the input data.
3.1.2 Rectified Linear Unit (Relu)
Relu is an activation function that simply outputs 0 when x < 0,
and conversely, it outputs a linear function when x ≥ 0 [16].
f (x) = max (0, x)
3.1.3 Dropout
Dropout is one of the most effective regularization that is used in
a neural network. Using dropout helps us to randomly keep only
a neuron active with some probability ‘p’. This helps it to force
the network to be accurate even if some information is not
present, which in turn helps the network not to be dependent on
any one neuron.
3.1.4 Fully Connected Layer
In a fully connected layer, every neuron in one layer is connected
to every neuron in the other layer. The last fully connected layer
is the SoftMax activation function that classifies based on the
generated features from the trained data.
3.2 Logistic Regression
Logistic regression is a binary classification statistical machine
learning model. The logistic regression is a sigmoid function,
which takes any real input and outputs a value between 1 and 0
[21]. The sigmoid function is given by the formula:
Sigmoid(x) = 1/ (1 + 𝑒
)
4. PROPOSED SOLUTION
This paper uses three model to understand whether it is possible
to use logistic regression for machine learning problems that can
be converted to binary classification and get the same results.
4.1 Pre-processing
This paper also focuses on different algorithms and how they
perform with the driver distraction problem. The problem tackled
here is to understand how each algorithm is different in their
prediction and will this analysis help in understanding, if even
logistic regression can play vital role in problems like driver
distraction. The image data from the dataset is split into training
and validation set, where the training set consists of images of
size 240 x 240 and the image class number. The training set is
then split into features and labels. The features are then converted
to a 4d array using NumPy array. The data is further used by the
models for training.
4.2 Convolutional neural net model
The convolutional neural network uses Keras’s sequential model
[11] and is divided into three convolutional groups. Each group
consists of two convolutional layer of filter size 32,64 and 128
and kernel of 3x3. The convolutional layer also uses zero padding
and Relu as the activation layer. The convolutional layer is
followed by batch normalization to normalize the data and each
group of convolutional layers is added with a max pooling layer
and dropout layer at the end.
The convolutional layers are flattened and added to the fully
connected layer. The fully connected layer consists of three dense
layers with 512,128 and 10 neurons respectively. The loss
function used is categorical cross entropy and optimizer as Adam.
4.3 Logistic Regression model
In the paper Logistic regression uses Keras’s sequential model
[13] and it uses one batch normalization and one dense layer with
cross entropy as the loss function and Adam as optimizer. The
model also uses early stopping to avoid overfitting. The model
splits the data into nine different groups, where each group has a
good driver and bad driver (for each distraction) combination.
The data is then trained to predict if the driver is good or bad. This
information can be further processed and amalgamated to get
similar output as the convolution neural network.
The second logistic regression model [12] splits the data into two
classes i.e. good driver and bad driver. Where the bad driver data
is a combination of all the distracted classes matching the class
size of the good driver. The model uses the Keras’s sequential
model and configuration like the previous logistic model. The
model is then trained to predict different images such as good or
bad driver.
5. RESULTS
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
This paper compares the r squared results, confusion matrix and
accuracy to understand whether logistic regression can be used
for problems that can be converted to binary classification
problem instead of CNN. The result also focuses on the time
required by the models to train, as the computers with less CPU
power and memory are likely to take more time compared to
powerful machines. The r squared value for the CNN model and
the two logistic models are shown in table 1.
Model
Name
GPU
Used
R
square
Accuracy
Time
consumed
CNN
Yes
1.0
0.99
1.5 hours
CNN
No
1.0
0.99
40 hours
Individual
Logistic
model
(average of
all models)
No
0.97
0.99
20 mins
Logistic
model
No
0.837
0.89
10 mins
Figure 3. Loss graph for all models: (a) Convolutional
neural network (b) Individual Logistic regression model (c)
Logistic regression model.
Table 1 : R squared values of each model.
(a)
(a)
(b)
(b)
(c)
Figure 2. Accuracy graph for all models: (a)
Convolutional neural network (b) Individual Logistic
regression model (c) Logistic regression model.
(c)
Figure 4. Confusion matrix graph for all models : (a)
Convolutional neural network (b) Individual Logistic
regression model (c) Logistic regression model.
The figures 2,3 and 4 shows the accuracy, loss and confusion
matrix for each model. The table 1 also shows the time taken for
training each model. The results show that the values are
approximately equal, and the time required by logistic regression
is less compared to CNN. Hence, we can also use logistic
17
18
M. A. Upal (editor)
regression when the data can be used as binary classification
machine learning and when the memory and CPU power is less.
[10] Applied logistic regression second edition David W. Hosmer
university of Massachusetts Amherst, Massachusetts Stanley
Lemeshow The ohio state University Columbus, Ohio.
6. CONCLUSION AND FUTURE WORK
[11] https://www.kaggle.com/jerrinv/driver-distraction Dated :
May 01 2019
Memory and processing power have been major issues for
machine learning models. The paper compares logistic regression
models and convolutional neural network in order to understand
whether it is possible to replace convolutional neural network.
This is because Convolutional neural network needs more
processing power when compared to logistic regression. Our
result shows that the logistic regression also gives us similar
results to convolutional neural network, which is promising. The
future work will include to use more complex machine learning
problems, where the data is more complicated with more diverse
images to see if the results are the same.
REFERENCES
[1] S. B. Kotsiantis, Supervised Machine Learning: A Review of
Classification Techniques.
[2] Appeared in Wilson, RA & Keil, F, editors. The MIT
Encyclopedia of the Cognitive Sciences.
[3] Elnaz Barshana, Ali Ghodsib, Zohreh Azimifara, Mansoor
Zolghadri Jahromi, Supervised Principal Component Analysis:
Visualization, Classification and Regression on Subspaces and
Submanifolds.
[4] Zoubin Ghahramani, Unsupervised Learning : University
College
London,
Kzoubin@gatsby.ucl.ac.ukhttp://www.gatsby.ucl.ac.uk/~zoubin.
[5] Learning Oriol Vinyals Timo Ewalds Sergey Bartunov Petko
Georgiev Alexander Sasha Vezhnevets Michelle Yeo Alireza
Makhzani Heinrich K¨ uttler John Agapiou Julian Schrittwieser
Stephen Gaffney Stig Petersen Karen Simonyan Tom Schaul
Hado van Hasselt David Silver Timothy Lillicrap DeepMind
Kevin Calderone Paul Keet Anthony Brunasso David Lawrence
Anders Ekermo Jacob Repp Rodney Tsing Blizzard, StarCraft II:
A New Challenge for Reinforcement.
[6] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael
Bowling, The Arcade Learning Environment: An evaluation
platform for general agents: J. Artif. Intell. Res.(JAIR), 47:253–
279, 2013.
[7] Understanding the distracted brain why driving while using
hands-freecell phones is risky behavior National Safety Council
White Paper April 2012.
[8] Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee,
Understanding and Improving Convolutional Neural Networks
via Concatenated Rectified Linear Units : University of
Michigan, Ann Arbor;NEC Laboratories America; Enlitic;
Oculus VR.
[9] Peter Karsmakers, Kristiaan Packman’s, Johan A.K. Suykens
Multi-class kernel logistic regression: a fixed-size
implementation.
[12] https://www.kaggle.com/jerrinv/logistic-regression Dated :
May 01 2019
[13] https://www.kaggle.com/jerrinv/driver-distraction-usinglogistic-regression. Dated : May 01 2019
[14]
https://www.kaggle.com/c/state-farm-distracted-driverdetection/data. Dated : May 01 2019
[15] Relation Classification via Convolutional Deep Neural
Network Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou
and Jun Zhao, National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences 95
Zhongguancun East Road, Beijing 100190, China
{djzeng,kliu,swlai,gyzhou,jzhao}@nlpr.ia.ac.cn
[16] Abien Fred M. Agarap, Deep Learning using Rectified
Linear Units (ReLU)
[17] Bonn, Germany scherer, Evaluation of Pooling Operations
in Convolutional Architectures for Object Recognition Dominik
Scherer, Andreas Muller, and Sven Behnke University of Bonn,
Institute of Computer Science VI, Autonomous Intelligent Systems
Group, Roberts. 164, 53117
[18] Dropout: A Simple Way to Prevent Neural Networks
from Overtting Nitish Srivastava nitish@cs.toronto.edu
Georey Hinton hinton@cs.toronto.edu Alex Krizhevsky
kriz@cs.toronto.edu Ilya Sutskever ilya@cs.toronto.edu
Ruslan
Salakhutdinov
rsalakhu@cs.toronto.edu
Department of Computer Science University of Toronto 10
Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4,
Canada.
[19] Multi-Category Classification by Soft-Max Combination
of Binary Classifiers Kaibo Duan1, S. Sathiya Keerthi1, Wei
Chu1, Shirish Krishnaj Shevade2, and Aun Neow Poo1 1
Control Division, Department of Mechanical Engineering
National University of Singapore, Singapore 119260
fengp9286, mpessk, engp9354, mpepooang@nus.edu.sg
2 Department of Computer Science and Automation Indian
Institute of Science, Bangalore 560
[20] Measures of Fit for Logistic Regression Paul D. Allison,
Statistical Horizons LLC and the University of Pennsylvania
[21]
wikipedia
:
https://en.wikipedia.org/wiki/Logistic_regression#Definition
_of_the_logistic_function
About the author:
Jerrin Varghese is a Graduate Student at Mercyhurst University.
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Machine Learning for the Detection of Mobile Malware on Android Devices
Christina Eusanio
Ridge College of Intelligence Studies and Applied Sciences
Mercyhurst University
Erie, PA
ceusan60@lakers.mercyhurst.edu
ABSTRACT
The widespread use of Android smartphones with access to thirdparty applications has spawned many security challenges, among
them malware. Mobile malware applications appear to be
harmless, but can access sensitive data like users’ contacts,
pictures, and passwords. Malware is becoming increasingly
sophisticated and may not be flagged in early review stages by
platforms like the Google Play store due to code obfuscation
techniques. A better way to detect the presence of malware can
be done with machine learning to analyze a mobile device’s
behavior. This paper utilizes a labelled Android mobile malware
dataset to train machine learning algorithms to detect the presence
of malware on Android devices by using attributes such as battery
percentage and CPU usage. The evaluation results suggest that
those features are effective and can be utilized to successfully
classify malicious applications with machine learning algorithms.
KEYWORDS
Android smartphone; machine learning; malware
detection; anomaly detection
1
INTRODUCTION
Smartphone usage has become pervasive across the
globe. According to a Pew Research Center survey of 39
countries conducted in January 2018, a median of 59% of people
reported owning a smartphone, with a higher reported usage of
72% among those in developed countries [1]. The convenience of
smartphones has led society to rely on them for much of what we
do, including making video calls, reading email, navigating to
new locations, taking pictures, streaming music and videos, and
even playing games for entertainment. This convenience comes
with a cost to security. Mobile devices are prime targets for
malware to steal users’ sensitive data, such as user location,
photographs, contacts, and passwords.
One of the largest mobile malware threats originates
from one of the greatest advantages of smartphones—their
application stores, which allow users to download an evergrowing variety of applications from third-party creators.
Android mobile devices are particularly susceptible to this threat.
For users of the iPhone’s Apple iOS, the application store is a
closed system, allowing Apple to control the marketplace where
users can download applications. Although malicious
applications still slip through the cracks, this system allows Apple
to review applications for security flaws before releasing them to
the market. Android devices are able to access not only the
controlled, official Google Play store to download mobile
applications, but also alternative marketplaces. The applications
in these unofficial marketplaces often contain malware in the
form of popular, known applications repackaged to include
malicious code [2].
When users download applications from any
marketplace, the user typically is prompted with requests for
certain privileges the first time they use the application. These
permissions include access to sensitive data such as a user’s
contacts, photos, calendar, and location data, as well as access to
specific hardware items, including the phone’s camera and
microphone [3]. Users often answer these prompts without
thinking much about what that permission means and how it can
impact their privacy [2]. This can lead to the installation of
trojans, malicious applications that appear benign, on users’
phones that can then exploit the permissions granted without
needing to figure out a means to exploit vulnerabilities within the
phone’s software [4].
Given this threat, it is not enough to rely on official
application marketplaces to do a thorough security check of each
app. Personal computers have their own form of malware
detection, but it often consumes too much memory and CPU and
would not be suitable for mobile devices given the limitations of
processing power and battery. Additionally, traditional malware
detection programs on personal computers often rely on a
database of malware signatures, this does not help with the
detection of new malware types that have not been encountered
[5,6].
One major obstacle to studying the applicability of
machine learning algorithms to malware detection on
smartphones is the lack of an extensive, labeled dataset. This
research will make use of the SherLock dataset that fills this
research gap and was collected with cybersecurity research in
mind. It was collected over the course of three years beginning in
2015. 50 participants were given Samsung Galaxy S5
smartphones with a malicious application installed, and the data
collected is a time-series representation of a wide range of
monitorable features of the smartphone, including CPU and
memory data for each running application, along with a labeled
set of activity by the malicious application.
19
20
M. A. Upal (editor)
The malware used in this data collection experiment
was updated at different time intervals so that a wide range of
malware types could be captured by the data. Additionally, the
malware used was based on malware samples found in the wild
but modified so the participants’ privacy would be protected—
this ensured the phone’s captured data would accurately reflect
that of a phone infected with genuine malware, outside of the
controlled model of the experiment [4].
applicability to mobile devices—given the limited battery and
computing power, they wanted to ensure that the KBTA method
could be employed directly on the smartphone to alert users in
real time if malware is detected on their phones. They found that
their method was effective, reporting 97% accuracy with only an
average of 3% of the phone’s CPU consumed [5].
A major limitation of this study is that the researchers
created the five malicious applications for this collection
experiment, as the Android platform was in its infancy and no
malware could be found in the wild [5]. Additionally, this data
collected for this study was from five users over only a week. A
richer dataset with malicious applications found in the wild would
provide better testing conditions for this method. Another
limitation of this method is the input needed by a security expert.
With new iterations of malware being created constantly, the
security expert in this scenario would need to constantly ensure
the security context was up to date so that any anomalies could be
detected. If machine learning is used, the algorithm could be
constantly updated with new knowledge without the need for
additional human input to protect the phone from malware.
This research will expand on the techniques used by
Shabtai et. al in 2010 that introduced “Andromaly,” a framework
for continuous, dynamic mobile malware analysis that uses
machine learning algorithms to classify collected data instances
as either benign or malicious based on low-level features, similar
to the SherLock dataset [6]. A major limitation of this research is
the malware used was not found in the wild, as Android was a
new platform at the time this study was done. The researchers
created four instances of malware to use for testing, which limits
the applicability of their model to real malware samples.
The goal of this research is to utilize machine learning
algorithms to classify applications as malicious or not based on
the collected features in the dataset. The labels in the dataset will
allow an accurate representation of the performance of the
algorithm to determine if this approach is a viable option for
malware detection on Android mobile devices.
2
Xue et. al introduces Malton, an on-device application
that dynamically detects malware through multi-layer monitoring
and information flow tracking along with efficient path
exploration within the Android runtime framework [8]. To
evaluate their detection model, the researchers tested Malton on
real-world malware samples and compared Malton to previously
proposed models that used similar methods to detect malware.
Comparatively, Malton outperformed the other models and added
new monitored features that allowed Malton to detect applications
that were using native code loading [8]. A limitation of this study
is the requirement for humans to select entry and exit points when
parsing an application’s code for analysis. A fully automated
approach would be better.
RELATED WORK
Malware analysis can be broken down into two types:
static and dynamic analysis [2,7]. Static analysis of malware is
performed independently of the code’s execution and attempts to
find malicious behavior before it actually happens. Static analysis
is a fast and efficient way to detect the presence of malware, but
it is insufficient when used alone. Malware is still found in the
official Google Play store [2], indicating that there are
workarounds to static malware analysis and it is insufficient when
used alone. Malicious code can avoid detection through
techniques such as obfuscation, which makes lines of code
difficult to read or reverse engineer. Additionally, malicious code
that uses newer techniques and has not been detected and labelled
as malicious by antimalware programs will not be caught by this
type of analysis because it lacks similarity to previously identified
malware. The proposed solution of this research will implement
the dynamic analysis of mobile malware, so much of the focus of
this section will be dedicated to the dynamic approach.
Kim et. al focuses their research on detecting malware
by creating a resource conscious application to monitor and
analyze power consumption. The detection framework consists of
a monitoring phase where a user’s power consumption is
monitored, and a baseline power consumption is determined.
Then, a power signature is created from the historical data, and
the application uses anomaly detection to analyze a user’s current
power usage and power signature against the established,
historical signature database [9]. The power monitor has to take
measurements of the power consumption at intervals to detect any
anomalies caused by running applications. In order to leave out
benign applications and actions that also consume large amounts
of energy, such as media players that display video footage, the
researchers characterized the application to define what kind of
power consumption would be expected from it without generating
a false positive [9]. The study proved successful in detecting
previously unseen malware, with accuracy rates up to 95%. A
limitation of the study was the small dataset used for training and
testing the algorithm—90 power signatures were used in the
training set, and 270 for testing. Another limitation is that this
Shabtai et. al approached the problem of detecting
previously unseen malware in Android mobile devices with a
knowledge-based temporal abstraction (KBTA) method [5].
Their solution was to use raw, time stamped data that includes,
for example, CPU usage and events by the user such as keyboard
or touch screen usage to detect unusual usage patterns as defined
by a domain expert—patterns such as high CPU usage while the
phone was not in use. The researchers also tested their solution’s
20
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
study did not evaluate its algorithm on real-world malware. The
researchers created a worm emulator for the purpose of testing
their energy-focused method to identify malware. A consistent
barrier to research on malware is the limited sample of publicly
available malware.
In their research, Shabtai et. al introduce Andromaly—
an anomaly-based malware detection method for Android mobile
devices. Their research uses low-privileged monitorable features
including CPU consumption, battery level, and number of data
packets sent over the network to train a machine learning
algorithm to classify mobile applications as either malicious or
benign [6]. The researchers tested several machine learning
classification algorithms on the data they collected, including kmeans, logistic regression, decision trees, Naïve Bayes, and
Bayesian networks. To evaluate the performance of the
algorithms in this binary classification problem, the researchers
used the Receiver Operating Characteristic (ROC) curve and used
that to calculate the Area-Under-the-Curve (AUC). The
researchers found that the Naïve Bayes and logistic regression
classifiers performed the best in the majority of the experiments
performed. Additionally, the researchers employed feature
selection and found that using the Fisher score to only use the top
10 features yielded the best results. A major limitation of this
study was the lack of availability of real-world samples of
malware. The researchers created four malicious applications to
use in validating this approach to anomaly-based malware
detection, which encapsulated denial-of-service and information
theft malware. It is difficult to tell how this would perform against
the various types and iterations of malware found in the wild.
Another limitation is that only two phones were used to collect
data for this study. A larger, more diverse dataset that includes a
wide variety of malware samples could give a more accurate
picture of how well this methodology can be applied in the real
world. A suggestion that the researchers make for further research
is to use time stamps to give more context to the features—for
example, instead of having a feature that represents the battery
level at that point in time, have a variable that represents the
change in battery level over the last 10 minutes.
Mirsky et. al collected the previously mentioned
SherLock dataset with cybersecurity in mind. Their study
collected data on a total of 50 volunteers that were given a
Samsung Galaxy S5 to use as their primary device for two years.
The phones were loaded with a malware application named
Moriarty that was updated throughout the study to encapsulate
several different types of malware. Examples of the malicious
applications include a web browser that contains spyware to
either capture users’ location and audio data, or to capture their
web traffic and history, a popular game that was repackaged to
include phishing attempts that prompt users to login with their
Facebook, Gmail, or Skype credentials via fake login pages. The
malware in Mirsky et. al’s study was based on malware samples
found in the wild, but the code was modified for the study to
protect the volunteers’ privacy. For example, malware that
retrieved files from the user’s phone to send to a remote server
scrambled that data prior to transfer to make it unintelligible. The
researchers demonstrate, using the SherLock dataset, that features
such as battery consumption, network traffic flow information,
data on the usage of CPU and memory can be used to detect the
presence of malware dynamically [4]. However, their analysis is
limited to showing the correlation and information gain scores of
each variable—the main goal of their research was to provide a
rich dataset that could be used with machine learning to further
explore research in mobile cybersecurity, which is the aim of this
paper.
3 PROPOSED SOLUTION AND
EVALUATION METHODOLOGY
The current body of research demonstrates that it is
possible to use features such as CPU usage, battery level, and
memory as indicators for detecting malware efficiently in mobile
devices. The goal of this research is to find the best combination
of machine learning algorithms and parameter tuning to get the
highest accuracy for classifying applications as benign or
malicious. Algorithms tested will include Naïve Bayes, logistic
regression, support-vector machines, as well as the ensemble
methods of random forest and soft voting. and in order to test a
wide range of algorithms to see which has the best outcome with
processing efficiency. Efficiency is important to keep in mind
because ideally, a mobile malware detection application can be
downloaded and run from anyone’s Android to ensure continuous
protection that does not monopolize CPU or battery usage. This
research will use a portion of the SherLock dataset as it is the
largest publicly available dataset for analyzing Android devices
for the presence of malware using low-level features. It contains
two years’ worth of data from a total of 50 devices. The malware
samples used represent a variety of different types of malware and
closely reflect malware samples found in the wild, modified only
slightly in order to protect the study volunteer’s privacy.
Since labelling the software as benign or malicious is a
binary classification problem, the Receiver Operating
Characteristic’s Area-Under-the-Curve (AUC) will be used to
measure the accuracy of the algorithm’s performance. The key to
validating the machine learning algorithms on the dataset is that
the malicious application, Moriarty, leaves “clues” when it is
running—the application alternates between benign and
malicious mode, and the nature of each action and session are
both captured in the dataset. While running in benign mode, the
application, while in its nature is malicious, only performs benign
actions. While running in malicious mode, the application
performs both benign and malicious actions in conjunction [4].
The benign sessions allow the machines to learn a baseline of the
application’s performance so that it can detect anomalies when
the application is acting suspiciously.
To create the dataset for machine learning, the clues
labelled as either benign or malicious from the malicious
Moriarty application, located in the “Moriarty” table, were joined
21
22
M. A. Upal (editor)
other classification algorithms in its ensemble voting method,
performed nearly as well as random forest.
with data from the “Application” and “t4” tables, which were
continuously sampled during the SherLock data collection every
5 seconds. Figure 1 presents an illustration of the data used in the
research. The Moriarty table included a timestamp and a number
indicating which version of the malicious app was running on
each user’s device, as the app changed throughout the data
collection project to include a broad range of malware that would
impact the devices differently. The features such as battery data
and global application statistics such as device memory and
storage were located in the “t4” data table. Data for every
application, otherwise known as local application statistics, also
included data on CPU usage, the network, process information,
and memory usage. For each user, the data was joined to each
Moriarty clue on the closest time stamp available, and grouped by
each version of Moriarty, which reflects a new instance of the
malware application with a new set of malicious behavior to
identify.
This research is limited to one of the many types of
malware represented in the SherLock dataset, so more research
should be done to confirm that the chosen features work well
across different types of malicious activity within mobile
applications, and to see if random forest will continue to be the
best-performing algorithm, or if ensemble voting methods such
as soft-voting will provide a better indicator at large when faced
with many different types of malware. However, the results show
by this research are indeed promising and prompt further
exploration to improve the performance metrics even further.
Algorithm
Moriarty "Clues"
- Description of Action
taken by the app
- Timestamp
Application Table
- Info for each individual
application (CPU, network
statistics, memory, process
information)
- Label: Benign/Malicious
-Version of the Moriarty
application
t4 Table
- Global Application
statistics (memory,
storage, CPU)
5
Dataset
Naïve Bayes
0.91
0.94
SVC
0.98
0.97
Logistic Regression
0.97
0.95
Random Forest
0.99
0.97
Soft Voting (using all of the above)
0.98
0.96
CONCLUSIONS AND FUTURE WORK
With society spending more time than ever glued to their phones,
application developers aim to grab ahold in this captive market
with entertaining applications for smartphone users to enjoy.
These applications sometimes go through a review process before
being made available in a smartphone application store, which
can help to keep malware from smartphones through a static
review of the code. However, these methods of detection are
becoming less effective as hackers with malicious intent get
creative with code obfuscation to cover up any signs of malware
in their code.
Figure 2 - An illustration of the data used in the
research
4
f1-score
Figure 3 - Chart of Evaluation Scores for Selected
Algorithms
- Battery statistics
Timestamp
- Timestamp
AUC
RESULTS AND DISCUSSION
The purpose of this research was to apply commonly
used machine learning algorithms to a known, labelled set of
malware application data to evaluate whether a set of lowprivileged statistics including battery consumption, memory, and
CPU usage could dynamically detect the presence of malware.
This research trained and tested on data from version one of the
Moriarty malware application. This version of the application was
a puzzle game that stole and transmitted a users’ contacts. During
data preprocessing, numerical data was standardized by removing
the mean of the training set and then scaling to unit variance of
the training set.
A better way to continuously protect smartphone users might be
to use a dynamic approach, where usage statistics like battery,
CPU, and memory are evaluated through machine learning to
detect strange behavior that could compromise an individual’s
personal data and privacy. Previous studies have shown these
features to be effective in classifying an application as malware.
The evaluation results in this study suggest that global
application features such as CPU usage, battery usage, memory,
as well as attributes for running applications and their individual
CPU and memory usage statistics are effective and can be utilized
to successfully classify malicious applications with machine
learning algorithms.
Figure 2 shows a chart of the results of training and
testing on a single user. The results show that the features chosen
are able to detect the presence of malware perform well on the
tested algorithms. Naïve Bayes, which was chosen as it was the
best-performing algorithm in the “Andromaly” framework [6],
performed the worst of the tested algorithms. However, it still
yielded an AUC score of 0.91. Random forest was the bestperforming algorithm on this dataset, with an AUC of 0.99, and
an f1 score of 0.97. The f1 score is the harmonic mean of precision
and recall. As with AUC, the f1 score is another indicator that
should be as close to 1 as possible, with 1 being a perfect score
and 0 being the lowest possible score. The support-vector
classifier and the soft-voting classifier, which utilized all of the
In future work, more of the SherLock dataset should be explored.
This analysis focused on the first quarter of 2016, but the dataset
contains over two years of the labelled malicious software data,
with more types of malware used within the “Moriarty”
application.
Another idea to further this research is to use a
similar methodology outlined in the “Andromaly”
proposed framework to examine the algorithm’s
22
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
performance under different circumstances. The authors
conducted several different experiments with similar data
in addition to the device-specific classification algorithms
outlined in this paper. For another experiment, Shabtai et.
al tested on each participant’s device whether the algorithm
could detect malicious applications that were not included
in the training set. Additionally, an experiment was
conducted with data from all benign and malicious
applications, but the training and testing data were split
along devices. Lastly, the researchers evaluated their
algorithm’s ability to classify an application as benign or
malicious when it was not included in the training set and
with training and testing performed on different devices
[6]. This test would help determine if the algorithm is
attuned to a single user or if the data can be generalized to
identify malware on different devices.
[3] GOOGLE DEVELOPERS. Permissions overview.
ACKNOWLEDGMENTS
[7] Arshad, Saba, Shah, Munam Ali, Khan, Abid, and Ahmed,
I would like to thank my professors Dr. Afzal Upal and Dr. Chad
Redmond for assistance with the research process and with data
processing. I would also like to thank Dr. Chris Mansour, whose
guidance led me to focus my research on applications of machine
learning in cybersecurity.
REFERENCES
[1] Poushter, Jacob, Bishop, Caldwell, and Chwe, Hanyu.
Social Media Use Continues to Rise in Developing
Countries but Plateaus Across Developed Ones. Pew
Research Center (June 19, 2018).
[2] Suarez-Tangil, Guillermo, Tapiador, Juan E., Peris-Lopez,
Pedro, and Ribagorda, Arturo. Evolution, Detection and
Analysis of Malware for Smart Devices. IEEE
Communications Surveys Tutorials, 16, 2 (2014), 961-987.
Android Developers (November 20, 2018).
[4] Mirsky, Yisroel, Shabtai, Asaf, Rokach, Lior, Shapira,
Bracha, and Elovici, Yuval. SherLock vs Moriarty: A
Smartphone Dataset for Cybersecurity Research. (Vienna,
Austria 2016), ACM.
[5] Shabtai, Asaf, Kanonov, Uri, and Elovici, Yuval. Intrusion
Detection for Mobile Devices Using the Knowledge-based,
Temporal Abstraction Method. Journal of Systems and
Software, 83, 8 (August 2010), 1524-1537.
[6] Shabtai, Asaf, Kanonov, Uri, Elovici, Yuval, Glezer,
Chanan, and Weiss, Yael. "Andromaly": a behavioral
malware detection framework for android devices. Journal
of Intelligent Information Systems, 38, 1 (February 2012),
161-190.
Mansoor. Android Malware Detection & Protection : A
Survey. International Journal of Advanced Computer
Science and Applications(IJACSA), 7, 2 (2016), 463-475.
[8] Xue, Lei, Zhou, Yajin, Chen, Ting, Luo, Xiapu, and Gu,
Guofei. Malton: Towards On-Device Non-Invasive Mobile
Malware Analysis for (ART). In 26th USENIX Security
Symposium (USENIX Security 17) (Vancouver, BC 2017),
USENIX Association, 289-306.
[9] Kim, Hahnsang, Smith, Joshua, and Shin, Kang G.
Detecting Energy-Greedy Anomalies and Mobile Malware
Variants. In Proceedings of the 6th International
Conference on Mobile Systems, Applications, and Services
(Breckenridge, CO, USA 2008), ACM, 239-252.
About the author:
Christina Eusanio is a Graduate Student at Mercyhurst
University.
23
24
M. A. Upal (editor)
Building a Gun Detection Model Using Deep Learning
Shraddha Dubey
Graduate Research Assistant
Mercyhurst University
Shraddha.dubey04@gmail.com
ABSTRACT
Mass shootings and homicides involving guns are on the rise. The
recent mass shooting at the Christchurch mosque in New Zealand
is yet another horrifying example of the pain and destruction such
incidents brings to society. The ease of obtaining handheld guns
in the open market adds to the risk of these incidents repeating.
The objective of the research is to build a trained model that can
detect hidden handguns. Manual analysis of security images to
identify threats of gun-related violence is labor-intensive, timeconsuming, and prone to human errors. This research is aimed at
finding the best suitable model to detect the presence of a gun in
still images using neural networks, such as Faster R-CNN and
SSD models. The dataset used for this research comes from open
source platforms.
Table 1: Estimated total civilian-held legal and illicit firearms
in the 25 top ranked countries and territories, 2017 [Source:
Small Arms Survey (2018)]
Keywords
Image classification, deep learning, R-CNNs
1.
INTRODUCTION
After every mass shooting reported in the media and after every
minute of silence, the question of gun control arises. Almost
always the answer on the stricter gun laws ends up coming short
to the next shooting. This paper doesn’t divulge into the discussion
regarding stricter gun control laws or limiting access to
ammunition. Instead, it focuses on the use of machine learning and
artificial intelligence to identify firearms (handheld guns) in
images in order to detect and alert concerned parties to take further
action.
Table 2: Estimated rate of civilian firearms holdings in the 25
top-ranked countries and territories, 2017 (firearms per 100
residents) [Source: Small Arms Survey (2018)]
Firearm detection in still images, particularly handgun detection,
is not yet perfected and there are benefits in improving the
technology. The primary goal of this work is to prevent firearm
misuse. This would be particularly valuable in countries where
illegal handgun use/misuse is a challenge for law enforcement.
Another important aspect and a good way to benefit would be to
incorporate it in surveillance methods and social media platforms
where such pictures may end up.
Figure 4: Annual acquisition of new firearms in the United
States[Source: Small Arms Survey (2018)]
The focus of researchers interested in building models similar to
handgun detection is mainly driven by high crime rates that affect
many people worldwide. According to the Small Arms Survey [1],
by the end of 2017, there were approximately 1,013 million
firearms
A research paper by Olmos, Tabik, and Herrera [2] claims that
“psychological studies demonstrate that the simple fact of having
access to a gun increases drastically the probability of committing
violent behavior.” In most cases, early prevention and detection of
firearms, particularly handguns, are the key to preventing such
behavior. For the most part, past studies have focused on
discovering firearms with techniques such as X-rays combined
with some of the traditional machine learning methods.
in the 230 countries and autonomous territories of the world. An
estimated, 84.6 percent of these were held by civilians, 13.1
percent by state militaries, and 2.2 percent by law enforcement
agencies.
24
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Recent research, however, incorporated more of the machine
learning and deep learning models, such as Convolutional Neural
Networks (CNNs), which ultimately have performed better than
traditional machine learning models.
published and would automatically red-flag it before it reaches any
users. Even though YouTube is making advances in video
classification there are still other online services where video
classifiers are rather lenient.
Valldor, Stenborg, and Gustafsson [3], from the Swedish Defense
Research Agency have focused more closely on social media and
images posted on those platforms. In terms of previous work, in
countries such as Sweden, many experts have analyzed data
(images) manually, and some still continue to do so however
expansion of social media makes it difficult due to the volume of
information available. This causes overload and can often lead to
a slow turnaround in the detection and prevention of violent
behavior. Because of these reasons, the main focus of research of
Valldor and his team was not only to build a model to detect
firearms in images but also to investigate what is needed to build
such a model in terms of time and other resources (mainly
monetary). They suggested three uses for their model:
An interesting takeaway from this article is the approach to
building a classifier. Wu, Yao, Fu, and Jiang suggested that is
CNNs are one of the most promising approaches for building a
model for object detection. Just as many research papers have done
with CNNs, this paper also states that even though the model they
build does have some success in classifying videos, it is still far
from practical use. This may be discouraging for some, however,
with enough focus and attention to developing CNN for firearm
detection, we may develop practical models with reliable results.
1.
2.
3.
General forensic analysis of images;
Disinformation and troll detection on the internet;
Lone actor terrorist detection.
2.
RELATED WORK
One of the early work was done by Hua-Mei Chen [5] and his team
on detecting firearms, published in March of 2005. It focuses on
the detection of weapons underneath of person's clothing and
recognizes that as a very important "obstacle", which could have a
major impact on security in highly populated areas such as
airports, bus stations, train stations. The authors of this paper
appealed to The Concealed Weapon Detection (CWD) program
that was started in 1995 under the sponsorship of the National
Institute of Justice, administered by the Air Force Research lab in
the United States.
Figure 5: ROC curve for firearm detector
Figure 2 shows a graph of the ROC curve for the firearm detector.
The true positive rant is measured on a test of 200 images
containing firearms. The false positive rate is measured on 5000
images of MS-COCO 2017 validation set. Results from their
model were very promising, however, there was a large number of
false positives, in their example model quite often categorized
handheld gadgets (tv remotes) and skiing gear (sticks) as weapons.
In the graph (Figure 1) from their paper showing, we see a high
false positive rate.
Another team of researchers, Wu, Yao, Fu, and Jiang [4], focuse
on applying deep learning to video classification and captioning
driven by the recent advances in picture classification and
captioning. They hope to increase the rate at which various videos
on the internet can be analyzed and classified. Services such as
YouTube, Vimeo, or Imgur, where users have complete control
over the context, were the main motivation of this team for such a
classifier. For example, if a user decides to post a very graphic,
violent video on YouTube in which this user is explaining various
fighting techniques and this video gets published under a name
such as “Mickey Mouse” in hopes that children may come across
a violent video, this classifier would process the video before it is
Table 3: Summary of the imaging sensors being developed by
the CWD
The quest has not ended yet, even 14 years after the article was
published and 27 years since the program started. The main reason
for the never-ending research is the changing technology as well
as technological advances on both sides of this arms race. Guns
can be 3D printed and can be made of various alloys of steel and
other materials that can make it harder for technologies such as to
discover them through a standard imagining process. As moment
more effort is put into an early detection as well as detection from
a distance, it can give security enforcers more time to act and
prevent it from happening.
The biggest strides are being made by using millimeter wave
(mmW) advanced imaging technology (AIT). The mmW
technology cameras commonly used by two of the biggest security
agencies- FDA, TSA/DHS. Most well-known cases of these
cameras being widely utilized are by the Transportation Security
Administration (TSA) of the United States of America. One of the
-1-
-2-
M. A. Upal (editor)
best features of this technology, in terms of firearm detection, is
that it can discover hidden handguns from about 65 feet (20
meters) away in real time. This technology analyzes the waves
emitted by the human bodies which are usually ‘warm’, compared
to the ‘cold’ waves of metals and other objects. The mmW cameras
only reflect red light if they see any ‘cold’ bodies and agent behind
the screen is the only one able to see but not the potential threat.
This allows for a quick reaction from the defender side rather than
the potentially dangerous subject
Table 5: Classification accuracy of SVM
In conclusion, this research reviewed here only gives us a glimpse
of the potential that machine learning for handgun detection. Last
research paper digs a bit deeper than previous work and the results
are not discouraging, rather promising that once enough minds
have been put to the matter problems such as handgun detection
will not be as big of a problem in the future.
Another paper published by Rohit Kumar Tiwari and Gyanendra
K. Verma [6] in 2015, focuses on CCTV cameras commonly used
for security and surveillance purposes. The paper discusses using
Harris interest point detector alongside FREAK [7] (Fast Retina
Keypoint) for an automated gun detection, thus save time and
increase the efficiency of the same task done by an operator or
security personnel. Harris detector is constant to every geometric
transformation (different gun model, picture of it from a different
angle, etc.) and FREAK as a feature extractor for each point which
allows for a clear and coherent result whether or not a gun shape
is detected.
3.
METHODOLOGY
Correct
classification
TPR (%)
1
different
backgrounds
12
11
91.66
2
different degree of
illumination
9
7
77.77
3
interclass variation
11
9
81.81
4
degree of occlusion
17
14
82.35
5
rotation variation
10
8
80.00
6
multiple guns
6
5
83.33
Session
This system though promising is not able to perform well in
illumination change because the color based segmentation
algorithm is not able to segment the image accurately. Hence
during the gun color extraction, they get only some part of the gun
and it affects the performance as seen in Table4.
No. of TP
After system initialization, Harris combined with FREAK creates
a description of a gun through basic gun images from various data
points. The results are as follow in Table 4.
Image
description
Obtaining a large enough dataset to train the model is probably the
most difficult task when applying Deep Learning. Collecting these
images from most common search engines will require time and
resources to tag individual images for the as Deep Learning
algorithms require tens of thousands of images.
Another paper by Gyanendra K. Verma and Anamika Dhillon
published in 2017 [8] pointed out that according to the United
States Department of Justice the majority of crimes are committed
using handguns and those crimes include robberies, rapes, and auto
thefts. I believe that CNNs are some of the most capable
algorithms that many researchers shy away from in part because
of their potential complexity of implementation, as well as the fact
that we do not necessarily know what goes on inside the black box
of a CNN. CNNs do not provide access to the learned knowledge
as decision-tree based learning algorithms do. Researcher in this
paper decided to apply Deep Convolutional Network (DCN) with
a Faster Region-based CNN model to automatically detect
handheld guns in a cluttered setting. When referring to the
previous work done, researchers of this paper point out to the
CWD (Concealed Weapon Detection program) since this program
is already implementing multiple technologies for imagining
detection of handguns, knives and such handheld weapons mainly
used by TSA at the airports.
Table 4: Performance under Harris plus FREAK descriptor
matching
In addition to data collection and training, establishing a
TensorFlow and Anaconda environment with default Python
package imports (e.g. NumPy, Anaconda’s Protobuf, lxml, Cython,
and OpenCV) was necessary. Then the object detection model was
then downloaded from the TensorFlow Model Zoo [14], a holding
repository for all current models for object detection and image
classification. In this case, the FasterRCNN-Inception-V2-COCO
model was packaged and installed for testing.
In order to train CNN, researchers have used the IMFDB (Internet
Movie Firearms Database) which is an online database from
various movies, tv shows, video games, and other media. Besides
handguns this database stores pictures of other firearms from
which researchers chose revolvers, rifles, and shotguns. The
system was trained using a mini-batch gradient descent approach.
During training adjustments to the multinomial logistic regression,
the objective was to develop the best training approach. During
testing, the accuracy of the system was measured through the True
Positive Rate, False Positive Rate, Positive Prediction Value, and
False Detection Rate. The results are reproduced below:
For this model, the images were collected from opensource
platforms. The images used only have one class labeled as pistol.
The training set contains 2000 images and evaluation set consists
of 100 images with four labels namely, face, car, pistol, and hand,
unevenly distributed among these groups.
-2-
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
The verification boxes of every instance of the targeted object was
manually labeled in each image and then converted into a .xml file
for training, using LabelImg[14]. The .xml files were then
converted to .csv files in order to be read by TensorFlow. The last
step was to configure the object detection training pipeline. The
pipeline defines which model and what parameters will be used for
training. In the configuration file, the parameters num_classes,
fine_tune_checkpoints, num_examples were altered in addition to
the train_input_reader and eval_input_reader sections to fit our
custom path directories.
4.
SOLUTION
The detector used in this work is based on Faster R-CNN and used
the Inception network for feature extraction. It constructs a single,
unified model composed of RPN (region proposal network) and
faster R-CNN with shared convolutional feature layers [13].
bx, by: coordinates of the center of the bounding box
bw: width of the bounding box w.r.t the image width
bh: height of the bounding box w.r.t the image height
Along with the information about bounding boxes, we can
consider the class of images as a multi-class classification
problem, defined as:
𝑦 =𝑐
Where,
𝑐 = is the probability of the ith class.
If there are three classes, the target variable is defined as,
𝑐1
𝑦 = 𝑐2
𝑐3
Loss Function
The model is optimized for a loss combining two tasks
(classification + localization)
The loss function sums up the cost of classification and bounding
box prediction:
L=Lcls+Lbox
For “background”, Lbox is ignored by the indicator function 1[𝑢
≥1] defined as:
1[𝑢 ≥ 1] =
Figure 6: An illustration of Faster R-CNN model [13]
The Faster R-CNN algorithm replaces the slow selective search
algorithm of previous models with a fast neural net, particularly
the introduction of Region Proposal Network (RPN).
RPNs works as follows:
1, 𝑖𝑓 𝑢 ≥ 1
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Faster R-CNN is optimized for multi-task loss function. It
combines the losses of classification and bounding box regression
(with classes higher than two):
At the last layer of an initial CNN, a 3x3 sliding window moves
across the feature map and maps it to a lower dimension (e.g. 256d model)
For each sliding-window location, it generates multiple possible
regions based on k fixed-ratio anchor boxes (default bounding
boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the
region
A Faster R-CNN is a two-stage classifier. The first stage involves
object localization. In this step, a sliding window is applied to the
output of the last layer if the features extraction network, the goal
is identifying regions in the image that contains an object of
interest with the help of bounding boxes. In the second stage
regions with high scores from the first stage are then extracted
from the feature map and fed through a classifier that predicts both
the object type and a bounding box for each such region.
A bounding box can be initialized using the following parameters:
Figure 7: symbol and explanation [13]
5.
EVALUATION
In real-time, the datasets have multiple classes and their
distribution is non-uniform, so a simple accuracy-based metric will
introduce biases in favor of the larger classes. It is also important
to assess the risk of misclassification. Thus, there is the need to
-3-
-4-
M. A. Upal (editor)
associate a confidence score of the model with each bounding box
detected and to assess the model at the various level of confidence.
Therefore, object detection evaluation involves two distinct
measures.
1.
further with more training data and more repetitions. An ideal
model the value should be closer to 2.
Determining whether an object exists in the image
(classification)
Average Precision (AP) is a popular measuring accuracy. The
AP is defined as the average of the precision scores after each
true positive, TP in the scope S. The mathematical definition
of the AP is:
Figure 9: Localization loss (loss score vs training steps)
The localization loss shows the price paid for inaccurate bounding
boxes/coordinates predicted by the model. We see its converging
to 0.65 which is a good measure. This can be attributed to less
number of class labels in the model.
Mean Average Precision (mAP) is an average of the precision
value for all AP over all classes.
2.
6.
Determining the location of the object (localization, a
regression task).
To evaluate a model on localization we must first identify
how well a model predicted the location of the object. It is
evaluated on the Intersection over Union threshold (IoU)
which summarizes how well the ground truth object overlaps
the object boundary predicted by the model.
RESULTS
Figure 10: Total loss (loss score vs training steps)
The total loss gives a sum of the classification and localization
loss. In this model, it is close to 4, which can be attributed to the
higher value of classification loss. The model has scope for more
training so that the total loss can be brought down to 2 – 1.5.
Figure 8: Classification loss (loss score vs training steps)
The classification loss shows the price paid for inaccurate
classification of the object in the images by this model. The graph
shows that the loss gets stable close to 3. This can be reduced
Figure 11: Trained Model Output
7.
Conclusion
The model predicted the labels with accuracies as high as 99%
when the image has a high contrast with the background but when
-4-
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
the image and the background have a lower contrast, it produces
the lowest accuracy of 64%. This could be attributed to the fewer
training images and a relatively lower number of steps.
In terms of speed of detection, it took ~15 seconds to identify the
objects which make is not practical for real-time object detection.
8.
FUTURE WORK
In the future, the model can be trained on a larger number of realtime surveillance images with more classes such as handgun, rifle,
and shotgun. There is a possibility of applying this model to videos
and live streaming data and it will be interesting to see how it
extends to those situations.
Faster R-CNN model (or the family of R-CNN) is region-based
object detection algorithms. They can achieve high accuracy but
could be too slow for certain applications such as autonomous
driving. It would be interesting to see the implementation of the
dataset with faster object detection models such as SSD and
RetinaNet.
Acknowledgment
I am extremely fortunate for the constant support and guidance I
received from Dr. Afzal Upal and Dr. Mahesh Maddumala from
the Department of Computer and Information Science, Mercyhurst
University. They were instrumental in helping me in every
challenge I came across during this research. They were patient
and understanding of my queries and guided me appropriately. I
would also like to thank my fellow classmate Heidi Beezub and
Mercyhurst alumni Praveen Kumar Neelappa who provided
constant help and critique during the process of writing this paper.
REFERENCES
Aaron Karp, Estimating Global Civilian-HELD Firearms
Numbers,
http://www.smallarmssurvey.org/fileadmin/docs/T-BriefingPapers/SAS-BP-Civilian-Firearms-Numbers.pdf. (January
2019)
[2] Roberto Olmos, Siham Tabik, and Francisco Herrera. 2017.
Automatic Handgun Detection Alarm in Videos Using Deep
Learning. Cornell University Library (February 2017).
[1]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
Erik Valdor and David Gustafsson. Firearm Detection in
Social Media. NATO STO.
Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang.
2016. Deep Learning For Video Classification
and
Captioning. (September 2016).
Hua-Mei Chen, Seungsin Lee, Raghuvee M. Rao, MohamedAdel Slamani, and Pramod K. Varshney. 2005. Imaging for
Concealed Weapon Detection. IEEE SIGNAL PROCESSING
MAGAZINE[(March 2005).
Rohit Kumar Tiwari and Gyanendra K. Verma. 2015. A
Computer Vision based Framework for Visual Gun Detection
Using Harris Interest Point Detector. Procedia Computer
Science54
(August
2015),
703–712.
DOI:
https://doi.org/10.1016/j.procs.2015.06.083
Alahi, R. Ortiz, and P. Vandergheynst. 2012. FREAK: Fast
Retina Keypoint. 2012 IEEE Conference on Computer Vision
and
Pattern
Recognition
(2012).
DOI:http://dx.doi.org/10.1109/cvpr.2012.6247715
Gyanendra K. Verma and Anamika Dhillon. 2017. A
Handheld Gun Detection using Faster R-CNN Deep
Learning. Proceedings of the 7th International Conference
on Computer and Communication Technology - ICCCT2017(2017).
84-88
DOI:http://dx.doi.org/10.1145/3154979.3154988
David G. Lowe. 2004. Distinctive Image Features from
Scale-Invariant Keypoints. International Journal of
Computer Vision 60, 2 (January 2004), 91–110.
DOI:http://dx.doi.org/10.1023/b:visi.0000029664.99615.94
Jim Handy. NGD’s New “In-Situ Processing” SSD
(https://thessdguy.com/tag/machine-learning/ ). (July 25,
2017)
Open
Images
Dataset
V4+.
(https://storage.googleapis.com/openimages/web/index.html
)
Karen Simonyan & Andrew Zisserman.2015. Very Deep
Convolutional Networks for Large Scale Image Recognition.
Visual Geometry Group, Department of Engineering Science,
University of Oxford. (10 April 2015)
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun
.Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks. 2016
https://github.com/tensorflow/models/blob/master/research/
object_detection/g3doc/detection_model_zoo.md. ( 15
January 2019)
About the author:
Shraddha Dubey is a Graduate Student at Mercyhurst University.
-5-
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Flight delay/cancellation prediction using machine learning
Adapting new ways to help stranded passengers
Miloš Vereš
Department of Computing & Information Science
Mercyhurst University, Erie, PA
mveres04@lakers.mercyhurst.edu
ABSTRACT
In 2017 US airline industry experienced an 11.3 percent increase
in the cost of delayed flights, from $23.9 to $26.6 billion dollars.
Total money lost comes out of the $1.5 trillion that this industry
contributes to the US economy. Flight cancellation prediction
model is a way to address this problem. By knowing in advance
that a flight will be canceled, the industry has a chance to “save”
a potential loss in demand by offering a variety of ways to
compensate the stranded travelers which can lead to an increase
in revenue. This program can help respond to the needs of
stranded passengers. The main goal of the work reported here is
to create a model that can be implemented by the airline and
hospitality industry. Raw flight data was collected from the
webpage of Bureau of Transportation Statistics (BTS), while
weather data was found on National Oceanic and Atmospheric
Administration (NOAA) webpage. Of the models we
investigated, the best one was the Isolation Forest model which is
commonly used for anomaly detection.
1.
INTRODUCTION
Commercial airline industry has been the backbone of the
worldwide transportation system ever since 1950s when a few US
airlines started introducing a new way of fast, comfortable and
efficient travel. In the beginning, sports teams and businessmen
were main customers but as the need for quick, reliable and
comfortable travel increased, the industry grew. Currently, the
FAA (Federal Aviation Administration) handles over 15 million
flights annually which translates to over 46 thousand flights per
day serving 2.6 million passengers [FAA 2018]. These numbers
apply to the United States of America only. After more than half
of a century since the commercial airline trend was adopted, the
industry has seen major improvements that have created an
outstanding reliability and almost all of the commercial airplanes
created today are engineered to perform better than the current
worldwide safety standards. Major reason for that is a small room
for error. If an accident were to happen, it would cause great
disruptions to the industry and not just individual carriers. Despite
the fact that the industry’s safety and customer service record is
close to the best it has ever been, there are still rare occasions
where passengers are inconvenienced by delays or even
cancellations which end up costing the industry and the economy
a pretty penny. Between 2012 and 2017 the industry experienced
significant increase in the total costs of delays. Federal Aviation
Administration defines total cost of delay as “the sum of costs to
airlines, passengers, lost demand and indirect costs.” The total
costs went up from 19.2 billion in 2012 to 26.6 billion in 2017.
This significant increase has been driven by the increase in the
cost for passengers who experienced delays/cancellations. It is
also important to note that according to the FAA delay means that
a flight was 15 or more minutes late for departure [FAA 2018].
Large number of the delays is related to the four main causes
shown on the graph below in Figure 1.
Figure 12: Costs of different types of delays. Adopted from
www.faa.gov/nextgen/programs/weather/
By far the largest factor are the sub-optimal weather conditions
that accounted for 69 percent of the delayed flights. Majority of
the weather delays happens during the summer season (April
through September) as opposed to the winter season. According
to the Operations Network (OPSNET)-official source of all air
traffic and delay data-summer months are usually categorized by
more convective weather which is categorized by heavy rains and
thunderstorms [Elliot 2013]. These conditions are often the most
disruptive to the airplanes as the thunderstorm creates strong
turbulence creating updrafts. Those updrafts may also carry large
pieces of hail that can seriously damage airplanes, mainly causing
malfunctions to the nose of the planes where radars are kept.
Damage to a radar system can seriously impair communication
with the air control potentially leaving pilots “alone” in the air
[Krajewski 2015]. Second biggest cause of delays is the pure
volume, in this case high demand. Even though Federal Aviation
Administration confirms that airline carriers possess
“unconstrained resource capacity” high demand still causes up to
19 percent of flights being delayed. Although this happens mainly
during the holiday season (Thanksgiving, Christmas, New Years,
Fourth of July) these delays still account for a major portion of all
annual delays. Runaway availability or in this case unavailability
due to high traffic volume is third delay factor with only 6 percent
of all flights experiencing delay due to this. “Other” category in
this case is represented by various general aviation, air taxi and
-1-
2
M. A. Upal (editor)
military aircrafts that have flown under FAA radar but have
experienced some delays. Last category in this group are the
delays caused by equipment failure which accounted for less than
1 percent of all the delays [Olson & Philips 2018]. The report that
OPSNET published in 2015 regarding the flight delays was the
most thorough one, however each year Bureau of Transportation
Statistics publishes a simple report on flight delays over the last
year and the weather-related delays still account for at least 55
percent of all delays suggesting that better understanding
suboptimal weather conditions for aircrafts could lead to the
improvement in scheduling that can significantly reduce delays
[Anon. 2018]. Even if the flight gets delayed, knowing that in
advance can help passengers plan their time at the airport, and it
is at this point that hospitality industry in the airport vicinity can
experience an influx in customers who are on standby for their
flights. FAA has predicted steady growth of 2.4 percent a year for
the commercial airline industry as a part of “Aerospace Forecast
Project: Fiscal Years 2017 to 2037” [FAA 2017]. Growth of such
magnitude suggests that cancellations/delay prediction models
may became even more valuable in the future.
2.
RELATED WORK
Cancellation prediction models have been widely utilized in
hospitality industry by third-party booking agencies such as
Expedia. The goal in this case is to predict how many of their
customers will cancel their bookings because every cancellation
can significantly influence potential profit. Over the last three
years there has been an increase in research papers related to the
air traffic, various optimization models as well as a few delay
predictors mainly focused on the delay at the arrival airport. One
particular article titled “Unfriendly Skies” focused on predicting
cancellations based on the weather data, however our approach
differs from that employed by Balduino [8] since the author only
uses data for 10 US airports which may seem like a lot of data
however we need to keep in mind that a cancellation is an
anomaly and it only accounts for less than 2 percent of all the
flights in the United States on an annual basis. Another difference
is that they used SPSS’s “Auto Classifier” node. Balduino found
that the Random Trees algorithm was the most accurate one. It is
important to note that the only variable to predict cancellations
was the weather. Accuracy of this model was almost 88 percent
that a particular flight would be cancelled [Balduino 2017].
Kuhn and Jamadagni’s [2017] approach to delays/cancellations
focused on predicting flight delays at the arrival airport. Even
though this has a real-world value as well, this is a common
practice done by OPSNET in United States since an airline has to
report no later than 30 minutes after they know that the airplane
will be arriving late at the destination. In this case researchers
have had a better outcome as they were able to utilize Neural
Network, Decision Tree and Logistic Regression to achieve
around 90 percent accuracy that a certain flight would be arriving
late. Main motives behind this research was to offer a way to
better manage air-traffic as a way to lower economic and
environmental impact of delays. It is important to note that
researchers have used dataset for one year which may be deemed
insufficient since the year of 2015 had multiple outliers due to
unusual extreme weather.
3.
DATA
Our prediction model used two main sources of data:
1)
Bureau of Transportation (BTS) – flight data
[Anon. 2007];
2) National
Oceanic
and
Atmospheric
Administration (NOAA) – weather data [.
Bureau of Transportation provides an in-depth dataset that
contains information about every domestic flight with over 20
variables for every flight. Scraping the data took the longest time
as only a limited amount of data can be scraped over a one month
period. This ended up being a major bottleneck as the BTS server
to which we sent request would take up to several hours to
approve the dataset download. The scraped dataset consisted of
last three years of information. Weather data collection had its
own as collected data referred to the location closest to the airport
usually selected by the zip code. Original idea was to scrape data
off the airport weather station however historical data is not
accessible to the public. Amount of flights at each airport, from
most popular to least popular is shown in Figure 2. After selecting
the airports to pursue the data from next step was to select
variables that will suit our model the best. Variables selected
Figure 13: Flight data from BTS included 20 out of top 30 Core Tower Operations listed by the FAA. Adopted
from www.FAA.gov/airTraffoc/ media/Air_Traffic_by_the_Numbers_2018.pdf.
2
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
were: Flight date, Origin, Origin City Name, Origin State Name,
Destination, Destination City Name, Destination State Name,
Cancelled, Cancellation Code, Cruise Elapsed Time, Actual
Elapsed Time, Air Time, Distance, Carrier Delay, Weather Delay,
NAS Delay, Security Delay, Late Aircraft Delay.
Weather data consisted of these variables:
Temperature, Humidity, Air Pressure, And Precipitation Type
and Amount (if any). Joining these two datasets was a challenging
task since the weather observation was done during times that do
not match scheduled departure times. Majority of the weather
stations report weather as soon as the flight conditions change. By
this time our flight data consisted of millions of rows so joining
the data based on the closest possible observation times which
would be vert taxing due to the size of the dataset. We dealt with
this problem by considering the mean value of each weather data
point for the day of the flight. By having such a wide variety of
variables in the data set there were a few options on which
variable can be predicted. Bureau of Transpiration Statistics
where the flight data came from has five major variables for flight
delays/cancelations:
As such it is widely used in the online fraud prevention and it can
provide the highest accuracy if the dataset has anomalies. In our
dataset main anomalies were cancelled flights.
Training data was done by a standard split of 70/30. This means
that 70 percent of the data was allocated to train our model and
other 30 percent was then used to test the accuracy of our model.
Confusion matrix is one of the easiest ways to describe the
performance of our model. Our overall accuracy came up to 76
percent and misclassification rate was 23 percent. These results
are very promising and can be a strong starting point in
approaching the issue of weather caused delayed and/or cancelled
flights in the future. Below is a confusion matrix that explains our
findings.
1. Carrier;
2. Weather;
3. National Airspace System;
4. Security;
5. Late Aircraft.
It was nearly impossible to note what exactly variables #3 and #4
stand for since the terms were very loosely defined and regarded
as “classified” by the airspace system. Since the main goal and
the reason to pursue this research was to determine relationship
between weather and delayed/cancelled flights Weather variable
was chosen. Below is a visualization of cancellation distribution
for different causes.
5.
CHALLENGES AND DISCUSSION
At this moment our model has a decent accuracy, however there
can be changes made that may positively impact the accuracy.
Some of the things that can be done are:
1.
Figure 14: Total cost of cancellations for different
reasons.
4.
METHODOLOGY AND FINDINGS
In the beginning phases of the modeling, decision trees and KNN
however neither of those two has yielded good results for the
particular cancellation prediction. Highest value with those two
models was only about 62 percent accuracy that the flight will get
delayed/cancelled, which can be just as good as flipping the coin
and guessing if the flight will be cancelled. After reexamining the
data, we realized that delays/cancellations were in this matter just
an anomaly since our dataset consisted of three years of flight data
for of the busiest airports in the United States only had about 2
percent of cancelled flights. Next method to model this dataset
was Isolation Forest (Tree). This is an unsupervised machine
learning approach that is used particularly in anomaly detection.
Daily mean weather values are not the best,
weather can change a few times per day and as a
main feature in the model getting real time airport
weather data could change the outcome;
2. Different features can be added to the model which
may result in better accuracy, however the main
focus of this paper was weather impact only.
Interesting point of discussion can be a real-world value that this
model can have because passengers can benefit from knowing
whether their flight is delayed or canceled. Businesses can also
benefit from knowing this in advance. It would be great to see this
model utilized in hospitality industry, particularly various
hospitality businesses surrounding the airports. Business like
restaurants at the terminals may design special offers for
customers who are stranded at the airport which would offer a
place for stranded passengers to temporarily stay at relax while
enjoying a meal. By creating special offers for passengers in
distress, business may ease passengers struggles thus making
their bad airport experience a little bit better. Business can also be
3
4
M. A. Upal (editor)
more likely to experience increase in profits as their customers
counts increase. At the moment there is no literature available that
has looked into the potential of flight cancellation prediction and
its benefits to businesses around the airports. This could be a very
innovative approach to marketing in the travel and tourism
industry that would carry potential benefit to the passengers and
business.
[5] Bureau of Transportation Statistics. Reporting Carrier
On-Time Performance. Bureau of Transportation
Statistics.
[6] Christopher Elliott. Storm Warnings: How Do Airlines
Know If It's Safe to Fly in Bad Weather? National
Geographic. (November 2013)
[7] David Olson and Ted Philips. Weather, volume cause
flight delays on one of the busiest travel days of the
year, FAA reports. www.newsday.com. (November
2018)
[8] Federal Aviation Administration. FAA Forecasts
Continued Growth in Air Travel. www.faa.gov.
(2017)
6.
FUTURE WORK
While this model presents a good start there are ways to improve
it. I would like to see this model adapted to particular sector in
hospitality, for examples restaurants. There are various point of
sale (POS) systems like Oracle used in restaurants that could
integrate models like ours. Only by integrating this model can we
find the real value of it. In order for any improvements to
happened we would first need to know the shortcomings of the
model that can be found if the model is put to a real-world test.
Another option would be to have more in depth weather data that
has the potential to improve the classification accuracy.
Approaching this problem with more knowledge from either
flight delay/cancellation or weather field could be another way to
get better results since all of the information gathered was done
through research without any previous field knowledge.
[9] Federal Aviation Administration ed.Air Traffic by
Numbers.
https://www.faa.gov/air_traffic/by_the_numbers/m
edia/Air_Traffic_by_the_Numbers_2018.pdf.
(November 2018)
[10] Federal Aviation Administration. 2017. What is the largest
cause of delay in the National Airspace System? (August
2017).
[11] Nathalie Kuhn and Navaneeth Jamadagni. 2017.
Application of Machine Learning Algorithms to Predict
Flight Arrival Delays. (October 2017).
REFERENCES
[1] Anon. Data Tools: Local Climatological Data
(LCD). National
Center
for
Environmental
Information. (2007)
[12] Ricardo Balduino. Unfriendly Skies: Predicting Flight
Cancellations Using Weather Data. Inside Machine
Learning. (December 2017)
[2] Anon. Understanding the Reporting of Causes of Flight
Delays
and
Cancellations. https://www.bts.gov.
(March 2018)
About the author:
Milos Veres is a Graduate Student at Mercyhurst University.
[3] Anon. Data Access. Retrieved May 1, 2019 from
https://www.ncdc.noaa.gov/data-access. (2019)
[4] Beth Krajewski. Flying in Convective Weather … And
Why You Shouldn’t. https://business.weather.com.
(September 2015)
4
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
How do Socioeconomic Factors Effect the Amount of
Waste (garbage) Produced
Heidi Beezub
Mercyhurst University
Erie, PA 16546 USA
hbeezu68@lakers.mercyhurst.edu
ABSTRACT
In this paper, I attempt to show a correlation between
socioeconomic factors and the production of waste. The increasing
amounts of waste (garbage) produced impacts the environment and
can create challenges for safe, efficient and effective disposal. This
paper looks at waste and socioeconomic data from the United
States, the Buffalo, New York Region (USA) and the United
Kingdom to find socioeconomic factors that are a predictor of waste
generation.
Keywords
Waste, Garbage, regression.
1. INTRODUCTION
Plastic straws received a tremendous amount of bad press when a
video of a straw being pulled out the nostril of a sea turtle [1] went
viral in 2018. But plastic straws are not the only problem; one
report predicts the amount of plastics will outweigh fish in the
ocean by 2050 [2]. Plastic, specifically single use plastics, have
become a major contributor to the waste stream. The waste stream
is the flow of waste (garbage) through to final disposal. In addition
to plastics, the waste stream includes household, industrial,
construction, hazardous, and medical wastes. Global waste
production is estimated to be 1.3 billion Metric Tons*1 (MT) or 1.2
kg/capita/day with amounts projected to reach 2.4 billion MT or 1.4
kg/capita/day per day /day by 2025 [3]. That is a lot of garbage!
As the world’s population increases, the demands on the finite
resources of our terrarium home we call Earth become more and
more taxed. Resources are not just fossil fuels and mineral ores.
Resources include things we take for granted: clean air, clean water,
arable land for food production, and even technology, labor and
time. Although historically technology in the form of the industrial
revolution has increased pollution, technology can be a critical
resource in the conservation of other resources. Technology has
helped clean polluted air from factory smokestacks, provided
renewable energy sources (wind, solar, hydro), and even resulted in
more time by automating manual systems. Advanced economies
have more access to technology than developing economies.
Developing economies produce more pollution and waste during
manufacturing processes than developed economies [4]. As
population increases, people also
*A U.S. ton, also called a short ton, is equal to 2,000 U.S. pounds,
a metric ton is slightly larger than a U.S. ton—it converts to 2,204.6
pounds. [7].
1
want a higher standard of living (i.e. everyone wants a washing
machine and an air conditioner). The carrying capacity of the Earth
(how many people can be supported) varies among studies. One
study suggests based on resources required for an ‘American’
standard of living, the Earth’s carrying capacity is approximately
1.5 Billion people [5]. Most estimates use food production to top
out sustainable population between 8 to 11 billion people [6]. With
a current population of approximately 7.5 billion we are close to
these estimates. For both an equitable and sustainable environment
and equitable and sustainable economy we need to consume less.
In order to maintain a sustainable planet environment, we need to
decrease and eliminate the amount of waste produced. Recycling
is a popular go-to option, but only 9% of plastic waste has been
recycled globally [8]. Recent articles are calling for more emphasis
on the circular economy where no waste is produced (i.e.
everything is not only recyclable but also IS recycled or reused). In
addition, decoupling consumption from economic growth is also a
solution, whereby the economy can grow without reliance on
consumerism.
Waste and pollution produced in one part of the globe spills out and
(eventually) effects other parts of the world. Air pollution form
Indian and Chinese factors circumnavigates the globe and affects
cities in the United States and Europe. Approximately 1.3 to 3.5
million MT of plastic waste alone enters the oceans annually due to
China’s lack of infrastructure to dispose of waste properly [8]. Part
of this plastic makes its way to the ‘Great Pacific Garbage Patch.’
The Great Pacific garbage patch, also described as the Pacific trash
vortex, is an area in the central North Pacific Ocean; it covers an
area approximately twice the size of Texas and is an estimated 7
million tons of plastic waste. [9] Even more concerning,
microplastics less than five millimeters in length (or about the size
of a sesame seed) [10] and nanoplastics 1000 times smaller than an
algal cell [11] have been entering the food chain. One study found
that 93% of bottled water contained some sort of microplastic. [12].
These tiny plastic particles can be ingested by aquatic life including
plankton and microscopic algae which form the basis for much of
our food chain. [13] The detrimental effects on organisms that
ingest plastic as it moves up the food chain can include intestinal
blockages and toxic effects to both animals and plants. [13] The
fibers from your favorite flannel shirt are polluting the environment
and even the water you drink. [14]
5
6
M. A. Upal (editor)
populations, environmental issues, and health concerns.
The
factors that contribute to waste generation vary from population to
population. [32] Cultural influences, consumption habits, standard
of living, and current infrastructure affect attitudes and actions
regarding waste generation and disposal (or the reuse and recycling
of materials). Even within country borders, the factors that
influence waste generation can vary from region to region or from
one city to the next. Although regression models are common, as
machine learning algorithms have gained more acceptance there
have been attempts to use different methodologies for predictions.
The interconnectedness of our global ecological (and economic)
system insures that what my neighbor does will affect me and what
I do will affect my neighbor. My neighbor (and your neighbor) is
anyone who shares this same planet. A recent estimate forecasts
the world population at approximately 9 billion by 2050 [15].
Action is needed to ensure all the inhabitants have equitable use of
resources to prevent global social and political instability.
Many studies have been done to predict the amount of garbage that
is produced from the global to the local level. [16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26] These studies are focused on predicting the
waste stream so that disposal mechanisms will be in place. Waste
determinants have included such factors as urbanization, household
size, wealth, education, and tourism [27]. Household size, wealth
and urbanization are the greatest predictors of waste amounts in
most models [27]. Most of the models are based on correlation and
regression analysis, very few have used artificial intelligence
systems [27]. Changes in the waste stream have altered the ability
to target waste production to regional populations. Local landfills
have been shut down in favor of larger regional landfills which
accept wastes not only from local and regional municipalities but
also from neighboring states [28].
Regression analysis with stratified sampling was performed for a
study in Sir Lanka [24] to determine socioeconomic factors in
waste generation. Waste was collected from specified households,
separated, and weighed to determine the composition percentages
(organic, plastic, paper, glass and metal). This study looked at not
only the total amount of waste produced, but how the composition
of the waste was affected by income. As found in other studies [18,
19, 21, 23, 24, 28, 30], organic waste was the largest component of
the Sri Lanka waste stream. Organic waste was also identified as
the most potentially harmful “in terms of potential to cause
environmental pollution and resource recovery” [24]. Organic
waste creates greenhouse gasses (specifically Methane) and
organic matter can foul water and other eco systems close to
disposal sites. The Sri Lanka study suggested possible feasibility
analysis for diversion of organic waste for composting. The study
focused on waste prediction to identify and build waste
management structures that could be implemented for a rapidly
growing city and region.
Waste generation information can be used to focus waste reduction
efforts toward specific populations and to identify areas where
technological advances/processes are needed to reduce waste. Qi
and Roe used regression and PCA analysis to identify behaviors
and attitudes regarding food waste [29]. The ability to identify
attitudes and behaviors, aids in the development of advertising
campaigns, tax incentives or other methods targeted at changing
behaviors. The data itself can also be used to shock effect. The
sheer large number/volume of what is wasted when presented is
staggering. Where prior studies focus on waste prediction to aid
in waste disposal, my analysis will be focused on determining the
amounts of component waste (plastic, glass, paper, metal, etc.)
generated. Hopefully this information can be used to target
reduction and recycling efforts to decrease waste production. To
get the most effect in waste reduction, the most affluent must
actively work to reduce waste through changes in behavior and
consumption (buying habits). The most affluent (at the ‘top of the
food chain’) need to recognize that inequities in consumption and
pollution/waste generation can affect their own daily lives through
the impact that it creates for the rest of society.
Income and other socioeconomic factors are typically used to
predict waste generation. A 1998 study proposed an alternative
income measure of Total Consumer Expenditures (TCE) [21] as a
better measurement for the actual amount spent annually on
consumer goods. Since some consumer spending does not result in
waste generation, a Relative TCE (RTCE) was developed to
address the actual spending that resulted in waste production from
TCE. This method results in a waste prediction model more closely
tied to consumer spending habits. The study used total data for the
US and the United Kingdom (UK) to predict waste generation for
the US and the European Union (EU) respectively. As with many
waste generation studies, accurate data for specific countries was
not available. The UK spending and waste generation behavior was
assumed to be representative of other EU countries. The RTCE
figures were used to derive composition percentages of plastic,
paper, glass, metal and organic wastes). Total waste predictions
were obtained with linear regression. Polynomial equations were
used to develop best fit curves for each waste component to predict
amounts that could be recycled and diverted from landfills.
2. RELEVANT WORK
The prediction of Municipal Solid Waste (MSW) generated is
important to ensure the appropriate disposal and removal methods
are in place. As population and the standard of living increases,
waste (garbage) is yet another aspect of our lives that we expect to
be managed without much additional thought. Recycling to divert
waste from landfills becomes more important as waste continues to
pollute the environment and more environmental protections
restrict safe waste disposal.
Systems Dynamic (SD) modeling has been used to predict waste
generation and the feasibility of a Material Recycling Facility [18]
to divert recyclable material from landfill disposal. SD uses
computer software (specifically Stella®) to simulate system inputs
and flows (identified as stocks, flows and converters). SD
modeling was selected for this San Antonio study due to a small
limited dataset. The simulation information was: total income per
area, people per household, historical waste generation, income per
household, recycling patterns, and population. The SD information
was then used in conjunction with a traditional linear regression
model to predict waste generation and the feasibility of a Material
Recycling Facility. An unusual SD prediction result was a plateau
of waste generation as income increased. Other studies have shown
increased recycling participation at higher incomes, but the
Most studies on waste production have used regression analysis
techniques based on population and income variables [16, 18, 24,
27, 30, 31]. Information on studies in the US is scarce; waste
management practices are fully developed, and waste collection
and disposal systems are readily available in all parts of the United
States (US). Although some studies have been performed in
developed countries, much of current literature available is from
developing countries. Developing economies are interested in
empirical information to aide in creating and establishing
mechanisms for waste disposal to accommodate growing
6
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
increased recycling does not offset the additional waste generation
in increased income [18, 33].
The Stella® SD model was applied in a Newark NJ study to project
the effects of recycling on landfill capacity. [17] The SD inputs
were derived from previous studies of factors found to have
significant influence on waste generation. These factors included:
Gross Domestic Product (GDP), infant mortality rates, population
density, household size, life expectancy, and labor force
(agricultural, service or industry). Waste figures had to be adjusted
for total amounts of waste either diverted to other states or accepted
from other states. The consolidation of landfill sites in the US has
made waste prediction more difficult. The study determined that
increased recycling has a long-term result in MORE waste being
sent to landfills. Waste diverted to recycling provides increased
landfill capacity which results in lower landfill costs. The analysis
showed that current recycling policies would not result in
additional recycling or increased economic feasibility of recycling
and concluded that more cost-effective recycling methods are
needed (specifically lower collection costs) to increase recycled
waste.
A recent 2017 study combined Multivariate Linear Regression with
Bayesian Model Averaging (BMA) to develop a better model for
waste prediction. [23] The model developed for Hoi An City,
Vietnam was significantly better than existing models using linear
regression. Like other models, socioeconomic factors were used
for prediction. This study used location (urban vs rural), presence
of a home business, number of people in household, and house area
per person to predict waste generation. Per capita income was not
used as a specific factor in the model due to the difficulty in
obtaining this information in developing areas. Residents are
unwilling to divulge income for fear of increased taxation. The
presence of a home business and general higher wages in urban
areas are good indicators of economic progress. This particular
study was labor intensive. The study included face-to-face
interviews as well as daily collection and weighing of household
waste which was separated as biodegradable (compostable) or
nonbiodegradable. Statistical tests (R2, MRS, RMSE, etc.) were
performed to validate the model.
Although labor intensive studies that break down waste by category
(paper, plastic, glass, etc.) are more difficult to implement, they
provide much more useful information for the prediction of
recycling possibilities and economic feasibility. A study in India
took a slightly different approach using an existing combined
socioeconomic (SES) parameter that included education,
occupation, and family income to divide the population into five
hierarchal groups. [19] The participants separated waste into
biodegradable and non-biodegradable bags. Waste was collected
daily. The non-biodegradable waste was further separated (paper,
plastics, glass, metal, etc.) and each component was weighed.
Unlike prior Indian studies cited, which indicated the highest SES
group generated the most waste, this study indicated the
Medium/Middle SES group generated the most waste. Waste
generation was broken down by component for complete analysis
on a day of the week basis. Several theories on why the highest
income group produced less waste included the use of LP gas for
heating and cooking (no coal ash in waste generation) and eating
outside the home in restaurants. A negative correlation between
family size and per capita waste generation was consistent with
prior studies.
Other approaches to predicting waste generation include Fuzzy
Logic (FL). Data from a prior study in Mexacali, Mexico was
analyzed using FL. [34]
FL can be applied when there are
uncertainties in the information. It works with mixed data types
(quantitative and qualitative) and can be used with sparse data and
missing values. Fuzzy reasoning is used to infer information based
on series of if/then statements. A degree of membership (rather
than a classification in or not in) is expressed by FL. [22] The
Mexico study was predominately interested in the amount of plastic
waste generated per socioeconomic group. This study focused on
recommendations targeted to reduce the amount of plastic waste.
[34] The data included the content of glass, cardboard, and metal
components which were included in the analysis. The study
concluded lower income households produce more plastic
packaging waste due to using smaller package sizes than higher
income households which can buy in larger packages. However,
high income, smaller household size resulted in higher per capita
generation of plastic waste.
Fuzzy Logic can provide a good model when data is sparse and
there are uncertainties. Geospatial data was used for waste
prediction in an Athens suburb. [22] Waste was collected from
centrally located bins (rather than household curbside pickup). The
study was focused on cost reduction to predict the optimal times for
garbage collection. Subject matter experts weighted the factors
used which included real estate values, building density, area size,
electric bills, commercial traffic, and the specific waste bin
locations. Separate inputs to calculate residential vs commercial
waste were then combined to arrive at a single number for a
specified area. A relationship between electricity consumption and
waste generation was indicated. Validation techniques for the
predicted results were not available due to the sparse data. The goal
was to predict when bins would be 90% full to optimize collection
times.
Although Artificial Neural Networks (ANN) were developed in the
1940’s, the technique did not begin to receive acceptance as a
modeling method until the 1980’s [35]. Neural networks attempt
to mimic the way humans process information to produce decisions
based on non-linear information. “ANNs are informationprocessing algorithms inspired by the way biological nervous
systems make generalizations from similar situations, such as
learning from past experience, and produce decisions out of
incomplete knowledge of states with large inherit complexities and
nonlinearities” [36]. Some studies have used ANN for waste
prediction. A General Regression Neural Network (GRNN) model
outperformed a Back Propagation Neural Network (BP) model in a
study that covered 26 European countries. [26] In addition to being
more accurate, the GRNN model training was significantly faster.
The inner layer of the Neural Networks, although both based on the
default minimum were not equal. The BP model had 10 neurons
while the GRNN model was based on 84 neurons (minimum
required for 84 data sets in the training data). Adding additional
Neurons to the BP model to improve performance was not
explored. R2 statistic was used for model validation. Significant
model errors were attributed to uncertainties in data estimations
made for missing values. The model performed better for more
developed countries where data was most complete.
A 2011 study used ANN to predict a 20-year future period of waste
prediction. [20] The authors used a combination of linear regression
and Multilayer Perception (MLP) model ANN for their forecasts.
Although the authors performed various statistical tests to validate
the test results, the data used to predict future waste generation in
7
8
M. A. Upal (editor)
the ANN brings uncertainty into the model. In addition, the data
had to be re-scaled for the MLP to be able to handle data at the far
future dates as it was too much out of range from the actual data
used for initial training. The authors solved this by using
logarithms for the data. The scaled data was used for training with
comparable results to the unscaled data.
would be analyzed individually. The two or three datasets will be
combined into a larger overall dataset. By combining the smaller
datasets, I will have a larger dataset with which to work. I will be
able to do additional analysis to see if similar or different patterns
emerge or if there appear to be differences between the smaller
datasets.
Support Vector Machines (SVM) can be used for linear or
nonlinear classification and regression tasks [35]. An Iranian study
wanted to develop a model that would generalize well using two
cities; Tehran and Mashhad. [25] SVM was used with an additional
component of Wavelet Transform (WT) to pre-process the time
series data. WT decomposes the signal into a set of bias functions
using a prescribed formula; the resulting sub signals retain the
structure/shape of the series. WT was used to eliminate noise in the
times series data used for weekly forecasting with seasonal
variations in waste generation. Although the models produced
provided good performance, the WT process is not easy to
understand. The income and socioeconomic factors used in the
SVM model were not clearly defined. Long term waste predictions
are important for planning future landfill and recycling recovery
operations.
I will use linear regression to analyze the data. The most common
models for waste prediction are correlation and regression analysis
[27]. Other more complicated models (Fuzzy Logic, Artificial
Neural Networks, Support Vector Machines, etc.) have been used
to predict waste generation. The more complicated methods do not
necessarily yield better results and are less easily explained or
examined than regression solutions.
The factor/characteristic I am most interested in as it relates to
waste generation is income. Other factors that I would like to
explore include education, and employment (type of occupation).
My hypothesis is that the amount of waste produced increases with
increased income. Most of the studies I have reviewed indicate
waste increases with wealth. I would further like to explore if there
are differences in the component make-up of the waste generated.
For example: do the component percentage amounts vary based on
socioeconomic factors? Does higher educational background result
in increased generation of paper waste and lower generation of
plastic wastes?
3. PROPOSTED SOLUTION
Aggregate/total waste data is available going back to the 1960s for
the US. Waste component data (breakdown of how much glass,
plastic, metal, paper, etc.) is only available from the mid-1990s.
Socioeconomic and demographic information (income, household
size, education, etc.) is available through census data every ten
years. In order to strictly use the census data, the information
would need to be annualized each year between the census dates.
Annual information on birth rates, infant mortality, population,
marriage rates is available (starting with the late 1990s to mid2000s depending on the type of data). The annual information as
well as the census data (and or extrapolated census data) will be
used to determine how these demographic and economic factors
affect the amount of component waste generated (i.e. the amount
of plastic, glass, paper, etc.).
4. EVALUATION METHODOLOGY
Data compilation took a significant amount of time. I was able to
obtain US waste data from 1960, 1970, 1980 and 1990 and annually
from 1991 to 2015 from the Environmental Protection Agency
(EPA) website and the EPA archives. The information came from
the EPA publication titled “Advancing Sustainable Materials
Management: Facts and Figures” (formerly known as
“Characterization of Municipal Solid Waste in the United States”).
The archive (with the alternate name) was an unexpected find of
data that included material breakdowns in (paper/pdf copies of
reports). The amount of waste generated continues to increase with
time; generation increased rapidly from the 60’s to the 90’s with
slower increases after the 90’s. Much of the focus today is on
plastics. The amount of plastic waste nearly tripled from the year
1980 (6,830 thousand tons) to the year 1990 (17,130 thousand
tons). By comparison, the next largest increase was wood waste
7,010 thousand tons in the year 1980 compared to 12,210 thousand
tons in the year 1990. Annualized figures for intervening years
between 1960 and 1990 were extrapolated by evenly indexing the
figures. With additional resources (time) a regression model could
be developed that could reflect incrementally increasing figures for
intervening years.
Regional data is available for the United Kingdom (UK) with total
amounts of waste generation and component information as well as
census data (by region). US data on waste component information
is not centrally collected and not as easily available. There is a
dataset for Buffalo, NY which includes waste component
breakdowns (plastic, paper, etc.), US census data and annual data
on birthrates and population are available for Buffalo
socioeconomic and demographic data.
I will obtain my data by downloading the UK waste data set and
combining this with the UK census data for socioeconomic
information such as income, education, and household size. I will
need to also look at additional factors due to the ten-year limit with
census data; other annually available information such as birthrates,
population (and other information I can obtain such as marriage and
divorce rates) will be added into the data. Once I have a
‘completed’ dataset I will then either create a ‘copy’ or add
additional columns to annualize the census data between census
years. I plan on simple equal division among the intervening years
(unless I would find documentation of a different trend). I will be
able to analyze the data both with and without the modified/added
yearly figures. I will use this same process for the Buffalo, NY
information combining the waste data with census socioeconomic
and demographic information. If time and data permits, I would
also like to locate data for an additional country. These datasets
Buffalo waste data was obtained from the ‘Open Data for Buffalo’
website. The city of Buffalo makes data openly and publicly
available. The city benefits from open source data for continuous
improvement of city services and to help improve the city’s
livability. This was my smallest number of years with data available
from 2010 through 2016. The Buffalo waste data increases from
2010 to 2015, however, nine of the 12 features either decrease or
stay the same in 2016. Again, six of the 12 features decreased or
stayed the same in 2017.
Waste data for the UK came from ‘Find Open Data’ a UK
government site with data and links to data. UK waste data was
available from 1997 to 2010. Data available for later years was
aggregate and not broken down into as many individual waste
features. The only ‘spike’ noted in the UK data is Co-mingled
8
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
recycling which went from zero in the year 2000 to 226 tons in the
year 2001. There were corresponding marked decreases in ‘Cans’
and ‘Other Recycling’ categories in the year 2001. Data increases
are seen each year from 1997 to 2008. Five of the 10 features
decreased in the year 2009. In the year 2010 there was a decrease
in four waste features from the prior year.
Census data for the US and the UK is openly available. However,
downloading the data was a tedious process. Although much
socioeconomic data is available, it does not easily align from one
census to the next. The information collected (and how it is broken
down) has changed, sometimes significantly, over time. For
example, “ages 20-24” became a further breakdown of “ages 20”,
“ages 21”, & “ages 22-24” and category names would change
“Male>>22to24 years”, “Male: 22 to24 years”.
Other
Socioeconomic characteristics have changed dramatically such as
types of employment. These are not easily matched from year to
year. I did not have an effective “automatic” way to line the data
up, this was done manually. Some parameters were more difficult
to align than others (depending on how much had changed from the
preceding census). Where possible estimates were determined for
additional breakdowns or data was merged to be consistent.
Data was obtained from various sources.
The American
Community Survey (ACS) has easily downloadable data from 2005
through 2017. This data is updated yearly and can be extracted for
multiple items in a single download by using pre-formatted tables.
Although the ACS is part of the US Census website, the actual US
Census data is not as easy to extract. Data prior to the 1990 census
is not available electronically. Although the 1980 census was the
first with information stored on computers (magnetic tape). The
census website only had paper (pdf) files of the 1980 and earlier
census years.
The National Historical Geographic Information System (NHGIS)
website provides electronically downloadable information for all
US census data. Kudos to those that manually input this
information from the paper Census reports! The information can
be downloaded for multiple years and multiple parameters;
however, it quickly becomes unwieldy. The information collected
(and how it is broken down) has changed, sometimes significantly,
over time (as noted). I needed to align each year and factor for
UK Total Waste Production
Cans
5000
Co-mingled
Compost
Glass
0
1997 1999 2001 2003 2005 2007 2009
consistency. Again, some parameters were more difficult to align
than others.
Like the US Census website, the UK Census website was also
difficult to download data. The Nomis website (run by the
University of Durham on behalf of the Office for National
Statistics) provided easily downloadable information. Similar
limitations were encountered as with the NHGIS website with
changing statistical information and restrictions on the ability to
download information on an annual basis.
Downloading additional data and formatting it could be continued
for the next year. The maximum amount of data was desired for
analysis. However, in the interest of obtaining results the data
gathering finally was halted.
Finding relevant factors.
Scikit-learn Random Forest Regressor was used to obtain the top
socioeconomic predictor for each waste factor for each data area.
A scatter plot with a best fit linear regression line was created for
each top predictor and the corresponding waste feature.
5. DISCUSSION
Overall, I did not find a strong correlation between socioeconomic
factors and waste generation, but there were still trends evident in
the data. The data can be divided into three datasets.
USA – Although correlations were still low, the US data produced
the most expected results with the most influential predictors
related to household size, income, and education. The US data
included 17 waste features and 499 socioeconomic features for the
years 1960 through 2017. Table 1 lists the waste and top predictor
feature for the US.
UK- The UK data area offered the most promise of finding patterns.
This data included:
from years 1997 to 2010
10 waste features
138 associated socioeconomic features
Instead of steady increases each year, the UK data showed
decreases in the total waste features during the last two years. As
shown in the graph in Figure 1.
9
10
M. A. Upal (editor)
Figure 1 total trend of UK waste
The top predictive socioeconomic features for the UK included:
population
population by age and sex
household composition
education (qualifications)
The UK data was sorted by region, waste feature and
socioeconomic feature to look for any similarities or patterns in the
results. The three predominant socioeconomic features were:
‘total population’ (23 waste-region pairs)
‘mean age’ (16 waste-region pairs)
‘males aged 18 – 24’ (13 waste-region pairs).
When the total UK was considered (all 9 regions) ‘one family and
no others - Lone parent households - all children non-dependent’
was a top predictor for 6 waste features. Tables 3a-3d lists the
waste and top predictor feature for UK.
Buffalo- The Buffalo data included 12 waste features and 453
socioeconomic features covering the years 2010 through 2017.
This was the smallest set of data with the least amount of years
(only 8 years). Similar to the UK data, there was decrease in waste
for the last two years in the Buffalo data. Table 2 lists the waste
and top predictor feature for Buffalo.
thank my Mercyhurst cohorts. I have drawn from their youthful
reserves of energy and optimism.
9. List of Figures/Tables
1.
2.
3.
4.
5.
6.
7.
6. RESULTS
Figure 1 total trend of UK waste
Tables 1 through 3d in attached Appendix
Table 1 US geographic area waste and top socioeconomic
predictor
Table 2 Buffalo geographic area waste and top
socioeconomic predictor
Tables 3a UK regions East, East Midlands, London, and
North East geographic area waste and top socioeconomic
predictor
Table 3b UK regions North West, South East, and South
West geographic area waste and top socioeconomic
predictor
Table 3c UK regions West Midlands and ‘Yorkshire and
the Humber’ geographic area waste and top
socioeconomic predictor
Table 3d UK region (Total UK) geographic area waste
and top socioeconomic predictor
10. REFERENCES/BIBLIOGRAPHY
Although my goal was to find a relationship between
socioeconomic factors, specifically income and waste generation, I
did not find statistical significance connecting these factors.
Random Forest Regressor was used to identify the most influential
socioeconomic predictor for each waste factor. Tables 1 – 3d list
the features identified for each dataset.
[13] U-tube video Sea Turtle with Straw up its Nostril - "NO" TO
PLASTIC STRAWS. 2015. Retrieved October 18, 2018 from
https://www.youtube.com/watch?v=4wH878t78bw.
[14] The New Plastics Economy — Rethinking the future of
plastics. 2016. World Economic Forum, Ellen MacArthur
Foundation
and
McKinsey
&
Company.
from
https://www.ellenmacarthurfoundation.org/assets/downloads/
EllenMacArthurFoundation_TheNewPlasticsEconomy_Page
s.pdf.
After obtaining these features, linear regression was used to attempt
to predict waste generation based on them. A training test split of
85% training and 15% test performed better for the large US data
than using a 30% test set. With the Buffalo and UK data, a 30%
test set provided higher performance. This could be due to the
smaller size of the data sets and the lower correlation numbers.
Performance measures of Rsquared (R2), Root Mean Squared Error
(RMSE) and Mean Absolute Error (MAE) were used to measure
linear regression performance. Overall performance measures
showed low classifier performance:
R2 scores ranged from .07 to .99.
RMSE ranged from 1.7 to 41637.5,
MAE values ranged from 1.5 to 24022.0
[15] Hoornweg, Dan & Perinaz, B.T.. 2012. What a waste: a global
review of solid waste management. The World Bank, Urban
Development Series Knowledge Papers, no. 15.
https://siteresources.worldbank.org/INTURBANDEVELOP
MENT/Resources/3363871334852610766/What_a_Waste2012_Final.pdf.
[16] Hoornweg, D. , Bhada‐Tata, P. and Kennedy, C. 2015. Peak
Waste: When Is It Likely to Occur?. Journal of Industrial
Ecology,
19,
117-128.
DOI:
https://onlinelibrary.wiley.com/doi/abs/10.1111/jiec.12165.
7. CONCLUSIONS AND FUTURE
WORK
[17] Andrew D. Hwang. 2018. 7.5 billion and counting: How many
humans can the Earth support? (July 30, 2018) Retrieved
October 19, 2018 from https://theconversation.com/7-5billion-and-counting-how-many-humans-can-the-earthsupport-98797.
The best results were obtained from the US data (which was the
largest dataset). If more complete data could be obtained,
correlation between waste and socioeconomic features could be
identified better. In addition to gathering additional data, future
work could include looking at more than one top waste predictor.
As noted, linear regression could be used to extrapolate intervening
data between Census years. Using fewer waste features may help
to reduce noise and show better performance.
[18] Bruce Pengraa. 2012. One Planet How Many People A
Review of Earth’s Carrying Capacity. UNEP Global
Environmental
Alert
Service
(GEAS).
https://na.unep.net/geas/archive/pdfs/geas_jun_12_carrying_
capacity.pdf.
8. ACKNOWLEDGMENTS
[19] Ton vs. Tonne: What’s the Difference?. Retrieved October 21,
2108
from
https://writingexplained.org/ton-vs-tonnesdifference.
My thanks to my proofreader and husband for reviewing many
drafts. Shraddha Dubey who helped me organize my thoughts.
And Special thanks to Ron Richardson for his patience and help
when I was learning to program in Python, R and SQL. I’d like to
[20] A. L. Brooks, S. Wang, J. R. Jambeck. The Chinese import
ban and its impact on global plastic waste trade. Science
10
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Advances. Vol 4. No 6, Article eaat1313 (Jun. 2018), 7 pages.
DOI: 10.13140/RG.2.2.11029.63202.
[21] Lebreton, B. Slat, F. Ferrari, B. Sainte-Rose, J. Aitken, R.
Marthouse, S. Hajbane, S. Cunsolo, A. Schwarz, A. Levivier,
K. Noble, P. Debeljak, H. Maral, R. Schoeneich-Argent, R.
Brambini & J. Reisser. 2018. Evidence that the Great Pacific
Garbage Patch is rapidly accumulating plastic. Scientific
Reports volume 8, Article number: 4666 (2018)
DOI:10.1038/s41598-018-22939-w.
[22] NOAA. What are microplastics? Retrieved December 10,2018
from
National
Ocean
Service
website,
https://oceanservice.noaa.gov/facts/microplastics.html,
6/25/18.
[23] Wageningen University & Research . Microplastics &
Nanoplastics.
Retrieved
from
https://www.wur.nl/en/Dossiers/file/Microplastics-andNanoplastics.htm.
[24] David Common, Eric Szeto. 2018. Microplastics found in
93% of bottled water tested in global study. (April 2018).
Retrieved
December
10,
2018
from
https://www.cbc.ca/news/technology/bottled-watermicroplastics-1.4575045.
[25] Betty Staugler . 2017. Microplastics – What’s the big deal?
(January 2017). Retrieved December 10, 2018 from
http://blogs.ifas.ufl.edu/charlotteco/2017/01/26/microplastics
-are-a-major-concern/.
[26] Jay Sinha. 2018. Life Without Plastic. Guest Lecture, held at
Tom Ridge Environmental Center, Erie, PA on October 5,
2018.
[27] Ted talk NPR. 2016. Marcel Dicke: Are Insects The Future Of
Food? (Nov. 2016). Retrieved October 19, 2018 from
https://www.npr.org/programs/ted-radiohour/?showDate=2018-10-19.
[28] S. Lebersorger, P. Beigl. 2011. Municipal solid waste
generation in municipalities: quantifying impacts of
household structure, commercial waste and domestic fuel.
Waste Management, 31 (Sep 2011), 1907-1915. DOI:
https://doi.org/10.1016/j.wasman.2011.05.016.
[29] N Kollikkathara, H Feng, and D Yu. 2010. A system dynamic
modeling approach for evaluating municipal solid waste
generation, landfill capacity and related cost management
issues. Waste Management, 30 (Jun 2010), 2194-2203. DOI:
http://doi:10.1016/j.wasman.2010.05.012.
[30] B Dyson, N Chang. 2005. Forecasting municipal solid waste
generation in a fast-growing urban region with system
dynamics modeling. Waste Management, 25 (Jan 2005), 669679. DOI:10.1016/j.wasman.2004.10.005
[31] D. Khan, A. Kumar, and S.R. Samadder. 2016. Impact of
socioeconomic status on municipal solid waste generation
rate. Waste Management, 49 (Mar 2016), 15-25. DOI:
http://dx.doi.org/10.1016/j.wasman.2016.01.019.
[32] D Antanasijevic´, V Pocajt, I Popovic´, N Redzˇic´, and M
Ristic´. 2013. Long term forecasting of solid waste generation
by the artificial neural networks. Environmental Progress &
Sustainable Energy, 31. 4. (Dec 2012), 628-636. DOI: http://
onlinelibrary.wiley.com/doi/abs/10.1002/ep.10591.
[33] E. Daskalopoulos , O. Badr, and S.D. Probert. 1988.
Municipal solid waste: a prediction methodology for the
generation rate and composition in the European Union
countries and the United States of America. Resources,
Conservation and Recycling, 24 (Nov 1988), 155-166. DOI:
https://doi.org/10.1016/S0921-3449(98)00032-9.
[34] NV Karadimas, V Loumos and A Orsoni. 2006. Municipal
solid waste generation modelling based on fuzzy logic. IN
Proceedings 20th European Conference on Modeling and
Simulation. Bonn, Sankt Augustin, Germany. (May 2006)
DOI:https://doi.org/10.7148/2006-0309.
[35] M. G. Hoang, T. Fujiwara, S. T. Pham Phu, and K. T. Nguyen
Thi. 2005. Predicting waste generation using Bayesian model
averaging. Global Journal of Environmental Science and
Management, 3 (Sep 2017), 385-402. DOI: DOI:
10.22034/GJESM.2017.03.04.005.
[36] N.J. G. J. Bandara, J. P. A. Hettiaratchi, S. C. Wirasinghe, and
S. Pilapiiya. 2007. Relation of waste generation and
composition to socio-economic factors a case study.
Environmental monitoring and assessment, 135 (Dec 2007),
31-39. DOI: 10.1007/s10661-007-9705-3.
[37] M. Abbasi, M. Abdoli, M. Abdoli, B. Omidvar, and A.
Baghvand. 2013. Results uncertainty of support vector
machine and hubrid of wavelet transform-support vector
machine models for solid waste generation forecasting.
Environmental Progress & Sustainable Energy, 33. 1. (Apr
2014), 220-228. DOI: DOI: 10.1002/ep.11747.
[38] D Antanasijevic´, V Pocajt, I Popovic´, N Redzˇic´, and M
Ristic´. 2013. The forecasting of municipal waste generation
using artificial neural networks and sustainability indicators.
Sustainability Science, 8 (Apr 2013), 37-46. DOI: http:// DOI
10.1007/s11625-012-0161-9.
[39] K.A. Kolekara, T. Hazrab, S.N. Chakrabartyc. 2016. A
Review on Prediction of Municipal Solid Waste Generation
Models. Procedia Environmental Sciences Vol 35, (2016), 238
– 244. DOI: 10.1016/j.proenv.2016.07.087
[40] Matthew Thomas Clement. 2009. . A Basic Accounting of
Variation in Municipal Solid‐Waste Generation at the County
Level in Texas, 2006: Groundwork for Applying Metabolic‐
Rift Theory to Waste Generation. Rural Sociology, Vol. 74, 3
(Sep.
2009),
412-429.
DOI:
https://doi.org/10.1526/003601109789037196.
[41] Danyi Qi and Brian E. Roe. 2016. Household Food Waste:
Multivariate Regression and Principal Components Analyses
of Awareness and Attitudes among U.S. Consumers PLoS
One,
Vol.
11,7
(Jul.
2016),
1-19.
DOI:
https://doi.org/10.1371/journal.pone.0159250.
[42] D Hockett, D.J. Lober, and K Pilgrim. 1995. Determinants of
Per Capita Municipal Solid Waste Generation in the
Southeastern United States J. Environ. Manage., 45 (1995),
205-217.
[43] O. O. Samuel. 2015. Socio-Economic Correlates Of
Household Solid Waste Generation: Evidence From Lagos
Metropolis, Nigeria Management Research and Practice,
Research Centre in Public Administration and Public Services,
vol. 7, 1 (Mar. 2015), 44-54.
[44] H. K. Ozcan, S. Y. Guvenc, L. Guvenc, and G. Demir. 2016.
Municipal Solid Waste Characterization according to
11
12
M. A. Upal (editor)
Different Income Levels: A Case Study. Sustainability, 8,
1044
(Oct
2016),
1-11.
DOI:
https://doi.org/10.3390/su8101044.
[54] UK Socioeconomic information, Office for National Statistics
www.nomisweb.co.uk.
[45] Matheus Bueno and Marica Valente. 2018. The Effects of
Pricing Waste Generation: A Synthetic Control Approach.
Discussion Papers of DIW Berlin 1737. DIW Berlin, German
Institute for Economic Research, Berlin, DE.
DOI:
10.13140/RG.2.2.11029.63202.
[56] Office for National Statistics; National Records of Scotland;
Northern Ireland Statistics and Research Agency (2017): 2011
Census aggregate data. UK Data Service (Edition: February
2017). DOI: http://dx.doi.org/10.5257/census/aggregate2011-2. https://census.ukdataservice.ac.uk/.
[55] US Census information, https://www.census.gov.
[46] G. Lozano-Olvera, S. Ojeda-Benıtez, J. Castro-Rodrıguez, M.
Bravo-Zanoguera,
and
A.
Rodrıguez-Diaz.
2008.
Identification of waste packaging profiles using fuzzy logic.
Resources, conservation, and recycling 52. (Jul 2008), 10221030. DOI: doi:10.1016/j.resconrec.2008.03.008.
[57] Household recycling by material and region, England
https://data.gov.uk/dataset/c9a3d775-6e00-4b8f-9f807f28fea7d944/household-recycling-by-material-and-regionengland.
12. MACHINE LEARNING AND DATA
MANIPULATION
[47] Aurelien Geron. 2017. Hands-on Machine Learning with
Scikit-Learn & TensorFlow: Concepts, tools, and techniques
to build intelligent systems (7th. ed.). O’Reilly, Subastopol,
CA.
[58] Wes McKinney. Pandas Data Structures for Statistical
Computing in Python, Proceedings of the 9th Python in
Science Conference, (2010) 51-56).
[48] S. Bayar, I. Demir, and G. Engin. 2009. Modeling leaching
behavior of solidified wastes using back-propagation neural
networks Ecotoxicology and Environmental Safety 72, 3
(Mar.
2009),
843–850.
DOI:
https://doi.org/10.1016/j.ecoenv.2007.10.019.
[59] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort,
Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu
Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg,
Jake Vanderplas, Alexandre Passos, David Cournapeau,
Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay.
Scikit-learn: Machine Learning in Python, Journal of Machine
Learning Research, 12, (2011), 2825-2830.
11. DATA SOURCES
[49] Buffalo socioeconomic information, American Community
Survey
(ACS)
https://www.census.gov/programssurveys/acs/data.html.
[60] John D. Hunter. Matplotlib: A 2D Graphics Environment,
Computing in Science & Engineering, 9, (2007), 90-95,
DOI:10.1109/MCSE.2007.55.
[50] Buffalo waste information, Open Data for all of Buffalo,
Monthly Recycling and Waste Collection Statistics ,
https://data.buffalony.gov/Quality-of-Life/MonthlyRecycling-and-Waste-Collection-Statistics/2cjd-uvx7/data.
[51] Buffalo
socioeconomic
information,
https://factfinder.census.gov/faces/tableservices/jsf/pages/pro
ductview.xhtml?src=bkmk
About the author:
Heidi L Beezub is a Graduate Student at Mercyhurst University.
Previously, she held several positions with STERIS Corporation
including six years as a Contract Administrator and seven years as
an Incentive Analyst. After working for four years as an Inside
Sales Specialist for SPX Corporation. She entered the Data Science
program at Mercyhurst to gain programming needed to be able to
return to work as an analyst. Heidi’s undergraduate degree is a BA
in Business Administration from Mercyhurst. In addition, she
holds a Secondary Education teaching certification from Edinboro
University.
[52] Buffalo
socioeconomic
information,
https://health.data.ny.gov/browse?limitTo=datasets&tags=vit
al+statistics&utf8=%E2%9C%93
[53] US Waste information, EPA, Municipal Solid Waste in the
United States: Facts and Figures (archive 1995-2012),
https://archive.epa.gov/epawaste/nonhaz/municipal/web/html
/msw99.html.
12
M. A. Upal (editor)
2
Using Stock Market Data to Evaluate Genetic Algorithm
Performance
William Fisher
Department of Computing and Information Science
Mercyhurst University, Erie, PA
wfishe96@lakers.mercyhurst.edu
ABSTRACT:
earnings or more complicated, generated features such as
moving averages and oscillators.
Sector-specific
information includes whether or not tariffs are affecting the
industry or whether a stock is cyclical and only performs
well at certain times. Examples of economic data are GDP
growth and unemployment numbers. Due to the vastness of
possible features, a good place to start when trying to predict
market movements would be to narrow down the difference
between noise and legitimate signals.
Stock market prediction is a particularly interesting problem
because the stock market is widely regarded as very hard to
forecast. One of the reasons the market is tough to predict is
because of the multitude of variables and their
interconnectivity. With that, it makes sense to use a feature
selection algorithm to increase prediction accuracy. The
following paper attempts to accurately predict whether the
market will have a positive day by combining a Support
Vector Machine (SVM) and an Artificial Neural Network
(ANN) with a genetic algorithm. The results show that it is
plausible that adding a genetic algorithm as part of the
feature selection phase will increase accuracy.
Researchers have already explored this topic using a number
of methods.
Among topical papers, the two most
prominently used algorithms are the Support Vector
Machine (SVM) and the Artificial Neural Network (ANN).
Most of the papers show that machine learning algorithms
can be used to predict stock market prices. Some of the
papers also used feature selection algorithms. Genetic
algorithms and various component analyses are the most
popular techniques for feature selection. Using feature
selection techniques has shown to produce superior results
to feeding the prediction algorithm all the relevant features.
1. INTRODUCTION
Stock markets are some of the most lucrative investment
vehicles in the world. Countries across the globe offer stock
markets as a way for people to invest in the future prosperity
of publicly traded companies. Stock markets comprise of
individual stocks that work on the principles of supply and
demand. If the company is doing well, the theory is that
more people will want to buy shares of that company which
increases demand and thus increases the price of a share.
The reverse is true as well, if a company delivers poor
performances, people will sell their shares which increases
the supply of shares in the market place and drives the prices
of shares down.
There are a few key areas in this space that have not been
explored to their fullest potential. First, there are so many
variables that can be included in the study of the stock
market that it is nearly impossible to test all of them and their
effectiveness. In particular, there are a number of data points
in the economic reports that have not been utilized.
Additionally, earnings data can cause huge moves in stock
price and is an inside look at how companies feel about past
and future performance. Markets are very interconnected
with data from around the world so, while it might not be
obvious, economic data from other countries can affect
markets from, seemingly, unrelated countries. Features and
the feature selection process are extremely important in
model accuracy, which makes it such a vital process for
creating a successful model. My model will focus on this
untapped data and extracting the superior features by using
a genetic algorithm for feature selection and pairing it with
an SVM and an ANN. Then, I will compare its results with
the standalone algorithms as well as a “buy and hold
method.”
Because of the wealth that stock markets can create, both
institutions and individuals have long tried to create systems
that maximize profits. Systems are, “a group of specific
parameters that combine to create buy and sell signals for a
given security” [18]. Some examples of trading systems
include Pairs Trading, Trend or Countertrend Following, and
News Related Positioning. As an example of how a system
works, a Pairs Trade involves taking two stocks that are
highly correlated and looking for times when one of the pair
components is lagging. The trader would then place buy
orders on the lagging company or sell the rising one [19].
The system that this paper is concerned with is taking the
predictions of the machine learning process and turning it
into profitable trades.
This research is relevant for a couple of reasons. First, it
illustrates to the data science community the importance of
tracking down and retrieving the best possible data, even
when it is not available in obvious places or datasets.
Second, the paper demonstrates the effects of feature
selecting algorithms, when given a diverse set of features,
can have on the accuracy of a model. Finally, the paper
proves to both the financial and machine learning
communities that, despite the perceived randomness,
markets are predictable.
One of the reasons predicting the market is so hard is because
of the number of variables that can affect market prices.
These factors can range from individual company
performances, to sector-specific information, to marketwide dynamics, to macroeconomic intricacies. Some of the
variables that can be extracted from the individual stocks
include simple features such as price, time of year, and
2
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
2. RELEVANT WORK
Using machine learning to predict the stock market is a feat
that data scientists continue to strive for. The stock market
represents a complex problem with many variables that
machine learning is perfectly suited for. To tackle the
problem, past work must first be considered. In particular, a
thorough understanding of three specific areas needs to be
understood. For this relevant works section, I will focus on
the stock market, algorithms that were previously used to try
and predict the market, and feature selection algorithms that
can help narrow down the features for market predictions.
All of these areas represent important pieces of knowledge
when attempting to build a successful model in this space.
Stock market domain knowledge gives a solid base for
where to start when coming up with creative features and
theories to test with various models. Algorithms that were
used in previous papers shed light on what types of models
were successful. This can offer both a learning experience
and can save time when looking for a prediction algorithm.
There are thousands of possible features to select from.
Some can be obvious and might be common among other
papers, but using previously successful feature selection
algorithms can lead to undiscovered correlations that can
result in successful model building. All three of these
components are necessary for understanding the subject
matter and crafting a successful model.
2A) Technical Background: The Stock Market
Understanding the stock market is vital to producing a
machine learning algorithm that can predict future
fluctuations in the market. There are many works on the
subject in academia. Many of the works display various
characteristics of the market that can be stored as knowledge
and used when constructing a workable model.
For hundreds of years, those who participated in the stock
market have used many techniques to try and manipulate
their gains. One such technique is technical analysis.
Technical analysis is the process of finding repeatable
patterns that have price prediction capabilities [16].
Technical analysis patterns can be simple price trends or
complex chart figures. Some of the most common
techniques are moving average evaluations, identification of
support and resistance, and reading momentum indicators
[10]. Moving averages are simply a calculation of a stock’s
price over a designated time period [10]. Support and
resistance are price barriers that many stocks adhere to due
to trading psychology and the role of automated trading set
to buy and sell at certain price points [12]. Momentum
indicators are those that show a stock's current trend, either
up or down, and whether or not it might continue to move in
that direction. These and many other indicators are used to
take the uncertainty out of future movements.
Another key area is the fundamentals. Fundamentals are
supposed to be the underlying factors of stock prices or the
intrinsic value. Fundamentals are used to calculate a stock’s
value to the market. Some examples of commonly used
fundamentals include a company’s profits, revenue, or debt
[2]. The fundamentals of a stock can vary from sector to
sector because what is an important measure in one stock
might not be important in another. For instance, airline
stocks rely on different sources of revenue including revenue
from passengers and revenue from cargo [1]. Contrarily, an
oil company’s price might be contingent on the number of
oil rigs they are currently operating and their outputs [14].
Due to many traders’ beliefs in fundamentals dictating
prices, it is important to have a basic understanding of them
and how stocks react to them.
The stock market does not operate in a vacuum. The
fundamentals previously mentioned are decided by more
than just a company’s performance. Many macroeconomic
forces play a role in the construction of the fundamentals.
With the interconnectivity of markets around the world,
factors such as interest rates, bond prices, and inflation all
serve a role in a stock’s price [3]. Besides those mentioned,
there are many more economic factors that are at play in the
stock market. Using machine learning can be used to sort
through these and find those that are most predictive of price
movement.
2B) Technical Background: Algorithms Previously Used
for Market Predictions
The stock market can behave unpredictably and is constantly
evolving. Therefore, certain algorithms perform better in
certain times and while one may be more accurate
historically, it does not mean going forward it will be best.
That said, there is merit in looking at the algorithms that have
been used in past research and examining their results.
2B.1) ANNs
A popular method for market prediction is the use of
Artificial Neural Networks (ANN). Fernández-Rodríguez et
al constructed an ANN that takes in nine input values where
the values are the returns of the previous nine days [5]. His
model has one hidden layer with four units and one output
layer, a number between -1 and 1 [5]. When the output is
positive it is a “buy” signal, and when the number is negative
it is a “sell” signal [5]. Fernández-Rodríguez et al found that
their model worked best in bear and stable market, but was
outperformed by a buy-and-hold strategy in bull markets [5].
Another author, Ticknor, also uses ANNs to predict stock
prices. Ticknor also uses nine input variables and specifies
that he uses the opening price, closing price, and high of the
day among other attributes [13]. His model contains three
layers. What makes Ticknor’s model unique is that instead
of the output being binary, the output is a predicted stock
price for the next day [13]. Ticknor finds that his model is
accurate compared to models in a previous paper written by
Hassan et al [13].
Finally, Vanstone and Finnie also wrote about ANN. The
authors use a model that takes a set of 13 inputs of various
technical stock data points [15]. The output for their model
was the high that occurred in the next 20 days, roughly one
trading month [15]. If a monthly high occurred on day five
of the 20-day period, that day’s price would be the predicted
output [15]. Using the outputs, the authors created trading
rules that they showed, over time, are more profitable than
buying and holding stocks [15].
2B.3) SVMs
SVMs appear to be the most popular method among stock
market prediction papers. In his paper, Kim uses an SVM to
3
M. A. Upal (editor)
4
predict the prices of Korean stocks [8]. He uses 12 technical
indicators as input variables and predicts whether or not the
market will go up or down on a daily basis with one being a
move to the positive side and zero being negative [8]. Kim’s
study focuses on whether or not an SVM based model will
outperform a back-propagation ANN and a case-based
reasoning algorithm [8]. Kim found that the SVM
outperformed the other two models [8].
3. PROPOSED SOLUTION
3A) Rationale:
3A.1) SVMs:
There are many reasons for using SVMs as one of the test
algorithms. First, many other studies looking at stock
market prediction also use the SVM algorithm [8] [6] [9]
[11]. This makes comparisons and readability easier across
the audience. Second, SVMs do an adequate job of avoiding
overfitting [6]. This is especially important for the stock
market because the swings can be unpredictable at times.
Finally, SVMs work well with non-linear problems [20].
Again, this is important for a problem involving the stock
market since the stock market movements are rarely linear.
Huang et al also use SVMs as the primary algorithm of their
paper [6l]. The authors test the effectiveness of SVMs
against that of Linear Discriminant Analysis (LDA),
Quadratic Discriminant Analysis (QDA), and Elman
Backpropagation Neural Networks (EBNN) algorithms [6].
Similarly to Kim, Huang et al find that the SVM classifiers
perform better than the rest of the algorithms [6]. The
authors theorize that the reason SVMs perform better than
the other models is due to the nature of the algorithm and it’s
propensity to avoid overfitting [6]. The authors also find that
a combining method, where the SVM is paired with the
various other algorithms, performs even better than the SVM
by itself [6].
3A.2) ANNs:
ANNs are often used for complicated, hard to model
problems and have been a favorite of many researchers while
studying stock market predictions. Like SVMs they have
been used for many financial analysis problems [5] [13] [15].
They are good for non-linear problems and good at
generalizing, meaning they also sufficiently avoid
overfitting [21].
Finally, the ANNs model
heteroskedasticity challenges well [21]. Highly volatile data
and non-constant variance are staples in financial analysis
which makes ANNs a good algorithm choice for this
problem.
Another author who uses an SVM approach to try and
predict the stock market is Lahmiri. Lahmiri focuses on a
comparison between Probabilistic Neural Networks (PNN)
and SVMs [9]. Lahmiri uses technical and macroeconomic
variables to try and predict daily stock movements [9]. The
author also explores a combination of the two methods [9].
The paper shows the best results were obtained by using an
SVM with the macroeconomic data as an input [9].
3A.3) GAs:
Genetic algorithms are inspired by nature and mimic the
natural evolution process. They allow users to search and
traverse the space of possible solutions in an efficient way.
Additionally, GAs are perfect for stock market analysis and
this particular problem because they are easy to program and
proven to find features that achieve optimal results [21] [4]
[7] [8]. They are useful for this space, in particular, because
there are many features that need to be streamlined to find
the best possible combination.
Li et al also tested the SVM algorithm against a number of
different algorithms and once again found that the SVM was
the most accurate [11]. Li et al tested the algorithm against
the extreme learning machine (ELM) algorithm and various
versions of neural networks [11]. Notably, Li et al also
discovered that in comparison to regular back-propagation
neural networks (BPNN) and SVMs, the ELMs used were
faster when training and testing on the same data [11].
2C) Technical Background: Algorithms for Feature
Selection
3A.4) Features:
There are multiple components to predicting a stock price
and each decision can have profound impacts downline. For
the benefit of traders and money managers, stocks can be
bought and sold in groups called Exchange Traded Funds
(ETFs). ETFs usually represent a segment of stocks. For
instance, the XLF ETF represents stocks that fall under the
financial sector description and the QQQ ETF represents
stocks that fall under the technology category.
2C.1) Genetic Algorithms
Many stock market prediction papers use genetic algorithms
(GA) when looking for an appropriate model. One such
paper by Kim and Han uses a GA combined with an ANN
[7]. In their paper, the role of the GA is to extract the best
features and the best weights for the ANN [7]. The study
concludes that the GA-ANN combination performed better
than either model by itself [7]. Kim has another paper in
which he explores the effects of two similar GAs [8]. In this
paper, Kim finds that a GA model that continuously updates
beats a similar GA that does not update [8].
One prominent way to try and predict the movement of ETFs
is to trade funds that are highly correlated. While the stock
market is known as being unpredictable, stocks rarely move
in isolation. Due to the interconnectivity of the global
economy sectors often times have a snowball effect on each
other. This means that when one sector goes up, we can
predict that similar sectors that rely on many of the same
factors will also increase in price. An example of this is
XHB and XRT. XHB is an ETF that encompasses
companies that focus on home building. XRT contains
retailers. Since consumers propel both of these sectors, it
stands to reason that they will have similar price movements.
Historically, this is true. These two sectors are positively
Choudhry and Grag also use a hybrid model that utilizes a
GA [4]. Choudhry and Grag use a GA and SVM on highly
correlated stock pairs to try and predict future prices [4]. The
GA is used to select the best features from an original list of
35 [4]. The authors experiment compares the results of the
GA-SVM model with a regular SVM that uses all 35 inputs
[4]. The comparison shows that the model that uses inputs
narrowed down by the GA produces superior results to the
SVM that uses all 35 input features [4].
4
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
correlated and have a correlation value of 0.78 using their
monthly returns since 01/01/2010.
This is important for our proposed solution because I use
many of these ETF prices as features in my dataset. The
selection process is a simple inclusion of the most popular
ETFs by trade volume.
As with the ETFs, macroeconomic factors also have a major
impact on the stock market. Leading indicators are
measurements that have the potential to forecast economic
conditions. This category of indicators includes reports such
as manufacturing activity, retail sales, the housing market,
and inventory levels. Lagging details such as GDP, profits,
and interest rates also can have effects on the stock market.
These are considered a measurement of current economic
conditions. Although they do not have as much forecasting
power as the leading indicators, they are a good gauge of the
overall health of an economy. Both categories can have an
effect on the stock market, especially if they show data that
contradicts the current consensus of a market.
Fundamentals are another set of features that are included in
the model. As mentioned above, many investors use
fundamentals to make investment decisions.
The
fundamental values to be included in the model will be those
that are popular and easily recognizable to both seasoned
investors and casual stock market participants. Technical
measures will also be included in the original feature set.
Technical indicators chosen for the original pool, before
being narrowed down via the Genetic Algorithm, will
include a set of simple, easy to read, and easy to calculate
indicators. Choosing easy to use and easy to understand
indicators of both the technical and fundamental variety has
many benefits including results that should be easier to read
and, subsequently, reproduce.
3B) Description:
3B.1) Written Description:
This paper examines four models, an SVM, an ANN, a GA
+ SVM hybrid, and a GA + ANN hybrid and tests their
performances against each other along with the general
moves of the broader market. The models aim to predict
whether or not the DJI, the ticker symbol for the Dow Jones
Industrial Average, has a daily positive gain or not
(PositiveGain in the dataset). All of the models are built
from the same initial data. The data for this paper is from a
variety of sources. For index prices, data comes from Yahoo
Finance. The technicals are calculated using a combination
of fundamentals, time, and prices, specific calculations are
provided below. Finally, economic data comes from the
Data.gov website.
The first date in the price data is 01/04/2010 and the final
date is 07/01/2018. The data includes all the trading days
between the previously described bookends. Trading days
are all business days not including holidays or various other
days where the market is closed. The date range was chosen
because it was the largest sample size that could be obtained
where all desired features were available.
Many of the economic data is given in monthly or quarterly
statistics. When this is the case, the data is extrapolated over
the entire time period until the newest data is available. For
example, the GDP report for 9/1/2018 was 20658.204. Since
a new number is not reported until 10/1/2018, the September
number is used for 9/2, 9/3, 9/4, etc. until the October
number comes in. The same technique was used for
quarterly data.
Data is processed via the SciKitLearn preprocessing
packages and includes removing any missing data and
scaling data so machine learning algorithms will perform
well. Data was scaled using (xi–min(x))/(max(x)–min(x))
since not all features are evenly distributed. For the models,
data is split using train-test split packages. Training data is
split again into training data and validation data. Due to the
nature of time-series data, the data is split chronologically.
The training data set starts on 01/04/2010 and ends on
02/06/2015. The cross-validation set starts on 02/09/2015
and ends on 10/18/2016. The test set starts on 10/19/2016
and ends on 6/29/2018. There can be seen in figure 1.
Figure 15: DJI prices split by train, validation, and
test segments
For the hybrid models, data is split like before but instead of
feeding right into the model, features are first filtered with
the GAs. The best-fit features are then used as a part of the
SVM or ANN to make predictions.
3B.2) Algorithm Descriptions:
SVM:
Support Vector Machines are a subset of
supervised learning algorithms. Support Vector
Machines achieve predictions by maximizing
margins between classifications. This means that
training examples are transformed onto a
hyperplane that increases the distance from one
class to another. On the optimal hyperplane, the
training examples that are closest to the maximum
margin are called the support vectors (see figure
1).
When data is linearly separable, a hyperplane
separating the prediction classes can be
represented with the equation:
y = w0 + w1x1 + w2x2 + …wnxn
In the equation, y is the outcome, x’s are the
variable values, w’s are the weighted values.
5
M. A. Upal (editor)
6
The maximum margin hyperplane can be
represented by:
the input and hidden layers is the ReLu function.
The ReLu function is defined as:
y = b + ∑αiyix(i) · x
f(x)=max(0,x)
y is the prediction value, x is a vector that
represents an instance, xi and yi are the support
vectors, · represents the dot product, b and ai are
parameters that represent the hyperplane and are
calculated by solving a linearly constrained
quadratic programming problem.
The network starts with a random initialization.
The activation rate is then found from the input
layer and the hidden layer. An output is found
using the softmax function. The softmax function
is defined as follows:
softmax(x)i=exi∑nj=1exj
When data is not linearly separated, a kernel is
added to the equation and it looks like this:
y = b + ∑αiyiK(x(i),x)
One common kernel, the Gaussian radial basis
function (RBF), is represented as follows:
K(x; y) = exp(−1/δ2 (x − y)2 )
δ2 is the bandwidth of the Gaussian RBF kernel.
The Sklearn Support Vector Classifier was used to
implement the algorithm. Through the package, a
regularization parameter, C, determines how to fit
the separator to the predicted classes. When C is
higher, the line is fit to the data more closely, when
it is lower, it is more linear. For this paper, C was
set to a value of 1.0. The gamma parameter takes
into consideration what points to use to calculate
the line of separation. When the gamma parameter
is set higher, only the points closest to the line of
separation are used for calculations. When the
gamma setting is set to a lower value, all points are
considered. In this model, the value of gamma is
set to (1 / number of features).
Figure 17: A representation of a neural network
with two inputs, one hidden layer, and two outputs
[24]
GA:
Genetic algorithms start with an initialization. A
gene in the case of a genetic algorithm is a
combination of features. First, the “genes” of the
algorithm are randomized and results are
analyzed. If a feature or combination of features
produces a correct result, it is marked as so.
Combinations that are more successful are ranked
higher than those that are less successful and given
a higher value which corresponds to a higher
probability of selection. During the selection
phase, each combination of features is given a
probability and instead of simply taking on the
best performing features from the training phase,
the selection is based on this probability. This is
how GAs avoid overfitting while still rewarding
features that initially perform well. Selected
combinations of features are then mixed and
matched to give new feature combinations.
Mutations can then be added in order to introduce
another level of randomness. Final results are
generated and the best feature combination is
selected.
Figure 16: SVM on a 2-D plane showing the
difference between strict and loose gamma values
[23]
ANN:
Artificial Neural Networks are a rough
representation of how the brain works with
neurons and connections woven together in order
to take a number of inputs and predict a
classification for the input (see Figure 2). The
neurons are units of calculations while the
connections are weights to be applied to the next
equation.
Each neural network has three parts, the input
layer, the hidden layers, and the output layers. The
number of nodes in the input layer corresponds to
the number of variables. In ANNs, the activation
function of a node sets the output of the node given
a set of inputs. The activation function used for
6
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Based on the results, utilizing the genetic algorithm appears
to have a positive effect on accuracy results. In both cases,
the models paired with the genetic algorithms outperformed
the standalone SVM and ANN results.
The GA
combinations also were able to beat a buy and hold strategy
which the standalone models could not do. The GA + SVM
model performed the best of the four machine learning
models. The GA + ANN model was the second-best
performer followed by the two standalone models.
5. FUTURE WORK
There are many ways in which this, and many of the other
papers out there, could be improved. First, the economic
landscape is so vast that there are many variables that have
not been used. Some examples of variables that may be able
to improve a model include individual company data, more
foreign macroeconomic data, text analysis from various
reports or articles, and countless other sources.
Figure 18: Flow chart illustrating the steps of a
genetic algorithm [25]
4. RESULTS
The four models were fitted on train data and cross-validated
on the preset cross-validation data. Once properly adjusted,
the models were used on test data. The results for how each
individual model performed on the test data can be found in
table 1. Accuracy or (number of correct predictions/total
number of predictions) was the metric used to evaluate the
model. This was chosen for its simplicity and its application
to the real world. If the target data was imbalanced, another
metric such as precision or recall may have been chosen but
since the number of positive days and number of negative
days were split almost evenly, accuracy worked well for this
study.
Baseline results were also collected and can be seen in table
2. The baseline methods used were a random walk and a buy
and hold strategy. The random walk method predicted
random 1s and 0s (positive and negative days). The buy and
hold strategy is the equivalent of predicting all positive days.
Model
Accuracy
SVM
0.513
ANN
0.507
GA + SVM
0.531
GA + ANN
0.522
Table 6: Model results
Baseline
Accuracy
Random Walk
0.501
Buy + Hold
0.521
Another area to explore is the addition of various other
predictive algorithms. SVMs and ANNs were selected for
this paper because of their popularity in previous papers.
New ways to approach the problem could be to use various
other classification algorithms such as decision trees,
random forests, logistic regression, etc. Aside from the
prediction method, a new approach could also be taken in
terms of the prediction label. In this paper, market gains
were defined as a binary positive gain. Similar approaches
could be used in regression problems to try and predict
continuous prices. The time constraint of the prediction
label could also be modified. This paper works with single
day predictions but longer and shorter time periods could
also be explored.
Finally, this paper focuses on price prediction. A next step
could be to attempt to turn this into an actual trading system.
This would most likely involve looking at ‘PositiveGain’
predictions and their probabilities and then analyzing
hypothetical gains. This could also include taking into
account things like trade prices, percent of gains, the timing
of trades, etc.
6. CONCLUSION
The paper illustrates a few important points regarding
machine learning, algorithm selection, feature selection, and
predicting daily stock market directions. First, the paper
shows that it is plausible, with the right combination of
machine learning algorithms, to predict market direction.
Next, the paper shows that with feature selection, model
improvement is possible. In this instance, the feature
selection algorithm was a genetic algorithm and it was paired
with a support vector classifier and an artificial neural
network. Results were compared between the standalone
algorithms and the hybrids that included the feature
selection. Finally, the results also showed that the SVM
outperformed the ANN. While there are many more areas
that need exploration, the paper lays out some key points for
any future work.
Table 7: Baseline results
REFERENCES
7
M. A. Upal (editor)
2
1.
Matthieu Medhi Belarouci. 2012. The Relation
between Technical Efficiency and Stock Prices:
Evidence from the US Airlines Industry (19902010). SSRN Electronic Journal.
DOI:http://dx.doi.org/10.2139/ssrn.2083651
2. Olivier Blanchard, Changyong Rhee, and Lawrence
Summers. 1990. The Stock Market, Profit and
Investment. DOI:http://dx.doi.org/10.3386/w3370
3. Nai-Fu Chen, Richard Roll, and Stephen A. Ross.
1986. Economic Forces and the Stock Market. The
Journal of Business, 59, 3, 383.
DOI:http://dx.doi.org/10.1086/296344
4. Rohit Choudhry and Kumkum Garg. 2008. A Hybrid
Machine Learning System for Stock Market
Forecasting. World Academy of Science, Engineering
and Technology International Journal of Computer
and Information Engineering, 2, 3, 689–692.
5. Fernández-Rodrı́guez Fernando, Christian GonzálezMartel, and Simón Sosvilla-Rivero. 2000. On the
profitability of technical trading rules based on
artificial neural networks, Economics Letters 69, 1,
89–94. DOI:http://dx.doi.org/10.1016/s01651765(00)00270-6
6. W. Huang. 2004. Forecasting stock market movement
direction with support vector machine. Computers &
Operations Research.
DOI:http://dx.doi.org/10.1016/s0305-0548(04)000681
7. Kyoung-Jae Kim and Ingoo Han. 2000. Genetic
algorithms approach to feature discretization in
artificial neural networks for the prediction of stock
price index. Expert Systems with Applications, 19, 2,
125–132. DOI:http://dx.doi.org/10.1016/s09574174(00)00027-0
8. Kyoung-Jae Kim. 2003. Financial time series
forecasting using support vector
machines. Neurocomputing, 55, 1-2, 307–319.
DOI:http://dx.doi.org/10.1016/s0925-2312(03)003722
9. Salim Lahmiri. 2011. A Comparison of PNN and
SVM for Stock Market Trend Prediction using
Economic and Technical Information. International
Journal of Computer Applications, 29, 3 (September
2011), 24–30.
10. Blake Lebaron, W.brian Arthur, and Richard Palmer.
1999. Time series properties of an artificial stock
market. Journal of Economic Dynamics and Control,
23, 9-10, 1487–1516.
DOI:http://dx.doi.org/10.1016/s0165-1889(98)000815
11. Xiaodong Li et al.2014. Empirical analysis: stock
market prediction via extreme learning
machine. Neural Computing and Applications, 27, 1
(February 2014), 67–78.
DOI:http://dx.doi.org/10.1007/s00521-014-1550-z
12. Anon. 2017. Support and Resistance. A Complete
Guide to the Futures Market, June 2017, 91–108.
DOI:http://dx.doi.org/10.1002/9781119209713.ch8
13. Jonathan L. Ticknor. 2013. A Bayesian regularized
artificial neural network for stock market
forecasting. Expert Systems with Applications, 40, 14
(2013), 5501–5506.
DOI:http://dx.doi.org/10.1016/j.eswa.2013.04.013
14. Bruce Vanstone and Gavin Finnie. 2010. Enhancing
stockmarket trading performance with ANNs. Expert
Systems with Applications, 37, 9 (2010), 6602–6610.
DOI:http://dx.doi.org/10.1016/j.eswa.2010.02.124
15. Ruud Weijermars. 2011. Price scenarios may alter
gas-to-oil strategy for US unconventionals. Oil & Gas
Journal (January 2011), 74–81.
16. Wing-Keung Wong, Meher Manzur, and Boon-Kiat
Chew. 2003. How rewarding is technical analysis?
Evidence from Singapore stock market. Applied
Financial Economics13, 7 (2003), 543–551.
DOI:http://dx.doi.org/10.1080/0960310022000020906
17. Lean Yu, Huanhuan Chen, Shouyang Wang, and Kin
Keung Lai. 2009. Evolving Least Squares Support
Vector Machines for Stock Market Trend
Mining. IEEE Transactions on Evolutionary
Computation13, 1 (2009), 87–102.
DOI:http://dx.doi.org/10.1109/tevc.2008.928176
18. Justin Kuepper. 2018. What Is A Trading System?
(March 2018). Retrieved October 22, 2018 from
https://www.investopedia.com/university/tradingsyste
ms/tradingsytems1.asp
19. Jean Folger. 2018. Guide to Pairs Trading. (February
2018). Retrieved October 22, 2018 from
https://www.investopedia.com/university/guide-pairstrading/
20. Bala Deshpande. 2013. When do support vector
machines trump other classification methods?.
(January 2013). Retrieved April 21, 2019 from
http://www.simafore.com/blog/bid/112816/When-dosupport-vector-machines-trump-other-classificationmethods
21. Jahnavi Mahanta. 2017. Introduction to Neural
Networks, Advantages and Applications. (July 2017).
Retrieved April 21, 2019 from
https://towardsdatascience.com/introduction-to-neuralnetworks-advantages-and-applications-96851bd1a207
22. Fernando Gomez and Alberto Quesada. Machine
Learning Blog Genetic algorithms for feature
selection. Retrieved April 22, 2019 from
https://www.neuraldesigner.com/blog/genetic_algorith
ms_for_feature_selection
23. Nicolas Panel. node-svm. Retrieved April 22, 2019
from https://www.npmjs.com/package/node-svm
24. Anon. Retrieved April 22, 2019 from
http://neuroph.sourceforge.net/tutorials/MultiLayerPer
ceptron.html
25. Anon. Retrieved April 22, 2019 from
https://www.hindawi.com/journals/mpe/2013/504895/
fig3/
2
M. A. Upal (editor)
2
Stock Market Price Model Using Sentiment and Market
Analysis
Justin Minsk
Department of Computing & Information Science
Mercyhurst University
Erie, PA, US
jminsk64@lakers.mercyhusrt.edu
Questions around which sources and features should be used
and how to combine the sentiment analysis with traditional
price predictors have been explored in recent research papers
on stock market predictions [6]. Multiple tensors used for
social media and news [6] combine sentiment analysis with
traditional stock market price predictors. Zhang’s model [6]
mines Chinese social media and news which have different
properties compared to United States news and social media.
Applying the ideas presented by Zhang et al. to United States
media such as Twitter and the Wall Street Journal is a good
way to test the generality of their model.
ABSTRACT
Zhang et al.[6] take social and news media with market
indicators and predicts stock prices based on the indicators.
All three of these data sources predict a set of stocks from
the Hong Kong stock exchange. Instead of using information
from Chinese sources, this paper uses United States sources,
including Twitter, Wall Street Journal, and Amazon stock
indicators. Our goal is to examine whether the same ideas
presented in [6] can be generalized to other stock markets, in
particular if these methods can be used to predict Amazon's
stock price for each minute during open stock market from
December 12th, 2018 to January 22nd, 2019.
2 BACKGROUND AND RELATED
WORK
1 INTRODUCTION
2.1 Related Work
Recent work on stock market price prediction focuses more
on sentiment analysis and less on historical prices and
economic indexes [4-8]. Sentiment analysis is the act of
taking text data and scoring that data. This can be done
several ways, the most popular being positive and negative
sentiment about a subject [5]. In stock market predictions,
social media, such as tweets from Twitter [4, 5, 6, 8], and
news articles [7, 6, 9] are commonly used to predict whether
a stock’s price will go up or down.
Predicting the stock market is an ongoing research topic in
academic and business sectors. The techniques used to and
knowledge of stock market prediction have changed over
time from the idea that stocks follow a random walk [6] to
models that can predict stock movement using deep learning
techniques [1 - 3]. One of the most influential indicators has
been found in news and social media sentiment analysis [5,
6, 8, 9]. While there are research papers combining multiple
techniques [7] there is still more work to be done to combine
deep learning techniques such as long-term short-term
memory (LSTM) deep learning models [2] and sentiment
analysis of news and social media [5, 6, 8, 9]. A review of
some of the sentiment analysis techniques in conjunction
with deep learning models can be fruitful for figuring out
techniques that could be used and where new research could
be continued.
While stock market prediction has focused on sentiment
analysis, the overall data has characteristics of a time series
problem. Papers addressing time series modeling are thus
also relevant to time series predictions. A long term short
term memory neural net (LSTM) [1, 2] or a gated recurrent
network (GRU) [10] are two approaches that have been used
to effectively model problems of this type.
New stock market research has focused on event-based
models [7] which take news or social media events and
predict how they affect a stock’s price. Sentiment analysis
assigns a positive or negative sentiment score to a given
piece of text based on whether a given topic is covered
positively or negatively in that text. Deep learning is used to
find the relation between sentiment and the stock price,
allowing for regression and classification. Regression can be
used to summarize and explain how a specific event affects
a stock’s price. Classification can be used to label stocks as
either “buy” or “sell.”
The most influential papers on US stock prices have focused
on social media [4, 5, 8] or news articles [7]. Little has been
done to combine both sources into a complex model that uses
multiple news sources, social media platforms, and basic
economic indicators.
An example of a paper that takes multiple sources of data
and combines them to predict a stock’s price is [6]. This
paper, however, is focused on the Hong Kong stock market
and not the US stock market. This paper creates a framework
2
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
on how to combine social media, a news source, and
traditional economic metrics into one model. The idea is to
create separate models for each source. One model for social
media, one model for a news source, and a final model for
the economic metrics. These models are referred to and
treated as tensors [6]. The social media and news media
tensor get combined into a tensor for qualitative or text data.
The economic metrics data is combined into a quantitative
tensor. These tensors, qualitative and quantitative, are fed
into a tensor that outputs the final value or predicted stock
price. The tensors for social media, news media, and
economic metrics use variations of LSTMs and support
vector machines. The output tensor or blender tensor uses a
variation of support vector machine.
1. 2.4 Long Term Short Term Memory Neural
Networks
Long term short term memory neural networks (LSTM) are
a direct improvement over RNNs since they solve the
vanishing gradient problem by adding gating functions into
their dynamic state [2]. Instead of having only hidden layers
like RNNs, they also have memory vectors that maintain the
state updates and outputs [2]. This memory vector allows for
longer term temporal trends to be remembered in an LSTM,
while an RNN might lose that information.
2.5 Gated Recurrent Network
The unique approach taken by this paper is to combine social
media, news media and economic metrics into one model
designed for real time predictions. Using the techniques for
sentiment analysis for social media [4, 5, 8] and news media
[7] with the method from [6] creates a combined model for
the United States stock market and will eventually provide a
reliable way to predict stock prices.
Gated recurrent networks (GRU) are similar to LSTMs in the
sense that they solve RNN’s vanishing gradient problem
[10]. LSTMs have multiple gates and are composed of
multiple complicated algorithms such as input, forget, and
output gates. GRUs, which were created after LSTMs, only
use reset and update gates. While GRUs are similar to
LSTMs they are simpler, however. Neither GRUs or LSTMs
are necessarily better than each other.
2.2 Amazon.com, Inc.
2.6 Support Vector Machines
The online bookstore turned technology company,
Amazon.com, Inc. (AMZN) was chosen to research due to a
general interest in the technological field and the ability to
collect Twitter data about the company. Since Amazon
encourages online customer reviews and interaction on
online public platforms, it was assumed that massive
amounts of data could be easily collected and applied to
stock market prices. Amazon is constantly expanding to
relatively new business environments with services such as
cloud services, Prime shipping, streaming services and their
own product lines in both electronics and consumer
products. This company holds special interest since they
have a wide range of products and services, that are closely
connected to their customer base, appeal to all demographics
of consumers, and they have had major fluctuations in stock
prices since their founding. Hopefully, if a model can be
applied to a diverse company such as Amazon, the same
model could eventually be applied to more specialized
companies in the future.
2.3 Recurrent Neural Networks
Recurrent neural networks (RNN) are important to time
series analysis because of their ability to keep temporal
behavior. This behavior is saved because of the
interconnection of each node in a layer [2]. RNNs contain
hidden layers that update their weights after a set of time.
There can be any number of hidden layers that increase the
complexity of the RNN. An RNN consists of an input layer,
a number of hidden layers and an output layer. However,
RNNs have a vanishing gradient problem which makes
longer time series data sets much less accurate and they lose
information over time. This loss of information makes long
term trends disappear in the training phase of the RNN
model.
Support vector machine (SVM) models are classic
classification machine learning models. They use the
concept of finding the line that divides the data. This line,
unlike linear regression, is designed to be as far away from
the clusters of different data as possible. This allows for
more general models to be compared to other classic
classification models. SVMs can be used in both
classification and regression tasks. In a regression task, the
logic is similar to classification, but instead the logic is used
to make a line of best fit. Since we have three different
predictions being fed into a blender model, there needs to be
a way to predict the final value. A SVM would work
perfectly for this task.
3: DATA COLLECTION
3.1 Twitter Data
Twitter data was collected from December 12th, 2018 to
January 22nd, 2019. A grand total of 23,585,132 tweets were
collected during this time, containing the words “amazon”
and “amzn.”. The data was appended to average AMZN
price for each minute the market was open between those
two dates. Tweets were combined to form one string for each
minute and a count was also collected. Term frequency–
inverse document frequency score was used to weigh words
in terms of frequency and importance within each document
and then rank the documents within the collection. Say
“amazon” was said multiple times in a collection of
documents, the document that said “amazon” the most
amount of times would have a higher score. Since “amazon”
is considered a common word in the collection, the word
itself is ranked lower in terms of significance in the
collection. The score of each document compares how words
are used in both the document and the collection. The subject
3
M. A. Upal (editor)
4
of every document is not always important. Instead the focus
lies in overall sentiment of the collection of documents. This
enumerates the data and should help models find important
words. In the end 1-4 grams of words where used and 75,001
features where used from the data.
4.2 IEX Model
Multiple configurations of deep learning models were used
to attempt to predict the next minute’s average stock price
for AMZN. Five models were run and the validation loss was
compared to decide which model performed the best. See
below.
3.2 Wall Street Journal Data
Articles suggested by Wall Street Journal as related to
Amazon were collected between December 12th, 2018 to
January 22nd, 2019. Article text data was appended forward
in time for each minute of prices during open stock trading
until a new article is written. This allows for Wall Street
Journal data to continue to influence the price. The variable,
“time since article posted,” was added to declare the
difference between an article just posted and an article that
was posted a few minutes, hours, or days ago. Term
frequency–inverse document frequency score was used
which assigned larger weights to words that appear in fewer
documents, but are used multiple times in a document or
article. This enumerates the data and helps models find
important words. In the end, 2-5 grams of words were used
and 75,001 features were used from the data.
3.3 Stock Market Data
Price and stock indicators include: open time, open price for
trade day, close time, previous close price for trade day, high
price for trade minute, low price for trade minute, latest
price, latest update, latest time, latest volume, delayed price
time, delayed price, extended price, extended change,
extended change percent, previous close, change in price,
percent change, market cap, week 52 high, week 52 low, and
year to date change.
Model
Loss (Mean
Error)
Squared
Gated Recurrent Network
(GRU) - Dense
3.3585554774617776e-05
Long Term Short Term
Memory (LSTM) - Dense
4.118878860026598e-05
GRU - GRU - Dense
2.2447573428507894e-05
GRU - Dropout - GRU Dropout - Dense
0.00045545812463387847
LSTM - LSTM - Dense
0.00016301091818604618
The best model, GRU - GRU - Dense, had the lowest
validation loss.
4.3 Twitter and Wall Street Journal Models
These variables are collected from IEX’s API and contain
every minute of open trading price data for each day of the
selected time period.
4: MODEL FORMULATION
Both Twitter and Wall Street Journal models are structured
the same due to financial constraints. They both have one
GRU layer that leads to the final Dense layer. The Twitter
model had a validation loss of 0.11920106410980225. The
Wall Street Journal model had a validation loss of
0.10996559262275696.
4.1 Basic Model Building Blocks
4.4 Models Performance
Each of the models, IEX, Twitter, and Wall Street Journal
models, use similar building blocks since they are trying to
solve the same problem: predicting the next minute’s
average stock price. The temporal nature of the data needed
to be maintained. Using batches that start with a random time
and contain a set number of steps or a set window into the
future. This way, time sequences are trained on the model,
but different chunks of data are trained at different times.
The second difference from the base model is that the mean
squared error loss was not calculated for the whole
validation sequence. The first 20 steps of the sequence are
skipped since the model starts at a random point and then
starts to make correct predictions. Skipping the first 20 steps
makes that random point not influence the loss. The last
layer is always a dense layer that is used to output a single
predicted value, namely, the average price. The learning rate
was reduced during learning to help find the minimum loss.
The IEX model was the only model with acceptable
performance. The Twitter and Wall Street Journal models try
to predict an average and do not seem to pick up on the
temporal nature of the data.
4.5 Ensemble Model
Multiple machine learning models were tested to blend the
predictions from the IEX, Twitter, and Wall Street Journal
models. The models used were Support Vector Regressor
(SVR), Decision Tree Regressor (DTR), Random Forest
Regressor (RFR), and Gradient Boosting Regressor (GBR).
4
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019
Cross validation for each model was run and R-squared
scored were used to pick the best ensemble model. The
model with the best R-squared cross validation score was
SVR.
5: CONCLUSION
Time Series Classification. IEEE Access 6, 1662 – 1669
DOI: 10.1109/ACCESS.2017.2779939
[3] Spyros Makridakis, Evangelos Spiliotis, Vassilios
Assimakopoulos 2018. Statistical and Machine Learning
forecasting methods: Concerns and ways forward. PLoS
ONE
13,3
DOI:
e0194889.
https://doi.org/10.1371/journal.pone.0194889
[4] Chung-Chi Chen ,Hen-Hsen Huang, Hsin-Hsi Chen
2018. NTUSD-Fin: A Market Sentiment Dictionary for
Financial Social Media Data Applications. Proceedings of
the Eleventh International Conference on Language
Resources and Evaluation (LREC 2018). ELRA.
The only model that performed acceptably was the IEX
model. Even with the IEX model as a factor in the ensemble
model, the model seems to only pick up on an average value.
This does not prove that the methodology from [6] do not
work on the US sources, but it may be indicate that the
problem of predicting stock values becomes harder to model
with scale or that there wasn’t enough information in the
sources we collected to predict the stock we chose, namely
Amazon.
5.2 Future Work
Both Twitter and Wall Street Journal sentiment analysis
models need to be improved before becoming useful models.
Looking at papers [4 - 7] that used a single source to predict
a stocks price, the sentiment analysis models still performed
well. The models presented within this paper performed
poorly for a variety of reasons. The first possibility is that a
lack of compute power inhibited the processing of such a
large corpus of text. A second theory could be that during
the period of data collection, social media and new media
where not reflecting the AMZN market value. A third theory
is that better data engineering is needed to eliminate some of
the noise that the data definitely contained. Whatever the
reason, normal sentiment analysis techniques were used and
the models did not perform as well as intended. The next
clear step is to improve the text models, one method could
be to move from regression to classification. Instead of
predicting the price predict if the price will go up, down, or
stay the same. These classifications could still be used with
the ensemble model to create a better regression model.
References
[1] Martin Längkvist, Lars Karlsson, and Amy Loutfi 2014.
A review of unsupervised feature learning and deep learning
for time-series modeling. Pattern Recognition Letters 42, 1,
11-24 DOI: https://doi.org/10.1016/j.patrec.2014.01.008
[2] Fazle Karim, Somshubra Majumdar, Houshang Darabi,
Shun Chen 2017. LSTM Fully Convolutional Networks for
[5] Johan Bollena, Huina Maoa, Xiaojun Zengb 2010.
Twitter mood predicts the stock market. Journal of
Computational
Science
2,
1-8
DOI:
https://doi.org/10.1016/j.jocs.2010.12.007
[6] Xi Zhang, Yunjia Zhang, Senzhang Wang, Yuntao Yao,
Binxing Fang, Philip S. Yu 2018. Improving Stock Market
Prediction via Heterogeneous Information Fusion. Knowl.Based
Syst.
143,
236-247
DOI:
10.1016/j.knosys.2017.12.025
[7] Paul C. Tetlock 2005. Giving Content to Investor
Sentiment: The Role of Media in the Stock Market. Journal
of Finance, Forthcoming 62, 1139-1168 DOI:
http://dx.doi.org/10.2139/ssrn.685145
[8] Mengmeng Wang, Wanli Zuo, Ying Wang 2015. A
Multilayer Naïve Bayes Model for Analyzing User's
Retweeting
Sentiment
Tendency.
Computational
Intelligence and Neuroscience 2015, 510281 DOI:
http://doi.org/10.1155/2015/510281
[9] Nesreen Ahmed & Amir Atiya & Neamat Gayar &
Hisham El-Shishiny, 2010. An Empirical Comparison of
Machine Learning Models for Time Series Forecasting.
Econometric Reviews, Taylor & Francis Journals, vol. 29(56), pages 594-621. DOI: 10.1080/07474938.2010.481556
[10] Junyoung Chung & Caglar Gulcehre & KyungHyun &
ChoYoshua Bengio, 2014. Empirical Evaluation of Gated
Recurrent Neural Networks on Sequence Modeling. CoRR,
vol 412.3555.
About the author:
Justin Minsk is a Graduate Student at Mercyhurst
University.
5
Connecting People: Psychology and Machine Learning
Praveen Kumar Neelappa
Data Research and Analytics Department
Uhisi Data Solution
Bengaluru, India/Toronto, Canada
pnkumar@outlook.com
among individuals with a similar personality. This
method is by far a huge contrast to what current social
media platforms adopt.[6]
Abstract— This paper tries to connect two major fields
of science; machine learning and psychology to find an
approach to connect people. This method was created
using unsupervised machine learning algorithm using K
mean clustering to identify traits and personality in people
and group them together and use an android application
to give a customized recommendation to become friends,
find dates or partners. The advantage of this method is
that it uses the advanced machine learning technique to
learn from data and automate the process to connect
people using their personality and not the appearances.
The demerits of this method were that it was tested using
a small sample of 100 students and all the participants
were in the age group of 17 – 24 and does not represent the
whole population. Further research is required.
The
mental,
emotional,
and
behavioral
characteristics pertaining to a specific person will be
scrutinized and a connection will be initiated between
such a person and another individual who exhibits
similar psychology.[7] With this method of connecting
individuals, the problem of incompatible connections
which some social media users experience will be
squashed.[8]
The algorithm or method which will be developed
using psychology and machine learning will be
transformed into an app or integrated into a product.[9]
This will, unlike other methods utilized by social media
platforms, connect individuals not only through their
basic information but in a broader scope or perspective.
We also want to use a machine learning algorithm to
connect people (using psychology) from diverse cultures
who share similar personality.[10] Through the concept
of this research which will be utilized in developing an
application or a product, individuals can find compatible
friends, partners, or dates.[11]
9. Introduction
Connection is key in a world occupied by a
population of about 7.7 billion people,[1] all of them
with a diverse culture, psychology, and personality.
Individuals crave for interactions with people who bear
similar perceptions, instincts, psychology, and culture in
the outside world.
Albert Einstein stated that; "A human being is part of
the whole, called by us 'Universe,' his thoughts and
feelings as something separated from the rest, a kind of
optical delusion of his consciousness. This delusion is a
prison, restricting us to our personal desires and affection
for a few persons close to us. Our task must be to free
ourselves from our prison by widening our circle of
compassion to embrace all humanity and the whole of
nature and its beauty."[2] This quote by Albert Einstein
clearly depicts the urgent need for effective connection
amongst individuals with a similar personality. Without
the proper means, developing system or algorithm which
effectively connects people with similar attribute could
be hard. For instance, Facebook connects diverse
individuals by utilizing basic information like
geographical location, preferences, age, pictures,
friends, etc.[3] This concept utilized by Facebook and
other social media platforms tends to connect individuals
who exhibit a similar demography.[4]
10. BACKGROUND
There have been several research techniques to
resolve the problem of how to connect people from
diverse parts of this world.[12] The main aim of these
researches, either ongoing or completed is to foster a
better relationship between individuals by matching
people with similar personality and traits.[13] The social
media giant, Facebook, utilizes a clever algorithm to
connect individuals (via friend suggestion) of similar
personality, location, instincts, etc.[14] Aside from this
research which is focused on connecting people using
psychology and k-means clustering machine learning
algorithm, there are other research works out there with
a similar aim but a somewhat diverse approach.[15]
Some of these research work utilizes a different
approach to resolve the issue of connecting individuals
with similar personality or traits. One of this method is
the link prediction method.[16]
In this research, we are trying to develop a somewhat
similar but unique approach to connecting individuals all
over the world.[5] This research aims to use psychology
and machine learning by utilizing a few concepts from
psychology to help optimize worldwide connection
The link prediction method has been an important
research topic for many years. This approach to
connecting individuals tends to predict social
6
Second Annual International Great Lakes Data Science Symposium
Praveen Kumar Neelappa
May 4-5, 2019, Erie PA, United State of America
connections between users. It does this by utilizing the
common neighbor prediction method.[17] This method
assumes that two nodes(individuals) with several similar
neighbors will have a future connection. This link
prediction method utilizes three matrices, the timevaried weight, the change degree of common neighbor
and the intimacy between common neighbors. This
approach of connecting individuals puts several factors
into consideration before making its prediction which is
always reliable given right pieces of information. One of
the factors considered by this approach is how many
common features (e.g. common hobbies, age, tastes,
geographical locations) the two individuals share.[18]
This consideration is used to measure the likelihood of
links between the two individuals. Since this approach
utilizes information provided by the nodes to make an
accurate prediction, in the case whereby information
about the nodes are inaccessible and unreliable due to
perhaps privacy policy, the approach will suffer.[19] In
overcoming this problem, we suggest that a more basic
approach is used in obtaining information about each
node. For instance, a system or model which utilizes
little information of the nodes to make predictions
should be developed.[20] Individuals will likely not give
out some personal information about themselves for
security reasons. So, another approach should be adopted
in obtaining this information about individuals to make
the prediction one that yields good results.[21]
Also, another similar research to ours can be observed
in that of professor C.V.Longani of the SRES's college
of engineering, Kopargaon, India. According to
professor C.V Longani, the friend suggestion system as
stated in his research uses the lifestyle of users to suggest
friends, instead of a social graph. The lifestyle between
the users and the friend matching graph is drawn.[22]
This graph is generated in a tabular form and it's used to
discover which users are more similar. Based on the
friend matching graph, users are recommended and
connections are initiated between two users of a similar
personality.[23] The lifestyle of the user can be
determined from the user's daily activities. This
approach to connecting people utilizes factors like the
habit, attitude, taste, moral standards, and economic
level of people to connect individuals. By connecting
individuals based on their lifestyle, a compatible union
is achieved, be it in terms of friendship, partnership,
etc.[24] But this approach comes with a defect, because
the lifestyle is dynamic, and some individuals are good
at portraying fake lifestyles. Matching an individual with
someone with a fake lifestyle could result in an
incompatible or failed connection. Further research is
required in this aspect.[25]
7
11. EXPERIMENTAL STUDIES
11.1 Research Question
Is it possible to create a method to collect data and use
the machine learning algorithm to connect people who
have similar interest and psychological behavior?
11.2 Variables
Personal Information - The basic personal
information of the candidates is collected in the first
round of screening. Afterward, the algorithm is utilized
to
screen
the
provided
information
for
recommendations. These basic information's are used to
search for the perfect match for various individuals with
a similar trait, personality, lifestyle, psychology, etc.
Age- A discrete or continuous variable which is very
essential because most established connections or
relationships are dependent on the age factor
Sex- Categorical (Male, female, others)
Interest-Categorical (Friendship, partnership, dating,
relationship)
Sexual status- Categorical. An essential variable that
must be considered before the connection is initiated.
Sexual status of diverse individuals includes
heterosexual, homosexual, bisexual, and transgender.
Other personal information obtained is not used as
data for the machine learning algorithm.
Psychology Test - This test is directed at assessing the
candidate's behavior, cognitive abilities, personality, and
other several domains. Questions are thrown to the
candidates to test their basic psychology. Ten
psychological questions are created. The candidates are
required to pick an answer from the option 1-5. The
option 1 denotes that they strongly disagree while option
5 denotes that they strongly agree with the question or
scenario presented to them.
Personality Test - Questions are created to properly
test the personality of the candidates. This test is carried
out so as to properly extract the extrinsic and intrinsic
personalities of each candidate. The extracted qualities
will then be transformed into data for Machine learning
algorithm. Questions are created to properly test the
personality of the candidates. Ten personality-related
questions are presented to the candidates. Among the ten
questions, the candidates are required to pick an answer
from the option 1-5. The option 1 denotes that they
strongly disagree while option 5 denotes that they
strongly agree with the question or scenario that is
presented to them.
The subject of Interest - Various subject of interest is
provided as an option to choose. These subject of interest
ranges from music, food, art, entertainment, politics,
sport, literature, movies technology, etc. To extract
8
M. A. Upal (editor)
which refers to the number of centroids you require in
the dataset.[31] A centroid is known as the fancy location
depicting the middle of clusters. That is to say, the kmean algorithm selects "k" number of centroids and then
matches each data point to the closest cluster while
maintaining the centroids as minute as possible.[32] The
"means" in the k-mean refers to discovering the
centroids.
information about the candidate's subject of interest, they
are required to select three interesting topics after which
five questions will be created from the selected topics.
The candidates are required to pick an answer from the
option 1-5. The option 1 denotes that they strongly
disagree while option 5 denotes that they strongly agree
with the question or scenario that is presented to them.
Hypothesis
HOW THE K-MEANS ALGORITHM WORKS
People with a similar interest and psychological score
should be comfortable communicating and developing
the friendship
To prepare the learning data, the k-means algorithm
in data mining begins with the first group of aleatory.
Picked centroids which are utilized as the starting points
for each cluster, and then executed frequent calculations
to perfect the stance of the centroids.
11.3 Data Collection
Data collection can be described as the means of
collecting and measuring information on directed
variables in a confirmed system which then helps one to
give answers to important questions and calculate
results.[26] The aim of all data collation regarding this
research is to seize quality evidence that permits analysis
to produce the formation of effective and real answers to
the questions that have been asked.
It stops enhancing and developing clusters when either:
1) The centroids are stable and there is no alteration in
their values because the clustering has been
accomplished.
2) The selected number of variations has been
accomplished.[33]
Regarding this research, data were collected through
an online survey using google sheets. 100 participants
were invited to be part of the data collection process and
for post result discussion. Most of the participants were
students and were targeted audience to use the method.
The only information given to the participant were that
they were part of a survey and kept the whole data
collection
single-blinded. The question was
administered online and there was no subjective
influence from our side on the outcome. Every
participant was given a unique identification number and
was identified through their mail ID.[28] Follow up
questionnaire were sent to the participant through which
the result and conclusion of this paper have been derived.
13. RESULT AND DISCUSSION
After running the data through the K mean clustering
algorithm. We found k = 5 as the best number of clusters
to form using the elbow method. In this, we found that
the 100 participants were divided into 5 groups as below
Group
1
2
3
4
5
No of Participants
24
12
8
17
39
Fig. 1. Groups with the number of data point around the centroids
12. DATA & METHOD
We send the result to each participant by sending
them the recommendation of people who were part of
their group as friends to talk to them and below was the
observation
Data can be defined as any series of one or greater
symbols yielding meaning by a certain act of
interpretation. For data to become information, it
requires interpretation.[29]
After the necessary data collection, the final data was
made up of 100 rows and 40 columns of data. All the
data was converted to integers. Python was used as a
programming language and unsupervised machine
learning algorithms K-Mean Clustering was used to
cluster similar users together.
K-means clustering is one of the basic and widely
known unsupervised machine learning algorithms.
Normally, unsupervised algorithms make deductions
from datasets utilizing any input vectors without
considering known or labeled outcomes.[30] A cluster is
a collation of data points assembled together because of
specific relations. You'll describe a large number "k"
8
23% of participant came back with information
that they already were friends with the
participant who were recommended to them
7% of participant informed that their friends
were not recommended to them
80% of participant informed that they had a
pleasant conversation with the recommended
friends.
The group with less than 20 members showed
more positive result toward algorithm working.
In a real-life scenario, the sample size will be
much higher.
Second Annual International Great Lakes Data Science Symposium
Praveen Kumar Neelappa
May 4-5, 2019, Erie PA, United State of America
14. CONCLUSIONS
The result obtained although seems promising that
the model can be used to connect people. There are some
issues with the method
1.
2.
3.
4.
5.
6.
7.
The way the data was collected and analyzed
can lead to subjective bias
The participant might have not given 100% on
the result as it was incentive based result
The sample size used to collect data was small
and does not represent the entire population
Since most of the participant were from India,
there might be cultural bias introduced in the
result
The result might change if the population from
all ages are used in the experiment
The criteria, scenario, and question used to
measure psychological and personality trait
might alter result if reframed or different
question were used
Should use other machine learning and deep
learning algorithm and comparison must be
made with the result
Although there are limitations, the usefulness of the
method to create an application which uses psychology
and machine learning to connect people can be created.
But the result cannot be generalized, and further data
collection and research is needed.
14.1.1.1.1 REFERENCES
[15] https://populationeducation.org/sites/default/files/the_people_c
onnection.pdf
[16] http://www.elise.com/quotes/einstein__a_human_being_is_part_of_the_whole
[17] https://www.impactbnd.com/blog/the-difference-betweenfacebook-twitter-linkedin-google-youtube-pinterest
[18] https://www.cs.ucsb.edu/~ravenben/publications/pdf/interactio
n-eurosys09.pdf
[19] http://interactions.acm.org/archive/view/january-february2019/beyond-generalization
[20] https://mcb.unco.edu/students/ets-resources/ETS-MarketingStrategy-Review.doc
[21] http://www.ilocis.org/documents/chpt5e.htm
[22] https://www.edge.org/responses/how-is-the-internet-changingthe-way-you-think
[23] https://searchenterpriseai.techtarget.com/definition/AIArtificial-Intelligence
9
[24] https://towardsdatascience.com/machine-learning-vs-deeplearning-62137a1c9842
[25] https://www3.nd.edu/~ghaeffel/OnineDating_Aron.pdf
[26] https://ctb.ku.edu/en/table-of-contents/culture/culturalcompetence/building-relationships/main
[27] https://courses.lumenlearning.com/boundlessmanagement/chapter/defining-leadership/
[28] https://www.recode.net/2016/10/1/13079770/how-facebookpeople-you-may-know-algorithm-works
[29] https://towardsdatascience.com/k-means-clustering8e1e64c1561c
[30] https://buffer.com/resources/how-the-big-five-personalitytraits-can-help-you-build-a-more-effective-team
[31] Lin Yao, Luning Wong, Lv pan, Kai Yao: Link prediction based
on common-neighbors for dynamic social network. The 7th
international conference on ambient system, neutrons, and
technology (ANT 2016).
[32] Ton Wang, Xing-Sheng He, Ming-Yang Zhou, and ZhongQian Fu: Link prediction in evolving networks based on
popularity of nodes. Scientific report 7, Article number:
7147,2017
[33] https://link.springer.com/article/10.1007/s10940-014-9235-4
[34] https://machinelearningmastery.com/how-to-configure-thenumber-of-layers-and-nodes-in-a-neural-network/
[35] https://www.nap.edu/read/1864/chapter/4
[36] https://www.ijedr.org/papers/IJEDR1603037.pdf
[37] https://www.wordstream.com/blog/ws/2016/09/28/generational
-marketing-tactics
[38] http://www.pondiuni.edu.in/storage/dde/downloads/markiii_cb.
pdf
[39] https://www.academia.edu/2194220/Media_culture_Cultural_st
udies_identity_and_politics_between_the_modern_and_the_po
stmodern
[40] http://www.fao.org/3/x2465e/x2465e09.htm
[41] https://www.researchgate.net/publication/13370611_Qualitativ
e_Research_Methods_in_Health_Technology_Assessment_A_
Review_of_the_Literature
[42] https://hbr.org/2002/02/getting-the-truth-into-workplacesurveys
[43] https://en.wikipedia.org/wiki/Conditional_probability
[44] https://machinelearningmastery.com/supervised-andunsupervised-machine-learning-algorithms/
[45] https://mineracaodedados.files.wordpress.com/2012/07/datamining-in-excel.pdf
[46] http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.p
df
[47] https://towardsdatascience.com/understanding-k-meansclustering-in-machine-learning-6a6e67336aa1
.
10
M. A. Upal (editor)
Medical Brain Drain: The Relationship between
Regulation and Emigration
Kimberly Staudt
Department of Computing & Information Science
Mercyhurst University, Erie, PA
United States
Kstaud85@lakers.mercyhurst.edu
2. BRAIN DRAIN AND ECONOMICS
ABSTRACT
The effects of regulation on medical brain drain have few available
studies using data analysis. This paper reports experiments with
two models: a hierarchal linear model and an OLS linear model.
The former model is constructed using the Python packages
linearmodels with random effects, and Sklearn’s linear_model. A
variance component, or random effects, model is employed on the
panel data to correct for missing values, and to account for
unobserved heterogeneity. Heterogeneity assumes that all agents
are unique. The results showed a strong positive correlation
between GDP per Capita and migration, while regulation was
insignificant. Infant mortality was found to be negatively correlated
with migration.
Keywords
Migration, brain drain, regulation, medicine, regression
1. INTRODUCTION TO BRAIN DRAIN
AND MEDICAL REGULATION What is the role
of regulating technology in aiding or preventing brain drain? This
paper seeks to analyze the relationship between the regulation of
medical technology and brain drain. Brain drain is defined as the
exodus of intelligence from a country or a region. This paper
defines a migrating individual’s birth and receiving country as the
exodus and host country, respectively. The importance of
studying this issue is that brain drain limits the opportunity for the
exodus country to increase innovation and economic output. It
affects every global citizen, since brain drain can aid either
economic prosperity or decline. Decreasing brain drain aids
exodus countries to retain their intellectual reserve and boosts
their economy.
Research that exposes the relationship between medical regulation
and brain drain is limited. This study seeks to fill this gap by
utilizing a random effects, or a variance component, model. Both
topics are important enough economic and social welfare issues,
where big data analysis provides the opportunity to analyze the
medical brain drain dilemma. This study uses regression methods
on the panel data. Physician migration data is limited to the
public, but more readily available to professional agencies. This
study uses 884 observations across 76 countries from 2005 to
2016. Countries include, but are not limited to, Algeria, Canada,
India, and the United States. For example, Algeria had a total of
3083 doctors in foreign countries and had a high infant mortality
rate of 22.4%, in 2012.
10
The promise of economic prosperity is what often drives an
intellectual away from the exodus country and into their host
country. An example of a model host country is the United States
(U.S.), which has held the spot of being the #1 host country for
over 14 years [8]. Much of this appeal is due to the U.S. having
the highest nominal GDP. The United States Central Intelligence
Agency (C.I.A.), reports American Samoa as having the lowest
net migration rate, meaning that they lose more citizens than any
other country [6]. The question here is whether this emigration
can be classified as brain drain or not. If the negative net
migration is due to the exodus of low-skilled workers, then it does
not fit the definition of brain drain.
A great threat to the exodus country is the lack of incentive to
diversify the fields of study offered to its migrating students.
Docqueir and Rapoport [2012] describe the resulting market
oversaturation by stating, “…brain drain distorts the provision of
public education away from internationally transferable education
(e.g., exact sciences, engineering, economics, medical professions)
and towards country-specific skills (e.g., law), with the source
country possibly ending up training too few engineers and too
many lawyers;”[9]. The authors then propose monetary penalties
for intellectuals who participate in brain drain as an incentive to
return to the exodus country. However, this may have little effect
in the prevention of brain drain, especially concerning the medical
community. For the intellectual that gains more economic
opportunity by immigrating to the host country, a brain drain tax
would give them little incentive to return. This still leaves the
exodus country with one less intellectual.
Unfortunately, the issue becomes more complicated when
migration is the result of political or social issues, such as war.
Since this paper focuses on the economic regulation of
medicine/technology, the migration data will be more specific to
the medical field. This study uses migration data from highskilled workers as a proxy variable for brain drain. Here, highskilled workers are defined as individuals who have completed
tertiary education.
A great threat to medical technology is withheld monetary
investment incentivized by over-regulation. Countries that have
greater regulation see a decrease in financial investment. In
Medical Device: Lost in Regulation, Citron [2012] states, “A slow
but inexorable process of added regulatory requirements
superimposed on existing requirements has driven up complexity
Second Annual International Great Lakes Data Science Symposium
Praveen Kumar Neelappa
May 4-5, 2019, Erie PA, United State of America
and cost and has extended the time required to obtain device
approval to levels that often make such investments unattractive.
It must be noted that the market for many medical devices is
relatively small. If the cost in time and resources of navigating the
regulatory process is high relative to the anticipated economic
return, the project is likely to be shelved” [7].This constitutes both
short-run and long-term costs at the expense of both patients and
medical personnel. For the aspiring medical student, the appeal to
migrate to a country with more robust medical technology is
greater than to remain in a country with poor medical
development.
3. RELEVENT WORK
While brain drain and the regulation of medicine and its
technologies have been researched, fewer studies have been
conducted concerning the relationship between the two topics.
Most research on these topics uses a combination of survey data
and research techniques at best when analyzing the data. The
importance of this study lays in analyzing a hypothesized
connection between regulation and the emigration of medical
personnel. Understanding the history of the two aids in decision
making for both public policy and healthcare. Brain drain, the
emigration of an intellectual from an exodus country to a host
country, negatively affects the exodus country’s economy. Some
of these individuals include medical professionals.
3.1 Brain Drain & Medical Retention
Regulation of medical technology has a negative impact on medical
staff as well as patients. Lofters & Slater surveyed nearly 500
physicians who emigrated because of political/economic issues and
underdeveloped healthcare. This results in a lower supply of
professionals in exodus countries. The surveyed data showed that
many physicians who emigrated were in their thirties and from
South Asia [11]. Asian exodus countries struggle to retain doctors,
while developed countries, such as the United States, have an
oversaturation of doctors.
Additionally, many doctors in the medical field specialize in a
specific practice. Specialization allows all professionals to focus
on specific tasks to optimize output. This also narrows down
employment prospects to a finite number of positions, creating
competition between the emigrant medical personnel and the
natural born citizen physicians. Competition can increase the
incentive for both parties to provide optimal healthcare but is less
desirable for the individual and their income. Competition drives
down salaries. This is expected in an oversaturated medical
market due to lower bargaining power. It can be deduced that the
emigrant doctors coming from exodus countries must have a
greater incentive to leave than to stay. This explains in part as to
why developing, low-income per capita countries have higher
brain drain than wealthier nations. Holding other privileges
constant, the host countries offer a higher income than the exodus
country, while simultaneously underpaying physicians.
The goal of Lofter & Slater’s study was to survey the physicians
and use descriptive statistics to influence public-policy. Their
research considers political and sociological factors that are
thought to be important based on theoretical consideration, rather
than automatically discovering such factors using machine
learning. This is just one of the many studies that use descriptive
11
statistics to consider a limited number of factors, while the
approach of using machine learning to look at a broader set of
factors to automatically identify important factors seems to be
neglected.
3.2 Brain Drain & Medical Care Quality
The economic cost to the exodus country is in the millions. Many
medical students study in the exodus country before emigration.
Research notes that, “Low‐resource nations spend US$ 500 million
each year to educate health workers who leave to work in North
America, Western Europe, and South Asia” [Serour, 2009, 7].
Ergo, can be inferred that this yearly deficit negatively affects the
low-income country’s economy. Historical data also reports a
decrease in healthcare quality for patients who remain in these
countries. He concludes that there is a positive correlation between
brain drain and decreased quality of life and health service in these
exodus countries [Serour,2009,7].
Organizations such as the World Health Organization (WHO)
have several goals and steps to prevent health and income issues
that plague developing countries [12]. However, there are few
proposed solutions to solving brain drain that have been
implemented successfully. Reducing brain drain is difficult due to
the complexity of and contribution by a multiplicity of factors.
While Lofter & Slater rely on soft data, Serour’s work seeks to
contribute to policy with some hard data collected by the WHO.
The lack of statistical models implemented in these papers results
in a gap in understanding the relationship between brain drain and
medicine. Solving the problem of brain drain has the potential to
increase the quality of life in exodus countries. Therefore, this
paper hypothesizes that there is a correlation between healthcare
quality traits such as infant mortality rates and life expectancy,
and medical professional emigration.
3.3 The Effects of Medical Regulation on Hospitals
An example of medical regulation is restrictions placed on medical
equipment and facilities. The demand for a facility to be within a
reasonable distance is expected to be high. Intuitively speaking,
distance is a factor in the quality of healthcare. This is especially
true for those with lower-income and those who cannot afford
transportation, or those who live outside of urban areas. Trogdon
[2009] describes a market where, “certification-of-need (CON)
regulations require a hospital to show…a cardiac facility, in any
given market before states permit entry…” [16]. For hospitals, this
results in the difficult choice between providing a highly demanded
facility, and one that serves both rural and urban communities [16].
Facilities outside of urban centers are less likely to see use, leading
to financial loss and the underutilization of surgeons. Trogdon uses
the American Hospital Association’s data to model this trade-off.
He notes that previous studies used classification to model the
severity and risk of heart attack in patients [16]. Trogdon’s solution
utilizes a multinomial logit function to model hospital service
parameters. Trogdon’s model suggests that regulation decreases
hospital competition and hurts patients. Hense, this study provides
important evidence concerning the relationship between regulation
and medicine.
12
M. A. Upal (editor)
Trogdon’s solution and evaluation differs from this paper’s
proposed solution in that his study uses a smaller data set and solely
relies on regression techniques. This paper strives to provide a
larger data set in addition to classification methods. The problem
that the solution addresses is similar in that it observes medical
regulation. A drawback to Trogdon’s study is that it fails to address
the effects of regulation of medical personnel, taking a demandside approach. This study will use data including factors such as
employment rates and emigration rates of medical personnel.
3.4 Barriers to Entry: Regulation of Doctors
Another issue that arises is the regulation of a doctor’s
qualifications. For countries with different standards in medical
knowledge, this creates a difficult situation for medical personnel.
Medical professionals that cannot receive the proper medical
training in an exodus country may choose to study in foreign
countries. A professional with a greater skill set is incentivized to
remain in the host country. This is due to both personal costs
acquired throughout their education, as well as a greater pay for
their skills.
Table 1 [Baretta, 2012]
Observing the relationship between pharmaceutical regulations and
brain drain is important because quality healthcare contributes to
the incentives to remain in an individual’s birth country. This paper
uses WHO reports, medical studies, and the World Data Bank’s
data to characterize the level of regulation implemented in any
country.
Historical research shows that for doctors trained in non-English
speaking countries, 66% of UK supervisors reported complaints
from patients concerning many issues [Bhat, 2014, pp.3]. He states
that many foreign-born doctors are less likely to understand patientdoctor relationships, medical regulations, and clinic skills. It was
also noted that non-whites were far more likely to be reported,
hinting at racial factors coming from the host country. [Bhat, 2014,
pp.3]. It can be argued that an unequal regulation ratio between
exodus and hosts countries contributes to this.
3.6 Historical Data
This paper presents a unique approach to the research methods
concerning brain drain. Previous research utilizes survey data along
with statistical analysis, and focuses on the negative effects of brain
drain on the exodus country. Fewer studies analyze brain drain and
medical regulations. While this presents a new opportunity,
especially concerning medical technology and personnel, there is
limited available data. Previous studies typically use surveys and
government sourced data. This paper uses data from the World
Health Organization (WHO), the Organisation for Economic Cooperation and Development(OECD), and the World Data Bank.
International data is prone to truncated values. Inconsistencies in
some international data is often attributed to bias, or a lack of,
reporting. Countries with corrupt governments may report
inaccurate statics to suit political agenda.
Bhat’s report shows a unique issue that stems from the exodus
country. As previously stated, host countries that have poor
developed healthcare have issues retaining doctors and other
citizens. Governments argue that medical regulations are
implemented to ensure patient safety. In countries that have poor
safety regulation, the demand to emigrate has been shown to
increase. Bhat’s study uses soft data to observe trends that cover
migrant doctors and regulations. His study differs from this paper’s
proposed solution in that it provides a sociological history but has
very limited data. Additionally, it does not provide a solution or
any evaluation techniques.
3.5 Pharmaceutical Regulations
It has been established that there is a connection between brain
drain and poor healthcare in exodus countries. An emerging
problem is the lack of pharmaceutical regulation in these Ravinetto
[2016] reports that pharmacists and organizations such as WHO
have begun campaigning for better policy and awareness in African
communities where disease and poverty are prevalent [14]. He cites
a study where Chinese regulators found that counterfeit drugs make
up 1%-2% of the market [Pan & Luo, 2016, pp.300]. Counterfeit
drugs included antibiotics and antimalarics in high quantities,
resulting in deadly consequences, see Table 1 [Baratta, 2012,
pp.175]. While Ravinetto does not provide a statistical solution, the
remaining studies both utilize statistical analysis of cross-sectional
data as a solution. Baratta’s study goes further by conducting
medical experimentation on the drugs to observe their legitimacy.
3.7 Data and Expectations
The quality of healthcare is based on the WHO’s efficiency index.
Countries with poorer healthcare standards and maximized
healthcare benefits, receive a score nearing to 0 and 1, respectively.
Countries with observed higher regulation standards, such as
sanitation regulations, have a higher index score.
Life expectancy and infant mortality rates are used as proxies for
the quality of life in a country. A healthcare rating index is
implemented on each country based on the efficiency of its
healthcare system. It is assumed that doctors gravitate towards
immigrating to a country with better healthcare and life quality.
Countries with greater regulations, concerning sanitation and
accessible medicine, are observed to have a higher index. This
paper uses the healthcare index as a proxy for regulation quality. It
is also assumed that professionals seek higher monetary gains,
12
Second Annual International Great Lakes Data Science Symposium
Praveen Kumar Neelappa
May 4-5, 2019, Erie PA, United State of America
seeking to move from a developing exodus country to a developed
host country. I hypothesize that there is a negative correlation
between GDP per capita and the exodus country’s doctor stock.
This study uses foreign doctor stock data from 76 countries to
model migration patterns among doctors. This model assumes that
a doctor who migrates reside only in the host country during the
census year.
3.8 Evaluation Techniques
In the past, studies employed regression analysis on less specific
data sets. Rather than observing trends within brain drain or
regulation, this paper uses doctor migration data from the OECD
and World Data Bank. The result is a cross-sectional data set. This
paper uses the following variables: GDP per Capita, infant
mortality rates, and life expectancy, an exodus country’s doctor
stock, and a healthcare rating index.
4. SOLUTION
This paper’s proposed solution expands upon past research on brain
drain and medical regulation by employing its technique on a
unique dataset. Previous researchers have implemented regression
techniques on the data to analyze the effects of brain drain and
regulation, separately. This paper analyzes trends between
regulatory policy, healthcare quality, and the migration rates of
medical professionals. This first experiment employs random
effects from Kevin Sheppard’s relatively new module linearmodels
[2], in addition to Sklearn, on the panel data. The second
experiment uses a traditional OLS linear regression. The training
and testing model was split 70/30, fit, and utilized for both
hypothesis testing and predictions.
My findings show a relationship between the exodus country’s
income and infant mortality, and doctor migration. An increase in
the exodus country’s GDP per capita is correlated with a slight
increase in migration. I speculate this to be due to an individual in
a developed country being satisfied by their country’s wealth, while
those in poorer countries lack opportunities to leave. When a
country is developing, individuals who previously could not leave,
now can do so. An example of this trend is the relationship between
the two variables for the exodus country, Albania, see Graph 1.
However, a higher infant mortality in the exodus country is
correlated with a decreased doctor stock from the exodus country.
Perhaps this is due to the desire to better one’s country or because
of a loss of potential. A higher infant mortality rate is assumed to
negatively affect population numbers. Individuals who survive
infancy have the potential to become mobile doctors, while those
who do not, lose this opportunity.
Surprisingly, healthcare quality and regulation presented no
significance towards doctor migration. Perhaps countries with
higher indexes impose migration barriers which leaves doctors with
the choice to move to another developing country or a country less
developed. Another scenario is that developing countries compete
with developed ones to host by offering greater opportunities.
Future works can improve upon this study through including
additional features such as accounting for war, or additional
regulatory statistics.
Feature Abbreviations for
Table 2
IM
HC
GDPP
Hypothesis Testing
My research uses international panel data that is prone to being
truncated. A random effects model allows for partial pooling, to
maximize the precision in estimating values. Testing showed
multicollinearity between life expectancy and infant mortality
rates. Life expectancy was dropped from the model, which greatly
improved the estimates and reduced noise. After normalizing the
data, a single tailed significance test was conducted on the model.
This model assumes α = 0.05, where:
Ho: µ1 = µ2
5. CONCLUSION
13
Albania:Doctor Stock vs GDP per
Capita (2005-2015)
DOCTOR STOCK
The linear model perfectly fit the data with an R2 of 1. Overall,
GDP per capita and infant mortality rates are shown to be better
predictors of a country’s doctor stock, or migration, see Table 3.
Health Care Rating Index
GDP per Capita
Table 2: Parameter estimates show that GDP per Capita has a
significant correlation between itself and migration.
Ha: µ1 ≠ µ2
GDP per Capita and infant mortality rate were significant, and
positively and negatively correlated with doctor migration,
respectively. A 1 unit increase in GDP per Capita increased doctor
stock by 0.2, while a 1 unit increase in the infant mortality rate
decreased doctor stock by 245.8, see Table 2 below in Tables &
Graphs.
Description
Infant Mortality Rate
500
38
85
291
197 246
97 136 90 156
383 467
0
GDP PER CAPITA
Graph 1: Doctor Stock by Exodus Country and the GDP per
Capita of the Exodus Country in Albani
14
M. A. Upal (editor)
[5]CEICData. (n.d.). Venezuela GDP per Capita [1960 - 2019]
[Data
&
Charts].
Retrieved
from
https://www.ceicdata.com/en/indicator/venezuela/gdp-per-capita
[6] Central Intelligence Agency. (n.d.). COUNTRY COMPARISON
:
NET
MIGRATION
RATE.
Retrieved
from
https://www.cia.gov/library/publications/the-worldfactbook/rankorder/2112rank.html
[7]Citron, P. (2011). Medical Devices: Lost in Regulation. Issues
in Science and Technology, 27(3), 23-28. Retrieved from
http://www.jstor.org/stable/43315484
Table 3: This table shows the first 50 predictions, shown as an
array.
REFERENCES
[8]Department of Social and Economic Affairs. (2015).
International Migration Report 2015 (United Nations, Ed.).
Retrieved
from
http://www.un.org/en/development/desa/population/migration/pub
lications/migrationreport/docs/MigrationReport2015_Highlights.p
df
[1] Baratta, F., Germano, A., & Brusa, P. (2012). Diffusion of
counterfeit drugs in developing countries and stability of galenics
stored for months under different conditions of temperature and
relative humidity. Croatian Medical Journal,53(2), 173-184.
doi:10.3325/cmj.2012.53.173
[2] Bashtage. 2017. bashtage/linearmodels. (2017). Retrieved
December 2018 from https://github.com/bashtage/linearmodelsF
[9]Docquier, F., & Rapoport, H. (2012). Globalization, Brain
Drain, and Development. Journal of Economic Literature, 50(3),
681-730. Retrieved from http://www.jstor.org/stable/23270475
[3] Bhat, M., Ajaz, A., & Zaman, N. (2014). Difficulties for
international medical graduates working in the NHS. BMJ: British
Medical Journal, 348. Retrieved from
https://www.jstor.org/stable/26514841
[8]Hautamaki, V., Karkkainen, I., & Franti, P. (2004). Outlier
detection using k-nearest neighbor graph. Proceedings of the 17th
International Conference on Pattern Recognition, 2004. ICPR
2004. doi:10.1109/icpr.2004.1334558
[4]CEICData. (n.d.). Syria GDP per Capita [2002 - 2019] [Data &
Charts].
Retrieved
from
https://www.ceicdata.com/en/indicator/syria/gdp-per-capita
[11]Lofters, A., Slater, M., Fumakia, N., & Thulien, N. (2014).
“Brain drain” and “brain waste”: Experiences of international
medical graduates in Ontario [Abstract]. Risk Management and
Healthcare Policy,81-89. Retrieved from
https://pdfs.semanticscholar.org/f29c/442619aa802d865fbb7721
06b37461edaf49.pdf?_ga=2.155211643.1377461583.15421409
76-1451209627.1542140976.
[16]Serour, G. (2010). Healthcare Workers and the Brain Drain.
Obstetric Anesthesia Digest, 30(3), 141.
doi:10.1097/01.aoa.0000386811.00064.72
[12]Millennium Development Goals (MDGs). (2017, October
17). Retrieved from
https://www.who.int/topics/millennium_development_goals/en/
[17] Trogdon., J. 2009. Demand for and Regulation of Cardiac
Services. (2009). Research, Osaka University Previous Item |
Next Item https://www.jstor.org/stable/25621506
[13]OECD.
Retrieved
https://stats.oecd.org/Index.aspx?QueryId=68336
7. Acknowledgements
from
I give my thanks to Dr. M. Afzal Upal of Mercyhurst University
for his guidance during my research. I would like to give a
special thanks to Kevin Sheppard, the developer of
linearmodels.
[14]Pan, H., Luo, H., Chen, S., & Ba-Thein, W. (2016).
Pharmacopoeial quality of antimicrobial drugs in southern
China. The Lancet Global Health, 4(5). doi:10.1016/s2214109x(16)00049-8
About the Author
Kimberly Staudt is a data science graduate student at
Mercyhurst University. In 2017 she was the Press Secretary for
the Mercyhurst Data Science Club. She has a Bachelor of Arts in
economics, and a minor in sociology from Duquesne University.
During her undergrad she was the Vice-President of the
Economic Student Union. Previously, she has interned for State
Budget Solutions as a data analyst and op-ed writer. Her
interests include medical science and psychology, economics,
physics, philosophy, and fine art.
[15]Ravinetto, R., Vandenbergh, D., Macé, C., Pouget, C.,
Renchon, B., Rigal, J., . . . Caudron, J. (2016). Fighting poorquality medicines in low- and middle-income countries: The
importance of advocacy and pedagogy. Journal of
Pharmaceutical Policy and Practice, 9(1). doi:10.1186/s40545016-0088-0
14
15
Predicting Future Poaching Sites in African Reserves
Stephanie Le Grange
Department of Computing & Information Science
Mercyhurst University
Erie, PA
Slegra78@lakers.mercyhurst.eu
ABSTRACT
Poaching has always been an issue in Africa not only due to the
massive loss of wildlife, but also due to the impact that the loss of
these animals has on the ecosystem and on the local community.
Poaching is driven by the demand for ivory and rhino horns, but
with the advancements in the illegal trade markets we have seen
an increase in poaching rates. Illegal animal trade can be reduced
if we can determine poaching hot spots [1]. This can be done with
the help of African reserves as well as the rangers that are
patrolling these areas. Rangers help collect data on animal
observations, location with signs of illegal activity, and poachers
[2]. Which can then be used to help provide patrol managers tools
that analyze the data they have collected to generate a forecast on
poacher’s behavior as well as provided future patrolling routes
[2].
There have been other models built to help assist rangers in
patrolling the vast conservation areas in Africa including PAWS,
CAPTURE, and INTERCEPT. All of these models have their own
advantages and disadvantages, but all have helped improve
rangers patrol routes, capture poachers, and remove snares.
My results from this project show when an attack would be
successful based on the date and location in question and
providing rangers with a better estimate on where they should be
patrolling next.
15. INTRODUCTION
African mammal populations have shown a dramatic decline
in size over the years while poaching and illegal wildlife
trade continue to grow since the late 2000s [3]. This loss in
species has consequences on the surrounding ecosystem [2]
as well as the local economies that depend on these animals
to help drive tourism. With the new advancements in
technology as well as social media it is getting easier for
poachers to move poached artifacts or animals to other
countries with little to no detection. To help reduce the loss
of African wildlife, conservation organizations assign
rangers to protect these large areas, but due to the harsh
environment, size of the areas that need patrolled, and the
limited number of rangers it makes it difficult to actively
protect these areas and animals in those areas from a growing
number of poachers.
CAPTURE has two layers; the first layer predicts the
attackability of an area while the second layer predicts the
likelihood of an attack being seen given the patrol routes [2].
INTERCEPT is efficient in assisting rangers with patrol
planning, while PAWS patrol planning is generated on
potential risk of an area not on predicted poacher attack areas
[2]. These technologies along with a new AI camera called
TrailGuard AI are being used in African reserves to assist
rangers with catching poachers before they kill the animals
that they are looking for.
Along with using AI and machine learning algorithms
rangers are teaming up with conservation organizations and
have been altering and coloring rhino horns to lessen their
appeal to poachers.
We need to continue looking into what is motivating
poachers to carry out their actions. One driving factor is that
the benefits to the poacher outweigh the risk of being caught.
Real world data is also noisy, and data can only be collected
from areas that are being patrolled [2].
16. RELEVANT WORK
Poaching has grown exponentially in the past few years and the
illegal trade of animals and animal parts is one of the driving
forces for this. Majority of the items that are being illegally traded
are derived from African elephant ivory. Many of these poachers
have very little to no experience with weapons, but they are
skilled in avoiding detection.
In 2013 alone 51 tons of ivory was seized, and the number of
elephants killed was estimated at 50,000 [4] which is significant
given that there were only about 434,000 elephants remaining that
year.
Ivory prices have also increased over the years. In 2013 black
market ivory averaged between $2,500 and $3,000 per kilogram
and rhino horns prices were reported at $65,000 per kilogram [3].
Researchers have also investigated that questions of where
poachers are being recruited and they have found that many
poachers were being recruited in the local villages and that 32%
of villagers knew who the poachers are but are not willing to
identify them [3].
Past research has shown that the distribution of poaching sites is
non-random and instead there is a spatial autocorrelation [5].
There has also been a strong correlation between poaching sites
and the landscape especially areas near water [5]. This
information is important in helping with predictions of future
poaching sites.
CAPTURE has a few shortcomings that INTERCEPT improves
on. CAPTURE takes hours to run and the model is difficult to
understand [2] which does not work for rangers as they have
limited access to computing power and are not well versed in
computing. INTERCEPT uses decision trees along with BOOST
IT to provide a prediction on potential future poacher sites.
16
M. A. Upal (editor)
Work is also being done in DNA testing of seized ivory to help
narrow down the origin of the ivory, so they know whether it is
from savannah or forest elephants [4]. DNA analyses can help
predict where the animals are being removed from, which can then
help rangers intercept poachers, helping stop the trade before it
happens [1].
number of true negative (TN), true positive (TP), false negative
(FN), and false positive (FP).
17. FIGURES/CAPTIONS
These results are printed in the below order:
TN is the non-successful attacks predicted correctly, TP is
successful attacks predicted correctly, FN is successful attacks
wrongfully predicted to be non-successful, and FP is the nonsuccessful attacks wrongfully predicted to be successful.
Place Tables/Figures/Images in text as close to the reference as
possible. It may extend across both columns to a maximum width
of 17.78 cm (7”).
True Negative – False Positive on the top row.
False Negative – True Positive on the bottom row.
In the table above we can see that we have 145 true
negatives and 1 true positive
Captions should be Times New Roman 9-point bold. They should
be numbered (e.g., “Table 1” or “Figure 2”) and be centered
beneath each table, figure or image.
18. PROPOSED SOLUTION
Finding relevant real-world data on poacher sites was challenging
as much of this data in not made publicly available due to the
sensitive nature of the data. I pulled data from two different sources
which provided information on the date of the poaching incident,
location (longitude and latitude), as well as the amount of animal
carcasses that were found in each location. I prepared a two
train/test splits of the dataset one from 2002 to 2012 to evaluate
data from 2013 and one from 2002 to 2017 to evaluate data from
2018. I will use a binary decision tree along with regression model
to predict the probability of a potential future poacher site. I will
then measure accuracy score to evaluate my model.
I ran the same code again with the random forest classifier and re
printed a confusion matrix with the below results:
We can see that we have a decrease in the number of true negatives
and increase in the number of true positives.
I then build a model that would allow the user to enter data based
on their next patrol route to predict how much resources they should
assign to a given location on a given day. They would need to enter:
19. METHOD
I started by importing the necessary libraries numpy, pandas data
frames, seaborn, and matplotlib. I then pulled in the data and added
labels to each column in the dataset.
The month
The day
The location (Longitude and Latitude)
The model will print out a statement on whether or not an attack is
likely or not and with this information the rangers can ensure that
they assign more resources to higher risk areas which might
increase the chances of more poachers being caught in the act and
less animals added to the endangered/extinction list.
20. DISCUSSION
There has been discussion on how to get the communities
involved in helping to reduce poacher recruitment in their
villages. The engagement of the community can also assist in
speedy arrest of poachers as seen in Namibia [3].
Technology should be deployed to assist with these efforts to
protect Africa’s wildlife and technology can assist with keeping
records and analyzing future deployment of rangers based on
poaching patterns [3].
Having rangers record the seasonal patterns as well as the distance
from bodies of water, roads, and shelters would help increase the
accuracy of models like this as these factors have been thought to
be a major factor for when poachers are choosing locations to
hunt. We know this is true because if it is drought season, we
know more animals are going to gather around the last few water
holes thus that would be a more likely spot for poachers to gather
and the same can be said with the weather and the hour of the day.
This dataset has 660 observations and 5 features. The reported date
and the description were then factorized before assigning x and y.
X represents the features that we want to test on, and Y is the
prediction that we want to obtain. X is going to consist of the
following columns: Number Killed, Date Reported, Longitude, and
Latitude while Y is going to predicts the likelihood of a poaching
incident at a given location on a given day. Before running the
train/test/split I had to check the shape of x and y and transpose x
so the shapes match.
Train/test/split was set to a size of 0.3 and random state 0. I then
imported linear_model from sklearn as ell as pyplot and printed the
shape of the train and test data. I used a decision tree classification
with a max depth of 3 was run on the training data to show the
16
17
Poachers might be more likely to attack during a specific hour
depending on the weather.
21. CONCLUSION
Technology will play a big role in the future of conservation
in Africa and in reducing the number of poaching incidents
as well as illegal trafficking, but it will not be a replacement
for well-trained rangers. Rangers will continue to play an
important role in the protection of African wildlife as they
know the bush, can spot signs that intruders have been there,
and are willing to put themselves in harm’s way to protect
the animals [3].
Studies should continue not only in the potential on a
poaching location, but also on the motivation of the poacher
to take the risk of continuing through with his actions. More
research needs to be done on poaching sites especially with
the collection of real-world data. Models also need to
continue to be improved upon to help them distinguish
between relevant data and noisy data. In the future it would
be nice to another program that can automatically enter in
locations and show them on a map. This would remove the
tedious work of manually entering in dates and locations by
longitude and latitude.
22. REFERENCES
[1] S.K. Wasser et al., Combating the Illegal Trade in
African Elephant Ivory with DNA Forensics.
Conservation Biology. 22, 1065-1071 (2008)
[2] D. Kar et al., Cloudy with a chance of poaching:
Adversary Behavior Modeling and
Forecasting with
Real-World Poaching Data. International conference on
Autonomous Agents and Multiagent System. (2017)
[3] B. Anderson and J. Jooste. Wildlife Poaching: Africa’s
Surging Trafficking Threat. Africa Security Brief. (2014)
[4] S.K. Wasser et al., Genetic assignment of large seizures
of elephant ivory reveals Africa’s major poaching
hotspots. Conservation. 349, 6243 (2015)
[5] M. J. Shaffer and J.A. Bishop., Predicting and Preventing
Elephant Poaching Incidents through Statistical
Analysis, GIS-Based Risk Analysis, and Aerial
Surveillance Flight Path
Modeling.
Tropical
Conservation Science. 525-548 (2016)
About the authors:
Stephanie Le Grange is a graduate student at Mercyhurst University
in Erie, PA. She earned her undergrad in Environmental Science at
Edinboro University.
18
M. A. Upal (editor)
Mass Shootings in the United States: An Analysis and
Prediction
Dayana Moncada
Mercyhurst University
501 East 38th Street
Erie, Pennsylvania 16501
dmonca63@lakers.mercyhurst.edu
ABSTRACT
metastatic growth of the consumer/producer dynamic that feeds
into the worst parts of our nature. The more content produced, the
more content we consume (Pescara-Kovatch et al., 2017) [2].
One thousand, one hundred and fifty-three people have died in the
course of U.S. history because of mass shootings. Despite these
occurrences are only a tiny factor of gun violence, they are still
terrifying because of their unexpected nature. In this research, we
utilized linear regressions, using scikit-learn, to analyze our data.
We were not able to conclude on how to predict mass shootings
from happening; but we were able to bring to light some
conclusions and correlations that can serve as light for future
research.
2. RELATED WORK
To understand how mass shootings have occurred
throughout the years, its societal implications and what analyses
have been done in academia. We investigated different journals and
articles to better understand and possibly further work on an already
ongoing research. The different articles that were pertinent to this
topic came from academic journals.
Keywords
The paper has been chosen to be the center of this project
due to its statistical methods and its contagion model. We fit a
contagion model to recent data sets related to mass shootings in the
US, with terms that take into account the fact that a school shooting
or mass murder may temporarily increase the probability of a
similar event in the immediate future, by assuming an exponential
decay in contagiousness after an event (Towers et al,.). [3]
Mass shootings, crime prediction, linear regression, data science,
Python
1. INTRODUCTION
The deadliest mass shootings in modern history have
occurred in the United States since October 2017. These include the
shooting at the Pulse nightclub where 49 people were killed, in the
58 people who were killed in Las Vegas during a concert in
Mandalay Bay, and in Parkland, Florida in Marjory Stoneman
Douglas High School who mobilized the country into one big rally
across the nation were 17 people were killed. Unfortunately, these
are not the only instances. One thousand one hundred and thirtyfive people have been killed in mass shootings in the United States’
modern history. A mass shooting is defined to as an incident where
more than 4 people are killed. However, there is not a single
definition of what a mass shooting that everyone agrees with. The
Washington Post, and other media posts, have defined a mass
shooting as “four or more people were killed by a lone shooter (two
shooters in a few cases). It does not include shootings tied to gang
disputes or robberies that went awry, and it does not include
domestic shootings that took place exclusively in private homes. A
broader definition would yield much higher numbers.” [1]
According to the authors, past studies have found that
media reports of suicides and homicides appear to subsequently
increase the incidence of similar events in the community,
apparently due to the coverage planting the seeds of ideation in atrisk individuals to commit similar acts.
It is very interesting to see these findings due to the
increasing news on mass shootings that have occurred in the past
couple of years. The authors also found how a state’s prevalence of
firearm ownership is significantly associated with the state
incidence of mass killings with firearms, school shootings, and
mass shootings. The research topic and direction is similar to the
one we will approach in terms of the methodology. The way their
research was performed was through binned statistical analysis.
The authors claim that this type of analyses is used in the life and
social sciences to distinguish between null and alternate
hypotheses.
As the media’s coverage of mass shootings has increased,
the American population has become more aware of such events.
The timeline of a mass shooting and its repercussions, in the eye of
the public, is eerily similar, it starts with the occurrence of the
shooting, followed by millions of posts on social media timelines.
The outrage from the public often become visible; activists call for
gun legislation and stricter gun control. This usually lasts for a
couple of weeks. Finally, the whole situation is forgotten until
another similar incident occurs. The idea of a mass shooting has
become normalized in our society to the point that people cannot
rely on gun legislation to deter the casualties and psychological
damage caused by these incidents.
A very interesting and challenging situation the authors
faced is by having contradictory outcomes during a testing of their
model. As a example of the advantage of unbinned likelihood
methods in increasing the statistical power of an analysis, here we
compare and contrast two recent analyses of contagion in mass
killings in America, both of which were based on exactly the same
data, but used in different methodology. One concluded that there
was evidence of contagion in mass killings. [4]
The way their model is done and performed is through a
unbinned statistics. These concepts were illustrated in our
comparison of two analyses of contagion in mass killings that have
appeared in the literature; both of which used exactly the same data,
but different analysis of methodologies. The second set of data will
be explained in the literature of Tomek (2017) in the following
point of the relevant works section. The Towers et al (2015) [5]
With the growth of an insatiable 24-hour news cycle,
social media, reality TV and a culture that thrives on instantaneous
access to information, it is no wonder there has been a symbiotic,
18
19
“analysis of mass killings used unbinned maximum likelihood
methods to examine the temporal distribution of the events, and
found significant evidence of contagion [6]”. In contrast, the
analysis, according to the authors in Lamkford and Tomek (2017),
claim the following: the examination of the data using coarselt [7].
The comparison of the two analyses provides an excellent example
of the power of unbinned likelihood methods; the very coarsely
binned methods; the very coarsely binned method used by
Lankford and Tomek was; “not sensitive to differences between the
null hypothesis model of no contagion and an alternate hypothesis
of a self-excitation contagion model.” It is important to note that
Lankford and Tomek took a different approach. The latter are the
citations and wording from Towers et al. The model taken by
Lankford and Tomek will be taken into consideration further.
It is not easy to deny that media has had a very important
influence in the modern world. With the expansion of technology,
smartphones and social media; news get to the palm of one’s hand
in a matter of second, weather they are fake or not, that is something
different. Several media outlets have been on the verge of the topic
of mass shootings for a while now. We are especially interested in
the different approaches taken to diminish and deter the instances
of mass shootings in the United States. The New York Times have
suggested “Mass Killings May Have Created Contagion, Feeding
on Itself” and a recent headline in The Washington Post” suggested
“Are Mass Shootings Contagious?” Some Scientists Who Study
Viruses Say Yes” (Carey, 2016; Rosenwald, 2016) [6]. The
approaches to understand and prevent mass shootings are vast and
complex and have been broadcasted through different media
outlets. Lankford and Tomek in this article start by explaining the
contagion effect in the incidence of suicide.
Lankford and Tomek write down: “Some researchers
have found that suicide rates increase in the days after a highly
publicized suicide, such as that of a celebrity or well-known
fictional character (Abrutyn & Mueller, 2014; Niederkrotenthaler
et al., 2010; Phillips, 1974; Wasserman, 1984)..” [7] The authors
continue to explain the similarities between incidents and mass
killers. Mass killers, in most cases, commit suicide after they
commit the crimes they intended to commit, in the first place.
Approximately 30% of mass killers die by suicide or refuse to
surrender and are killed by police, which constitutes “suicide by
cop” (Duwe, 2004; Lankford, 2015; Lindsay & Lester, 2004) [8].
Social contagion in the case of mass killings takes a
similar precedent as the social contagion of suicides. When applied
to mass killings, the social contagion thesis suggests that
perpetrators receive so much attention for their attacks that each
high-profile killer ends up “infecting” the minds of other
impressionable individuals (Kisner, 2016; Towers et al., 2015). The
authors continue by explaining that Kissner (2016) found that in the
United States from 2000 to 2012, there was an increased risk of
active shootings in the 14 days following an incident, and Towers
et al. (2015) similarly found that from 2006 to 2013, there was an
increased risk of mass killings and school shootings in the 13 days
following a previous incident. However, there are speculations
these findings are not entirely helpful or true. Following, Kisnner
(2016) and Towers et al. (2015) based their studies on incident
dates. Joiner, T in 1999, in his article The Clustering and Contagion
of Suicide, comes up with the idea that chronological clusters of
mass killings that are more prevalent than would be expected at
random do not necessarily provide evidence of contagion effects
(Joiner, 1999).
Lankford and Tomek continue by explaining that
incident clusters could be attributed to other social and
environmental factors such as political cycles, stock market gains
or losses, or other news events unrelated to crime. This paper poses
a challenge to the initial question of does media contagion transmits
the need of potential mass killers to commit a crime and therefore,
making it difficult to deter and prevent these incidents. We believe
the challenge is not a sign of completely changing the original
question but of one that challenges to keep looking for an answer
or answers. The findings made my Lankford and Tomek pose a
challenge to what is known and studied.
Contagion cannot occur without transmission, says
Lankford and Tomek, in page 460 of their article. The social
contagion thesis requires that the imitative mass killer be at least
indirectly exposed to the model killer’s behavior. However,
although mass murderers receive a large amount of media attention,
Duwe (2004) found that only 45% of all mass killings in the United
States from 1976 to 1999 were even covered by The New York
Times. [8]
The latter gives the assumption that the media does not
play an important part in the propagation of news. However, it is
worth to mention that during 1976 and 1999, social media and the
rapidness of news outlet to deliver news. It is not relevant to media
contagion in modern times. Nevertheless important to take into
consideration. An important aspect of study made by Lankford and
Tomek is they made note of the importance of differentiating
between high-profile incidents and low-profile incidents of account
for variation in the amount of attention that each incident receives
(460).
The methodology used was the same used in Towers et
al. (2015). The data set contains 232 mass killings in the United
States from 2006 and 2013 and provides population of incidents,
not just sample. A very interesting part of this study performed was
they used a second additional data set which was randomly
generated dates that simulate 232 mass killing incidents across an
8-year time frame (2006-2013), for a total of 116,000 randomly
simulated dates (461).
The findings in this paper were challenging to the first
article mentioned after their statistical exam that although the very
high-incident profiles such as mass shooting resulted in more
public attention, “this did not significantly increase either the
proximity of the event or the number of events within the next 14
days (464). [9]
The following excerpt was how the findings were
concluded:
“Overall, the present study’s findings have direct
implications for crime prevention and response. If the data showed
that risks of mass killings were significantly greater in the days
following high-profile incidents, officials would be wise to issue
alerts during these critical periods to inform the public of the
heightened risks. This strategy would be similar to the U.S. Centers
for Disease Control and Prevention’s public alerts following the
outbreak of contagious viruses and could help people take
precautionary measures until the dangerous period passed.
However, because chronological clusters of mass killings appear
like randomly distributed events, law enforcement officials have a
20
M. A. Upal (editor)
may wonder if there is something that private citizens or
corporations should take into action and/or if a safer and stricter
gun legislation will be enough. There is a limited amount of power
over the nature of mass shootings.
more difficult challenge: encouraging constant vigilance. Previous
research suggests that warning signs often exist: School shooters,
for example, are prone to tell at least one person about their violent
plans prior to striking (Pollack, Modzeleski, & Rooney, 2008).
Unfortunately, their comments are often dismissed or ignored
(Levin & Madfis, 2008; Newman et al., 2004) . Security officials
should do everything they can to ensure that these critical warning
signs are taken equally seriously at all times of the year, regardless
of the recency of previous mass killings (465).”
The motivation for this project is to use technology and science to
find a way to better analyze mass shootings and to use the
information sciences resources to find a solution or to predict the
likelihood of these incidents happening.
b. Data Collection Methodology
Despite the fact that it challenges the previous article, it
possesses a force to keep digging and researching more information
and eventually create a tangible solution about how to deter mass
killings, if media contagion serves as a source of mass killing
spreading throughout the country or if there is something else that
might be causing offenders the urge to cause death and sorrow in
the United States.
The data was gathered from a full data sat from an in-depth
investigation into mass shootings gathered in Kaggle which ranges
from 1966-2019, a total of 53 years’ worth of data. It is restricted
to continental United States cities. The data can be found on
https://www.kaggle.com/zusmani/us-mass-shootings-last-50-years
[12].
Towers et al. [10] support the analysis performed in both
of the previous articles. Binned statistical methods are used
frequently in the social and life sciences. The background given by
the authors is that of how this type of analysis are used. Binned
statistics methodology is based on the moments of a distribution
(such as the mean, and variance). These methods have the
advantage of simplicity of implementation, and simplicity of
explanation.
The following list shows what’s included in the database and how
it is organized. It is important to note that it is an ongoing project
by the owner of the data to keep adding instances of mass shooting
as they occur.
Data Fields: Title, Location, Date, Incident Area, Open/Close
Location, Target, Cause, Summary, Fatalities, Injured, Total
victims, Policeman Killed, Age, Employed, Employed At, Mental
Health Issues, Race, Gender, Latitude and Longitude.
The authors, Towers et al., talk about the advantages of
unbinned likelihood methods,”in increasing the statistical power of
an analysis, here we compare and contrast two recent analyses of
contagion in America, both of which were based on exactly the
same data, but used different methodology. One concluded that
there was evidence of contagion in mass killings, while the later
analysis contradicted this claim (2)”
Data Coverage: 1966 – 2019, updated frequently. A very important
criterion used was if the perpetrator took the lives of at least four
people, the killings occurred in a public space and if the shootings
was a spree killing or mass murder. Shootings that occurred based
on conventional motivated crimes such as armed robbery, gang
violence or domestic abuse are not included.
The features of the data that will be used are:
How both differ is simple yet not easy. In 2015, Towers
et al., published their findings under the hypothesis that a mass
killing temporarily raises the probability of a similar event occuring
in the near future, with an exponential decay of the probability.
Meaning that it appears that mass killings appears to inspire
approximately 0.28 new mass killings ([0,10, 0.56], 95% CI), with
an average decay period of the exponential of approximately 13
days (2).
On the other hand, Lankford and Tomek in 2017, publish
an article that claims the complete opposite of what Towers et al..
(2015)’s results in their analysis. Lankford and Tomek (2017)
claimed in their conclusion on an analysis of how many events
occured within 14 days of a prior event, under the null hypothesis
assumption of the data were randomly uniformly distributed in
time. They also compared the mean and variance of the distribution
to the mean and variance of the distribution to the mean and
variance expected under their null hypothesis, and performed
statistical tests of the null hypothesis using these quantities with
Student, T, F and Z tests (2).
The main difference was how Towers et al., used a simple
binned analysis and Lankford and Tomek (2017) used an unbinned
maximum likelihood method.
3. DISCUSSION
a. Problem and Motivation
Title - A case name that the 'shooting' has been assigned
Location - The city and the state the shooting occurred
Date - The date of the shooting
Incident Area - School, Church, Parks, etc.
Open/Close Location - Describes whether the shooting
occurred in an open or closed or an open and closed
space
Target - The victims of the shooting
Cause - Shooting cause e.g. Racial, Terrorism,
Psychotic outbreak, etc.
Summary - A brief summary of the event
Fatalities - Death count (also counts the shooter, if they
were killed/ committed suicide afterwards)
Injured - Injury count
Total victims - injury plus the death count (Excludes
the perpetrator if he/she is killed)
Age - Age of the shooter. Data is only available for
events which had a single shooter.
Mental_Health_Issues - States whether the shooter had
mental health issue or not, or whether the case is unclear
or unknown
Race - Race of the shooter
Latitude - self-explanatory
Longitude - self-explanatory
During the analysis, the variable date was divided into three: day,
month, and year for better analytical purposes.
There is not a day when we open our news outlets and we do not
see deaths and multiple killings under the hands of violence. We
20
21
b. Exploratory Analysis
To gain some background we were able to perform exploratory
analysis on the data. We introduced the data as a Pandas data frame,
and we based the analysis based on location. We were able to split
Location to look deeper into targeted areas. The result is a bar chart
titled “Top 10 States of Mass Shootings” in which we were able to
see that California has ranked number one in having more mass
shootings incidents in the last 53 years. Followed by Pennsylvania,
Maryland and Florida. Furthermore, we were able to explore an
analysis of US cities, in which, the same bar chart was done.
5.
Apply model for predictions
The model assumes the following:
Regression Line, y
= mx+c,
y = Dependent Variable
x= Independent Variable ; c = y-Intercept
c. Map Analysis
We analyzed longitude and latitude. This analysis was chosen due
to the easiness of reading maps and having a visual understanding
of mass shooting occurrences.
See Appendix A for Figure 1.
4. SOLUTION
In this paper, we will be focusing on answering the question, “are
mass shooting random events or are they events that could be
considered “copycat” crimes due to contagion?” Following this
question, we are asking if mass shootings can be predicted and what
the relationship among years and total victims is across the last 53
years. Unfortunately, due to the way the data is arranged, mass
shootings date and time cannot be predicted because the data is
scattered. However, we can come up with conclusions and create
linear regressions among the variables to understand what can be
done through law enforcement and researchers.
Most of the studies done with regression and Poisson distribution
show that unfortunately, mass shootings are random events that
cannot be predicted and that in the event of a shooting occurring
days, weeks or months of another shooting, does not have anything
to do with one another.
Crime prediction has been found useful when there are different
types of crimes in the dataset; due to our dataset being restricted to
mass shootings only, then it makes it more difficult to do a testing
and training and evaluating through an algorithm.
Studies have implemented regression analysis to understand the
relationships between the features. We will be employing linear
regression and making conclusions for better predicting. As well as
checking our answer with a Random Forest algorithm and Gradient
Boosting.
Our research will use the package NumPy, which is a Python
package that allows analytical performance operations on single
and multi-dimensional arrays. NumPy also offers an easy use of
mathematical analysis on the data. Following, we utilize the scikitlearn package, which is a widely used Python library for Machine
Learning. Scikit-learn provides preprocessing of data,
dimensionality, implementing regression, classification, clustering,
among others.
We will be analyzing total victims and years. We followed five
steps when implementing this linear regression.
1.
2.
3.
4.
Import packages and classes
Provide data to work and do transformations
Create a regression model and fit it with existing data
Check the results of model fitting to know if the model
performed satisfactorily
5. RESULTS
We analyzed the data using linear regression. The features used
were Years and Total Victims. The goal is to build a model which
will learn and will be able to predict the number of victims for the
following years.
We separated the features and the dependent variable into variables
x and y; as well, we are using the Linear Regression class. We are
not worried about feature scaling, since the library does that by
itself.
After building our linear regression model, we calculated R
squared, which resulted in: -0.1366. Following, our root mean
squared error came back, and we got: 14.8025.
The latter means that our model was able to predict the total victims
by year in the test set within 14.80 victims every year. In order to
check our prediction, we utilized mean absolute error, to be able to
determine the accuracy of it. Our mean absolute error came back a
13.1941, which is close to the root mean squared error.
In order to keep checking our answers and not stay with just one,
we utilized the Random Forest algorithm and a Gradient Boosting
algorithm; however, the results did not seem realistic and the
models did not work with our dataset. They were left in the code
for future work.
22
M. A. Upal (editor)
3.
Income inequality
4.
Stricter gun laws are correlated with lower risks of
mass shootings.
These final conclusions are not based on stone, and while we may
not have the solutions yet to deter mass shootings, we can help
communities become healthier through exercise and mental health
medical access; as well as taking into consideration the income
inequality and limited access to opportunities of social mobility.
7. FUTURE WORK
How large the dataset is becomes essential and has an important
role in the predicting in the analysis of mass shootings. We were
able to only gather 53 years’ worth of data in which we were able
to see that in average, two mass shootings happen per year. This
means that we will need to acquire more data to have a better
success in the analysis. Our suggestion is to gather data from other
developed countries such as the United Kingdom, New Zealand
and/or Australia.
REFERENCES
[2] Lisa Pescara-Kovach, Mary-Jeanne Raleigh. 2017. The
Contagion Effect as it Relates to Public Mass Shootings and
Suicides. The Journal of Campus Behavioral Intervention. 3, 35-45
(2017).
[3] Sherry Towers, Anuj Mubayi, and Carlos Castillo-Chavez.
2018. Detecting the contagion effect in mass killings; a constructive
example of the statistical advantages of unbinned likelihood
methods. Plos
One13,
5
(2018).
DOI:http://dx.doi.org/10.1371/journal.pone.0196863
[4] Sherry Towers, Anuj Mubayi, and Carlos Castillo-Chavez.
2018. Detecting the contagion effect in mass killings; a constructive
example of the statistical advantages of unbinned likelihood
methods. Plos
One13,
5
(2018).
DOI:http://dx.doi.org/10.1371/journal.pone.0196863
6. CONCLUSION
The larger the dataset is essential and has an important
role in the predicting in the analysis of mass shootings. We were
able to only gather 53 years’ worth of data in which we were able
to see that in average, two mass shootings happen per year. This
means that we will need to acquire more data to have a better
success in the analysis.
[5] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj
Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass
Killings and School Shootings. Plos One 10, 7 (2015). DOI:
http://dx.doi.org/10.371/journal.pone.0117259
Our findings show a relationship between total victims
and years; as years pass, total victims will increase due to mass
shootings. Our assumption is due to the political environment in the
United States, such as not focusing on gun reform at the federal
level of government; as well, as studies showing that mental health
does not make someone become a mass shooter.
[6] Michael S. Rosenwald. 2016. Are mass shootings contagious?
Some scientists who study how viruses spread say yes. (March
2016).
Retrieved
April
23,
2019
from
https://www.washingtonpost.com/local/are-mass-shootingscontagious-some-scientists-who-study-how-viruses-spread-sayyes/2016/03/07/be44866a-df31-11e5-846c10191d1fc4ec_story.html?noredirect=on&utm_term=.c170b1506
9a2
Unfortunately, our results are half of a fraction of why
and how a mass shooting occurs. At the public policy level, mass
shootings are a complicated topic. Studies have shown that these
events are random, and the government and private citizens can
only follow certain precautions.
[7] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj
Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass
Killings and School Shootings. Plos One 10, 7 (2015). DOI:
http://dx.doi.org/10.371/journal.pone.0117259
The following factors explain the common factors that
communities where mass shootings have occurred have in common
[13]:
1.
More access to mental health resources; this
happens because mass shootings tend to occur in
urban areas, therefore, rural areas have a shortage of
mental health physicians.
2.
Lack of socialization: Less time for physical activity
and recreation areas.
[8] Grant Duwe. 2016. The Patterns and Prevalence of Mass Public
Shootings in the United States, 1915-2013. The Wiley Handbook of
the
Psychology
of
Mass
Shootings(2016),
20–35.
DOI:http://dx.doi.org/10.1002/9781119048015.ch2
[9] Adam Lankford and Sara Tomek. 2017. Mass Killings in the
United States from 2006 to 2013: Social Contagion or Random
22
23
Clusters? (July 2017). Retrieved April 23, 2019
https://onlinelibrary.wiley.com/doi/full/10.1111/sltb.12366
from
https://www.inverse.com/article/50072-why-are-there-massshootings-some-places-and-not-others
[10] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj
Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass
Killings and School Shootings. Plos One 10, 7 (2015). DOI:
http://dx.doi.org/10.371/journal.pone.0117259
[11] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj
Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass
Killings and School Shootings. Plos One 10, 7 (2015). DOI:
http://dx.doi.org/10.371/journal.pone.0117259
[12] Zeeshan-ul-hassan Usmani. 2017. US Mass Shootings.
(November 2017). Retrieved April 23, 2019 from
https://www.kaggle.com/zusmani/us-mass-shootings-last-50-years
[13] Peter Hess. Communities With Mass Shootings Share 4
Common Traits, Study Shows. Retrieved May 2, 2019 from
About the author:
Dayana Moncada is Graduate Student in Data Science at
Mercyhurst University in Erie, Pennsylvania.
Appendix A.
Figure 1. Shooting Fatalities by Latitude/Longitude in the United States
Figure 2. U.S. Mass Shootings Victim Count from 1966 – 2019.
24
M. A. Upal (editor)
Figure 3. Victims Grouped by Years.
Figure 4. Top 10 States of Mass Shootings in the past 53 years.
Figure 5. An Approximation in the Linear Regression graph.
24
25
Are ISIS Sympathizers More Like Republicans or “Water
For All” Charity Members?
M. Afzal Upal
Department of Computing & Information Science
Mercyhurst University, Erie, PA, 16506
mupal@mercyhurst.edu
ABSTRACT
The rapid adoption of social media by billions of people from all
over the world has unleashed unprecedented opportunities for
marketers as well as military and public policy officials to better
understand their target audiences and design more effective
messages for them. Previously, we have reported on a novel
technique for automatically deriving insights about sociocultural
groups (including Doctors without Borders, US Republican Party,
and Water.org) from Twitter posts by members of those groups
[1]. We discovered that while readers of the Republican Party
account @GOP were more likely to like and retweet surprising,
positive, and social identity related (us-versus-them) messages,
Water.org readers preferred emotional, negative, and
religious/ideological messages. This study was designed to apply
this technique to learn more about the target audience for the
terrorist group called ISIS. We also compare ISIS’s target
audience with those of the non-terrorist groups we have
previously studied. This analysis should help counter-terrorism
and counter-insurgency officials to design more effective
messages to counter ISIS’s online propaganda.
CCS Concepts: • Theory of Computation → Machine
Learning Theory; Redundancy; Robotics; • Information
Systems → Web and Social Media Search
General Terms: Social media mining, machine
learning, big data.
Additional Key Words and Phrases: natural language
processing.
1.
INTRODUCTION
Experts credit ISIS’s ability to exploit social media as one of the
key factors responsible for its rapid rise to prominence in the
Middle East [2, 3]. In order to counter ISIS’s online propaganda,
we must understand how it spreads its messages through various
social media platforms. What is it about ISIS messages that
allows them to resonate with their target audience? What types of
messages do ISIS sympathizers like? Are religious messages
more likely to be liked and shared by ISIS sympathizers or are
messages appealing to nationalistic notions of “us versus them”
more likely to become viral among them? The questions of
whether religious doctrine, social identity, or resource deprivation
is the primary motivator of terrorism is hotly debated by scientists
[4, 5] as well as columnists [6]. Traditional empirical research to
investigate such questions is notoriously difficult not the least due
to the safety and security issues involved in carrying out research
with human participants in an active conflict zone. Since, social
media messages can be accessed from anywhere in the world, is
it possible to learn this information from them? Previously, we
reported on a study carried out to better understand factors
responsible for popularity of messages among members of a
variety of social groups on Twitter including the US Republican
Party, Doctors without Borders (MSF), Toronto Maple Leafs,
Proctor & Gamble’s Always Pads, Water.org, and People for
Ethical Treatment of Animals (PETA) [1]. We found some
common factors such as having a picture, length of time the tweet
has been up, and emotionality of a tweet’s message that predicted
tweet popularity in all groups. We also found differences among
groups. The tweets that ask to be liked were liked by followers
of the Doctors Without Borders (MSF) and the Toronto Maple
Leafs accounts but not by others. Similarly, having a URL
predicted tweet-popularity among readers of P&G’s Always
account but not among others. In fact, not having a URL was a
predictor of tweet success among the Republicans. If a tweet
explicitly asks its readers to comment on it, only readers of the
MSF account complied. Having ideological content in a tweet
was a good predictor of tweet popularity in the Water.org
community but not in others (especially in among readers of the
MSF account, where it was a negative predictor). Having
surprising content was a good way to catch the attention of the
Republicans but not others. Humorous tweets were liked and
shared by readers of the Water.org tweets but not by readers of
other Twitter accounts. Tweets that appeal to notions of “us”
versus “them” were popular among Republicans but not among
readers of the other accounts. Tweets with negative content were
popular among readers of the Water.org and Doctors Without
Borders account while tweets with positive content were popular
among Toronto Maple Leaf fans and Republicans. The objective
of work presented here was to carry out similar analysis for
Twitter accounts of ISIS sympathizers to better understand what
makes ISIS messages popular and compare the results with those
of the groups we previously studied to understand how similar
and different the preferences of ISIS sympathizers are from those
of the groups we studied before.
2 EXPERIMENTAL AND
COMPUTATIONAL DETAILS
2.1
Data
In our previous work with the five Twitter groups described
earlier, we downloaded as many tweets posted by each of the
following Twitter handles before 1 November 2015.
26
M. A. Upal (editor)
1. @Always: Proctor and Gamble’s Always pads (23k+
followers, 1.8k+ messages).
2. @mapleleafs: Toronto Maple Leafs Hockey Club (1.1
million+ followers, 74K+ messages).
3. @gop: The US Republican National Committee (650k+
followers, 20k+ messages).
4. @MSF_USA: The Doctors Without Borders (530k+
followers, 17k+ messages)
5. @Water: Safe water for all (750k+ followers, 10k+
messages).
This included all 1880 messages that had been posted by
@always and 3300 messages for the remaining 4 accounts‒the
upper limit set by the Twitter API. For each tweet, we also
downloaded the number of likes and retweets it had received. We
added the number of likes and retweets to compute the popularity
number for each tweet. Using this popularity measure, we labeled
the top 10% tweets as popular and the bottom 10% as unpopular.
We recruited six coders (5 females and 1 male) with academic
training in psychology as well as experience in coding
psychological and linguistic data to code the tweets on 22 features
that have been identified by cognitive science as well as social
media researchers for contributing to message popularity.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19. Posted duration. How long has this tweet been up on
Twitter?
20. Has a picture. Does this tweet have an embedded
picture?
21. Has a video. Does this tweet have an embedded
video?
22. Has a URL. Does this tweet contain a link to another
website?
Features 1 through 8 were coded from 0 to 2 with 0 indicating the
absence of the feature, 1 when the feature is somewhat present,
and 2 when the feature was thought to have a strong presence in
the tweet. Features 9 to 22 were coded as binary with a score of
0 if the feature was deemed absent and 1 if it was thought to be
present in the tweet. The last four features were automatically
coded by examining the relevant fields provided in the JSON data
structure returned by the Twitter API.
We repeated this methodology to obtain data for the present study
with some variations. The first problem was obtaining a large
number of tweets some of which have a large number of likes and
retweets while others have none or few so that our algorithms will
have something to learn from. Unlike the previous six accounts
for which we could simply downloaded tweets posted by the
official Twitter handle, ISIS does not have an official Twitter
account. There are however, a number of Twitter accounts that
are known to belong to ISIS members. We started with 42 of
these accounts and looked for those accounts that retweeted their
20 most recet tweets. We allowed the algorithm to run for 36
hours. This resulted in 4823 users and 6855 unique tweets in 35
languages. Of these we selected 3733 English tweets and had one
of the coders who had coded the five-group tweets, code them for
the 22 features discussed above.
Surprising. Is the message surprising to its target
audience?
Emotional. Does the message arouse emotions in its
target audience?
Positive/negative. How positively/negatively is the
message perceived by the target audience?
Humorous. How humorous would this message be
considered by its target audience?
Concrete. Does this message contain mostly concrete
easy to imagine concepts?
Coherent. Is the message in this tweet coherent?
Repetitive. Has this message been posted on this
group account before?
Social identity related. Would this tweet be
perceived by the target audience members to be
about “us” versus “them”?
Exaggerated. Does this tweet contain exaggeration or
facts?
Ideological or Religious. Does this tweet contain
ideological or religious message?
Conspiratorial. Does this tweet invoke a conspiracy
theory in the minds of its target audience members?
About an event. Is the message about an upcoming
or past event?
Personal communication. Is this message part of a
personal communication between the official group
handle and an individual?
Asks to like. Does the tweet ask its readers to like it?
Asks to retweet. Does the tweet ask its readers to
retweet it?
Asks for real world action (RWA). Does this tweet ask
its target audience to take a real world action?
Story. Is this message a story?
An arcing narrative. Is this an arcing narrative that
reminds the target audience of the group’s glorious
past and promised a glorious future if the group
enacts the proposed reform [7].
2.2
METHOD
We selected seven classification algorithms that have been
known to perform well on social media mining applications
[8]. We accessed six of these algorithms (Logistics
classifier, RIPPER, Random Forrest, C4.5, Alternating
DTs, & K*) from the Weka Machine Learning toolkit [9]
while the Support Vector Machines (SVM) was accessed
through the LIBSVM library [10]. Algorithm performance
was measured using ten-fold cross validation. This
involved dividing the data set into ten segments. Each
segment was then used as a test set while the remaining
nine segments were used for training the algorithm. The
measures of performance we computed were accuracy,
precision, recall, Cohen’s kappa, the F-measure, and the
Area Under the ROC Curve (AUC) [11].
2.3
RESULTS
Classification measures of performance (Table 10) show
that the features we selected can be effectively used to
predict whether or not a tweet will become popular.
26
27
Table 8 shows the top five rules learned by the RIPPER algorithm
for each data set. Such rules can be extremely useful for ad
designers because they can be used by them to figure out how to
design ads that will be liked and retweeted by their target audience
members.
Rule
No.
GOP
MSF
Water
Always
1
Has_a_picture &
is_very_emotio
nal
Duration ≤
1988
&
is_very_emo
tional
&
is_not_a_sto
ry
Not Personal
Duration≤282 &
is_arcing
2
Has_a_picture &
is_emotional
Has_a_pictur
e
&
is_not_an_e
vent
Has_A_URL &
is_emotional
& Duration ≤
11877
Has_a_URL &
Duration ≤ 498
&
is_not_persona
l
3
does_not_have
_a_url
&
is_negative &
duration ≥ 5883
501≤Duratio
n≤991
&
is_emotional
is_humorous
& is_positive
Has_a_picture
& Duration≤496
4
815≤Duratio
n≤919
5
Duration ≤
769
&
Does_not_h
ave_a URL &
is_very
Coherent
6
is_very
Surprising &
4556≤Durati
on≤5615
Table 8: The top five rules learned using the Repeated
Incremental Pruning to Produce Error Reduction
(RIPPER) algorithm.
Calculating the logistic regression odds ratios is a good
way to identify features that are critical to predicting tweet
popularity in a given target audience. The odds ratios
measure the association between presence of a variable in
a tweet and its popularity. An odds ratio of one indicates
no correlation, above one indicates a positive correlation
and below one indicates a negative correlation. The PValues represent the probability of the odds ratios having
the observed value given that there is no association
between the variable and the tweet popularity. A smaller PValue indicates that there is a very low probability that
these results could have been obtained without a
relationship. The rightmost column in Table 1 below
shows the logistic regression odds ratios for all Tweets data
TML
ISIS
All tweets
set.
is_not_personal
duration > 1820
Has_a_picture
The results show that similar to Water,
Doctors without
&
is_very_emotio
Borders (MSF), Republican Party
(GOP), and P&G’s
nal
Always, ISIS tweets are also likely
to become more
popular the longer they stay up on Twitter. Similar to
Water, MSF (and unlike Toronto Maple Leafs, US
Has_a_picture &
Has_a_picture
Republicans,
and Always) having
a picture is not
is_positive
& is_emotional
predictive of tweet popularity in the ISIS data set. Since
our coders were not able to access the multimedia aspects
of the ISIS tweets (because most of the ISIS sympathizer
Has a Picture &
Not Personal &
accounts
had been removed by Twitter
by the time our
678≤Duration≤
Duration
≥
coders coded the data in 2016), 933
ISIS coding& for “has a
358.16 &
picture” and “has a video” wasis_very_cohere
woefully incomplete.
is_not_humorou
nt
Therefore,
we
will
ignore
this
aspect
in the rest of the
s
discussion. Unlike the Always data set, and similar to all
290≤Duration≤1
is_not_persona
other
199 data sets,
& having a URL is l not predictive
& of tweet
popularity
in the ISIS data. Unlike,
Doctors without
is_very_emotion
285≤Duration≤
al
&
Borders,
Toronto Maple Leafs, 2047
US Republicans,
and
is_arcing
Always data sets, and similar to Water.org data set, being
2373≤Duration≤
“very
emotional” is not predictiveis_not_persona
of a tweet’s popularity.
2468
l
&
Similar to Water.org “very concrete”
tweets are likely to
1296≤Duration
become popular among ISIS ≤2407
sympathizer& accounts.
Similar to the US Republican partyNot_SocialID
and unlike all other data
sets, social identity related (i.e., “us versus them”)
700≤Duration≤1
messages
are also likely to becomeis_not_persona
popular in the ISIS data
300 & Has A URL
l & Duration ≤
sets. Very social-ID related messages
are &
significantly
1969
associated with tweet popularityis_not_arcing
in both &groups, the
is_emotional
somewhat social-ID related messages
are only&statistically
is_an_event &
significant in the US Republican does_not_ask_f
data set and they only
approach significance in the ISIS data
set.
or_RWA
Similarities
and
differences
between
ISIS data set and US
Republican Data set
Social identity related
messages are preferred
by readers of both
groups
Somewhat
concrete
message are preferred
Similarities
and
differences
between
ISIS data set and
Water.org Data set
Very
emotional
messages are preferred
by readers of both
groups
Very
concrete
messages are preferred
28
M. A. Upal (editor)
by
readers
of
Republican Party and
“very
concrete”
messages are preferred
by readers of ISIS
Twitter accounts.
Positive messages are
preferred
by
Republican readers but
not by the readers of
the ISIS accounts
Somewhat surprising
messages are preferred
by Republican readers
but not by readers of
the ISIS accounts
4
by readers of both
groups
CONCLUSIONS
This paper has described the results of applying a semiautomated technique for understanding the target audience
of the terrorist group ISIS on Twitter. The results of our
analysis offer insights that can be used by civilian and
military decision makers to design messages that are more
likely to be effective in their countering the ISIS
propaganda. The limitations of the work presented here are
that it only considers a subset of ISIS’s social media
messages, namely, the English messages posted on Twitter
and the work involved significant human coding
involvement.
We are working to overcome these
limitations by considering messages posted on other social
media platforms (such as Facebook) in English as well as
Arabic. We are also working on automating the coding
process using Sentiment analysis techniques to
automatically code tweets for the 22 features we used in
our study.
Negative messages are
preferred by Water.org
readers but not by
readers of ISIS accounts
Very
humorous
messages are preferred
by Water.org readers
but not by ISIS readers
Ideological messages
are
preferred
by
Water.org readers but
not by readers of the
ISIS Twitter accounts
REFERENCES
[1] Upal, M. A. and Marupaka, P. Ad-Oracle
for Predicting the Popularity of Marketing
Campaign Messages on Twitter. Submitted.
[2] Berger, J. M. and Morgan, J. The ISIS
Twitter census: Defining and describing the
population of ISIS and supporters on Twitter.
The Brookings Institution, Washington, DC,
2015.
[3] Bodine-Baron, E., Helmus, T. C.,
Magnuson, M. and Winkelman, Z.
Examining ISIS Support and Opposition
Networks on Twitter. RR1328, Rand
Corporation, Santa Monica, CA, 2016.
[4] Rogers, B. Religious Dimensions of
Political Conflict and Violence. Sociological
Theory, 33, 1 2015), 1-19.
[5] Rink, A. and Sharma, K. The
determinants of religious radicalization:
Evidence from Kenya. Journal of Conflict
Resolution2016).
[6] Wood, G. What ISIS really wants. The
Atlantic March 2015 2015).
[7] Upal, M. A., Packer, d., Moskowitz, G. B.
and Kugler, M. B. Investigating the
Dynamics of Identity Formation, and
Narrative Information Comprehension:
Table 9: Summary of similarities between readers of
the ISIS Twitter accounts and US Republican Party
Twitter account (@GOP).
As Table 9 summarizing the results shows readers of the
ISIS Twitter accounts are similar to the readers of
Water.org in that they preferentially like and retweet
emotional and concrete messages. The readers of the ISIS
accounts are similar to the readers of the US Republican
party in their preference for concrete as well as social
identity related messages that appeal to notions of usversus-them. The readers of the ISIS Twitter accounts are
dissimilar from Water.org readers because Water.org
readers like and retweet humorous, negative and
ideological messages while readers of ISIS account do not.
The fact that readers of ISIS accounts are not more likely
to retweet and like religious or ideological messages seems
surprising but lends support to those who argue that social
identity factors are more of a motivating factor for ISIS
supporters rather than religious and doctrinal factors. The
readers of ISIS Twitter accounts are dissimilar to the US
Republicans in that US Republicans prefer positive and
surprising messages while readers of the ISIS Twitter
accounts do not. In fact, readers of the ISIS Twitter
accounts seem to prefer neutral messages over positive or
the negative ones.
28
29
Final Report Defence Research &
Development Canada, 2011.
[8] Japkowicz, N. and Stefanowski, J. Big
Data Analysis: New Algorithms for a new
society. Springer Verlag, City, 2015.
[9] Witten, I. A., Frank, E. and Hall, M. A.
Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann,
San Francisco, 2011.
[10] Chang, C. and Lin, C. LIBSVM: A
library for support vector machines. ACM
Transactions on Intelligent Systems and
Technology 2, 3 2011), 27.21-27.27.
[11] Japkowicz, N. and Shah, M. Evaluating
Learning Algorithms: A Classification
Perspective. Cambridge University Press,
New York, 2011.
30
M. A. Upal (editor)
APPENDIX
Classification
Algorithm
Weka Logistic
Classifier
RIPPER
Random
Forest
C4.5 Decision
Trees
Alternating
Decision Tree
K* Lazy
Classifier
Measure of
performance
MSF
GOP
Water
Always
TML
Accuracy
86.32
89.30
96.48
93.14
Recall
0.88
0.89
0.96
Precision
0.85
0.89
Cohen's
Kappa
0.73
F-Measure
ISIS
All
Tweets
91.57
97.49
85.43
0.86
0.92
0.98
0.84
0.96
0.89
0.91
0.98
0.84
0.79
0.93
0.83
0.83
0.86
0.89
0.96
0.93
0.92
0.98
0.85
AUC
0.93
0.95
0.98
0.95
0.97
1.00
0.92
Accuracy
89.02
87.92
96.19
94.85
94.66
99.84
89.37
Recall
0.89
0.88
0.96
0.91
0.95
1.0
0.87
Precision
0.89
0.88
0.96
0.91
0.94
1.0
0.89
Cohen's
Kappa
0.78
0.76
0.92
0.87
0.89
F-Measure
0.88
0.89
0.96
0.95
0.95
1.0
0.89
AUC
0.90
0.90
0.96
0.93
0.95
1.0
0.91
Accuracy
91.56
91.87
96.92
94.07
96.21
100
91.68
Recall
0.92
0.91
0.96
0.86
0.96
1.0
0.91
Precision
0.91
0.92
0.98
0.92
0.96
1.0
0.91
Cohen’s
Kappa
0.83
0.84
0.94
0.85
0.92
F-Measure
0.92
0.92
0.97
0.94
0.95
1.0
0.92
AUC
0.97
0.97
0.99
0.98
0.98
1.0
0.98
Accuracy
88.5
90.16
96.19
94.92
94.38
99.95
90.45
Recall
0.88
0.89
0.96
0.88
0.92
1.0
0.89
Precision
0.89
0.90
0.96
0.90
0.96
1.0
0.90
Cohen’s
Kappa
0.77
0.80
0.92
0.85
0.89
F-Measure
0.88
0.90
0.96
0.94
0.94
1.0
0.90
AUC
0.87
0.91
0.96
0.94
0.95
1.0
0.93
Accuracy
89.23
90.29
96.63
93.60
95.37
99.95
88.49
Recall
0.90
0.89
0.95
0.97
0.95
1.0
0.86
Precision
0.88
0.91
0.98
0.91
0.96
1.0
0.88
Cohen’s
Kappa
0.78
0.81
0.93
0.84
0.91
F-Measure
0.89
0.90
0.97
0.93
0.95
1.0
0.88
AUC
0.95
0.95
0.99
0.97
0.98
1.0
0.93
Accuracy
91.12
88.16
95.45
93.14
94.80
99.74
91.30
Recall
0.91
0.89
0.94
0.83
0.92
1.0
0.89
30
0.95
1.0
1.0
1.0
1.0
0.71
0.78
0.83
0.81
0.77
31
Support
Vector
Machine
Precision
0.91
0.87
0.96
0.92
0.97
1.0
Cohen’s
Kappa
0.82
0.76
0.91
0.83
0.90
F-Measure
0.91
0.88
0.95
0.93
0.95
1.0
0.91
AUC
0.96
0.95
0.99
0.97
0.98
1.0
0.96
Accuracy
79.76
66.95
62.46
90.65
80.62
99.37
73.69
Recall
0.80
0.67
0.62
0.91
0.80
0.99
0.74
Precision
0.80
0.67
0.63
0.91
0.80
0.99
0.74
Cohen’s
Kappa
0.59
0.34
0.25
0.76
0.61
F-Measure
0.80
0.67
0.62
0.90
0.81
0.99
0.73
AUC
0.80
0.67
0.62
0.86
0.81
0.99
0.73
1.0
0.99
0.92
0.82
0.46
Table 10: Performance of various machine learning algorithms on the coded twitter data consisting of tweets collected
from five Twitter campaigns.
Rule
No.
GOP
MSF
Water
Always
TML
ISIS
All tweets
1
Has_a_picture &
is_very_emotio
nal
Duration ≤
1988
&
is_very_emo
tional
&
is_not_a_sto
ry
Not Personal
Duration≤282 &
is_arcing
is_not_personal
duration > 1820
Has_a_picture
&
is_very_emotio
nal
2
Has_a_picture &
is_emotional
Has_a_pictur
e
&
is_not_an_e
vent
Has_A_URL &
is_emotional
& Duration ≤
11877
Has_a_URL &
Duration ≤ 498
&
is_not_persona
l
Has_a_picture &
is_positive
Has_a_picture
& is_emotional
3
does_not_have
_a_url
&
is_negative &
duration ≥ 5883
501≤Duratio
n≤991
&
is_emotional
is_humorous
& is_positive
Has_a_picture
& Duration≤496
Has a Picture &
Not Personal &
678≤Duration≤
933
&
is_very_cohere
nt
Duration
358.16 &
≥
is_not_humorou
s
4
815≤Duratio
n≤919
290≤Duration≤1
199
&
is_very_emotion
al
is_not_persona
l
&
285≤Duration≤
2047
&
is_arcing
5
Duration ≤
769
&
Does_not_h
ave_a URL &
is_very
Coherent
2373≤Duration≤
2468
is_not_persona
l
&
1296≤Duration
≤2407
&
Not_SocialID
6
is_very
Surprising &
4556≤Durati
on≤5615
700≤Duration≤1
300 & Has A URL
is_not_persona
l & Duration ≤
1969
&
is_not_arcing &
is_emotional &
is_an_event &
32
M. A. Upal (editor)
does_not_ask_f
or_RWA
Table 11: The top five rules learned using the Repeated Incremental Pruning to Produce Error Reduction (RIPPER)
algorithm.
Water
MSF
TML
GOP
ISIS
Always
OR
Pr(>|z|
)
OR
Pr(>|z|
)
OR
Pr(>|z|
)
OR
Pr(>|z|
)
OR
Pr(>|z|
)
OR
Pr(>|z|
)
(Intercept)
0.47
0.67
0.00
0.99
0.00
1.00
0.00
0.99
0.26
0.25
1.30E+1
5
0.99
Duration
1.00
0.001
(***)
1.00
0.001
(***)
1.00
0.19
1.00
0.001
(***)
0.99
0.001
(***)
1.00
0.001
(***)
Asks to Like
0.02
0.16
59.26
0.06
22.85
0.02
(**)
3.72
0.52
0.47
0.15
Asks to Share
0.00
1.00
0.33
0.39
5.43E+0
7
1.00
0.00
0.99
8.13
0.42
0.75
0.08
Asks
to
Comment
0.01
1.00
8.13
0.001
(***)
0.00
1.00
1.38
0.18
Has a Picture
1.03E+1
2
0.99
8.23E+0
5
0.99
51.02
0.001
(***)
Has a Video
77.29
0.001
(***)
40.91
0.001
(***)
0.22
0.26
0.28
0.18
Has a URL
6.73
0.15
0.54
0.06(.)
1.22
0.67
0.13
0.001
(***)
10.32
0.001
(***)
1.3
0.16
Asks for RWA
0.62
0.72
0.45
0.06(.)
0.00
0.99
0.90
0.78
2.00
0.14
0.22
0.001
7.14E+0
3
1.00
0.77
0.63
0.12
0.99
0.33
0.99
0.88
0.99
Is
Conspiratori
al
Is an Event
1.41
0.84
0.75
0.85
1.01
0.99
1.41
0.33
Is Ideological
or Religious
2.00E+0
4
0.01
(**)
0.24
0.001
(***)
1.12E+0
4
1.00
0.61
0.21
Is Personal
0.00
0.001
(***)
0.05
0.001
(***)
0.00
0.001
(***)
1.23
0.89
0.79
0.68
416.85
0.96
Is Arcing
0.00
0.12
0.00
1.00
1.83
0.29
4.35
0.03
2.43
0.99
Is a Story
0.00
0.02(*)
0.26
0.06(.)
0.68
0.29
7.58
0.31
0.66
0.90
Is
Exaggerated
9.88E+0
5
1.00
3.45E+0
5
1.00
0.57
0.38
0.57
0.99
Is Somewhat
Surprising
1.20
0.93
0.88
0.75
0.00
1.00
2.07
0.04(*)
0.72
0.86
0.83
0.001
Is
Very
Surprising
0.03
0.92
1.04
0.92
5.43E+0
8
1.00
1.03
0.98
6.64E+1
2
0.99
0.04
0.99
Is Somewhat
Emotional
20.65
0.02 (*)
1.79
0.10
3.69
0.20
9.53
0.001
(***)
23.46
0.02 (*)
1.04
0.23
Is
Very
Emotional
81.27
0.30
7.26
0.001
(***)
13.10
0.02
(**)
32.18
0.001
(***)
34.86
0.06 (.)
0.26
0.02
Is Somewhat
Humorous
9.68E+0
5
0.001
(***)
1.13
0.96
0.28
0.12
1.41
0.55
0.69
0.81
11.55
0.64
Is
Very
Humorous
3.90E+0
9
1.00
1.51
0.15
0.38
0.45
2.09
0.56
0.00
1.00
Is Somewhat
Concrete
8.10
0.19
32.92
0.17
2.25E+0
6
1.00
3.63
0.07 (.)
32
1.04
0.96
0.99
0.78
0.000
33
Is
Very
Concrete
450.17
0.06 (.)
2.14E+0
6
0.99
2.45E+0
6
1.00
2.24
0.28
Is Somewhat
Coherent
1.30
0.88
7.71E+0
6
0.99
0.41
0.49
609.91
1.00
Is
Very
Coherent
98.34
0.16
3.14
0.24
2.95E+0
3
1.00
0.57
2.21
0.21
0.22
very
repetitive
Is Somewhat
Social
Identity
Related
0.02
0.08 (.)
6.59E+0
6
0.99
Is Very Social
Identity
Related
1.55
0.03
(**)
0.90
0.99
0.98
0.05
0.60
0.001
3.92
0.01
(**)
0.73
0.57
0.99
0.002
(…)
9.58
0.001
(***)
1.30
0.83
1.64
0.003
(***)
0.86
0.001
7.78E+0
4
0.07 (.)
2.68
0.04 (*)
NA
0.65
4.43
0.18
0.00
0.99
Is Positive
3.63
0.45
1.62
0.21
7.83
0.001
(***)
5.35
0.08 (.)
0.23
0.26
0.79
0.002
Is Positive &
Negative
159.66
0.19
1.93
0.29
7.71
1.00
2.65
0.40
0.23
0.31
0.72
0.99
Is Negative
1.38
Is neutral
Table 3: Logistic regression odds ratios and the P-values. Significant odd ratios and significant P-values are shown in
a bold font. Three starts indicate highly significant results (P<0.001), two stars are placed besides results that are very
significant (P<0.01), and one star indicates results that are somewhat significant (P<0.05). Results that approach the
statistically significance threshold but do not meet it are indicated in bold font with a period besides them. Cells for
which no data exists are reported as blank.
34
M. A. Upal (editor)
34