Academia.eduAcademia.edu
Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 PROCEEDINGS OF 1 2 M. A. Upal (editor) Program Chair & Proceedings Editor: M. Afzal Upal, PhD Chair of Computing & Information Science Department Mercyhurst University 501 E 38th St Erie, PA, 16546 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Table of Contents Music Genre Classification Using Machine Learning Techniques Andrew Innes 4 Predicting Hole by Hole Golf Scores on the PGA Tour Ron Richardson 11 Logistic regression Versus Convolutional Neural Networks for classification Jerrin Varghese 15 Machine Learning for the Detection of Mobile Malware on Android Devices Christina Eusanio 19 Building a Gun Detection Model Using Deep Learning Shraddha Dubey 24 Flight delay/cancellation prediction using machine learning Milos Veres 30 How do Socioeconomic Factors Effect the Amount of Waste Produced Heidi Beezub 34 Using Stock Market Data to Evaluate Genetic Algorithm Performance Bill Fisher 42 Stock Market Price Model Using Sentiment and Market Analysis Justin Minsk 50 Connecting People: Psychology and Machine Learning Praveen Neelappa 54 Medical Brain Drain: The Relationship between Regulation and Emigration Kimberly Staudt 58 Predicting Future Poaching Sites in African Reserves Stephanie Le Grange 63 Mass Shootings in the United States: An Analysis and Prediction Dayana Moncada 67 Are ISIS Sympathizers More Like Republicans or “Water For All” Charity Members? M. Afzal Upal 74 3 4 M. A. Upal (editor) Music Genre Classification Using Machine Learning Techniques Andrew Innes Mercyhurst University 501 E 38th St, Erie, PA 16546 ainnes54@lakers.mercyhurst.edu ABSTRACT Music genre classification has been a widely studied topic for many years and has grown drastically with the rise of machine learning. Various techniques have been used to classify music, but which techniques perform the best? In this study, multiple machine learning techniques will be used to compare which algorithm performs the most accurate and most efficient. Keywords Music Genre Classification, Neural Network, CNN, CRNN, Sequential, Decision Tree, Linear Regression, Random Forest, SVM. 1. INTRODUCTION Machine learning promises to transform the music industry for the better and we will see consistent change for years to come. With the emergence of massive streaming services machine learning has taken over the science behind the music. Machine learning has already helped artist almost fully create songs and has also helped get new artists off the ground. Some other uses have been in predicting the next big hit which focuses on the combination of big data analytics and machine learning techniques. In this paper, we will focus on the difficult task of music genre classification. Music genre classification is an ongoing research problem that has increased in importance since the emergence of music streaming applications, personalized radios, and at-home playlists. There are currently over a thousand micro-genres of popular music which makes genre classification a very challenging task. Micro-genres or sub-genres can make most music hard to categorize because they can carry multiple elements of different genres. In this paper, we address the music genre classification problem using machine learning techniques. Music genre classification can be broken down into two key uses. One being what the streaming services like Spotify and Pandora use it for. They use music genre classification to recommend a similar song when listening to music. This means it will take the song that’s currently playing, analyze it into a specific genre or subgenre, and then pick another song from that genre to play next. Another use for music genre classification would be for sorting out a large music collection. This can be helpful for the normal everyday music listener or today’s disc-jockeys (DJs) that have a music library with a massive amount of music. The problem is, many of these songs are not tagged with proper labels or are not tagged at all. Automatic music genre classification aims to solve this problem by quickly sifting through a user’s music library and correctly placing genre tags on each song. In the end, this would make it easier for a user to sort through their music and generate playlists with similar music. This paper will focus on music genre classification for the user. Music professionals such as DJs could benefit significantly from a genre classification tool. There are already many applications that can automatically sort through music using key features to create playlists. However, these key features must already be labeled. These labels can be anything from the genre of the music, the mood, or just the song title. As stated before, many songs aren’t labeled or are not labeled correctly. Music can be broken down into multiple genres and multiple subgenres. This makes the task of classifying music generally hard. Another reason music genre classification is difficult is because of the genre definition itself [1]. Many genres and especially subgenres have unclear definitions of what the music is. For example, Electronic music has many sub-genres with very similar elements of each where one specific sound could be the reason why it’s in that sub-genre. Another genre example would be, New Weird America. New Weird America is considered an indie folk/rock variant descended from the psychedelic folk and rock of the 60s and 70s. It contains elements from multiple genres that includes, metal, free jazz, electronic music, world music, Latin, noise, and opera [2]. This definition brings in the human aspect of genre classification. When you add in the human element to classify music, it can get subjective. Humans typically have their own opinions when it comes to music. For instance, one person could say this song is amazing and is considered this genre while another could consider it just noise. Also, humans want things done relatively quickly. Numerous previous studies have investigated music genre classification and have used a variety of different machine learning algorithms. As with any machine learning algorithm, some work better than others. In a study done using the GTZAN dataset, college students could achieve a 70% classification accuracy after listening to 3 seconds of music [3]. When listening to longer clips of music the accuracy was not any better. In one previous work, they used a convolutional deep belief network achieved a 70% classification accuracy on a 5-genre classification task [3]. Based on this report there is still more room for improvement. In another report, a combined algorithm for music genre classification based on specific parameters and on a set of SVMs was used to classify up to 100% correctly. However, it was trained on 80% of the database which only had 72 songs [1]. When using only 10% of the dataset, recognition rates varied from 51% to 92%. Overall, the classification method worked but was only used on a small dataset. This shows that there is still room for improvement when using a larger dataset. Another report used a combination of convolutional neural networks (CNN) and recurrent neural networks (RNN) to form a CRNN for music tagging [4]. Within this report, they used the million-song dataset. However, they use the CRNN to predict the top-50 tags. We will be looking at genre specifically. Their Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 evaluation allows them to predict genre, mood, instruments used, and the era of music. To further understand how we aim to classify music into genres, you must understand what features make up a song. Some genres contain specific segments such as a solo guitarist in rock music, a drop in electronic music, or a chorus in pop music. All these can be used to help classify which genre of music a song belongs to. More features can include mood, beats per minute (BPM), pitch, tone, instruments used, lyrics, and release date. There are many more features that can be used to classify songs. We will provide detailed information about the features used in this report later. 2. RELEVANT WORK Many different techniques have been applied to the problem of music genre classification. Some of these are machine learning techniques that attempt to classify genre correctly. The average person can classify music at a 70% accuracy rate when listening to 30 second audio samples. Many previous attempts to use machine learning to classify music correctly have seen an accuracy rate approaching humans. This could be because the classification of music into genres is a highly subjective issue. Many genres of music contain very similar musical elements that make it difficult for both humans and machines to accurately classify music. However, music genre classification in data science is used to speed up the process of classifying genre while maintaining a high accuracy rate. In this section we will look at relevant works that have successfully classified music using neural network approaches, use similar datasets, and focus on feature importance. The first relevant work we will discuss contains a few different methods that shows a few different approaches taken in classifying genre. These methods include support vector machines, neural networks, decision trees, k-nearest neighbor, and composer classification [5]. They tested each model showing the average F1 score of each genre for the model. For their support vector machine’s model, they used both linear and polynomial kernels. The linear kernel showed an F-score of 0.847 and the polynomial kernel had an F-score of a 0.838 [5]. Their neural network approached used a model with two hidden layers which achieved an average F-score of 0.9 [5]. As discussed in their results, their neural network approach showed the best classification results across 4 genres with 100 songs each. Their next approach was their k-nearest neighbor approach which saw a 0.842 F-score. Their final model was a decision trees model which showed an F-score of 0.793. This paper also showed that there is much room for improvement as few genres were attempted in a small dataset. Many previous studies focus on the use of convolutional neural networks (CNNs) to assist in music genre classification. One specific paper written at South China University of Technology focused on feature extraction and the use of CNNs to predict music genre at a 72.4% accuracy [6]. Their algorithm was based on spectrograms and CNNs. A spectrogram is a visual representation of the frequencies in sound. The spectrogram contains more details of music components such as pitch and tempo which can assist in classifying music into a specific genre. They used a feature detector as a filter to divide the spectrogram into four feature maps, which were used to see trends of the spectrogram in both time and frequency. Using CNNs will assist in music genre classification based on the convolution of the spectrogram [6]. Their final step was connecting the features to a multi-layer perceptron classifier (MLP). The GTZAN dataset has been used by many other reports written on this topic [5]. It contains 10 genres each are represented by 100 30-second audio clips. Other studies have used only a portion of the GTZAN dataset to test the accuracy of many machine learning approaches. Although this paper will not use the GTZAN dataset, it is a good baseline to classify genre using audio clips and can help focus on specific features to classify music. Evaluating audio clips will be of importance in classifying music into specific genres. This study concludes with a discussion of future work. They make note that they manually selected the feature detector and are interested in how to automatically learn the feature detector [5]. They also suggest trying to add more layers to the CNNs. These two changes could help create higher-level features which could help with improving the accuracy of the study. Another study uses the same dataset and focuses on similar techniques but better explains the use of CNNs. CNNs are used to learn filters that extract features in the time and frequency domain [3]. If the filters mimic spectro-temporal receptive fields (STRF) in the human auditory system, useful features can then be extracted for music genre classification [3]. In order to successfully fit the CNN to the spectrogram of the music signal, the spectrogram must be split into 3-second segments. This allows the CNN to make predictions for each segment and then combine the predictions together. This is used because human classification accuracy plateaus at 3 seconds and good results were obtained using 3second segments to train convolutional deep belief network [6]. In short, this study used the “Divide and Conquer” technique to successfully implement the CNN and reach human level accuracy. Another study uses a combination of CNNs and recurrent neural networks (RNNs) to form a convolutional recurrent neural network (CRNN) for music genre classification [4]. A CRNN is a modified CNN with the last convolutional layers replaced with an RNN. Both CNNs and RNNs play a role as a feature extractor and a temporal summarizer. The RNN allows the networks to take the global structure into account while local features are extracted with the remaining convolutional layers. This allows the networks to focus on all features such as mood and instruments. Mood would be considered a global feature while instrument would be considered a local feature. In their report, they tested their CRNN against 3 other CNNs. When testing for speed, one of the CNNs performed faster than the CRNN in all testing parameters [4]. However, the CRNN outperformed that CNN with the same number of parameters. In this paper, we will be testing for both speed and accuracy in the hopes of finding the best overall classifier. A previous study also used the million-song dataset (MSD). The million-song dataset is considered to be the largest dataset in the field of music [8]. It is a freely-available collection of audio features and metadata for a million contemporary popular music tracks [8]. The MSD is a unique dataset that worked around the issue of music licensing by using songs that were legally available to The Echo Nest. The Echo Nest is one of the world’s largest music data companies that focuses on music intelligence to power smarter music applications. The MSD was created to further research into music tagging and genre classification in a legal and larger way. This dataset is pushing the boundaries for music and data as many previous datasets have not come close to the size or amount of tags as the MSD. As stated before, MSD contains audio features and metadata for 1 million songs which includes, 280 GB of data, 44,745 unique artists, 7,643 unique terms (Echo Nest Tags), 2,321 unique musicbrainz tags, 43,943 artists with at least one term, 5 6 M. A. Upal (editor) 2,201,916 asymmetric similarity relationships, and 515,576 dated tracks starting from 1922. [8] The MSD can be used alongside The Echo Nests API to give extra identifiers and updated tags that have been changed since the release of the dataset. MDS also contains audio file clips that contain acoustic features such as pitches, timbre, and loudness as well as peak loudness. When looking for important features that could help classify music into genre, we must look at music information retrieval. One relevant work was written on multiple-instance learning for music information retrieval [9]. In this paper they use two types of features to describe musical audio. One of those features is spectral features that capture temporal aspects of music in relation to instruments and production quality [9]. The second features are types of temporal features that summarize the beat, tempo, and rhythmic complexity in four different frequency bands. The beat would be considered the overall structure of the song. In today’s world of music, most songs follow a standard 4 by 4 beat structure. The tempo is how fast a song is played out. Most songs vary in tempo however, much of today’s music is in tempos that are related to specific genres. For instance, dance music tends to be around 128 beats per minute (BPM), and most pop music is around 100 BPM. Rhythmic complexity describes how complicated the song is to follow. Songs are more complex when they don’t follow a structure. This could be from a speed up or slow-down of the tempo or a guitar solo that throws off the main portion of the song. Temporal features are calculated on the magnitude of the Mel spectrogram. The Mel spectrogram is also known as Melfrequency cepstrum (MFC) which is a representation of the shortterm power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear Mel scale of frequency. The Mel bands are then combined into four large bands at low, low-mid, high-mid, and high frequencies given the total magnitude of each [9]. The spectral features consist of the mean and unwrapped covariance of clip’s Mel-frequency cepstral coefficients (MFCC). The MFCCs are calculated from the Mel spectrogram used in the temporal features above. The MFCC is a non-linear spectrum of a spectrum. Mandel and Ellis did not attempt to solve our problem or use the same dataset, it focused heavily on features that we will need to analyze in order to classify genre correctly. To summarize, previous studies focus on audio clips and the spectrogram. Most recent work has focused on classifying genre with the use of audio features. Audio features are the sounds and structure that makes up a song. Through neural network techniques, previous studies were able to classify music using audio clips into genre categories around a 70% accuracy rate. In this report, we will attempt to achieve a higher classification accuracy by incorporating other aspects of music data. These include music tags that are in the million-song dataset which includes artist, title, release date, etc. This paper focused on using as many features as possible in order to accurately classify music. In the next section we will discuss our proposed solution and the exact steps we will be taking to solve the music genre classification problem. 3. PROPOSED SOLUTION In order to solve the problem of music genre classification we must use a music library or dataset with genre labels. This study will use a combination of the large and medium subset of the FMA dataset. The FMA large dataset contains approximately 105,000 tracks. The FMA medium dataset is made up of 25,000 tracks that are 30 seconds long and contains 16 genres. The dataset also includes a metadata folder that contains CSV files with feature information and music tags such as genre. During research, many studies focused on audio extraction for their neural network to better understand what a song is made of. For this study, these features are already extracted using the LibROSA package which will be discussed later in this report. Since the dataset is made up of audio files with some metadata, we can implement a comparison study on different machine learning algorithms and build two datasets. The first dataset includes a database of features extracted from each song using the LibROSA package in Python. However, this dataset does not include the spectrogram. The second dataset includes only the spectrogram (See Figure 1) for each song and uses the genre as the label. Using two datasets allows us to test different algorithms to see which performs better. For the first dataset, we focus on using multiple classifiers to determine which performs the fastest and the most accurate. These Figure 1: Sample Spectrogram showing frequency and time. classifiers include a Decision tree, a SVC, Linear Regression, and a Random Forest model. In the FMA medium dataset, the data is already split between train, test, and validation. This information is accessed through the Tracks CSV which contains metadata for each track. Although there are 25,000 songs in this dataset only 13,522 are available for training, and 1,705 are available for testing. This gives us approximately an 87% split for training and a 13% split for testing. This dataset will include 252 features that were extracted using the LibROSA package. For the second dataset, we focus on an image recognition task. This task will be used on the spectrograms of each song to classify its genre. For this, we use deep learning with a combination of a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN) as a CRNN. Although this has been tested in previous reports, we are doing a comparison study between different machine learning techniques. To gather more data for our training sets, we can combine the large and medium subsets. Only 24,986 songs are available for use in our models due to an issue with the Tracks CSV. 2,906 songs are available for the validation set and 3,974 are available for the test set. This gives us a spilt of 73%, 12%, and 15%. In the next section we will provide the details about each algorithm and explain the appropriate steps taken to reach our results. . 4. EVALUTION The first step in the evaluation process is data collection. For this research paper, music was collected from the FMA dataset. As mentioned before, this dataset includes subsets. For this paper, we are using the FMA medium dataset which includes 25,000 song with 16 genres. These genres include Blues, Classical, Country, Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Easy Listening, Electronic, Experimental, Folk, Hip-Hop, Instrumental, International, Jazz, Old-Time/Historic, Pop, Rock, Soul-R&B, and Spoken. Unfortunately, the FMA medium dataset does not have an even split among genres but it is best to use a well put together dataset with correct genre labels when testing various algorithms. The FMA dataset includes two CSV files to easily connect each track with its metadata and its extracted features from the LibROSA package. Metadata information is found within the Tracks CSV and features information is found within the Features CSV. The next step involved the LibROSA package in Python that allows us to convert our music into ready to use data. The LibROSA package is a python package for music and audio analysis that provides the building blocks for music information retrieval. The LibROSA package was used to extract features for the Features CSV. For this paper, we use 6 features that are commonly used in music information retrieval. The first feature used is the Root Mean Square Energy (RMSE) value for each spectrogram. In music, the energy of the signal is also known as the magnitude of the signal or how loud the song is. The next feature is a chromagram generated from the songs wave form. A chromagram is used to predict the songs pitch class. This is based on the 12 semitones of a pianos keyboard. This allows us to find the songs pitch key in a numerical value. The third feature is the spectral centroid. The spectral centroid indicates where the “center of mass” of the spectrum is located. It’s connected with the “brightness” of sound. The next feature is the spectral bandwidth which is the difference between upper and lower frequencies on a spectrum. This can determine whether the song has more bass or has more high-end sounds. The fifth feature used is the zero-crossing rate. The zero-crossing rate indicates the number of times a signal crosses the horizontal axis. The zero-crossing rate is useful to determine at which points drums are present in a song. The final feature is a set of 20 different features that make up the Mel-Frequency Cepstral Coefficients (MFCC). MFCC’s are coefficients that collectively make up the Mel-Frequency Cepstrum (MFC). MFC is a representation of the short-term power scale of sound. In MFCC, the bands are equally spaced on the Mel-scale, which approximates more closely to the human ear. Using 20 different MFCC’s allows the song to be broken up into 20 bins throughout the song. This produces more accurate MFCC’s. One thing to note is that the first MFCC normally contains silence which can produce an inaccurate mean value for the first bin. The next step in the evaluation process is to split the data into train and test. The FMA dataset already includes a train, test, and validation split that can be found in the Tracks CSV. This split is approximately an 87% train and 13% test or validation data. For more testing, we could use another dataset with the same features extracted and run our algorithms on that. For the purpose of this paper, we will be using the test and validation data that is given. As mentioned previously in this paper, we are not just testing for accuracy but speed of each algorithm on the test set. For this approach, we train and test our data on four different algorithms and output the time it takes to train and test the data as well as the F1 score. These algorithms include a Decision Tree Classifier, SVC, Logistic Regression, and a Random Forest Classifier. The F1 score is a measure of accuracy. The first algorithm used is a Decision Tree Classifier. Decision Tree models can be used for classification or regression. It makes a trees structure and breaks down the data into smaller subsets that eventually leads to prediction. Decision Tree models tend to be one of the most used algorithms in machine learning due to its high classification accuracy. The second algorithm we train our data on is an SVC which comes from support vector machines. The goal of the SVC model is to fit the data you provide returning a best-fit model. This helps categorize the data and return a more accurate prediction. Many music genre classification tasks use an SVC model. The third algorithm is Logistic Regression. Logistic Regression is typically used for a binary classification problem. For this case, the model still becomes binary by passing each category as whether it is true or false. The fourth algorithm we are testing is a Random Forest Classifier in the TensorFlow package. A Random Forest Classifier is a supervised learning approach that takes a Decision Tree Model and adds more trees to the model. The higher number of trees leads to a higher accurate model which when classifying data, accuracy means everything. 4.1 Neural Network Approach (CRNN) The next classifier uses a neural network approach in order to classify genres. For this approach, we must extract the spectrogram for each track. This will turn the problem into an image recognition task. In order to complete this task, the spectrogram must be turned into an array of numbers. This approach will be explained further in the paragraphs that follow. The first step of this approach is to extract the spectrogram from each song. For this, we use the LibROSA package that allows you to plot a spectrogram that shows the frequency in Hz and the time of the track. It also allows us to see the energy of the track in dB with a color spectrum. Just like the previous approach mentioned, we must use the Tracks CSV to link the tracks with their metadata. This allows use to build a data frame with the spectrograms that includes the tracks genre tag. During the extraction, we were able to extract 24,986 tracks for our train dataset, 2,906 for validation, and 3,974 for our test data. This gives us an 73% train, 12% validation, and 15% test split. After extracting the spectrograms, the next step is to convert each spectrogram into a NumPy array. In order to due this task on a normal laptop you must batch your data. For the purpose of this paper, we split our training data into 11 batches with approximately 1,600 songs in each batch. This cut our original train data size down from 130,000 to 24,986 due to an issue with the Tracks CSV however, this is still the best approach to creating the array for our data. After each batch is extracted, we then save the arrays to npz files which are used specifically for arrays. We then do the same with our test and validation data. The next step is to then convert our files using the db_to_power function in the LibROSA package. Then we finally scale the data using the log function. This allows us to determine the loudness of the sound data in decibels as it relates to human-perceived pitch. After the songs are converted, we then concatenate the data and save it to a final npz file. To train this method, we use a combination of CNN and an RNN to develop a CRNN. CNNs are often used in image recognition tasks and RNNs are typically used for sequential data, in this case time. This model was inspired by and modified from a recently developed model by Priya Dwivedi [10]. This approach takes 1D convolution layers that perform the convolution operation across the time dimension. RELU is then applied after the convolution operation. RELU is an activation function that changes all positive values to a linear identity and all negatives to 0. RELU is also the 7 8 M. A. Upal (editor) most commonly used activation function used for CNNs. Next, Batch Normalization is applied which normalizes the inputs to layers within the network. Finally, we apply 1D Max Pooling that is used to reduce spatial dimension of the image and prevents us from overfitting our data. This is performed 5 times with 64 filters per layer. The output of the convolutional layer is then fed into a LSTM. LSTM is an RNN that at its base can compute anything a conventional computer can compute. In this case, we are using it to compute the short- and long-term structure of the song. The LSTM is then put into a Dense Layer which is just a regular layer of neurons in a neural network. To simplify this, each neuron receives an input from all the neurons in the previous layer making them densely connected. The final output is another Dense Layer with SoftMax activation. The SoftMax function is used to give our prediction between 0 and 1, thus meaning an accuracy percentage. To reduce overfitting, we used dropout and L2 Regularization between each layer. Dropout is a technique used where randomly chosen neurons are ignored or dropped out. Regularization is a technique used to discourage the complexity of the model by penalizing the loss function. L2 Regularization is the sum of square of all features weights and forces the weights to be small but does not make them zero. We use an Adam Optimizer to train our data with a learning rate of 0.001. Using an Adam Optimizer is said to retrieve good results fast as it updates the weights with each gradient. For the loss function we use categorical cross entropy which measures the probability error in discrete classification tasks in which classes are mutually exclusive. The model is trained for a total of 70 epochs. 5. RESULTS As stated before, each model was tested for speed and accuracy. When training and testing the data using the extracted features, the Decision Tree Classifier performed the fastest. The model trained in 1.4467 seconds with a 59% accuracy of the training data and made predictions on the test set in 0 seconds with a 60% accuracy. However, there is at least a 10% loss in accuracy with this model when compared to the top performing model. The SVC model gave us the best accuracy with an 82% on the training data and a 73% on the test data. The SVC model trained in 49.60 seconds and made predictions on the test set in 4.6 seconds. To further review the results, we used the classification_report function in sklearn. This allows use to see the classification into each genre. After reviewing the classification report, it became known that we have imbalances in our train data that lead to misclassification of the test set. The CRNN model trained each epoch in approximately 250 seconds. It took approximately 5 hours to train all the data with the best model producing a 72% accuracy with the train data and a 60% on the validation data. When using the test set, we received and accuracy score of 65% in 15 seconds. When viewing our classification report, Old-Time / Historic received an F1 score of 99% making it the most accurate class. A total of 6 classes were completely misclassified with a 0% F1 score. These genres are Blues, Country, Easy Listening, Instrumental, Pop, and RnB. This is most likely due to unbalanced sample sizes in the training set. In the next section we will further discuss our findings. 6. DISCUSSION After receiving the results of each algorithm, the CRNN did not perform as well as expected. We think this may be due to unbalanced sample sizes in each class when training the data. However, the CRNN model did produce results for our test data fast. It was able to process approximately 4,000 songs in 16 seconds and classify them into genre. To no surprise, the SVC model performed the best. When doing research, many reports mentioned the use of SVC for music genre classification. The SVC approach almost always showed the best accuracy scores next to a neural network approach. The big difference between these approaches is the way the data is formulated. The SVC approach, you must extract features from each song. The neural network approach is an image recognition task that requires the user to extract the spectrogram from each song and formulate arrays. One downfall of this project was the dataset not having an equal amount of songs in each genre for the training dataset. Although it isn’t required for any classification task, having an equal number of classes allows the classifiers to train each class equally. This would allow us to build a more accurate model. One solution to fix this issue is resampling of the training set. This will allow us to better fit our model with equal sample sizes. This method was tested on the Decision Tree and SVC Model using the built-in class_weights function. This function allows us to balance the weights of each class to better fit the model. For instance, if Class A has more samples tan Class B, Class B will be weighted higher than Class A. Using this on the Decision Tree model produced undesirable results with only the highest weighted class being classified. The SVC model produced similar results with the class_weights function as discussed in the results section. Another issue for less accuracy could be that some songs were not labeled properly to begin with or are ambiguous across multiple genres. Music Genre Classification is a highly opinionated topic making it a difficult classification task. Music can also be easily classified into more than one genre or have similarities to another. For example, an instrumental track can have a guitar riff that can misclassify the song as rock. Another example could be a pop song with electronic roots. All these issues must be taken into consideration when developing a strong dataset for Music Genre Classification. 7. CONCLUSION In this project, we attempted to classify music into its appropriate genre category. The key findings show us that the best classifier to use on this dataset is an SVC that gave us a 72% accuracy score. Our CRNN model did not perform as well as expected with only a 65% accuracy score. These key findings show that there is still much room for improvement when it comes to music genre classification. They may also show us that the dataset plays a key role in how accurate the predictions of your model may be. 8. FUTURE WORK In order to extend this work, we could try batching the spectrogram into 3 second windows to more accurately predict each songs classification. Another suggestion would be to build a better dataset with expertly tagged songs with more genres. The topic of music genre relies heavily on opinion and when extracting songs from a non-expertly tagged database, inaccuracies should be expected. After a satisfactory accuracy score is met, our next suggestion would be to build an app that would allow a user to take songs in their music library and have it place genre tags on each song. With thousands of genres and subgenres of music this app would have to be constantly updated in order to keep up with the pace of today’s music. Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 9. ACKNOWLEDGEMENTS A Special Thank you goes to Dr. Upal who has helped guide me through this project throughout my final year of graduate school. Another Thank you goes to all who have supported me throughout my studies at Mercyhurst University. 10. REFERENCES [1] Antonio Jose Homsi Goulart, Rodrigo Capobianco Guido, Carlos Dias Maciel. 2012. Exploring different approaches for music genre classification Egyptian Informatics Journal 13, 2 (July 2012), 59-63. DOI: https://www.sciencedirect.com/science/article/pii/S11108665 12000151#b0015 [2] Eliot Van Buskirk. 2015. 50 Genres with the Strangest Names on Spotify. (Sep. 2015). Retrieved October 18, 2018 from https://insights.spotify.com/us/2015/09/30/50-strangestgenre-names/ [3] Mingwen Dong. 2018. Convolutional Neural Network Achieves Human-level Accuracy in Music Genre Classification. arXiv: 1802.09697v1. Retrieved from https://arxiv.org/pdf/1802.09697.pdf [4] Keunwoo Choi, Gy¨orgy Fazekas, Mark Sandler. 2016. CONVOLUTIONAL RECURRENT NEURAL NETWORKS FOR MUSIC CLASSIFICATION. arXiv: 1609.04243v3. Retrieved from https://arxiv.org/pdf/1609.04243.pdf [5] Matthew Creme, Charles Burlin, Raphael Lenain. 2016. Music Genre Classification. (December 2016). Retrieved November 8, 2018 from http://cs229.stanford.edu/proj2016/report/BurlinCremeLenai n-MusicGenreClassification-report.pdf [6] Qiuqiang Kong, Xiaohui Feng, Yanxiong Li. 2014. Music Genre Classification Using Convolutional Neural Network. (2014). Retrieved November 8, 2018 from http://www.terasoft.com.tw/conf/ismir2014/LBD%5CLBD1 7.pdf [7] Alexandros Tsaptsinos. 2017. LYRICS-BASED MUSIC GENRE CLASSIFICATION USING A HIERARCHICAL ATTENTION NETWORK. (2017). Retrieved November 8, 2018 from https://ccrma.stanford.edu/groups/meri/assets/pdf/tsaptsinos2 017preprint.pdf [8] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, Paul Lamere. 2011. THE MILLION SONG DATASET. (2011). Retrieved November 8, 2018 from https://www.ee.columbia.edu/~dpwe/pubs/BertEWL11msd.pdf [9] Michael I. Mandel, Daniel P.W. Ellis. 2008. MULTIPLEINSTANCE LEARNING FOR MUSIC INFORMATION RETRIEVAL. (2008). Retrieved November 8, 2018 from http://www.ee.columbia.edu/~dpwe/pubs/MandelE08MImusic.pdf [10] Priya Dwivedi. 2018. Using CNNs and RNNs for Music Genre Recognition. (Dec. 2018). Retrieved February 6, 2019 from https://towardsdatascience.com/using-cnns-and-rnnsfor-music-genre-recognition-2435fb2ed6af About the authors: Andrew J. Innes is a 2nd year Data Science Graduate Student at Mercyhurst University. Andrew has an undergraduate degree in Business Competitive Intelligence. He also has a strong passion for music and enjoys sharing his passion with others. 9 10 M. A. Upal (editor) Predicting Hole by Hole Golf Scores on the PGA Tour Ron Richardson Department of Computing and Information Science Mercyhurst University Erie, PA, USA ron.richardson@gmail.com ABSTRACT This paper tests different machine learning techniques to predict the score made by a golfer on the PGA Tour, comparing features of the golfer’s skills with additional features of the course and hole, to examine the impact the course has on performance. For this study, data from the PGA Tour was used between 2016 and 2018. Using a few handpicked features of golfer performance alongside features of the course being played, the paper concludes that the course does not have a significant impact on the outcome of the hole, and the skills of the golfer alone determine the score on the hole. Using an optimized Random Forest classifier, an accuracy score of 62.7% was achieved. Keywords Golf, PGA Tour, ShotLink, machine learning, random forest, classification. 1. INTRODUCTION Using machine learning techniques to predict the outcomes of sporting events is nothing new. There has been plenty of research on the major sports (baseball, basketball, football), but the amount of research on golf has been limited. What makes golf so hard to predict is the amount of randomness that can occur during a round of golf. A golfer might hole out a long shot where the probability of doing so is very high. A hole in one by a professional golfer is estimated at 3,000 to 1, whereas for an average golfer the probability jumps to 12,000 to 1 [1]. A golfer making a hole in one earns them roughly 2.1 strokes gained for the hole, where the mean strokes gained is usually between 0.5 and 1. Determining the best feature set for predicting golf scores has been evaluated several times, starting with Davidson and Templin in 1986, where only three features explained 86% of a golfer’s scoring variance (greens in regulation (GIR), total putts, and driving proficiency) [2]. In 1992, Shmanske used three different, but similar, features for prediction [3]. This type of research continued, with some variations, until Sen created a single metric to use for score prediction [4]. Previous research focuses on total round score, whereas this paper breaks down scoring prediction down to individual holes. Breaking down scoring predictions to the hole level can open a whole new level of betting options and prop bets. This can increase the volume of betting and fan engagement. Additionally, this level of breakdown can help professional golfers identify which types of holes can cause issues allowing them to work on different strategies or practice different skills to improve their chances of success. 1.1 Available Data With the introduction of ShotLink by the PGA Tour in 2004, the amount of available data has increased significantly [5]. ShotLink is a real-time system that collects every shot hit by a golfer during tournaments on the PGA Tour. The data is collected by a team of volunteers at the tournament using a laser to pinpoint the starting and ending location of each shot, while logging the type of condition the golfer hit from (e.g. fairway, rough, sand, etc.). Over the past 14 years, the amount of data that has come from this system is staggering. From 2010 to 2018 alone, over 10 million shots have been recorded and over 300 data features calculated. 1.2 Strokes Gained Coming out of this trove of data was a new statistic called Strokes Gained. This statistic was first created by Mark Broadie in 2008 for putting and expanded in 2012 to include other aspects of the game [6]. Strokes gained provides a benchmark for comparing other golfers in certain skills and has been valuable in giving viewers and fans a better insight into how a golfer is performing. Strokes gained is a measure of the effectiveness of a golf shot to the golfer’s score and represents the decrease in the average number of strokes to finish the hole, from the beginning of the shot to the end of the shot, minus one to account for the stroke taken [6]. If J(distance, condition) is a function that represents the average number of strokes it takes a PGA Tour golfer to complete the hole, where distance is the distance the ball is from the hole, and condition is the current location of the ball for the shot (e.g. fairway, rough, green), then strokes gained is defined as the difference of the current shot and the next shot, minus one to account for the stroke actually taken. 𝑔 = 𝐽(𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 , 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 ) − 𝐽(𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 −1 , 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 ) Broadie gives the following example: Suppose the average number of shots to complete a hole from 40 yards away in the fairway is 2.6. If the golfer hits the shot to one foot away from the hole (where the average number of shots to complete the hole is 1.0), then the strokes gained is 0.6 [6]. 𝑔 = 2.6 − 1.0 − 1 = 0.6 In general, a positive strokes gained value represents that the shot is better than a PGA Tour golfer’s average shot from that distance and condition. Strokes gained helps quantify some of the legacy stats that exist on the PGA Tour. For example, the total putts in a round statistic was commonly used but can be deceiving. If golfer A takes only 29 putts during a round, while golfer B takes 31 putts, does that mean that golfer A is a better putter than golfer B? Golfer A might have missed more greens in regulation (measured as being on the Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 green in 2 shots or less than par for the hole). By missing the green in regulation, golfer A has a shorter shot (not a putt) that has a decent likelihood of getting much closer to the hole resulting in an easier putt and a greater probability of needing only one putt to complete the hole. However, golfer B might be on the green in regulation, but from a much longer distance, and require 2 or more putts to complete the hole. In his book “Every Shot Counts” [7], Broadie calculated the probabilities of one-putting from different distances, compared to the probability of three-putting throughout the 2003 and 2012 seasons. From 2-feet, the probability of making it in one shot is 99%, for an average of 1.01 shots to complete the hole. From 60feet, the one-putt probability is reduced to 2%, for an average of 2.21 shots to complete the hole. So, if a golfer only takes 2 shots from 60-feet, they gained 0.21 strokes. Another golfer might only have 2-feet left to putt, and by only taking one shot to complete the hole, they gained only 0.01 strokes. that not hitting the fairway with a tee shot costs a golfer on average 0.32 strokes [8]. The par on a hole is a rating that is designated by the course architect and represents what a golfer with a handicap rating of 0 (e.g. a “scratch” golfer) should score on the hole. Golfers on the PGA Tour don’t play with a handicap but are generally considered about 4-8 strokes better than scratch. Par is generally related to the length of the hole but does not describe the difficulty of the hole. In some cases, the scoring average on a par 5 hole can actually be lower than the scoring average on a par 4 hole. The distribution of scores in 2018 by par values is shown in Figure 1. Strokes gained has evolved into the following categories, covering nearly all aspects of the game: off the tee (OTT), approach the green (ATG), around the green (ARG), putting, tee to green (T2G), and total. For this paper, the average strokes gained for OTT, ATG, ARG, and putting were used as predictive features for a golfer’s skill. 2. DATA COLLECTION The data for this research was obtained from the PGA Tour ShotLink System. Academic access was granted by the PGA Tour in August 2017, but the program closed in January 2019. When the 2018 golf season ended in September, the following data was downloaded for the 2016 through 2018 seasons: course data, round data, hole data, and stroke data. This data is provided in a collection of delimited text files for each year. Within the files, there are hundreds of measured and calculated statistics, so feature reduction is important. Two different feature sets were created, with the first set having a total of 21 features, combing both golfer features and course/hole features. The second feature set removed all course/hole features, resulting in only 12 features. The 2016 season was used to train the algorithms, with the 2017 season used to validate the results. Finally, the 2018 season was used as a final test. This resulted in a training set of 271,462 holes, a validation set of 290,122 holes, and a test set of 281,376 holes. 2.1 Course/Hole Features For most every course that is played on the PGA Tour, categorical data for each hole is provided for the type of grass used on the tee boxes, fairways, rough, and greens. Additionally, the height of those grasses is provided as some courses have much taller grass in the rough which can make the course tougher (e.g. GC of Houston had a rough height of 1.25 inches in 2018, while Aronimink GC had a rough height of 4.75 inches). Wind speed and direction is measured both in the morning and afternoon, as well as how firm the fairways and greens are. As of 2010, additional data included is the width of the fairways at certain distances. These widths are measured at distances between 275 and 350 yards, which contains most drives hit by professional golfers. Broadie recently measured the cost of missing the fairway during the 2019 PGA Tour season and found Figure 1: Score distribution by hole par value 2.2 Golfer Features With nearly 300 different statistics for a golfer’s performance included in the data files, picking the most predictive ones is key. Since strokes gained is widely used as a measure of a golfer’s performance, it makes sense to include these in the feature set. Additionally, a golfer’s average driving distance, driving accuracy, percentage of greens hit in regulation, and putts per hole were calculated for each golfer. Before training the algorithms with these features, a calculation of the golfer’s features at the time of the hole being predicted needed to be calculated. This was performed by taking the previous 25 rounds of golf played by a golfer (450 holes) prior to the tournament being played. 3. MODEL SELECTION Since golf scores are always whole numbers, and generally vary between 1 and 8, a classifier can be used rather than regression. To test the best model to use, a sample of 12 golfers from the 2016 season were selected and run through a couple of classifiers, including Random Forest and Gaussian Naïve Bayes. From this initial test, a Random Forest classifier performed the best, before any hyperparameter optimization. To optimize the hyperparameters, both a random search and grid search were performed, with the grid search providing the best optimization of the Random Forest classifier. After the grid search, the full feature set with course/hole features included was reduced from 21 features down to 10. The grid search was also run on the feature set that excluded course/hole features and no feature reduction was need, keeping the total feature count at 12. 11 12 M. A. Upal (editor) 4. RESULTS 4.1 Model Accuracy The default Random Forest classifier with both golfer and course/hole features included resulted in an accuracy score of 59.4% against the validation data set. When the course/hole features were removed, the accuracy of the default Random Forest classifier remained steady at 59.3%. The results of the grid search determined that using entropy as the criterion for measuring the quality of a split was best. A max depth of 3 was used, along with only 10 features, and a minimum sample split of two. This resulted in a bump in accuracy to 62.7% with the full feature set, and the exact same for the reduced feature set for just golfer skills. 4.1.1 Feature Importance In the full feature set, the top two features in terms of importance were the actual yardage of the hole, and the par value for the hole. This is not surprising since the length of the hole is highly correlated to the par value (e.g. a 200 yard hole is always a par 3, and a 550 yard hole is always a par 5). Since PGA Tour golfers are considered better than a 0 handicap golfer (for which par is determined), their scores will usually hover very closely around par (within one stroke). The top 10 most important features for the full feature set are shown below in Table 1. Table 1: Top 10 Features for Golfers and Courses/Holes by Importance Actual Yardage Par Actual 275 Distance Actual 350 Distance Actual 325 Distance Avg SG Approach Avg SG Around the Green Avg Driving Distance Driving Accuracy Avg SG Off the Tee Importance 0.345919 0.089438 0.062597 0.041917 0.038473 0.035605 0.035171 0.035159 0.034993 0.034901 Once the course/hole features were removed, the same test of feature importance was run, and again, the actual yardage of the hole and par value were the top two predictive values. However, the importance of approach shots, shots off the tee, and average driving distance were also high on the list. The top 10 most important features for the reduced feature set are shown below in Table 2. Table 2: Top 10 Features for Golfers by Importance Actual Yardage Par Avg SG Approach Avg SG Off the Tee Avg Driving Distance Importance 0.584660 0.115456 0.031246 0.031179 0.031016 Importance 0.030791 0.030652 0.030519 0.030041 0.029770 Avg SG Around the Green Avg SG Putting Scrambling Success Driving Accuracy Putts per Hole (GIR) 4.1.2 One Round Test A test was performed on one round played in 2018 to show the hole by hole predictions. The round chosen was by Phil Mickelson in 2018 during the first round of the WGC Bridgestone Invitational played at Firestone Country Club in Akron, OH. Par for the course is 70, and Mickelson shot a 66 that day. Tables 3 and 4 show the prediction results for each hole with all 4 models used. Table 3: Prediction results for one round vs. Actual results (Front 9) Model (Features) Actual Score Default (Full) Optimized (Full) Default (Reduced) Optimized (Reduced) 1 2 3 4 5 6 7 8 9 % 4 4 3 6 4 4 4 4 2 3 5 4 2 3 4 4 4 3 44% 4 5 4 4 3 4 3 4 4 56% 5 5 4 5 3 5 3 5 4 33% 4 5 4 4 3 4 3 4 4 56% Table 4: Prediction results for one round vs. Actual results (Back 9) Model 10 Actual Score 3 Default 4 (Full) Optimized 4 (Full) Default 4 (Reduced) Optimized 4 (Reduced) 11 4 4 12 3 3 13 4 4 14 4 4 15 3 3 16 5 5 17 4 4 18 % 4 4 89% 4 3 4 4 3 5 4 4 89% 5 3 5 4 3 4 4 5 44% 4 3 4 4 3 5 4 4 89% Both the default and optimized Random Forest models, using the full feature set, predicted a total round score of 70. The default model with the reduced feature set predicted a total round score of 76, while the optimized model predicted a total round score of 70. However, since we are looking at hole by hole predictions, the best models were the optimized models, regardless of feature set, having an accuracy of 72.2% for this 18-hole round. Most of the holes were predicted correctly, however, hole #2 was never predicted correctly by any model. On this hole (a par 5 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 measuring 529 yards), Mickelson scored an eagle (3), which is 2 shots better than par. An eagle only occurred 2.2% of the time during the 2018 season, so predicting this value is difficult. No other hole was incorrect by more than one shot. 5. DISCUSSION This paper discussed using and optimizing a Random Forest classifier to predict a PGA Tour golfer’s score on a specific hole, while also investigating if a course or hole setup makes much of an impact in the predictions. A classifier was chosen because golf scores on a hole usually fall within the range of 2 and 6. The best model in this research never classified a score outside the range of 3 and 5. While the default Random Forest models show that some course/hole features are important, removing them provided no significant decrease in the accuracy of the predictions. Matt and Will Courchene of the website datagolf.ca have also found a very small correlation between a golfer’s past performance at a course versus future performance, indicating that using course history and setup blindly won’t always tell the whole story [9]. There is an old adage in golf that says, “drive for show, putt for dough”, meaning that the long tee shots look neat and are fun to hit, but putting is where you will make your money and lower your scores. From this research, putting statistics were barely an influence in the prediction, but approach shots to the green and driving distance and accuracy were more important. Broadie has confirmed this several times, most recently after one tournament where Rory McIlroy won and was in the top 10 of the field in driving and approach shots [10]. The PGA Tour has the best golfers in the world and a lot of that can come down to course management and strategy. With a narrower hole, a golfer might choose to play to a strategic portion of the hole to optimize their chances of scoring close to or below par. Most amateur golfers might hit their standard tee shot and end up in a more difficult situation, which might lead to a higher number. The consistency of shots for a professional golfer versus an amateur golfer is the biggest difference between them. 6. FUTURE WORK Admittedly, there is a lot of future work that can be performed. The number of features available in the ShotLink system is staggering and include granular statistics such as proximity to the hole from certain distances, putting accuracy from certain distances, and detailed notes on the type of condition each shot was taken from. Using these additional features, rather than handpicked ones, could increase the accuracy of the hole predictions. Principal Component Analysis could be utilized to quickly figure out which features can be grouped together. Additional algorithms can also be explored. The Poisson regression algorithm is a type of generalized linear model that is used when the target value is a count, which a golf score is. Ordered logistic regression was also recommended and has been used by several people in the daily fantasy sports betting arena. Dan Rosenheck of The Economist used the Burr Type XII distribution for his hole by hole prediction system called EAGLE, first presented at the 2017 MIT Sloan Sports Analytics Conference in Boston and updated in 2019 [11]. For this research, certain course features were not included, such as grass type and firmness, along with environmental features such as wind speed and direction. The wind can have a big impact on a golf score, most notably in Europe where tournaments have been played in 40 mph gusts. This research did not investigate scoring differences between each round (except for the difference in hole yardages, which only varies by a few yards), but there might be some additional information that can be extracted for each round that has an impact on scoring predictions. 7. ACKNOWLEDGMENTS The author would like to thank the PGA Tour for their access to the ShotLink system and data. Without this information readily available, additional work to piece together the data would be required. The author has yet to find another reliable source for the level of course and hole detail anywhere else. The author would also like to thank Dr. Afzal Upal and Dr. Stephen Ousley for their guidance throughout this process. 8. REFERENCES [1] Auclair, T.J.. June 29, 2018. Odds of a hole in one, albatross, condor and golf’s other unlikely shots. Retrieved December 12, 2018 from https://www.pga.com/news/golf-buzz/oddshole-in-one-albatross-condor [2] Davidson, J.D. and Templin, T.J.. 1986. Determinates of Success Among Professional Golfers. Research Quarterly for Exercise and Sport, 57, 1 (1986), 60-67. [3] Shmanske, S.. 1992. Human Capital Formation in Professional Sports: Evidence from the PGA Tour. Atlantic Economic Journal, 20, 3, 66-80. [4] Sen, K.C. 2012. Mapping statistics to success on the pga tour: Insights from the use of a single metric. Sport, Business and Management: An International Journal, 2, 1 (2012), 3950. [5] ShotLink Background. Retrieved December 12, 2018 from http://www.shotlink.com/about/background [6] Broadie, Mark. 2012. Assessing Golfer Performance on the PGA TOUR. Interfaces, 42, 2 (April 1 2012), 105-228. DOI: https://doi.org/10.1287/inte.1120.0626 [7] Broadie, Mark. 2014. Every Shot Counts. Gotham, New York, NY. [8] @MarkBroadie. 2019. Mark Broadie on Twitter: A standard way to measure the cost of a missed fairway. Retrieved March 30, 2019 from https://twitter.com/MarkBroadie/status/1108804384673222 659 [9] @DataGolf. 2019. data golf on Twitter: (Course History thread!). Retrieved March 30, 2019 from https://twitter.com/DataGolf/status/1086138883916496896 [10] @MarkBroadie. 2019. Mark Broadie on Twitter: Strokes gained results. Retrieved March 30, 2019 from https://twitter.com/MarkBroadie/status/1108013944738930 689 [11] Rosenheck, Dan. 2017. The EAGLE has landed: Real-time win probabilities in men’s major golf tournaments. In MIT Sloan Sports Analytics Conference (March 3-4, 2017). Boston, MA. Retrieved from http://www.sloansportsconference.com/content/eaglelanded-real-time-win-probabilities-mens-major-golftournaments/ 13 M. A. Upal (editor) About the author: Ron Richardson is a graduate student at Mercyhurst University studying Data Science. Previously, he graduated from Penn State University, majoring in both Computer Science and Mathematics. 14 He has worked for Fortune 500 companies as a software engineer and IT Manager and has built enterprise-level software for businesses of varying sizes. In his spare time, he can be found on the golf course, claiming it benefits his research projects. Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Logistic regression Versus Convolutional neural network for classification Jerrin Joe Varghese Department of Data Science Mercyhurst University jvargh81@lakers.mercyhurst.edu ABSTRACT Machine learning algorithms are becoming popular and are widely used for giving machines the ability to learn for them self without human intervention. Hence, these algorithms are used for object detection, image classification, stock prediction etc. Some machine learning algorithms are complex and requires more memory and processing power. This paper proposes the use of logistic regression to overcome the problem of memory and processing power, if the data can be turned into a binary classification problem. In order to test this hypothesis, the paper goes through the problem of driver distraction and uses both convolutional neural network (CNN) [8] and logistic regression [10] to analyze the performance of both models on different machines with different memory and processing power. 1. INTRODUCTION Machine learning is a type of artificial intelligence technique that learns to identify new pattern in data. This technique is now widely used in several industries for various tasks. There are different types of machine learning algorithms such as supervised learning, unsupervised learning, and reinforcement learning. So, it’s important to know which type of machine learning algorithm is best suited for a machine learning problem. Supervised learning is the search for algorithms that reason from externally supplied instances to produce general hypotheses, which then make predictions about future instances. In other words, the goal of supervised learning is to build a concise model of the distribution of class labels in terms of predictive features [1]. We provide the machine with data that is already labeled and it’s not learning own its own, we call this type of technique as supervised learning. Supervised learning can be further divided into two: classification, where the output is in the form of categories and regression, where the output is in the form of real values [3]. Examples of supervised learning is Linear Regression, Logistic Regression, CART, Naive Bayes, KNN etc. Unsupervised on the other hand, studies how systems can learn to represent input patterns in a way that reflects the statistical structure of the overall collection of input patterns [2]. In this type of machine learning, the machine simply receives as data, but obtains neither supervised target outputs nor any rewards from its environment [4]. The machine learns patterns that it feels are present in the data. Unsupervised learning can be further subdivided into three sub categories such as clustering, association and dimensionality reduction. Examples of unsupervised learning are K-means, PCA etc. Finally, reinforcement [6] learning is a type of machine learning that helps the machine learn by helping it decide the next action by rewarding it. This methodology is typically used in robotics where the machine learns by trial and error technique. One of the most recent use of reinforcement learning is its use in Deep mind’s StarCraft project [5]. Where rewards are given for learning based on the score obtained from the StarCraft II engine against the built-in computer opponent. This paper focuses on supervised learning and whether we can use logistic regression to solve machine learning problems that could be converted to binary classification. The paper is looking into the binary classification problem because it relies on the premise: if any problem can be converted to a binary classifier and gives us a result that is close to deep learning or convolution neural networks with Logistic regression, then we could save time and solve the machine learning problems faster and utilize computers with low memory and computer processing power. Hence, driver distraction [7] is the problem used in the paper as it can be converted to a binary classification, where the classes are divided into distracted and undistracted drivers. This machine learning problem is solved using convolutional neural network and logistic regression [10]; therefore, developed three models, one with binary classification using logistic regression, the second multiple models of logistic regression and multi-class using convolutional neural network [8] and multi-class logistic regression [9]. All the models are available on Kaggle for reference [11,12,13]. 2. DATA The dataset is collected from Kaggle’s State Farm dataset [14]. Figure 1 shows two examples of the dataset. The dataset consists of 10 classes such as safe driving, texting–right, talking on the phone–right, texting–left, talking on the phone–left, operating the radio, drinking, reaching behind, hair and makeup, and talking to passenger [14]. The dataset consists of 22424 labeled data for training and 79726 data for validation. The shape of each image is of 480 x 640. (a) (b) Figure 1. Example of a driver distraction dataset: (a) safe driving (b) dangerous driving- texting. 3. RELEVANT WORK Convolutional neural network is traditionally used for image processing as they extract features by convolving the images and extracting useful information. Logistic regression is another machine learning algorithm which is widely used for binary classification. The paper goes through both algorithms to understand whether a binary classification problem or the one that can be converted to a binary classification problem needs CNN as it requires more memory and processing power. 3.1 Convolution Neural Network (CNN) 15 16 M. A. Upal (editor) Convolution Neural Network [8], consist of an input layer, hidden layer, and an output layer. Some of these layers in the network are: Convolution [15], Activation [16], Pooling [17], Dropout [18], Dense, and SoftMax [19]. The Convolution layer consists a set of filters, where each filter can be considered as a small square that extends through the full depth of the input volume. During each pass, the filter convolves across the width and height of the input, which results in a 2-d activation map that gives the responses of that filter at every spatial position. To avoid over-fitting, pooling layers are used to apply non-linear down sampling on the activation maps. It means that, this layer is aggressive at discarding information, but can be useful if used appropriately. Dropout layers also help to reduce over-fitting by randomly ignoring certain activation functions, while dense layers are fully connected layers and often come at the end of the Neural Network. The output of the layers of the neural network are processed using an activation function, which is a node that is added to the hidden layers and output layers. You’ll often find that the RELU activation [16] function is used in hidden layers, while the final layer typically consists of a SoftMax activation function. The idea is that by stacking layers of linear and non-linear functions, we can detect a large range of patterns and accurately predict a label for a given image. SoftMax is often found in the final layer which acts as basically a normalizer and produces a discrete probability distribution vector. Because of these benefits, CNN is most widely used in image classification or problems related to images. 3.1.1 Pooling The pooling layer reduces the spatial dimensions of the input and the computational complexity of our model. Pooling also helps in controlling the overfitting problem, as it operates on every slice independently. There are different functions such as Max pooling, average pooling or L2-norm pooling. Max pooling is the most used type of pooling that takes the most important part from each slice of the input data. 3.1.2 Rectified Linear Unit (Relu) Relu is an activation function that simply outputs 0 when x < 0, and conversely, it outputs a linear function when x ≥ 0 [16]. f (x) = max (0, x) 3.1.3 Dropout Dropout is one of the most effective regularization that is used in a neural network. Using dropout helps us to randomly keep only a neuron active with some probability ‘p’. This helps it to force the network to be accurate even if some information is not present, which in turn helps the network not to be dependent on any one neuron. 3.1.4 Fully Connected Layer In a fully connected layer, every neuron in one layer is connected to every neuron in the other layer. The last fully connected layer is the SoftMax activation function that classifies based on the generated features from the trained data. 3.2 Logistic Regression Logistic regression is a binary classification statistical machine learning model. The logistic regression is a sigmoid function, which takes any real input and outputs a value between 1 and 0 [21]. The sigmoid function is given by the formula: Sigmoid(x) = 1/ (1 + 𝑒 ) 4. PROPOSED SOLUTION This paper uses three model to understand whether it is possible to use logistic regression for machine learning problems that can be converted to binary classification and get the same results. 4.1 Pre-processing This paper also focuses on different algorithms and how they perform with the driver distraction problem. The problem tackled here is to understand how each algorithm is different in their prediction and will this analysis help in understanding, if even logistic regression can play vital role in problems like driver distraction. The image data from the dataset is split into training and validation set, where the training set consists of images of size 240 x 240 and the image class number. The training set is then split into features and labels. The features are then converted to a 4d array using NumPy array. The data is further used by the models for training. 4.2 Convolutional neural net model The convolutional neural network uses Keras’s sequential model [11] and is divided into three convolutional groups. Each group consists of two convolutional layer of filter size 32,64 and 128 and kernel of 3x3. The convolutional layer also uses zero padding and Relu as the activation layer. The convolutional layer is followed by batch normalization to normalize the data and each group of convolutional layers is added with a max pooling layer and dropout layer at the end. The convolutional layers are flattened and added to the fully connected layer. The fully connected layer consists of three dense layers with 512,128 and 10 neurons respectively. The loss function used is categorical cross entropy and optimizer as Adam. 4.3 Logistic Regression model In the paper Logistic regression uses Keras’s sequential model [13] and it uses one batch normalization and one dense layer with cross entropy as the loss function and Adam as optimizer. The model also uses early stopping to avoid overfitting. The model splits the data into nine different groups, where each group has a good driver and bad driver (for each distraction) combination. The data is then trained to predict if the driver is good or bad. This information can be further processed and amalgamated to get similar output as the convolution neural network. The second logistic regression model [12] splits the data into two classes i.e. good driver and bad driver. Where the bad driver data is a combination of all the distracted classes matching the class size of the good driver. The model uses the Keras’s sequential model and configuration like the previous logistic model. The model is then trained to predict different images such as good or bad driver. 5. RESULTS Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 This paper compares the r squared results, confusion matrix and accuracy to understand whether logistic regression can be used for problems that can be converted to binary classification problem instead of CNN. The result also focuses on the time required by the models to train, as the computers with less CPU power and memory are likely to take more time compared to powerful machines. The r squared value for the CNN model and the two logistic models are shown in table 1. Model Name GPU Used R square Accuracy Time consumed CNN Yes 1.0 0.99 1.5 hours CNN No 1.0 0.99 40 hours Individual Logistic model (average of all models) No 0.97 0.99 20 mins Logistic model No 0.837 0.89 10 mins Figure 3. Loss graph for all models: (a) Convolutional neural network (b) Individual Logistic regression model (c) Logistic regression model. Table 1 : R squared values of each model. (a) (a) (b) (b) (c) Figure 2. Accuracy graph for all models: (a) Convolutional neural network (b) Individual Logistic regression model (c) Logistic regression model. (c) Figure 4. Confusion matrix graph for all models : (a) Convolutional neural network (b) Individual Logistic regression model (c) Logistic regression model. The figures 2,3 and 4 shows the accuracy, loss and confusion matrix for each model. The table 1 also shows the time taken for training each model. The results show that the values are approximately equal, and the time required by logistic regression is less compared to CNN. Hence, we can also use logistic 17 18 M. A. Upal (editor) regression when the data can be used as binary classification machine learning and when the memory and CPU power is less. [10] Applied logistic regression second edition David W. Hosmer university of Massachusetts Amherst, Massachusetts Stanley Lemeshow The ohio state University Columbus, Ohio. 6. CONCLUSION AND FUTURE WORK [11] https://www.kaggle.com/jerrinv/driver-distraction Dated : May 01 2019 Memory and processing power have been major issues for machine learning models. The paper compares logistic regression models and convolutional neural network in order to understand whether it is possible to replace convolutional neural network. This is because Convolutional neural network needs more processing power when compared to logistic regression. Our result shows that the logistic regression also gives us similar results to convolutional neural network, which is promising. The future work will include to use more complex machine learning problems, where the data is more complicated with more diverse images to see if the results are the same. REFERENCES [1] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques. [2] Appeared in Wilson, RA & Keil, F, editors. The MIT Encyclopedia of the Cognitive Sciences. [3] Elnaz Barshana, Ali Ghodsib, Zohreh Azimifara, Mansoor Zolghadri Jahromi, Supervised Principal Component Analysis: Visualization, Classification and Regression on Subspaces and Submanifolds. [4] Zoubin Ghahramani, Unsupervised Learning : University College London, Kzoubin@gatsby.ucl.ac.ukhttp://www.gatsby.ucl.ac.uk/~zoubin. [5] Learning Oriol Vinyals Timo Ewalds Sergey Bartunov Petko Georgiev Alexander Sasha Vezhnevets Michelle Yeo Alireza Makhzani Heinrich K¨ uttler John Agapiou Julian Schrittwieser Stephen Gaffney Stig Petersen Karen Simonyan Tom Schaul Hado van Hasselt David Silver Timothy Lillicrap DeepMind Kevin Calderone Paul Keet Anthony Brunasso David Lawrence Anders Ekermo Jacob Repp Rodney Tsing Blizzard, StarCraft II: A New Challenge for Reinforcement. [6] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling, The Arcade Learning Environment: An evaluation platform for general agents: J. Artif. Intell. Res.(JAIR), 47:253– 279, 2013. [7] Understanding the distracted brain why driving while using hands-freecell phones is risky behavior National Safety Council White Paper April 2012. [8] Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee, Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units : University of Michigan, Ann Arbor;NEC Laboratories America; Enlitic; Oculus VR. [9] Peter Karsmakers, Kristiaan Packman’s, Johan A.K. Suykens Multi-class kernel logistic regression: a fixed-size implementation. [12] https://www.kaggle.com/jerrinv/logistic-regression Dated : May 01 2019 [13] https://www.kaggle.com/jerrinv/driver-distraction-usinglogistic-regression. Dated : May 01 2019 [14] https://www.kaggle.com/c/state-farm-distracted-driverdetection/data. Dated : May 01 2019 [15] Relation Classification via Convolutional Deep Neural Network Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou and Jun Zhao, National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences 95 Zhongguancun East Road, Beijing 100190, China {djzeng,kliu,swlai,gyzhou,jzhao}@nlpr.ia.ac.cn [16] Abien Fred M. Agarap, Deep Learning using Rectified Linear Units (ReLU) [17] Bonn, Germany scherer, Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition Dominik Scherer, Andreas Muller, and Sven Behnke University of Bonn, Institute of Computer Science VI, Autonomous Intelligent Systems Group, Roberts. 164, 53117 [18] Dropout: A Simple Way to Prevent Neural Networks from Overtting Nitish Srivastava nitish@cs.toronto.edu Georey Hinton hinton@cs.toronto.edu Alex Krizhevsky kriz@cs.toronto.edu Ilya Sutskever ilya@cs.toronto.edu Ruslan Salakhutdinov rsalakhu@cs.toronto.edu Department of Computer Science University of Toronto 10 Kings College Road, Rm 3302 Toronto, Ontario, M5S 3G4, Canada. [19] Multi-Category Classification by Soft-Max Combination of Binary Classifiers Kaibo Duan1, S. Sathiya Keerthi1, Wei Chu1, Shirish Krishnaj Shevade2, and Aun Neow Poo1 1 Control Division, Department of Mechanical Engineering National University of Singapore, Singapore 119260 fengp9286, mpessk, engp9354, mpepooang@nus.edu.sg 2 Department of Computer Science and Automation Indian Institute of Science, Bangalore 560 [20] Measures of Fit for Logistic Regression Paul D. Allison, Statistical Horizons LLC and the University of Pennsylvania [21] wikipedia : https://en.wikipedia.org/wiki/Logistic_regression#Definition _of_the_logistic_function About the author: Jerrin Varghese is a Graduate Student at Mercyhurst University. Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Machine Learning for the Detection of Mobile Malware on Android Devices Christina Eusanio Ridge College of Intelligence Studies and Applied Sciences Mercyhurst University Erie, PA ceusan60@lakers.mercyhurst.edu ABSTRACT The widespread use of Android smartphones with access to thirdparty applications has spawned many security challenges, among them malware. Mobile malware applications appear to be harmless, but can access sensitive data like users’ contacts, pictures, and passwords. Malware is becoming increasingly sophisticated and may not be flagged in early review stages by platforms like the Google Play store due to code obfuscation techniques. A better way to detect the presence of malware can be done with machine learning to analyze a mobile device’s behavior. This paper utilizes a labelled Android mobile malware dataset to train machine learning algorithms to detect the presence of malware on Android devices by using attributes such as battery percentage and CPU usage. The evaluation results suggest that those features are effective and can be utilized to successfully classify malicious applications with machine learning algorithms. KEYWORDS Android smartphone; machine learning; malware detection; anomaly detection 1 INTRODUCTION Smartphone usage has become pervasive across the globe. According to a Pew Research Center survey of 39 countries conducted in January 2018, a median of 59% of people reported owning a smartphone, with a higher reported usage of 72% among those in developed countries [1]. The convenience of smartphones has led society to rely on them for much of what we do, including making video calls, reading email, navigating to new locations, taking pictures, streaming music and videos, and even playing games for entertainment. This convenience comes with a cost to security. Mobile devices are prime targets for malware to steal users’ sensitive data, such as user location, photographs, contacts, and passwords. One of the largest mobile malware threats originates from one of the greatest advantages of smartphones—their application stores, which allow users to download an evergrowing variety of applications from third-party creators. Android mobile devices are particularly susceptible to this threat. For users of the iPhone’s Apple iOS, the application store is a closed system, allowing Apple to control the marketplace where users can download applications. Although malicious applications still slip through the cracks, this system allows Apple to review applications for security flaws before releasing them to the market. Android devices are able to access not only the controlled, official Google Play store to download mobile applications, but also alternative marketplaces. The applications in these unofficial marketplaces often contain malware in the form of popular, known applications repackaged to include malicious code [2]. When users download applications from any marketplace, the user typically is prompted with requests for certain privileges the first time they use the application. These permissions include access to sensitive data such as a user’s contacts, photos, calendar, and location data, as well as access to specific hardware items, including the phone’s camera and microphone [3]. Users often answer these prompts without thinking much about what that permission means and how it can impact their privacy [2]. This can lead to the installation of trojans, malicious applications that appear benign, on users’ phones that can then exploit the permissions granted without needing to figure out a means to exploit vulnerabilities within the phone’s software [4]. Given this threat, it is not enough to rely on official application marketplaces to do a thorough security check of each app. Personal computers have their own form of malware detection, but it often consumes too much memory and CPU and would not be suitable for mobile devices given the limitations of processing power and battery. Additionally, traditional malware detection programs on personal computers often rely on a database of malware signatures, this does not help with the detection of new malware types that have not been encountered [5,6]. One major obstacle to studying the applicability of machine learning algorithms to malware detection on smartphones is the lack of an extensive, labeled dataset. This research will make use of the SherLock dataset that fills this research gap and was collected with cybersecurity research in mind. It was collected over the course of three years beginning in 2015. 50 participants were given Samsung Galaxy S5 smartphones with a malicious application installed, and the data collected is a time-series representation of a wide range of monitorable features of the smartphone, including CPU and memory data for each running application, along with a labeled set of activity by the malicious application. 19 20 M. A. Upal (editor) The malware used in this data collection experiment was updated at different time intervals so that a wide range of malware types could be captured by the data. Additionally, the malware used was based on malware samples found in the wild but modified so the participants’ privacy would be protected— this ensured the phone’s captured data would accurately reflect that of a phone infected with genuine malware, outside of the controlled model of the experiment [4]. applicability to mobile devices—given the limited battery and computing power, they wanted to ensure that the KBTA method could be employed directly on the smartphone to alert users in real time if malware is detected on their phones. They found that their method was effective, reporting 97% accuracy with only an average of 3% of the phone’s CPU consumed [5]. A major limitation of this study is that the researchers created the five malicious applications for this collection experiment, as the Android platform was in its infancy and no malware could be found in the wild [5]. Additionally, this data collected for this study was from five users over only a week. A richer dataset with malicious applications found in the wild would provide better testing conditions for this method. Another limitation of this method is the input needed by a security expert. With new iterations of malware being created constantly, the security expert in this scenario would need to constantly ensure the security context was up to date so that any anomalies could be detected. If machine learning is used, the algorithm could be constantly updated with new knowledge without the need for additional human input to protect the phone from malware. This research will expand on the techniques used by Shabtai et. al in 2010 that introduced “Andromaly,” a framework for continuous, dynamic mobile malware analysis that uses machine learning algorithms to classify collected data instances as either benign or malicious based on low-level features, similar to the SherLock dataset [6]. A major limitation of this research is the malware used was not found in the wild, as Android was a new platform at the time this study was done. The researchers created four instances of malware to use for testing, which limits the applicability of their model to real malware samples. The goal of this research is to utilize machine learning algorithms to classify applications as malicious or not based on the collected features in the dataset. The labels in the dataset will allow an accurate representation of the performance of the algorithm to determine if this approach is a viable option for malware detection on Android mobile devices. 2 Xue et. al introduces Malton, an on-device application that dynamically detects malware through multi-layer monitoring and information flow tracking along with efficient path exploration within the Android runtime framework [8]. To evaluate their detection model, the researchers tested Malton on real-world malware samples and compared Malton to previously proposed models that used similar methods to detect malware. Comparatively, Malton outperformed the other models and added new monitored features that allowed Malton to detect applications that were using native code loading [8]. A limitation of this study is the requirement for humans to select entry and exit points when parsing an application’s code for analysis. A fully automated approach would be better. RELATED WORK Malware analysis can be broken down into two types: static and dynamic analysis [2,7]. Static analysis of malware is performed independently of the code’s execution and attempts to find malicious behavior before it actually happens. Static analysis is a fast and efficient way to detect the presence of malware, but it is insufficient when used alone. Malware is still found in the official Google Play store [2], indicating that there are workarounds to static malware analysis and it is insufficient when used alone. Malicious code can avoid detection through techniques such as obfuscation, which makes lines of code difficult to read or reverse engineer. Additionally, malicious code that uses newer techniques and has not been detected and labelled as malicious by antimalware programs will not be caught by this type of analysis because it lacks similarity to previously identified malware. The proposed solution of this research will implement the dynamic analysis of mobile malware, so much of the focus of this section will be dedicated to the dynamic approach. Kim et. al focuses their research on detecting malware by creating a resource conscious application to monitor and analyze power consumption. The detection framework consists of a monitoring phase where a user’s power consumption is monitored, and a baseline power consumption is determined. Then, a power signature is created from the historical data, and the application uses anomaly detection to analyze a user’s current power usage and power signature against the established, historical signature database [9]. The power monitor has to take measurements of the power consumption at intervals to detect any anomalies caused by running applications. In order to leave out benign applications and actions that also consume large amounts of energy, such as media players that display video footage, the researchers characterized the application to define what kind of power consumption would be expected from it without generating a false positive [9]. The study proved successful in detecting previously unseen malware, with accuracy rates up to 95%. A limitation of the study was the small dataset used for training and testing the algorithm—90 power signatures were used in the training set, and 270 for testing. Another limitation is that this Shabtai et. al approached the problem of detecting previously unseen malware in Android mobile devices with a knowledge-based temporal abstraction (KBTA) method [5]. Their solution was to use raw, time stamped data that includes, for example, CPU usage and events by the user such as keyboard or touch screen usage to detect unusual usage patterns as defined by a domain expert—patterns such as high CPU usage while the phone was not in use. The researchers also tested their solution’s 20 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 study did not evaluate its algorithm on real-world malware. The researchers created a worm emulator for the purpose of testing their energy-focused method to identify malware. A consistent barrier to research on malware is the limited sample of publicly available malware. In their research, Shabtai et. al introduce Andromaly— an anomaly-based malware detection method for Android mobile devices. Their research uses low-privileged monitorable features including CPU consumption, battery level, and number of data packets sent over the network to train a machine learning algorithm to classify mobile applications as either malicious or benign [6]. The researchers tested several machine learning classification algorithms on the data they collected, including kmeans, logistic regression, decision trees, Naïve Bayes, and Bayesian networks. To evaluate the performance of the algorithms in this binary classification problem, the researchers used the Receiver Operating Characteristic (ROC) curve and used that to calculate the Area-Under-the-Curve (AUC). The researchers found that the Naïve Bayes and logistic regression classifiers performed the best in the majority of the experiments performed. Additionally, the researchers employed feature selection and found that using the Fisher score to only use the top 10 features yielded the best results. A major limitation of this study was the lack of availability of real-world samples of malware. The researchers created four malicious applications to use in validating this approach to anomaly-based malware detection, which encapsulated denial-of-service and information theft malware. It is difficult to tell how this would perform against the various types and iterations of malware found in the wild. Another limitation is that only two phones were used to collect data for this study. A larger, more diverse dataset that includes a wide variety of malware samples could give a more accurate picture of how well this methodology can be applied in the real world. A suggestion that the researchers make for further research is to use time stamps to give more context to the features—for example, instead of having a feature that represents the battery level at that point in time, have a variable that represents the change in battery level over the last 10 minutes. Mirsky et. al collected the previously mentioned SherLock dataset with cybersecurity in mind. Their study collected data on a total of 50 volunteers that were given a Samsung Galaxy S5 to use as their primary device for two years. The phones were loaded with a malware application named Moriarty that was updated throughout the study to encapsulate several different types of malware. Examples of the malicious applications include a web browser that contains spyware to either capture users’ location and audio data, or to capture their web traffic and history, a popular game that was repackaged to include phishing attempts that prompt users to login with their Facebook, Gmail, or Skype credentials via fake login pages. The malware in Mirsky et. al’s study was based on malware samples found in the wild, but the code was modified for the study to protect the volunteers’ privacy. For example, malware that retrieved files from the user’s phone to send to a remote server scrambled that data prior to transfer to make it unintelligible. The researchers demonstrate, using the SherLock dataset, that features such as battery consumption, network traffic flow information, data on the usage of CPU and memory can be used to detect the presence of malware dynamically [4]. However, their analysis is limited to showing the correlation and information gain scores of each variable—the main goal of their research was to provide a rich dataset that could be used with machine learning to further explore research in mobile cybersecurity, which is the aim of this paper. 3 PROPOSED SOLUTION AND EVALUATION METHODOLOGY The current body of research demonstrates that it is possible to use features such as CPU usage, battery level, and memory as indicators for detecting malware efficiently in mobile devices. The goal of this research is to find the best combination of machine learning algorithms and parameter tuning to get the highest accuracy for classifying applications as benign or malicious. Algorithms tested will include Naïve Bayes, logistic regression, support-vector machines, as well as the ensemble methods of random forest and soft voting. and in order to test a wide range of algorithms to see which has the best outcome with processing efficiency. Efficiency is important to keep in mind because ideally, a mobile malware detection application can be downloaded and run from anyone’s Android to ensure continuous protection that does not monopolize CPU or battery usage. This research will use a portion of the SherLock dataset as it is the largest publicly available dataset for analyzing Android devices for the presence of malware using low-level features. It contains two years’ worth of data from a total of 50 devices. The malware samples used represent a variety of different types of malware and closely reflect malware samples found in the wild, modified only slightly in order to protect the study volunteer’s privacy. Since labelling the software as benign or malicious is a binary classification problem, the Receiver Operating Characteristic’s Area-Under-the-Curve (AUC) will be used to measure the accuracy of the algorithm’s performance. The key to validating the machine learning algorithms on the dataset is that the malicious application, Moriarty, leaves “clues” when it is running—the application alternates between benign and malicious mode, and the nature of each action and session are both captured in the dataset. While running in benign mode, the application, while in its nature is malicious, only performs benign actions. While running in malicious mode, the application performs both benign and malicious actions in conjunction [4]. The benign sessions allow the machines to learn a baseline of the application’s performance so that it can detect anomalies when the application is acting suspiciously. To create the dataset for machine learning, the clues labelled as either benign or malicious from the malicious Moriarty application, located in the “Moriarty” table, were joined 21 22 M. A. Upal (editor) other classification algorithms in its ensemble voting method, performed nearly as well as random forest. with data from the “Application” and “t4” tables, which were continuously sampled during the SherLock data collection every 5 seconds. Figure 1 presents an illustration of the data used in the research. The Moriarty table included a timestamp and a number indicating which version of the malicious app was running on each user’s device, as the app changed throughout the data collection project to include a broad range of malware that would impact the devices differently. The features such as battery data and global application statistics such as device memory and storage were located in the “t4” data table. Data for every application, otherwise known as local application statistics, also included data on CPU usage, the network, process information, and memory usage. For each user, the data was joined to each Moriarty clue on the closest time stamp available, and grouped by each version of Moriarty, which reflects a new instance of the malware application with a new set of malicious behavior to identify. This research is limited to one of the many types of malware represented in the SherLock dataset, so more research should be done to confirm that the chosen features work well across different types of malicious activity within mobile applications, and to see if random forest will continue to be the best-performing algorithm, or if ensemble voting methods such as soft-voting will provide a better indicator at large when faced with many different types of malware. However, the results show by this research are indeed promising and prompt further exploration to improve the performance metrics even further. Algorithm Moriarty "Clues" - Description of Action taken by the app - Timestamp Application Table - Info for each individual application (CPU, network statistics, memory, process information) - Label: Benign/Malicious -Version of the Moriarty application t4 Table - Global Application statistics (memory, storage, CPU) 5 Dataset Naïve Bayes 0.91 0.94 SVC 0.98 0.97 Logistic Regression 0.97 0.95 Random Forest 0.99 0.97 Soft Voting (using all of the above) 0.98 0.96 CONCLUSIONS AND FUTURE WORK With society spending more time than ever glued to their phones, application developers aim to grab ahold in this captive market with entertaining applications for smartphone users to enjoy. These applications sometimes go through a review process before being made available in a smartphone application store, which can help to keep malware from smartphones through a static review of the code. However, these methods of detection are becoming less effective as hackers with malicious intent get creative with code obfuscation to cover up any signs of malware in their code. Figure 2 - An illustration of the data used in the research 4 f1-score Figure 3 - Chart of Evaluation Scores for Selected Algorithms - Battery statistics Timestamp - Timestamp AUC RESULTS AND DISCUSSION The purpose of this research was to apply commonly used machine learning algorithms to a known, labelled set of malware application data to evaluate whether a set of lowprivileged statistics including battery consumption, memory, and CPU usage could dynamically detect the presence of malware. This research trained and tested on data from version one of the Moriarty malware application. This version of the application was a puzzle game that stole and transmitted a users’ contacts. During data preprocessing, numerical data was standardized by removing the mean of the training set and then scaling to unit variance of the training set. A better way to continuously protect smartphone users might be to use a dynamic approach, where usage statistics like battery, CPU, and memory are evaluated through machine learning to detect strange behavior that could compromise an individual’s personal data and privacy. Previous studies have shown these features to be effective in classifying an application as malware. The evaluation results in this study suggest that global application features such as CPU usage, battery usage, memory, as well as attributes for running applications and their individual CPU and memory usage statistics are effective and can be utilized to successfully classify malicious applications with machine learning algorithms. Figure 2 shows a chart of the results of training and testing on a single user. The results show that the features chosen are able to detect the presence of malware perform well on the tested algorithms. Naïve Bayes, which was chosen as it was the best-performing algorithm in the “Andromaly” framework [6], performed the worst of the tested algorithms. However, it still yielded an AUC score of 0.91. Random forest was the bestperforming algorithm on this dataset, with an AUC of 0.99, and an f1 score of 0.97. The f1 score is the harmonic mean of precision and recall. As with AUC, the f1 score is another indicator that should be as close to 1 as possible, with 1 being a perfect score and 0 being the lowest possible score. The support-vector classifier and the soft-voting classifier, which utilized all of the In future work, more of the SherLock dataset should be explored. This analysis focused on the first quarter of 2016, but the dataset contains over two years of the labelled malicious software data, with more types of malware used within the “Moriarty” application. Another idea to further this research is to use a similar methodology outlined in the “Andromaly” proposed framework to examine the algorithm’s 22 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 performance under different circumstances. The authors conducted several different experiments with similar data in addition to the device-specific classification algorithms outlined in this paper. For another experiment, Shabtai et. al tested on each participant’s device whether the algorithm could detect malicious applications that were not included in the training set. Additionally, an experiment was conducted with data from all benign and malicious applications, but the training and testing data were split along devices. Lastly, the researchers evaluated their algorithm’s ability to classify an application as benign or malicious when it was not included in the training set and with training and testing performed on different devices [6]. This test would help determine if the algorithm is attuned to a single user or if the data can be generalized to identify malware on different devices. [3] GOOGLE DEVELOPERS. Permissions overview. ACKNOWLEDGMENTS [7] Arshad, Saba, Shah, Munam Ali, Khan, Abid, and Ahmed, I would like to thank my professors Dr. Afzal Upal and Dr. Chad Redmond for assistance with the research process and with data processing. I would also like to thank Dr. Chris Mansour, whose guidance led me to focus my research on applications of machine learning in cybersecurity. REFERENCES [1] Poushter, Jacob, Bishop, Caldwell, and Chwe, Hanyu. Social Media Use Continues to Rise in Developing Countries but Plateaus Across Developed Ones. Pew Research Center (June 19, 2018). [2] Suarez-Tangil, Guillermo, Tapiador, Juan E., Peris-Lopez, Pedro, and Ribagorda, Arturo. Evolution, Detection and Analysis of Malware for Smart Devices. IEEE Communications Surveys Tutorials, 16, 2 (2014), 961-987. Android Developers (November 20, 2018). [4] Mirsky, Yisroel, Shabtai, Asaf, Rokach, Lior, Shapira, Bracha, and Elovici, Yuval. SherLock vs Moriarty: A Smartphone Dataset for Cybersecurity Research. (Vienna, Austria 2016), ACM. [5] Shabtai, Asaf, Kanonov, Uri, and Elovici, Yuval. Intrusion Detection for Mobile Devices Using the Knowledge-based, Temporal Abstraction Method. Journal of Systems and Software, 83, 8 (August 2010), 1524-1537. [6] Shabtai, Asaf, Kanonov, Uri, Elovici, Yuval, Glezer, Chanan, and Weiss, Yael. "Andromaly": a behavioral malware detection framework for android devices. Journal of Intelligent Information Systems, 38, 1 (February 2012), 161-190. Mansoor. Android Malware Detection & Protection : A Survey. International Journal of Advanced Computer Science and Applications(IJACSA), 7, 2 (2016), 463-475. [8] Xue, Lei, Zhou, Yajin, Chen, Ting, Luo, Xiapu, and Gu, Guofei. Malton: Towards On-Device Non-Invasive Mobile Malware Analysis for (ART). In 26th USENIX Security Symposium (USENIX Security 17) (Vancouver, BC 2017), USENIX Association, 289-306. [9] Kim, Hahnsang, Smith, Joshua, and Shin, Kang G. Detecting Energy-Greedy Anomalies and Mobile Malware Variants. In Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (Breckenridge, CO, USA 2008), ACM, 239-252. About the author: Christina Eusanio is a Graduate Student at Mercyhurst University. 23 24 M. A. Upal (editor) Building a Gun Detection Model Using Deep Learning Shraddha Dubey Graduate Research Assistant Mercyhurst University Shraddha.dubey04@gmail.com ABSTRACT Mass shootings and homicides involving guns are on the rise. The recent mass shooting at the Christchurch mosque in New Zealand is yet another horrifying example of the pain and destruction such incidents brings to society. The ease of obtaining handheld guns in the open market adds to the risk of these incidents repeating. The objective of the research is to build a trained model that can detect hidden handguns. Manual analysis of security images to identify threats of gun-related violence is labor-intensive, timeconsuming, and prone to human errors. This research is aimed at finding the best suitable model to detect the presence of a gun in still images using neural networks, such as Faster R-CNN and SSD models. The dataset used for this research comes from open source platforms. Table 1: Estimated total civilian-held legal and illicit firearms in the 25 top ranked countries and territories, 2017 [Source: Small Arms Survey (2018)] Keywords Image classification, deep learning, R-CNNs 1. INTRODUCTION After every mass shooting reported in the media and after every minute of silence, the question of gun control arises. Almost always the answer on the stricter gun laws ends up coming short to the next shooting. This paper doesn’t divulge into the discussion regarding stricter gun control laws or limiting access to ammunition. Instead, it focuses on the use of machine learning and artificial intelligence to identify firearms (handheld guns) in images in order to detect and alert concerned parties to take further action. Table 2: Estimated rate of civilian firearms holdings in the 25 top-ranked countries and territories, 2017 (firearms per 100 residents) [Source: Small Arms Survey (2018)] Firearm detection in still images, particularly handgun detection, is not yet perfected and there are benefits in improving the technology. The primary goal of this work is to prevent firearm misuse. This would be particularly valuable in countries where illegal handgun use/misuse is a challenge for law enforcement. Another important aspect and a good way to benefit would be to incorporate it in surveillance methods and social media platforms where such pictures may end up. Figure 4: Annual acquisition of new firearms in the United States[Source: Small Arms Survey (2018)] The focus of researchers interested in building models similar to handgun detection is mainly driven by high crime rates that affect many people worldwide. According to the Small Arms Survey [1], by the end of 2017, there were approximately 1,013 million firearms A research paper by Olmos, Tabik, and Herrera [2] claims that “psychological studies demonstrate that the simple fact of having access to a gun increases drastically the probability of committing violent behavior.” In most cases, early prevention and detection of firearms, particularly handguns, are the key to preventing such behavior. For the most part, past studies have focused on discovering firearms with techniques such as X-rays combined with some of the traditional machine learning methods. in the 230 countries and autonomous territories of the world. An estimated, 84.6 percent of these were held by civilians, 13.1 percent by state militaries, and 2.2 percent by law enforcement agencies. 24 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Recent research, however, incorporated more of the machine learning and deep learning models, such as Convolutional Neural Networks (CNNs), which ultimately have performed better than traditional machine learning models. published and would automatically red-flag it before it reaches any users. Even though YouTube is making advances in video classification there are still other online services where video classifiers are rather lenient. Valldor, Stenborg, and Gustafsson [3], from the Swedish Defense Research Agency have focused more closely on social media and images posted on those platforms. In terms of previous work, in countries such as Sweden, many experts have analyzed data (images) manually, and some still continue to do so however expansion of social media makes it difficult due to the volume of information available. This causes overload and can often lead to a slow turnaround in the detection and prevention of violent behavior. Because of these reasons, the main focus of research of Valldor and his team was not only to build a model to detect firearms in images but also to investigate what is needed to build such a model in terms of time and other resources (mainly monetary). They suggested three uses for their model: An interesting takeaway from this article is the approach to building a classifier. Wu, Yao, Fu, and Jiang suggested that is CNNs are one of the most promising approaches for building a model for object detection. Just as many research papers have done with CNNs, this paper also states that even though the model they build does have some success in classifying videos, it is still far from practical use. This may be discouraging for some, however, with enough focus and attention to developing CNN for firearm detection, we may develop practical models with reliable results. 1. 2. 3. General forensic analysis of images; Disinformation and troll detection on the internet; Lone actor terrorist detection. 2. RELATED WORK One of the early work was done by Hua-Mei Chen [5] and his team on detecting firearms, published in March of 2005. It focuses on the detection of weapons underneath of person's clothing and recognizes that as a very important "obstacle", which could have a major impact on security in highly populated areas such as airports, bus stations, train stations. The authors of this paper appealed to The Concealed Weapon Detection (CWD) program that was started in 1995 under the sponsorship of the National Institute of Justice, administered by the Air Force Research lab in the United States. Figure 5: ROC curve for firearm detector Figure 2 shows a graph of the ROC curve for the firearm detector. The true positive rant is measured on a test of 200 images containing firearms. The false positive rate is measured on 5000 images of MS-COCO 2017 validation set. Results from their model were very promising, however, there was a large number of false positives, in their example model quite often categorized handheld gadgets (tv remotes) and skiing gear (sticks) as weapons. In the graph (Figure 1) from their paper showing, we see a high false positive rate. Another team of researchers, Wu, Yao, Fu, and Jiang [4], focuse on applying deep learning to video classification and captioning driven by the recent advances in picture classification and captioning. They hope to increase the rate at which various videos on the internet can be analyzed and classified. Services such as YouTube, Vimeo, or Imgur, where users have complete control over the context, were the main motivation of this team for such a classifier. For example, if a user decides to post a very graphic, violent video on YouTube in which this user is explaining various fighting techniques and this video gets published under a name such as “Mickey Mouse” in hopes that children may come across a violent video, this classifier would process the video before it is Table 3: Summary of the imaging sensors being developed by the CWD The quest has not ended yet, even 14 years after the article was published and 27 years since the program started. The main reason for the never-ending research is the changing technology as well as technological advances on both sides of this arms race. Guns can be 3D printed and can be made of various alloys of steel and other materials that can make it harder for technologies such as to discover them through a standard imagining process. As moment more effort is put into an early detection as well as detection from a distance, it can give security enforcers more time to act and prevent it from happening. The biggest strides are being made by using millimeter wave (mmW) advanced imaging technology (AIT). The mmW technology cameras commonly used by two of the biggest security agencies- FDA, TSA/DHS. Most well-known cases of these cameras being widely utilized are by the Transportation Security Administration (TSA) of the United States of America. One of the -1- -2- M. A. Upal (editor) best features of this technology, in terms of firearm detection, is that it can discover hidden handguns from about 65 feet (20 meters) away in real time. This technology analyzes the waves emitted by the human bodies which are usually ‘warm’, compared to the ‘cold’ waves of metals and other objects. The mmW cameras only reflect red light if they see any ‘cold’ bodies and agent behind the screen is the only one able to see but not the potential threat. This allows for a quick reaction from the defender side rather than the potentially dangerous subject Table 5: Classification accuracy of SVM In conclusion, this research reviewed here only gives us a glimpse of the potential that machine learning for handgun detection. Last research paper digs a bit deeper than previous work and the results are not discouraging, rather promising that once enough minds have been put to the matter problems such as handgun detection will not be as big of a problem in the future. Another paper published by Rohit Kumar Tiwari and Gyanendra K. Verma [6] in 2015, focuses on CCTV cameras commonly used for security and surveillance purposes. The paper discusses using Harris interest point detector alongside FREAK [7] (Fast Retina Keypoint) for an automated gun detection, thus save time and increase the efficiency of the same task done by an operator or security personnel. Harris detector is constant to every geometric transformation (different gun model, picture of it from a different angle, etc.) and FREAK as a feature extractor for each point which allows for a clear and coherent result whether or not a gun shape is detected. 3. METHODOLOGY Correct classification TPR (%) 1 different backgrounds 12 11 91.66 2 different degree of illumination 9 7 77.77 3 interclass variation 11 9 81.81 4 degree of occlusion 17 14 82.35 5 rotation variation 10 8 80.00 6 multiple guns 6 5 83.33 Session This system though promising is not able to perform well in illumination change because the color based segmentation algorithm is not able to segment the image accurately. Hence during the gun color extraction, they get only some part of the gun and it affects the performance as seen in Table4. No. of TP After system initialization, Harris combined with FREAK creates a description of a gun through basic gun images from various data points. The results are as follow in Table 4. Image description Obtaining a large enough dataset to train the model is probably the most difficult task when applying Deep Learning. Collecting these images from most common search engines will require time and resources to tag individual images for the as Deep Learning algorithms require tens of thousands of images. Another paper by Gyanendra K. Verma and Anamika Dhillon published in 2017 [8] pointed out that according to the United States Department of Justice the majority of crimes are committed using handguns and those crimes include robberies, rapes, and auto thefts. I believe that CNNs are some of the most capable algorithms that many researchers shy away from in part because of their potential complexity of implementation, as well as the fact that we do not necessarily know what goes on inside the black box of a CNN. CNNs do not provide access to the learned knowledge as decision-tree based learning algorithms do. Researcher in this paper decided to apply Deep Convolutional Network (DCN) with a Faster Region-based CNN model to automatically detect handheld guns in a cluttered setting. When referring to the previous work done, researchers of this paper point out to the CWD (Concealed Weapon Detection program) since this program is already implementing multiple technologies for imagining detection of handguns, knives and such handheld weapons mainly used by TSA at the airports. Table 4: Performance under Harris plus FREAK descriptor matching In addition to data collection and training, establishing a TensorFlow and Anaconda environment with default Python package imports (e.g. NumPy, Anaconda’s Protobuf, lxml, Cython, and OpenCV) was necessary. Then the object detection model was then downloaded from the TensorFlow Model Zoo [14], a holding repository for all current models for object detection and image classification. In this case, the FasterRCNN-Inception-V2-COCO model was packaged and installed for testing. In order to train CNN, researchers have used the IMFDB (Internet Movie Firearms Database) which is an online database from various movies, tv shows, video games, and other media. Besides handguns this database stores pictures of other firearms from which researchers chose revolvers, rifles, and shotguns. The system was trained using a mini-batch gradient descent approach. During training adjustments to the multinomial logistic regression, the objective was to develop the best training approach. During testing, the accuracy of the system was measured through the True Positive Rate, False Positive Rate, Positive Prediction Value, and False Detection Rate. The results are reproduced below: For this model, the images were collected from opensource platforms. The images used only have one class labeled as pistol. The training set contains 2000 images and evaluation set consists of 100 images with four labels namely, face, car, pistol, and hand, unevenly distributed among these groups. -2- Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 The verification boxes of every instance of the targeted object was manually labeled in each image and then converted into a .xml file for training, using LabelImg[14]. The .xml files were then converted to .csv files in order to be read by TensorFlow. The last step was to configure the object detection training pipeline. The pipeline defines which model and what parameters will be used for training. In the configuration file, the parameters num_classes, fine_tune_checkpoints, num_examples were altered in addition to the train_input_reader and eval_input_reader sections to fit our custom path directories. 4. SOLUTION The detector used in this work is based on Faster R-CNN and used the Inception network for feature extraction. It constructs a single, unified model composed of RPN (region proposal network) and faster R-CNN with shared convolutional feature layers [13].    bx, by: coordinates of the center of the bounding box bw: width of the bounding box w.r.t the image width bh: height of the bounding box w.r.t the image height Along with the information about bounding boxes, we can consider the class of images as a multi-class classification problem, defined as: 𝑦 =𝑐 Where, 𝑐 = is the probability of the ith class. If there are three classes, the target variable is defined as, 𝑐1 𝑦 = 𝑐2 𝑐3 Loss Function The model is optimized for a loss combining two tasks (classification + localization) The loss function sums up the cost of classification and bounding box prediction: L=Lcls+Lbox For “background”, Lbox is ignored by the indicator function 1[𝑢 ≥1] defined as: 1[𝑢 ≥ 1] = Figure 6: An illustration of Faster R-CNN model [13] The Faster R-CNN algorithm replaces the slow selective search algorithm of previous models with a fast neural net, particularly the introduction of Region Proposal Network (RPN). RPNs works as follows: 1, 𝑖𝑓 𝑢 ≥ 1 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Faster R-CNN is optimized for multi-task loss function. It combines the losses of classification and bounding box regression (with classes higher than two): At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256d model) For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes) Each region proposal consists of:  an “objectness” score for that region and 4 coordinates representing the bounding box of the region A Faster R-CNN is a two-stage classifier. The first stage involves object localization. In this step, a sliding window is applied to the output of the last layer if the features extraction network, the goal is identifying regions in the image that contains an object of interest with the help of bounding boxes. In the second stage regions with high scores from the first stage are then extracted from the feature map and fed through a classifier that predicts both the object type and a bounding box for each such region. A bounding box can be initialized using the following parameters: Figure 7: symbol and explanation [13] 5. EVALUATION In real-time, the datasets have multiple classes and their distribution is non-uniform, so a simple accuracy-based metric will introduce biases in favor of the larger classes. It is also important to assess the risk of misclassification. Thus, there is the need to -3- -4- M. A. Upal (editor) associate a confidence score of the model with each bounding box detected and to assess the model at the various level of confidence. Therefore, object detection evaluation involves two distinct measures. 1. further with more training data and more repetitions. An ideal model the value should be closer to 2. Determining whether an object exists in the image (classification) Average Precision (AP) is a popular measuring accuracy. The AP is defined as the average of the precision scores after each true positive, TP in the scope S. The mathematical definition of the AP is: Figure 9: Localization loss (loss score vs training steps) The localization loss shows the price paid for inaccurate bounding boxes/coordinates predicted by the model. We see its converging to 0.65 which is a good measure. This can be attributed to less number of class labels in the model. Mean Average Precision (mAP) is an average of the precision value for all AP over all classes. 2. 6. Determining the location of the object (localization, a regression task). To evaluate a model on localization we must first identify how well a model predicted the location of the object. It is evaluated on the Intersection over Union threshold (IoU) which summarizes how well the ground truth object overlaps the object boundary predicted by the model. RESULTS Figure 10: Total loss (loss score vs training steps) The total loss gives a sum of the classification and localization loss. In this model, it is close to 4, which can be attributed to the higher value of classification loss. The model has scope for more training so that the total loss can be brought down to 2 – 1.5. Figure 8: Classification loss (loss score vs training steps) The classification loss shows the price paid for inaccurate classification of the object in the images by this model. The graph shows that the loss gets stable close to 3. This can be reduced Figure 11: Trained Model Output 7. Conclusion The model predicted the labels with accuracies as high as 99% when the image has a high contrast with the background but when -4- Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 the image and the background have a lower contrast, it produces the lowest accuracy of 64%. This could be attributed to the fewer training images and a relatively lower number of steps. In terms of speed of detection, it took ~15 seconds to identify the objects which make is not practical for real-time object detection. 8. FUTURE WORK In the future, the model can be trained on a larger number of realtime surveillance images with more classes such as handgun, rifle, and shotgun. There is a possibility of applying this model to videos and live streaming data and it will be interesting to see how it extends to those situations. Faster R-CNN model (or the family of R-CNN) is region-based object detection algorithms. They can achieve high accuracy but could be too slow for certain applications such as autonomous driving. It would be interesting to see the implementation of the dataset with faster object detection models such as SSD and RetinaNet. Acknowledgment I am extremely fortunate for the constant support and guidance I received from Dr. Afzal Upal and Dr. Mahesh Maddumala from the Department of Computer and Information Science, Mercyhurst University. They were instrumental in helping me in every challenge I came across during this research. They were patient and understanding of my queries and guided me appropriately. I would also like to thank my fellow classmate Heidi Beezub and Mercyhurst alumni Praveen Kumar Neelappa who provided constant help and critique during the process of writing this paper. REFERENCES Aaron Karp, Estimating Global Civilian-HELD Firearms Numbers, http://www.smallarmssurvey.org/fileadmin/docs/T-BriefingPapers/SAS-BP-Civilian-Firearms-Numbers.pdf. (January 2019) [2] Roberto Olmos, Siham Tabik, and Francisco Herrera. 2017. Automatic Handgun Detection Alarm in Videos Using Deep Learning. Cornell University Library (February 2017). [1] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] Erik Valdor and David Gustafsson. Firearm Detection in Social Media. NATO STO. Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang. 2016. Deep Learning For Video Classification and Captioning. (September 2016). Hua-Mei Chen, Seungsin Lee, Raghuvee M. Rao, MohamedAdel Slamani, and Pramod K. Varshney. 2005. Imaging for Concealed Weapon Detection. IEEE SIGNAL PROCESSING MAGAZINE[(March 2005). Rohit Kumar Tiwari and Gyanendra K. Verma. 2015. A Computer Vision based Framework for Visual Gun Detection Using Harris Interest Point Detector. Procedia Computer Science54 (August 2015), 703–712. DOI: https://doi.org/10.1016/j.procs.2015.06.083 Alahi, R. Ortiz, and P. Vandergheynst. 2012. FREAK: Fast Retina Keypoint. 2012 IEEE Conference on Computer Vision and Pattern Recognition (2012). DOI:http://dx.doi.org/10.1109/cvpr.2012.6247715 Gyanendra K. Verma and Anamika Dhillon. 2017. A Handheld Gun Detection using Faster R-CNN Deep Learning. Proceedings of the 7th International Conference on Computer and Communication Technology - ICCCT2017(2017). 84-88 DOI:http://dx.doi.org/10.1145/3154979.3154988 David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60, 2 (January 2004), 91–110. DOI:http://dx.doi.org/10.1023/b:visi.0000029664.99615.94 Jim Handy. NGD’s New “In-Situ Processing” SSD (https://thessdguy.com/tag/machine-learning/ ). (July 25, 2017) Open Images Dataset V4+. (https://storage.googleapis.com/openimages/web/index.html ) Karen Simonyan & Andrew Zisserman.2015. Very Deep Convolutional Networks for Large Scale Image Recognition. Visual Geometry Group, Department of Engineering Science, University of Oxford. (10 April 2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun .Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. 2016 https://github.com/tensorflow/models/blob/master/research/ object_detection/g3doc/detection_model_zoo.md. ( 15 January 2019) About the author: Shraddha Dubey is a Graduate Student at Mercyhurst University. -5- Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Flight delay/cancellation prediction using machine learning Adapting new ways to help stranded passengers Miloš Vereš Department of Computing & Information Science Mercyhurst University, Erie, PA mveres04@lakers.mercyhurst.edu ABSTRACT In 2017 US airline industry experienced an 11.3 percent increase in the cost of delayed flights, from $23.9 to $26.6 billion dollars. Total money lost comes out of the $1.5 trillion that this industry contributes to the US economy. Flight cancellation prediction model is a way to address this problem. By knowing in advance that a flight will be canceled, the industry has a chance to “save” a potential loss in demand by offering a variety of ways to compensate the stranded travelers which can lead to an increase in revenue. This program can help respond to the needs of stranded passengers. The main goal of the work reported here is to create a model that can be implemented by the airline and hospitality industry. Raw flight data was collected from the webpage of Bureau of Transportation Statistics (BTS), while weather data was found on National Oceanic and Atmospheric Administration (NOAA) webpage. Of the models we investigated, the best one was the Isolation Forest model which is commonly used for anomaly detection. 1. INTRODUCTION Commercial airline industry has been the backbone of the worldwide transportation system ever since 1950s when a few US airlines started introducing a new way of fast, comfortable and efficient travel. In the beginning, sports teams and businessmen were main customers but as the need for quick, reliable and comfortable travel increased, the industry grew. Currently, the FAA (Federal Aviation Administration) handles over 15 million flights annually which translates to over 46 thousand flights per day serving 2.6 million passengers [FAA 2018]. These numbers apply to the United States of America only. After more than half of a century since the commercial airline trend was adopted, the industry has seen major improvements that have created an outstanding reliability and almost all of the commercial airplanes created today are engineered to perform better than the current worldwide safety standards. Major reason for that is a small room for error. If an accident were to happen, it would cause great disruptions to the industry and not just individual carriers. Despite the fact that the industry’s safety and customer service record is close to the best it has ever been, there are still rare occasions where passengers are inconvenienced by delays or even cancellations which end up costing the industry and the economy a pretty penny. Between 2012 and 2017 the industry experienced significant increase in the total costs of delays. Federal Aviation Administration defines total cost of delay as “the sum of costs to airlines, passengers, lost demand and indirect costs.” The total costs went up from 19.2 billion in 2012 to 26.6 billion in 2017. This significant increase has been driven by the increase in the cost for passengers who experienced delays/cancellations. It is also important to note that according to the FAA delay means that a flight was 15 or more minutes late for departure [FAA 2018]. Large number of the delays is related to the four main causes shown on the graph below in Figure 1. Figure 12: Costs of different types of delays. Adopted from www.faa.gov/nextgen/programs/weather/ By far the largest factor are the sub-optimal weather conditions that accounted for 69 percent of the delayed flights. Majority of the weather delays happens during the summer season (April through September) as opposed to the winter season. According to the Operations Network (OPSNET)-official source of all air traffic and delay data-summer months are usually categorized by more convective weather which is categorized by heavy rains and thunderstorms [Elliot 2013]. These conditions are often the most disruptive to the airplanes as the thunderstorm creates strong turbulence creating updrafts. Those updrafts may also carry large pieces of hail that can seriously damage airplanes, mainly causing malfunctions to the nose of the planes where radars are kept. Damage to a radar system can seriously impair communication with the air control potentially leaving pilots “alone” in the air [Krajewski 2015]. Second biggest cause of delays is the pure volume, in this case high demand. Even though Federal Aviation Administration confirms that airline carriers possess “unconstrained resource capacity” high demand still causes up to 19 percent of flights being delayed. Although this happens mainly during the holiday season (Thanksgiving, Christmas, New Years, Fourth of July) these delays still account for a major portion of all annual delays. Runaway availability or in this case unavailability due to high traffic volume is third delay factor with only 6 percent of all flights experiencing delay due to this. “Other” category in this case is represented by various general aviation, air taxi and -1- 2 M. A. Upal (editor) military aircrafts that have flown under FAA radar but have experienced some delays. Last category in this group are the delays caused by equipment failure which accounted for less than 1 percent of all the delays [Olson & Philips 2018]. The report that OPSNET published in 2015 regarding the flight delays was the most thorough one, however each year Bureau of Transportation Statistics publishes a simple report on flight delays over the last year and the weather-related delays still account for at least 55 percent of all delays suggesting that better understanding suboptimal weather conditions for aircrafts could lead to the improvement in scheduling that can significantly reduce delays [Anon. 2018]. Even if the flight gets delayed, knowing that in advance can help passengers plan their time at the airport, and it is at this point that hospitality industry in the airport vicinity can experience an influx in customers who are on standby for their flights. FAA has predicted steady growth of 2.4 percent a year for the commercial airline industry as a part of “Aerospace Forecast Project: Fiscal Years 2017 to 2037” [FAA 2017]. Growth of such magnitude suggests that cancellations/delay prediction models may became even more valuable in the future. 2. RELATED WORK Cancellation prediction models have been widely utilized in hospitality industry by third-party booking agencies such as Expedia. The goal in this case is to predict how many of their customers will cancel their bookings because every cancellation can significantly influence potential profit. Over the last three years there has been an increase in research papers related to the air traffic, various optimization models as well as a few delay predictors mainly focused on the delay at the arrival airport. One particular article titled “Unfriendly Skies” focused on predicting cancellations based on the weather data, however our approach differs from that employed by Balduino [8] since the author only uses data for 10 US airports which may seem like a lot of data however we need to keep in mind that a cancellation is an anomaly and it only accounts for less than 2 percent of all the flights in the United States on an annual basis. Another difference is that they used SPSS’s “Auto Classifier” node. Balduino found that the Random Trees algorithm was the most accurate one. It is important to note that the only variable to predict cancellations was the weather. Accuracy of this model was almost 88 percent that a particular flight would be cancelled [Balduino 2017]. Kuhn and Jamadagni’s [2017] approach to delays/cancellations focused on predicting flight delays at the arrival airport. Even though this has a real-world value as well, this is a common practice done by OPSNET in United States since an airline has to report no later than 30 minutes after they know that the airplane will be arriving late at the destination. In this case researchers have had a better outcome as they were able to utilize Neural Network, Decision Tree and Logistic Regression to achieve around 90 percent accuracy that a certain flight would be arriving late. Main motives behind this research was to offer a way to better manage air-traffic as a way to lower economic and environmental impact of delays. It is important to note that researchers have used dataset for one year which may be deemed insufficient since the year of 2015 had multiple outliers due to unusual extreme weather. 3. DATA Our prediction model used two main sources of data: 1) Bureau of Transportation (BTS) – flight data [Anon. 2007]; 2) National Oceanic and Atmospheric Administration (NOAA) – weather data [. Bureau of Transportation provides an in-depth dataset that contains information about every domestic flight with over 20 variables for every flight. Scraping the data took the longest time as only a limited amount of data can be scraped over a one month period. This ended up being a major bottleneck as the BTS server to which we sent request would take up to several hours to approve the dataset download. The scraped dataset consisted of last three years of information. Weather data collection had its own as collected data referred to the location closest to the airport usually selected by the zip code. Original idea was to scrape data off the airport weather station however historical data is not accessible to the public. Amount of flights at each airport, from most popular to least popular is shown in Figure 2. After selecting the airports to pursue the data from next step was to select variables that will suit our model the best. Variables selected Figure 13: Flight data from BTS included 20 out of top 30 Core Tower Operations listed by the FAA. Adopted from www.FAA.gov/airTraffoc/ media/Air_Traffic_by_the_Numbers_2018.pdf. 2 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 were: Flight date, Origin, Origin City Name, Origin State Name, Destination, Destination City Name, Destination State Name, Cancelled, Cancellation Code, Cruise Elapsed Time, Actual Elapsed Time, Air Time, Distance, Carrier Delay, Weather Delay, NAS Delay, Security Delay, Late Aircraft Delay. Weather data consisted of these variables: Temperature, Humidity, Air Pressure, And Precipitation Type and Amount (if any). Joining these two datasets was a challenging task since the weather observation was done during times that do not match scheduled departure times. Majority of the weather stations report weather as soon as the flight conditions change. By this time our flight data consisted of millions of rows so joining the data based on the closest possible observation times which would be vert taxing due to the size of the dataset. We dealt with this problem by considering the mean value of each weather data point for the day of the flight. By having such a wide variety of variables in the data set there were a few options on which variable can be predicted. Bureau of Transpiration Statistics where the flight data came from has five major variables for flight delays/cancelations: As such it is widely used in the online fraud prevention and it can provide the highest accuracy if the dataset has anomalies. In our dataset main anomalies were cancelled flights. Training data was done by a standard split of 70/30. This means that 70 percent of the data was allocated to train our model and other 30 percent was then used to test the accuracy of our model. Confusion matrix is one of the easiest ways to describe the performance of our model. Our overall accuracy came up to 76 percent and misclassification rate was 23 percent. These results are very promising and can be a strong starting point in approaching the issue of weather caused delayed and/or cancelled flights in the future. Below is a confusion matrix that explains our findings. 1. Carrier; 2. Weather; 3. National Airspace System; 4. Security; 5. Late Aircraft. It was nearly impossible to note what exactly variables #3 and #4 stand for since the terms were very loosely defined and regarded as “classified” by the airspace system. Since the main goal and the reason to pursue this research was to determine relationship between weather and delayed/cancelled flights Weather variable was chosen. Below is a visualization of cancellation distribution for different causes. 5. CHALLENGES AND DISCUSSION At this moment our model has a decent accuracy, however there can be changes made that may positively impact the accuracy. Some of the things that can be done are: 1. Figure 14: Total cost of cancellations for different reasons. 4. METHODOLOGY AND FINDINGS In the beginning phases of the modeling, decision trees and KNN however neither of those two has yielded good results for the particular cancellation prediction. Highest value with those two models was only about 62 percent accuracy that the flight will get delayed/cancelled, which can be just as good as flipping the coin and guessing if the flight will be cancelled. After reexamining the data, we realized that delays/cancellations were in this matter just an anomaly since our dataset consisted of three years of flight data for of the busiest airports in the United States only had about 2 percent of cancelled flights. Next method to model this dataset was Isolation Forest (Tree). This is an unsupervised machine learning approach that is used particularly in anomaly detection. Daily mean weather values are not the best, weather can change a few times per day and as a main feature in the model getting real time airport weather data could change the outcome; 2. Different features can be added to the model which may result in better accuracy, however the main focus of this paper was weather impact only. Interesting point of discussion can be a real-world value that this model can have because passengers can benefit from knowing whether their flight is delayed or canceled. Businesses can also benefit from knowing this in advance. It would be great to see this model utilized in hospitality industry, particularly various hospitality businesses surrounding the airports. Business like restaurants at the terminals may design special offers for customers who are stranded at the airport which would offer a place for stranded passengers to temporarily stay at relax while enjoying a meal. By creating special offers for passengers in distress, business may ease passengers struggles thus making their bad airport experience a little bit better. Business can also be 3 4 M. A. Upal (editor) more likely to experience increase in profits as their customers counts increase. At the moment there is no literature available that has looked into the potential of flight cancellation prediction and its benefits to businesses around the airports. This could be a very innovative approach to marketing in the travel and tourism industry that would carry potential benefit to the passengers and business. [5] Bureau of Transportation Statistics. Reporting Carrier On-Time Performance. Bureau of Transportation Statistics. [6] Christopher Elliott. Storm Warnings: How Do Airlines Know If It's Safe to Fly in Bad Weather? National Geographic. (November 2013) [7] David Olson and Ted Philips. Weather, volume cause flight delays on one of the busiest travel days of the year, FAA reports. www.newsday.com. (November 2018) [8] Federal Aviation Administration. FAA Forecasts Continued Growth in Air Travel. www.faa.gov. (2017) 6. FUTURE WORK While this model presents a good start there are ways to improve it. I would like to see this model adapted to particular sector in hospitality, for examples restaurants. There are various point of sale (POS) systems like Oracle used in restaurants that could integrate models like ours. Only by integrating this model can we find the real value of it. In order for any improvements to happened we would first need to know the shortcomings of the model that can be found if the model is put to a real-world test. Another option would be to have more in depth weather data that has the potential to improve the classification accuracy. Approaching this problem with more knowledge from either flight delay/cancellation or weather field could be another way to get better results since all of the information gathered was done through research without any previous field knowledge. [9] Federal Aviation Administration ed.Air Traffic by Numbers. https://www.faa.gov/air_traffic/by_the_numbers/m edia/Air_Traffic_by_the_Numbers_2018.pdf. (November 2018) [10] Federal Aviation Administration. 2017. What is the largest cause of delay in the National Airspace System? (August 2017). [11] Nathalie Kuhn and Navaneeth Jamadagni. 2017. Application of Machine Learning Algorithms to Predict Flight Arrival Delays. (October 2017). REFERENCES [1] Anon. Data Tools: Local Climatological Data (LCD). National Center for Environmental Information. (2007) [12] Ricardo Balduino. Unfriendly Skies: Predicting Flight Cancellations Using Weather Data. Inside Machine Learning. (December 2017) [2] Anon. Understanding the Reporting of Causes of Flight Delays and Cancellations. https://www.bts.gov. (March 2018) About the author: Milos Veres is a Graduate Student at Mercyhurst University. [3] Anon. Data Access. Retrieved May 1, 2019 from https://www.ncdc.noaa.gov/data-access. (2019) [4] Beth Krajewski. Flying in Convective Weather … And Why You Shouldn’t. https://business.weather.com. (September 2015) 4 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 How do Socioeconomic Factors Effect the Amount of Waste (garbage) Produced Heidi Beezub Mercyhurst University Erie, PA 16546 USA hbeezu68@lakers.mercyhurst.edu ABSTRACT In this paper, I attempt to show a correlation between socioeconomic factors and the production of waste. The increasing amounts of waste (garbage) produced impacts the environment and can create challenges for safe, efficient and effective disposal. This paper looks at waste and socioeconomic data from the United States, the Buffalo, New York Region (USA) and the United Kingdom to find socioeconomic factors that are a predictor of waste generation. Keywords Waste, Garbage, regression. 1. INTRODUCTION Plastic straws received a tremendous amount of bad press when a video of a straw being pulled out the nostril of a sea turtle [1] went viral in 2018. But plastic straws are not the only problem; one report predicts the amount of plastics will outweigh fish in the ocean by 2050 [2]. Plastic, specifically single use plastics, have become a major contributor to the waste stream. The waste stream is the flow of waste (garbage) through to final disposal. In addition to plastics, the waste stream includes household, industrial, construction, hazardous, and medical wastes. Global waste production is estimated to be 1.3 billion Metric Tons*1 (MT) or 1.2 kg/capita/day with amounts projected to reach 2.4 billion MT or 1.4 kg/capita/day per day /day by 2025 [3]. That is a lot of garbage! As the world’s population increases, the demands on the finite resources of our terrarium home we call Earth become more and more taxed. Resources are not just fossil fuels and mineral ores. Resources include things we take for granted: clean air, clean water, arable land for food production, and even technology, labor and time. Although historically technology in the form of the industrial revolution has increased pollution, technology can be a critical resource in the conservation of other resources. Technology has helped clean polluted air from factory smokestacks, provided renewable energy sources (wind, solar, hydro), and even resulted in more time by automating manual systems. Advanced economies have more access to technology than developing economies. Developing economies produce more pollution and waste during manufacturing processes than developed economies [4]. As population increases, people also *A U.S. ton, also called a short ton, is equal to 2,000 U.S. pounds, a metric ton is slightly larger than a U.S. ton—it converts to 2,204.6 pounds. [7]. 1 want a higher standard of living (i.e. everyone wants a washing machine and an air conditioner). The carrying capacity of the Earth (how many people can be supported) varies among studies. One study suggests based on resources required for an ‘American’ standard of living, the Earth’s carrying capacity is approximately 1.5 Billion people [5]. Most estimates use food production to top out sustainable population between 8 to 11 billion people [6]. With a current population of approximately 7.5 billion we are close to these estimates. For both an equitable and sustainable environment and equitable and sustainable economy we need to consume less. In order to maintain a sustainable planet environment, we need to decrease and eliminate the amount of waste produced. Recycling is a popular go-to option, but only 9% of plastic waste has been recycled globally [8]. Recent articles are calling for more emphasis on the circular economy where no waste is produced (i.e. everything is not only recyclable but also IS recycled or reused). In addition, decoupling consumption from economic growth is also a solution, whereby the economy can grow without reliance on consumerism. Waste and pollution produced in one part of the globe spills out and (eventually) effects other parts of the world. Air pollution form Indian and Chinese factors circumnavigates the globe and affects cities in the United States and Europe. Approximately 1.3 to 3.5 million MT of plastic waste alone enters the oceans annually due to China’s lack of infrastructure to dispose of waste properly [8]. Part of this plastic makes its way to the ‘Great Pacific Garbage Patch.’ The Great Pacific garbage patch, also described as the Pacific trash vortex, is an area in the central North Pacific Ocean; it covers an area approximately twice the size of Texas and is an estimated 7 million tons of plastic waste. [9] Even more concerning, microplastics less than five millimeters in length (or about the size of a sesame seed) [10] and nanoplastics 1000 times smaller than an algal cell [11] have been entering the food chain. One study found that 93% of bottled water contained some sort of microplastic. [12]. These tiny plastic particles can be ingested by aquatic life including plankton and microscopic algae which form the basis for much of our food chain. [13] The detrimental effects on organisms that ingest plastic as it moves up the food chain can include intestinal blockages and toxic effects to both animals and plants. [13] The fibers from your favorite flannel shirt are polluting the environment and even the water you drink. [14] 5 6 M. A. Upal (editor) populations, environmental issues, and health concerns. The factors that contribute to waste generation vary from population to population. [32] Cultural influences, consumption habits, standard of living, and current infrastructure affect attitudes and actions regarding waste generation and disposal (or the reuse and recycling of materials). Even within country borders, the factors that influence waste generation can vary from region to region or from one city to the next. Although regression models are common, as machine learning algorithms have gained more acceptance there have been attempts to use different methodologies for predictions. The interconnectedness of our global ecological (and economic) system insures that what my neighbor does will affect me and what I do will affect my neighbor. My neighbor (and your neighbor) is anyone who shares this same planet. A recent estimate forecasts the world population at approximately 9 billion by 2050 [15]. Action is needed to ensure all the inhabitants have equitable use of resources to prevent global social and political instability. Many studies have been done to predict the amount of garbage that is produced from the global to the local level. [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] These studies are focused on predicting the waste stream so that disposal mechanisms will be in place. Waste determinants have included such factors as urbanization, household size, wealth, education, and tourism [27]. Household size, wealth and urbanization are the greatest predictors of waste amounts in most models [27]. Most of the models are based on correlation and regression analysis, very few have used artificial intelligence systems [27]. Changes in the waste stream have altered the ability to target waste production to regional populations. Local landfills have been shut down in favor of larger regional landfills which accept wastes not only from local and regional municipalities but also from neighboring states [28]. Regression analysis with stratified sampling was performed for a study in Sir Lanka [24] to determine socioeconomic factors in waste generation. Waste was collected from specified households, separated, and weighed to determine the composition percentages (organic, plastic, paper, glass and metal). This study looked at not only the total amount of waste produced, but how the composition of the waste was affected by income. As found in other studies [18, 19, 21, 23, 24, 28, 30], organic waste was the largest component of the Sri Lanka waste stream. Organic waste was also identified as the most potentially harmful “in terms of potential to cause environmental pollution and resource recovery” [24]. Organic waste creates greenhouse gasses (specifically Methane) and organic matter can foul water and other eco systems close to disposal sites. The Sri Lanka study suggested possible feasibility analysis for diversion of organic waste for composting. The study focused on waste prediction to identify and build waste management structures that could be implemented for a rapidly growing city and region. Waste generation information can be used to focus waste reduction efforts toward specific populations and to identify areas where technological advances/processes are needed to reduce waste. Qi and Roe used regression and PCA analysis to identify behaviors and attitudes regarding food waste [29]. The ability to identify attitudes and behaviors, aids in the development of advertising campaigns, tax incentives or other methods targeted at changing behaviors. The data itself can also be used to shock effect. The sheer large number/volume of what is wasted when presented is staggering. Where prior studies focus on waste prediction to aid in waste disposal, my analysis will be focused on determining the amounts of component waste (plastic, glass, paper, metal, etc.) generated. Hopefully this information can be used to target reduction and recycling efforts to decrease waste production. To get the most effect in waste reduction, the most affluent must actively work to reduce waste through changes in behavior and consumption (buying habits). The most affluent (at the ‘top of the food chain’) need to recognize that inequities in consumption and pollution/waste generation can affect their own daily lives through the impact that it creates for the rest of society. Income and other socioeconomic factors are typically used to predict waste generation. A 1998 study proposed an alternative income measure of Total Consumer Expenditures (TCE) [21] as a better measurement for the actual amount spent annually on consumer goods. Since some consumer spending does not result in waste generation, a Relative TCE (RTCE) was developed to address the actual spending that resulted in waste production from TCE. This method results in a waste prediction model more closely tied to consumer spending habits. The study used total data for the US and the United Kingdom (UK) to predict waste generation for the US and the European Union (EU) respectively. As with many waste generation studies, accurate data for specific countries was not available. The UK spending and waste generation behavior was assumed to be representative of other EU countries. The RTCE figures were used to derive composition percentages of plastic, paper, glass, metal and organic wastes). Total waste predictions were obtained with linear regression. Polynomial equations were used to develop best fit curves for each waste component to predict amounts that could be recycled and diverted from landfills. 2. RELEVANT WORK The prediction of Municipal Solid Waste (MSW) generated is important to ensure the appropriate disposal and removal methods are in place. As population and the standard of living increases, waste (garbage) is yet another aspect of our lives that we expect to be managed without much additional thought. Recycling to divert waste from landfills becomes more important as waste continues to pollute the environment and more environmental protections restrict safe waste disposal. Systems Dynamic (SD) modeling has been used to predict waste generation and the feasibility of a Material Recycling Facility [18] to divert recyclable material from landfill disposal. SD uses computer software (specifically Stella®) to simulate system inputs and flows (identified as stocks, flows and converters). SD modeling was selected for this San Antonio study due to a small limited dataset. The simulation information was: total income per area, people per household, historical waste generation, income per household, recycling patterns, and population. The SD information was then used in conjunction with a traditional linear regression model to predict waste generation and the feasibility of a Material Recycling Facility. An unusual SD prediction result was a plateau of waste generation as income increased. Other studies have shown increased recycling participation at higher incomes, but the Most studies on waste production have used regression analysis techniques based on population and income variables [16, 18, 24, 27, 30, 31]. Information on studies in the US is scarce; waste management practices are fully developed, and waste collection and disposal systems are readily available in all parts of the United States (US). Although some studies have been performed in developed countries, much of current literature available is from developing countries. Developing economies are interested in empirical information to aide in creating and establishing mechanisms for waste disposal to accommodate growing 6 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 increased recycling does not offset the additional waste generation in increased income [18, 33]. The Stella® SD model was applied in a Newark NJ study to project the effects of recycling on landfill capacity. [17] The SD inputs were derived from previous studies of factors found to have significant influence on waste generation. These factors included: Gross Domestic Product (GDP), infant mortality rates, population density, household size, life expectancy, and labor force (agricultural, service or industry). Waste figures had to be adjusted for total amounts of waste either diverted to other states or accepted from other states. The consolidation of landfill sites in the US has made waste prediction more difficult. The study determined that increased recycling has a long-term result in MORE waste being sent to landfills. Waste diverted to recycling provides increased landfill capacity which results in lower landfill costs. The analysis showed that current recycling policies would not result in additional recycling or increased economic feasibility of recycling and concluded that more cost-effective recycling methods are needed (specifically lower collection costs) to increase recycled waste. A recent 2017 study combined Multivariate Linear Regression with Bayesian Model Averaging (BMA) to develop a better model for waste prediction. [23] The model developed for Hoi An City, Vietnam was significantly better than existing models using linear regression. Like other models, socioeconomic factors were used for prediction. This study used location (urban vs rural), presence of a home business, number of people in household, and house area per person to predict waste generation. Per capita income was not used as a specific factor in the model due to the difficulty in obtaining this information in developing areas. Residents are unwilling to divulge income for fear of increased taxation. The presence of a home business and general higher wages in urban areas are good indicators of economic progress. This particular study was labor intensive. The study included face-to-face interviews as well as daily collection and weighing of household waste which was separated as biodegradable (compostable) or nonbiodegradable. Statistical tests (R2, MRS, RMSE, etc.) were performed to validate the model. Although labor intensive studies that break down waste by category (paper, plastic, glass, etc.) are more difficult to implement, they provide much more useful information for the prediction of recycling possibilities and economic feasibility. A study in India took a slightly different approach using an existing combined socioeconomic (SES) parameter that included education, occupation, and family income to divide the population into five hierarchal groups. [19] The participants separated waste into biodegradable and non-biodegradable bags. Waste was collected daily. The non-biodegradable waste was further separated (paper, plastics, glass, metal, etc.) and each component was weighed. Unlike prior Indian studies cited, which indicated the highest SES group generated the most waste, this study indicated the Medium/Middle SES group generated the most waste. Waste generation was broken down by component for complete analysis on a day of the week basis. Several theories on why the highest income group produced less waste included the use of LP gas for heating and cooking (no coal ash in waste generation) and eating outside the home in restaurants. A negative correlation between family size and per capita waste generation was consistent with prior studies. Other approaches to predicting waste generation include Fuzzy Logic (FL). Data from a prior study in Mexacali, Mexico was analyzed using FL. [34] FL can be applied when there are uncertainties in the information. It works with mixed data types (quantitative and qualitative) and can be used with sparse data and missing values. Fuzzy reasoning is used to infer information based on series of if/then statements. A degree of membership (rather than a classification in or not in) is expressed by FL. [22] The Mexico study was predominately interested in the amount of plastic waste generated per socioeconomic group. This study focused on recommendations targeted to reduce the amount of plastic waste. [34] The data included the content of glass, cardboard, and metal components which were included in the analysis. The study concluded lower income households produce more plastic packaging waste due to using smaller package sizes than higher income households which can buy in larger packages. However, high income, smaller household size resulted in higher per capita generation of plastic waste. Fuzzy Logic can provide a good model when data is sparse and there are uncertainties. Geospatial data was used for waste prediction in an Athens suburb. [22] Waste was collected from centrally located bins (rather than household curbside pickup). The study was focused on cost reduction to predict the optimal times for garbage collection. Subject matter experts weighted the factors used which included real estate values, building density, area size, electric bills, commercial traffic, and the specific waste bin locations. Separate inputs to calculate residential vs commercial waste were then combined to arrive at a single number for a specified area. A relationship between electricity consumption and waste generation was indicated. Validation techniques for the predicted results were not available due to the sparse data. The goal was to predict when bins would be 90% full to optimize collection times. Although Artificial Neural Networks (ANN) were developed in the 1940’s, the technique did not begin to receive acceptance as a modeling method until the 1980’s [35]. Neural networks attempt to mimic the way humans process information to produce decisions based on non-linear information. “ANNs are informationprocessing algorithms inspired by the way biological nervous systems make generalizations from similar situations, such as learning from past experience, and produce decisions out of incomplete knowledge of states with large inherit complexities and nonlinearities” [36]. Some studies have used ANN for waste prediction. A General Regression Neural Network (GRNN) model outperformed a Back Propagation Neural Network (BP) model in a study that covered 26 European countries. [26] In addition to being more accurate, the GRNN model training was significantly faster. The inner layer of the Neural Networks, although both based on the default minimum were not equal. The BP model had 10 neurons while the GRNN model was based on 84 neurons (minimum required for 84 data sets in the training data). Adding additional Neurons to the BP model to improve performance was not explored. R2 statistic was used for model validation. Significant model errors were attributed to uncertainties in data estimations made for missing values. The model performed better for more developed countries where data was most complete. A 2011 study used ANN to predict a 20-year future period of waste prediction. [20] The authors used a combination of linear regression and Multilayer Perception (MLP) model ANN for their forecasts. Although the authors performed various statistical tests to validate the test results, the data used to predict future waste generation in 7 8 M. A. Upal (editor) the ANN brings uncertainty into the model. In addition, the data had to be re-scaled for the MLP to be able to handle data at the far future dates as it was too much out of range from the actual data used for initial training. The authors solved this by using logarithms for the data. The scaled data was used for training with comparable results to the unscaled data. would be analyzed individually. The two or three datasets will be combined into a larger overall dataset. By combining the smaller datasets, I will have a larger dataset with which to work. I will be able to do additional analysis to see if similar or different patterns emerge or if there appear to be differences between the smaller datasets. Support Vector Machines (SVM) can be used for linear or nonlinear classification and regression tasks [35]. An Iranian study wanted to develop a model that would generalize well using two cities; Tehran and Mashhad. [25] SVM was used with an additional component of Wavelet Transform (WT) to pre-process the time series data. WT decomposes the signal into a set of bias functions using a prescribed formula; the resulting sub signals retain the structure/shape of the series. WT was used to eliminate noise in the times series data used for weekly forecasting with seasonal variations in waste generation. Although the models produced provided good performance, the WT process is not easy to understand. The income and socioeconomic factors used in the SVM model were not clearly defined. Long term waste predictions are important for planning future landfill and recycling recovery operations. I will use linear regression to analyze the data. The most common models for waste prediction are correlation and regression analysis [27]. Other more complicated models (Fuzzy Logic, Artificial Neural Networks, Support Vector Machines, etc.) have been used to predict waste generation. The more complicated methods do not necessarily yield better results and are less easily explained or examined than regression solutions. The factor/characteristic I am most interested in as it relates to waste generation is income. Other factors that I would like to explore include education, and employment (type of occupation). My hypothesis is that the amount of waste produced increases with increased income. Most of the studies I have reviewed indicate waste increases with wealth. I would further like to explore if there are differences in the component make-up of the waste generated. For example: do the component percentage amounts vary based on socioeconomic factors? Does higher educational background result in increased generation of paper waste and lower generation of plastic wastes? 3. PROPOSTED SOLUTION Aggregate/total waste data is available going back to the 1960s for the US. Waste component data (breakdown of how much glass, plastic, metal, paper, etc.) is only available from the mid-1990s. Socioeconomic and demographic information (income, household size, education, etc.) is available through census data every ten years. In order to strictly use the census data, the information would need to be annualized each year between the census dates. Annual information on birth rates, infant mortality, population, marriage rates is available (starting with the late 1990s to mid2000s depending on the type of data). The annual information as well as the census data (and or extrapolated census data) will be used to determine how these demographic and economic factors affect the amount of component waste generated (i.e. the amount of plastic, glass, paper, etc.). 4. EVALUATION METHODOLOGY Data compilation took a significant amount of time. I was able to obtain US waste data from 1960, 1970, 1980 and 1990 and annually from 1991 to 2015 from the Environmental Protection Agency (EPA) website and the EPA archives. The information came from the EPA publication titled “Advancing Sustainable Materials Management: Facts and Figures” (formerly known as “Characterization of Municipal Solid Waste in the United States”). The archive (with the alternate name) was an unexpected find of data that included material breakdowns in (paper/pdf copies of reports). The amount of waste generated continues to increase with time; generation increased rapidly from the 60’s to the 90’s with slower increases after the 90’s. Much of the focus today is on plastics. The amount of plastic waste nearly tripled from the year 1980 (6,830 thousand tons) to the year 1990 (17,130 thousand tons). By comparison, the next largest increase was wood waste 7,010 thousand tons in the year 1980 compared to 12,210 thousand tons in the year 1990. Annualized figures for intervening years between 1960 and 1990 were extrapolated by evenly indexing the figures. With additional resources (time) a regression model could be developed that could reflect incrementally increasing figures for intervening years. Regional data is available for the United Kingdom (UK) with total amounts of waste generation and component information as well as census data (by region). US data on waste component information is not centrally collected and not as easily available. There is a dataset for Buffalo, NY which includes waste component breakdowns (plastic, paper, etc.), US census data and annual data on birthrates and population are available for Buffalo socioeconomic and demographic data. I will obtain my data by downloading the UK waste data set and combining this with the UK census data for socioeconomic information such as income, education, and household size. I will need to also look at additional factors due to the ten-year limit with census data; other annually available information such as birthrates, population (and other information I can obtain such as marriage and divorce rates) will be added into the data. Once I have a ‘completed’ dataset I will then either create a ‘copy’ or add additional columns to annualize the census data between census years. I plan on simple equal division among the intervening years (unless I would find documentation of a different trend). I will be able to analyze the data both with and without the modified/added yearly figures. I will use this same process for the Buffalo, NY information combining the waste data with census socioeconomic and demographic information. If time and data permits, I would also like to locate data for an additional country. These datasets Buffalo waste data was obtained from the ‘Open Data for Buffalo’ website. The city of Buffalo makes data openly and publicly available. The city benefits from open source data for continuous improvement of city services and to help improve the city’s livability. This was my smallest number of years with data available from 2010 through 2016. The Buffalo waste data increases from 2010 to 2015, however, nine of the 12 features either decrease or stay the same in 2016. Again, six of the 12 features decreased or stayed the same in 2017. Waste data for the UK came from ‘Find Open Data’ a UK government site with data and links to data. UK waste data was available from 1997 to 2010. Data available for later years was aggregate and not broken down into as many individual waste features. The only ‘spike’ noted in the UK data is Co-mingled 8 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 recycling which went from zero in the year 2000 to 226 tons in the year 2001. There were corresponding marked decreases in ‘Cans’ and ‘Other Recycling’ categories in the year 2001. Data increases are seen each year from 1997 to 2008. Five of the 10 features decreased in the year 2009. In the year 2010 there was a decrease in four waste features from the prior year. Census data for the US and the UK is openly available. However, downloading the data was a tedious process. Although much socioeconomic data is available, it does not easily align from one census to the next. The information collected (and how it is broken down) has changed, sometimes significantly, over time. For example, “ages 20-24” became a further breakdown of “ages 20”, “ages 21”, & “ages 22-24” and category names would change “Male>>22to24 years”, “Male: 22 to24 years”. Other Socioeconomic characteristics have changed dramatically such as types of employment. These are not easily matched from year to year. I did not have an effective “automatic” way to line the data up, this was done manually. Some parameters were more difficult to align than others (depending on how much had changed from the preceding census). Where possible estimates were determined for additional breakdowns or data was merged to be consistent. Data was obtained from various sources. The American Community Survey (ACS) has easily downloadable data from 2005 through 2017. This data is updated yearly and can be extracted for multiple items in a single download by using pre-formatted tables. Although the ACS is part of the US Census website, the actual US Census data is not as easy to extract. Data prior to the 1990 census is not available electronically. Although the 1980 census was the first with information stored on computers (magnetic tape). The census website only had paper (pdf) files of the 1980 and earlier census years. The National Historical Geographic Information System (NHGIS) website provides electronically downloadable information for all US census data. Kudos to those that manually input this information from the paper Census reports! The information can be downloaded for multiple years and multiple parameters; however, it quickly becomes unwieldy. The information collected (and how it is broken down) has changed, sometimes significantly, over time (as noted). I needed to align each year and factor for UK Total Waste Production Cans 5000 Co-mingled Compost Glass 0 1997 1999 2001 2003 2005 2007 2009 consistency. Again, some parameters were more difficult to align than others. Like the US Census website, the UK Census website was also difficult to download data. The Nomis website (run by the University of Durham on behalf of the Office for National Statistics) provided easily downloadable information. Similar limitations were encountered as with the NHGIS website with changing statistical information and restrictions on the ability to download information on an annual basis. Downloading additional data and formatting it could be continued for the next year. The maximum amount of data was desired for analysis. However, in the interest of obtaining results the data gathering finally was halted. Finding relevant factors. Scikit-learn Random Forest Regressor was used to obtain the top socioeconomic predictor for each waste factor for each data area. A scatter plot with a best fit linear regression line was created for each top predictor and the corresponding waste feature. 5. DISCUSSION Overall, I did not find a strong correlation between socioeconomic factors and waste generation, but there were still trends evident in the data. The data can be divided into three datasets. USA – Although correlations were still low, the US data produced the most expected results with the most influential predictors related to household size, income, and education. The US data included 17 waste features and 499 socioeconomic features for the years 1960 through 2017. Table 1 lists the waste and top predictor feature for the US. UK- The UK data area offered the most promise of finding patterns. This data included: from years 1997 to 2010 10 waste features 138 associated socioeconomic features Instead of steady increases each year, the UK data showed decreases in the total waste features during the last two years. As shown in the graph in Figure 1. 9 10 M. A. Upal (editor) Figure 1 total trend of UK waste The top predictive socioeconomic features for the UK included: population population by age and sex household composition education (qualifications) The UK data was sorted by region, waste feature and socioeconomic feature to look for any similarities or patterns in the results. The three predominant socioeconomic features were: ‘total population’ (23 waste-region pairs) ‘mean age’ (16 waste-region pairs) ‘males aged 18 – 24’ (13 waste-region pairs). When the total UK was considered (all 9 regions) ‘one family and no others - Lone parent households - all children non-dependent’ was a top predictor for 6 waste features. Tables 3a-3d lists the waste and top predictor feature for UK. Buffalo- The Buffalo data included 12 waste features and 453 socioeconomic features covering the years 2010 through 2017. This was the smallest set of data with the least amount of years (only 8 years). Similar to the UK data, there was decrease in waste for the last two years in the Buffalo data. Table 2 lists the waste and top predictor feature for Buffalo. thank my Mercyhurst cohorts. I have drawn from their youthful reserves of energy and optimism. 9. List of Figures/Tables 1. 2. 3. 4. 5. 6. 7. 6. RESULTS Figure 1 total trend of UK waste Tables 1 through 3d in attached Appendix Table 1 US geographic area waste and top socioeconomic predictor Table 2 Buffalo geographic area waste and top socioeconomic predictor Tables 3a UK regions East, East Midlands, London, and North East geographic area waste and top socioeconomic predictor Table 3b UK regions North West, South East, and South West geographic area waste and top socioeconomic predictor Table 3c UK regions West Midlands and ‘Yorkshire and the Humber’ geographic area waste and top socioeconomic predictor Table 3d UK region (Total UK) geographic area waste and top socioeconomic predictor 10. REFERENCES/BIBLIOGRAPHY Although my goal was to find a relationship between socioeconomic factors, specifically income and waste generation, I did not find statistical significance connecting these factors. Random Forest Regressor was used to identify the most influential socioeconomic predictor for each waste factor. Tables 1 – 3d list the features identified for each dataset. [13] U-tube video Sea Turtle with Straw up its Nostril - "NO" TO PLASTIC STRAWS. 2015. Retrieved October 18, 2018 from https://www.youtube.com/watch?v=4wH878t78bw. [14] The New Plastics Economy — Rethinking the future of plastics. 2016. World Economic Forum, Ellen MacArthur Foundation and McKinsey & Company. from https://www.ellenmacarthurfoundation.org/assets/downloads/ EllenMacArthurFoundation_TheNewPlasticsEconomy_Page s.pdf. After obtaining these features, linear regression was used to attempt to predict waste generation based on them. A training test split of 85% training and 15% test performed better for the large US data than using a 30% test set. With the Buffalo and UK data, a 30% test set provided higher performance. This could be due to the smaller size of the data sets and the lower correlation numbers. Performance measures of Rsquared (R2), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) were used to measure linear regression performance. Overall performance measures showed low classifier performance: R2 scores ranged from .07 to .99. RMSE ranged from 1.7 to 41637.5, MAE values ranged from 1.5 to 24022.0 [15] Hoornweg, Dan & Perinaz, B.T.. 2012. What a waste: a global review of solid waste management. The World Bank, Urban Development Series Knowledge Papers, no. 15. https://siteresources.worldbank.org/INTURBANDEVELOP MENT/Resources/3363871334852610766/What_a_Waste2012_Final.pdf. [16] Hoornweg, D. , Bhada‐Tata, P. and Kennedy, C. 2015. Peak Waste: When Is It Likely to Occur?. Journal of Industrial Ecology, 19, 117-128. DOI: https://onlinelibrary.wiley.com/doi/abs/10.1111/jiec.12165. 7. CONCLUSIONS AND FUTURE WORK [17] Andrew D. Hwang. 2018. 7.5 billion and counting: How many humans can the Earth support? (July 30, 2018) Retrieved October 19, 2018 from https://theconversation.com/7-5billion-and-counting-how-many-humans-can-the-earthsupport-98797. The best results were obtained from the US data (which was the largest dataset). If more complete data could be obtained, correlation between waste and socioeconomic features could be identified better. In addition to gathering additional data, future work could include looking at more than one top waste predictor. As noted, linear regression could be used to extrapolate intervening data between Census years. Using fewer waste features may help to reduce noise and show better performance. [18] Bruce Pengraa. 2012. One Planet How Many People A Review of Earth’s Carrying Capacity. UNEP Global Environmental Alert Service (GEAS). https://na.unep.net/geas/archive/pdfs/geas_jun_12_carrying_ capacity.pdf. 8. ACKNOWLEDGMENTS [19] Ton vs. Tonne: What’s the Difference?. Retrieved October 21, 2108 from https://writingexplained.org/ton-vs-tonnesdifference. My thanks to my proofreader and husband for reviewing many drafts. Shraddha Dubey who helped me organize my thoughts. And Special thanks to Ron Richardson for his patience and help when I was learning to program in Python, R and SQL. I’d like to [20] A. L. Brooks, S. Wang, J. R. Jambeck. The Chinese import ban and its impact on global plastic waste trade. Science 10 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Advances. Vol 4. No 6, Article eaat1313 (Jun. 2018), 7 pages. DOI: 10.13140/RG.2.2.11029.63202. [21] Lebreton, B. Slat, F. Ferrari, B. Sainte-Rose, J. Aitken, R. Marthouse, S. Hajbane, S. Cunsolo, A. Schwarz, A. Levivier, K. Noble, P. Debeljak, H. Maral, R. Schoeneich-Argent, R. Brambini & J. Reisser. 2018. Evidence that the Great Pacific Garbage Patch is rapidly accumulating plastic. Scientific Reports volume 8, Article number: 4666 (2018) DOI:10.1038/s41598-018-22939-w. [22] NOAA. What are microplastics? Retrieved December 10,2018 from National Ocean Service website, https://oceanservice.noaa.gov/facts/microplastics.html, 6/25/18. [23] Wageningen University & Research . Microplastics & Nanoplastics. Retrieved from https://www.wur.nl/en/Dossiers/file/Microplastics-andNanoplastics.htm. [24] David Common, Eric Szeto. 2018. Microplastics found in 93% of bottled water tested in global study. (April 2018). Retrieved December 10, 2018 from https://www.cbc.ca/news/technology/bottled-watermicroplastics-1.4575045. [25] Betty Staugler . 2017. Microplastics – What’s the big deal? (January 2017). Retrieved December 10, 2018 from http://blogs.ifas.ufl.edu/charlotteco/2017/01/26/microplastics -are-a-major-concern/. [26] Jay Sinha. 2018. Life Without Plastic. Guest Lecture, held at Tom Ridge Environmental Center, Erie, PA on October 5, 2018. [27] Ted talk NPR. 2016. Marcel Dicke: Are Insects The Future Of Food? (Nov. 2016). Retrieved October 19, 2018 from https://www.npr.org/programs/ted-radiohour/?showDate=2018-10-19. [28] S. Lebersorger, P. Beigl. 2011. Municipal solid waste generation in municipalities: quantifying impacts of household structure, commercial waste and domestic fuel. Waste Management, 31 (Sep 2011), 1907-1915. DOI: https://doi.org/10.1016/j.wasman.2011.05.016. [29] N Kollikkathara, H Feng, and D Yu. 2010. A system dynamic modeling approach for evaluating municipal solid waste generation, landfill capacity and related cost management issues. Waste Management, 30 (Jun 2010), 2194-2203. DOI: http://doi:10.1016/j.wasman.2010.05.012. [30] B Dyson, N Chang. 2005. Forecasting municipal solid waste generation in a fast-growing urban region with system dynamics modeling. Waste Management, 25 (Jan 2005), 669679. DOI:10.1016/j.wasman.2004.10.005 [31] D. Khan, A. Kumar, and S.R. Samadder. 2016. Impact of socioeconomic status on municipal solid waste generation rate. Waste Management, 49 (Mar 2016), 15-25. DOI: http://dx.doi.org/10.1016/j.wasman.2016.01.019. [32] D Antanasijevic´, V Pocajt, I Popovic´, N Redzˇic´, and M Ristic´. 2013. Long term forecasting of solid waste generation by the artificial neural networks. Environmental Progress & Sustainable Energy, 31. 4. (Dec 2012), 628-636. DOI: http:// onlinelibrary.wiley.com/doi/abs/10.1002/ep.10591. [33] E. Daskalopoulos , O. Badr, and S.D. Probert. 1988. Municipal solid waste: a prediction methodology for the generation rate and composition in the European Union countries and the United States of America. Resources, Conservation and Recycling, 24 (Nov 1988), 155-166. DOI: https://doi.org/10.1016/S0921-3449(98)00032-9. [34] NV Karadimas, V Loumos and A Orsoni. 2006. Municipal solid waste generation modelling based on fuzzy logic. IN Proceedings 20th European Conference on Modeling and Simulation. Bonn, Sankt Augustin, Germany. (May 2006) DOI:https://doi.org/10.7148/2006-0309. [35] M. G. Hoang, T. Fujiwara, S. T. Pham Phu, and K. T. Nguyen Thi. 2005. Predicting waste generation using Bayesian model averaging. Global Journal of Environmental Science and Management, 3 (Sep 2017), 385-402. DOI: DOI: 10.22034/GJESM.2017.03.04.005. [36] N.J. G. J. Bandara, J. P. A. Hettiaratchi, S. C. Wirasinghe, and S. Pilapiiya. 2007. Relation of waste generation and composition to socio-economic factors a case study. Environmental monitoring and assessment, 135 (Dec 2007), 31-39. DOI: 10.1007/s10661-007-9705-3. [37] M. Abbasi, M. Abdoli, M. Abdoli, B. Omidvar, and A. Baghvand. 2013. Results uncertainty of support vector machine and hubrid of wavelet transform-support vector machine models for solid waste generation forecasting. Environmental Progress & Sustainable Energy, 33. 1. (Apr 2014), 220-228. DOI: DOI: 10.1002/ep.11747. [38] D Antanasijevic´, V Pocajt, I Popovic´, N Redzˇic´, and M Ristic´. 2013. The forecasting of municipal waste generation using artificial neural networks and sustainability indicators. Sustainability Science, 8 (Apr 2013), 37-46. DOI: http:// DOI 10.1007/s11625-012-0161-9. [39] K.A. Kolekara, T. Hazrab, S.N. Chakrabartyc. 2016. A Review on Prediction of Municipal Solid Waste Generation Models. Procedia Environmental Sciences Vol 35, (2016), 238 – 244. DOI: 10.1016/j.proenv.2016.07.087 [40] Matthew Thomas Clement. 2009. . A Basic Accounting of Variation in Municipal Solid‐Waste Generation at the County Level in Texas, 2006: Groundwork for Applying Metabolic‐ Rift Theory to Waste Generation. Rural Sociology, Vol. 74, 3 (Sep. 2009), 412-429. DOI: https://doi.org/10.1526/003601109789037196. [41] Danyi Qi and Brian E. Roe. 2016. Household Food Waste: Multivariate Regression and Principal Components Analyses of Awareness and Attitudes among U.S. Consumers PLoS One, Vol. 11,7 (Jul. 2016), 1-19. DOI: https://doi.org/10.1371/journal.pone.0159250. [42] D Hockett, D.J. Lober, and K Pilgrim. 1995. Determinants of Per Capita Municipal Solid Waste Generation in the Southeastern United States J. Environ. Manage., 45 (1995), 205-217. [43] O. O. Samuel. 2015. Socio-Economic Correlates Of Household Solid Waste Generation: Evidence From Lagos Metropolis, Nigeria Management Research and Practice, Research Centre in Public Administration and Public Services, vol. 7, 1 (Mar. 2015), 44-54. [44] H. K. Ozcan, S. Y. Guvenc, L. Guvenc, and G. Demir. 2016. Municipal Solid Waste Characterization according to 11 12 M. A. Upal (editor) Different Income Levels: A Case Study. Sustainability, 8, 1044 (Oct 2016), 1-11. DOI: https://doi.org/10.3390/su8101044. [54] UK Socioeconomic information, Office for National Statistics www.nomisweb.co.uk. [45] Matheus Bueno and Marica Valente. 2018. The Effects of Pricing Waste Generation: A Synthetic Control Approach. Discussion Papers of DIW Berlin 1737. DIW Berlin, German Institute for Economic Research, Berlin, DE. DOI: 10.13140/RG.2.2.11029.63202. [56] Office for National Statistics; National Records of Scotland; Northern Ireland Statistics and Research Agency (2017): 2011 Census aggregate data. UK Data Service (Edition: February 2017). DOI: http://dx.doi.org/10.5257/census/aggregate2011-2. https://census.ukdataservice.ac.uk/. [55] US Census information, https://www.census.gov. [46] G. Lozano-Olvera, S. Ojeda-Benıtez, J. Castro-Rodrıguez, M. Bravo-Zanoguera, and A. Rodrıguez-Diaz. 2008. Identification of waste packaging profiles using fuzzy logic. Resources, conservation, and recycling 52. (Jul 2008), 10221030. DOI: doi:10.1016/j.resconrec.2008.03.008. [57] Household recycling by material and region, England https://data.gov.uk/dataset/c9a3d775-6e00-4b8f-9f807f28fea7d944/household-recycling-by-material-and-regionengland. 12. MACHINE LEARNING AND DATA MANIPULATION [47] Aurelien Geron. 2017. Hands-on Machine Learning with Scikit-Learn & TensorFlow: Concepts, tools, and techniques to build intelligent systems (7th. ed.). O’Reilly, Subastopol, CA. [58] Wes McKinney. Pandas Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, (2010) 51-56). [48] S. Bayar, I. Demir, and G. Engin. 2009. Modeling leaching behavior of solidified wastes using back-propagation neural networks Ecotoxicology and Environmental Safety 72, 3 (Mar. 2009), 843–850. DOI: https://doi.org/10.1016/j.ecoenv.2007.10.019. [59] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, (2011), 2825-2830. 11. DATA SOURCES [49] Buffalo socioeconomic information, American Community Survey (ACS) https://www.census.gov/programssurveys/acs/data.html. [60] John D. Hunter. Matplotlib: A 2D Graphics Environment, Computing in Science & Engineering, 9, (2007), 90-95, DOI:10.1109/MCSE.2007.55. [50] Buffalo waste information, Open Data for all of Buffalo, Monthly Recycling and Waste Collection Statistics , https://data.buffalony.gov/Quality-of-Life/MonthlyRecycling-and-Waste-Collection-Statistics/2cjd-uvx7/data. [51] Buffalo socioeconomic information, https://factfinder.census.gov/faces/tableservices/jsf/pages/pro ductview.xhtml?src=bkmk About the author: Heidi L Beezub is a Graduate Student at Mercyhurst University. Previously, she held several positions with STERIS Corporation including six years as a Contract Administrator and seven years as an Incentive Analyst. After working for four years as an Inside Sales Specialist for SPX Corporation. She entered the Data Science program at Mercyhurst to gain programming needed to be able to return to work as an analyst. Heidi’s undergraduate degree is a BA in Business Administration from Mercyhurst. In addition, she holds a Secondary Education teaching certification from Edinboro University. [52] Buffalo socioeconomic information, https://health.data.ny.gov/browse?limitTo=datasets&tags=vit al+statistics&utf8=%E2%9C%93 [53] US Waste information, EPA, Municipal Solid Waste in the United States: Facts and Figures (archive 1995-2012), https://archive.epa.gov/epawaste/nonhaz/municipal/web/html /msw99.html. 12 M. A. Upal (editor) 2 Using Stock Market Data to Evaluate Genetic Algorithm Performance William Fisher Department of Computing and Information Science Mercyhurst University, Erie, PA wfishe96@lakers.mercyhurst.edu ABSTRACT: earnings or more complicated, generated features such as moving averages and oscillators. Sector-specific information includes whether or not tariffs are affecting the industry or whether a stock is cyclical and only performs well at certain times. Examples of economic data are GDP growth and unemployment numbers. Due to the vastness of possible features, a good place to start when trying to predict market movements would be to narrow down the difference between noise and legitimate signals. Stock market prediction is a particularly interesting problem because the stock market is widely regarded as very hard to forecast. One of the reasons the market is tough to predict is because of the multitude of variables and their interconnectivity. With that, it makes sense to use a feature selection algorithm to increase prediction accuracy. The following paper attempts to accurately predict whether the market will have a positive day by combining a Support Vector Machine (SVM) and an Artificial Neural Network (ANN) with a genetic algorithm. The results show that it is plausible that adding a genetic algorithm as part of the feature selection phase will increase accuracy. Researchers have already explored this topic using a number of methods. Among topical papers, the two most prominently used algorithms are the Support Vector Machine (SVM) and the Artificial Neural Network (ANN). Most of the papers show that machine learning algorithms can be used to predict stock market prices. Some of the papers also used feature selection algorithms. Genetic algorithms and various component analyses are the most popular techniques for feature selection. Using feature selection techniques has shown to produce superior results to feeding the prediction algorithm all the relevant features. 1. INTRODUCTION Stock markets are some of the most lucrative investment vehicles in the world. Countries across the globe offer stock markets as a way for people to invest in the future prosperity of publicly traded companies. Stock markets comprise of individual stocks that work on the principles of supply and demand. If the company is doing well, the theory is that more people will want to buy shares of that company which increases demand and thus increases the price of a share. The reverse is true as well, if a company delivers poor performances, people will sell their shares which increases the supply of shares in the market place and drives the prices of shares down. There are a few key areas in this space that have not been explored to their fullest potential. First, there are so many variables that can be included in the study of the stock market that it is nearly impossible to test all of them and their effectiveness. In particular, there are a number of data points in the economic reports that have not been utilized. Additionally, earnings data can cause huge moves in stock price and is an inside look at how companies feel about past and future performance. Markets are very interconnected with data from around the world so, while it might not be obvious, economic data from other countries can affect markets from, seemingly, unrelated countries. Features and the feature selection process are extremely important in model accuracy, which makes it such a vital process for creating a successful model. My model will focus on this untapped data and extracting the superior features by using a genetic algorithm for feature selection and pairing it with an SVM and an ANN. Then, I will compare its results with the standalone algorithms as well as a “buy and hold method.” Because of the wealth that stock markets can create, both institutions and individuals have long tried to create systems that maximize profits. Systems are, “a group of specific parameters that combine to create buy and sell signals for a given security” [18]. Some examples of trading systems include Pairs Trading, Trend or Countertrend Following, and News Related Positioning. As an example of how a system works, a Pairs Trade involves taking two stocks that are highly correlated and looking for times when one of the pair components is lagging. The trader would then place buy orders on the lagging company or sell the rising one [19]. The system that this paper is concerned with is taking the predictions of the machine learning process and turning it into profitable trades. This research is relevant for a couple of reasons. First, it illustrates to the data science community the importance of tracking down and retrieving the best possible data, even when it is not available in obvious places or datasets. Second, the paper demonstrates the effects of feature selecting algorithms, when given a diverse set of features, can have on the accuracy of a model. Finally, the paper proves to both the financial and machine learning communities that, despite the perceived randomness, markets are predictable. One of the reasons predicting the market is so hard is because of the number of variables that can affect market prices. These factors can range from individual company performances, to sector-specific information, to marketwide dynamics, to macroeconomic intricacies. Some of the variables that can be extracted from the individual stocks include simple features such as price, time of year, and 2 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 2. RELEVANT WORK Using machine learning to predict the stock market is a feat that data scientists continue to strive for. The stock market represents a complex problem with many variables that machine learning is perfectly suited for. To tackle the problem, past work must first be considered. In particular, a thorough understanding of three specific areas needs to be understood. For this relevant works section, I will focus on the stock market, algorithms that were previously used to try and predict the market, and feature selection algorithms that can help narrow down the features for market predictions. All of these areas represent important pieces of knowledge when attempting to build a successful model in this space. Stock market domain knowledge gives a solid base for where to start when coming up with creative features and theories to test with various models. Algorithms that were used in previous papers shed light on what types of models were successful. This can offer both a learning experience and can save time when looking for a prediction algorithm. There are thousands of possible features to select from. Some can be obvious and might be common among other papers, but using previously successful feature selection algorithms can lead to undiscovered correlations that can result in successful model building. All three of these components are necessary for understanding the subject matter and crafting a successful model. 2A) Technical Background: The Stock Market Understanding the stock market is vital to producing a machine learning algorithm that can predict future fluctuations in the market. There are many works on the subject in academia. Many of the works display various characteristics of the market that can be stored as knowledge and used when constructing a workable model. For hundreds of years, those who participated in the stock market have used many techniques to try and manipulate their gains. One such technique is technical analysis. Technical analysis is the process of finding repeatable patterns that have price prediction capabilities [16]. Technical analysis patterns can be simple price trends or complex chart figures. Some of the most common techniques are moving average evaluations, identification of support and resistance, and reading momentum indicators [10]. Moving averages are simply a calculation of a stock’s price over a designated time period [10]. Support and resistance are price barriers that many stocks adhere to due to trading psychology and the role of automated trading set to buy and sell at certain price points [12]. Momentum indicators are those that show a stock's current trend, either up or down, and whether or not it might continue to move in that direction. These and many other indicators are used to take the uncertainty out of future movements. Another key area is the fundamentals. Fundamentals are supposed to be the underlying factors of stock prices or the intrinsic value. Fundamentals are used to calculate a stock’s value to the market. Some examples of commonly used fundamentals include a company’s profits, revenue, or debt [2]. The fundamentals of a stock can vary from sector to sector because what is an important measure in one stock might not be important in another. For instance, airline stocks rely on different sources of revenue including revenue from passengers and revenue from cargo [1]. Contrarily, an oil company’s price might be contingent on the number of oil rigs they are currently operating and their outputs [14]. Due to many traders’ beliefs in fundamentals dictating prices, it is important to have a basic understanding of them and how stocks react to them. The stock market does not operate in a vacuum. The fundamentals previously mentioned are decided by more than just a company’s performance. Many macroeconomic forces play a role in the construction of the fundamentals. With the interconnectivity of markets around the world, factors such as interest rates, bond prices, and inflation all serve a role in a stock’s price [3]. Besides those mentioned, there are many more economic factors that are at play in the stock market. Using machine learning can be used to sort through these and find those that are most predictive of price movement. 2B) Technical Background: Algorithms Previously Used for Market Predictions The stock market can behave unpredictably and is constantly evolving. Therefore, certain algorithms perform better in certain times and while one may be more accurate historically, it does not mean going forward it will be best. That said, there is merit in looking at the algorithms that have been used in past research and examining their results. 2B.1) ANNs A popular method for market prediction is the use of Artificial Neural Networks (ANN). Fernández-Rodríguez et al constructed an ANN that takes in nine input values where the values are the returns of the previous nine days [5]. His model has one hidden layer with four units and one output layer, a number between -1 and 1 [5]. When the output is positive it is a “buy” signal, and when the number is negative it is a “sell” signal [5]. Fernández-Rodríguez et al found that their model worked best in bear and stable market, but was outperformed by a buy-and-hold strategy in bull markets [5]. Another author, Ticknor, also uses ANNs to predict stock prices. Ticknor also uses nine input variables and specifies that he uses the opening price, closing price, and high of the day among other attributes [13]. His model contains three layers. What makes Ticknor’s model unique is that instead of the output being binary, the output is a predicted stock price for the next day [13]. Ticknor finds that his model is accurate compared to models in a previous paper written by Hassan et al [13]. Finally, Vanstone and Finnie also wrote about ANN. The authors use a model that takes a set of 13 inputs of various technical stock data points [15]. The output for their model was the high that occurred in the next 20 days, roughly one trading month [15]. If a monthly high occurred on day five of the 20-day period, that day’s price would be the predicted output [15]. Using the outputs, the authors created trading rules that they showed, over time, are more profitable than buying and holding stocks [15]. 2B.3) SVMs SVMs appear to be the most popular method among stock market prediction papers. In his paper, Kim uses an SVM to 3 M. A. Upal (editor) 4 predict the prices of Korean stocks [8]. He uses 12 technical indicators as input variables and predicts whether or not the market will go up or down on a daily basis with one being a move to the positive side and zero being negative [8]. Kim’s study focuses on whether or not an SVM based model will outperform a back-propagation ANN and a case-based reasoning algorithm [8]. Kim found that the SVM outperformed the other two models [8]. 3. PROPOSED SOLUTION 3A) Rationale: 3A.1) SVMs: There are many reasons for using SVMs as one of the test algorithms. First, many other studies looking at stock market prediction also use the SVM algorithm [8] [6] [9] [11]. This makes comparisons and readability easier across the audience. Second, SVMs do an adequate job of avoiding overfitting [6]. This is especially important for the stock market because the swings can be unpredictable at times. Finally, SVMs work well with non-linear problems [20]. Again, this is important for a problem involving the stock market since the stock market movements are rarely linear. Huang et al also use SVMs as the primary algorithm of their paper [6l]. The authors test the effectiveness of SVMs against that of Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Elman Backpropagation Neural Networks (EBNN) algorithms [6]. Similarly to Kim, Huang et al find that the SVM classifiers perform better than the rest of the algorithms [6]. The authors theorize that the reason SVMs perform better than the other models is due to the nature of the algorithm and it’s propensity to avoid overfitting [6]. The authors also find that a combining method, where the SVM is paired with the various other algorithms, performs even better than the SVM by itself [6]. 3A.2) ANNs: ANNs are often used for complicated, hard to model problems and have been a favorite of many researchers while studying stock market predictions. Like SVMs they have been used for many financial analysis problems [5] [13] [15]. They are good for non-linear problems and good at generalizing, meaning they also sufficiently avoid overfitting [21]. Finally, the ANNs model heteroskedasticity challenges well [21]. Highly volatile data and non-constant variance are staples in financial analysis which makes ANNs a good algorithm choice for this problem. Another author who uses an SVM approach to try and predict the stock market is Lahmiri. Lahmiri focuses on a comparison between Probabilistic Neural Networks (PNN) and SVMs [9]. Lahmiri uses technical and macroeconomic variables to try and predict daily stock movements [9]. The author also explores a combination of the two methods [9]. The paper shows the best results were obtained by using an SVM with the macroeconomic data as an input [9]. 3A.3) GAs: Genetic algorithms are inspired by nature and mimic the natural evolution process. They allow users to search and traverse the space of possible solutions in an efficient way. Additionally, GAs are perfect for stock market analysis and this particular problem because they are easy to program and proven to find features that achieve optimal results [21] [4] [7] [8]. They are useful for this space, in particular, because there are many features that need to be streamlined to find the best possible combination. Li et al also tested the SVM algorithm against a number of different algorithms and once again found that the SVM was the most accurate [11]. Li et al tested the algorithm against the extreme learning machine (ELM) algorithm and various versions of neural networks [11]. Notably, Li et al also discovered that in comparison to regular back-propagation neural networks (BPNN) and SVMs, the ELMs used were faster when training and testing on the same data [11]. 2C) Technical Background: Algorithms for Feature Selection 3A.4) Features: There are multiple components to predicting a stock price and each decision can have profound impacts downline. For the benefit of traders and money managers, stocks can be bought and sold in groups called Exchange Traded Funds (ETFs). ETFs usually represent a segment of stocks. For instance, the XLF ETF represents stocks that fall under the financial sector description and the QQQ ETF represents stocks that fall under the technology category. 2C.1) Genetic Algorithms Many stock market prediction papers use genetic algorithms (GA) when looking for an appropriate model. One such paper by Kim and Han uses a GA combined with an ANN [7]. In their paper, the role of the GA is to extract the best features and the best weights for the ANN [7]. The study concludes that the GA-ANN combination performed better than either model by itself [7]. Kim has another paper in which he explores the effects of two similar GAs [8]. In this paper, Kim finds that a GA model that continuously updates beats a similar GA that does not update [8]. One prominent way to try and predict the movement of ETFs is to trade funds that are highly correlated. While the stock market is known as being unpredictable, stocks rarely move in isolation. Due to the interconnectivity of the global economy sectors often times have a snowball effect on each other. This means that when one sector goes up, we can predict that similar sectors that rely on many of the same factors will also increase in price. An example of this is XHB and XRT. XHB is an ETF that encompasses companies that focus on home building. XRT contains retailers. Since consumers propel both of these sectors, it stands to reason that they will have similar price movements. Historically, this is true. These two sectors are positively Choudhry and Grag also use a hybrid model that utilizes a GA [4]. Choudhry and Grag use a GA and SVM on highly correlated stock pairs to try and predict future prices [4]. The GA is used to select the best features from an original list of 35 [4]. The authors experiment compares the results of the GA-SVM model with a regular SVM that uses all 35 inputs [4]. The comparison shows that the model that uses inputs narrowed down by the GA produces superior results to the SVM that uses all 35 input features [4]. 4 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 correlated and have a correlation value of 0.78 using their monthly returns since 01/01/2010. This is important for our proposed solution because I use many of these ETF prices as features in my dataset. The selection process is a simple inclusion of the most popular ETFs by trade volume. As with the ETFs, macroeconomic factors also have a major impact on the stock market. Leading indicators are measurements that have the potential to forecast economic conditions. This category of indicators includes reports such as manufacturing activity, retail sales, the housing market, and inventory levels. Lagging details such as GDP, profits, and interest rates also can have effects on the stock market. These are considered a measurement of current economic conditions. Although they do not have as much forecasting power as the leading indicators, they are a good gauge of the overall health of an economy. Both categories can have an effect on the stock market, especially if they show data that contradicts the current consensus of a market. Fundamentals are another set of features that are included in the model. As mentioned above, many investors use fundamentals to make investment decisions. The fundamental values to be included in the model will be those that are popular and easily recognizable to both seasoned investors and casual stock market participants. Technical measures will also be included in the original feature set. Technical indicators chosen for the original pool, before being narrowed down via the Genetic Algorithm, will include a set of simple, easy to read, and easy to calculate indicators. Choosing easy to use and easy to understand indicators of both the technical and fundamental variety has many benefits including results that should be easier to read and, subsequently, reproduce. 3B) Description: 3B.1) Written Description: This paper examines four models, an SVM, an ANN, a GA + SVM hybrid, and a GA + ANN hybrid and tests their performances against each other along with the general moves of the broader market. The models aim to predict whether or not the DJI, the ticker symbol for the Dow Jones Industrial Average, has a daily positive gain or not (PositiveGain in the dataset). All of the models are built from the same initial data. The data for this paper is from a variety of sources. For index prices, data comes from Yahoo Finance. The technicals are calculated using a combination of fundamentals, time, and prices, specific calculations are provided below. Finally, economic data comes from the Data.gov website. The first date in the price data is 01/04/2010 and the final date is 07/01/2018. The data includes all the trading days between the previously described bookends. Trading days are all business days not including holidays or various other days where the market is closed. The date range was chosen because it was the largest sample size that could be obtained where all desired features were available. Many of the economic data is given in monthly or quarterly statistics. When this is the case, the data is extrapolated over the entire time period until the newest data is available. For example, the GDP report for 9/1/2018 was 20658.204. Since a new number is not reported until 10/1/2018, the September number is used for 9/2, 9/3, 9/4, etc. until the October number comes in. The same technique was used for quarterly data. Data is processed via the SciKitLearn preprocessing packages and includes removing any missing data and scaling data so machine learning algorithms will perform well. Data was scaled using (xi–min(x))/(max(x)–min(x)) since not all features are evenly distributed. For the models, data is split using train-test split packages. Training data is split again into training data and validation data. Due to the nature of time-series data, the data is split chronologically. The training data set starts on 01/04/2010 and ends on 02/06/2015. The cross-validation set starts on 02/09/2015 and ends on 10/18/2016. The test set starts on 10/19/2016 and ends on 6/29/2018. There can be seen in figure 1. Figure 15: DJI prices split by train, validation, and test segments For the hybrid models, data is split like before but instead of feeding right into the model, features are first filtered with the GAs. The best-fit features are then used as a part of the SVM or ANN to make predictions. 3B.2) Algorithm Descriptions: SVM: Support Vector Machines are a subset of supervised learning algorithms. Support Vector Machines achieve predictions by maximizing margins between classifications. This means that training examples are transformed onto a hyperplane that increases the distance from one class to another. On the optimal hyperplane, the training examples that are closest to the maximum margin are called the support vectors (see figure 1). When data is linearly separable, a hyperplane separating the prediction classes can be represented with the equation: y = w0 + w1x1 + w2x2 + …wnxn In the equation, y is the outcome, x’s are the variable values, w’s are the weighted values. 5 M. A. Upal (editor) 6 The maximum margin hyperplane can be represented by: the input and hidden layers is the ReLu function. The ReLu function is defined as: y = b + ∑αiyix(i) · x f(x)=max(0,x) y is the prediction value, x is a vector that represents an instance, xi and yi are the support vectors, · represents the dot product, b and ai are parameters that represent the hyperplane and are calculated by solving a linearly constrained quadratic programming problem. The network starts with a random initialization. The activation rate is then found from the input layer and the hidden layer. An output is found using the softmax function. The softmax function is defined as follows: softmax(x)i=exi∑nj=1exj When data is not linearly separated, a kernel is added to the equation and it looks like this: y = b + ∑αiyiK(x(i),x) One common kernel, the Gaussian radial basis function (RBF), is represented as follows: K(x; y) = exp(−1/δ2 (x − y)2 ) δ2 is the bandwidth of the Gaussian RBF kernel. The Sklearn Support Vector Classifier was used to implement the algorithm. Through the package, a regularization parameter, C, determines how to fit the separator to the predicted classes. When C is higher, the line is fit to the data more closely, when it is lower, it is more linear. For this paper, C was set to a value of 1.0. The gamma parameter takes into consideration what points to use to calculate the line of separation. When the gamma parameter is set higher, only the points closest to the line of separation are used for calculations. When the gamma setting is set to a lower value, all points are considered. In this model, the value of gamma is set to (1 / number of features). Figure 17: A representation of a neural network with two inputs, one hidden layer, and two outputs [24] GA: Genetic algorithms start with an initialization. A gene in the case of a genetic algorithm is a combination of features. First, the “genes” of the algorithm are randomized and results are analyzed. If a feature or combination of features produces a correct result, it is marked as so. Combinations that are more successful are ranked higher than those that are less successful and given a higher value which corresponds to a higher probability of selection. During the selection phase, each combination of features is given a probability and instead of simply taking on the best performing features from the training phase, the selection is based on this probability. This is how GAs avoid overfitting while still rewarding features that initially perform well. Selected combinations of features are then mixed and matched to give new feature combinations. Mutations can then be added in order to introduce another level of randomness. Final results are generated and the best feature combination is selected. Figure 16: SVM on a 2-D plane showing the difference between strict and loose gamma values [23] ANN: Artificial Neural Networks are a rough representation of how the brain works with neurons and connections woven together in order to take a number of inputs and predict a classification for the input (see Figure 2). The neurons are units of calculations while the connections are weights to be applied to the next equation. Each neural network has three parts, the input layer, the hidden layers, and the output layers. The number of nodes in the input layer corresponds to the number of variables. In ANNs, the activation function of a node sets the output of the node given a set of inputs. The activation function used for 6 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Based on the results, utilizing the genetic algorithm appears to have a positive effect on accuracy results. In both cases, the models paired with the genetic algorithms outperformed the standalone SVM and ANN results. The GA combinations also were able to beat a buy and hold strategy which the standalone models could not do. The GA + SVM model performed the best of the four machine learning models. The GA + ANN model was the second-best performer followed by the two standalone models. 5. FUTURE WORK There are many ways in which this, and many of the other papers out there, could be improved. First, the economic landscape is so vast that there are many variables that have not been used. Some examples of variables that may be able to improve a model include individual company data, more foreign macroeconomic data, text analysis from various reports or articles, and countless other sources. Figure 18: Flow chart illustrating the steps of a genetic algorithm [25] 4. RESULTS The four models were fitted on train data and cross-validated on the preset cross-validation data. Once properly adjusted, the models were used on test data. The results for how each individual model performed on the test data can be found in table 1. Accuracy or (number of correct predictions/total number of predictions) was the metric used to evaluate the model. This was chosen for its simplicity and its application to the real world. If the target data was imbalanced, another metric such as precision or recall may have been chosen but since the number of positive days and number of negative days were split almost evenly, accuracy worked well for this study. Baseline results were also collected and can be seen in table 2. The baseline methods used were a random walk and a buy and hold strategy. The random walk method predicted random 1s and 0s (positive and negative days). The buy and hold strategy is the equivalent of predicting all positive days. Model Accuracy SVM 0.513 ANN 0.507 GA + SVM 0.531 GA + ANN 0.522 Table 6: Model results Baseline Accuracy Random Walk 0.501 Buy + Hold 0.521 Another area to explore is the addition of various other predictive algorithms. SVMs and ANNs were selected for this paper because of their popularity in previous papers. New ways to approach the problem could be to use various other classification algorithms such as decision trees, random forests, logistic regression, etc. Aside from the prediction method, a new approach could also be taken in terms of the prediction label. In this paper, market gains were defined as a binary positive gain. Similar approaches could be used in regression problems to try and predict continuous prices. The time constraint of the prediction label could also be modified. This paper works with single day predictions but longer and shorter time periods could also be explored. Finally, this paper focuses on price prediction. A next step could be to attempt to turn this into an actual trading system. This would most likely involve looking at ‘PositiveGain’ predictions and their probabilities and then analyzing hypothetical gains. This could also include taking into account things like trade prices, percent of gains, the timing of trades, etc. 6. CONCLUSION The paper illustrates a few important points regarding machine learning, algorithm selection, feature selection, and predicting daily stock market directions. First, the paper shows that it is plausible, with the right combination of machine learning algorithms, to predict market direction. Next, the paper shows that with feature selection, model improvement is possible. In this instance, the feature selection algorithm was a genetic algorithm and it was paired with a support vector classifier and an artificial neural network. Results were compared between the standalone algorithms and the hybrids that included the feature selection. Finally, the results also showed that the SVM outperformed the ANN. While there are many more areas that need exploration, the paper lays out some key points for any future work. Table 7: Baseline results REFERENCES 7 M. A. Upal (editor) 2 1. Matthieu Medhi Belarouci. 2012. The Relation between Technical Efficiency and Stock Prices: Evidence from the US Airlines Industry (19902010). SSRN Electronic Journal. DOI:http://dx.doi.org/10.2139/ssrn.2083651 2. Olivier Blanchard, Changyong Rhee, and Lawrence Summers. 1990. The Stock Market, Profit and Investment. DOI:http://dx.doi.org/10.3386/w3370 3. Nai-Fu Chen, Richard Roll, and Stephen A. Ross. 1986. Economic Forces and the Stock Market. The Journal of Business, 59, 3, 383. DOI:http://dx.doi.org/10.1086/296344 4. Rohit Choudhry and Kumkum Garg. 2008. A Hybrid Machine Learning System for Stock Market Forecasting. World Academy of Science, Engineering and Technology International Journal of Computer and Information Engineering, 2, 3, 689–692. 5. Fernández-Rodrı́guez Fernando, Christian GonzálezMartel, and Simón Sosvilla-Rivero. 2000. On the profitability of technical trading rules based on artificial neural networks, Economics Letters 69, 1, 89–94. DOI:http://dx.doi.org/10.1016/s01651765(00)00270-6 6. W. Huang. 2004. Forecasting stock market movement direction with support vector machine. Computers & Operations Research. DOI:http://dx.doi.org/10.1016/s0305-0548(04)000681 7. Kyoung-Jae Kim and Ingoo Han. 2000. Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert Systems with Applications, 19, 2, 125–132. DOI:http://dx.doi.org/10.1016/s09574174(00)00027-0 8. Kyoung-Jae Kim. 2003. Financial time series forecasting using support vector machines. Neurocomputing, 55, 1-2, 307–319. DOI:http://dx.doi.org/10.1016/s0925-2312(03)003722 9. Salim Lahmiri. 2011. A Comparison of PNN and SVM for Stock Market Trend Prediction using Economic and Technical Information. International Journal of Computer Applications, 29, 3 (September 2011), 24–30. 10. Blake Lebaron, W.brian Arthur, and Richard Palmer. 1999. Time series properties of an artificial stock market. Journal of Economic Dynamics and Control, 23, 9-10, 1487–1516. DOI:http://dx.doi.org/10.1016/s0165-1889(98)000815 11. Xiaodong Li et al.2014. Empirical analysis: stock market prediction via extreme learning machine. Neural Computing and Applications, 27, 1 (February 2014), 67–78. DOI:http://dx.doi.org/10.1007/s00521-014-1550-z 12. Anon. 2017. Support and Resistance. A Complete Guide to the Futures Market, June 2017, 91–108. DOI:http://dx.doi.org/10.1002/9781119209713.ch8 13. Jonathan L. Ticknor. 2013. A Bayesian regularized artificial neural network for stock market forecasting. Expert Systems with Applications, 40, 14 (2013), 5501–5506. DOI:http://dx.doi.org/10.1016/j.eswa.2013.04.013 14. Bruce Vanstone and Gavin Finnie. 2010. Enhancing stockmarket trading performance with ANNs. Expert Systems with Applications, 37, 9 (2010), 6602–6610. DOI:http://dx.doi.org/10.1016/j.eswa.2010.02.124 15. Ruud Weijermars. 2011. Price scenarios may alter gas-to-oil strategy for US unconventionals. Oil & Gas Journal (January 2011), 74–81. 16. Wing-Keung Wong, Meher Manzur, and Boon-Kiat Chew. 2003. How rewarding is technical analysis? Evidence from Singapore stock market. Applied Financial Economics13, 7 (2003), 543–551. DOI:http://dx.doi.org/10.1080/0960310022000020906 17. Lean Yu, Huanhuan Chen, Shouyang Wang, and Kin Keung Lai. 2009. Evolving Least Squares Support Vector Machines for Stock Market Trend Mining. IEEE Transactions on Evolutionary Computation13, 1 (2009), 87–102. DOI:http://dx.doi.org/10.1109/tevc.2008.928176 18. Justin Kuepper. 2018. What Is A Trading System? (March 2018). Retrieved October 22, 2018 from https://www.investopedia.com/university/tradingsyste ms/tradingsytems1.asp 19. Jean Folger. 2018. Guide to Pairs Trading. (February 2018). Retrieved October 22, 2018 from https://www.investopedia.com/university/guide-pairstrading/ 20. Bala Deshpande. 2013. When do support vector machines trump other classification methods?. (January 2013). Retrieved April 21, 2019 from http://www.simafore.com/blog/bid/112816/When-dosupport-vector-machines-trump-other-classificationmethods 21. Jahnavi Mahanta. 2017. Introduction to Neural Networks, Advantages and Applications. (July 2017). Retrieved April 21, 2019 from https://towardsdatascience.com/introduction-to-neuralnetworks-advantages-and-applications-96851bd1a207 22. Fernando Gomez and Alberto Quesada. Machine Learning Blog Genetic algorithms for feature selection. Retrieved April 22, 2019 from https://www.neuraldesigner.com/blog/genetic_algorith ms_for_feature_selection 23. Nicolas Panel. node-svm. Retrieved April 22, 2019 from https://www.npmjs.com/package/node-svm 24. Anon. Retrieved April 22, 2019 from http://neuroph.sourceforge.net/tutorials/MultiLayerPer ceptron.html 25. Anon. Retrieved April 22, 2019 from https://www.hindawi.com/journals/mpe/2013/504895/ fig3/ 2 M. A. Upal (editor) 2 Stock Market Price Model Using Sentiment and Market Analysis Justin Minsk Department of Computing & Information Science Mercyhurst University Erie, PA, US jminsk64@lakers.mercyhusrt.edu Questions around which sources and features should be used and how to combine the sentiment analysis with traditional price predictors have been explored in recent research papers on stock market predictions [6]. Multiple tensors used for social media and news [6] combine sentiment analysis with traditional stock market price predictors. Zhang’s model [6] mines Chinese social media and news which have different properties compared to United States news and social media. Applying the ideas presented by Zhang et al. to United States media such as Twitter and the Wall Street Journal is a good way to test the generality of their model. ABSTRACT Zhang et al.[6] take social and news media with market indicators and predicts stock prices based on the indicators. All three of these data sources predict a set of stocks from the Hong Kong stock exchange. Instead of using information from Chinese sources, this paper uses United States sources, including Twitter, Wall Street Journal, and Amazon stock indicators. Our goal is to examine whether the same ideas presented in [6] can be generalized to other stock markets, in particular if these methods can be used to predict Amazon's stock price for each minute during open stock market from December 12th, 2018 to January 22nd, 2019. 2 BACKGROUND AND RELATED WORK 1 INTRODUCTION 2.1 Related Work Recent work on stock market price prediction focuses more on sentiment analysis and less on historical prices and economic indexes [4-8]. Sentiment analysis is the act of taking text data and scoring that data. This can be done several ways, the most popular being positive and negative sentiment about a subject [5]. In stock market predictions, social media, such as tweets from Twitter [4, 5, 6, 8], and news articles [7, 6, 9] are commonly used to predict whether a stock’s price will go up or down. Predicting the stock market is an ongoing research topic in academic and business sectors. The techniques used to and knowledge of stock market prediction have changed over time from the idea that stocks follow a random walk [6] to models that can predict stock movement using deep learning techniques [1 - 3]. One of the most influential indicators has been found in news and social media sentiment analysis [5, 6, 8, 9]. While there are research papers combining multiple techniques [7] there is still more work to be done to combine deep learning techniques such as long-term short-term memory (LSTM) deep learning models [2] and sentiment analysis of news and social media [5, 6, 8, 9]. A review of some of the sentiment analysis techniques in conjunction with deep learning models can be fruitful for figuring out techniques that could be used and where new research could be continued. While stock market prediction has focused on sentiment analysis, the overall data has characteristics of a time series problem. Papers addressing time series modeling are thus also relevant to time series predictions. A long term short term memory neural net (LSTM) [1, 2] or a gated recurrent network (GRU) [10] are two approaches that have been used to effectively model problems of this type. New stock market research has focused on event-based models [7] which take news or social media events and predict how they affect a stock’s price. Sentiment analysis assigns a positive or negative sentiment score to a given piece of text based on whether a given topic is covered positively or negatively in that text. Deep learning is used to find the relation between sentiment and the stock price, allowing for regression and classification. Regression can be used to summarize and explain how a specific event affects a stock’s price. Classification can be used to label stocks as either “buy” or “sell.” The most influential papers on US stock prices have focused on social media [4, 5, 8] or news articles [7]. Little has been done to combine both sources into a complex model that uses multiple news sources, social media platforms, and basic economic indicators. An example of a paper that takes multiple sources of data and combines them to predict a stock’s price is [6]. This paper, however, is focused on the Hong Kong stock market and not the US stock market. This paper creates a framework 2 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 on how to combine social media, a news source, and traditional economic metrics into one model. The idea is to create separate models for each source. One model for social media, one model for a news source, and a final model for the economic metrics. These models are referred to and treated as tensors [6]. The social media and news media tensor get combined into a tensor for qualitative or text data. The economic metrics data is combined into a quantitative tensor. These tensors, qualitative and quantitative, are fed into a tensor that outputs the final value or predicted stock price. The tensors for social media, news media, and economic metrics use variations of LSTMs and support vector machines. The output tensor or blender tensor uses a variation of support vector machine. 1. 2.4 Long Term Short Term Memory Neural Networks Long term short term memory neural networks (LSTM) are a direct improvement over RNNs since they solve the vanishing gradient problem by adding gating functions into their dynamic state [2]. Instead of having only hidden layers like RNNs, they also have memory vectors that maintain the state updates and outputs [2]. This memory vector allows for longer term temporal trends to be remembered in an LSTM, while an RNN might lose that information. 2.5 Gated Recurrent Network The unique approach taken by this paper is to combine social media, news media and economic metrics into one model designed for real time predictions. Using the techniques for sentiment analysis for social media [4, 5, 8] and news media [7] with the method from [6] creates a combined model for the United States stock market and will eventually provide a reliable way to predict stock prices. Gated recurrent networks (GRU) are similar to LSTMs in the sense that they solve RNN’s vanishing gradient problem [10]. LSTMs have multiple gates and are composed of multiple complicated algorithms such as input, forget, and output gates. GRUs, which were created after LSTMs, only use reset and update gates. While GRUs are similar to LSTMs they are simpler, however. Neither GRUs or LSTMs are necessarily better than each other. 2.2 Amazon.com, Inc. 2.6 Support Vector Machines The online bookstore turned technology company, Amazon.com, Inc. (AMZN) was chosen to research due to a general interest in the technological field and the ability to collect Twitter data about the company. Since Amazon encourages online customer reviews and interaction on online public platforms, it was assumed that massive amounts of data could be easily collected and applied to stock market prices. Amazon is constantly expanding to relatively new business environments with services such as cloud services, Prime shipping, streaming services and their own product lines in both electronics and consumer products. This company holds special interest since they have a wide range of products and services, that are closely connected to their customer base, appeal to all demographics of consumers, and they have had major fluctuations in stock prices since their founding. Hopefully, if a model can be applied to a diverse company such as Amazon, the same model could eventually be applied to more specialized companies in the future. 2.3 Recurrent Neural Networks Recurrent neural networks (RNN) are important to time series analysis because of their ability to keep temporal behavior. This behavior is saved because of the interconnection of each node in a layer [2]. RNNs contain hidden layers that update their weights after a set of time. There can be any number of hidden layers that increase the complexity of the RNN. An RNN consists of an input layer, a number of hidden layers and an output layer. However, RNNs have a vanishing gradient problem which makes longer time series data sets much less accurate and they lose information over time. This loss of information makes long term trends disappear in the training phase of the RNN model. Support vector machine (SVM) models are classic classification machine learning models. They use the concept of finding the line that divides the data. This line, unlike linear regression, is designed to be as far away from the clusters of different data as possible. This allows for more general models to be compared to other classic classification models. SVMs can be used in both classification and regression tasks. In a regression task, the logic is similar to classification, but instead the logic is used to make a line of best fit. Since we have three different predictions being fed into a blender model, there needs to be a way to predict the final value. A SVM would work perfectly for this task. 3: DATA COLLECTION 3.1 Twitter Data Twitter data was collected from December 12th, 2018 to January 22nd, 2019. A grand total of 23,585,132 tweets were collected during this time, containing the words “amazon” and “amzn.”. The data was appended to average AMZN price for each minute the market was open between those two dates. Tweets were combined to form one string for each minute and a count was also collected. Term frequency– inverse document frequency score was used to weigh words in terms of frequency and importance within each document and then rank the documents within the collection. Say “amazon” was said multiple times in a collection of documents, the document that said “amazon” the most amount of times would have a higher score. Since “amazon” is considered a common word in the collection, the word itself is ranked lower in terms of significance in the collection. The score of each document compares how words are used in both the document and the collection. The subject 3 M. A. Upal (editor) 4 of every document is not always important. Instead the focus lies in overall sentiment of the collection of documents. This enumerates the data and should help models find important words. In the end 1-4 grams of words where used and 75,001 features where used from the data. 4.2 IEX Model Multiple configurations of deep learning models were used to attempt to predict the next minute’s average stock price for AMZN. Five models were run and the validation loss was compared to decide which model performed the best. See below. 3.2 Wall Street Journal Data Articles suggested by Wall Street Journal as related to Amazon were collected between December 12th, 2018 to January 22nd, 2019. Article text data was appended forward in time for each minute of prices during open stock trading until a new article is written. This allows for Wall Street Journal data to continue to influence the price. The variable, “time since article posted,” was added to declare the difference between an article just posted and an article that was posted a few minutes, hours, or days ago. Term frequency–inverse document frequency score was used which assigned larger weights to words that appear in fewer documents, but are used multiple times in a document or article. This enumerates the data and helps models find important words. In the end, 2-5 grams of words were used and 75,001 features were used from the data. 3.3 Stock Market Data Price and stock indicators include: open time, open price for trade day, close time, previous close price for trade day, high price for trade minute, low price for trade minute, latest price, latest update, latest time, latest volume, delayed price time, delayed price, extended price, extended change, extended change percent, previous close, change in price, percent change, market cap, week 52 high, week 52 low, and year to date change. Model Loss (Mean Error) Squared Gated Recurrent Network (GRU) - Dense 3.3585554774617776e-05 Long Term Short Term Memory (LSTM) - Dense 4.118878860026598e-05 GRU - GRU - Dense 2.2447573428507894e-05 GRU - Dropout - GRU Dropout - Dense 0.00045545812463387847 LSTM - LSTM - Dense 0.00016301091818604618 The best model, GRU - GRU - Dense, had the lowest validation loss. 4.3 Twitter and Wall Street Journal Models These variables are collected from IEX’s API and contain every minute of open trading price data for each day of the selected time period. 4: MODEL FORMULATION Both Twitter and Wall Street Journal models are structured the same due to financial constraints. They both have one GRU layer that leads to the final Dense layer. The Twitter model had a validation loss of 0.11920106410980225. The Wall Street Journal model had a validation loss of 0.10996559262275696. 4.1 Basic Model Building Blocks 4.4 Models Performance Each of the models, IEX, Twitter, and Wall Street Journal models, use similar building blocks since they are trying to solve the same problem: predicting the next minute’s average stock price. The temporal nature of the data needed to be maintained. Using batches that start with a random time and contain a set number of steps or a set window into the future. This way, time sequences are trained on the model, but different chunks of data are trained at different times. The second difference from the base model is that the mean squared error loss was not calculated for the whole validation sequence. The first 20 steps of the sequence are skipped since the model starts at a random point and then starts to make correct predictions. Skipping the first 20 steps makes that random point not influence the loss. The last layer is always a dense layer that is used to output a single predicted value, namely, the average price. The learning rate was reduced during learning to help find the minimum loss. The IEX model was the only model with acceptable performance. The Twitter and Wall Street Journal models try to predict an average and do not seem to pick up on the temporal nature of the data. 4.5 Ensemble Model Multiple machine learning models were tested to blend the predictions from the IEX, Twitter, and Wall Street Journal models. The models used were Support Vector Regressor (SVR), Decision Tree Regressor (DTR), Random Forest Regressor (RFR), and Gradient Boosting Regressor (GBR). 4 Proceedings of the Second Annual Data Science Symposium, Mercyhurst University, Erie, PA, May 4, 2019 Cross validation for each model was run and R-squared scored were used to pick the best ensemble model. The model with the best R-squared cross validation score was SVR. 5: CONCLUSION Time Series Classification. IEEE Access 6, 1662 – 1669 DOI: 10.1109/ACCESS.2017.2779939 [3] Spyros Makridakis, Evangelos Spiliotis, Vassilios Assimakopoulos 2018. Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLoS ONE 13,3 DOI: e0194889. https://doi.org/10.1371/journal.pone.0194889 [4] Chung-Chi Chen ,Hen-Hsen Huang, Hsin-Hsi Chen 2018. NTUSD-Fin: A Market Sentiment Dictionary for Financial Social Media Data Applications. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). ELRA. The only model that performed acceptably was the IEX model. Even with the IEX model as a factor in the ensemble model, the model seems to only pick up on an average value. This does not prove that the methodology from [6] do not work on the US sources, but it may be indicate that the problem of predicting stock values becomes harder to model with scale or that there wasn’t enough information in the sources we collected to predict the stock we chose, namely Amazon. 5.2 Future Work Both Twitter and Wall Street Journal sentiment analysis models need to be improved before becoming useful models. Looking at papers [4 - 7] that used a single source to predict a stocks price, the sentiment analysis models still performed well. The models presented within this paper performed poorly for a variety of reasons. The first possibility is that a lack of compute power inhibited the processing of such a large corpus of text. A second theory could be that during the period of data collection, social media and new media where not reflecting the AMZN market value. A third theory is that better data engineering is needed to eliminate some of the noise that the data definitely contained. Whatever the reason, normal sentiment analysis techniques were used and the models did not perform as well as intended. The next clear step is to improve the text models, one method could be to move from regression to classification. Instead of predicting the price predict if the price will go up, down, or stay the same. These classifications could still be used with the ensemble model to create a better regression model. References [1] Martin Längkvist, Lars Karlsson, and Amy Loutfi 2014. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters 42, 1, 11-24 DOI: https://doi.org/10.1016/j.patrec.2014.01.008 [2] Fazle Karim, Somshubra Majumdar, Houshang Darabi, Shun Chen 2017. LSTM Fully Convolutional Networks for [5] Johan Bollena, Huina Maoa, Xiaojun Zengb 2010. Twitter mood predicts the stock market. Journal of Computational Science 2, 1-8 DOI: https://doi.org/10.1016/j.jocs.2010.12.007 [6] Xi Zhang, Yunjia Zhang, Senzhang Wang, Yuntao Yao, Binxing Fang, Philip S. Yu 2018. Improving Stock Market Prediction via Heterogeneous Information Fusion. Knowl.Based Syst. 143, 236-247 DOI: 10.1016/j.knosys.2017.12.025 [7] Paul C. Tetlock 2005. Giving Content to Investor Sentiment: The Role of Media in the Stock Market. Journal of Finance, Forthcoming 62, 1139-1168 DOI: http://dx.doi.org/10.2139/ssrn.685145 [8] Mengmeng Wang, Wanli Zuo, Ying Wang 2015. A Multilayer Naïve Bayes Model for Analyzing User's Retweeting Sentiment Tendency. Computational Intelligence and Neuroscience 2015, 510281 DOI: http://doi.org/10.1155/2015/510281 [9] Nesreen Ahmed & Amir Atiya & Neamat Gayar & Hisham El-Shishiny, 2010. An Empirical Comparison of Machine Learning Models for Time Series Forecasting. Econometric Reviews, Taylor & Francis Journals, vol. 29(56), pages 594-621. DOI: 10.1080/07474938.2010.481556 [10] Junyoung Chung & Caglar Gulcehre & KyungHyun & ChoYoshua Bengio, 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR, vol 412.3555. About the author: Justin Minsk is a Graduate Student at Mercyhurst University. 5 Connecting People: Psychology and Machine Learning Praveen Kumar Neelappa Data Research and Analytics Department Uhisi Data Solution Bengaluru, India/Toronto, Canada pnkumar@outlook.com among individuals with a similar personality. This method is by far a huge contrast to what current social media platforms adopt.[6] Abstract— This paper tries to connect two major fields of science; machine learning and psychology to find an approach to connect people. This method was created using unsupervised machine learning algorithm using K mean clustering to identify traits and personality in people and group them together and use an android application to give a customized recommendation to become friends, find dates or partners. The advantage of this method is that it uses the advanced machine learning technique to learn from data and automate the process to connect people using their personality and not the appearances. The demerits of this method were that it was tested using a small sample of 100 students and all the participants were in the age group of 17 – 24 and does not represent the whole population. Further research is required. The mental, emotional, and behavioral characteristics pertaining to a specific person will be scrutinized and a connection will be initiated between such a person and another individual who exhibits similar psychology.[7] With this method of connecting individuals, the problem of incompatible connections which some social media users experience will be squashed.[8] The algorithm or method which will be developed using psychology and machine learning will be transformed into an app or integrated into a product.[9] This will, unlike other methods utilized by social media platforms, connect individuals not only through their basic information but in a broader scope or perspective. We also want to use a machine learning algorithm to connect people (using psychology) from diverse cultures who share similar personality.[10] Through the concept of this research which will be utilized in developing an application or a product, individuals can find compatible friends, partners, or dates.[11] 9. Introduction Connection is key in a world occupied by a population of about 7.7 billion people,[1] all of them with a diverse culture, psychology, and personality. Individuals crave for interactions with people who bear similar perceptions, instincts, psychology, and culture in the outside world. Albert Einstein stated that; "A human being is part of the whole, called by us 'Universe,' his thoughts and feelings as something separated from the rest, a kind of optical delusion of his consciousness. This delusion is a prison, restricting us to our personal desires and affection for a few persons close to us. Our task must be to free ourselves from our prison by widening our circle of compassion to embrace all humanity and the whole of nature and its beauty."[2] This quote by Albert Einstein clearly depicts the urgent need for effective connection amongst individuals with a similar personality. Without the proper means, developing system or algorithm which effectively connects people with similar attribute could be hard. For instance, Facebook connects diverse individuals by utilizing basic information like geographical location, preferences, age, pictures, friends, etc.[3] This concept utilized by Facebook and other social media platforms tends to connect individuals who exhibit a similar demography.[4] 10. BACKGROUND There have been several research techniques to resolve the problem of how to connect people from diverse parts of this world.[12] The main aim of these researches, either ongoing or completed is to foster a better relationship between individuals by matching people with similar personality and traits.[13] The social media giant, Facebook, utilizes a clever algorithm to connect individuals (via friend suggestion) of similar personality, location, instincts, etc.[14] Aside from this research which is focused on connecting people using psychology and k-means clustering machine learning algorithm, there are other research works out there with a similar aim but a somewhat diverse approach.[15] Some of these research work utilizes a different approach to resolve the issue of connecting individuals with similar personality or traits. One of this method is the link prediction method.[16] In this research, we are trying to develop a somewhat similar but unique approach to connecting individuals all over the world.[5] This research aims to use psychology and machine learning by utilizing a few concepts from psychology to help optimize worldwide connection The link prediction method has been an important research topic for many years. This approach to connecting individuals tends to predict social 6 Second Annual International Great Lakes Data Science Symposium Praveen Kumar Neelappa May 4-5, 2019, Erie PA, United State of America connections between users. It does this by utilizing the common neighbor prediction method.[17] This method assumes that two nodes(individuals) with several similar neighbors will have a future connection. This link prediction method utilizes three matrices, the timevaried weight, the change degree of common neighbor and the intimacy between common neighbors. This approach of connecting individuals puts several factors into consideration before making its prediction which is always reliable given right pieces of information. One of the factors considered by this approach is how many common features (e.g. common hobbies, age, tastes, geographical locations) the two individuals share.[18] This consideration is used to measure the likelihood of links between the two individuals. Since this approach utilizes information provided by the nodes to make an accurate prediction, in the case whereby information about the nodes are inaccessible and unreliable due to perhaps privacy policy, the approach will suffer.[19] In overcoming this problem, we suggest that a more basic approach is used in obtaining information about each node. For instance, a system or model which utilizes little information of the nodes to make predictions should be developed.[20] Individuals will likely not give out some personal information about themselves for security reasons. So, another approach should be adopted in obtaining this information about individuals to make the prediction one that yields good results.[21] Also, another similar research to ours can be observed in that of professor C.V.Longani of the SRES's college of engineering, Kopargaon, India. According to professor C.V Longani, the friend suggestion system as stated in his research uses the lifestyle of users to suggest friends, instead of a social graph. The lifestyle between the users and the friend matching graph is drawn.[22] This graph is generated in a tabular form and it's used to discover which users are more similar. Based on the friend matching graph, users are recommended and connections are initiated between two users of a similar personality.[23] The lifestyle of the user can be determined from the user's daily activities. This approach to connecting people utilizes factors like the habit, attitude, taste, moral standards, and economic level of people to connect individuals. By connecting individuals based on their lifestyle, a compatible union is achieved, be it in terms of friendship, partnership, etc.[24] But this approach comes with a defect, because the lifestyle is dynamic, and some individuals are good at portraying fake lifestyles. Matching an individual with someone with a fake lifestyle could result in an incompatible or failed connection. Further research is required in this aspect.[25] 7 11. EXPERIMENTAL STUDIES 11.1 Research Question Is it possible to create a method to collect data and use the machine learning algorithm to connect people who have similar interest and psychological behavior? 11.2 Variables Personal Information - The basic personal information of the candidates is collected in the first round of screening. Afterward, the algorithm is utilized to screen the provided information for recommendations. These basic information's are used to search for the perfect match for various individuals with a similar trait, personality, lifestyle, psychology, etc. Age- A discrete or continuous variable which is very essential because most established connections or relationships are dependent on the age factor Sex- Categorical (Male, female, others) Interest-Categorical (Friendship, partnership, dating, relationship) Sexual status- Categorical. An essential variable that must be considered before the connection is initiated. Sexual status of diverse individuals includes heterosexual, homosexual, bisexual, and transgender. Other personal information obtained is not used as data for the machine learning algorithm. Psychology Test - This test is directed at assessing the candidate's behavior, cognitive abilities, personality, and other several domains. Questions are thrown to the candidates to test their basic psychology. Ten psychological questions are created. The candidates are required to pick an answer from the option 1-5. The option 1 denotes that they strongly disagree while option 5 denotes that they strongly agree with the question or scenario presented to them. Personality Test - Questions are created to properly test the personality of the candidates. This test is carried out so as to properly extract the extrinsic and intrinsic personalities of each candidate. The extracted qualities will then be transformed into data for Machine learning algorithm. Questions are created to properly test the personality of the candidates. Ten personality-related questions are presented to the candidates. Among the ten questions, the candidates are required to pick an answer from the option 1-5. The option 1 denotes that they strongly disagree while option 5 denotes that they strongly agree with the question or scenario that is presented to them. The subject of Interest - Various subject of interest is provided as an option to choose. These subject of interest ranges from music, food, art, entertainment, politics, sport, literature, movies technology, etc. To extract 8 M. A. Upal (editor) which refers to the number of centroids you require in the dataset.[31] A centroid is known as the fancy location depicting the middle of clusters. That is to say, the kmean algorithm selects "k" number of centroids and then matches each data point to the closest cluster while maintaining the centroids as minute as possible.[32] The "means" in the k-mean refers to discovering the centroids. information about the candidate's subject of interest, they are required to select three interesting topics after which five questions will be created from the selected topics. The candidates are required to pick an answer from the option 1-5. The option 1 denotes that they strongly disagree while option 5 denotes that they strongly agree with the question or scenario that is presented to them. Hypothesis HOW THE K-MEANS ALGORITHM WORKS People with a similar interest and psychological score should be comfortable communicating and developing the friendship To prepare the learning data, the k-means algorithm in data mining begins with the first group of aleatory. Picked centroids which are utilized as the starting points for each cluster, and then executed frequent calculations to perfect the stance of the centroids. 11.3 Data Collection Data collection can be described as the means of collecting and measuring information on directed variables in a confirmed system which then helps one to give answers to important questions and calculate results.[26] The aim of all data collation regarding this research is to seize quality evidence that permits analysis to produce the formation of effective and real answers to the questions that have been asked. It stops enhancing and developing clusters when either: 1) The centroids are stable and there is no alteration in their values because the clustering has been accomplished. 2) The selected number of variations has been accomplished.[33] Regarding this research, data were collected through an online survey using google sheets. 100 participants were invited to be part of the data collection process and for post result discussion. Most of the participants were students and were targeted audience to use the method. The only information given to the participant were that they were part of a survey and kept the whole data collection single-blinded. The question was administered online and there was no subjective influence from our side on the outcome. Every participant was given a unique identification number and was identified through their mail ID.[28] Follow up questionnaire were sent to the participant through which the result and conclusion of this paper have been derived. 13. RESULT AND DISCUSSION After running the data through the K mean clustering algorithm. We found k = 5 as the best number of clusters to form using the elbow method. In this, we found that the 100 participants were divided into 5 groups as below Group 1 2 3 4 5 No of Participants 24 12 8 17 39 Fig. 1. Groups with the number of data point around the centroids 12. DATA & METHOD We send the result to each participant by sending them the recommendation of people who were part of their group as friends to talk to them and below was the observation Data can be defined as any series of one or greater symbols yielding meaning by a certain act of interpretation. For data to become information, it requires interpretation.[29]  After the necessary data collection, the final data was made up of 100 rows and 40 columns of data. All the data was converted to integers. Python was used as a programming language and unsupervised machine learning algorithms K-Mean Clustering was used to cluster similar users together.   K-means clustering is one of the basic and widely known unsupervised machine learning algorithms. Normally, unsupervised algorithms make deductions from datasets utilizing any input vectors without considering known or labeled outcomes.[30] A cluster is a collation of data points assembled together because of specific relations. You'll describe a large number "k"  8 23% of participant came back with information that they already were friends with the participant who were recommended to them 7% of participant informed that their friends were not recommended to them 80% of participant informed that they had a pleasant conversation with the recommended friends. The group with less than 20 members showed more positive result toward algorithm working. In a real-life scenario, the sample size will be much higher. Second Annual International Great Lakes Data Science Symposium Praveen Kumar Neelappa May 4-5, 2019, Erie PA, United State of America 14. CONCLUSIONS The result obtained although seems promising that the model can be used to connect people. There are some issues with the method 1. 2. 3. 4. 5. 6. 7. The way the data was collected and analyzed can lead to subjective bias The participant might have not given 100% on the result as it was incentive based result The sample size used to collect data was small and does not represent the entire population Since most of the participant were from India, there might be cultural bias introduced in the result The result might change if the population from all ages are used in the experiment The criteria, scenario, and question used to measure psychological and personality trait might alter result if reframed or different question were used Should use other machine learning and deep learning algorithm and comparison must be made with the result Although there are limitations, the usefulness of the method to create an application which uses psychology and machine learning to connect people can be created. But the result cannot be generalized, and further data collection and research is needed. 14.1.1.1.1 REFERENCES [15] https://populationeducation.org/sites/default/files/the_people_c onnection.pdf [16] http://www.elise.com/quotes/einstein__a_human_being_is_part_of_the_whole [17] https://www.impactbnd.com/blog/the-difference-betweenfacebook-twitter-linkedin-google-youtube-pinterest [18] https://www.cs.ucsb.edu/~ravenben/publications/pdf/interactio n-eurosys09.pdf [19] http://interactions.acm.org/archive/view/january-february2019/beyond-generalization [20] https://mcb.unco.edu/students/ets-resources/ETS-MarketingStrategy-Review.doc [21] http://www.ilocis.org/documents/chpt5e.htm [22] https://www.edge.org/responses/how-is-the-internet-changingthe-way-you-think [23] https://searchenterpriseai.techtarget.com/definition/AIArtificial-Intelligence 9 [24] https://towardsdatascience.com/machine-learning-vs-deeplearning-62137a1c9842 [25] https://www3.nd.edu/~ghaeffel/OnineDating_Aron.pdf [26] https://ctb.ku.edu/en/table-of-contents/culture/culturalcompetence/building-relationships/main [27] https://courses.lumenlearning.com/boundlessmanagement/chapter/defining-leadership/ [28] https://www.recode.net/2016/10/1/13079770/how-facebookpeople-you-may-know-algorithm-works [29] https://towardsdatascience.com/k-means-clustering8e1e64c1561c [30] https://buffer.com/resources/how-the-big-five-personalitytraits-can-help-you-build-a-more-effective-team [31] Lin Yao, Luning Wong, Lv pan, Kai Yao: Link prediction based on common-neighbors for dynamic social network. The 7th international conference on ambient system, neutrons, and technology (ANT 2016). [32] Ton Wang, Xing-Sheng He, Ming-Yang Zhou, and ZhongQian Fu: Link prediction in evolving networks based on popularity of nodes. Scientific report 7, Article number: 7147,2017 [33] https://link.springer.com/article/10.1007/s10940-014-9235-4 [34] https://machinelearningmastery.com/how-to-configure-thenumber-of-layers-and-nodes-in-a-neural-network/ [35] https://www.nap.edu/read/1864/chapter/4 [36] https://www.ijedr.org/papers/IJEDR1603037.pdf [37] https://www.wordstream.com/blog/ws/2016/09/28/generational -marketing-tactics [38] http://www.pondiuni.edu.in/storage/dde/downloads/markiii_cb. pdf [39] https://www.academia.edu/2194220/Media_culture_Cultural_st udies_identity_and_politics_between_the_modern_and_the_po stmodern [40] http://www.fao.org/3/x2465e/x2465e09.htm [41] https://www.researchgate.net/publication/13370611_Qualitativ e_Research_Methods_in_Health_Technology_Assessment_A_ Review_of_the_Literature [42] https://hbr.org/2002/02/getting-the-truth-into-workplacesurveys [43] https://en.wikipedia.org/wiki/Conditional_probability [44] https://machinelearningmastery.com/supervised-andunsupervised-machine-learning-algorithms/ [45] https://mineracaodedados.files.wordpress.com/2012/07/datamining-in-excel.pdf [46] http://szeliski.org/Book/drafts/SzeliskiBook_20100903_draft.p df [47] https://towardsdatascience.com/understanding-k-meansclustering-in-machine-learning-6a6e67336aa1 . 10 M. A. Upal (editor) Medical Brain Drain: The Relationship between Regulation and Emigration Kimberly Staudt Department of Computing & Information Science Mercyhurst University, Erie, PA United States Kstaud85@lakers.mercyhurst.edu 2. BRAIN DRAIN AND ECONOMICS ABSTRACT The effects of regulation on medical brain drain have few available studies using data analysis. This paper reports experiments with two models: a hierarchal linear model and an OLS linear model. The former model is constructed using the Python packages linearmodels with random effects, and Sklearn’s linear_model. A variance component, or random effects, model is employed on the panel data to correct for missing values, and to account for unobserved heterogeneity. Heterogeneity assumes that all agents are unique. The results showed a strong positive correlation between GDP per Capita and migration, while regulation was insignificant. Infant mortality was found to be negatively correlated with migration. Keywords Migration, brain drain, regulation, medicine, regression 1. INTRODUCTION TO BRAIN DRAIN AND MEDICAL REGULATION What is the role of regulating technology in aiding or preventing brain drain? This paper seeks to analyze the relationship between the regulation of medical technology and brain drain. Brain drain is defined as the exodus of intelligence from a country or a region. This paper defines a migrating individual’s birth and receiving country as the exodus and host country, respectively. The importance of studying this issue is that brain drain limits the opportunity for the exodus country to increase innovation and economic output. It affects every global citizen, since brain drain can aid either economic prosperity or decline. Decreasing brain drain aids exodus countries to retain their intellectual reserve and boosts their economy. Research that exposes the relationship between medical regulation and brain drain is limited. This study seeks to fill this gap by utilizing a random effects, or a variance component, model. Both topics are important enough economic and social welfare issues, where big data analysis provides the opportunity to analyze the medical brain drain dilemma. This study uses regression methods on the panel data. Physician migration data is limited to the public, but more readily available to professional agencies. This study uses 884 observations across 76 countries from 2005 to 2016. Countries include, but are not limited to, Algeria, Canada, India, and the United States. For example, Algeria had a total of 3083 doctors in foreign countries and had a high infant mortality rate of 22.4%, in 2012. 10 The promise of economic prosperity is what often drives an intellectual away from the exodus country and into their host country. An example of a model host country is the United States (U.S.), which has held the spot of being the #1 host country for over 14 years [8]. Much of this appeal is due to the U.S. having the highest nominal GDP. The United States Central Intelligence Agency (C.I.A.), reports American Samoa as having the lowest net migration rate, meaning that they lose more citizens than any other country [6]. The question here is whether this emigration can be classified as brain drain or not. If the negative net migration is due to the exodus of low-skilled workers, then it does not fit the definition of brain drain. A great threat to the exodus country is the lack of incentive to diversify the fields of study offered to its migrating students. Docqueir and Rapoport [2012] describe the resulting market oversaturation by stating, “…brain drain distorts the provision of public education away from internationally transferable education (e.g., exact sciences, engineering, economics, medical professions) and towards country-specific skills (e.g., law), with the source country possibly ending up training too few engineers and too many lawyers;”[9]. The authors then propose monetary penalties for intellectuals who participate in brain drain as an incentive to return to the exodus country. However, this may have little effect in the prevention of brain drain, especially concerning the medical community. For the intellectual that gains more economic opportunity by immigrating to the host country, a brain drain tax would give them little incentive to return. This still leaves the exodus country with one less intellectual. Unfortunately, the issue becomes more complicated when migration is the result of political or social issues, such as war. Since this paper focuses on the economic regulation of medicine/technology, the migration data will be more specific to the medical field. This study uses migration data from highskilled workers as a proxy variable for brain drain. Here, highskilled workers are defined as individuals who have completed tertiary education. A great threat to medical technology is withheld monetary investment incentivized by over-regulation. Countries that have greater regulation see a decrease in financial investment. In Medical Device: Lost in Regulation, Citron [2012] states, “A slow but inexorable process of added regulatory requirements superimposed on existing requirements has driven up complexity Second Annual International Great Lakes Data Science Symposium Praveen Kumar Neelappa May 4-5, 2019, Erie PA, United State of America and cost and has extended the time required to obtain device approval to levels that often make such investments unattractive. It must be noted that the market for many medical devices is relatively small. If the cost in time and resources of navigating the regulatory process is high relative to the anticipated economic return, the project is likely to be shelved” [7].This constitutes both short-run and long-term costs at the expense of both patients and medical personnel. For the aspiring medical student, the appeal to migrate to a country with more robust medical technology is greater than to remain in a country with poor medical development. 3. RELEVENT WORK While brain drain and the regulation of medicine and its technologies have been researched, fewer studies have been conducted concerning the relationship between the two topics. Most research on these topics uses a combination of survey data and research techniques at best when analyzing the data. The importance of this study lays in analyzing a hypothesized connection between regulation and the emigration of medical personnel. Understanding the history of the two aids in decision making for both public policy and healthcare. Brain drain, the emigration of an intellectual from an exodus country to a host country, negatively affects the exodus country’s economy. Some of these individuals include medical professionals. 3.1 Brain Drain & Medical Retention Regulation of medical technology has a negative impact on medical staff as well as patients. Lofters & Slater surveyed nearly 500 physicians who emigrated because of political/economic issues and underdeveloped healthcare. This results in a lower supply of professionals in exodus countries. The surveyed data showed that many physicians who emigrated were in their thirties and from South Asia [11]. Asian exodus countries struggle to retain doctors, while developed countries, such as the United States, have an oversaturation of doctors. Additionally, many doctors in the medical field specialize in a specific practice. Specialization allows all professionals to focus on specific tasks to optimize output. This also narrows down employment prospects to a finite number of positions, creating competition between the emigrant medical personnel and the natural born citizen physicians. Competition can increase the incentive for both parties to provide optimal healthcare but is less desirable for the individual and their income. Competition drives down salaries. This is expected in an oversaturated medical market due to lower bargaining power. It can be deduced that the emigrant doctors coming from exodus countries must have a greater incentive to leave than to stay. This explains in part as to why developing, low-income per capita countries have higher brain drain than wealthier nations. Holding other privileges constant, the host countries offer a higher income than the exodus country, while simultaneously underpaying physicians. The goal of Lofter & Slater’s study was to survey the physicians and use descriptive statistics to influence public-policy. Their research considers political and sociological factors that are thought to be important based on theoretical consideration, rather than automatically discovering such factors using machine learning. This is just one of the many studies that use descriptive 11 statistics to consider a limited number of factors, while the approach of using machine learning to look at a broader set of factors to automatically identify important factors seems to be neglected. 3.2 Brain Drain & Medical Care Quality The economic cost to the exodus country is in the millions. Many medical students study in the exodus country before emigration. Research notes that, “Low‐resource nations spend US$ 500 million each year to educate health workers who leave to work in North America, Western Europe, and South Asia” [Serour, 2009, 7]. Ergo, can be inferred that this yearly deficit negatively affects the low-income country’s economy. Historical data also reports a decrease in healthcare quality for patients who remain in these countries. He concludes that there is a positive correlation between brain drain and decreased quality of life and health service in these exodus countries [Serour,2009,7]. Organizations such as the World Health Organization (WHO) have several goals and steps to prevent health and income issues that plague developing countries [12]. However, there are few proposed solutions to solving brain drain that have been implemented successfully. Reducing brain drain is difficult due to the complexity of and contribution by a multiplicity of factors. While Lofter & Slater rely on soft data, Serour’s work seeks to contribute to policy with some hard data collected by the WHO. The lack of statistical models implemented in these papers results in a gap in understanding the relationship between brain drain and medicine. Solving the problem of brain drain has the potential to increase the quality of life in exodus countries. Therefore, this paper hypothesizes that there is a correlation between healthcare quality traits such as infant mortality rates and life expectancy, and medical professional emigration. 3.3 The Effects of Medical Regulation on Hospitals An example of medical regulation is restrictions placed on medical equipment and facilities. The demand for a facility to be within a reasonable distance is expected to be high. Intuitively speaking, distance is a factor in the quality of healthcare. This is especially true for those with lower-income and those who cannot afford transportation, or those who live outside of urban areas. Trogdon [2009] describes a market where, “certification-of-need (CON) regulations require a hospital to show…a cardiac facility, in any given market before states permit entry…” [16]. For hospitals, this results in the difficult choice between providing a highly demanded facility, and one that serves both rural and urban communities [16]. Facilities outside of urban centers are less likely to see use, leading to financial loss and the underutilization of surgeons. Trogdon uses the American Hospital Association’s data to model this trade-off. He notes that previous studies used classification to model the severity and risk of heart attack in patients [16]. Trogdon’s solution utilizes a multinomial logit function to model hospital service parameters. Trogdon’s model suggests that regulation decreases hospital competition and hurts patients. Hense, this study provides important evidence concerning the relationship between regulation and medicine. 12 M. A. Upal (editor) Trogdon’s solution and evaluation differs from this paper’s proposed solution in that his study uses a smaller data set and solely relies on regression techniques. This paper strives to provide a larger data set in addition to classification methods. The problem that the solution addresses is similar in that it observes medical regulation. A drawback to Trogdon’s study is that it fails to address the effects of regulation of medical personnel, taking a demandside approach. This study will use data including factors such as employment rates and emigration rates of medical personnel. 3.4 Barriers to Entry: Regulation of Doctors Another issue that arises is the regulation of a doctor’s qualifications. For countries with different standards in medical knowledge, this creates a difficult situation for medical personnel. Medical professionals that cannot receive the proper medical training in an exodus country may choose to study in foreign countries. A professional with a greater skill set is incentivized to remain in the host country. This is due to both personal costs acquired throughout their education, as well as a greater pay for their skills. Table 1 [Baretta, 2012] Observing the relationship between pharmaceutical regulations and brain drain is important because quality healthcare contributes to the incentives to remain in an individual’s birth country. This paper uses WHO reports, medical studies, and the World Data Bank’s data to characterize the level of regulation implemented in any country. Historical research shows that for doctors trained in non-English speaking countries, 66% of UK supervisors reported complaints from patients concerning many issues [Bhat, 2014, pp.3]. He states that many foreign-born doctors are less likely to understand patientdoctor relationships, medical regulations, and clinic skills. It was also noted that non-whites were far more likely to be reported, hinting at racial factors coming from the host country. [Bhat, 2014, pp.3]. It can be argued that an unequal regulation ratio between exodus and hosts countries contributes to this. 3.6 Historical Data This paper presents a unique approach to the research methods concerning brain drain. Previous research utilizes survey data along with statistical analysis, and focuses on the negative effects of brain drain on the exodus country. Fewer studies analyze brain drain and medical regulations. While this presents a new opportunity, especially concerning medical technology and personnel, there is limited available data. Previous studies typically use surveys and government sourced data. This paper uses data from the World Health Organization (WHO), the Organisation for Economic Cooperation and Development(OECD), and the World Data Bank. International data is prone to truncated values. Inconsistencies in some international data is often attributed to bias, or a lack of, reporting. Countries with corrupt governments may report inaccurate statics to suit political agenda. Bhat’s report shows a unique issue that stems from the exodus country. As previously stated, host countries that have poor developed healthcare have issues retaining doctors and other citizens. Governments argue that medical regulations are implemented to ensure patient safety. In countries that have poor safety regulation, the demand to emigrate has been shown to increase. Bhat’s study uses soft data to observe trends that cover migrant doctors and regulations. His study differs from this paper’s proposed solution in that it provides a sociological history but has very limited data. Additionally, it does not provide a solution or any evaluation techniques. 3.5 Pharmaceutical Regulations It has been established that there is a connection between brain drain and poor healthcare in exodus countries. An emerging problem is the lack of pharmaceutical regulation in these Ravinetto [2016] reports that pharmacists and organizations such as WHO have begun campaigning for better policy and awareness in African communities where disease and poverty are prevalent [14]. He cites a study where Chinese regulators found that counterfeit drugs make up 1%-2% of the market [Pan & Luo, 2016, pp.300]. Counterfeit drugs included antibiotics and antimalarics in high quantities, resulting in deadly consequences, see Table 1 [Baratta, 2012, pp.175]. While Ravinetto does not provide a statistical solution, the remaining studies both utilize statistical analysis of cross-sectional data as a solution. Baratta’s study goes further by conducting medical experimentation on the drugs to observe their legitimacy. 3.7 Data and Expectations The quality of healthcare is based on the WHO’s efficiency index. Countries with poorer healthcare standards and maximized healthcare benefits, receive a score nearing to 0 and 1, respectively. Countries with observed higher regulation standards, such as sanitation regulations, have a higher index score. Life expectancy and infant mortality rates are used as proxies for the quality of life in a country. A healthcare rating index is implemented on each country based on the efficiency of its healthcare system. It is assumed that doctors gravitate towards immigrating to a country with better healthcare and life quality. Countries with greater regulations, concerning sanitation and accessible medicine, are observed to have a higher index. This paper uses the healthcare index as a proxy for regulation quality. It is also assumed that professionals seek higher monetary gains, 12 Second Annual International Great Lakes Data Science Symposium Praveen Kumar Neelappa May 4-5, 2019, Erie PA, United State of America seeking to move from a developing exodus country to a developed host country. I hypothesize that there is a negative correlation between GDP per capita and the exodus country’s doctor stock. This study uses foreign doctor stock data from 76 countries to model migration patterns among doctors. This model assumes that a doctor who migrates reside only in the host country during the census year. 3.8 Evaluation Techniques In the past, studies employed regression analysis on less specific data sets. Rather than observing trends within brain drain or regulation, this paper uses doctor migration data from the OECD and World Data Bank. The result is a cross-sectional data set. This paper uses the following variables: GDP per Capita, infant mortality rates, and life expectancy, an exodus country’s doctor stock, and a healthcare rating index. 4. SOLUTION This paper’s proposed solution expands upon past research on brain drain and medical regulation by employing its technique on a unique dataset. Previous researchers have implemented regression techniques on the data to analyze the effects of brain drain and regulation, separately. This paper analyzes trends between regulatory policy, healthcare quality, and the migration rates of medical professionals. This first experiment employs random effects from Kevin Sheppard’s relatively new module linearmodels [2], in addition to Sklearn, on the panel data. The second experiment uses a traditional OLS linear regression. The training and testing model was split 70/30, fit, and utilized for both hypothesis testing and predictions. My findings show a relationship between the exodus country’s income and infant mortality, and doctor migration. An increase in the exodus country’s GDP per capita is correlated with a slight increase in migration. I speculate this to be due to an individual in a developed country being satisfied by their country’s wealth, while those in poorer countries lack opportunities to leave. When a country is developing, individuals who previously could not leave, now can do so. An example of this trend is the relationship between the two variables for the exodus country, Albania, see Graph 1. However, a higher infant mortality in the exodus country is correlated with a decreased doctor stock from the exodus country. Perhaps this is due to the desire to better one’s country or because of a loss of potential. A higher infant mortality rate is assumed to negatively affect population numbers. Individuals who survive infancy have the potential to become mobile doctors, while those who do not, lose this opportunity. Surprisingly, healthcare quality and regulation presented no significance towards doctor migration. Perhaps countries with higher indexes impose migration barriers which leaves doctors with the choice to move to another developing country or a country less developed. Another scenario is that developing countries compete with developed ones to host by offering greater opportunities. Future works can improve upon this study through including additional features such as accounting for war, or additional regulatory statistics. Feature Abbreviations for Table 2 IM HC GDPP Hypothesis Testing My research uses international panel data that is prone to being truncated. A random effects model allows for partial pooling, to maximize the precision in estimating values. Testing showed multicollinearity between life expectancy and infant mortality rates. Life expectancy was dropped from the model, which greatly improved the estimates and reduced noise. After normalizing the data, a single tailed significance test was conducted on the model. This model assumes α = 0.05, where: Ho: µ1 = µ2 5. CONCLUSION 13 Albania:Doctor Stock vs GDP per Capita (2005-2015) DOCTOR STOCK The linear model perfectly fit the data with an R2 of 1. Overall, GDP per capita and infant mortality rates are shown to be better predictors of a country’s doctor stock, or migration, see Table 3. Health Care Rating Index GDP per Capita Table 2: Parameter estimates show that GDP per Capita has a significant correlation between itself and migration. Ha: µ1 ≠ µ2 GDP per Capita and infant mortality rate were significant, and positively and negatively correlated with doctor migration, respectively. A 1 unit increase in GDP per Capita increased doctor stock by 0.2, while a 1 unit increase in the infant mortality rate decreased doctor stock by 245.8, see Table 2 below in Tables & Graphs. Description Infant Mortality Rate 500 38 85 291 197 246 97 136 90 156 383 467 0 GDP PER CAPITA Graph 1: Doctor Stock by Exodus Country and the GDP per Capita of the Exodus Country in Albani 14 M. A. Upal (editor) [5]CEICData. (n.d.). Venezuela GDP per Capita [1960 - 2019] [Data & Charts]. Retrieved from https://www.ceicdata.com/en/indicator/venezuela/gdp-per-capita [6] Central Intelligence Agency. (n.d.). COUNTRY COMPARISON : NET MIGRATION RATE. Retrieved from https://www.cia.gov/library/publications/the-worldfactbook/rankorder/2112rank.html [7]Citron, P. (2011). Medical Devices: Lost in Regulation. Issues in Science and Technology, 27(3), 23-28. Retrieved from http://www.jstor.org/stable/43315484 Table 3: This table shows the first 50 predictions, shown as an array. REFERENCES [8]Department of Social and Economic Affairs. (2015). International Migration Report 2015 (United Nations, Ed.). Retrieved from http://www.un.org/en/development/desa/population/migration/pub lications/migrationreport/docs/MigrationReport2015_Highlights.p df [1] Baratta, F., Germano, A., & Brusa, P. (2012). Diffusion of counterfeit drugs in developing countries and stability of galenics stored for months under different conditions of temperature and relative humidity. Croatian Medical Journal,53(2), 173-184. doi:10.3325/cmj.2012.53.173 [2] Bashtage. 2017. bashtage/linearmodels. (2017). Retrieved December 2018 from https://github.com/bashtage/linearmodelsF [9]Docquier, F., & Rapoport, H. (2012). Globalization, Brain Drain, and Development. Journal of Economic Literature, 50(3), 681-730. Retrieved from http://www.jstor.org/stable/23270475 [3] Bhat, M., Ajaz, A., & Zaman, N. (2014). Difficulties for international medical graduates working in the NHS. BMJ: British Medical Journal, 348. Retrieved from https://www.jstor.org/stable/26514841 [8]Hautamaki, V., Karkkainen, I., & Franti, P. (2004). Outlier detection using k-nearest neighbor graph. Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004. doi:10.1109/icpr.2004.1334558 [4]CEICData. (n.d.). Syria GDP per Capita [2002 - 2019] [Data & Charts]. Retrieved from https://www.ceicdata.com/en/indicator/syria/gdp-per-capita [11]Lofters, A., Slater, M., Fumakia, N., & Thulien, N. (2014). “Brain drain” and “brain waste”: Experiences of international medical graduates in Ontario [Abstract]. Risk Management and Healthcare Policy,81-89. Retrieved from https://pdfs.semanticscholar.org/f29c/442619aa802d865fbb7721 06b37461edaf49.pdf?_ga=2.155211643.1377461583.15421409 76-1451209627.1542140976. [16]Serour, G. (2010). Healthcare Workers and the Brain Drain. Obstetric Anesthesia Digest, 30(3), 141. doi:10.1097/01.aoa.0000386811.00064.72 [12]Millennium Development Goals (MDGs). (2017, October 17). Retrieved from https://www.who.int/topics/millennium_development_goals/en/ [17] Trogdon., J. 2009. Demand for and Regulation of Cardiac Services. (2009). Research, Osaka University Previous Item | Next Item https://www.jstor.org/stable/25621506 [13]OECD. Retrieved https://stats.oecd.org/Index.aspx?QueryId=68336 7. Acknowledgements from I give my thanks to Dr. M. Afzal Upal of Mercyhurst University for his guidance during my research. I would like to give a special thanks to Kevin Sheppard, the developer of linearmodels. [14]Pan, H., Luo, H., Chen, S., & Ba-Thein, W. (2016). Pharmacopoeial quality of antimicrobial drugs in southern China. The Lancet Global Health, 4(5). doi:10.1016/s2214109x(16)00049-8 About the Author Kimberly Staudt is a data science graduate student at Mercyhurst University. In 2017 she was the Press Secretary for the Mercyhurst Data Science Club. She has a Bachelor of Arts in economics, and a minor in sociology from Duquesne University. During her undergrad she was the Vice-President of the Economic Student Union. Previously, she has interned for State Budget Solutions as a data analyst and op-ed writer. Her interests include medical science and psychology, economics, physics, philosophy, and fine art. [15]Ravinetto, R., Vandenbergh, D., Macé, C., Pouget, C., Renchon, B., Rigal, J., . . . Caudron, J. (2016). Fighting poorquality medicines in low- and middle-income countries: The importance of advocacy and pedagogy. Journal of Pharmaceutical Policy and Practice, 9(1). doi:10.1186/s40545016-0088-0 14 15 Predicting Future Poaching Sites in African Reserves Stephanie Le Grange Department of Computing & Information Science Mercyhurst University Erie, PA Slegra78@lakers.mercyhurst.eu ABSTRACT Poaching has always been an issue in Africa not only due to the massive loss of wildlife, but also due to the impact that the loss of these animals has on the ecosystem and on the local community. Poaching is driven by the demand for ivory and rhino horns, but with the advancements in the illegal trade markets we have seen an increase in poaching rates. Illegal animal trade can be reduced if we can determine poaching hot spots [1]. This can be done with the help of African reserves as well as the rangers that are patrolling these areas. Rangers help collect data on animal observations, location with signs of illegal activity, and poachers [2]. Which can then be used to help provide patrol managers tools that analyze the data they have collected to generate a forecast on poacher’s behavior as well as provided future patrolling routes [2]. There have been other models built to help assist rangers in patrolling the vast conservation areas in Africa including PAWS, CAPTURE, and INTERCEPT. All of these models have their own advantages and disadvantages, but all have helped improve rangers patrol routes, capture poachers, and remove snares. My results from this project show when an attack would be successful based on the date and location in question and providing rangers with a better estimate on where they should be patrolling next. 15. INTRODUCTION African mammal populations have shown a dramatic decline in size over the years while poaching and illegal wildlife trade continue to grow since the late 2000s [3]. This loss in species has consequences on the surrounding ecosystem [2] as well as the local economies that depend on these animals to help drive tourism. With the new advancements in technology as well as social media it is getting easier for poachers to move poached artifacts or animals to other countries with little to no detection. To help reduce the loss of African wildlife, conservation organizations assign rangers to protect these large areas, but due to the harsh environment, size of the areas that need patrolled, and the limited number of rangers it makes it difficult to actively protect these areas and animals in those areas from a growing number of poachers. CAPTURE has two layers; the first layer predicts the attackability of an area while the second layer predicts the likelihood of an attack being seen given the patrol routes [2]. INTERCEPT is efficient in assisting rangers with patrol planning, while PAWS patrol planning is generated on potential risk of an area not on predicted poacher attack areas [2]. These technologies along with a new AI camera called TrailGuard AI are being used in African reserves to assist rangers with catching poachers before they kill the animals that they are looking for. Along with using AI and machine learning algorithms rangers are teaming up with conservation organizations and have been altering and coloring rhino horns to lessen their appeal to poachers. We need to continue looking into what is motivating poachers to carry out their actions. One driving factor is that the benefits to the poacher outweigh the risk of being caught. Real world data is also noisy, and data can only be collected from areas that are being patrolled [2]. 16. RELEVANT WORK Poaching has grown exponentially in the past few years and the illegal trade of animals and animal parts is one of the driving forces for this. Majority of the items that are being illegally traded are derived from African elephant ivory. Many of these poachers have very little to no experience with weapons, but they are skilled in avoiding detection. In 2013 alone 51 tons of ivory was seized, and the number of elephants killed was estimated at 50,000 [4] which is significant given that there were only about 434,000 elephants remaining that year. Ivory prices have also increased over the years. In 2013 black market ivory averaged between $2,500 and $3,000 per kilogram and rhino horns prices were reported at $65,000 per kilogram [3]. Researchers have also investigated that questions of where poachers are being recruited and they have found that many poachers were being recruited in the local villages and that 32% of villagers knew who the poachers are but are not willing to identify them [3]. Past research has shown that the distribution of poaching sites is non-random and instead there is a spatial autocorrelation [5]. There has also been a strong correlation between poaching sites and the landscape especially areas near water [5]. This information is important in helping with predictions of future poaching sites. CAPTURE has a few shortcomings that INTERCEPT improves on. CAPTURE takes hours to run and the model is difficult to understand [2] which does not work for rangers as they have limited access to computing power and are not well versed in computing. INTERCEPT uses decision trees along with BOOST IT to provide a prediction on potential future poacher sites. 16 M. A. Upal (editor) Work is also being done in DNA testing of seized ivory to help narrow down the origin of the ivory, so they know whether it is from savannah or forest elephants [4]. DNA analyses can help predict where the animals are being removed from, which can then help rangers intercept poachers, helping stop the trade before it happens [1]. number of true negative (TN), true positive (TP), false negative (FN), and false positive (FP). 17. FIGURES/CAPTIONS These results are printed in the below order: TN is the non-successful attacks predicted correctly, TP is successful attacks predicted correctly, FN is successful attacks wrongfully predicted to be non-successful, and FP is the nonsuccessful attacks wrongfully predicted to be successful. Place Tables/Figures/Images in text as close to the reference as possible. It may extend across both columns to a maximum width of 17.78 cm (7”).  True Negative – False Positive on the top row.  False Negative – True Positive on the bottom row.  In the table above we can see that we have 145 true negatives and 1 true positive Captions should be Times New Roman 9-point bold. They should be numbered (e.g., “Table 1” or “Figure 2”) and be centered beneath each table, figure or image. 18. PROPOSED SOLUTION Finding relevant real-world data on poacher sites was challenging as much of this data in not made publicly available due to the sensitive nature of the data. I pulled data from two different sources which provided information on the date of the poaching incident, location (longitude and latitude), as well as the amount of animal carcasses that were found in each location. I prepared a two train/test splits of the dataset one from 2002 to 2012 to evaluate data from 2013 and one from 2002 to 2017 to evaluate data from 2018. I will use a binary decision tree along with regression model to predict the probability of a potential future poacher site. I will then measure accuracy score to evaluate my model. I ran the same code again with the random forest classifier and re printed a confusion matrix with the below results: We can see that we have a decrease in the number of true negatives and increase in the number of true positives. I then build a model that would allow the user to enter data based on their next patrol route to predict how much resources they should assign to a given location on a given day. They would need to enter: 19. METHOD I started by importing the necessary libraries numpy, pandas data frames, seaborn, and matplotlib. I then pulled in the data and added labels to each column in the dataset.  The month  The day  The location (Longitude and Latitude) The model will print out a statement on whether or not an attack is likely or not and with this information the rangers can ensure that they assign more resources to higher risk areas which might increase the chances of more poachers being caught in the act and less animals added to the endangered/extinction list. 20. DISCUSSION There has been discussion on how to get the communities involved in helping to reduce poacher recruitment in their villages. The engagement of the community can also assist in speedy arrest of poachers as seen in Namibia [3]. Technology should be deployed to assist with these efforts to protect Africa’s wildlife and technology can assist with keeping records and analyzing future deployment of rangers based on poaching patterns [3]. Having rangers record the seasonal patterns as well as the distance from bodies of water, roads, and shelters would help increase the accuracy of models like this as these factors have been thought to be a major factor for when poachers are choosing locations to hunt. We know this is true because if it is drought season, we know more animals are going to gather around the last few water holes thus that would be a more likely spot for poachers to gather and the same can be said with the weather and the hour of the day. This dataset has 660 observations and 5 features. The reported date and the description were then factorized before assigning x and y. X represents the features that we want to test on, and Y is the prediction that we want to obtain. X is going to consist of the following columns: Number Killed, Date Reported, Longitude, and Latitude while Y is going to predicts the likelihood of a poaching incident at a given location on a given day. Before running the train/test/split I had to check the shape of x and y and transpose x so the shapes match. Train/test/split was set to a size of 0.3 and random state 0. I then imported linear_model from sklearn as ell as pyplot and printed the shape of the train and test data. I used a decision tree classification with a max depth of 3 was run on the training data to show the 16 17 Poachers might be more likely to attack during a specific hour depending on the weather. 21. CONCLUSION Technology will play a big role in the future of conservation in Africa and in reducing the number of poaching incidents as well as illegal trafficking, but it will not be a replacement for well-trained rangers. Rangers will continue to play an important role in the protection of African wildlife as they know the bush, can spot signs that intruders have been there, and are willing to put themselves in harm’s way to protect the animals [3]. Studies should continue not only in the potential on a poaching location, but also on the motivation of the poacher to take the risk of continuing through with his actions. More research needs to be done on poaching sites especially with the collection of real-world data. Models also need to continue to be improved upon to help them distinguish between relevant data and noisy data. In the future it would be nice to another program that can automatically enter in locations and show them on a map. This would remove the tedious work of manually entering in dates and locations by longitude and latitude. 22. REFERENCES [1] S.K. Wasser et al., Combating the Illegal Trade in African Elephant Ivory with DNA Forensics. Conservation Biology. 22, 1065-1071 (2008) [2] D. Kar et al., Cloudy with a chance of poaching: Adversary Behavior Modeling and Forecasting with Real-World Poaching Data. International conference on Autonomous Agents and Multiagent System. (2017) [3] B. Anderson and J. Jooste. Wildlife Poaching: Africa’s Surging Trafficking Threat. Africa Security Brief. (2014) [4] S.K. Wasser et al., Genetic assignment of large seizures of elephant ivory reveals Africa’s major poaching hotspots. Conservation. 349, 6243 (2015) [5] M. J. Shaffer and J.A. Bishop., Predicting and Preventing Elephant Poaching Incidents through Statistical Analysis, GIS-Based Risk Analysis, and Aerial Surveillance Flight Path Modeling. Tropical Conservation Science. 525-548 (2016) About the authors: Stephanie Le Grange is a graduate student at Mercyhurst University in Erie, PA. She earned her undergrad in Environmental Science at Edinboro University. 18 M. A. Upal (editor) Mass Shootings in the United States: An Analysis and Prediction Dayana Moncada Mercyhurst University 501 East 38th Street Erie, Pennsylvania 16501 dmonca63@lakers.mercyhurst.edu ABSTRACT metastatic growth of the consumer/producer dynamic that feeds into the worst parts of our nature. The more content produced, the more content we consume (Pescara-Kovatch et al., 2017) [2]. One thousand, one hundred and fifty-three people have died in the course of U.S. history because of mass shootings. Despite these occurrences are only a tiny factor of gun violence, they are still terrifying because of their unexpected nature. In this research, we utilized linear regressions, using scikit-learn, to analyze our data. We were not able to conclude on how to predict mass shootings from happening; but we were able to bring to light some conclusions and correlations that can serve as light for future research. 2. RELATED WORK To understand how mass shootings have occurred throughout the years, its societal implications and what analyses have been done in academia. We investigated different journals and articles to better understand and possibly further work on an already ongoing research. The different articles that were pertinent to this topic came from academic journals. Keywords The paper has been chosen to be the center of this project due to its statistical methods and its contagion model. We fit a contagion model to recent data sets related to mass shootings in the US, with terms that take into account the fact that a school shooting or mass murder may temporarily increase the probability of a similar event in the immediate future, by assuming an exponential decay in contagiousness after an event (Towers et al,.). [3] Mass shootings, crime prediction, linear regression, data science, Python 1. INTRODUCTION The deadliest mass shootings in modern history have occurred in the United States since October 2017. These include the shooting at the Pulse nightclub where 49 people were killed, in the 58 people who were killed in Las Vegas during a concert in Mandalay Bay, and in Parkland, Florida in Marjory Stoneman Douglas High School who mobilized the country into one big rally across the nation were 17 people were killed. Unfortunately, these are not the only instances. One thousand one hundred and thirtyfive people have been killed in mass shootings in the United States’ modern history. A mass shooting is defined to as an incident where more than 4 people are killed. However, there is not a single definition of what a mass shooting that everyone agrees with. The Washington Post, and other media posts, have defined a mass shooting as “four or more people were killed by a lone shooter (two shooters in a few cases). It does not include shootings tied to gang disputes or robberies that went awry, and it does not include domestic shootings that took place exclusively in private homes. A broader definition would yield much higher numbers.” [1] According to the authors, past studies have found that media reports of suicides and homicides appear to subsequently increase the incidence of similar events in the community, apparently due to the coverage planting the seeds of ideation in atrisk individuals to commit similar acts. It is very interesting to see these findings due to the increasing news on mass shootings that have occurred in the past couple of years. The authors also found how a state’s prevalence of firearm ownership is significantly associated with the state incidence of mass killings with firearms, school shootings, and mass shootings. The research topic and direction is similar to the one we will approach in terms of the methodology. The way their research was performed was through binned statistical analysis. The authors claim that this type of analyses is used in the life and social sciences to distinguish between null and alternate hypotheses. As the media’s coverage of mass shootings has increased, the American population has become more aware of such events. The timeline of a mass shooting and its repercussions, in the eye of the public, is eerily similar, it starts with the occurrence of the shooting, followed by millions of posts on social media timelines. The outrage from the public often become visible; activists call for gun legislation and stricter gun control. This usually lasts for a couple of weeks. Finally, the whole situation is forgotten until another similar incident occurs. The idea of a mass shooting has become normalized in our society to the point that people cannot rely on gun legislation to deter the casualties and psychological damage caused by these incidents. A very interesting and challenging situation the authors faced is by having contradictory outcomes during a testing of their model. As a example of the advantage of unbinned likelihood methods in increasing the statistical power of an analysis, here we compare and contrast two recent analyses of contagion in mass killings in America, both of which were based on exactly the same data, but used in different methodology. One concluded that there was evidence of contagion in mass killings. [4] The way their model is done and performed is through a unbinned statistics. These concepts were illustrated in our comparison of two analyses of contagion in mass killings that have appeared in the literature; both of which used exactly the same data, but different analysis of methodologies. The second set of data will be explained in the literature of Tomek (2017) in the following point of the relevant works section. The Towers et al (2015) [5] With the growth of an insatiable 24-hour news cycle, social media, reality TV and a culture that thrives on instantaneous access to information, it is no wonder there has been a symbiotic, 18 19 “analysis of mass killings used unbinned maximum likelihood methods to examine the temporal distribution of the events, and found significant evidence of contagion [6]”. In contrast, the analysis, according to the authors in Lamkford and Tomek (2017), claim the following: the examination of the data using coarselt [7]. The comparison of the two analyses provides an excellent example of the power of unbinned likelihood methods; the very coarsely binned methods; the very coarsely binned method used by Lankford and Tomek was; “not sensitive to differences between the null hypothesis model of no contagion and an alternate hypothesis of a self-excitation contagion model.” It is important to note that Lankford and Tomek took a different approach. The latter are the citations and wording from Towers et al. The model taken by Lankford and Tomek will be taken into consideration further. It is not easy to deny that media has had a very important influence in the modern world. With the expansion of technology, smartphones and social media; news get to the palm of one’s hand in a matter of second, weather they are fake or not, that is something different. Several media outlets have been on the verge of the topic of mass shootings for a while now. We are especially interested in the different approaches taken to diminish and deter the instances of mass shootings in the United States. The New York Times have suggested “Mass Killings May Have Created Contagion, Feeding on Itself” and a recent headline in The Washington Post” suggested “Are Mass Shootings Contagious?” Some Scientists Who Study Viruses Say Yes” (Carey, 2016; Rosenwald, 2016) [6]. The approaches to understand and prevent mass shootings are vast and complex and have been broadcasted through different media outlets. Lankford and Tomek in this article start by explaining the contagion effect in the incidence of suicide. Lankford and Tomek write down: “Some researchers have found that suicide rates increase in the days after a highly publicized suicide, such as that of a celebrity or well-known fictional character (Abrutyn & Mueller, 2014; Niederkrotenthaler et al., 2010; Phillips, 1974; Wasserman, 1984)..” [7] The authors continue to explain the similarities between incidents and mass killers. Mass killers, in most cases, commit suicide after they commit the crimes they intended to commit, in the first place. Approximately 30% of mass killers die by suicide or refuse to surrender and are killed by police, which constitutes “suicide by cop” (Duwe, 2004; Lankford, 2015; Lindsay & Lester, 2004) [8]. Social contagion in the case of mass killings takes a similar precedent as the social contagion of suicides. When applied to mass killings, the social contagion thesis suggests that perpetrators receive so much attention for their attacks that each high-profile killer ends up “infecting” the minds of other impressionable individuals (Kisner, 2016; Towers et al., 2015). The authors continue by explaining that Kissner (2016) found that in the United States from 2000 to 2012, there was an increased risk of active shootings in the 14 days following an incident, and Towers et al. (2015) similarly found that from 2006 to 2013, there was an increased risk of mass killings and school shootings in the 13 days following a previous incident. However, there are speculations these findings are not entirely helpful or true. Following, Kisnner (2016) and Towers et al. (2015) based their studies on incident dates. Joiner, T in 1999, in his article The Clustering and Contagion of Suicide, comes up with the idea that chronological clusters of mass killings that are more prevalent than would be expected at random do not necessarily provide evidence of contagion effects (Joiner, 1999). Lankford and Tomek continue by explaining that incident clusters could be attributed to other social and environmental factors such as political cycles, stock market gains or losses, or other news events unrelated to crime. This paper poses a challenge to the initial question of does media contagion transmits the need of potential mass killers to commit a crime and therefore, making it difficult to deter and prevent these incidents. We believe the challenge is not a sign of completely changing the original question but of one that challenges to keep looking for an answer or answers. The findings made my Lankford and Tomek pose a challenge to what is known and studied. Contagion cannot occur without transmission, says Lankford and Tomek, in page 460 of their article. The social contagion thesis requires that the imitative mass killer be at least indirectly exposed to the model killer’s behavior. However, although mass murderers receive a large amount of media attention, Duwe (2004) found that only 45% of all mass killings in the United States from 1976 to 1999 were even covered by The New York Times. [8] The latter gives the assumption that the media does not play an important part in the propagation of news. However, it is worth to mention that during 1976 and 1999, social media and the rapidness of news outlet to deliver news. It is not relevant to media contagion in modern times. Nevertheless important to take into consideration. An important aspect of study made by Lankford and Tomek is they made note of the importance of differentiating between high-profile incidents and low-profile incidents of account for variation in the amount of attention that each incident receives (460). The methodology used was the same used in Towers et al. (2015). The data set contains 232 mass killings in the United States from 2006 and 2013 and provides population of incidents, not just sample. A very interesting part of this study performed was they used a second additional data set which was randomly generated dates that simulate 232 mass killing incidents across an 8-year time frame (2006-2013), for a total of 116,000 randomly simulated dates (461). The findings in this paper were challenging to the first article mentioned after their statistical exam that although the very high-incident profiles such as mass shooting resulted in more public attention, “this did not significantly increase either the proximity of the event or the number of events within the next 14 days (464). [9] The following excerpt was how the findings were concluded: “Overall, the present study’s findings have direct implications for crime prevention and response. If the data showed that risks of mass killings were significantly greater in the days following high-profile incidents, officials would be wise to issue alerts during these critical periods to inform the public of the heightened risks. This strategy would be similar to the U.S. Centers for Disease Control and Prevention’s public alerts following the outbreak of contagious viruses and could help people take precautionary measures until the dangerous period passed. However, because chronological clusters of mass killings appear like randomly distributed events, law enforcement officials have a 20 M. A. Upal (editor) may wonder if there is something that private citizens or corporations should take into action and/or if a safer and stricter gun legislation will be enough. There is a limited amount of power over the nature of mass shootings. more difficult challenge: encouraging constant vigilance. Previous research suggests that warning signs often exist: School shooters, for example, are prone to tell at least one person about their violent plans prior to striking (Pollack, Modzeleski, & Rooney, 2008). Unfortunately, their comments are often dismissed or ignored (Levin & Madfis, 2008; Newman et al., 2004) . Security officials should do everything they can to ensure that these critical warning signs are taken equally seriously at all times of the year, regardless of the recency of previous mass killings (465).” The motivation for this project is to use technology and science to find a way to better analyze mass shootings and to use the information sciences resources to find a solution or to predict the likelihood of these incidents happening. b. Data Collection Methodology Despite the fact that it challenges the previous article, it possesses a force to keep digging and researching more information and eventually create a tangible solution about how to deter mass killings, if media contagion serves as a source of mass killing spreading throughout the country or if there is something else that might be causing offenders the urge to cause death and sorrow in the United States. The data was gathered from a full data sat from an in-depth investigation into mass shootings gathered in Kaggle which ranges from 1966-2019, a total of 53 years’ worth of data. It is restricted to continental United States cities. The data can be found on https://www.kaggle.com/zusmani/us-mass-shootings-last-50-years [12]. Towers et al. [10] support the analysis performed in both of the previous articles. Binned statistical methods are used frequently in the social and life sciences. The background given by the authors is that of how this type of analysis are used. Binned statistics methodology is based on the moments of a distribution (such as the mean, and variance). These methods have the advantage of simplicity of implementation, and simplicity of explanation. The following list shows what’s included in the database and how it is organized. It is important to note that it is an ongoing project by the owner of the data to keep adding instances of mass shooting as they occur. Data Fields: Title, Location, Date, Incident Area, Open/Close Location, Target, Cause, Summary, Fatalities, Injured, Total victims, Policeman Killed, Age, Employed, Employed At, Mental Health Issues, Race, Gender, Latitude and Longitude. The authors, Towers et al., talk about the advantages of unbinned likelihood methods,”in increasing the statistical power of an analysis, here we compare and contrast two recent analyses of contagion in America, both of which were based on exactly the same data, but used different methodology. One concluded that there was evidence of contagion in mass killings, while the later analysis contradicted this claim (2)” Data Coverage: 1966 – 2019, updated frequently. A very important criterion used was if the perpetrator took the lives of at least four people, the killings occurred in a public space and if the shootings was a spree killing or mass murder. Shootings that occurred based on conventional motivated crimes such as armed robbery, gang violence or domestic abuse are not included. The features of the data that will be used are: How both differ is simple yet not easy. In 2015, Towers et al., published their findings under the hypothesis that a mass killing temporarily raises the probability of a similar event occuring in the near future, with an exponential decay of the probability. Meaning that it appears that mass killings appears to inspire approximately 0.28 new mass killings ([0,10, 0.56], 95% CI), with an average decay period of the exponential of approximately 13 days (2).        On the other hand, Lankford and Tomek in 2017, publish an article that claims the complete opposite of what Towers et al.. (2015)’s results in their analysis. Lankford and Tomek (2017) claimed in their conclusion on an analysis of how many events occured within 14 days of a prior event, under the null hypothesis assumption of the data were randomly uniformly distributed in time. They also compared the mean and variance of the distribution to the mean and variance of the distribution to the mean and variance expected under their null hypothesis, and performed statistical tests of the null hypothesis using these quantities with Student, T, F and Z tests (2).       The main difference was how Towers et al., used a simple binned analysis and Lankford and Tomek (2017) used an unbinned maximum likelihood method.    3. DISCUSSION a. Problem and Motivation Title - A case name that the 'shooting' has been assigned Location - The city and the state the shooting occurred Date - The date of the shooting Incident Area - School, Church, Parks, etc. Open/Close Location - Describes whether the shooting occurred in an open or closed or an open and closed space Target - The victims of the shooting Cause - Shooting cause e.g. Racial, Terrorism, Psychotic outbreak, etc. Summary - A brief summary of the event Fatalities - Death count (also counts the shooter, if they were killed/ committed suicide afterwards) Injured - Injury count Total victims - injury plus the death count (Excludes the perpetrator if he/she is killed) Age - Age of the shooter. Data is only available for events which had a single shooter. Mental_Health_Issues - States whether the shooter had mental health issue or not, or whether the case is unclear or unknown Race - Race of the shooter Latitude - self-explanatory Longitude - self-explanatory During the analysis, the variable date was divided into three: day, month, and year for better analytical purposes. There is not a day when we open our news outlets and we do not see deaths and multiple killings under the hands of violence. We 20 21 b. Exploratory Analysis To gain some background we were able to perform exploratory analysis on the data. We introduced the data as a Pandas data frame, and we based the analysis based on location. We were able to split Location to look deeper into targeted areas. The result is a bar chart titled “Top 10 States of Mass Shootings” in which we were able to see that California has ranked number one in having more mass shootings incidents in the last 53 years. Followed by Pennsylvania, Maryland and Florida. Furthermore, we were able to explore an analysis of US cities, in which, the same bar chart was done. 5. Apply model for predictions The model assumes the following: Regression Line, y = mx+c, y = Dependent Variable x= Independent Variable ; c = y-Intercept c. Map Analysis We analyzed longitude and latitude. This analysis was chosen due to the easiness of reading maps and having a visual understanding of mass shooting occurrences. See Appendix A for Figure 1. 4. SOLUTION In this paper, we will be focusing on answering the question, “are mass shooting random events or are they events that could be considered “copycat” crimes due to contagion?” Following this question, we are asking if mass shootings can be predicted and what the relationship among years and total victims is across the last 53 years. Unfortunately, due to the way the data is arranged, mass shootings date and time cannot be predicted because the data is scattered. However, we can come up with conclusions and create linear regressions among the variables to understand what can be done through law enforcement and researchers. Most of the studies done with regression and Poisson distribution show that unfortunately, mass shootings are random events that cannot be predicted and that in the event of a shooting occurring days, weeks or months of another shooting, does not have anything to do with one another. Crime prediction has been found useful when there are different types of crimes in the dataset; due to our dataset being restricted to mass shootings only, then it makes it more difficult to do a testing and training and evaluating through an algorithm. Studies have implemented regression analysis to understand the relationships between the features. We will be employing linear regression and making conclusions for better predicting. As well as checking our answer with a Random Forest algorithm and Gradient Boosting. Our research will use the package NumPy, which is a Python package that allows analytical performance operations on single and multi-dimensional arrays. NumPy also offers an easy use of mathematical analysis on the data. Following, we utilize the scikitlearn package, which is a widely used Python library for Machine Learning. Scikit-learn provides preprocessing of data, dimensionality, implementing regression, classification, clustering, among others. We will be analyzing total victims and years. We followed five steps when implementing this linear regression. 1. 2. 3. 4. Import packages and classes Provide data to work and do transformations Create a regression model and fit it with existing data Check the results of model fitting to know if the model performed satisfactorily 5. RESULTS We analyzed the data using linear regression. The features used were Years and Total Victims. The goal is to build a model which will learn and will be able to predict the number of victims for the following years. We separated the features and the dependent variable into variables x and y; as well, we are using the Linear Regression class. We are not worried about feature scaling, since the library does that by itself. After building our linear regression model, we calculated R squared, which resulted in: -0.1366. Following, our root mean squared error came back, and we got: 14.8025. The latter means that our model was able to predict the total victims by year in the test set within 14.80 victims every year. In order to check our prediction, we utilized mean absolute error, to be able to determine the accuracy of it. Our mean absolute error came back a 13.1941, which is close to the root mean squared error. In order to keep checking our answers and not stay with just one, we utilized the Random Forest algorithm and a Gradient Boosting algorithm; however, the results did not seem realistic and the models did not work with our dataset. They were left in the code for future work. 22 M. A. Upal (editor) 3. Income inequality 4. Stricter gun laws are correlated with lower risks of mass shootings. These final conclusions are not based on stone, and while we may not have the solutions yet to deter mass shootings, we can help communities become healthier through exercise and mental health medical access; as well as taking into consideration the income inequality and limited access to opportunities of social mobility. 7. FUTURE WORK How large the dataset is becomes essential and has an important role in the predicting in the analysis of mass shootings. We were able to only gather 53 years’ worth of data in which we were able to see that in average, two mass shootings happen per year. This means that we will need to acquire more data to have a better success in the analysis. Our suggestion is to gather data from other developed countries such as the United Kingdom, New Zealand and/or Australia. REFERENCES [2] Lisa Pescara-Kovach, Mary-Jeanne Raleigh. 2017. The Contagion Effect as it Relates to Public Mass Shootings and Suicides. The Journal of Campus Behavioral Intervention. 3, 35-45 (2017). [3] Sherry Towers, Anuj Mubayi, and Carlos Castillo-Chavez. 2018. Detecting the contagion effect in mass killings; a constructive example of the statistical advantages of unbinned likelihood methods. Plos One13, 5 (2018). DOI:http://dx.doi.org/10.1371/journal.pone.0196863 [4] Sherry Towers, Anuj Mubayi, and Carlos Castillo-Chavez. 2018. Detecting the contagion effect in mass killings; a constructive example of the statistical advantages of unbinned likelihood methods. Plos One13, 5 (2018). DOI:http://dx.doi.org/10.1371/journal.pone.0196863 6. CONCLUSION The larger the dataset is essential and has an important role in the predicting in the analysis of mass shootings. We were able to only gather 53 years’ worth of data in which we were able to see that in average, two mass shootings happen per year. This means that we will need to acquire more data to have a better success in the analysis. [5] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass Killings and School Shootings. Plos One 10, 7 (2015). DOI: http://dx.doi.org/10.371/journal.pone.0117259 Our findings show a relationship between total victims and years; as years pass, total victims will increase due to mass shootings. Our assumption is due to the political environment in the United States, such as not focusing on gun reform at the federal level of government; as well, as studies showing that mental health does not make someone become a mass shooter. [6] Michael S. Rosenwald. 2016. Are mass shootings contagious? Some scientists who study how viruses spread say yes. (March 2016). Retrieved April 23, 2019 from https://www.washingtonpost.com/local/are-mass-shootingscontagious-some-scientists-who-study-how-viruses-spread-sayyes/2016/03/07/be44866a-df31-11e5-846c10191d1fc4ec_story.html?noredirect=on&utm_term=.c170b1506 9a2 Unfortunately, our results are half of a fraction of why and how a mass shooting occurs. At the public policy level, mass shootings are a complicated topic. Studies have shown that these events are random, and the government and private citizens can only follow certain precautions. [7] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass Killings and School Shootings. Plos One 10, 7 (2015). DOI: http://dx.doi.org/10.371/journal.pone.0117259 The following factors explain the common factors that communities where mass shootings have occurred have in common [13]: 1. More access to mental health resources; this happens because mass shootings tend to occur in urban areas, therefore, rural areas have a shortage of mental health physicians. 2. Lack of socialization: Less time for physical activity and recreation areas. [8] Grant Duwe. 2016. The Patterns and Prevalence of Mass Public Shootings in the United States, 1915-2013. The Wiley Handbook of the Psychology of Mass Shootings(2016), 20–35. DOI:http://dx.doi.org/10.1002/9781119048015.ch2 [9] Adam Lankford and Sara Tomek. 2017. Mass Killings in the United States from 2006 to 2013: Social Contagion or Random 22 23 Clusters? (July 2017). Retrieved April 23, 2019 https://onlinelibrary.wiley.com/doi/full/10.1111/sltb.12366 from https://www.inverse.com/article/50072-why-are-there-massshootings-some-places-and-not-others [10] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass Killings and School Shootings. Plos One 10, 7 (2015). DOI: http://dx.doi.org/10.371/journal.pone.0117259 [11] Sherry Towers, Andres Gomez-Lievano, Maryam Khan, Anuj Mubayi, and Carlos Castillo-Chavez. 2015. Contagion in Mass Killings and School Shootings. Plos One 10, 7 (2015). DOI: http://dx.doi.org/10.371/journal.pone.0117259 [12] Zeeshan-ul-hassan Usmani. 2017. US Mass Shootings. (November 2017). Retrieved April 23, 2019 from https://www.kaggle.com/zusmani/us-mass-shootings-last-50-years [13] Peter Hess. Communities With Mass Shootings Share 4 Common Traits, Study Shows. Retrieved May 2, 2019 from About the author: Dayana Moncada is Graduate Student in Data Science at Mercyhurst University in Erie, Pennsylvania. Appendix A. Figure 1. Shooting Fatalities by Latitude/Longitude in the United States Figure 2. U.S. Mass Shootings Victim Count from 1966 – 2019. 24 M. A. Upal (editor) Figure 3. Victims Grouped by Years. Figure 4. Top 10 States of Mass Shootings in the past 53 years. Figure 5. An Approximation in the Linear Regression graph. 24 25 Are ISIS Sympathizers More Like Republicans or “Water For All” Charity Members? M. Afzal Upal Department of Computing & Information Science Mercyhurst University, Erie, PA, 16506 mupal@mercyhurst.edu ABSTRACT The rapid adoption of social media by billions of people from all over the world has unleashed unprecedented opportunities for marketers as well as military and public policy officials to better understand their target audiences and design more effective messages for them. Previously, we have reported on a novel technique for automatically deriving insights about sociocultural groups (including Doctors without Borders, US Republican Party, and Water.org) from Twitter posts by members of those groups [1]. We discovered that while readers of the Republican Party account @GOP were more likely to like and retweet surprising, positive, and social identity related (us-versus-them) messages, Water.org readers preferred emotional, negative, and religious/ideological messages. This study was designed to apply this technique to learn more about the target audience for the terrorist group called ISIS. We also compare ISIS’s target audience with those of the non-terrorist groups we have previously studied. This analysis should help counter-terrorism and counter-insurgency officials to design more effective messages to counter ISIS’s online propaganda. CCS Concepts: • Theory of Computation → Machine Learning Theory; Redundancy; Robotics; • Information Systems → Web and Social Media Search General Terms: Social media mining, machine learning, big data. Additional Key Words and Phrases: natural language processing. 1. INTRODUCTION Experts credit ISIS’s ability to exploit social media as one of the key factors responsible for its rapid rise to prominence in the Middle East [2, 3]. In order to counter ISIS’s online propaganda, we must understand how it spreads its messages through various social media platforms. What is it about ISIS messages that allows them to resonate with their target audience? What types of messages do ISIS sympathizers like? Are religious messages more likely to be liked and shared by ISIS sympathizers or are messages appealing to nationalistic notions of “us versus them” more likely to become viral among them? The questions of whether religious doctrine, social identity, or resource deprivation is the primary motivator of terrorism is hotly debated by scientists [4, 5] as well as columnists [6]. Traditional empirical research to investigate such questions is notoriously difficult not the least due to the safety and security issues involved in carrying out research with human participants in an active conflict zone. Since, social media messages can be accessed from anywhere in the world, is it possible to learn this information from them? Previously, we reported on a study carried out to better understand factors responsible for popularity of messages among members of a variety of social groups on Twitter including the US Republican Party, Doctors without Borders (MSF), Toronto Maple Leafs, Proctor & Gamble’s Always Pads, Water.org, and People for Ethical Treatment of Animals (PETA) [1]. We found some common factors such as having a picture, length of time the tweet has been up, and emotionality of a tweet’s message that predicted tweet popularity in all groups. We also found differences among groups. The tweets that ask to be liked were liked by followers of the Doctors Without Borders (MSF) and the Toronto Maple Leafs accounts but not by others. Similarly, having a URL predicted tweet-popularity among readers of P&G’s Always account but not among others. In fact, not having a URL was a predictor of tweet success among the Republicans. If a tweet explicitly asks its readers to comment on it, only readers of the MSF account complied. Having ideological content in a tweet was a good predictor of tweet popularity in the Water.org community but not in others (especially in among readers of the MSF account, where it was a negative predictor). Having surprising content was a good way to catch the attention of the Republicans but not others. Humorous tweets were liked and shared by readers of the Water.org tweets but not by readers of other Twitter accounts. Tweets that appeal to notions of “us” versus “them” were popular among Republicans but not among readers of the other accounts. Tweets with negative content were popular among readers of the Water.org and Doctors Without Borders account while tweets with positive content were popular among Toronto Maple Leaf fans and Republicans. The objective of work presented here was to carry out similar analysis for Twitter accounts of ISIS sympathizers to better understand what makes ISIS messages popular and compare the results with those of the groups we previously studied to understand how similar and different the preferences of ISIS sympathizers are from those of the groups we studied before. 2 EXPERIMENTAL AND COMPUTATIONAL DETAILS 2.1 Data In our previous work with the five Twitter groups described earlier, we downloaded as many tweets posted by each of the following Twitter handles before 1 November 2015. 26 M. A. Upal (editor) 1. @Always: Proctor and Gamble’s Always pads (23k+ followers, 1.8k+ messages). 2. @mapleleafs: Toronto Maple Leafs Hockey Club (1.1 million+ followers, 74K+ messages). 3. @gop: The US Republican National Committee (650k+ followers, 20k+ messages). 4. @MSF_USA: The Doctors Without Borders (530k+ followers, 17k+ messages) 5. @Water: Safe water for all (750k+ followers, 10k+ messages). This included all 1880 messages that had been posted by @always and 3300 messages for the remaining 4 accounts‒the upper limit set by the Twitter API. For each tweet, we also downloaded the number of likes and retweets it had received. We added the number of likes and retweets to compute the popularity number for each tweet. Using this popularity measure, we labeled the top 10% tweets as popular and the bottom 10% as unpopular. We recruited six coders (5 females and 1 male) with academic training in psychology as well as experience in coding psychological and linguistic data to code the tweets on 22 features that have been identified by cognitive science as well as social media researchers for contributing to message popularity. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. Posted duration. How long has this tweet been up on Twitter? 20. Has a picture. Does this tweet have an embedded picture? 21. Has a video. Does this tweet have an embedded video? 22. Has a URL. Does this tweet contain a link to another website? Features 1 through 8 were coded from 0 to 2 with 0 indicating the absence of the feature, 1 when the feature is somewhat present, and 2 when the feature was thought to have a strong presence in the tweet. Features 9 to 22 were coded as binary with a score of 0 if the feature was deemed absent and 1 if it was thought to be present in the tweet. The last four features were automatically coded by examining the relevant fields provided in the JSON data structure returned by the Twitter API. We repeated this methodology to obtain data for the present study with some variations. The first problem was obtaining a large number of tweets some of which have a large number of likes and retweets while others have none or few so that our algorithms will have something to learn from. Unlike the previous six accounts for which we could simply downloaded tweets posted by the official Twitter handle, ISIS does not have an official Twitter account. There are however, a number of Twitter accounts that are known to belong to ISIS members. We started with 42 of these accounts and looked for those accounts that retweeted their 20 most recet tweets. We allowed the algorithm to run for 36 hours. This resulted in 4823 users and 6855 unique tweets in 35 languages. Of these we selected 3733 English tweets and had one of the coders who had coded the five-group tweets, code them for the 22 features discussed above. Surprising. Is the message surprising to its target audience? Emotional. Does the message arouse emotions in its target audience? Positive/negative. How positively/negatively is the message perceived by the target audience? Humorous. How humorous would this message be considered by its target audience? Concrete. Does this message contain mostly concrete easy to imagine concepts? Coherent. Is the message in this tweet coherent? Repetitive. Has this message been posted on this group account before? Social identity related. Would this tweet be perceived by the target audience members to be about “us” versus “them”? Exaggerated. Does this tweet contain exaggeration or facts? Ideological or Religious. Does this tweet contain ideological or religious message? Conspiratorial. Does this tweet invoke a conspiracy theory in the minds of its target audience members? About an event. Is the message about an upcoming or past event? Personal communication. Is this message part of a personal communication between the official group handle and an individual? Asks to like. Does the tweet ask its readers to like it? Asks to retweet. Does the tweet ask its readers to retweet it? Asks for real world action (RWA). Does this tweet ask its target audience to take a real world action? Story. Is this message a story? An arcing narrative. Is this an arcing narrative that reminds the target audience of the group’s glorious past and promised a glorious future if the group enacts the proposed reform [7]. 2.2 METHOD We selected seven classification algorithms that have been known to perform well on social media mining applications [8]. We accessed six of these algorithms (Logistics classifier, RIPPER, Random Forrest, C4.5, Alternating DTs, & K*) from the Weka Machine Learning toolkit [9] while the Support Vector Machines (SVM) was accessed through the LIBSVM library [10]. Algorithm performance was measured using ten-fold cross validation. This involved dividing the data set into ten segments. Each segment was then used as a test set while the remaining nine segments were used for training the algorithm. The measures of performance we computed were accuracy, precision, recall, Cohen’s kappa, the F-measure, and the Area Under the ROC Curve (AUC) [11]. 2.3 RESULTS Classification measures of performance (Table 10) show that the features we selected can be effectively used to predict whether or not a tweet will become popular. 26 27 Table 8 shows the top five rules learned by the RIPPER algorithm for each data set. Such rules can be extremely useful for ad designers because they can be used by them to figure out how to design ads that will be liked and retweeted by their target audience members. Rule No. GOP MSF Water Always 1 Has_a_picture & is_very_emotio nal Duration ≤ 1988 & is_very_emo tional & is_not_a_sto ry Not Personal Duration≤282 & is_arcing 2 Has_a_picture & is_emotional Has_a_pictur e & is_not_an_e vent Has_A_URL & is_emotional & Duration ≤ 11877 Has_a_URL & Duration ≤ 498 & is_not_persona l 3 does_not_have _a_url & is_negative & duration ≥ 5883 501≤Duratio n≤991 & is_emotional is_humorous & is_positive Has_a_picture & Duration≤496 4 815≤Duratio n≤919 5 Duration ≤ 769 & Does_not_h ave_a URL & is_very Coherent 6 is_very Surprising & 4556≤Durati on≤5615 Table 8: The top five rules learned using the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithm. Calculating the logistic regression odds ratios is a good way to identify features that are critical to predicting tweet popularity in a given target audience. The odds ratios measure the association between presence of a variable in a tweet and its popularity. An odds ratio of one indicates no correlation, above one indicates a positive correlation and below one indicates a negative correlation. The PValues represent the probability of the odds ratios having the observed value given that there is no association between the variable and the tweet popularity. A smaller PValue indicates that there is a very low probability that these results could have been obtained without a relationship. The rightmost column in Table 1 below shows the logistic regression odds ratios for all Tweets data TML ISIS All tweets set. is_not_personal duration > 1820 Has_a_picture The results show that similar to Water, Doctors without & is_very_emotio Borders (MSF), Republican Party (GOP), and P&G’s nal Always, ISIS tweets are also likely to become more popular the longer they stay up on Twitter. Similar to Water, MSF (and unlike Toronto Maple Leafs, US Has_a_picture & Has_a_picture Republicans, and Always) having a picture is not is_positive & is_emotional predictive of tweet popularity in the ISIS data set. Since our coders were not able to access the multimedia aspects of the ISIS tweets (because most of the ISIS sympathizer Has a Picture & Not Personal & accounts had been removed by Twitter by the time our 678≤Duration≤ Duration ≥ coders coded the data in 2016), 933 ISIS coding& for “has a 358.16 & picture” and “has a video” wasis_very_cohere woefully incomplete. is_not_humorou nt Therefore, we will ignore this aspect in the rest of the s discussion. Unlike the Always data set, and similar to all 290≤Duration≤1 is_not_persona other 199 data sets, & having a URL is l not predictive & of tweet popularity in the ISIS data. Unlike, Doctors without is_very_emotion 285≤Duration≤ al & Borders, Toronto Maple Leafs, 2047 US Republicans, and is_arcing Always data sets, and similar to Water.org data set, being 2373≤Duration≤ “very emotional” is not predictiveis_not_persona of a tweet’s popularity. 2468 l & Similar to Water.org “very concrete” tweets are likely to 1296≤Duration become popular among ISIS ≤2407 sympathizer& accounts. Similar to the US Republican partyNot_SocialID and unlike all other data sets, social identity related (i.e., “us versus them”) 700≤Duration≤1 messages are also likely to becomeis_not_persona popular in the ISIS data 300 & Has A URL l & Duration ≤ sets. Very social-ID related messages are & significantly 1969 associated with tweet popularityis_not_arcing in both &groups, the is_emotional somewhat social-ID related messages are only&statistically is_an_event & significant in the US Republican does_not_ask_f data set and they only approach significance in the ISIS data set. or_RWA Similarities and differences between ISIS data set and US Republican Data set Social identity related messages are preferred by readers of both groups Somewhat concrete message are preferred Similarities and differences between ISIS data set and Water.org Data set Very emotional messages are preferred by readers of both groups Very concrete messages are preferred 28 M. A. Upal (editor) by readers of Republican Party and “very concrete” messages are preferred by readers of ISIS Twitter accounts. Positive messages are preferred by Republican readers but not by the readers of the ISIS accounts Somewhat surprising messages are preferred by Republican readers but not by readers of the ISIS accounts 4 by readers of both groups CONCLUSIONS This paper has described the results of applying a semiautomated technique for understanding the target audience of the terrorist group ISIS on Twitter. The results of our analysis offer insights that can be used by civilian and military decision makers to design messages that are more likely to be effective in their countering the ISIS propaganda. The limitations of the work presented here are that it only considers a subset of ISIS’s social media messages, namely, the English messages posted on Twitter and the work involved significant human coding involvement. We are working to overcome these limitations by considering messages posted on other social media platforms (such as Facebook) in English as well as Arabic. We are also working on automating the coding process using Sentiment analysis techniques to automatically code tweets for the 22 features we used in our study. Negative messages are preferred by Water.org readers but not by readers of ISIS accounts Very humorous messages are preferred by Water.org readers but not by ISIS readers Ideological messages are preferred by Water.org readers but not by readers of the ISIS Twitter accounts REFERENCES [1] Upal, M. A. and Marupaka, P. Ad-Oracle for Predicting the Popularity of Marketing Campaign Messages on Twitter. Submitted. [2] Berger, J. M. and Morgan, J. The ISIS Twitter census: Defining and describing the population of ISIS and supporters on Twitter. The Brookings Institution, Washington, DC, 2015. [3] Bodine-Baron, E., Helmus, T. C., Magnuson, M. and Winkelman, Z. Examining ISIS Support and Opposition Networks on Twitter. RR1328, Rand Corporation, Santa Monica, CA, 2016. [4] Rogers, B. Religious Dimensions of Political Conflict and Violence. Sociological Theory, 33, 1 2015), 1-19. [5] Rink, A. and Sharma, K. The determinants of religious radicalization: Evidence from Kenya. Journal of Conflict Resolution2016). [6] Wood, G. What ISIS really wants. The Atlantic March 2015 2015). [7] Upal, M. A., Packer, d., Moskowitz, G. B. and Kugler, M. B. Investigating the Dynamics of Identity Formation, and Narrative Information Comprehension: Table 9: Summary of similarities between readers of the ISIS Twitter accounts and US Republican Party Twitter account (@GOP). As Table 9 summarizing the results shows readers of the ISIS Twitter accounts are similar to the readers of Water.org in that they preferentially like and retweet emotional and concrete messages. The readers of the ISIS accounts are similar to the readers of the US Republican party in their preference for concrete as well as social identity related messages that appeal to notions of usversus-them. The readers of the ISIS Twitter accounts are dissimilar from Water.org readers because Water.org readers like and retweet humorous, negative and ideological messages while readers of ISIS account do not. The fact that readers of ISIS accounts are not more likely to retweet and like religious or ideological messages seems surprising but lends support to those who argue that social identity factors are more of a motivating factor for ISIS supporters rather than religious and doctrinal factors. The readers of ISIS Twitter accounts are dissimilar to the US Republicans in that US Republicans prefer positive and surprising messages while readers of the ISIS Twitter accounts do not. In fact, readers of the ISIS Twitter accounts seem to prefer neutral messages over positive or the negative ones. 28 29 Final Report Defence Research & Development Canada, 2011. [8] Japkowicz, N. and Stefanowski, J. Big Data Analysis: New Algorithms for a new society. Springer Verlag, City, 2015. [9] Witten, I. A., Frank, E. and Hall, M. A. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2011. [10] Chang, C. and Lin, C. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3 2011), 27.21-27.27. [11] Japkowicz, N. and Shah, M. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, New York, 2011. 30 M. A. Upal (editor) APPENDIX Classification Algorithm Weka Logistic Classifier RIPPER Random Forest C4.5 Decision Trees Alternating Decision Tree K* Lazy Classifier Measure of performance MSF GOP Water Always TML Accuracy 86.32 89.30 96.48 93.14 Recall 0.88 0.89 0.96 Precision 0.85 0.89 Cohen's Kappa 0.73 F-Measure ISIS All Tweets 91.57 97.49 85.43 0.86 0.92 0.98 0.84 0.96 0.89 0.91 0.98 0.84 0.79 0.93 0.83 0.83 0.86 0.89 0.96 0.93 0.92 0.98 0.85 AUC 0.93 0.95 0.98 0.95 0.97 1.00 0.92 Accuracy 89.02 87.92 96.19 94.85 94.66 99.84 89.37 Recall 0.89 0.88 0.96 0.91 0.95 1.0 0.87 Precision 0.89 0.88 0.96 0.91 0.94 1.0 0.89 Cohen's Kappa 0.78 0.76 0.92 0.87 0.89 F-Measure 0.88 0.89 0.96 0.95 0.95 1.0 0.89 AUC 0.90 0.90 0.96 0.93 0.95 1.0 0.91 Accuracy 91.56 91.87 96.92 94.07 96.21 100 91.68 Recall 0.92 0.91 0.96 0.86 0.96 1.0 0.91 Precision 0.91 0.92 0.98 0.92 0.96 1.0 0.91 Cohen’s Kappa 0.83 0.84 0.94 0.85 0.92 F-Measure 0.92 0.92 0.97 0.94 0.95 1.0 0.92 AUC 0.97 0.97 0.99 0.98 0.98 1.0 0.98 Accuracy 88.5 90.16 96.19 94.92 94.38 99.95 90.45 Recall 0.88 0.89 0.96 0.88 0.92 1.0 0.89 Precision 0.89 0.90 0.96 0.90 0.96 1.0 0.90 Cohen’s Kappa 0.77 0.80 0.92 0.85 0.89 F-Measure 0.88 0.90 0.96 0.94 0.94 1.0 0.90 AUC 0.87 0.91 0.96 0.94 0.95 1.0 0.93 Accuracy 89.23 90.29 96.63 93.60 95.37 99.95 88.49 Recall 0.90 0.89 0.95 0.97 0.95 1.0 0.86 Precision 0.88 0.91 0.98 0.91 0.96 1.0 0.88 Cohen’s Kappa 0.78 0.81 0.93 0.84 0.91 F-Measure 0.89 0.90 0.97 0.93 0.95 1.0 0.88 AUC 0.95 0.95 0.99 0.97 0.98 1.0 0.93 Accuracy 91.12 88.16 95.45 93.14 94.80 99.74 91.30 Recall 0.91 0.89 0.94 0.83 0.92 1.0 0.89 30 0.95 1.0 1.0 1.0 1.0 0.71 0.78 0.83 0.81 0.77 31 Support Vector Machine Precision 0.91 0.87 0.96 0.92 0.97 1.0 Cohen’s Kappa 0.82 0.76 0.91 0.83 0.90 F-Measure 0.91 0.88 0.95 0.93 0.95 1.0 0.91 AUC 0.96 0.95 0.99 0.97 0.98 1.0 0.96 Accuracy 79.76 66.95 62.46 90.65 80.62 99.37 73.69 Recall 0.80 0.67 0.62 0.91 0.80 0.99 0.74 Precision 0.80 0.67 0.63 0.91 0.80 0.99 0.74 Cohen’s Kappa 0.59 0.34 0.25 0.76 0.61 F-Measure 0.80 0.67 0.62 0.90 0.81 0.99 0.73 AUC 0.80 0.67 0.62 0.86 0.81 0.99 0.73 1.0 0.99 0.92 0.82 0.46 Table 10: Performance of various machine learning algorithms on the coded twitter data consisting of tweets collected from five Twitter campaigns. Rule No. GOP MSF Water Always TML ISIS All tweets 1 Has_a_picture & is_very_emotio nal Duration ≤ 1988 & is_very_emo tional & is_not_a_sto ry Not Personal Duration≤282 & is_arcing is_not_personal duration > 1820 Has_a_picture & is_very_emotio nal 2 Has_a_picture & is_emotional Has_a_pictur e & is_not_an_e vent Has_A_URL & is_emotional & Duration ≤ 11877 Has_a_URL & Duration ≤ 498 & is_not_persona l Has_a_picture & is_positive Has_a_picture & is_emotional 3 does_not_have _a_url & is_negative & duration ≥ 5883 501≤Duratio n≤991 & is_emotional is_humorous & is_positive Has_a_picture & Duration≤496 Has a Picture & Not Personal & 678≤Duration≤ 933 & is_very_cohere nt Duration 358.16 & ≥ is_not_humorou s 4 815≤Duratio n≤919 290≤Duration≤1 199 & is_very_emotion al is_not_persona l & 285≤Duration≤ 2047 & is_arcing 5 Duration ≤ 769 & Does_not_h ave_a URL & is_very Coherent 2373≤Duration≤ 2468 is_not_persona l & 1296≤Duration ≤2407 & Not_SocialID 6 is_very Surprising & 4556≤Durati on≤5615 700≤Duration≤1 300 & Has A URL is_not_persona l & Duration ≤ 1969 & is_not_arcing & is_emotional & is_an_event & 32 M. A. Upal (editor) does_not_ask_f or_RWA Table 11: The top five rules learned using the Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithm. Water MSF TML GOP ISIS Always OR Pr(>|z| ) OR Pr(>|z| ) OR Pr(>|z| ) OR Pr(>|z| ) OR Pr(>|z| ) OR Pr(>|z| ) (Intercept) 0.47 0.67 0.00 0.99 0.00 1.00 0.00 0.99 0.26 0.25 1.30E+1 5 0.99 Duration 1.00 0.001 (***) 1.00 0.001 (***) 1.00 0.19 1.00 0.001 (***) 0.99 0.001 (***) 1.00 0.001 (***) Asks to Like 0.02 0.16 59.26 0.06 22.85 0.02 (**) 3.72 0.52 0.47 0.15 Asks to Share 0.00 1.00 0.33 0.39 5.43E+0 7 1.00 0.00 0.99 8.13 0.42 0.75 0.08 Asks to Comment 0.01 1.00 8.13 0.001 (***) 0.00 1.00 1.38 0.18 Has a Picture 1.03E+1 2 0.99 8.23E+0 5 0.99 51.02 0.001 (***) Has a Video 77.29 0.001 (***) 40.91 0.001 (***) 0.22 0.26 0.28 0.18 Has a URL 6.73 0.15 0.54 0.06(.) 1.22 0.67 0.13 0.001 (***) 10.32 0.001 (***) 1.3 0.16 Asks for RWA 0.62 0.72 0.45 0.06(.) 0.00 0.99 0.90 0.78 2.00 0.14 0.22 0.001 7.14E+0 3 1.00 0.77 0.63 0.12 0.99 0.33 0.99 0.88 0.99 Is Conspiratori al Is an Event 1.41 0.84 0.75 0.85 1.01 0.99 1.41 0.33 Is Ideological or Religious 2.00E+0 4 0.01 (**) 0.24 0.001 (***) 1.12E+0 4 1.00 0.61 0.21 Is Personal 0.00 0.001 (***) 0.05 0.001 (***) 0.00 0.001 (***) 1.23 0.89 0.79 0.68 416.85 0.96 Is Arcing 0.00 0.12 0.00 1.00 1.83 0.29 4.35 0.03 2.43 0.99 Is a Story 0.00 0.02(*) 0.26 0.06(.) 0.68 0.29 7.58 0.31 0.66 0.90 Is Exaggerated 9.88E+0 5 1.00 3.45E+0 5 1.00 0.57 0.38 0.57 0.99 Is Somewhat Surprising 1.20 0.93 0.88 0.75 0.00 1.00 2.07 0.04(*) 0.72 0.86 0.83 0.001 Is Very Surprising 0.03 0.92 1.04 0.92 5.43E+0 8 1.00 1.03 0.98 6.64E+1 2 0.99 0.04 0.99 Is Somewhat Emotional 20.65 0.02 (*) 1.79 0.10 3.69 0.20 9.53 0.001 (***) 23.46 0.02 (*) 1.04 0.23 Is Very Emotional 81.27 0.30 7.26 0.001 (***) 13.10 0.02 (**) 32.18 0.001 (***) 34.86 0.06 (.) 0.26 0.02 Is Somewhat Humorous 9.68E+0 5 0.001 (***) 1.13 0.96 0.28 0.12 1.41 0.55 0.69 0.81 11.55 0.64 Is Very Humorous 3.90E+0 9 1.00 1.51 0.15 0.38 0.45 2.09 0.56 0.00 1.00 Is Somewhat Concrete 8.10 0.19 32.92 0.17 2.25E+0 6 1.00 3.63 0.07 (.) 32 1.04 0.96 0.99 0.78 0.000 33 Is Very Concrete 450.17 0.06 (.) 2.14E+0 6 0.99 2.45E+0 6 1.00 2.24 0.28 Is Somewhat Coherent 1.30 0.88 7.71E+0 6 0.99 0.41 0.49 609.91 1.00 Is Very Coherent 98.34 0.16 3.14 0.24 2.95E+0 3 1.00 0.57 2.21 0.21 0.22 very repetitive Is Somewhat Social Identity Related 0.02 0.08 (.) 6.59E+0 6 0.99 Is Very Social Identity Related 1.55 0.03 (**) 0.90 0.99 0.98 0.05 0.60 0.001 3.92 0.01 (**) 0.73 0.57 0.99 0.002 (…) 9.58 0.001 (***) 1.30 0.83 1.64 0.003 (***) 0.86 0.001 7.78E+0 4 0.07 (.) 2.68 0.04 (*) NA 0.65 4.43 0.18 0.00 0.99 Is Positive 3.63 0.45 1.62 0.21 7.83 0.001 (***) 5.35 0.08 (.) 0.23 0.26 0.79 0.002 Is Positive & Negative 159.66 0.19 1.93 0.29 7.71 1.00 2.65 0.40 0.23 0.31 0.72 0.99 Is Negative 1.38 Is neutral Table 3: Logistic regression odds ratios and the P-values. Significant odd ratios and significant P-values are shown in a bold font. Three starts indicate highly significant results (P<0.001), two stars are placed besides results that are very significant (P<0.01), and one star indicates results that are somewhat significant (P<0.05). Results that approach the statistically significance threshold but do not meet it are indicated in bold font with a period besides them. Cells for which no data exists are reported as blank. 34 M. A. Upal (editor) 34