Earthquake Analysis

1. Introduction

1.1. Aim of The Report

This work aims to analyse a dataset about earthquakes and to gain a good understanding of their nature and behaviour. We will begin by discussing some fundamental facts about earthquakes. Next, we will examine the dataset we have, in order to get an idea of what information it contains. We will then use the dataset to find out where and when earthquakes most commonly occur. Lastly, we will explore earthquake predictions and find out whether we can develop a reliable model aimed at forecasting them.

We will be using Python as a programming language, pandas library to work with our data, and scikit-learn library to create and train models.

1.2. Earthquakes

Earthquakes are phenomena in which a sudden energy is released in the Earth’s crust, leading to seismic waves that shake the ground. This energy release is most often caused by the movement of tectonic plates, but can also be caused by volcanic activity or human activities, such as mining, fracking or the detonation of explosives.

The strength of earthquakes is commonly measured on the Richter scale, which is a logarithmic scale used to quantify the energy released. Every point increase on the scale corresponds to a tenfold increase in seismic wave amplitude, which translates to approximately \(31.6\) times more energy being released. For perspective, earthquakes typically need to be of at least magnitude 5 to cause any notable damage to buildings, and there is usually one to two thousand of these earthquakes worldwide each year. The strongest earthquake was recorded in Chile in 1960 and was of magnitude \(9.5\). Due to the nature of logarithmic scales, we can have earthquakes of negative magnitudes, but these quakes are happening constantly all around the world and are barely noticed by even seismographs.

1.3. Dataset

Our dataset is a collection of data about earthquakes recorded worldwide from January 1990 to July 2023. It consists of approximately \(3.4\) million records, of which almost \(17,000\) are duplicates and none of them have any null values.

The dataset has the following columns:

time: unix timestamp in milliseconds (int64)
place: geographical location (object)
status: Represents the current state or condition of the event, which could be reviewed or automatic (object)
tsunami: Relates to a series of large ocean waves typically caused by an underwater disturbance, often associated with earthquakes, boolean value (int64)
significance: Denotes the importance or impact level of the event, which could be used to assess the potential consequences (int64)
data_type: Specifies the type of data being referenced (object)
magnitudo: Refers to the measurement of the size or intensity of an earthquake, typically measured on the Richter or moment magnitude scale (float64)
state: Represents the administrative division or state where the event occurred, often applicable to specific countries (object)
longitude: coordinate (float64)
latitude: coordinate (float64)
depth: depth at which the earthquake occured (float64)
date: datetime with timezone information (object)

Following is an overview of the values in our dataset. We can see that the strongest earthquakes was of magnitude \(9.1\). Interestingly, the smallest magnitude value is \(-9.99\). This is very likely an incorrect value, as this weak of an “earthquake” could be caused by even a grain of sand falling to the ground. We will deal with this in the next step

	time	tsunami	significance	magnitudo	longitude	latitude	depth
count	3.445751e+06	3.445751e+06	3.445751e+06	3.445751e+06	3.445751e+06	3.445751e+06	3.445751e+06
mean	1.247124e+12	4.434447e-04	7.400973e+01	1.774076e+00	-1.012876e+02	3.746483e+01	2.285387e+01
std	2.976292e+11	2.105346e-02	1.016364e+02	1.291055e+00	7.697416e+01	2.041577e+01	5.484938e+01
min	6.311534e+11	0.000000e+00	0.000000e+00	-9.990000e+00	-1.799997e+02	-8.442200e+01	-1.000000e+01
25%	1.024401e+12	0.000000e+00	1.300000e+01	9.100000e-01	-1.464274e+02	3.406400e+01	3.120000e+00
50%	1.282338e+12	0.000000e+00	3.300000e+01	1.460000e+00	-1.189538e+02	3.793567e+01	7.700000e+00
75%	1.508701e+12	0.000000e+00	8.100000e+01	2.300000e+00	-1.159277e+02	4.784800e+01	1.612000e+01
max	1.690629e+12	1.000000e+00	2.910000e+03	9.100000e+00	1.800000e+02	8.738600e+01	7.358000e+02

Transformation.

I transformed the dataset by first removing the duplicates. After that, I removed all \(111,000\) earthquakes with magnitude smaller than 0. The reason for that is that these earthquakes have no real-world impact and removing them makes the dataset cleaner and easier to work with.

The dataset also contains inconsistent denotations of US states, where they are sometimes refered to by their abbreviation and other times by their name (e.g., “AK” and “Alaska”). I converted all states’ abbreviations to the their full names to unify the naming.

Other than that, the dataset doesn’t seem to have any more inconsistencies that would cause us trouble.

2. Data Exploration

2.1. Temporal Analysis

Let us first visualize the distribution of earthquakes by their magnitudes. Looking at Figure 1, we can see that the majority of earthquakes in this dataset is of weaker strength: nearly \(2.4\) million of the \(3.4\) million earthquakes are of magnitude smaller than 2, which cannot even be humanly felt and can only be detected by seismographs.

Figure 1: Number of earthquakes by magnitude

We can also analyse whether the occurrence of earthquakes in the world depends on the time of the year. In Figure 2, we see that the number of earthquakes is stable throughout the year and does not depend on the month. This result is expected, since the movement of tectonic plates is not related to the position of Earth relative to the Sun.

Figure 2: Adjusted number of earthquakes per month

Let us also look at how the number of earthquakes changes over time. In Figure 3, we can notice that there is an upward trend in their frequency. There are several potential explanations for this finding. Firstly, the simplest reason may be that there simply are more earthquakes now than there were before. Secondly, there is a possibility that seismographs have gotten more sensitive over the years, allowing them to detect smaller earthquakes. There can also now be a better coverage of Earth by seismographs, which would detect more remote earthquakes. Lastly, earthquakes may now be recorded more carefully then before.

Figure 3: Number of earthquakes per year

We can test whether the frequency of earthquakes is increasing by looking only at earthquakes of higher magnitudes. We can be confident that these earthquakes would be detected even by distant and weaker seismographs, and would likely be recorded and included in our dataset. Looking at Figure 4, we can see that the number of earthquakes stronger than magnitude 6 is not increasing and is relatively stable, with some random yearly fluctuations. This indicates that the trend seen in Figure 3 is not caused by the earthquakes getting more numerous, but rather by the factors discussed above. The literature can confirm our assumption [1] and the most likely explanation seems to be the increased number of seismographs, resulting in better earthquake detection [2].

Figure 4: Number of earthquakes per year with magnitude higher than 6

2.2. Geographical Analysis

The strongest earthquakes are caused by tectonic plate movement. Tectonic plates interact in several ways, and we can classify plate boundaries into three primary types: convergent, divergent, and transform.

Divergent boundaries
- At divergent boundaries, tectonic plates move away from each other. The tension generated by this movement can cause earthquakes, but they are generally less severe than those at other boundaries.
Convergent boundaries
- At convergent boundaries, tectonic plates move toward each other and collide. The friction at these boundaries creates immense pressure, which can result in significant earthquakes.
Transform boundaries
- At transform boundaries, the plates slide past each other, generating substantial pressure that also results in earthquakes.

In Figure 5, we can see the strongest earthquakes in our dataset overlaid over tectonic plates. As expected, the vast majority of earthquakes have occurred along the borders of tectonic plates. Additionally, we can notice that the most active earthquake regions are located along convergent boundaries. For example, Chile is located at the convergence of the Nazca Plate and the South American Plate, and Japan is positioned at the convergence of four tectonic plates. An example of a transform boundary is the San Andreas Fault in coastal California, where many earthquakes occur, though they are typically not as strong. An example of a divergent boundary is between the North American Plate and the Eurasian Plate, where some earthquakes also occur, but they are significantly weaker and less numerous.

Figure 5: Earthquakes of magnitude >6.5 overlayed over tectonic plates

3. Earthquake Prediction

3.1 Overview of Earthquake Prediction

Earthquake prediction aims to predict the timing, location, and magnitude of earthquakes. There have various methods explored, including analysing patterns of seismic activity and analysing correlations between animal behaviour and impeding earthquakes. To date, there has not yet been a reproducible model that can predict earthquakes even to a specific month [3]. The massive problem with unreliable predictions is that unless we can predict the exact day and strength of an earthquake, the prediction could do more harm then good by causing panic. Additionally, the resources used on unnecessary evacuations would much better be spent on improving the durability of infrastructure threatened by earthquakes [4].

3.2 Modeling Earthquake Prediction

Due to the aforementioned grim prospects for our ability to predict the exact strength, day and location, we will simplify our task by predicting whether an earthquake stronger then magnitude 6 will occur in a given month in a region of the world. These regions will correspond to the “state” column in our dataset, which can be entire countries, or subdivisions in case of larger countries (for example, the United States are subdivided into California, Florida, etc.).

We will need a baseline model to compare our results to. For that purpose, we will create a simple prediction algorithm: we will look at all data in a given state prior to the given month and calculate how many of the previous months had an earthquake of magnitude at least 6. If the fraction of these months is larger than some threshold, the model will predict an earthquake happening this month.

Our main model will be based on the following thought: earthquakes that are caused by tectonic plate movement occur due to a pressure building up, which is, after some time, suddenly released. The idea is that we could be able to observe this pressure by looking at previous months’ data. Therefore, when predicting whether an earthquake of magnitude at least 6 happens in a given state during a given month, we will analyse the state’s previous 18 months and look for the following features:

How many total earthquakes happened.
What the average magnitude of the earthquakes was.
What the maximum magnitude of the earthquakes was.
How many earthquakes of magnitude at least 4 happened.
How many earthquakes of magnitude at least 5 happened.
How many earthquakes of magnitude at least 6 happened.

This should hopefully give our model enough information to make successful predictions. We will be using 2015 as a cutoff year, where data before and including 2015 will be used to train our model, and data after 2015 to test it.

Results

Baseline model

The only parameter for our baseline model is the threshold we choose. By setting the threshold too low, we will be predicting a lot of earthquakes, leading to a high recall, but a low precision. By increasing the threshold, we can increase our precision, but at the cost of recall. Below are results for different threshold values.

Threshold	Precision	Recall	F1 Score
0.05	0.09	0.86	0.16
0.1	0.12	0.60	0.20
0.15	0.16	0.59	0.25
0.2	0.20	0.43	0.27
0.3	0.21	0.31	0.25
0.5	0.15	0.12	0.13
0.7	0.04	0.01	0.02

We can see that the best F1 score was gained by setting the threshold to \(0.2\). Interstingly, after a certain point, increasing the threshold resulted in the decrease of both recall and precision.

Main model

The first algorithm I tested was logistic regression. No matter the parameters I tried, however, this model didn’t even manage to beat the baseline. The second algorithm I tried was random forest, which yielded noticeably better results than logistic regression. The best results were reached with the following parameters:

n_estimators=200
min_samples_split=10
class_weight="balanced"

The other parameters were left default as per the sklearn.ensemble.RandomForestClassifier class. With these parameters, the model had the following results:

Model	Precision	Recall	F1 Score
Random Forest	0.31	0.37	0.34
Baseline (best results)	0.20	0.43	0.27

Compared to the baseline model we can observe an increase in precision and a slighter decrease in recall, resulting in a higher F1 score.

4. Conclusion

We analysed data on earthquakes and reached reasonable conclusions, such as the fact that their occurence is independent of the time of year. We observed that earthquakes typically occur along tectonic plate boundaries, the strongest of which being near convergent boundaries.

We created a model for earthquake prediction, which was slightly better than a baseline algorithm. Despite that, the model would still be unusable in the real world due to its high false positive rate and the fact that it can only predict which month the earthquake would happen. For future work, it would be interesting to experiment with different features and more models in order to reach a better performance.