Data Cleaning from Global Terrorism Database (GTD)

by Maria Jane Poncardas, 06/14/19

Data preprocessing:

The data was collected from an open-source database from the GLOBAL TERRORISM DATABASE website (https://www.start.umd.edu/gtd/) which has annual updates of worldwide terrorist events from 1970 to 2017. The data file was obtained in the website by hovering over “USING GTD” option and then select the “Download GTD”. On the webpage, the “Action” field has a drop-down menu which contains general inquiries and acquisition of GTD file. It contains comprehensive information, news sources and among others. The researchers only selected the most relevant information for the study such as:



  1. Date

  2. City and Province

  3. Latitude and longitude

  4. Attack Type:

    1. Bombing/Explosion

    2. Armed Assault

    3. Hostage Taking (Kidnapping)

    4. Assassination

    5. Facility/Infrastructure Attack

    6. Hijacking

    7. Hostage Taking (Barricade Incident)

    8. Unarmed Assault

  5. Target type:

    1. Military

    2. Private Citizens & Property

    3. Government

    4. Business

    5. Police

    6. Transportation

    7. Utilities

    8. Educational Institution

    9. Religious Figures/Institutions

    10. Journalists & Media

    11. Maritime

    12. Telecommunication

    13. Terrorists/Non-State Militia

    14. NGO

    15. Food or Water Supply

    16. Tourists

    17. Airports and Aircraft

  6. Weapon details:

    1. Firearms

    2. Explosives

    3. Incendiary

    4. Melee

    5. Chemical

    6. Sabotage equipment

  7. Doubted terrorism

  8. Suicide

  9. Infiltrator group name

  10. Number of dead victims

  11. Number of injured



DATA CLEANING:


  1. The researchers filtered out data from non-Mindanao provinces or states such as Luzon, Visayas, and foreign localities.

  2. Using reverse geocoder module in python, the latitude and longitude was utilized to augment information about its localities, i.e., city/municipality names and provinces.

import reverse_geocoder as rg
coordinates = list(mindanao_terrorism['latitude_longitude'])
results = rg.search(coordinates)
city_municipality = pd.DataFrame(results)


  1. The augmented cities and municipalities in a data frame was then concatenated to the GTD file to retain each terrorist activity information.

  2. The Philippine shapefile was used as reference to extract capitalization errors, misspelled words, and typos of the cities and municipalities produced from the reverse geocoder.

(Cities not in shapefile indicates erroneous information generated)


gtd_cities = list(concat['city_municipality'].sort_values().unique())
shape_data = list(phil_shapefile['City'].sort_values().unique())
cities_not_in_shapefile = list(set(gtd_cities) - set(shape_data))
cities_not_in_shapefile.sort()

for i in cities_not_in_shapefile:
    lati = list(concat[concat['city_municipality']==i]['latitude'])[0]
    long = list(concat[concat['city_municipality']==i]['longitude'])[0]
   
    print(i,lati,long)

The for loop returns the incorrect cities/municipalities and its coordinates.


  1. Each incorrect cities/municipalities are manually verified through Google maps and Wikipedia since most of these municipalities and cities have been renamed or was carved out from another city/municipality

  2. Once verified, all the data that contains common erroneous city/municipality name will be automatically corrected through this code:

city_toedit = "Iligan"
city_correct = "Iligan City"

error_city = concat[concat['name (rg)'] == city_toedit]['name (rg)'].to_dict() #assign series into a dictionary such that it will assign keys() as their index numbers and values() as their incorrect province name                                                                        #showing only dictionary keys

for i in error_city.keys(): #for loop for the editing
    concat.at[i, 'name (rg)'] = city_correct
    print('changed from '+ city_toedit + ' to '+ city_correct)


Comments

Popular posts from this blog

Using QGIS to convert .shp file to .geojson (GADM)

LSTM pseudocode