LA Crime Data Analysis

image of blog

Introduction

Crime is a societal issue that impacts communities and cities around the world. Analyzing crime data helps to understand the nature, scale, and patterns of criminal activities, enabling better decision-making in law enforcement and policy creation. This project focuses on exploratory analysis of crime in Los Angeles, the second-most populous city in the United States, from 2020 to first quarter of 2023.

The objective of this analysis is to reveal patterns and trends in crime, understand the demographics of victims, determine peak times and locations for various crimes, and gain other useful insights that can assist in developing effective crime prevention strategies.

Data for this project comes from LA Crime Data available on Kaggle. The dataset contains details about each reported crime event, including the date, type of crime, location, and demographics of the victim.

The primary tool used for this analysis tools were R, SQL and Python. R was used for cleaning the data using a library called tidyverse, which assists with data import, tidying and manipulation. SQL's powerful querying capabilities allow for complex analysis to be performed directly on the database. For further visualizations, Python libraries such as pandas, matplotlib, and seaborn were used in addition to the dataframes generated by SQL queries.

The report includes detailed explanations of Data Cleaning Process, SQL queries used for the analysis, visualizations generated from the analysis, and discussions on the potential implications of the results.

Data Cleaning and Preprocessing

Before delving into the analysis, an initial step of data cleaning and preprocessing was carried out to ensure the data quality and integrity. The tool of choice for this phase was R, a language specifically designed for statistical analysis and data manipulation, along with its robust package 'tidyverse' which includes a collection of efficient tools for data cleaning.

Initially, the dataset was inspected for null values colSums(is.na(crime_data)) to find total numm values in each column. It was found that several columns contained a significant percentage of missing data. It is crucual to detal with missing values since they can skew the analysis and lead to incorrect conclusions.

Specifically, columns such as 'Mocodes'. 'Vict_Sex', 'Vict_Descent', 'Weapon_Used_Cd' and 'Weapon_Desc' has missing data. Rather than discarding these rows entirely, which might have resulted in a significant loss of information, the decision was made to impute these missing values. For example categorical variables, missing value were replaced with 'Unknown', allowing us to retain these records for analysis without creating misleading results.

For instance, 'Weapon_Used_Cd' and 'Weapon_Desc', despite having a high percentage of missing data, was considered important for the analysis. The assumption was made that a missing value in 'Weapon_Used_Cd' and 'Weapon_Desc' indicated that the weapon was unknown, and therefore, these were filled with 0 and 'Unknown' respectively.

On the other hand, the null values were set to 0 in case of 'Vict Age'. The negative values, which could be data entry errors were filtered during analysis as they weren't in large quantity.

Other than that, 'Premis_Cd', 'Status Desc' and 'Crm Cd 1' had very few null values, which wouldn't make significant difference to this large dataset. Thus, rows with null values in these columns were removed.

This preprocessing step produced a clean, structured dataset, setting a solid foundation for the ensuing exploratory data analysis and visualizations step.

Exploratory Data Analysis

The Exploratory Data Analysis(EDA) phase was primarily executed using SQL, with the intent to gain insights into the patterns and relationships within our dataset. SQL, with its powerful querying and data manipulation capabilities, was ideal choice for this exercise. The visualizations using the data generated from sql queries was done using Python with the help of psycopg2 to connect to database, pandas to generate dataframe, and matplotlib and seaborn for generating visualizations. Our analysis was guided by a set of key questions designed to explore the structure, characteristics, and patterns within the data.

1. What is the most common type of crime(based on 'Crm Cd Desc') in each area (based on 'AREA NAME')?

A combination of grouping and ordering SQL commands helped to identify the most common crime in each area. Using window functions like ROW_NUMBER() enabled us to return the crime with the highest occurrence in each area. This provided insights into the local crime dymanics and potential area-specific crime issues. The SQL query is as below:

WITH crime_counts AS (
	select area_name, crm_cd_desc, 
	count(*) as count_per_area_per_crime,
	ROW_NUMBER() OVER (PARTITION BY area_name ORDER BY COUNT(*) DESC) as rn
	from la_crime 
	group by area_name,crm_cd_desc 
)

SELECT area_name, crm_cd_desc as most_common_crime, count_per_area_per_crime 
from crime_counts
where rn=1
most_common_crime_area

The barplots suggests that "Vehicle-Stolen" is the most common crime across most of the areas. Some exceptions were "Central" with 'Burglary from vehicle', "Hollywood" with 'Battery - Simple Assualt', and "Burglary" in "Topanga","West LA", "Wilshire"

Please visit the github repo for entire code. Thank You

niranjanblank

https://github.com/niranjanblank/LACrimeDataAnalysis/tree/main