Exploratory Data Analysis of Los Angeles City Parcels#

Introduction and Problem Statement#

In recent years, there has been an upward trend in the real estate market in California, USA. According to a report by California Department of Finance, average price of single-family residential homes was 808K USD in September 2021, which indicates a 13.5% increase in the average price within a year. More specifically, the median sale price in Los Angeles (LA) County has risen by about 15% in the last year. Up until now, a 12% increase has been observed in the real estate market in LA County in 2022. Keeping the current inflation rate (8.5%) in mind, the questions to be answered are:

  1. What could be the underlying reasons for the house prices to go up?

  2. How were the trends in LA County in the previous years?

To find out the Assessor Parcels Data for LA County was selected to investigate total parcel values over the last 16 years. Consequently, a Machine Learning (ML) model has been developed to estimate the most recent parcel prices in LA City.

Exploratory Data Analysis#

As of April 2022, the Assessor Parcels Data — 2006 thru 2021 is comprised of 38,170,298 instances and 51 features for a total of 2,485,732 parcels located in LA County. In terms of general use type, more than 89% of the parcels are classified as Residential, and 4.2% are Commercial. The rest of the distribution of the column can be seen in Table 1. Property types of the parcels are classified into five categories: (i) SFR (Single Family Residence), (ii) CND (Condominium), (iii) R-I (Residential-Income), (iv) C/I (Commercial/Industrial), (v) Other. Another column in the dataset is the number of units located in each parcel. Parcels with a single unit account for 74% of the data, and the remaining ones are 12% for no unit, 5.4% for 2 units, 2.2% percent for 4 units, and 1.61% for 3 units.

Table 1 - Distribution of Parcels in LA County

General Use Type

Ratio (%)

Property Type

Ratio (%)

Unit

Ratio (%)

Residential

89.3

SFR

62.6

1

73.9

Commercial

4.2

CND

12.3

0

12.1

Industrial

2.2

R-I

10.3

2

5.4

Dry Farm

2.1

C/I

5.7

4

2.1

Other

2.6

Other

1.6

Other

6.5

Feature Explorations#

Because the original dataset had a massive file size, parcels outside the City of Los Angeles were excluded from data exploration. Each parcel was located based on its coordinates, and the neighborhood of each parcel was determined based on the geographic dataset of LA City neighborhoods. In total, 114 neighborhoods were found within the LA City boundaries, and these neighborhoods can be seen in Figure 1, which also highlights the average total values for each neighborhood in 2021.

Figure 1 - Average Total Value of Single-Family Residence Parcels in LA City


In addition, new features were extracted from the geographic dataset for LA County parcels, which are ShapeSTarea, ShapeSTLen, and geometry. These features were verified using GeoPandas functions to locate the latitudes and longitudes of the parcels, which were the original variables CENTER_LAT and CENTER_LON in the dataset. Regression analyses were then conducted between the total value of the parcel and each chosen feature. As depicted in Figure 2, a few visualizations were created to investigate if there is a valid pattern between the total value and the selected features. Some features were excluded because of an unexplanatory correlation to the total parcel values.

Data explorations were also conducted beyond the original features. Since the scope of the project is to investigate parcel values, aspects related to living quality were considered and reviewed. One aspect was transit accessibility, and two features, BusBenchClosestDist and SubwayStopClosestDist were created to understand if transit accessibility can reflect the parcel value. Another aspect was the safety of the neighborhoods. For this reason, the most recent crime data were retrieved from the LA City website. In the beginning, in cases of any induced bias, crimes that happened beyond the dwelling property line were not considered. As a result, only trespassing data was selected for the exploration. However, further analysis suggested no significant relationship between the annual trespassing incidence and the average parcel value in the neighborhoods. Instead, a moderate pattern was discovered between crime incidence and the total value of parcel when including all crime types. Consequently, a new feature, crime_count, was created and added to the main dataset. To enlarge the number of features in the dataset, the tidiness of neighborhood streets was considered, and geographic dataset was downloaded from LA City geohub website. As a result, five features were created to reflect the correlation between the neighborhood street tidiness and parcel values: (i) cleanliness score (C_score), (ii) bulky items score (BI_score), (iii) illegal dumping score (ID_score), (iv) weed score (WD_score), and (v) litter score (LL_score).

Data Cleaning and Preparation#

As a first step, data cleaning was carried out. Data for residential parcels, especially single-family residence parcels, was first filtered from the raw dataset. The cleaning process filtered every parcel with “Residential” for its GeneralUseType, “SFR” for its PropertyType, and “1” for its Unit number in the parcel. The number of unique parcels was reduced from 2,485,732 to 639,663 which indicated that about 26% percent of the parcels in LA County were Residential — SFR — 1 Unit parcels in LA City. During the cleaning process, a small amount of the dataset was also removed such as unusual parcels (e.g., modular homes and planned developments) from PropertyUseCode column. As an example, single-family residences were classified based on certain criteria: (i) “0100” represents “Single Family Residence with no pool”, (ii) “0101” represents “Single Residence with Pool”, (iii) “0103” represents “Single Residence with Pool and misc.”, and (iv) “0104” represents “Single Residence with Therapy Pool”. For the scope of our project, only the parcel data with PropertyUseCode of “0100”, “0101”, “0103” or “0104” were considered, and other types of properties (accounts for less than 4%) were removed from the dataset. One of the existing columns, totBuildingDataLines, shows how many individual structures are in the parcel. Since Unit number was limited to “1” in the previous steps, parcels with totalBuildingDataLines greater than one were also removed. Moreover, there were public parcels such as non-taxable and government-owned parcels in the raw dataset. To ensure the dataset contains only private parcels, these public or special-use parcels were removed by filtering the SpecialParcelClassification column. Among the original features, there were features/columns with the same value for every parcel, and these features/columns were considered redundant. For example, every selected parcel had a value of “YES” for column isTaxableParcel.

Figure 2 - Correlation Between Some Selected Features and Total Value


As discussed in the previous section, visualizations of every feature were created, and some of them are shown in Figure 2. Besides the investigation of correlation patterns, there were also some outliers or obvious misleading data that needed to be removed. Based on the observation, parcels with unreasonable building ages were removed by filtering BuiltYear and EffectiveYearBuilt after the year 1850. EffectiveYearBuilt describes the year of the most recent renovation for the parcel. Similarly, parcels with LandBaseYear of 1907 were observed to have unreasonable patterns and therefore removed.

Cleaning was also done on quantitative data to eliminate misleading outliers. Parcels with TotalValue more than 50M USD were removed. Parcels with more than 19 bedrooms and more than 20 bathrooms were removed. For the geometry of the parcels, the SQFTmain of the parcels were limited to a range of 500 to 40,000 square feet to be statistically clean for the model. Also, the ShapeSTAre, which describes the area of the parcel property, were limited to a range of 1,000 to 600,000 square feet, and ShapeSTLen, which describes the length of the parcel property, were limited up to 4,000 feet. To summarize, it was identified that only 0.2% of the selected dataset is comprised of outliers, and these were filtered out before developing a ML model.