Data Processing

Scraping Online Publications

Using PHP, we scraped the online archives of a monthly magazine publication from January 1971 to January 2015. The scraped data was then converted into a CSV file and imported into textual analysis software (Atlas.ti). The collection and processing of the data created an opportunity to perform a longitudinal discourse analysis about technology and innovation. The content was analyzed using 50 pre-determined keywords to identify relevant articles for in-depth analysis. This methodological approach was necessary because retrospective discourse about technological innovation often glosses over tensions. A historical content analysis circumvented the potential positivity bias by tracking the complex process of multiple technological innovations across 40 years.

Final Report

Code

Flattening JSON into SPSS

Before using the Yelp academic data set in a collaborative project, the data needed to be cleaned and converted from a set of hierarchical JSON files into a flat file compatible with SPSS. Using Python, each row of the main business file was read into a table and the additional data files were matched to each unique business. For each business in the table, the hierarchical categories were flattened into individual columns, and the data was cleaned to correct spelling mistakes and filter out irrelevant cases. The result was then exported as an SPSS compatible file.

Final Report

Code

Parsing HL7 Data

Using Python we created a Health Level Seven (HL7) parser to extract relevant patient information. Because HL7 is the standard in medical administration databases, the files present a wealth of information for statistical modeling. The results of this data processing are used to predict patient wait time (link).

Code

Edge Detection/ Image Processing

This project was using edge detection in Matlab to locate bone fractures that could be missed by the human eye. Edge detection required the computation of intensity changes at each pixel; large changes can be a sign of edges. Once the edge pixels have been isolated, this subset is then analyzed to locate straight lines, using a slope equation. The edges with the higher sums are better candidates for positive fractures. The model provides a second opinion to medical professionals because it can find features within noisy data.

Code

Bilateral Filter/ Image Processing

There are multiple ways to deal with noisy data in Matlab, such as averaging, and regularization. This program used a bilateral filter which is very close to averaging but defines better weights, based on a Gaussian distribution. Bilateral filters have many parameters that can affect the filtering results, but this formula allows for the removal of noise while preserving edges.

Code

Scraping Blogs

Blogs were a central part of the social activist community. Scraping the two largest blogs, with PHP, allowed us to track 10 years of activity. For each blog post, we scraped the title of the post, the URL, and the total number of comments. We used the total number of comment as a proxy for group size and cross-referenced those trends with the data collected from Facebook. We were also able to use the total number of comments to isolate the posts with the greatest participation. These posts were then analyzed against historical data to assess why they had been outliers. We were able to conclude that controversial topics and posts that coincide activist events received the most comments, and were often the precursor to increases in Facebook groups.

Code

Scraping Facebook

As part of the study of online social activism, we scraped Facebook to track historical growth patterns. The HTML was parsed using PHP and then exported as a CSV for analysis in Excel. The data allowed us to identify what type of activism or media attention resulted in significant increases in group membership.

However, the scraping brought up data privacy issues. Many of the members of these closed groups participated only because of the assumption of anonymity. For many of the participants, activism brought social and familial cost. To deal with these issues, we were very careful to scrub any identifying information out of the data, such as name and birthday.

Scraping Weather Data

This app used Python and Dark Sky API to translate numerical weather data into text-based instructions for appropriate daily clothing choices. This app was designed to help children learn about the connection between weather and clothing while teaching independent decision making.

Code

Home