Statistical Modeling

Predicting Healthcare Costs

As the American population continues to age, a critical issue facing the healthcare industry is the steady rise in total medical costs. To identify some of the factors that contribute to total medical costs, we used Stata to analyze the 2005 Medical Expenditures Panel Survey. The goal of this project is to demonstrate that statistical models, such as Multiple Linear Regression, can help identify possible interventions and highlight other cost reducing measures.

Our main findings centered around a large number of significant interactions between the variables. Our extensive testing demonstrated that the creation of these interactions allowed us to reduce our residual errors, and more accurately predict total cost (see plots above). These interactions were not surprising given the complexities of the human body and intricate patterns of health-related behavior. From these interactions our most important conclusion would be to focus on preventative care. While this is not a new or revolutionary suggestion, our data showed that the interactions between chronic disease and physical limitations were great contributors to total expenditure. Our results also suggested that preventative programs should be mindful of the impact of race, gender, marital status, and income have on total health care cost, and seek to address those early. However, due to the small size of this dataset, we would also suggest that before policies are adopted additional relevant datasets should be analyzed.

Final Report

Predicting Patient Wait Time

This model used Matlab to predict total patient wait time in a radiology department, using multiple linear regression. Accurate wait time predictions are beneficial to the patients and are also advantageous for hospital administrators who must accurately allocate hospital space, equipment expenditures, and staff salaries. Total wait time was calculated, using HL7 data, by subtracting exam begin time from arrival time. The results demonstrated that the model that was most successful when it used total wait time plus the interaction between wait time and current line size. To calculate current line size, from HL7 data, it was necessary to iterate over the patients who had arrived in the department before the current patients' check-in but had not yet started their exam.

Code

Predicting Check-ins - Yelp I

This linear regression model is the first of a three models created for a larger project that used the seventh Yelp Academic Dataset inside of SPSS. We initially wanted to explore variables that might measure a restaurant's success regarding the quantity of visits (as opposed to average quality rating, i.e. stars). We hypothesized that a greater number of check-ins could indicate that more people were visiting a restaurant. However, upon investigation, we discovered there were Yelp marketing incentives associated with check-ins. We modified our goal to focus on the number of check-ins as a rough proxy for customers who were willing to announce their patronage of a restaurant publicly. Due to the extensive number of variables we used stepwise and hierarchical methods to filter through the data to locate the most significant explanatory variables for the number of check-ins. The normality assumptions of linear regression required that many of the variables were transformed, see above image.

The most jarring finding of our regression was that stars were not a significant predictor of restaurant check-ins. This finding was counter-intuitive given that stars are arguably the most visible and valued metric on the Yelp site. Additionally, we tested the interaction between stars and reviews (starsXreviews), the interaction between price range and reviews (priceXreviews), as well as a Bayesian Estimate. Unfortunately, these variables turned out to be insignificant. We, therefore, concluded that reviews had the same effect on check-ins at all levels of stars and price. From the outcome of the Multiple Linear Regression, barring any other confounding factors, we concluded that restaurants with higher online popularity (number of reviews), catering to the more affordable market segment (price range) and without the delivery option are more likely to receive higher check-ins.

Final Report

Difference Between Cities - Yelp II

Building off the results of the regression (Yelp I), we conducted an Analysis of Variance (ANOVA). This model allowed us to test if there were differences between three specific cities - Las Vegas, Montreal, and Charlotte. We used 1000 cases from each of the three cities. The result supported our pre-analysis expectation that there would be a significant difference between the cities, in particular between Las Vegas and the other two. The model indicated that all three cities significantly differed from one another. ANOVA is limited because it cannot explain the variation. These differences need to be explored using a General Linear Model (GLM), see Yelp III.

Final Report

Examining City Differences- Yelp III

Building off of the results of ANOVA (YelpII), we ran a General Linear Model (GLM) to examine the difference between specific cities with a slight modification; we controlled for the continuous predictor and its interaction with cities. The GLM allowed us to navigate the potential outcomes across cities. The General Linear Model was run with cities as the independent variable, number of reviews as the covariate and the interaction between the number of reviews and cities. Controlling for the number of reviews and interaction factor, there was a statistically significant difference between checkins and cities. This meant that while there was an effect of reviews across all cities, the magnitude of the effect changed significantly from city to city. The inclusion of the interaction variable in the GLM controlled for this effect. Although city and the interaction between city and reviews were significant, their effect sizes were minimal, following the trend of the previous model the largest impact on checkins was from reviews. Overall, we can safely conclude that (1) reviews, open, price range and delivery were the predictors that best explained check-ins and that (2) there are clearly differences between cities whether we took a more conservative approach (accounting for the covariate and interaction) or if we looked at the differences between cities alone.

Final Report

Simulating Vaccination Porticol

This model used Arena to simulate various humanitarian vaccination procedures and protocols. Given the time constraints of humanitarian work, the primary goal was to mimic what model used the staff most efficiently. For example, how can nurses be utilized in less complex situations to free up doctors for more complex and time demanding cases? Additional constraints were placed on the model by adding a local non traditional practitioner to the simulation, to observe how an additional variable impacted patient flow.

Home