• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Tips for logistic regression analysis

Mirrorglass

Illuminator
Joined
Mar 9, 2010
Messages
3,464
Hello ya'll.

I've recently taken my first steps into the world of scientific research, and found out that statistics are hard. I'm looking for materials at my local library, of course, but I thought I might as well ask for assitance here.

What I'm looking for is books or online materials on statistical analysis, particularly logistic regression, and even more particularly logistic regression analysis with several confounding factors using the SPSS tools. Currently, I'm getting an analysis that looks right, but I have only a vague idea about what I'm actually testing for.

Anyway, thanks in advance!
 
I'm no great mathematician, but I have used stats a fair bit in previous research positions, so if you could give some more detail, I might be able to help a little.

(But I might not)
 
Last edited:
Ah, logistic regression. Back in the olden days when I took Generalized Linear Models, I think we used McCullagh and Nelder's Generalized Linear Models book as a reference. But it was all course notes, and we used R (because it's free). I know a little bit of statistics, but slightly more on the theoretical side. So like Professor Yaffle, with more details, in theory, I could help. In practice I might not.
 
Well, the actual problem I have at the moment is fairly simple. I'm using SPSS 16 to analyze a data set of about a thousand people. It contains a dozen variables for each person - the ones of interest here are sex, age, level of education, marital status and a positive diagnosis for the medical problem I'm researching.

Now, my goal is to use logistic regression analysis to determine whether or not being widowed significantly increases the likelihood of having this condition, and control for age, sex and education as confounding factors. (A simple chi-square test does give a significant correlation between widow status and the condition).

I know how to select the test, and I have run it, but have been unable to determine whether or not I'm using the correct settings. I get different results depending on where I enter the covariants and whether or not I include a constant in the model. The bit about the constant is what has me most confused. What exactly does including it in the model do? All I can see is it turns a significant correlation into a non-significant one.

ETA: Oh, and if you don't know how SPSS works, the theoretical answer would also be helpful. I'm going to have to understand what I'm doing if I'm ever going to get to actually writing for journals.
 
Last edited:
I found a bit about using a constant in logistic regression here. I'm pretty rusty on this kind of stuff, but I'm still a master at using google.

http://www.sma.org.sg/smj/4504/4504bs1.pdf

As far as I can gather, the constant in the model will be the y intercept. If you exclude it, the intercept will have to be 0. It's odd that including the constant would reduce the significance. we might need someone who knows what they're talking about to come in here.
 
Last edited:
I have vague memories of doing hierarchical regression analysis where I entered some variables and then looked to see if adding the final term explained a significant amount of variance over and above that - is that the sort of thing you are looking at.
 
I found a bit about using a constant in logistic regression here. I'm pretty rusty on this kind of stuff, but I'm still a master at using google.

http://www.sma.org.sg/smj/4504/4504bs1.pdf

As far as I can gather, the constant in the model will be the y intercept. If you exclude it, the intercept will have to be 0. It's odd that including the constant would reduce the significance. we might need someone who knows what they're talking about to come in here.

Thank you; that is much more helpful than the materials I managed to google up. I think it may well be enough to get past this hurdle. I'll get back to you once I've tried it out. (I don't have SPSS on my home computer.):th:
 
I found a bit about using a constant in logistic regression here. I'm pretty rusty on this kind of stuff, but I'm still a master at using google.

http://www.sma.org.sg/smj/4504/4504bs1.pdf

As far as I can gather, the constant in the model will be the y intercept. If you exclude it, the intercept will have to be 0. It's odd that including the constant would reduce the significance. we might need someone who knows what they're talking about to come in here.

It's not odd at all. By forcing the intercept to be 0 (not having the constant), you are making a strong assumption on the baseline probability (of disease here I am guessing) and are consequently likely to increase the magnitude of the other regression coefficients (thus increasing their significance), or, conversely, reduce the magnitude/significance of regression coefficients depending on the data, due to the distortion caused by forcing the value of the intercept. By default, the constant should be included in the model.
 
Last edited:
Well, the actual problem I have at the moment is fairly simple. I'm using SPSS 16 to analyze a data set of about a thousand people. It contains a dozen variables for each person - the ones of interest here are sex, age, level of education, marital status and a positive diagnosis for the medical problem I'm researching.

Now, my goal is to use logistic regression analysis to determine whether or not being widowed significantly increases the likelihood of having this condition, and control for age, sex and education as confounding factors. (A simple chi-square test does give a significant correlation between widow status and the condition).

I know how to select the test, and I have run it, but have been unable to determine whether or not I'm using the correct settings. I get different results depending on where I enter the covariants and whether or not I include a constant in the model. The bit about the constant is what has me most confused. What exactly does including it in the model do? All I can see is it turns a significant correlation into a non-significant one.

ETA: Oh, and if you don't know how SPSS works, the theoretical answer would also be helpful. I'm going to have to understand what I'm doing if I'm ever going to get to actually writing for journals.

Are all of your IV's binary? Sex, Marital Status, and Positive Diagnosis, seem fine. Age and level of education may be a bit harder to fit into a binary framwork, and the loss of information caused by the transformation may mask significant findings.
 
Last edited:
Are all of your DV's binary? Sex, Marital Status, and Positive Diagnosis, seem fine. Age and level of education may be a bit harder to fit into a binary framwork, and the loss of information caused by the transformation may mask significant findings.

I was assuming only one DV - presence or absence of the medical condition. The IVs are a mixture of continuous and discrete variables.

ETA: If all the IVs were dichotomous, I think discriminant analysis would usually be the better choice.
 
Last edited:
It's not odd at all. By forcing the intercept to be 0 (not having the constant), you are making a strong assumption on the baseline probability (of disease here I am guessing) and are consequently likely to increase the magnitude of the other regression coefficients (thus increasing their significance), or, conversely, reduce the magnitude/significance of regression coefficients depending on the data, due to the distortion caused by forcing the value of the intercept. By default, the constant should be included in the model.

Told you I was rusty.
 
I was assuming only one DV - presence or absence of the medical condition. The IVs are a mixture of continuous and discrete variables.

ETA: If all the IVs were dichotomous, I think discriminant analysis would usually be the better choice.

Got IV and DV screwed up there, I will let you guess if it was a real error or a typo :-)

off to edit now.
 
Its called logistic because it attempts to fit the data to a logistic curve - because the DV is dichotomous, not continuous.

Thanks. A use of the word of which I was totally unaware. I learn something here every day.
 
Last edited:
It's not odd at all. By forcing the intercept to be 0 (not having the constant), you are making a strong assumption on the baseline probability (of disease here I am guessing) and are consequently likely to increase the magnitude of the other regression coefficients (thus increasing their significance), or, conversely, reduce the magnitude/significance of regression coefficients depending on the data, due to the distortion caused by forcing the value of the intercept. By default, the constant should be included in the model.

Just out of curiosity, under what circumstances would you exclude a constant from your model?
 
Just out of curiosity, under what circumstances would you exclude a constant from your model?

It would really depend on the problem, you would need something where the baseline (or value at the origin) is known for some reason or another (such as, in theory). I can't think of anything off the top of my head, but I might find examples if I looked through my old notes. Essentially, if the assumption that E(Y|X=0)=0 is reasonable (here, Y is the logit of a probability, so that probability would be 0.5 at x=0), then you save a degree of freedom in your analysis by not estimating that 0.
 
Are all of your IV's binary? Sex, Marital Status, and Positive Diagnosis, seem fine. Age and level of education may be a bit harder to fit into a binary framwork, and the loss of information caused by the transformation may mask significant findings.

As the Prof noted, only the presence/absence of the condition is an DV (I think that's right, but the terminology still confuses me a bit :blush:). The other four are all IV's, and only sex is a binary variable.
 
I have a little software program that does statistics for me.

Basically, it's an internet spider that pull sin random pieces of any mathematical equations it finds durign a search fo random internet sites. It collates enough of these together to meet the minimum number of pages (a setting for running the program), then inserts your desired answer (another program input) as the final step.

Saves a lot of time, that does ;)

(On a serious note, I'm absolutely no help at all :))
 
Hi again. With the help of Prof Yaffle's link, I've made some progress into the problem; I'm now fairly certain widow status is not a significant risk factor when the other variables are taken into account.

However, there's still on thing confusing me; namely, what do the Blocks do in this model? From my attempts, it does not seem to make a difference whether I put all variables into Block One or split marital status into Block Two. What is the purpose of the different Blocks?

Once again, my thanks in advance for your assistance.
 

Back
Top Bottom